Fojiba-Jabba Notes

From SlugWiki
Revision as of 22:31, 25 August 2015 by Ivanaf (Talk | contribs) (18 revisions imported)

(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)
Jump to: navigation, search

Fojiba-Jabba is the module of Cruft Alarm supporting Automatic Text Generation. Its name comes from an Old Welsh word meaning "onomatopoeia".

Theoretical Foundations

Fojiba-Jabba uses techniques from Markov Chain- and Recursive Transition Network-Theory.

Markov Chains

One method of text generation involves Markov Chains. In theory, Markov Chains can produce a delightfully quirky text; in practice, they sort of suck.

Process

The process can be summarized as follows:

  • The user specifies an initial word and the number of sentences desired in the text.
  • Fojiba-Jabba, having previously analyzed a set of texts in order to gather statistics on which words follow which words, uses these data to generate the next word.
  • This process repeats until the desired number of sentences is obtained.

Problems

There are, however, several problems with this method:

  • The corpus available is too limited to attempt anything but an Order-1 Markov Chain (anything higher results in what is essentially the original text itself).
  • An Order-1 Markov Chain is often too retarded to produce anything but rather ungrammatical (and clearly fake) sentences.

Possible Solutions/To Do

  • Use highly advanced linguistic knowledge to improve grammaticality (e.g., a noun or an adjective must follow a determiner). A Brill Part-of-Speech Tagger or the Stanford Parser may be useful.
  • Use google to find likely following words, or to increase the dataset somehow.
  • Find a better POS tagger.
  • Improve the equivalence classes (currently history of one word) through stemming or semantic grouping. (Maybe syntactic grouping too...but maybe too similar to what was tried with the RTNs).
  • Next word depends on previous word AND syntactic group (not just one).
  • Add some sort of tag to beginning and ends of sentences: don't want beginning of one sentence to depend on end of last, and sentences often begin (and end?) in certain ways.
  • Dynamic-order chaining: use higher order n-grams when enough data, and lower order n-grams when not.

Recursive Transition Networks

Another method of text generation involves Recursive Transition Networks. While more grammatical than Markov Chains by design (in theory; in practice, they also sort of suck), they are slightly more difficult to implement and, unless cleverly manufactured, have less of the idiosyncratic charm that Markov Chain-fans find so endearing.

Process

Text generation using Recursive Transition Networks proceeds as follows:

  • Fojiba-Jabba takes a set of texts, and runs a part-of-speech tagger through them. Two hash maps are created: one mapping each word to its possible parts-of-speech (e.g., 'can' to 'noun' and 'modal') and one mapping each part-of-speech to its possible words (e.g., 'verb' to 'lick' and 'frolic'). A rules array containing every sequence of part-of-speech tags that any sentence contains is also created.
  • To generate a sentence, Fojiba-Jabba picks a random element (a sequence of part-of-speech tags) from the rules array. It then maps each of these tags to an English word using its hash map, thereby generating a sentence.

Problems

  • Parts-of-speech alone are not enough to guarantee grammaticality, so the sentences produced by the RTN are still very ungrammatical.

Possible Solutions

  • Create a set of rules by hand.

Ideas

  • Percentage-based text generation with a large corpus (e.g., google). (The author has forgotten what this means.)
  • Match rarities (capitalization, slang) and content (e.g., the kitchen).
  • Work backwards from writer identification (removing headers and identifiers).
  • Use Nomlex (or Celex or NomBank or PropBank or...).
  • Start by generating NPs, PPs, etc.
  • Determine user-specific collocations (using relative frequency ratios).
  • Microsoft(!)'s AMALGAM.

Current

  • Statistical analyses are being performed on Lyric Doshi's infamous and poopy writings. The goal, good sirs, is unknown. O:-) (To do: vowel elongation (yaaay), likey?, stock phrases (oh boy!, goodness, sir))

To Do

  • Work on Smoke Alarm.