Difference between revisions of "Fojiba-Jabba Notes"

From SlugWiki
Jump to: navigation, search
(Markov Chains)
(Ideas)
Line 46: Line 46:
 
*Start by generating NPs, PPs, etc.
 
*Start by generating NPs, PPs, etc.
 
*Determine user-specific collocations (using relative frequency ratios).
 
*Determine user-specific collocations (using relative frequency ratios).
 +
*Microsoft(!)'s AMALGAM.
  
 
=Current=
 
=Current=

Revision as of 19:58, 28 June 2006

Fojiba-Jabba is the module of Cruft Alarm supporting Automatic Text Generation. Its name comes from an Old Welsh word meaning "onomatopoeia".

Theoretical Foundations

Fojiba-Jabba uses techniques from Markov Chain- and Recursive Transition Network-Theory.

Markov Chains

One method of text generation involves Markov Chains. In theory, Markov Chains can produce a delightfully quirky text; in practice, they sort of suck.

Process

The process can be summarized as follows:

  • The user specifies an initial word and the number of sentences desired in the text.
  • Fojiba-Jabba, having previously analyzed a set of texts in order to gather statistics on which words follow which words, uses these data to generate the next word.
  • This process repeats until the desired number of sentences is obtained.

Problems

There are, however, several problems with this method:

  • The corpus available is too limited to attempt anything but an Order-1 Markov Chain (anything higher results in what is essentially the original text itself).
  • An Order-1 Markov Chain is often too retarded to produce anything but rather ungrammatical (and clearly fake) sentences.

Possible Solutions/To Do

  • Use highly advanced linguistic knowledge to improve grammaticality (e.g., a noun or an adjective must follow a determiner). A Brill Part-of-Speech Tagger or the Stanford Parser may be useful.
  • Use google to find likely following words, or to increase the dataset somehow.
  • Find a better POS tagger.
  • Improve the equivalence classes (currently history of one word) through stemming or semantic grouping. (Maybe syntactic grouping too...but maybe too similar to what was tried with the RTNs).
  • Next word depends on previous word AND syntactic group (not just one).
  • Add some sort of tag to beginning and ends of sentences: don't want beginning of one sentence to depend on end of last, and sentences often begin (and end?) in certain ways.

Recursive Transition Networks

Another method of text generation involves Recursive Transition Networks. While more grammatical than Markov Chains by design (in theory; in practice, they also sort of suck), they are slightly more difficult to implement and, unless cleverly manufactured, have less of the idiosyncratic charm that Markov Chain-fans find so endearing.

Process

Text generation using Recursive Transition Networks proceeds as follows:

  • Fojiba-Jabba takes a set of texts, and runs a part-of-speech tagger through them. Two hash maps are created: one mapping each word to its possible parts-of-speech (e.g., 'can' to 'noun' and 'modal') and one mapping each part-of-speech to its possible words (e.g., 'verb' to 'lick' and 'frolic'). A rules array containing every sequence of part-of-speech tags that any sentence contains is also created.
  • To generate a sentence, Fojiba-Jabba picks a random element (a sequence of part-of-speech tags) from the rules array. It then maps each of these tags to an English word using its hash map, thereby generating a sentence.

Problems

  • Parts-of-speech alone are not enough to guarantee grammaticality, so the sentences produced by the RTN are still very ungrammatical.

Possible Solutions

  • Create a set of rules by hand.

Ideas

  • Percentage-based text generation with a large corpus (e.g., google). (The author has forgotten what this means.)
  • Match rarities (capitalization, slang) and content (e.g., the kitchen).
  • Work backwards from writer identification (removing headers and identifiers).
  • Use Nomlex (or Celex or NomBank or PropBank or...).
  • Start by generating NPs, PPs, etc.
  • Determine user-specific collocations (using relative frequency ratios).
  • Microsoft(!)'s AMALGAM.

Current

  • Statistical analyses are being performed on Lyric Doshi's infamous and poopy writings. The goal, good sirs, is unknown. O:-) (To do: vowel elongation (yaaay), likey?, stock phrases (oh boy!, goodness, sir))

To Do

  • Work on Smoke Alarm.