Fojiba-Jabba is the module of Cruft Alarm supporting Automatic Text Generation. Its name comes from an Old Welsh word meaning "onomatopoeia".
Fojiba-Jabba uses techniques from Markov Chain- and Recursive Transition Network-Theory.
One method of text generation involves Markov Chains. In theory, Markov Chains can produce a delightfully quirky text; in practice, they sort of suck.
The process can be summarized as follows:
- The user specifies an initial word and the number of sentences desired in the text.
- Fojiba-Jabba, having previously analyzed a set of texts in order to gather statistics on which words follow which words, uses these data to generate the next word.
- This process repeats until the desired number of sentences is obtained.
There are, however, several problems with this method:
- The corpus available is too limited to attempt anything but an Order-1 Markov Chain (anything higher results in what is essentially the original text itself).
- An Order-1 Markov Chain is often too retarded to produce anything but rather ungrammatical (and clearly fake) sentences.
Possible Solutions/To Do
- Use highly advanced linguistic knowledge to improve grammaticality (e.g., a noun or an adjective must follow a determiner). A Brill Part-of-Speech Tagger or the Stanford Parser may be useful.
- Use google to find likely following words, or to increase the dataset somehow.
- Find a better POS tagger.
- Improve the equivalence classes (currently history of one word) through stemming or semantic grouping. (Maybe syntactic grouping too...but maybe too similar to what was tried with the RTNs).
- Next word depends on previous word AND syntactic group (not just one).
- Add some sort of tag to beginning and ends of sentences: don't want beginning of one sentence to depend on end of last, and sentences often begin (and end?) in certain ways.
- Dynamic-order chaining: use higher order n-grams when enough data, and lower order n-grams when not.
Recursive Transition Networks
Another method of text generation involves Recursive Transition Networks. While more grammatical than Markov Chains by design (in theory; in practice, they also sort of suck), they are slightly more difficult to implement and, unless cleverly manufactured, have less of the idiosyncratic charm that Markov Chain-fans find so endearing.
Text generation using Recursive Transition Networks proceeds as follows:
- Fojiba-Jabba takes a set of texts, and runs a part-of-speech tagger through them. Two hash maps are created: one mapping each word to its possible parts-of-speech (e.g., 'can' to 'noun' and 'modal') and one mapping each part-of-speech to its possible words (e.g., 'verb' to 'lick' and 'frolic'). A rules array containing every sequence of part-of-speech tags that any sentence contains is also created.
- To generate a sentence, Fojiba-Jabba picks a random element (a sequence of part-of-speech tags) from the rules array. It then maps each of these tags to an English word using its hash map, thereby generating a sentence.
- Parts-of-speech alone are not enough to guarantee grammaticality, so the sentences produced by the RTN are still very ungrammatical.
- Create a set of rules by hand.
- Percentage-based text generation with a large corpus (e.g., google). (The author has forgotten what this means.)
- Match rarities (capitalization, slang) and content (e.g., the kitchen).
- Work backwards from writer identification (removing headers and identifiers).
- Use Nomlex (or Celex or NomBank or PropBank or...).
- Start by generating NPs, PPs, etc.
- Determine user-specific collocations (using relative frequency ratios).
- Microsoft(!)'s AMALGAM.
- Statistical analyses are being performed on Lyric Doshi's infamous and poopy writings. The goal, good sirs, is unknown. O:-) (To do: vowel elongation (yaaay), likey?, stock phrases (oh boy!, goodness, sir))
- Work on Smoke Alarm.