Difference between revisions of "Cruft Alarm Notes"
|Line 140:||Line 140:|
*Write user_subscribed.php and user_unsubscribed.php pages.
*Write user_subscribed.php and user_unsubscribed.php pages.
*Add more to thesaurus.
*Add more to thesaurus.
Revision as of 13:32, 25 June 2006
Cruft Alarm is a sophisticated computer program that screens several mailing lists for desirable items. It is written in the Ruby programming language.
- 1 Classes
- 2 MySQL
- 3 Natural Language Processing
- 4 Miscellaneous
- 5 Website
- 6 To Do (General)
- 7 See Also
Two classes, crufter.rb and post.rb, form the basis of Cruft Alarm. Other classes include db_reader.rb, froogle_item.rb, and scavenger.rb.
The crufter class connects to the cruftster-at-gmail.com email account. If there are any new messages, it gives them to the post class, which manages the information in each message.
The post class manages the information in each cruft email. It cleans up an email and finds a location and items from an email.
The cruftlists_db_manager class processes information from a MySQL database.
The cruftlist class manages the information in each cruft list (id, title, description, items, and subscribers).
The subscriber class manages the information for each subscriber (id, athena username, email address, and phone number).
Given an item, the froogle_item class uses froogle.com to find the first and average prices for the item. Specifically, the first price is the first price listed when sorting by relevance, and the average price is the average of the ten prices listed on the first page (again when sorting by relevance).
The usefulness of this class is quite debatable, given the vague nature of reuse postings and the wide variety of items on froogle; that is, it is unlikely that the search results will match the reuse item very well.
It may be desirable in the future to use froogle_item to also find the category of an item.
The scavenger class supports automatic claiming. Given an array of items, it can also send emails notifying others that these items have been taken.
Text generation using Markov Chains is currently in somewhat top-secret development; see Fojiba-Jabba Notes for more information.
Cruft Alarm uses a "cruftlists" MySQL database to store cruftlists and a "subscribers" MySQL database to store subscribers.
"cruftlists" contains a single "cruftlist" table, with the fields 'id', 'title', 'description', 'items', and 'subscribers'.
"subscribers" contains a single "subscriber" table, with the fields 'id', 'athena', 'email', and 'phonenumber'.
Natural Language Processing
Cruft Alarm is rumored to have passed the Turing Test.
Cruft Alarm requires dictionaries in order to work. Cruft Alarm's current dictionaries are the following:
- Cruft: this is a list of desirable cruft items (e.g., Pentium IV).
- Food: this is a list of food items (e.g., Bertucci's pizza).
- Location: this is a list of locations at MIT (e.g., Walcott).
- Next: this is a list of words X such that if X Y appears in the email, where Y is another word, then X Y will want to be returned (e.g., if 'outside 10-250' appears in an email, and 'outside' is a word in Next, then it is desirable to return 'outside 10-250').
- Prev: this is analogous to the Next dictionary.
- Remove: this is a list of words to remove from an email (e.g., 'of'). Reasons for why these words might want to be removed may be discussed later.
The necessity of the Next, Prev, and Remove dictionaries is currently in question.
(cf. the 'get_items' method in the post class) In order to find the cruft items in a message, the post class checks each of the words in the message with the words in the cruft dictionary.
(cf. the 'get_location' method in the post class) In order to find where the items in a message have been posted, the post class does the following:
- It looks for any words beginning with 'NE', 'E', or 'W' (e.g., NE42, E50, and W20).
- It looks for any words containing a hyphen (e.g., 26-100), but which are not telephone numbers.
- It checks the words in the message against the words in the location dictionary.
Cruft Alarm also contains a thesaurus.
- Cruft items are often posted in list form. For example, a reuse email may go as follows:
I have accumulated too much crap. Following items will be left in EC, in the Wood stairwell (on the West parallel, closest to the triangle building). In a box...look for it! - lamp with a missing foot. Comes with light bulb! - 50 cent bulletproof DVD with over 50 songs and 12 music videos! - mechanical panda bear, does all sorts of tricks. comes with a bottle - Maxwell House Hazelnut coffee + godiva chocolate box filled with splenda packets - antique hourglass ...actually an hour! - Blueberry shower gel - book: Flaubert - "Three Tales" - book: Martin Page - "How I Became Stupid" - Red Gel toothpaste
Find a way to extract items from such lists.
Ideas to Improve
- Froogler can determine the category of a word. So one way to find all the items in a post may be to search froogle for every word in the post (or every two/three/four/etc. words in a post), and see which ones are in a Computer/Electronics/etc. category.
- Use WordNet.
- Implement relations such as (10 inches, height, monitor).
The post class removes HTML tags, punctuation, and signatures. The cheap hack it uses to remove signatures is to remove everything after 5 or more spaces. (After removing HTML tags and punctuation, it looks like signatures, and nothing else, follow 5 or more spaces.)
Cruft Alarm ignores all replies and all messages from Helen Ray.
Gmail is as the email account of choice for several reasons:
- Using an Athena account would require hardcoding in someone's username and password.
- Yahoo! Mail and hotmail do not support POP3 (or something like that).
By using Gmail to check for new messages, Cruft Alarm in fact receives emails considerably faster than do normal Athena accounts (in contrary to what was initially believed). According to Ruth Shewmon, this is because Gmail updates its email servers more often than Athena (or something like that).
Cruft Alarm connects to its gmail account through a ruby gmail library.
Cruft Alarm receives e-mails from the gmail account cruftster-at-gmail.com. cruftster-at-gmail.com is subscribed to the Athena mailing list, cruftalarm. cruftalarm, in turn, is subscribed to the Athena mailing lists reuse and free-food.
Attempts were made to add the-companion (a previous Athena mailing list) to freefood, but they ended in disaster. (It does not seem possible to blanche onto freefood, and it is not a mailman mailing list, so the only way to subscribe to it is through moira.) One consequence of these attempts is that cruftalarm-at-gmail.com is now banned from reuse.
Current mailing lists used by Cruft Alarm are cruftster, colonbrander, fojiba, and luigicasanueva. Each of these comes in a (gmail, Athena mailing list) pair.
cruftster is used to receive reuse and free-food emails; colonbrander (alias, Colin Brander), fojiba (alias, Ryo Fojiba), and luigicasanueva (alias, Luigi Casanueva) are used by the scavenger class to claim and take items.
The Cruft Alarm website is located at cruftalarm.mit.edu. It allows visitors to manage their subscriptions to cruftlists, edit cruftlists, and create their own cruftlists.
A cruftlist is a list of items associated with a (non-MIT) mailing list. For example, there may be a 'food' cruftlist, with items such as "pizza", "crackers", and "vegetables", linked to a list of its subscribers.
The goal of a cruftlist is to allow one to obtain desirable reuse items, without a daily deluge of reuse posts. More specifically, Cruft Alarm checks each reuse post to see if an item from a cruftlist is contained within it; if so, it emails the cruftlist, so that each subscriber can be notified of the item. In the future, Cruft Alarm may also make a phone call to each subscriber of the list.
It is unknown who should be able to edit which cruftlists.
To Do (General)
- Use databases to keep track of previous reuse statistics.
- Do the stupid lists thing.
- Add telephone capabilities.
- Implement stemming (using Snowball?).
- Write user_subscribed.php and user_unsubscribed.php pages.
- Add more to thesaurus.
- Find something better than the stupid gmailer library.