Peter Norvig (Google), High Order Bit at Web 2.0 (by Jeremy Zawodny)

Statistical machine translation. Looking at text in one language and using the information in another. You need to grok syntax and semantics of both, a big dictionary, etc. Google has access to lots of CPU and lots of text, so they took a statistical approach using world pairs, phrases, etc.

Example of a news story translated from Arabic to English.

Named entity extraction (people, companies, products, etc). Lots of relationships to find in the text they've got. They started with simple patterns in "easy" sentences. If text such "such as" they're using it. It helps them extract facts like "HP is a computer manufacturer."

Word clusters is next. They build a bayesian network of words and word clusters.

On-line demo time. Interactive use of word clusters. Using "george bush" and "john kerry". Amusing results. "That's what the web says."

See Also: My Web 2.0 post archive for coverage of all the other sessions I attended.

Posted by jzawodn at October 07, 2004 11:13 AM | edit

Reader Comments

# Ole Aamot said:

Efficient clustering of words have already been done.

McCallum, Andrew Kachites. "Bow: A toolkit for statistical language modeling, text retrieval, classification and clustering."

http://www.cs.cmu.edu/~mccallum/bow. 1996.

on May 18, 2005 04:33 AM

Disclaimer: The opinions expressed here are mine and mine alone. My current, past, or previous employers are not responsible for what I write here, the comments left by others, or the photos I may share. If you have questions, please contact me. Also, I am not a journalist or reporter. Don't "pitch" me.

Privacy: I do not share or publish the email addresses or IP addresses of anyone posting a comment here without consent. However, I do reserve the right to remove comments that are spammy, off-topic, or otherwise unsuitable based on my comment policy. In a few cases, I may leave spammy comments but remove any URLs they contain.