Email and Browser URL Extraction and Search via a Personal del.icio.us (by Jeremy Zawodny)

A few minutes ago, I needed to send a note to Russell about Yahoo Desktop Search. Specifically, I had to find a URL for an internal site that he wanted to see. But I couldn't remember what the URL was or who sent it to me. All I knew was that it was in my e-mail inbox. Somewhere.

So I ran a quick grep (command-line search) for "http:" and got a big list of URLs and URL-like things from my inbox. I was able to further refine the search using the word "desktop" and found the URL in no time.

A moment later, a realization struck me:

I do this a lot!

In a sense, URLs are just another type of e-mail attachment. Someone can either send you the content directly or they send you a URL to the content.

What I really need is a tool that acts like a personal del.icio.us that's automatically fed from the combination of URLs embedded in e-mail messages as well as my browser history. It could keep a database of those URLs, count the frequency with which I visit them as well as how often they appear in e-mail that I send or receive. And if it provided the ability to tag and annotate the URLs, all the better.

In fact, if it was like a private "satellite" version of del.icio.us that had the ability to check with the larger public del.icio.us that'd be even better. The idea being that for public URLs which end up in my local (private) database, I could still benefit form the collective tagging and annotation efforts of those in the outside world.

I can imagine a second generation of this system that goes a step further: fetching the web content that each of the URLs points to, storing a cached copy locally, and indexing it just like a traditional web search engine might. Bonus points for integration with something like the Slogger extension for Firefox, so that it doesn't have to store duplicate data.

If I had a copy of the source code for del.icio.us handy, I could probably get the first cut of this going in a day's time. That might be a day well spent.

Hmm. Between Firefox (and Slogger) plus Thunderbird, it might even be possible to do this in a cross-platform way.

Posted by jzawodn at December 10, 2004 12:24 AM | edit

Reader Comments

# Philip Tellis said:

Here's a start (not tested):

.procmailrc:

:0 Bc
* http://
| grep "http://" | sort | uniq >> .emailed_urls

Put the rest in cron.

on December 10, 2004 03:01 AM

# MrChucho said:

Coincidently, I was just thinking about this very same thing yesterday. These are the kinds of things I want the moment I started using del.icio.us.

RMC

on December 10, 2004 04:18 AM

# Benjamin Reitzammer said:

That's exactly what I'm trying to implement with my open source project roosster (http://roosster.org/dev). I'm still in a quite early stage of development, but I'm planning del.icio.us and simpy, as well as Slogger integration. So you could submit URLs to del.icio.us and they would be synced to your private list of URLs some time later.
I was thinking along exaclty the same lines as you did in your post now, Jeremy, some months ago and that led me to implement roosster.

on December 10, 2004 05:44 AM

# rich said:

The tricky part is that at least half of the time you're looking for a URL by context instead of content -- "the URL from Bob's message about SVN", even though the URL itself doesn't contain /svn/i.
I suspect at that point it just reduces to making email archives quickly searchable.

on December 10, 2004 06:24 AM

# dan said:

Have you looked at Zoe? It may be close to what you are looking for http://guests.evectors.it/zoe/itstories/home.php?data=stories

on December 10, 2004 07:44 AM

# Larry said:

I started doing something similar a few weeks ago: I have a procmail rule that greps URLs out of emails from people I know (to avoid the spam) and turns them into posts in a private blog. I also grab links that I send or receive through AIM or ICQ or IRC (using aimsniff and dircproxy) and turn those into linkblog posts as well, but in separate categories. I should probably condense all that into a nice package and release it sometime .. :)

on December 10, 2004 09:46 AM

# Mark Eichin said:

That's another use for a private del.icio.us lookaside that I hadn't thought of. I've been working on a del.icio.us API proxy, so that I can intercept requests from the Mac "cocoalicious" client, cache the requests and responses -- and then augment/filter (ie. lie about :-) them on the way through. The first obvious example was to add a "private" tag which would divert them only to the local stash and not upstream, and then add those in when search requests are made. The next is to add actions to tags, though you could alread do that by having a robot read the feed for the tag... after than, an "offline" mode seems at least a little bit useful.

Right now it's about 150 lines of python that clone-and-pass the API, enough to do a "lookaside" that just logs actions. The actual local store comes next.

This doesn't help someone who only uses the web interface, I don't want to clone that (I'd rather say "see, this is useful" with the API case and then see what Joshua [and everyone else] can do to help :-)

on December 11, 2004 12:42 AM

# Christopher said:

rich: That's what Beagle is for.. ;)

on December 11, 2004 02:35 PM

# Tantek said:

"It could keep a database of those URLs, count the frequency with which I visit them..."

Jeremy,

Imagine doing that across all your applications which let you read/browse URLs (such as browsers and feed readers etc.).

The first thing you would need for that is a format (preferably an open standard) for exchanging/syncing that information among them.

This is exactly what Attention.XML does for you. It's an open standard for automatically letting you keep track of what sites/feeds/blogs you visit, and within them, which pages/items/posts you visit, how often, and for how long.

"...as well as how often they appear in e-mail that I send or receive."

This is an excellent suggestion either for how email programs could add to your Attention.XML file, or how a post-processing script (like you have written) could extract this info from your email and add it to your Attention.XML file.

"And if it provided the ability to tag and annotate the URLs, all the better."

Attention.XML let's you tag a blog URL with your social networking relationship to the owner (using XFN). You can also tag any URL with whether you liked it or not (via Vote Links).

Your suggestion of a "general" tagging/annotation field (ala flickr/delicious) for any page or post is an excellent one, and I'll be sure to incorporate it in the next draft of Attention.XML.

Tantek

on December 12, 2004 08:09 AM

# Otis said:

Email -> Simpy ( http://www.simpy.com ) has been on my TODO list for a loooong time. Now that Simpy has the REST API, this should be doable even from the outside (e.g. from Thunderbird, procmail, etc.)

on December 12, 2004 10:52 AM

# Bryan Pietrzak said:

I wonder if desktop search is going to open us up to some serious trouble

on December 13, 2004 08:19 AM

# Aristus said:

There is an archiver for doing research on the internet called Dowser Web Search (I help develop it) -- it's basically a metasearch, caching proxy and database rolled together. It would be easy to make a plugin that feeds D links from whatever source, and perhaps also checks to see if the URL is already posted to deli or stumbleupon.

on December 13, 2004 09:10 AM

# Scott Rubin said:

This is very cool. I've been working on something similar recently that I call glues It's Gaim URL Extractor Syndication. Right now its a short ruby script I run with a cron daemon. But what it does is parse out all my gaim log files for the current day and creates an rss feed of all the links it can find. This way I can keep track of what links people send me and which ones I send. Eventually I'll add ranking for popularity and such. Also I'm going to make it a cgi which can generate the stuff on the fly. Static rss feeds are silly.

on December 16, 2004 05:32 AM

# Danny said:

Checkout Connotea, a del.icio.us-a-like for science material, I suspect the source is only an email away:

http://www.connotea.org/about

Tantek - they use dc:subject for arbitrary tags, any neat way for that to go in the Attention.xml profile?

on December 17, 2004 04:43 AM

# duncan said:

I had a similar idea, a personal proxy server (written in Python) that indexes all of your http traffic:

http://www.suttree.com/code/pps/

on December 19, 2004 05:05 AM

# Buzz Andersen said:

FWIW, we are moving Cocoalicious in the direction of having its own local database, rather than just being essentially a frontend for the del.icio.us web service. Once this happens, we can provide the option for private links.

Some of the other things Jeremy is talking about (visit counts and so forth) are also things I've contemplated adding once we're less dependent on del.icio.us.

on January 5, 2005 12:08 AM

# Trevor said:

Tim Martin just GPL'ed the code to his url repo, linkmonger, which could easily be the base for just such a thing:

http://sourceforge.net/projects/linkmonger

on August 22, 2005 02:41 PM

# tucex said:

Thanks trev, for the link!

on January 23, 2006 07:41 PM

# Wesley said:

thanks for the backlink.

on March 25, 2009 03:44 PM

Disclaimer: The opinions expressed here are mine and mine alone. My current, past, or previous employers are not responsible for what I write here, the comments left by others, or the photos I may share. If you have questions, please contact me. Also, I am not a journalist or reporter. Don't "pitch" me.

Privacy: I do not share or publish the email addresses or IP addresses of anyone posting a comment here without consent. However, I do reserve the right to remove comments that are spammy, off-topic, or otherwise unsuitable based on my comment policy. In a few cases, I may leave spammy comments but remove any URLs they contain.