Marc's The Video Library of Alexandria post on O'Reilly Radar connected a set of dots for me that I can't believe I never connected on my own.

In that, it certainly seems like an appropriate purchase for Google, much like DejaNews before it.

That one sentence made me realize that Google has been buying up a lot of digital information archives and repositories of various types: DejaNews (Usenet News), Keyhole / Google Earth (Satellite data), and YouTube (Video). When you combine that with their archive(s) of the web, the growing mountain of email stored in Gmail's perpetually expanding mailboxes, and book scanning, it's quite impressive.

In casual thinking, I can only think of a few on-line information repositories that I care about that Google doesn't own.

  1. Yahoo! Groups archives (currently not indexed by any search engine, which is a tragedy in my mind)
  2. Wikipedia
  3. The Internet Archive
  4. The Library of Congress (but they may just scan the whole damn thing)
  5. The National Weather Service

Notice that #1 is owned by a competitor, #2 and #3 can't really be bought, and the last two are paid for by US taxpayers.

A few years from now I might be convinced that Flickr belongs on that list too.

What other big sources of data would you like to see outlive the organizations that currently control (or own) them?

Posted by jzawodn at February 13, 2007 09:01 PM

Reader Comments
# Srinath said:

How about Yahoo! Answers?

on February 13, 2007 09:49 PM
# Jeremy Zawodny said:

I thought about that, but I really don't find a need for it myself. Sure, it's incredibly popular, but not really my ball of wax.

on February 13, 2007 09:57 PM
# Charles said:

There are vast online resources that are privately owned, like Bill Gates' Corbis. BillG owns everything. BillG even bought exclusive publication rights to the entire Smithsonian Museum. He has acquired the rights to dozens of museums' collections, most of which were in the public domain but now are only available for a fee through Corbis.
I'd say more but I'm in the middle of writing a big article about this.

on February 13, 2007 10:20 PM
# James Day said:

I'm wondering when someone will notice all this and decide to spawn non-profits (with restrictive compelled access and distribution covenants) to get the stuff neutral and keep it that way so it can't be bought out from under them. We really don't need a competitive war for original data sources but that is what we're in.

The authors who own wikipedia the work can write a lot and can photograph and draw many things but sometimes you do need the actual work and have to license some things. It's one of the current limitations of the work, particularly for worldwide distribution. A budget expressly for buying BSD-style and/or GFDL/GPL-style worldwide (and space-wide) licenses to original works would be interesting.

As Bill Gates has demonstrated, a Foundation without restrictive covenants is not actually safely public and the Wikimedia Foundation does face financial pressure to do favored access deals and other things that restrict or differentiate access to the works it's hosting and distributing. It would be nice to have one or two well financed copies held by different foundations dedicated solely to keeping the material available.

on February 13, 2007 11:15 PM
# Aaron Wormus said:

the best source for online guitar tabs since forever... now down thanks to a C&D from the MAFIAA. Music is a HUGE part of our "cultural archive", if google (or yahoo! - hint hint) would bring OGLA back a LOT of people would be eternally grateful.

on February 14, 2007 02:08 AM
# Nah Na Nah said:

Owns? Nah, I don't think so.

If the information is a pile of farmer's manure to be dug through, Google is the spade, not the farmer.

Google Earth is a rendering of maps from a bunch of other companies, it's search index is really the data on everyone else's site. DejaNews is a snapshot of Usenet, which can't be copyrighted outside of Europe. The book scanning is being done for the universities, with copies handed back to them. A lot of YouTube is stuff they don't own the copyright to.

There's nothing there that couldn't be separately bought, or separately scanned.

I view Google as the ultimate 'middleman deception'. The intelligence is in the websites, yet Google gets the credit.

For example:

'Message from: Ann

"I just wanted to let you know that Google may well have saved my life. My sons and I were walking home from having eaten out. A half block from my house, I felt this pressure building in my chest. Immediately, I thought, 'heart attack' and ran through how I'd been feeling that the day (I had been nauseated). My first thought was, 'confirm suspicions,' and immediately, upon arriving home, I went to Google and typed in 'heart attack.' I kept thinking, 'you only have minutes...' I found a site that listed symptoms. Indeed, I was having a heart attack. I was at the Albany fire station within minutes. Five baby aspirin later, and a few squirts of nitro and I was in the ambulance on my way to the hospital. The good news is, I have no residual damage. My heart is back to normal. Thank you for providing the Google search engine. I'm sure my recovery was complete because of the speed within which I was able to get help."


Notice she doesn't thank the doctor or nurse who carefully explained the symptoms of a heart attack attached to a page flagged with the words 'heart attack'.

on February 14, 2007 04:31 AM
# Alex said:

OFFTOPIC, but no other way...

Image placeholder under Yahoo! St.Val flash logo returns 404. So for everybody without Flash Yahoo! home page is broken. I can hardly believe it.

on February 14, 2007 06:25 AM
# Corey Thompson said:

on February 14, 2007 09:14 AM
# Eric said:

Flickr definitely belongs on the list. Honestly, I hope that's still around in 20 or 30 years... especially if every photo gets to be geotagged in addition to timestamped, enabling you to look at the evolution of almost any place on the globe over that same period.

And the other big one is blogs. Google indexes them, but a lot of them are on sites like Livejournal and are otherwise private and inaccessible to the Googlebot.

How about music and movies? Currently, no complete digital archive of that stuff exists, mostly because the assholes at the RIAA and MPAA would never allow it; the closest thing we have to an archive is the p2p networks - but I dread to think about how much of it exists solely on decaying physical media.

Pretty much everything else I can think of, Google is already indexing.

on February 14, 2007 10:28 AM
# Cody Simms said:

Even though I now spend my days designing new services for the internet, I was a historian as a student. I've written two sizable historical research projects in my life: my master's thesis and my undergraduate thesis.

My master's thesis traced the BBC's changing attitudes toward jazz programming from the 1920s through the 1930s. While working on the project, I was able to spend a full week in the BBC archive, pouring over 75 year-old handwritten memos passed back and forth between programming executives, many of which were often very colorful.

My undergrad thesis looked at American journalists in China in the 1930s and 1940s. I was blessed to live within 45 minutes of the Edgar Snow archives at the University of Missouri-Kansas City. Edgar Snow was the first American to interview Mao Zedong, and he lived among the CCP for some time while they were still a band of guerillas in the Chinese wilderness. His archives contained wonderfully rich handwritten letters sent to his family over the course of a few tumultuous decades in China.

The analog world had its problems. In order for Snow's letters or the BBC memos to be valuable today, someone had to decide to save them. But what will happen to historians in 30 years when trying to sort through personal communique that has become dominated by digital formats? Will the Yahoo!s, Hotmails and Gmails of the world open up their mail archives in the future? What about companies? Will companies' massive caches of internal email all someday be opened up to the public domain?

I sometimes wonder what the prominent historians of today think about these problems. Are they concerned about the future of their field of study? If anyone knows of any prominent works on this subject that have been written by current historians, please do share.

on February 14, 2007 12:25 PM
# zmarties said:

Three major set of info came instantly to mind, none of which are currently available to search through:

1) Old ebay auctions
2) Old real estate listings
3) Old news

All of these have immense research value, but all appear on the web, generally locked into site specific search systems, then expire never to be accessible again.

(Yes, I know that some news outlets are keeping more and more older articles - but they all seem to redesign their site every few years, meaning that article URLs stop working, or they lock them up behind a pay wall - which is fine if you get all your old news from one source, but not if you need to read one article from each source, and thus have to take out memberships from every source you come across).

on February 14, 2007 12:46 PM
# Helen said:

This appears to be in line with a slowly moving trend, as Google gains access to more and more information. It will be interesting to see what this growth will mean over the long term, as people begin to analyze and understand the level of access that search engines have.

There is a round-up of this topic here, which you may also find useful:

on February 14, 2007 12:52 PM
# Doug Cutting said:

Domain registration history is not archived anywhere, and is an important part of the internet.

on February 14, 2007 01:50 PM
# Marlo said:

The Internet Movie Database (IMDb)

I always use that to find out which other movies a favorite actor was in.

on February 14, 2007 03:11 PM
# girish said:

Microsoft bug repository and patch archives :)

on February 14, 2007 03:46 PM
# sourabh niyogi said:

The Wayback Machine.

You can see how people experimented with their web sites. The crawls are quite shallow, the retrieval is slow, it doesn't have "web archeologists" commenting, definitely needs better history viewing (ala mediawiki, at least), and due to robots.txt restrictions its not comprehensive enough (e.g. myspace etc). But this web archeology is important to do, and it would be bad to see them disappear.

on February 15, 2007 02:09 AM
# Soyapi said: with online archives.

Google's probably working on their own free IRC server version with "cool" features with target ads and integrated to your Google Account, of course :)

on February 15, 2007 04:23 AM
# said:

YouTube. I keep finding rare videos and music performances from the 50's (and other time periods) that I'd never have seen otherwise or even know where to begin looking for them. Also, music producers like Just Blaze are posting "behind the scenes" videos of them working with orchestras and vocalists. It's very fascinating stuff.

I had the same feelings when Napster came on the horizon, but too bad it's gone now (the real Napster that is). Hopefully YouTube doesn't suffer the same fate.

on February 15, 2007 06:47 AM
# Yogish Baliga said:

First of all the information available in news paper archive is much more than any library can offer. Imagine some company owning the archives of all the major news papers in the world. May be AP and Reuters have more information. Only thing is they were not able to monotize it.

Another thing is people in US are so narrow minded that they think that world is USA. Other countries are insignificant. The information available in USA may cover the whole world, but most of them are the views of western people. There are lot of archives ( not yet digitized ) are in the other countries including the ancient civilization, governmets, freedom struggle, british raj etc. etc. They are yet to be digitized. If someone get all these things digitized and make it searchable, it would be 100 times more than what google has indexed. May be google cannot do the search within 0.000001 seconds.

on February 16, 2007 11:40 PM
