A few days ago, I noted that I was being stupid. In that post, I made a comment about using Devel::Size to figure out what Perl was doing when it kept eating up all my memory. I sort of hinted at the fact that I didn't really believe what Devel::Size was telling me.
As it happens, the author of Devel::Size, Perl Hacker Dan Sugalski read my comment and asked what my problem with Devel::Size was. After I got over my surprise, I sent him the script I was using and explained how it was eating memory in a hurry.
More specifically, I wrote a really, really simple script that read in file of queries (the ones that are typed into the search box on www.yahoo.com every day). It wasn't much more complicated than this:
while (<>)
{
chomp;
$query_count{$_}++;
}
And when it was done, it'd spit the counts and queries out so they could be processed by some other code.
The problem was that it never finished. It always ran out of memory around 10 million queries. But I needed to do roughly 40 million or so. I instrumented the code with some calls to Devel::Size to see if the hash was really as big as it seemed.
Anyway, back to Dan. He tinkered around a bit and was able to reproduce the problem. It was two-fold: (1) Devel::Size itself used up more memory than expected, and (2) Perl's just not all that efficient with hashing.
He explained his findings via e-mail and I thought to myself, "Self: you should blog this stuff." Luckily, I I was lazy. Dan has summarized much of it on his blog so that I don't have to try and parphrase him.
The moral of the story? There are several. First, blogging is good. Second, Perl's hashes are inefficient. You need A LOT of memory if you intend to hash tens of millions of keys. And finally, Dan may have been inspired to make Perl 6's hashes a little lighter.
I re-implemented my code to loop over the file 36 times. Once for each digit and letter of the alphabet (the queries were already lower-cased). It's slow and crude, but it works.
Posted by jzawodn at February 23, 2003 10:29 PM
"It's slow and crude, but it works"
But enough about me...
Why not just use a single-column mysql table that stores each logged query in its own row?
I'd approach it like this:
while (<>)
{
chomp;
# begin pseudo-code
INSERT INTO queries VALUES ($_);
# end pseudo-code
}
Then, when it's done:
SELECT query_value, COUNT(*) FROM queries GROUP BY query_value;
or, if you don't care about the count,
SELECT DISTINCT query_value FROM queries;
and pipe that into a text file.
I'm always amused when folks suggest using tools like unix sort or an SQL database to do the types of data crunching you're talking about.
Sure, if you've got 20 thousand records, it'll all work fine. But when you've got terabytes of data to crunch, nothing but C/C++ will do the trick.
I remember writing some code a couple of years ago (when I was in the Ads group working with Data Mining, doing similar things to what you're doing) and realizing that I could get a 10% speedup by ditching scanf() and writing a parser that used strsep(). Then I ended up ditching that and writing a parser that read data in blocks, iterating over it character by character. Why? So I could get the thing to run in 3 hours instead of 4.
Sure, it's 300 line of C code instead of 3 lines of Perl. But it's way faster. And in some cases, speed matters.
If the list of keys (queries) have no locality of reference, in the naive approach, the random seeks to update the counters is the bottleneck. Hence, the importance of keeping all updates in-memory, and only writing them out to disk periodically. The choice of language, within reason, is really irrelevant here.
