Perl Hashes and stuff... (by Jeremy Zawodny)

A few days ago, I noted that I was being stupid. In that post, I made a comment about using Devel::Size to figure out what Perl was doing when it kept eating up all my memory. I sort of hinted at the fact that I didn't really believe what Devel::Size was telling me.

As it happens, the author of Devel::Size, Perl Hacker Dan Sugalski read my comment and asked what my problem with Devel::Size was. After I got over my surprise, I sent him the script I was using and explained how it was eating memory in a hurry.

More specifically, I wrote a really, really simple script that read in file of queries (the ones that are typed into the search box on www.yahoo.com every day). It wasn't much more complicated than this:

while (<>)
{
    chomp;
    $query_count{$_}++;
}

And when it was done, it'd spit the counts and queries out so they could be processed by some other code.

The problem was that it never finished. It always ran out of memory around 10 million queries. But I needed to do roughly 40 million or so. I instrumented the code with some calls to Devel::Size to see if the hash was really as big as it seemed.

Anyway, back to Dan. He tinkered around a bit and was able to reproduce the problem. It was two-fold: (1) Devel::Size itself used up more memory than expected, and (2) Perl's just not all that efficient with hashing.

He explained his findings via e-mail and I thought to myself, "Self: you should blog this stuff." Luckily, I I was lazy. Dan has summarized much of it on his blog so that I don't have to try and parphrase him.

The moral of the story? There are several. First, blogging is good. Second, Perl's hashes are inefficient. You need A LOT of memory if you intend to hash tens of millions of keys. And finally, Dan may have been inspired to make Perl 6's hashes a little lighter.

I re-implemented my code to loop over the file 36 times. Once for each digit and letter of the alphabet (the queries were already lower-cased). It's slow and crude, but it works.

Posted by jzawodn at February 23, 2003 10:29 PM | edit

Reader Comments

# Harmen said:

what`s wrong with
sort the_query_log|uniq -c

on February 24, 2003 02:38 AM

# Dan Isaacs said:

"It's slow and crude, but it works"

But enough about me...

on February 24, 2003 05:47 AM

# Joe Grossberg said:

Why not just use a single-column mysql table that stores each logged query in its own row?

I'd approach it like this:

while (<>)
{
chomp;
# begin pseudo-code
INSERT INTO queries VALUES ($_);
# end pseudo-code
}

Then, when it's done:

SELECT query_value, COUNT(*) FROM queries GROUP BY query_value;

or, if you don't care about the count,

SELECT DISTINCT query_value FROM queries;

and pipe that into a text file.

on February 24, 2003 08:41 AM

# Michael J. Radwin said:

I'm always amused when folks suggest using tools like unix sort or an SQL database to do the types of data crunching you're talking about.

Sure, if you've got 20 thousand records, it'll all work fine. But when you've got terabytes of data to crunch, nothing but C/C++ will do the trick.

I remember writing some code a couple of years ago (when I was in the Ads group working with Data Mining, doing similar things to what you're doing) and realizing that I could get a 10% speedup by ditching scanf() and writing a parser that used strsep(). Then I ended up ditching that and writing a parser that read data in blocks, iterating over it character by character. Why? So I could get the thing to run in 3 hours instead of 4.

Sure, it's 300 line of C code instead of 3 lines of Perl. But it's way faster. And in some cases, speed matters.

on February 24, 2003 06:02 PM

# Jeff Chan said:

If the list of keys (queries) have no locality of reference, in the naive approach, the random seeks to update the counters is the bottleneck. Hence, the importance of keeping all updates in-memory, and only writing them out to disk periodically. The choice of language, within reason, is really irrelevant here.

on February 24, 2003 07:36 PM

# Sridhar Anandraj said:

Hi,

I have a big problem in running my perl code which invloves many perl modules and many configuration files for cross references for validations.

Here my problem is, which loading the values from the files to the hash it is quitting out due to Out of memory problem. Previosly we were using perl 5.6 which is a 32 bit compiler, and now we have installed 5.8 version which is a 64 bit compiler. Now the out of memory issue is sorted out. but the processing performance got reduced.

Here only thing which needs to look is, for validating each record file, the hash is loaded with a huge size of reference file. Actually the reference file size is big. So loading the reference file into hash for each time for validating a flat file reduces the performance.

To tell this clear, for validating a flat file say 10 lines, the reference file is of 100MB size is loaded 10 times into the hash for a sigle file.

Like this the flat file needs to be validated for more than 10,000 records.

Is there any way to fix this problem?

Please help in this regard.

Sridhar Anandaraj

on January 22, 2009 06:35 AM

Disclaimer: The opinions expressed here are mine and mine alone. My current, past, or previous employers are not responsible for what I write here, the comments left by others, or the photos I may share. If you have questions, please contact me. Also, I am not a journalist or reporter. Don't "pitch" me.

Privacy: I do not share or publish the email addresses or IP addresses of anyone posting a comment here without consent. However, I do reserve the right to remove comments that are spammy, off-topic, or otherwise unsuitable based on my comment policy. In a few cases, I may leave spammy comments but remove any URLs they contain.