February 23, 2003

barrapunto

I noticed a lot of users referred to my site from barrapunto.com recently. Apparently it's like the Spanish Slashdot.

Whoda thunk?

Anyway, they picked up my 10 habits post and a few folks have commented so far. They even translated my annoyances. But I can't read them. I took German in High School and college. A lot of good that's done me.

[Mental note: take Spanish next time. It's waaaaayyy more useful when living in California.]

Anyway, I pumped the pages thu the fish and it made some sense. Any of the linguistically inclined want to tell me what they're saying? :-)

Posted by jzawodn at 11:33 PM

Perl Hashes and stuff...

A few days ago, I noted that I was being stupid. In that post, I made a comment about using Devel::Size to figure out what Perl was doing when it kept eating up all my memory. I sort of hinted at the fact that I didn't really believe what Devel::Size was telling me.

As it happens, the author of Devel::Size, Perl Hacker Dan Sugalski read my comment and asked what my problem with Devel::Size was. After I got over my surprise, I sent him the script I was using and explained how it was eating memory in a hurry.

More specifically, I wrote a really, really simple script that read in file of queries (the ones that are typed into the search box on www.yahoo.com every day). It wasn't much more complicated than this:

while (<>)
{
    chomp;
    $query_count{$_}++;
}

And when it was done, it'd spit the counts and queries out so they could be processed by some other code.

The problem was that it never finished. It always ran out of memory around 10 million queries. But I needed to do roughly 40 million or so. I instrumented the code with some calls to Devel::Size to see if the hash was really as big as it seemed.

Anyway, back to Dan. He tinkered around a bit and was able to reproduce the problem. It was two-fold: (1) Devel::Size itself used up more memory than expected, and (2) Perl's just not all that efficient with hashing.

He explained his findings via e-mail and I thought to myself, "Self: you should blog this stuff." Luckily, I I was lazy. Dan has summarized much of it on his blog so that I don't have to try and parphrase him.

The moral of the story? There are several. First, blogging is good. Second, Perl's hashes are inefficient. You need A LOT of memory if you intend to hash tens of millions of keys. And finally, Dan may have been inspired to make Perl 6's hashes a little lighter.

I re-implemented my code to loop over the file 36 times. Once for each digit and letter of the alphabet (the queries were already lower-cased). It's slow and crude, but it works.

Posted by jzawodn at 10:29 PM