September 23, 2009

Pass the Duct Tape

Reading The Duct Tape Programmer really hit a nerve with me. If you code for a living (or part of your living), go read it. I sometimes get annoyed by Joel Spolsky's writing, but this time he hit the nail right on the head.

I am by no means putting myself in the same league as Jamie Zawinski, but I definitely subscribe to the idea that keeping my code as simple as possible is going to pay big, big dividends in the long run.

Needless complexity really does kill you.

In my work it has manifested itself in a few ways. But one that seems to pop up over and over in recent times (and has bitten a few of my coworkers as well) is socket programming. For whatever reason, we're able to find the the freely available client libraries for interacting with some services will inevitably fail a small but non-zero percentage of the time. And when you're making millions and millions of calls or connections per day, even a 0.01% failure rate is enought to make for a really bad day.

That's especially true when the failure mode involves leaking file descriptors or needlessly long timeouts (or timeouts that fail to work properly) that result in a full or partial cascading failure. It's the ultimate in frustration.

I'm not going to name names here. Maybe at an upcoming conference I'll talk about what we've seen and how network oddities add a whole new set of failures to the mix.

Anyway, the solution ends up being one of two options: (1) forking the module and re-writing the code to use the low-level system calls instead of the multitude of abstraction layers that were supposed to make the task easier (die abstraction layer! die!), or (2) scraping the module entirely and writing your own. The first option sucks but at least you're not having to learn all the quirks of those abstraction layers. The second option sucks even more but at least you know exactly what you've got when it's all said and done.

For me it's more satisfying and productive get to know the low-level stuff well enough that you can just drop it in when you need it. It's less frustrating than trying to reproduce all the conditions and variable that cause some abstraction layer to fail.

Pass the duct tape. I've got work to do!

Posted by jzawodn at 08:40 PM

November 21, 2008

Bash Trick: Watching Multiple Background Jobs

I recently had a need to add some error checking to a bash script that runs multiple copies of a Perl script in parallel to better utilize a multi-core server. I wanted a way to run these four processes in the background and gather up their exit values. Then, if any of them failed, I'd prematurely exit the bash script and report the error.

After a bit of reading bash docs, I came across some built-ins that I hadn't previously used or even seen. First, I'll show you the code:

wait.sh

This is the bash script that runs the parallel processes and gathers up the exit values.

sleeper

And here's the Perl script that I wrote in order to test the functioning of wait.sh. It accepts to arguments. The first is the number of seconds to sleep (to simulate the delay associated with doing work) and the second is the exit value it should use (any non-zero value indicates a failure).

Discussion

New to me was the use of let to do math on a variable so that I can count up the number of failures. Is there a better way? There's no native ++ operator in bash. Similarly, using jobs to get a list of pids to wait on provided to be a very useful idiom.

The code is straightforward and works for my purposes. But since 99% of my time is spent in Perl rather than bash, I wonder what I could have done differently and/or better. Feedback welcome.

And, if this is at all useful to you, feel free to take it and run...

Finally, I'm starting to really dig gist.github for showing off bits of code. It's good stuff.

Posted by jzawodn at 07:21 AM

November 14, 2008

Asynchronous MySQL Client in Perl

I recently found myself wishing for an async library for MySQL. My goal is to be able to fire off queries to a group of federated servers in parallel and aggregate the results in my code.

With the standard client (DBD::mysql), I'd have to query the servers one at a time. If there are 10 servers and each query takes 0.5 seconds, my code would stall for 5 seconds. But by using an async library, I could fire off all the queries and fetch the results as they become available. The overall wait time should not be much more than 0.5 seconds.

While I found little evidence of anyone doing this in practice, my search led me to the perl-mysql-async project on Google Code. It's a pure-Perl implementation of the MySQL 4.1 protocol and an asyncronous client that uses Event::Lib (and libevent) under the hood.

The code contains little in the way of documentation or examples, aside from the simple bundled test script. After a bit of mucking around with it, I managed to cobble together a working example. It looks like this:

Sure enough, that code runs in just a bit more time than the longest query it executes, rather than the sum of all the query times.

What still surprises me is that this code doesn't appear to get a lot of use (or at least discussion) in the real world. In the PHP world, the mysqlnd driver offers async queries.

So count this as my contribution to demonstrating that Perl can do async MySQL queries too.

Posted by jzawodn at 07:47 AM

October 01, 2008

Programming Annoyance: Libraries that Exit on Me

This is something that's been bugging me for a long time now. Over the years, I've come to realize that programming time is 10% about writing the code to do the work, 70% about figuring out where failures might occur and dealing with them, 10% about documentation, and 10% about documentation. (That last 10% may be substituted with Desktop Tower Defense or something equally time wasting.)

Or something like that. The point is that writing the code to do what I want isn't hard. It's dealing with all the other things that do--especially error conditions. There are so many weird corner cases to consider. And when you're working on code for a high volume web site that has its servers under load 24 hours a day, it doesn't take long to encounter those odd situations.

Murphy is always watching.

Years ago, after battling similar problems at Yahoo, I began to develop certain ideas about how errors should be detected, handled, and reported. An important idea here is that the developer should always be in control of when the script/program/process dies. Aside from something truly fatal (like a segfault) library routines should detect errors and report them back to their caller in the form of a known-to-be-bad return value.

The problem is that I keep running into code I want to use that breaks that rule in multiple places. In Perl terms, that means that I'll be happily testing my code and suddenly something goes wrong and my script dies in a place I didn't expect. Upon digging into it, I find that the CPAN library I'm using has something like this lurking in it:

if (not $good) {
    Carp::croak("bad stuff happened!");
}

Or...

if (not $good) {
    die "badness here!";
}

Sigh!

This means I have to read the code a bit more and see if I can discern why the developer wants my script to die in some cases, but in others he's content to just do this:

if (not $good) {
    $@ = "bad things happened";
    return undef;
}

What is it about some errors that makes them fatal while others aren't so bad that I'm deemed able to deal with them? Why has this developer taken that decision away from me? It makes no sense at all.

What this means is that I then need to litter my code with ugly crap like this:

eval {
    $object->methodThatMayDie;
};
if ($@) {
    # handle error here
}

The problem with that, aside from the fact that I'm dealing with another developer's inconsistent coding, is that it pollutes my code and forces me to make yet another frustrating decision.

Do I use a small number of big eval blocks and give up knowing exactly where the code died? Or do I pollute my code with a larger number of smaller eval blocks so that I can react to specific problems with a more specific solution? That means the module developer would have had to document which methods or functions may die on me. Otherwise I have to go trudging through their code and waste my time figuring that out. Guess which is more frequent.

Or do I override the module's use of die or Carp or whatever. I can do it, but that has other side effects I probably don't want to deal with either.

Why do I even need to deal with this in the first place? Can't people provide consistent interfaces? Is there something so bad about returning an error code and leaving it up to the user of your code to decide how to handle error conditions?

Maybe they do want to exit() or die(). Maybe they want to retry the logic after waiting a bit. Maybe they want to page someone and log the failure. Maybe...

You get the idea.

This whole concept of "fatal" exceptions seems wrong to me. Unless things are so bad that the kernel is going to kill my process, I should be the one in charge of deciding when my code will blow up. And I shouldn't have to do extra work to asset that authority. Should I?

I know that in the Java world, it's common to do a bunch of stuff in a big try block and then try to figure out what, if anything, blew up later. But I'm a firm believer in dealing with specific problems at the exact place they occur.

I really wish more people thought that way. It'd make my life easier.

Posted by jzawodn at 07:45 AM

September 02, 2008

The Perl UTF-8 and utf8 Encoding Mess

I've been hacking on some Perl code that extracts data that comes from web users around the world and been stored into MySQL (with no real encoding information, of course). My goal it to generate well-formed, valid XML that can be read by another tool.

Now I'll be the first to admit that I never really took the time to like, understand, or pay much attention to all the changes in Perl's character and byte handling over the years. I'm one of those developers that, I suspect, is representative of the majority (at least in this self-centered country). I think it's all stupid and complicated and should Just Work... somehow.

But at the same time I know it's not.

Anyway, after importing lots of data I came across my first bug. Well, okay... not my first bug. My first bug related to this encoding stuff. The XML parser on the receiving end raised hell about some weird characters coming in.

Oh, crap. That's right. This is the big bad Internet and I forgot to do anything to scrub the data so that it'd look like the sort of thing you can cram into XML and expect to maybe work.

A little searching around managed to jog my memory and I updated my code to include something like this:

  use Encode;

  ...

  my $data = Encode::decode('utf8', $row->{'Stuff'});

And all was well for quite some time. I got a lot farther with that until this weekend when Perl itself began to blow up on me, throwing fatal exceptions like this:

  Malformed UTF-8 character (fatal) ...

My first reaction, like yours probably, was WTF?!?! How on god's green earth is this a FATAL error?

After much swearing, a Twitter plea, and some reading (thanks Twitter world!), I came across a section of the Encode manual page from Perl.

I'm going to quote from it a fair amount here because I know you're as lazy as I am and won't go read it if I just link here. The relevant section is at the very end (just before SEE ALSO) and titled UTF-8 vs. utf8.

    ....We now view strings not as sequences of bytes, but as
    sequences of numbers in the range 0 .. 2**32‐1 (or in the case of
    64‐bit computers, 0 .. 2**64‐1) ‐‐ Programming Perl, 3rd ed.

  That has been the perl’s notion of UTF−8 but official UTF−8 is more
  strict; Its ranges is much narrower (0 .. 10FFFF), some sequences
  are not allowed (i.e. Those used in the surrogate pair, 0xFFFE, et
  al).

  Now that is overruled by Larry Wall himself.

    From: Larry Wall 
    Date: December 04, 2004 11:51:58 JST
    To: perl‐unicode@perl.org
    Subject: Re: Make Encode.pm support the real UTF‐8
    Message‐Id: <20041204025158.GA28754@wall.org>

    On Fri, Dec 03, 2004 at 10:12:12PM +0000, Tim Bunce wrote:
    : I’ve no problem with ’utf8’ being perl’s unrestricted uft8 encoding,
    : but "UTF‐8" is the name of the standard and should give the
    : corresponding behaviour.

    For what it’s worth, that’s how I’ve always kept them straight in my
    head.

    Also for what it’s worth, Perl 6 will mostly default to strict but
    make it easy to switch back to lax.

    Larry

  Do you copy?  As of Perl 5.8.7, UTF−8 means strict, official UTF−8
  while utf8 means liberal, lax, version thereof.  And Encode version
  2.10 or later thus groks the difference between "UTF−8" and "utf8".

    encode("utf8",  "\x{FFFF_FFFF}", 1); # okay
    encode("UTF‐8", "\x{FFFF_FFFF}", 1); # croaks

  "UTF−8" in Encode is actually a canonical name for "utf−8−strict".
  Yes, the hyphen between "UTF" and "8" is important.  Without it
  Encode goes "liberal"

    find_encoding("UTF‐8")‐>name # is ’utf‐8‐strict’
    find_encoding("utf‐8")‐>name # ditto. names are case insensitive
    find_encoding("utf8")‐>name  # ditto. "_" are treated as "‐"
    find_encoding("UTF8")‐>name  # is ’utf8’.

Got all that?

The sound you heard last night was me banging my head on a desk. Repeatedly.

I mean, how could I have possibly noticed the massive difference between utf8 and UTF-8? Really. I must have been on some serious crack.

Sigh!

Needless to say my code now looks more like this:

  use Encode;

  ...

  my $data = Encode::decode('UTF-8', $row->{'Stuff'}); ## fuck!

Actually, I was kidding about the "fuck!" I wouldn't swear in code.

Posted by jzawodn at 02:10 PM

August 07, 2008

Fun with Network Programming, race conditions, and recv() flags

internet tubes Last week I had the opportunity to do a bit of protocol hacking and found myself stymied by what seemed like a race condition. As with most race conditions, it didn't happen often--anywhere from 1 in 300 to 1 in 5,000 runs. But it did happen and I couldn't really ignore it.

So I did what I often do when faced with code that's doing seemingly odd things: insert lots of debugging (otherwise known as "print statements"). Since I didn't know if the bug was in the client (Perl) or server (C++), I had to instrument both of them. I'd changed both of them a bit, so they were equally likely in my mind.

Well, to make a long, boring, and potentially embarrassing story sort, I soon figured out that the server was not at fault. The changes I made to the client were the real problem.

I had forgotten about how the recv() system call really works. I had code that looked something like this (in Perl):

recv($socket, $buffer, $length, 0);
...
if (length($buffer) != $length) {
    # complain here
}

The value of $length was provided by the server as part of its response. So the idea was that the client would read exactly $length bytes and then move on. If it read fewer, we'd be stuck checking again for more data. And if we did something like this:

while (my $chunk = <$socket>) {
    $buffer .= $chunk;
}

There's a good chance it could block forever and end up in a sort of deadlock, each waiting for the other to do something. The sever would be waiting for the next request and the client would be waiting for the sever to be "done."

Unfortunately for me, the default behavior of recv() is not to block. That means the code can't get stuck there--it simply does a best effort read. If you ask for 2048 bytes to be ready but only 1536 are currently available, you'll end up with 1536 bytes. And that's exactly the sort of thing that'd happen every once in a while.

The MSG_WAITALL flag turned out to be the solution. You can probably guess what it does...

This flag requests that the operation block until the full request is satisfied. However, the call may still return less data than requested if a signal is caught, an error or disconnect occurs, or the next data to be received is of a different type than that returned.

That's pretty much exactly what I wanted in this situation. I'm willing to handle the signal, disconnect, and error cases. Once I made that change, the client and server never missed a beat. All the weird debugging code and attempts to "detect and fix" the problem were promptly ripped out and the code started to look correct again.

The moral of this story is that you should never assume that the default behavior is what you want. Check those flags.

Now don't get me started about quoting and database queries...

Posted by jzawodn at 08:42 PM

January 25, 2008

Perl as a Web Scripting Language

Had I not been at work, I might have nearly fallen off my seat laughing as I read item #3 in You Used Perl to Write WHAT?! on page 2. You see, the author is taking about where Perl is good and bad, and he says "there are some uses that just aren't right." That' followed by a list of items. Here's number three.

As a Web scripting language: One of the earliest usages of perl, as the Web evolved, was for CGI programming webpages. As a result, perl has some pretty strong packages for dealing with Web forms. There is also support for embedding perl into HTML in the same way the Java is embedded into JSP pages.
However, I would argue that more modern Web scripting languages, such as PHP and Ruby on Rails, offer more out-of-the-box Web support and a cleaner integration into the webpage experience. You should especially avoid using perl for traditional CGI-style form processing; this code tends to be hard to read and maintain because the HTML ends up inlined inside the perl code.

Uhm, WHAT?!

Now maybe it's the fact that I started coding in Perl about 12 years ago. Or maybe it's the fact that I've used it quite successfully in various web projects over the years. But suggesting that Perl isn't a good web scripting language is rather laughable.

Seriously.

Now, if you're starting from scratch and looking to learn a new language primarily for building web applications, that's one thing. Sure, look at PHP, Ruby, Python, and whatever else can do the job. But the author seems to be saying that it's worth learning yet another language simply for building web applications.

That's messed up.

Have a look around CPAN and at the web frameworks that exist around Perl and mod_perl. Heck, some of them have been around a long time and are quite mature. Saying that Perl "has some pretty strong packages for dealing with Web forms" doesn't really capture the true state of things.

I think programming language battles are as dumb as the next person, so I generally try to avoid them. But when I see stuff like this, it's hard to keep my mouth shut.

As for the claim that traditional CGI-style form processing being hard to read and maintain... WHAT?! There's nothing hard about separating the HTML from the code.

Okay, enough of my griping...

Thanks to the folks at TechCzar for translating my tech blog posts and including them in their blog network.

Posted by jzawodn at 01:02 PM

December 04, 2007

My Flickr Badge Perl Script

A couple times a year, someone asks how I created the Flickr "badge" that appears on the right side of my blog front page. It's not using JavaScript (what overkill that'd be) or anything fancy. It's simply an include file built up by a Perl script from a cron job every 20 minutes.

Basically it uses the Flickr API to fetch a list of my most recent photos, grabs the thumbnails, caches them on disk, and then re-writes the include file.

It's pretty simple, really. Here is the code: flickr_badge.pl.

It's a highly modified and pruned version of someone else's script that I ran across years ago. I have since lost the attribution, so ping me if you know where it came from and I'll gladly give credit.

This code is free for the taking, as in Public Domain. Use it for whatever you want. Feel free to give me credit or not. It's really not rocket surgery.

Posted by jzawodn at 11:43 AM

March 01, 2007

spam_lash.pl

I just found an old (circa 2002) Perl script that I wrote very early one morning after being spammed repeatedly on my pager/cell phone.

It looks like this...

#!/usr/local/bin/perl -w

$|=1;

use strict;
use Net::SMTP;

my $lines     = 100;
my $mail_from = "dev\@null.com";
my $smtp_host = "mail.XXXXXX-online.com";
my $mail_to   = "postmaster\@XXXXXX-online.com";
my $subject   = "Why did you spam my pager?!";

my $line = "Q" x 72;
$line  .= "\n";

my $smtp = Net::SMTP->new($smtp_host, Debug => 0) or die "$!";

while (1)
{
    $smtp->mail($mail_from)                  or die "$!";
    $smtp->to($mail_to)                      or die "$!";
    $smtp->data()                            or die "$!" ;
    $smtp->datasend("To: $mail_to\n")        or die "$!";
    $smtp->datasend("Subject: $subject\n")   or die "$!";
    $smtp->datasend("\n")                    or die "$!";

    for (1 .. $lines)
    {
        $smtp->datasend($line)               or die "$!";
    }

    $smtp->dataend()                         or die "$!";
    #$smtp->quit;

    print ".";
}

print "\n";

Heh.

Do not code when you're angry, kids...

If nothing else, it's interesting to see how my coding style has evolved in some ways but not in others--at least my "coding while pissed off" style.

Posted by jzawodn at 10:34 PM

July 31, 2006

Should I Learn Python or Ruby next?

I've been programming (when I do program) mainly in Perl for the last 10 years or so. But I've been itching to learn a new language for a while now, and the two near the top of the list are Ruby and Python.

I figure that Ruby would be easy to learn because of its similarity to Perl (I'm told). But I also figure that Python would be easy to learn because of its simplicity. And when it comes to webby stuff, I can use Rails with Ruby and Django with Python.

I'm currently leaning toward Python and began doing so last week. I started with Mark Pilgrim's excellent Dive Into Python and made it thru the first 3 chapters pretty quickly. So far it feels pretty good.

Before I really dive in, though, I'm curious to hear what others think about the choice between these two languages.

(On a related note, you might also read Tim Bray's On Ruby post, since he just started learning Ruby.)

Posted by jzawodn at 10:29 AM

October 05, 2005

JDBC Module on CPAN

This is kind of amusing. Tim Bunce just announced that there's a JDBC module available now on CPAN.

If this seems like a crack inspired coding exercise, the docs are a bit more revealing:

Why did I create this module?

Because it will help the design of DBI v2.

How will it help the design of DBI v2?

Well, "the plan" is to clearly separate the driver interface from the Perl DBI. The driver interface will be defined at the Parrot level and so, it's hoped, that a single set of drivers can be shared by all languages targeting Parrot.

Each language would then have their own thin 'adaptor' layered over the Parrot drivers. For Perl that'll be the Perl DBIv2.

So before getting very far designing DBI v2 there's a need to design the underlying driver interface. Java JDBC can serve as a useful role model. (Many of the annoyances of Java JDBC and actually annoyances of Java and so cease to be relevant for Parrot.)

As part of the DBI v2 work I'll probably write a "PDBC" module as a layer over this JDBC module. Then DBI v2 will target the PDBC module and the PDBC module will capture the differences between plain JDBC API and the Parrot driver API.

Anyway, if you're a Java geek who's "stuck" in Perl land, give it a whirl.

Posted by jzawodn at 11:26 AM

January 27, 2005

Use a Modern Perl with Kwiki

To anyone else considering Kwiki (an excellent minimal yet extensible Perl Wiki), make sure you're running a recent Perl. I spent far too long attempting to set it up under Perl 5.6.x recently only to discover that it was utterly painless under Perl 5.8.x.

I wish I had thought to try that hours earlier.

That concludes this public service announcement.

Posted by jzawodn at 05:28 PM

August 10, 2004

Yahoo! Job Opening: Software Engineer

Yup, another job opening. If you're interested or know someone who'd kick ass in this job, let me know.

The job is on-site in Sunnyvale, California.

<job_posting>

Enjoy solving hard problems creatively? Know all the GOF patterns? Can you make database schemas into 3rd Normal form? Do you know the difference between REST, SOAP, and MOM?

We are looking for an engineer to architect new services, build shared libraries, and refactor existing systems. You will work with Yahoo! News, Sports, Weather, Finance, Health and other groups to build exciting systems. You will deliver complex projects in demanding deadlines while helping other engineers design and implement their systems.

If you've had experience building high-throughput systems, can design an class hierarchy in your sleep, and know all about web services, then we're looking for you.

Qualifications

  • 5-7 years experience designing modern systems
  • BS or MS in Computer Science
  • Knowledge of practical application of design patterns
  • Good written/verbal communication skills and strong investigation, research and evaluation skills
  • Clear prioritization skills in a chaotic, fast-paced environment
  • Web services experience a plus
  • C/C++, Perl, Java
  • Apache
  • XML/XSL
  • MySQL/Oracle

</job_posting>

I know some (many?) of the folks you'd be working with in this job. They're smart folks who love building great technology.

Oh, and don't ask me what "exciting systems" are. We all know that job listings are partly sales pitches, so that's what you get I guess.

Posted by jzawodn at 09:43 PM

June 21, 2004

The Perl Community: Broken Again

Why is it that every few years someone in the Perl community has to stand up and explain that things are, once again, screwed up an in need of repair?

This time around it's Nat Torkington suggesting an enema. A few years back it was Jon Orwant throwing mugs at the wall and ultimately arguing for the creation of Perl 6.

Do the Python and PHP communities have this sort of stuff going on? Or is it something that's uniquely Perl? I don't really get involved in the Perl world the way I used to, so this is genuinely puzzling to me.

See Also:

Posted by jzawodn at 08:02 AM

January 18, 2004

File::Tail is damned useful

In the last week or so, I've developed a renewed appreciation for the File::Tail Perl module. If you haven't guessed from the name, this module provides a native Perl implementation of something akin to tail -f somefile and--better yet--it can do this on multiple files at the same time.

In case you're wondering, the reason I find it so helpful is that I've been building various tools that need to perform real-time scanning of log files. Specifically, I'm dealing with logs from a mail server (Exim) and a radius server. By putting the two together, I can determine, in real-time, which WCNet users may have infected machines which are using our designated mail relays for spamming.

Posted by jzawodn at 08:19 PM

January 17, 2004

Craigslist RSS Search Script

I've been idly looking at a few used vehicle models for the purpose of towing a glider (and glider trailer, obviously). Recently I've been checking out Toyota 4runners and Jeep Cherokees. A few weeks ago it was Nissan Pathfinders. Trying to gauge availability and pricing is a tricky business and one that I really didn't want to spent a lot of time on.

Since the used car ads on craigslist are quite active, I figured that was a good place to look. The trouble is that in the South Bay, East Bay, and Peninsula listings, they go by pretty quickly. I don't have time to track all that.

So I wrote a script that uses the RSS feeds. This is both good and bad.

The good points:

  1. I can run it from cron every 20-30 minutes and get e-mail.
  2. It's very lightweight.
  3. It took 5 minutes to write and another 5 to test and tweak.

On the negative side:

  1. Some false positives. People aren't always sure how to represent, say, a Toyota 4runner. Should it be 4-runner, 4runner, 4 runner, four runner, or something else? So I have to be a little more liberal in the regex I use. And sometimes they're selling a 4 runner service manual or something.
  2. There's currently no logic to notice that I've already seen a particular listing before, so I see duplicates sometimes.
  3. Craigslist, for some stupid reason, strips prices from the titles in their RSS feeds. So I have to look at every listing by hand. Or I may have to automate the process--fetch every page that looks interesting and try to find the price.

That last one really pisses me off, but it's a free service, so I can't complain too much I guess. All in all the script has saved me a couple hours of time so far.

What's that? You'd like the code? Oh, okay. I suppose I can share: cl-carfind.pl (921 bytes)

Share and enjoy. But don't mis-use it.

Posted by jzawodn at 07:52 PM

November 18, 2003

CVS Commit + Weblog = Changeblog

In response to my recent post about CVS Commit Notifications via E-mail, Jason Gessner writes to say that he's done something even cooler. He's rigged up a way to get CVS Commit info posted to a weblog using Net::Blogger (which I've played with before (here, here, and here) too and use to post 95% of my entries from Emacs now).

He demonstrated it at Chicago.pm (PDF) and has an example on-line.

Cool stuff.

That reminds me, in a roundabout way, of the RSS feeds of CVS Commits that I setup at work. It's more popular than I expected it'd be. There are people other than me who use it...

Posted by jzawodn at 09:44 PM

July 31, 2003

Tim Blames Perl

But it's his own lack of any comments that confused his co-worker. If the code isn't intuitive, document it.

Come on Tim, don't you know better? Don't blame the language. You can write obscure code and leave it undocumented in any language.

Posted by jzawodn at 05:55 PM

July 10, 2003

OSCON Day #4: Ruby for Perl Programmers

Phil is talking about Ruby. Again, semi-realtime notes on Ruby.

Ruby is roughly 10 years old now. Matz liked Perl's text processing but didn't think that Python was OO enough. It's more of a Perl/SmallTalk blend. Classes, methods, objects, exceptions, message passing, iterators, closures, garbage collection, etc. And it's multi-platform, of course.

Back in 2000, Phil used a lot of Perl but found OO Perl tedious.

Why learn Ruby? It has a similar syntax but is different enough in some places to make you think differently. Strings, hashes, arrays, etc. Ruby can use any object as a key to a hash. Regexes, here-docs, etc.

@ means instance variable inside a class, not an array. The $ denotes global scope variable. @@ denotes a class variable. Semi-colons are optional at the end of line. Parens are optional in method calls.

False and nil are false. But 0 and '0' are true. Everything is an object.

Smaller community for Ruby, but that's okay.

Lots of interesting on-screen examples that I can't reproduce easily, so I'll just watch.

Posted by jzawodn at 12:06 PM

May 26, 2003

Software upgrades suck

I ran an upgrade recently on family.zawodny.com. It was the usual apt-get dist-upgrade to bring things current.

Well, it upgraded a bunch of stuff, including Perl to 5.8.0. That cause MoveableType to die in mysterious ways. So after 2 hours of messing with it, I've migrated my data to MySQL (yes, I was still using Berkely DB, sue me).

If this posts, I guess it's working again.

What a waste of time. As if I didn't already, but now I have a renewed appreciation for why someone might want to use TypePad.

Posted by jzawodn at 07:48 PM

May 20, 2003

Net::Blogger fix-up

Morbus Iff, the illustrious creator of AmphetaDesk (my first aggregator and soon to be my aggregator of choice for the 3rd time--but that's another topic) just pointed out a bug that afflicted the Perl script I used to post to my blog from Emacs.

After a little back and forth, I learned that when I posted using post.pl, my category pages weren't updated to reflect the new post. It wasn't until a comment or TrackBack came along that it'd happen. So we were left wondering how to trigger the proper re-generation of stuff when posting.

Pretty soon he had it figured out and mailed me a code snipit. I've updated the code if you're curious. Other than the masked out password it's exactly what I use to post.

Summary: Morbus rocks!

Posted by jzawodn at 09:56 PM

March 10, 2003

MySQL Full-Text Search Rocks My World

People ask me about MySQL's full-text search from time to time, but I've never actually used it. I understand how it works, so I can generally provide ball-park ideas about performance and suitability for a particular purpose. But until today, I had no first hand experience.

That all changed today. My initial reaction: Wow!

In MySQL 4.0.10 (I haven't bothered to build 4.0.11 yet) it makes my life way easier.

Here's the problem I'm trying to solve, stated generally enough so that it's meaningful and doesn't give away any trade secrets.

I have a Perl script manipulating lots of short multi-word strings. Each string has an associated numeric value. There's anywhere from a few hundred thousand to 5 million of them. For each of those strings, I need to locate all the other strings that contain the first string and then do something interesting with the associated value.

For example, given the string "car rental" I need to find:

  • national car rental
  • avis car rental
  • dollar car rental
  • car rental companies

And so on.

I do not want to match "rental car" or "car rent" or "car rentals" or similar variations. Order matters. Word boundaries matter.

The simple solution is to iterate over the list of strings. For each string, scan all the other strings to look for matches. The problem is that this does not scale well at all. It's an O(n**2) solution. With a few million strings, it takes forever.

What I needed was a way to index the strings. In the "car rental" case, if I could somehow find a list of all the strings that contain the word "rental" and then examine those, it'd be way faster. It be even faster if I could find the the intersection of the set of strings that contain "car" and those that contain "rental." Then I could just check for ordering to make sure I don't find "rental car." But I didn't want to build that myself. And memory is at a premium here, so I can't attack it sloppily..

MySQL to the Rescue!

After a bit of thinking, I realized that MySQL's fulltext indexing could probably do the job a lot faster than I could. So I constructed a simple table that can hold these mysterious strings and values.

  CREATE TABLE `stuff`
  (
    secret_num    INTEGER UNSIGNED NOT NULL,
    secret_string VARCHAR(250)     NOT NULL
  )

Then I load all the data into the table, either directly in Perl or all at once using mysqlimport. Once it's there, I add a fulltext index to the secret_string column.

ALTER TABLE `stuff` ADD FULLTEXT (secret_string)

Then I can find the data I want much, much faster.

mysql> select * from stuff
     > where match (secret_string) against ('+"car rental"'
     > in boolean mode) order by freq asc;
+------+-----------------------+
|   48 | discount car rental   |
|   56 | car rental companies  |
|   81 | advantage car rental  |
|  106 | payless car rental    |
|  204 | avis car rental       |
|  206 | hertz car rental      |
|  231 | dollar car rental     |
|  267 | alamo car rental      |
|  329 | thrifty car rental    |
|  495 | budget car rental     |
|  523 | enterprise car rental |
|  960 | national car rental   |
| 1750 | car rental            |
+------+-----------------------+
13 rows in set (0.00 sec)

Not bad.

Of course, it's not perfect. There are three issues.

  1. MySQL has a slightly different notion of what a "word" is than my code. But I can account for that my doing a sanity check on the records that come back.
  2. MySQL doesn't index small words (length 3 or less) by default. I haven't addressed that yet. I can either rebuild MySQL to also index smaller words, or handle it in a different way. I'll worry about it on Wednesday.
  3. The original record ("car rental") appears in the results. So I have to filter it out. No big deal.

All in all, this is a lot easier and faster that having to come up with my own solution.

Oh. I should point out that this data was destined to be stored in MySQL anyway, so it's not like I have an unusual dependency on MySQL just to solve this problem.

Go forth and make good use of MySQL's full-text search engine.

Posted by jzawodn at 04:31 PM

March 05, 2003

Posting from Perl is Way Cool

I just had to try one more post to make sure my new posting script works intended. I've updated it quite a bit to, in Perl spirit, Do What I Want. And it does. And I'm quite happy.

Next step: emacs integration. Then e-mail. Then I take over the world.

Posted by jzawodn at 09:15 PM

The Code

Okay, Net::Blogger officially rocks.

I posted the previous entry using a little script that took me maybe 7 minutes to write (including the time to install Net::Blogger). If you're curious, have a look at post.pl to see how I did it.

There is so much cool stuff I can do with this.

Posted by jzawodn at 08:54 PM

Posting to MT from Perl

If this works, Net::Blogger rocks! :-)

Posted by jzawodn at 08:33 PM

February 23, 2003

Perl Hashes and stuff...

A few days ago, I noted that I was being stupid. In that post, I made a comment about using Devel::Size to figure out what Perl was doing when it kept eating up all my memory. I sort of hinted at the fact that I didn't really believe what Devel::Size was telling me.

As it happens, the author of Devel::Size, Perl Hacker Dan Sugalski read my comment and asked what my problem with Devel::Size was. After I got over my surprise, I sent him the script I was using and explained how it was eating memory in a hurry.

More specifically, I wrote a really, really simple script that read in file of queries (the ones that are typed into the search box on www.yahoo.com every day). It wasn't much more complicated than this:

while (<>)
{
    chomp;
    $query_count{$_}++;
}

And when it was done, it'd spit the counts and queries out so they could be processed by some other code.

The problem was that it never finished. It always ran out of memory around 10 million queries. But I needed to do roughly 40 million or so. I instrumented the code with some calls to Devel::Size to see if the hash was really as big as it seemed.

Anyway, back to Dan. He tinkered around a bit and was able to reproduce the problem. It was two-fold: (1) Devel::Size itself used up more memory than expected, and (2) Perl's just not all that efficient with hashing.

He explained his findings via e-mail and I thought to myself, "Self: you should blog this stuff." Luckily, I I was lazy. Dan has summarized much of it on his blog so that I don't have to try and parphrase him.

The moral of the story? There are several. First, blogging is good. Second, Perl's hashes are inefficient. You need A LOT of memory if you intend to hash tens of millions of keys. And finally, Dan may have been inspired to make Perl 6's hashes a little lighter.

I re-implemented my code to loop over the file 36 times. Once for each digit and letter of the alphabet (the queries were already lower-cased). It's slow and crude, but it works.

Posted by jzawodn at 10:29 PM

February 20, 2003

Binary Search

Programmers can be so damned stupid sometimes.

Take me for example.

I've been working to optimize and adjust some code at work. I can't tell you what it does but I can tell you that it's too slow and uses too much memory. It's Perl. I know Perl. I'd like to think I'd know it pretty well, having used it for around nine years now.

In tracking down this memory problem, I've learned a lot about what a memory pig Perl can be. But that's a topic for another blog entry. The reall issue is how I've been tracking the problem. I'd get a hunch that the %foo hash was way too big and causing the process to die. So I'd convert it to a tied hash backed by Berkeley DB. And I'd run it again. It would again die.

Of course, this never happens in my small and quick to test data. It only happens with the full load (between 6 and 17 million, uhm, phrases). And it takes anywhere from 35 to 60 minutes for it to die. So you can guess how productive this makes me with an average 45 minute test cycle.

Ugh.

I've finally decided to just resort to a classic debugging technique: the binary search. Well, with a twist. Thanks to Ray, I'm using Devel::Size to periodically dump the memory use (or some approximation of it--that's another story) out to a log.

Why I didn't start this a few days ago is beyond me.

No, wait. It's not. It's because every time I tried something new, followed a new hunch, I was convinced that it was the solution.

Grr.

Someone slap me next time I do this.

Posted by jzawodn at 09:29 PM

February 11, 2003

DBD::google

This is too cool. A DBD::google module on CPAN:

DBD::google allows you to use Google as a datasource; google can be queried using SQL SELECT statements, and iterated over using standard DBI conventions.

I've been wanting an SQL interface to Google for so long. :-)

Posted by jzawodn at 08:45 PM

January 27, 2003

From the WTF?! Department

Every once in a while, I run across a Perl module that scares me a little. The most recent one is X11::Protocol and friends.

That's right boys and girls! Protocl-level X11 programming in Perl.

/me smacks forehead

Posted by jzawodn at 09:09 PM

November 05, 2002

Bricolage

I attended the Silicon Valley Perl Mongers (sv.pm) meeting tonight. The speaker was David Wheeler, one of the main developers of the Bricolage content management / publishing system. Bricolage was recently reviewed by eWeek and they loved it.

I'd heard a lot of good things about it, but never spent the time to look at it. After the presentation, I'm very impressed. It makes good use of existing technologies (mod_perl, HTML::Mason, PostgreSQL, etc) to build something very impressive.

The sick part is that I found myself wondering if it could be contorted into a sort of MovableType on steroids. I've convinced myself that it could be done with maybe a couple week's of hacking. So I'll put that on the list of "stuff to mess around with after the book is done".

Posted by jzawodn at 10:36 PM