Jeremy Zawodny's blog: September 2008 Archives

September 25, 2008

Ubuntu Kung Fu: Best Book Cover Ever!

I just ran across news that Ubuntu Kung Fu is Shipping and happened to look at the cover. As a cat lover and technical book author myself, I felt a little slighted.

That's right. Keir Thomas got a kitten on his book.

That kicks ass.

But even better, Ubuntu Kung Fu (PDF and printed) sounds like a real winner for day-to-day Ubutnu users. As the marketing blurb says:

Award-winning Linux author Keir Thomas gets down and dirty with Ubuntu to provide over 300 concise tips that enhance productivity, avoid annoyances, and simply get the most from Ubuntu. You'll find many unique tips here that can't be found anywhere else. You'll also get a crash course in Ubuntu's flavor of system administration. Whether you're new to Linux or an old hand, you'll find tips to make your day easier.

In other words, it's a book that nearly everyone using Ubuntu could benefit from. I'm hoping to grab a copy shortly. Have a listen to Keir Thomas on Ubuntu Kung Fu in this week's Pragmatic Podcast.

Also available on Amazon.com.

Posted by jzawodn at 06:44 AM

September 24, 2008

Mustard Lime Beef Steaks Recipe

A few days ago I made a new grill recipe that turned out even better than we expected, so I've reproduced it here for your grilling and eating enjoyment.

Ingredients

4 sirloin beef steaks (roughly 1" thick)
1/4 cup of dry mustard (Colman's works well)
1/4 cup Worcestershire Sauce (Lea and Perrins works well)
Lime Juice or 1 large lime
Coarse salt (sea salt is what I use)
Freshly ground white pepper

Preparation

Cover the steaks on one side with 2 tablespoons of dry mustard. Pat it down and spread evenly with the back side of a fork. Sprinkle two tablespoons of worcestershire sauce over the steaks, allowing it to soak into the mustard--patting the steaks with the fork if necessary. Dribble a bit of lime juice over the steaks.

Season the steaks with a good amount of salt and pepper. Then flip the steaks and repeat on the other side. Let them marinate for 20-30 minutes while pre-heating the grill.

Cooking

Clean and oil the grill. Cook the steaks on high heat for roughly 4 to 6 minutes per side, aiming to keep them pink on the inside. Do not rotate the steaks to make those nice cross-hatched grill marks. Doing so may knock off some of the mustard and seasonings.

Let the steaks sit for a few minutes. Slice and enjoy. :-)

Unfortunately, I have no pictures to show. But they're most excellent to eat. Trust me.

Posted by jzawodn at 07:41 AM

September 22, 2008

HTPC Wireless Keyboard and Mouse Recommendations?

Dear Lazyweb,

As of a few weeks ago, we have a computer hooked up to the 66 inch TV full-time. However, it currently has a wired keyboard and mouse, both of which are less than optimal when you'd prefer to keep your ass on the couch and pick a movie from the server upstairs.

So I'm soliciting recommendations for a wireless keyboard and mouse (or keyboard/mouse combo) that has decent range (20-30 feet, ideally) and doesn't take up too much space. The keyboard doesn't need a numeric keypad or even full-sized keys. It's only going to be used to type a small amount: occasional hits to IMDB and renaming a folder or two.

The mouse should be a reliable two or three button optical that can tolerate occasional attacks by our cats and possibly even a spilled drink.

On option that is highly rated but also highly priced is the Logitech diNovo Edge 2-Tone 84 Normal Keys 9 Function Keys USB Bluetooth Wireless Mini Keyboard. Reviews claim is has excellent size and range. But the touchpad mouse seems a bit funky. However, I do like the idea of it being built-in so there's only one object to deal with.

Thoughts?

Thanks in advance,
Jeremy

Posted by jzawodn at 02:28 PM

September 10, 2008

Never Buy a House from a Realtor

A couple weekends ago we embarked on a seemingly simple painting project at home. We wanted finally paint over the wall that was torn up when I had plumbing problems a few years ago (see: The Leak, Day #2, The Leak, Day #3: Leak Found, Pictures, Showering with a 90 Foot Hose, and other Fun Tidbits, The Leak, Day #7: Still Showering with a Hose, etc.).

There were numerous cans of paint in the garage that the previous owners had left behind. And since the house had mostly white walls, it seemed like a pretty trivial task. We got out the paint, spread the plastic and sheets, stirred, poured, and started putting paint on the walls.

After a bit of painting it became apparent that were we not using the right color. Apparently there was more than one white used in the house. This wouldn't normally be a problem. But as part of the painting we decided to touch up a few other walls in other rooms of the house. It looked fine while the paint was wet. But as the paint dried, we realized that there were actually three or more different flavors of "white" in use around the house.

Grr.

Realizing what pain in the ass this could turn into, we opted to chip a bit of paint off the affected walls, take them over to our neighborhood Orchard Supply Hardware, and get them to match the color.

They did an excellent job. The touched up spots look fine. And the colored paint we got for the previously repaired wall looks great. (Oh, we decide to use a non-white color after we realized the "white" was all wrong.)

So you're probably wondering what this has to do with realtors.

Realtors know what it takes to sell a house. They know where they can cut corners and get away with it. After thinking about it a bit, I realized what the previous owners of our house must have done. I suspect that they hired some cheap painters and asked them to bring along any leftover white paint from previous jobs.

They did. And they used one white for one room, a slightly different white for the next, and so on--thereby using up the extra paint and not having to spend a whopping $12/gallon to repaint the house before selling it.

I can't think of any other reason why someone would paint different rooms using shades of white that are just different enough to be different. It just doesn't make sense.

But to make matters worse, they didn't bother to label the spare cans so we'd know which room the colors applied to. At least the spare paint can I put away after we were done have things like "living room" or "bedroom" written on them in black marker.

Damned cheap-ass realtors.

Anyone need three or four cans of partially used off-white paint?

Posted by jzawodn at 09:32 AM

September 08, 2008

Long Term Data Archiving Formats, Storage, and Architecture

I'm thinking about ways to store archival data for the long term and wanted to solicit anyone who's been down this road for some input, advice, warnings, etc.

Background

Essentially I'm dealing with a system where "live" and "recently live" data is stored in a set of replicated MySQL servers and queried in real-time. As time goes on, however, older "expired" data is moved to a smaller set of replicated "archive" servers that also happen to be MySQL.

This is problematic for a few reasons, but rather than be all negative, I'll express what I'm trying to do in the form of some goals.

Goals

There are a few high-level things I'd like this archive to handle based on current and future needs:

Be able to store data for the foreseeable future. That means hundreds of millions of records and, ultimately, billions.
Fast access to a small set of records. In other words, I like having MySQL and indexes that'll get me what I want in a hurry without having to write a lot of code. The archive needs to be able to handle real-time queries quickly. It does this today and needs to continue to work.
Future-proof file/data format(s). One problem with simply using MySQL is that there will be schema changes over time. A column may be added or dropped or renamed. That change can't easily be implemented retroactively on a larger data set in a big table or whatnot. But if you don't then code needs to be willing to deal with those changes, NULLs appearing, etc.
Fault tolerance. In other words, the data has to live in more than once place.
Support for large scans on the data. This can be to do full-text style searches, looking for patterns that can't easily be indexed, computing statistics, etc.
It's worth noting that data is added to the archive on a constant basis and it is queried regularly in a variety of ways. But there are no delete or updates occurring. It's a write heavy system most of the time.

Pieces of a Possible Solution

I'm not sure that a single tool or piece of infrastructure will ever solve all the needs, but I'm thinking there may be several open source solutions that can be put to use.

You'll notice that this involves duplicating data, but storage is fairly cheap. And each piece is especially good at solving one or more of our needs.

MySQL. I still believe there's a need for having a copy of the data either denormalized or in a star schema in a set of replicated MySQL instances using MyISAM. The transactional overhead of InnoDB isn't really needed here. To keep things manageable one might create tables per month or quarter or year. Down the road maybe Drizzle makes sense?
Sphinx. I've been experimenting with Sphinx for indexing large amounts of textual data (with some numeric attributes) and it works remarkably well. This would be very useful instead of building MySQL full-text indexes or doing painful LIKE queries.
Hadoop/HDFS and Flat Files or a simple record structure. To facilitate fast batch processing of large chunks of data, it'd be nice to have everything stored in HDFS as part of a small Hadoop cluster where one can use Hadoop Streaming to run jobs over the entire data set. But what's good future-proof file format that's efficient? We could use something like XML (duh), JSON, or even Protocol Buffers. And it may make sense to compress the data with gzip too. Maybe put a month's worth of data per file and compress? Even Pig could be handy down the road.

While it involves some data duplication, I believe these pieces could do a good job of handling a wide variety of use cases: real-time simple queries, full-text searching, and more intense searching or statistical processing that can't be pre-indexed.

So what else is there to consider here? Other tools or considerations when dealing with a growing archive of data whose structure may grow and change over time?

I'm mostly keeping discussion of storage hardware out of this, since it's not the piece I really deal with (a big disk is a big disk for most of my purposes), but if you have thoughts on that, feel free to say so. So far I'm only thinking 64bit Linux boxes with RAID for MySQL and non-RAID for HDFS and Sphinx.

Posted by jzawodn at 04:24 PM

September 02, 2008

The Perl UTF-8 and utf8 Encoding Mess

I've been hacking on some Perl code that extracts data that comes from web users around the world and been stored into MySQL (with no real encoding information, of course). My goal it to generate well-formed, valid XML that can be read by another tool.

Now I'll be the first to admit that I never really took the time to like, understand, or pay much attention to all the changes in Perl's character and byte handling over the years. I'm one of those developers that, I suspect, is representative of the majority (at least in this self-centered country). I think it's all stupid and complicated and should Just Work... somehow.

But at the same time I know it's not.

Anyway, after importing lots of data I came across my first bug. Well, okay... not my first bug. My first bug related to this encoding stuff. The XML parser on the receiving end raised hell about some weird characters coming in.

Oh, crap. That's right. This is the big bad Internet and I forgot to do anything to scrub the data so that it'd look like the sort of thing you can cram into XML and expect to maybe work.

A little searching around managed to jog my memory and I updated my code to include something like this:

  use Encode;

  ...

  my $data = Encode::decode('utf8', $row->{'Stuff'});

And all was well for quite some time. I got a lot farther with that until this weekend when Perl itself began to blow up on me, throwing fatal exceptions like this:

  Malformed UTF-8 character (fatal) ...

My first reaction, like yours probably, was WTF?!?! How on god's green earth is this a FATAL error?

After much swearing, a Twitter plea, and some reading (thanks Twitter world!), I came across a section of the Encode manual page from Perl.

I'm going to quote from it a fair amount here because I know you're as lazy as I am and won't go read it if I just link here. The relevant section is at the very end (just before SEE ALSO) and titled UTF-8 vs. utf8.

    ....We now view strings not as sequences of bytes, but as
    sequences of numbers in the range 0 .. 2**32‐1 (or in the case of
    64‐bit computers, 0 .. 2**64‐1) ‐‐ Programming Perl, 3rd ed.

  That has been the perl’s notion of UTF−8 but official UTF−8 is more
  strict; Its ranges is much narrower (0 .. 10FFFF), some sequences
  are not allowed (i.e. Those used in the surrogate pair, 0xFFFE, et
  al).

  Now that is overruled by Larry Wall himself.

    From: Larry Wall 
    Date: December 04, 2004 11:51:58 JST
    To: perl‐unicode@perl.org
    Subject: Re: Make Encode.pm support the real UTF‐8
    Message‐Id: <20041204025158.GA28754@wall.org>

    On Fri, Dec 03, 2004 at 10:12:12PM +0000, Tim Bunce wrote:
    : I’ve no problem with ’utf8’ being perl’s unrestricted uft8 encoding,
    : but "UTF‐8" is the name of the standard and should give the
    : corresponding behaviour.

    For what it’s worth, that’s how I’ve always kept them straight in my
    head.

    Also for what it’s worth, Perl 6 will mostly default to strict but
    make it easy to switch back to lax.

    Larry

  Do you copy?  As of Perl 5.8.7, UTF−8 means strict, official UTF−8
  while utf8 means liberal, lax, version thereof.  And Encode version
  2.10 or later thus groks the difference between "UTF−8" and "utf8".

    encode("utf8",  "\x{FFFF_FFFF}", 1); # okay
    encode("UTF‐8", "\x{FFFF_FFFF}", 1); # croaks

  "UTF−8" in Encode is actually a canonical name for "utf−8−strict".
  Yes, the hyphen between "UTF" and "8" is important.  Without it
  Encode goes "liberal"

    find_encoding("UTF‐8")‐>name # is ’utf‐8‐strict’
    find_encoding("utf‐8")‐>name # ditto. names are case insensitive
    find_encoding("utf8")‐>name  # ditto. "_" are treated as "‐"
    find_encoding("UTF8")‐>name  # is ’utf8’.

Got all that?

The sound you heard last night was me banging my head on a desk. Repeatedly.

I mean, how could I have possibly noticed the massive difference between utf8 and UTF-8? Really. I must have been on some serious crack.

Sigh!

Needless to say my code now looks more like this:

  use Encode;

  ...

  my $data = Encode::decode('UTF-8', $row->{'Stuff'}); ## fuck!

Actually, I was kidding about the "fuck!" I wouldn't swear in code.

Posted by jzawodn at 02:10 PM

Jeremy Zawodny's blog