The Perl UTF-8 and utf8 Encoding Mess (by Jeremy Zawodny)

I've been hacking on some Perl code that extracts data that comes from web users around the world and been stored into MySQL (with no real encoding information, of course). My goal it to generate well-formed, valid XML that can be read by another tool.

Now I'll be the first to admit that I never really took the time to like, understand, or pay much attention to all the changes in Perl's character and byte handling over the years. I'm one of those developers that, I suspect, is representative of the majority (at least in this self-centered country). I think it's all stupid and complicated and should Just Work... somehow.

But at the same time I know it's not.

Anyway, after importing lots of data I came across my first bug. Well, okay... not my first bug. My first bug related to this encoding stuff. The XML parser on the receiving end raised hell about some weird characters coming in.

Oh, crap. That's right. This is the big bad Internet and I forgot to do anything to scrub the data so that it'd look like the sort of thing you can cram into XML and expect to maybe work.

A little searching around managed to jog my memory and I updated my code to include something like this:

  use Encode;

  ...

  my $data = Encode::decode('utf8', $row->{'Stuff'});

And all was well for quite some time. I got a lot farther with that until this weekend when Perl itself began to blow up on me, throwing fatal exceptions like this:

  Malformed UTF-8 character (fatal) ...

My first reaction, like yours probably, was WTF?!?! How on god's green earth is this a FATAL error?

After much swearing, a Twitter plea, and some reading (thanks Twitter world!), I came across a section of the Encode manual page from Perl.

I'm going to quote from it a fair amount here because I know you're as lazy as I am and won't go read it if I just link here. The relevant section is at the very end (just before SEE ALSO) and titled UTF-8 vs. utf8.

    ....We now view strings not as sequences of bytes, but as
    sequences of numbers in the range 0 .. 2**32‐1 (or in the case of
    64‐bit computers, 0 .. 2**64‐1) ‐‐ Programming Perl, 3rd ed.

  That has been the perl’s notion of UTF−8 but official UTF−8 is more
  strict; Its ranges is much narrower (0 .. 10FFFF), some sequences
  are not allowed (i.e. Those used in the surrogate pair, 0xFFFE, et
  al).

  Now that is overruled by Larry Wall himself.

    From: Larry Wall 
    Date: December 04, 2004 11:51:58 JST
    To: perl‐unicode@perl.org
    Subject: Re: Make Encode.pm support the real UTF‐8
    Message‐Id: <20041204025158.GA28754@wall.org>

    On Fri, Dec 03, 2004 at 10:12:12PM +0000, Tim Bunce wrote:
    : I’ve no problem with ’utf8’ being perl’s unrestricted uft8 encoding,
    : but "UTF‐8" is the name of the standard and should give the
    : corresponding behaviour.

    For what it’s worth, that’s how I’ve always kept them straight in my
    head.

    Also for what it’s worth, Perl 6 will mostly default to strict but
    make it easy to switch back to lax.

    Larry

  Do you copy?  As of Perl 5.8.7, UTF−8 means strict, official UTF−8
  while utf8 means liberal, lax, version thereof.  And Encode version
  2.10 or later thus groks the difference between "UTF−8" and "utf8".

    encode("utf8",  "\x{FFFF_FFFF}", 1); # okay
    encode("UTF‐8", "\x{FFFF_FFFF}", 1); # croaks

  "UTF−8" in Encode is actually a canonical name for "utf−8−strict".
  Yes, the hyphen between "UTF" and "8" is important.  Without it
  Encode goes "liberal"

    find_encoding("UTF‐8")‐>name # is ’utf‐8‐strict’
    find_encoding("utf‐8")‐>name # ditto. names are case insensitive
    find_encoding("utf8")‐>name  # ditto. "_" are treated as "‐"
    find_encoding("UTF8")‐>name  # is ’utf8’.

Got all that?

The sound you heard last night was me banging my head on a desk. Repeatedly.

I mean, how could I have possibly noticed the massive difference between utf8 and UTF-8? Really. I must have been on some serious crack.

Sigh!

Needless to say my code now looks more like this:

  use Encode;

  ...

  my $data = Encode::decode('UTF-8', $row->{'Stuff'}); ## fuck!

Actually, I was kidding about the "fuck!" I wouldn't swear in code.

Posted by jzawodn at September 02, 2008 02:10 PM | edit

Reader Comments

# Phil Windley said:

I saw your tweet and while I had nothing to add, I felt your pain. Wow. What a treat to waste several days on that. It's easy to say "we're going to be international from the get go and get this right" but harder to do.

on September 2, 2008 03:06 PM

# said:

This probably explains some of the crap characters I get in Firefox 3.0.1 from sites like segelflug.de despite trying to tame it.

on September 2, 2008 08:11 PM

# Yousef Ourabi said:

Some similar UTF8 funkiness bit me a few weeks.

When concatenating UTF8 and non-UTF8 strings Perl double UTF8 encodes leading to pain an misery.

What perl needs is something like the encoding notation Python has - as in u"My UTF8 string" -- which would be on way around this.

on September 2, 2008 09:41 PM

# Harry Fuecks said:

These days I won't touch foreign XML until it's been through;

$ iconv -fUTF-8 -tUTF-8 file.xml
$ xmllint --recover file.xml

The second is particularly upsetting to some but such is the way of the world.

on September 3, 2008 02:55 AM

# Sheeri said:

Wow, that's just awful. Especially the case *insensitivity* of the names, on TOP of the similarity. "utf8" should just be called "Larry-TF-8".

on September 3, 2008 04:37 AM

# İsmail Dönmez said:

use utf8;

if (utf8::valid($somestring))
{
utf8::decode($somestring);
}

this should work better.

on September 3, 2008 11:57 AM

# said:

find_encoding("utf8")‐>name # ditto. "_" are treated as "‐"
should be
find_encoding("utf_8")‐>name # ditto. "_" are treated as "‐"

on September 3, 2008 12:06 PM

# said:

use diagnostics or splain or google

http://perldoc.perl.org/perldiag.html#Malformed-UTF-8-character-(%25s)
http://perldoc.perl.org/Encode.html#Handling-Malformed-Data
http://perldoc.perl.org/Encode.html#UTF-8-vs.-utf8-vs.-UTF8

Yes, I know its not fun to RTFM (search), but there you go.

on September 3, 2008 11:41 PM

# Joshua Schachter said:

Speaking of utf-8, you've got a bunch of mojibake in your post -- I see a ton of ï¿½

on September 7, 2008 02:31 AM

# Chis said:

Off on a tangent... Those little question marks (ï¿½) always remind me of Shoemoney's logo :)

on September 8, 2008 06:16 AM

# Obvio Capitao said:

Wow!

This should be called "WTF-8"!

my $data = Encode::decode('WTF-8', $row->{'Stuff'});

on February 12, 2009 05:34 AM

# Tom Printy said:

Thanks I just used this today!

-T

on April 4, 2009 03:59 PM

# Vasundhar said:

Jeremy,
You are a life saver.
I was banging my heard for some apparently simple Parsing issue.
it turned out after trying various parsers and analysis ...
its UTF-8 the killer.

Thanks for post.

on March 2, 2010 09:08 AM

Disclaimer: The opinions expressed here are mine and mine alone. My current, past, or previous employers are not responsible for what I write here, the comments left by others, or the photos I may share. If you have questions, please contact me. Also, I am not a journalist or reporter. Don't "pitch" me.

Privacy: I do not share or publish the email addresses or IP addresses of anyone posting a comment here without consent. However, I do reserve the right to remove comments that are spammy, off-topic, or otherwise unsuitable based on my comment policy. In a few cases, I may leave spammy comments but remove any URLs they contain.