I've been hacking on some Perl code that extracts data that comes from web users around the world and been stored into MySQL (with no real encoding information, of course). My goal it to generate well-formed, valid XML that can be read by another tool.
Now I'll be the first to admit that I never really took the time to like, understand, or pay much attention to all the changes in Perl's character and byte handling over the years. I'm one of those developers that, I suspect, is representative of the majority (at least in this self-centered country). I think it's all stupid and complicated and should Just Work... somehow.
But at the same time I know it's not.
Anyway, after importing lots of data I came across my first bug. Well, okay... not my first bug. My first bug related to this encoding stuff. The XML parser on the receiving end raised hell about some weird characters coming in.
Oh, crap. That's right. This is the big bad Internet and I forgot to do anything to scrub the data so that it'd look like the sort of thing you can cram into XML and expect to maybe work.
A little searching around managed to jog my memory and I updated my code to include something like this:
use Encode; ... my $data = Encode::decode('utf8', $row->{'Stuff'});
And all was well for quite some time. I got a lot farther with that until this weekend when Perl itself began to blow up on me, throwing fatal exceptions like this:
Malformed UTF-8 character (fatal) ...
My first reaction, like yours probably, was WTF?!?! How on god's green earth is this a FATAL error?
After much swearing, a Twitter plea, and some reading (thanks Twitter world!), I came across a section of the Encode manual page from Perl.
I'm going to quote from it a fair amount here because I know you're as lazy as I am and won't go read it if I just link here. The relevant section is at the very end (just before SEE ALSO) and titled UTF-8 vs. utf8.
....We now view strings not as sequences of bytes, but as sequences of numbers in the range 0 .. 2**32‐1 (or in the case of 64‐bit computers, 0 .. 2**64‐1) ‐‐ Programming Perl, 3rd ed. That has been the perl’s notion of UTF−8 but official UTF−8 is more strict; Its ranges is much narrower (0 .. 10FFFF), some sequences are not allowed (i.e. Those used in the surrogate pair, 0xFFFE, et al). Now that is overruled by Larry Wall himself. From: Larry WallDate: December 04, 2004 11:51:58 JST To: perl‐unicode@perl.org Subject: Re: Make Encode.pm support the real UTF‐8 Message‐Id: <20041204025158.GA28754@wall.org> On Fri, Dec 03, 2004 at 10:12:12PM +0000, Tim Bunce wrote: : I’ve no problem with ’utf8’ being perl’s unrestricted uft8 encoding, : but "UTF‐8" is the name of the standard and should give the : corresponding behaviour. For what it’s worth, that’s how I’ve always kept them straight in my head. Also for what it’s worth, Perl 6 will mostly default to strict but make it easy to switch back to lax. Larry Do you copy? As of Perl 5.8.7, UTF−8 means strict, official UTF−8 while utf8 means liberal, lax, version thereof. And Encode version 2.10 or later thus groks the difference between "UTF−8" and "utf8". encode("utf8", "\x{FFFF_FFFF}", 1); # okay encode("UTF‐8", "\x{FFFF_FFFF}", 1); # croaks "UTF−8" in Encode is actually a canonical name for "utf−8−strict". Yes, the hyphen between "UTF" and "8" is important. Without it Encode goes "liberal" find_encoding("UTF‐8")‐>name # is ’utf‐8‐strict’ find_encoding("utf‐8")‐>name # ditto. names are case insensitive find_encoding("utf8")‐>name # ditto. "_" are treated as "‐" find_encoding("UTF8")‐>name # is ’utf8’.
Got all that?
The sound you heard last night was me banging my head on a desk. Repeatedly.
I mean, how could I have possibly noticed the massive difference between utf8 and UTF-8? Really. I must have been on some serious crack.
Sigh!
Needless to say my code now looks more like this:
use Encode; ... my $data = Encode::decode('UTF-8', $row->{'Stuff'}); ## fuck!
Actually, I was kidding about the "fuck!" I wouldn't swear in code.
Posted by jzawodn at September 02, 2008 02:10 PM
I saw your tweet and while I had nothing to add, I felt your pain. Wow. What a treat to waste several days on that. It's easy to say "we're going to be international from the get go and get this right" but harder to do.
This probably explains some of the crap characters I get in Firefox 3.0.1 from sites like segelflug.de despite trying to tame it.
Some similar UTF8 funkiness bit me a few weeks.
When concatenating UTF8 and non-UTF8 strings Perl double UTF8 encodes leading to pain an misery.
What perl needs is something like the encoding notation Python has - as in u"My UTF8 string" -- which would be on way around this.
These days I won't touch foreign XML until it's been through;
$ iconv -fUTF-8 -tUTF-8 file.xml
$ xmllint --recover file.xml
The second is particularly upsetting to some but such is the way of the world.
Wow, that's just awful. Especially the case *insensitivity* of the names, on TOP of the similarity. "utf8" should just be called "Larry-TF-8".
use utf8;
if (utf8::valid($somestring))
{
utf8::decode($somestring);
}
this should work better.
find_encoding("utf8")‐>name # ditto. "_" are treated as "‐"
should be
find_encoding("utf_8")‐>name # ditto. "_" are treated as "‐"
use diagnostics or splain or google
http://perldoc.perl.org/perldiag.html#Malformed-UTF-8-character-(%25s)
http://perldoc.perl.org/Encode.html#Handling-Malformed-Data
http://perldoc.perl.org/Encode.html#UTF-8-vs.-utf8-vs.-UTF8
Yes, I know its not fun to RTFM (search), but there you go.
Speaking of utf-8, you've got a bunch of mojibake in your post -- I see a ton of �
Off on a tangent... Those little question marks (�) always remind me of Shoemoney's logo :)
Wow!
This should be called "WTF-8"!
my $data = Encode::decode('WTF-8', $row->{'Stuff'});
Jeremy,
You are a life saver.
I was banging my heard for some apparently simple Parsing issue.
it turned out after trying various parsers and analysis ...
its UTF-8 the killer.
Thanks for post.