UTF-8 A-Go-Go

Recently I finished (more or less) converting a Perl/CGI/MySQL website application to use UTF-8 throughout.

The CGI module and the DBI module currently have lousy character encoding support, so I created Perl packages to fix them (relatively) transparently.

Here’s how, and here’s my code:

UPDATE: I’ve just updated the code based on others’ feedback, for which, many thanks. See comments below. (Jan 2007)

Abstract

I here present slap-on fixes for two really common modules in Perl-5.8-based web development—CGI.pm and DBI.pm—to allow them to properly support character encoding within an all-UTF-8 application. Both fixes can be installed with an application, without requiring admin access to the server (so ideal for a web hosted app); both fixes operate as a thin filter layer, requiring minimal changes to your app. The CGI fix is designed to deal correctly with various web browser character encodings, and the DBI fix is designed to allow you to use pure UTF-8 in your database, even when your driver and version of MySQL don’t properly support it.

I wanted to Do The Right Thing with character encoding, and these are the cheap hacks which allowed me to do it. Please feel free to reuse in your project.

Background

ASCII no longer cuts the mustard. Your software needs to deal with

Accented characters and funny foreign symbols: 4€, Strauß, mañana, Československo.
Professional-level punctuation, like ‘smart quotes’—dashes—and →arrows, daggers†‡ and whatever★.
Greek letters: 3.5Ω, α-β, 2μV
Really foreign languages, like Japanese or Welsh.

There are other approaches, but the UTF-8 character encoding has a host of advantages (not least ASCII-compatibility). Most text-processing systems, including Perl are moving towards it. Joel Spolsky has an excellent overview of Unicode and character encodings that every programmer should read.

Perl 5.8 provides decent UTF-8 support and very good support for converting between other character encodings. (See perldoc topics perluniintro and perlunicode.)

However, support for UTF-8 in libraries is patchy. Common libraries are ‘CGI’, for parsing CGI input* to server-side script, and ‘DBI’, for database access, neither of which (at least the versions I’ve seen installed anywhere), have any character encoding support, and both of which can easily mangle your carefully-wrought UTF-8.

The libraries described below, ‘UTF8CGI’ and ‘UTF8DBI’ are designed to be post facto fixes for these two common packages.

* Some people use CGI package to write HTML. I don’t. My modifications below don’t address this. However, it will probably work fairly seamlessly.

CGI.pm

Download UTF8CGI.pm

The CGI library has been around since the dawn of time. It’s the cannonical way for a Perl script to parse CGI parameter and form input. Some people even use it to generate HTML. (In my opinion it should be replaced with something less monolithic and with better standards support, but that’s just my opinion.)

Issues with the existing module

CGI.pm completely ignores character encodings from client.
Many (especially older) web browsers don’t tell you what encoding their form input is in.
The HTTP standard isn’t much help and conveniently ignore the issues. Frankly it’s all a rather sorry mess. See a rather good article on the subject

Solutions

My solution was to create a subclass of CGI which magically determines the input character encoding (or at least makes a decent stab in the dark), and converts it all into UTF-8, so that your application really doesn’t need to worry about it.

This subclass, UTF8CGI should be imported into wherever you “use CGI”. Instead of creating a “new CGI” object, just create a “new UTF8CGI” object instead. Your code might look like:

#!/usr/bin/perl -wT

use utf8; # If the script itself is written in UTF-8.
use strict; # Be strict. Note we gave ‘-wT’ flags to Perl too, for extra protection.
use CGI qw/:cgi -private_tempfiles/; # …or whatever parameters you like.
use UTF8CGI; # Import the fix.

my $query = new UTF8CGI; # Create a magically-auto-detect-and-convert-to-UTF-8-ed CGI object

Notes:

Should work with Perl5.8 or later, and pretty much any version of CGI. (Tested with Perl 5.8.0 and 5.8.4 and CGI 2.89 and 3.04.)
I’ve only tested this with object-syntax for CGI. It may work with functional syntax, but I don’t know. (Object syntax is cleaner anyway.)
Yes, you need to do this every time you invoke CGI. But why not do what I do, and hide it behind a further abstraction, that all of your CGI Perl scripts can invoke?

DBI.pm

Download UTF8DBI.pm

Issues with the existing module

Although most modern versions of database engines have character-encoding support, the current versions of Perl DBI don’t really.

Queries return strings as raw bytes, even when the database columns contain correctly-formed UTF-8. To correctly use such strings, the application must ‘decode’ every string from the database, which is error-prone and tedious. I found a good summary of the issues.

Solutions

My approach was to create a subclass of DBI, which puts a lightweight wrapper around every database query, and UTF-8-decodes every string from every query result.

Note that, because of the simple(minded) approach, the wrapper processes every single field of every row of every query, even if it contains numeric data. Potentially, with large data sets, this could become a significant overhead. Suck it and see.

Also, this requires that every string returned from any queries is either ASCII or valid UTF-8. If your database contains strings encoded as Latin-1, or any other character set, they will come back mangled. (However, if your database contains a mixture of character encodings, and you’re using DBI, you already have problems.) So before using, ensure all your database string data is UTF-8.

To use:

Ensure that UTF8DBI is in your library path
Replace ‘use DBI;’ with ‘use UTF8DBI;’
Replace ‘DBI->connect’ with ‘UTF8DBI->connect’

For example:

#!/usr/bin/perl -wT

use utf8; # The script itself is written in UTF-8
use strict; # Be strict. Note we gave ‘-wT’ flags to Perl too, for extra protection.
use UTF8DBI; # Import the fix.

my $dbh = UTF8DBI->connect( # Create a magically-UTF-8-decoding database handle
“DBI:mysql:$dbname:$host”,
$username, $password
)
or die $UTF8DBI->errstr;

Notes:

Should work with Perl5.8 or later, and DBI 1.21 and later. (Tested with Perl 5.8.0 and 5.8.4 and DBI 1.32 and 1.46.)
If we find a string with malformed UTF-8, that string is replaced by a string “malformed-UTF8”. You could edit the code and replace this with a ‘die’, so that the code fails if the database is found to contain badly formed UTF-8, but I found the replaced string to be quite useful while debugging.

Summary

I use them, and they work.

They’re small and lightweight.

However, caveat emptor. You may need to tweak them to get them to work. I’m not providing any technical support. You have the source code, so if they blow up your computer and crash your grandmother, you’re on your own.

However, I hope they’re of use to you. If I save you from pulling your hair out as a result of pesky character-encoding issues, I’ll be very glad.

I don’t have much time to maintain these modules, but please feed back any improvements or bug fixes you make.

18 thoughts on “UTF-8 A-Go-Go”

Delboy 6 February 2006 at 1:17 pm

meh

;-)

You actually do stuff like this in your /spare time/ Andrew? And on a Sunday too?.

Well … as Sheryl Crow sang – “if it makes you happy, it can’t be that baaa-a-a-ad”. (apart from drugs, of course. And marzipan)
Andrew Post author6 February 2006 at 3:02 pm

Well, I had done it already. Oddly, it probably took just as long again to write it up properly and post it on my blog, with links to more information.

Ho hum.
Delboy 7 February 2006 at 1:21 pm

Are you saying you didn’t do it properly first time round? A shocking admission for a BSc. Comp. Sci. graduate! ;-)
Andrew Post author12 February 2006 at 6:49 pm

No, no! By “write it up properly”, I mean “write a blog entry about it, (which kinda doubles as documentation.”)
Pubesy 17 February 2006 at 9:10 pm

I take issue with the marzipan reference!

I see that Dysphoria is keeping up the tradition of hard core geekdom to the highest level. I was lost after UTF-8…

All in all very impressive Mr. Forrest but not quite at the level of the Panda’s VAMP forum I fear. Unless the hit count goes through the roof as all the UTF-8 inhibited zombies come a calling!

Good Luck. And remember…be careful out there.
Pavel Kudinov 2 April 2006 at 5:03 pm

Hi!

I think, UTF8DBI.pm need a little bug fix:

line 14:
(ref $row eq ‘ARRAY’ ? \&encode_array : \&encode_hash)->($row);

for this: $dbh->selectall_arrayref(‘…’, {‘Slice’=>{}}, …);
(this unpopular form will return a very poplular array of hashes ;)
Pavel Kudinov 2 April 2006 at 7:03 pm

Hi again!

;)
You should wrap DBI::st::fetch* rather than DBI::db::select* because DBI::db::select* based on DBI::st::fetch*!

This is my imho more pretty re-implementaion of your module:

http://perl7.ru/lib/UTF8DBI.pm
Andrew Scheller 5 June 2006 at 10:42 am

This is just brilliant!
Thanks very much Andrew and Pavel :)
Andrew Scheller 5 June 2006 at 1:35 pm

Ooops, spoke too soon… ;)
After doing some proper testing, I believe I’ve got a bugfix for UTF8CGI.pm :-
For ActivePerl 5.8.8 on Windows 2003 (dunno about other versions/platforms) I needed to change line 53 from:
foreach my $val ($query->param($key)) {
to instead say:
foreach my $val (@{$query->param_fetch($key)}) {
otherwise the param values remain unchanged after the loop.

I also think it’s safe to get rid of “use utf8;” as the only non-ASCII characters are the sloped-apostrophes within comments on lines 49 and 51.
Andrew Scheller 5 June 2006 at 4:46 pm

One more addendum – bet you can’t guess what I’ve spent all day doing! ;)

I think you need to add:
next if $query->upload($key);
after line 50 of UTF8CGI.pm “foreach my $key ($query->param) {”
Otherwise the filehandle (when grabbed with a later query->upload) seems to be invalid…?
Andrew Post author6 June 2006 at 11:18 am

Thanks for all the feedback.

I haven’t been ignoring you, I’ve just been really busy. I plan to update my code with Pavel and Andrew S’s comments in a couple of weeks (as soon as this show is out of the way).

Best regards,

–A
Andrew Scheller 27 June 2006 at 10:48 am

Another update for you :-)
perl -wc UTF8DBI.pm complains with:
UTF8DBI::st::_utf8_() called too early to check prototype at UTF8DBI.pm line 10.

UTF8DBI::st::_utf8_() called too early to check prototype at UTF8DBI.pm line 11.

Apparently (according to google ;) ) the fix is to either forward-declare the _utf8_ sub, or to not use prototypes. I chose the former option (because I’m not familiar with prototypes) and managed to get rid of the warning by inserting:
sub _utf8_();
on the line before the “real” sub.
Pavel Kudinov 24 August 2006 at 4:46 pm

Andrew, i just prefer “no warnings”, but “use strict”. i’m just think that perl warnings concept wrong for pretty code writing style.
Andrew Post author30 August 2006 at 10:54 pm

I’m afraid I’m a mere weekend (well, and evenings) Perl programmer. I need all the help I can get, so usually switch on strictness, warnings, and taint mode, too, if the situation seems to benefit from it.
Andrew Post author7 January 2007 at 2:54 pm

Yup.

I’ve applied Andrew Scheller’s two fixes to UTF8CGI. Not quite sure how I’d missed them.

I’ve also changed UTF8DBI for Pavel Kudinov’s version, but for a couple of changes:
* Removed the prototype on the _utf8_ function.
* Added back in the check for duff UTF8. Belt and braces.
* Removed all but one of the fetch… functions, since, it seems, all of the other statement-handle fetch functions go through that one.
* Now that should be all that’s required, in theory, since, as Pavel says, the DBI::db select… functions should just be short ways of getting at the st functions, but that doesn’t seem to be the case. It wouldn’t work without me reimplementing the select… functions too. (At least, this is the case in my version of the DBI module, which, as you can imagine, is the one that I care about.)

Have fun,

–A
Allan Engelhardt 21 January 2009 at 12:16 pm

For some database drivers the UTF8DBI modules is not longer needed:

DBD::Pg (PostgreSQL) : Use { pg_enable_utf8 => 1 } in the attributes to DBI->connect (e.g. my $dbh=DBI->connect($dsn, $user, $pass, { AutoCommit => 1, RaiseError => 0, pg_enable_utf8 => 1 };) [Tested.]

DBD::mysql (MySQL) : Similarly, use { mysql_enable_utf8 => 1 }. [Not tested and marked “experimental” at http://search.cpan.org/~capttofu/DBD-mysql-4.010/lib/DBD/mysql.pm#mysql_enable_utf8 so try to see if it works for you]

DBD::SQLite (SQLite) : Setting “$dbh->{unicode} = 1” will turn on the utf8 flag on all strings. [Not tested, see http://search.cpan.org/~msergeant/DBD-SQLite-1.14/lib/DBD/SQLite.pm#unicode%5D

Hope this note will help others….
Andrew Post author21 January 2009 at 10:02 pm

Thanks Allan. It was certainly written as a band-aid, not a proper fix… it’s nice that people can start to take the band-aid off. Though it’s certainly taken a while for the wound underneath to heal.

(This painfully extended metaphor is brought to you by CPAN and Johnson & Johnson.)
Oleg Kuznetsov 28 December 2009 at 1:25 am

DBD::mysql v4.011 supports {mysql_enable_utf8 => 1} key, but it does not setup utf8 flag, unfortunately =(

perl 5.10
dbi 1.609