Dave Schweisguth in a Bottle

How many meanings of that can you think of?

latin1, utf8, Rails 2.3.9 and mysql2

with one comment

GWW‘s database.yml (which originated under Rails 1.something) has never specified an encoding. We seem to have gotten away with it because mysql defaults to latin1, our database is latin1 (probably because mysql defaults to it) and the mysql gem was happy with that default. It was clear that non-ASCII characters were being handled correctly, because when using GWW one often sees Flickr usernames that contain them.

Discussions of encodings often include a statement like “start using UTF-8 now, because more and more software will assume it in the future”. They were right, and the future is now. Working towards upgrading to Rails 3, I recently updated GWW to Rails 2.3.10 and replaced the mysql gem with the mysql2 gem (which Rails 2.3.9 and later require). I’m pretty sure this is when GWW began dropping the ball on non-ASCII characters, displaying them all as question marks. I didn’t actually roll back the upgrade to prove it, and I don’t know whether it’s Rails 2.3.9/10 or mysql2. I was more interested in fixing the problem quickly — which, fortunately, it could be.

A little research reminded me that pretty much everyone else in the world has “encoding: something” in their database.yml, so I added “encoding: latin1” to GWW’s. That fixed the problem for text that had been stored in the database before the upgrade. Uh-oh: text with non-ASCII characters that had been stored in the database after the upgrade had been mangled and still displayed incorrectly. Fortunately, most GWW data is refreshed from Flickr daily, and refreshing again fixed the mangled data. 2πrad, ɲℓιŦεɲđ1 and 猫娘/ nekomusume are once again displayed in all their multi-byte glory.

Here’s the unit test:

describe Person do
  describe '#username' do
    it 'should handle non-ASCII characters' do
      non_ascii_username = '猫娘/ nekomusume'
      Person.make! :username => non_ascii_username
      Person.all[0].username.should == non_ascii_username

This test passes when database.yml contains “encoding: latin1” and fails when it does not. Wish I’d thought to write it first.

To sum up: be sure to specify an encoding: in database.yml before upgrading to Rails 2.3.9 and mysql2, or risk corrupted text.

Note that GWW’s database.yml did have

  charset:   latin1
  collation: latin1_swedish_ci

This made ‘rake db:create’ create a database with the correct character set and collation, and it’s still necessary for that purpose with Rails 2.3.10 and mysql2. However, judging from the above it has no effect on the rest of Rails, and if your database character set is latin1 you need all three parameters in database.yml:

  encoding:  latin1
  charset:   latin1
  collation: latin1_swedish_ci

Mind you, my next step will be to migrate the database from latin1 to utf8 and join the future. With any luck that will be uneventful enough to not require another post.

Update: No, this recipe worked just fine.


Written by dschweisguth

February 16, 2011 at 11:02

Posted in Programming, Rails, Ruby

One Response

Subscribe to comments with RSS.

  1. Thanks Dave, it was really helpful..


    February 13, 2013 at 22:54

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s