latin1, utf8, Rails 2.3.9 and mysql2
GWW‘s database.yml (which originated under Rails 1.something) has never specified an encoding. We seem to have gotten away with it because mysql defaults to latin1, our database is latin1 (probably because mysql defaults to it) and the mysql gem was happy with that default. It was clear that non-ASCII characters were being handled correctly, because when using GWW one often sees Flickr usernames that contain them.
Discussions of encodings often include a statement like “start using UTF-8 now, because more and more software will assume it in the future”. They were right, and the future is now. Working towards upgrading to Rails 3, I recently updated GWW to Rails 2.3.10 and replaced the mysql gem with the mysql2 gem (which Rails 2.3.9 and later require). I’m pretty sure this is when GWW began dropping the ball on non-ASCII characters, displaying them all as question marks. I didn’t actually roll back the upgrade to prove it, and I don’t know whether it’s Rails 2.3.9/10 or mysql2. I was more interested in fixing the problem quickly — which, fortunately, it could be.
A little research reminded me that pretty much everyone else in the world has “encoding: something” in their database.yml, so I added “encoding: latin1” to GWW’s. That fixed the problem for text that had been stored in the database before the upgrade. Uh-oh: text with non-ASCII characters that had been stored in the database after the upgrade had been mangled and still displayed incorrectly. Fortunately, most GWW data is refreshed from Flickr daily, and refreshing again fixed the mangled data. 2πrad, ɲℓιŦεɲđ1 and 猫娘/ nekomusume are once again displayed in all their multi-byte glory.
Here’s the unit test:
describe Person do describe '#username' do it 'should handle non-ASCII characters' do non_ascii_username = '猫娘/ nekomusume' Person.make! :username => non_ascii_username Person.all[0].username.should == non_ascii_username end end end
This test passes when database.yml contains “encoding: latin1” and fails when it does not. Wish I’d thought to write it first.
To sum up: be sure to specify an encoding: in database.yml before upgrading to Rails 2.3.9 and mysql2, or risk corrupted text.
Note that GWW’s database.yml did have
charset: latin1 collation: latin1_swedish_ci
This made ‘rake db:create’ create a database with the correct character set and collation, and it’s still necessary for that purpose with Rails 2.3.10 and mysql2. However, judging from the above it has no effect on the rest of Rails, and if your database character set is latin1 you need all three parameters in database.yml:
encoding: latin1 charset: latin1 collation: latin1_swedish_ci
Mind you, my next step will be to migrate the database from latin1 to utf8 and join the future. With any luck that will be uneventful enough to not require another post.
Update: No, this recipe worked just fine.
Thanks Dave, it was really helpful..
syed
February 13, 2013 at 22:54