I need some file encoding help...

Doc requests, organization, and submissions

Moderators: dorpond, trevor, Azhrei

Post Reply
User avatar
Azhrei
Site Admin
Posts: 12086
Joined: Mon Jun 12, 2006 1:20 pm
Location: Tampa, FL

I need some file encoding help...

Post by Azhrei »

I have a German translation file (hello aPown!) that I can't seem to get the file encoding correct for Java.

I've tried to phrase this question simply, but the topic can get a little complicated so if you don't deal with translations very much you probably want to just avoid this thread. ;)

(Before you comment on the following paragraphs: Yes, I know it's really called "UTF-8" but I'm typing "UTF8" because it's easier and I think it looks better. ;))

The file is UTF8 encoded, but it uses \u2028 as the end of line character. Here's a hex dump of the file contents:

Code: Select all

0000000: efbb bf23 4765 726d 616e 2076 6572 7369  ...#German versi
0000010: 6f6e 2030 2e36 3620 2830 372e 3035 2e32  on 0.66 (07.05.2
0000020: 3031 3029 2062 7920 6150 6f77 6e2c 2070  010) by aPown, p
0000030: 726f 6f66 2d72 6561 6420 6279 2049 6d70  roof-read by Imp
0000040: 2ee2 80a8 2357 656e 6e20 6574 7761 7320  ....#Wenn etwas 
0000050: 5369 6e6e 6765 6dc3 a4c3 9f20 6661 6c73  Sinngem.... fals
0000060: 6368 20c3 bc62 6572 7365 747a 7420 7775  ch ..bersetzt wu
Every copy of the # character you see on the right denotes a comment in the translation file and it should appear in column zero of the file. The hex bytes immediately preceding that "#" are always e2 80 a8, which is UTF8-speak for "end of line".

I need some way to convert this to Java's \u#### format. If I run Java's native2ascii -encoding UTF-8 on it, it's converted to \u2028 which is the Unicode end of line, but Java apparently only accepts \u000a or \u000d (i.e. the ASCII LF and CR characters) as the end of line.

My plan right now is to use native2ascii and convert the file, then do a global search and replace and change \u2028 into LF. But I was hoping for a cleaner solution that I could incorporate into the MapTool build process and automate.

For example, if I take a translation file and run it through native2ascii and it comes out the same, then no conversion is needed. If there is a difference, then I can save the output as the new translation file. I'm looking for something along those lines.

Side question: I've opened the file in both OSX's TextEdit and Vim and I can't get either one to write the file correctly! If I tell TextEdit to write it as ASCII (or anything other than UTF8) I get an error that it can't convert. And if I try to paste the text into Vim it seems to ignore the line endings completely. :(

I'm going to use my search-and-replace technique for right now because I want to put out RC5 tonight, but I'd like any input on this that folks might have.

Craig
Great Wyrm
Posts: 2107
Joined: Sun Jun 22, 2008 7:53 pm
Location: Melbourne, Australia

Re: I need some file encoding help...

Post by Craig »

You should be able to edit the file with vim using
gvim "+set encoding=utf-8" filename
or mvim if using mac vim. Use the GUI version that way you don't need to make sure that your terminal
is set up correctly to display utf-8.

If this still doesn't work then you can open it as a binary (on a non Windows machine).
vim -b filename
Then
:%!xxd
The xxd command comes with most installations of vim
Edit the resulting hex (just a search and replace)
Then
:%!xxd -r
To turn in back in into binary and save

User avatar
Azhrei
Site Admin
Posts: 12086
Joined: Mon Jun 12, 2006 1:20 pm
Location: Tampa, FL

Re: I need some file encoding help...

Post by Azhrei »

Craig wrote:You should be able to edit the file with vim using
gvim "+set encoding=utf-8" filename
or mvim if using mac vim. Use the GUI version that way you don't need to make sure that your terminal
is set up correctly to display utf-8.
Using Mvim under the GUI, the encoding already shows up as UTF8. It appears that the "fileencoding" (shortcut "fenc") is used for actually converting the file when reading/writing, but it didn't seem to be doing anything when I tried it. Maybe a bug in vim? :?
If this still doesn't work then you can open it as a binary (on a non Windows machine).
vim -b filename
Then
:%!xxd
The xxd command comes with most installations of vim
Edit the resulting hex (just a search and replace)
Then
:%!xxd -r
To turn in back in into binary and save
Good point. I wrote a one-line script to do the same thing, but I'd like to have a "platform solution" so that I can tell submitters, "do such-and-such and it'll be alright" if I get one that's gobbledy-gook. ;)

Code: Select all

native2ascii i18n_xx.properties | perl -pe 's/\\u2028/\n/g;' > i18n.txt
And now i18n.txt has the right contents. That's just sort of an ugly kludge to add to my build though. 8)

Thanks for the ideas. I have a working translation file now and it's part of b82 (just released) so the immediate crush is off.

Post Reply

Return to “Documentation Requests/Discussion”