The anatomy of a file



I like files. Not all of them, but generally speaking, most of them. I don't like PDF files, though. I have developed a kind of love-hate relationship with Microsoft Word's doc(x) file format – kind of the same as with my daily newspaper, Helsingin Sanomat: sometimes they make me angry, but I can't really live without them. Excel files are interesting and I would like to understand them better. Nowadays, I am indifferent towards PowerPoint files (I used to be interested in them as well). The world would do well without them.

Text files are my favourites. Firstly, they come in different forms and they just keep on surprising me and only rarely irritate me. Most importantly, if a text file has been written by a human being, its content is usually relevant, not too long, and is meant to be read in its entirety. In addition, viewing text files does not require large and complex software.

It is true though, that often text files have been produced automatically or as a result of some kind of process, like the following example. It is a database dump for a certain customer’s content, but it is meant to be read by human beings. And it was sent to us for translation. I had to figure out how to convert it into a format that could be imported into the translation tool for translation. To cut a long story short, this is how the file was prepared for translation services.

If a text file has been written by a human being, its content is usually relevant, and is meant to be read in its entirety.

The first step was to check the file contents and see if there was a filter or parser for the file format available. Filters and parsers (different terms are used in different tools) are rule sets that are used for dividing the text into segments, displaying the translatable content in the translation tool and hiding external content that is not meant to be translated.

These rule sets exist for MS Office files (and many other file formats), but things get more complicated with text files. The file in this example was a .json file:

When I opened the file in EmEditor, the content looked as follows:


All the text was on one line although the file was relatively large, over 600 kB. I decided to see whether the translation tool recognised the file format and imported the file into our standard translation tool.

Then I tried the proper localisation tool, which seemed to recognize the file format:


The next question was whether I could get any content to show. First, I created a text list (this is terminology used in the localisation tool...)


Ok, then I tried to open the file for editing:


It was obvious to me that the file could not be opened even with the localization tool, which meant that I had to specify the text structure and the elements that preceded and followed the translatable text segments. Armed with this information, I then tried to create a rule set which would allow importing the file in the translation tool for translation.

I returned to EmEditor to take another look at the contents. First, I enabled text wrapping.


This made the text readable (although the escaped Unicode characters, such as \u00e4, make reading difficult) and the language was correct as well – this text was to be translated from Finnish to Swedish. It was still not possible to figure out the structure of the text at this stage, which is why I had to perform some “pretty-printing” and disable text wrapping:


Now things started to look better! From this structure I could define where the text began and ended and make the necessary specifications in the translation tool rule sets.

Going through all the details of working with this file format would take a long time, but to give a short account of the process, the following steps must be performed before the file can be imported into the translation tool:

  • converting escaped characters into normal characters (e.g. \u00e4 -> ä)
  • saving the content in UTF-8 format (with BOM). In case the term BOM does not say anything to you, it is also called signature, but if this does not sound familiar either, see this page.
  • specifying a new text filter (rule set) in the translation tool
  • specifying the tags embedded in the text to either internal or external (e.g. an external tag pair, an internal text pair. The text also included the following internal parameters with translatable text between them: ${link:Ks. määräosan luovutus kaupalla|real_estate_portion_transfer})

When all the aforementioned actions have been performed, I'm ready to import the text into the translation tool, to check that everything works correctly and to ensure that all translatable text will be translated. The text looks fine in the translation tool's editor and the internal tags are displayed correctly:



Finally, I check that everything will be translated by making a pseudo translation of the text and comparing the translation with the original file. When I'm satisfied with the result, I will send the files to be translated.

All in all, I spent a couple of days for these preparations (the json file was not the only file format included in this work order), and if I remember correctly, we still encountered some problems during the finalisation stage of the project. But that is another story.

Oh yeah, how many translatable words did the file contain? There were over 23,000 words, and originally it was all on one line!