Suppose you have an aligned corpus in Excel (or any other delimited format) and you wish to reuse that content in your favorite CAT tool. It’s actually very easy to convert a bilingual text format to TMX via a very-handy and free open-source tool called Oliphant.
- Ensure you have an aligned corpus in Excel, with the leftmost column containing the source text and the target in the next column. If your corpus is not perfectly aligned, you may want to check out my early post about a down and dirty alignment tool.
- Paste the bilingual table in Notepad and save the file, ensuring the encoding is set to UTF-8.
- Download, install and launch Oliphant
- Press Ctrl+N to create a new TM and add your language code to the target field.
- Then go to File>Import and choose Tab-delimited files (.txt) from the dropdown menu. Locate the file you created in step 2 and hit Open.
- In the Destination Field, set the Field Type of Column 1 to Text, Language EN-US (or whatever source language you’re working with), and for Column 2, Text again as Field Type and your target language code in the Language field.
- Press OK and hit Save. Your bilingual corpus has been converted to a TMX file!
Oliphant is fitted with powerful editing tools including advanced Find/Replace and also gives you the ability to delete, add, merge and edit segments on the spot.
This tool can be a life saver in those occasions when your Studio TM gets corrupted while you’re working on your project (it’s not that infrequent as it may seem). Since Studio saves a bilingual file along with the project (you can access it by going to your target language sub-folder within the main project folder), you can use SDLXliff2Tmx to convert the. sdlxliff bilingual file to a TMX or TXT and rebuild your TM that way.
SDLXliff2Tmx can be downloaded for free from the SDL Marketplace website. You simply load the .sdlxliff file, choose the statuses you wish to exclude (if any), if you want to remove internal tags or save the output as .txt file instead of a TMX.
Term frequency may be a good starting point if you have to create terminology databases ex novo, especially when time and resources are limited. TermoStat, an on-line free tool developed by the Université de Montréal, is a term extractor that uses a statistical and linguistic method to identify candidate terms. It takes into account not only the structure of potential term candidates but also their relative frequencies in the text being processed.
You start by saving your English source text as a .txt file with ANSI encoding (this is the only format accepted). Then go to the TermoStat website and create an account. Hit Browse and select the .txt file you have just created. Under single-word terms, you can choose whether you want to include nouns, verbs, adjectives and adverbs as part of your candidate list. Hit Analyze.
The tool is relatively fast (it processed a 150k words+ PhD thesis on macroeconomics in about 5 minutes) and the results are displayed on a tabular format which you can export to an Excel-compatible tabbed format.
You can sort the candidate terms by alphabetical order, frequency, specificity and pattern. Click on any candidate to access all the contexts in which the term occurs.
TermoStat provides 5 different data views: List of Terms (defualt), Cloud, Stat, Structuration and Bigrams. I’ve found Structuration to be particular helpful since it provides a list of combinations for most candidates.
If you deal with legal translations from or into any of the EU languages, you may want to check out the EurLex website. EurLex is a repository of documents about European Union law including treaties, legislation, case-law and legislative proposals, which are indexed according to several categorization schemes to allow for multiple search facilities. It contains more than 2.8 million documents, some dating back to 1951.
It is possible to search to terminology in any source language of the EU and then verifying multiple translations in context via a simple tabular corpora. Really helpful when you’re struggling with intricate or obscure terms.
If you have ever been asked to localize Photoshop (.psd) graphics, you may have wondered whether it is possible to extract the text in there and localize it with your favourite CAT tool. It actually is, and the process is very straightforward.
1. Download CopyText from here
2. Unzip it to any folder you wish. The program is a self contained .exe, so no need to install
3. Drag and drop your .psd file to the CopyText window.
4. The program will extract the text contained in all layers and generate a .txt file.
Now what if you want to import your translation back to the .psd file? CopyText cannot handle that, but Bramus Text Convert can (also free of charge). This is essentially a script you install in Photoshop which will allow you to create a localized version of the artwork almost instantly and, in most cases, with minimal DTP work. However, the software is only compatible with Photoshop CS4 or earlier. Alternatives are available for the most recent versions (e.g. Sysfilter from ECM Engineering) but they are rather expensive.
Ever wanted to rename multiple files in one fell swoop? Despite perhaps not being the most aptly named piece of software out there, 1-4a Rename can however save you precious time if you ever need to rename many files in one go, for example, adding your language code at the end of localized files or assigning your holiday photos more memorable names than the nondescript ones that come straight off your camera. Last, but not least…it’s completely free.
No installation is required, you just download it from this location and launch the executable file. Press F2 to toggle between the basic and expert interface. You can easily append information to the end of the file or just replace the file name altogether by leaving out the top box under Replace empty and adding the replacement text in the box underneath.
You get a live preview of your changes and even if you make a mistake you can always Undo All. Very convenient.