Εμπειρίες μετάφρασης ...: The DGT Multilingual Translation Memory of the Acquis Communautaire: DGT-TM

Κυριακή 10 Απριλίου 2011

The DGT Multilingual Translation Memory of the Acquis Communautaire: DGT-TM

1. Introduction
2. DGT's Translation Memory
3. Description of the Data - Pre-processing
4. Statistics on the corpus
5. Conditions for Use
6. Difference between the JRC-Acquis and the DGT-TM
7. Download the DGT Translation Memory
8. How to Produce Bilingual Extractions (Java bytecode is now also available!)
9. Acknowledgements and Contact

1) Introduction

As of November 2007, the European Commission's Directorate-General for Translation (DGT) made publicly accessible its multilingual Translation Memory for the Acquis Communautaire (the body of EU law) - a collection of parallel texts (texts and their translation, also referred to as bi-texts) in 22 languages. This is a page for technical users, where you will find a summary of this unique resource and instructions on where to download it and how to produce bilingual aligned corpora for any of the 231 language pairs (462 language pair directions). For an example of one sentence translated into all 22 languages, click here. Please note that DGT-TM is not machine translation software.

If you are a non-technical user, you may be more interested in our freely accessible news analysis applications, which you find at http://emm.jrc.it/overview.html.

The release of this linguistic resource follows the public release - in May 2006 - of the JRC-Acquis multilingual parallel corpus with sentence alignment for 231 language pairs. Version 3.0 of the JRC-Acquis, which now also contains Bulgarian as a 22nd language and which comprises a total of over 1 billion words, has been made available in April 2007. The data releases of DGT and JRC are in line with the general effort of the European Commission to support multilingualism, language diversity and the re-use of Commission information.

The Acquis Communautaire is the entire body of European legislation, including all the treaties, regulations and directives adopted by the European Union (EU) and the rulings of the European Court of Justice (see the Wikipedia entry). Since each new country joining the EU is required to accept the whole Acquis Communautaire, this body of legislation is translated into 22 official languages. As a result, the Acquis now exists as parallel texts in the following 22 languages: Bulgarian, Czech, Danish, Dutch, English, Estonian, German, Greek, Finnish, French, Hungarian, Italian, Latvian, Lithuanian, Maltese, Polish, Portuguese, Romanian, Slovak, Slovene, Spanish and Swedish. For the 23rd official EU language, Irish, the Acquis is not translated on a regular basis.

A translation memory is a collection of small text segments and their translation. These segments can be sentences or sentence parts. Translation memories are used to support translators by ensuring that pieces of text that have already been translated do not need to be translated again.

Both translation memories and parallel texts are an important linguistic resource that can be used for a variety of purposes, including:

training automatic systems for statistical machine translation (SMT);
producing monolingual or multilingual lexical and semantic resources such as dictionaries and ontologies;
training and testing multilingual information extraction software;
checking translation consistency automatically;
testing and benchmarking alignment software (for sentences, words, etc.).

Generally speaking, parallel corpora are useful for all types of cross-lingual research. The value of a parallel corpus grows with its size and with the number of languages for which translations exist. While parallel corpora for some languages exist abundantly, there are few or no parallel corpora for most other language pairs. To our knowledge, the Acquis Communautaire is the biggest parallel corpus in existence, if we take into consideration both its size and the large number of languages involved. The most outstanding advantage of the Acquis Communautaire - apart from being freely available - is the number of rare language pairs (e.g. Maltese-Estonian, Slovene-Finnish, etc.).

2) DGT's Translation Memory

This extraction of aligned sentences can be used to produce a parallel multilingual corpus of the legislative documents (Acquis Communautaire) of the European Union in 22 EU languages. The aligned sentences ("translation units") have been provided by the Directorate-General for Translation of the European Commission by extraction from one of its large shared translation memories in Euramis (European advanced multilingual information system). This memory contains most, although not all, of the documents of the Acquis Communautaire, as well as some other documents which are not part of the Acquis.

In order to cut down the size, the extraction takes English as the source language. The sequence in the extracted files is not necessarily the same as in the underlying documents, and redundancies of text segments like "Article 1" are inevitable. The documents in the files are identified by the document number (Numdoc) of the original legislative document in the EUR-Lex database, but it should be noted that these documents have been modified (see section on pre-processing below). The documents are in TMX format, a widely used format provided by LISA: in order to be backwards compatible, the header mentions TMX format 1.1, but the files are also compliant with TMX 1.4b. The texts are encoded in UTF-16 Little Endian. The source language of the documents and sentences is not known, but many of the documents were originally written in English and then translated into the other languages.

DGT cannot assume any responsibility for the quality and the content.

3) Description of the Data - Pre-processing

Before the documents were aligned and corrected, they were pre-processed to remove certain differences between the source and target language versions (further details). This means that the contents of the documents might have changed. The documents were aligned in accordance with the segmentation rules used in the Directorate-General for Translation of the European Commission. The extraction keeps only the EUR-Lex document number (NumDoc) from which other information (e.g. year and document type) can be derived. For further information on the Numdoc structure, see the information provided by EUR-Lex.

4) Statistics for the DGT Translation Memory

The DGT Translation Memory is currently available in 22 languages.

5) Conditions for Use

Under Commission Decision 2006/291/EC, Euratom of 7 April 2006 on the re-use of Commission information (Official Journal L 107, 20.4.2006, pp. 38-41), this data may be disseminated, but only within the limits set by the Decision. In particular, the Commission is not liable for any consequence stemming from the re-use. Moreover, the Commission is not liable for the quality of the alignment nor the correctness of the data provided.

By agreement with the European Commission's Office for Official Publications (OPOCE), the Acquis can be used and distributed for research purposes, but the following conditions for use must be observed:

The European Communities consider legislative and quasi-legislative documents published in the Official Journal of the European Union to be in the public domain. Prior written permission is not required for their reproduction/translation, and they may be reproduced freely without restriction, including for the purpose of further non-commercial dissemination to final users, subject to the condition that appropriate acknowledgement is given to the European Communities and to the source, and provided that - whenever a document is reproduced verbatim from a source other than the printed version of the Official Journal of the European Union - a prominently positioned disclaimer should read: "Only European Community legislation printed in the paper edition of the Official Journal of the European Union is deemed authentic."

6) Difference between the DGT Translation Memory and the JRC-Acquis

The two resources are rather similar in nature as they are both based on the Acquis Communautaire, but they are not identical and can both serve different purposes. The main differences are the following:

The collection of documents of both resources should mostly be the same, but they are not identical as both resources were collected in different ways. None of the resources is exactly equivalent to the Acquis Communautaire. The criteria for the collection of the JRC-Acquis were rather loose (all documents were collected which were available in at least ten languages of which at least three 'new' EU languages) so that the JRC-Acquis is bigger.
The DGT Translation Memory is a collection of translation units, from which the full text cannot be reproduced. The JRC-Acquis is mostly a collection of full texts with additional information on which sentences are aligned with which others.
Most parts of the DGT Translation Memory have been corrected manually using the Euramis alignment editor, while the alignment of the JRC-Acquis documents was done using the alignment software tools Vanilla (Versions 2.2 and 3) and HunAlign (Version 2.2), without manual correction.
For the cleaning and pre-processing of the texts, different methods and tools were used.
Most JRC-Acquis documents are acompanied by information on the manually assigned Eurovoc subject domain classes so that the JRC-Acquis can also be used to train automatic multi-label classification software.

7) Download the DGT Translation Memory

The distribution consists of 12 zip files (Volume_1.zip, ... Volume_12.zip), each of approximately 100 MB. Each zip file has dozens of tmx-files identified by the EUR-Lex number of the underlying documents of the Acquis and a file list in txt specifying the languages in which the documents are available.

You can download the data files from the site http://optima.jrc.it/Acquis/DGT_TU_1.0/data/. There is no need to unzip the files as the extraction program will access the data in the zip files directly. The texts for the different languages are spread over the various zip files so that you will need to download all files if you want the full parallel corpus. Downloading only a subset of the zip files is possible, but it will result in producing only a subset of the parallel corpus.

You also need to download the extraction program and copy it into the same directory as the zip files with the data. The program is distributed in two versions (NEW!): a version with graphical user interface for the Windows operating system, consisting of two files: the program file and the library, and a machine-independent command line version in
java bytecode that can be run on any machine supporting a Java runtime of version 1.4 or newer.

8) How to produce bilingual extractions

The multilingual extraction has English as the source language. Users can extract any language pair as follows, using the extraction tool TMXtract:

For the Windows Operating System:

download the zip files, the extraction tool TMXtract (exe.file) and the file swt-win32-3218.dll onto your PC. The files must be in the same directory;
open TMXtract;
select Input files (Volume_1.zip, etc.; multiple selection is possible);
specify Output file (the result is always 1 file);
choose Source and Target language;
click on Start.

For other Operating Systems: (NEW!)

download the zip files, the extraction tool TMXtract (jar file) onto your computer. The files should be in the same directory;
Start a command shell;
Invoke the program by the command java -jar TMXtract.jar [ ...];
The progress of the extraction will be displayed on the console. Example on Solaris:

9) Acknowledgement and Contact

Following multiple requests, the European Commission's Directorate-General for Translation (DGT) has decided to make its translation memory for Europe-wide legal texts - the Acquis Communautaire - available to the wider public. For that purpose, DGT planned and produced a database extraction and developed programs that allow the efficient extraction of bilingual corpora for all 231 language pairs for 22 official EU languages. DGT is the largest translation service world-wide. Its translation memory serves several European Union institutions.

The Joint Research Centre (JRC) is a Directorate-General of the European Commission. The JRC has for many years worked on highly multilingual text analysis applications and has also contributed to the dissemination of the DGT Translation Memory. In addition to developing various reliable high-usage in-house tools, the JRC made three news aggregation and analysis applications of the Europe Media Monitor (EMM) family publicly accessible. EMM aggregates news from about 1,200 news portals world-wide in 42 languages. The news portals are visited around the clock and EMM updates its pages every ten minutes. The non-public, Commission-internal EMM applications additionally ingest news from about 20 different newswires. EMM's sites receive up to 1.2 million hits per day. Much information is available via RSS feeds.

NewsBrief: Breaking News detection and display of the very latest thematically organised news from around the world; Grouping of related news; RSS feeds and automatic email alerting; 42 languages.
MedISys: EMM's Medical Information System selects the health-related EMM news and additionally gathers documents from about 150 medical web sites. MedISys displays the medical news according to diseases, symptoms, organisations and themes and has statistics-based early warning functions for each category. A second, restricted site offers more functionality to EU public health organisations. 42 languages.
NewsExplorer: Summary of the news in 19 languages for each 24-hour window; grouping of related news into clusters; linking of daily clusters over time and across languages (multilingual and cross-lingual topic tracking); visualisation of time lines and of geographical news coverage; information extraction to detect and disambiguate persons, organisations and locations; individual, daily-updated pages for 700,000 names; detection of quotations by and about people; automatic calculation of social networks.

For more information, you can contact the following persons:

Directorate-General for Translation (DGT)
Patrick Schluter (Email address format: Firstname.Lastname@ec.europa.eu)
Unit DGT.R.3 Informatics
Jean-Monnet Building A2/137
L-2920 Luxembourg
More information on DGT.

Joint Research Centre (JRC)
Ralf Steinberger (Email address format: Firstname.Lastname@jrc.it)
IPSC - SeS
Via E. Fermi 2749, T.P. 267
I-21027 Ispra (VA)
More information on the JRC and its Language Technology activity.