Since November 2007 the European Commission's Directorate-General for Translation has made its multilingual Translation Memory for the Acquis Communautaire, DGT-TM, publicly accessible in order to foster the European Commission’s general effort to support multilingualism, language diversity and the re-use of Commission information.
The Acquis Communautaire is the entire body of European legislation, comprising all the treaties, regulations and directives adopted by the European Union (EU). Since each new country joining the EU is required to accept the whole Acquis Communautaire, this body of legislation has been translated into 23 official languages. As a result, the Acquis now exists as parallel texts in the following 23 languages: Bulgarian, Czech, Danish, Dutch, English, Estonian, German, Greek, Finnish, French, Irish, Hungarian, Italian, Latvian, Lithuanian, Maltese, Polish, Portuguese, Romanian, Slovak, Slovene, Spanish and Swedish. For Irish, there is very little data since the Acquis is not translated on a regular basis.
Parallel texts are texts and their manually produced translations. They are also referred to as bi-texts. A translation memory is a collection of small text segments and their translations (referred to as translation units, TU). These TUs can be sentences or parts of sentences. Translation memories are used to support translators by ensuring that pieces of text that have already been translated do not need to be translated again.
Both translation memories and parallel texts are important linguistic resources that can be used for a variety of purposes, including:
training automatic systems for statistical machine translation (SMT);
producing monolingual or multilingual lexical and semantic resources such as dictionaries and ontologies;
training and testing multilingual information extraction software;
checking translation consistency automatically;
testing and benchmarking alignment software (for sentences, words, etc.).
The value of a parallel corpus grows with its size and with the number of languages for which translations exist. While parallel corpora for some languages are abundant, there are few or no parallel corpora for most language pairs. To our knowledge, the Acquis Communautaire is the biggest parallel corpus in existence, taking into consideration both its size and the number of languages covered. The most outstanding advantage of the Acquis Communautaire - apart from it being freely available - is the number of rare language pairs (e.g. Maltese-Estonian, Slovene-Finnish, etc.).
The first version of DGT-TM was released in 2007 and included documents published up to the year 2006. In April 2012, DGT-TM-2011 was released, which contains data from 2007 until 2010. Since then, data is released annually (e.g. 2011 data is released in 2012 with the name of DGT-TM-2012). While the alignments between TUs and their translations were verified manually for DGT-TM-2007, the TUs in since DGT-TM-2011 were aligned automatically. The data format is the same for all releases.
This page, which is meant for technical users, provides a description of this unique linguistic resource as well as instructions on where to download it and how to produce bilingual aligned corpora for any of the 253 language pairs or 506 language pair directions. Here is an example of one sentence translated into 22 languages (http://optima.jrc.it/Resources/Documents/0801_DGT-TM_Sample-translation-unit_all-languages.doc).
Conditions for Use
I. Intellectual property and conditions of use of databases
The DGT-TM database is the exclusive property of the European Commission. The Commission cedes its non-exclusive rights free of charge and world-wide for the entire duration of the protection of those rights to the re-user, for all kinds of use which comply with the conditions laid down in the Commission Decision of 12 December 2011 on the re-use of Commission documents, published in Official Journal of the European Union L330 of 14 December 2011, pages 39 to 42.
Any re-use of the database or of the structured elements contained in it is required to be identified by the re-user, who is under an obligation to state the source of the documents used: the website address, the date of the latest update and the fact that the European Commission retains ownership of the data.
II. Conditions for use of software
The DGT-TM database is distributed with the software necessary for its exploitation/extraction. Use of such software must be carried out in accordance with the conditions laid down in the EUPL licence.
The database and the accompanying software are made available, without any guarantee, explicit or tacit. The Commission cannot be held responsible for any loss, injury or damage the re-user may suffer due to the re-use. The Commission does not however guarantee the absence of any irregularities which may be present in the databases, within the structured data they contain or the software itself. The Commission does not guarantee the on-going distribution of said databases and software.
The Commission cannot be held responsible for any loss, injury or damage caused to third parties as a result of the re-use. The re-user shall bear sole responsibility for the re-use of the data collection, the structured elements it contains and the software. Re-use must not mislead third parties in respect of the contents of the database and the structured elements it contains, it’s the source of the contents or the date of the last update thereto.
This disclaimer is not intended to limit the liability of the Commission in violation of any requirements laid down in applicable national law or to exclude its liability in cases where this is not permitted by the applicable law.
Definitions of terms used by the Commission Decision of 12 December 2011 on the re-use of Commission documents, published in Official Journal of the European Union L330 of 14 December 2011, pages 39 to 42, are supplemented by the following definitions:
Re-user: Any natural or legal person who re-uses the documents, in accordance with the conditions laid down in the Commission Decision of 12 December 2011 on the re-use of Commission documents, published in Official Journal of the European Union L330 of 14 December 2011, pages 39 to 42.
Databases: A collection of independent works, data or other materials arranged in a systematic or methodical way and individually accessible by electronic means or in any other way.