PatTR: Patent Translation Resource subcorpus DE-EN
PatTR is a sentence-parallel corpus extracted from the MAREC patent collection. The current version contains more than 22 million German-English and 18 million French-English parallel sentences collected from all patent text sections as well as 5 million German-French sentence pairs from patent titles, abstracts and claims.
Like the MAREC data PatTR is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 3.0 Unported License. Please cite Wäschle & Riezler (2012b), if you use the corpus in your work.
Creative Commons License
The corpus is sorted by language pairs and by text sections of a patent document, namely title, abstract, claims and description. Parallel data from title, abstract and claims sections were extracted from documents belonging to the European Patent Office (EPO) and the World Intellectual Property Organization (WIPO) corpora in MAREC. Both resources feature multilingual documents that contain for example both an English and a German abstract.
Since there are no multilingual descriptions, data from this section were collected by exploiting patent families to align German and French documents from the EPO corpus to English documents from the United States Patent and Trademark Office (USPTO) corpus, following Utiyama and Isahara (2007).
All sections were sentence-aligned using the Gargantua aligner. Preprocessing was done automatically. Sentence boundaries were detected using the Europarl processing tools.
For a detailed description of the corpus construction process, please see the publications.
People who looked at this resource also viewed the following:
People who downloaded this resource also downloaded the following: