This is a parallel corpus made out of PDF documents from the European Medicines Agency. All files are automatically converted from PDF to plain text using pdftotext with the command line arguments -layout -nopgbrk -eol unix. There are some known problems with tables and multi-column layouts - some of them are fixed in the current version.
22 languages, 231 bitexts
total number of files: 41,957
total number of tokens: 311.65M
total number of sentence fragments: 26.51M
Please cite the following article if you use any part of the corpus in your own work:
Jörg Tiedemann, 2009, News from OPUS - A Collection of Multilingual Parallel Corpora with Tools and Interfaces. In N. Nicolov and K. Bontcheva and G. Angelova and R. Mitkov (eds.) Recent Advances in Natural Language Processing (vol V), pages 237-248, John Benjamins, Amsterdam/Philadelphia
European Medicines Agency copyright and limited reproduction notices.
The contents of these webpages are © EMA [1995-2014].In particular, unless otherwise stated, the Agency, according to current European Union and international legislation1, is the owner of copyright and other intellectual property rights for documents and other content published on this website. Information and documents made available on the Agency's webpages are public and may be reproduced and/or distributed, totally or in part, irrespective of the means and/or the formats used, for non-commercial and commercial purposes, provided that the Agency is always acknowledged as the source of the material. Such acknowledgement must be included in each copy of the material.Citations may be made from such material without prior permission, provided the source is always acknowledged.The above-mentioned permissions do not apply to content supplied by third parties. Therefore, for documents where the copyright vests in a third party, permission for reproduction must be obtained from this copyright holder.
People who looked at this resource also viewed the following:
- EAMT11 dataset - machine translations with human judgements and post-editions
- WMT12 dataset - machine translations with human judgements and post-editions
- OpenNLP Tokenizer (Portuguese)
- TSD13 dataset - English-Spanish WMT12 machine translations by various MT systems, post-edited by 10 translation students