European Parliament Proceedings Parallel Corpus 1996-2011, parallel corpus Latvian-English
The Europarl parallel corpus is extracted from the proceedings of the European Parliament. It includes versions in 21 European languages: Romanic (French, Italian, Spanish, Portuguese, Romanian), Germanic (English, Dutch, German, Danish, Swedish), Slavik (Bulgarian, Czech, Polish, Slovak, Slovene), Finni-Ugric (Finnish, Hungarian, Estonian), Baltic (Latvian, Lithuanian), and Greek.
The goal of the extraction and processing was to generate sentence aligned text for statistical machine translation systems. For this purpose we extracted matching items and labeled them with corresponding document IDs. Using a preprocessor we identified sentence boundaries. We sentence aligned the data using a tool based on the Church and Gale algorithm.
People who looked at this resource also viewed the following:
- English-Latvian cross-linked collection of comparable sentences from Wikipedia
- Latvian-English Ngram corpus, Legislation of Republic of Latvia
- European Parliament Proceedings Parallel Corpus 1996-2011, parallel corpus Lithuanian-English
- English-Lithuanian cross-linked collection of comparable sentences from Wikipedia