OpenNLP Tokenizer (Portuguese)
The Apache OpenNLP library is a machine learning based toolkit for the processing of natural language text. It supports the most common NLP tasks, such as tokenization, sentence segmentation, part-of-speech tagging, named entity extraction, chunking, parsing, and coreference resolution. These tasks are usually required to build more advanced text processing services. OpenNLP also included maximum entropy and perceptron based machine learning.
The goal of the OpenNLP project will be to create a mature toolkit for the abovementioned tasks. An additional goal is to provide a large number of pre-built models for a variety of languages, as well as the annotated text resources that those models are derived from.
General Library Structure
The Apache OpenNLP library contains several components, enabling one to build a full natural language processing pipeline. These components include: sentence detector, tokenizer, name finder, document categorizer, part-of-speech tagger, chunker, parser, coreference resolution. Components contain parts which enable one to execute the respective natural language processing task, to train a model and often also to evaluate a model. Each of these facilities is accessible via its application program interface (API). In addition, a command line interface (CLI) is provided for convenience of experiments and training.
People who looked at this resource also viewed the following:
- OpenNLP Part-of-Speech Tagger (Portuguese)
- WMT12 dataset - machine translations with human judgements and post-editions
- TSD13 dataset - English-Spanish WMT12 machine translations by various MT systems, post-edited by 10 translation students
- WPTP12 dataset - machine translations with post-editing performed by multiple translators with different levels of expertise