You are viewing a plain text version of this content. The canonical link for it is here.
Posted to users@opennlp.apache.org by "Paterson, Norman [CRI]" <np...@bsd.uchicago.edu> on 2015/12/02 18:04:58 UTC

Jira OpenNLP-216 and OpenNLP-217

Suggest the following is added to documentation (source https://www.mail-archive.com/search?l=issues@opennlp.apache.org&q=subject:%22%5C%5Bjira%5C%5D+%5C%5BComment+Edited%5C%5D+%5C(OPENNLP%5C-216%5C)+Add+Detokenizer+API+section%22&o=newest&f=1):

Create instance of SimpleTokenizer.

String sentence = He said \This is a test\.;
SimpleTokenizer instance = SimpleTokenizer.INSTANCE;

Tokenize the sentence using tokenize(String str) method from SimpleTokenizer

String tokens[] = instance.tokenize(sentence);

The operations array must have the same number of operation name as tokens
array. Basically array length should be equal.
Store the operation name N-times (tokens.length times)  into operation array.

Operation operations[] = new Operation[tokens.length];
String oper = MOVE_RIGHT; // please refere above list for the list of
operations
for (int i = 0; i  tokens.length; i++) {
operations[i] = Operation.parse(oper);
}
System.out.println(operations.length);
Here the operation array length will be equal to the tokens array length.

Now create an instance of DetokenizationDictionary by passing tokens and
operations arrays to the constructor.

DetokenizationDictionary detokenizeDict = new
DetokenizationDictionary(tokens, operations);

Pass DetokenizationDictionary instance to the DictionaryDetokenizer class to
detokenize the tokens.

DictionaryDetokenizer dictDetokenize = new
DictionaryDetokenizer(detokenizeDict);

DictionaryDetokenizer.detokenize requires two parameters. a). tokens array and
b). split marker
String st = dictDetokenize.detokenize(tokens,  );


Output:
-

He said  This is a test  .


was (Author: prakash111...@gmail.com):

-----------------------------------------------------
Norman Paterson - Software Engineer
University of Chicago Div. of Biological Sciences
Ph. 773-834-4809               Cel. 312-350-9838
-----------------------------------------------------

********************************************************************************
This e-mail is intended only for the use of the individual or entity to which
it is addressed and may contain information that is privileged and confidential.
If the reader of this e-mail message is not the intended recipient, you are 
hereby notified that any dissemination, distribution or copying of this
communication is prohibited. If you have received this e-mail in error, please 
notify the sender and destroy all copies of the transmittal. 

Thank you
University of Chicago Medicine and Biological Sciences 
********************************************************************************