You are viewing a plain text version of this content. The canonical link for it is here.
Posted to users@opennlp.apache.org by Adam Goodkind <a....@gmail.com> on 2012/09/13 04:26:51 UTC

Extracting Indices When Tokenizing

Hi,

When tokenizing a string of text, is there also a way to track the index (of the original text) where the token begins?

For example:
"Mary didn't kiss John"
[(Mary, 0), (did, 5), (n't, 8), (kiss, 12), (John, 17)]

If there is a way to extract the 0, 5, 8, 12 and 17 from somewhere, that would be great. I cannot rely on whitespace, since the tokenizer sometimes breaks up words.

Thanks,
Adam

Re: Extracting Indices When Tokenizing

Posted by Jörn Kottmann <ko...@gmail.com>.
Hello,

you need to use OpenNLP via its API, the tokenizer has a tokenizePos method
which returns the spans of the detected tokens.

Have a look at our documentation:
http://opennlp.apache.org/documentation/1.5.2-incubating/manual/opennlp.html#tools.tokenizer.api

We do not support this in the command line interface.

Hope that helps,
Jörn



On 09/13/2012 04:26 AM, Adam Goodkind wrote:
> Hi,
>
> When tokenizing a string of text, is there also a way to track the index (of the original text) where the token begins?
>
> For example:
> "Mary didn't kiss John"
> [(Mary, 0), (did, 5), (n't, 8), (kiss, 12), (John, 17)]
>
> If there is a way to extract the 0, 5, 8, 12 and 17 from somewhere, that would be great. I cannot rely on whitespace, since the tokenizer sometimes breaks up words.
>
> Thanks,
> Adam