You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@opennlp.apache.org by Tri Nguyen <yt...@gmail.com> on 2011/11/05 04:48:55 UTC

Name finder position

Hi,

Could somebody guide me how to get positions of names in the document like
the positions of keywords in Lucene?

Thanks,
Tri.

Re: Name finder position

Posted by Tri Nguyen <yt...@gmail.com>.

It is English.
I am using Lucene StandardAnalyzer, it index the words at correct
positions. Can we map the token position from OpenNLP to Lucene?

Tri.

On Sun, Nov 6, 2011 at 7:28 AM, James Kosin <ja...@gmail.com> wrote:

> Tri,
>
> Unfortunately, it depends on the input language.  Only thing I've found
> is it may be better to find the tokens that are punctuation.  A hint is
> most tokens that are punctuation are a single character wide.  But,
> again that may not be the case depending on the encoding and the
> punctuation.  Words are usually a bit longer.
>
> James
>
> On 11/5/2011 2:14 PM, Tri Nguyen wrote:
> > Thank you James,
> > I don't count the token having pattern ".*[A-Za-z0-9]+.*" and check some
> > cases it works.
> > The token is not satisfied that pattern can be a punctuation. Is that
> > pattern enough to cover a keyword?
> > Can we incorporate Lucene and OpenNLP so that the keyword position and
> > Named Entity position are compatible?
> >
> >
> > On Sun, Nov 6, 2011 at 12:22 AM, James Kosin <ja...@gmail.com>
> wrote:
> >
> >> Tri,
> >>
> >> You could just subtract the number of punctuation tokens from the
> >> offsets you get.
> >> On 11/5/2011 1:08 PM, Tri Nguyen wrote:
> >>> On Sat, Nov 5, 2011 at 11:30 PM, Jörn Kottmann <ko...@gmail.com>
> >> wrote:
> >>>> On 11/5/11 4:53 PM, Tri Nguyen wrote:
> >>>>
> >>>>> Obama is correct, but Bill Gates. Since the NameFinderME return the
> >> token
> >>>>> index (position in the token array) not the keyword position (the
> >> keyword
> >>>>> position in the text). I want to cooperate with keyword position in
> >>>>> Lucene.
> >>>>>
> >>>> What is a keyword position?
> >>>>
> >>> It is the order of the word in the text.
> >>> Ex:
> >>> Barack: 0
> >>> Obama: 1
> >>> president: 3
> >>> US: 5
> >>> he: 6
> >>> 1961: 11
> >>> Bill: 12
> >>>
> >>>> Jörn
> >>>>
> >>
>
>

Re: Name finder position

Posted by James Kosin <ja...@gmail.com>.

Tri,

Unfortunately, it depends on the input language.  Only thing I've found
is it may be better to find the tokens that are punctuation.  A hint is
most tokens that are punctuation are a single character wide.  But,
again that may not be the case depending on the encoding and the
punctuation.  Words are usually a bit longer.

James

On 11/5/2011 2:14 PM, Tri Nguyen wrote:
> Thank you James,
> I don't count the token having pattern ".*[A-Za-z0-9]+.*" and check some
> cases it works.
> The token is not satisfied that pattern can be a punctuation. Is that
> pattern enough to cover a keyword?
> Can we incorporate Lucene and OpenNLP so that the keyword position and
> Named Entity position are compatible?
>
>
> On Sun, Nov 6, 2011 at 12:22 AM, James Kosin <ja...@gmail.com> wrote:
>
>> Tri,
>>
>> You could just subtract the number of punctuation tokens from the
>> offsets you get.
>> On 11/5/2011 1:08 PM, Tri Nguyen wrote:
>>> On Sat, Nov 5, 2011 at 11:30 PM, Jörn Kottmann <ko...@gmail.com>
>> wrote:
>>>> On 11/5/11 4:53 PM, Tri Nguyen wrote:
>>>>
>>>>> Obama is correct, but Bill Gates. Since the NameFinderME return the
>> token
>>>>> index (position in the token array) not the keyword position (the
>> keyword
>>>>> position in the text). I want to cooperate with keyword position in
>>>>> Lucene.
>>>>>
>>>> What is a keyword position?
>>>>
>>> It is the order of the word in the text.
>>> Ex:
>>> Barack: 0
>>> Obama: 1
>>> president: 3
>>> US: 5
>>> he: 6
>>> 1961: 11
>>> Bill: 12
>>>
>>>> Jörn
>>>>
>>

Re: Name finder position

Posted by Tri Nguyen <yt...@gmail.com>.

Thank you James,
I don't count the token having pattern ".*[A-Za-z0-9]+.*" and check some
cases it works.
The token is not satisfied that pattern can be a punctuation. Is that
pattern enough to cover a keyword?
Can we incorporate Lucene and OpenNLP so that the keyword position and
Named Entity position are compatible?

On Sun, Nov 6, 2011 at 12:22 AM, James Kosin <ja...@gmail.com> wrote:

> Tri,
>
> You could just subtract the number of punctuation tokens from the
> offsets you get.
> On 11/5/2011 1:08 PM, Tri Nguyen wrote:
> > On Sat, Nov 5, 2011 at 11:30 PM, Jörn Kottmann <ko...@gmail.com>
> wrote:
> >
> >> On 11/5/11 4:53 PM, Tri Nguyen wrote:
> >>
> >>> Obama is correct, but Bill Gates. Since the NameFinderME return the
> token
> >>> index (position in the token array) not the keyword position (the
> keyword
> >>> position in the text). I want to cooperate with keyword position in
> >>> Lucene.
> >>>
> >> What is a keyword position?
> >>
> > It is the order of the word in the text.
> > Ex:
> > Barack: 0
> > Obama: 1
> > president: 3
> > US: 5
> > he: 6
> > 1961: 11
> > Bill: 12
> >
> >> Jörn
> >>
>
>

Re: Name finder position

Posted by James Kosin <ja...@gmail.com>.

Tri,

You could just subtract the number of punctuation tokens from the
offsets you get.
On 11/5/2011 1:08 PM, Tri Nguyen wrote:
> On Sat, Nov 5, 2011 at 11:30 PM, Jörn Kottmann <ko...@gmail.com> wrote:
>
>> On 11/5/11 4:53 PM, Tri Nguyen wrote:
>>
>>> Obama is correct, but Bill Gates. Since the NameFinderME return the token
>>> index (position in the token array) not the keyword position (the keyword
>>> position in the text). I want to cooperate with keyword position in
>>> Lucene.
>>>
>> What is a keyword position?
>>
> It is the order of the word in the text.
> Ex:
> Barack: 0
> Obama: 1
> president: 3
> US: 5
> he: 6
> 1961: 11
> Bill: 12
>
>> Jörn
>>

Re: Name finder position

Posted by Tri Nguyen <yt...@gmail.com>.

On Sat, Nov 5, 2011 at 11:30 PM, Jörn Kottmann <ko...@gmail.com> wrote:

> On 11/5/11 4:53 PM, Tri Nguyen wrote:
>
>> Obama is correct, but Bill Gates. Since the NameFinderME return the token
>> index (position in the token array) not the keyword position (the keyword
>> position in the text). I want to cooperate with keyword position in
>> Lucene.
>>
>
> What is a keyword position?
>
It is the order of the word in the text.
Ex:
Barack: 0
Obama: 1
president: 3
US: 5
he: 6
1961: 11
Bill: 12

>
> Jörn
>

Re: Name finder position

Posted by Jörn Kottmann <ko...@gmail.com>.

On 11/5/11 4:53 PM, Tri Nguyen wrote:
> Obama is correct, but Bill Gates. Since the NameFinderME return the token
> index (position in the token array) not the keyword position (the keyword
> position in the text). I want to cooperate with keyword position in Lucene.

What is a keyword position?

Jörn

Re: Name finder position

Posted by Tri Nguyen <yt...@gmail.com>.

Hi Jörn,
Obama is correct, but Bill Gates. Since the NameFinderME return the token
index (position in the token array) not the keyword position (the keyword
position in the text). I want to cooperate with keyword position in Lucene.

Best Regards,

Tri.

On Sat, Nov 5, 2011 at 10:38 PM, Jörn Kottmann <ko...@gmail.com> wrote:

> On 11/5/11 4:33 PM, Tri Nguyen wrote:
>
>> Can I have the word position of a token?
>> ex: word position of token [Bill] should be 12. word position of token
>> [he]
>> should be 6.
>>
>
> Well, looks like I am missing something, but isn't that what
> NameFinderME.find returns?
> It gives back an array of Spans where each span has a start and end offset
> in your input
> token array.
>
> Your sample:
>
>
> [Barack]    [Obama]    [is]    [president]    [of]    [US]    [,]    [he]
> [was]    [born]    [August]    [4]    [,]    [1961]    [.]    [Bill]
> [Gates]    [found]    [Microsoft]    [on]    [April]    [4]    [,]
>  [1975]
> [.]
>
> Here the Span for Obama would be start=0 and end=2.
>
> Jörn
>
>
>

Re: Name finder position

Posted by Jörn Kottmann <ko...@gmail.com>.

On 11/5/11 4:33 PM, Tri Nguyen wrote:
> Can I have the word position of a token?
> ex: word position of token [Bill] should be 12. word position of token [he]
> should be 6.

Well, looks like I am missing something, but isn't that what 
NameFinderME.find returns?
It gives back an array of Spans where each span has a start and end 
offset in your input
token array.

Your sample:

[Barack]    [Obama]    [is]    [president]    [of]    [US]    [,]    [he]
[was]    [born]    [August]    [4]    [,]    [1961]    [.]    [Bill]
[Gates]    [found]    [Microsoft]    [on]    [April]    [4]    [,]    [1975]
[.]

Here the Span for Obama would be start=0 and end=2.

Jörn

Re: Name finder position

Posted by Tri Nguyen <yt...@gmail.com>.

Can I have the word position of a token?
ex: word position of token [Bill] should be 12. word position of token [he]
should be 6.

Regards,
Tri.

On Sat, Nov 5, 2011 at 10:05 PM, Tri Nguyen <yt...@gmail.com> wrote:

> Hi Jörn,
>
> I understand your idea, but I want the word position:
>
> Barack Obama is president of US, he was born August 4, 1961. Bill Gates
> found Microsoft on April 4, 1975.
>
> Barack Obama: position 0
>
> Bill Gates: position 12
>
> While token Barack has position 0 and token Bill has position 15.
>
> [Barack]    [Obama]    [is]    [president]    [of]    [US]    [,]    [he]
> [was]    [born]    [August]    [4]    [,]    [1961]    [.]    [Bill]
> [Gates]    [found]    [Microsoft]    [on]    [April]    [4]    [,]
> [1975]    [.]
>
> Best Regards,
>
> Tri.
>
>
> On Sat, Nov 5, 2011 at 8:40 PM, Jörn Kottmann <ko...@gmail.com> wrote:
>
>> On 11/5/11 4:48 AM, Tri Nguyen wrote:
>>
>>> Hi,
>>>
>>> Could somebody guide me how to get positions of names in the document
>>> like
>>> the positions of keywords in Lucene?
>>>
>>>
>> The name finder expects tokenized input, the names can only be mapped to
>> tokens,
>> but your tokens can be mapped back to character offsets.
>>
>> Jörn
>>
>
>

Re: Name finder position

Posted by Tri Nguyen <yt...@gmail.com>.

Hi Jörn,

I understand your idea, but I want the word position:

Barack Obama is president of US, he was born August 4, 1961. Bill Gates
found Microsoft on April 4, 1975.

Barack Obama: position 0

Bill Gates: position 12

While token Barack has position 0 and token Bill has position 15.

[Barack]    [Obama]    [is]    [president]    [of]    [US]    [,]    [he]
[was]    [born]    [August]    [4]    [,]    [1961]    [.]    [Bill]
[Gates]    [found]    [Microsoft]    [on]    [April]    [4]    [,]    [1975]
[.]

Best Regards,

Tri.

On Sat, Nov 5, 2011 at 8:40 PM, Jörn Kottmann <ko...@gmail.com> wrote:

> On 11/5/11 4:48 AM, Tri Nguyen wrote:
>
>> Hi,
>>
>> Could somebody guide me how to get positions of names in the document like
>> the positions of keywords in Lucene?
>>
>>
> The name finder expects tokenized input, the names can only be mapped to
> tokens,
> but your tokens can be mapped back to character offsets.
>
> Jörn
>

Re: Name finder position

Posted by Jörn Kottmann <ko...@gmail.com>.

On 11/5/11 4:48 AM, Tri Nguyen wrote:
> Hi,
>
> Could somebody guide me how to get positions of names in the document like
> the positions of keywords in Lucene?
>

The name finder expects tokenized input, the names can only be mapped to 
tokens,
but your tokens can be mapped back to character offsets.

Jörn