You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@opennlp.apache.org by Tri Nguyen <yt...@gmail.com> on 2011/11/05 04:48:55 UTC
Name finder position
Hi,
Could somebody guide me how to get positions of names in the document like
the positions of keywords in Lucene?
Thanks,
Tri.
Re: Name finder position
Posted by Tri Nguyen <yt...@gmail.com>.
It is English.
I am using Lucene StandardAnalyzer, it index the words at correct
positions. Can we map the token position from OpenNLP to Lucene?
Tri.
On Sun, Nov 6, 2011 at 7:28 AM, James Kosin <ja...@gmail.com> wrote:
> Tri,
>
> Unfortunately, it depends on the input language. Only thing I've found
> is it may be better to find the tokens that are punctuation. A hint is
> most tokens that are punctuation are a single character wide. But,
> again that may not be the case depending on the encoding and the
> punctuation. Words are usually a bit longer.
>
> James
>
> On 11/5/2011 2:14 PM, Tri Nguyen wrote:
> > Thank you James,
> > I don't count the token having pattern ".*[A-Za-z0-9]+.*" and check some
> > cases it works.
> > The token is not satisfied that pattern can be a punctuation. Is that
> > pattern enough to cover a keyword?
> > Can we incorporate Lucene and OpenNLP so that the keyword position and
> > Named Entity position are compatible?
> >
> >
> > On Sun, Nov 6, 2011 at 12:22 AM, James Kosin <ja...@gmail.com>
> wrote:
> >
> >> Tri,
> >>
> >> You could just subtract the number of punctuation tokens from the
> >> offsets you get.
> >> On 11/5/2011 1:08 PM, Tri Nguyen wrote:
> >>> On Sat, Nov 5, 2011 at 11:30 PM, Jörn Kottmann <ko...@gmail.com>
> >> wrote:
> >>>> On 11/5/11 4:53 PM, Tri Nguyen wrote:
> >>>>
> >>>>> Obama is correct, but Bill Gates. Since the NameFinderME return the
> >> token
> >>>>> index (position in the token array) not the keyword position (the
> >> keyword
> >>>>> position in the text). I want to cooperate with keyword position in
> >>>>> Lucene.
> >>>>>
> >>>> What is a keyword position?
> >>>>
> >>> It is the order of the word in the text.
> >>> Ex:
> >>> Barack: 0
> >>> Obama: 1
> >>> president: 3
> >>> US: 5
> >>> he: 6
> >>> 1961: 11
> >>> Bill: 12
> >>>
> >>>> Jörn
> >>>>
> >>
>
>
Re: Name finder position
Posted by James Kosin <ja...@gmail.com>.
Tri,
Unfortunately, it depends on the input language. Only thing I've found
is it may be better to find the tokens that are punctuation. A hint is
most tokens that are punctuation are a single character wide. But,
again that may not be the case depending on the encoding and the
punctuation. Words are usually a bit longer.
James
On 11/5/2011 2:14 PM, Tri Nguyen wrote:
> Thank you James,
> I don't count the token having pattern ".*[A-Za-z0-9]+.*" and check some
> cases it works.
> The token is not satisfied that pattern can be a punctuation. Is that
> pattern enough to cover a keyword?
> Can we incorporate Lucene and OpenNLP so that the keyword position and
> Named Entity position are compatible?
>
>
> On Sun, Nov 6, 2011 at 12:22 AM, James Kosin <ja...@gmail.com> wrote:
>
>> Tri,
>>
>> You could just subtract the number of punctuation tokens from the
>> offsets you get.
>> On 11/5/2011 1:08 PM, Tri Nguyen wrote:
>>> On Sat, Nov 5, 2011 at 11:30 PM, Jörn Kottmann <ko...@gmail.com>
>> wrote:
>>>> On 11/5/11 4:53 PM, Tri Nguyen wrote:
>>>>
>>>>> Obama is correct, but Bill Gates. Since the NameFinderME return the
>> token
>>>>> index (position in the token array) not the keyword position (the
>> keyword
>>>>> position in the text). I want to cooperate with keyword position in
>>>>> Lucene.
>>>>>
>>>> What is a keyword position?
>>>>
>>> It is the order of the word in the text.
>>> Ex:
>>> Barack: 0
>>> Obama: 1
>>> president: 3
>>> US: 5
>>> he: 6
>>> 1961: 11
>>> Bill: 12
>>>
>>>> Jörn
>>>>
>>
Re: Name finder position
Posted by Tri Nguyen <yt...@gmail.com>.
Thank you James,
I don't count the token having pattern ".*[A-Za-z0-9]+.*" and check some
cases it works.
The token is not satisfied that pattern can be a punctuation. Is that
pattern enough to cover a keyword?
Can we incorporate Lucene and OpenNLP so that the keyword position and
Named Entity position are compatible?
On Sun, Nov 6, 2011 at 12:22 AM, James Kosin <ja...@gmail.com> wrote:
> Tri,
>
> You could just subtract the number of punctuation tokens from the
> offsets you get.
> On 11/5/2011 1:08 PM, Tri Nguyen wrote:
> > On Sat, Nov 5, 2011 at 11:30 PM, Jörn Kottmann <ko...@gmail.com>
> wrote:
> >
> >> On 11/5/11 4:53 PM, Tri Nguyen wrote:
> >>
> >>> Obama is correct, but Bill Gates. Since the NameFinderME return the
> token
> >>> index (position in the token array) not the keyword position (the
> keyword
> >>> position in the text). I want to cooperate with keyword position in
> >>> Lucene.
> >>>
> >> What is a keyword position?
> >>
> > It is the order of the word in the text.
> > Ex:
> > Barack: 0
> > Obama: 1
> > president: 3
> > US: 5
> > he: 6
> > 1961: 11
> > Bill: 12
> >
> >> Jörn
> >>
>
>
Re: Name finder position
Posted by James Kosin <ja...@gmail.com>.
Tri,
You could just subtract the number of punctuation tokens from the
offsets you get.
On 11/5/2011 1:08 PM, Tri Nguyen wrote:
> On Sat, Nov 5, 2011 at 11:30 PM, Jörn Kottmann <ko...@gmail.com> wrote:
>
>> On 11/5/11 4:53 PM, Tri Nguyen wrote:
>>
>>> Obama is correct, but Bill Gates. Since the NameFinderME return the token
>>> index (position in the token array) not the keyword position (the keyword
>>> position in the text). I want to cooperate with keyword position in
>>> Lucene.
>>>
>> What is a keyword position?
>>
> It is the order of the word in the text.
> Ex:
> Barack: 0
> Obama: 1
> president: 3
> US: 5
> he: 6
> 1961: 11
> Bill: 12
>
>> Jörn
>>
Re: Name finder position
Posted by Tri Nguyen <yt...@gmail.com>.
On Sat, Nov 5, 2011 at 11:30 PM, Jörn Kottmann <ko...@gmail.com> wrote:
> On 11/5/11 4:53 PM, Tri Nguyen wrote:
>
>> Obama is correct, but Bill Gates. Since the NameFinderME return the token
>> index (position in the token array) not the keyword position (the keyword
>> position in the text). I want to cooperate with keyword position in
>> Lucene.
>>
>
> What is a keyword position?
>
It is the order of the word in the text.
Ex:
Barack: 0
Obama: 1
president: 3
US: 5
he: 6
1961: 11
Bill: 12
>
> Jörn
>
Re: Name finder position
Posted by Jörn Kottmann <ko...@gmail.com>.
On 11/5/11 4:53 PM, Tri Nguyen wrote:
> Obama is correct, but Bill Gates. Since the NameFinderME return the token
> index (position in the token array) not the keyword position (the keyword
> position in the text). I want to cooperate with keyword position in Lucene.
What is a keyword position?
Jörn
Re: Name finder position
Posted by Tri Nguyen <yt...@gmail.com>.
Hi Jörn,
Obama is correct, but Bill Gates. Since the NameFinderME return the token
index (position in the token array) not the keyword position (the keyword
position in the text). I want to cooperate with keyword position in Lucene.
Best Regards,
Tri.
On Sat, Nov 5, 2011 at 10:38 PM, Jörn Kottmann <ko...@gmail.com> wrote:
> On 11/5/11 4:33 PM, Tri Nguyen wrote:
>
>> Can I have the word position of a token?
>> ex: word position of token [Bill] should be 12. word position of token
>> [he]
>> should be 6.
>>
>
> Well, looks like I am missing something, but isn't that what
> NameFinderME.find returns?
> It gives back an array of Spans where each span has a start and end offset
> in your input
> token array.
>
> Your sample:
>
>
> [Barack] [Obama] [is] [president] [of] [US] [,] [he]
> [was] [born] [August] [4] [,] [1961] [.] [Bill]
> [Gates] [found] [Microsoft] [on] [April] [4] [,]
> [1975]
> [.]
>
> Here the Span for Obama would be start=0 and end=2.
>
> Jörn
>
>
>
Re: Name finder position
Posted by Jörn Kottmann <ko...@gmail.com>.
On 11/5/11 4:33 PM, Tri Nguyen wrote:
> Can I have the word position of a token?
> ex: word position of token [Bill] should be 12. word position of token [he]
> should be 6.
Well, looks like I am missing something, but isn't that what
NameFinderME.find returns?
It gives back an array of Spans where each span has a start and end
offset in your input
token array.
Your sample:
[Barack] [Obama] [is] [president] [of] [US] [,] [he]
[was] [born] [August] [4] [,] [1961] [.] [Bill]
[Gates] [found] [Microsoft] [on] [April] [4] [,] [1975]
[.]
Here the Span for Obama would be start=0 and end=2.
Jörn
Re: Name finder position
Posted by Tri Nguyen <yt...@gmail.com>.
Can I have the word position of a token?
ex: word position of token [Bill] should be 12. word position of token [he]
should be 6.
Regards,
Tri.
On Sat, Nov 5, 2011 at 10:05 PM, Tri Nguyen <yt...@gmail.com> wrote:
> Hi Jörn,
>
> I understand your idea, but I want the word position:
>
> Barack Obama is president of US, he was born August 4, 1961. Bill Gates
> found Microsoft on April 4, 1975.
>
> Barack Obama: position 0
>
> Bill Gates: position 12
>
> While token Barack has position 0 and token Bill has position 15.
>
> [Barack] [Obama] [is] [president] [of] [US] [,] [he]
> [was] [born] [August] [4] [,] [1961] [.] [Bill]
> [Gates] [found] [Microsoft] [on] [April] [4] [,]
> [1975] [.]
>
> Best Regards,
>
> Tri.
>
>
> On Sat, Nov 5, 2011 at 8:40 PM, Jörn Kottmann <ko...@gmail.com> wrote:
>
>> On 11/5/11 4:48 AM, Tri Nguyen wrote:
>>
>>> Hi,
>>>
>>> Could somebody guide me how to get positions of names in the document
>>> like
>>> the positions of keywords in Lucene?
>>>
>>>
>> The name finder expects tokenized input, the names can only be mapped to
>> tokens,
>> but your tokens can be mapped back to character offsets.
>>
>> Jörn
>>
>
>
Re: Name finder position
Posted by Tri Nguyen <yt...@gmail.com>.
Hi Jörn,
I understand your idea, but I want the word position:
Barack Obama is president of US, he was born August 4, 1961. Bill Gates
found Microsoft on April 4, 1975.
Barack Obama: position 0
Bill Gates: position 12
While token Barack has position 0 and token Bill has position 15.
[Barack] [Obama] [is] [president] [of] [US] [,] [he]
[was] [born] [August] [4] [,] [1961] [.] [Bill]
[Gates] [found] [Microsoft] [on] [April] [4] [,] [1975]
[.]
Best Regards,
Tri.
On Sat, Nov 5, 2011 at 8:40 PM, Jörn Kottmann <ko...@gmail.com> wrote:
> On 11/5/11 4:48 AM, Tri Nguyen wrote:
>
>> Hi,
>>
>> Could somebody guide me how to get positions of names in the document like
>> the positions of keywords in Lucene?
>>
>>
> The name finder expects tokenized input, the names can only be mapped to
> tokens,
> but your tokens can be mapped back to character offsets.
>
> Jörn
>
Re: Name finder position
Posted by Jörn Kottmann <ko...@gmail.com>.
On 11/5/11 4:48 AM, Tri Nguyen wrote:
> Hi,
>
> Could somebody guide me how to get positions of names in the document like
> the positions of keywords in Lucene?
>
The name finder expects tokenized input, the names can only be mapped to
tokens,
but your tokens can be mapped back to character offsets.
Jörn