You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@opennlp.apache.org by Jörn Kottmann <ko...@gmail.com> on 2013/05/23 15:04:31 UTC

OPENNLP-579

Hi all,

please have a look at
https://issues.apache.org/jira/browse/OPENNLP-579

Its about a contribution to link location entities to a geo name database,
the component could later be extended to link other entity types as well to
a database or dictionary.

Thanks,
Jörn

Re: OPENNLP-579

Posted by Jörn Kottmann <ko...@gmail.com>.

On 05/30/2013 10:19 PM, William Colen wrote:
> I could not understand what do you mean with using token offsets fot the
> sentences.

With the current approach in OpenNLP sentence detection is done before 
tokenization,
and both components output Spans which refer to character offsets.

But if you do tokenization first the sentence detector could output 
Spans which mark
the tokens in a sentence (like the name finder does with name Spans). 
This allows
to directly use a sentence Span to access the tokens of a sentence. 
Anyway thats also
easy with the current approach.

If you now go one step further a DocumentNameFinders find method could be:
Span[] find(String text, Span tokens[], Span sentences[])
or
Span[] find(String tokens[], Span sentences[])

In both cases sentences would contain Spans with token offsets.

Jörn

Re: OPENNLP-579

Posted by William Colen <wi...@gmail.com>.

I like the second approach

Span[] find(String text, Span sentences[], Span tokens[])

looks like it would be easier to use. Maybe we could add a new tokenize
method in Tokenizer which takes the sentence offset and outputs spans with
this offset included.

I could not understand what do you mean with using token offsets fot the
sentences.


On Thu, May 30, 2013 at 12:46 PM, Jörn Kottmann <ko...@gmail.com> wrote:

> We are now one iteration further. In this new version it is
> possible to pass in a document at once. Which leads
> to the question on how we should handle this in OpenNLP generally.
>
> To pass in a document the following information needs to be handed over:
> - Sentences
> - Tokens
> - Names
>
> And maybe a the text depending on if the tokens are Spans or Strings.
>
> If the component is stateless all this needs to handed over in one method
> call,
> otherwise it could handed over on a per sentences basis (thats how coref
> is doing it).
>
> The DocumentNameFinder (never implemented, but interface is defined) its
> done
> like this:
> Span[][] find(String tokens[][])
>
> In my opinion thats not a nice solution, it first requires that the input
> text
> gets split into Strings and second its hard to use the returned Spans,
> they are only meaningful
> within the context which is given by the returned array. Names which cross
> sentences are not possible.
>
> Another approach could be that:
> Span[] find(String text, Span sentences[], Span tokens[])
>
> Where the sentence and token offsets in the spans are character offsets,
> and
> the returned spans or token offsets.
>
> It would probably be nicer to use token offsets for the sentences as well,
> but thats
> currently incompatible with the sentence detector interface.
>
> Any opinions on how we should solve this?
>
> Jörn
>
>
> On 05/23/2013 03:04 PM, Jörn Kottmann wrote:
>
>> Hi all,
>>
>> please have a look at
>> https://issues.apache.org/**jira/browse/OPENNLP-579<https://issues.apache.org/jira/browse/OPENNLP-579>
>>
>> Its about a contribution to link location entities to a geo name database,
>> the component could later be extended to link other entity types as well
>> to
>> a database or dictionary.
>>
>> Thanks,
>> Jörn
>>
>
>

Re: OPENNLP-579

Posted by Jörn Kottmann <ko...@gmail.com>.

We are now one iteration further. In this new version it is
possible to pass in a document at once. Which leads
to the question on how we should handle this in OpenNLP generally.

To pass in a document the following information needs to be handed over:
- Sentences
- Tokens
- Names

And maybe a the text depending on if the tokens are Spans or Strings.

If the component is stateless all this needs to handed over in one 
method call,
otherwise it could handed over on a per sentences basis (thats how coref 
is doing it).

The DocumentNameFinder (never implemented, but interface is defined) its 
done
like this:
Span[][] find(String tokens[][])

In my opinion thats not a nice solution, it first requires that the 
input text
gets split into Strings and second its hard to use the returned Spans, 
they are only meaningful
within the context which is given by the returned array. Names which 
cross sentences are not possible.

Another approach could be that:
Span[] find(String text, Span sentences[], Span tokens[])

Where the sentence and token offsets in the spans are character offsets, and
the returned spans or token offsets.

It would probably be nicer to use token offsets for the sentences as 
well, but thats
currently incompatible with the sentence detector interface.

Any opinions on how we should solve this?

Jörn

On 05/23/2013 03:04 PM, Jörn Kottmann wrote:
> Hi all,
>
> please have a look at
> https://issues.apache.org/jira/browse/OPENNLP-579
>
> Its about a contribution to link location entities to a geo name 
> database,
> the component could later be extended to link other entity types as 
> well to
> a database or dictionary.
>
> Thanks,
> Jörn

Re: OPENNLP-579

Posted by Aliaksandr Autayeu <al...@autayeu.com>.

Very interesting!

Jörn, I'd agree more with your advice in JIRA comments about making it more
generic - Entity rather than just Geo. Afterall, we have 4 fundamental
entity types (person, location, organization, event) and all but the last
have various relatively simple to use gazetteers. One can event think about
Wikipedia as generic gazetteer.

I didn't get exactly why one would need PostGIS for this compared to just a
database, but that implementation issue. It's more important to have the
interface designed properly.

Re: OPENNLP-579

Posted by William Colen <wi...@gmail.com>.

It is a very nice contribution!
I am looking forward to help extending it to other entity types.

On Thu, May 23, 2013 at 10:04 AM, Jörn Kottmann <ko...@gmail.com> wrote:

> Hi all,
>
> please have a look at
> https://issues.apache.org/**jira/browse/OPENNLP-579<https://issues.apache.org/jira/browse/OPENNLP-579>
>
> Its about a contribution to link location entities to a geo name database,
> the component could later be extended to link other entity types as well to
> a database or dictionary.
>
> Thanks,
> Jörn
>