You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@opennlp.apache.org by Mark G <ma...@apache.org> on 2013/10/05 23:58:51 UTC

request for Input or ideas.... EntityLinker tickets

All,
Before I plug some tickets into Jira, I wanted to get some feedback from
the team on some changes I would like to make to the EntityLinker
GeoEntityLinkerImpl
Below are what I consider improvement tickets

1. Only the first start and end are populated in CountryContext object when
returned from CountryContext.find, it should return all instances of each
country mention in a map so the proximity of other toponyms to the found
country indicators can be included as a factor in the scoring

Currently the user only gets the first indexOf for each country mention.
The country mentions are an attempt to better gauge ambiguous names( Paris
Texas rather than Paris France). Because of this, I am not able to do a
proximity analysis thoroughly to assist in scoring. Basically I need every
mention of every country indicator in the doc, which I will correlate with
every Named Entity span to produce a score. I am also not passing the list
of country codes into the database query as a where predicate, which would
improve performance tremendously (I will index the column).

2. Discovery of indicators for "country context" should be regex based, in
order to provide a more robust ability to discover context

Currenty I use a String.indexOf(term) to discover the country hit list.
Regex would allow users to configure interesting ways to indicate
countries. Regex will also provide the array of start/end I need for issue
1 from its Matcher.find

3. fuzzy string matching should be part of the scoring, this would allow
mysql fuzzy search to return more candidate toponyms.

Currently, the search into the MySQL gazateers is using "boolean mode" and
each NER result is passed in as a literal string. If I implement a fuzzy
string matching based score (do we have one?) the user could turn on
"natural language" mode in MySQL then we can generate a score and thresh to
allow for more recall on transliterated names etc....
I would also like to use proximity to the majority of points in the
document as a disambiguation criteria as well.

4. provide a "solution wrapper" for the Geotagging capability

In order to make the GeoTagging a bit more "out of the box" functional, I
was thinking of creating a class that one calls find(MaxentModel, doc,
sentencedetector, EntityLinkerProperties) to abstract the current impl. I
know this is not standard practice, just want to see what you all think.
This would make it "easier" to get this thing running.

thanks!
MG

Re: request for Input or ideas.... EntityLinker tickets

Posted by Mark G <ma...@apache.org>.

sounds good, I will get it in the sandbox...



On Tue, Nov 5, 2013 at 8:43 AM, Jörn Kottmann <ko...@gmail.com> wrote:

> On 11/05/2013 01:23 PM, Mark G wrote:
>
>> Joern, I am using Lucene inside the GeoEntityLinker impl, so do you think
>> I
>> should move the entire GeoEntityLinker impl and all its classes to a new
>> module in the sandbox and leave only the entitylinker framework in opennlp
>> tools?
>>
>
> +1, it might be nice to provide a simple dictionary based implementation
> as a sample for simple
> tasks, e.g. just map country names.
> If  the user wants to use the lucene based solution he should just depend
> on the addon module, which
> we should release together with the core components.
>
>
>  A second thought/option is to make the lucene pom entries optional in the
>> opennlptools pom, so users will have to add lucene to their pom to run the
>> geoentitylinker and the jars will not be included in the tools build
>>
>
> I really prefer the other solution because then a user needs to once
> explicitly deal with it
> to make his project work, if you do this people probably start using the
> clases and then discover
> only by try and error that there must be something missing on their
> classpath.
>
> The lemmatizer will also be part of the addon solution, if there are no
> concerns I suggest we get our
> addons started.
>
> Jörn
>

Re: request for Input or ideas.... EntityLinker tickets

Posted by Jörn Kottmann <ko...@gmail.com>.

On 11/05/2013 01:23 PM, Mark G wrote:
> Joern, I am using Lucene inside the GeoEntityLinker impl, so do you think I
> should move the entire GeoEntityLinker impl and all its classes to a new
> module in the sandbox and leave only the entitylinker framework in opennlp
> tools?

+1, it might be nice to provide a simple dictionary based implementation 
as a sample for simple
tasks, e.g. just map country names.
If  the user wants to use the lucene based solution he should just 
depend on the addon module, which
we should release together with the core components.

> A second thought/option is to make the lucene pom entries optional in the
> opennlptools pom, so users will have to add lucene to their pom to run the
> geoentitylinker and the jars will not be included in the tools build

I really prefer the other solution because then a user needs to once 
explicitly deal with it
to make his project work, if you do this people probably start using the 
clases and then discover
only by try and error that there must be something missing on their 
classpath.

The lemmatizer will also be part of the addon solution, if there are no 
concerns I suggest we get our
addons started.

Jörn

Re: request for Input or ideas.... EntityLinker tickets

Posted by Mark G <ma...@apache.org>.

Joern, I am using Lucene inside the GeoEntityLinker impl, so do you think I
should move the entire GeoEntityLinker impl and all its classes to a new
module in the sandbox and leave only the entitylinker framework in opennlp
tools?
A second thought/option is to make the lucene pom entries optional in the
opennlptools pom, so users will have to add lucene to their pom to run the
geoentitylinker and the jars will not be included in the tools build

On Tue, Nov 5, 2013 at 3:39 AM, Jörn Kottmann <ko...@gmail.com> wrote:

> On 11/03/2013 02:22 AM, Mark G wrote:
>
>> I finished with the Lucene indexing of the Gazateers, just need to get
>> them
>> tied into the gaz lookups, which is fairly simple. Do you all think I
>> should disregard all the MySQL dependency and just have Lucene? The lucene
>> index files are only about 2.5 gigs total, so very manageable to
>> distribute
>> the files across a cluster. I could keep the MySQL classes as an option,
>> but at this point the Lucene based approach is really growing on me.
>> If I don't here from anyone I am going to remove the MySQL implementation.
>>
>
> +1 I believe a Lucene based solution is easier to handle for most people,
> because it can
> be fully integrated via API (no need to install anything) and therefor
> hides most of
> the complexity.
>
> Please avoid adding a dependincy for lucene to the opennlp-tools project,
> I suggest that we
> add this code to the sandbox, or a new addon area. If people want to use a
> Lucene based dictionary
> they can depend on that module explicitly.
>
> Jörn
>

Re: request for Input or ideas.... EntityLinker tickets

Posted by Jörn Kottmann <ko...@gmail.com>.

On 11/03/2013 02:22 AM, Mark G wrote:
> I finished with the Lucene indexing of the Gazateers, just need to get them
> tied into the gaz lookups, which is fairly simple. Do you all think I
> should disregard all the MySQL dependency and just have Lucene? The lucene
> index files are only about 2.5 gigs total, so very manageable to distribute
> the files across a cluster. I could keep the MySQL classes as an option,
> but at this point the Lucene based approach is really growing on me.
> If I don't here from anyone I am going to remove the MySQL implementation.

+1 I believe a Lucene based solution is easier to handle for most 
people, because it can
be fully integrated via API (no need to install anything) and therefor 
hides most of
the complexity.

Please avoid adding a dependincy for lucene to the opennlp-tools 
project, I suggest that we
add this code to the sandbox, or a new addon area. If people want to use 
a Lucene based dictionary
they can depend on that module explicitly.

Jörn

Re: request for Input or ideas.... EntityLinker tickets

Posted by Mark G <ma...@apache.org>.

I finished with the Lucene indexing of the Gazateers, just need to get them
tied into the gaz lookups, which is fairly simple. Do you all think I
should disregard all the MySQL dependency and just have Lucene? The lucene
index files are only about 2.5 gigs total, so very manageable to distribute
the files across a cluster. I could keep the MySQL classes as an option,
but at this point the Lucene based approach is really growing on me.
If I don't here from anyone I am going to remove the MySQL implementation.
Thanks
MG


On Wed, Oct 30, 2013 at 7:34 PM, Lance Norskog <go...@gmail.com> wrote:

> Just to elaborate- The RAMDirectory storage is in Java GC. This makes Java
> GC work very very hard. A memory-mapped file is a write-through cache for
> file contents. The memory in the cache is outside of Java garbage
> collection. A memory-mapped index will take a little less time to create at
> these volumes. Loading a pre-built memory-mapped index will be under 5
> seconds.
>
>
> On 10/29/2013 03:43 PM, Mark G wrote:
>
>> thanks, that was my next option with lucene. Build the indexes from the
>> gaz
>> files and keep them up to date in one place, and make sure something like
>> puppet will distribute them to each node in a cluster on some interval,
>> then each task (map reduce or whatever) can use that file resource. I'll
>> let everyone know how it goes
>> MG
>>
>>
>> On Tue, Oct 29, 2013 at 6:06 PM, Lance Norskog <go...@gmail.com> wrote:
>>
>>  This is what memory-mapped file indexes are for! RAMDirectory is for very
>>> small projects.
>>>
>>>
>>> On 10/29/2013 04:00 AM, Mark G wrote:
>>>
>>>  FYI, I implemented an in mem lucene index of the NGA Geonames. It was
>>>> almost 7 GB ram and took about 40 minutes to load.
>>>> Still looking at other DBs/Indexes. So one would need at least 10G ram
>>>> to
>>>> hold the USGS and NGA gazateers.
>>>>
>>>>
>>>> On Fri, Oct 25, 2013 at 6:21 AM, Mark G <gi...@gmail.com> wrote:
>>>>
>>>>   I wrote a quick lucene RAMDirectory in memory index, it looks like a
>>>>
>>>>> valid
>>>>> option to hold the gazateers and it provides good text search of
>>>>> course.
>>>>> The idea is that at runtime the geoentitylinker would pull three files
>>>>> off
>>>>> disk, the NGAGeonames file, the USGS FIle, and the CountryContext
>>>>> indicator
>>>>> file and lucene index them in memory,. initially this will take a
>>>>> while.
>>>>> So, deployment wise, you would have to use your tool of choice (ie
>>>>> Puppet)
>>>>> to distribute the files to each node, or mount a share to each node. My
>>>>> concern with this approach is that each MR Task runs in it's own JVM,
>>>>> so
>>>>> each task on each node will consume this much memory unless you do
>>>>> something interesting with memory mapping. The EntityLinkerProperties
>>>>> file
>>>>> will support the config of the file locations and whether to use DB or
>>>>> in
>>>>> mem Lucene...
>>>>>
>>>>> I am also working on a Postgres version of the gazateer structures and
>>>>> stored procs.
>>>>>
>>>>> Thoughts?
>>>>>
>>>>>
>>>>> On Wed, Oct 23, 2013 at 8:46 AM, Jörn Kottmann <ko...@gmail.com>
>>>>> wrote:
>>>>>
>>>>>   On 10/23/2013 01:14 PM, Mark G wrote:
>>>>>
>>>>>>   All that being said, it is totally possible to run an in memory
>>>>>> version
>>>>>>
>>>>>>> of
>>>>>>> the gazateer. Personally, I like the DB approach, it provides a lot
>>>>>>> of
>>>>>>> flexibility and power.
>>>>>>>
>>>>>>>   Yes, and you can even use a DB to run in-memory which works with
>>>>>>> the
>>>>>>>
>>>>>> current implementation,
>>>>>> I think I will experiment with that.
>>>>>>
>>>>>> I don't really mind using 3 GB memory for it, since my Hadoop servers
>>>>>> have more than enough anyway,
>>>>>> and it makes the deployment easier (don't have to deal with installing
>>>>>> MySQL
>>>>>> databases and keeping them in sync).
>>>>>>
>>>>>> Jörn
>>>>>>
>>>>>>
>>>>>>
>

Re: request for Input or ideas.... EntityLinker tickets

Posted by Lance Norskog <go...@gmail.com>.

Just to elaborate- The RAMDirectory storage is in Java GC. This makes 
Java GC work very very hard. A memory-mapped file is a write-through 
cache for file contents. The memory in the cache is outside of Java 
garbage collection. A memory-mapped index will take a little less time 
to create at these volumes. Loading a pre-built memory-mapped index will 
be under 5 seconds.

On 10/29/2013 03:43 PM, Mark G wrote:
> thanks, that was my next option with lucene. Build the indexes from the gaz
> files and keep them up to date in one place, and make sure something like
> puppet will distribute them to each node in a cluster on some interval,
> then each task (map reduce or whatever) can use that file resource. I'll
> let everyone know how it goes
> MG
>
>
> On Tue, Oct 29, 2013 at 6:06 PM, Lance Norskog <go...@gmail.com> wrote:
>
>> This is what memory-mapped file indexes are for! RAMDirectory is for very
>> small projects.
>>
>>
>> On 10/29/2013 04:00 AM, Mark G wrote:
>>
>>> FYI, I implemented an in mem lucene index of the NGA Geonames. It was
>>> almost 7 GB ram and took about 40 minutes to load.
>>> Still looking at other DBs/Indexes. So one would need at least 10G ram to
>>> hold the USGS and NGA gazateers.
>>>
>>>
>>> On Fri, Oct 25, 2013 at 6:21 AM, Mark G <gi...@gmail.com> wrote:
>>>
>>>   I wrote a quick lucene RAMDirectory in memory index, it looks like a
>>>> valid
>>>> option to hold the gazateers and it provides good text search of course.
>>>> The idea is that at runtime the geoentitylinker would pull three files
>>>> off
>>>> disk, the NGAGeonames file, the USGS FIle, and the CountryContext
>>>> indicator
>>>> file and lucene index them in memory,. initially this will take a while.
>>>> So, deployment wise, you would have to use your tool of choice (ie
>>>> Puppet)
>>>> to distribute the files to each node, or mount a share to each node. My
>>>> concern with this approach is that each MR Task runs in it's own JVM, so
>>>> each task on each node will consume this much memory unless you do
>>>> something interesting with memory mapping. The EntityLinkerProperties
>>>> file
>>>> will support the config of the file locations and whether to use DB or in
>>>> mem Lucene...
>>>>
>>>> I am also working on a Postgres version of the gazateer structures and
>>>> stored procs.
>>>>
>>>> Thoughts?
>>>>
>>>>
>>>> On Wed, Oct 23, 2013 at 8:46 AM, Jörn Kottmann <ko...@gmail.com>
>>>> wrote:
>>>>
>>>>   On 10/23/2013 01:14 PM, Mark G wrote:
>>>>>   All that being said, it is totally possible to run an in memory version
>>>>>> of
>>>>>> the gazateer. Personally, I like the DB approach, it provides a lot of
>>>>>> flexibility and power.
>>>>>>
>>>>>>   Yes, and you can even use a DB to run in-memory which works with the
>>>>> current implementation,
>>>>> I think I will experiment with that.
>>>>>
>>>>> I don't really mind using 3 GB memory for it, since my Hadoop servers
>>>>> have more than enough anyway,
>>>>> and it makes the deployment easier (don't have to deal with installing
>>>>> MySQL
>>>>> databases and keeping them in sync).
>>>>>
>>>>> Jörn
>>>>>
>>>>>

Re: request for Input or ideas.... EntityLinker tickets

Posted by Mark G <gi...@gmail.com>.

thanks, that was my next option with lucene. Build the indexes from the gaz
files and keep them up to date in one place, and make sure something like
puppet will distribute them to each node in a cluster on some interval,
then each task (map reduce or whatever) can use that file resource. I'll
let everyone know how it goes
MG


On Tue, Oct 29, 2013 at 6:06 PM, Lance Norskog <go...@gmail.com> wrote:

> This is what memory-mapped file indexes are for! RAMDirectory is for very
> small projects.
>
>
> On 10/29/2013 04:00 AM, Mark G wrote:
>
>> FYI, I implemented an in mem lucene index of the NGA Geonames. It was
>> almost 7 GB ram and took about 40 minutes to load.
>> Still looking at other DBs/Indexes. So one would need at least 10G ram to
>> hold the USGS and NGA gazateers.
>>
>>
>> On Fri, Oct 25, 2013 at 6:21 AM, Mark G <gi...@gmail.com> wrote:
>>
>>  I wrote a quick lucene RAMDirectory in memory index, it looks like a
>>> valid
>>> option to hold the gazateers and it provides good text search of course.
>>> The idea is that at runtime the geoentitylinker would pull three files
>>> off
>>> disk, the NGAGeonames file, the USGS FIle, and the CountryContext
>>> indicator
>>> file and lucene index them in memory,. initially this will take a while.
>>> So, deployment wise, you would have to use your tool of choice (ie
>>> Puppet)
>>> to distribute the files to each node, or mount a share to each node. My
>>> concern with this approach is that each MR Task runs in it's own JVM, so
>>> each task on each node will consume this much memory unless you do
>>> something interesting with memory mapping. The EntityLinkerProperties
>>> file
>>> will support the config of the file locations and whether to use DB or in
>>> mem Lucene...
>>>
>>> I am also working on a Postgres version of the gazateer structures and
>>> stored procs.
>>>
>>> Thoughts?
>>>
>>>
>>> On Wed, Oct 23, 2013 at 8:46 AM, Jörn Kottmann <ko...@gmail.com>
>>> wrote:
>>>
>>>  On 10/23/2013 01:14 PM, Mark G wrote:
>>>>
>>>>  All that being said, it is totally possible to run an in memory version
>>>>> of
>>>>> the gazateer. Personally, I like the DB approach, it provides a lot of
>>>>> flexibility and power.
>>>>>
>>>>>  Yes, and you can even use a DB to run in-memory which works with the
>>>> current implementation,
>>>> I think I will experiment with that.
>>>>
>>>> I don't really mind using 3 GB memory for it, since my Hadoop servers
>>>> have more than enough anyway,
>>>> and it makes the deployment easier (don't have to deal with installing
>>>> MySQL
>>>> databases and keeping them in sync).
>>>>
>>>> Jörn
>>>>
>>>>
>>>
>

Re: request for Input or ideas.... EntityLinker tickets

Posted by Lance Norskog <go...@gmail.com>.

This is what memory-mapped file indexes are for! RAMDirectory is for 
very small projects.

On 10/29/2013 04:00 AM, Mark G wrote:
> FYI, I implemented an in mem lucene index of the NGA Geonames. It was
> almost 7 GB ram and took about 40 minutes to load.
> Still looking at other DBs/Indexes. So one would need at least 10G ram to
> hold the USGS and NGA gazateers.
>
>
> On Fri, Oct 25, 2013 at 6:21 AM, Mark G <gi...@gmail.com> wrote:
>
>> I wrote a quick lucene RAMDirectory in memory index, it looks like a valid
>> option to hold the gazateers and it provides good text search of course.
>> The idea is that at runtime the geoentitylinker would pull three files off
>> disk, the NGAGeonames file, the USGS FIle, and the CountryContext indicator
>> file and lucene index them in memory,. initially this will take a while.
>> So, deployment wise, you would have to use your tool of choice (ie Puppet)
>> to distribute the files to each node, or mount a share to each node. My
>> concern with this approach is that each MR Task runs in it's own JVM, so
>> each task on each node will consume this much memory unless you do
>> something interesting with memory mapping. The EntityLinkerProperties file
>> will support the config of the file locations and whether to use DB or in
>> mem Lucene...
>>
>> I am also working on a Postgres version of the gazateer structures and
>> stored procs.
>>
>> Thoughts?
>>
>>
>> On Wed, Oct 23, 2013 at 8:46 AM, Jörn Kottmann <ko...@gmail.com> wrote:
>>
>>> On 10/23/2013 01:14 PM, Mark G wrote:
>>>
>>>> All that being said, it is totally possible to run an in memory version
>>>> of
>>>> the gazateer. Personally, I like the DB approach, it provides a lot of
>>>> flexibility and power.
>>>>
>>> Yes, and you can even use a DB to run in-memory which works with the
>>> current implementation,
>>> I think I will experiment with that.
>>>
>>> I don't really mind using 3 GB memory for it, since my Hadoop servers
>>> have more than enough anyway,
>>> and it makes the deployment easier (don't have to deal with installing
>>> MySQL
>>> databases and keeping them in sync).
>>>
>>> Jörn
>>>
>>

Re: request for Input or ideas.... EntityLinker tickets

Posted by Mark G <gi...@gmail.com>.

FYI, I implemented an in mem lucene index of the NGA Geonames. It was
almost 7 GB ram and took about 40 minutes to load.
Still looking at other DBs/Indexes. So one would need at least 10G ram to
hold the USGS and NGA gazateers.


On Fri, Oct 25, 2013 at 6:21 AM, Mark G <gi...@gmail.com> wrote:

> I wrote a quick lucene RAMDirectory in memory index, it looks like a valid
> option to hold the gazateers and it provides good text search of course.
> The idea is that at runtime the geoentitylinker would pull three files off
> disk, the NGAGeonames file, the USGS FIle, and the CountryContext indicator
> file and lucene index them in memory,. initially this will take a while.
> So, deployment wise, you would have to use your tool of choice (ie Puppet)
> to distribute the files to each node, or mount a share to each node. My
> concern with this approach is that each MR Task runs in it's own JVM, so
> each task on each node will consume this much memory unless you do
> something interesting with memory mapping. The EntityLinkerProperties file
> will support the config of the file locations and whether to use DB or in
> mem Lucene...
>
> I am also working on a Postgres version of the gazateer structures and
> stored procs.
>
> Thoughts?
>
>
> On Wed, Oct 23, 2013 at 8:46 AM, Jörn Kottmann <ko...@gmail.com> wrote:
>
>> On 10/23/2013 01:14 PM, Mark G wrote:
>>
>>> All that being said, it is totally possible to run an in memory version
>>> of
>>> the gazateer. Personally, I like the DB approach, it provides a lot of
>>> flexibility and power.
>>>
>>
>> Yes, and you can even use a DB to run in-memory which works with the
>> current implementation,
>> I think I will experiment with that.
>>
>> I don't really mind using 3 GB memory for it, since my Hadoop servers
>> have more than enough anyway,
>> and it makes the deployment easier (don't have to deal with installing
>> MySQL
>> databases and keeping them in sync).
>>
>> Jörn
>>
>
>

Re: request for Input or ideas.... EntityLinker tickets

Posted by Mark G <gi...@gmail.com>.

I wrote a quick lucene RAMDirectory in memory index, it looks like a valid
option to hold the gazateers and it provides good text search of course.
The idea is that at runtime the geoentitylinker would pull three files off
disk, the NGAGeonames file, the USGS FIle, and the CountryContext indicator
file and lucene index them in memory,. initially this will take a while.
So, deployment wise, you would have to use your tool of choice (ie Puppet)
to distribute the files to each node, or mount a share to each node. My
concern with this approach is that each MR Task runs in it's own JVM, so
each task on each node will consume this much memory unless you do
something interesting with memory mapping. The EntityLinkerProperties file
will support the config of the file locations and whether to use DB or in
mem Lucene...

I am also working on a Postgres version of the gazateer structures and
stored procs.

Thoughts?

On Wed, Oct 23, 2013 at 8:46 AM, Jörn Kottmann <ko...@gmail.com> wrote:

> On 10/23/2013 01:14 PM, Mark G wrote:
>
>> All that being said, it is totally possible to run an in memory version of
>> the gazateer. Personally, I like the DB approach, it provides a lot of
>> flexibility and power.
>>
>
> Yes, and you can even use a DB to run in-memory which works with the
> current implementation,
> I think I will experiment with that.
>
> I don't really mind using 3 GB memory for it, since my Hadoop servers have
> more than enough anyway,
> and it makes the deployment easier (don't have to deal with installing
> MySQL
> databases and keeping them in sync).
>
> Jörn
>

Re: request for Input or ideas.... EntityLinker tickets

Posted by Mark G <ma...@apache.org>.

I am looking at the EntityLinker interface, and I would like to add this
method (one which I think was proposed very early on). This allows for an
entire doc worth of NEs to be processed. Currently, if a scoring routine
needs all the results from the entire document, the scorer cannot be called
from within the EntityLinker impl. The below method allows for a user to
perform all NER as normal for an entire doc, then pass all that info into
this method. I realized this when writing the scoring algorithms for the
GeoEntityLinker... some require all the hits for the doc, some don't, so I
was using some scorers internally, then some after, it got messy and
confusing. This would also allow for better pipeline integration, so no
scorers would have to be chained after the EntityLinking, it would all
happen within.

Thoughts?

like this:
  public List<LinkedSpan> find(String doctext, Span[] sentences, String[][]
tokens, Span[][] names) {
    ArrayList<LinkedSpan> spans = new ArrayList<LinkedSpan>();
    for (int s = 0; s < sentences.length; s++) {
      for (String name : Span.spansToStrings(names[s], tokens[s])) {
        //do something
      }

    }  return spans;
  }

On Wed, Oct 23, 2013 at 11:36 AM, Mark G <gi...@gmail.com> wrote:

> not sure if the in mem approach will provide the equivalent to full text
> indexing....but worth a try. Another design pattern is to just install one
> DB and have all the nodes connect. I have done this with Postgres on a
> 40ish node hadoop cluster. The queries against the db's full text index are
> not that expensive for mysql, it's not a complex query, just a seek on the
> full text index.  But, of course, it depends on how much concurrency it
> will get, which depends on how much data, nodes, and tasks you have....
> Generically I think the right answer is to be able to configure the
> connection behind the GeoEntityLinker... in mem || remote db || locahost db
>
>
>
> On Wed, Oct 23, 2013 at 8:46 AM, Jörn Kottmann <ko...@gmail.com> wrote:
>
>> On 10/23/2013 01:14 PM, Mark G wrote:
>>
>>> All that being said, it is totally possible to run an in memory version
>>> of
>>> the gazateer. Personally, I like the DB approach, it provides a lot of
>>> flexibility and power.
>>>
>>
>> Yes, and you can even use a DB to run in-memory which works with the
>> current implementation,
>> I think I will experiment with that.
>>
>> I don't really mind using 3 GB memory for it, since my Hadoop servers
>> have more than enough anyway,
>> and it makes the deployment easier (don't have to deal with installing
>> MySQL
>> databases and keeping them in sync).
>>
>> Jörn
>>
>
>

Re: request for Input or ideas.... EntityLinker tickets

Posted by Mark G <gi...@gmail.com>.

not sure if the in mem approach will provide the equivalent to full text
indexing....but worth a try. Another design pattern is to just install one
DB and have all the nodes connect. I have done this with Postgres on a
40ish node hadoop cluster. The queries against the db's full text index are
not that expensive for mysql, it's not a complex query, just a seek on the
full text index.  But, of course, it depends on how much concurrency it
will get, which depends on how much data, nodes, and tasks you have....
Generically I think the right answer is to be able to configure the
connection behind the GeoEntityLinker... in mem || remote db || locahost db

On Wed, Oct 23, 2013 at 8:46 AM, Jörn Kottmann <ko...@gmail.com> wrote:

> On 10/23/2013 01:14 PM, Mark G wrote:
>
>> All that being said, it is totally possible to run an in memory version of
>> the gazateer. Personally, I like the DB approach, it provides a lot of
>> flexibility and power.
>>
>
> Yes, and you can even use a DB to run in-memory which works with the
> current implementation,
> I think I will experiment with that.
>
> I don't really mind using 3 GB memory for it, since my Hadoop servers have
> more than enough anyway,
> and it makes the deployment easier (don't have to deal with installing
> MySQL
> databases and keeping them in sync).
>
> Jörn
>

Re: request for Input or ideas.... EntityLinker tickets

Posted by Jörn Kottmann <ko...@gmail.com>.

On 10/23/2013 01:14 PM, Mark G wrote:
> All that being said, it is totally possible to run an in memory version of
> the gazateer. Personally, I like the DB approach, it provides a lot of
> flexibility and power.

Yes, and you can even use a DB to run in-memory which works with the 
current implementation,
I think I will experiment with that.

I don't really mind using 3 GB memory for it, since my Hadoop servers 
have more than enough anyway,
and it makes the deployment easier (don't have to deal with installing MySQL
databases and keeping them in sync).

Jörn

Re: request for Input or ideas.... EntityLinker tickets

Posted by Mark G <gi...@gmail.com>.

The database is only about 3GB of storage right now.Since I used pure JDBC
and JDBC style stored proc calls, it can run with any JDBC driver, and all
the connection props are in the EntityLinkerProperties file, so it can run
on other database engines. Currently it is optional to use the MySQL fuzzy
string matching, all one has to do is change the stored proc to boolean
mode rather than natural language mode. If you really mean, do we have to
use mysql FULL TEXT *INDEXING*, then no, but with around 10Million
toponymns it provides super fast lookups without consuming a lot of memory.
If I was running the OpenNLP GeoEntityLinker in say, Map Reduce, and I am
running multiple tasks on each node, I would not want to pull 3GB into
memory for each task. The way it is now one could distribute MySQL to each
node via something like Puppet and it would serve requests from the tasks
on that node. Or if they have a beefy server they could make one large
instance of MySQL and have each node connect from the cluster.
All that being said, it is totally possible to run an in memory version of
the gazateer. Personally, I like the DB approach, it provides a lot of
flexibility and power.

On Tue, Oct 22, 2013 at 2:39 PM, Jörn Kottmann <ko...@gmail.com> wrote:

> On 10/05/2013 11:58 PM, Mark G wrote:
>
>> 3. fuzzy string matching should be part of the scoring, this would allow
>> mysql fuzzy search to return more candidate toponyms.
>>
>> Currently, the search into the MySQL gazateers is using "boolean mode" and
>> each NER result is passed in as a literal string. If I implement a fuzzy
>> string matching based score (do we have one?) the user could turn on
>> "natural language" mode in MySQL then we can generate a score and thresh
>> to
>> allow for more recall on transliterated names etc....
>> I would also like to use proximity to the majority of points in the
>> document as a disambiguation criteria as well.
>>
>
> It would probably be nice if this would work with other databases too,
> e.g. Apache Derby,
> or some in-memory database, maybe even Lucene.
>
> Would it be possible to not use the MySQL fuzzy string matching feature
> for this?
>
> I would like to run your code, but its difficult to scale the MySQL
> database in my scenario,
> but I have lots of RAM and believe the geonames dataset could fit into it
> to provide
> super fast lookups for me on my worker servers.
>
> Jörn
>

Re: request for Input or ideas.... EntityLinker tickets

Posted by Jörn Kottmann <ko...@gmail.com>.

On 10/05/2013 11:58 PM, Mark G wrote:
> 3. fuzzy string matching should be part of the scoring, this would allow
> mysql fuzzy search to return more candidate toponyms.
>
> Currently, the search into the MySQL gazateers is using "boolean mode" and
> each NER result is passed in as a literal string. If I implement a fuzzy
> string matching based score (do we have one?) the user could turn on
> "natural language" mode in MySQL then we can generate a score and thresh to
> allow for more recall on transliterated names etc....
> I would also like to use proximity to the majority of points in the
> document as a disambiguation criteria as well.

It would probably be nice if this would work with other databases too, 
e.g. Apache Derby,
or some in-memory database, maybe even Lucene.

Would it be possible to not use the MySQL fuzzy string matching feature 
for this?

I would like to run your code, but its difficult to scale the MySQL 
database in my scenario,
but I have lots of RAM and believe the geonames dataset could fit into 
it to provide
super fast lookups for me on my worker servers.

Jörn

Re: request for Input or ideas.... EntityLinker tickets

Posted by Jörn Kottmann <ko...@gmail.com>.

On 10/23/2013 01:02 PM, Mark G wrote:
> I have never used UIMA, but I have heard good things. All the analytics
> processes I run are in Hadoop Mapreduce and there are cascading jobs that
> do many different things. However, this sounds like a good idea for a
> "solution wrapper," and I understand and agree with your concern about
> creating classes which combine components.
> I would like to try it in UIMA, sounds great, where in the UIMA project do
> I start?

We are running UIMA pipelines inside our mapreduce jobs to do the NLP 
for us,
you can find information about how to do that here:
https://cwiki.apache.org/confluence/display/UIMA/Running+UIMA+Apps+on+Hadoop

There are a couple of good getting started guides for UIMA on the 
website, have a look there.

Jörn

Re: request for Input or ideas.... EntityLinker tickets

Posted by Mark G <gi...@gmail.com>.

I have never used UIMA, but I have heard good things. All the analytics
processes I run are in Hadoop Mapreduce and there are cascading jobs that
do many different things. However, this sounds like a good idea for a
"solution wrapper," and I understand and agree with your concern about
creating classes which combine components.
I would like to try it in UIMA, sounds great, where in the UIMA project do
I start?


On Tue, Oct 22, 2013 at 2:29 PM, Jörn Kottmann <ko...@gmail.com> wrote:

> On 10/05/2013 11:58 PM, Mark G wrote:
>
>> 4. provide a "solution wrapper" for the Geotagging capability
>>
>> In order to make the GeoTagging a bit more "out of the box" functional, I
>> was thinking of creating a class that one calls find(MaxentModel, doc,
>> sentencedetector, EntityLinkerProperties) to abstract the current impl. I
>> know this is not standard practice, just want to see what you all think.
>> This would make it "easier" to get this thing running.
>>
>
>
> What do you think about using a solution like UIMA to do this? I am not
> sure how you
> are intending to run your NLP pipelines but in my experiences that has
> worked out
> really well. UIMA can help to solve some production problems like
> scalability, error handling,
> etc.
>
> If you are interested in this you could write an Analysis Engine for the
> Entity Linker and add
> it to opennlp-uima.
>
> I still believe it is not a good idea to make classes which combine
> components to use them out of
> the box, because that never really suits all of our users, and it is easy
> to implement inside a user project.
>
> Anyway we should add command line support and implement a class which can
> demonstrate how the entity linker
> works in a similar fashion as our other command line tools.
>
> Jörn
>

Re: request for Input or ideas.... EntityLinker tickets

Posted by Jörn Kottmann <ko...@gmail.com>.

On 10/05/2013 11:58 PM, Mark G wrote:
> 4. provide a "solution wrapper" for the Geotagging capability
>
> In order to make the GeoTagging a bit more "out of the box" functional, I
> was thinking of creating a class that one calls find(MaxentModel, doc,
> sentencedetector, EntityLinkerProperties) to abstract the current impl. I
> know this is not standard practice, just want to see what you all think.
> This would make it "easier" to get this thing running.

What do you think about using a solution like UIMA to do this? I am not 
sure how you
are intending to run your NLP pipelines but in my experiences that has 
worked out
really well. UIMA can help to solve some production problems like 
scalability, error handling,
etc.

If you are interested in this you could write an Analysis Engine for the 
Entity Linker and add
it to opennlp-uima.

I still believe it is not a good idea to make classes which combine 
components to use them out of
the box, because that never really suits all of our users, and it is 
easy to implement inside a user project.

Anyway we should add command line support and implement a class which 
can demonstrate how the entity linker
works in a similar fashion as our other command line tools.

Jörn

Re: request for Input or ideas.... EntityLinker tickets

Posted by Mark G <gi...@gmail.com>.

Currently this regex finding of countrycontext is done in a CountryContext
class which is behind the GeoEntityLinker impl itself. This class's
regexFind method takes the full doc text as a param and returns a hashmap
of each country code to a set of mentions in the doc :
 public Map<String, Set<Integer>> regexfind(String docText,
EntityLinkerProperties properties)
this could be done in as a NameFinder impl extension, but since it was
specific to the GeoEntityLinker impl I didn't bother, but initially I did
think of this

On Tue, Oct 22, 2013 at 2:45 PM, Jörn Kottmann <ko...@gmail.com> wrote:

> On 10/05/2013 11:58 PM, Mark G wrote:
>
>> 2. Discovery of indicators for "country context" should be regex based, in
>> order to provide a more robust ability to discover context
>>
>> Currenty I use a String.indexOf(term) to discover the country hit list.
>> Regex would allow users to configure interesting ways to indicate
>> countries. Regex will also provide the array of start/end I need for issue
>> 1 from its Matcher.find
>>
>
> Can we reuse the name finder for this? The user could simply provide a
> name finder which
> can do this depending on what is possible for him, e.g. trained on his
> data, regex based,
> dictionary based, etc.
>
> Jörn
>

Re: request for Input or ideas.... EntityLinker tickets

Posted by Jörn Kottmann <ko...@gmail.com>.

On 10/05/2013 11:58 PM, Mark G wrote:
> 2. Discovery of indicators for "country context" should be regex based, in
> order to provide a more robust ability to discover context
>
> Currenty I use a String.indexOf(term) to discover the country hit list.
> Regex would allow users to configure interesting ways to indicate
> countries. Regex will also provide the array of start/end I need for issue
> 1 from its Matcher.find

Can we reuse the name finder for this? The user could simply provide a 
name finder which
can do this depending on what is possible for him, e.g. trained on his 
data, regex based,
dictionary based, etc.

Jörn

Re: request for Input or ideas.... EntityLinker tickets

Posted by Mark G <gi...@gmail.com>.

I'll take a look at the Leipzig project, not familiar with it. But the idea
is to allow users to wire up whatever data they have and not have it
particular to any format, the tool now just produces opennlp format...
however I can write a LeipzigSentenceProvider or LeipzigKnownEntityProvider
impl and it would work with the framework as is.
thanks

On Fri, Oct 11, 2013 at 6:13 AM, Jörn Kottmann <ko...@gmail.com> wrote:

> On 10/11/2013 11:51 AM, Mark G wrote:
>
>> Thanks Joern. Good question about license.... I wrote a web crawler and it
>> polls a bunch of RSS news feeds (google news and BBC mainly) as well as
>> wikipedia and then recursively scrapes to N depth on them. So.... It's
>> hard
>> to say what the license would be, I will look deeper, and maybe only use
>> the wiki data.
>>
>
> The Leipzig project is doing something similar for many languages, maybe
> it would be good
> solution to just make it work with their data format.
>
> What do you think?
>
> Jörn
>

Re: request for Input or ideas.... EntityLinker tickets

Posted by Jörn Kottmann <ko...@gmail.com>.

On 10/11/2013 11:51 AM, Mark G wrote:
> Thanks Joern. Good question about license.... I wrote a web crawler and it
> polls a bunch of RSS news feeds (google news and BBC mainly) as well as
> wikipedia and then recursively scrapes to N depth on them. So.... It's hard
> to say what the license would be, I will look deeper, and maybe only use
> the wiki data.

The Leipzig project is doing something similar for many languages, maybe 
it would be good
solution to just make it work with their data format.

What do you think?

Jörn

Re: request for Input or ideas.... EntityLinker tickets

Posted by Mark G <gi...@gmail.com>.

Thanks Joern. Good question about license.... I wrote a web crawler and it
polls a bunch of RSS news feeds (google news and BBC mainly) as well as
wikipedia and then recursively scrapes to N depth on them. So.... It's hard
to say what the license would be, I will look deeper, and maybe only use
the wiki data.
thanks

On Fri, Oct 11, 2013 at 3:17 AM, Jörn Kottmann <ko...@gmail.com> wrote:

> On 10/10/2013 06:54 PM, Mark G wrote:
>
>> thanks, I am also working on a rapid model builder framework that I would
>> like you to look at. I posted a description earlier but no feedback yet, I
>> was thinking I could check it into the sandbox so everyone can run it,
>> along with a filebased implementation that includes a file of ~200K
>> sentences.
>> This tool would allow users to specify a file of sentences from their
>> data,
>> a file (dictionary) of known named entities, and a blacklist file (for
>> false positive reduction) in order to build a model for a specific entity
>> type.
>>
>
> +1 I posted feedback to this on the user list.
>
> Just go ahead and open a Jira issue for it, and then add it to the sandbox.
>
> What is the license of the sentence file?
>
> Jörn
>

Re: request for Input or ideas.... EntityLinker tickets

Posted by Jörn Kottmann <ko...@gmail.com>.

On 10/10/2013 06:54 PM, Mark G wrote:
> thanks, I am also working on a rapid model builder framework that I would
> like you to look at. I posted a description earlier but no feedback yet, I
> was thinking I could check it into the sandbox so everyone can run it,
> along with a filebased implementation that includes a file of ~200K
> sentences.
> This tool would allow users to specify a file of sentences from their data,
> a file (dictionary) of known named entities, and a blacklist file (for
> false positive reduction) in order to build a model for a specific entity
> type.

+1 I posted feedback to this on the user list.

Just go ahead and open a Jira issue for it, and then add it to the sandbox.

What is the license of the sentence file?

Jörn

Re: request for Input or ideas.... EntityLinker tickets

Posted by Mark G <gi...@gmail.com>.

thanks, I am also working on a rapid model builder framework that I would
like you to look at. I posted a description earlier but no feedback yet, I
was thinking I could check it into the sandbox so everyone can run it,
along with a filebased implementation that includes a file of ~200K
sentences.
This tool would allow users to specify a file of sentences from their data,
a file (dictionary) of known named entities, and a blacklist file (for
false positive reduction) in order to build a model for a specific entity
type.


On Thu, Oct 10, 2013 at 12:00 PM, Jörn Kottmann <ko...@gmail.com> wrote:

> I will have a look at it tomorrow, we are planning on using the
> entitylinker in on of
> our systems.
>
> Jörn
>
>
> On 10/05/2013 11:58 PM, Mark G wrote:
>
>> All,
>> Before I plug some tickets into Jira, I wanted to get some feedback from
>> the team on some changes I would like to make to the EntityLinker
>> GeoEntityLinkerImpl
>> Below are what I consider improvement tickets
>>
>> 1. Only the first start and end are populated in CountryContext object
>> when
>> returned from CountryContext.find, it should return all instances of each
>> country mention in a map so the proximity of other toponyms to the found
>> country indicators can be included as a factor in the scoring
>>
>> Currently the user only gets the first indexOf for each country mention.
>> The country mentions are an attempt to better gauge ambiguous names( Paris
>> Texas rather than Paris France). Because of this, I am not able to do a
>> proximity analysis thoroughly to assist in scoring. Basically I need every
>> mention of every country indicator in the doc, which I will correlate with
>> every Named Entity span to produce a score. I am also not passing the list
>> of country codes into the database query as a where predicate, which would
>> improve performance tremendously (I will index the column).
>>
>> 2. Discovery of indicators for "country context" should be regex based, in
>> order to provide a more robust ability to discover context
>>
>> Currenty I use a String.indexOf(term) to discover the country hit list.
>> Regex would allow users to configure interesting ways to indicate
>> countries. Regex will also provide the array of start/end I need for issue
>> 1 from its Matcher.find
>>
>> 3. fuzzy string matching should be part of the scoring, this would allow
>> mysql fuzzy search to return more candidate toponyms.
>>
>> Currently, the search into the MySQL gazateers is using "boolean mode" and
>> each NER result is passed in as a literal string. If I implement a fuzzy
>> string matching based score (do we have one?) the user could turn on
>> "natural language" mode in MySQL then we can generate a score and thresh
>> to
>> allow for more recall on transliterated names etc....
>> I would also like to use proximity to the majority of points in the
>> document as a disambiguation criteria as well.
>>
>> 4. provide a "solution wrapper" for the Geotagging capability
>>
>> In order to make the GeoTagging a bit more "out of the box" functional, I
>> was thinking of creating a class that one calls find(MaxentModel, doc,
>> sentencedetector, EntityLinkerProperties) to abstract the current impl. I
>> know this is not standard practice, just want to see what you all think.
>> This would make it "easier" to get this thing running.
>>
>> thanks!
>> MG
>>
>>
>

Re: request for Input or ideas.... EntityLinker tickets

Posted by Jörn Kottmann <ko...@gmail.com>.

I will have a look at it tomorrow, we are planning on using the 
entitylinker in on of
our systems.

Jörn

On 10/05/2013 11:58 PM, Mark G wrote:
> All,
> Before I plug some tickets into Jira, I wanted to get some feedback from
> the team on some changes I would like to make to the EntityLinker
> GeoEntityLinkerImpl
> Below are what I consider improvement tickets
>
> 1. Only the first start and end are populated in CountryContext object when
> returned from CountryContext.find, it should return all instances of each
> country mention in a map so the proximity of other toponyms to the found
> country indicators can be included as a factor in the scoring
>
> Currently the user only gets the first indexOf for each country mention.
> The country mentions are an attempt to better gauge ambiguous names( Paris
> Texas rather than Paris France). Because of this, I am not able to do a
> proximity analysis thoroughly to assist in scoring. Basically I need every
> mention of every country indicator in the doc, which I will correlate with
> every Named Entity span to produce a score. I am also not passing the list
> of country codes into the database query as a where predicate, which would
> improve performance tremendously (I will index the column).
>
> 2. Discovery of indicators for "country context" should be regex based, in
> order to provide a more robust ability to discover context
>
> Currenty I use a String.indexOf(term) to discover the country hit list.
> Regex would allow users to configure interesting ways to indicate
> countries. Regex will also provide the array of start/end I need for issue
> 1 from its Matcher.find
>
> 3. fuzzy string matching should be part of the scoring, this would allow
> mysql fuzzy search to return more candidate toponyms.
>
> Currently, the search into the MySQL gazateers is using "boolean mode" and
> each NER result is passed in as a literal string. If I implement a fuzzy
> string matching based score (do we have one?) the user could turn on
> "natural language" mode in MySQL then we can generate a score and thresh to
> allow for more recall on transliterated names etc....
> I would also like to use proximity to the majority of points in the
> document as a disambiguation criteria as well.
>
> 4. provide a "solution wrapper" for the Geotagging capability
>
> In order to make the GeoTagging a bit more "out of the box" functional, I
> was thinking of creating a class that one calls find(MaxentModel, doc,
> sentencedetector, EntityLinkerProperties) to abstract the current impl. I
> know this is not standard practice, just want to see what you all think.
> This would make it "easier" to get this thing running.
>
> thanks!
> MG
>