You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@opennlp.apache.org by Mark G <ma...@apache.org> on 2013/11/03 02:22:41 UTC

Re: request for Input or ideas.... EntityLinker tickets

I finished with the Lucene indexing of the Gazateers, just need to get them
tied into the gaz lookups, which is fairly simple. Do you all think I
should disregard all the MySQL dependency and just have Lucene? The lucene
index files are only about 2.5 gigs total, so very manageable to distribute
the files across a cluster. I could keep the MySQL classes as an option,
but at this point the Lucene based approach is really growing on me.
If I don't here from anyone I am going to remove the MySQL implementation.
Thanks
MG


On Wed, Oct 30, 2013 at 7:34 PM, Lance Norskog <go...@gmail.com> wrote:

> Just to elaborate- The RAMDirectory storage is in Java GC. This makes Java
> GC work very very hard. A memory-mapped file is a write-through cache for
> file contents. The memory in the cache is outside of Java garbage
> collection. A memory-mapped index will take a little less time to create at
> these volumes. Loading a pre-built memory-mapped index will be under 5
> seconds.
>
>
> On 10/29/2013 03:43 PM, Mark G wrote:
>
>> thanks, that was my next option with lucene. Build the indexes from the
>> gaz
>> files and keep them up to date in one place, and make sure something like
>> puppet will distribute them to each node in a cluster on some interval,
>> then each task (map reduce or whatever) can use that file resource. I'll
>> let everyone know how it goes
>> MG
>>
>>
>> On Tue, Oct 29, 2013 at 6:06 PM, Lance Norskog <go...@gmail.com> wrote:
>>
>>  This is what memory-mapped file indexes are for! RAMDirectory is for very
>>> small projects.
>>>
>>>
>>> On 10/29/2013 04:00 AM, Mark G wrote:
>>>
>>>  FYI, I implemented an in mem lucene index of the NGA Geonames. It was
>>>> almost 7 GB ram and took about 40 minutes to load.
>>>> Still looking at other DBs/Indexes. So one would need at least 10G ram
>>>> to
>>>> hold the USGS and NGA gazateers.
>>>>
>>>>
>>>> On Fri, Oct 25, 2013 at 6:21 AM, Mark G <gi...@gmail.com> wrote:
>>>>
>>>>   I wrote a quick lucene RAMDirectory in memory index, it looks like a
>>>>
>>>>> valid
>>>>> option to hold the gazateers and it provides good text search of
>>>>> course.
>>>>> The idea is that at runtime the geoentitylinker would pull three files
>>>>> off
>>>>> disk, the NGAGeonames file, the USGS FIle, and the CountryContext
>>>>> indicator
>>>>> file and lucene index them in memory,. initially this will take a
>>>>> while.
>>>>> So, deployment wise, you would have to use your tool of choice (ie
>>>>> Puppet)
>>>>> to distribute the files to each node, or mount a share to each node. My
>>>>> concern with this approach is that each MR Task runs in it's own JVM,
>>>>> so
>>>>> each task on each node will consume this much memory unless you do
>>>>> something interesting with memory mapping. The EntityLinkerProperties
>>>>> file
>>>>> will support the config of the file locations and whether to use DB or
>>>>> in
>>>>> mem Lucene...
>>>>>
>>>>> I am also working on a Postgres version of the gazateer structures and
>>>>> stored procs.
>>>>>
>>>>> Thoughts?
>>>>>
>>>>>
>>>>> On Wed, Oct 23, 2013 at 8:46 AM, Jörn Kottmann <ko...@gmail.com>
>>>>> wrote:
>>>>>
>>>>>   On 10/23/2013 01:14 PM, Mark G wrote:
>>>>>
>>>>>>   All that being said, it is totally possible to run an in memory
>>>>>> version
>>>>>>
>>>>>>> of
>>>>>>> the gazateer. Personally, I like the DB approach, it provides a lot
>>>>>>> of
>>>>>>> flexibility and power.
>>>>>>>
>>>>>>>   Yes, and you can even use a DB to run in-memory which works with
>>>>>>> the
>>>>>>>
>>>>>> current implementation,
>>>>>> I think I will experiment with that.
>>>>>>
>>>>>> I don't really mind using 3 GB memory for it, since my Hadoop servers
>>>>>> have more than enough anyway,
>>>>>> and it makes the deployment easier (don't have to deal with installing
>>>>>> MySQL
>>>>>> databases and keeping them in sync).
>>>>>>
>>>>>> Jörn
>>>>>>
>>>>>>
>>>>>>
>

Re: request for Input or ideas.... EntityLinker tickets

Posted by Mark G <ma...@apache.org>.

sounds good, I will get it in the sandbox...



On Tue, Nov 5, 2013 at 8:43 AM, Jörn Kottmann <ko...@gmail.com> wrote:

> On 11/05/2013 01:23 PM, Mark G wrote:
>
>> Joern, I am using Lucene inside the GeoEntityLinker impl, so do you think
>> I
>> should move the entire GeoEntityLinker impl and all its classes to a new
>> module in the sandbox and leave only the entitylinker framework in opennlp
>> tools?
>>
>
> +1, it might be nice to provide a simple dictionary based implementation
> as a sample for simple
> tasks, e.g. just map country names.
> If  the user wants to use the lucene based solution he should just depend
> on the addon module, which
> we should release together with the core components.
>
>
>  A second thought/option is to make the lucene pom entries optional in the
>> opennlptools pom, so users will have to add lucene to their pom to run the
>> geoentitylinker and the jars will not be included in the tools build
>>
>
> I really prefer the other solution because then a user needs to once
> explicitly deal with it
> to make his project work, if you do this people probably start using the
> clases and then discover
> only by try and error that there must be something missing on their
> classpath.
>
> The lemmatizer will also be part of the addon solution, if there are no
> concerns I suggest we get our
> addons started.
>
> Jörn
>

Re: request for Input or ideas.... EntityLinker tickets

Posted by Jörn Kottmann <ko...@gmail.com>.

On 11/05/2013 01:23 PM, Mark G wrote:
> Joern, I am using Lucene inside the GeoEntityLinker impl, so do you think I
> should move the entire GeoEntityLinker impl and all its classes to a new
> module in the sandbox and leave only the entitylinker framework in opennlp
> tools?

+1, it might be nice to provide a simple dictionary based implementation 
as a sample for simple
tasks, e.g. just map country names.
If  the user wants to use the lucene based solution he should just 
depend on the addon module, which
we should release together with the core components.

> A second thought/option is to make the lucene pom entries optional in the
> opennlptools pom, so users will have to add lucene to their pom to run the
> geoentitylinker and the jars will not be included in the tools build

I really prefer the other solution because then a user needs to once 
explicitly deal with it
to make his project work, if you do this people probably start using the 
clases and then discover
only by try and error that there must be something missing on their 
classpath.

The lemmatizer will also be part of the addon solution, if there are no 
concerns I suggest we get our
addons started.

Jörn

Re: request for Input or ideas.... EntityLinker tickets

Posted by Mark G <ma...@apache.org>.

Joern, I am using Lucene inside the GeoEntityLinker impl, so do you think I
should move the entire GeoEntityLinker impl and all its classes to a new
module in the sandbox and leave only the entitylinker framework in opennlp
tools?
A second thought/option is to make the lucene pom entries optional in the
opennlptools pom, so users will have to add lucene to their pom to run the
geoentitylinker and the jars will not be included in the tools build

On Tue, Nov 5, 2013 at 3:39 AM, Jörn Kottmann <ko...@gmail.com> wrote:

> On 11/03/2013 02:22 AM, Mark G wrote:
>
>> I finished with the Lucene indexing of the Gazateers, just need to get
>> them
>> tied into the gaz lookups, which is fairly simple. Do you all think I
>> should disregard all the MySQL dependency and just have Lucene? The lucene
>> index files are only about 2.5 gigs total, so very manageable to
>> distribute
>> the files across a cluster. I could keep the MySQL classes as an option,
>> but at this point the Lucene based approach is really growing on me.
>> If I don't here from anyone I am going to remove the MySQL implementation.
>>
>
> +1 I believe a Lucene based solution is easier to handle for most people,
> because it can
> be fully integrated via API (no need to install anything) and therefor
> hides most of
> the complexity.
>
> Please avoid adding a dependincy for lucene to the opennlp-tools project,
> I suggest that we
> add this code to the sandbox, or a new addon area. If people want to use a
> Lucene based dictionary
> they can depend on that module explicitly.
>
> Jörn
>

Re: request for Input or ideas.... EntityLinker tickets

Posted by Jörn Kottmann <ko...@gmail.com>.

On 11/03/2013 02:22 AM, Mark G wrote:
> I finished with the Lucene indexing of the Gazateers, just need to get them
> tied into the gaz lookups, which is fairly simple. Do you all think I
> should disregard all the MySQL dependency and just have Lucene? The lucene
> index files are only about 2.5 gigs total, so very manageable to distribute
> the files across a cluster. I could keep the MySQL classes as an option,
> but at this point the Lucene based approach is really growing on me.
> If I don't here from anyone I am going to remove the MySQL implementation.

+1 I believe a Lucene based solution is easier to handle for most 
people, because it can
be fully integrated via API (no need to install anything) and therefor 
hides most of
the complexity.

Please avoid adding a dependincy for lucene to the opennlp-tools 
project, I suggest that we
add this code to the sandbox, or a new addon area. If people want to use 
a Lucene based dictionary
they can depend on that module explicitly.

Jörn