You are viewing a plain text version of this content. The canonical link for it is here.

Posted to users@opennlp.apache.org by giri <gi...@gmail.com> on 2012/06/06 19:38:31 UTC

openNLP with Hadoop MapReduce Programming

Hi Friends,

I want to use OPenNLP in Mapreduce programming, but i couldn't able to load
the OpenNLP model.

if any one have the idea or code please help me.

Regards,
Giri.

RE: openNLP with Hadoop MapReduce Programming

Posted by "Roeder, Chris" <CH...@UCDENVER.EDU>.

I've gotten some use out of putting models in JARs so I could
use Maven to deploy out to the cluster. 

In either case,  JAR files or HDFS, if the code is written to
open a java.io.File, some modification will be necessary.

-Chris
________________________________________
From: James Kosin [james.kosin@gmail.com]
Sent: Thursday, June 07, 2012 6:17 PM
To: users@opennlp.apache.org
Subject: Re: openNLP with Hadoop MapReduce Programming

Hadoop seems to be a large scale project; so, the work would be spread
across many servers / clients to perform the work.  The map reduce would
allow all the processes across many servers to be done and then
synchronized to provide the final results.  So, each process would have
to load its own model.  The file system using HDFS should allow sharing
of the models and large data collection between them all.

On 6/7/2012 3:45 AM, Jörn Kottmann wrote:
> On 06/07/2012 05:39 AM, James Kosin wrote:
>> Hmm, good idea.  I'll have to try that soon... I do create models for my
>> project and have them included in the JAR... but, haven't gotten around
>> to testing with them embedded in the JAR file.  I know there will be
>> issues with this and it is usually best to keep them in either windows
>> or linux file system.
>> Jorn has the start of supporting the web-server side; but, I know it is
>> far from complete... he still has this marked as a TODO for the
>> interface.  Unless I'm a bit behind now.
>
> I usually load my models from an http server, because
> they are getting updated much more frequently than
> my jars, but if you use map reduce you will need to do
> the loading yourself (very easy in java).
>
> Just including a model in a jar works great and many
> people actually do that.
>
> If you have many threads you want to share the models
> between them I am not sure how this is done in map reduce.
>
> Jörn

Re: openNLP with Hadoop MapReduce Programming

Posted by Julien Nioche <li...@gmail.com>.

That's what's done in Behemoth (https://github.com/DigitalPebble/behemoth)
e.g. for sharing GATE or UIMA resources. The code can be used as an example
of how to do this.

Julien


> I think distributed cache is a good way to do this.

I did some similar work about stanford parser model loading in Hadoop using
> distributed cache.
> I think that will solve the problem. But we should be careful because the
> Hadoop system is normally data-intensive, and NLP handling there may cause
> high-CPU usage and problem to other jobs.
>
> Sheng
>
> > Date: Thu, 7 Jun 2012 20:17:26 -0400
> > From: james.kosin@gmail.com
> > To: users@opennlp.apache.org
> > Subject: Re: openNLP with Hadoop MapReduce Programming
> >
> > Hadoop seems to be a large scale project; so, the work would be spread
> > across many servers / clients to perform the work.  The map reduce would
> > allow all the processes across many servers to be done and then
> > synchronized to provide the final results.  So, each process would have
> > to load its own model.  The file system using HDFS should allow sharing
> > of the models and large data collection between them all.
> >
> > On 6/7/2012 3:45 AM, Jörn Kottmann wrote:
> > > On 06/07/2012 05:39 AM, James Kosin wrote:
> > >> Hmm, good idea.  I'll have to try that soon... I do create models for
> my
> > >> project and have them included in the JAR... but, haven't gotten
> around
> > >> to testing with them embedded in the JAR file.  I know there will be
> > >> issues with this and it is usually best to keep them in either windows
> > >> or linux file system.
> > >> Jorn has the start of supporting the web-server side; but, I know it
> is
> > >> far from complete... he still has this marked as a TODO for the
> > >> interface.  Unless I'm a bit behind now.
> > >
> > > I usually load my models from an http server, because
> > > they are getting updated much more frequently than
> > > my jars, but if you use map reduce you will need to do
> > > the loading yourself (very easy in java).
> > >
> > > Just including a model in a jar works great and many
> > > people actually do that.
> > >
> > > If you have many threads you want to share the models
> > > between them I am not sure how this is done in map reduce.
> > >
> > > Jörn
> >
> >
>
>



-- 
*
*Open Source Solutions for Text Engineering

http://digitalpebble.blogspot.com/
http://www.digitalpebble.com
http://twitter.com/digitalpebble

Re: openNLP with Hadoop MapReduce Programming

Posted by Matt Lehman <ma...@fluentconsulting.com>.

On Jun 7, 2012, at 8:37 PM, Sheng Guo wrote:

> I think distributed cache is a good way to do this.
> I did some similar work about stanford parser model loading in Hadoop using distributed cache.
> I think that will solve the problem. But we should be careful because the Hadoop system is normally data-intensive, and NLP handling there may cause high-CPU usage and problem to other jobs.

I have used the distributed cache with Hadoop on Elastic MapReduce, which worked really well with OpenNLP models.  Additionally, with EMR models can be stored on S3 and added to the distributed cache using their s3:// paths.

RE: openNLP with Hadoop MapReduce Programming

Posted by Sheng Guo <en...@hotmail.com>.

Hi Carlos,

Sorry I wrote that one and half a year ago and it was in company codebase.
But the basic procedure is simple.
first you upload your model to the HDFS, then before you run the job, you do this:
DistributedCache.addCacheFile(new URI(YourModelFilePath), jobConf);
then in your configure method of your mapper, you write something like this:

      {
        super.configure(job);
        
        try{
          
            Path[] localFiles = DistributedCache.getLocalCacheFiles(job) ;
            
            if (localFiles != null )
            {
  
              String metadataFileName = "";
              
              for (int i = 0; i < localFiles.length; i++)
              {
                String strFileName = localFiles[i].toString();
                if (strFileName.contains(job.get("modelPath")))
                {
                  metadataFileName = strFileName;       
                  break;
                }
              }
              if(metadataFileName.length()>0){
                json_model = new JSONObject();
                readHashSet("file:///" + metadataFileName, json_model);
              }
              System.out.println("*********"+metadataFileName);
              //model_dir = metadataFileName.substring(0, metadataFileName.lastIndexOf(str))
  
            }
        }catch(Exception e){
            e.printStackTrace();
//            info = "except";
        }
      }

json_model should be a static variable inside your mapper, be aware that this is just converted from some pseudo code.

Sheng

> Date: Tue, 12 Jun 2012 17:52:34 -0600
> Subject: Re: openNLP with Hadoop MapReduce Programming
> From: nando.nlp@gmail.com
> To: users@opennlp.apache.org
> 
> Sheng,
> 
> Can you provide an example?
> 
> Thanks,
> 
> Carlos.
> 
> On Thu, Jun 7, 2012 at 6:37 PM, Sheng Guo <en...@hotmail.com> wrote:
> 
> >
> > I think distributed cache is a good way to do this.
> > I did some similar work about stanford parser model loading in Hadoop
> > using distributed cache.
> > I think that will solve the problem. But we should be careful because the
> > Hadoop system is normally data-intensive, and NLP handling there may cause
> > high-CPU usage and problem to other jobs.
> >
> > Sheng
> >
> > > Date: Thu, 7 Jun 2012 20:17:26 -0400
> > > From: james.kosin@gmail.com
> > > To: users@opennlp.apache.org
> > > Subject: Re: openNLP with Hadoop MapReduce Programming
> > >
> > > Hadoop seems to be a large scale project; so, the work would be spread
> > > across many servers / clients to perform the work.  The map reduce would
> > > allow all the processes across many servers to be done and then
> > > synchronized to provide the final results.  So, each process would have
> > > to load its own model.  The file system using HDFS should allow sharing
> > > of the models and large data collection between them all.
> > >
> > > On 6/7/2012 3:45 AM, Jörn Kottmann wrote:
> > > > On 06/07/2012 05:39 AM, James Kosin wrote:
> > > >> Hmm, good idea.  I'll have to try that soon... I do create models for
> > my
> > > >> project and have them included in the JAR... but, haven't gotten
> > around
> > > >> to testing with them embedded in the JAR file.  I know there will be
> > > >> issues with this and it is usually best to keep them in either windows
> > > >> or linux file system.
> > > >> Jorn has the start of supporting the web-server side; but, I know it
> > is
> > > >> far from complete... he still has this marked as a TODO for the
> > > >> interface.  Unless I'm a bit behind now.
> > > >
> > > > I usually load my models from an http server, because
> > > > they are getting updated much more frequently than
> > > > my jars, but if you use map reduce you will need to do
> > > > the loading yourself (very easy in java).
> > > >
> > > > Just including a model in a jar works great and many
> > > > people actually do that.
> > > >
> > > > If you have many threads you want to share the models
> > > > between them I am not sure how this is done in map reduce.
> > > >
> > > > Jörn
> > >
> > >
> >
> >

Re: openNLP with Hadoop MapReduce Programming

Posted by Carlos Scheidecker <na...@gmail.com>.

Sheng,

Can you provide an example?

Thanks,

Carlos.

On Thu, Jun 7, 2012 at 6:37 PM, Sheng Guo <en...@hotmail.com> wrote:

>
> I think distributed cache is a good way to do this.
> I did some similar work about stanford parser model loading in Hadoop
> using distributed cache.
> I think that will solve the problem. But we should be careful because the
> Hadoop system is normally data-intensive, and NLP handling there may cause
> high-CPU usage and problem to other jobs.
>
> Sheng
>
> > Date: Thu, 7 Jun 2012 20:17:26 -0400
> > From: james.kosin@gmail.com
> > To: users@opennlp.apache.org
> > Subject: Re: openNLP with Hadoop MapReduce Programming
> >
> > Hadoop seems to be a large scale project; so, the work would be spread
> > across many servers / clients to perform the work.  The map reduce would
> > allow all the processes across many servers to be done and then
> > synchronized to provide the final results.  So, each process would have
> > to load its own model.  The file system using HDFS should allow sharing
> > of the models and large data collection between them all.
> >
> > On 6/7/2012 3:45 AM, Jörn Kottmann wrote:
> > > On 06/07/2012 05:39 AM, James Kosin wrote:
> > >> Hmm, good idea.  I'll have to try that soon... I do create models for
> my
> > >> project and have them included in the JAR... but, haven't gotten
> around
> > >> to testing with them embedded in the JAR file.  I know there will be
> > >> issues with this and it is usually best to keep them in either windows
> > >> or linux file system.
> > >> Jorn has the start of supporting the web-server side; but, I know it
> is
> > >> far from complete... he still has this marked as a TODO for the
> > >> interface.  Unless I'm a bit behind now.
> > >
> > > I usually load my models from an http server, because
> > > they are getting updated much more frequently than
> > > my jars, but if you use map reduce you will need to do
> > > the loading yourself (very easy in java).
> > >
> > > Just including a model in a jar works great and many
> > > people actually do that.
> > >
> > > If you have many threads you want to share the models
> > > between them I am not sure how this is done in map reduce.
> > >
> > > Jörn
> >
> >
>
>

RE: openNLP with Hadoop MapReduce Programming

Posted by Sheng Guo <en...@hotmail.com>.

I think distributed cache is a good way to do this.
I did some similar work about stanford parser model loading in Hadoop using distributed cache.
I think that will solve the problem. But we should be careful because the Hadoop system is normally data-intensive, and NLP handling there may cause high-CPU usage and problem to other jobs.

Sheng

> Date: Thu, 7 Jun 2012 20:17:26 -0400
> From: james.kosin@gmail.com
> To: users@opennlp.apache.org
> Subject: Re: openNLP with Hadoop MapReduce Programming
> 
> Hadoop seems to be a large scale project; so, the work would be spread
> across many servers / clients to perform the work.  The map reduce would
> allow all the processes across many servers to be done and then
> synchronized to provide the final results.  So, each process would have
> to load its own model.  The file system using HDFS should allow sharing
> of the models and large data collection between them all.
> 
> On 6/7/2012 3:45 AM, Jörn Kottmann wrote:
> > On 06/07/2012 05:39 AM, James Kosin wrote:
> >> Hmm, good idea.  I'll have to try that soon... I do create models for my
> >> project and have them included in the JAR... but, haven't gotten around
> >> to testing with them embedded in the JAR file.  I know there will be
> >> issues with this and it is usually best to keep them in either windows
> >> or linux file system.
> >> Jorn has the start of supporting the web-server side; but, I know it is
> >> far from complete... he still has this marked as a TODO for the
> >> interface.  Unless I'm a bit behind now.
> >
> > I usually load my models from an http server, because
> > they are getting updated much more frequently than
> > my jars, but if you use map reduce you will need to do
> > the loading yourself (very easy in java).
> >
> > Just including a model in a jar works great and many
> > people actually do that.
> >
> > If you have many threads you want to share the models
> > between them I am not sure how this is done in map reduce.
> >
> > Jörn
> 
>

Re: openNLP with Hadoop MapReduce Programming

Posted by James Kosin <ja...@gmail.com>.

Hadoop seems to be a large scale project; so, the work would be spread
across many servers / clients to perform the work.  The map reduce would
allow all the processes across many servers to be done and then
synchronized to provide the final results.  So, each process would have
to load its own model.  The file system using HDFS should allow sharing
of the models and large data collection between them all.

On 6/7/2012 3:45 AM, Jörn Kottmann wrote:
> On 06/07/2012 05:39 AM, James Kosin wrote:
>> Hmm, good idea.  I'll have to try that soon... I do create models for my
>> project and have them included in the JAR... but, haven't gotten around
>> to testing with them embedded in the JAR file.  I know there will be
>> issues with this and it is usually best to keep them in either windows
>> or linux file system.
>> Jorn has the start of supporting the web-server side; but, I know it is
>> far from complete... he still has this marked as a TODO for the
>> interface.  Unless I'm a bit behind now.
>
> I usually load my models from an http server, because
> they are getting updated much more frequently than
> my jars, but if you use map reduce you will need to do
> the loading yourself (very easy in java).
>
> Just including a model in a jar works great and many
> people actually do that.
>
> If you have many threads you want to share the models
> between them I am not sure how this is done in map reduce.
>
> Jörn

Re: openNLP with Hadoop MapReduce Programming

Posted by Jörn Kottmann <ko...@gmail.com>.

On 06/07/2012 05:39 AM, James Kosin wrote:
> Hmm, good idea.  I'll have to try that soon... I do create models for my
> project and have them included in the JAR... but, haven't gotten around
> to testing with them embedded in the JAR file.  I know there will be
> issues with this and it is usually best to keep them in either windows
> or linux file system.
> Jorn has the start of supporting the web-server side; but, I know it is
> far from complete... he still has this marked as a TODO for the
> interface.  Unless I'm a bit behind now.

I usually load my models from an http server, because
they are getting updated much more frequently than
my jars, but if you use map reduce you will need to do
the loading yourself (very easy in java).

Just including a model in a jar works great and many
people actually do that.

If you have many threads you want to share the models
between them I am not sure how this is done in map reduce.

Jörn

Re: openNLP with Hadoop MapReduce Programming

Posted by James Kosin <ja...@gmail.com>.

William,

Hmm, good idea.  I'll have to try that soon... I do create models for my
project and have them included in the JAR... but, haven't gotten around
to testing with them embedded in the JAR file.  I know there will be
issues with this and it is usually best to keep them in either windows
or linux file system.
Jorn has the start of supporting the web-server side; but, I know it is
far from complete... he still has this marked as a TODO for the
interface.  Unless I'm a bit behind now.

Still, what errors are you getting?  It may be as simple as writing a
wrapper for the InputStream for the Hadoop file system support to get
this working.

James

On 6/6/2012 10:55 PM, William Colen wrote:
> I am not sure, I never tried Hadoop, but maybe your issue is that Java
> can't access Hadoop file system, isn't it?
> Maybe you should simple add the models to a jar so the models are in the
> classpath and read it as a resource:
>
> InputStream modelIn = this.getClass().getResourceAsStream("sentence.model");
> SentenceModel model = new SentenceModel(modelIn);
> SentenceDetectorME sentenceDetector = new SentenceDetectorME(model);
> modelIn.close();
>
>
> On Wed, Jun 6, 2012 at 11:36 PM, giri <gi...@gmail.com> wrote:
>
>> i'm not able to load ModelIN either from HDFS or local file system.
>>
>> On Thu, Jun 7, 2012 at 5:04 AM, James Kosin <ja...@gmail.com> wrote:
>>
>>> What problems are you having?
>>>
>>> James
>>>
>>> On 6/6/2012 1:38 PM, giri wrote:
>>>> Hi Friends,
>>>>
>>>> I want to use OPenNLP in Mapreduce programming, but i couldn't able to
>>> load
>>>> the OpenNLP model.
>>>>
>>>> if any one have the idea or code please help me.
>>>>
>>>> Regards,
>>>> Giri.
>>>>
>>>
>>>

Re: openNLP with Hadoop MapReduce Programming

Posted by William Colen <wi...@gmail.com>.

I am not sure, I never tried Hadoop, but maybe your issue is that Java
can't access Hadoop file system, isn't it?
Maybe you should simple add the models to a jar so the models are in the
classpath and read it as a resource:

InputStream modelIn = this.getClass().getResourceAsStream("sentence.model");
SentenceModel model = new SentenceModel(modelIn);
SentenceDetectorME sentenceDetector = new SentenceDetectorME(model);
modelIn.close();


On Wed, Jun 6, 2012 at 11:36 PM, giri <gi...@gmail.com> wrote:

> i'm not able to load ModelIN either from HDFS or local file system.
>
> On Thu, Jun 7, 2012 at 5:04 AM, James Kosin <ja...@gmail.com> wrote:
>
> > What problems are you having?
> >
> > James
> >
> > On 6/6/2012 1:38 PM, giri wrote:
> > > Hi Friends,
> > >
> > > I want to use OPenNLP in Mapreduce programming, but i couldn't able to
> > load
> > > the OpenNLP model.
> > >
> > > if any one have the idea or code please help me.
> > >
> > > Regards,
> > > Giri.
> > >
> >
> >
> >
>

Re: openNLP with Hadoop MapReduce Programming

Posted by giri <gi...@gmail.com>.

i'm not able to load ModelIN either from HDFS or local file system.

On Thu, Jun 7, 2012 at 5:04 AM, James Kosin <ja...@gmail.com> wrote:

> What problems are you having?
>
> James
>
> On 6/6/2012 1:38 PM, giri wrote:
> > Hi Friends,
> >
> > I want to use OPenNLP in Mapreduce programming, but i couldn't able to
> load
> > the OpenNLP model.
> >
> > if any one have the idea or code please help me.
> >
> > Regards,
> > Giri.
> >
>
>
>

Re: openNLP with Hadoop MapReduce Programming

Posted by James Kosin <ja...@gmail.com>.

What problems are you having?

James

On 6/6/2012 1:38 PM, giri wrote:
> Hi Friends,
>
> I want to use OPenNLP in Mapreduce programming, but i couldn't able to load
> the OpenNLP model.
>
> if any one have the idea or code please help me.
>
> Regards,
> Giri.
>