You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@nutch.apache.org by chethan <ch...@gmail.com> on 2014/05/04 07:52:15 UTC

Nutch + GATE on Amazon EMR

I have setup Nutch to crawl on Amazon EMR and I have a plugin that
uses GATE<https://gate.ac.uk/> for
text processing in the Indexing filters. GATE requires certain static
resources (some xmls and text files) to be loaded for it to be initialized.
I tried to bundle these resources in the job jar and load them from the
classpath but that didn't work. I also tried copying them to HDFS and
loading them from there but that too failed.

What is the best way to bundle such static resources and reference them in
the Indexing filters? I am working on copying the file to the distributed
cache and loading it from there but I wanted to know how others are
handling this. Thanks.

Regards,

--
Chethan Prasad

Re: Nutch + GATE on Amazon EMR

Posted by feng lu <am...@gmail.com>.

Maybe you can build this resources into plugin jar package like
"language-identifier" plugin and load them at run time.


On Sun, May 4, 2014 at 1:52 PM, chethan <ch...@gmail.com> wrote:

> I have setup Nutch to crawl on Amazon EMR and I have a plugin that
> uses GATE<https://gate.ac.uk/> for
> text processing in the Indexing filters. GATE requires certain static
> resources (some xmls and text files) to be loaded for it to be initialized.
> I tried to bundle these resources in the job jar and load them from the
> classpath but that didn't work. I also tried copying them to HDFS and
> loading them from there but that too failed.
>
> What is the best way to bundle such static resources and reference them in
> the Indexing filters? I am working on copying the file to the distributed
> cache and loading it from there but I wanted to know how others are
> handling this. Thanks.
>
> Regards,
>
> --
> Chethan Prasad
>



-- 
Don't Grow Old, Grow Up... :-)

Re: Nutch + GATE on Amazon EMR

Posted by feng lu <am...@gmail.com>.

If you are run nutch on hadoop cluster, the logs corresponding to each
mapper and reducer of each phase.


On Mon, May 5, 2014 at 7:33 PM, chethan <ch...@gmail.com> wrote:

> Also, I'm not able to see any logs generated by the plugin or Nutch base
> classes. There are lots of Hadoop logs, but none from Nutch. Any idea what
> could be the case?
>
> Regards,
>
> --
> Chethan Prasad
>
>
> On Mon, May 5, 2014 at 12:14 PM, chethan <ch...@gmail.com> wrote:
>
> > Thanks Feng and Julien for your replies. I will take a look at both
> > options and update what worked.
> >
> > Regards,
> >
> > --
> > Chethan Prasad
> >
> >
> > On Mon, May 5, 2014 at 12:10 AM, Julien Nioche <
> > lists.digitalpebble@gmail.com> wrote:
> >
> >> Chethan
> >>
> >> Have a look at Behemoth [https://github.com/DigitalPebble/behemoth] if
> >> you
> >> haven't already done so. Porting the code from the GATE module into an
> >> IndexingFilter should not be too difficult. What we do there is that the
> >> GATE pipeline is stored on HDFS and loaded by the slaves via the
> >> distributed cache.
> >>
> >> Alternatively you could use the Nutch just for crawling then use the
> Nutch
> >> and GATE modules of Behemoth as well as the SOLR or ElasticSearch ones
> if
> >> that's what you want to do.
> >>
> >> HTH
> >>
> >> Julien
> >>
> >>
> >> On 4 May 2014 06:52, chethan <ch...@gmail.com> wrote:
> >>
> >> > I have setup Nutch to crawl on Amazon EMR and I have a plugin that
> >> > uses GATE<https://gate.ac.uk/> for
> >> > text processing in the Indexing filters. GATE requires certain static
> >> > resources (some xmls and text files) to be loaded for it to be
> >> initialized.
> >> > I tried to bundle these resources in the job jar and load them from
> the
> >> > classpath but that didn't work. I also tried copying them to HDFS and
> >> > loading them from there but that too failed.
> >> >
> >> > What is the best way to bundle such static resources and reference
> them
> >> in
> >> > the Indexing filters? I am working on copying the file to the
> >> distributed
> >> > cache and loading it from there but I wanted to know how others are
> >> > handling this. Thanks.
> >> >
> >> > Regards,
> >> >
> >> > --
> >> > Chethan Prasad
> >> >
> >>
> >>
> >>
> >> --
> >>
> >> Open Source Solutions for Text Engineering
> >>
> >> http://digitalpebble.blogspot.com/
> >> http://www.digitalpebble.com
> >> http://twitter.com/digitalpebble
> >>
> >
> >
>



-- 
Don't Grow Old, Grow Up... :-)

Re: Nutch + GATE on Amazon EMR

Posted by chethan <ch...@gmail.com>.

Also, I'm not able to see any logs generated by the plugin or Nutch base
classes. There are lots of Hadoop logs, but none from Nutch. Any idea what
could be the case?

Regards,

--
Chethan Prasad


On Mon, May 5, 2014 at 12:14 PM, chethan <ch...@gmail.com> wrote:

> Thanks Feng and Julien for your replies. I will take a look at both
> options and update what worked.
>
> Regards,
>
> --
> Chethan Prasad
>
>
> On Mon, May 5, 2014 at 12:10 AM, Julien Nioche <
> lists.digitalpebble@gmail.com> wrote:
>
>> Chethan
>>
>> Have a look at Behemoth [https://github.com/DigitalPebble/behemoth] if
>> you
>> haven't already done so. Porting the code from the GATE module into an
>> IndexingFilter should not be too difficult. What we do there is that the
>> GATE pipeline is stored on HDFS and loaded by the slaves via the
>> distributed cache.
>>
>> Alternatively you could use the Nutch just for crawling then use the Nutch
>> and GATE modules of Behemoth as well as the SOLR or ElasticSearch ones if
>> that's what you want to do.
>>
>> HTH
>>
>> Julien
>>
>>
>> On 4 May 2014 06:52, chethan <ch...@gmail.com> wrote:
>>
>> > I have setup Nutch to crawl on Amazon EMR and I have a plugin that
>> > uses GATE<https://gate.ac.uk/> for
>> > text processing in the Indexing filters. GATE requires certain static
>> > resources (some xmls and text files) to be loaded for it to be
>> initialized.
>> > I tried to bundle these resources in the job jar and load them from the
>> > classpath but that didn't work. I also tried copying them to HDFS and
>> > loading them from there but that too failed.
>> >
>> > What is the best way to bundle such static resources and reference them
>> in
>> > the Indexing filters? I am working on copying the file to the
>> distributed
>> > cache and loading it from there but I wanted to know how others are
>> > handling this. Thanks.
>> >
>> > Regards,
>> >
>> > --
>> > Chethan Prasad
>> >
>>
>>
>>
>> --
>>
>> Open Source Solutions for Text Engineering
>>
>> http://digitalpebble.blogspot.com/
>> http://www.digitalpebble.com
>> http://twitter.com/digitalpebble
>>
>
>

Re: Nutch + GATE on Amazon EMR

Posted by chethan <ch...@gmail.com>.

Thanks Feng and Julien for your replies. I will take a look at both options
and update what worked.

Regards,

--
Chethan Prasad


On Mon, May 5, 2014 at 12:10 AM, Julien Nioche <
lists.digitalpebble@gmail.com> wrote:

> Chethan
>
> Have a look at Behemoth [https://github.com/DigitalPebble/behemoth] if you
> haven't already done so. Porting the code from the GATE module into an
> IndexingFilter should not be too difficult. What we do there is that the
> GATE pipeline is stored on HDFS and loaded by the slaves via the
> distributed cache.
>
> Alternatively you could use the Nutch just for crawling then use the Nutch
> and GATE modules of Behemoth as well as the SOLR or ElasticSearch ones if
> that's what you want to do.
>
> HTH
>
> Julien
>
>
> On 4 May 2014 06:52, chethan <ch...@gmail.com> wrote:
>
> > I have setup Nutch to crawl on Amazon EMR and I have a plugin that
> > uses GATE<https://gate.ac.uk/> for
> > text processing in the Indexing filters. GATE requires certain static
> > resources (some xmls and text files) to be loaded for it to be
> initialized.
> > I tried to bundle these resources in the job jar and load them from the
> > classpath but that didn't work. I also tried copying them to HDFS and
> > loading them from there but that too failed.
> >
> > What is the best way to bundle such static resources and reference them
> in
> > the Indexing filters? I am working on copying the file to the distributed
> > cache and loading it from there but I wanted to know how others are
> > handling this. Thanks.
> >
> > Regards,
> >
> > --
> > Chethan Prasad
> >
>
>
>
> --
>
> Open Source Solutions for Text Engineering
>
> http://digitalpebble.blogspot.com/
> http://www.digitalpebble.com
> http://twitter.com/digitalpebble
>

Re: Nutch + GATE on Amazon EMR

Posted by Julien Nioche <li...@gmail.com>.

Chethan

Have a look at Behemoth [https://github.com/DigitalPebble/behemoth] if you
haven't already done so. Porting the code from the GATE module into an
IndexingFilter should not be too difficult. What we do there is that the
GATE pipeline is stored on HDFS and loaded by the slaves via the
distributed cache.

Alternatively you could use the Nutch just for crawling then use the Nutch
and GATE modules of Behemoth as well as the SOLR or ElasticSearch ones if
that's what you want to do.

HTH

Julien

On 4 May 2014 06:52, chethan <ch...@gmail.com> wrote:

> I have setup Nutch to crawl on Amazon EMR and I have a plugin that
> uses GATE<https://gate.ac.uk/> for
> text processing in the Indexing filters. GATE requires certain static
> resources (some xmls and text files) to be loaded for it to be initialized.
> I tried to bundle these resources in the job jar and load them from the
> classpath but that didn't work. I also tried copying them to HDFS and
> loading them from there but that too failed.
>
> What is the best way to bundle such static resources and reference them in
> the Indexing filters? I am working on copying the file to the distributed
> cache and loading it from there but I wanted to know how others are
> handling this. Thanks.
>
> Regards,
>
> --
> Chethan Prasad
>

-- 

Open Source Solutions for Text Engineering

http://digitalpebble.blogspot.com/
http://www.digitalpebble.com
http://twitter.com/digitalpebble