You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@nutch.apache.org by Roger Marin <rs...@gmail.com> on 2010/08/06 21:01:24 UTC

Embed the Crawl API in my application

Hello,

I am new to nutch and I have a requirement to embed the crawler into my
application, however I have been running into some issues that I hope
someone can help me with.

First of all, I understand that nutch requires a unix like environment to
run, but what can I do if I need to embed nutch in an app that can run in
both windows and linux without
the guarantee that cygwin might be installed?.

Basically I need to create a crawler class that just uses the nutch crawl
api underneath, I used the Crawl class as a starting point, so far I have
been running into
some issues trying to get it to work mostly because of code that tries to
run "chmod" or some other unix command in a windows machine  without cygwin,
is there a way to bypass this?.
I have been trying to single out some of the classes and write my own
subclasses of these and just swallowing some of these exceptions when
running on windows, but this is very ugly
and I'm not sure if it might work at all so I need to figure out if there's
a better way to embed the nutch crawler api.

The other stuff I need to figure out is if it's possible to programmaticaly
set some of the parameters needed to use the crawler, for instance I need to
programmatically set the values of the urls instead of having a url file, or
a crawl-urlfilter file as well as the properties in the nutch-site.xml,
because these can be configured dynamically by the application and are
relative to the application itself
so I cannot hardcode these properties.

Any help you can give me or documentation you can point me to, will be
greatly appreciated.


Thank you.

Re: Message queueing system (in nutch-1.0) ?

Posted by "Mattmann, Chris A (388J)" <ch...@jpl.nasa.gov>.
Hi Patricio,

There were a bunch of Fetcher changes in Nutch 1.1 and beyond, so it's quite possible the patch is out of date and needs to be brought up to snuff with the release you want to apply it to. You could try Nutch 1.1 [1] and see if it applies there, or you may have to look at the patch see what it's doing and then apply those changes by hand.

If you do that, we'd happily accept your updated patch.

Cheers,
Chris

[1] http://svn.apache.org/repos/asf/nutch/tags/release-1.1


On 8/7/10 6:30 PM, "Patricio Galeas" <pg...@yahoo.de> wrote:

Hello,

I'm trying to apply the PATCH from NUTCH-368 (in nutch-1.0), but I get the
following response:


patch -p0 < Fetcher-ctrl.patch
patching file src/java/org/apache/nutch/fetcher/Fetcher.java
Hunk #1 FAILED at 17.
Hunk #2 FAILED at 36.
Hunk #3 FAILED at 77.
Hunk #4 FAILED at 85.
Hunk #5 FAILED at 110.
Hunk #6 FAILED at 338.
Hunk #7 FAILED at 346.
Hunk #8 FAILED at 364.
Hunk #9 FAILED at 389.
Hunk #10 succeeded at 781 with fuzz 2 (offset 370 lines).
9 out of 10 hunks FAILED -- saving rejects to file
src/java/org/apache/nutch/fetcher/Fetcher.java.rej

Has someone any idea?

Thanks
pgaleas





++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Chris Mattmann, Ph.D.
Senior Computer Scientist
NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
Office: 171-266B, Mailstop: 171-246
Email: Chris.Mattmann@jpl.nasa.gov
WWW:   http://sunset.usc.edu/~mattmann/
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Adjunct Assistant Professor, Computer Science Department
University of Southern California, Los Angeles, CA 90089 USA
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++


Message queueing system (in nutch-1.0) ?

Posted by Patricio Galeas <pg...@yahoo.de>.
Hello,

I'm trying to apply the PATCH from NUTCH-368 (in nutch-1.0), but I get the 
following response:


patch -p0 < Fetcher-ctrl.patch
patching file src/java/org/apache/nutch/fetcher/Fetcher.java
Hunk #1 FAILED at 17.
Hunk #2 FAILED at 36.
Hunk #3 FAILED at 77.
Hunk #4 FAILED at 85.
Hunk #5 FAILED at 110.
Hunk #6 FAILED at 338.
Hunk #7 FAILED at 346.
Hunk #8 FAILED at 364.
Hunk #9 FAILED at 389.
Hunk #10 succeeded at 781 with fuzz 2 (offset 370 lines).
9 out of 10 hunks FAILED -- saving rejects to file 
src/java/org/apache/nutch/fetcher/Fetcher.java.rej

Has someone any idea?

Thanks
pgaleas



AW: Embed the Crawl API in my application

Posted by Patricio Galeas <pg...@yahoo.de>.
Hello,

I'm trying to apply the PATCH from NUTCH-368 (in nutch-1.0), but I get the 
following response:


patch -p0 < Fetcher-ctrl.patch
patching file src/java/org/apache/nutch/fetcher/Fetcher.java
Hunk #1 FAILED at 17.
Hunk #2 FAILED at 36.
Hunk #3 FAILED at 77.
Hunk #4 FAILED at 85.
Hunk #5 FAILED at 110.
Hunk #6 FAILED at 338.
Hunk #7 FAILED at 346.
Hunk #8 FAILED at 364.
Hunk #9 FAILED at 389.
Hunk #10 succeeded at 781 with fuzz 2 (offset 370 lines).
9 out of 10 hunks FAILED -- saving rejects to file 
src/java/org/apache/nutch/fetcher/Fetcher.java.rej

Has someone any idea?

Thanks
pgaleas




Re: Embed the Crawl API in my application

Posted by Andrzej Bialecki <ab...@getopt.org>.
On 2010-08-06 21:01, Roger Marin wrote:
> Hello,
>
> I am new to nutch and I have a requirement to embed the crawler into my
> application, however I have been running into some issues that I hope
> someone can help me with.
>
> First of all, I understand that nutch requires a unix like environment to
> run, but what can I do if I need to embed nutch in an app that can run in
> both windows and linux without
> the guarantee that cygwin might be installed?.

This is currently difficult... The dependency on POSIX utilities can be 
cut out from Hadoop but not easily - Java API doesn't give access to 
some of the information that Hadoop needs. At one time I used AspectJ to 
replace calls to these utilities with Java classes that return real data 
(if possible to obtain e.g. using Java 1.6 File API) or return fake 
data. While this worked for my application I wouldn't recommend this in 
general case.

Another option is to include a small subset of cygwin utils and libs 
that are needed by Hadoop, and provide a "private" cygwin install with 
your application.


> The other stuff I need to figure out is if it's possible to programmaticaly
> set some of the parameters needed to use the crawler, for instance I need to
> programmatically set the values of the urls instead of having a url file, or
> a crawl-urlfilter file as well as the properties in the nutch-site.xml,
> because these can be configured dynamically by the application and are
> relative to the application itself
> so I cannot hardcode these properties.

Most, if not all Nutch tools implement the Tool interface, which means 
you can execute them through run(String[] args). Most of them also 
provide specialized methods that accept typed arguments.

Also, tools are configured with an instance of Configuration - before 
you execute run() you can tweak this Configuration object to your liking 
by setting properties programmatically.

Re: seed file - this actually needs to be a file so that Hadoop 
FileInputFormat works. If you absolutely can't create seedlist in a temp 
file, then you will need to change Injector to use a different 
InputFormat implementation that reads these values from some other source...

-- 
Best regards,
Andrzej Bialecki     <><
  ___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com


Re: Embed the Crawl API in my application

Posted by Roger Marin <rs...@gmail.com>.
Thanks everyone I managed to get it working using a "private" cygwin
install, now going through the configuration issues.

Again, thank you all!

On 9 August 2010 07:03, Hannes Carl Meyer <ha...@googlemail.com> wrote:

> Hi,
> I used the same example for the integration of nutch inside an EAR.
> Take a look at: au.csiro.cass.arch.utils.Starter (thanks again to Arkadi!)
> Regards
> Hannes
> On Mon, Aug 9, 2010 at 1:53 AM, <Ar...@csiro.au> wrote:
>
> > Hi,
> >
> > >-----Original Message-----
> > >From: Roger Marin [mailto:rsmaniak@gmail.com]
> > >Sent: Saturday, August 07, 2010 5:01 AM
> > >To: user@nutch.apache.org
> > >Subject: Embed the Crawl API in my application
> > >
> >
> > ...
> >
> > >
> > >The other stuff I need to figure out is if it's possible to
> > >programmaticaly
> > >set some of the parameters needed to use the crawler, for instance I
> > >need to
> > >programmatically set the values of the urls instead of having a url
> > >file, or
> > >a crawl-urlfilter file as well as the properties in the nutch-site.xml,
> > >because these can be configured dynamically by the application and are
> > >relative to the application itself
> > >so I cannot hardcode these properties.
> >
> > This is done in Arch. You can get source here
> >
> > http://www.atnf.csiro.au/computing/software/arch/
> >
> > and use it as an example.
> >
> > >
> > >Any help you can give me or documentation you can point me to, will be
> > >greatly appreciated.
> > >
> > >
> > >Thank you.
> >
>

Re: Embed the Crawl API in my application

Posted by Hannes Carl Meyer <ha...@googlemail.com>.
Hi,
I used the same example for the integration of nutch inside an EAR.
Take a look at: au.csiro.cass.arch.utils.Starter (thanks again to Arkadi!)
Regards
Hannes
On Mon, Aug 9, 2010 at 1:53 AM, <Ar...@csiro.au> wrote:

> Hi,
>
> >-----Original Message-----
> >From: Roger Marin [mailto:rsmaniak@gmail.com]
> >Sent: Saturday, August 07, 2010 5:01 AM
> >To: user@nutch.apache.org
> >Subject: Embed the Crawl API in my application
> >
>
> ...
>
> >
> >The other stuff I need to figure out is if it's possible to
> >programmaticaly
> >set some of the parameters needed to use the crawler, for instance I
> >need to
> >programmatically set the values of the urls instead of having a url
> >file, or
> >a crawl-urlfilter file as well as the properties in the nutch-site.xml,
> >because these can be configured dynamically by the application and are
> >relative to the application itself
> >so I cannot hardcode these properties.
>
> This is done in Arch. You can get source here
>
> http://www.atnf.csiro.au/computing/software/arch/
>
> and use it as an example.
>
> >
> >Any help you can give me or documentation you can point me to, will be
> >greatly appreciated.
> >
> >
> >Thank you.
>

RE: Embed the Crawl API in my application

Posted by Ar...@csiro.au.
Hi,

>-----Original Message-----
>From: Roger Marin [mailto:rsmaniak@gmail.com]
>Sent: Saturday, August 07, 2010 5:01 AM
>To: user@nutch.apache.org
>Subject: Embed the Crawl API in my application
>

...

>
>The other stuff I need to figure out is if it's possible to
>programmaticaly
>set some of the parameters needed to use the crawler, for instance I
>need to
>programmatically set the values of the urls instead of having a url
>file, or
>a crawl-urlfilter file as well as the properties in the nutch-site.xml,
>because these can be configured dynamically by the application and are
>relative to the application itself
>so I cannot hardcode these properties.

This is done in Arch. You can get source here 

http://www.atnf.csiro.au/computing/software/arch/

and use it as an example.

>
>Any help you can give me or documentation you can point me to, will be
>greatly appreciated.
>
>
>Thank you.