You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@nutch.apache.org by co...@complexityintelligence.com on 2012/01/18 10:15:13 UTC

Embedded Nutch API

Hello,

  We've finished plug-in development (filter by lang-id, it will be
proposed on JIRA shortly), now we want to embed the Nutch control in
a Java Intelligent Agent, so basically what I'm looking for is the
best strategy to use Nutch directly in a Java program, to avoid to
rely on shell scripts and so on.

  Is there an API designed for this purpose ? What do you recommend ?

  I need to control all steps, from crawling to indexing (Solr).

Best,
Alessio

Re: Embedded Nutch API

Posted by Ferdy Galema <fe...@kalooga.com>.

Ah yes, spawning a new process is also possible, instead of directly 
calling Nutch code from within your own code. My previous mail was 
describing the latter. Both options have their benefits; spawning a new 
process that uses high level Nutch commands is more easy to set up, 
because you can prepare the Nutch installation separately. But 
integrating Nutch into your project itself gives you more granularity 
and you will be able to use much more specific Nutch code/components.

By the way, I think the permission check error you mention is completely 
unrelated to the above. (Seems specific to Windows though).

On 01/19/2012 07:51 AM, shlomi java wrote:
> I used o use org.apache.nutch.crawl.Crawl.main(String[]).
>
> I used to run it with the following program arguments:
> "C:\urls" -depth 3 -topN 250 -threads 100 -solr "http://localhost:8081/solr"
> and with the following VM arguments:
> -Xmx1024m -Dhadoop.log.dir=logs -Dhadoop.log.file=hadoop.log
> -Dhadoop.root.logger=DEBUG,DRFA -Dhadoop.tmp.dir="D:\tmp"
> -Dfs.default.name="file:///D:/fs"
> -Dplugin.folders="D:\Nutch-1.4\runtime\local\plugins"
> -Ddfs.permissions=false
>
> The Crawl class simply submits the different jobs in the Nutch task to
> Hadoop, which is implemented locally (not on an Hadoop cluster).
>
> I write "used to", because there is some change in the Hadoop JAR that
> prevent this now, due to some permission check in the file system level.
> see
> http://mail-archives.apache.org/mod_mbox/hadoop-common-user/201201.mbox/%3CCAH%3DrPUFTs%3DkXNKx3Oh9jVOVMe7x11RQw20Tg7_MN8cTu%3DO-0nA%40mail.gmail.com%3E
>
> ShlomiJ
>
>
>
>
> On Wed, Jan 18, 2012 at 3:15 PM, Ferdy Galema<fe...@kalooga.com>wrote:
>
>> Hi,
>>
>> There is not really an API for this purpose, but it is possible to
>> integrate Nutch in a Java project. While it is possibly to simply drop the
>> Nutch core jar and jar dependencies into your project, this won't get you
>> very far. You need to add the enabled plugins and configuration too.
>> Especially plugins can be a pain in this regard: They need to be put onto
>> the classpath (the 'plugins' folder itself) and a current limitation of the
>> plugin framework is that they actually need to be available on a file
>> level. (So embedding them into a single jar won't work yet).
>>
>> As for the specific commands, you can look them up in Crawl.java (or
>> Crawler.java when using Nutchgora) to see how to use the individual Nutch
>> commands in a Java environment.
>>
>> Ferdy.
>>
>>
>> On 01/18/2012 10:15 AM, contacts@**complexityintelligence.com<co...@complexityintelligence.com>wrote:
>>
>>> Hello,
>>>
>>>    We've finished plug-in development (filter by lang-id, it will be
>>> proposed on JIRA shortly), now we want to embed the Nutch control in
>>> a Java Intelligent Agent, so basically what I'm looking for is the
>>> best strategy to use Nutch directly in a Java program, to avoid to
>>> rely on shell scripts and so on.
>>>
>>>    Is there an API designed for this purpose ? What do you recommend ?
>>>
>>>    I need to control all steps, from crawling to indexing (Solr).
>>>
>>> Best,
>>> Alessio
>>>
>>>

Re: Embedded Nutch API

Posted by shlomi java <sh...@gmail.com>.

I used o use org.apache.nutch.crawl.Crawl.main(String[]).

I used to run it with the following program arguments:
"C:\urls" -depth 3 -topN 250 -threads 100 -solr "http://localhost:8081/solr"
and with the following VM arguments:
-Xmx1024m -Dhadoop.log.dir=logs -Dhadoop.log.file=hadoop.log
-Dhadoop.root.logger=DEBUG,DRFA -Dhadoop.tmp.dir="D:\tmp"
-Dfs.default.name="file:///D:/fs"
-Dplugin.folders="D:\Nutch-1.4\runtime\local\plugins"
-Ddfs.permissions=false

The Crawl class simply submits the different jobs in the Nutch task to
Hadoop, which is implemented locally (not on an Hadoop cluster).

I write "used to", because there is some change in the Hadoop JAR that
prevent this now, due to some permission check in the file system level.
see
http://mail-archives.apache.org/mod_mbox/hadoop-common-user/201201.mbox/%3CCAH%3DrPUFTs%3DkXNKx3Oh9jVOVMe7x11RQw20Tg7_MN8cTu%3DO-0nA%40mail.gmail.com%3E

ShlomiJ




On Wed, Jan 18, 2012 at 3:15 PM, Ferdy Galema <fe...@kalooga.com>wrote:

> Hi,
>
> There is not really an API for this purpose, but it is possible to
> integrate Nutch in a Java project. While it is possibly to simply drop the
> Nutch core jar and jar dependencies into your project, this won't get you
> very far. You need to add the enabled plugins and configuration too.
> Especially plugins can be a pain in this regard: They need to be put onto
> the classpath (the 'plugins' folder itself) and a current limitation of the
> plugin framework is that they actually need to be available on a file
> level. (So embedding them into a single jar won't work yet).
>
> As for the specific commands, you can look them up in Crawl.java (or
> Crawler.java when using Nutchgora) to see how to use the individual Nutch
> commands in a Java environment.
>
> Ferdy.
>
>
> On 01/18/2012 10:15 AM, contacts@**complexityintelligence.com<co...@complexityintelligence.com>wrote:
>
>> Hello,
>>
>>   We've finished plug-in development (filter by lang-id, it will be
>> proposed on JIRA shortly), now we want to embed the Nutch control in
>> a Java Intelligent Agent, so basically what I'm looking for is the
>> best strategy to use Nutch directly in a Java program, to avoid to
>> rely on shell scripts and so on.
>>
>>   Is there an API designed for this purpose ? What do you recommend ?
>>
>>   I need to control all steps, from crawling to indexing (Solr).
>>
>> Best,
>> Alessio
>>
>>

Re: Embedded Nutch API

Posted by Ferdy Galema <fe...@kalooga.com>.

Hi,

There is not really an API for this purpose, but it is possible to 
integrate Nutch in a Java project. While it is possibly to simply drop 
the Nutch core jar and jar dependencies into your project, this won't 
get you very far. You need to add the enabled plugins and configuration 
too. Especially plugins can be a pain in this regard: They need to be 
put onto the classpath (the 'plugins' folder itself) and a current 
limitation of the plugin framework is that they actually need to be 
available on a file level. (So embedding them into a single jar won't 
work yet).

As for the specific commands, you can look them up in Crawl.java (or 
Crawler.java when using Nutchgora) to see how to use the individual 
Nutch commands in a Java environment.

Ferdy.

On 01/18/2012 10:15 AM, contacts@complexityintelligence.com wrote:
> Hello,
>
>    We've finished plug-in development (filter by lang-id, it will be
> proposed on JIRA shortly), now we want to embed the Nutch control in
> a Java Intelligent Agent, so basically what I'm looking for is the
> best strategy to use Nutch directly in a Java program, to avoid to
> rely on shell scripts and so on.
>
>    Is there an API designed for this purpose ? What do you recommend ?
>
>    I need to control all steps, from crawling to indexing (Solr).
>
> Best,
> Alessio
>