You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@nutch.apache.org by jasimop <st...@gmail.com> on 2012/08/01 15:00:50 UTC

Re: Integrating Nutch

> Resources such as the URL filter and normalizer rule files 
> are usually defined as pure files without path and are located 
> on the classpath. So it should work if 
>  C:/server/nutch/conf/ 
> is in the classpath and the resources are simply named "regex-urlfilter.txt" 
> resp. "regex-normalize.xml". 

Thanks for the information. It works now by putting the files into the classpath and
just using the filenames.
Everything works now and I can start a crawl cycle from my Java application.
One question though: Is there a way to get some more verbose
information out of the crawl process than just the logging information?
I intend something like the urls crawled, the ones waiting to be crawled, current status etc?
Programmatically I can only infer at what stage the process is (injecting, fetching etc.),
but no details. Injector Generator and Fetcher classes seem not to contain any useful
methods for that purpose.
Any hints?

Regards,

Max

 





--
View this message in context: http://lucene.472066.n3.nabble.com/Integrating-Nutch-tp3996461p3998591.html
Sent from the Nutch - User mailing list archive at Nabble.com.

Re: Integrating Nutch

Posted by Sebastian Nagel <wa...@googlemail.com>.

> One question though: Is there a way to get some more verbose
> information out of the crawl process than just the logging information?
> I intend something like the urls crawled, the ones waiting to be crawled, current status etc?
> Programmatically I can only infer at what stage the process is (injecting, fetching etc.),
> but no details. Injector Generator and Fetcher classes seem not to contain any useful
> methods for that purpose.

Many Nutch classes make use of Hadoop job counters (look for org.apache.hadoop.mapred.Reporter).
But I actually don't know how to access these counters from inside a Java application
for running jobs.

Another possibility is to run
 nutch readdb -stats   / CrawlDbReader#processStatJob
after each cycle which provides the number of fetched, unfetched, failed, etc.
documents.

On 08/01/2012 03:00 PM, jasimop wrote:
>> Resources such as the URL filter and normalizer rule files 
>> are usually defined as pure files without path and are located 
>> on the classpath. So it should work if 
>>  C:/server/nutch/conf/ 
>> is in the classpath and the resources are simply named "regex-urlfilter.txt" 
>> resp. "regex-normalize.xml". 
> 
> Thanks for the information. It works now by putting the files into the classpath and
> just using the filenames.
> Everything works now and I can start a crawl cycle from my Java application.
> One question though: Is there a way to get some more verbose
> information out of the crawl process than just the logging information?
> I intend something like the urls crawled, the ones waiting to be crawled, current status etc?
> Programmatically I can only infer at what stage the process is (injecting, fetching etc.),
> but no details. Injector Generator and Fetcher classes seem not to contain any useful
> methods for that purpose.
> Any hints?
> 
> Regards,
> 
> Max
> 
>  
> 
> 
> 
> 
> 
> --
> View this message in context: http://lucene.472066.n3.nabble.com/Integrating-Nutch-tp3996461p3998591.html
> Sent from the Nutch - User mailing list archive at Nabble.com.
>