You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@nutch.apache.org by Jeff Pettenski <jp...@gmail.com> on 2005/10/03 20:43:53 UTC

Example of segslice using -filterUrlBy

Here is the example. It works.

./nutch segslice -filterUrlBy "-(.*
dba.test.com/ftpinput/.*|.*home.in.test.com/sunrp/south/.*<http://dba.test.com/ftpinput/.*|.*home.in.test.com/sunrp/south/.*>)"
-o /apps/nutch/baseROOT/data/nutch/searchC/segments/20051003000141B
-logLevel FINE /data/nutch/searchC/segments/20051003000141 >
./logs/slice.log 2>&1


log level of FINE will log url entries it copies and skips.

Why use this? I have written a perl script to pretty much do what the java
crawl does, but in steps. I also split the fetch step to a fetch / parse (2
steps).

In my case the fetch is working great, the parse hung. So I used the slice
to take out (what I think) are the offending urls and I will re-run the
parse and updatedb on the last segment, then back to the std crawl loop.

-j.p.