You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@nutch.apache.org by Jeff Pettenski <jp...@gmail.com> on 2005/10/03 20:43:53 UTC
Example of segslice using -filterUrlBy
Here is the example. It works.
./nutch segslice -filterUrlBy "-(.*
dba.test.com/ftpinput/.*|.*home.in.test.com/sunrp/south/.*<http://dba.test.com/ftpinput/.*|.*home.in.test.com/sunrp/south/.*>)"
-o /apps/nutch/baseROOT/data/nutch/searchC/segments/20051003000141B
-logLevel FINE /data/nutch/searchC/segments/20051003000141 >
./logs/slice.log 2>&1
log level of FINE will log url entries it copies and skips.
Why use this? I have written a perl script to pretty much do what the java
crawl does, but in steps. I also split the fetch step to a fetch / parse (2
steps).
In my case the fetch is working great, the parse hung. So I used the slice
to take out (what I think) are the offending urls and I will re-run the
parse and updatedb on the last segment, then back to the std crawl loop.
-j.p.