You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@nutch.apache.org by Lyndon Maydwell <ma...@gmail.com> on 2007/12/14 08:08:40 UTC
filter / normalize from command line on existing db
I'm attempting to run some new regex-normalize and regex-urlfilter
rules on my existing crawl directory.
for example:
<regex>
<pattern>(https?://)www\.(.*)</pattern>
<substitution>$1$2</substitution>
</regex>
I tried the updatedb command, and the mergedb command, but neither of
these seem to be updating what the web-application returns:
./nutch updatedb ../TEST1/crawl/crawldb/
../TEST1/crawl/segments/20071125053435/ -normalize -filter
./nutch mergedb ../TEST1/crawl/crawldb/ ../TEST1/crawl/crawldb/
-normalize -filter
Am I on the right track?