You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@nutch.apache.org by Lyndon Maydwell <ma...@gmail.com> on 2007/12/14 08:08:40 UTC

filter / normalize from command line on existing db

I'm attempting to run some new regex-normalize and regex-urlfilter
rules on my existing crawl directory.

for example:

<regex>
        <pattern>(https?://)www\.(.*)</pattern>
        <substitution>$1$2</substitution>
</regex>

I tried the updatedb command, and the mergedb command, but neither of
these seem to be updating what the web-application returns:

./nutch updatedb ../TEST1/crawl/crawldb/
../TEST1/crawl/segments/20071125053435/ -normalize -filter
./nutch mergedb ../TEST1/crawl/crawldb/ ../TEST1/crawl/crawldb/
-normalize -filter

Am I on the right track?