You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@nutch.apache.org by devang pandey <de...@gmail.com> on 2013/07/11 07:59:44 UTC

questions regarding nutch url normalizer

Hello , I am working on nutch 1.2 to crawl a site . Now few urls are like
www.example/(sndjnc22e3r3r))/abc.com. I want to strip out this part inside
brackets to normalize my urls . For this I wrote a regex in my regex
normalizer and substituted it . Now I am crawling again but still not able
to get proper results.

Please guide me in solving this issue

Re: questions regarding nutch url normalizer

Posted by Sebastian Nagel <wa...@googlemail.com>.
I would strongly recommend to test the normalizer(s) before crawling.
There are two handy tools, to see what you get after normalization:

echo "http://www.example/(sndjnc22e3r3r))/abc.com" \
  | $NUTCH_HOME/bin/nutch org.apache.nutch.net.URLNormalizerChecker

$NUTCH_HOME/bin/nutch plugin urlnormalizer-regex \
  org.apache.nutch.net.urlnormalizer.regex.RegexURLNormalizer <url>

And yes, you can combine this with the URL filter checker:

cat urls.txt \
  | $NUTCH_HOME/bin/nutch org.apache.nutch.net.URLNormalizerChecker \
  | $NUTCH_HOME/bin/nutch org.apache.nutch.net.URLFilterChecker -allCombined

On 07/11/2013 07:59 AM, devang pandey wrote:
> Hello , I am working on nutch 1.2 to crawl a site . Now few urls are like
> www.example/(sndjnc22e3r3r))/abc.com. I want to strip out this part inside
> brackets to normalize my urls . For this I wrote a regex in my regex
> normalizer and substituted it . Now I am crawling again but still not able
> to get proper results.
> 
> Please guide me in solving this issue
>