You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@nutch.apache.org by Niclas Rothman <ni...@lechill.com> on 2005/10/04 11:24:20 UTC

NewbieNutcher.....

Hi all Nutch users!!!

I'm all new to the nutch crawl system and have now for a time tried to
make a successful crawl of a site without any "bigger success".

I have written a shell script to do the work (all from the tutorial),
however it seems like my roundtrips of the commands generate, fetch and
update "fails" or should I say when I try to run my second, third round
it doesn't find any URLs to fetch. My site gets only partly indexed.  

 

My script looks like;

 

************************************************************************
********************************************'

#Remove any directories since last test.

rm -r db

rm -r segments

#Create directories

mkdir db

mkdir segments

 

./nutch admin db -create

./nutch inject db -urlfile ../conf/root-urls.txt #This file contains
just one url, http://www.rivieradvd.com/home.htm)

 

./nutch generate db segments

s1=`ls -d segments/2* | tail -1`

./nutch fetch $s1

./nutch updatedb db $s1

            

for ((  i = 0 ;  i <= 5;  i++  ))

do

            ./nutch generate db segments -topN 1000

            s1=`ls -d segments/2* | tail -1`

            ./nutch fetch $s1

            ./nutch updatedb db $s1

 

done

./nutch dedup segments dedup.tmp

 

************************************************************************
********************************************'

 

My crawl-urlfilter file looks like; (should also fetch pages with
querystrings right?

 

************************************************************************
********************************************'

# The url filter file used by the crawl command.

 

# Better for intranet crawling.

# Be sure to change MY.DOMAIN.NAME to your domain name.

 

# Each non-comment, non-blank line contains a regular expression

# prefixed by '+' or '-'.  The first matching pattern in the file

# determines whether a URL is included or ignored.  If no pattern

# matches, the URL is ignored.

 

# skip file:, ftp:, & mailto: urls

-^(file|ftp|mailto):

 

# skip image and other suffixes we can't yet parse

-\.(gif|GIF|jpg|JPG|ico|ICO|css|sit|eps|wmf|zip|ppt|mpg|xls|gz|rpm|tgz|m
ov|MOV|exe|png|PNG|js)$

 

# skip URLs containing certain characters as probable queries, etc.

#-[*!@=]

 

# accept hosts in MY.DOMAIN.NAME

+^http://([a-z0-9]*\.)*rivieradvd.com/

 

# skip everything else

-.

************************************************************************
********************************************'

 

 

I hope I have given you the enough information for you to guide me on
the right track. 

Best regards Niclas

Re: NewbieNutcher.....

Posted by Jeff Pettenski <jp...@gmail.com>.

Niclas,

Add -showThreadID -logLevel FINE to the fetch.

The way you are calling the fetch (without the -noParsing switch) tells the
fetch to parse within the fetch step.

You could use the -noParsing switch and add:
./nutch parse $segDir -showThreadID -logLevel FINE

This will break up the fetch and parse so you can see, better, if the actual
fetch or parsing is causing you trouble.

Also, you may want to widen you exclude list for some other file types.
I have:
-\.(gif|GIF|jpg|JPG|ico|ICO|css|sit|eps|wmf|zip|ppt|mpg|MPG|xls|gz|rpm|tgz|mov|MOV|exe|EXE|png|PNG|js|msi|bin|BIN|jar|vsd|xml|avi|AVI|jpeg|JPEG|lnk|LNK|wav|WAV|bmp|BMP|reg|REG|sit|SIT|hqs|HQS|mp4|MP4|texi|TEXI|pps|PPS|svg|SVG|tiff|TIFF|tif|TIF|mov|MOV|mso|MSO|asf|ASF|jpe|JPE|raw|RAW)$

Go figure!

(small rant).
My latest issue has been a large (8mb page). It seems to send the parser
into fits. The page has an .html suffix, is recognized as text/html. But is
really just plain text, all 8mb. It seems to choke the nekoHTML parser. O.K.
maybe I'm just not patient enough to wait hours for it to parse ... nothing.
In some cases, I end up getting out of memory errors, but after 6 or more of
those, I think the JVM gets hosed.

I isolated the 8mb and tried to use tagsoup and it gets through the file in
about 1/2hr vs. out of memory errors.

In any case, turn up the logging, that should point you in a better
direction.

-j.p.

On 10/4/05, Niclas Rothman <ni...@lechill.com> wrote:
>
> Hi all Nutch users!!!
>
> I'm all new to the nutch crawl system and have now for a time tried to
> make a successful crawl of a site without any "bigger success".
>
> I have written a shell script to do the work (all from the tutorial),
> however it seems like my roundtrips of the commands generate, fetch and
> update "fails" or should I say when I try to run my second, third round
> it doesn't find any URLs to fetch. My site gets only partly indexed.
>
>
>
> My script looks like;
>
>
>
> ************************************************************************
> ********************************************'
>
> #Remove any directories since last test.
>
> rm -r db
>
> rm -r segments
>
> #Create directories
>
> mkdir db
>
> mkdir segments
>
>
>
> ./nutch admin db -create
>
> ./nutch inject db -urlfile ../conf/root-urls.txt #This file contains
> just one url, http://www.rivieradvd.com/home.htm)
>
>
>
> ./nutch generate db segments
>
> s1=`ls -d segments/2* | tail -1`
>
> ./nutch fetch $s1
>
> ./nutch updatedb db $s1
>
>
>
> for (( i = 0 ; i <= 5; i++ ))
>
> do
>
> ./nutch generate db segments -topN 1000
>
> s1=`ls -d segments/2* | tail -1`
>
> ./nutch fetch $s1
>
> ./nutch updatedb db $s1
>
>
>
> done
>
> ./nutch dedup segments dedup.tmp
>
>
>
> ************************************************************************
> ********************************************'
>
>
>
> My crawl-urlfilter file looks like; (should also fetch pages with
> querystrings right?
>
>
>
> ************************************************************************
> ********************************************'
>
> # The url filter file used by the crawl command.
>
>
>
> # Better for intranet crawling.
>
> # Be sure to change MY.DOMAIN.NAME <http://MY.DOMAIN.NAME> to your domain
> name.
>
>
>
> # Each non-comment, non-blank line contains a regular expression
>
> # prefixed by '+' or '-'. The first matching pattern in the file
>
> # determines whether a URL is included or ignored. If no pattern
>
> # matches, the URL is ignored.
>
>
>
> # skip file:, ftp:, & mailto: urls
>
> -^(file|ftp|mailto):
>
>
>
> # skip image and other suffixes we can't yet parse
>
> -\.(gif|GIF|jpg|JPG|ico|ICO|css|sit|eps|wmf|zip|ppt|mpg|xls|gz|rpm|tgz|m
> ov|MOV|exe|png|PNG|js)$
>
>
>
> # skip URLs containing certain characters as probable queries, etc.
>
> #-[*!@=]
>
>
>
> # accept hosts in MY.DOMAIN.NAME <http://MY.DOMAIN.NAME>
>
> +^http://([a-z0-9]*\.)*rivieradvd.com/ <http://rivieradvd.com/>
>
>
>
> # skip everything else
>
> -.
>
> ************************************************************************
> ********************************************'
>
>
>
>
>
> I hope I have given you the enough information for you to guide me on
> the right track.
>
> Best regards Niclas
>
>
>
>
>
>
>
>
>