You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@nutch.apache.org by Bin Wang <bi...@gmail.com> on 2013/12/27 19:49:52 UTC

Nutch Crawl a Specific List Of URLs (150K)

Hi,

I have a very specific list of URLs, which is about 140K URLs.

I switch off the `db.update.additions.allowed` so it will not update the
crawldb... and I was assuming I can feed all the URLs to Nutch, and after
one round of fetching, it will finish and leave all the raw HTML files in
the segment folder.

However, after I run this command:
nohup bin/nutch crawl urls -dir result -depth 1 -topN 200000 &

It ended up with a small number of URLs..
TOTAL urls: 872
retry 0: 872
min score: 1.0
avg score: 1.0
max score: 1.0

And I double check the log to make sure that every url can pass the filter
and normalization. And here is the log:

2013-12-27 17:55:25,068 INFO  crawl.Injector - Injector: total number of
urls rejected by filters: 0
2013-12-27 17:55:25,069 INFO  crawl.Injector - Injector: total number of
urls injected after normalization and filtering: 139058
2013-12-27 17:55:25,069 INFO  crawl.Injector - Injector: Merging injected
urls into crawl db.

I don't know how 140K URLs ended up being 872 in the end...

/usr/bin

----------------------
AWS ubuntu instance
Nutch 1.7
java version "1.6.0_27"
OpenJDK Runtime Environment (IcedTea6 1.12.6) (6b27-1.12.6-1ubuntu0.12.04.4)
OpenJDK 64-Bit Server VM (build 20.0-b12, mixed mode)

Re: Nutch Crawl a Specific List Of URLs (150K)

Posted by Bin Wang <bi...@gmail.com>.

Thanks for all the response, they are very inspiring and diving into the
log level is very beneficial to learn Nutch.

The fact is that I use Python BeautifulSoup to parse the sitemap of my
targeted website, which comes up with those 150K URLs, however, it turned
out that there are many many duplicates which actually in the end turned
out to be 900 distinct URLs.

And Nutch is smart enough to filter out those duplicates and come up with
900 before hitting their websites.



On Mon, Dec 30, 2013 at 4:13 AM, Markus Jelsma
<ma...@openindex.io>wrote:

> Hi,
>
> You ran one crawl cycle. Depending on the generator and fetcher settings
> you are not guaranteerd to fetch 200.000 URL's with only topN specified.
> Check the logs, the generator will tell you if there are too many URL's for
> a host or domain. Also check all fetcher logs, it will tell you how much it
> crawled and why it likely stopped when it did.
>
> Cheers
>
> -----Original message-----
> From: Bin Wang<bi...@gmail.com>
> Sent: Friday 27th December 2013 19:50
> To: dev@nutch.apache.org
> Subject: Nutch Crawl a Specific List Of URLs (150K)
>
> Hi,
>
> I have a very specific list of URLs, which is about 140K URLs.
>
> I switch off the `db.update.additions.allowed` so it will not update the
> crawldb... and I was assuming I can feed all the URLs to Nutch, and after
> one round of fetching, it will finish and leave all the raw HTML files in
> the segment folder.
>
> However, after I run this command:
>
> nohup bin/nutch crawl urls -dir result -depth 1 -topN 200000 &
>
> It ended up with a small number of URLs..
>
> TOTAL urls:     872
>
> retry 0:        872
>
> min score:      1.0
>
> avg score:      1.0
>
> max score:      1.0
>
> And I double check the log to make sure that every url can pass the filter
> and normalization. And here is the log:
>
> 2013-12-27 17:55:25,068 INFO  crawl.Injector - Injector: total number of
> urls rejected by filters: 0
>
> 2013-12-27 17:55:25,069 INFO  crawl.Injector - Injector: total number of
> urls injected after normalization and filtering: 139058
>
> 2013-12-27 17:55:25,069 INFO  crawl.Injector - Injector: Merging injected
> urls into crawl db.
>
> I dont know how 140K URLs ended up being 872 in the end...
>
> /usr/bin
>
> ----------------------
>
> AWS ubuntu instance
>
> Nutch 1.7
>
> java version "1.6.0_27"
>
> OpenJDK Runtime Environment (IcedTea6 1.12.6)
> (6b27-1.12.6-1ubuntu0.12.04.4)
>
> OpenJDK 64-Bit Server VM (build 20.0-b12, mixed mode)
>
>
>

RE: Nutch Crawl a Specific List Of URLs (150K)

Posted by Markus Jelsma <ma...@openindex.io>.

Hi, 

You ran one crawl cycle. Depending on the generator and fetcher settings you are not guaranteerd to fetch 200.000 URL's with only topN specified. Check the logs, the generator will tell you if there are too many URL's for a host or domain. Also check all fetcher logs, it will tell you how much it crawled and why it likely stopped when it did.

Cheers

-----Original message-----
From: Bin Wang<bi...@gmail.com>
Sent: Friday 27th December 2013 19:50
To: dev@nutch.apache.org
Subject: Nutch Crawl a Specific List Of URLs (150K)

Hi,

I have a very specific list of URLs, which is about 140K URLs.

I switch off the `db.update.additions.allowed` so it will not update the crawldb... and I was assuming I can feed all the URLs to Nutch, and after one round of fetching, it will finish and leave all the raw HTML files in the segment folder.

However, after I run this command:

nohup bin/nutch crawl urls -dir result -depth 1 -topN 200000 &

It ended up with a small number of URLs..

TOTAL urls:	872

retry 0:	872

min score:	1.0

avg score:	1.0

max score:	1.0

And I double check the log to make sure that every url can pass the filter and normalization. And here is the log:

2013-12-27 17:55:25,068 INFO  crawl.Injector - Injector: total number of urls rejected by filters: 0

2013-12-27 17:55:25,069 INFO  crawl.Injector - Injector: total number of urls injected after normalization and filtering: 139058

2013-12-27 17:55:25,069 INFO  crawl.Injector - Injector: Merging injected urls into crawl db.

I dont know how 140K URLs ended up being 872 in the end...

/usr/bin

----------------------

AWS ubuntu instance

Nutch 1.7

java version "1.6.0_27"

OpenJDK Runtime Environment (IcedTea6 1.12.6) (6b27-1.12.6-1ubuntu0.12.04.4)

OpenJDK 64-Bit Server VM (build 20.0-b12, mixed mode)

Re: Nutch Crawl a Specific List Of URLs (150K)

Posted by Tejas Patil <te...@gmail.com>.

Hi Bin Wang,

>> nohup bin/nutch crawl urls -dir result -depth 1 -topN 200000 &
You were creating a new crawldb or reusing some old one ?

Were you running this on a cluster or in local mode ?
Was there any failure due to which the fetch round got aborted ? (see logs
for this).

I would like to reproduce this issue. Will it be possible for you to share
your config files and subset of urls ?

Thanks,
Tejas


On Sat, Dec 28, 2013 at 2:10 AM, Talat Uyarer <ta...@uyarer.com> wrote:

> Hi Bin,
>
> You have interesting error. I don't use 1.7 but I can try with screen
> command. I believe you will not get same error.
>
> Talat
>
>
> 2013/12/27 Bin Wang <bi...@gmail.com>
>
>> Hi,
>>
>> I have a very specific list of URLs, which is about 140K URLs.
>>
>> I switch off the `db.update.additions.allowed` so it will not update the
>> crawldb... and I was assuming I can feed all the URLs to Nutch, and after
>> one round of fetching, it will finish and leave all the raw HTML files in
>> the segment folder.
>>
>> However, after I run this command:
>> nohup bin/nutch crawl urls -dir result -depth 1 -topN 200000 &
>>
>> It ended up with a small number of URLs..
>> TOTAL urls: 872
>> retry 0: 872
>> min score: 1.0
>> avg score: 1.0
>> max score: 1.0
>>
>> And I double check the log to make sure that every url can pass the
>> filter and normalization. And here is the log:
>>
>> 2013-12-27 17:55:25,068 INFO  crawl.Injector - Injector: total number of
>> urls rejected by filters: 0
>> 2013-12-27 17:55:25,069 INFO  crawl.Injector - Injector: total number of
>> urls injected after normalization and filtering: 139058
>> 2013-12-27 17:55:25,069 INFO  crawl.Injector - Injector: Merging injected
>> urls into crawl db.
>>
>> I don't know how 140K URLs ended up being 872 in the end...
>>
>> /usr/bin
>>
>> ----------------------
>> AWS ubuntu instance
>> Nutch 1.7
>> java version "1.6.0_27"
>> OpenJDK Runtime Environment (IcedTea6 1.12.6)
>> (6b27-1.12.6-1ubuntu0.12.04.4)
>> OpenJDK 64-Bit Server VM (build 20.0-b12, mixed mode)
>>
>
>
>
> --
> Talat UYARER
> Websitesi: http://talat.uyarer.com
> Twitter: http://twitter.com/talatuyarer
> Linkedin: http://tr.linkedin.com/pub/talat-uyarer/10/142/304
>

Re: Nutch Crawl a Specific List Of URLs (150K)

Posted by Talat Uyarer <ta...@uyarer.com>.

Hi Bin,

You have interesting error. I don't use 1.7 but I can try with screen
command. I believe you will not get same error.

Talat


2013/12/27 Bin Wang <bi...@gmail.com>

> Hi,
>
> I have a very specific list of URLs, which is about 140K URLs.
>
> I switch off the `db.update.additions.allowed` so it will not update the
> crawldb... and I was assuming I can feed all the URLs to Nutch, and after
> one round of fetching, it will finish and leave all the raw HTML files in
> the segment folder.
>
> However, after I run this command:
> nohup bin/nutch crawl urls -dir result -depth 1 -topN 200000 &
>
> It ended up with a small number of URLs..
> TOTAL urls: 872
> retry 0: 872
> min score: 1.0
> avg score: 1.0
> max score: 1.0
>
> And I double check the log to make sure that every url can pass the filter
> and normalization. And here is the log:
>
> 2013-12-27 17:55:25,068 INFO  crawl.Injector - Injector: total number of
> urls rejected by filters: 0
> 2013-12-27 17:55:25,069 INFO  crawl.Injector - Injector: total number of
> urls injected after normalization and filtering: 139058
> 2013-12-27 17:55:25,069 INFO  crawl.Injector - Injector: Merging injected
> urls into crawl db.
>
> I don't know how 140K URLs ended up being 872 in the end...
>
> /usr/bin
>
> ----------------------
> AWS ubuntu instance
> Nutch 1.7
> java version "1.6.0_27"
> OpenJDK Runtime Environment (IcedTea6 1.12.6)
> (6b27-1.12.6-1ubuntu0.12.04.4)
> OpenJDK 64-Bit Server VM (build 20.0-b12, mixed mode)
>



-- 
Talat UYARER
Websitesi: http://talat.uyarer.com
Twitter: http://twitter.com/talatuyarer
Linkedin: http://tr.linkedin.com/pub/talat-uyarer/10/142/304

RE: Nutch Crawl a Specific List Of URLs (150K)

Posted by Markus Jelsma <ma...@openindex.io>.

Hi - Are they exact duplicates? If you inject http://nutch.apache.org/ a thousand times, it is added only once, and crawled only once, until it is scheduled to crawl again.

-----Original message-----
From: Bin Wang<bi...@gmail.com>
Sent: Thursday 2nd January 2014 23:13
To: dev@nutch.apache.org
Subject: Re: Nutch Crawl a Specific List Of URLs (150K)

Thanks for all the response, they are very inspiring and diving into the log level is very beneficial to learn Nutch.

The fact is that I use Python BeautifulSoup to parse the sitemap of my targeted website, which comes up with those 150K URLs, however, it turned out that there are many many duplicates which actually in the end turned out to be 900 distinct URLs.

And Nutch is smart enough to filter out those duplicates and come up with 900 before hitting their websites.

On Mon, Dec 30, 2013 at 4:13 AM, Markus Jelsma <markus.jelsma@openindex.io <ma...@openindex.io>> wrote:

Hi,

You ran one crawl cycle. Depending on the generator and fetcher settings you are not guaranteerd to fetch 200.000 URLs with only topN specified. Check the logs, the generator will tell you if there are too many URLs for a host or domain. Also check all fetcher logs, it will tell you how much it crawled and why it likely stopped when it did.

Cheers

-----Original message-----

From: Bin Wang<binwang.cu@gmail.com <ma...@gmail.com>>

Sent: Friday 27th December 2013 19:50

To: dev@nutch.apache.org <ma...@nutch.apache.org>

Subject: Nutch Crawl a Specific List Of URLs (150K)

Hi,

I have a very specific list of URLs, which is about 140K URLs.

I switch off the `db.update.additions.allowed` so it will not update the crawldb... and I was assuming I can feed all the URLs to Nutch, and after one round of fetching, it will finish and leave all the raw HTML files in the segment folder.

However, after I run this command:

nohup bin/nutch crawl urls -dir result -depth 1 -topN 200000 &

It ended up with a small number of URLs..

TOTAL urls:     872

retry 0:        872

min score:      1.0

avg score:      1.0

max score:      1.0

And I double check the log to make sure that every url can pass the filter and normalization. And here is the log:

2013-12-27 17:55:25,068 INFO  crawl.Injector - Injector: total number of urls rejected by filters: 0

2013-12-27 17:55:25,069 INFO  crawl.Injector - Injector: total number of urls injected after normalization and filtering: 139058

2013-12-27 17:55:25,069 INFO  crawl.Injector - Injector: Merging injected urls into crawl db.

I dont know how 140K URLs ended up being 872 in the end...

/usr/bin

----------------------

AWS ubuntu instance

Nutch 1.7

java version "1.6.0_27"

OpenJDK Runtime Environment (IcedTea6 1.12.6) (6b27-1.12.6-1ubuntu0.12.04.4)

OpenJDK 64-Bit Server VM (build 20.0-b12, mixed mode)