You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@nutch.apache.org by POIRIER David <DP...@cross-systems.com> on 2008/06/04 14:23:59 UTC

Can I parse more than once fetched segments?

Hello,

Can I parse more than once fetched segments without having to fetch
everything again?

When I first tried to use the "./bin nutch parse
./path/to/an/already/parsed/segment" command I got a java exception
explaining that the segment involved had already be parsed. Indeed the
following subdirectories could be found under the segment directory:

segment/content
segment/crawl_fetch
segment/crawl_generate
segment/crawl_parse
segment/parse_data
segment/parse_text

To try and force the parsing process I renamed the last 3 subdirectories
to something else and re-lunched the "./bin nutch parse" command. It has
been running for more than 24 hours... and it is still not over.

My idea is to afterward recreate an index with the newly parsed segment.

Is this the way to do it? Isn't there a simpler, and maybe quicker, way
to reparsed segments?

Thank you,

David

Re: Can I parse more than once fetched segments?

Posted by Dennis Kubes <ku...@apache.org>.

Two things to be aware of with fetching:

1) The number of urls from a specific host.  I think it is defined by 
the generate.max.per.host configuration variable.  If this is lower (say 
10-50) your fetches will be much faster and will not hang on single 
large sites.

2) The max.crawl.delay variable sets the limit at which site will be 
ignored if they have a crawl delay in robots.txt that is greater than 
the variable.  By default this is 30 seconds and is usually not that big 
of a problem, especially when there is a limit to the number of pages 
per site as in 1.  The problem comes when you have 1000 pages from a 
single site with a crawl delay of 20 seconds per page or something similar.

Dennis

POIRIER David wrote:
> Thank you Dennis.
> 
> I just re-lunched a complete crawl of one source (one host), modifying the fetcher.threads.per.host limit from 1 to 4, expecting this to speed up the fetch process... and not be blacklisted.
> 
> I did check the cpu and memory status before killing the prior fetching of the same source, and everything was pretty ok. CPU charge was toping at 45%-50% and I still had 1GO of RAM left. The java process itself was not eating more then 150MO of RAM. HD access was also good.
> 
> I just finished the fetch/parse/index of a filesystem using the protocol-file plugin and all my usual homemade plugins. The index was ready in about 1h30. I fetched 55 000 URIs.
> 
> I guess the slowing down of my regular internet sources is due to network latency problems.
> 
> Anyway, I'll keep you posted about the performance difference when using 4 threads instead of 1 per host.
> 
> David
> 
> 
> 
> 
> 
> 
> 
> -----Original Message-----
> From: Dennis Kubes [mailto:kubes@apache.org] 
> Sent: mercredi, 4. juin 2008 18:18
> To: nutch-user@lucene.apache.org
> Subject: Re: Can I parse more than once fetched segments?
> 
> Usually it is not logging (80M wouldn't really be that much, we have 
> some logs that do ~G a day).  Not having enough memory set in either the 
> hadoop servers or the child opts could cause lots of swapping which 
> could slow it down.  Also having to many active tasks at once eating up 
> CPU could also do it.  If you added custom parsing plugins it may be 
> that either their processing takes a lot of cpu or that they eat up 
> memory eventually causing swapping.
> 
>  From what you describe my first thought would be memory leak or url 
> with a lot of content.
> 
> Dennis
> 
> POIRIER David wrote:
>> Dennis,
>>
>> Thank you. The parse was over the second I received your mail. More than 24 hours... I wonder if this is because I added two more parser plugins, plugins writing a lot to the hadoop.log file. Actually this file get usually bigger then 80MO every day. Can that cause performance problems?
>>
>> I also have performance problems when crawling a fairly big source (+30 000 urls). The fetching of the first 10 000 urls goes fairly rapidly, but then it takes forever for the last 20 000 urls. Can it be my parser plugins? The log file? Not enough fetching threads?
>>
>> If you have any idea.
>>
>> Thank you,
>>
>> David
>>
>>
>>
>> -----------------------------------------
>> David Poirier
>> E-business Consultant - Software Engineer
>>  
>> Direct: +41 (0)22 596 10 35
>>  
>> Cross Systems - Groupe Micropole Univers
>> Route des Acacias 45 B
>> 1227 Carouge / Genève
>> Tél: +41 (0)22 308 48 60
>> Fax: +41 (0)22 308 48 68
>>  
>>
>> -----Original Message-----
>> From: Dennis Kubes [mailto:kubes@apache.org] 
>> Sent: mercredi, 4. juin 2008 16:27
>> To: nutch-user@lucene.apache.org
>> Subject: Re: Can I parse more than once fetched segments?
>>
>> You can if you remove the crawl_parse, parse_text, and parse_data 
>> directories and then run the parse command.  Don't know why it would be 
>> taking so long.
>>
>> Dennis
>>
>> POIRIER David wrote:
>>> Hello,
>>>
>>> Can I parse more than once fetched segments without having to fetch
>>> everything again?
>>>
>>> When I first tried to use the "./bin nutch parse
>>> ./path/to/an/already/parsed/segment" command I got a java exception
>>> explaining that the segment involved had already be parsed. Indeed the
>>> following subdirectories could be found under the segment directory:
>>>
>>> segment/content
>>> segment/crawl_fetch
>>> segment/crawl_generate
>>> segment/crawl_parse
>>> segment/parse_data
>>> segment/parse_text
>>>
>>> To try and force the parsing process I renamed the last 3 subdirectories
>>> to something else and re-lunched the "./bin nutch parse" command. It has
>>> been running for more than 24 hours... and it is still not over.
>>>
>>> My idea is to afterward recreate an index with the newly parsed segment.
>>>
>>> Is this the way to do it? Isn't there a simpler, and maybe quicker, way
>>> to reparsed segments?
>>>
>>> Thank you,
>>>
>>> David

RE: Can I parse more than once fetched segments?

Posted by POIRIER David <DP...@cross-systems.com>.

Thank you Dennis.

I just re-lunched a complete crawl of one source (one host), modifying the fetcher.threads.per.host limit from 1 to 4, expecting this to speed up the fetch process... and not be blacklisted.

I did check the cpu and memory status before killing the prior fetching of the same source, and everything was pretty ok. CPU charge was toping at 45%-50% and I still had 1GO of RAM left. The java process itself was not eating more then 150MO of RAM. HD access was also good.

I just finished the fetch/parse/index of a filesystem using the protocol-file plugin and all my usual homemade plugins. The index was ready in about 1h30. I fetched 55 000 URIs.

I guess the slowing down of my regular internet sources is due to network latency problems.

Anyway, I'll keep you posted about the performance difference when using 4 threads instead of 1 per host.

David

-----Original Message-----
From: Dennis Kubes [mailto:kubes@apache.org] 
Sent: mercredi, 4. juin 2008 18:18
To: nutch-user@lucene.apache.org
Subject: Re: Can I parse more than once fetched segments?

Usually it is not logging (80M wouldn't really be that much, we have 
some logs that do ~G a day).  Not having enough memory set in either the 
hadoop servers or the child opts could cause lots of swapping which 
could slow it down.  Also having to many active tasks at once eating up 
CPU could also do it.  If you added custom parsing plugins it may be 
that either their processing takes a lot of cpu or that they eat up 
memory eventually causing swapping.

 From what you describe my first thought would be memory leak or url 
with a lot of content.

Dennis

POIRIER David wrote:
> Dennis,
> 
> Thank you. The parse was over the second I received your mail. More than 24 hours... I wonder if this is because I added two more parser plugins, plugins writing a lot to the hadoop.log file. Actually this file get usually bigger then 80MO every day. Can that cause performance problems?
> 
> I also have performance problems when crawling a fairly big source (+30 000 urls). The fetching of the first 10 000 urls goes fairly rapidly, but then it takes forever for the last 20 000 urls. Can it be my parser plugins? The log file? Not enough fetching threads?
> 
> If you have any idea.
> 
> Thank you,
> 
> David
> 
> 
> 
> -----------------------------------------
> David Poirier
> E-business Consultant - Software Engineer
>  
> Direct: +41 (0)22 596 10 35
>  
> Cross Systems - Groupe Micropole Univers
> Route des Acacias 45 B
> 1227 Carouge / Genève
> Tél: +41 (0)22 308 48 60
> Fax: +41 (0)22 308 48 68
>  
> 
> -----Original Message-----
> From: Dennis Kubes [mailto:kubes@apache.org] 
> Sent: mercredi, 4. juin 2008 16:27
> To: nutch-user@lucene.apache.org
> Subject: Re: Can I parse more than once fetched segments?
> 
> You can if you remove the crawl_parse, parse_text, and parse_data 
> directories and then run the parse command.  Don't know why it would be 
> taking so long.
> 
> Dennis
> 
> POIRIER David wrote:
>> Hello,
>>
>> Can I parse more than once fetched segments without having to fetch
>> everything again?
>>
>> When I first tried to use the "./bin nutch parse
>> ./path/to/an/already/parsed/segment" command I got a java exception
>> explaining that the segment involved had already be parsed. Indeed the
>> following subdirectories could be found under the segment directory:
>>
>> segment/content
>> segment/crawl_fetch
>> segment/crawl_generate
>> segment/crawl_parse
>> segment/parse_data
>> segment/parse_text
>>
>> To try and force the parsing process I renamed the last 3 subdirectories
>> to something else and re-lunched the "./bin nutch parse" command. It has
>> been running for more than 24 hours... and it is still not over.
>>
>> My idea is to afterward recreate an index with the newly parsed segment.
>>
>> Is this the way to do it? Isn't there a simpler, and maybe quicker, way
>> to reparsed segments?
>>
>> Thank you,
>>
>> David

Re: Can I parse more than once fetched segments?

Posted by Dennis Kubes <ku...@apache.org>.

Usually it is not logging (80M wouldn't really be that much, we have 
some logs that do ~G a day).  Not having enough memory set in either the 
hadoop servers or the child opts could cause lots of swapping which 
could slow it down.  Also having to many active tasks at once eating up 
CPU could also do it.  If you added custom parsing plugins it may be 
that either their processing takes a lot of cpu or that they eat up 
memory eventually causing swapping.

 From what you describe my first thought would be memory leak or url 
with a lot of content.

Dennis

POIRIER David wrote:
> Dennis,
> 
> Thank you. The parse was over the second I received your mail. More than 24 hours... I wonder if this is because I added two more parser plugins, plugins writing a lot to the hadoop.log file. Actually this file get usually bigger then 80MO every day. Can that cause performance problems?
> 
> I also have performance problems when crawling a fairly big source (+30 000 urls). The fetching of the first 10 000 urls goes fairly rapidly, but then it takes forever for the last 20 000 urls. Can it be my parser plugins? The log file? Not enough fetching threads?
> 
> If you have any idea.
> 
> Thank you,
> 
> David
> 
> 
> 
> -----------------------------------------
> David Poirier
> E-business Consultant - Software Engineer
>  
> Direct: +41 (0)22 596 10 35
>  
> Cross Systems - Groupe Micropole Univers
> Route des Acacias 45 B
> 1227 Carouge / Genève
> Tél: +41 (0)22 308 48 60
> Fax: +41 (0)22 308 48 68
>  
> 
> -----Original Message-----
> From: Dennis Kubes [mailto:kubes@apache.org] 
> Sent: mercredi, 4. juin 2008 16:27
> To: nutch-user@lucene.apache.org
> Subject: Re: Can I parse more than once fetched segments?
> 
> You can if you remove the crawl_parse, parse_text, and parse_data 
> directories and then run the parse command.  Don't know why it would be 
> taking so long.
> 
> Dennis
> 
> POIRIER David wrote:
>> Hello,
>>
>> Can I parse more than once fetched segments without having to fetch
>> everything again?
>>
>> When I first tried to use the "./bin nutch parse
>> ./path/to/an/already/parsed/segment" command I got a java exception
>> explaining that the segment involved had already be parsed. Indeed the
>> following subdirectories could be found under the segment directory:
>>
>> segment/content
>> segment/crawl_fetch
>> segment/crawl_generate
>> segment/crawl_parse
>> segment/parse_data
>> segment/parse_text
>>
>> To try and force the parsing process I renamed the last 3 subdirectories
>> to something else and re-lunched the "./bin nutch parse" command. It has
>> been running for more than 24 hours... and it is still not over.
>>
>> My idea is to afterward recreate an index with the newly parsed segment.
>>
>> Is this the way to do it? Isn't there a simpler, and maybe quicker, way
>> to reparsed segments?
>>
>> Thank you,
>>
>> David

RE: Can I parse more than once fetched segments?

Posted by POIRIER David <DP...@cross-systems.com>.

Dennis,

Thank you. The parse was over the second I received your mail. More than 24 hours... I wonder if this is because I added two more parser plugins, plugins writing a lot to the hadoop.log file. Actually this file get usually bigger then 80MO every day. Can that cause performance problems?

I also have performance problems when crawling a fairly big source (+30 000 urls). The fetching of the first 10 000 urls goes fairly rapidly, but then it takes forever for the last 20 000 urls. Can it be my parser plugins? The log file? Not enough fetching threads?

If you have any idea.

Thank you,

David

-----------------------------------------
David Poirier
E-business Consultant - Software Engineer

Direct: +41 (0)22 596 10 35

Cross Systems - Groupe Micropole Univers
Route des Acacias 45 B
1227 Carouge / Genève
Tél: +41 (0)22 308 48 60
Fax: +41 (0)22 308 48 68

-----Original Message-----
From: Dennis Kubes [mailto:kubes@apache.org] 
Sent: mercredi, 4. juin 2008 16:27
To: nutch-user@lucene.apache.org
Subject: Re: Can I parse more than once fetched segments?

You can if you remove the crawl_parse, parse_text, and parse_data 
directories and then run the parse command.  Don't know why it would be 
taking so long.

Dennis

POIRIER David wrote:
> Hello,
> 
> Can I parse more than once fetched segments without having to fetch
> everything again?
> 
> When I first tried to use the "./bin nutch parse
> ./path/to/an/already/parsed/segment" command I got a java exception
> explaining that the segment involved had already be parsed. Indeed the
> following subdirectories could be found under the segment directory:
> 
> segment/content
> segment/crawl_fetch
> segment/crawl_generate
> segment/crawl_parse
> segment/parse_data
> segment/parse_text
> 
> To try and force the parsing process I renamed the last 3 subdirectories
> to something else and re-lunched the "./bin nutch parse" command. It has
> been running for more than 24 hours... and it is still not over.
> 
> My idea is to afterward recreate an index with the newly parsed segment.
> 
> Is this the way to do it? Isn't there a simpler, and maybe quicker, way
> to reparsed segments?
> 
> Thank you,
> 
> David

Re: Can I parse more than once fetched segments?

Posted by Dennis Kubes <ku...@apache.org>.

You can if you remove the crawl_parse, parse_text, and parse_data 
directories and then run the parse command.  Don't know why it would be 
taking so long.

Dennis

POIRIER David wrote:
> Hello,
> 
> Can I parse more than once fetched segments without having to fetch
> everything again?
> 
> When I first tried to use the "./bin nutch parse
> ./path/to/an/already/parsed/segment" command I got a java exception
> explaining that the segment involved had already be parsed. Indeed the
> following subdirectories could be found under the segment directory:
> 
> segment/content
> segment/crawl_fetch
> segment/crawl_generate
> segment/crawl_parse
> segment/parse_data
> segment/parse_text
> 
> To try and force the parsing process I renamed the last 3 subdirectories
> to something else and re-lunched the "./bin nutch parse" command. It has
> been running for more than 24 hours... and it is still not over.
> 
> My idea is to afterward recreate an index with the newly parsed segment.
> 
> Is this the way to do it? Isn't there a simpler, and maybe quicker, way
> to reparsed segments?
> 
> Thank you,
> 
> David