You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@nutch.apache.org by Iain Lopata <il...@hotmail.com> on 2015/01/29 23:20:07 UTC

InvertLinks Performance Nutch 1.6

I am running the invertlinks step in my Nutch 1.6 based crawl process on a
single node.  I run invertlinks only because I need the Inlinks in the
indexer step so as to store them with the document.  I do not need the
anchor text and I am not scoring.  I am finding that invertlinks (and more
specifically the merge of the linkdb) takes a long time - about 30 minutes
for a crawl of around 150K documents.  I am looking for ways that I might
shorten this processing time.  Any suggestions? 

 

I actually only need the Inlinks for a subset of my documents, which could
be identified either by a URL regex pattern match or by MIME type.  This
would be a case where a scoped filter for the invertlinks step might be
helpful, but I understand that scoping is only available for normalizers.

 

 

Thanks

Re: InvertLinks Performance Nutch 1.6

Posted by "Mattmann, Chris A (3980)" <ch...@jpl.nasa.gov>.

WOW friggin awesome

++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Chris Mattmann, Ph.D.
Chief Architect
Instrument Software and Science Data Systems Section (398)
NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
Office: 168-519, Mailstop: 168-527
Email: chris.a.mattmann@nasa.gov
WWW:  http://sunset.usc.edu/~mattmann/
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Adjunct Associate Professor, Computer Science Department
University of Southern California, Los Angeles, CA 90089 USA
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++






-----Original Message-----
From: Iain Lopata <il...@hotmail.com>
Reply-To: "user@nutch.apache.org" <us...@nutch.apache.org>
Date: Thursday, February 5, 2015 at 9:24 AM
To: "user@nutch.apache.org" <us...@nutch.apache.org>
Subject: RE: InvertLinks Performance Nutch 1.6

>Reduced processing time from 40 minutes down to 30 seconds!  Thank you!
>
>-----Original Message-----
>From: Iain Lopata [mailto:ilopata1@hotmail.com]
>Sent: Monday, February 2, 2015 11:36 AM
>To: user@nutch.apache.org
>Subject: RE: InvertLinks Performance Nutch 1.6
>
>Thanks Sebastian -- I had not turned off filtering/normalization and did
>not appreciate they could be a significant contribution.  I will give
>that a try.
>
>-----Original Message-----
>From: Sebastian Nagel [mailto:wastl.nagel@googlemail.com]
>Sent: Monday, February 2, 2015 11:32 AM
>To: user@nutch.apache.org
>Subject: Re: InvertLinks Performance Nutch 1.6
>
>Hi Iain,
>
>is the link inversion done with URL normalization/filtering.
>That could potentially take long if there are many links probably in
>combination with complex filters or long URLs (which make the regex
>filter slow).
>
>Filtering/normalization is on per default.
>You have to disable it explicitly via:
>% nutch invertlinks ... -noNormalize -noFilter
>
>Best,
>Sebastian
>
>
>
>2015-01-29 23:20 GMT+01:00 Iain Lopata <il...@hotmail.com>:
>
>> I am running the invertlinks step in my Nutch 1.6 based crawl process
>> on a single node.  I run invertlinks only because I need the Inlinks
>> in the indexer step so as to store them with the document.  I do not
>> need the anchor text and I am not scoring.  I am finding that
>> invertlinks (and more specifically the merge of the linkdb) takes a
>> long time - about 30 minutes for a crawl of around 150K documents.  I
>> am looking for ways that I might shorten this processing time.  Any
>>suggestions?
>>
>>
>>
>> I actually only need the Inlinks for a subset of my documents, which
>> could be identified either by a URL regex pattern match or by MIME
>> type.  This would be a case where a scoped filter for the invertlinks
>> step might be helpful, but I understand that scoping is only available
>>for normalizers.
>>
>>
>>
>>
>>
>> Thanks
>>
>>
>
>

Re: InvertLinks Performance Nutch 1.6

Posted by Sebastian Nagel <wa...@googlemail.com>.

Yeah, impressive.

The defaults are not really optimal
for production crawls where it's unlikely
that URL filter / normalization rules get
changed somewhere in between the steps
of a running crawl.

Ideally, URL should be filtered / normalized
only if new URLs are added to CrawlDb:
- seeds
- outlinks
- redirects
But there may be other opinions with better
arguments? Are there any?

Regarding the outlinks I'm not sure whether
it's better to do normalization and filtering
during the parse job or when updating CrawlDb.

Feel free to continue the discussion or open
a Jira to improve the default configuration.

Thanks,
Sebastian


On 02/05/2015 06:24 PM, Iain Lopata wrote:
> Reduced processing time from 40 minutes down to 30 seconds!  Thank you!
> 
> -----Original Message-----
> From: Iain Lopata [mailto:ilopata1@hotmail.com] 
> Sent: Monday, February 2, 2015 11:36 AM
> To: user@nutch.apache.org
> Subject: RE: InvertLinks Performance Nutch 1.6
> 
> Thanks Sebastian -- I had not turned off filtering/normalization and did not appreciate they could be a significant contribution.  I will give that a try.
> 
> -----Original Message-----
> From: Sebastian Nagel [mailto:wastl.nagel@googlemail.com]
> Sent: Monday, February 2, 2015 11:32 AM
> To: user@nutch.apache.org
> Subject: Re: InvertLinks Performance Nutch 1.6
> 
> Hi Iain,
> 
> is the link inversion done with URL normalization/filtering.
> That could potentially take long if there are many links probably in combination with complex filters or long URLs (which make the regex filter slow).
> 
> Filtering/normalization is on per default.
> You have to disable it explicitly via:
> % nutch invertlinks ... -noNormalize -noFilter
> 
> Best,
> Sebastian
> 
> 
> 
> 2015-01-29 23:20 GMT+01:00 Iain Lopata <il...@hotmail.com>:
> 
>> I am running the invertlinks step in my Nutch 1.6 based crawl process 
>> on a single node.  I run invertlinks only because I need the Inlinks 
>> in the indexer step so as to store them with the document.  I do not 
>> need the anchor text and I am not scoring.  I am finding that 
>> invertlinks (and more specifically the merge of the linkdb) takes a 
>> long time - about 30 minutes for a crawl of around 150K documents.  I 
>> am looking for ways that I might shorten this processing time.  Any suggestions?
>>
>>
>>
>> I actually only need the Inlinks for a subset of my documents, which 
>> could be identified either by a URL regex pattern match or by MIME 
>> type.  This would be a case where a scoped filter for the invertlinks 
>> step might be helpful, but I understand that scoping is only available for normalizers.
>>
>>
>>
>>
>>
>> Thanks
>>
>>
> 
>

RE: InvertLinks Performance Nutch 1.6

Posted by Iain Lopata <il...@hotmail.com>.

Reduced processing time from 40 minutes down to 30 seconds!  Thank you!

-----Original Message-----
From: Iain Lopata [mailto:ilopata1@hotmail.com] 
Sent: Monday, February 2, 2015 11:36 AM
To: user@nutch.apache.org
Subject: RE: InvertLinks Performance Nutch 1.6

Thanks Sebastian -- I had not turned off filtering/normalization and did not appreciate they could be a significant contribution.  I will give that a try.

-----Original Message-----
From: Sebastian Nagel [mailto:wastl.nagel@googlemail.com]
Sent: Monday, February 2, 2015 11:32 AM
To: user@nutch.apache.org
Subject: Re: InvertLinks Performance Nutch 1.6

Hi Iain,

is the link inversion done with URL normalization/filtering.
That could potentially take long if there are many links probably in combination with complex filters or long URLs (which make the regex filter slow).

Filtering/normalization is on per default.
You have to disable it explicitly via:
% nutch invertlinks ... -noNormalize -noFilter

Best,
Sebastian

2015-01-29 23:20 GMT+01:00 Iain Lopata <il...@hotmail.com>:

> I am running the invertlinks step in my Nutch 1.6 based crawl process 
> on a single node.  I run invertlinks only because I need the Inlinks 
> in the indexer step so as to store them with the document.  I do not 
> need the anchor text and I am not scoring.  I am finding that 
> invertlinks (and more specifically the merge of the linkdb) takes a 
> long time - about 30 minutes for a crawl of around 150K documents.  I 
> am looking for ways that I might shorten this processing time.  Any suggestions?
>
>
>
> I actually only need the Inlinks for a subset of my documents, which 
> could be identified either by a URL regex pattern match or by MIME 
> type.  This would be a case where a scoped filter for the invertlinks 
> step might be helpful, but I understand that scoping is only available for normalizers.
>
>
>
>
>
> Thanks
>
>

RE: InvertLinks Performance Nutch 1.6

Posted by Iain Lopata <il...@hotmail.com>.

Thanks Sebastian -- I had not turned off filtering/normalization and did not appreciate they could be a significant contribution.  I will give that a try.

-----Original Message-----
From: Sebastian Nagel [mailto:wastl.nagel@googlemail.com] 
Sent: Monday, February 2, 2015 11:32 AM
To: user@nutch.apache.org
Subject: Re: InvertLinks Performance Nutch 1.6

Hi Iain,

is the link inversion done with URL normalization/filtering.
That could potentially take long if there are many links probably in combination with complex filters or long URLs (which make the regex filter slow).

Filtering/normalization is on per default.
You have to disable it explicitly via:
% nutch invertlinks ... -noNormalize -noFilter

Best,
Sebastian

2015-01-29 23:20 GMT+01:00 Iain Lopata <il...@hotmail.com>:

> I am running the invertlinks step in my Nutch 1.6 based crawl process 
> on a single node.  I run invertlinks only because I need the Inlinks 
> in the indexer step so as to store them with the document.  I do not 
> need the anchor text and I am not scoring.  I am finding that 
> invertlinks (and more specifically the merge of the linkdb) takes a 
> long time - about 30 minutes for a crawl of around 150K documents.  I 
> am looking for ways that I might shorten this processing time.  Any suggestions?
>
>
>
> I actually only need the Inlinks for a subset of my documents, which 
> could be identified either by a URL regex pattern match or by MIME 
> type.  This would be a case where a scoped filter for the invertlinks 
> step might be helpful, but I understand that scoping is only available for normalizers.
>
>
>
>
>
> Thanks
>
>

Re: InvertLinks Performance Nutch 1.6

Posted by Sebastian Nagel <wa...@googlemail.com>.

Hi Iain,

is the link inversion done with URL normalization/filtering.
That could potentially take long if there are many links
probably in combination with complex filters or long URLs
(which make the regex filter slow).

Filtering/normalization is on per default.
You have to disable it explicitly via:
% nutch invertlinks ... -noNormalize -noFilter

Best,
Sebastian



2015-01-29 23:20 GMT+01:00 Iain Lopata <il...@hotmail.com>:

> I am running the invertlinks step in my Nutch 1.6 based crawl process on a
> single node.  I run invertlinks only because I need the Inlinks in the
> indexer step so as to store them with the document.  I do not need the
> anchor text and I am not scoring.  I am finding that invertlinks (and more
> specifically the merge of the linkdb) takes a long time - about 30 minutes
> for a crawl of around 150K documents.  I am looking for ways that I might
> shorten this processing time.  Any suggestions?
>
>
>
> I actually only need the Inlinks for a subset of my documents, which could
> be identified either by a URL regex pattern match or by MIME type.  This
> would be a case where a scoped filter for the invertlinks step might be
> helpful, but I understand that scoping is only available for normalizers.
>
>
>
>
>
> Thanks
>
>