You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@nutch.apache.org by Markus Jelsma <ma...@openindex.io> on 2011/07/18 17:10:50 UTC

OutlinkExtractor, configure schema in regex

Hi,

The reducer of a huge parse takes forever! It trips over numerous URL filter 
exceptions, mostly stuff like:

2011-07-18 15:07:15,360 ERROR 
org.apache.nutch.urlfilter.domain.DomainURLFilter: Could not apply filter on 
url: Anlagen:AdresseAvans
java.net.MalformedURLException: unknown protocol: anlagen

I suspect the issue is the OutlinkExtractor, being a bit to eager. How about 
making it a bit more configurable? This is now a real waste of CPU-cycles.

Thanks

Re: OutlinkExtractor, configure schema in regex

Posted by Markus Jelsma <ma...@openindex.io>.
Hi Julien,

On Tuesday 19 July 2011 11:20:30 Julien Nioche wrote:
> Hi Markus
> 
> On 18 July 2011 23:46, Markus Jelsma <ma...@openindex.io> wrote:
> > I've modified the regular expression in OutlinkExtractor not to allow URI
> > schema's other than http:// and i can confirm a significant increase in
> > throughput.
> 
> Can't remember how the OutlinkExtractor works but are relative URLs already
> normalised into full form at that stage?
> Bear in mind that we also handle other protocols such as file://,
> ftp://https://
> so it is not only about http://

Correct. The prefix URL filter's settings could be used. For example, the 
current extractor's schema regex is:

([A-Za-z][A-Za-z0-9+.-]{1,120}:[A-Za-z0-9/]

This can be greatly simplified by tieing in the settings from prefix URL 
filter. If one filters for https | file | http you would get the following 
partial regex for the schema:

(^|[ \t\r\n])((http|https|file):

This would mean no outlinks are extracted at all besides the schema's we want. 
This is a part which greatly reduces the total number of extracted outlinks. 
If you don't do this the extractor comes up with countless `URL's` from plain 
text (or parsed PDF etc) such as:

id:12
says:how

..and other parts of normal text. 

> 
> > The previous parse/reduce took ages and had only ~600.000 random internet
> > document to process. Another parse/reduce did it in less than half the
> > time and had 33% more documents. Instead of countless exceptions this
> > produces less
> > than 10 for all documents.
> 
> JIRA + patch? Am sure the outlink extractor could be improved indeed

Yes, i shall open an issue.

> 
> > Wouldn't it be a good idea to connect the various URL filters to Nutch'
> > own outlink extractor? It shouldn't be hard to create a partial regex
> > from some simple url filters. Since URL's extracted by the regex are
> > still processed by
> > filters and/or normalizers there would be a huge gain in throughput when
> > we 1)
> > simplify the regex and 2) stop unwanted URL's and 'URL's' at the gate.
> 
> Unlike URLNormalisers, URLFilters don't have a realm so when you apply the
> filtering ALL the filters are used so you can't have a specific set of
> filters for that particular stage.

I know. I shouldn't mention normalizers at all. It adds confusion.

> 
> Why don't you simply specify the regex-based URLFilters to be applied
> BEFORE the domain one. It would simply be a matter of setting something
> like http://.+ or whatever protocol you are using. This way you won't get
> any issues with the DomainFilter

I can indeed put filters in a specific order, my goal is to reduce the amount 
of URL's produced by the extractor. IF the extractor produces less unwanted 
URL's (that are filtered away anyway) we spare a lot of cycles.

> 
> > And how could crawler-commons be fitted in to the Nutch' outlink
> > extractor or
> 
> even Tika for HTML documents?
> 
> 
> what for? we don't do URL filtering in CCexpression: yet

Indeed, not yet. I'll try to come up with a patch beginning with prefix URL 
filter into the regex of OutlinkExtractor.

> 
> For the Tika parser it has ContentHandler which extracts the links, we
> could use this instead of OutlinkExtractor and see how it fares.
> 
> Julien


Thank you for your comments.
Markus

> 
> > > Hi,
> > > 
> > > The reducer of a huge parse takes forever! It trips over numerous URL
> > > filter exceptions, mostly stuff like:
> > > 
> > > 2011-07-18 15:07:15,360 ERROR
> > > org.apache.nutch.urlfilter.domain.DomainURLFilter: Could not apply
> > > filter on url: Anlagen:AdresseAvans
> > > java.net.MalformedURLException: unknown protocol: anlagen
> > > 
> > > I suspect the issue is the OutlinkExtractor, being a bit to eager. How
> > > about making it a bit more configurable? This is now a real waste of
> > > CPU-cycles.
> > > 
> > > Thanks


Re: OutlinkExtractor, configure schema in regex

Posted by Julien Nioche <li...@gmail.com>.
Hi Markus

On 18 July 2011 23:46, Markus Jelsma <ma...@openindex.io> wrote:

> I've modified the regular expression in OutlinkExtractor not to allow URI
> schema's other than http:// and i can confirm a significant increase in
> throughput.
>

Can't remember how the OutlinkExtractor works but are relative URLs already
normalised into full form at that stage?
Bear in mind that we also handle other protocols such as file://,
ftp://https://
so it is not only about http://


> The previous parse/reduce took ages and had only ~600.000 random internet
> document to process. Another parse/reduce did it in less than half the time
> and had 33% more documents. Instead of countless exceptions this produces
> less
> than 10 for all documents.
>

JIRA + patch? Am sure the outlink extractor could be improved indeed


>
> Wouldn't it be a good idea to connect the various URL filters to Nutch' own
> outlink extractor? It shouldn't be hard to create a partial regex from some
> simple url filters. Since URL's extracted by the regex are still processed
> by
> filters and/or normalizers there would be a huge gain in throughput when we
> 1)
> simplify the regex and 2) stop unwanted URL's and 'URL's' at the gate.
>

Unlike URLNormalisers, URLFilters don't have a realm so when you apply the
filtering ALL the filters are used so you can't have a specific set of
filters for that particular stage.

Why don't you simply specify the regex-based URLFilters to be applied BEFORE
the domain one. It would simply be a matter of setting something like
http://.+ or whatever protocol you are using. This way you won't get any
issues with the DomainFilter


> And how could crawler-commons be fitted in to the Nutch' outlink extractor
> or

even Tika for HTML documents?
>

what for? we don't do URL filtering in CC yet

For the Tika parser it has ContentHandler which extracts the links, we could
use this instead of OutlinkExtractor and see how it fares.

Julien




>
> > Hi,
> >
> > The reducer of a huge parse takes forever! It trips over numerous URL
> > filter exceptions, mostly stuff like:
> >
> > 2011-07-18 15:07:15,360 ERROR
> > org.apache.nutch.urlfilter.domain.DomainURLFilter: Could not apply filter
> > on url: Anlagen:AdresseAvans
> > java.net.MalformedURLException: unknown protocol: anlagen
> >
> > I suspect the issue is the OutlinkExtractor, being a bit to eager. How
> > about making it a bit more configurable? This is now a real waste of
> > CPU-cycles.
> >
> > Thanks
>



-- 
*
*Open Source Solutions for Text Engineering

http://digitalpebble.blogspot.com/
http://www.digitalpebble.com

Re: OutlinkExtractor, configure schema in regex

Posted by Markus Jelsma <ma...@openindex.io>.
I've modified the regular expression in OutlinkExtractor not to allow URI 
schema's other than http:// and i can confirm a significant increase in 
throughput. 

The previous parse/reduce took ages and had only ~600.000 random internet 
document to process. Another parse/reduce did it in less than half the time 
and had 33% more documents. Instead of countless exceptions this produces less 
than 10 for all documents.

Wouldn't it be a good idea to connect the various URL filters to Nutch' own 
outlink extractor? It shouldn't be hard to create a partial regex from some 
simple url filters. Since URL's extracted by the regex are still processed by 
filters and/or normalizers there would be a huge gain in throughput when we 1) 
simplify the regex and 2) stop unwanted URL's and 'URL's' at the gate.

And how could crawler-commons be fitted in to the Nutch' outlink extractor or 
even Tika for HTML documents?

> Hi,
> 
> The reducer of a huge parse takes forever! It trips over numerous URL
> filter exceptions, mostly stuff like:
> 
> 2011-07-18 15:07:15,360 ERROR
> org.apache.nutch.urlfilter.domain.DomainURLFilter: Could not apply filter
> on url: Anlagen:AdresseAvans
> java.net.MalformedURLException: unknown protocol: anlagen
> 
> I suspect the issue is the OutlinkExtractor, being a bit to eager. How
> about making it a bit more configurable? This is now a real waste of
> CPU-cycles.
> 
> Thanks