You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@nutch.apache.org by co...@complexityintelligence.com on 2012/01/02 13:14:15 UTC

RE: Filter by content language ID

Hello everyone,

   I've just finished testing my plug-in 'language-id-filter' that
is used to filter the indexing of documents by language id.

   I've two questions:

   1) The plug-in works like a charm, it is an indexing filter. BUT
      I guess that even after indexing the content of filtered documents
      remains in the crawler segments, wasting a lot of disk space.

      How to optimize this behaviour ? I mean: i've to crawl and index
      only documents in a X, Y and Z languages. Of course, I don't know
      the language of a document, so I've to fetch it, check the
language,
      and if it is ok, store the content (and later, indexing it),
otherwise
      I only want to store miniumum information about skipped documents,
      or none at all. I'm new to nutch so I don't know about that.

   2) I would like to make the language-id-filter plug-in available in
      the standard Nutch distribution. Is it possible ?

Best Regards,
Alessio


-------- Original Message --------
Subject: Re: Filter by content language ID
From: Markus Jelsma <ma...@openindex.io>
Date: Tue, December 13, 2011 3:15 am
To: user@nutch.apache.org

The indexer part of this plugin will help you on your way.

http://wiki.apache.org/nutch/WritingPluginExample-1.2

> Like i said, create an indexing filter. The example on the wiki is very
> simply and clear. Just check the field created by the langid plugin and
> decide what to do with it. The field, when the plugin is present, is
> automatically added to NutchDocument which are passed through indexing
> filters and later on transformed to SolrDocument obj.
> 
> > Hello,
> > 
> > After a lot of searching, i was unable to find update (Nutch1.4) info
> > 
> > about how to use language id for filtering. Some info are very outdated,
> > and doesn't work at all with Nutch 1.4.
> > 
> > Basically we're testing Nutch for crawling 10M+ web pages, but we want
> > 
> > to deal only with pages that are in EN,IT,DE,FR language, and skip
> > others. In addition, when indexing with Solr, we need to store the field
> > regarding the language id, to use it as a query filter (e.g.: "Only
> > pages in XX language that contain Y").
> > 
> > We're new to Nutch, but this seems to be a very common pattern, but as
> > 
> > stated, I was unable to find any update documentation. I think the
> > solution may be useful to many.
> > 
> > Please, point me to a related resource or hint to solve this task. I'm
> > 
> > very happy to add this solution to the Wiki if it is possible.
> > 
> > Thanks,
> > Alessio
> > 
> > -------- Original Message --------
> > Subject: Re: Filter by content language ID
> > From: Markus Jelsma <ma...@openindex.io>
> > Date: Fri, December 02, 2011 8:49 am
> > To: user@nutch.apache.org
> > 
> > On Friday 02 December 2011 16:23:42 contacts@complexityintelligence.com
> 
> wrote:
> > > Hello everyone,
> > > 
> > > 
> > > We've a set of urls to crawl, but we're interested in crawling only
> > > pages
> > > whose language is in our white list (e.g.: English, Italian, French),
> > > and reject all the others.
> > > 
> > > 
> > > I don't know if Nutch has a built-in support for this,
> > > language-detector
> > > seems to be dedicated only to another task.
> > 
> > You can use the field value added by the language detector to reject the
> > 
> > page from being indexed. Create a custom indexing filter, skipping all
> > documents you don't need.
> > 
> > > Which is the best way to achieve this with Nutch? Some configuration
> > > options, or it's needed to write a new plug-in ? (That for example,
> > > download
> > > the page, detect the content language, and if the language is ok,
> > > proceed,
> > > otherwise the page is skipped).
> > > 
> > > 
> > > Thanks,
> > > Alessio


Re: Filter by content language ID

Posted by Markus Jelsma <ma...@openindex.io>.

On Monday 02 January 2012 13:14:15 contacts@complexityintelligence.com wrote:
> Hello everyone,
> 
>    I've just finished testing my plug-in 'language-id-filter' that
> is used to filter the indexing of documents by language id.
> 
>    I've two questions:
> 
>    1) The plug-in works like a charm, it is an indexing filter. BUT
>       I guess that even after indexing the content of filtered documents
>       remains in the crawler segments, wasting a lot of disk space.

Not possible. Delete the whole segment is the only way to go. Rebuilding the 
segment is a waste of resources.

> 
>       How to optimize this behaviour ? I mean: i've to crawl and index
>       only documents in a X, Y and Z languages. Of course, I don't know
>       the language of a document, so I've to fetch it, check the
> language,
>       and if it is ok, store the content (and later, indexing it),
> otherwise
>       I only want to store miniumum information about skipped documents,
>       or none at all. I'm new to nutch so I don't know about that.

One possibility is to enable parsing during fetch time and use a parse filter. 
When the document comes back you can get rid of the document by not storing 
it. It won't end up in a segment.

> 
>    2) I would like to make the language-id-filter plug-in available in
>       the standard Nutch distribution. Is it possible ?

Open a ticket at our Nutch Jira. 
https://issues.apache.org/jira/browse/NUTCH

> 
> Best Regards,
> Alessio
> 
> 
> -------- Original Message --------
> Subject: Re: Filter by content language ID
> From: Markus Jelsma <ma...@openindex.io>
> Date: Tue, December 13, 2011 3:15 am
> To: user@nutch.apache.org
> 
> The indexer part of this plugin will help you on your way.
> 
> http://wiki.apache.org/nutch/WritingPluginExample-1.2
> 
> > Like i said, create an indexing filter. The example on the wiki is very
> > simply and clear. Just check the field created by the langid plugin and
> > decide what to do with it. The field, when the plugin is present, is
> > automatically added to NutchDocument which are passed through indexing
> > filters and later on transformed to SolrDocument obj.
> > 
> > > Hello,
> > > 
> > > After a lot of searching, i was unable to find update (Nutch1.4) info
> > > 
> > > about how to use language id for filtering. Some info are very
> > > outdated, and doesn't work at all with Nutch 1.4.
> > > 
> > > Basically we're testing Nutch for crawling 10M+ web pages, but we want
> > > 
> > > to deal only with pages that are in EN,IT,DE,FR language, and skip
> > > others. In addition, when indexing with Solr, we need to store the
> > > field regarding the language id, to use it as a query filter (e.g.:
> > > "Only pages in XX language that contain Y").
> > > 
> > > We're new to Nutch, but this seems to be a very common pattern, but as
> > > 
> > > stated, I was unable to find any update documentation. I think the
> > > solution may be useful to many.
> > > 
> > > Please, point me to a related resource or hint to solve this task. I'm
> > > 
> > > very happy to add this solution to the Wiki if it is possible.
> > > 
> > > Thanks,
> > > Alessio
> > > 
> > > -------- Original Message --------
> > > Subject: Re: Filter by content language ID
> > > From: Markus Jelsma <ma...@openindex.io>
> > > Date: Fri, December 02, 2011 8:49 am
> > > To: user@nutch.apache.org
> > > 
> > > On Friday 02 December 2011 16:23:42 contacts@complexityintelligence.com
> > 
> > wrote:
> > > > Hello everyone,
> > > > 
> > > > 
> > > > We've a set of urls to crawl, but we're interested in crawling only
> > > > pages
> > > > whose language is in our white list (e.g.: English, Italian, French),
> > > > and reject all the others.
> > > > 
> > > > 
> > > > I don't know if Nutch has a built-in support for this,
> > > > language-detector
> > > > seems to be dedicated only to another task.
> > > 
> > > You can use the field value added by the language detector to reject
> > > the
> > > 
> > > page from being indexed. Create a custom indexing filter, skipping all
> > > documents you don't need.
> > > 
> > > > Which is the best way to achieve this with Nutch? Some configuration
> > > > options, or it's needed to write a new plug-in ? (That for example,
> > > > download
> > > > the page, detect the content language, and if the language is ok,
> > > > proceed,
> > > > otherwise the page is skipped).
> > > > 
> > > > 
> > > > Thanks,
> > > > Alessio

-- 
Markus Jelsma - CTO - Openindex