You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@nutch.apache.org by "ASF GitHub Bot (JIRA)" <ji...@apache.org> on 2017/08/28 12:52:00 UTC

[jira] [Commented] (NUTCH-2414) Allow LanguageIndexingFilter to actually filter documents by language.

    [ https://issues.apache.org/jira/browse/NUTCH-2414?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16143737#comment-16143737 ] 

ASF GitHub Bot commented on NUTCH-2414:
---------------------------------------

pipldev opened a new pull request #217: NUTCH-2414 - Allow LanguageIndexingFilter to actually filter documents by language
URL: https://github.com/apache/nutch/pull/217
 
 
   Added property lang.index.languages. If exists and is not empty, it is treated as a comma-separated list of languages to index. A document in another language will not be indexed.
   "unknown" is a valid language code.
 
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


> Allow LanguageIndexingFilter to actually filter documents by language.
> ----------------------------------------------------------------------
>
>                 Key: NUTCH-2414
>                 URL: https://issues.apache.org/jira/browse/NUTCH-2414
>             Project: Nutch
>          Issue Type: Improvement
>          Components: plugin
>    Affects Versions: 1.13
>            Reporter: Yossi Tamari
>            Priority: Minor
>
> It is often useful to only index pages in select languages (e.g. only those languages that we intend to search in). At first glance it seems that this is done by LanguageIndexingFilter, but currently all the filter does is add the language as a field to the index.
> We can add a configuration property to LanguageIndexingFilter that will allow it to only index languages specified in this property.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)