You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@nutch.apache.org by "İlhami KALKAN (JIRA)" <ji...@apache.org> on 2013/11/05 10:32:18 UTC

[jira] [Updated] (NUTCH-1663) Crawl page with specified language

     [ https://issues.apache.org/jira/browse/NUTCH-1663?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

İlhami KALKAN updated NUTCH-1663:
---------------------------------

    Attachment: README.txt
                language-filter.patch

I added lanuage-filter plugin for filter pages by languages while crawling. For use, language-identifier plugin must run before this plugin. language-identifier plugin looks metadata for language of url and remove or accept its outlinks according to "language.filter.type" in nutch.site.xml. if this parameter set to accept, this plugin allow only "language.filter.languages" entries which must be ISO-639 language codes and remove outlinks to pages in other languages. If set to filter, remove outlinks of pages which page lang equals "language.filter.languages" entries.  	

> Crawl page with specified language
> ----------------------------------
>
>                 Key: NUTCH-1663
>                 URL: https://issues.apache.org/jira/browse/NUTCH-1663
>             Project: Nutch
>          Issue Type: Improvement
>    Affects Versions: 2.2.1
>            Reporter: İlhami KALKAN
>            Priority: Minor
>             Fix For: 2.3
>
>         Attachments: README.txt, language-filter.patch
>
>
> User can crawl pages with specified language. For example, we want to crawl pages which language is Turkish.



--
This message was sent by Atlassian JIRA
(v6.1#6144)