You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@nutch.apache.org by "Markus Jelsma (JIRA)" <ji...@apache.org> on 2011/04/01 16:41:06 UTC

[jira] [Closed] (NUTCH-389) a url tokenizer implementation for tokenizing index fields : url and host

     [ https://issues.apache.org/jira/browse/NUTCH-389?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Markus Jelsma closed NUTCH-389.
-------------------------------

    Resolution: Won't Fix

> a url tokenizer implementation for tokenizing index fields : url and host
> -------------------------------------------------------------------------
>
>                 Key: NUTCH-389
>                 URL: https://issues.apache.org/jira/browse/NUTCH-389
>             Project: Nutch
>          Issue Type: Improvement
>          Components: indexer
>    Affects Versions: 0.9.0
>            Reporter: Enis Soztutar
>            Priority: Minor
>         Attachments: urlTokenizer-improved.diff, urlTokenizer.diff
>
>
> NutchAnalysis.jj tokenizes the input by threating & and _ as non token seperators, which is in the case of the urls not appropriate. So i have written a url tokenizer which the tokens that match the regular exp [a-zA-Z0-9]. As stated in http://www.gbiv.com/protocols/uri/rfc/rfc3986.html which describes the grammer for URIs, URL's can be tokenized with the above expression. 
> NutchDocumentAnalyzer code is modified to use the UrlTokenizer with the "url", "site" and "host" fields.
> see : http://www.mail-archive.com/nutch-user@lucene.apache.org/msg06247.html

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira