You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@nutch.apache.org by "Enis Soztutar (JIRA)" <ji...@apache.org> on 2006/10/20 10:45:35 UTC

[jira] Created: (NUTCH-389) a url tokenizer implementation for tokenizing index fields : url and host

a url tokenizer implementation for tokenizing index fields : url and host 
--------------------------------------------------------------------------

                 Key: NUTCH-389
                 URL: http://issues.apache.org/jira/browse/NUTCH-389
             Project: Nutch
          Issue Type: Improvement
          Components: indexer
    Affects Versions: 0.9.0
            Reporter: Enis Soztutar
            Priority: Minor


NutchAnalysis.jj tokenizes the input by threating & and _ as non token seperators, which is in the case of the urls not appropriate. So i have written a url tokenizer which the tokens that match the regular exp [a-zA-Z0-9]. As stated in http://www.gbiv.com/protocols/uri/rfc/rfc3986.html which describes the grammer for URIs, URL's can be tokenized with the above expression. 


see : http://www.mail-archive.com/nutch-user@lucene.apache.org/msg06247.html

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] Commented: (NUTCH-389) a url tokenizer implementation for tokenizing index fields : url and host

Posted by "Otis Gospodnetic (JIRA)" <ji...@apache.org>.

    [ http://issues.apache.org/jira/browse/NUTCH-389?page=comments#action_12444510 ] 
            
Otis Gospodnetic commented on NUTCH-389:
----------------------------------------

Enis:
Can you give us some examples of how URLs were tokenized before, and how they are tokenized with your patch?

For example:

http://www.foo_bar.com/baz_bar?car&dar_mar

How is this tokenized with your patch, and how was it done before?

Thanks.

> a url tokenizer implementation for tokenizing index fields : url and host
> -------------------------------------------------------------------------
>
>                 Key: NUTCH-389
>                 URL: http://issues.apache.org/jira/browse/NUTCH-389
>             Project: Nutch
>          Issue Type: Improvement
>          Components: indexer
>    Affects Versions: 0.9.0
>            Reporter: Enis Soztutar
>            Priority: Minor
>         Attachments: urlTokenizer.diff
>
>
> NutchAnalysis.jj tokenizes the input by threating & and _ as non token seperators, which is in the case of the urls not appropriate. So i have written a url tokenizer which the tokens that match the regular exp [a-zA-Z0-9]. As stated in http://www.gbiv.com/protocols/uri/rfc/rfc3986.html which describes the grammer for URIs, URL's can be tokenized with the above expression. 
> NutchDocumentAnalyzer code is modified to use the UrlTokenizer with the "url", "site" and "host" fields.
> see : http://www.mail-archive.com/nutch-user@lucene.apache.org/msg06247.html

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] Updated: (NUTCH-389) a url tokenizer implementation for tokenizing index fields : url and host

Posted by "Enis Soztutar (JIRA)" <ji...@apache.org>.

     [ http://issues.apache.org/jira/browse/NUTCH-389?page=all ]

Enis Soztutar updated NUTCH-389:
--------------------------------

    Attachment: urlTokenizer.diff

patch for url tokenization

> a url tokenizer implementation for tokenizing index fields : url and host
> -------------------------------------------------------------------------
>
>                 Key: NUTCH-389
>                 URL: http://issues.apache.org/jira/browse/NUTCH-389
>             Project: Nutch
>          Issue Type: Improvement
>          Components: indexer
>    Affects Versions: 0.9.0
>            Reporter: Enis Soztutar
>            Priority: Minor
>         Attachments: urlTokenizer.diff
>
>
> NutchAnalysis.jj tokenizes the input by threating & and _ as non token seperators, which is in the case of the urls not appropriate. So i have written a url tokenizer which the tokens that match the regular exp [a-zA-Z0-9]. As stated in http://www.gbiv.com/protocols/uri/rfc/rfc3986.html which describes the grammer for URIs, URL's can be tokenized with the above expression. 
> see : http://www.mail-archive.com/nutch-user@lucene.apache.org/msg06247.html

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] Updated: (NUTCH-389) a url tokenizer implementation for tokenizing index fields : url and host

Posted by "Enis Soztutar (JIRA)" <ji...@apache.org>.

     [ http://issues.apache.org/jira/browse/NUTCH-389?page=all ]

Enis Soztutar updated NUTCH-389:
--------------------------------

    Description: 
NutchAnalysis.jj tokenizes the input by threating & and _ as non token seperators, which is in the case of the urls not appropriate. So i have written a url tokenizer which the tokens that match the regular exp [a-zA-Z0-9]. As stated in http://www.gbiv.com/protocols/uri/rfc/rfc3986.html which describes the grammer for URIs, URL's can be tokenized with the above expression. 

NutchDocumentAnalyzer code is modified to use the UrlTokenizer with the "url", "site" and "host" fields.


see : http://www.mail-archive.com/nutch-user@lucene.apache.org/msg06247.html

  was:
NutchAnalysis.jj tokenizes the input by threating & and _ as non token seperators, which is in the case of the urls not appropriate. So i have written a url tokenizer which the tokens that match the regular exp [a-zA-Z0-9]. As stated in http://www.gbiv.com/protocols/uri/rfc/rfc3986.html which describes the grammer for URIs, URL's can be tokenized with the above expression. 


see : http://www.mail-archive.com/nutch-user@lucene.apache.org/msg06247.html


> a url tokenizer implementation for tokenizing index fields : url and host
> -------------------------------------------------------------------------
>
>                 Key: NUTCH-389
>                 URL: http://issues.apache.org/jira/browse/NUTCH-389
>             Project: Nutch
>          Issue Type: Improvement
>          Components: indexer
>    Affects Versions: 0.9.0
>            Reporter: Enis Soztutar
>            Priority: Minor
>         Attachments: urlTokenizer.diff
>
>
> NutchAnalysis.jj tokenizes the input by threating & and _ as non token seperators, which is in the case of the urls not appropriate. So i have written a url tokenizer which the tokens that match the regular exp [a-zA-Z0-9]. As stated in http://www.gbiv.com/protocols/uri/rfc/rfc3986.html which describes the grammer for URIs, URL's can be tokenized with the above expression. 
> NutchDocumentAnalyzer code is modified to use the UrlTokenizer with the "url", "site" and "host" fields.
> see : http://www.mail-archive.com/nutch-user@lucene.apache.org/msg06247.html

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] Commented: (NUTCH-389) a url tokenizer implementation for tokenizing index fields : url and host

Posted by "Enis Soztutar (JIRA)" <ji...@apache.org>.

    [ http://issues.apache.org/jira/browse/NUTCH-389?page=comments#action_12445512 ] 
            
Enis Soztutar commented on NUTCH-389:
-------------------------------------

Otis you can test the tokenizer using the TestUrlTokenizer junit test case. And you cab test the NutchDocumentTokenizer by running the NutchDocumentTokenizer's main method. 

NutchDocumentTokzenizer tokenizes http://www.foo_bar.com/baz_bar?car&dar_mar as 

    http www foo_bar com baz_bar car&dar_mar


whereas urlTokzenizer tokenizes the above url as

    http www foo bar com baz bar car dar mar

so it will hit the queries "baz", "bar","car". "dar" and "mar" as well.

for the url http://www.google.com.tr/firefox?client=firefox-a&rls=org.mozilla:en-US:official

NutchDocumentTokenizer gives tokens : http www google com tr firefox client firefox a&rls org mozilla en us official
urlTokenizer gives tokens : http www google com tr firefox client firefox a rls org mozilla en US official 



> a url tokenizer implementation for tokenizing index fields : url and host
> -------------------------------------------------------------------------
>
>                 Key: NUTCH-389
>                 URL: http://issues.apache.org/jira/browse/NUTCH-389
>             Project: Nutch
>          Issue Type: Improvement
>          Components: indexer
>    Affects Versions: 0.9.0
>            Reporter: Enis Soztutar
>            Priority: Minor
>         Attachments: urlTokenizer.diff
>
>
> NutchAnalysis.jj tokenizes the input by threating & and _ as non token seperators, which is in the case of the urls not appropriate. So i have written a url tokenizer which the tokens that match the regular exp [a-zA-Z0-9]. As stated in http://www.gbiv.com/protocols/uri/rfc/rfc3986.html which describes the grammer for URIs, URL's can be tokenized with the above expression. 
> NutchDocumentAnalyzer code is modified to use the UrlTokenizer with the "url", "site" and "host" fields.
> see : http://www.mail-archive.com/nutch-user@lucene.apache.org/msg06247.html

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] Updated: (NUTCH-389) a url tokenizer implementation for tokenizing index fields : url and host

Posted by "Enis Soztutar (JIRA)" <ji...@apache.org>.

     [ http://issues.apache.org/jira/browse/NUTCH-389?page=all ]

Enis Soztutar updated NUTCH-389:
--------------------------------

    Attachment: urlTokenizer-improved.diff

This is an improvement and a minor bug fix over the previous url tokenizer. This version first replaces characters, which are represented in hexadecimal format in the urls. 

For example the url "file:///tmp/foo%20baz%20bar/foo/baz~bar/index.html" will first be converted to "file:///tmp/foo baz bar/foo/baz~bar/index.html" by replacing the %20 characters with the space. 

A NullPointerException is corrected in case or input reader returning null for the url. 

Further improvements on the url tokenization can be discussed here. 


> a url tokenizer implementation for tokenizing index fields : url and host
> -------------------------------------------------------------------------
>
>                 Key: NUTCH-389
>                 URL: http://issues.apache.org/jira/browse/NUTCH-389
>             Project: Nutch
>          Issue Type: Improvement
>          Components: indexer
>    Affects Versions: 0.9.0
>            Reporter: Enis Soztutar
>            Priority: Minor
>         Attachments: urlTokenizer-improved.diff, urlTokenizer.diff
>
>
> NutchAnalysis.jj tokenizes the input by threating & and _ as non token seperators, which is in the case of the urls not appropriate. So i have written a url tokenizer which the tokens that match the regular exp [a-zA-Z0-9]. As stated in http://www.gbiv.com/protocols/uri/rfc/rfc3986.html which describes the grammer for URIs, URL's can be tokenized with the above expression. 
> NutchDocumentAnalyzer code is modified to use the UrlTokenizer with the "url", "site" and "host" fields.
> see : http://www.mail-archive.com/nutch-user@lucene.apache.org/msg06247.html

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira