You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@nutch.apache.org by Vishal Shah <vi...@rediff.co.in> on 2006/09/27 09:53:17 UTC

Problem in URL tokenization

Hi,
 
   If I understand correctly, there is a common tokenizer for all fields
(URL, content, meta etc.). This tokenizer does not use the underscore
character as a separator. Since a lot of URLs use underscore to separate
different words, it would be better if the URLs are tokenized slightly
differently from the other fields. I tried looking at the
NutchDocumentAnalyzer and related files, but can't figure out a clear
way to implement a new tokenizer for URLs only. Any ideas as to how to
go about doing this?
 
Thanks,
 
-vishal.

RE: Problem in URL tokenization

Posted by Vishal Shah <vi...@rediff.co.in>.

Hi Enis,

   Thanks a lot for the reply. I wasn't too sure about the .jj files,
I'll try it out next week.

Regards,

-v.

-----Original Message-----
From: Enis Soztutar [mailto:enis.soz.nutch@gmail.com] 
Sent: Wednesday, September 27, 2006 7:39 PM
To: nutch-user@lucene.apache.org
Subject: Re: Problem in URL tokenization

Vishal Shah wrote:
> Hi,
>  
>    If I understand correctly, there is a common tokenizer for all
fields
> (URL, content, meta etc.). This tokenizer does not use the underscore
> character as a separator. Since a lot of URLs use underscore to
separate
> different words, it would be better if the URLs are tokenized slightly
> differently from the other fields. I tried looking at the
> NutchDocumentAnalyzer and related files, but can't figure out a clear
> way to implement a new tokenizer for URLs only. Any ideas as to how to
> go about doing this?
>  
> Thanks,
>  
> -vishal.
>
>   
hi, it is not straightforward to implement this without modifying 
default tokenizing behavior,

first you should copy the NutchAnalysis.jj to URLAnalysis.jj (or 
something you like) and change
| <#WORD_PUNCT: ("_"|"&")>  
to :
| <#WORD_PUNCT: ("&")>  
and recompile with javaCC.

then, you should copy NutchDocumentTokenizer to URLTokenizer, and 
refactor NutchAnalysisTokenManager instances to URLAnalysisTokenManager 
instance,

then you should write an Analyzer like to

private static class URLAnalyzer extends Analyzer {
         
         public URLAnalyzer(){
             
         }
         public TokenStream tokenStream(String field, Reader reader) {
            
                return new URLTokenizer(reader);
            }
     }

and finally, you change NutchDocumentAnalyzer

    if ("anchor".equals(fieldName))
      analyzer = ANCHOR_ANALYZER;
    else
      analyzer = CONTENT_ANALYZER;

to

    if ("anchor".equals(fieldName))
              analyzer = ANCHOR_ANALYZER;
            else if("url".equals(fieldName))
                analyzer = URL_ANALYZER;
            else
              analyzer = CONTENT_ANALYZER;

assuming URL_ANALYZER is an instance of URLAnalyzer

I have not tested this but it should work as expected.

Re: Problem in URL tokenization

Posted by Enis Soztutar <en...@gmail.com>.

Vishal Shah wrote:
> Hi,
>  
>    If I understand correctly, there is a common tokenizer for all fields
> (URL, content, meta etc.). This tokenizer does not use the underscore
> character as a separator. Since a lot of URLs use underscore to separate
> different words, it would be better if the URLs are tokenized slightly
> differently from the other fields. I tried looking at the
> NutchDocumentAnalyzer and related files, but can't figure out a clear
> way to implement a new tokenizer for URLs only. Any ideas as to how to
> go about doing this?
>  
> Thanks,
>  
> -vishal.
>
>   
hi, it is not straightforward to implement this without modifying 
default tokenizing behavior,

first you should copy the NutchAnalysis.jj to URLAnalysis.jj (or 
something you like) and change
| <#WORD_PUNCT: ("_"|"&")>  
to :
| <#WORD_PUNCT: ("&")>  
and recompile with javaCC.

then, you should copy NutchDocumentTokenizer to URLTokenizer, and 
refactor NutchAnalysisTokenManager instances to URLAnalysisTokenManager 
instance,

then you should write an Analyzer like to

private static class URLAnalyzer extends Analyzer {
         
         public URLAnalyzer(){
             
         }
         public TokenStream tokenStream(String field, Reader reader) {
            
                return new URLTokenizer(reader);
            }
     }

and finally, you change NutchDocumentAnalyzer

    if ("anchor".equals(fieldName))
      analyzer = ANCHOR_ANALYZER;
    else
      analyzer = CONTENT_ANALYZER;

to

    if ("anchor".equals(fieldName))
              analyzer = ANCHOR_ANALYZER;
            else if("url".equals(fieldName))
                analyzer = URL_ANALYZER;
            else
              analyzer = CONTENT_ANALYZER;

assuming URL_ANALYZER is an instance of URLAnalyzer

I have not tested this but it should work as expected.