You are viewing a plain text version of this content. The canonical link for it is here.

Posted to pylucene-dev@lucene.apache.org by Martin <ma...@webscio.net> on 2010/07/17 22:23:01 UTC

Building a custom Tokenizer

Hi there,

I'm trying to extend the PythonTokenizer class to build my own custom 
tokenizer, but seem to get stuck pretty much soon after that. I know 
that I'm supposed to extend the incrementToken() method, but what 
exactly am I dealing with in there and what should it return? My goal is 
to construct a tokenizer that returns pretty large tokens, maybe 
sentences or even the whole content. The reason I need this is that the 
NGramTokenFilter needs a TokenStream to run on, but any other tokenizer 
removes whitespaces from the text.. and I need ngrams that span over 
spaces :(

Thanks in advance for any hints!

Regards,
Martin

Re: Building a custom Tokenizer

Posted by Martin <ma...@webscio.net>.

Hey,

Thanks for the tips. I was pointed towards the KeywordTokenizer by the 
java people which returns the full content as one content (not a very 
intuitive name in my opinion, but anyway). I might still need to extend 
this to do some customizations, so I'll look into the PythonAnalyzer 
samples.

Thanks again,
Martin
>
> On Jul 17, 2010, at 22:30, Andi Vajda <va...@apache.org> wrote:
>
>>
>> On Jul 17, 2010, at 22:23, Martin <ma...@webscio.net> wrote:
>>
>>> Hi there,
>>>
>>> I'm trying to extend the PythonTokenizer class to build my own 
>>> custom tokenizer, but seem to get stuck pretty much soon after that. 
>>> I know that I'm supposed to extend the incrementToken() method, but 
>>> what exactly am I dealing with in there and what should it return? 
>>> My goal is to construct a tokenizer that returns pretty large 
>>> tokens, maybe sentences or even the whole content. The reason I need 
>>> this is that the NGramTokenFilter needs a TokenStream to run on, but 
>>> any other tokenizer removes whitespaces from the text.. and I need 
>>> ngrams that span over spaces :(
>>>
>>> Thanks in advance for any hints!
>>
>> Check out the Java Lucene javadocs and ask again on 
>> java-user@lucene.apache.org where many more lucene expert users hang 
>> out. Subscribe first by sending mail to java-user-subscribe and 
>> following the instructions in the response.
>
> I forgot to mention that there a number of PyLucene tests and samples 
> doing this by extending PythonAnalyzer. Look for these under the tests 
> and sampled/LuceneInAction directories.
>
> Andi..
>
>>
>> Andi..
>>
>>>
>>> Regards,
>>> Martin
>
>

Re: Building a custom Tokenizer

Posted by Andi Vajda <va...@apache.org>.

On Jul 17, 2010, at 22:30, Andi Vajda <va...@apache.org> wrote:

>
> On Jul 17, 2010, at 22:23, Martin <ma...@webscio.net> wrote:
>
>> Hi there,
>>
>> I'm trying to extend the PythonTokenizer class to build my own  
>> custom tokenizer, but seem to get stuck pretty much soon after  
>> that. I know that I'm supposed to extend the incrementToken()  
>> method, but what exactly am I dealing with in there and what should  
>> it return? My goal is to construct a tokenizer that returns pretty  
>> large tokens, maybe sentences or even the whole content. The reason  
>> I need this is that the NGramTokenFilter needs a TokenStream to run  
>> on, but any other tokenizer removes whitespaces from the text.. and  
>> I need ngrams that span over spaces :(
>>
>> Thanks in advance for any hints!
>
> Check out the Java Lucene javadocs and ask again on java-user@lucene.apache.org 
>  where many more lucene expert users hang out. Subscribe first by  
> sending mail to java-user-subscribe and following the instructions  
> in the response.

I forgot to mention that there a number of PyLucene tests and samples  
doing this by extending PythonAnalyzer. Look for these under the tests  
and sampled/LuceneInAction directories.

Andi..

>
> Andi..
>
>>
>> Regards,
>> Martin

Re: Building a custom Tokenizer

Posted by Andi Vajda <va...@apache.org>.

On Jul 17, 2010, at 22:23, Martin <ma...@webscio.net> wrote:

> Hi there,
>
> I'm trying to extend the PythonTokenizer class to build my own  
> custom tokenizer, but seem to get stuck pretty much soon after that.  
> I know that I'm supposed to extend the incrementToken() method, but  
> what exactly am I dealing with in there and what should it return?  
> My goal is to construct a tokenizer that returns pretty large  
> tokens, maybe sentences or even the whole content. The reason I need  
> this is that the NGramTokenFilter needs a TokenStream to run on, but  
> any other tokenizer removes whitespaces from the text.. and I need  
> ngrams that span over spaces :(
>
> Thanks in advance for any hints!

Check out the Java Lucene javadocs and ask again on java-user@lucene.apache.org 
  where many more lucene expert users hang out. Subscribe first by  
sending mail to java-user-subscribe and following the instructions in  
the response.

Andi..

>
> Regards,
> Martin