You are viewing a plain text version of this content. The canonical link for it is here.
Posted to pylucene-dev@lucene.apache.org by Martin <ma...@webscio.net> on 2010/07/17 22:23:01 UTC
Building a custom Tokenizer
Hi there,
I'm trying to extend the PythonTokenizer class to build my own custom
tokenizer, but seem to get stuck pretty much soon after that. I know
that I'm supposed to extend the incrementToken() method, but what
exactly am I dealing with in there and what should it return? My goal is
to construct a tokenizer that returns pretty large tokens, maybe
sentences or even the whole content. The reason I need this is that the
NGramTokenFilter needs a TokenStream to run on, but any other tokenizer
removes whitespaces from the text.. and I need ngrams that span over
spaces :(
Thanks in advance for any hints!
Regards,
Martin
Re: Building a custom Tokenizer
Posted by Martin <ma...@webscio.net>.
Hey,
Thanks for the tips. I was pointed towards the KeywordTokenizer by the
java people which returns the full content as one content (not a very
intuitive name in my opinion, but anyway). I might still need to extend
this to do some customizations, so I'll look into the PythonAnalyzer
samples.
Thanks again,
Martin
>
> On Jul 17, 2010, at 22:30, Andi Vajda <va...@apache.org> wrote:
>
>>
>> On Jul 17, 2010, at 22:23, Martin <ma...@webscio.net> wrote:
>>
>>> Hi there,
>>>
>>> I'm trying to extend the PythonTokenizer class to build my own
>>> custom tokenizer, but seem to get stuck pretty much soon after that.
>>> I know that I'm supposed to extend the incrementToken() method, but
>>> what exactly am I dealing with in there and what should it return?
>>> My goal is to construct a tokenizer that returns pretty large
>>> tokens, maybe sentences or even the whole content. The reason I need
>>> this is that the NGramTokenFilter needs a TokenStream to run on, but
>>> any other tokenizer removes whitespaces from the text.. and I need
>>> ngrams that span over spaces :(
>>>
>>> Thanks in advance for any hints!
>>
>> Check out the Java Lucene javadocs and ask again on
>> java-user@lucene.apache.org where many more lucene expert users hang
>> out. Subscribe first by sending mail to java-user-subscribe and
>> following the instructions in the response.
>
> I forgot to mention that there a number of PyLucene tests and samples
> doing this by extending PythonAnalyzer. Look for these under the tests
> and sampled/LuceneInAction directories.
>
> Andi..
>
>>
>> Andi..
>>
>>>
>>> Regards,
>>> Martin
>
>
Re: Building a custom Tokenizer
Posted by Andi Vajda <va...@apache.org>.
On Jul 17, 2010, at 22:30, Andi Vajda <va...@apache.org> wrote:
>
> On Jul 17, 2010, at 22:23, Martin <ma...@webscio.net> wrote:
>
>> Hi there,
>>
>> I'm trying to extend the PythonTokenizer class to build my own
>> custom tokenizer, but seem to get stuck pretty much soon after
>> that. I know that I'm supposed to extend the incrementToken()
>> method, but what exactly am I dealing with in there and what should
>> it return? My goal is to construct a tokenizer that returns pretty
>> large tokens, maybe sentences or even the whole content. The reason
>> I need this is that the NGramTokenFilter needs a TokenStream to run
>> on, but any other tokenizer removes whitespaces from the text.. and
>> I need ngrams that span over spaces :(
>>
>> Thanks in advance for any hints!
>
> Check out the Java Lucene javadocs and ask again on java-user@lucene.apache.org
> where many more lucene expert users hang out. Subscribe first by
> sending mail to java-user-subscribe and following the instructions
> in the response.
I forgot to mention that there a number of PyLucene tests and samples
doing this by extending PythonAnalyzer. Look for these under the tests
and sampled/LuceneInAction directories.
Andi..
>
> Andi..
>
>>
>> Regards,
>> Martin
Re: Building a custom Tokenizer
Posted by Andi Vajda <va...@apache.org>.
On Jul 17, 2010, at 22:23, Martin <ma...@webscio.net> wrote:
> Hi there,
>
> I'm trying to extend the PythonTokenizer class to build my own
> custom tokenizer, but seem to get stuck pretty much soon after that.
> I know that I'm supposed to extend the incrementToken() method, but
> what exactly am I dealing with in there and what should it return?
> My goal is to construct a tokenizer that returns pretty large
> tokens, maybe sentences or even the whole content. The reason I need
> this is that the NGramTokenFilter needs a TokenStream to run on, but
> any other tokenizer removes whitespaces from the text.. and I need
> ngrams that span over spaces :(
>
> Thanks in advance for any hints!
Check out the Java Lucene javadocs and ask again on java-user@lucene.apache.org
where many more lucene expert users hang out. Subscribe first by
sending mail to java-user-subscribe and following the instructions in
the response.
Andi..
>
> Regards,
> Martin