You are viewing a plain text version of this content. The canonical link for it is here.

Posted to java-user@lucene.apache.org by Steve Rowe <sa...@gmail.com> on 2014/10/01 07:01:54 UTC

Re: Does StandardTokenizer remove punctuation (in Lucene 4.1)

Hi Paul,

StandardTokenizer implements the Word Boundaries rules in the Unicode Text Segmentation Standard Annex UAX#29 - here’s the relevant section for Unicode 6.1.0, which is the version supported by Lucene 4.1.0: <http://www.unicode.org/reports/tr29/tr29-19.html#Word_Boundaries>.

Only those sequences between boundaries that contain letters and/or digits are returned as tokens; all other sequences between boundaries are skipped over and not returned as tokens.

Steve

On Sep 30, 2014, at 3:54 PM, Paul Taylor <pa...@fastmail.fm> wrote:

> Does StandardTokenizer remove punctuation (in Lucene 4.1)
> 
> Im just trying to move back to StandardTokenizer from my own old custom implemenation because the newer version seems to have much better support for Asian languages
> 
> However this code except fails on incrementToken() implying that the !!! are removed from output, yet looking at the jflex classes I cant see anything to indicate punctuation is removed, is it removed and if so can i remove it ?
> 
> Tokenizer tokenizer = new StandardTokenizer(LuceneVersion.LUCENE_VERSION, new StringReader("!!!"));
> assertNotNull(tokenizer);
> tokenizer.reset();
> assertTrue(tokenizer.incrementToken());
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
> 

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: Does StandardTokenizer remove punctuation (in Lucene 4.1)

Posted by Steve Rowe <sa...@gmail.com>.

Paul,

You should also check out ICUTokenizer/DefaultICUTokenizerConfig, which adds better handling for some languages to UAX#29 Word Break rules conformance, and also finds token boundaries when the writing system (aka script) changes.  This is intended to be extensible per script.

The root break iterator used by DefaultICUTokenizerConfig also ignores punctuation.  You can find its grammar at:

    lucene/analysis/icu/src/data/uax29/Default.rbbi

Steve

On Oct 1, 2014, at 4:22 PM, Paul Taylor <pa...@fastmail.fm> wrote:

> On 01/10/2014 18:42, Steve Rowe wrote:
>> Paul,
>> 
>> Boilerplate upgrade recommendation: consider using the most recent Lucene release (4.10.1) - it’s the most stable, performant, and featureful release available, and many bugs have been fixed since the 4.1 release.
> Yeah sure, I did try this and hit a load of errors but I certainly will do so.
>> FYI, StandardTokenizer doesn’t find word boundaries for Chinese, Japanese, Korean, Thai, and other languages that don’t use whitespace to denote word boundaries, except those around punctuation.  Note that Lucene 4.1 does have specialized tokenizers for Simplified Chinese and Japanese: the smartcn and kuromoji analysis modules, respectively.
> So for Chinese, Japanese, Korean, Thai etc its just identifying that the chars are from said language, and then we can do something clever with it with subsequent filters such as CJBigramFilter right ?
> My big trouble is my code is meant to deal with any language  and I dont know what language it in except by looking at the characters themselves  AND i also have to deal with stuff that contains symbols, funny punctuation etc
>> It is possible to construct a tokenizer just based on pure java code - there are several examples of this in Lucene 4.1, see e.g. PatternTokenizer, and CharTokenizer and its subclasses WhitespaceTokenizer and LetterTokenizer.
>> 
> Ah yes I discovered this today, what I would really like is a version of the jflex StandardTokenizer but written in pure Java making it easier to tweak it, but I'm a little concerned that If I naively write it from scratch I may create something that doesnt perform very well.
> 
> Paul
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
> 


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: Does StandardTokenizer remove punctuation (in Lucene 4.1)

Posted by Paul Taylor <pa...@fastmail.fm>.

On 01/10/2014 18:42, Steve Rowe wrote:
> Paul,
>
> Boilerplate upgrade recommendation: consider using the most recent Lucene release (4.10.1) - it’s the most stable, performant, and featureful release available, and many bugs have been fixed since the 4.1 release.
Yeah sure, I did try this and hit a load of errors but I certainly will 
do so.
> FYI, StandardTokenizer doesn’t find word boundaries for Chinese, Japanese, Korean, Thai, and other languages that don’t use whitespace to denote word boundaries, except those around punctuation.  Note that Lucene 4.1 does have specialized tokenizers for Simplified Chinese and Japanese: the smartcn and kuromoji analysis modules, respectively.
So for Chinese, Japanese, Korean, Thai etc its just identifying that the 
chars are from said language, and then we can do something clever with 
it with subsequent filters such as CJBigramFilter right ?
My big trouble is my code is meant to deal with any language  and I dont 
know what language it in except by looking at the characters themselves  
AND i also have to deal with stuff that contains symbols, funny 
punctuation etc
> It is possible to construct a tokenizer just based on pure java code - there are several examples of this in Lucene 4.1, see e.g. PatternTokenizer, and CharTokenizer and its subclasses WhitespaceTokenizer and LetterTokenizer.
>
Ah yes I discovered this today, what I would really like is a version of 
the jflex StandardTokenizer but written in pure Java making it easier to 
tweak it, but I'm a little concerned that If I naively write it from 
scratch I may create something that doesnt perform very well.

Paul

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: Does StandardTokenizer remove punctuation (in Lucene 4.1)

Posted by Steve Rowe <sa...@gmail.com>.

Paul,

Boilerplate upgrade recommendation: consider using the most recent Lucene release (4.10.1) - it’s the most stable, performant, and featureful release available, and many bugs have been fixed since the 4.1 release.

FYI, StandardTokenizer doesn’t find word boundaries for Chinese, Japanese, Korean, Thai, and other languages that don’t use whitespace to denote word boundaries, except those around punctuation.  Note that Lucene 4.1 does have specialized tokenizers for Simplified Chinese and Japanese: the smartcn and kuromoji analysis modules, respectively.

It is possible to construct a tokenizer just based on pure java code - there are several examples of this in Lucene 4.1, see e.g. PatternTokenizer, and CharTokenizer and its subclasses WhitespaceTokenizer and LetterTokenizer.

Steve
www.lucidworks.com

On Oct 1, 2014, at 4:04 AM, Paul Taylor <pa...@fastmail.fm> wrote:

> On 01/10/2014 08:08, Dawid Weiss wrote:
>> Hi Steve,
>> 
>> I have to admit I also find it frequently useful to include
>> punctuation as tokens (even if it's filtered out by subsequent token
>> filters for indexing, it's a useful to-have for other NLP tasks). Do
>> you think it'd be possible (read: relatively easy) to create an
>> analyzer (or a modification of the standard one's lexer) so that
>> punctuation is returned as a separate token type?
>> 
>> Dawid
>> 
>> 
>> On Wed, Oct 1, 2014 at 7:01 AM, Steve Rowe <sa...@gmail.com> wrote:
>>> Hi Paul,
>>> 
>>> StandardTokenizer implements the Word Boundaries rules in the Unicode Text Segmentation Standard Annex UAX#29 - here’s the relevant section for Unicode 6.1.0, which is the version supported by Lucene 4.1.0: <http://www.unicode.org/reports/tr29/tr29-19.html#Word_Boundaries>.
>>> 
>>> Only those sequences between boundaries that contain letters and/or digits are returned as tokens; all other sequences between boundaries are skipped over and not returned as tokens.
>>> 
>>> Steve
> Yep, I need punctuation in fact the only thing I usually want removed is whitespace yet I would to take advantage of the fact that the new tokenizer can recognise some word boundaries that are not based on whitespace  in the case of some non western languages). I have modified the tokenizer before but found it very diificult to understand it, is it possible/advisable to contstruct a tokenizer just based on pure java code rather than derived from a jflex definition ?
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
> 

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: Does StandardTokenizer remove punctuation (in Lucene 4.1)

Posted by Paul Taylor <pa...@fastmail.fm>.

On 01/10/2014 08:08, Dawid Weiss wrote:
> Hi Steve,
>
> I have to admit I also find it frequently useful to include
> punctuation as tokens (even if it's filtered out by subsequent token
> filters for indexing, it's a useful to-have for other NLP tasks). Do
> you think it'd be possible (read: relatively easy) to create an
> analyzer (or a modification of the standard one's lexer) so that
> punctuation is returned as a separate token type?
>
> Dawid
>
>
> On Wed, Oct 1, 2014 at 7:01 AM, Steve Rowe <sa...@gmail.com> wrote:
>> Hi Paul,
>>
>> StandardTokenizer implements the Word Boundaries rules in the Unicode Text Segmentation Standard Annex UAX#29 - here’s the relevant section for Unicode 6.1.0, which is the version supported by Lucene 4.1.0: <http://www.unicode.org/reports/tr29/tr29-19.html#Word_Boundaries>.
>>
>> Only those sequences between boundaries that contain letters and/or digits are returned as tokens; all other sequences between boundaries are skipped over and not returned as tokens.
>>
>> Steve
Yep, I need punctuation in fact the only thing I usually want removed is 
whitespace yet I would to take advantage of the fact that the new 
tokenizer can recognise some word boundaries that are not based on 
whitespace  in the case of some non western languages). I have modified 
the tokenizer before but found it very diificult to understand it, is it 
possible/advisable to contstruct a tokenizer just based on pure java 
code rather than derived from a jflex definition ?

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: Does StandardTokenizer remove punctuation (in Lucene 4.1)

Posted by Michael McCandless <lu...@mikemccandless.com>.

I played with this possibility on the extremely experimental
https://issues.apache.org/jira/browse/LUCENE-5012 which I haven't
gotten back to for a long time...

The changes on that branch adds the idea of a "deleted token", by just
setting a new DeletedAttribute marking whether the token is deleted or
not.  Otherwise all other token attributes are visible like normal.
I.e., tokens are deleted the way documents are deleted in Lucene
(marked with a bit but not actually deleted until "later").  E.g.
StopFilter (on that branch) just sets that attribute to true, instead
of removing the token and leaving a hole.

The branch also had an InsertDeletedPunctuationTokenStage that would
detect when the tokenizer had dropped punctuation and then insert
[deleted] punctuation tokens.

This way IndexWriter could still ignore such tokens (since they are
marked as deleted), but other token filters would still see the
deleted tokens and be able to make decisions based on them...

Anyway, the branch is far far away from committing, but maybe we could
just pull off of it the idea of a "deleted bit" that we mark on a
given Token to tell IndexWriter not to index it, but subsequent token
filters would be able to see it ...

Mike McCandless

http://blog.mikemccandless.com

On Wed, Oct 1, 2014 at 3:08 AM, Dawid Weiss <da...@gmail.com> wrote:
> Hi Steve,
>
> I have to admit I also find it frequently useful to include
> punctuation as tokens (even if it's filtered out by subsequent token
> filters for indexing, it's a useful to-have for other NLP tasks). Do
> you think it'd be possible (read: relatively easy) to create an
> analyzer (or a modification of the standard one's lexer) so that
> punctuation is returned as a separate token type?
>
> Dawid
>
>
> On Wed, Oct 1, 2014 at 7:01 AM, Steve Rowe <sa...@gmail.com> wrote:
>> Hi Paul,
>>
>> StandardTokenizer implements the Word Boundaries rules in the Unicode Text Segmentation Standard Annex UAX#29 - here’s the relevant section for Unicode 6.1.0, which is the version supported by Lucene 4.1.0: <http://www.unicode.org/reports/tr29/tr29-19.html#Word_Boundaries>.
>>
>> Only those sequences between boundaries that contain letters and/or digits are returned as tokens; all other sequences between boundaries are skipped over and not returned as tokens.
>>
>> Steve
>>
>> On Sep 30, 2014, at 3:54 PM, Paul Taylor <pa...@fastmail.fm> wrote:
>>
>>> Does StandardTokenizer remove punctuation (in Lucene 4.1)
>>>
>>> Im just trying to move back to StandardTokenizer from my own old custom implemenation because the newer version seems to have much better support for Asian languages
>>>
>>> However this code except fails on incrementToken() implying that the !!! are removed from output, yet looking at the jflex classes I cant see anything to indicate punctuation is removed, is it removed and if so can i remove it ?
>>>
>>> Tokenizer tokenizer = new StandardTokenizer(LuceneVersion.LUCENE_VERSION, new StringReader("!!!"));
>>> assertNotNull(tokenizer);
>>> tokenizer.reset();
>>> assertTrue(tokenizer.incrementToken());
>>>
>>> ---------------------------------------------------------------------
>>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>>
>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: Does StandardTokenizer remove punctuation (in Lucene 4.1)

Posted by Dawid Weiss <da...@gmail.com>.

Hi Steve,

I have to admit I also find it frequently useful to include
punctuation as tokens (even if it's filtered out by subsequent token
filters for indexing, it's a useful to-have for other NLP tasks). Do
you think it'd be possible (read: relatively easy) to create an
analyzer (or a modification of the standard one's lexer) so that
punctuation is returned as a separate token type?

Dawid


On Wed, Oct 1, 2014 at 7:01 AM, Steve Rowe <sa...@gmail.com> wrote:
> Hi Paul,
>
> StandardTokenizer implements the Word Boundaries rules in the Unicode Text Segmentation Standard Annex UAX#29 - here’s the relevant section for Unicode 6.1.0, which is the version supported by Lucene 4.1.0: <http://www.unicode.org/reports/tr29/tr29-19.html#Word_Boundaries>.
>
> Only those sequences between boundaries that contain letters and/or digits are returned as tokens; all other sequences between boundaries are skipped over and not returned as tokens.
>
> Steve
>
> On Sep 30, 2014, at 3:54 PM, Paul Taylor <pa...@fastmail.fm> wrote:
>
>> Does StandardTokenizer remove punctuation (in Lucene 4.1)
>>
>> Im just trying to move back to StandardTokenizer from my own old custom implemenation because the newer version seems to have much better support for Asian languages
>>
>> However this code except fails on incrementToken() implying that the !!! are removed from output, yet looking at the jflex classes I cant see anything to indicate punctuation is removed, is it removed and if so can i remove it ?
>>
>> Tokenizer tokenizer = new StandardTokenizer(LuceneVersion.LUCENE_VERSION, new StringReader("!!!"));
>> assertNotNull(tokenizer);
>> tokenizer.reset();
>> assertTrue(tokenizer.incrementToken());
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org