You are viewing a plain text version of this content. The canonical link for it is here.
Posted to java-user@lucene.apache.org by Paul Taylor <pa...@fastmail.fm> on 2014/09/30 21:54:42 UTC
Does StandardTokenizer remove punctuation (in Lucene 4.1)
Does StandardTokenizer remove punctuation (in Lucene 4.1)
Im just trying to move back to StandardTokenizer from my own old custom
implemenation because the newer version seems to have much better
support for Asian languages
However this code except fails on incrementToken() implying that the !!!
are removed from output, yet looking at the jflex classes I cant see
anything to indicate punctuation is removed, is it removed and if so can
i remove it ?
Tokenizer tokenizer = new
StandardTokenizer(LuceneVersion.LUCENE_VERSION, new StringReader("!!!"));
assertNotNull(tokenizer);
tokenizer.reset();
assertTrue(tokenizer.incrementToken());
---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org
Re: Does StandardTokenizer remove punctuation (in Lucene 4.1)
Posted by Steve Rowe <sa...@gmail.com>.
Paul,
You should also check out ICUTokenizer/DefaultICUTokenizerConfig, which adds better handling for some languages to UAX#29 Word Break rules conformance, and also finds token boundaries when the writing system (aka script) changes. This is intended to be extensible per script.
The root break iterator used by DefaultICUTokenizerConfig also ignores punctuation. You can find its grammar at:
lucene/analysis/icu/src/data/uax29/Default.rbbi
Steve
On Oct 1, 2014, at 4:22 PM, Paul Taylor <pa...@fastmail.fm> wrote:
> On 01/10/2014 18:42, Steve Rowe wrote:
>> Paul,
>>
>> Boilerplate upgrade recommendation: consider using the most recent Lucene release (4.10.1) - it’s the most stable, performant, and featureful release available, and many bugs have been fixed since the 4.1 release.
> Yeah sure, I did try this and hit a load of errors but I certainly will do so.
>> FYI, StandardTokenizer doesn’t find word boundaries for Chinese, Japanese, Korean, Thai, and other languages that don’t use whitespace to denote word boundaries, except those around punctuation. Note that Lucene 4.1 does have specialized tokenizers for Simplified Chinese and Japanese: the smartcn and kuromoji analysis modules, respectively.
> So for Chinese, Japanese, Korean, Thai etc its just identifying that the chars are from said language, and then we can do something clever with it with subsequent filters such as CJBigramFilter right ?
> My big trouble is my code is meant to deal with any language and I dont know what language it in except by looking at the characters themselves AND i also have to deal with stuff that contains symbols, funny punctuation etc
>> It is possible to construct a tokenizer just based on pure java code - there are several examples of this in Lucene 4.1, see e.g. PatternTokenizer, and CharTokenizer and its subclasses WhitespaceTokenizer and LetterTokenizer.
>>
> Ah yes I discovered this today, what I would really like is a version of the jflex StandardTokenizer but written in pure Java making it easier to tweak it, but I'm a little concerned that If I naively write it from scratch I may create something that doesnt perform very well.
>
> Paul
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org
Re: Does StandardTokenizer remove punctuation (in Lucene 4.1)
Posted by Paul Taylor <pa...@fastmail.fm>.
On 01/10/2014 18:42, Steve Rowe wrote:
> Paul,
>
> Boilerplate upgrade recommendation: consider using the most recent Lucene release (4.10.1) - it’s the most stable, performant, and featureful release available, and many bugs have been fixed since the 4.1 release.
Yeah sure, I did try this and hit a load of errors but I certainly will
do so.
> FYI, StandardTokenizer doesn’t find word boundaries for Chinese, Japanese, Korean, Thai, and other languages that don’t use whitespace to denote word boundaries, except those around punctuation. Note that Lucene 4.1 does have specialized tokenizers for Simplified Chinese and Japanese: the smartcn and kuromoji analysis modules, respectively.
So for Chinese, Japanese, Korean, Thai etc its just identifying that the
chars are from said language, and then we can do something clever with
it with subsequent filters such as CJBigramFilter right ?
My big trouble is my code is meant to deal with any language and I dont
know what language it in except by looking at the characters themselves
AND i also have to deal with stuff that contains symbols, funny
punctuation etc
> It is possible to construct a tokenizer just based on pure java code - there are several examples of this in Lucene 4.1, see e.g. PatternTokenizer, and CharTokenizer and its subclasses WhitespaceTokenizer and LetterTokenizer.
>
Ah yes I discovered this today, what I would really like is a version of
the jflex StandardTokenizer but written in pure Java making it easier to
tweak it, but I'm a little concerned that If I naively write it from
scratch I may create something that doesnt perform very well.
Paul
---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org
Re: Does StandardTokenizer remove punctuation (in Lucene 4.1)
Posted by Steve Rowe <sa...@gmail.com>.
Paul,
Boilerplate upgrade recommendation: consider using the most recent Lucene release (4.10.1) - it’s the most stable, performant, and featureful release available, and many bugs have been fixed since the 4.1 release.
FYI, StandardTokenizer doesn’t find word boundaries for Chinese, Japanese, Korean, Thai, and other languages that don’t use whitespace to denote word boundaries, except those around punctuation. Note that Lucene 4.1 does have specialized tokenizers for Simplified Chinese and Japanese: the smartcn and kuromoji analysis modules, respectively.
It is possible to construct a tokenizer just based on pure java code - there are several examples of this in Lucene 4.1, see e.g. PatternTokenizer, and CharTokenizer and its subclasses WhitespaceTokenizer and LetterTokenizer.
Steve
www.lucidworks.com
On Oct 1, 2014, at 4:04 AM, Paul Taylor <pa...@fastmail.fm> wrote:
> On 01/10/2014 08:08, Dawid Weiss wrote:
>> Hi Steve,
>>
>> I have to admit I also find it frequently useful to include
>> punctuation as tokens (even if it's filtered out by subsequent token
>> filters for indexing, it's a useful to-have for other NLP tasks). Do
>> you think it'd be possible (read: relatively easy) to create an
>> analyzer (or a modification of the standard one's lexer) so that
>> punctuation is returned as a separate token type?
>>
>> Dawid
>>
>>
>> On Wed, Oct 1, 2014 at 7:01 AM, Steve Rowe <sa...@gmail.com> wrote:
>>> Hi Paul,
>>>
>>> StandardTokenizer implements the Word Boundaries rules in the Unicode Text Segmentation Standard Annex UAX#29 - here’s the relevant section for Unicode 6.1.0, which is the version supported by Lucene 4.1.0: <http://www.unicode.org/reports/tr29/tr29-19.html#Word_Boundaries>.
>>>
>>> Only those sequences between boundaries that contain letters and/or digits are returned as tokens; all other sequences between boundaries are skipped over and not returned as tokens.
>>>
>>> Steve
> Yep, I need punctuation in fact the only thing I usually want removed is whitespace yet I would to take advantage of the fact that the new tokenizer can recognise some word boundaries that are not based on whitespace in the case of some non western languages). I have modified the tokenizer before but found it very diificult to understand it, is it possible/advisable to contstruct a tokenizer just based on pure java code rather than derived from a jflex definition ?
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org
Re: Does StandardTokenizer remove punctuation (in Lucene 4.1)
Posted by Paul Taylor <pa...@fastmail.fm>.
On 01/10/2014 08:08, Dawid Weiss wrote:
> Hi Steve,
>
> I have to admit I also find it frequently useful to include
> punctuation as tokens (even if it's filtered out by subsequent token
> filters for indexing, it's a useful to-have for other NLP tasks). Do
> you think it'd be possible (read: relatively easy) to create an
> analyzer (or a modification of the standard one's lexer) so that
> punctuation is returned as a separate token type?
>
> Dawid
>
>
> On Wed, Oct 1, 2014 at 7:01 AM, Steve Rowe <sa...@gmail.com> wrote:
>> Hi Paul,
>>
>> StandardTokenizer implements the Word Boundaries rules in the Unicode Text Segmentation Standard Annex UAX#29 - here’s the relevant section for Unicode 6.1.0, which is the version supported by Lucene 4.1.0: <http://www.unicode.org/reports/tr29/tr29-19.html#Word_Boundaries>.
>>
>> Only those sequences between boundaries that contain letters and/or digits are returned as tokens; all other sequences between boundaries are skipped over and not returned as tokens.
>>
>> Steve
Yep, I need punctuation in fact the only thing I usually want removed is
whitespace yet I would to take advantage of the fact that the new
tokenizer can recognise some word boundaries that are not based on
whitespace in the case of some non western languages). I have modified
the tokenizer before but found it very diificult to understand it, is it
possible/advisable to contstruct a tokenizer just based on pure java
code rather than derived from a jflex definition ?
---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org
Re: Does StandardTokenizer remove punctuation (in Lucene 4.1)
Posted by Michael McCandless <lu...@mikemccandless.com>.
I played with this possibility on the extremely experimental
https://issues.apache.org/jira/browse/LUCENE-5012 which I haven't
gotten back to for a long time...
The changes on that branch adds the idea of a "deleted token", by just
setting a new DeletedAttribute marking whether the token is deleted or
not. Otherwise all other token attributes are visible like normal.
I.e., tokens are deleted the way documents are deleted in Lucene
(marked with a bit but not actually deleted until "later"). E.g.
StopFilter (on that branch) just sets that attribute to true, instead
of removing the token and leaving a hole.
The branch also had an InsertDeletedPunctuationTokenStage that would
detect when the tokenizer had dropped punctuation and then insert
[deleted] punctuation tokens.
This way IndexWriter could still ignore such tokens (since they are
marked as deleted), but other token filters would still see the
deleted tokens and be able to make decisions based on them...
Anyway, the branch is far far away from committing, but maybe we could
just pull off of it the idea of a "deleted bit" that we mark on a
given Token to tell IndexWriter not to index it, but subsequent token
filters would be able to see it ...
Mike McCandless
http://blog.mikemccandless.com
On Wed, Oct 1, 2014 at 3:08 AM, Dawid Weiss <da...@gmail.com> wrote:
> Hi Steve,
>
> I have to admit I also find it frequently useful to include
> punctuation as tokens (even if it's filtered out by subsequent token
> filters for indexing, it's a useful to-have for other NLP tasks). Do
> you think it'd be possible (read: relatively easy) to create an
> analyzer (or a modification of the standard one's lexer) so that
> punctuation is returned as a separate token type?
>
> Dawid
>
>
> On Wed, Oct 1, 2014 at 7:01 AM, Steve Rowe <sa...@gmail.com> wrote:
>> Hi Paul,
>>
>> StandardTokenizer implements the Word Boundaries rules in the Unicode Text Segmentation Standard Annex UAX#29 - here’s the relevant section for Unicode 6.1.0, which is the version supported by Lucene 4.1.0: <http://www.unicode.org/reports/tr29/tr29-19.html#Word_Boundaries>.
>>
>> Only those sequences between boundaries that contain letters and/or digits are returned as tokens; all other sequences between boundaries are skipped over and not returned as tokens.
>>
>> Steve
>>
>> On Sep 30, 2014, at 3:54 PM, Paul Taylor <pa...@fastmail.fm> wrote:
>>
>>> Does StandardTokenizer remove punctuation (in Lucene 4.1)
>>>
>>> Im just trying to move back to StandardTokenizer from my own old custom implemenation because the newer version seems to have much better support for Asian languages
>>>
>>> However this code except fails on incrementToken() implying that the !!! are removed from output, yet looking at the jflex classes I cant see anything to indicate punctuation is removed, is it removed and if so can i remove it ?
>>>
>>> Tokenizer tokenizer = new StandardTokenizer(LuceneVersion.LUCENE_VERSION, new StringReader("!!!"));
>>> assertNotNull(tokenizer);
>>> tokenizer.reset();
>>> assertTrue(tokenizer.incrementToken());
>>>
>>> ---------------------------------------------------------------------
>>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>>
>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org
Re: Does StandardTokenizer remove punctuation (in Lucene 4.1)
Posted by Dawid Weiss <da...@gmail.com>.
Hi Steve,
I have to admit I also find it frequently useful to include
punctuation as tokens (even if it's filtered out by subsequent token
filters for indexing, it's a useful to-have for other NLP tasks). Do
you think it'd be possible (read: relatively easy) to create an
analyzer (or a modification of the standard one's lexer) so that
punctuation is returned as a separate token type?
Dawid
On Wed, Oct 1, 2014 at 7:01 AM, Steve Rowe <sa...@gmail.com> wrote:
> Hi Paul,
>
> StandardTokenizer implements the Word Boundaries rules in the Unicode Text Segmentation Standard Annex UAX#29 - here’s the relevant section for Unicode 6.1.0, which is the version supported by Lucene 4.1.0: <http://www.unicode.org/reports/tr29/tr29-19.html#Word_Boundaries>.
>
> Only those sequences between boundaries that contain letters and/or digits are returned as tokens; all other sequences between boundaries are skipped over and not returned as tokens.
>
> Steve
>
> On Sep 30, 2014, at 3:54 PM, Paul Taylor <pa...@fastmail.fm> wrote:
>
>> Does StandardTokenizer remove punctuation (in Lucene 4.1)
>>
>> Im just trying to move back to StandardTokenizer from my own old custom implemenation because the newer version seems to have much better support for Asian languages
>>
>> However this code except fails on incrementToken() implying that the !!! are removed from output, yet looking at the jflex classes I cant see anything to indicate punctuation is removed, is it removed and if so can i remove it ?
>>
>> Tokenizer tokenizer = new StandardTokenizer(LuceneVersion.LUCENE_VERSION, new StringReader("!!!"));
>> assertNotNull(tokenizer);
>> tokenizer.reset();
>> assertTrue(tokenizer.incrementToken());
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org
Re: Does StandardTokenizer remove punctuation (in Lucene 4.1)
Posted by Steve Rowe <sa...@gmail.com>.
Hi Paul,
StandardTokenizer implements the Word Boundaries rules in the Unicode Text Segmentation Standard Annex UAX#29 - here’s the relevant section for Unicode 6.1.0, which is the version supported by Lucene 4.1.0: <http://www.unicode.org/reports/tr29/tr29-19.html#Word_Boundaries>.
Only those sequences between boundaries that contain letters and/or digits are returned as tokens; all other sequences between boundaries are skipped over and not returned as tokens.
Steve
On Sep 30, 2014, at 3:54 PM, Paul Taylor <pa...@fastmail.fm> wrote:
> Does StandardTokenizer remove punctuation (in Lucene 4.1)
>
> Im just trying to move back to StandardTokenizer from my own old custom implemenation because the newer version seems to have much better support for Asian languages
>
> However this code except fails on incrementToken() implying that the !!! are removed from output, yet looking at the jflex classes I cant see anything to indicate punctuation is removed, is it removed and if so can i remove it ?
>
> Tokenizer tokenizer = new StandardTokenizer(LuceneVersion.LUCENE_VERSION, new StringReader("!!!"));
> assertNotNull(tokenizer);
> tokenizer.reset();
> assertTrue(tokenizer.incrementToken());
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org
Re: Does StandardTokenizer remove punctuation (in Lucene 4.1)
Posted by Jack Krupansky <ja...@basetechnology.com>.
Yes, most special characters are treated as term delimiters, except that
underscores, dots, and commas have some special rules.
See the details under Standard Tokenizer in my Solr e-book:
http://www.lulu.com/us/en/shop/jack-krupansky/solr-4x-deep-dive-early-access-release-7/ebook/product-21203548.html
That doesn't give you Java details for Lucene, but the tokenizer rules are
the same.
-- Jack Krupansky
-----Original Message-----
From: Paul Taylor
Sent: Tuesday, September 30, 2014 3:54 PM
To: java-user@lucene.apache.org
Subject: Does StandardTokenizer remove punctuation (in Lucene 4.1)
Does StandardTokenizer remove punctuation (in Lucene 4.1)
Im just trying to move back to StandardTokenizer from my own old custom
implemenation because the newer version seems to have much better
support for Asian languages
However this code except fails on incrementToken() implying that the !!!
are removed from output, yet looking at the jflex classes I cant see
anything to indicate punctuation is removed, is it removed and if so can
i remove it ?
Tokenizer tokenizer = new
StandardTokenizer(LuceneVersion.LUCENE_VERSION, new StringReader("!!!"));
assertNotNull(tokenizer);
tokenizer.reset();
assertTrue(tokenizer.incrementToken());
---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org