You are viewing a plain text version of this content. The canonical link for it is here.
Posted to solr-user@lucene.apache.org by Shawn Heisey <so...@elyograg.org> on 2012/11/16 20:30:29 UTC

Solr/Lucene Tokenizers - cannot get the behavior I need

I cannot seem to get the combination of behaviors that I want from the 
tokenizer/filter combinations in Solr.

Right now I am using WhitespaceTokenizer.  This does not split on 
punctuation, which is the behavior I want, because I do this myself 
later.  I use WordDelimeterFilter with preserveOriginal so that 
documents with text in the format "Word1-Word2" can be located by a 
search for word1word2 as well as the two words individually.

I am extremely interested in the Unicode behavior of ICUTokenizer, but I 
cannot disable the punctuation-splitting behavior and let WDF handle it 
properly, which causes recall problems.  There is no filter that I can 
run after tokenization, either.  Looking at ICUTokenizer.java, I do not 
see any way to write my own tokenizer that does what I need.

I have this problem with pretty much all of the tokenizers other than 
Whitespace.  There are situations where I would like to use some of the 
others, but the punctuation-splitting behavior is a major problem for me.

Do I have any options?  I have never looked at the ICU code from IBM, so 
I don't know if it would require major surgery there.

Thanks,
Shawn


Re: Solr/Lucene Tokenizers - cannot get the behavior I need

Posted by Shawn Heisey <el...@elyograg.org>.
On 11/16/2012 12:30 PM, Shawn Heisey wrote:
> I am extremely interested in the Unicode behavior of ICUTokenizer, but 
> I cannot disable the punctuation-splitting behavior and let WDF handle 
> it properly, which causes recall problems.  There is no filter that I 
> can run after tokenization, either.  Looking at ICUTokenizer.java, I 
> do not see any way to write my own tokenizer that does what I need.
>
> I have this problem with pretty much all of the tokenizers other than 
> Whitespace.  There are situations where I would like to use some of 
> the others, but the punctuation-splitting behavior is a major problem 
> for me.
>
> Do I have any options?  I have never looked at the ICU code from IBM, 
> so I don't know if it would require major surgery there.

Related problem: The entire reason I started down this path is because 
I'd like to handle CJK better with CJKBigramFilter.  It appears that 
unless you use StandardTokenizer, ClassicTokenizer, or ICUTokenizer, 
CJKBigramFilter doesn't work ... but none of these tokenizers will 
handle punctuation right for me.

I seem to remember a discussion some time ago around this, saying that a 
future version of CJKBigramFilter would drop the requirement that each 
token be tagged.

Do I need to file an issue about this, and/or start a new discussion thread?

Thanks,
Shawn


Re: Solr/Lucene Tokenizers - cannot get the behavior I need

Posted by Shawn Heisey <so...@elyograg.org>.
On 11/16/2012 12:52 PM, Shawn Heisey wrote:
> On 11/16/2012 12:36 PM, Jack Krupansky wrote:
>> Generally, you don't need the preserveOriginal attribute for WDF. 
>> Generate both the word parts and the concatenated terms, and queries 
>> should work fine without the original. The separated terms will be 
>> indexed as a sequence, and the split/separated terms will generate a 
>> phrase query that matches the indexed sequence. And if you index the 
>> concatenated terms, that can be queried as well.
>>
>> With that issue out of the way, is there a remaining issue here?
>
> You're right, that's handled by catenateWords.  I do need 
> preserveOriginal for other things, though.  I think it's unimportant 
> for this discussion.  I may consider removing it at a later stage, but 
> right now our assessment is that we need it.
>
> The immediate problem is that when ICUTokenizer is done with an input 
> of "Word1-Word2" I am left with two tokens, Word1 and Word2.  The 
> punctuation in the middle is gone.  Even if WDF is the very next thing 
> in the analysis chain, there's nothing for it to do - the fact that 
> Word1 and Word2 were connected by punctuation is entirely lost.

Ideally I would like to see a "splitOnPunctuation" option on a majority 
of available tokenizers, but if a filter were available that did one 
subset of ICUTokenizer's functionality - splitting tokens on script 
changes - I would have a solution in combination with WhiteSpaceTokenizer.

I have been looking at the source code related to ICUTokenizer, trying 
to get a handle on how it works.  Based on what I've learned so far, I'm 
not sure that punctuation can be ignored in the way that I need.  If 
someone knows it well enough to comment, I would love to know for sure.

Thanks,
Shawn


Re: Solr/Lucene Tokenizers - cannot get the behavior I need

Posted by Shawn Heisey <so...@elyograg.org>.
On 11/16/2012 12:36 PM, Jack Krupansky wrote:
> Generally, you don't need the preserveOriginal attribute for WDF. 
> Generate both the word parts and the concatenated terms, and queries 
> should work fine without the original. The separated terms will be 
> indexed as a sequence, and the split/separated terms will generate a 
> phrase query that matches the indexed sequence. And if you index the 
> concatenated terms, that can be queried as well.
>
> With that issue out of the way, is there a remaining issue here?

You're right, that's handled by catenateWords.  I do need 
preserveOriginal for other things, though.  I think it's unimportant for 
this discussion.  I may consider removing it at a later stage, but right 
now our assessment is that we need it.

The immediate problem is that when ICUTokenizer is done with an input of 
"Word1-Word2" I am left with two tokens, Word1 and Word2.  The 
punctuation in the middle is gone.  Even if WDF is the very next thing 
in the analysis chain, there's nothing for it to do - the fact that 
Word1 and Word2 were connected by punctuation is entirely lost.

Thanks,
Shawn


Re: Solr/Lucene Tokenizers - cannot get the behavior I need

Posted by Jack Krupansky <ja...@basetechnology.com>.
Generally, you don't need the preserveOriginal attribute for WDF. Generate 
both the word parts and the concatenated terms, and queries should work fine 
without the original. The separated terms will be indexed as a sequence, and 
the split/separated terms will generate a phrase query that matches the 
indexed sequence. And if you index the concatenated terms, that can be 
queried as well.

With that issue out of the way, is there a remaining issue here?

-- Jack Krupansky

-----Original Message----- 
From: Shawn Heisey
Sent: Friday, November 16, 2012 11:30 AM
To: solr-user@lucene.apache.org
Subject: Solr/Lucene Tokenizers - cannot get the behavior I need

I cannot seem to get the combination of behaviors that I want from the
tokenizer/filter combinations in Solr.

Right now I am using WhitespaceTokenizer.  This does not split on
punctuation, which is the behavior I want, because I do this myself
later.  I use WordDelimeterFilter with preserveOriginal so that
documents with text in the format "Word1-Word2" can be located by a
search for word1word2 as well as the two words individually.

I am extremely interested in the Unicode behavior of ICUTokenizer, but I
cannot disable the punctuation-splitting behavior and let WDF handle it
properly, which causes recall problems.  There is no filter that I can
run after tokenization, either.  Looking at ICUTokenizer.java, I do not
see any way to write my own tokenizer that does what I need.

I have this problem with pretty much all of the tokenizers other than
Whitespace.  There are situations where I would like to use some of the
others, but the punctuation-splitting behavior is a major problem for me.

Do I have any options?  I have never looked at the ICU code from IBM, so
I don't know if it would require major surgery there.

Thanks,
Shawn