You are viewing a plain text version of this content. The canonical link for it is here.
Posted to solr-user@lucene.apache.org by Matthew Hall <mh...@informatics.jax.org> on 2010/12/03 19:14:21 UTC
Question about Solr Fieldtypes, Chaining of Tokenizers
Hey folks, I'm working with a fairly specific set of requirements for
our corpus that needs a somewhat tricky text type for both indexing and
searching.
The chain currently looks like this:
<tokenizer class="solr.WhitespaceTokenizerFactory"/>
<filter class="solr.PatternReplaceFilterFactory"
pattern="(.*?)(\p{Punct}*)$"
replacement="$1"/>
<filter class="solr.LowerCaseFilterFactory"/>
<filter class="solr.StopFilterFactory"
ignoreCase="true"
words="stopwords.txt"
enablePositionIncrements="true"
/>
<filter class="solr.SnowballPorterFilterFactory" language="English"
protected="protwords.txt"/>
<filter class="solr.PatternReplaceFilterFactory"
pattern="\p{Punct}"
replacement=" "/>
<tokenizer class="solr.WhitespaceTokenizerFactory"/>
Now you will notice that I'm trying to add in a second tokenizer to this
chain at the very end, this is due to the final replacement of
punctuation to whitespace. At that point I'd like to further break up
these tokens to smaller tokens.
The reason for this is that we have a mixed normal english word and
scientific corpus. For example you could expect string like "The
symposium of Tg<The>(RX3fg+and) gene studies" being added to the index,
and parts of those phrases being searched on.
We want to be able to remove the stopwords in the mostly english parts
of these types of statements, which the whitespace tokenizer, followed
by removing trailing punctuation, followed by the stopfilter takes care
of. We do not want to remove references to genetic information
contained in allele symbols and the like.
Sadly as far as I can tell, you cannot chain tokenizers in the
schema.xml, so does anyone have some suggestions on how this could be
accomplished?
Oh, and let me add that the WordDelimiterFilter comes really close to
what I want, but since we are unwilling to promote our solr version to
the trunk (we are on the 1.4x) version atm, the inability to turn off
the automatic phrase queries makes it a no go. We need to be able to
make searches on "left/right" match "right/left."
My searches through the old material on this subject isn't really
showing me much except some advice on using the copyField attribute.
But my understanding is that this will simply take your original input
to the field, and then analyze it in two different ways depending on the
field definitions. It would be very nice if it were copying the already
analyzed version of the text... but that's not what its doing, right?
Thanks for any advice on this matter.
Matt
Re: Question about Solr Fieldtypes, Chaining of Tokenizers
Posted by Robert Muir <rc...@gmail.com>.
On Fri, Dec 3, 2010 at 1:14 PM, Matthew Hall <mh...@informatics.jax.org> wrote:
> Oh, and let me add that the WordDelimiterFilter comes really close to what I
> want, but since we are unwilling to promote our solr version to the trunk
> (we are on the 1.4x) version atm, the inability to turn off the automatic
> phrase queries makes it a no go. We need to be able to make searches on
> "left/right" match "right/left."
>
if this is the case, it doesnt matter what your analysis does, it won't work.
your only workaround if you cannot upgrade, is to use PositionFilter
at query-time... but then you cannot use phrasequeries at all.
Re: Question about Solr Fieldtypes, Chaining of Tokenizers
Posted by Matthew Hall <mh...@informatics.jax.org>.
Yes, that's my conclusion as well Grant.
As for the example output:
The symposium of Tg<The>(RX3fg+and) gene studies
Should end up tokenizing to:
symposium tg the rx3fg and gene studi
Assuming I guessed right on the stemming.
Anyhow, thanks for the confirmation guys.
Matt
On 12/4/2010 8:18 PM, Grant Ingersoll wrote:
> Could you expand on your example and show the output you want? FWIW, you could simply write a token filter that does the same thing as the WhitespaceTokenizer.
>
> -Grant
>
> On Dec 3, 2010, at 1:14 PM, Matthew Hall wrote:
>
>> Hey folks, I'm working with a fairly specific set of requirements for our corpus that needs a somewhat tricky text type for both indexing and searching.
>>
>> The chain currently looks like this:
>>
>> <tokenizer class="solr.WhitespaceTokenizerFactory"/>
>> <filter class="solr.PatternReplaceFilterFactory"
>> pattern="(.*?)(\p{Punct}*)$"
>> replacement="$1"/>
>> <filter class="solr.LowerCaseFilterFactory"/>
>> <filter class="solr.StopFilterFactory"
>> ignoreCase="true"
>> words="stopwords.txt"
>> enablePositionIncrements="true"
>> />
>> <filter class="solr.SnowballPorterFilterFactory" language="English" protected="protwords.txt"/>
>> <filter class="solr.PatternReplaceFilterFactory"
>> pattern="\p{Punct}"
>> replacement=" "/>
>> <tokenizer class="solr.WhitespaceTokenizerFactory"/>
>>
>> Now you will notice that I'm trying to add in a second tokenizer to this chain at the very end, this is due to the final replacement of punctuation to whitespace. At that point I'd like to further break up these tokens to smaller tokens.
>>
>> The reason for this is that we have a mixed normal english word and scientific corpus. For example you could expect string like "The symposium of Tg<The>(RX3fg+and) gene studies" being added to the index, and parts of those phrases being searched on.
>>
>> We want to be able to remove the stopwords in the mostly english parts of these types of statements, which the whitespace tokenizer, followed by removing trailing punctuation, followed by the stopfilter takes care of. We do not want to remove references to genetic information contained in allele symbols and the like.
>>
>> Sadly as far as I can tell, you cannot chain tokenizers in the schema.xml, so does anyone have some suggestions on how this could be accomplished?
>>
>> Oh, and let me add that the WordDelimiterFilter comes really close to what I want, but since we are unwilling to promote our solr version to the trunk (we are on the 1.4x) version atm, the inability to turn off the automatic phrase queries makes it a no go. We need to be able to make searches on "left/right" match "right/left."
>>
>> My searches through the old material on this subject isn't really showing me much except some advice on using the copyField attribute. But my understanding is that this will simply take your original input to the field, and then analyze it in two different ways depending on the field definitions. It would be very nice if it were copying the already analyzed version of the text... but that's not what its doing, right?
>>
>> Thanks for any advice on this matter.
>>
>> Matt
>>
>>
> --------------------------
> Grant Ingersoll
> http://www.lucidimagination.com
>
Re: Question about Solr Fieldtypes, Chaining of Tokenizers
Posted by Grant Ingersoll <gs...@apache.org>.
Could you expand on your example and show the output you want? FWIW, you could simply write a token filter that does the same thing as the WhitespaceTokenizer.
-Grant
On Dec 3, 2010, at 1:14 PM, Matthew Hall wrote:
> Hey folks, I'm working with a fairly specific set of requirements for our corpus that needs a somewhat tricky text type for both indexing and searching.
>
> The chain currently looks like this:
>
> <tokenizer class="solr.WhitespaceTokenizerFactory"/>
> <filter class="solr.PatternReplaceFilterFactory"
> pattern="(.*?)(\p{Punct}*)$"
> replacement="$1"/>
> <filter class="solr.LowerCaseFilterFactory"/>
> <filter class="solr.StopFilterFactory"
> ignoreCase="true"
> words="stopwords.txt"
> enablePositionIncrements="true"
> />
> <filter class="solr.SnowballPorterFilterFactory" language="English" protected="protwords.txt"/>
> <filter class="solr.PatternReplaceFilterFactory"
> pattern="\p{Punct}"
> replacement=" "/>
> <tokenizer class="solr.WhitespaceTokenizerFactory"/>
>
> Now you will notice that I'm trying to add in a second tokenizer to this chain at the very end, this is due to the final replacement of punctuation to whitespace. At that point I'd like to further break up these tokens to smaller tokens.
>
> The reason for this is that we have a mixed normal english word and scientific corpus. For example you could expect string like "The symposium of Tg<The>(RX3fg+and) gene studies" being added to the index, and parts of those phrases being searched on.
>
> We want to be able to remove the stopwords in the mostly english parts of these types of statements, which the whitespace tokenizer, followed by removing trailing punctuation, followed by the stopfilter takes care of. We do not want to remove references to genetic information contained in allele symbols and the like.
>
> Sadly as far as I can tell, you cannot chain tokenizers in the schema.xml, so does anyone have some suggestions on how this could be accomplished?
>
> Oh, and let me add that the WordDelimiterFilter comes really close to what I want, but since we are unwilling to promote our solr version to the trunk (we are on the 1.4x) version atm, the inability to turn off the automatic phrase queries makes it a no go. We need to be able to make searches on "left/right" match "right/left."
>
> My searches through the old material on this subject isn't really showing me much except some advice on using the copyField attribute. But my understanding is that this will simply take your original input to the field, and then analyze it in two different ways depending on the field definitions. It would be very nice if it were copying the already analyzed version of the text... but that's not what its doing, right?
>
> Thanks for any advice on this matter.
>
> Matt
>
>
--------------------------
Grant Ingersoll
http://www.lucidimagination.com