You are viewing a plain text version of this content. The canonical link for it is here.

Posted to solr-user@lucene.apache.org by Shawn Heisey <so...@elyograg.org> on 2010/08/26 17:45:45 UTC

Multiple passes with WordDelimiterFilterFactory

  Can I pass my data through WordDelimiterFilterFactory more than once?  
It occurs to me that I might get better results if I can do some of the 
filters separately and use preserveOriginal on some of them but not others.

Currently I am using the following definition on both indexing and 
querying.  Would it make sense to do the two differently?

<filter class="solr.WordDelimiterFilterFactory"
   splitOnCaseChange="1"
   splitOnNumerics="1"
   stemEnglishPossessive="1"
   generateWordParts="1"
   generateNumberParts="1"
   catenateWords="1"
   catenateNumbers="1"
   catenateAll="0"
   preserveOriginal="1"
/>

Thanks,
Shawn

Re: Multiple passes with WordDelimiterFilterFactory

Posted by Shawn Heisey <so...@elyograg.org>.

  On 8/30/2010 9:01 AM, Shawn Heisey wrote:
>  On 8/29/2010 2:17 PM, Erick Erickson wrote:
>> <<<charFilters are applied even before the tokenizer>>>
>> Try putting this after any instances of, say, WhiteSpaceTokenizerFactory
>> in your analyzser definition, and I believe you'll see that this is not
>> true.
>> At least looking at this in the analysis page from SOLR admin sure 
>> doesn't
>> seem to support that assertion.
>
> It was the analysis page (branch_3x revision 990461) that told me that 
> my charFilter was applied first.  I had not actually tried it for 
> real.  I was in the process of trying it for real today with a new 
> regex, but I am running into trouble with my it.  The regex with a 
> custom range in brackets (even run through an XML encoder) won't allow 
> Solr to initialize.  I also tried [[:punc:]] and \p.
>
> If anyone has a regex that matches all punctuation and works with 
> Solr, please share it.

I finally figured out how to match punctuation in java regex, but it's 
not helping.  All the fields of that type are completely empty after a 
test index, and analysis shows the charFilter going first and eating 
everything:

http://www.elyograg.org/punct_analysis.png

That's not the order I have it defined:

<fieldType name="text" class="solr.TextField" sortMissingLast="true" 
positionIncrementGap="100">
<analyzer type="index">
<tokenizer class="solr.WhitespaceTokenizerFactory"/>
<filter class="solr.ASCIIFoldingFilterFactory"/>
<charFilter class="solr.PatternReplaceCharFilterFactory"
           pattern="(\p{Punct}*)(.*)(\p{Punct}*)"
           replaceWith="$2"
         />
<filter class="solr.WordDelimiterFilterFactory"
... [snip] ...

Re: Multiple passes with WordDelimiterFilterFactory

Posted by Shawn Heisey <el...@elyograg.org>.

  On 8/29/2010 2:17 PM, Erick Erickson wrote:
> <<<charFilters are applied even before the tokenizer>>>
> Try putting this after any instances of, say, WhiteSpaceTokenizerFactory
> in your analyzser definition, and I believe you'll see that this is not
> true.
> At least looking at this in the analysis page from SOLR admin sure doesn't
> seem to support that assertion.

It was the analysis page (branch_3x revision 990461) that told me that 
my charFilter was applied first.  I had not actually tried it for real.  
I was in the process of trying it for real today with a new regex, but I 
am running into trouble with my it.  The regex with a custom range in 
brackets (even run through an XML encoder) won't allow Solr to 
initialize.  I also tried [[:punc:]] and \p.

If anyone has a regex that matches all punctuation and works with Solr, 
please share it.

Thanks,
Shawn

Re: Multiple passes with WordDelimiterFilterFactory

Posted by Erick Erickson <er...@gmail.com>.

There's nothing built into SOLR that I know of that'll deal with
auto-detecting
multiple languages and "doing the right thing". I know there's been
discussion
of that, searching the users' list might help... You may have to write your
own
analyzer that tries to do this, but I have no clue how you'd go about it.

<<<charFilters are applied even before the tokenizer>>>
Try putting this after any instances of, say, WhiteSpaceTokenizerFactory
in your analyzser definition, and I believe you'll see that this is not
true.
At least looking at this in the analysis page from SOLR admin sure doesn't
seem to support that assertion.

This last doesn't help much with the different character sets though..

I'll have to leave any other insights to wiser heads than mine though..

Best
Erick

On Sun, Aug 29, 2010 at 12:44 PM, Shawn Heisey <so...@elyograg.org> wrote:

>  Thank you for taking the time to help.  The way I've got the word
> delimiter index filter set up with only one pass, "wolf-biederman" will
> result in wolf, biederman, wolfbiederman, and wolf-biederman.  With two
> passes, the last one is not present.  One pass changes "gremlin's" to
> gremlin and gremlin's.  Two passes results in gremlin and gremlins.
>
> I was trying to use the PatternReplaceCharFilterFactory to strip leading
> and trailing punctuation, but it didn't work.  It seems that charFilters are
> applied even before the tokenizer, which will not produce the results I
> want, and the filter I'd come up with was eating everything, producing no
> results.  I later realized that it would not work with radically different
> character sets like Arabic and Cyrillic, even if I solved those problems.
>  Is there a regular filter that could strip leading/trailing punctuation?
>
> As for stemming, we have no effective way to separate the languages.  Most
> of the content is English, but we also have Spanish, Arabic, Russian,
> German, French, and possibly a few others.  For that reason, I'm not using
> stemming.  I've been thinking that I might want to use an English stemmer
> anyway to improve results on most of the content, but I haven't done any
> testing yet.
>
> Thanks,
> Shawn
>
>
>
> On 8/29/2010 12:28 PM, Erick Erickson wrote:
>
>> Look at the tokenizer/filter chain that makes up your analyzers, and see:
>>
>> http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters
>>
>> for other tokenizer/analyzer/filter options.
>>
>> You're on the right track looking at the various choices provided, and
>> I suspect you'll find what you need...
>>
>> Be a little cautious about preserving things. Your users will often be
>> more
>> confused than helped if you require hyphens for a match. Ditto with
>> possessives, plurals, etc. You might want to look at stemmers....
>>
>
>

Re: Multiple passes with WordDelimiterFilterFactory

Posted by Shawn Heisey <so...@elyograg.org>.

  Thank you for taking the time to help.  The way I've got the word 
delimiter index filter set up with only one pass, "wolf-biederman" will 
result in wolf, biederman, wolfbiederman, and wolf-biederman.  With two 
passes, the last one is not present.  One pass changes "gremlin's" to 
gremlin and gremlin's.  Two passes results in gremlin and gremlins.

I was trying to use the PatternReplaceCharFilterFactory to strip leading 
and trailing punctuation, but it didn't work.  It seems that charFilters 
are applied even before the tokenizer, which will not produce the 
results I want, and the filter I'd come up with was eating everything, 
producing no results.  I later realized that it would not work with 
radically different character sets like Arabic and Cyrillic, even if I 
solved those problems.  Is there a regular filter that could strip 
leading/trailing punctuation?

As for stemming, we have no effective way to separate the languages.  
Most of the content is English, but we also have Spanish, Arabic, 
Russian, German, French, and possibly a few others.  For that reason, 
I'm not using stemming.  I've been thinking that I might want to use an 
English stemmer anyway to improve results on most of the content, but I 
haven't done any testing yet.

Thanks,
Shawn

On 8/29/2010 12:28 PM, Erick Erickson wrote:
> Look at the tokenizer/filter chain that makes up your analyzers, and see:
>
> http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters
>
> for other tokenizer/analyzer/filter options.
>
> You're on the right track looking at the various choices provided, and
> I suspect you'll find what you need...
>
> Be a little cautious about preserving things. Your users will often be more
> confused than helped if you require hyphens for a match. Ditto with
> possessives, plurals, etc. You might want to look at stemmers....

Re: Multiple passes with WordDelimiterFilterFactory

Posted by Erick Erickson <er...@gmail.com>.

Look at the tokenizer/filter chain that makes up your analyzers, and see:

http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters

for other tokenizer/analyzer/filter options.

You're on the right track looking at the various choices provided, and
I suspect you'll find what you need...

Be a little cautious about preserving things. Your users will often be more
confused than helped if you require hyphens for a match. Ditto with
possessives, plurals, etc. You might want to look at stemmers....

Best
Erick

On Sat, Aug 28, 2010 at 6:20 PM, Shawn Heisey <el...@elyograg.org> wrote:

>  It's metadata for a collection of 45 million documents that is mostly
> photos, with some videos and text.  The data is imported from a MySQL
> database and split among six large shards (each nearly 13GB) and a small
> shard with data added in the last week.  That works out to between 300,000
> and 500,000 documents.
>
> I am mostly trying to think of ways to drastically reduce the index size
> without reducing the functionality.  Using copyField would just make it
> larger.
>
> I would like to make it so that I don't have two terms when there's a
> punctuation character at the beginning or end of a word.  For intstance, one
> field value that I just analyzed ends up with terms like the following,
> which are unneeded duplicates:
>
>
> championship.
> championship
> '04
> 04
> wisconsin.
> wisconsin
>
> Since I was already toying around, I just tested the whole notion.  I ran
> it through once with just generateWordParts and catenateWords enabled, then
> again with all the options including preserveOriginal enabled.  A test
> analysis of input with 59 whitespace separated words showed 93 terms with
> the single filter and 77 with two.  The only drop in term quality that I
> noticed was that possessive words (apostrophe-s) no longer have the original
> preserved.  I haven't yet decided whether that's a problem.
>
>
> Shawn
>
>
> On 8/27/2010 11:00 AM, Erick Erickson wrote:
>
>> I agree with Marcus, the usefulness of passing through WDF twice
>> is suspect. You can always do a copyfield to a completely different
>> field and do whatever you want there, copyfield forks the raw input
>> to the second field, not the analyzed stream...
>>
>> What is it you're really trying to accomplish? Your use-case would
>> help us help you.
>>
>> About defining things differently in index and analysis. Sure, it can
>> make sense. But, especially with WDF it's tricky. Spend some
>> significant time in the admin analysis page looking at the effects
>> of various configurations before you decide.
>>
>> Best
>> Erick
>>
>
>

Re: Multiple passes with WordDelimiterFilterFactory

Posted by Shawn Heisey <el...@elyograg.org>.

  It's metadata for a collection of 45 million documents that is mostly 
photos, with some videos and text.  The data is imported from a MySQL 
database and split among six large shards (each nearly 13GB) and a small 
shard with data added in the last week.  That works out to between 
300,000 and 500,000 documents.

I am mostly trying to think of ways to drastically reduce the index size 
without reducing the functionality.  Using copyField would just make it 
larger.

I would like to make it so that I don't have two terms when there's a 
punctuation character at the beginning or end of a word.  For intstance, 
one field value that I just analyzed ends up with terms like the 
following, which are unneeded duplicates:

championship.
championship
'04
04
wisconsin.
wisconsin

Since I was already toying around, I just tested the whole notion.  I 
ran it through once with just generateWordParts and catenateWords 
enabled, then again with all the options including preserveOriginal 
enabled.  A test analysis of input with 59 whitespace separated words 
showed 93 terms with the single filter and 77 with two.  The only drop 
in term quality that I noticed was that possessive words (apostrophe-s) 
no longer have the original preserved.  I haven't yet decided whether 
that's a problem.

Shawn

On 8/27/2010 11:00 AM, Erick Erickson wrote:
> I agree with Marcus, the usefulness of passing through WDF twice
> is suspect. You can always do a copyfield to a completely different
> field and do whatever you want there, copyfield forks the raw input
> to the second field, not the analyzed stream...
>
> What is it you're really trying to accomplish? Your use-case would
> help us help you.
>
> About defining things differently in index and analysis. Sure, it can
> make sense. But, especially with WDF it's tricky. Spend some
> significant time in the admin analysis page looking at the effects
> of various configurations before you decide.
>
> Best
> Erick

Re: Multiple passes with WordDelimiterFilterFactory

Posted by Shawn Heisey <so...@elyograg.org>.

  On 8/28/2010 7:59 PM, Shawn Heisey wrote:
> The only drop in term quality that I noticed was that possessive words 
> (apostrophe-s) no longer have the original preserved.  I haven't yet 
> decided whether that's a problem.

I finally did notice another drop in term quality from the dual pass - 
words with punctuation in the middle (like wolf-biederman) are not 
preserved with that punctuation intact.  I need a different filter to 
strip non-alphanumerics from the beginning and end of terms, that gets 
run after the tokenizer and the ASCII folding filter but before the word 
delimeter filter.  Does such a thing already exist, or do I just need to 
use something that does regex? Are there any recommended regex patterns 
out there for this?

Thanks,
Shawn

Re: Multiple passes with WordDelimiterFilterFactory

Posted by Shawn Heisey <so...@elyograg.org>.

It's metadata for a collection of 45 million documents that is mostly 
photos, with some videos and text.  The data is imported from a MySQL 
database and split among six large shards (each nearly 13GB) and a small 
shard with data added in the last week, which usually works out to 
between 300,000 and 500,000 documents.

My goal is to reduce the index size without reducing the functionality. 
  Using copyField would just make it larger.

The biggest issue to solve is making sure that I don't have two terms 
when there's a punctuation character at the beginning or end of a word. 
  For intstance, one chunk of text that I just analyzed ends up with 
terms like the following, which are unneeded duplicates:

championship.
championship
'04
04
wisconsin.
wisconsin

Since I was already toying around, I just tested the whole notion with 
the analysis tool.  I configured two filter steps - the first with just 
generateWordParts and catenateWords enabled, the second with all the 
options including preserveOriginal enabled.  A test analysis of input 
with 59 whitespace separated words showed 93 terms with the single 
filter and 77 with two.  The only drop in term quality that I noticed 
was that possessive words (apostrophe-s) no longer have the original 
preserved.  I haven't yet decided whether that's a problem.

Shawn

On 8/27/2010 11:00 AM, Erick Erickson wrote:
> I agree with Marcus, the usefulness of passing through WDF twice
> is suspect. You can always do a copyfield to a completely different
> field and do whatever you want there, copyfield forks the raw input
> to the second field, not the analyzed stream...
>
> What is it you're really trying to accomplish? Your use-case would
> help us help you.
>
> About defining things differently in index and analysis. Sure, it can
> make sense. But, especially with WDF it's tricky. Spend some
> significant time in the admin analysis page looking at the effects
> of various configurations before you decide.
>
> Best
> Erick

Re: Multiple passes with WordDelimiterFilterFactory

Posted by Erick Erickson <er...@gmail.com>.

I agree with Marcus, the usefulness of passing through WDF twice
is suspect. You can always do a copyfield to a completely different
field and do whatever you want there, copyfield forks the raw input
to the second field, not the analyzed stream...

What is it you're really trying to accomplish? Your use-case would
help us help you.

About defining things differently in index and analysis. Sure, it can
make sense. But, especially with WDF it's tricky. Spend some
significant time in the admin analysis page looking at the effects
of various configurations before you decide.

Best
Erick

On Fri, Aug 27, 2010 at 4:26 AM, Markus Jelsma <ma...@buyways.nl>wrote:

> It's just a configured filter, so you should be able to define it twice.
> Have
> you tried it? But it might be tricky, the output from the first will be the
> input of the second so i doubt the usefulness of this approach.
>
>
> On Thursday 26 August 2010 17:45:45 Shawn Heisey wrote:
> >   Can I pass my data through WordDelimiterFilterFactory more than once?
> > It occurs to me that I might get better results if I can do some of the
> > filters separately and use preserveOriginal on some of them but not
> others.
> >
> > Currently I am using the following definition on both indexing and
> > querying.  Would it make sense to do the two differently?
> >
> > <filter class="solr.WordDelimiterFilterFactory"
> >    splitOnCaseChange="1"
> >    splitOnNumerics="1"
> >    stemEnglishPossessive="1"
> >    generateWordParts="1"
> >    generateNumberParts="1"
> >    catenateWords="1"
> >    catenateNumbers="1"
> >    catenateAll="0"
> >    preserveOriginal="1"
> > />
> >
> > Thanks,
> > Shawn
> >
>
> Markus Jelsma - Technisch Architect - Buyways BV
> http://www.linkedin.com/in/markus17
> 050-8536620 / 06-50258350
>
>

Re: Multiple passes with WordDelimiterFilterFactory

Posted by Markus Jelsma <ma...@buyways.nl>.

It's just a configured filter, so you should be able to define it twice. Have 
you tried it? But it might be tricky, the output from the first will be the 
input of the second so i doubt the usefulness of this approach.


On Thursday 26 August 2010 17:45:45 Shawn Heisey wrote:
>   Can I pass my data through WordDelimiterFilterFactory more than once?
> It occurs to me that I might get better results if I can do some of the
> filters separately and use preserveOriginal on some of them but not others.
> 
> Currently I am using the following definition on both indexing and
> querying.  Would it make sense to do the two differently?
> 
> <filter class="solr.WordDelimiterFilterFactory"
>    splitOnCaseChange="1"
>    splitOnNumerics="1"
>    stemEnglishPossessive="1"
>    generateWordParts="1"
>    generateNumberParts="1"
>    catenateWords="1"
>    catenateNumbers="1"
>    catenateAll="0"
>    preserveOriginal="1"
> />
> 
> Thanks,
> Shawn
> 

Markus Jelsma - Technisch Architect - Buyways BV
http://www.linkedin.com/in/markus17
050-8536620 / 06-50258350