You are viewing a plain text version of this content. The canonical link for it is here.

Posted to solr-user@lucene.apache.org by Ellery Leung <el...@be-o.com> on 2011/11/23 03:54:36 UTC

If search matches index in the middle of filter chain, will result return?

Hi all

 

I am using Solr 3.4 with Win7 and Jetty.

 

When I do a search on a field, according to the "Analysis" from Solr, the
search string matches the index in the middle of the chain.  Here is the
schema:

 

                <fieldType name="substring_search" class="solr.TextField"
positionIncrementGap="100">

                        <analyzer type="index">

                                <charFilter
class="solr.MappingCharFilterFactory"
mapping="../../filters/filter-mappings.txt"/>

                                <charFilter
class="solr.HTMLStripCharFilterFactory" />

                                <tokenizer
class="solr.KeywordTokenizerFactory"/>

                                <filter
class="solr.ASCIIFoldingFilterFactory"/>

                                <filter class="solr.TrimFilterFactory" />

                                <filter class="solr.LowerCaseFilterFactory"
/>

                                <filter
class="solr.CommonGramsFilterFactory" words="../../filters/stopwords.txt"
ignoreCase="true"/>

                                <filter class="solr.NGramFilterFactory"
minGramSize="1" maxGramSize="20"/>

                                <filter
class="solr.RemoveDuplicatesTokenFilterFactory" />

                        </analyzer>

                        <analyzer type="query">

                                <charFilter
class="solr.MappingCharFilterFactory"
mapping="../../filters/filter-mappings.txt"/>

                                <charFilter
class="solr.HTMLStripCharFilterFactory" />

                                <tokenizer
class="solr.KeywordTokenizerFactory"/>

                                <filter
class="solr.ASCIIFoldingFilterFactory"/>

                                <filter class="solr.TrimFilterFactory" />

                                <filter class="solr.LowerCaseFilterFactory"
/>

                                <filter
class="solr.RemoveDuplicatesTokenFilterFactory" />

                        </analyzer>

                </fieldType>

 

I am searching for an email called: office@officeofficeoffice.com.  If I
search any text under 20 characters, result will be returned.  But when I
search the whole string: office@officeofficeoffice.com, no result return.

 

As you all see in the schema in "index" part, when I search the whole
string, it will match the index chain before NGramFilterFactory.  But after
NGram, no result found.

 

Here are my questions:

-          Is this behavior normal?

-          In order to get "office@officeofficeoffice.com", does it mean
that I have to make the maxGramSize larger (like 70)?

 

Thank you in advance for all your support.  This is a great community.

RE: If search matches index in the middle of filter chain, will result return?

Posted by Ellery Leung <el...@be-o.com>.

Thanks Shawn.  So to recap:

- Every "match" must be found after entire chain, not in the middle of the
chain.
- Suggested: <index> and <query> chain should be the same.

In my situation, if I make both of them the same, the result may be
misleading because it will also match other records that have the same
partial string.

But your suggestion is wonderful.  Thank you very much.

-----Original Message-----
From: Shawn Heisey [mailto:solr@elyograg.org] 
Sent: 2011年11月23日 12:04 下午
To: solr-user@lucene.apache.org
Subject: Re: If search matches index in the middle of filter chain, will
result return?

On 11/22/2011 7:54 PM, Ellery Leung wrote:
> I am searching for an email called: office@officeofficeoffice.com.  If I
> search any text under 20 characters, result will be returned.  But when I
> search the whole string: office@officeofficeoffice.com, no result return.
>
> As you all see in the schema in "index" part, when I search the whole
> string, it will match the index chain before NGramFilterFactory.  But
after
> NGram, no result found.
>
> Here are my questions:
> -          Is this behavior normal?

I'm pretty sure that your query must match after the entire analyzer 
chain is done.  I would expect that behavior to be normal.

> -          In order to get "office@officeofficeoffice.com", does it mean
> that I have to make the maxGramSize larger (like 70)?

If you were to increase the maxGramSize to 70, you would get a match in 
this case, but your index might get a lot larger, depending on what's in 
your source data.  That's probably not the right approach, though.

In general, you want to have your index and query analyzer chains 
exactly the same.  There are some exceptions, but I don't think the 
NGram filter is one of them.  The synonym filter and WordDelimiterFilter 
are examples where it is expected that your index and query analyzer 
chains will be different.

Add the NGram and CommonGram filters to the query chain, and everything 
should start working.  If you were to go with a single analyzer for both 
like the following, I think it would start working.  You wouldn't even 
need to reindex, since you wouldn't be changing the index analyzer.

<fieldType name="substring_search" class="solr.TextField" 
positionIncrementGap="100">
<analyzer>
<charFilter class="solr.MappingCharFilterFactory" 
mapping="../../filters/filter-mappings.txt"/>
<charFilter class="solr.HTMLStripCharFilterFactory" />
<tokenizer class="solr.KeywordTokenizerFactory"/>
<filter class="solr.ASCIIFoldingFilterFactory"/>
<filter class="solr.TrimFilterFactory" />
<filter class="solr.LowerCaseFilterFactory" />
<filter class="solr.CommonGramsFilterFactory" 
words="../../filters/stopwords.txt" ignoreCase="true"/>
<filter class="solr.NGramFilterFactory" minGramSize="1" maxGramSize="20"/>
<filter class="solr.RemoveDuplicatesTokenFilterFactory" />
</analyzer>
</fieldType>

Regarding your NGram filter,  I would actually increase the minGramSize 
to at least 2 and decrease the maxGramSize to something like 10 or 15, 
then reindex.

An additional note: CommonGrams may not be all that useful unless you 
are indexing large numbers of huge documents, like entire books.  This 
particular fieldType is not suitable for full text anyway, since it uses 
KeywordTokenizer.  Consider removing CommonGrams from this fieldType and 
reindexing.  Unless you are dealing with large amounts of text, consider 
removing it from the entire schema.  If you do remove it, it's usually 
not a good idea to replace it with a StopFilter.  The index size 
reduction found in stopword removal is not usually worth the potential 
loss of recall.

Be prepared to test all reasonable analyzer combinations, rather than 
taking my word for it.

After reading the Hathi Trust blog, I tried CommonGrams on my own 
index.  It actually made things slower, not faster.  My typical document 
is only a few thousand bytes of metadata.  The Hathi Trust is indexing 
millions of full-length books.

Thanks,
Shawn

Re: If search matches index in the middle of filter chain, will result return?

Posted by Shawn Heisey <so...@elyograg.org>.

On 11/22/2011 7:54 PM, Ellery Leung wrote:
> I am searching for an email called: office@officeofficeoffice.com.  If I
> search any text under 20 characters, result will be returned.  But when I
> search the whole string: office@officeofficeoffice.com, no result return.
>
> As you all see in the schema in "index" part, when I search the whole
> string, it will match the index chain before NGramFilterFactory.  But after
> NGram, no result found.
>
> Here are my questions:
> -          Is this behavior normal?

I'm pretty sure that your query must match after the entire analyzer 
chain is done.  I would expect that behavior to be normal.

> -          In order to get "office@officeofficeoffice.com", does it mean
> that I have to make the maxGramSize larger (like 70)?

If you were to increase the maxGramSize to 70, you would get a match in 
this case, but your index might get a lot larger, depending on what's in 
your source data.  That's probably not the right approach, though.

In general, you want to have your index and query analyzer chains 
exactly the same.  There are some exceptions, but I don't think the 
NGram filter is one of them.  The synonym filter and WordDelimiterFilter 
are examples where it is expected that your index and query analyzer 
chains will be different.

Add the NGram and CommonGram filters to the query chain, and everything 
should start working.  If you were to go with a single analyzer for both 
like the following, I think it would start working.  You wouldn't even 
need to reindex, since you wouldn't be changing the index analyzer.

<fieldType name="substring_search" class="solr.TextField" 
positionIncrementGap="100">
<analyzer>
<charFilter class="solr.MappingCharFilterFactory" 
mapping="../../filters/filter-mappings.txt"/>
<charFilter class="solr.HTMLStripCharFilterFactory" />
<tokenizer class="solr.KeywordTokenizerFactory"/>
<filter class="solr.ASCIIFoldingFilterFactory"/>
<filter class="solr.TrimFilterFactory" />
<filter class="solr.LowerCaseFilterFactory" />
<filter class="solr.CommonGramsFilterFactory" 
words="../../filters/stopwords.txt" ignoreCase="true"/>
<filter class="solr.NGramFilterFactory" minGramSize="1" maxGramSize="20"/>
<filter class="solr.RemoveDuplicatesTokenFilterFactory" />
</analyzer>
</fieldType>

Regarding your NGram filter,  I would actually increase the minGramSize 
to at least 2 and decrease the maxGramSize to something like 10 or 15, 
then reindex.

An additional note: CommonGrams may not be all that useful unless you 
are indexing large numbers of huge documents, like entire books.  This 
particular fieldType is not suitable for full text anyway, since it uses 
KeywordTokenizer.  Consider removing CommonGrams from this fieldType and 
reindexing.  Unless you are dealing with large amounts of text, consider 
removing it from the entire schema.  If you do remove it, it's usually 
not a good idea to replace it with a StopFilter.  The index size 
reduction found in stopword removal is not usually worth the potential 
loss of recall.

Be prepared to test all reasonable analyzer combinations, rather than 
taking my word for it.

After reading the Hathi Trust blog, I tried CommonGrams on my own 
index.  It actually made things slower, not faster.  My typical document 
is only a few thousand bytes of metadata.  The Hathi Trust is indexing 
millions of full-length books.

Thanks,
Shawn