You are viewing a plain text version of this content. The canonical link for it is here.

Posted to solr-user@lucene.apache.org by octopus <oc...@gmail.com> on 2015/06/27 12:27:33 UTC

Solr Wildcard Search for large amount of text

Hi, I'm looking at Solr's features for wildcard search used for a large
amount of text. I read on the net that solr.EdgeNGramFilterFactory is used
to generate tokens for wildcard searching. 

For Nigerian => "ni", "nig", "nige", "niger", "nigeri", "nigeria",
"nigeria", "nigerian"

However, I have a large amount of text out there which requires wildcard
search and it's not viable to use EdgeNGrameFilterFactory as the amount of
processing will be too huge. Do you have any suggestions/advice please?

Thank you so much for your time! 



--
View this message in context: http://lucene.472066.n3.nabble.com/Solr-Wildcard-Search-for-large-amount-of-text-tp4214392.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: Solr Wildcard Search for large amount of text

Posted by Upayavira <uv...@odoko.co.uk>.

That is one way to implement wildcarda, but isnt the most efficient.

Just index normally, tokenized, and search with an asterisk suffix, e.g.
foo*

This will build a finite state transformer that will make wildcard
handling efficient.

Upayavira

On, Jun 27, 2015, at 11:27 AM, pus wrote:
> Hi, I'm looking at Solr's features for wildcard search used for a large
> amount of text. I read on the net that solr.EdgeNGramFilterFactory is
> used
> to generate tokens for wildcard searching. 
> 
> For Nigerian => "ni", "nig", "nige", "niger", "nigeri", "nigeria",
> "nigeria", "nigerian"
> 
> However, I have a large amount of text out there which requires wildcard
> search and it's not viable to use EdgeNGrameFilterFactory as the amount
> of
> processing will be too huge. Do you have any suggestions/advice please?
> 
> Thank you so much for your time! 
> 
> 
> 
> --
> View this message in context:
> http://lucene.472066.n3.nabble.com/Solr-Wildcard-Search-for-large-amount-of-text-tp4214392.html
> Sent from the Solr - User mailing list archive at Nabble.com.

Re: Solr Wildcard Search for large amount of text

Posted by Jack Krupansky <ja...@gmail.com>.

What do you want actual user queries to look like? I mean, having to
explicitly write asterisks after every term is a real pain.

Indexing ngrams has the advantage that phrase queries and edismax phrase
boosting work automatically. Phrases don't work with explicit wildcard
queries.

The only real downside to ngrams is that they explode the size of the
index. But memory is supposed to be cheap these days. I mean, compare the
cost of the extra RAM (to keep the full index in memory) to the cost to
users of tehir productivity constructing queries and having expensive staff
to help them figure out why various queries don't work as expected.

How big is your corpus - number of documents and average document size?

-- Jack Krupansky

On Sat, Jun 27, 2015 at 6:27 AM, octopus <oc...@gmail.com> wrote:

> Hi, I'm looking at Solr's features for wildcard search used for a large
> amount of text. I read on the net that solr.EdgeNGramFilterFactory is used
> to generate tokens for wildcard searching.
>
> For Nigerian => "ni", "nig", "nige", "niger", "nigeri", "nigeria",
> "nigeria", "nigerian"
>
> However, I have a large amount of text out there which requires wildcard
> search and it's not viable to use EdgeNGrameFilterFactory as the amount of
> processing will be too huge. Do you have any suggestions/advice please?
>
> Thank you so much for your time!
>
>
>
> --
> View this message in context:
> http://lucene.472066.n3.nabble.com/Solr-Wildcard-Search-for-large-amount-of-text-tp4214392.html
> Sent from the Solr - User mailing list archive at Nabble.com.
>

Re: Solr Wildcard Search for large amount of text

Posted by Erick Erickson <er...@gmail.com>.

Try it and see ;).

My experience is that wildcards work fine, although
what "fine" is up to you to decide _if_ you restrict
it to requiring at least two leading "real" characters,
and I actually prefer three. I.e.
ab* or abc*. Note that if you require leading
wildcards, use the reverse wildcard filter.

I will vociferously argue that single-letter wildcards are
not useful anyway. I mean every single document in your
corpus will probably match every single-letter wildcard
(a*, b*, whatever), providing no benefit to the user.

And, the need for wildcards can often be reduced or
eliminated if you use can autosuggest or autocomplete.
Of course if you're trying to satisfy more complex use
cases where the user is composing their own complex
clauses that may not apply.

FWIW,
Erick

On Sat, Jun 27, 2015 at 10:06 AM, Shawn Heisey <ap...@elyograg.org> wrote:
> On 6/27/2015 4:27 AM, octopus wrote:
>> Hi, I'm looking at Solr's features for wildcard search used for a large
>> amount of text. I read on the net that solr.EdgeNGramFilterFactory is used
>> to generate tokens for wildcard searching.
>>
>> For Nigerian => "ni", "nig", "nige", "niger", "nigeri", "nigeria",
>> "nigeria", "nigerian"
>>
>> However, I have a large amount of text out there which requires wildcard
>> search and it's not viable to use EdgeNGrameFilterFactory as the amount of
>> processing will be too huge. Do you have any suggestions/advice please?
>
> Both edgengrams and wildcards are ways to do this.  There are advantages
> and disadvantages to both ways.
>
> To do a wildcard search, Solr (Lucene really) must look up all the
> matching terms in the index and substitute them into the query so that
> it becomes a large number of simple string matches.  If you have a large
> number of terms in your index, that can be slow.  The expensive work
> (expanding the terms) is done for every single query.
>
> The edgengram filter does similar work, but it does it at *index* time,
> rather than query time.  At query time, you are doing a simple string
> match with one term, although the index contains many more terms,
> because the very expensive work was done at index time.
>
> It's difficult to know which approach will be more efficient on *your*
> index without experimentation, but there is a general rule when it comes
> to Solr performance: As much as possible, do the expensive work at index
> time.
>
> Thanks,
> Shawn
>

Re: Solr Wildcard Search for large amount of text

Posted by Shawn Heisey <ap...@elyograg.org>.

On 6/27/2015 4:27 AM, octopus wrote:
> Hi, I'm looking at Solr's features for wildcard search used for a large
> amount of text. I read on the net that solr.EdgeNGramFilterFactory is used
> to generate tokens for wildcard searching. 
> 
> For Nigerian => "ni", "nig", "nige", "niger", "nigeri", "nigeria",
> "nigeria", "nigerian"
> 
> However, I have a large amount of text out there which requires wildcard
> search and it's not viable to use EdgeNGrameFilterFactory as the amount of
> processing will be too huge. Do you have any suggestions/advice please?

Both edgengrams and wildcards are ways to do this.  There are advantages
and disadvantages to both ways.

To do a wildcard search, Solr (Lucene really) must look up all the
matching terms in the index and substitute them into the query so that
it becomes a large number of simple string matches.  If you have a large
number of terms in your index, that can be slow.  The expensive work
(expanding the terms) is done for every single query.

The edgengram filter does similar work, but it does it at *index* time,
rather than query time.  At query time, you are doing a simple string
match with one term, although the index contains many more terms,
because the very expensive work was done at index time.

It's difficult to know which approach will be more efficient on *your*
index without experimentation, but there is a general rule when it comes
to Solr performance: As much as possible, do the expensive work at index
time.

Thanks,
Shawn