You are viewing a plain text version of this content. The canonical link for it is here.

Posted to solr-user@lucene.apache.org by Brian Lamb <br...@journalexperts.com> on 2011/05/25 22:53:20 UTC

Edgengram

Hi all,

I'm running into some confusion with the way edgengram works. I have the
field set up as:

<fieldType name="edgengram" class="solr.TextField"
positionIncrementGap="1000">
   <analyzer>
     <tokenizer class="solr.LowerCaseTokenizerFactory" />
       <filter class="solr.EdgeNGramFilterFactory" minGramSize="1"
maxGramSize="100" side="front" />
   </analyzer>
</fieldType>

I've also set up my own similarity class that returns 1 as the idf score.
What I've found this does is if I match a string "abcdefg" against a field
containing "abcdefghijklmnop", then the idf will score that as a 7:

7.0 = idf(myfield: a=51 ab=23 abc=2 abcd=2 abcde=2 abcdef=2 abcdefg=2)

I get why that's happening, but is there a way to avoid that? Do I need to
do a new field type to achieve the desired affect?

Thanks,

Brian Lamb

Re: Edgengram

Posted by Brian Lamb <br...@journalexperts.com>.

I think in my case LowerCaseTokenizerFactory will be sufficient because
there will never be spaces in this particular field. But thank you for the
useful link!

Thanks,

Brian Lamb

On Wed, Jun 1, 2011 at 11:44 AM, Erick Erickson <er...@gmail.com>wrote:

> Be a little careful here. LowerCaseTokenizerFactory is different than
> KeywordTokenizerFactory.
>
> LowerCaseTokenizerFactory will give you more than one term. e.g.
> the string "Intelligence can't be MeaSurEd" will give you 5 terms,
> any of which may match. i.e.
> "intelligence", "can", "t", "be", "measured".
> whereas KeywordTokenizerFactory followed, by, say LowerCaseFilter
> would give you exactly one token:
> "intelligence can't be measured".
>
> So searching for "measured" would get a hit in the first case but
> not in the second. Searching for "intellig*" would hit both.
>
> Neither is better, just make sure they do what you want!
>
> This page will help a lot:
>
> http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters#solr.LowerCaseTokenizerFactory
> as will the admin/analysis page.
>
> Best
> Erick
>
> On Wed, Jun 1, 2011 at 10:43 AM, Brian Lamb
> <br...@journalexperts.com> wrote:
> > Hi Tomás,
> >
> > Thank you very much for your suggestion. I took another crack at it using
> > your recommendation and it worked ideally. The only thing I had to change
> > was
> >
> > <analyzer type="query">
> >  <tokenizer class="solr.KeywordTokenizerFactory" />
> > </analyzer>
> >
> > to
> >
> > <analyzer type="query">
> >  <tokenizer class="solr.LowerCaseTokenizerFactory" />
> > </analyzer>
> >
> > The first did not produce any results but the second worked beautifully.
> >
> > Thanks!
> >
> > Brian Lamb
> >
> > 2011/5/31 Tomás Fernández Löbbe <to...@gmail.com>
> >
> >> ...or also use the LowerCaseTokenizerFactory at query time for
> consistency,
> >> but not the edge ngram filter.
> >>
> >> 2011/5/31 Tomás Fernández Löbbe <to...@gmail.com>
> >>
> >> > Hi Brian, I don't know if I understand what you are trying to achieve.
> >> You
> >> > want the term query "abcdefg" to have an idf of 1 insead of 7? I think
> >> using
> >> > the KeywordTokenizerFilterFactory at query time should work. I would
> be
> >> > something like:
> >> >
> >> > <fieldType name="edgengram" class="solr.TextField"
> >> > positionIncrementGap="1000">
> >> >   <analyzer type="index">
> >> >
> >> >     <tokenizer class="solr.LowerCaseTokenizerFactory" />
> >> >     <filter class="solr.EdgeNGramFilterFactory" minGramSize="1"
> >> > maxGramSize="25" side="front" />
> >> >   </analyzer>
> >> >   <analyzer type="query">
> >> >   <tokenizer class="solr.KeywordTokenizerFactory" />
> >> >   </analyzer>
> >> > </fieldType>
> >> >
> >> > this way, at query time "abcdefg" won't be turned to "a ab abc abcd
> abcde
> >> > abcdef abcdefg". At index time it will.
> >> >
> >> > Regards,
> >> > Tomás
> >> >
> >> >
> >> > On Tue, May 31, 2011 at 1:07 PM, Brian Lamb <
> >> brian.lamb@journalexperts.com
> >> > > wrote:
> >> >
> >> >> <fieldType name="edgengram" class="solr.TextField"
> >> >> positionIncrementGap="1000">
> >> >>   <analyzer>
> >> >>     <tokenizer class="solr.LowerCaseTokenizerFactory" />
> >> >>     <filter class="solr.EdgeNGramFilterFactory" minGramSize="1"
> >> >> maxGramSize="25" side="front" />
> >> >>   </analyzer>
> >> >> </fieldType>
> >> >>
> >> >> I believe I used that link when I initially set up the field and it
> >> worked
> >> >> great (and I'm still using it in other places). In this particular
> >> example
> >> >> however it does not appear to be practical for me. I mentioned that I
> >> have
> >> >> a
> >> >> similarity class that returns 1 for the idf and in the case of an
> >> >> edgengram,
> >> >> it returns 1 * length of the search string.
> >> >>
> >> >> Thanks,
> >> >>
> >> >> Brian Lamb
> >> >>
> >> >> On Tue, May 31, 2011 at 11:34 AM, bmdakshinamurthy@gmail.com <
> >> >> bmdakshinamurthy@gmail.com> wrote:
> >> >>
> >> >> > Can you specify the analyzer you are using for your queries?
> >> >> >
> >> >> > May be you could use a KeywordAnalyzer for your queries so you
> don't
> >> end
> >> >> up
> >> >> > matching parts of your query.
> >> >> >
> >> >> >
> >> >>
> >>
> http://www.lucidimagination.com/blog/2009/09/08/auto-suggest-from-popular-queries-using-edgengrams/
> >> >> > This should help you.
> >> >> >
> >> >> > On Tue, May 31, 2011 at 8:24 PM, Brian Lamb
> >> >> > <br...@journalexperts.com>wrote:
> >> >> >
> >> >> > > In this particular case, I will be doing a solr search based on
> user
> >> >> > > preferences. So I will not be depending on the user to type
> >> "abcdefg".
> >> >> > That
> >> >> > > will be automatically generated based on user selections.
> >> >> > >
> >> >> > > The contents of the field do not contain spaces and since I am
> >> created
> >> >> > the
> >> >> > > search parameters, case isn't important either.
> >> >> > >
> >> >> > > Thanks,
> >> >> > >
> >> >> > > Brian Lamb
> >> >> > >
> >> >> > > On Tue, May 31, 2011 at 9:44 AM, Erick Erickson <
> >> >> erickerickson@gmail.com
> >> >> > > >wrote:
> >> >> > >
> >> >> > > > That'll work for your case, although be aware that string types
> >> >> aren't
> >> >> > > > analyzed at all,
> >> >> > > > so case matters, as do spaces etc.....
> >> >> > > >
> >> >> > > > What is the use-case here? If you explain it a bit there might
> be
> >> >> > > > better answers....
> >> >> > > >
> >> >> > > > Best
> >> >> > > > Erick
> >> >> > > >
> >> >> > > > On Fri, May 27, 2011 at 9:17 AM, Brian Lamb
> >> >> > > > <br...@journalexperts.com> wrote:
> >> >> > > > > For this, I ended up just changing it to string and using
> >> >> "abcdefg*"
> >> >> > to
> >> >> > > > > match. That seems to work so far.
> >> >> > > > >
> >> >> > > > > Thanks,
> >> >> > > > >
> >> >> > > > > Brian Lamb
> >> >> > > > >
> >> >> > > > > On Wed, May 25, 2011 at 4:53 PM, Brian Lamb
> >> >> > > > > <br...@journalexperts.com>wrote:
> >> >> > > > >
> >> >> > > > >> Hi all,
> >> >> > > > >>
> >> >> > > > >> I'm running into some confusion with the way edgengram
> works. I
> >> >> have
> >> >> > > the
> >> >> > > > >> field set up as:
> >> >> > > > >>
> >> >> > > > >> <fieldType name="edgengram" class="solr.TextField"
> >> >> > > > >> positionIncrementGap="1000">
> >> >> > > > >>    <analyzer>
> >> >> > > > >>      <tokenizer class="solr.LowerCaseTokenizerFactory" />
> >> >> > > > >>        <filter class="solr.EdgeNGramFilterFactory"
> >> >> minGramSize="1"
> >> >> > > > >> maxGramSize="100" side="front" />
> >> >> > > > >>    </analyzer>
> >> >> > > > >> </fieldType>
> >> >> > > > >>
> >> >> > > > >> I've also set up my own similarity class that returns 1 as
> the
> >> >> idf
> >> >> > > > score.
> >> >> > > > >> What I've found this does is if I match a string "abcdefg"
> >> >> against a
> >> >> > > > field
> >> >> > > > >> containing "abcdefghijklmnop", then the idf will score that
> as
> >> a
> >> >> 7:
> >> >> > > > >>
> >> >> > > > >> 7.0 = idf(myfield: a=51 ab=23 abc=2 abcd=2 abcde=2 abcdef=2
> >> >> > abcdefg=2)
> >> >> > > > >>
> >> >> > > > >> I get why that's happening, but is there a way to avoid
> that?
> >> Do
> >> >> I
> >> >> > > need
> >> >> > > > to
> >> >> > > > >> do a new field type to achieve the desired affect?
> >> >> > > > >>
> >> >> > > > >> Thanks,
> >> >> > > > >>
> >> >> > > > >> Brian Lamb
> >> >> > > > >>
> >> >> > > > >
> >> >> > > >
> >> >> > >
> >> >> >
> >> >> >
> >> >> >
> >> >> > --
> >> >> > Thanks and Regards,
> >> >> > DakshinaMurthy BM
> >> >> >
> >> >>
> >> >
> >> >
> >>
> >
>

Re: Edgengram

Posted by Erick Erickson <er...@gmail.com>.

Be a little careful here. LowerCaseTokenizerFactory is different than
KeywordTokenizerFactory.

LowerCaseTokenizerFactory will give you more than one term. e.g.
the string "Intelligence can't be MeaSurEd" will give you 5 terms,
any of which may match. i.e.
"intelligence", "can", "t", "be", "measured".
whereas KeywordTokenizerFactory followed, by, say LowerCaseFilter
would give you exactly one token:
"intelligence can't be measured".

So searching for "measured" would get a hit in the first case but
not in the second. Searching for "intellig*" would hit both.

Neither is better, just make sure they do what you want!

This page will help a lot:
http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters#solr.LowerCaseTokenizerFactory
as will the admin/analysis page.

Best
Erick

On Wed, Jun 1, 2011 at 10:43 AM, Brian Lamb
<br...@journalexperts.com> wrote:
> Hi Tomás,
>
> Thank you very much for your suggestion. I took another crack at it using
> your recommendation and it worked ideally. The only thing I had to change
> was
>
> <analyzer type="query">
>  <tokenizer class="solr.KeywordTokenizerFactory" />
> </analyzer>
>
> to
>
> <analyzer type="query">
>  <tokenizer class="solr.LowerCaseTokenizerFactory" />
> </analyzer>
>
> The first did not produce any results but the second worked beautifully.
>
> Thanks!
>
> Brian Lamb
>
> 2011/5/31 Tomás Fernández Löbbe <to...@gmail.com>
>
>> ...or also use the LowerCaseTokenizerFactory at query time for consistency,
>> but not the edge ngram filter.
>>
>> 2011/5/31 Tomás Fernández Löbbe <to...@gmail.com>
>>
>> > Hi Brian, I don't know if I understand what you are trying to achieve.
>> You
>> > want the term query "abcdefg" to have an idf of 1 insead of 7? I think
>> using
>> > the KeywordTokenizerFilterFactory at query time should work. I would be
>> > something like:
>> >
>> > <fieldType name="edgengram" class="solr.TextField"
>> > positionIncrementGap="1000">
>> >   <analyzer type="index">
>> >
>> >     <tokenizer class="solr.LowerCaseTokenizerFactory" />
>> >     <filter class="solr.EdgeNGramFilterFactory" minGramSize="1"
>> > maxGramSize="25" side="front" />
>> >   </analyzer>
>> >   <analyzer type="query">
>> >   <tokenizer class="solr.KeywordTokenizerFactory" />
>> >   </analyzer>
>> > </fieldType>
>> >
>> > this way, at query time "abcdefg" won't be turned to "a ab abc abcd abcde
>> > abcdef abcdefg". At index time it will.
>> >
>> > Regards,
>> > Tomás
>> >
>> >
>> > On Tue, May 31, 2011 at 1:07 PM, Brian Lamb <
>> brian.lamb@journalexperts.com
>> > > wrote:
>> >
>> >> <fieldType name="edgengram" class="solr.TextField"
>> >> positionIncrementGap="1000">
>> >>   <analyzer>
>> >>     <tokenizer class="solr.LowerCaseTokenizerFactory" />
>> >>     <filter class="solr.EdgeNGramFilterFactory" minGramSize="1"
>> >> maxGramSize="25" side="front" />
>> >>   </analyzer>
>> >> </fieldType>
>> >>
>> >> I believe I used that link when I initially set up the field and it
>> worked
>> >> great (and I'm still using it in other places). In this particular
>> example
>> >> however it does not appear to be practical for me. I mentioned that I
>> have
>> >> a
>> >> similarity class that returns 1 for the idf and in the case of an
>> >> edgengram,
>> >> it returns 1 * length of the search string.
>> >>
>> >> Thanks,
>> >>
>> >> Brian Lamb
>> >>
>> >> On Tue, May 31, 2011 at 11:34 AM, bmdakshinamurthy@gmail.com <
>> >> bmdakshinamurthy@gmail.com> wrote:
>> >>
>> >> > Can you specify the analyzer you are using for your queries?
>> >> >
>> >> > May be you could use a KeywordAnalyzer for your queries so you don't
>> end
>> >> up
>> >> > matching parts of your query.
>> >> >
>> >> >
>> >>
>> http://www.lucidimagination.com/blog/2009/09/08/auto-suggest-from-popular-queries-using-edgengrams/
>> >> > This should help you.
>> >> >
>> >> > On Tue, May 31, 2011 at 8:24 PM, Brian Lamb
>> >> > <br...@journalexperts.com>wrote:
>> >> >
>> >> > > In this particular case, I will be doing a solr search based on user
>> >> > > preferences. So I will not be depending on the user to type
>> "abcdefg".
>> >> > That
>> >> > > will be automatically generated based on user selections.
>> >> > >
>> >> > > The contents of the field do not contain spaces and since I am
>> created
>> >> > the
>> >> > > search parameters, case isn't important either.
>> >> > >
>> >> > > Thanks,
>> >> > >
>> >> > > Brian Lamb
>> >> > >
>> >> > > On Tue, May 31, 2011 at 9:44 AM, Erick Erickson <
>> >> erickerickson@gmail.com
>> >> > > >wrote:
>> >> > >
>> >> > > > That'll work for your case, although be aware that string types
>> >> aren't
>> >> > > > analyzed at all,
>> >> > > > so case matters, as do spaces etc.....
>> >> > > >
>> >> > > > What is the use-case here? If you explain it a bit there might be
>> >> > > > better answers....
>> >> > > >
>> >> > > > Best
>> >> > > > Erick
>> >> > > >
>> >> > > > On Fri, May 27, 2011 at 9:17 AM, Brian Lamb
>> >> > > > <br...@journalexperts.com> wrote:
>> >> > > > > For this, I ended up just changing it to string and using
>> >> "abcdefg*"
>> >> > to
>> >> > > > > match. That seems to work so far.
>> >> > > > >
>> >> > > > > Thanks,
>> >> > > > >
>> >> > > > > Brian Lamb
>> >> > > > >
>> >> > > > > On Wed, May 25, 2011 at 4:53 PM, Brian Lamb
>> >> > > > > <br...@journalexperts.com>wrote:
>> >> > > > >
>> >> > > > >> Hi all,
>> >> > > > >>
>> >> > > > >> I'm running into some confusion with the way edgengram works. I
>> >> have
>> >> > > the
>> >> > > > >> field set up as:
>> >> > > > >>
>> >> > > > >> <fieldType name="edgengram" class="solr.TextField"
>> >> > > > >> positionIncrementGap="1000">
>> >> > > > >>    <analyzer>
>> >> > > > >>      <tokenizer class="solr.LowerCaseTokenizerFactory" />
>> >> > > > >>        <filter class="solr.EdgeNGramFilterFactory"
>> >> minGramSize="1"
>> >> > > > >> maxGramSize="100" side="front" />
>> >> > > > >>    </analyzer>
>> >> > > > >> </fieldType>
>> >> > > > >>
>> >> > > > >> I've also set up my own similarity class that returns 1 as the
>> >> idf
>> >> > > > score.
>> >> > > > >> What I've found this does is if I match a string "abcdefg"
>> >> against a
>> >> > > > field
>> >> > > > >> containing "abcdefghijklmnop", then the idf will score that as
>> a
>> >> 7:
>> >> > > > >>
>> >> > > > >> 7.0 = idf(myfield: a=51 ab=23 abc=2 abcd=2 abcde=2 abcdef=2
>> >> > abcdefg=2)
>> >> > > > >>
>> >> > > > >> I get why that's happening, but is there a way to avoid that?
>> Do
>> >> I
>> >> > > need
>> >> > > > to
>> >> > > > >> do a new field type to achieve the desired affect?
>> >> > > > >>
>> >> > > > >> Thanks,
>> >> > > > >>
>> >> > > > >> Brian Lamb
>> >> > > > >>
>> >> > > > >
>> >> > > >
>> >> > >
>> >> >
>> >> >
>> >> >
>> >> > --
>> >> > Thanks and Regards,
>> >> > DakshinaMurthy BM
>> >> >
>> >>
>> >
>> >
>>
>

Re: Edgengram

Posted by Brian Lamb <br...@journalexperts.com>.

Hi Tomás,

Thank you very much for your suggestion. I took another crack at it using
your recommendation and it worked ideally. The only thing I had to change
was

<analyzer type="query">
  <tokenizer class="solr.KeywordTokenizerFactory" />
</analyzer>

to

<analyzer type="query">
  <tokenizer class="solr.LowerCaseTokenizerFactory" />
</analyzer>

The first did not produce any results but the second worked beautifully.

Thanks!

Brian Lamb

2011/5/31 Tomás Fernández Löbbe <to...@gmail.com>

> ...or also use the LowerCaseTokenizerFactory at query time for consistency,
> but not the edge ngram filter.
>
> 2011/5/31 Tomás Fernández Löbbe <to...@gmail.com>
>
> > Hi Brian, I don't know if I understand what you are trying to achieve.
> You
> > want the term query "abcdefg" to have an idf of 1 insead of 7? I think
> using
> > the KeywordTokenizerFilterFactory at query time should work. I would be
> > something like:
> >
> > <fieldType name="edgengram" class="solr.TextField"
> > positionIncrementGap="1000">
> >   <analyzer type="index">
> >
> >     <tokenizer class="solr.LowerCaseTokenizerFactory" />
> >     <filter class="solr.EdgeNGramFilterFactory" minGramSize="1"
> > maxGramSize="25" side="front" />
> >   </analyzer>
> >   <analyzer type="query">
> >   <tokenizer class="solr.KeywordTokenizerFactory" />
> >   </analyzer>
> > </fieldType>
> >
> > this way, at query time "abcdefg" won't be turned to "a ab abc abcd abcde
> > abcdef abcdefg". At index time it will.
> >
> > Regards,
> > Tomás
> >
> >
> > On Tue, May 31, 2011 at 1:07 PM, Brian Lamb <
> brian.lamb@journalexperts.com
> > > wrote:
> >
> >> <fieldType name="edgengram" class="solr.TextField"
> >> positionIncrementGap="1000">
> >>   <analyzer>
> >>     <tokenizer class="solr.LowerCaseTokenizerFactory" />
> >>     <filter class="solr.EdgeNGramFilterFactory" minGramSize="1"
> >> maxGramSize="25" side="front" />
> >>   </analyzer>
> >> </fieldType>
> >>
> >> I believe I used that link when I initially set up the field and it
> worked
> >> great (and I'm still using it in other places). In this particular
> example
> >> however it does not appear to be practical for me. I mentioned that I
> have
> >> a
> >> similarity class that returns 1 for the idf and in the case of an
> >> edgengram,
> >> it returns 1 * length of the search string.
> >>
> >> Thanks,
> >>
> >> Brian Lamb
> >>
> >> On Tue, May 31, 2011 at 11:34 AM, bmdakshinamurthy@gmail.com <
> >> bmdakshinamurthy@gmail.com> wrote:
> >>
> >> > Can you specify the analyzer you are using for your queries?
> >> >
> >> > May be you could use a KeywordAnalyzer for your queries so you don't
> end
> >> up
> >> > matching parts of your query.
> >> >
> >> >
> >>
> http://www.lucidimagination.com/blog/2009/09/08/auto-suggest-from-popular-queries-using-edgengrams/
> >> > This should help you.
> >> >
> >> > On Tue, May 31, 2011 at 8:24 PM, Brian Lamb
> >> > <br...@journalexperts.com>wrote:
> >> >
> >> > > In this particular case, I will be doing a solr search based on user
> >> > > preferences. So I will not be depending on the user to type
> "abcdefg".
> >> > That
> >> > > will be automatically generated based on user selections.
> >> > >
> >> > > The contents of the field do not contain spaces and since I am
> created
> >> > the
> >> > > search parameters, case isn't important either.
> >> > >
> >> > > Thanks,
> >> > >
> >> > > Brian Lamb
> >> > >
> >> > > On Tue, May 31, 2011 at 9:44 AM, Erick Erickson <
> >> erickerickson@gmail.com
> >> > > >wrote:
> >> > >
> >> > > > That'll work for your case, although be aware that string types
> >> aren't
> >> > > > analyzed at all,
> >> > > > so case matters, as do spaces etc.....
> >> > > >
> >> > > > What is the use-case here? If you explain it a bit there might be
> >> > > > better answers....
> >> > > >
> >> > > > Best
> >> > > > Erick
> >> > > >
> >> > > > On Fri, May 27, 2011 at 9:17 AM, Brian Lamb
> >> > > > <br...@journalexperts.com> wrote:
> >> > > > > For this, I ended up just changing it to string and using
> >> "abcdefg*"
> >> > to
> >> > > > > match. That seems to work so far.
> >> > > > >
> >> > > > > Thanks,
> >> > > > >
> >> > > > > Brian Lamb
> >> > > > >
> >> > > > > On Wed, May 25, 2011 at 4:53 PM, Brian Lamb
> >> > > > > <br...@journalexperts.com>wrote:
> >> > > > >
> >> > > > >> Hi all,
> >> > > > >>
> >> > > > >> I'm running into some confusion with the way edgengram works. I
> >> have
> >> > > the
> >> > > > >> field set up as:
> >> > > > >>
> >> > > > >> <fieldType name="edgengram" class="solr.TextField"
> >> > > > >> positionIncrementGap="1000">
> >> > > > >>    <analyzer>
> >> > > > >>      <tokenizer class="solr.LowerCaseTokenizerFactory" />
> >> > > > >>        <filter class="solr.EdgeNGramFilterFactory"
> >> minGramSize="1"
> >> > > > >> maxGramSize="100" side="front" />
> >> > > > >>    </analyzer>
> >> > > > >> </fieldType>
> >> > > > >>
> >> > > > >> I've also set up my own similarity class that returns 1 as the
> >> idf
> >> > > > score.
> >> > > > >> What I've found this does is if I match a string "abcdefg"
> >> against a
> >> > > > field
> >> > > > >> containing "abcdefghijklmnop", then the idf will score that as
> a
> >> 7:
> >> > > > >>
> >> > > > >> 7.0 = idf(myfield: a=51 ab=23 abc=2 abcd=2 abcde=2 abcdef=2
> >> > abcdefg=2)
> >> > > > >>
> >> > > > >> I get why that's happening, but is there a way to avoid that?
> Do
> >> I
> >> > > need
> >> > > > to
> >> > > > >> do a new field type to achieve the desired affect?
> >> > > > >>
> >> > > > >> Thanks,
> >> > > > >>
> >> > > > >> Brian Lamb
> >> > > > >>
> >> > > > >
> >> > > >
> >> > >
> >> >
> >> >
> >> >
> >> > --
> >> > Thanks and Regards,
> >> > DakshinaMurthy BM
> >> >
> >>
> >
> >
>

Re: Edgengram

Posted by Tomás Fernández Löbbe <to...@gmail.com>.

...or also use the LowerCaseTokenizerFactory at query time for consistency,
but not the edge ngram filter.

2011/5/31 Tomás Fernández Löbbe <to...@gmail.com>

> Hi Brian, I don't know if I understand what you are trying to achieve. You
> want the term query "abcdefg" to have an idf of 1 insead of 7? I think using
> the KeywordTokenizerFilterFactory at query time should work. I would be
> something like:
>
> <fieldType name="edgengram" class="solr.TextField"
> positionIncrementGap="1000">
>   <analyzer type="index">
>
>     <tokenizer class="solr.LowerCaseTokenizerFactory" />
>     <filter class="solr.EdgeNGramFilterFactory" minGramSize="1"
> maxGramSize="25" side="front" />
>   </analyzer>
>   <analyzer type="query">
>   <tokenizer class="solr.KeywordTokenizerFactory" />
>   </analyzer>
> </fieldType>
>
> this way, at query time "abcdefg" won't be turned to "a ab abc abcd abcde
> abcdef abcdefg". At index time it will.
>
> Regards,
> Tomás
>
>
> On Tue, May 31, 2011 at 1:07 PM, Brian Lamb <brian.lamb@journalexperts.com
> > wrote:
>
>> <fieldType name="edgengram" class="solr.TextField"
>> positionIncrementGap="1000">
>>   <analyzer>
>>     <tokenizer class="solr.LowerCaseTokenizerFactory" />
>>     <filter class="solr.EdgeNGramFilterFactory" minGramSize="1"
>> maxGramSize="25" side="front" />
>>   </analyzer>
>> </fieldType>
>>
>> I believe I used that link when I initially set up the field and it worked
>> great (and I'm still using it in other places). In this particular example
>> however it does not appear to be practical for me. I mentioned that I have
>> a
>> similarity class that returns 1 for the idf and in the case of an
>> edgengram,
>> it returns 1 * length of the search string.
>>
>> Thanks,
>>
>> Brian Lamb
>>
>> On Tue, May 31, 2011 at 11:34 AM, bmdakshinamurthy@gmail.com <
>> bmdakshinamurthy@gmail.com> wrote:
>>
>> > Can you specify the analyzer you are using for your queries?
>> >
>> > May be you could use a KeywordAnalyzer for your queries so you don't end
>> up
>> > matching parts of your query.
>> >
>> >
>> http://www.lucidimagination.com/blog/2009/09/08/auto-suggest-from-popular-queries-using-edgengrams/
>> > This should help you.
>> >
>> > On Tue, May 31, 2011 at 8:24 PM, Brian Lamb
>> > <br...@journalexperts.com>wrote:
>> >
>> > > In this particular case, I will be doing a solr search based on user
>> > > preferences. So I will not be depending on the user to type "abcdefg".
>> > That
>> > > will be automatically generated based on user selections.
>> > >
>> > > The contents of the field do not contain spaces and since I am created
>> > the
>> > > search parameters, case isn't important either.
>> > >
>> > > Thanks,
>> > >
>> > > Brian Lamb
>> > >
>> > > On Tue, May 31, 2011 at 9:44 AM, Erick Erickson <
>> erickerickson@gmail.com
>> > > >wrote:
>> > >
>> > > > That'll work for your case, although be aware that string types
>> aren't
>> > > > analyzed at all,
>> > > > so case matters, as do spaces etc.....
>> > > >
>> > > > What is the use-case here? If you explain it a bit there might be
>> > > > better answers....
>> > > >
>> > > > Best
>> > > > Erick
>> > > >
>> > > > On Fri, May 27, 2011 at 9:17 AM, Brian Lamb
>> > > > <br...@journalexperts.com> wrote:
>> > > > > For this, I ended up just changing it to string and using
>> "abcdefg*"
>> > to
>> > > > > match. That seems to work so far.
>> > > > >
>> > > > > Thanks,
>> > > > >
>> > > > > Brian Lamb
>> > > > >
>> > > > > On Wed, May 25, 2011 at 4:53 PM, Brian Lamb
>> > > > > <br...@journalexperts.com>wrote:
>> > > > >
>> > > > >> Hi all,
>> > > > >>
>> > > > >> I'm running into some confusion with the way edgengram works. I
>> have
>> > > the
>> > > > >> field set up as:
>> > > > >>
>> > > > >> <fieldType name="edgengram" class="solr.TextField"
>> > > > >> positionIncrementGap="1000">
>> > > > >>    <analyzer>
>> > > > >>      <tokenizer class="solr.LowerCaseTokenizerFactory" />
>> > > > >>        <filter class="solr.EdgeNGramFilterFactory"
>> minGramSize="1"
>> > > > >> maxGramSize="100" side="front" />
>> > > > >>    </analyzer>
>> > > > >> </fieldType>
>> > > > >>
>> > > > >> I've also set up my own similarity class that returns 1 as the
>> idf
>> > > > score.
>> > > > >> What I've found this does is if I match a string "abcdefg"
>> against a
>> > > > field
>> > > > >> containing "abcdefghijklmnop", then the idf will score that as a
>> 7:
>> > > > >>
>> > > > >> 7.0 = idf(myfield: a=51 ab=23 abc=2 abcd=2 abcde=2 abcdef=2
>> > abcdefg=2)
>> > > > >>
>> > > > >> I get why that's happening, but is there a way to avoid that? Do
>> I
>> > > need
>> > > > to
>> > > > >> do a new field type to achieve the desired affect?
>> > > > >>
>> > > > >> Thanks,
>> > > > >>
>> > > > >> Brian Lamb
>> > > > >>
>> > > > >
>> > > >
>> > >
>> >
>> >
>> >
>> > --
>> > Thanks and Regards,
>> > DakshinaMurthy BM
>> >
>>
>
>

Re: Edgengram

Posted by Tomás Fernández Löbbe <to...@gmail.com>.

Hi Brian, I don't know if I understand what you are trying to achieve. You
want the term query "abcdefg" to have an idf of 1 insead of 7? I think using
the KeywordTokenizerFilterFactory at query time should work. I would be
something like:

<fieldType name="edgengram" class="solr.TextField"
positionIncrementGap="1000">
  <analyzer type="index">
    <tokenizer class="solr.LowerCaseTokenizerFactory" />
    <filter class="solr.EdgeNGramFilterFactory" minGramSize="1"
maxGramSize="25" side="front" />
  </analyzer>
  <analyzer type="query">
  <tokenizer class="solr.KeywordTokenizerFactory" />
  </analyzer>
</fieldType>

this way, at query time "abcdefg" won't be turned to "a ab abc abcd abcde
abcdef abcdefg". At index time it will.

Regards,
Tomás


On Tue, May 31, 2011 at 1:07 PM, Brian Lamb
<br...@journalexperts.com>wrote:

> <fieldType name="edgengram" class="solr.TextField"
> positionIncrementGap="1000">
>   <analyzer>
>     <tokenizer class="solr.LowerCaseTokenizerFactory" />
>     <filter class="solr.EdgeNGramFilterFactory" minGramSize="1"
> maxGramSize="25" side="front" />
>   </analyzer>
> </fieldType>
>
> I believe I used that link when I initially set up the field and it worked
> great (and I'm still using it in other places). In this particular example
> however it does not appear to be practical for me. I mentioned that I have
> a
> similarity class that returns 1 for the idf and in the case of an
> edgengram,
> it returns 1 * length of the search string.
>
> Thanks,
>
> Brian Lamb
>
> On Tue, May 31, 2011 at 11:34 AM, bmdakshinamurthy@gmail.com <
> bmdakshinamurthy@gmail.com> wrote:
>
> > Can you specify the analyzer you are using for your queries?
> >
> > May be you could use a KeywordAnalyzer for your queries so you don't end
> up
> > matching parts of your query.
> >
> >
> http://www.lucidimagination.com/blog/2009/09/08/auto-suggest-from-popular-queries-using-edgengrams/
> > This should help you.
> >
> > On Tue, May 31, 2011 at 8:24 PM, Brian Lamb
> > <br...@journalexperts.com>wrote:
> >
> > > In this particular case, I will be doing a solr search based on user
> > > preferences. So I will not be depending on the user to type "abcdefg".
> > That
> > > will be automatically generated based on user selections.
> > >
> > > The contents of the field do not contain spaces and since I am created
> > the
> > > search parameters, case isn't important either.
> > >
> > > Thanks,
> > >
> > > Brian Lamb
> > >
> > > On Tue, May 31, 2011 at 9:44 AM, Erick Erickson <
> erickerickson@gmail.com
> > > >wrote:
> > >
> > > > That'll work for your case, although be aware that string types
> aren't
> > > > analyzed at all,
> > > > so case matters, as do spaces etc.....
> > > >
> > > > What is the use-case here? If you explain it a bit there might be
> > > > better answers....
> > > >
> > > > Best
> > > > Erick
> > > >
> > > > On Fri, May 27, 2011 at 9:17 AM, Brian Lamb
> > > > <br...@journalexperts.com> wrote:
> > > > > For this, I ended up just changing it to string and using
> "abcdefg*"
> > to
> > > > > match. That seems to work so far.
> > > > >
> > > > > Thanks,
> > > > >
> > > > > Brian Lamb
> > > > >
> > > > > On Wed, May 25, 2011 at 4:53 PM, Brian Lamb
> > > > > <br...@journalexperts.com>wrote:
> > > > >
> > > > >> Hi all,
> > > > >>
> > > > >> I'm running into some confusion with the way edgengram works. I
> have
> > > the
> > > > >> field set up as:
> > > > >>
> > > > >> <fieldType name="edgengram" class="solr.TextField"
> > > > >> positionIncrementGap="1000">
> > > > >>    <analyzer>
> > > > >>      <tokenizer class="solr.LowerCaseTokenizerFactory" />
> > > > >>        <filter class="solr.EdgeNGramFilterFactory" minGramSize="1"
> > > > >> maxGramSize="100" side="front" />
> > > > >>    </analyzer>
> > > > >> </fieldType>
> > > > >>
> > > > >> I've also set up my own similarity class that returns 1 as the idf
> > > > score.
> > > > >> What I've found this does is if I match a string "abcdefg" against
> a
> > > > field
> > > > >> containing "abcdefghijklmnop", then the idf will score that as a
> 7:
> > > > >>
> > > > >> 7.0 = idf(myfield: a=51 ab=23 abc=2 abcd=2 abcde=2 abcdef=2
> > abcdefg=2)
> > > > >>
> > > > >> I get why that's happening, but is there a way to avoid that? Do I
> > > need
> > > > to
> > > > >> do a new field type to achieve the desired affect?
> > > > >>
> > > > >> Thanks,
> > > > >>
> > > > >> Brian Lamb
> > > > >>
> > > > >
> > > >
> > >
> >
> >
> >
> > --
> > Thanks and Regards,
> > DakshinaMurthy BM
> >
>

Re: Edgengram

Posted by Brian Lamb <br...@journalexperts.com>.

<fieldType name="edgengram" class="solr.TextField"
positionIncrementGap="1000">
   <analyzer>
     <tokenizer class="solr.LowerCaseTokenizerFactory" />
     <filter class="solr.EdgeNGramFilterFactory" minGramSize="1"
maxGramSize="25" side="front" />
   </analyzer>
</fieldType>

I believe I used that link when I initially set up the field and it worked
great (and I'm still using it in other places). In this particular example
however it does not appear to be practical for me. I mentioned that I have a
similarity class that returns 1 for the idf and in the case of an edgengram,
it returns 1 * length of the search string.

Thanks,

Brian Lamb

On Tue, May 31, 2011 at 11:34 AM, bmdakshinamurthy@gmail.com <
bmdakshinamurthy@gmail.com> wrote:

> Can you specify the analyzer you are using for your queries?
>
> May be you could use a KeywordAnalyzer for your queries so you don't end up
> matching parts of your query.
>
> http://www.lucidimagination.com/blog/2009/09/08/auto-suggest-from-popular-queries-using-edgengrams/
> This should help you.
>
> On Tue, May 31, 2011 at 8:24 PM, Brian Lamb
> <br...@journalexperts.com>wrote:
>
> > In this particular case, I will be doing a solr search based on user
> > preferences. So I will not be depending on the user to type "abcdefg".
> That
> > will be automatically generated based on user selections.
> >
> > The contents of the field do not contain spaces and since I am created
> the
> > search parameters, case isn't important either.
> >
> > Thanks,
> >
> > Brian Lamb
> >
> > On Tue, May 31, 2011 at 9:44 AM, Erick Erickson <erickerickson@gmail.com
> > >wrote:
> >
> > > That'll work for your case, although be aware that string types aren't
> > > analyzed at all,
> > > so case matters, as do spaces etc.....
> > >
> > > What is the use-case here? If you explain it a bit there might be
> > > better answers....
> > >
> > > Best
> > > Erick
> > >
> > > On Fri, May 27, 2011 at 9:17 AM, Brian Lamb
> > > <br...@journalexperts.com> wrote:
> > > > For this, I ended up just changing it to string and using "abcdefg*"
> to
> > > > match. That seems to work so far.
> > > >
> > > > Thanks,
> > > >
> > > > Brian Lamb
> > > >
> > > > On Wed, May 25, 2011 at 4:53 PM, Brian Lamb
> > > > <br...@journalexperts.com>wrote:
> > > >
> > > >> Hi all,
> > > >>
> > > >> I'm running into some confusion with the way edgengram works. I have
> > the
> > > >> field set up as:
> > > >>
> > > >> <fieldType name="edgengram" class="solr.TextField"
> > > >> positionIncrementGap="1000">
> > > >>    <analyzer>
> > > >>      <tokenizer class="solr.LowerCaseTokenizerFactory" />
> > > >>        <filter class="solr.EdgeNGramFilterFactory" minGramSize="1"
> > > >> maxGramSize="100" side="front" />
> > > >>    </analyzer>
> > > >> </fieldType>
> > > >>
> > > >> I've also set up my own similarity class that returns 1 as the idf
> > > score.
> > > >> What I've found this does is if I match a string "abcdefg" against a
> > > field
> > > >> containing "abcdefghijklmnop", then the idf will score that as a 7:
> > > >>
> > > >> 7.0 = idf(myfield: a=51 ab=23 abc=2 abcd=2 abcde=2 abcdef=2
> abcdefg=2)
> > > >>
> > > >> I get why that's happening, but is there a way to avoid that? Do I
> > need
> > > to
> > > >> do a new field type to achieve the desired affect?
> > > >>
> > > >> Thanks,
> > > >>
> > > >> Brian Lamb
> > > >>
> > > >
> > >
> >
>
>
>
> --
> Thanks and Regards,
> DakshinaMurthy BM
>

Re: Edgengram

Posted by "bmdakshinamurthy@gmail.com" <bm...@gmail.com>.

Can you specify the analyzer you are using for your queries?

May be you could use a KeywordAnalyzer for your queries so you don't end up
matching parts of your query.
http://www.lucidimagination.com/blog/2009/09/08/auto-suggest-from-popular-queries-using-edgengrams/
This should help you.

On Tue, May 31, 2011 at 8:24 PM, Brian Lamb
<br...@journalexperts.com>wrote:

> In this particular case, I will be doing a solr search based on user
> preferences. So I will not be depending on the user to type "abcdefg". That
> will be automatically generated based on user selections.
>
> The contents of the field do not contain spaces and since I am created the
> search parameters, case isn't important either.
>
> Thanks,
>
> Brian Lamb
>
> On Tue, May 31, 2011 at 9:44 AM, Erick Erickson <erickerickson@gmail.com
> >wrote:
>
> > That'll work for your case, although be aware that string types aren't
> > analyzed at all,
> > so case matters, as do spaces etc.....
> >
> > What is the use-case here? If you explain it a bit there might be
> > better answers....
> >
> > Best
> > Erick
> >
> > On Fri, May 27, 2011 at 9:17 AM, Brian Lamb
> > <br...@journalexperts.com> wrote:
> > > For this, I ended up just changing it to string and using "abcdefg*" to
> > > match. That seems to work so far.
> > >
> > > Thanks,
> > >
> > > Brian Lamb
> > >
> > > On Wed, May 25, 2011 at 4:53 PM, Brian Lamb
> > > <br...@journalexperts.com>wrote:
> > >
> > >> Hi all,
> > >>
> > >> I'm running into some confusion with the way edgengram works. I have
> the
> > >> field set up as:
> > >>
> > >> <fieldType name="edgengram" class="solr.TextField"
> > >> positionIncrementGap="1000">
> > >>    <analyzer>
> > >>      <tokenizer class="solr.LowerCaseTokenizerFactory" />
> > >>        <filter class="solr.EdgeNGramFilterFactory" minGramSize="1"
> > >> maxGramSize="100" side="front" />
> > >>    </analyzer>
> > >> </fieldType>
> > >>
> > >> I've also set up my own similarity class that returns 1 as the idf
> > score.
> > >> What I've found this does is if I match a string "abcdefg" against a
> > field
> > >> containing "abcdefghijklmnop", then the idf will score that as a 7:
> > >>
> > >> 7.0 = idf(myfield: a=51 ab=23 abc=2 abcd=2 abcde=2 abcdef=2 abcdefg=2)
> > >>
> > >> I get why that's happening, but is there a way to avoid that? Do I
> need
> > to
> > >> do a new field type to achieve the desired affect?
> > >>
> > >> Thanks,
> > >>
> > >> Brian Lamb
> > >>
> > >
> >
>



-- 
Thanks and Regards,
DakshinaMurthy BM

Re: Edgengram

Posted by Brian Lamb <br...@journalexperts.com>.

In this particular case, I will be doing a solr search based on user
preferences. So I will not be depending on the user to type "abcdefg". That
will be automatically generated based on user selections.

The contents of the field do not contain spaces and since I am created the
search parameters, case isn't important either.

Thanks,

Brian Lamb

On Tue, May 31, 2011 at 9:44 AM, Erick Erickson <er...@gmail.com>wrote:

> That'll work for your case, although be aware that string types aren't
> analyzed at all,
> so case matters, as do spaces etc.....
>
> What is the use-case here? If you explain it a bit there might be
> better answers....
>
> Best
> Erick
>
> On Fri, May 27, 2011 at 9:17 AM, Brian Lamb
> <br...@journalexperts.com> wrote:
> > For this, I ended up just changing it to string and using "abcdefg*" to
> > match. That seems to work so far.
> >
> > Thanks,
> >
> > Brian Lamb
> >
> > On Wed, May 25, 2011 at 4:53 PM, Brian Lamb
> > <br...@journalexperts.com>wrote:
> >
> >> Hi all,
> >>
> >> I'm running into some confusion with the way edgengram works. I have the
> >> field set up as:
> >>
> >> <fieldType name="edgengram" class="solr.TextField"
> >> positionIncrementGap="1000">
> >>    <analyzer>
> >>      <tokenizer class="solr.LowerCaseTokenizerFactory" />
> >>        <filter class="solr.EdgeNGramFilterFactory" minGramSize="1"
> >> maxGramSize="100" side="front" />
> >>    </analyzer>
> >> </fieldType>
> >>
> >> I've also set up my own similarity class that returns 1 as the idf
> score.
> >> What I've found this does is if I match a string "abcdefg" against a
> field
> >> containing "abcdefghijklmnop", then the idf will score that as a 7:
> >>
> >> 7.0 = idf(myfield: a=51 ab=23 abc=2 abcd=2 abcde=2 abcdef=2 abcdefg=2)
> >>
> >> I get why that's happening, but is there a way to avoid that? Do I need
> to
> >> do a new field type to achieve the desired affect?
> >>
> >> Thanks,
> >>
> >> Brian Lamb
> >>
> >
>

Re: Edgengram

Posted by Erick Erickson <er...@gmail.com>.

That'll work for your case, although be aware that string types aren't
analyzed at all,
so case matters, as do spaces etc.....

What is the use-case here? If you explain it a bit there might be
better answers....

Best
Erick

On Fri, May 27, 2011 at 9:17 AM, Brian Lamb
<br...@journalexperts.com> wrote:
> For this, I ended up just changing it to string and using "abcdefg*" to
> match. That seems to work so far.
>
> Thanks,
>
> Brian Lamb
>
> On Wed, May 25, 2011 at 4:53 PM, Brian Lamb
> <br...@journalexperts.com>wrote:
>
>> Hi all,
>>
>> I'm running into some confusion with the way edgengram works. I have the
>> field set up as:
>>
>> <fieldType name="edgengram" class="solr.TextField"
>> positionIncrementGap="1000">
>>    <analyzer>
>>      <tokenizer class="solr.LowerCaseTokenizerFactory" />
>>        <filter class="solr.EdgeNGramFilterFactory" minGramSize="1"
>> maxGramSize="100" side="front" />
>>    </analyzer>
>> </fieldType>
>>
>> I've also set up my own similarity class that returns 1 as the idf score.
>> What I've found this does is if I match a string "abcdefg" against a field
>> containing "abcdefghijklmnop", then the idf will score that as a 7:
>>
>> 7.0 = idf(myfield: a=51 ab=23 abc=2 abcd=2 abcde=2 abcdef=2 abcdefg=2)
>>
>> I get why that's happening, but is there a way to avoid that? Do I need to
>> do a new field type to achieve the desired affect?
>>
>> Thanks,
>>
>> Brian Lamb
>>
>

Re: Edgengram

Posted by Brian Lamb <br...@journalexperts.com>.

For this, I ended up just changing it to string and using "abcdefg*" to
match. That seems to work so far.

Thanks,

Brian Lamb

On Wed, May 25, 2011 at 4:53 PM, Brian Lamb
<br...@journalexperts.com>wrote:

> Hi all,
>
> I'm running into some confusion with the way edgengram works. I have the
> field set up as:
>
> <fieldType name="edgengram" class="solr.TextField"
> positionIncrementGap="1000">
>    <analyzer>
>      <tokenizer class="solr.LowerCaseTokenizerFactory" />
>        <filter class="solr.EdgeNGramFilterFactory" minGramSize="1"
> maxGramSize="100" side="front" />
>    </analyzer>
> </fieldType>
>
> I've also set up my own similarity class that returns 1 as the idf score.
> What I've found this does is if I match a string "abcdefg" against a field
> containing "abcdefghijklmnop", then the idf will score that as a 7:
>
> 7.0 = idf(myfield: a=51 ab=23 abc=2 abcd=2 abcde=2 abcdef=2 abcdefg=2)
>
> I get why that's happening, but is there a way to avoid that? Do I need to
> do a new field type to achieve the desired affect?
>
> Thanks,
>
> Brian Lamb
>