You are viewing a plain text version of this content. The canonical link for it is here.
Posted to solr-user@lucene.apache.org by Parvesh Garg <pa...@zettata.com> on 2013/10/28 08:39:59 UTC

Compound words

Hi,

I'm an infant in Solr/Lucene family, just a couple of months old.

We are trying to find a way to combine words into a single compound word at
index and query time. E.g. if the document has "sea bird" in it, it should
be indexed as seabird and any query having sea bird in it should also look
for seabird not only in qf but also in pf, pf2, pf3 fields. Well, we are
using edismax query parser.

Our problem is not at index time, we have achieved it by writing our own
token filter, but at query time. Our token filter takes a dictionary in the
form of "prefix,suffix" in the file and keeps emitting regular and compound
tokens as it encounters them.

We configured our own filter at query time but figured that at query time
individual clauses like field:sea , field:bird etc are created first and
then sent to the analyzer. First of all, can someone please confirm if this
part of my understanding is correct? So, we are forced to emit sea and bird
as individual tokens because we are not getting them in sequence at all.

Is it possible to achieve this by other means than pre-processing query
before sending it to solr? Can a CharFilter be used instead, are they
applied before creating query clauses?

I can keep providing more details as necessary. This mail has already
crossed TL;DR limits for many :)

Parvesh Garg
http://www.zettata.com
+91 963 222 5540

Re: Compound words

Posted by Parvesh Garg <pa...@zettata.com>.
Hi Erick,

I tried with expand=true and got exactly the same tokens i.e., seabiscuit
sea bird at 1,2 and 3 positions respectively. As per solr documentation at
http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters#solr.SynonymFilterFactory,
explicit mappings ignore the expand parameter in the schema.

So, the problem of creating compound problems at query time remains.


Parvesh Garg
http://www.zettata.com


On Tue, Oct 29, 2013 at 2:11 AM, Parvesh Garg <pa...@zettata.com> wrote:

> Hi Roman, thanks for the link, will go through it.
>
> Erick, will try with expand=true once and check out the results. Will
> update this thread with the findings. I remember we rejected expand=true
> because of some weird spaghetti problem. Will check it out again.
>
> Thanks,
>
> Parvesh Garg
> http://www.zettata.com
>
>
> On Mon, Oct 28, 2013 at 9:01 PM, Roman Chyla <ro...@gmail.com>wrote:
>
>> Hi Parvesh,
>> I think you should check the following jira
>> https://issues.apache.org/jira/browse/SOLR-5379. You will find there
>> links
>> to other possible solutions/problems:-)
>> Roman
>> On 28 Oct 2013 09:06, "Erick Erickson" <er...@gmail.com> wrote:
>>
>> > Consider setting expand=true at index time. That
>> > puts all the tokens in your index, and then you
>> > may not need to have any synonym
>> > processing at query time since all the variants will
>> > already be in the index.
>> >
>> > As it is, you've replaced the words in the original with
>> > synonyms, essentially collapsed them down to a single
>> > word and then you have to do something at query time
>> > to get matches. If all the variants are in the index, you
>> > shouldn't have to. That's what I meant by "raw".
>> >
>> > Best,
>> > Erick
>> >
>> >
>> > On Mon, Oct 28, 2013 at 8:02 AM, Parvesh Garg <pa...@zettata.com>
>> wrote:
>> >
>> > > Hi Erick,
>> > >
>> > > Thanks for the suggestion. Like I said, I'm an infant.
>> > >
>> > > We tried synonyms both ways. sea biscuit => seabiscuit and seabiscuit
>> =>
>> > > sea biscuit and didn't understand exactly how it worked. But I just
>> > checked
>> > > the analysis tool, and it seems to work perfectly fine at index time.
>> > Now,
>> > > I can happily discard my own filter and 4 days of work. I'm happy I
>> got
>> > to
>> > > know a few ways on how/when not to write a solr filter :)
>> > >
>> > > I tried the string "sea biscuit sea bird" with expand=false and the
>> > tokens
>> > > i got were seabiscuit sea bird at 1,2 and 3 positions respectively.
>> But
>> > at
>> > > query time, when I enter the same term "sea biscuit sea bird", using
>> > > edismax and qf, pf2, and pf3, the parsedQuery looks like this:
>> > >
>> > > +((text:sea) (text:biscuit) (text:sea) (text:bird)) ((text:\"biscuit
>> > sea\")
>> > > (text:\"sea bird\")) ((text:\"seabiscuit sea\") (text:\"biscuit sea
>> > > bird\"))"
>> > >
>> > > What I wanted instead was this
>> > >
>> > > "+((text:seabiscuit) (text:sea) (text:bird)) ((text:\"seabiscuit
>> sea\")
>> > > (text:\"sea bird\")) (text:\"seabiscuit sea bird\")"
>> > >
>> > > Looks like there isn't any other way than to pre-process query myself
>> and
>> > > create the compound word. What do you mean by "just query the raw
>> > string"?
>> > > Am I still missing something?
>> > >
>> > > Parvesh Garg
>> > > http://www.zettata.com
>> > > (This time I did remove my phone number :) )
>> > >
>> > > On Mon, Oct 28, 2013 at 4:14 PM, Erick Erickson <
>> erickerickson@gmail.com
>> > > >wrote:
>> > >
>> > > > Why did you reject using synonyms? You can have multi-word
>> > > > synonyms just fine at index time, and at query time, since the
>> > > > multiple words are already substituted in the index you don't
>> > > > need to do the same substitution, just query the raw strings.
>> > > >
>> > > > I freely acknowledge you may have very good reasons for doing
>> > > > this yourself, I'm just making sure you know what's already
>> > > > there.
>> > > >
>> > > > See:
>> > > >
>> > > >
>> > >
>> >
>> http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters#solr.SynonymFilterFactory
>> > > >
>> > > > Look particularly at the explanations for "sea biscuit" in that
>> > section.
>> > > >
>> > > > Best,
>> > > > Erick
>> > > >
>> > > >
>> > > >
>> > > > On Mon, Oct 28, 2013 at 3:47 AM, Parvesh Garg <pa...@zettata.com>
>> > > wrote:
>> > > >
>> > > > > One more thing, Is there a way to remove my "accidentally sent
>> phone
>> > > > number
>> > > > > in the signature" from the previous mail? aarrrggghhh
>> > > > >
>> > > >
>> > >
>> >
>>
>
>

Re: Compound words

Posted by Parvesh Garg <pa...@zettata.com>.
Hi Roman, thanks for the link, will go through it.

Erick, will try with expand=true once and check out the results. Will
update this thread with the findings. I remember we rejected expand=true
because of some weird spaghetti problem. Will check it out again.

Thanks,

Parvesh Garg
http://www.zettata.com


On Mon, Oct 28, 2013 at 9:01 PM, Roman Chyla <ro...@gmail.com> wrote:

> Hi Parvesh,
> I think you should check the following jira
> https://issues.apache.org/jira/browse/SOLR-5379. You will find there links
> to other possible solutions/problems:-)
> Roman
> On 28 Oct 2013 09:06, "Erick Erickson" <er...@gmail.com> wrote:
>
> > Consider setting expand=true at index time. That
> > puts all the tokens in your index, and then you
> > may not need to have any synonym
> > processing at query time since all the variants will
> > already be in the index.
> >
> > As it is, you've replaced the words in the original with
> > synonyms, essentially collapsed them down to a single
> > word and then you have to do something at query time
> > to get matches. If all the variants are in the index, you
> > shouldn't have to. That's what I meant by "raw".
> >
> > Best,
> > Erick
> >
> >
> > On Mon, Oct 28, 2013 at 8:02 AM, Parvesh Garg <pa...@zettata.com>
> wrote:
> >
> > > Hi Erick,
> > >
> > > Thanks for the suggestion. Like I said, I'm an infant.
> > >
> > > We tried synonyms both ways. sea biscuit => seabiscuit and seabiscuit
> =>
> > > sea biscuit and didn't understand exactly how it worked. But I just
> > checked
> > > the analysis tool, and it seems to work perfectly fine at index time.
> > Now,
> > > I can happily discard my own filter and 4 days of work. I'm happy I got
> > to
> > > know a few ways on how/when not to write a solr filter :)
> > >
> > > I tried the string "sea biscuit sea bird" with expand=false and the
> > tokens
> > > i got were seabiscuit sea bird at 1,2 and 3 positions respectively. But
> > at
> > > query time, when I enter the same term "sea biscuit sea bird", using
> > > edismax and qf, pf2, and pf3, the parsedQuery looks like this:
> > >
> > > +((text:sea) (text:biscuit) (text:sea) (text:bird)) ((text:\"biscuit
> > sea\")
> > > (text:\"sea bird\")) ((text:\"seabiscuit sea\") (text:\"biscuit sea
> > > bird\"))"
> > >
> > > What I wanted instead was this
> > >
> > > "+((text:seabiscuit) (text:sea) (text:bird)) ((text:\"seabiscuit sea\")
> > > (text:\"sea bird\")) (text:\"seabiscuit sea bird\")"
> > >
> > > Looks like there isn't any other way than to pre-process query myself
> and
> > > create the compound word. What do you mean by "just query the raw
> > string"?
> > > Am I still missing something?
> > >
> > > Parvesh Garg
> > > http://www.zettata.com
> > > (This time I did remove my phone number :) )
> > >
> > > On Mon, Oct 28, 2013 at 4:14 PM, Erick Erickson <
> erickerickson@gmail.com
> > > >wrote:
> > >
> > > > Why did you reject using synonyms? You can have multi-word
> > > > synonyms just fine at index time, and at query time, since the
> > > > multiple words are already substituted in the index you don't
> > > > need to do the same substitution, just query the raw strings.
> > > >
> > > > I freely acknowledge you may have very good reasons for doing
> > > > this yourself, I'm just making sure you know what's already
> > > > there.
> > > >
> > > > See:
> > > >
> > > >
> > >
> >
> http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters#solr.SynonymFilterFactory
> > > >
> > > > Look particularly at the explanations for "sea biscuit" in that
> > section.
> > > >
> > > > Best,
> > > > Erick
> > > >
> > > >
> > > >
> > > > On Mon, Oct 28, 2013 at 3:47 AM, Parvesh Garg <pa...@zettata.com>
> > > wrote:
> > > >
> > > > > One more thing, Is there a way to remove my "accidentally sent
> phone
> > > > number
> > > > > in the signature" from the previous mail? aarrrggghhh
> > > > >
> > > >
> > >
> >
>

Re: Compound words

Posted by Roman Chyla <ro...@gmail.com>.
Hi Parvesh,
I think you should check the following jira
https://issues.apache.org/jira/browse/SOLR-5379. You will find there links
to other possible solutions/problems:-)
Roman
On 28 Oct 2013 09:06, "Erick Erickson" <er...@gmail.com> wrote:

> Consider setting expand=true at index time. That
> puts all the tokens in your index, and then you
> may not need to have any synonym
> processing at query time since all the variants will
> already be in the index.
>
> As it is, you've replaced the words in the original with
> synonyms, essentially collapsed them down to a single
> word and then you have to do something at query time
> to get matches. If all the variants are in the index, you
> shouldn't have to. That's what I meant by "raw".
>
> Best,
> Erick
>
>
> On Mon, Oct 28, 2013 at 8:02 AM, Parvesh Garg <pa...@zettata.com> wrote:
>
> > Hi Erick,
> >
> > Thanks for the suggestion. Like I said, I'm an infant.
> >
> > We tried synonyms both ways. sea biscuit => seabiscuit and seabiscuit =>
> > sea biscuit and didn't understand exactly how it worked. But I just
> checked
> > the analysis tool, and it seems to work perfectly fine at index time.
> Now,
> > I can happily discard my own filter and 4 days of work. I'm happy I got
> to
> > know a few ways on how/when not to write a solr filter :)
> >
> > I tried the string "sea biscuit sea bird" with expand=false and the
> tokens
> > i got were seabiscuit sea bird at 1,2 and 3 positions respectively. But
> at
> > query time, when I enter the same term "sea biscuit sea bird", using
> > edismax and qf, pf2, and pf3, the parsedQuery looks like this:
> >
> > +((text:sea) (text:biscuit) (text:sea) (text:bird)) ((text:\"biscuit
> sea\")
> > (text:\"sea bird\")) ((text:\"seabiscuit sea\") (text:\"biscuit sea
> > bird\"))"
> >
> > What I wanted instead was this
> >
> > "+((text:seabiscuit) (text:sea) (text:bird)) ((text:\"seabiscuit sea\")
> > (text:\"sea bird\")) (text:\"seabiscuit sea bird\")"
> >
> > Looks like there isn't any other way than to pre-process query myself and
> > create the compound word. What do you mean by "just query the raw
> string"?
> > Am I still missing something?
> >
> > Parvesh Garg
> > http://www.zettata.com
> > (This time I did remove my phone number :) )
> >
> > On Mon, Oct 28, 2013 at 4:14 PM, Erick Erickson <erickerickson@gmail.com
> > >wrote:
> >
> > > Why did you reject using synonyms? You can have multi-word
> > > synonyms just fine at index time, and at query time, since the
> > > multiple words are already substituted in the index you don't
> > > need to do the same substitution, just query the raw strings.
> > >
> > > I freely acknowledge you may have very good reasons for doing
> > > this yourself, I'm just making sure you know what's already
> > > there.
> > >
> > > See:
> > >
> > >
> >
> http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters#solr.SynonymFilterFactory
> > >
> > > Look particularly at the explanations for "sea biscuit" in that
> section.
> > >
> > > Best,
> > > Erick
> > >
> > >
> > >
> > > On Mon, Oct 28, 2013 at 3:47 AM, Parvesh Garg <pa...@zettata.com>
> > wrote:
> > >
> > > > One more thing, Is there a way to remove my "accidentally sent phone
> > > number
> > > > in the signature" from the previous mail? aarrrggghhh
> > > >
> > >
> >
>

Re: Compound words

Posted by Erick Erickson <er...@gmail.com>.
Consider setting expand=true at index time. That
puts all the tokens in your index, and then you
may not need to have any synonym
processing at query time since all the variants will
already be in the index.

As it is, you've replaced the words in the original with
synonyms, essentially collapsed them down to a single
word and then you have to do something at query time
to get matches. If all the variants are in the index, you
shouldn't have to. That's what I meant by "raw".

Best,
Erick


On Mon, Oct 28, 2013 at 8:02 AM, Parvesh Garg <pa...@zettata.com> wrote:

> Hi Erick,
>
> Thanks for the suggestion. Like I said, I'm an infant.
>
> We tried synonyms both ways. sea biscuit => seabiscuit and seabiscuit =>
> sea biscuit and didn't understand exactly how it worked. But I just checked
> the analysis tool, and it seems to work perfectly fine at index time. Now,
> I can happily discard my own filter and 4 days of work. I'm happy I got to
> know a few ways on how/when not to write a solr filter :)
>
> I tried the string "sea biscuit sea bird" with expand=false and the tokens
> i got were seabiscuit sea bird at 1,2 and 3 positions respectively. But at
> query time, when I enter the same term "sea biscuit sea bird", using
> edismax and qf, pf2, and pf3, the parsedQuery looks like this:
>
> +((text:sea) (text:biscuit) (text:sea) (text:bird)) ((text:\"biscuit sea\")
> (text:\"sea bird\")) ((text:\"seabiscuit sea\") (text:\"biscuit sea
> bird\"))"
>
> What I wanted instead was this
>
> "+((text:seabiscuit) (text:sea) (text:bird)) ((text:\"seabiscuit sea\")
> (text:\"sea bird\")) (text:\"seabiscuit sea bird\")"
>
> Looks like there isn't any other way than to pre-process query myself and
> create the compound word. What do you mean by "just query the raw string"?
> Am I still missing something?
>
> Parvesh Garg
> http://www.zettata.com
> (This time I did remove my phone number :) )
>
> On Mon, Oct 28, 2013 at 4:14 PM, Erick Erickson <erickerickson@gmail.com
> >wrote:
>
> > Why did you reject using synonyms? You can have multi-word
> > synonyms just fine at index time, and at query time, since the
> > multiple words are already substituted in the index you don't
> > need to do the same substitution, just query the raw strings.
> >
> > I freely acknowledge you may have very good reasons for doing
> > this yourself, I'm just making sure you know what's already
> > there.
> >
> > See:
> >
> >
> http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters#solr.SynonymFilterFactory
> >
> > Look particularly at the explanations for "sea biscuit" in that section.
> >
> > Best,
> > Erick
> >
> >
> >
> > On Mon, Oct 28, 2013 at 3:47 AM, Parvesh Garg <pa...@zettata.com>
> wrote:
> >
> > > One more thing, Is there a way to remove my "accidentally sent phone
> > number
> > > in the signature" from the previous mail? aarrrggghhh
> > >
> >
>

Re: Compound words

Posted by Parvesh Garg <pa...@zettata.com>.
Hi Erick,

Thanks for the suggestion. Like I said, I'm an infant.

We tried synonyms both ways. sea biscuit => seabiscuit and seabiscuit =>
sea biscuit and didn't understand exactly how it worked. But I just checked
the analysis tool, and it seems to work perfectly fine at index time. Now,
I can happily discard my own filter and 4 days of work. I'm happy I got to
know a few ways on how/when not to write a solr filter :)

I tried the string "sea biscuit sea bird" with expand=false and the tokens
i got were seabiscuit sea bird at 1,2 and 3 positions respectively. But at
query time, when I enter the same term "sea biscuit sea bird", using
edismax and qf, pf2, and pf3, the parsedQuery looks like this:

+((text:sea) (text:biscuit) (text:sea) (text:bird)) ((text:\"biscuit sea\")
(text:\"sea bird\")) ((text:\"seabiscuit sea\") (text:\"biscuit sea
bird\"))"

What I wanted instead was this

"+((text:seabiscuit) (text:sea) (text:bird)) ((text:\"seabiscuit sea\")
(text:\"sea bird\")) (text:\"seabiscuit sea bird\")"

Looks like there isn't any other way than to pre-process query myself and
create the compound word. What do you mean by "just query the raw string"?
Am I still missing something?

Parvesh Garg
http://www.zettata.com
(This time I did remove my phone number :) )

On Mon, Oct 28, 2013 at 4:14 PM, Erick Erickson <er...@gmail.com>wrote:

> Why did you reject using synonyms? You can have multi-word
> synonyms just fine at index time, and at query time, since the
> multiple words are already substituted in the index you don't
> need to do the same substitution, just query the raw strings.
>
> I freely acknowledge you may have very good reasons for doing
> this yourself, I'm just making sure you know what's already
> there.
>
> See:
>
> http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters#solr.SynonymFilterFactory
>
> Look particularly at the explanations for "sea biscuit" in that section.
>
> Best,
> Erick
>
>
>
> On Mon, Oct 28, 2013 at 3:47 AM, Parvesh Garg <pa...@zettata.com> wrote:
>
> > One more thing, Is there a way to remove my "accidentally sent phone
> number
> > in the signature" from the previous mail? aarrrggghhh
> >
>

Re: Compound words

Posted by Erick Erickson <er...@gmail.com>.
Why did you reject using synonyms? You can have multi-word
synonyms just fine at index time, and at query time, since the
multiple words are already substituted in the index you don't
need to do the same substitution, just query the raw strings.

I freely acknowledge you may have very good reasons for doing
this yourself, I'm just making sure you know what's already
there.

See:
http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters#solr.SynonymFilterFactory

Look particularly at the explanations for "sea biscuit" in that section.

Best,
Erick



On Mon, Oct 28, 2013 at 3:47 AM, Parvesh Garg <pa...@zettata.com> wrote:

> One more thing, Is there a way to remove my "accidentally sent phone number
> in the signature" from the previous mail? aarrrggghhh
>

Re: Compound words

Posted by Parvesh Garg <pa...@zettata.com>.
One more thing, Is there a way to remove my "accidentally sent phone number
in the signature" from the previous mail? aarrrggghhh