You are viewing a plain text version of this content. The canonical link for it is here.
Posted to solr-user@lucene.apache.org by Tom Devel <de...@gmail.com> on 2015/03/03 22:51:25 UTC

Search over a multiValued field

Hi,

I am running Solr 5.0.0 and have a question about proximity search and
multiValued fields.

I am indexing xml files of the following form with foundField being a field
defined as multiValued and text_en my in schema.xml.

<?xml version="1.0" encoding="UTF-8"?>
<add><doc>
<field name="id">8</field>
<field name="foundField">"Oranges from South California - ordered"</field>
<field name="foundField">"Green Apples - available"</field>
<field name="foundField">"Black Report Books - ordered"</field>
</doc></add>

There are several such documents, and for instance, I would like to query
all documents having in the foundField "Oranges" and "ordered". The
following proximity query takes care of it:

q=foundField:("oranges AND ordered"~2)

However, a field could have more words, and I also cannot know the
proximity of the desired query words in advance. Setting the proximity
value too high results in false positives, the following query also returns
the document (although "available" was in the entry about Apples):

foundField:("oranges AND available"~200)

I do not think that tweaking a proximity value is the correct approach.

How can I search to match contents in a multiValued field per Value as
described above, without running into the problem?

Many thanks for any help

Re: Search over a multiValued field

Posted by Tom Devel <de...@gmail.com>.
Erick,

Thanks a lot for the explanation, makes sense now.

Tom

On Tue, Mar 3, 2015 at 5:54 PM, Erick Erickson <er...@gmail.com>
wrote:

> bq: Does it mean that words between " symbols, such as "Orange ordered" are
> treated as a single term, with (implicitly) AND conjunction between them?
>
> not at all. When you quote things, you're getting a "phrase query",
> perhaps one
> with slop. So something like
> "a b" means that 'a' must appear right next to 'b'. This is something
> like an AND
> in the sense that both terms must appear, but it is far more
> restrictive since it takes into
> account the position of the terms in the field.
>
> "a b"~10 means that both words must appear within 10 transpositions in
> the same field.
> You can think of "transposition" as how many intervening terms there
> are, so something
> like "a b"~2 would match docs with "a x b", but not "a x y z b".
>
> And this is where positionIncrementGap comes in. By putting 1000 in
> for it, you guarantee
> "a b"~999 won't match 'a' in one field and 'b' in another.
>
> whereas a AND b would match across successive MV entries no matter what the
> gap.
>
> HTH,
> Erick
>
> On Tue, Mar 3, 2015 at 2:22 PM, Tom Devel <de...@gmail.com> wrote:
> > Jack,
> >
> > This is exactly what I was looking for, thanks. I found the
> > positionIncrementGap attribute in the schema.xml for the text_en
> >
> > I was putting in "AND" because I read in the Solr documentation that "The
> > OR operator is the default conjunction operator."
> >
> > Does it mean that words between " symbols, such as "Orange ordered" are
> > treated as a single term, with (implicitly) AND conjunction between them?
> >
> > Where could I found more info about this?
> >
> > I am currently reading
> >
> https://cwiki.apache.org/confluence/display/solr/The+Standard+Query+Parser
> >
> > Thanks again
> >
> > On Tue, Mar 3, 2015 at 3:58 PM, Jack Krupansky <jack.krupansky@gmail.com
> >
> > wrote:
> >
> >> Just set the positionIncrementGap for the multivalued field to a much
> >> higher value, like 1000 or 5000. That's the purpose of this attribute,
> to
> >> assure that reasonable proximity matches don't match across multiple
> >> values.
> >>
> >> Also, leave "AND" out of the query phrases - you're just trying to match
> >> the product name and availability.
> >>
> >>
> >> -- Jack Krupansky
> >>
> >> On Tue, Mar 3, 2015 at 4:51 PM, Tom Devel <de...@gmail.com> wrote:
> >>
> >> > Hi,
> >> >
> >> > I am running Solr 5.0.0 and have a question about proximity search and
> >> > multiValued fields.
> >> >
> >> > I am indexing xml files of the following form with foundField being a
> >> field
> >> > defined as multiValued and text_en my in schema.xml.
> >> >
> >> > <?xml version="1.0" encoding="UTF-8"?>
> >> > <add><doc>
> >> > <field name="id">8</field>
> >> > <field name="foundField">"Oranges from South California -
> >> ordered"</field>
> >> > <field name="foundField">"Green Apples - available"</field>
> >> > <field name="foundField">"Black Report Books - ordered"</field>
> >> > </doc></add>
> >> >
> >> > There are several such documents, and for instance, I would like to
> query
> >> > all documents having in the foundField "Oranges" and "ordered". The
> >> > following proximity query takes care of it:
> >> >
> >> > q=foundField:("oranges AND ordered"~2)
> >> >
> >> > However, a field could have more words, and I also cannot know the
> >> > proximity of the desired query words in advance. Setting the proximity
> >> > value too high results in false positives, the following query also
> >> returns
> >> > the document (although "available" was in the entry about Apples):
> >> >
> >> > foundField:("oranges AND available"~200)
> >> >
> >> > I do not think that tweaking a proximity value is the correct
> approach.
> >> >
> >> > How can I search to match contents in a multiValued field per Value as
> >> > described above, without running into the problem?
> >> >
> >> > Many thanks for any help
> >> >
> >>
>

Re: Search over a multiValued field

Posted by Erick Erickson <er...@gmail.com>.
bq: Does it mean that words between " symbols, such as "Orange ordered" are
treated as a single term, with (implicitly) AND conjunction between them?

not at all. When you quote things, you're getting a "phrase query", perhaps one
with slop. So something like
"a b" means that 'a' must appear right next to 'b'. This is something
like an AND
in the sense that both terms must appear, but it is far more
restrictive since it takes into
account the position of the terms in the field.

"a b"~10 means that both words must appear within 10 transpositions in
the same field.
You can think of "transposition" as how many intervening terms there
are, so something
like "a b"~2 would match docs with "a x b", but not "a x y z b".

And this is where positionIncrementGap comes in. By putting 1000 in
for it, you guarantee
"a b"~999 won't match 'a' in one field and 'b' in another.

whereas a AND b would match across successive MV entries no matter what the
gap.

HTH,
Erick

On Tue, Mar 3, 2015 at 2:22 PM, Tom Devel <de...@gmail.com> wrote:
> Jack,
>
> This is exactly what I was looking for, thanks. I found the
> positionIncrementGap attribute in the schema.xml for the text_en
>
> I was putting in "AND" because I read in the Solr documentation that "The
> OR operator is the default conjunction operator."
>
> Does it mean that words between " symbols, such as "Orange ordered" are
> treated as a single term, with (implicitly) AND conjunction between them?
>
> Where could I found more info about this?
>
> I am currently reading
> https://cwiki.apache.org/confluence/display/solr/The+Standard+Query+Parser
>
> Thanks again
>
> On Tue, Mar 3, 2015 at 3:58 PM, Jack Krupansky <ja...@gmail.com>
> wrote:
>
>> Just set the positionIncrementGap for the multivalued field to a much
>> higher value, like 1000 or 5000. That's the purpose of this attribute, to
>> assure that reasonable proximity matches don't match across multiple
>> values.
>>
>> Also, leave "AND" out of the query phrases - you're just trying to match
>> the product name and availability.
>>
>>
>> -- Jack Krupansky
>>
>> On Tue, Mar 3, 2015 at 4:51 PM, Tom Devel <de...@gmail.com> wrote:
>>
>> > Hi,
>> >
>> > I am running Solr 5.0.0 and have a question about proximity search and
>> > multiValued fields.
>> >
>> > I am indexing xml files of the following form with foundField being a
>> field
>> > defined as multiValued and text_en my in schema.xml.
>> >
>> > <?xml version="1.0" encoding="UTF-8"?>
>> > <add><doc>
>> > <field name="id">8</field>
>> > <field name="foundField">"Oranges from South California -
>> ordered"</field>
>> > <field name="foundField">"Green Apples - available"</field>
>> > <field name="foundField">"Black Report Books - ordered"</field>
>> > </doc></add>
>> >
>> > There are several such documents, and for instance, I would like to query
>> > all documents having in the foundField "Oranges" and "ordered". The
>> > following proximity query takes care of it:
>> >
>> > q=foundField:("oranges AND ordered"~2)
>> >
>> > However, a field could have more words, and I also cannot know the
>> > proximity of the desired query words in advance. Setting the proximity
>> > value too high results in false positives, the following query also
>> returns
>> > the document (although "available" was in the entry about Apples):
>> >
>> > foundField:("oranges AND available"~200)
>> >
>> > I do not think that tweaking a proximity value is the correct approach.
>> >
>> > How can I search to match contents in a multiValued field per Value as
>> > described above, without running into the problem?
>> >
>> > Many thanks for any help
>> >
>>

Re: Search over a multiValued field

Posted by Tom Devel <de...@gmail.com>.
Jack,

This is exactly what I was looking for, thanks. I found the
positionIncrementGap attribute in the schema.xml for the text_en

I was putting in "AND" because I read in the Solr documentation that "The
OR operator is the default conjunction operator."

Does it mean that words between " symbols, such as "Orange ordered" are
treated as a single term, with (implicitly) AND conjunction between them?

Where could I found more info about this?

I am currently reading
https://cwiki.apache.org/confluence/display/solr/The+Standard+Query+Parser

Thanks again

On Tue, Mar 3, 2015 at 3:58 PM, Jack Krupansky <ja...@gmail.com>
wrote:

> Just set the positionIncrementGap for the multivalued field to a much
> higher value, like 1000 or 5000. That's the purpose of this attribute, to
> assure that reasonable proximity matches don't match across multiple
> values.
>
> Also, leave "AND" out of the query phrases - you're just trying to match
> the product name and availability.
>
>
> -- Jack Krupansky
>
> On Tue, Mar 3, 2015 at 4:51 PM, Tom Devel <de...@gmail.com> wrote:
>
> > Hi,
> >
> > I am running Solr 5.0.0 and have a question about proximity search and
> > multiValued fields.
> >
> > I am indexing xml files of the following form with foundField being a
> field
> > defined as multiValued and text_en my in schema.xml.
> >
> > <?xml version="1.0" encoding="UTF-8"?>
> > <add><doc>
> > <field name="id">8</field>
> > <field name="foundField">"Oranges from South California -
> ordered"</field>
> > <field name="foundField">"Green Apples - available"</field>
> > <field name="foundField">"Black Report Books - ordered"</field>
> > </doc></add>
> >
> > There are several such documents, and for instance, I would like to query
> > all documents having in the foundField "Oranges" and "ordered". The
> > following proximity query takes care of it:
> >
> > q=foundField:("oranges AND ordered"~2)
> >
> > However, a field could have more words, and I also cannot know the
> > proximity of the desired query words in advance. Setting the proximity
> > value too high results in false positives, the following query also
> returns
> > the document (although "available" was in the entry about Apples):
> >
> > foundField:("oranges AND available"~200)
> >
> > I do not think that tweaking a proximity value is the correct approach.
> >
> > How can I search to match contents in a multiValued field per Value as
> > described above, without running into the problem?
> >
> > Many thanks for any help
> >
>

Re: Search over a multiValued field

Posted by Jack Krupansky <ja...@gmail.com>.
Just set the positionIncrementGap for the multivalued field to a much
higher value, like 1000 or 5000. That's the purpose of this attribute, to
assure that reasonable proximity matches don't match across multiple values.

Also, leave "AND" out of the query phrases - you're just trying to match
the product name and availability.


-- Jack Krupansky

On Tue, Mar 3, 2015 at 4:51 PM, Tom Devel <de...@gmail.com> wrote:

> Hi,
>
> I am running Solr 5.0.0 and have a question about proximity search and
> multiValued fields.
>
> I am indexing xml files of the following form with foundField being a field
> defined as multiValued and text_en my in schema.xml.
>
> <?xml version="1.0" encoding="UTF-8"?>
> <add><doc>
> <field name="id">8</field>
> <field name="foundField">"Oranges from South California - ordered"</field>
> <field name="foundField">"Green Apples - available"</field>
> <field name="foundField">"Black Report Books - ordered"</field>
> </doc></add>
>
> There are several such documents, and for instance, I would like to query
> all documents having in the foundField "Oranges" and "ordered". The
> following proximity query takes care of it:
>
> q=foundField:("oranges AND ordered"~2)
>
> However, a field could have more words, and I also cannot know the
> proximity of the desired query words in advance. Setting the proximity
> value too high results in false positives, the following query also returns
> the document (although "available" was in the entry about Apples):
>
> foundField:("oranges AND available"~200)
>
> I do not think that tweaking a proximity value is the correct approach.
>
> How can I search to match contents in a multiValued field per Value as
> described above, without running into the problem?
>
> Many thanks for any help
>