You are viewing a plain text version of this content. The canonical link for it is here.

Posted to solr-user@lucene.apache.org by Suk-Hyun Cho <ch...@gmail.com> on 2011/08/02 02:08:15 UTC

Matching queries on a per-element basis against a multivalued field

I'm sure someone asked this before, but I couldn't find a previous post
regarding this.


The problem:


Let's say that I have a multivalued field called myFriends that tokenizes on
whitespaces. Basically, I'm treating it like a List of Lists (attributes of
friends):


Document A:

myFriends = [
    "isCool=true SOME_JUNK_HERE gender=male bloodType=A"
]

Document B:

myFriends = [
    "isCool=true SOME_JUNK_HERE gender=female bloodType=O",
    "isCool=false SOME_JUNK_HERE gender=male bloodType=AB"
]

Now, let's say that I want to search for all the cool male friends I have.
Naively, I can query q=myFriends:isCool=true+AND+myFriends:gender=male.
However, this returns documents A and B, because the two criteria are tested
against the entire collection, rather than against individual elements. 


I could work around this by not tokenizing on whitespaces and using
wildcards: 


q=myFriends:isCool=true\ *\ gender=male


but this becomes painful when the query becomes more complex. What if I
wanted to find cool friends who are either type A or type O? I could do
q=myFriends:(isCool=true\ *\ bloodType=A+OR+isCool=true\ *\ bloodType=O).
And you can see that the number of criteria will just explode as queries get
more complex.


There are other methods that I've considered, such as duplicating documents
for every friend, like so:


Document A1:

myFriend = [
    "isCool=true",
    "gender=male",
    "bloodType=A"
]

Document B1:

myFriend = [
    "isCool=true",
    "gender=female",
    "bloodType=O"
]

Document B2:

myFriend = [
    "isCool=false",
    "gender=male",
    "bloodType=AB"
]

but this would be less than desirable.

I would like to hear any other ideas around solving this problem, but going
back to the original question, is there a way to match multiple criteria on
a per-item basis rather than against the entire multifield?

--
View this message in context: http://lucene.472066.n3.nabble.com/Matching-queries-on-a-per-element-basis-against-a-multivalued-field-tp3217432p3217432.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: Matching queries on a per-element basis against a multivalued field

Posted by "Smiley, David W." <ds...@mitre.org>.

On Aug 2, 2011, at 5:47 PM, eks dev wrote:

> Sure, I know...,
> the point I was trying to make, "if someone serious like Lucid  is
> using solr 4.x as a core technology for own customers, the trunk could
> not be all that bad" => release date not as far as 2012 :)

Oh the current trunk is most definitely *not* "all that bad", as you say; that wasn't a point of discussion.  Code coverage is excellent, testing is rather extensive, and many folks like me use it in in production.  But after nearly 3 years of waiting, I wouldn't hold your breath on it getting released w/i 6 months (before 2012).

~ David

Re: Matching queries on a per-element basis against a multivalued field

Posted by eks dev <ek...@googlemail.com>.

Sure, I know...,
the point I was trying to make, "if someone serious like Lucid  is
using solr 4.x as a core technology for own customers, the trunk could
not be all that bad" => release date not as far as 2012 :)


On Tue, Aug 2, 2011 at 11:33 PM, Smiley, David W. <ds...@mitre.org> wrote:
> "LucidWorks Enterprise" (which is more than Solr, and a modified Solr at that) isn't free; so you can't extract the Solr part of that package and use it unless you are willing to pay them.
>
> Lucid's "Certified Solr", on the other hand, is free.  But they have yet to bump that to trunk/4.x; it was only recently updated to 3.2.
>
> On Aug 2, 2011, at 5:26 PM, eks dev wrote:
>
>> Well, Lucid released "LucidWorks Enterprise"
>> with  " Complete Apache Solr 4.x Release Integrated and tested with
>> powerful enhancements"
>>
>> Whatever it means for solr 4.0
>>
>>
>>
>> On Tue, Aug 2, 2011 at 11:10 PM, David Smiley (@MITRE.org)
>> <DS...@mitre.org> wrote:
>>> My best guess (and it is just a guess) is between December and March.
>>>
>>> The roots of Solr 4 which triggered the major version change is known as
>>> "flexible indexing" (or just "flex" for short amongst developers).  The
>>> genesis of it was posted to JIRA as a patch on 18 November 2008 --
>>> LUCENE-1458 (almost 3 years ago!). About a year later it was committed into
>>> a special flex branch that is probably gone now, and then around
>>> April/early-May 2010, it went into trunk whereas the pre-flex code on trunk
>>> went to a newly formed 3x branch. That is ancient history now, and there are
>>> some amazing performance improvements tied to flex that haven't seen the
>>> light of day in an official release. It's a shame, really. So it's been so
>>> long that, well, after it dawns on everyone that it that the code is 3
>>> friggin years old without a release -- it's time to get on with the show.
>>>
>>> ~ David Smiley
>>>
>>> -----
>>>  Author: https://www.packtpub.com/solr-1-4-enterprise-search-server/book
>>> --
>>> View this message in context: http://lucene.472066.n3.nabble.com/Matching-queries-on-a-per-element-basis-against-a-multivalued-field-tp3217432p3220242.html
>>> Sent from the Solr - User mailing list archive at Nabble.com.
>>>
>
>

Re: Matching queries on a per-element basis against a multivalued field

Posted by "Smiley, David W." <ds...@mitre.org>.

"LucidWorks Enterprise" (which is more than Solr, and a modified Solr at that) isn't free; so you can't extract the Solr part of that package and use it unless you are willing to pay them.

Lucid's "Certified Solr", on the other hand, is free.  But they have yet to bump that to trunk/4.x; it was only recently updated to 3.2.

On Aug 2, 2011, at 5:26 PM, eks dev wrote:

> Well, Lucid released "LucidWorks Enterprise"
> with  " Complete Apache Solr 4.x Release Integrated and tested with
> powerful enhancements"
> 
> Whatever it means for solr 4.0
> 
> 
> 
> On Tue, Aug 2, 2011 at 11:10 PM, David Smiley (@MITRE.org)
> <DS...@mitre.org> wrote:
>> My best guess (and it is just a guess) is between December and March.
>> 
>> The roots of Solr 4 which triggered the major version change is known as
>> "flexible indexing" (or just "flex" for short amongst developers).  The
>> genesis of it was posted to JIRA as a patch on 18 November 2008 --
>> LUCENE-1458 (almost 3 years ago!). About a year later it was committed into
>> a special flex branch that is probably gone now, and then around
>> April/early-May 2010, it went into trunk whereas the pre-flex code on trunk
>> went to a newly formed 3x branch. That is ancient history now, and there are
>> some amazing performance improvements tied to flex that haven't seen the
>> light of day in an official release. It's a shame, really. So it's been so
>> long that, well, after it dawns on everyone that it that the code is 3
>> friggin years old without a release -- it's time to get on with the show.
>> 
>> ~ David Smiley
>> 
>> -----
>>  Author: https://www.packtpub.com/solr-1-4-enterprise-search-server/book
>> --
>> View this message in context: http://lucene.472066.n3.nabble.com/Matching-queries-on-a-per-element-basis-against-a-multivalued-field-tp3217432p3220242.html
>> Sent from the Solr - User mailing list archive at Nabble.com.
>>

Re: Matching queries on a per-element basis against a multivalued field

Posted by eks dev <ek...@yahoo.co.uk>.

Well, Lucid released "LucidWorks Enterprise"
with  " Complete Apache Solr 4.x Release Integrated and tested with
powerful enhancements"

Whatever it means for solr 4.0



On Tue, Aug 2, 2011 at 11:10 PM, David Smiley (@MITRE.org)
<DS...@mitre.org> wrote:
> My best guess (and it is just a guess) is between December and March.
>
> The roots of Solr 4 which triggered the major version change is known as
> "flexible indexing" (or just "flex" for short amongst developers).  The
> genesis of it was posted to JIRA as a patch on 18 November 2008 --
> LUCENE-1458 (almost 3 years ago!). About a year later it was committed into
> a special flex branch that is probably gone now, and then around
> April/early-May 2010, it went into trunk whereas the pre-flex code on trunk
> went to a newly formed 3x branch. That is ancient history now, and there are
> some amazing performance improvements tied to flex that haven't seen the
> light of day in an official release. It's a shame, really. So it's been so
> long that, well, after it dawns on everyone that it that the code is 3
> friggin years old without a release -- it's time to get on with the show.
>
> ~ David Smiley
>
> -----
>  Author: https://www.packtpub.com/solr-1-4-enterprise-search-server/book
> --
> View this message in context: http://lucene.472066.n3.nabble.com/Matching-queries-on-a-per-element-basis-against-a-multivalued-field-tp3217432p3220242.html
> Sent from the Solr - User mailing list archive at Nabble.com.
>

Re: Matching queries on a per-element basis against a multivalued field

Posted by "David Smiley (@MITRE.org)" <DS...@mitre.org>.

My best guess (and it is just a guess) is between December and March.  

The roots of Solr 4 which triggered the major version change is known as
"flexible indexing" (or just "flex" for short amongst developers).  The
genesis of it was posted to JIRA as a patch on 18 November 2008 --
LUCENE-1458 (almost 3 years ago!). About a year later it was committed into
a special flex branch that is probably gone now, and then around
April/early-May 2010, it went into trunk whereas the pre-flex code on trunk
went to a newly formed 3x branch. That is ancient history now, and there are
some amazing performance improvements tied to flex that haven't seen the
light of day in an official release. It's a shame, really. So it's been so
long that, well, after it dawns on everyone that it that the code is 3
friggin years old without a release -- it's time to get on with the show.  

~ David Smiley

-----
 Author: https://www.packtpub.com/solr-1-4-enterprise-search-server/book
--
View this message in context: http://lucene.472066.n3.nabble.com/Matching-queries-on-a-per-element-basis-against-a-multivalued-field-tp3217432p3220242.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: Matching queries on a per-element basis against a multivalued field

Posted by Suk-Hyun Cho <ch...@gmail.com>.

Thanks. I saw the related jira issue but didn't follow closely enough to see
the cross-core join being added later. Any idea/hint on when I can expect
Solr 4 to be released?

--
View this message in context: http://lucene.472066.n3.nabble.com/Matching-queries-on-a-per-element-basis-against-a-multivalued-field-tp3217432p3220091.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: Matching queries on a per-element basis against a multivalued field

Posted by "David Smiley (@MITRE.org)" <DS...@mitre.org>.

On Aug 2, 2011, at 1:09 PM, Suk-Hyun Cho [via Lucene] wrote:

> I appreciate your replies and ideas. 
> 
> SpanQuery would work, and I'll look into this further. However, what about the original question? Is there no way to match documents on a per-element basis against a multivalued field?

Correct; there is no way.  Aside from Solr 4's "Join" feature, everything else suggested is a hack / work-around for a fundamental limitation.  

> If not, would it perhaps make sense to create a feature request? 

You could but I wouldn't bother because its unlikely to get any traction as it's a fundamental issue with Lucene and at the Solr level there is a solution on the horizon.

> Also, regarding the join support you guys have mentioned: is it only on a field within the same core, or is it across cores (as if cores are tables in a database)? Joining on cores would eliminate most of the issues I'm having. The examples I gave are simplified, but actually I have an entity A that has entity B that has entity C, and I'm flattening out queriable fields of B and C into the schema for A. This way, I can search for documents for the core A that match some criteria for A, B, and/or C. 

The Join support works across cores.  See the wiki and associated JIRA issue for it.

~ David Smiley



-----
 Author: https://www.packtpub.com/solr-1-4-enterprise-search-server/book
--
View this message in context: http://lucene.472066.n3.nabble.com/Matching-queries-on-a-per-element-basis-against-a-multivalued-field-tp3217432p3219638.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: Matching queries on a per-element basis against a multivalued field

Posted by Suk-Hyun Cho <ch...@gmail.com>.

I appreciate your replies and ideas.

SpanQuery would work, and I'll look into this further. However, what about
the original question? Is there no way to match documents on a per-element
basis against a multivalued field? If not, would it perhaps make sense to
create a feature request?

Also, regarding the join support you guys have mentioned: is it only on a
field within the same core, or is it across cores (as if cores are tables in
a database)? Joining on cores would eliminate most of the issues I'm having.
The examples I gave are simplified, but actually I have an entity A that has
entity B that has entity C, and I'm flattening out queriable fields of B and
C into the schema for A. This way, I can search for documents for the core A
that match some criteria for A, B, and/or C.

--
View this message in context: http://lucene.472066.n3.nabble.com/Matching-queries-on-a-per-element-basis-against-a-multivalued-field-tp3217432p3219565.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: Matching queries on a per-element basis against a multivalued field

Posted by Suk-Hyun Cho <ch...@gmail.com>.

Thanks for the history and the current state of trunk, guys. It sounds like
it's rather stable for serious use... in which case it's probably ready for
a release, but let's not go back in circles. :) I'll give it a shot
sometime.

Thanks, again!

--
View this message in context: http://lucene.472066.n3.nabble.com/Matching-queries-on-a-per-element-basis-against-a-multivalued-field-tp3217432p3220449.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: Matching queries on a per-element basis against a multivalued field

Posted by "David Smiley (@MITRE.org)" <DS...@mitre.org>.

Suk,

You're hitting on a well known limitation with Lucene, and the "solutions"
are work-arounds that may be unacceptable depending on the specifics of your
case.

Solr 4.0 (trunk)'s support for Joins is definitely an up and coming option,
as Mike pointed out.

Kersen's suggestion of using an index just for friends is very good,
although depending on the specifics of your actual needs it may not work or
be unscalable.

Mike also pointed out phrase queries, which will work, but remember to add a
proximity, e.g. "isCool=true gender=male"~50 You'll want to consider the
position increment gap setting in your schema. A limitation here is that
your text analysis options are limited since all the data is in the same
field. You're also limited to simple term search; no range queries.

I took a different approach for an app I built. I indexed into separate
fields (i.e. isCool, gender, bloodType) so that I could analyze each of them
appropriately. But I did have to add a filter that basically collapsed all
position offsets within a value to zero, effectively nullifying my ability
to do a phrase query for a particular value. That was acceptable to me and
it can be ameliorated with shingling. Then at search time I used Span
queries and their unique ability to positionally query over more than one
field. There were some edge conditions that were tricky to debug when I had
a null value, but it was at least fixable with a sentinal value kluge.

~ David Smiley

-----
Author: https://www.packtpub.com/solr-1-4-enterprise-search-server/book
--
View this message in context: http://lucene.472066.n3.nabble.com/Matching-queries-on-a-per-element-basis-against-a-multivalued-field-tp3217432p3219352.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: Matching queries on a per-element basis against a multivalued field

Posted by ka...@gmx.de.

Hi Suk-Hyun Cho,

if "myFriend" is the unit of retrieval you should use this as lucene document with the fields "isCool" "gender" "bloodType" ...

if you realy want to insert all "myFriends" in one field like your
myFriends = [
    "isCool=true SOME_JUNK_HERE gender=female bloodType=O",
    "isCool=false SOME_JUNK_HERE gender=male bloodType=AB"
]
example, you can use SpanQueries

http://www.lucidimagination.com/blog/2009/07/18/the-spanquery/

with SpanNotQuery you can search for all "isCool true" and "gender male" where no other "isCool" is between both phrases.

Best regards
  Karsten


P.S. see in context
http://lucene.472066.n3.nabble.com/Matching-queries-on-a-per-element-basis-against-a-multivalued-field-td3217432.html

Re: Matching queries on a per-element basis against a multivalued field

Posted by Mike Sokolov <so...@ifactory.com>.

You have a few choices:

1) flatten your field structure - like your "undesirable" example, but 
wouldn't you want to have the document identifier as a field value also?

2) use phrase queries to make sure the key/value pairs are adjacent

3) use a join query

That's all I can think of

-Mike

On 08/01/2011 08:08 PM, Suk-Hyun Cho wrote:
> I'm sure someone asked this before, but I couldn't find a previous post
> regarding this.
>
>
> The problem:
>
>
> Let's say that I have a multivalued field called myFriends that tokenizes on
> whitespaces. Basically, I'm treating it like a List of Lists (attributes of
> friends):
>
>
> Document A:
>
> myFriends = [
>      "isCool=true SOME_JUNK_HERE gender=male bloodType=A"
> ]
>
> Document B:
>
> myFriends = [
>      "isCool=true SOME_JUNK_HERE gender=female bloodType=O",
>      "isCool=false SOME_JUNK_HERE gender=male bloodType=AB"
> ]
>
> Now, let's say that I want to search for all the cool male friends I have.
> Naively, I can query q=myFriends:isCool=true+AND+myFriends:gender=male.
> However, this returns documents A and B, because the two criteria are tested
> against the entire collection, rather than against individual elements.
>
>
> I could work around this by not tokenizing on whitespaces and using
> wildcards:
>
>
> q=myFriends:isCool=true\ *\ gender=male
>
>
> but this becomes painful when the query becomes more complex. What if I
> wanted to find cool friends who are either type A or type O? I could do
> q=myFriends:(isCool=true\ *\ bloodType=A+OR+isCool=true\ *\ bloodType=O).
> And you can see that the number of criteria will just explode as queries get
> more complex.
>
>
> There are other methods that I've considered, such as duplicating documents
> for every friend, like so:
>
>
> Document A1:
>
> myFriend = [
>      "isCool=true",
>      "gender=male",
>      "bloodType=A"
> ]
>
> Document B1:
>
> myFriend = [
>      "isCool=true",
>      "gender=female",
>      "bloodType=O"
> ]
>
> Document B2:
>
> myFriend = [
>      "isCool=false",
>      "gender=male",
>      "bloodType=AB"
> ]
>
> but this would be less than desirable.
>
> I would like to hear any other ideas around solving this problem, but going
> back to the original question, is there a way to match multiple criteria on
> a per-item basis rather than against the entire multifield?
>
> --
> View this message in context: http://lucene.472066.n3.nabble.com/Matching-queries-on-a-per-element-basis-against-a-multivalued-field-tp3217432p3217432.html
> Sent from the Solr - User mailing list archive at Nabble.com.
>