You are viewing a plain text version of this content. The canonical link for it is here.
Posted to solr-user@lucene.apache.org by Mindaugas Žakšauskas <mi...@gmail.com> on 2012/04/24 16:24:21 UTC

Query parsing VS marshalling/unmarshalling

Hi,

I maintain a distributed system which Solr is part of. The data which
is kept is Solr is "permissioned" and permissions are currently
implemented by taking the original user query, adding certain bits to
it which would make it return less data in the search results. Now I
am at the point where I need to go over this functionality and try to
improve it.

Changing this to send two separate queries (q=...&fq=...) would be the
first logical thing to do, however I was thinking of an extra
improvement. Instead of generating filter query, converting it into a
String, sending over the HTTP just to parse it by Solr again - would
it not be better to take generated Lucene fq query, serialize it using
Java serialization, convert it to, say, Base64 and then send and
deserialize it on the Solr end? Has anyone tried doing any performance
comparisons on this topic?

I am particularly concerned about this because in extreme cases my
filter queries can be very large (1000s of characters long) and we
already had to do tweaks as the size of GET requests would exceed
default limits. And yes, we could move to POST but I would like to
minimize both the amount of data that is sent over and the time taken
to parse large queries.

Thanks in advance.

m.

Re: Query parsing VS marshalling/unmarshalling

Posted by "balaji.gandhi" <ji...@gmail.com>.
Hi, 

I am trying to do something similar:- 

Eg. 
Input: (name:John AND name:Doe) 
Output: ((firstName:John OR lastName:John) AND (firstName:John OR
lastName:John)) 

How can I extract the fields, change them and repackage the query? 

Thanks, 
Balaji



--
View this message in context: http://lucene.472066.n3.nabble.com/Query-parsing-VS-marshalling-unmarshalling-tp3935430p4033985.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: Query parsing VS marshalling/unmarshalling

Posted by Erick Erickson <er...@gmail.com>.
If you're assembling an fq clause, this is all done or you, although
you need to take some care to form the fq clause _exactly_
the same way each time. Think of the filterCache as a key/value
map where the key is the raw fq text and the value is the docs
satisfying that query.

So fq=acl:(a OR a) will not, for instance, match
     fq=acl:(b OR a)

FWIW
Erick

2012/4/24 Mindaugas Žakšauskas <mi...@gmail.com>:
> Hi Erick,
>
> Thanks for looking into this and for the tips you've sent.
>
> I am leaning towards custom query component at the moment, the primary
> reason for it would be to be able to squeeze the amount of data that
> is sent over to Solr. A single round trip within the same datacenter
> is worth around 0.5 ms [1] and if query doesn't fit into a single
> ethernet packet, this number effectively has to double/triple/etc.
>
> Regarding cache filters - I was actually thinking the opposite:
> caching ACL queries (filter queries) would be beneficial as those tend
> to be the same across multiple search requests.
>
> [1] http://static.googleusercontent.com/external_content/untrusted_dlcp/research.google.com/en//people/jeff/stanford-295-talk.pdf
> , slide 13
>
> m.
>
> On Tue, Apr 24, 2012 at 4:43 PM, Erick Erickson <er...@gmail.com> wrote:
>> In general, query parsing is such a small fraction of the total time that,
>> almost no matter how complex, it's not worth worrying about. To see
>> this, attach &debugQuery=on to your query and look at the timings
>> in the "pepare" and "process" portions of the response. I'd  be
>> very sure that it was a problem before spending any time trying to make
>> the transmission of the data across the wire more efficient, my first
>> reaction is that this is premature optimization.
>>
>> Second, you could do this on the server side with a custom query
>> component if you chose. You can freely modify the query
>> over there and it may make sense in your situation.
>>
>> Third, consider "no cache filters", which were developed for
>> expensive filter queries, ACL being one of them. See:
>> https://issues.apache.org/jira/browse/SOLR-2429
>>
>> Fourth, I'd ask if there's a way to reduce the size of the FQ
>> clause. Is this on a particular user basis or groups basis?
>> If you can get this down to a few groups that would help. Although
>> there's often some outlier who is member of thousands of
>> groups :(.
>>
>> Best
>> Erick
>>
>>
>> 2012/4/24 Mindaugas Žakšauskas <mi...@gmail.com>:
>>> On Tue, Apr 24, 2012 at 3:27 PM, Benson Margulies <bi...@gmail.com> wrote:
>>>> I'm about to try out a contribution for serializing queries in
>>>> Javascript using Jackson. I've previously done this by serializing my
>>>> own data structure and putting the JSON into a custom query parameter.
>>>
>>> Thanks for your reply. Appreciate your effort, but I'm not sure if I
>>> fully understand the gain.
>>>
>>> Having data in JSON would still require it to be converted into Lucene
>>> Query at the end which takes space & CPU effort, right? Or are you
>>> saying that having query serialized into a structured data blob (JSON
>>> in this case) makes it somehow easier to convert it into Lucene Query?
>>>
>>> I only thought about Java serialization because:
>>> - it's rather close to the in-object format
>>> - the mechanism is rather stable and is an established standard in Java/JVM
>>> - Lucene Queries seem to implement java.io.Serializable (haven't done
>>> a thorough check but looks good on the surface)
>>> - other conversions (e.g. using Xtream) are either slow or require
>>> custom annotations. I personally don't see how would Lucene/Solr
>>> include them in their core classes.
>>>
>>> Anyway, it would still be interesting to hear if anyone could
>>> elaborate on query parsing complexity.
>>>
>>> m.

Re: Query parsing VS marshalling/unmarshalling

Posted by Mindaugas Žakšauskas <mi...@gmail.com>.
Hi Erick,

Thanks for looking into this and for the tips you've sent.

I am leaning towards custom query component at the moment, the primary
reason for it would be to be able to squeeze the amount of data that
is sent over to Solr. A single round trip within the same datacenter
is worth around 0.5 ms [1] and if query doesn't fit into a single
ethernet packet, this number effectively has to double/triple/etc.

Regarding cache filters - I was actually thinking the opposite:
caching ACL queries (filter queries) would be beneficial as those tend
to be the same across multiple search requests.

[1] http://static.googleusercontent.com/external_content/untrusted_dlcp/research.google.com/en//people/jeff/stanford-295-talk.pdf
, slide 13

m.

On Tue, Apr 24, 2012 at 4:43 PM, Erick Erickson <er...@gmail.com> wrote:
> In general, query parsing is such a small fraction of the total time that,
> almost no matter how complex, it's not worth worrying about. To see
> this, attach &debugQuery=on to your query and look at the timings
> in the "pepare" and "process" portions of the response. I'd  be
> very sure that it was a problem before spending any time trying to make
> the transmission of the data across the wire more efficient, my first
> reaction is that this is premature optimization.
>
> Second, you could do this on the server side with a custom query
> component if you chose. You can freely modify the query
> over there and it may make sense in your situation.
>
> Third, consider "no cache filters", which were developed for
> expensive filter queries, ACL being one of them. See:
> https://issues.apache.org/jira/browse/SOLR-2429
>
> Fourth, I'd ask if there's a way to reduce the size of the FQ
> clause. Is this on a particular user basis or groups basis?
> If you can get this down to a few groups that would help. Although
> there's often some outlier who is member of thousands of
> groups :(.
>
> Best
> Erick
>
>
> 2012/4/24 Mindaugas Žakšauskas <mi...@gmail.com>:
>> On Tue, Apr 24, 2012 at 3:27 PM, Benson Margulies <bi...@gmail.com> wrote:
>>> I'm about to try out a contribution for serializing queries in
>>> Javascript using Jackson. I've previously done this by serializing my
>>> own data structure and putting the JSON into a custom query parameter.
>>
>> Thanks for your reply. Appreciate your effort, but I'm not sure if I
>> fully understand the gain.
>>
>> Having data in JSON would still require it to be converted into Lucene
>> Query at the end which takes space & CPU effort, right? Or are you
>> saying that having query serialized into a structured data blob (JSON
>> in this case) makes it somehow easier to convert it into Lucene Query?
>>
>> I only thought about Java serialization because:
>> - it's rather close to the in-object format
>> - the mechanism is rather stable and is an established standard in Java/JVM
>> - Lucene Queries seem to implement java.io.Serializable (haven't done
>> a thorough check but looks good on the surface)
>> - other conversions (e.g. using Xtream) are either slow or require
>> custom annotations. I personally don't see how would Lucene/Solr
>> include them in their core classes.
>>
>> Anyway, it would still be interesting to hear if anyone could
>> elaborate on query parsing complexity.
>>
>> m.

Re: Query parsing VS marshalling/unmarshalling

Posted by Erick Erickson <er...@gmail.com>.
In general, query parsing is such a small fraction of the total time that,
almost no matter how complex, it's not worth worrying about. To see
this, attach &debugQuery=on to your query and look at the timings
in the "pepare" and "process" portions of the response. I'd  be
very sure that it was a problem before spending any time trying to make
the transmission of the data across the wire more efficient, my first
reaction is that this is premature optimization.

Second, you could do this on the server side with a custom query
component if you chose. You can freely modify the query
over there and it may make sense in your situation.

Third, consider "no cache filters", which were developed for
expensive filter queries, ACL being one of them. See:
https://issues.apache.org/jira/browse/SOLR-2429

Fourth, I'd ask if there's a way to reduce the size of the FQ
clause. Is this on a particular user basis or groups basis?
If you can get this down to a few groups that would help. Although
there's often some outlier who is member of thousands of
groups :(.

Best
Erick


2012/4/24 Mindaugas Žakšauskas <mi...@gmail.com>:
> On Tue, Apr 24, 2012 at 3:27 PM, Benson Margulies <bi...@gmail.com> wrote:
>> I'm about to try out a contribution for serializing queries in
>> Javascript using Jackson. I've previously done this by serializing my
>> own data structure and putting the JSON into a custom query parameter.
>
> Thanks for your reply. Appreciate your effort, but I'm not sure if I
> fully understand the gain.
>
> Having data in JSON would still require it to be converted into Lucene
> Query at the end which takes space & CPU effort, right? Or are you
> saying that having query serialized into a structured data blob (JSON
> in this case) makes it somehow easier to convert it into Lucene Query?
>
> I only thought about Java serialization because:
> - it's rather close to the in-object format
> - the mechanism is rather stable and is an established standard in Java/JVM
> - Lucene Queries seem to implement java.io.Serializable (haven't done
> a thorough check but looks good on the surface)
> - other conversions (e.g. using Xtream) are either slow or require
> custom annotations. I personally don't see how would Lucene/Solr
> include them in their core classes.
>
> Anyway, it would still be interesting to hear if anyone could
> elaborate on query parsing complexity.
>
> m.

Re: Query parsing VS marshalling/unmarshalling

Posted by Mindaugas Žakšauskas <mi...@gmail.com>.
On Tue, Apr 24, 2012 at 3:27 PM, Benson Margulies <bi...@gmail.com> wrote:
> I'm about to try out a contribution for serializing queries in
> Javascript using Jackson. I've previously done this by serializing my
> own data structure and putting the JSON into a custom query parameter.

Thanks for your reply. Appreciate your effort, but I'm not sure if I
fully understand the gain.

Having data in JSON would still require it to be converted into Lucene
Query at the end which takes space & CPU effort, right? Or are you
saying that having query serialized into a structured data blob (JSON
in this case) makes it somehow easier to convert it into Lucene Query?

I only thought about Java serialization because:
- it's rather close to the in-object format
- the mechanism is rather stable and is an established standard in Java/JVM
- Lucene Queries seem to implement java.io.Serializable (haven't done
a thorough check but looks good on the surface)
- other conversions (e.g. using Xtream) are either slow or require
custom annotations. I personally don't see how would Lucene/Solr
include them in their core classes.

Anyway, it would still be interesting to hear if anyone could
elaborate on query parsing complexity.

m.

Re: Query parsing VS marshalling/unmarshalling

Posted by Benson Margulies <bi...@gmail.com>.
2012/4/24 Mindaugas Žakšauskas <mi...@gmail.com>:
> Hi,
>
> I maintain a distributed system which Solr is part of. The data which
> is kept is Solr is "permissioned" and permissions are currently
> implemented by taking the original user query, adding certain bits to
> it which would make it return less data in the search results. Now I
> am at the point where I need to go over this functionality and try to
> improve it.
>
> Changing this to send two separate queries (q=...&fq=...) would be the
> first logical thing to do, however I was thinking of an extra
> improvement. Instead of generating filter query, converting it into a
> String, sending over the HTTP just to parse it by Solr again - would
> it not be better to take generated Lucene fq query, serialize it using
> Java serialization, convert it to, say, Base64 and then send and
> deserialize it on the Solr end? Has anyone tried doing any performance
> comparisons on this topic?

I'm about to try out a contribution for serializing queries in
Javascript using Jackson. I've previously done this by serializing my
own data structure and putting the JSON into a custom query parameter.


>
> I am particularly concerned about this because in extreme cases my
> filter queries can be very large (1000s of characters long) and we
> already had to do tweaks as the size of GET requests would exceed
> default limits. And yes, we could move to POST but I would like to
> minimize both the amount of data that is sent over and the time taken
> to parse large queries.
>
> Thanks in advance.
>
> m.