You are viewing a plain text version of this content. The canonical link for it is here.

Posted to solr-user@lucene.apache.org by Darin Amos <da...@gmail.com> on 2014/12/04 16:49:15 UTC

Anti-Pattern in lucent-join jar?

Hello All,

I have been doing a lot of research in building some custom queries and I have been looking at the Lucene Join library as a reference. I noticed something that I believe could actually have a negative side effect.

Specifically I was looking at the JoinUtil.createJoinQuery(…) method and within that method you see the following code:

	TermsWithScoreCollector termsWithScoreCollector =
            TermsWithScoreCollector.create(fromField, multipleValuesPerDocument, scoreMode);
        fromSearcher.search(fromQuery, termsWithScoreCollector);

As you can see, when the JoinQuery is being built, the code is executing the query that is wraps with it’s own collector to collect all the scores. If I were to write a query parser using this library (which someone has done here), doesn’t this reduce the benefit of the SOLR query cache? The wrapped query is being executing when the Join Query is being constructed, not when it is executed.

Thanks

Darin

Re: Anti-Pattern in lucent-join jar?

Posted by Darin Amos <da...@gmail.com>.

Hi Mikhail,

I was merely posing a thought in an effort to continue to learn and educate myself. Your point about Weight.scorer() being called per segment helps my understanding. I am in the middle of building a POC for a customer of mine that I pointed out in this thread on Dec 5th (shortly after noon). I have spent countless hours over the weekend continuing to try and learn the internals of SOLR and Lucene.

Thanks

Darin


> On Dec 8, 2014, at 4:57 AM, Mikhail Khludnev <mk...@griddynamics.com> wrote:
> 
> On Fri, Dec 5, 2014 at 10:44 PM, Darin Amos <da...@gmail.com> wrote:
> 
>>                        public Scorer scorer(){
>>                                TermsWithScoreCollector collector = new
>> TermsWithScoreCollector();
>>                                JoinQuery.this.s.search(JoinQuery.this.q,
>> collector);
>> 
>>                                //do the rest..
>> 
>>                        }
>> 
> 
> Darin,
> I hardly follow, but this approach either is not efficient or even doesn't
> work. Generally join is O(n^2) operation, which is most impls try to
> reduce. weight.scorer() is invoked per segment, and scorer yields results
> only from a particular segment. However, fromQuery should run across all
> segments. Hence, TermsWithScoreCollector will collect IDs globally again
> and again.
> As you can see, the current JoinUtil design is much more efficient, it
> reuses global IDs hash across all "to" segments searches.
> 
> 
> -- 
> Sincerely yours
> Mikhail Khludnev
> Principal Engineer,
> Grid Dynamics
> 
> <http://www.griddynamics.com>
> <mk...@griddynamics.com>

Re: Anti-Pattern in lucent-join jar?

Posted by Michael Sokolov <ms...@safaribooksonline.com>.

Right - allowing Solr to manage these queries (SOLR-6234) seems like the 
way to go

  ... OP == original poster (I lost track of who started the discussion)


-Mike

On 12/08/2014 10:19 AM, Mikhail Khludnev wrote:
> On Mon, Dec 8, 2014 at 5:38 PM, Michael Sokolov <
> msokolov@safaribooksonline.com> wrote:
>
>> I get the impression there was a concern that the caller could hold on to
>> the query generated by JoinUtil for too long - eg across requests in Solr.
> Michael, if you still bother, SOLR-6234
> <https://issues.apache.org/jira/browse/SOLR-6234> is free from this issue.
> Cache keys (Queries), are fairly small and GC friendly.
>
>
>> I'm not sure why the OP thinks that would happen, though.
>>
> Could you please expand "OP"? I didn't get it.
>
>> -Mike
>>
>>
>> On 12/08/2014 04:57 AM, Mikhail Khludnev wrote:
>>
>>> On Fri, Dec 5, 2014 at 10:44 PM, Darin Amos <da...@gmail.com> wrote:
>>>
>>>                            public Scorer scorer(){
>>>>                                   TermsWithScoreCollector collector = new
>>>> TermsWithScoreCollector();
>>>>                                   JoinQuery.this.s.search(
>>>> JoinQuery.this.q,
>>>> collector);
>>>>
>>>>                                   //do the rest..
>>>>
>>>>                           }
>>>>
>>>>   Darin,
>>> I hardly follow, but this approach either is not efficient or even doesn't
>>> work. Generally join is O(n^2) operation, which is most impls try to
>>> reduce. weight.scorer() is invoked per segment, and scorer yields results
>>> only from a particular segment. However, fromQuery should run across all
>>> segments. Hence, TermsWithScoreCollector will collect IDs globally again
>>> and again.
>>> As you can see, the current JoinUtil design is much more efficient, it
>>> reuses global IDs hash across all "to" segments searches.
>>>
>>>
>>>
>

Re: Anti-Pattern in lucent-join jar?

Posted by Mikhail Khludnev <mk...@griddynamics.com>.

On Mon, Dec 8, 2014 at 5:38 PM, Michael Sokolov <
msokolov@safaribooksonline.com> wrote:

> I get the impression there was a concern that the caller could hold on to
> the query generated by JoinUtil for too long - eg across requests in Solr.

Michael, if you still bother, SOLR-6234
<https://issues.apache.org/jira/browse/SOLR-6234> is free from this issue.
Cache keys (Queries), are fairly small and GC friendly.


> I'm not sure why the OP thinks that would happen, though.
>
Could you please expand "OP"? I didn't get it.

>
> -Mike
>
>
> On 12/08/2014 04:57 AM, Mikhail Khludnev wrote:
>
>> On Fri, Dec 5, 2014 at 10:44 PM, Darin Amos <da...@gmail.com> wrote:
>>
>>                           public Scorer scorer(){
>>>                                  TermsWithScoreCollector collector = new
>>> TermsWithScoreCollector();
>>>                                  JoinQuery.this.s.search(
>>> JoinQuery.this.q,
>>> collector);
>>>
>>>                                  //do the rest..
>>>
>>>                          }
>>>
>>>  Darin,
>> I hardly follow, but this approach either is not efficient or even doesn't
>> work. Generally join is O(n^2) operation, which is most impls try to
>> reduce. weight.scorer() is invoked per segment, and scorer yields results
>> only from a particular segment. However, fromQuery should run across all
>> segments. Hence, TermsWithScoreCollector will collect IDs globally again
>> and again.
>> As you can see, the current JoinUtil design is much more efficient, it
>> reuses global IDs hash across all "to" segments searches.
>>
>>
>>
>


-- 
Sincerely yours
Mikhail Khludnev
Principal Engineer,
Grid Dynamics

<http://www.griddynamics.com>
<mk...@griddynamics.com>

Re: Anti-Pattern in lucent-join jar?

Posted by Michael Sokolov <ms...@safaribooksonline.com>.

I get the impression there was a concern that the caller could hold on 
to the query generated by JoinUtil for too long - eg across requests in 
Solr. I'm not sure why the OP thinks that would happen, though.

-Mike

On 12/08/2014 04:57 AM, Mikhail Khludnev wrote:
> On Fri, Dec 5, 2014 at 10:44 PM, Darin Amos <da...@gmail.com> wrote:
>
>>                          public Scorer scorer(){
>>                                  TermsWithScoreCollector collector = new
>> TermsWithScoreCollector();
>>                                  JoinQuery.this.s.search(JoinQuery.this.q,
>> collector);
>>
>>                                  //do the rest..
>>
>>                          }
>>
> Darin,
> I hardly follow, but this approach either is not efficient or even doesn't
> work. Generally join is O(n^2) operation, which is most impls try to
> reduce. weight.scorer() is invoked per segment, and scorer yields results
> only from a particular segment. However, fromQuery should run across all
> segments. Hence, TermsWithScoreCollector will collect IDs globally again
> and again.
> As you can see, the current JoinUtil design is much more efficient, it
> reuses global IDs hash across all "to" segments searches.
>
>

Re: Anti-Pattern in lucent-join jar?

Posted by Mikhail Khludnev <mk...@griddynamics.com>.

On Fri, Dec 5, 2014 at 10:44 PM, Darin Amos <da...@gmail.com> wrote:

>                         public Scorer scorer(){
>                                 TermsWithScoreCollector collector = new
> TermsWithScoreCollector();
>                                 JoinQuery.this.s.search(JoinQuery.this.q,
> collector);
>
>                                 //do the rest..
>
>                         }
>

Darin,
I hardly follow, but this approach either is not efficient or even doesn't
work. Generally join is O(n^2) operation, which is most impls try to
reduce. weight.scorer() is invoked per segment, and scorer yields results
only from a particular segment. However, fromQuery should run across all
segments. Hence, TermsWithScoreCollector will collect IDs globally again
and again.
As you can see, the current JoinUtil design is much more efficient, it
reuses global IDs hash across all "to" segments searches.

-- 
Sincerely yours
Mikhail Khludnev
Principal Engineer,
Grid Dynamics

<http://www.griddynamics.com>
<mk...@griddynamics.com>

Re: Anti-Pattern in lucent-join jar?

Posted by Darin Amos <da...@gmail.com>.

In this case I was thinking about something like the following.. if you changed the Query implementation or created your own similar query:

If you consider this query: q={!scorejoin from=parent to=id}type:child

public class ScoreJoinQuery extends Query(){


	private Query q = null;
	private IndexSearcher s = null;

	public JoinQuery(Query q, IndexSearcher s){
		this.q = q;   //THis is the term query type:child
		this.s = s;
	}

	.
	.
	.
	public Weight createWeight(…..){
		return new Weight(){
			.
			.
			.
			public Scorer scorer(){
				TermsWithScoreCollector collector = new TermsWithScoreCollector();
				JoinQuery.this.s.search(JoinQuery.this.q, collector);

				//do the rest..		

			}
			
		}
	}
}

This is what I was thinking in my head…. but I don’t really believe it offers any value above how the scorcejoin query works today.



> On Dec 5, 2014, at 2:16 PM, Roman Chyla <ro...@gmail.com> wrote:
> 
> Not sure I understand. It is the searcher which executes the query, how
> would you 'convince' it to pass the query? First the Weight is created,
> weight instance creates scorer - you would have to change the API to do the
> passing (or maybe not...?)
> In my case, the relationships were across index segments, so I had to
> collect them first - but in some other situations, when you look only at
> the data inside one index segments, it _might_ be better to wait
> 
> 
> 
> On Fri, Dec 5, 2014 at 1:25 PM, Darin Amos <da...@gmail.com> wrote:
> 
>> Couldn’t you just keep passing the wrapped query and searcher down to
>> Weight.scorer()?
>> 
>> This would allow you to wait until the query is executed to do term
>> collection. If you want to protect against creating and executing the query
>> with different searchers, you would have to make the query factory (or
>> constructor) only visible to the query parser or parser plugin?
>> 
>> I might not have followed you, this discussing challenges my understanding
>> of Lucene and SOLR.
>> 
>> Darin
>> 
>> 
>> 
>>> On Dec 5, 2014, at 12:47 PM, Roman Chyla <ro...@gmail.com> wrote:
>>> 
>>> Hi Mikhail, I think you are right, it won't be problem for SOLR, but it
>> is
>>> likely an antipattern inside a lucene component. Because custom
>> components
>>> may create join queries, hold to them and then execute much later
>> against a
>>> different searcher. One approach would be to postpone term collection
>> until
>>> the query actually runs, I looked far and wide for appropriate place, but
>>> only found createWeight() - but at least it does give developers NO
>>> opportunity to shoot their feet! ;-)
>>> 
>>> Since it may serve as an inspiration to someone, here is a link:
>>> 
>> https://github.com/romanchyla/montysolr/blob/master-next/contrib/adsabs/src/java/org/apache/lucene/search/SecondOrderQuery.java#L101
>>> 
>>> roman
>>> 
>>> On Fri, Dec 5, 2014 at 4:52 AM, Mikhail Khludnev <
>> mkhludnev@griddynamics.com
>>>> wrote:
>>> 
>>>> Thanks Roman! Let's expand it for the sake of completeness.
>>>> Such issue is not possible in Solr, because caches are associated with
>> the
>>>> searcher. While you follow this design (see Solr userCache), and don't
>>>> update what's cached once, there is no chance to shoot the foot.
>>>> There were few caches inside of Lucene (old FieldCache,
>>>> CachingWrapperFilter, ExternalFileField, etc), but they are properly
>> mapped
>>>> onto segment keys, hence it exclude such leakage across different
>>>> searchers.
>>>> 
>>>> On Fri, Dec 5, 2014 at 6:43 AM, Roman Chyla <ro...@gmail.com>
>> wrote:
>>>> 
>>>>> +1, additionally (as it follows from your observation) the query can
>> get
>>>>> out of sync with the index, if eg it was saved for later use and ran
>>>>> against newly opened searcher
>>>>> 
>>>>> Roman
>>>>> On 4 Dec 2014 10:51, "Darin Amos" <da...@gmail.com> wrote:
>>>>> 
>>>>>> Hello All,
>>>>>> 
>>>>>> I have been doing a lot of research in building some custom queries
>>>> and I
>>>>>> have been looking at the Lucene Join library as a reference. I noticed
>>>>>> something that I believe could actually have a negative side effect.
>>>>>> 
>>>>>> Specifically I was looking at the JoinUtil.createJoinQuery(…) method
>>>> and
>>>>>> within that method you see the following code:
>>>>>> 
>>>>>>       TermsWithScoreCollector termsWithScoreCollector =
>>>>>>           TermsWithScoreCollector.create(fromField,
>>>>>> multipleValuesPerDocument, scoreMode);
>>>>>>       fromSearcher.search(fromQuery, termsWithScoreCollector);
>>>>>> 
>>>>>> As you can see, when the JoinQuery is being built, the code is
>>>> executing
>>>>>> the query that is wraps with it’s own collector to collect all the
>>>>> scores.
>>>>>> If I were to write a query parser using this library (which someone
>> has
>>>>>> done here), doesn’t this reduce the benefit of the SOLR query cache?
>>>> The
>>>>>> wrapped query is being executing when the Join Query is being
>>>>> constructed,
>>>>>> not when it is executed.
>>>>>> 
>>>>>> Thanks
>>>>>> 
>>>>>> Darin
>>>>>> 
>>>>> 
>>>> 
>>>> 
>>>> 
>>>> --
>>>> Sincerely yours
>>>> Mikhail Khludnev
>>>> Principal Engineer,
>>>> Grid Dynamics
>>>> 
>>>> <http://www.griddynamics.com>
>>>> <mk...@griddynamics.com>
>>>> 
>> 
>>

Re: Anti-Pattern in lucent-join jar?

Posted by Roman Chyla <ro...@gmail.com>.

Not sure I understand. It is the searcher which executes the query, how
would you 'convince' it to pass the query? First the Weight is created,
weight instance creates scorer - you would have to change the API to do the
passing (or maybe not...?)
In my case, the relationships were across index segments, so I had to
collect them first - but in some other situations, when you look only at
the data inside one index segments, it _might_ be better to wait



On Fri, Dec 5, 2014 at 1:25 PM, Darin Amos <da...@gmail.com> wrote:

> Couldn’t you just keep passing the wrapped query and searcher down to
> Weight.scorer()?
>
> This would allow you to wait until the query is executed to do term
> collection. If you want to protect against creating and executing the query
> with different searchers, you would have to make the query factory (or
> constructor) only visible to the query parser or parser plugin?
>
> I might not have followed you, this discussing challenges my understanding
> of Lucene and SOLR.
>
> Darin
>
>
>
> > On Dec 5, 2014, at 12:47 PM, Roman Chyla <ro...@gmail.com> wrote:
> >
> > Hi Mikhail, I think you are right, it won't be problem for SOLR, but it
> is
> > likely an antipattern inside a lucene component. Because custom
> components
> > may create join queries, hold to them and then execute much later
> against a
> > different searcher. One approach would be to postpone term collection
> until
> > the query actually runs, I looked far and wide for appropriate place, but
> > only found createWeight() - but at least it does give developers NO
> > opportunity to shoot their feet! ;-)
> >
> > Since it may serve as an inspiration to someone, here is a link:
> >
> https://github.com/romanchyla/montysolr/blob/master-next/contrib/adsabs/src/java/org/apache/lucene/search/SecondOrderQuery.java#L101
> >
> > roman
> >
> > On Fri, Dec 5, 2014 at 4:52 AM, Mikhail Khludnev <
> mkhludnev@griddynamics.com
> >> wrote:
> >
> >> Thanks Roman! Let's expand it for the sake of completeness.
> >> Such issue is not possible in Solr, because caches are associated with
> the
> >> searcher. While you follow this design (see Solr userCache), and don't
> >> update what's cached once, there is no chance to shoot the foot.
> >> There were few caches inside of Lucene (old FieldCache,
> >> CachingWrapperFilter, ExternalFileField, etc), but they are properly
> mapped
> >> onto segment keys, hence it exclude such leakage across different
> >> searchers.
> >>
> >> On Fri, Dec 5, 2014 at 6:43 AM, Roman Chyla <ro...@gmail.com>
> wrote:
> >>
> >>> +1, additionally (as it follows from your observation) the query can
> get
> >>> out of sync with the index, if eg it was saved for later use and ran
> >>> against newly opened searcher
> >>>
> >>> Roman
> >>> On 4 Dec 2014 10:51, "Darin Amos" <da...@gmail.com> wrote:
> >>>
> >>>> Hello All,
> >>>>
> >>>> I have been doing a lot of research in building some custom queries
> >> and I
> >>>> have been looking at the Lucene Join library as a reference. I noticed
> >>>> something that I believe could actually have a negative side effect.
> >>>>
> >>>> Specifically I was looking at the JoinUtil.createJoinQuery(…) method
> >> and
> >>>> within that method you see the following code:
> >>>>
> >>>>        TermsWithScoreCollector termsWithScoreCollector =
> >>>>            TermsWithScoreCollector.create(fromField,
> >>>> multipleValuesPerDocument, scoreMode);
> >>>>        fromSearcher.search(fromQuery, termsWithScoreCollector);
> >>>>
> >>>> As you can see, when the JoinQuery is being built, the code is
> >> executing
> >>>> the query that is wraps with it’s own collector to collect all the
> >>> scores.
> >>>> If I were to write a query parser using this library (which someone
> has
> >>>> done here), doesn’t this reduce the benefit of the SOLR query cache?
> >> The
> >>>> wrapped query is being executing when the Join Query is being
> >>> constructed,
> >>>> not when it is executed.
> >>>>
> >>>> Thanks
> >>>>
> >>>> Darin
> >>>>
> >>>
> >>
> >>
> >>
> >> --
> >> Sincerely yours
> >> Mikhail Khludnev
> >> Principal Engineer,
> >> Grid Dynamics
> >>
> >> <http://www.griddynamics.com>
> >> <mk...@griddynamics.com>
> >>
>
>

Re: Anti-Pattern in lucent-join jar?

Posted by Darin Amos <da...@gmail.com>.

Couldn’t you just keep passing the wrapped query and searcher down to Weight.scorer()?

This would allow you to wait until the query is executed to do term collection. If you want to protect against creating and executing the query with different searchers, you would have to make the query factory (or constructor) only visible to the query parser or parser plugin?

I might not have followed you, this discussing challenges my understanding of Lucene and SOLR.

Darin



> On Dec 5, 2014, at 12:47 PM, Roman Chyla <ro...@gmail.com> wrote:
> 
> Hi Mikhail, I think you are right, it won't be problem for SOLR, but it is
> likely an antipattern inside a lucene component. Because custom components
> may create join queries, hold to them and then execute much later against a
> different searcher. One approach would be to postpone term collection until
> the query actually runs, I looked far and wide for appropriate place, but
> only found createWeight() - but at least it does give developers NO
> opportunity to shoot their feet! ;-)
> 
> Since it may serve as an inspiration to someone, here is a link:
> https://github.com/romanchyla/montysolr/blob/master-next/contrib/adsabs/src/java/org/apache/lucene/search/SecondOrderQuery.java#L101
> 
> roman
> 
> On Fri, Dec 5, 2014 at 4:52 AM, Mikhail Khludnev <mkhludnev@griddynamics.com
>> wrote:
> 
>> Thanks Roman! Let's expand it for the sake of completeness.
>> Such issue is not possible in Solr, because caches are associated with the
>> searcher. While you follow this design (see Solr userCache), and don't
>> update what's cached once, there is no chance to shoot the foot.
>> There were few caches inside of Lucene (old FieldCache,
>> CachingWrapperFilter, ExternalFileField, etc), but they are properly mapped
>> onto segment keys, hence it exclude such leakage across different
>> searchers.
>> 
>> On Fri, Dec 5, 2014 at 6:43 AM, Roman Chyla <ro...@gmail.com> wrote:
>> 
>>> +1, additionally (as it follows from your observation) the query can get
>>> out of sync with the index, if eg it was saved for later use and ran
>>> against newly opened searcher
>>> 
>>> Roman
>>> On 4 Dec 2014 10:51, "Darin Amos" <da...@gmail.com> wrote:
>>> 
>>>> Hello All,
>>>> 
>>>> I have been doing a lot of research in building some custom queries
>> and I
>>>> have been looking at the Lucene Join library as a reference. I noticed
>>>> something that I believe could actually have a negative side effect.
>>>> 
>>>> Specifically I was looking at the JoinUtil.createJoinQuery(…) method
>> and
>>>> within that method you see the following code:
>>>> 
>>>>        TermsWithScoreCollector termsWithScoreCollector =
>>>>            TermsWithScoreCollector.create(fromField,
>>>> multipleValuesPerDocument, scoreMode);
>>>>        fromSearcher.search(fromQuery, termsWithScoreCollector);
>>>> 
>>>> As you can see, when the JoinQuery is being built, the code is
>> executing
>>>> the query that is wraps with it’s own collector to collect all the
>>> scores.
>>>> If I were to write a query parser using this library (which someone has
>>>> done here), doesn’t this reduce the benefit of the SOLR query cache?
>> The
>>>> wrapped query is being executing when the Join Query is being
>>> constructed,
>>>> not when it is executed.
>>>> 
>>>> Thanks
>>>> 
>>>> Darin
>>>> 
>>> 
>> 
>> 
>> 
>> --
>> Sincerely yours
>> Mikhail Khludnev
>> Principal Engineer,
>> Grid Dynamics
>> 
>> <http://www.griddynamics.com>
>> <mk...@griddynamics.com>
>>

Re: Anti-Pattern in lucent-join jar?

Posted by Roman Chyla <ro...@gmail.com>.

Hi Mikhail, I think you are right, it won't be problem for SOLR, but it is
likely an antipattern inside a lucene component. Because custom components
may create join queries, hold to them and then execute much later against a
different searcher. One approach would be to postpone term collection until
the query actually runs, I looked far and wide for appropriate place, but
only found createWeight() - but at least it does give developers NO
opportunity to shoot their feet! ;-)

Since it may serve as an inspiration to someone, here is a link:
https://github.com/romanchyla/montysolr/blob/master-next/contrib/adsabs/src/java/org/apache/lucene/search/SecondOrderQuery.java#L101

roman

On Fri, Dec 5, 2014 at 4:52 AM, Mikhail Khludnev <mkhludnev@griddynamics.com
> wrote:

> Thanks Roman! Let's expand it for the sake of completeness.
> Such issue is not possible in Solr, because caches are associated with the
> searcher. While you follow this design (see Solr userCache), and don't
> update what's cached once, there is no chance to shoot the foot.
> There were few caches inside of Lucene (old FieldCache,
> CachingWrapperFilter, ExternalFileField, etc), but they are properly mapped
> onto segment keys, hence it exclude such leakage across different
> searchers.
>
> On Fri, Dec 5, 2014 at 6:43 AM, Roman Chyla <ro...@gmail.com> wrote:
>
> > +1, additionally (as it follows from your observation) the query can get
> > out of sync with the index, if eg it was saved for later use and ran
> > against newly opened searcher
> >
> > Roman
> > On 4 Dec 2014 10:51, "Darin Amos" <da...@gmail.com> wrote:
> >
> > > Hello All,
> > >
> > > I have been doing a lot of research in building some custom queries
> and I
> > > have been looking at the Lucene Join library as a reference. I noticed
> > > something that I believe could actually have a negative side effect.
> > >
> > > Specifically I was looking at the JoinUtil.createJoinQuery(…) method
> and
> > > within that method you see the following code:
> > >
> > >         TermsWithScoreCollector termsWithScoreCollector =
> > >             TermsWithScoreCollector.create(fromField,
> > > multipleValuesPerDocument, scoreMode);
> > >         fromSearcher.search(fromQuery, termsWithScoreCollector);
> > >
> > > As you can see, when the JoinQuery is being built, the code is
> executing
> > > the query that is wraps with it’s own collector to collect all the
> > scores.
> > > If I were to write a query parser using this library (which someone has
> > > done here), doesn’t this reduce the benefit of the SOLR query cache?
> The
> > > wrapped query is being executing when the Join Query is being
> > constructed,
> > > not when it is executed.
> > >
> > > Thanks
> > >
> > > Darin
> > >
> >
>
>
>
> --
> Sincerely yours
> Mikhail Khludnev
> Principal Engineer,
> Grid Dynamics
>
> <http://www.griddynamics.com>
> <mk...@griddynamics.com>
>

Re: Anti-Pattern in lucent-join jar?

Posted by Darin Amos <da...@gmail.com>.

Thanks for the information!

The reason I ask is I am doing a POC on building a custom Query+QueryParser+Facet Component customization. I have had some issues finding exactly what I am looking for OOTB and I believe I need something custom. (its also a really good learning exercise)

I do ecommerce, when you type in a search in the website, we execute the search against available sku’s (i.e.. small red shirt, blue 34x34 jeans) then want to perform a rollup into those items products (shirt, jeans) and return the top level document. (We are on SOLR 4.3.0, and don’t have parent/child support yet)

I can’t use grouping because it throws off the pagination and is another can of worms, what I want can be easily done with the join query parser {![score]join from=blah to=blah} but there are two gaps:

1) My customer wants to return facets calculated by the child dataset, not the final parent dataset. Meaning if my search returns a couple shirts, I don’t want to show the “small” facet if none of the shirts have the small size in stock (meaning the small shirt wasn’t in the child docset). The product documents won’t even have the “size” field populated anyway.

2) We want to be able to add filters to the search string that will filter the child documents, not the final result set documents. Example:
q={!join from=parent to=id}name:(*Shirt*)&fq=size:small&fq=color:blue

	- * In this case… if the shirt doesn’t have small or blue in stock, it won’t be returned at all.

Hence, I have been working on a customization. I am looking to build my own custom join query (been calling it a rollup) and base it off of the scorejoin query implementation pointed out to me.

A sample query for small red shirts would look like the following:

q={!rollup from=parent to=id}name:(*Shirt*)&childfq=size:small&childfq=color:red

My code would do the following:
	1) Custom query parser (extends ExtendedDisMaxQParser) would wrap the main query in a BooleanQuery and add the “childfq” queries (Occur.MUST) similar to how the edismax boost queries work today.
	2) Custom query parser would wrap the final BooeanQuery in my custom Rollup query (would use the score mode for max child score in my case)
	3) Custom rollup query would record and save the child documents docset and make it available in an accessor method, therefore in a custom facet component I can execute:

	4) 
		Query q = rb.getQuery();
		if(q instance RollupQuery){
			RollupQuery rq = (RollupQuery)q;
			DocSet children = rq.getChildren();

			//Build facets from the children docset.

		}
		else{
			//build facets from rb.getResults().docSet….
		}

If anyone has taken the time to read this example, I greatly appreciate it would would appreciate any feedback. I would be glad to share the final implementation for review.

Thanks

Darin

> On Dec 5, 2014, at 4:52 AM, Mikhail Khludnev <mk...@griddynamics.com> wrote:
> 
> Thanks Roman! Let's expand it for the sake of completeness.
> Such issue is not possible in Solr, because caches are associated with the
> searcher. While you follow this design (see Solr userCache), and don't
> update what's cached once, there is no chance to shoot the foot.
> There were few caches inside of Lucene (old FieldCache,
> CachingWrapperFilter, ExternalFileField, etc), but they are properly mapped
> onto segment keys, hence it exclude such leakage across different
> searchers.
> 
> On Fri, Dec 5, 2014 at 6:43 AM, Roman Chyla <ro...@gmail.com> wrote:
> 
>> +1, additionally (as it follows from your observation) the query can get
>> out of sync with the index, if eg it was saved for later use and ran
>> against newly opened searcher
>> 
>> Roman
>> On 4 Dec 2014 10:51, "Darin Amos" <da...@gmail.com> wrote:
>> 
>>> Hello All,
>>> 
>>> I have been doing a lot of research in building some custom queries and I
>>> have been looking at the Lucene Join library as a reference. I noticed
>>> something that I believe could actually have a negative side effect.
>>> 
>>> Specifically I was looking at the JoinUtil.createJoinQuery(…) method and
>>> within that method you see the following code:
>>> 
>>>        TermsWithScoreCollector termsWithScoreCollector =
>>>            TermsWithScoreCollector.create(fromField,
>>> multipleValuesPerDocument, scoreMode);
>>>        fromSearcher.search(fromQuery, termsWithScoreCollector);
>>> 
>>> As you can see, when the JoinQuery is being built, the code is executing
>>> the query that is wraps with it’s own collector to collect all the
>> scores.
>>> If I were to write a query parser using this library (which someone has
>>> done here), doesn’t this reduce the benefit of the SOLR query cache? The
>>> wrapped query is being executing when the Join Query is being
>> constructed,
>>> not when it is executed.
>>> 
>>> Thanks
>>> 
>>> Darin
>>> 
>> 
> 
> 
> 
> -- 
> Sincerely yours
> Mikhail Khludnev
> Principal Engineer,
> Grid Dynamics
> 
> <http://www.griddynamics.com>
> <mk...@griddynamics.com>

Re: Anti-Pattern in lucent-join jar?

Posted by Mikhail Khludnev <mk...@griddynamics.com>.

Thanks Roman! Let's expand it for the sake of completeness.
Such issue is not possible in Solr, because caches are associated with the
searcher. While you follow this design (see Solr userCache), and don't
update what's cached once, there is no chance to shoot the foot.
There were few caches inside of Lucene (old FieldCache,
CachingWrapperFilter, ExternalFileField, etc), but they are properly mapped
onto segment keys, hence it exclude such leakage across different
searchers.

On Fri, Dec 5, 2014 at 6:43 AM, Roman Chyla <ro...@gmail.com> wrote:

> +1, additionally (as it follows from your observation) the query can get
> out of sync with the index, if eg it was saved for later use and ran
> against newly opened searcher
>
> Roman
> On 4 Dec 2014 10:51, "Darin Amos" <da...@gmail.com> wrote:
>
> > Hello All,
> >
> > I have been doing a lot of research in building some custom queries and I
> > have been looking at the Lucene Join library as a reference. I noticed
> > something that I believe could actually have a negative side effect.
> >
> > Specifically I was looking at the JoinUtil.createJoinQuery(…) method and
> > within that method you see the following code:
> >
> >         TermsWithScoreCollector termsWithScoreCollector =
> >             TermsWithScoreCollector.create(fromField,
> > multipleValuesPerDocument, scoreMode);
> >         fromSearcher.search(fromQuery, termsWithScoreCollector);
> >
> > As you can see, when the JoinQuery is being built, the code is executing
> > the query that is wraps with it’s own collector to collect all the
> scores.
> > If I were to write a query parser using this library (which someone has
> > done here), doesn’t this reduce the benefit of the SOLR query cache? The
> > wrapped query is being executing when the Join Query is being
> constructed,
> > not when it is executed.
> >
> > Thanks
> >
> > Darin
> >
>



-- 
Sincerely yours
Mikhail Khludnev
Principal Engineer,
Grid Dynamics

<http://www.griddynamics.com>
<mk...@griddynamics.com>

Re: Anti-Pattern in lucent-join jar?

Posted by Roman Chyla <ro...@gmail.com>.

+1, additionally (as it follows from your observation) the query can get
out of sync with the index, if eg it was saved for later use and ran
against newly opened searcher

Roman
On 4 Dec 2014 10:51, "Darin Amos" <da...@gmail.com> wrote:

> Hello All,
>
> I have been doing a lot of research in building some custom queries and I
> have been looking at the Lucene Join library as a reference. I noticed
> something that I believe could actually have a negative side effect.
>
> Specifically I was looking at the JoinUtil.createJoinQuery(…) method and
> within that method you see the following code:
>
>         TermsWithScoreCollector termsWithScoreCollector =
>             TermsWithScoreCollector.create(fromField,
> multipleValuesPerDocument, scoreMode);
>         fromSearcher.search(fromQuery, termsWithScoreCollector);
>
> As you can see, when the JoinQuery is being built, the code is executing
> the query that is wraps with it’s own collector to collect all the scores.
> If I were to write a query parser using this library (which someone has
> done here), doesn’t this reduce the benefit of the SOLR query cache? The
> wrapped query is being executing when the Join Query is being constructed,
> not when it is executed.
>
> Thanks
>
> Darin
>

Re: Anti-Pattern in lucent-join jar?

Posted by Mikhail Khludnev <mk...@griddynamics.com>.

Hello,

I wonder if you see https://issues.apache.org/jira/browse/SOLR-6234 which
solves such problem.
QueryResult Cache are useless for join, because they carry cropped results.
Potentially you can hit filter cache wrapping fromQuery into this monster
bridge
new FilteredQuery(new MatchAllDocsQuery(),
filterCache.get(fromQuery).getTopFilter())
however, you refer to TermsWithScoreCollector, but filterCache doesn't
stores scores.
fromQuery is not a hotspot for JoinQuery usually (I spoke about it at last
LuceneRevolution)
Fwiw, it's common to have a heavy processing at Lucene level eg. see
RangeQuery. The idea is to cache the result of query execution (but not the
intermediate data) on the levels above like it's done Solr's filterCache or
queryResultCache.
Hope it helps

On Thu, Dec 4, 2014 at 6:49 PM, Darin Amos <da...@gmail.com> wrote:

> Hello All,
>
> I have been doing a lot of research in building some custom queries and I
> have been looking at the Lucene Join library as a reference. I noticed
> something that I believe could actually have a negative side effect.
>
> Specifically I was looking at the JoinUtil.createJoinQuery(…) method and
> within that method you see the following code:
>
>         TermsWithScoreCollector termsWithScoreCollector =
>             TermsWithScoreCollector.create(fromField,
> multipleValuesPerDocument, scoreMode);
>         fromSearcher.search(fromQuery, termsWithScoreCollector);
>
> As you can see, when the JoinQuery is being built, the code is executing
> the query that is wraps with it’s own collector to collect all the scores.
> If I were to write a query parser using this library (which someone has
> done here), doesn’t this reduce the benefit of the SOLR query cache? The
> wrapped query is being executing when the Join Query is being constructed,
> not when it is executed.
>
> Thanks
>
> Darin
>

-- 
Sincerely yours
Mikhail Khludnev
Principal Engineer,
Grid Dynamics

<http://www.griddynamics.com>
<mk...@griddynamics.com>