You are viewing a plain text version of this content. The canonical link for it is here.
Posted to solr-user@lucene.apache.org by tedsolr <ts...@sciquest.com> on 2015/01/13 22:29:03 UTC

Engage custom hit collector for special search processing

I have a complicated problem to solve, and I don't know enough about
lucene/solr to phrase the question properly. This is kind of a shot in the
dark. My requirement is to return search results always in completely
"collapsed" form, rolling up duplicates with a count. Duplicates are defined
by whatever fields are requested. If the search requests fields A, B, C,
then all matched documents that have identical values for those 3 fields are
"dupes". The field list may change with every new search request. What I do
know is the super set of all fields that may be part of the field list at
index time.

I know this can't be done with configuration alone. It doesn't seem
performant to retrieve all 1M+ docs and post process in Java. A very smart
person told me that a custom hit collector should be able to do the
filtering for me. So, maybe I create a custom search handler that somehow
exposes this custom hit collector that can use FieldCache or DocValues to
examine all the matches and filter the results in the way I've described
above.

So assuming this is a viable solution path, can anyone suggest some helpful
posts, code fragments, books for me to review? I admit to being out of my
depth, but this requirement isn't going away. I'm grasping for straws right
now.

thanks
(using Solr 4.9)





--
View this message in context: http://lucene.472066.n3.nabble.com/Engage-custom-hit-collector-for-special-search-processing-tp4179348.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: Engage custom hit collector for special search processing

Posted by William Bell <bi...@gmail.com>.
We all need example data, and a sample query to help you.

You can use "group" to group by a field and remove dupes.

If you want to remove dupes you can do something like:

q=field1:DOG AND NOT field2:DOG AND NOT field3:DOG

That will remove DOG from field2 or field3.

If you don't care if it is in any field, you can use dismax/edismax and qf,
or you can just use OR.

q=field1:DOG OR field2:DOG OR field3:DOG

If you have a set of values that you want to remove duplicates at INDEX
time you can do that with SQL (if coming from SQL), and write code in the
DIH.

var x = row.get("field1");
var x1 = row.get("field2");
var x2 = row.get("field3");

if (x.equals(x1)) {
   row.put("field2", "");
}

if (x.equals(x2)) {
   row.put("field3","");
}

That way you eliminate the dupes at index time...

Bill







On Tue, Jan 13, 2015 at 2:29 PM, tedsolr <ts...@sciquest.com> wrote:

> I have a complicated problem to solve, and I don't know enough about
> lucene/solr to phrase the question properly. This is kind of a shot in the
> dark. My requirement is to return search results always in completely
> "collapsed" form, rolling up duplicates with a count. Duplicates are
> defined
> by whatever fields are requested. If the search requests fields A, B, C,
> then all matched documents that have identical values for those 3 fields
> are
> "dupes". The field list may change with every new search request. What I do
> know is the super set of all fields that may be part of the field list at
> index time.
>
> I know this can't be done with configuration alone. It doesn't seem
> performant to retrieve all 1M+ docs and post process in Java. A very smart
> person told me that a custom hit collector should be able to do the
> filtering for me. So, maybe I create a custom search handler that somehow
> exposes this custom hit collector that can use FieldCache or DocValues to
> examine all the matches and filter the results in the way I've described
> above.
>
> So assuming this is a viable solution path, can anyone suggest some helpful
> posts, code fragments, books for me to review? I admit to being out of my
> depth, but this requirement isn't going away. I'm grasping for straws right
> now.
>
> thanks
> (using Solr 4.9)
>
>
>
>
>
> --
> View this message in context:
> http://lucene.472066.n3.nabble.com/Engage-custom-hit-collector-for-special-search-processing-tp4179348.html
> Sent from the Solr - User mailing list archive at Nabble.com.
>



-- 
Bill Bell
billnbell@gmail.com
cell 720-256-8076

Re: Engage custom hit collector for special search processing

Posted by tedsolr <ts...@sciquest.com>.
Thank you so much Alex and Joel for your ideas. I am pouring through the
documentation and code now to try an understand it all. A post filter sounds
promising. As 99% of my doc fields are character based I should try to
compliment the collapsing Q parser with an option that compares string
fields for equality. As long as a multi-field comparison approach is not
prohibited in some way by this architecture, I feel it's a great place to
start.



--
View this message in context: http://lucene.472066.n3.nabble.com/Engage-custom-hit-collector-for-special-search-processing-tp4179348p4179621.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: Engage custom hit collector for special search processing

Posted by Joel Bernstein <jo...@gmail.com>.
You may also want to take a look at how AnalyticsQueries can be plugged in.
This won't show you how to do the implementation but it will show you how
you can plugin a custom collector.

http://heliosearch.org/solrs-new-analyticsquery-api/
http://heliosearch.org/solrs-mergestrategy/

Joel Bernstein
Search Engineer at Heliosearch

On Tue, Jan 13, 2015 at 4:45 PM, Alexandre Rafalovitch <ar...@gmail.com>
wrote:

> Sounds like:
>
> https://cwiki.apache.org/confluence/display/solr/Collapse+and+Expand+Results
>
> http://heliosearch.org/the-collapsingqparserplugin-solrs-new-high-performance-field-collapsing-postfilter/
>
> The main issue is your multi-field criteria. So you may need to
> extend/overwrite the comparison method. Plus you'd need to keep the
> counts. Which you should know since you are doing the filtering.
>
> Is this the right direction for what you need?
>
> Regards,
>    Alex.
> ----
> Sign up for my Solr resources newsletter at http://www.solr-start.com/
>
>
> On 13 January 2015 at 16:29, tedsolr <ts...@sciquest.com> wrote:
> > I have a complicated problem to solve, and I don't know enough about
> > lucene/solr to phrase the question properly. This is kind of a shot in
> the
> > dark. My requirement is to return search results always in completely
> > "collapsed" form, rolling up duplicates with a count. Duplicates are
> defined
> > by whatever fields are requested. If the search requests fields A, B, C,
> > then all matched documents that have identical values for those 3 fields
> are
> > "dupes". The field list may change with every new search request. What I
> do
> > know is the super set of all fields that may be part of the field list at
> > index time.
> >
> > I know this can't be done with configuration alone. It doesn't seem
> > performant to retrieve all 1M+ docs and post process in Java. A very
> smart
> > person told me that a custom hit collector should be able to do the
> > filtering for me. So, maybe I create a custom search handler that somehow
> > exposes this custom hit collector that can use FieldCache or DocValues to
> > examine all the matches and filter the results in the way I've described
> > above.
> >
> > So assuming this is a viable solution path, can anyone suggest some
> helpful
> > posts, code fragments, books for me to review? I admit to being out of my
> > depth, but this requirement isn't going away. I'm grasping for straws
> right
> > now.
> >
> > thanks
> > (using Solr 4.9)
> >
> >
> >
> >
> >
> > --
> > View this message in context:
> http://lucene.472066.n3.nabble.com/Engage-custom-hit-collector-for-special-search-processing-tp4179348.html
> > Sent from the Solr - User mailing list archive at Nabble.com.
>

Re: Engage custom hit collector for special search processing

Posted by Alexandre Rafalovitch <ar...@gmail.com>.
Sounds like:
https://cwiki.apache.org/confluence/display/solr/Collapse+and+Expand+Results
http://heliosearch.org/the-collapsingqparserplugin-solrs-new-high-performance-field-collapsing-postfilter/

The main issue is your multi-field criteria. So you may need to
extend/overwrite the comparison method. Plus you'd need to keep the
counts. Which you should know since you are doing the filtering.

Is this the right direction for what you need?

Regards,
   Alex.
----
Sign up for my Solr resources newsletter at http://www.solr-start.com/


On 13 January 2015 at 16:29, tedsolr <ts...@sciquest.com> wrote:
> I have a complicated problem to solve, and I don't know enough about
> lucene/solr to phrase the question properly. This is kind of a shot in the
> dark. My requirement is to return search results always in completely
> "collapsed" form, rolling up duplicates with a count. Duplicates are defined
> by whatever fields are requested. If the search requests fields A, B, C,
> then all matched documents that have identical values for those 3 fields are
> "dupes". The field list may change with every new search request. What I do
> know is the super set of all fields that may be part of the field list at
> index time.
>
> I know this can't be done with configuration alone. It doesn't seem
> performant to retrieve all 1M+ docs and post process in Java. A very smart
> person told me that a custom hit collector should be able to do the
> filtering for me. So, maybe I create a custom search handler that somehow
> exposes this custom hit collector that can use FieldCache or DocValues to
> examine all the matches and filter the results in the way I've described
> above.
>
> So assuming this is a viable solution path, can anyone suggest some helpful
> posts, code fragments, books for me to review? I admit to being out of my
> depth, but this requirement isn't going away. I'm grasping for straws right
> now.
>
> thanks
> (using Solr 4.9)
>
>
>
>
>
> --
> View this message in context: http://lucene.472066.n3.nabble.com/Engage-custom-hit-collector-for-special-search-processing-tp4179348.html
> Sent from the Solr - User mailing list archive at Nabble.com.

Re: Engage custom hit collector for special search processing

Posted by tedsolr <ts...@sciquest.com>.
As insane as it sounds, I need to process all the results. No one document is
more or less important than another. Only a few hundred "unique" docs will
be sent to the client at any one time, but the users expect to page through
them all.

I don't expect sub-second performance for this task. I'm just hoping for
something "reasonable", and I can't define that either.



--
View this message in context: http://lucene.472066.n3.nabble.com/Engage-custom-hit-collector-for-special-search-processing-tp4179348p4179366.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: Engage custom hit collector for special search processing

Posted by Jack Krupansky <ja...@gmail.com>.
Do you have a sense of what your typical queries would look like? I mean,
maybe you wouldn't actually need to fetch more than a tiny fraction of
those million documents. Do you only need to determine the top 10 or 20 or
50 unique field value row sets, or do you need to determine ALL unique row
sets? The latter would never be very performant even as a custom
handler/collector since it would have to scan all rows.

Try a client-side solution that reads 100 (or 50 or 20 or 200) rows at a
time, storing rows by the unique combination of field values, until you hit
the threshold needed for number of unique row sets.

-- Jack Krupansky

On Tue, Jan 13, 2015 at 4:29 PM, tedsolr <ts...@sciquest.com> wrote:

> I have a complicated problem to solve, and I don't know enough about
> lucene/solr to phrase the question properly. This is kind of a shot in the
> dark. My requirement is to return search results always in completely
> "collapsed" form, rolling up duplicates with a count. Duplicates are
> defined
> by whatever fields are requested. If the search requests fields A, B, C,
> then all matched documents that have identical values for those 3 fields
> are
> "dupes". The field list may change with every new search request. What I do
> know is the super set of all fields that may be part of the field list at
> index time.
>
> I know this can't be done with configuration alone. It doesn't seem
> performant to retrieve all 1M+ docs and post process in Java. A very smart
> person told me that a custom hit collector should be able to do the
> filtering for me. So, maybe I create a custom search handler that somehow
> exposes this custom hit collector that can use FieldCache or DocValues to
> examine all the matches and filter the results in the way I've described
> above.
>
> So assuming this is a viable solution path, can anyone suggest some helpful
> posts, code fragments, books for me to review? I admit to being out of my
> depth, but this requirement isn't going away. I'm grasping for straws right
> now.
>
> thanks
> (using Solr 4.9)
>
>
>
>
>
> --
> View this message in context:
> http://lucene.472066.n3.nabble.com/Engage-custom-hit-collector-for-special-search-processing-tp4179348.html
> Sent from the Solr - User mailing list archive at Nabble.com.
>