You are viewing a plain text version of this content. The canonical link for it is here.

Posted to solr-user@lucene.apache.org by tedsolr <ts...@sciquest.com> on 2015/09/02 23:12:14 UTC

Merging documents from a distributed search

I've read from  http://heliosearch.org/solrs-mergestrategy/
<http://heliosearch.org/solrs-mergestrategy/>   that the AnalyticsQuery
component only works for a single instance of Solr. I'm planning to
"migrate" to the SolrCloud soon and I have a custom AnalyticsQuery module
that collapses what I consider to be duplicate documents, keeping stats like
a "count" of the dupes. For my purposes "dupes" are determined at run time
and vary by the search request. Once a collection has multiple shards I will
not be able to prevent "dupes" from appearing across those shards. A custom
merge strategy should allow me to merge my stats, but I don't see how I can
drop duplicate docs at that point.

If shard1 returns docs A & B and shard2 returns docs B & C (letters denoting
what I consider to be unique docs), can my implementation of a merge
strategy return only docs A, B, & C, rather than A, B, B, & C?

thanks! 
solr 5.2.1



--
View this message in context: http://lucene.472066.n3.nabble.com/Merging-documents-from-a-distributed-search-tp4226802.html
Sent from the Solr - User mailing list archive at Nabble.com.

RE: Merging documents from a distributed search

Posted by Markus Jelsma <ma...@openindex.io>.

It seems so indeed. Please look up the thread titled "Custom merge logic in SolrCloud."       

 
 
-----Original message-----
> From:tedsolr <ts...@sciquest.com>
> Sent: Thursday 3rd September 2015 21:28
> To: solr-user@lucene.apache.org
> Subject: RE: Merging documents from a distributed search
> 
> Markus, did you mistakingly post a link to this same thread?
> 
> 
> 
> --
> View this message in context: http://lucene.472066.n3.nabble.com/Merging-documents-from-a-distributed-search-tp4226802p4227035.html
> Sent from the Solr - User mailing list archive at Nabble.com.
>

Re: Merging documents from a distributed search

Posted by tedsolr <ts...@sciquest.com>.

Joel,

It needs to perform. Typically users will have 1 - 5 million rows in a
query, returning 10 - 15 fields. Grouping reduces the return by 50% or more
normally. Responses tend be less than a half second.

It sounds like the manipulation of docs at the collector level has been left
to the single solr node implementations, and that your streaming API is the
way forward for cloud implementations. Even if it does have some performance
drawbacks. I can bear slower searches as long as they are not seconds
slower.

I could implement some business strategy that forks searching to either the
AnalyticsQuery or the streaming API based on the shard count in the
collection. Most of my customers will have single shard collections. A goal
of mine is to keep each collection whole as long as possible. If one gets
too big for the pond I'll move it to a bigger pond, until some heap limit is
reached when it will have to be split. 



--
View this message in context: http://lucene.472066.n3.nabble.com/Merging-documents-from-a-distributed-search-tp4226802p4227595.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: Merging documents from a distributed search

Posted by Joel Bernstein <jo...@gmail.com>.

It's possible that the ReducerStream's buffer can grow too large if
document groups are very large. But the ReducerStream only needs to hold
one group at a time in memory. The RollupStream, in trunk, has a grouping
implementation that doesn't hang on to all the Tuples from a group. You
could also implement a custom stream that does exactly what you need.

The AnalyicsQuery is much more efficient because the data is left in place.
The Streaming API has streaming overhead which is considerable. But it's
the Stream "shuffling" that gives you the power to do things like fully
distributed grouping.

How many records are processed in a typical query and what type of response
time do you need?

Joel Bernstein
http://joelsolr.blogspot.com/

On Thu, Sep 3, 2015 at 3:25 PM, tedsolr <ts...@sciquest.com> wrote:

> Thanks Joel, that link looks promising. The CloudSolrStream bypasses my
> issue
> of multiple shards. Perhaps the ReducerStream would provide what I need. At
> first glance I worry that the the buffer would grow too large - if its
> really holding the values for all the fields in each document
> (Tuple.getMaps()). I use a Map in my DelegatingCollector to store the
> "unique" docs, but I only keep the docId, my stats, and the ordinals for
> each field. Would you expect the new streams API to perform as well as my
> implementation of an AnalyticsQuery and a DelegatingCollector?
>
>
>
> --
> View this message in context:
> http://lucene.472066.n3.nabble.com/Merging-documents-from-a-distributed-search-tp4226802p4227034.html
> Sent from the Solr - User mailing list archive at Nabble.com.
>

Re: Merging documents from a distributed search

Posted by tedsolr <ts...@sciquest.com>.

Thanks Joel, that link looks promising. The CloudSolrStream bypasses my issue
of multiple shards. Perhaps the ReducerStream would provide what I need. At
first glance I worry that the the buffer would grow too large - if its
really holding the values for all the fields in each document
(Tuple.getMaps()). I use a Map in my DelegatingCollector to store the
"unique" docs, but I only keep the docId, my stats, and the ordinals for
each field. Would you expect the new streams API to perform as well as my
implementation of an AnalyticsQuery and a DelegatingCollector?



--
View this message in context: http://lucene.472066.n3.nabble.com/Merging-documents-from-a-distributed-search-tp4226802p4227034.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: Merging documents from a distributed search

Posted by Joel Bernstein <jo...@gmail.com>.

The merge strategy probably won't work for the type of distributed collapse
you're describing.

You may want to begin exploring the Streaming API which supports real-time
map/reduce operations,

http://joelsolr.blogspot.com/2015/03/parallel-computing-with-solrcloud.html

Joel Bernstein
http://joelsolr.blogspot.com/

On Wed, Sep 2, 2015 at 5:12 PM, tedsolr <ts...@sciquest.com> wrote:

> I've read from  http://heliosearch.org/solrs-mergestrategy/
> <http://heliosearch.org/solrs-mergestrategy/>   that the AnalyticsQuery
> component only works for a single instance of Solr. I'm planning to
> "migrate" to the SolrCloud soon and I have a custom AnalyticsQuery module
> that collapses what I consider to be duplicate documents, keeping stats
> like
> a "count" of the dupes. For my purposes "dupes" are determined at run time
> and vary by the search request. Once a collection has multiple shards I
> will
> not be able to prevent "dupes" from appearing across those shards. A custom
> merge strategy should allow me to merge my stats, but I don't see how I can
> drop duplicate docs at that point.
>
> If shard1 returns docs A & B and shard2 returns docs B & C (letters
> denoting
> what I consider to be unique docs), can my implementation of a merge
> strategy return only docs A, B, & C, rather than A, B, B, & C?
>
> thanks!
> solr 5.2.1
>
>
>
> --
> View this message in context:
> http://lucene.472066.n3.nabble.com/Merging-documents-from-a-distributed-search-tp4226802.html
> Sent from the Solr - User mailing list archive at Nabble.com.
>

Re: Merging documents from a distributed search

Posted by tedsolr <ts...@sciquest.com>.

Upayavira ,

The docs are all unique. In my example the two docs are considered to be
dupes because the requested fields all have the same values.
fields   A        B   C   D     E
Doc 1: apple, 10, 15, bye, yellow
Doc 2: apple, 12, 15, by, green

The two docs are certainly unique. Say they are on different shards in the
same collection. If the search request has fl:A,C then the two are dupes and
the user wants to see them collapsed. If the search request has fl:A,B,C
then the two are unique from the user's perspective and display separately.

Each doc typically has a couple hundred fields. When viewed through the lens
of just 3 or 4 fields, lots of docs, sometimes 1000s will be rolled up and
I'll compute some stats on that group. Bringing all those docs back to the
calling app for processing is too slow. The AnalyticsQuery does a great job
of filtering out the dupes, but it looks like I need another solution for
multi shard collections.



--
View this message in context: http://lucene.472066.n3.nabble.com/Merging-documents-from-a-distributed-search-tp4226802p4227261.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: Merging documents from a distributed search

Posted by Upayavira <uv...@odoko.co.uk>.


On Wed, Sep 2, 2015, at 10:12 PM, tedsolr wrote:
> I've read from  http://heliosearch.org/solrs-mergestrategy/
> <http://heliosearch.org/solrs-mergestrategy/>   that the AnalyticsQuery
> component only works for a single instance of Solr. I'm planning to
> "migrate" to the SolrCloud soon and I have a custom AnalyticsQuery module
> that collapses what I consider to be duplicate documents, keeping stats
> like
> a "count" of the dupes. For my purposes "dupes" are determined at run
> time
> and vary by the search request. Once a collection has multiple shards I
> will
> not be able to prevent "dupes" from appearing across those shards. A
> custom
> merge strategy should allow me to merge my stats, but I don't see how I
> can
> drop duplicate docs at that point.
> 
> If shard1 returns docs A & B and shard2 returns docs B & C (letters
> denoting
> what I consider to be unique docs), can my implementation of a merge
> strategy return only docs A, B, & C, rather than A, B, B, & C?

How did you end up with document B in both shard1 and shard2? Can't you
prevent that from happening, and thus not have this issue?

Upayavira

RE: Merging documents from a distributed search

Posted by tedsolr <ts...@sciquest.com>.

Markus, did you mistakingly post a link to this same thread?



--
View this message in context: http://lucene.472066.n3.nabble.com/Merging-documents-from-a-distributed-search-tp4226802p4227035.html
Sent from the Solr - User mailing list archive at Nabble.com.