You are viewing a plain text version of this content. The canonical link for it is here.
Posted to solr-user@lucene.apache.org by Tom Burton-West <tb...@umich.edu> on 2012/08/22 00:39:25 UTC

Scalability of Solr Result Grouping/Field Collapsing: Millions/Billions of documents?

Hello all,

We are thinking about using Solr Field Collapsing on a rather large scale
and wonder if anyone has experience with performance when doing Field
Collapsing on millions of or billions of documents (details below. )  Are
there performance issues with grouping large result sets?

Details:
We have a collection of the full text of 10 million books/journals.  This
is spread across 12 shards with each shard holding about 800,000
documents.  When a query matches a journal article, we would like to group
all the matching articles from the same journal together. (there is a
unique id field identifying the journal).  Similarly when there is a match
in multiple copies of the same book we would like to group all results for
the same book together (again we have a unique id field we can group on).
Sometimes a short query against the OCR field will result in over one
million hits.  Are there known performance issues when field collapsing
result sets containing a million hits?

We currently index the entire book as one Solr document.  We would like to
investigate the feasibility of indexing each page as a Solr document with a
field indicating the book id.  We could then offer our users the choice of
a list of the most relevant pages, or a list of the books containing the
most relevant pages.  We have approximately 3 billion pages.   Does anyone
have experience using field collapsing on this sort of scale?

Tom

Tom Burton-West
Information Retrieval Programmer
Digital Library Production Service
Univerity of Michigan Library
http://www.hathitrust.org/blogs/large-scale-search

Re: Scalability of Solr Result Grouping/Field Collapsing: Millions/Billions of documents?

Posted by ilay <il...@gmail.com>.
Hello all,

  I have a similar situation for grouping where I want group my products
into top categories for a ecommerce application. The number groups here is
less than 10 and total number of docs in the index is 10 Million. Will solr
goruping is an issue here, we have seen OOM issue when we tried grouping for
books simillar editions against the same index. However, if we are grouping
for categories where number of groups is less than 10, will it still be a
problem? Any thoughts on this can be greatly appreciated.



--
View this message in context: http://lucene.472066.n3.nabble.com/Scalability-of-Solr-Result-Grouping-Field-Collapsing-Millions-Billions-of-documents-tp4002524p4017945.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: Scalability of Solr Result Grouping/Field Collapsing: Millions/Billions of documents?

Posted by Mikhail Khludnev <mk...@griddynamics.com>.
Tom,
Feel free to find my benchmark results for two alternative joining
approaches.
http://blog.griddynamics.com/2012/08/block-join-query-performs.html

Regards

On Thu, Aug 23, 2012 at 4:40 PM, Erick Erickson <er...@gmail.com>wrote:

> Tom:
>
> I thin my comments were that grouping on a field where there was
> a unique value _per document_ chewed up a lot of resources.
> Conceptually, there's a bucket for each unique group value. And
> grouping on a file path is just asking for trouble.
>
> But the memory used for grouping should max as a function of
> the unique values in the grouped field.
>
> Best
> Erick
>
> On Wed, Aug 22, 2012 at 11:32 PM, Lance Norskog <go...@gmail.com> wrote:
> > Yes, distributed grouping works, but grouping takes a lot of
> > resources. If you can avoid in distributed mode, so much the better.
> >
> > On Wed, Aug 22, 2012 at 3:35 PM, Tom Burton-West <tb...@umich.edu>
> wrote:
> >> Thanks Tirthankar,
> >>
> >> So the issue in memory use for sorting.  I'm not sure I understand how
> >> sorting of grouping fields  is involved with the defaults and field
> >> collapsing, since the default sorts by relevance not grouping field.  On
> >> the other hand I don't know much about how field collapsing is
> implemented.
> >>
> >> So far the few tests I've made haven't revealed any memory problems.  We
> >> are using very small string fields for grouping and I think that we
> >> probably only have a couple of cases where we are grouping more than a
> few
> >> thousand docs.   I will try to find a query with a lot of docs per group
> >> and take a look at the memory use using JConsole.
> >>
> >> Tom
> >>
> >>
> >> On Wed, Aug 22, 2012 at 4:02 PM, Tirthankar Chatterjee <
> >> tchatterjee@commvault.com> wrote:
> >>
> >>>  Hi Tom,****
> >>>
> >>> We had an issue where we are keeping millions of docs in a single node
> and
> >>> we were trying to group them on a string field which is nothing but
> full
> >>> file path… that caused SOLR to go out of memory…****
> >>>
> >>> ** **
> >>>
> >>> Erick has explained nicely in the thread as to why it won’t work and I
> had
> >>> to find another way of architecting it. ****
> >>>
> >>> ** **
> >>>
> >>> How do you think this is different in your case. If you want to group
> by a
> >>> string field with thousands of similar entries I am guessing you will
> face
> >>> the same issue. ****
> >>>
> >>> ** **
> >>>
> >>> Thanks,****
> >>>
> >>> Tirthankar****
> >>> ***************************Legal Disclaimer***************************
> >>> "This communication may contain confidential and privileged material
> for
> >>> the
> >>> sole use of the intended recipient. Any unauthorized review, use or
> >>> distribution
> >>> by others is strictly prohibited. If you have received the message in
> >>> error,
> >>> please advise the sender by reply email and delete the message. Thank
> you."
> >>> **********************************************************************
> >>>
> >
> >
> >
> > --
> > Lance Norskog
> > goksron@gmail.com
>



-- 
Sincerely yours
Mikhail Khludnev
Tech Lead
Grid Dynamics

<http://www.griddynamics.com>
 <mk...@griddynamics.com>

Re: Scalability of Solr Result Grouping/Field Collapsing: Millions/Billions of documents?

Posted by Erick Erickson <er...@gmail.com>.
Tom:

I thin my comments were that grouping on a field where there was
a unique value _per document_ chewed up a lot of resources.
Conceptually, there's a bucket for each unique group value. And
grouping on a file path is just asking for trouble.

But the memory used for grouping should max as a function of
the unique values in the grouped field.

Best
Erick

On Wed, Aug 22, 2012 at 11:32 PM, Lance Norskog <go...@gmail.com> wrote:
> Yes, distributed grouping works, but grouping takes a lot of
> resources. If you can avoid in distributed mode, so much the better.
>
> On Wed, Aug 22, 2012 at 3:35 PM, Tom Burton-West <tb...@umich.edu> wrote:
>> Thanks Tirthankar,
>>
>> So the issue in memory use for sorting.  I'm not sure I understand how
>> sorting of grouping fields  is involved with the defaults and field
>> collapsing, since the default sorts by relevance not grouping field.  On
>> the other hand I don't know much about how field collapsing is implemented.
>>
>> So far the few tests I've made haven't revealed any memory problems.  We
>> are using very small string fields for grouping and I think that we
>> probably only have a couple of cases where we are grouping more than a few
>> thousand docs.   I will try to find a query with a lot of docs per group
>> and take a look at the memory use using JConsole.
>>
>> Tom
>>
>>
>> On Wed, Aug 22, 2012 at 4:02 PM, Tirthankar Chatterjee <
>> tchatterjee@commvault.com> wrote:
>>
>>>  Hi Tom,****
>>>
>>> We had an issue where we are keeping millions of docs in a single node and
>>> we were trying to group them on a string field which is nothing but full
>>> file path… that caused SOLR to go out of memory…****
>>>
>>> ** **
>>>
>>> Erick has explained nicely in the thread as to why it won’t work and I had
>>> to find another way of architecting it. ****
>>>
>>> ** **
>>>
>>> How do you think this is different in your case. If you want to group by a
>>> string field with thousands of similar entries I am guessing you will face
>>> the same issue. ****
>>>
>>> ** **
>>>
>>> Thanks,****
>>>
>>> Tirthankar****
>>> ***************************Legal Disclaimer***************************
>>> "This communication may contain confidential and privileged material for
>>> the
>>> sole use of the intended recipient. Any unauthorized review, use or
>>> distribution
>>> by others is strictly prohibited. If you have received the message in
>>> error,
>>> please advise the sender by reply email and delete the message. Thank you."
>>> **********************************************************************
>>>
>
>
>
> --
> Lance Norskog
> goksron@gmail.com

Re: Scalability of Solr Result Grouping/Field Collapsing: Millions/Billions of documents?

Posted by Lance Norskog <go...@gmail.com>.
Yes, distributed grouping works, but grouping takes a lot of
resources. If you can avoid in distributed mode, so much the better.

On Wed, Aug 22, 2012 at 3:35 PM, Tom Burton-West <tb...@umich.edu> wrote:
> Thanks Tirthankar,
>
> So the issue in memory use for sorting.  I'm not sure I understand how
> sorting of grouping fields  is involved with the defaults and field
> collapsing, since the default sorts by relevance not grouping field.  On
> the other hand I don't know much about how field collapsing is implemented.
>
> So far the few tests I've made haven't revealed any memory problems.  We
> are using very small string fields for grouping and I think that we
> probably only have a couple of cases where we are grouping more than a few
> thousand docs.   I will try to find a query with a lot of docs per group
> and take a look at the memory use using JConsole.
>
> Tom
>
>
> On Wed, Aug 22, 2012 at 4:02 PM, Tirthankar Chatterjee <
> tchatterjee@commvault.com> wrote:
>
>>  Hi Tom,****
>>
>> We had an issue where we are keeping millions of docs in a single node and
>> we were trying to group them on a string field which is nothing but full
>> file path… that caused SOLR to go out of memory…****
>>
>> ** **
>>
>> Erick has explained nicely in the thread as to why it won’t work and I had
>> to find another way of architecting it. ****
>>
>> ** **
>>
>> How do you think this is different in your case. If you want to group by a
>> string field with thousands of similar entries I am guessing you will face
>> the same issue. ****
>>
>> ** **
>>
>> Thanks,****
>>
>> Tirthankar****
>> ***************************Legal Disclaimer***************************
>> "This communication may contain confidential and privileged material for
>> the
>> sole use of the intended recipient. Any unauthorized review, use or
>> distribution
>> by others is strictly prohibited. If you have received the message in
>> error,
>> please advise the sender by reply email and delete the message. Thank you."
>> **********************************************************************
>>



-- 
Lance Norskog
goksron@gmail.com

Re: Scalability of Solr Result Grouping/Field Collapsing: Millions/Billions of documents?

Posted by Tom Burton-West <tb...@umich.edu>.
Thanks Tirthankar,

So the issue in memory use for sorting.  I'm not sure I understand how
sorting of grouping fields  is involved with the defaults and field
collapsing, since the default sorts by relevance not grouping field.  On
the other hand I don't know much about how field collapsing is implemented.

So far the few tests I've made haven't revealed any memory problems.  We
are using very small string fields for grouping and I think that we
probably only have a couple of cases where we are grouping more than a few
thousand docs.   I will try to find a query with a lot of docs per group
and take a look at the memory use using JConsole.

Tom


On Wed, Aug 22, 2012 at 4:02 PM, Tirthankar Chatterjee <
tchatterjee@commvault.com> wrote:

>  Hi Tom,****
>
> We had an issue where we are keeping millions of docs in a single node and
> we were trying to group them on a string field which is nothing but full
> file path… that caused SOLR to go out of memory…****
>
> ** **
>
> Erick has explained nicely in the thread as to why it won’t work and I had
> to find another way of architecting it. ****
>
> ** **
>
> How do you think this is different in your case. If you want to group by a
> string field with thousands of similar entries I am guessing you will face
> the same issue. ****
>
> ** **
>
> Thanks,****
>
> Tirthankar****
> ***************************Legal Disclaimer***************************
> "This communication may contain confidential and privileged material for
> the
> sole use of the intended recipient. Any unauthorized review, use or
> distribution
> by others is strictly prohibited. If you have received the message in
> error,
> please advise the sender by reply email and delete the message. Thank you."
> **********************************************************************
>

RE: Scalability of Solr Result Grouping/Field Collapsing: Millions/Billions of documents?

Posted by Tirthankar Chatterjee <tc...@commvault.com>.
Hi Tom,
We had an issue where we are keeping millions of docs in a single node and we were trying to group them on a string field which is nothing but full file path... that caused SOLR to go out of memory...

Erick has explained nicely in the thread as to why it won't work and I had to find another way of architecting it.

How do you think this is different in your case. If you want to group by a string field with thousands of similar entries I am guessing you will face the same issue.

Thanks,
Tirthankar


******************Legal Disclaimer***************************
"This communication may contain confidential and privileged
material for the sole use of the intended recipient. Any
unauthorized review, use or distribution by others is strictly
prohibited. If you have received the message in error, please
advise the sender by reply email and delete the message. Thank
you."
*********************************************************

Re: Scalability of Solr Result Grouping/Field Collapsing: Millions/Billions of documents?

Posted by Tom Burton-West <tb...@umich.edu>.
Hi Tirthankar,

Can you give me a quick summary of what   won't work and why?
I couldn't figure it out from looking at your thread.  You seem to have a
different issue, but maybe I'm missing something here.

Tom

On Tue, Aug 21, 2012 at 7:10 PM, Tirthankar Chatterjee <
tchatterjee@commvault.com> wrote:

> This wont work, see my thread on Solr3.6 Field collapsing
> Thanks,
> Tirthankar
>
>

Re: Scalability of Solr Result Grouping/Field Collapsing: Millions/Billions of documents?

Posted by Tom Burton-West <tb...@umich.edu>.
Hi Lance,

I don't understand enough of how the field collapsing is implemented, but I
thought it worked with distributed search.  Are you saying it only works if
everything that needs collapsing is on the same shard?

Tom

On Wed, Aug 22, 2012 at 2:41 AM, Lance Norskog <go...@gmail.com> wrote:

> How do you separate the documents among the shards? Can you set up the
> shards such that one "collapse group" is only on a single shard? That
> you never have to do distributed grouping?
>
> On Tue, Aug 21, 2012 at 4:10 PM, Tirthankar Chatterjee
> <tc...@commvault.com> wrote:
> > This wont work, see my thread on Solr3.6 Field collapsing
> > Thanks,
> > Tirthankar
> >
> > -----Original Message-----
> > From: Tom Burton-West <tb...@umich.edu>
> > Date: Tue, 21 Aug 2012 18:39:25
> > To: solr-user@lucene.apache.org<so...@lucene.apache.org>
> > Reply-To: "solr-user@lucene.apache.org" <so...@lucene.apache.org>
> > Cc: William Dueber<du...@umich.edu>; Phillip Farber<pf...@umich.edu>
> > Subject: Scalability of Solr Result Grouping/Field Collapsing:
> >  Millions/Billions of documents?
> >
> > Hello all,
> >
> > We are thinking about using Solr Field Collapsing on a rather large scale
> > and wonder if anyone has experience with performance when doing Field
> > Collapsing on millions of or billions of documents (details below. )  Are
> > there performance issues with grouping large result sets?
> >
> > Details:
> > We have a collection of the full text of 10 million books/journals.  This
> > is spread across 12 shards with each shard holding about 800,000
> > documents.  When a query matches a journal article, we would like to
> group
> > all the matching articles from the same journal together. (there is a
> > unique id field identifying the journal).  Similarly when there is a
> match
> > in multiple copies of the same book we would like to group all results
> for
> > the same book together (again we have a unique id field we can group on).
> > Sometimes a short query against the OCR field will result in over one
> > million hits.  Are there known performance issues when field collapsing
> > result sets containing a million hits?
> >
> > We currently index the entire book as one Solr document.  We would like
> to
> > investigate the feasibility of indexing each page as a Solr document
> with a
> > field indicating the book id.  We could then offer our users the choice
> of
> > a list of the most relevant pages, or a list of the books containing the
> > most relevant pages.  We have approximately 3 billion pages.   Does
> anyone
> > have experience using field collapsing on this sort of scale?
> >
> > Tom
> >
> > Tom Burton-West
> > Information Retrieval Programmer
> > Digital Library Production Service
> > Univerity of Michigan Library
> > http://www.hathitrust.org/blogs/large-scale-search
> > ******************Legal Disclaimer***************************
> > "This communication may contain confidential and privileged
> > material for the sole use of the intended recipient. Any
> > unauthorized review, use or distribution by others is strictly
> > prohibited. If you have received the message in error, please
> > advise the sender by reply email and delete the message. Thank
> > you."
> > *********************************************************
>
>
>
> --
> Lance Norskog
> goksron@gmail.com
>

Re: Scalability of Solr Result Grouping/Field Collapsing: Millions/Billions of documents?

Posted by Tom Burton-West <tb...@umich.edu>.
Hi Lance and Tirthankar,

We are currently using Solr 3.6.  I tried a search across our current 12
shards grouping by book id (record_no in our schema) and it seems to work
fine (the query with the actual urls for the shards changed is appended
below.)

I then searched for the record_no of the second group in the results to
confirm that the number of records being folded is correct. In both cases
the numFound is 505 so it seems as though the record counts for the group
are correct.  Then I tried the same search but changed the shards parameter
to limit the search to 1/2 of the shards and got numFound = 325.  This
shows that the items in the group are distributed between different shards.

What am I missing here?   What is it that you are saying does not work?

Tom
Field Collapse query ( IP address changed, and newlines added and  shard
urls simplified  for readability)


http://solr-myhost.edu/serve-9/select?indent=on&version=2.2
&shards=shard1,shard2,shard3, shard4,shard5, shard,6,...shard12
&q=title:nature&fq=&start=0&rows=10&fl=id,author,title,volume_enumcron,score
&group=true&group.field=record_no&group.limit=2

Re: Scalability of Solr Result Grouping/Field Collapsing: Millions/Billions of documents?

Posted by Lance Norskog <go...@gmail.com>.
How do you separate the documents among the shards? Can you set up the
shards such that one "collapse group" is only on a single shard? That
you never have to do distributed grouping?

On Tue, Aug 21, 2012 at 4:10 PM, Tirthankar Chatterjee
<tc...@commvault.com> wrote:
> This wont work, see my thread on Solr3.6 Field collapsing
> Thanks,
> Tirthankar
>
> -----Original Message-----
> From: Tom Burton-West <tb...@umich.edu>
> Date: Tue, 21 Aug 2012 18:39:25
> To: solr-user@lucene.apache.org<so...@lucene.apache.org>
> Reply-To: "solr-user@lucene.apache.org" <so...@lucene.apache.org>
> Cc: William Dueber<du...@umich.edu>; Phillip Farber<pf...@umich.edu>
> Subject: Scalability of Solr Result Grouping/Field Collapsing:
>  Millions/Billions of documents?
>
> Hello all,
>
> We are thinking about using Solr Field Collapsing on a rather large scale
> and wonder if anyone has experience with performance when doing Field
> Collapsing on millions of or billions of documents (details below. )  Are
> there performance issues with grouping large result sets?
>
> Details:
> We have a collection of the full text of 10 million books/journals.  This
> is spread across 12 shards with each shard holding about 800,000
> documents.  When a query matches a journal article, we would like to group
> all the matching articles from the same journal together. (there is a
> unique id field identifying the journal).  Similarly when there is a match
> in multiple copies of the same book we would like to group all results for
> the same book together (again we have a unique id field we can group on).
> Sometimes a short query against the OCR field will result in over one
> million hits.  Are there known performance issues when field collapsing
> result sets containing a million hits?
>
> We currently index the entire book as one Solr document.  We would like to
> investigate the feasibility of indexing each page as a Solr document with a
> field indicating the book id.  We could then offer our users the choice of
> a list of the most relevant pages, or a list of the books containing the
> most relevant pages.  We have approximately 3 billion pages.   Does anyone
> have experience using field collapsing on this sort of scale?
>
> Tom
>
> Tom Burton-West
> Information Retrieval Programmer
> Digital Library Production Service
> Univerity of Michigan Library
> http://www.hathitrust.org/blogs/large-scale-search
> ******************Legal Disclaimer***************************
> "This communication may contain confidential and privileged
> material for the sole use of the intended recipient. Any
> unauthorized review, use or distribution by others is strictly
> prohibited. If you have received the message in error, please
> advise the sender by reply email and delete the message. Thank
> you."
> *********************************************************



-- 
Lance Norskog
goksron@gmail.com

Re: Scalability of Solr Result Grouping/Field Collapsing: Millions/Billions of documents?

Posted by Tirthankar Chatterjee <tc...@commvault.com>.
This wont work, see my thread on Solr3.6 Field collapsing
Thanks,
Tirthankar

-----Original Message-----
From: Tom Burton-West <tb...@umich.edu>
Date: Tue, 21 Aug 2012 18:39:25 
To: solr-user@lucene.apache.org<so...@lucene.apache.org>
Reply-To: "solr-user@lucene.apache.org" <so...@lucene.apache.org>
Cc: William Dueber<du...@umich.edu>; Phillip Farber<pf...@umich.edu>
Subject: Scalability of Solr Result Grouping/Field Collapsing:
 Millions/Billions of documents?

Hello all,

We are thinking about using Solr Field Collapsing on a rather large scale
and wonder if anyone has experience with performance when doing Field
Collapsing on millions of or billions of documents (details below. )  Are
there performance issues with grouping large result sets?

Details:
We have a collection of the full text of 10 million books/journals.  This
is spread across 12 shards with each shard holding about 800,000
documents.  When a query matches a journal article, we would like to group
all the matching articles from the same journal together. (there is a
unique id field identifying the journal).  Similarly when there is a match
in multiple copies of the same book we would like to group all results for
the same book together (again we have a unique id field we can group on).
Sometimes a short query against the OCR field will result in over one
million hits.  Are there known performance issues when field collapsing
result sets containing a million hits?

We currently index the entire book as one Solr document.  We would like to
investigate the feasibility of indexing each page as a Solr document with a
field indicating the book id.  We could then offer our users the choice of
a list of the most relevant pages, or a list of the books containing the
most relevant pages.  We have approximately 3 billion pages.   Does anyone
have experience using field collapsing on this sort of scale?

Tom

Tom Burton-West
Information Retrieval Programmer
Digital Library Production Service
Univerity of Michigan Library
http://www.hathitrust.org/blogs/large-scale-search
******************Legal Disclaimer***************************
"This communication may contain confidential and privileged
material for the sole use of the intended recipient. Any
unauthorized review, use or distribution by others is strictly
prohibited. If you have received the message in error, please
advise the sender by reply email and delete the message. Thank
you."
*********************************************************