You are viewing a plain text version of this content. The canonical link for it is here.
Posted to solr-user@lucene.apache.org by Mingfeng Yang <mf...@wisewindow.com> on 2013/11/02 07:01:07 UTC

Problem of facet on 170M documents

I have an index with 170M documents, and two of the fields for each doc is
"source" and "url".  And I want to know the top 500 most frequent urls from
Video source.

So I did a facet with
 "fq=source:Video&facet=true&facet.field=url&facet.limit=500", and the
matching documents are about 9 millions.

The solr cluster is hosted on two ec2 instances each with 4 cpu, and  32G
memory. 16G is allocated tfor java heap.  4 master shards on one machine,
and 4 replica on another machine. Connected together via zookeeper.

Whenever I did the query above, the response is just taking too long and
the client will get timed out. Sometimes,  when the end user is impatient,
so he/she may wait for a few second for the results, and then kill the
connection, and then issue the same query again and again.  Then the server
will have to deal with multiple such heavy queries simultaneously and
 being so busy that we got "no server hosting shard" error, probably due to
lost communication between solr node and zookeeper.

Is there any way to deal with such problem?

Thanks,
Ming

Re: Problem of facet on 170M documents

Posted by Fudong-gmail <fu...@gmail.com>.
One way to solve the issue may be to create another field to group the value in a range, so you have fewer facet values to query.

Sent from my iPhone

On Nov 5, 2013, at 4:31 AM, Erick Erickson <er...@gmail.com> wrote:

> You're just going to have to accept it being slow. Think of it this way:
> you have
> 4M (say) buckets that have to be counted into. Then the top 500 have to be
> collected to return. That's just going to take some time unless you have
> very beefy machines.
> 
> I'd _really_ back up and consider whether this is a good thing or whether
> this is one of those ideas that doesn't have much use to the user. If your
> results rarely if ever show counts for a URL more than, say, 5, is it
> really giving your users useful info?
> 
> Best,
> Erick
> 
> 
> On Mon, Nov 4, 2013 at 6:54 PM, Mingfeng Yang <mf...@wisewindow.com> wrote:
> 
>> Erick,
>> 
>> It could have more than 4M distinct values.  The purpose of this facet is
>> to display the most frequent, say top 500, urls to users.
>> 
>> Sascha,
>> 
>> Thanks for the info. I will look into this thread thing.
>> 
>> Mingfeng
>> 
>> 
>> On Mon, Nov 4, 2013 at 4:47 AM, Erick Erickson <erickerickson@gmail.com
>>> wrote:
>> 
>>> How many unique URLs do you have in your 9M
>>> docs? If your 9M hits have 4M distinct URLs, then
>>> this is not very valuable to the user.
>>> 
>>> Sascha:
>>> Was that speedup on a single field or were you faceting over
>>> multiple fields? Because as I remember that code spins off
>>> threads on a per-field basis, and if I'm mis-remembering I need
>>> to look again!
>>> 
>>> Best,
>>> Erick
>>> 
>>> 
>>> On Sat, Nov 2, 2013 at 5:07 AM, Sascha SZOTT <sz...@gmx.de> wrote:
>>> 
>>>> Hi Ming,
>>>> 
>>>> which Solr version are you using? In case you use one of the latest
>>>> versions (4.5 or above) try the new parameter facet.threads with a
>>>> reasonable value (4 to 8 gave me a massive performance speedup when
>>>> working with large facets, i.e. nTerms >> 10^7).
>>>> 
>>>> -Sascha
>>>> 
>>>> 
>>>> Mingfeng Yang wrote:
>>>>> I have an index with 170M documents, and two of the fields for each
>>>>> doc is "source" and "url".  And I want to know the top 500 most
>>>>> frequent urls from Video source.
>>>>> 
>>>>> So I did a facet with
>>>>> "fq=source:Video&facet=true&facet.field=url&facet.limit=500", and
>>>>> the matching documents are about 9 millions.
>>>>> 
>>>>> The solr cluster is hosted on two ec2 instances each with 4 cpu, and
>>>>> 32G memory. 16G is allocated tfor java heap.  4 master shards on one
>>>>> machine, and 4 replica on another machine. Connected together via
>>>>> zookeeper.
>>>>> 
>>>>> Whenever I did the query above, the response is just taking too long
>>>>> and the client will get timed out. Sometimes,  when the end user is
>>>>> impatient, so he/she may wait for a few second for the results, and
>>>>> then kill the connection, and then issue the same query again and
>>>>> again.  Then the server will have to deal with multiple such heavy
>>>>> queries simultaneously and being so busy that we got "no server
>>>>> hosting shard" error, probably due to lost communication between solr
>>>>> node and zookeeper.
>>>>> 
>>>>> Is there any way to deal with such problem?
>>>>> 
>>>>> Thanks, Ming
>> 

Re: Problem of facet on 170M documents

Posted by Erick Erickson <er...@gmail.com>.
You're just going to have to accept it being slow. Think of it this way:
you have
4M (say) buckets that have to be counted into. Then the top 500 have to be
collected to return. That's just going to take some time unless you have
very beefy machines.

I'd _really_ back up and consider whether this is a good thing or whether
this is one of those ideas that doesn't have much use to the user. If your
results rarely if ever show counts for a URL more than, say, 5, is it
really giving your users useful info?

Best,
Erick


On Mon, Nov 4, 2013 at 6:54 PM, Mingfeng Yang <mf...@wisewindow.com> wrote:

> Erick,
>
> It could have more than 4M distinct values.  The purpose of this facet is
> to display the most frequent, say top 500, urls to users.
>
> Sascha,
>
> Thanks for the info. I will look into this thread thing.
>
> Mingfeng
>
>
> On Mon, Nov 4, 2013 at 4:47 AM, Erick Erickson <erickerickson@gmail.com
> >wrote:
>
> > How many unique URLs do you have in your 9M
> > docs? If your 9M hits have 4M distinct URLs, then
> > this is not very valuable to the user.
> >
> > Sascha:
> > Was that speedup on a single field or were you faceting over
> > multiple fields? Because as I remember that code spins off
> > threads on a per-field basis, and if I'm mis-remembering I need
> > to look again!
> >
> > Best,
> > Erick
> >
> >
> > On Sat, Nov 2, 2013 at 5:07 AM, Sascha SZOTT <sz...@gmx.de> wrote:
> >
> > > Hi Ming,
> > >
> > > which Solr version are you using? In case you use one of the latest
> > > versions (4.5 or above) try the new parameter facet.threads with a
> > > reasonable value (4 to 8 gave me a massive performance speedup when
> > > working with large facets, i.e. nTerms >> 10^7).
> > >
> > > -Sascha
> > >
> > >
> > > Mingfeng Yang wrote:
> > > > I have an index with 170M documents, and two of the fields for each
> > > > doc is "source" and "url".  And I want to know the top 500 most
> > > > frequent urls from Video source.
> > > >
> > > > So I did a facet with
> > > > "fq=source:Video&facet=true&facet.field=url&facet.limit=500", and
> > > > the matching documents are about 9 millions.
> > > >
> > > > The solr cluster is hosted on two ec2 instances each with 4 cpu, and
> > > > 32G memory. 16G is allocated tfor java heap.  4 master shards on one
> > > > machine, and 4 replica on another machine. Connected together via
> > > > zookeeper.
> > > >
> > > > Whenever I did the query above, the response is just taking too long
> > > > and the client will get timed out. Sometimes,  when the end user is
> > > > impatient, so he/she may wait for a few second for the results, and
> > > > then kill the connection, and then issue the same query again and
> > > > again.  Then the server will have to deal with multiple such heavy
> > > > queries simultaneously and being so busy that we got "no server
> > > > hosting shard" error, probably due to lost communication between solr
> > > > node and zookeeper.
> > > >
> > > > Is there any way to deal with such problem?
> > > >
> > > > Thanks, Ming
> > > >
> > >
> >
>

Re: Problem of facet on 170M documents

Posted by Mingfeng Yang <mf...@wisewindow.com>.
Erick,

It could have more than 4M distinct values.  The purpose of this facet is
to display the most frequent, say top 500, urls to users.

Sascha,

Thanks for the info. I will look into this thread thing.

Mingfeng


On Mon, Nov 4, 2013 at 4:47 AM, Erick Erickson <er...@gmail.com>wrote:

> How many unique URLs do you have in your 9M
> docs? If your 9M hits have 4M distinct URLs, then
> this is not very valuable to the user.
>
> Sascha:
> Was that speedup on a single field or were you faceting over
> multiple fields? Because as I remember that code spins off
> threads on a per-field basis, and if I'm mis-remembering I need
> to look again!
>
> Best,
> Erick
>
>
> On Sat, Nov 2, 2013 at 5:07 AM, Sascha SZOTT <sz...@gmx.de> wrote:
>
> > Hi Ming,
> >
> > which Solr version are you using? In case you use one of the latest
> > versions (4.5 or above) try the new parameter facet.threads with a
> > reasonable value (4 to 8 gave me a massive performance speedup when
> > working with large facets, i.e. nTerms >> 10^7).
> >
> > -Sascha
> >
> >
> > Mingfeng Yang wrote:
> > > I have an index with 170M documents, and two of the fields for each
> > > doc is "source" and "url".  And I want to know the top 500 most
> > > frequent urls from Video source.
> > >
> > > So I did a facet with
> > > "fq=source:Video&facet=true&facet.field=url&facet.limit=500", and
> > > the matching documents are about 9 millions.
> > >
> > > The solr cluster is hosted on two ec2 instances each with 4 cpu, and
> > > 32G memory. 16G is allocated tfor java heap.  4 master shards on one
> > > machine, and 4 replica on another machine. Connected together via
> > > zookeeper.
> > >
> > > Whenever I did the query above, the response is just taking too long
> > > and the client will get timed out. Sometimes,  when the end user is
> > > impatient, so he/she may wait for a few second for the results, and
> > > then kill the connection, and then issue the same query again and
> > > again.  Then the server will have to deal with multiple such heavy
> > > queries simultaneously and being so busy that we got "no server
> > > hosting shard" error, probably due to lost communication between solr
> > > node and zookeeper.
> > >
> > > Is there any way to deal with such problem?
> > >
> > > Thanks, Ming
> > >
> >
>

Re: Problem of facet on 170M documents

Posted by Erick Erickson <er...@gmail.com>.
How many unique URLs do you have in your 9M
docs? If your 9M hits have 4M distinct URLs, then
this is not very valuable to the user.

Sascha:
Was that speedup on a single field or were you faceting over
multiple fields? Because as I remember that code spins off
threads on a per-field basis, and if I'm mis-remembering I need
to look again!

Best,
Erick


On Sat, Nov 2, 2013 at 5:07 AM, Sascha SZOTT <sz...@gmx.de> wrote:

> Hi Ming,
>
> which Solr version are you using? In case you use one of the latest
> versions (4.5 or above) try the new parameter facet.threads with a
> reasonable value (4 to 8 gave me a massive performance speedup when
> working with large facets, i.e. nTerms >> 10^7).
>
> -Sascha
>
>
> Mingfeng Yang wrote:
> > I have an index with 170M documents, and two of the fields for each
> > doc is "source" and "url".  And I want to know the top 500 most
> > frequent urls from Video source.
> >
> > So I did a facet with
> > "fq=source:Video&facet=true&facet.field=url&facet.limit=500", and
> > the matching documents are about 9 millions.
> >
> > The solr cluster is hosted on two ec2 instances each with 4 cpu, and
> > 32G memory. 16G is allocated tfor java heap.  4 master shards on one
> > machine, and 4 replica on another machine. Connected together via
> > zookeeper.
> >
> > Whenever I did the query above, the response is just taking too long
> > and the client will get timed out. Sometimes,  when the end user is
> > impatient, so he/she may wait for a few second for the results, and
> > then kill the connection, and then issue the same query again and
> > again.  Then the server will have to deal with multiple such heavy
> > queries simultaneously and being so busy that we got "no server
> > hosting shard" error, probably due to lost communication between solr
> > node and zookeeper.
> >
> > Is there any way to deal with such problem?
> >
> > Thanks, Ming
> >
>

Re: Problem of facet on 170M documents

Posted by Sascha SZOTT <sz...@gmx.de>.
Hi Ming,

which Solr version are you using? In case you use one of the latest
versions (4.5 or above) try the new parameter facet.threads with a
reasonable value (4 to 8 gave me a massive performance speedup when
working with large facets, i.e. nTerms >> 10^7).

-Sascha


Mingfeng Yang wrote:
> I have an index with 170M documents, and two of the fields for each
> doc is "source" and "url".  And I want to know the top 500 most
> frequent urls from Video source.
> 
> So I did a facet with 
> "fq=source:Video&facet=true&facet.field=url&facet.limit=500", and
> the matching documents are about 9 millions.
> 
> The solr cluster is hosted on two ec2 instances each with 4 cpu, and
> 32G memory. 16G is allocated tfor java heap.  4 master shards on one
> machine, and 4 replica on another machine. Connected together via
> zookeeper.
> 
> Whenever I did the query above, the response is just taking too long
> and the client will get timed out. Sometimes,  when the end user is
> impatient, so he/she may wait for a few second for the results, and
> then kill the connection, and then issue the same query again and
> again.  Then the server will have to deal with multiple such heavy
> queries simultaneously and being so busy that we got "no server
> hosting shard" error, probably due to lost communication between solr
> node and zookeeper.
> 
> Is there any way to deal with such problem?
> 
> Thanks, Ming
>