You are viewing a plain text version of this content. The canonical link for it is here.

Posted to solr-user@lucene.apache.org by Carlos Bonilla <ca...@gmail.com> on 2013/05/17 10:47:29 UTC

Facet pivot 50.000.000 different values

Hi,
To calculate some stats we are using a field "B" with 50.000.0000 different
values as facet pivot in a schema that contains 200.000.000 documents. We
only need to calculate how many different "B" values have more than 1
document but it takes ages.... Is there any other better way/configuration
to do this?

Configuration:
Solr 4.2.1
JVM Java 7
Max Java Heap size : 12Gb
8 GB RAM
Dual Core

Many thanks.

Re: Facet pivot 50.000.000 different values

Posted by Carlos Bonilla <ca...@gmail.com>.

In case anyone is interested, I solved my problem using the "grouping"
feature:

*query* --> "filter" query (if any)
*field* --> field that you want to count (in my case field "B")

SolrQuery solrQuery = new SolrQuery(query);
solrQuery.add("group", "true");
solrQuery.add("group.field", "B"); // Group by the field
solrQuery.add("group.ngroups", "true");
solrQuery.setRows(0);

And in the response *getNGroups()* will give you the total number
of distinct values (total number of "B" distinct values)

Cheers,
Carlos.


2013/5/18 Carlos Bonilla <ca...@gmail.com>

> Hi Mikhail,
> yes the thing is that I need to take into account different queries and
> that's why I can't use the Terms Component.
>
> Cheers.
>
>
> 2013/5/17 Mikhail Khludnev <mk...@griddynamics.com>
>
>> On Fri, May 17, 2013 at 12:47 PM, Carlos Bonilla
>> <ca...@gmail.com>wrote:
>>
>> > We
>> > only need to calculate how many different "B" values have more than 1
>> > document but it takes ages
>> >
>>
>> Carlos,
>> It's not clear whether you need to take results of a query into account or
>> just gather statistics from index. if later you can just enumerate terms
>> and watch into TermsEnum.docFreq() . Am I getting it right?
>>
>>
>> --
>> Sincerely yours
>> Mikhail Khludnev
>> Principal Engineer,
>> Grid Dynamics
>>
>> <http://www.griddynamics.com>
>>  <mk...@griddynamics.com>
>>
>
>

Re: Facet pivot 50.000.000 different values

Posted by Carlos Bonilla <ca...@gmail.com>.

Hi Mikhail,
yes the thing is that I need to take into account different queries and
that's why I can't use the Terms Component.

Cheers.


2013/5/17 Mikhail Khludnev <mk...@griddynamics.com>

> On Fri, May 17, 2013 at 12:47 PM, Carlos Bonilla
> <ca...@gmail.com>wrote:
>
> > We
> > only need to calculate how many different "B" values have more than 1
> > document but it takes ages
> >
>
> Carlos,
> It's not clear whether you need to take results of a query into account or
> just gather statistics from index. if later you can just enumerate terms
> and watch into TermsEnum.docFreq() . Am I getting it right?
>
>
> --
> Sincerely yours
> Mikhail Khludnev
> Principal Engineer,
> Grid Dynamics
>
> <http://www.griddynamics.com>
>  <mk...@griddynamics.com>
>

Re: Facet pivot 50.000.000 different values

Posted by Mikhail Khludnev <mk...@griddynamics.com>.

On Fri, May 17, 2013 at 12:47 PM, Carlos Bonilla
<ca...@gmail.com>wrote:

> We
> only need to calculate how many different "B" values have more than 1
> document but it takes ages
>

Carlos,
It's not clear whether you need to take results of a query into account or
just gather statistics from index. if later you can just enumerate terms
and watch into TermsEnum.docFreq() . Am I getting it right?

-- 
Sincerely yours
Mikhail Khludnev
Principal Engineer,
Grid Dynamics

<http://www.griddynamics.com>
 <mk...@griddynamics.com>

Re: Facet pivot 50.000.000 different values

Posted by Carlos Bonilla <ca...@gmail.com>.

Sorry, 16 GB RAM (not 8).


2013/5/17 Carlos Bonilla <ca...@gmail.com>

> Hi,
> To calculate some stats we are using a field "B" with 50.000.0000
> different values as facet pivot in a schema that contains 200.000.000
> documents. We only need to calculate how many different "B" values have
> more than 1 document but it takes ages.... Is there any other better
> way/configuration to do this?
>
> Configuration:
> Solr 4.2.1
> JVM Java 7
> Max Java Heap size : 12Gb
> 8 GB RAM
> Dual Core
>
> Many thanks.
>

Re: Facet pivot 50.000.000 different values

Posted by Shawn Heisey <so...@elyograg.org>.

On 5/17/2013 2:47 AM, Carlos Bonilla wrote:
> To calculate some stats we are using a field "B" with 50.000.0000 different
> values as facet pivot in a schema that contains 200.000.000 documents. We
> only need to calculate how many different "B" values have more than 1
> document but it takes ages.... Is there any other better way/configuration
> to do this?
> 
> Configuration:
> Solr 4.2.1
> JVM Java 7
> Max Java Heap size : 12Gb
> 8 GB RAM
> Dual Core

You probably don't have enough RAM.  With 200 million documents, I would
imagine that your index is considerably larger than 4GB in size.  With
the 16GB of RAM that you mentioned in your other message, this
configuration leaves 4GB of RAM for caching after Java manages to
allocate the entire 12GB heap - which it will do very quickly with a
large index.

See the following:

http://wiki.apache.org/solr/SolrPerformanceProblems

I don't know the size of your index.  If it is 100GB, then ideally you
would want to have at least 112GB of RAM, but you could probably make it
work in 64GB.

Thanks,
Shawn