You are viewing a plain text version of this content. The canonical link for it is here.
Posted to solr-user@lucene.apache.org by Mikhail Khludnev <mk...@apache.org> on 2017/08/21 10:01:39 UTC

Huge Facets and Streaming

Hello!

I need to count really wide facet on 30 shards index with roughly 100M
docs, the facet response is about 100M values takes 0.5G in text file.

So, far I experimented with old facets. It calculates per shard facets
fine, but then a node which attempts to merge such 30 responses fails due
to OOM. It's reasonable.

I suppose I'll get pretty much same with json.facet, or it's better
scalable?

I want to experiment with Streaming Expression, which I've never taken yet.
I've found facet() expression and select() with partitionKeys they'll try
to merge facet values in FacetComponent/Module anyway.
Is there a way to merge per-shard facet responses with Streaming?

-- 
Sincerely yours
Mikhail Khludnev

Re: Huge Facets and Streaming

Posted by Joel Bernstein <jo...@gmail.com>.
The current approach for high cardinality aggregations is the MapReduce
approach:

parallel(rollup(search()))

But what Yonik describes would be much more efficient.


Joel Bernstein
http://joelsolr.blogspot.com/

On Mon, Aug 21, 2017 at 3:44 PM, Mikhail Khludnev <mk...@apache.org> wrote:

> Thanks for sharing this idea, Younik!
> I've raised https://issues.apache.org/jira/browse/SOLR-11271.
>
> On Mon, Aug 21, 2017 at 4:00 PM, Yonik Seeley <ys...@gmail.com> wrote:
>
> > On Mon, Aug 21, 2017 at 6:01 AM, Mikhail Khludnev <mk...@apache.org>
> wrote:
> > > Hello!
> > >
> > > I need to count really wide facet on 30 shards index with roughly 100M
> > > docs, the facet response is about 100M values takes 0.5G in text file.
> > >
> > > So, far I experimented with old facets. It calculates per shard facets
> > > fine, but then a node which attempts to merge such 30 responses fails
> due
> > > to OOM. It's reasonable.
> > >
> > > I suppose I'll get pretty much same with json.facet, or it's better
> > > scalable?
> > >
> > > I want to experiment with Streaming Expression, which I've never taken
> > yet.
> > > I've found facet() expression and select() with partitionKeys they'll
> try
> > > to merge facet values in FacetComponent/Module anyway.
> > > Is there a way to merge per-shard facet responses with Streaming?
> >
> > Yeah, I think I've mentioned before that this is the way it should be
> > implemented (per-shard distrib=false facet request merged by streaming
> > expression).
> > The JSON Facet "stream" method does stream (i.e. does not build up the
> > response all in memory first), but only at the shard level and not at
> > the distrib/merge level.  This could then be fed into streaming to get
> > exact facets (and streaming facets).  But I don't think this has been
> > done yet.
> >
> > -Yonik
> >
>
>
>
> --
> Sincerely yours
> Mikhail Khludnev
>

Re: Huge Facets and Streaming

Posted by Mikhail Khludnev <mk...@apache.org>.
Thanks for sharing this idea, Younik!
I've raised https://issues.apache.org/jira/browse/SOLR-11271.

On Mon, Aug 21, 2017 at 4:00 PM, Yonik Seeley <ys...@gmail.com> wrote:

> On Mon, Aug 21, 2017 at 6:01 AM, Mikhail Khludnev <mk...@apache.org> wrote:
> > Hello!
> >
> > I need to count really wide facet on 30 shards index with roughly 100M
> > docs, the facet response is about 100M values takes 0.5G in text file.
> >
> > So, far I experimented with old facets. It calculates per shard facets
> > fine, but then a node which attempts to merge such 30 responses fails due
> > to OOM. It's reasonable.
> >
> > I suppose I'll get pretty much same with json.facet, or it's better
> > scalable?
> >
> > I want to experiment with Streaming Expression, which I've never taken
> yet.
> > I've found facet() expression and select() with partitionKeys they'll try
> > to merge facet values in FacetComponent/Module anyway.
> > Is there a way to merge per-shard facet responses with Streaming?
>
> Yeah, I think I've mentioned before that this is the way it should be
> implemented (per-shard distrib=false facet request merged by streaming
> expression).
> The JSON Facet "stream" method does stream (i.e. does not build up the
> response all in memory first), but only at the shard level and not at
> the distrib/merge level.  This could then be fed into streaming to get
> exact facets (and streaming facets).  But I don't think this has been
> done yet.
>
> -Yonik
>



-- 
Sincerely yours
Mikhail Khludnev

Re: Huge Facets and Streaming

Posted by Yonik Seeley <ys...@gmail.com>.
On Mon, Aug 21, 2017 at 6:01 AM, Mikhail Khludnev <mk...@apache.org> wrote:
> Hello!
>
> I need to count really wide facet on 30 shards index with roughly 100M
> docs, the facet response is about 100M values takes 0.5G in text file.
>
> So, far I experimented with old facets. It calculates per shard facets
> fine, but then a node which attempts to merge such 30 responses fails due
> to OOM. It's reasonable.
>
> I suppose I'll get pretty much same with json.facet, or it's better
> scalable?
>
> I want to experiment with Streaming Expression, which I've never taken yet.
> I've found facet() expression and select() with partitionKeys they'll try
> to merge facet values in FacetComponent/Module anyway.
> Is there a way to merge per-shard facet responses with Streaming?

Yeah, I think I've mentioned before that this is the way it should be
implemented (per-shard distrib=false facet request merged by streaming
expression).
The JSON Facet "stream" method does stream (i.e. does not build up the
response all in memory first), but only at the shard level and not at
the distrib/merge level.  This could then be fed into streaming to get
exact facets (and streaming facets).  But I don't think this has been
done yet.

-Yonik