You are viewing a plain text version of this content. The canonical link for it is here.

Posted to java-user@lucene.apache.org by Gimantha Bandara <gi...@wso2.com> on 2016/03/17 14:07:52 UTC

Re: GROUP BY in Lucene

Hi Rob,

Thank you for explaining your approach. Still I have a few questions. Do I
need to store the values being aggregated as STORED at indexing time? and
how does the collector handle a large number of documents when aggregating?
I mean lets say I have several millions documents in an index and I am
going to call the SUM of a field called "subject_marks". How does the
collector efficiently handle summation? Is it going through all the
segments parallelly or something like that?

For now we have a facetfield which has X Y Z and I can get documents which
belong to a specific XYZ group and perform aggregation over those records.
So I can actually do that for all the groups. But it is not fast. It is
like a simple Java loop which go through all the different facet values and
aggregate the documents values belong to those facet values and put them
into a map. It is slow because we are not storing field values in the
Lucene documents, we fetch the actual data from a DB. We only keep an ID as
a STORED field in lucene documents once we get those IDS from lucene
documents we look up the DB and perform aggregation. This is really slow
when the number of records grow.

Thanks,
Gimantha

On Mon, Aug 10, 2015 at 6:26 PM, Rob Audenaerde <ro...@gmail.com>
wrote:

> You can write a custom (facet) collector to do this. I have done something
> similar, I'll describe my approach:
>
> For all the values that need grouping or aggregating, I have added a
> FacetField ( an AssociatedFacetField, so I can store the value alongside
> the ordinal) . The main search stays the same, in your case for example a
> NumericRangeQuery  (if the date is store in ms).
>
> Then I have a custom facet collector that does the grouping.
>
> Basically, it goes through all the MatchingDocs. For each doc, it creates a
> unique key (composed of X, Y and Z), and makes aggregates as needed (sum
> D).These are stored in a map. If a key is already in the map, the existing
> aggregate is added to the new value. Tricky is to make your unique key fast
> and immutable, so you can  precompute the hashcode.
>
> This is fast enough if the number of unique keys is smallish (<10.000),
> index size +- 1M docs).
>
> -Rob
>
>
> On Mon, Aug 10, 2015 at 2:47 PM, Michael McCandless <
> lucene@mikemccandless.com> wrote:
>
> > Lucene has a grouping module that has several approaches for grouping
> > search hits, though it's only by a single field I believe.
> >
> > Mike McCandless
> >
> > http://blog.mikemccandless.com
> >
> >
> > On Sun, Aug 9, 2015 at 2:55 PM, Gimantha Bandara <gi...@wso2.com>
> > wrote:
> > > Hi all,
> > >
> > > Is there a way to achieve $subject? For example, consider the following
> > SQL
> > > query.
> > >
> > > SELECT A, B, C SUM(D) as E FROM  `table` WHERE time BETWEEN fromDate
> AND
> > > toDate *GROUP BY X,Y,Z*
> > >
> > > In the above query we can group the records by, X,Y,Z. Is there a way
> to
> > > achieve the same in Lucene? (I guess Faceting would help, But is it
> > > possible get all the categoryPaths along with the matching records? )
> Is
> > > there any other way other than using Facets?
> > >
> > > --
> > > Gimantha Bandara
> > > Software Engineer
> > > WSO2. Inc : http://wso2.com
> > > Mobile : +94714961919
> >
> > ---------------------------------------------------------------------
> > To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> > For additional commands, e-mail: java-user-help@lucene.apache.org
> >
> >
>

-- 
Gimantha Bandara
Software Engineer
WSO2. Inc : http://wso2.com
Mobile : +94714961919

Re: GROUP BY in Lucene

Posted by Gimantha Bandara <gi...@wso2.com>.

Hi Rob,

Thanks a lot for above very descriptive answer. I will give it a try.

On Friday, March 18, 2016, Rob Audenaerde <ro...@gmail.com> wrote:

> Hi Gimantha,
>
> You don't need to store the aggregates and don't need to retrieve
> Documents. The aggregates are calculated during collection using the
> BinaryDocValues from the facet-module. What I do, is that I need to store
> values in the facets using AssociationFacetFields. (for example
> FloatAssociationFacetField). I just choose facets because then I can use
> the facets as well :)
>
> I have a implementation of `Facets` class that does all the aggregation. I
> cannot paste all the code unfortunately, but here is the idea (it is loosly
> based on the TaxonomyFacetSumIntAssociations implementation, where you can
> look up how the BinaryDocValues are translated to ordinals and to facets).
> This aggregation is used in conjunction with a FacetsCollector, which
> collects the facets during a search:
>
>         FacetsCollector fc = new FacetsCollector();
>         searcher.search(new ConstantScoreQuery(query), fc);
>
>
> Then, the use this FacetsCollector:
>
>      taxoReader = getTaxonomyReaderManager().acquire();
>      OnePassTaxonomyFacets facets = new OnePassTaxonomyFacets(taxoReader,
> LuceneIndexConfig.facetConfig);
>      Collection<GroupByResultTuple>
> facets.aggregateValues(fc.getMatchingDocs(), p.getGroupByListWithoutData(),
> aggregateFields);
>
>
> The aggregateValues method (cannot paste it all :(  ) :
>
>
>     public final Collection<GroupByResultTuple>
> aggregateValues(List<MatchingDocs> matchingDocs, final List<GroupByField>
> groupByFields,
>             final List<String> aggregateFieldNames, EmptyValues
> emptyValues) throws IOException {
>         LOG.info("Starting aggregation for pivot.. EmptyValues=" +
> emptyValues);
>
>         //We want to group a list of ordinals to a list of aggregates. The
> taxoReader has the ordinals, so a selection like 'Lang=NL, Region=South'
> will
>         //end up like a MultiIntKey of [13,44]
>         Map<MultiIntKey, List<TotalFacetAgg>> aggs = Maps.newHashMap();
>
>         List<String> groupByFieldsNames = Lists.newArrayList();
>         for (GroupByField gbf : groupByFields) {
>             groupByFieldsNames.add(gbf.getField().getName());
>         }
>         int groupByCount = groupByFieldsNames.size();
>
>         //We need to know which ordinals are the 'group-by' ordinals, so
> we can check if a ordinal that is found, belongs to one of these fields
>         int[] groupByOrdinals = new int[groupByCount];
>         for (int i = 0; i < groupByOrdinals.length; i++) {
>             groupByOrdinals[i] =
> this.getOrdinalForListItem(groupByFieldsNames, i);
>         }
>
>         //We need to know with ordinals are the 'aggregate-field'
> ordinals, so we can check if a ordinal that is found, belongs to one of
> these fields
>         int[] aggregateOrdinals = new int[aggregateFieldNames.size()];
>         for (int i = 0; i < aggregateOrdinals.length; i++) {
>             aggregateOrdinals[i] =
> this.getOrdinalForListItem(aggregateFieldNames, i);
>         }
>
>         //Now we go and find all the ordinals in the matching documents.
>         //For each ordinal, we check if it is a groupBy-ordinal, or a
> aggregate-ordinal, and act accordinly.
>         for (MatchingDocs hitList : matchingDocs) {
>             BinaryDocValues dv =
> hitList.context.reader().getBinaryDocValues(this.indexFieldName);
>
>             //Here find the oridinals of the group-by-fields and the
> arrgegate fields.
>             //Create a multi ordinal key MultiIntKey from the
> group-by-ordinals and use that to add the current value of the fiels to do
> the agggregation to the facet-aggregates
>
>             ......
>
>
> Hope this helps :)
> -Rob
>
>

-- 
Gimantha Bandara
Software Engineer
WSO2. Inc : http://wso2.com
Mobile : +94714961919

Re: GROUP BY in Lucene

Posted by Rob Audenaerde <ro...@gmail.com>.

Hi Gimantha,

You don't need to store the aggregates and don't need to retrieve
Documents. The aggregates are calculated during collection using the
BinaryDocValues from the facet-module. What I do, is that I need to store
values in the facets using AssociationFacetFields. (for example
FloatAssociationFacetField). I just choose facets because then I can use
the facets as well :)

I have a implementation of `Facets` class that does all the aggregation. I
cannot paste all the code unfortunately, but here is the idea (it is loosly
based on the TaxonomyFacetSumIntAssociations implementation, where you can
look up how the BinaryDocValues are translated to ordinals and to facets).
This aggregation is used in conjunction with a FacetsCollector, which
collects the facets during a search:

        FacetsCollector fc = new FacetsCollector();
        searcher.search(new ConstantScoreQuery(query), fc);


Then, the use this FacetsCollector:

     taxoReader = getTaxonomyReaderManager().acquire();
     OnePassTaxonomyFacets facets = new OnePassTaxonomyFacets(taxoReader,
LuceneIndexConfig.facetConfig);
     Collection<GroupByResultTuple>
facets.aggregateValues(fc.getMatchingDocs(), p.getGroupByListWithoutData(),
aggregateFields);


The aggregateValues method (cannot paste it all :(  ) :


    public final Collection<GroupByResultTuple>
aggregateValues(List<MatchingDocs> matchingDocs, final List<GroupByField>
groupByFields,
            final List<String> aggregateFieldNames, EmptyValues
emptyValues) throws IOException {
        LOG.info("Starting aggregation for pivot.. EmptyValues=" +
emptyValues);

        //We want to group a list of ordinals to a list of aggregates. The
taxoReader has the ordinals, so a selection like 'Lang=NL, Region=South'
will
        //end up like a MultiIntKey of [13,44]
        Map<MultiIntKey, List<TotalFacetAgg>> aggs = Maps.newHashMap();

        List<String> groupByFieldsNames = Lists.newArrayList();
        for (GroupByField gbf : groupByFields) {
            groupByFieldsNames.add(gbf.getField().getName());
        }
        int groupByCount = groupByFieldsNames.size();

        //We need to know which ordinals are the 'group-by' ordinals, so we
can check if a ordinal that is found, belongs to one of these fields
        int[] groupByOrdinals = new int[groupByCount];
        for (int i = 0; i < groupByOrdinals.length; i++) {
            groupByOrdinals[i] =
this.getOrdinalForListItem(groupByFieldsNames, i);
        }

        //We need to know with ordinals are the 'aggregate-field' ordinals,
so we can check if a ordinal that is found, belongs to one of these fields
        int[] aggregateOrdinals = new int[aggregateFieldNames.size()];
        for (int i = 0; i < aggregateOrdinals.length; i++) {
            aggregateOrdinals[i] =
this.getOrdinalForListItem(aggregateFieldNames, i);
        }

        //Now we go and find all the ordinals in the matching documents.
        //For each ordinal, we check if it is a groupBy-ordinal, or a
aggregate-ordinal, and act accordinly.
        for (MatchingDocs hitList : matchingDocs) {
            BinaryDocValues dv =
hitList.context.reader().getBinaryDocValues(this.indexFieldName);

            //Here find the oridinals of the group-by-fields and the
arrgegate fields.
            //Create a multi ordinal key MultiIntKey from the
group-by-ordinals and use that to add the current value of the fiels to do
the agggregation to the facet-aggregates

            ......


Hope this helps :)
-Rob