You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@calcite.apache.org by "yeyonghao (Jira)" <ji...@apache.org> on 2020/06/17 19:50:00 UTC

[jira] [Created] (CALCITE-4069) Elasticsearch the result of count(*) will be added to the result list twice when use multiple aggregations without group by

yeyonghao created CALCITE-4069:
----------------------------------

             Summary: Elasticsearch the result of count(*) will be added to the result list twice when use multiple aggregations without group by
                 Key: CALCITE-4069
                 URL: https://issues.apache.org/jira/browse/CALCITE-4069
             Project: Calcite
          Issue Type: Improvement
          Components: elasticsearch-adapter
            Reporter: yeyonghao


when you write a sql:
{code:java}
select count(*) from mytable
{code}
in:
{code:java}
org.apache.calcite.adapter.elasticsearch.ElasticsearchTable#aggregate
{code}
As an optimization, totalHit will be the result of the query for count(*):
{code:java}
// elastic exposes total number of documents matching a query in "/hits/total" path
// this can be used for simple "select count(*) from table"
final long total = res.searchHits().total().value();
{code}
It is then added to the result set:
{code:java}
if (groupBy.isEmpty()) {
  // put totals automatically for count(*) expression(s), unless they contain group by
  for (String expr : countAll) {
    result.forEach(m -> m.put(expr, total));
  }
}
{code}
This is fine if there is only one count(*) aggregate function in SQL.

But when you write multiple aggregate functions in youe SQL:
{code:java}
select count(*),sum(field1) from mytable{code}
You'll notice that the result of count(*) is repeatedly added to the result set (it doesn't affect the final result, but I think it can be optimized).

The result of count(*) is first added to the result set along with other aggregate function result values in the following code:
{code:java}
final List<Map<String, Object>> result = new ArrayList<>();
if (res.aggregations() != null) {
  // collect values
  ElasticsearchJson.visitValueNodes(res.aggregations(), m -> {
    // using 'Collectors.toMap' will trigger Java 8 bug here
    Map<String, Object> newMap = new LinkedHashMap<>();
    for (String key: m.keySet()) {
      newMap.put(fieldMap.getOrDefault(key, key), m.get(key));
    }
    result.add(newMap);
  });
} else {
  // probably no group by. add single result
  result.add(new LinkedHashMap<>());
}
{code}
The second time is added again in the following code:
{code:java}
// elastic exposes total number of documents matching a query in "/hits/total" path
// this can be used for simple "select count(*) from table"
final long total = res.searchHits().total().value();

if (groupBy.isEmpty()) {
  // put totals automatically for count(*) expression(s), unless they contain group by
  for (String expr : countAll) {
    result.forEach(m -> m.put(expr, total));
  }
}
{code}
Although such operations have no effect on the final result when there are multiple aggregate functions, it is obvious that they can be optimized.TotalHit is added to the result set only when groupby. isEmpty() and aggregation. size() == 1 simultaneously satisfy:
{code:java}
if (groupBy.isEmpty() && aggregations.size() == 1) {
   ...
}
{code}
This avoids repeatedly adding count(*) to the result set.

 

 

 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)