You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@drill.apache.org by Aman Sinha <as...@maprtech.com> on 2014/02/21 21:24:00 UTC

Review Request 18347: Implement hash aggregation operator and a hash table (for Drill-335) and new templates for aggregate functions (Drill-318)

-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/18347/
-----------------------------------------------------------

Review request for drill, John Morris, Jacques Nadeau, Jason Altekruse, and Mehant Baid.


Summary (updated)
-----------------

Implement hash aggregation operator and a hash table (for Drill-335) and new templates for aggregate functions (Drill-318)


Bugs: Drill-318, Drill-335 and and
    https://issues.apache.org/jira/browse/Drill-318
    https://issues.apache.org/jira/browse/Drill-335
    https://issues.apache.org/jira/browse/and


Repository: drill-git


Description (updated)
-------

This patch contains the following main components: 
1. Implementation of the hash aggregation execution operator - this has two main parts: the HashAggTemplate and the HashAggBatch.  
2. Implementation of a hash table which is used by the hash aggregation.  The hash table hash two main parts: the HashTableTemplate and the ChainedHashTable. The hash table internally uses the notion of 'BatchHolder' to keep track of all keys that can fit within one batch of 64K values.  New BatcHolder objects are created as needed.  Each BatchHolder has its own vector container.  The HashAggregate also has a similar structure and it keeps track of the workspace variables.  
 (NOTE: An initial design document for the hash aggregation and hash table was already attached with Drill-335.  The document has not yet been updated with the latest implementation ... will try to do that in the near future).
3. Jinfeng's changes to use workspace vectors in the generated code for aggregate functions (previously, for streaming aggregate we only needed to maintain workspace variable for 1 running group; however for hash aggregate we need to maintain it for all groups).  
4. Fix for Drill-318: because of #3 above, the previous fix for Drill-318 is not valid anymore.  I modified the template generation code for the aggregate functions such that they conform to the new infrastructure.  
5. The original AggTemplate, AggBatch and Aggregator classes have been moved to corresponding StreamingAggTemplate, StreamingAggBatch and StreamingAggregator in order to differentiate it from hash aggregation.  These appear as new files but the code there has not changed.  


Diffs (updated)
-----

  common/src/main/java/org/apache/drill/common/expression/fn/AggregationFunctions.java 5b46b78 
  exec/java-exec/src/main/codegen/config.fmpp cd2b2cc 
  exec/java-exec/src/main/codegen/data/AggrTypes1.tdd PRE-CREATION 
  exec/java-exec/src/main/codegen/data/AggrTypes2.tdd PRE-CREATION 
  exec/java-exec/src/main/codegen/templates/AggrTypeFunctions1.java PRE-CREATION 
  exec/java-exec/src/main/codegen/templates/AggrTypeFunctions2.java PRE-CREATION 
  exec/java-exec/src/main/codegen/templates/TypeHelper.java 783f943 
  exec/java-exec/src/main/java/org/apache/drill/exec/compile/sig/GeneratorMapping.java f99150e 
  exec/java-exec/src/main/java/org/apache/drill/exec/compile/sig/MappingSet.java 1cb2380 
  exec/java-exec/src/main/java/org/apache/drill/exec/expr/ClassGenerator.java e674eab 
  exec/java-exec/src/main/java/org/apache/drill/exec/expr/fn/DrillAggFuncHolder.java 7622865 
  exec/java-exec/src/main/java/org/apache/drill/exec/expr/fn/DrillFuncHolder.java dcdf1ea 
  exec/java-exec/src/main/java/org/apache/drill/exec/expr/fn/FunctionConverter.java befa9bf 
  exec/java-exec/src/main/java/org/apache/drill/exec/expr/fn/agg/impl/CountFunctions.java b0939f1 
  exec/java-exec/src/main/java/org/apache/drill/exec/expr/fn/agg/impl/SumFunctions.java 57905fe 
  exec/java-exec/src/main/java/org/apache/drill/exec/expr/fn/impl/BitFunctions.java 41bb7c6 
  exec/java-exec/src/main/java/org/apache/drill/exec/ops/FragmentContext.java fd965a7 
  exec/java-exec/src/main/java/org/apache/drill/exec/opt/BasicOptimizer.java 6e58ec7 
  exec/java-exec/src/main/java/org/apache/drill/exec/physical/OperatorCost.java 0bade41 
  exec/java-exec/src/main/java/org/apache/drill/exec/physical/base/AbstractPhysicalVisitor.java ec7244a 
  exec/java-exec/src/main/java/org/apache/drill/exec/physical/base/PhysicalVisitor.java 2e2d7fd 
  exec/java-exec/src/main/java/org/apache/drill/exec/physical/config/HashAggregate.java PRE-CREATION 
  exec/java-exec/src/main/java/org/apache/drill/exec/physical/impl/aggregate/AggBatch.java 86fea4e 
  exec/java-exec/src/main/java/org/apache/drill/exec/physical/impl/aggregate/AggBatchCreator.java 36344f9 
  exec/java-exec/src/main/java/org/apache/drill/exec/physical/impl/aggregate/AggTemplate.java a8f39e3 
  exec/java-exec/src/main/java/org/apache/drill/exec/physical/impl/aggregate/Aggregator.java 2683045 
  exec/java-exec/src/main/java/org/apache/drill/exec/physical/impl/aggregate/HashAggBatch.java PRE-CREATION 
  exec/java-exec/src/main/java/org/apache/drill/exec/physical/impl/aggregate/HashAggBatchCreator.java PRE-CREATION 
  exec/java-exec/src/main/java/org/apache/drill/exec/physical/impl/aggregate/HashAggTemplate.java PRE-CREATION 
  exec/java-exec/src/main/java/org/apache/drill/exec/physical/impl/aggregate/HashAggregator.java PRE-CREATION 
  exec/java-exec/src/main/java/org/apache/drill/exec/physical/impl/aggregate/StreamingAggBatch.java PRE-CREATION 
  exec/java-exec/src/main/java/org/apache/drill/exec/physical/impl/aggregate/StreamingAggBatchCreator.java PRE-CREATION 
  exec/java-exec/src/main/java/org/apache/drill/exec/physical/impl/aggregate/StreamingAggTemplate.java PRE-CREATION 
  exec/java-exec/src/main/java/org/apache/drill/exec/physical/impl/aggregate/StreamingAggregator.java PRE-CREATION 
  exec/java-exec/src/main/java/org/apache/drill/exec/physical/impl/common/ChainedHashTable.java PRE-CREATION 
  exec/java-exec/src/main/java/org/apache/drill/exec/physical/impl/common/HashTable.java PRE-CREATION 
  exec/java-exec/src/main/java/org/apache/drill/exec/physical/impl/common/HashTableConfig.java PRE-CREATION 
  exec/java-exec/src/main/java/org/apache/drill/exec/physical/impl/common/HashTableTemplate.java PRE-CREATION 
  exec/java-exec/src/main/java/org/apache/drill/exec/physical/impl/project/ProjectRecordBatch.java 3b8154e 
  exec/java-exec/src/test/java/org/apache/drill/exec/expr/ExpressionTest.java 0de64b4 
  exec/java-exec/src/test/java/org/apache/drill/exec/physical/impl/SimpleRootExec.java ccf7758 
  exec/java-exec/src/test/java/org/apache/drill/exec/physical/impl/agg/TestHashAggr.java PRE-CREATION 
  exec/java-exec/src/test/java/org/apache/drill/exec/physical/impl/common/TestHashTable.java PRE-CREATION 
  exec/java-exec/src/test/resources/common/test_hashtable1.json PRE-CREATION 

Diff: https://reviews.apache.org/r/18347/diff/


Testing (updated)
-------

I have run several tests manually as part of TestHashAggr...these tests use TPC-H data and in particular a relatively large 'Orders' table.  However, I have not yet packaged the tests to run as part of JUnit since the location and size of the parquet files needs to be figured out. I will continue to work on that.  


Thanks,

Aman Sinha


Re: Review Request 18347: Implement hash aggregation operator and a hash table (for Drill-335) and new templates for aggregate functions (Drill-318)

Posted by Aman Sinha <as...@maprtech.com>.

> On March 1, 2014, 4:20 p.m., Jacques Nadeau wrote:
> > exec/java-exec/src/main/java/org/apache/drill/exec/physical/impl/common/HashTableTemplate.java, line 96
> > <https://reviews.apache.org/r/18347/diff/1/?file=500195#file500195line96>
> >
> >     Same as above, let's use IntVector instead of primitive array, otherwise memory accounting is challenging.

Agreed...will make these off-heap. 


- Aman


-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/18347/#review35902
-----------------------------------------------------------


On Feb. 21, 2014, 8:23 p.m., Aman Sinha wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/18347/
> -----------------------------------------------------------
> 
> (Updated Feb. 21, 2014, 8:23 p.m.)
> 
> 
> Review request for drill, John Morris, Jacques Nadeau, Jason Altekruse, and Mehant Baid.
> 
> 
> Bugs: Drill-318, Drill-335 and and
>     https://issues.apache.org/jira/browse/Drill-318
>     https://issues.apache.org/jira/browse/Drill-335
>     https://issues.apache.org/jira/browse/and
> 
> 
> Repository: drill-git
> 
> 
> Description
> -------
> 
> This patch contains the following main components: 
> 1. Implementation of the hash aggregation execution operator - this has two main parts: the HashAggTemplate and the HashAggBatch.  
> 2. Implementation of a hash table which is used by the hash aggregation.  The hash table hash two main parts: the HashTableTemplate and the ChainedHashTable. The hash table internally uses the notion of 'BatchHolder' to keep track of all keys that can fit within one batch of 64K values.  New BatcHolder objects are created as needed.  Each BatchHolder has its own vector container.  The HashAggregate also has a similar structure and it keeps track of the workspace variables.  
>  (NOTE: An initial design document for the hash aggregation and hash table was already attached with Drill-335.  The document has not yet been updated with the latest implementation ... will try to do that in the near future).
> 3. Jinfeng's changes to use workspace vectors in the generated code for aggregate functions (previously, for streaming aggregate we only needed to maintain workspace variable for 1 running group; however for hash aggregate we need to maintain it for all groups).  
> 4. Fix for Drill-318: because of #3 above, the previous fix for Drill-318 is not valid anymore.  I modified the template generation code for the aggregate functions such that they conform to the new infrastructure.  
> 5. The original AggTemplate, AggBatch and Aggregator classes have been moved to corresponding StreamingAggTemplate, StreamingAggBatch and StreamingAggregator in order to differentiate it from hash aggregation.  These appear as new files but the code there has not changed.  
> 
> 
> Diffs
> -----
> 
>   common/src/main/java/org/apache/drill/common/expression/fn/AggregationFunctions.java 5b46b78 
>   exec/java-exec/src/main/codegen/config.fmpp cd2b2cc 
>   exec/java-exec/src/main/codegen/data/AggrTypes1.tdd PRE-CREATION 
>   exec/java-exec/src/main/codegen/data/AggrTypes2.tdd PRE-CREATION 
>   exec/java-exec/src/main/codegen/templates/AggrTypeFunctions1.java PRE-CREATION 
>   exec/java-exec/src/main/codegen/templates/AggrTypeFunctions2.java PRE-CREATION 
>   exec/java-exec/src/main/codegen/templates/TypeHelper.java 783f943 
>   exec/java-exec/src/main/java/org/apache/drill/exec/compile/sig/GeneratorMapping.java f99150e 
>   exec/java-exec/src/main/java/org/apache/drill/exec/compile/sig/MappingSet.java 1cb2380 
>   exec/java-exec/src/main/java/org/apache/drill/exec/expr/ClassGenerator.java e674eab 
>   exec/java-exec/src/main/java/org/apache/drill/exec/expr/fn/DrillAggFuncHolder.java 7622865 
>   exec/java-exec/src/main/java/org/apache/drill/exec/expr/fn/DrillFuncHolder.java dcdf1ea 
>   exec/java-exec/src/main/java/org/apache/drill/exec/expr/fn/FunctionConverter.java befa9bf 
>   exec/java-exec/src/main/java/org/apache/drill/exec/expr/fn/agg/impl/CountFunctions.java b0939f1 
>   exec/java-exec/src/main/java/org/apache/drill/exec/expr/fn/agg/impl/SumFunctions.java 57905fe 
>   exec/java-exec/src/main/java/org/apache/drill/exec/expr/fn/impl/BitFunctions.java 41bb7c6 
>   exec/java-exec/src/main/java/org/apache/drill/exec/ops/FragmentContext.java fd965a7 
>   exec/java-exec/src/main/java/org/apache/drill/exec/opt/BasicOptimizer.java 6e58ec7 
>   exec/java-exec/src/main/java/org/apache/drill/exec/physical/OperatorCost.java 0bade41 
>   exec/java-exec/src/main/java/org/apache/drill/exec/physical/base/AbstractPhysicalVisitor.java ec7244a 
>   exec/java-exec/src/main/java/org/apache/drill/exec/physical/base/PhysicalVisitor.java 2e2d7fd 
>   exec/java-exec/src/main/java/org/apache/drill/exec/physical/config/HashAggregate.java PRE-CREATION 
>   exec/java-exec/src/main/java/org/apache/drill/exec/physical/impl/aggregate/AggBatch.java 86fea4e 
>   exec/java-exec/src/main/java/org/apache/drill/exec/physical/impl/aggregate/AggBatchCreator.java 36344f9 
>   exec/java-exec/src/main/java/org/apache/drill/exec/physical/impl/aggregate/AggTemplate.java a8f39e3 
>   exec/java-exec/src/main/java/org/apache/drill/exec/physical/impl/aggregate/Aggregator.java 2683045 
>   exec/java-exec/src/main/java/org/apache/drill/exec/physical/impl/aggregate/HashAggBatch.java PRE-CREATION 
>   exec/java-exec/src/main/java/org/apache/drill/exec/physical/impl/aggregate/HashAggBatchCreator.java PRE-CREATION 
>   exec/java-exec/src/main/java/org/apache/drill/exec/physical/impl/aggregate/HashAggTemplate.java PRE-CREATION 
>   exec/java-exec/src/main/java/org/apache/drill/exec/physical/impl/aggregate/HashAggregator.java PRE-CREATION 
>   exec/java-exec/src/main/java/org/apache/drill/exec/physical/impl/aggregate/StreamingAggBatch.java PRE-CREATION 
>   exec/java-exec/src/main/java/org/apache/drill/exec/physical/impl/aggregate/StreamingAggBatchCreator.java PRE-CREATION 
>   exec/java-exec/src/main/java/org/apache/drill/exec/physical/impl/aggregate/StreamingAggTemplate.java PRE-CREATION 
>   exec/java-exec/src/main/java/org/apache/drill/exec/physical/impl/aggregate/StreamingAggregator.java PRE-CREATION 
>   exec/java-exec/src/main/java/org/apache/drill/exec/physical/impl/common/ChainedHashTable.java PRE-CREATION 
>   exec/java-exec/src/main/java/org/apache/drill/exec/physical/impl/common/HashTable.java PRE-CREATION 
>   exec/java-exec/src/main/java/org/apache/drill/exec/physical/impl/common/HashTableConfig.java PRE-CREATION 
>   exec/java-exec/src/main/java/org/apache/drill/exec/physical/impl/common/HashTableTemplate.java PRE-CREATION 
>   exec/java-exec/src/main/java/org/apache/drill/exec/physical/impl/project/ProjectRecordBatch.java 3b8154e 
>   exec/java-exec/src/test/java/org/apache/drill/exec/expr/ExpressionTest.java 0de64b4 
>   exec/java-exec/src/test/java/org/apache/drill/exec/physical/impl/SimpleRootExec.java ccf7758 
>   exec/java-exec/src/test/java/org/apache/drill/exec/physical/impl/agg/TestHashAggr.java PRE-CREATION 
>   exec/java-exec/src/test/java/org/apache/drill/exec/physical/impl/common/TestHashTable.java PRE-CREATION 
>   exec/java-exec/src/test/resources/common/test_hashtable1.json PRE-CREATION 
> 
> Diff: https://reviews.apache.org/r/18347/diff/
> 
> 
> Testing
> -------
> 
> I have run several tests manually as part of TestHashAggr...these tests use TPC-H data and in particular a relatively large 'Orders' table.  However, I have not yet packaged the tests to run as part of JUnit since the location and size of the parquet files needs to be figured out. I will continue to work on that.  
> 
> 
> Thanks,
> 
> Aman Sinha
> 
>


Re: Review Request 18347: Implement hash aggregation operator and a hash table (for Drill-335) and new templates for aggregate functions (Drill-318)

Posted by Aman Sinha <as...@maprtech.com>.

> On March 1, 2014, 4:20 p.m., Jacques Nadeau wrote:
> > exec/java-exec/src/main/java/org/apache/drill/exec/physical/impl/common/HashTableTemplate.java, line 96
> > <https://reviews.apache.org/r/18347/diff/1/?file=500195#file500195line96>
> >
> >     Same as above, let's use IntVector instead of primitive array, otherwise memory accounting is challenging.
> 
> Aman Sinha wrote:
>     Agreed...will make these off-heap.

Made links, hashValues and startIndices off-heap. 


- Aman


-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/18347/#review35902
-----------------------------------------------------------


On March 6, 2014, 2:10 a.m., Aman Sinha wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/18347/
> -----------------------------------------------------------
> 
> (Updated March 6, 2014, 2:10 a.m.)
> 
> 
> Review request for drill, Jacques Nadeau, Jason Altekruse, Mehant Baid, and Steven Phillips.
> 
> 
> Bugs: Drill-318, Drill-335 and and
>     https://issues.apache.org/jira/browse/Drill-318
>     https://issues.apache.org/jira/browse/Drill-335
>     https://issues.apache.org/jira/browse/and
> 
> 
> Repository: drill-git
> 
> 
> Description
> -------
> 
> This patch contains the following main components: 
> 1. Implementation of the hash aggregation execution operator - this has two main parts: the HashAggTemplate and the HashAggBatch.  
> 2. Implementation of a hash table which is used by the hash aggregation.  The hash table hash two main parts: the HashTableTemplate and the ChainedHashTable. The hash table internally uses the notion of 'BatchHolder' to keep track of all keys that can fit within one batch of 64K values.  New BatcHolder objects are created as needed.  Each BatchHolder has its own vector container.  The HashAggregate also has a similar structure and it keeps track of the workspace variables.  
>  (NOTE: An initial design document for the hash aggregation and hash table was already attached with Drill-335.  The document has not yet been updated with the latest implementation ... will try to do that in the near future).
> 3. Jinfeng's changes to use workspace vectors in the generated code for aggregate functions (previously, for streaming aggregate we only needed to maintain workspace variable for 1 running group; however for hash aggregate we need to maintain it for all groups).  
> 4. Fix for Drill-318: because of #3 above, the previous fix for Drill-318 is not valid anymore.  I modified the template generation code for the aggregate functions such that they conform to the new infrastructure.  
> 5. The original AggTemplate, AggBatch and Aggregator classes have been moved to corresponding StreamingAggTemplate, StreamingAggBatch and StreamingAggregator in order to differentiate it from hash aggregation.  These appear as new files but the code there has not changed.  
> 
> 
> Diffs
> -----
> 
>   common/src/main/java/org/apache/drill/common/expression/fn/AggregationFunctions.java 5b46b78 
>   exec/java-exec/src/main/codegen/config.fmpp 8f1060a 
>   exec/java-exec/src/main/codegen/data/AggrTypes1.tdd PRE-CREATION 
>   exec/java-exec/src/main/codegen/data/AggrTypes2.tdd PRE-CREATION 
>   exec/java-exec/src/main/codegen/templates/AggrTypeFunctions1.java PRE-CREATION 
>   exec/java-exec/src/main/codegen/templates/AggrTypeFunctions2.java PRE-CREATION 
>   exec/java-exec/src/main/codegen/templates/TypeHelper.java 8c72cf7 
>   exec/java-exec/src/main/java/org/apache/drill/exec/compile/sig/GeneratorMapping.java f99150e 
>   exec/java-exec/src/main/java/org/apache/drill/exec/compile/sig/MappingSet.java 1cb2380 
>   exec/java-exec/src/main/java/org/apache/drill/exec/expr/ClassGenerator.java e674eab 
>   exec/java-exec/src/main/java/org/apache/drill/exec/expr/fn/DrillAggFuncHolder.java 7622865 
>   exec/java-exec/src/main/java/org/apache/drill/exec/expr/fn/DrillFuncHolder.java dcdf1ea 
>   exec/java-exec/src/main/java/org/apache/drill/exec/expr/fn/FunctionConverter.java befa9bf 
>   exec/java-exec/src/main/java/org/apache/drill/exec/expr/fn/agg/impl/CountFunctions.java b0939f1 
>   exec/java-exec/src/main/java/org/apache/drill/exec/expr/fn/agg/impl/SumFunctions.java 57905fe 
>   exec/java-exec/src/main/java/org/apache/drill/exec/expr/fn/impl/BitFunctions.java 41bb7c6 
>   exec/java-exec/src/main/java/org/apache/drill/exec/ops/FragmentContext.java 8462622 
>   exec/java-exec/src/main/java/org/apache/drill/exec/opt/BasicOptimizer.java aed4802 
>   exec/java-exec/src/main/java/org/apache/drill/exec/physical/OperatorCost.java 0bade41 
>   exec/java-exec/src/main/java/org/apache/drill/exec/physical/base/AbstractPhysicalVisitor.java ec7244a 
>   exec/java-exec/src/main/java/org/apache/drill/exec/physical/base/PhysicalVisitor.java 2e2d7fd 
>   exec/java-exec/src/main/java/org/apache/drill/exec/physical/config/HashAggregate.java PRE-CREATION 
>   exec/java-exec/src/main/java/org/apache/drill/exec/physical/impl/aggregate/AggBatch.java 86fea4e 
>   exec/java-exec/src/main/java/org/apache/drill/exec/physical/impl/aggregate/AggBatchCreator.java 36344f9 
>   exec/java-exec/src/main/java/org/apache/drill/exec/physical/impl/aggregate/AggTemplate.java a8f39e3 
>   exec/java-exec/src/main/java/org/apache/drill/exec/physical/impl/aggregate/Aggregator.java 2683045 
>   exec/java-exec/src/main/java/org/apache/drill/exec/physical/impl/aggregate/HashAggBatch.java PRE-CREATION 
>   exec/java-exec/src/main/java/org/apache/drill/exec/physical/impl/aggregate/HashAggBatchCreator.java PRE-CREATION 
>   exec/java-exec/src/main/java/org/apache/drill/exec/physical/impl/aggregate/HashAggTemplate.java PRE-CREATION 
>   exec/java-exec/src/main/java/org/apache/drill/exec/physical/impl/aggregate/HashAggregator.java PRE-CREATION 
>   exec/java-exec/src/main/java/org/apache/drill/exec/physical/impl/aggregate/StreamingAggBatch.java PRE-CREATION 
>   exec/java-exec/src/main/java/org/apache/drill/exec/physical/impl/aggregate/StreamingAggBatchCreator.java PRE-CREATION 
>   exec/java-exec/src/main/java/org/apache/drill/exec/physical/impl/aggregate/StreamingAggTemplate.java PRE-CREATION 
>   exec/java-exec/src/main/java/org/apache/drill/exec/physical/impl/aggregate/StreamingAggregator.java PRE-CREATION 
>   exec/java-exec/src/main/java/org/apache/drill/exec/physical/impl/common/ChainedHashTable.java PRE-CREATION 
>   exec/java-exec/src/main/java/org/apache/drill/exec/physical/impl/common/HashTable.java PRE-CREATION 
>   exec/java-exec/src/main/java/org/apache/drill/exec/physical/impl/common/HashTableConfig.java PRE-CREATION 
>   exec/java-exec/src/main/java/org/apache/drill/exec/physical/impl/common/HashTableTemplate.java PRE-CREATION 
>   exec/java-exec/src/main/java/org/apache/drill/exec/physical/impl/project/ProjectRecordBatch.java 509d13b 
>   exec/java-exec/src/main/java/org/apache/drill/exec/record/VectorContainer.java 6e764a8 
>   exec/java-exec/src/test/java/org/apache/drill/exec/expr/ExpressionTest.java 0de64b4 
>   exec/java-exec/src/test/java/org/apache/drill/exec/physical/impl/SimpleRootExec.java ccf7758 
>   exec/java-exec/src/test/java/org/apache/drill/exec/physical/impl/TestSimpleFunctions.java f6a8096 
>   exec/java-exec/src/test/java/org/apache/drill/exec/physical/impl/agg/TestHashAggr.java PRE-CREATION 
>   exec/java-exec/src/test/java/org/apache/drill/exec/physical/impl/common/TestHashTable.java PRE-CREATION 
>   exec/java-exec/src/test/resources/common/test_hashtable1.json PRE-CREATION 
> 
> Diff: https://reviews.apache.org/r/18347/diff/
> 
> 
> Testing
> -------
> 
> I have run several tests manually as part of TestHashAggr...these tests use TPC-H data and in particular a relatively large 'Orders' table.  However, I have not yet packaged the tests to run as part of JUnit since the location and size of the parquet files needs to be figured out. I will continue to work on that.  
> 
> 
> Thanks,
> 
> Aman Sinha
> 
>


Re: Review Request 18347: Implement hash aggregation operator and a hash table (for Drill-335) and new templates for aggregate functions (Drill-318)

Posted by Jacques Nadeau <ja...@gmail.com>.
-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/18347/#review35902
-----------------------------------------------------------


Not a full review but wanted to send these two notes along.


exec/java-exec/src/main/java/org/apache/drill/exec/physical/impl/common/HashTableTemplate.java
<https://reviews.apache.org/r/18347/#comment66715>

    Let's pull these off heap and use an IntVector



exec/java-exec/src/main/java/org/apache/drill/exec/physical/impl/common/HashTableTemplate.java
<https://reviews.apache.org/r/18347/#comment66716>

    Same as above, let's use IntVector instead of primitive array, otherwise memory accounting is challenging.


- Jacques Nadeau


On Feb. 21, 2014, 8:23 p.m., Aman Sinha wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/18347/
> -----------------------------------------------------------
> 
> (Updated Feb. 21, 2014, 8:23 p.m.)
> 
> 
> Review request for drill, John Morris, Jacques Nadeau, Jason Altekruse, and Mehant Baid.
> 
> 
> Bugs: Drill-318, Drill-335 and and
>     https://issues.apache.org/jira/browse/Drill-318
>     https://issues.apache.org/jira/browse/Drill-335
>     https://issues.apache.org/jira/browse/and
> 
> 
> Repository: drill-git
> 
> 
> Description
> -------
> 
> This patch contains the following main components: 
> 1. Implementation of the hash aggregation execution operator - this has two main parts: the HashAggTemplate and the HashAggBatch.  
> 2. Implementation of a hash table which is used by the hash aggregation.  The hash table hash two main parts: the HashTableTemplate and the ChainedHashTable. The hash table internally uses the notion of 'BatchHolder' to keep track of all keys that can fit within one batch of 64K values.  New BatcHolder objects are created as needed.  Each BatchHolder has its own vector container.  The HashAggregate also has a similar structure and it keeps track of the workspace variables.  
>  (NOTE: An initial design document for the hash aggregation and hash table was already attached with Drill-335.  The document has not yet been updated with the latest implementation ... will try to do that in the near future).
> 3. Jinfeng's changes to use workspace vectors in the generated code for aggregate functions (previously, for streaming aggregate we only needed to maintain workspace variable for 1 running group; however for hash aggregate we need to maintain it for all groups).  
> 4. Fix for Drill-318: because of #3 above, the previous fix for Drill-318 is not valid anymore.  I modified the template generation code for the aggregate functions such that they conform to the new infrastructure.  
> 5. The original AggTemplate, AggBatch and Aggregator classes have been moved to corresponding StreamingAggTemplate, StreamingAggBatch and StreamingAggregator in order to differentiate it from hash aggregation.  These appear as new files but the code there has not changed.  
> 
> 
> Diffs
> -----
> 
>   common/src/main/java/org/apache/drill/common/expression/fn/AggregationFunctions.java 5b46b78 
>   exec/java-exec/src/main/codegen/config.fmpp cd2b2cc 
>   exec/java-exec/src/main/codegen/data/AggrTypes1.tdd PRE-CREATION 
>   exec/java-exec/src/main/codegen/data/AggrTypes2.tdd PRE-CREATION 
>   exec/java-exec/src/main/codegen/templates/AggrTypeFunctions1.java PRE-CREATION 
>   exec/java-exec/src/main/codegen/templates/AggrTypeFunctions2.java PRE-CREATION 
>   exec/java-exec/src/main/codegen/templates/TypeHelper.java 783f943 
>   exec/java-exec/src/main/java/org/apache/drill/exec/compile/sig/GeneratorMapping.java f99150e 
>   exec/java-exec/src/main/java/org/apache/drill/exec/compile/sig/MappingSet.java 1cb2380 
>   exec/java-exec/src/main/java/org/apache/drill/exec/expr/ClassGenerator.java e674eab 
>   exec/java-exec/src/main/java/org/apache/drill/exec/expr/fn/DrillAggFuncHolder.java 7622865 
>   exec/java-exec/src/main/java/org/apache/drill/exec/expr/fn/DrillFuncHolder.java dcdf1ea 
>   exec/java-exec/src/main/java/org/apache/drill/exec/expr/fn/FunctionConverter.java befa9bf 
>   exec/java-exec/src/main/java/org/apache/drill/exec/expr/fn/agg/impl/CountFunctions.java b0939f1 
>   exec/java-exec/src/main/java/org/apache/drill/exec/expr/fn/agg/impl/SumFunctions.java 57905fe 
>   exec/java-exec/src/main/java/org/apache/drill/exec/expr/fn/impl/BitFunctions.java 41bb7c6 
>   exec/java-exec/src/main/java/org/apache/drill/exec/ops/FragmentContext.java fd965a7 
>   exec/java-exec/src/main/java/org/apache/drill/exec/opt/BasicOptimizer.java 6e58ec7 
>   exec/java-exec/src/main/java/org/apache/drill/exec/physical/OperatorCost.java 0bade41 
>   exec/java-exec/src/main/java/org/apache/drill/exec/physical/base/AbstractPhysicalVisitor.java ec7244a 
>   exec/java-exec/src/main/java/org/apache/drill/exec/physical/base/PhysicalVisitor.java 2e2d7fd 
>   exec/java-exec/src/main/java/org/apache/drill/exec/physical/config/HashAggregate.java PRE-CREATION 
>   exec/java-exec/src/main/java/org/apache/drill/exec/physical/impl/aggregate/AggBatch.java 86fea4e 
>   exec/java-exec/src/main/java/org/apache/drill/exec/physical/impl/aggregate/AggBatchCreator.java 36344f9 
>   exec/java-exec/src/main/java/org/apache/drill/exec/physical/impl/aggregate/AggTemplate.java a8f39e3 
>   exec/java-exec/src/main/java/org/apache/drill/exec/physical/impl/aggregate/Aggregator.java 2683045 
>   exec/java-exec/src/main/java/org/apache/drill/exec/physical/impl/aggregate/HashAggBatch.java PRE-CREATION 
>   exec/java-exec/src/main/java/org/apache/drill/exec/physical/impl/aggregate/HashAggBatchCreator.java PRE-CREATION 
>   exec/java-exec/src/main/java/org/apache/drill/exec/physical/impl/aggregate/HashAggTemplate.java PRE-CREATION 
>   exec/java-exec/src/main/java/org/apache/drill/exec/physical/impl/aggregate/HashAggregator.java PRE-CREATION 
>   exec/java-exec/src/main/java/org/apache/drill/exec/physical/impl/aggregate/StreamingAggBatch.java PRE-CREATION 
>   exec/java-exec/src/main/java/org/apache/drill/exec/physical/impl/aggregate/StreamingAggBatchCreator.java PRE-CREATION 
>   exec/java-exec/src/main/java/org/apache/drill/exec/physical/impl/aggregate/StreamingAggTemplate.java PRE-CREATION 
>   exec/java-exec/src/main/java/org/apache/drill/exec/physical/impl/aggregate/StreamingAggregator.java PRE-CREATION 
>   exec/java-exec/src/main/java/org/apache/drill/exec/physical/impl/common/ChainedHashTable.java PRE-CREATION 
>   exec/java-exec/src/main/java/org/apache/drill/exec/physical/impl/common/HashTable.java PRE-CREATION 
>   exec/java-exec/src/main/java/org/apache/drill/exec/physical/impl/common/HashTableConfig.java PRE-CREATION 
>   exec/java-exec/src/main/java/org/apache/drill/exec/physical/impl/common/HashTableTemplate.java PRE-CREATION 
>   exec/java-exec/src/main/java/org/apache/drill/exec/physical/impl/project/ProjectRecordBatch.java 3b8154e 
>   exec/java-exec/src/test/java/org/apache/drill/exec/expr/ExpressionTest.java 0de64b4 
>   exec/java-exec/src/test/java/org/apache/drill/exec/physical/impl/SimpleRootExec.java ccf7758 
>   exec/java-exec/src/test/java/org/apache/drill/exec/physical/impl/agg/TestHashAggr.java PRE-CREATION 
>   exec/java-exec/src/test/java/org/apache/drill/exec/physical/impl/common/TestHashTable.java PRE-CREATION 
>   exec/java-exec/src/test/resources/common/test_hashtable1.json PRE-CREATION 
> 
> Diff: https://reviews.apache.org/r/18347/diff/
> 
> 
> Testing
> -------
> 
> I have run several tests manually as part of TestHashAggr...these tests use TPC-H data and in particular a relatively large 'Orders' table.  However, I have not yet packaged the tests to run as part of JUnit since the location and size of the parquet files needs to be figured out. I will continue to work on that.  
> 
> 
> Thanks,
> 
> Aman Sinha
> 
>


Re: Review Request 18347: Implement hash aggregation operator and a hash table (for Drill-335) and new templates for aggregate functions (Drill-318)

Posted by Aman Sinha <as...@maprtech.com>.

> On March 1, 2014, 8:23 a.m., Timothy Chen wrote:
> > exec/java-exec/src/test/java/org/apache/drill/exec/physical/impl/common/TestHashTable.java, line 74
> > <https://reviews.apache.org/r/18347/diff/1/?file=500200#file500200line74>
> >
> >     There is no single test for these changes?

I have about a dozen tests that I run (as mentioned in the Testing section of the review)...but these are semi-automated only.  I am not using mock scan, instead using real data..I will package some of these as part of TestHashTable and TestHashAggr. 


- Aman


-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/18347/#review35899
-----------------------------------------------------------


On Feb. 21, 2014, 8:23 p.m., Aman Sinha wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/18347/
> -----------------------------------------------------------
> 
> (Updated Feb. 21, 2014, 8:23 p.m.)
> 
> 
> Review request for drill, John Morris, Jacques Nadeau, Jason Altekruse, and Mehant Baid.
> 
> 
> Bugs: Drill-318, Drill-335 and and
>     https://issues.apache.org/jira/browse/Drill-318
>     https://issues.apache.org/jira/browse/Drill-335
>     https://issues.apache.org/jira/browse/and
> 
> 
> Repository: drill-git
> 
> 
> Description
> -------
> 
> This patch contains the following main components: 
> 1. Implementation of the hash aggregation execution operator - this has two main parts: the HashAggTemplate and the HashAggBatch.  
> 2. Implementation of a hash table which is used by the hash aggregation.  The hash table hash two main parts: the HashTableTemplate and the ChainedHashTable. The hash table internally uses the notion of 'BatchHolder' to keep track of all keys that can fit within one batch of 64K values.  New BatcHolder objects are created as needed.  Each BatchHolder has its own vector container.  The HashAggregate also has a similar structure and it keeps track of the workspace variables.  
>  (NOTE: An initial design document for the hash aggregation and hash table was already attached with Drill-335.  The document has not yet been updated with the latest implementation ... will try to do that in the near future).
> 3. Jinfeng's changes to use workspace vectors in the generated code for aggregate functions (previously, for streaming aggregate we only needed to maintain workspace variable for 1 running group; however for hash aggregate we need to maintain it for all groups).  
> 4. Fix for Drill-318: because of #3 above, the previous fix for Drill-318 is not valid anymore.  I modified the template generation code for the aggregate functions such that they conform to the new infrastructure.  
> 5. The original AggTemplate, AggBatch and Aggregator classes have been moved to corresponding StreamingAggTemplate, StreamingAggBatch and StreamingAggregator in order to differentiate it from hash aggregation.  These appear as new files but the code there has not changed.  
> 
> 
> Diffs
> -----
> 
>   common/src/main/java/org/apache/drill/common/expression/fn/AggregationFunctions.java 5b46b78 
>   exec/java-exec/src/main/codegen/config.fmpp cd2b2cc 
>   exec/java-exec/src/main/codegen/data/AggrTypes1.tdd PRE-CREATION 
>   exec/java-exec/src/main/codegen/data/AggrTypes2.tdd PRE-CREATION 
>   exec/java-exec/src/main/codegen/templates/AggrTypeFunctions1.java PRE-CREATION 
>   exec/java-exec/src/main/codegen/templates/AggrTypeFunctions2.java PRE-CREATION 
>   exec/java-exec/src/main/codegen/templates/TypeHelper.java 783f943 
>   exec/java-exec/src/main/java/org/apache/drill/exec/compile/sig/GeneratorMapping.java f99150e 
>   exec/java-exec/src/main/java/org/apache/drill/exec/compile/sig/MappingSet.java 1cb2380 
>   exec/java-exec/src/main/java/org/apache/drill/exec/expr/ClassGenerator.java e674eab 
>   exec/java-exec/src/main/java/org/apache/drill/exec/expr/fn/DrillAggFuncHolder.java 7622865 
>   exec/java-exec/src/main/java/org/apache/drill/exec/expr/fn/DrillFuncHolder.java dcdf1ea 
>   exec/java-exec/src/main/java/org/apache/drill/exec/expr/fn/FunctionConverter.java befa9bf 
>   exec/java-exec/src/main/java/org/apache/drill/exec/expr/fn/agg/impl/CountFunctions.java b0939f1 
>   exec/java-exec/src/main/java/org/apache/drill/exec/expr/fn/agg/impl/SumFunctions.java 57905fe 
>   exec/java-exec/src/main/java/org/apache/drill/exec/expr/fn/impl/BitFunctions.java 41bb7c6 
>   exec/java-exec/src/main/java/org/apache/drill/exec/ops/FragmentContext.java fd965a7 
>   exec/java-exec/src/main/java/org/apache/drill/exec/opt/BasicOptimizer.java 6e58ec7 
>   exec/java-exec/src/main/java/org/apache/drill/exec/physical/OperatorCost.java 0bade41 
>   exec/java-exec/src/main/java/org/apache/drill/exec/physical/base/AbstractPhysicalVisitor.java ec7244a 
>   exec/java-exec/src/main/java/org/apache/drill/exec/physical/base/PhysicalVisitor.java 2e2d7fd 
>   exec/java-exec/src/main/java/org/apache/drill/exec/physical/config/HashAggregate.java PRE-CREATION 
>   exec/java-exec/src/main/java/org/apache/drill/exec/physical/impl/aggregate/AggBatch.java 86fea4e 
>   exec/java-exec/src/main/java/org/apache/drill/exec/physical/impl/aggregate/AggBatchCreator.java 36344f9 
>   exec/java-exec/src/main/java/org/apache/drill/exec/physical/impl/aggregate/AggTemplate.java a8f39e3 
>   exec/java-exec/src/main/java/org/apache/drill/exec/physical/impl/aggregate/Aggregator.java 2683045 
>   exec/java-exec/src/main/java/org/apache/drill/exec/physical/impl/aggregate/HashAggBatch.java PRE-CREATION 
>   exec/java-exec/src/main/java/org/apache/drill/exec/physical/impl/aggregate/HashAggBatchCreator.java PRE-CREATION 
>   exec/java-exec/src/main/java/org/apache/drill/exec/physical/impl/aggregate/HashAggTemplate.java PRE-CREATION 
>   exec/java-exec/src/main/java/org/apache/drill/exec/physical/impl/aggregate/HashAggregator.java PRE-CREATION 
>   exec/java-exec/src/main/java/org/apache/drill/exec/physical/impl/aggregate/StreamingAggBatch.java PRE-CREATION 
>   exec/java-exec/src/main/java/org/apache/drill/exec/physical/impl/aggregate/StreamingAggBatchCreator.java PRE-CREATION 
>   exec/java-exec/src/main/java/org/apache/drill/exec/physical/impl/aggregate/StreamingAggTemplate.java PRE-CREATION 
>   exec/java-exec/src/main/java/org/apache/drill/exec/physical/impl/aggregate/StreamingAggregator.java PRE-CREATION 
>   exec/java-exec/src/main/java/org/apache/drill/exec/physical/impl/common/ChainedHashTable.java PRE-CREATION 
>   exec/java-exec/src/main/java/org/apache/drill/exec/physical/impl/common/HashTable.java PRE-CREATION 
>   exec/java-exec/src/main/java/org/apache/drill/exec/physical/impl/common/HashTableConfig.java PRE-CREATION 
>   exec/java-exec/src/main/java/org/apache/drill/exec/physical/impl/common/HashTableTemplate.java PRE-CREATION 
>   exec/java-exec/src/main/java/org/apache/drill/exec/physical/impl/project/ProjectRecordBatch.java 3b8154e 
>   exec/java-exec/src/test/java/org/apache/drill/exec/expr/ExpressionTest.java 0de64b4 
>   exec/java-exec/src/test/java/org/apache/drill/exec/physical/impl/SimpleRootExec.java ccf7758 
>   exec/java-exec/src/test/java/org/apache/drill/exec/physical/impl/agg/TestHashAggr.java PRE-CREATION 
>   exec/java-exec/src/test/java/org/apache/drill/exec/physical/impl/common/TestHashTable.java PRE-CREATION 
>   exec/java-exec/src/test/resources/common/test_hashtable1.json PRE-CREATION 
> 
> Diff: https://reviews.apache.org/r/18347/diff/
> 
> 
> Testing
> -------
> 
> I have run several tests manually as part of TestHashAggr...these tests use TPC-H data and in particular a relatively large 'Orders' table.  However, I have not yet packaged the tests to run as part of JUnit since the location and size of the parquet files needs to be figured out. I will continue to work on that.  
> 
> 
> Thanks,
> 
> Aman Sinha
> 
>


Re: Review Request 18347: Implement hash aggregation operator and a hash table (for Drill-335) and new templates for aggregate functions (Drill-318)

Posted by Jacques Nadeau <ja...@gmail.com>.

> On March 1, 2014, 8:23 a.m., Timothy Chen wrote:
> > exec/java-exec/src/test/java/org/apache/drill/exec/physical/impl/common/TestHashTable.java, line 74
> > <https://reviews.apache.org/r/18347/diff/1/?file=500200#file500200line74>
> >
> >     There is no single test for these changes?
> 
> Aman Sinha wrote:
>     I have about a dozen tests that I run (as mentioned in the Testing section of the review)...but these are semi-automated only.  I am not using mock scan, instead using real data..I will package some of these as part of TestHashTable and TestHashAggr.

We're in the process of publishing larger example datasets publicly so that we can add tests that leverage them without filling up the source repository


- Jacques


-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/18347/#review35899
-----------------------------------------------------------


On Feb. 21, 2014, 8:23 p.m., Aman Sinha wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/18347/
> -----------------------------------------------------------
> 
> (Updated Feb. 21, 2014, 8:23 p.m.)
> 
> 
> Review request for drill, John Morris, Jacques Nadeau, Jason Altekruse, and Mehant Baid.
> 
> 
> Bugs: Drill-318, Drill-335 and and
>     https://issues.apache.org/jira/browse/Drill-318
>     https://issues.apache.org/jira/browse/Drill-335
>     https://issues.apache.org/jira/browse/and
> 
> 
> Repository: drill-git
> 
> 
> Description
> -------
> 
> This patch contains the following main components: 
> 1. Implementation of the hash aggregation execution operator - this has two main parts: the HashAggTemplate and the HashAggBatch.  
> 2. Implementation of a hash table which is used by the hash aggregation.  The hash table hash two main parts: the HashTableTemplate and the ChainedHashTable. The hash table internally uses the notion of 'BatchHolder' to keep track of all keys that can fit within one batch of 64K values.  New BatcHolder objects are created as needed.  Each BatchHolder has its own vector container.  The HashAggregate also has a similar structure and it keeps track of the workspace variables.  
>  (NOTE: An initial design document for the hash aggregation and hash table was already attached with Drill-335.  The document has not yet been updated with the latest implementation ... will try to do that in the near future).
> 3. Jinfeng's changes to use workspace vectors in the generated code for aggregate functions (previously, for streaming aggregate we only needed to maintain workspace variable for 1 running group; however for hash aggregate we need to maintain it for all groups).  
> 4. Fix for Drill-318: because of #3 above, the previous fix for Drill-318 is not valid anymore.  I modified the template generation code for the aggregate functions such that they conform to the new infrastructure.  
> 5. The original AggTemplate, AggBatch and Aggregator classes have been moved to corresponding StreamingAggTemplate, StreamingAggBatch and StreamingAggregator in order to differentiate it from hash aggregation.  These appear as new files but the code there has not changed.  
> 
> 
> Diffs
> -----
> 
>   common/src/main/java/org/apache/drill/common/expression/fn/AggregationFunctions.java 5b46b78 
>   exec/java-exec/src/main/codegen/config.fmpp cd2b2cc 
>   exec/java-exec/src/main/codegen/data/AggrTypes1.tdd PRE-CREATION 
>   exec/java-exec/src/main/codegen/data/AggrTypes2.tdd PRE-CREATION 
>   exec/java-exec/src/main/codegen/templates/AggrTypeFunctions1.java PRE-CREATION 
>   exec/java-exec/src/main/codegen/templates/AggrTypeFunctions2.java PRE-CREATION 
>   exec/java-exec/src/main/codegen/templates/TypeHelper.java 783f943 
>   exec/java-exec/src/main/java/org/apache/drill/exec/compile/sig/GeneratorMapping.java f99150e 
>   exec/java-exec/src/main/java/org/apache/drill/exec/compile/sig/MappingSet.java 1cb2380 
>   exec/java-exec/src/main/java/org/apache/drill/exec/expr/ClassGenerator.java e674eab 
>   exec/java-exec/src/main/java/org/apache/drill/exec/expr/fn/DrillAggFuncHolder.java 7622865 
>   exec/java-exec/src/main/java/org/apache/drill/exec/expr/fn/DrillFuncHolder.java dcdf1ea 
>   exec/java-exec/src/main/java/org/apache/drill/exec/expr/fn/FunctionConverter.java befa9bf 
>   exec/java-exec/src/main/java/org/apache/drill/exec/expr/fn/agg/impl/CountFunctions.java b0939f1 
>   exec/java-exec/src/main/java/org/apache/drill/exec/expr/fn/agg/impl/SumFunctions.java 57905fe 
>   exec/java-exec/src/main/java/org/apache/drill/exec/expr/fn/impl/BitFunctions.java 41bb7c6 
>   exec/java-exec/src/main/java/org/apache/drill/exec/ops/FragmentContext.java fd965a7 
>   exec/java-exec/src/main/java/org/apache/drill/exec/opt/BasicOptimizer.java 6e58ec7 
>   exec/java-exec/src/main/java/org/apache/drill/exec/physical/OperatorCost.java 0bade41 
>   exec/java-exec/src/main/java/org/apache/drill/exec/physical/base/AbstractPhysicalVisitor.java ec7244a 
>   exec/java-exec/src/main/java/org/apache/drill/exec/physical/base/PhysicalVisitor.java 2e2d7fd 
>   exec/java-exec/src/main/java/org/apache/drill/exec/physical/config/HashAggregate.java PRE-CREATION 
>   exec/java-exec/src/main/java/org/apache/drill/exec/physical/impl/aggregate/AggBatch.java 86fea4e 
>   exec/java-exec/src/main/java/org/apache/drill/exec/physical/impl/aggregate/AggBatchCreator.java 36344f9 
>   exec/java-exec/src/main/java/org/apache/drill/exec/physical/impl/aggregate/AggTemplate.java a8f39e3 
>   exec/java-exec/src/main/java/org/apache/drill/exec/physical/impl/aggregate/Aggregator.java 2683045 
>   exec/java-exec/src/main/java/org/apache/drill/exec/physical/impl/aggregate/HashAggBatch.java PRE-CREATION 
>   exec/java-exec/src/main/java/org/apache/drill/exec/physical/impl/aggregate/HashAggBatchCreator.java PRE-CREATION 
>   exec/java-exec/src/main/java/org/apache/drill/exec/physical/impl/aggregate/HashAggTemplate.java PRE-CREATION 
>   exec/java-exec/src/main/java/org/apache/drill/exec/physical/impl/aggregate/HashAggregator.java PRE-CREATION 
>   exec/java-exec/src/main/java/org/apache/drill/exec/physical/impl/aggregate/StreamingAggBatch.java PRE-CREATION 
>   exec/java-exec/src/main/java/org/apache/drill/exec/physical/impl/aggregate/StreamingAggBatchCreator.java PRE-CREATION 
>   exec/java-exec/src/main/java/org/apache/drill/exec/physical/impl/aggregate/StreamingAggTemplate.java PRE-CREATION 
>   exec/java-exec/src/main/java/org/apache/drill/exec/physical/impl/aggregate/StreamingAggregator.java PRE-CREATION 
>   exec/java-exec/src/main/java/org/apache/drill/exec/physical/impl/common/ChainedHashTable.java PRE-CREATION 
>   exec/java-exec/src/main/java/org/apache/drill/exec/physical/impl/common/HashTable.java PRE-CREATION 
>   exec/java-exec/src/main/java/org/apache/drill/exec/physical/impl/common/HashTableConfig.java PRE-CREATION 
>   exec/java-exec/src/main/java/org/apache/drill/exec/physical/impl/common/HashTableTemplate.java PRE-CREATION 
>   exec/java-exec/src/main/java/org/apache/drill/exec/physical/impl/project/ProjectRecordBatch.java 3b8154e 
>   exec/java-exec/src/test/java/org/apache/drill/exec/expr/ExpressionTest.java 0de64b4 
>   exec/java-exec/src/test/java/org/apache/drill/exec/physical/impl/SimpleRootExec.java ccf7758 
>   exec/java-exec/src/test/java/org/apache/drill/exec/physical/impl/agg/TestHashAggr.java PRE-CREATION 
>   exec/java-exec/src/test/java/org/apache/drill/exec/physical/impl/common/TestHashTable.java PRE-CREATION 
>   exec/java-exec/src/test/resources/common/test_hashtable1.json PRE-CREATION 
> 
> Diff: https://reviews.apache.org/r/18347/diff/
> 
> 
> Testing
> -------
> 
> I have run several tests manually as part of TestHashAggr...these tests use TPC-H data and in particular a relatively large 'Orders' table.  However, I have not yet packaged the tests to run as part of JUnit since the location and size of the parquet files needs to be figured out. I will continue to work on that.  
> 
> 
> Thanks,
> 
> Aman Sinha
> 
>


Re: Review Request 18347: Implement hash aggregation operator and a hash table (for Drill-335) and new templates for aggregate functions (Drill-318)

Posted by Timothy Chen <tn...@gmail.com>.
-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/18347/#review35899
-----------------------------------------------------------



exec/java-exec/src/test/java/org/apache/drill/exec/physical/impl/common/TestHashTable.java
<https://reviews.apache.org/r/18347/#comment66713>

    There is no single test for these changes?


- Timothy Chen


On Feb. 21, 2014, 8:23 p.m., Aman Sinha wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/18347/
> -----------------------------------------------------------
> 
> (Updated Feb. 21, 2014, 8:23 p.m.)
> 
> 
> Review request for drill, John Morris, Jacques Nadeau, Jason Altekruse, and Mehant Baid.
> 
> 
> Bugs: Drill-318, Drill-335 and and
>     https://issues.apache.org/jira/browse/Drill-318
>     https://issues.apache.org/jira/browse/Drill-335
>     https://issues.apache.org/jira/browse/and
> 
> 
> Repository: drill-git
> 
> 
> Description
> -------
> 
> This patch contains the following main components: 
> 1. Implementation of the hash aggregation execution operator - this has two main parts: the HashAggTemplate and the HashAggBatch.  
> 2. Implementation of a hash table which is used by the hash aggregation.  The hash table hash two main parts: the HashTableTemplate and the ChainedHashTable. The hash table internally uses the notion of 'BatchHolder' to keep track of all keys that can fit within one batch of 64K values.  New BatcHolder objects are created as needed.  Each BatchHolder has its own vector container.  The HashAggregate also has a similar structure and it keeps track of the workspace variables.  
>  (NOTE: An initial design document for the hash aggregation and hash table was already attached with Drill-335.  The document has not yet been updated with the latest implementation ... will try to do that in the near future).
> 3. Jinfeng's changes to use workspace vectors in the generated code for aggregate functions (previously, for streaming aggregate we only needed to maintain workspace variable for 1 running group; however for hash aggregate we need to maintain it for all groups).  
> 4. Fix for Drill-318: because of #3 above, the previous fix for Drill-318 is not valid anymore.  I modified the template generation code for the aggregate functions such that they conform to the new infrastructure.  
> 5. The original AggTemplate, AggBatch and Aggregator classes have been moved to corresponding StreamingAggTemplate, StreamingAggBatch and StreamingAggregator in order to differentiate it from hash aggregation.  These appear as new files but the code there has not changed.  
> 
> 
> Diffs
> -----
> 
>   common/src/main/java/org/apache/drill/common/expression/fn/AggregationFunctions.java 5b46b78 
>   exec/java-exec/src/main/codegen/config.fmpp cd2b2cc 
>   exec/java-exec/src/main/codegen/data/AggrTypes1.tdd PRE-CREATION 
>   exec/java-exec/src/main/codegen/data/AggrTypes2.tdd PRE-CREATION 
>   exec/java-exec/src/main/codegen/templates/AggrTypeFunctions1.java PRE-CREATION 
>   exec/java-exec/src/main/codegen/templates/AggrTypeFunctions2.java PRE-CREATION 
>   exec/java-exec/src/main/codegen/templates/TypeHelper.java 783f943 
>   exec/java-exec/src/main/java/org/apache/drill/exec/compile/sig/GeneratorMapping.java f99150e 
>   exec/java-exec/src/main/java/org/apache/drill/exec/compile/sig/MappingSet.java 1cb2380 
>   exec/java-exec/src/main/java/org/apache/drill/exec/expr/ClassGenerator.java e674eab 
>   exec/java-exec/src/main/java/org/apache/drill/exec/expr/fn/DrillAggFuncHolder.java 7622865 
>   exec/java-exec/src/main/java/org/apache/drill/exec/expr/fn/DrillFuncHolder.java dcdf1ea 
>   exec/java-exec/src/main/java/org/apache/drill/exec/expr/fn/FunctionConverter.java befa9bf 
>   exec/java-exec/src/main/java/org/apache/drill/exec/expr/fn/agg/impl/CountFunctions.java b0939f1 
>   exec/java-exec/src/main/java/org/apache/drill/exec/expr/fn/agg/impl/SumFunctions.java 57905fe 
>   exec/java-exec/src/main/java/org/apache/drill/exec/expr/fn/impl/BitFunctions.java 41bb7c6 
>   exec/java-exec/src/main/java/org/apache/drill/exec/ops/FragmentContext.java fd965a7 
>   exec/java-exec/src/main/java/org/apache/drill/exec/opt/BasicOptimizer.java 6e58ec7 
>   exec/java-exec/src/main/java/org/apache/drill/exec/physical/OperatorCost.java 0bade41 
>   exec/java-exec/src/main/java/org/apache/drill/exec/physical/base/AbstractPhysicalVisitor.java ec7244a 
>   exec/java-exec/src/main/java/org/apache/drill/exec/physical/base/PhysicalVisitor.java 2e2d7fd 
>   exec/java-exec/src/main/java/org/apache/drill/exec/physical/config/HashAggregate.java PRE-CREATION 
>   exec/java-exec/src/main/java/org/apache/drill/exec/physical/impl/aggregate/AggBatch.java 86fea4e 
>   exec/java-exec/src/main/java/org/apache/drill/exec/physical/impl/aggregate/AggBatchCreator.java 36344f9 
>   exec/java-exec/src/main/java/org/apache/drill/exec/physical/impl/aggregate/AggTemplate.java a8f39e3 
>   exec/java-exec/src/main/java/org/apache/drill/exec/physical/impl/aggregate/Aggregator.java 2683045 
>   exec/java-exec/src/main/java/org/apache/drill/exec/physical/impl/aggregate/HashAggBatch.java PRE-CREATION 
>   exec/java-exec/src/main/java/org/apache/drill/exec/physical/impl/aggregate/HashAggBatchCreator.java PRE-CREATION 
>   exec/java-exec/src/main/java/org/apache/drill/exec/physical/impl/aggregate/HashAggTemplate.java PRE-CREATION 
>   exec/java-exec/src/main/java/org/apache/drill/exec/physical/impl/aggregate/HashAggregator.java PRE-CREATION 
>   exec/java-exec/src/main/java/org/apache/drill/exec/physical/impl/aggregate/StreamingAggBatch.java PRE-CREATION 
>   exec/java-exec/src/main/java/org/apache/drill/exec/physical/impl/aggregate/StreamingAggBatchCreator.java PRE-CREATION 
>   exec/java-exec/src/main/java/org/apache/drill/exec/physical/impl/aggregate/StreamingAggTemplate.java PRE-CREATION 
>   exec/java-exec/src/main/java/org/apache/drill/exec/physical/impl/aggregate/StreamingAggregator.java PRE-CREATION 
>   exec/java-exec/src/main/java/org/apache/drill/exec/physical/impl/common/ChainedHashTable.java PRE-CREATION 
>   exec/java-exec/src/main/java/org/apache/drill/exec/physical/impl/common/HashTable.java PRE-CREATION 
>   exec/java-exec/src/main/java/org/apache/drill/exec/physical/impl/common/HashTableConfig.java PRE-CREATION 
>   exec/java-exec/src/main/java/org/apache/drill/exec/physical/impl/common/HashTableTemplate.java PRE-CREATION 
>   exec/java-exec/src/main/java/org/apache/drill/exec/physical/impl/project/ProjectRecordBatch.java 3b8154e 
>   exec/java-exec/src/test/java/org/apache/drill/exec/expr/ExpressionTest.java 0de64b4 
>   exec/java-exec/src/test/java/org/apache/drill/exec/physical/impl/SimpleRootExec.java ccf7758 
>   exec/java-exec/src/test/java/org/apache/drill/exec/physical/impl/agg/TestHashAggr.java PRE-CREATION 
>   exec/java-exec/src/test/java/org/apache/drill/exec/physical/impl/common/TestHashTable.java PRE-CREATION 
>   exec/java-exec/src/test/resources/common/test_hashtable1.json PRE-CREATION 
> 
> Diff: https://reviews.apache.org/r/18347/diff/
> 
> 
> Testing
> -------
> 
> I have run several tests manually as part of TestHashAggr...these tests use TPC-H data and in particular a relatively large 'Orders' table.  However, I have not yet packaged the tests to run as part of JUnit since the location and size of the parquet files needs to be figured out. I will continue to work on that.  
> 
> 
> Thanks,
> 
> Aman Sinha
> 
>


Re: Review Request 18347: Implement hash aggregation operator and a hash table (for Drill-335) and new templates for aggregate functions (Drill-318)

Posted by Aman Sinha <as...@maprtech.com>.
-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/18347/#review37170
-----------------------------------------------------------



exec/java-exec/src/main/java/org/apache/drill/exec/physical/impl/common/HashTableTemplate.java
<https://reviews.apache.org/r/18347/#comment68579>

    Good point. I initially had explicit allocation of the new vectors but in order to fix an issue with incorrect assignment of field ids, I added the clone method and it looks like in the process I am only allocating the container.  I will fix it. 



exec/java-exec/src/main/java/org/apache/drill/exec/record/VectorContainer.java
<https://reviews.apache.org/r/18347/#comment68580>

    This is related to the first comment. Yes, the desired behavior is to create new vectors of the same type.  Will fix this. 


- Aman Sinha


On March 6, 2014, 2:10 a.m., Aman Sinha wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/18347/
> -----------------------------------------------------------
> 
> (Updated March 6, 2014, 2:10 a.m.)
> 
> 
> Review request for drill, Jacques Nadeau, Jason Altekruse, Mehant Baid, and Steven Phillips.
> 
> 
> Bugs: Drill-318, Drill-335 and and
>     https://issues.apache.org/jira/browse/Drill-318
>     https://issues.apache.org/jira/browse/Drill-335
>     https://issues.apache.org/jira/browse/and
> 
> 
> Repository: drill-git
> 
> 
> Description
> -------
> 
> This patch contains the following main components: 
> 1. Implementation of the hash aggregation execution operator - this has two main parts: the HashAggTemplate and the HashAggBatch.  
> 2. Implementation of a hash table which is used by the hash aggregation.  The hash table hash two main parts: the HashTableTemplate and the ChainedHashTable. The hash table internally uses the notion of 'BatchHolder' to keep track of all keys that can fit within one batch of 64K values.  New BatcHolder objects are created as needed.  Each BatchHolder has its own vector container.  The HashAggregate also has a similar structure and it keeps track of the workspace variables.  
>  (NOTE: An initial design document for the hash aggregation and hash table was already attached with Drill-335.  The document has not yet been updated with the latest implementation ... will try to do that in the near future).
> 3. Jinfeng's changes to use workspace vectors in the generated code for aggregate functions (previously, for streaming aggregate we only needed to maintain workspace variable for 1 running group; however for hash aggregate we need to maintain it for all groups).  
> 4. Fix for Drill-318: because of #3 above, the previous fix for Drill-318 is not valid anymore.  I modified the template generation code for the aggregate functions such that they conform to the new infrastructure.  
> 5. The original AggTemplate, AggBatch and Aggregator classes have been moved to corresponding StreamingAggTemplate, StreamingAggBatch and StreamingAggregator in order to differentiate it from hash aggregation.  These appear as new files but the code there has not changed.  
> 
> 
> Diffs
> -----
> 
>   common/src/main/java/org/apache/drill/common/expression/fn/AggregationFunctions.java 5b46b78 
>   exec/java-exec/src/main/codegen/config.fmpp 8f1060a 
>   exec/java-exec/src/main/codegen/data/AggrTypes1.tdd PRE-CREATION 
>   exec/java-exec/src/main/codegen/data/AggrTypes2.tdd PRE-CREATION 
>   exec/java-exec/src/main/codegen/templates/AggrTypeFunctions1.java PRE-CREATION 
>   exec/java-exec/src/main/codegen/templates/AggrTypeFunctions2.java PRE-CREATION 
>   exec/java-exec/src/main/codegen/templates/TypeHelper.java 8c72cf7 
>   exec/java-exec/src/main/java/org/apache/drill/exec/compile/sig/GeneratorMapping.java f99150e 
>   exec/java-exec/src/main/java/org/apache/drill/exec/compile/sig/MappingSet.java 1cb2380 
>   exec/java-exec/src/main/java/org/apache/drill/exec/expr/ClassGenerator.java e674eab 
>   exec/java-exec/src/main/java/org/apache/drill/exec/expr/fn/DrillAggFuncHolder.java 7622865 
>   exec/java-exec/src/main/java/org/apache/drill/exec/expr/fn/DrillFuncHolder.java dcdf1ea 
>   exec/java-exec/src/main/java/org/apache/drill/exec/expr/fn/FunctionConverter.java befa9bf 
>   exec/java-exec/src/main/java/org/apache/drill/exec/expr/fn/agg/impl/CountFunctions.java b0939f1 
>   exec/java-exec/src/main/java/org/apache/drill/exec/expr/fn/agg/impl/SumFunctions.java 57905fe 
>   exec/java-exec/src/main/java/org/apache/drill/exec/expr/fn/impl/BitFunctions.java 41bb7c6 
>   exec/java-exec/src/main/java/org/apache/drill/exec/ops/FragmentContext.java 8462622 
>   exec/java-exec/src/main/java/org/apache/drill/exec/opt/BasicOptimizer.java aed4802 
>   exec/java-exec/src/main/java/org/apache/drill/exec/physical/OperatorCost.java 0bade41 
>   exec/java-exec/src/main/java/org/apache/drill/exec/physical/base/AbstractPhysicalVisitor.java ec7244a 
>   exec/java-exec/src/main/java/org/apache/drill/exec/physical/base/PhysicalVisitor.java 2e2d7fd 
>   exec/java-exec/src/main/java/org/apache/drill/exec/physical/config/HashAggregate.java PRE-CREATION 
>   exec/java-exec/src/main/java/org/apache/drill/exec/physical/impl/aggregate/AggBatch.java 86fea4e 
>   exec/java-exec/src/main/java/org/apache/drill/exec/physical/impl/aggregate/AggBatchCreator.java 36344f9 
>   exec/java-exec/src/main/java/org/apache/drill/exec/physical/impl/aggregate/AggTemplate.java a8f39e3 
>   exec/java-exec/src/main/java/org/apache/drill/exec/physical/impl/aggregate/Aggregator.java 2683045 
>   exec/java-exec/src/main/java/org/apache/drill/exec/physical/impl/aggregate/HashAggBatch.java PRE-CREATION 
>   exec/java-exec/src/main/java/org/apache/drill/exec/physical/impl/aggregate/HashAggBatchCreator.java PRE-CREATION 
>   exec/java-exec/src/main/java/org/apache/drill/exec/physical/impl/aggregate/HashAggTemplate.java PRE-CREATION 
>   exec/java-exec/src/main/java/org/apache/drill/exec/physical/impl/aggregate/HashAggregator.java PRE-CREATION 
>   exec/java-exec/src/main/java/org/apache/drill/exec/physical/impl/aggregate/StreamingAggBatch.java PRE-CREATION 
>   exec/java-exec/src/main/java/org/apache/drill/exec/physical/impl/aggregate/StreamingAggBatchCreator.java PRE-CREATION 
>   exec/java-exec/src/main/java/org/apache/drill/exec/physical/impl/aggregate/StreamingAggTemplate.java PRE-CREATION 
>   exec/java-exec/src/main/java/org/apache/drill/exec/physical/impl/aggregate/StreamingAggregator.java PRE-CREATION 
>   exec/java-exec/src/main/java/org/apache/drill/exec/physical/impl/common/ChainedHashTable.java PRE-CREATION 
>   exec/java-exec/src/main/java/org/apache/drill/exec/physical/impl/common/HashTable.java PRE-CREATION 
>   exec/java-exec/src/main/java/org/apache/drill/exec/physical/impl/common/HashTableConfig.java PRE-CREATION 
>   exec/java-exec/src/main/java/org/apache/drill/exec/physical/impl/common/HashTableTemplate.java PRE-CREATION 
>   exec/java-exec/src/main/java/org/apache/drill/exec/physical/impl/project/ProjectRecordBatch.java 509d13b 
>   exec/java-exec/src/main/java/org/apache/drill/exec/record/VectorContainer.java 6e764a8 
>   exec/java-exec/src/test/java/org/apache/drill/exec/expr/ExpressionTest.java 0de64b4 
>   exec/java-exec/src/test/java/org/apache/drill/exec/physical/impl/SimpleRootExec.java ccf7758 
>   exec/java-exec/src/test/java/org/apache/drill/exec/physical/impl/TestSimpleFunctions.java f6a8096 
>   exec/java-exec/src/test/java/org/apache/drill/exec/physical/impl/agg/TestHashAggr.java PRE-CREATION 
>   exec/java-exec/src/test/java/org/apache/drill/exec/physical/impl/common/TestHashTable.java PRE-CREATION 
>   exec/java-exec/src/test/resources/common/test_hashtable1.json PRE-CREATION 
> 
> Diff: https://reviews.apache.org/r/18347/diff/
> 
> 
> Testing
> -------
> 
> I have run several tests manually as part of TestHashAggr...these tests use TPC-H data and in particular a relatively large 'Orders' table.  However, I have not yet packaged the tests to run as part of JUnit since the location and size of the parquet files needs to be figured out. I will continue to work on that.  
> 
> 
> Thanks,
> 
> Aman Sinha
> 
>


Re: Review Request 18347: Implement hash aggregation operator and a hash table (for Drill-335) and new templates for aggregate functions (Drill-318)

Posted by Jason Altekruse <al...@gmail.com>.

> On March 12, 2014, 4:36 p.m., Jason Altekruse wrote:
> > exec/java-exec/src/main/codegen/templates/AggrTypeFunctions2.java, line 79
> > <https://reviews.apache.org/r/18347/diff/2/?file=512026#file512026line79>
> >
> >     While this strategy will produce the most accurate results, if we are storing the running sum in a BigInt for all of the types, we will almost certainly hit an overflow for inputs like BigInt, and will likely also be an issue for integer columns in full scale datasets as well. We should look at keeping an approximate average as the running state with a count of values factored in so far. New values should be factored by adding distance(runningAvg, currVal) * 1/valCount to the runningAvg. The downside is values that should push the average up or down just a little bit, once a lot of other values have been factored in will be lost to rounding errors. This can be fixed the same as above, by batching subsets of values that will have a collected sub-average applied to the total average all at once. As it is approximate, this could still produce inaccurate results if the sub-groups are not large enough, but this could be adjusted on a per query or per table basis.

By distance I mean runningAvg - currVal, I was thinking I was abstracting way an absolute value by calling it distance, but you actually want a simple subtraction here.


- Jason


-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/18347/#review36717
-----------------------------------------------------------


On March 6, 2014, 2:10 a.m., Aman Sinha wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/18347/
> -----------------------------------------------------------
> 
> (Updated March 6, 2014, 2:10 a.m.)
> 
> 
> Review request for drill, Jacques Nadeau, Jason Altekruse, Mehant Baid, and Steven Phillips.
> 
> 
> Bugs: Drill-318, Drill-335 and and
>     https://issues.apache.org/jira/browse/Drill-318
>     https://issues.apache.org/jira/browse/Drill-335
>     https://issues.apache.org/jira/browse/and
> 
> 
> Repository: drill-git
> 
> 
> Description
> -------
> 
> This patch contains the following main components: 
> 1. Implementation of the hash aggregation execution operator - this has two main parts: the HashAggTemplate and the HashAggBatch.  
> 2. Implementation of a hash table which is used by the hash aggregation.  The hash table hash two main parts: the HashTableTemplate and the ChainedHashTable. The hash table internally uses the notion of 'BatchHolder' to keep track of all keys that can fit within one batch of 64K values.  New BatcHolder objects are created as needed.  Each BatchHolder has its own vector container.  The HashAggregate also has a similar structure and it keeps track of the workspace variables.  
>  (NOTE: An initial design document for the hash aggregation and hash table was already attached with Drill-335.  The document has not yet been updated with the latest implementation ... will try to do that in the near future).
> 3. Jinfeng's changes to use workspace vectors in the generated code for aggregate functions (previously, for streaming aggregate we only needed to maintain workspace variable for 1 running group; however for hash aggregate we need to maintain it for all groups).  
> 4. Fix for Drill-318: because of #3 above, the previous fix for Drill-318 is not valid anymore.  I modified the template generation code for the aggregate functions such that they conform to the new infrastructure.  
> 5. The original AggTemplate, AggBatch and Aggregator classes have been moved to corresponding StreamingAggTemplate, StreamingAggBatch and StreamingAggregator in order to differentiate it from hash aggregation.  These appear as new files but the code there has not changed.  
> 
> 
> Diffs
> -----
> 
>   common/src/main/java/org/apache/drill/common/expression/fn/AggregationFunctions.java 5b46b78 
>   exec/java-exec/src/main/codegen/config.fmpp 8f1060a 
>   exec/java-exec/src/main/codegen/data/AggrTypes1.tdd PRE-CREATION 
>   exec/java-exec/src/main/codegen/data/AggrTypes2.tdd PRE-CREATION 
>   exec/java-exec/src/main/codegen/templates/AggrTypeFunctions1.java PRE-CREATION 
>   exec/java-exec/src/main/codegen/templates/AggrTypeFunctions2.java PRE-CREATION 
>   exec/java-exec/src/main/codegen/templates/TypeHelper.java 8c72cf7 
>   exec/java-exec/src/main/java/org/apache/drill/exec/compile/sig/GeneratorMapping.java f99150e 
>   exec/java-exec/src/main/java/org/apache/drill/exec/compile/sig/MappingSet.java 1cb2380 
>   exec/java-exec/src/main/java/org/apache/drill/exec/expr/ClassGenerator.java e674eab 
>   exec/java-exec/src/main/java/org/apache/drill/exec/expr/fn/DrillAggFuncHolder.java 7622865 
>   exec/java-exec/src/main/java/org/apache/drill/exec/expr/fn/DrillFuncHolder.java dcdf1ea 
>   exec/java-exec/src/main/java/org/apache/drill/exec/expr/fn/FunctionConverter.java befa9bf 
>   exec/java-exec/src/main/java/org/apache/drill/exec/expr/fn/agg/impl/CountFunctions.java b0939f1 
>   exec/java-exec/src/main/java/org/apache/drill/exec/expr/fn/agg/impl/SumFunctions.java 57905fe 
>   exec/java-exec/src/main/java/org/apache/drill/exec/expr/fn/impl/BitFunctions.java 41bb7c6 
>   exec/java-exec/src/main/java/org/apache/drill/exec/ops/FragmentContext.java 8462622 
>   exec/java-exec/src/main/java/org/apache/drill/exec/opt/BasicOptimizer.java aed4802 
>   exec/java-exec/src/main/java/org/apache/drill/exec/physical/OperatorCost.java 0bade41 
>   exec/java-exec/src/main/java/org/apache/drill/exec/physical/base/AbstractPhysicalVisitor.java ec7244a 
>   exec/java-exec/src/main/java/org/apache/drill/exec/physical/base/PhysicalVisitor.java 2e2d7fd 
>   exec/java-exec/src/main/java/org/apache/drill/exec/physical/config/HashAggregate.java PRE-CREATION 
>   exec/java-exec/src/main/java/org/apache/drill/exec/physical/impl/aggregate/AggBatch.java 86fea4e 
>   exec/java-exec/src/main/java/org/apache/drill/exec/physical/impl/aggregate/AggBatchCreator.java 36344f9 
>   exec/java-exec/src/main/java/org/apache/drill/exec/physical/impl/aggregate/AggTemplate.java a8f39e3 
>   exec/java-exec/src/main/java/org/apache/drill/exec/physical/impl/aggregate/Aggregator.java 2683045 
>   exec/java-exec/src/main/java/org/apache/drill/exec/physical/impl/aggregate/HashAggBatch.java PRE-CREATION 
>   exec/java-exec/src/main/java/org/apache/drill/exec/physical/impl/aggregate/HashAggBatchCreator.java PRE-CREATION 
>   exec/java-exec/src/main/java/org/apache/drill/exec/physical/impl/aggregate/HashAggTemplate.java PRE-CREATION 
>   exec/java-exec/src/main/java/org/apache/drill/exec/physical/impl/aggregate/HashAggregator.java PRE-CREATION 
>   exec/java-exec/src/main/java/org/apache/drill/exec/physical/impl/aggregate/StreamingAggBatch.java PRE-CREATION 
>   exec/java-exec/src/main/java/org/apache/drill/exec/physical/impl/aggregate/StreamingAggBatchCreator.java PRE-CREATION 
>   exec/java-exec/src/main/java/org/apache/drill/exec/physical/impl/aggregate/StreamingAggTemplate.java PRE-CREATION 
>   exec/java-exec/src/main/java/org/apache/drill/exec/physical/impl/aggregate/StreamingAggregator.java PRE-CREATION 
>   exec/java-exec/src/main/java/org/apache/drill/exec/physical/impl/common/ChainedHashTable.java PRE-CREATION 
>   exec/java-exec/src/main/java/org/apache/drill/exec/physical/impl/common/HashTable.java PRE-CREATION 
>   exec/java-exec/src/main/java/org/apache/drill/exec/physical/impl/common/HashTableConfig.java PRE-CREATION 
>   exec/java-exec/src/main/java/org/apache/drill/exec/physical/impl/common/HashTableTemplate.java PRE-CREATION 
>   exec/java-exec/src/main/java/org/apache/drill/exec/physical/impl/project/ProjectRecordBatch.java 509d13b 
>   exec/java-exec/src/main/java/org/apache/drill/exec/record/VectorContainer.java 6e764a8 
>   exec/java-exec/src/test/java/org/apache/drill/exec/expr/ExpressionTest.java 0de64b4 
>   exec/java-exec/src/test/java/org/apache/drill/exec/physical/impl/SimpleRootExec.java ccf7758 
>   exec/java-exec/src/test/java/org/apache/drill/exec/physical/impl/TestSimpleFunctions.java f6a8096 
>   exec/java-exec/src/test/java/org/apache/drill/exec/physical/impl/agg/TestHashAggr.java PRE-CREATION 
>   exec/java-exec/src/test/java/org/apache/drill/exec/physical/impl/common/TestHashTable.java PRE-CREATION 
>   exec/java-exec/src/test/resources/common/test_hashtable1.json PRE-CREATION 
> 
> Diff: https://reviews.apache.org/r/18347/diff/
> 
> 
> Testing
> -------
> 
> I have run several tests manually as part of TestHashAggr...these tests use TPC-H data and in particular a relatively large 'Orders' table.  However, I have not yet packaged the tests to run as part of JUnit since the location and size of the parquet files needs to be figured out. I will continue to work on that.  
> 
> 
> Thanks,
> 
> Aman Sinha
> 
>


Re: Review Request 18347: Implement hash aggregation operator and a hash table (for Drill-335) and new templates for aggregate functions (Drill-318)

Posted by Ted Dunning <te...@gmail.com>.
Also, when doubles are being summed, kahan summation can preserve more bits of doubles. 

Sent from my iPhone

> On Mar 14, 2014, at 18:11, "Jacques Nadeau" <ja...@gmail.com> wrote:
> 
> 
> 
>>> On March 12, 2014, 4:36 p.m., Jason Altekruse wrote:
>>> exec/java-exec/src/main/codegen/data/AggrTypes1.tdd, line 43
>>> <https://reviews.apache.org/r/18347/diff/2/?file=512023#file512023line43>
>>> 
>>>    Is it possible that we would want to make the output type of a sum of BigInts (or even Ints with the dataset sizes we are expecting) a type other than BigInt? If we move to something like a double, or a BigDecimal we could store a more accurate guess of the sum, rather than just overflowing the output value, which I assume would be reasonably common. There is the problem that while doubles can store large values, they lose precision, so as we calculate the sum (assuming it is going to be a large number of small values) once we reach a large partial sum, adding a single value would not increment the sum. We would have to store intermediate sums in some kind of temporary storage until they are large enough to be added to the whole sum and not be lost to rounding.
> 
> I think moving to approximate results implicitly is dangerous.  Let's keep these as big int.  If someone has an overflow problem, they could just cast prior to summing.
> 
> 
> - Jacques
> 
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/18347/#review36717
> -----------------------------------------------------------
> 
> 
>> On March 6, 2014, 2:10 a.m., Aman Sinha wrote:
>> 
>> -----------------------------------------------------------
>> This is an automatically generated e-mail. To reply, visit:
>> https://reviews.apache.org/r/18347/
>> -----------------------------------------------------------
>> 
>> (Updated March 6, 2014, 2:10 a.m.)
>> 
>> 
>> Review request for drill, Jacques Nadeau, Jason Altekruse, Mehant Baid, and Steven Phillips.
>> 
>> 
>> Bugs: Drill-318, Drill-335 and and
>>    https://issues.apache.org/jira/browse/Drill-318
>>    https://issues.apache.org/jira/browse/Drill-335
>>    https://issues.apache.org/jira/browse/and
>> 
>> 
>> Repository: drill-git
>> 
>> 
>> Description
>> -------
>> 
>> This patch contains the following main components: 
>> 1. Implementation of the hash aggregation execution operator - this has two main parts: the HashAggTemplate and the HashAggBatch.  
>> 2. Implementation of a hash table which is used by the hash aggregation.  The hash table hash two main parts: the HashTableTemplate and the ChainedHashTable. The hash table internally uses the notion of 'BatchHolder' to keep track of all keys that can fit within one batch of 64K values.  New BatcHolder objects are created as needed.  Each BatchHolder has its own vector container.  The HashAggregate also has a similar structure and it keeps track of the workspace variables.  
>> (NOTE: An initial design document for the hash aggregation and hash table was already attached with Drill-335.  The document has not yet been updated with the latest implementation ... will try to do that in the near future).
>> 3. Jinfeng's changes to use workspace vectors in the generated code for aggregate functions (previously, for streaming aggregate we only needed to maintain workspace variable for 1 running group; however for hash aggregate we need to maintain it for all groups).  
>> 4. Fix for Drill-318: because of #3 above, the previous fix for Drill-318 is not valid anymore.  I modified the template generation code for the aggregate functions such that they conform to the new infrastructure.  
>> 5. The original AggTemplate, AggBatch and Aggregator classes have been moved to corresponding StreamingAggTemplate, StreamingAggBatch and StreamingAggregator in order to differentiate it from hash aggregation.  These appear as new files but the code there has not changed.  
>> 
>> 
>> Diffs
>> -----
>> 
>>  common/src/main/java/org/apache/drill/common/expression/fn/AggregationFunctions.java 5b46b78 
>>  exec/java-exec/src/main/codegen/config.fmpp 8f1060a 
>>  exec/java-exec/src/main/codegen/data/AggrTypes1.tdd PRE-CREATION 
>>  exec/java-exec/src/main/codegen/data/AggrTypes2.tdd PRE-CREATION 
>>  exec/java-exec/src/main/codegen/templates/AggrTypeFunctions1.java PRE-CREATION 
>>  exec/java-exec/src/main/codegen/templates/AggrTypeFunctions2.java PRE-CREATION 
>>  exec/java-exec/src/main/codegen/templates/TypeHelper.java 8c72cf7 
>>  exec/java-exec/src/main/java/org/apache/drill/exec/compile/sig/GeneratorMapping.java f99150e 
>>  exec/java-exec/src/main/java/org/apache/drill/exec/compile/sig/MappingSet.java 1cb2380 
>>  exec/java-exec/src/main/java/org/apache/drill/exec/expr/ClassGenerator.java e674eab 
>>  exec/java-exec/src/main/java/org/apache/drill/exec/expr/fn/DrillAggFuncHolder.java 7622865 
>>  exec/java-exec/src/main/java/org/apache/drill/exec/expr/fn/DrillFuncHolder.java dcdf1ea 
>>  exec/java-exec/src/main/java/org/apache/drill/exec/expr/fn/FunctionConverter.java befa9bf 
>>  exec/java-exec/src/main/java/org/apache/drill/exec/expr/fn/agg/impl/CountFunctions.java b0939f1 
>>  exec/java-exec/src/main/java/org/apache/drill/exec/expr/fn/agg/impl/SumFunctions.java 57905fe 
>>  exec/java-exec/src/main/java/org/apache/drill/exec/expr/fn/impl/BitFunctions.java 41bb7c6 
>>  exec/java-exec/src/main/java/org/apache/drill/exec/ops/FragmentContext.java 8462622 
>>  exec/java-exec/src/main/java/org/apache/drill/exec/opt/BasicOptimizer.java aed4802 
>>  exec/java-exec/src/main/java/org/apache/drill/exec/physical/OperatorCost.java 0bade41 
>>  exec/java-exec/src/main/java/org/apache/drill/exec/physical/base/AbstractPhysicalVisitor.java ec7244a 
>>  exec/java-exec/src/main/java/org/apache/drill/exec/physical/base/PhysicalVisitor.java 2e2d7fd 
>>  exec/java-exec/src/main/java/org/apache/drill/exec/physical/config/HashAggregate.java PRE-CREATION 
>>  exec/java-exec/src/main/java/org/apache/drill/exec/physical/impl/aggregate/AggBatch.java 86fea4e 
>>  exec/java-exec/src/main/java/org/apache/drill/exec/physical/impl/aggregate/AggBatchCreator.java 36344f9 
>>  exec/java-exec/src/main/java/org/apache/drill/exec/physical/impl/aggregate/AggTemplate.java a8f39e3 
>>  exec/java-exec/src/main/java/org/apache/drill/exec/physical/impl/aggregate/Aggregator.java 2683045 
>>  exec/java-exec/src/main/java/org/apache/drill/exec/physical/impl/aggregate/HashAggBatch.java PRE-CREATION 
>>  exec/java-exec/src/main/java/org/apache/drill/exec/physical/impl/aggregate/HashAggBatchCreator.java PRE-CREATION 
>>  exec/java-exec/src/main/java/org/apache/drill/exec/physical/impl/aggregate/HashAggTemplate.java PRE-CREATION 
>>  exec/java-exec/src/main/java/org/apache/drill/exec/physical/impl/aggregate/HashAggregator.java PRE-CREATION 
>>  exec/java-exec/src/main/java/org/apache/drill/exec/physical/impl/aggregate/StreamingAggBatch.java PRE-CREATION 
>>  exec/java-exec/src/main/java/org/apache/drill/exec/physical/impl/aggregate/StreamingAggBatchCreator.java PRE-CREATION 
>>  exec/java-exec/src/main/java/org/apache/drill/exec/physical/impl/aggregate/StreamingAggTemplate.java PRE-CREATION 
>>  exec/java-exec/src/main/java/org/apache/drill/exec/physical/impl/aggregate/StreamingAggregator.java PRE-CREATION 
>>  exec/java-exec/src/main/java/org/apache/drill/exec/physical/impl/common/ChainedHashTable.java PRE-CREATION 
>>  exec/java-exec/src/main/java/org/apache/drill/exec/physical/impl/common/HashTable.java PRE-CREATION 
>>  exec/java-exec/src/main/java/org/apache/drill/exec/physical/impl/common/HashTableConfig.java PRE-CREATION 
>>  exec/java-exec/src/main/java/org/apache/drill/exec/physical/impl/common/HashTableTemplate.java PRE-CREATION 
>>  exec/java-exec/src/main/java/org/apache/drill/exec/physical/impl/project/ProjectRecordBatch.java 509d13b 
>>  exec/java-exec/src/main/java/org/apache/drill/exec/record/VectorContainer.java 6e764a8 
>>  exec/java-exec/src/test/java/org/apache/drill/exec/expr/ExpressionTest.java 0de64b4 
>>  exec/java-exec/src/test/java/org/apache/drill/exec/physical/impl/SimpleRootExec.java ccf7758 
>>  exec/java-exec/src/test/java/org/apache/drill/exec/physical/impl/TestSimpleFunctions.java f6a8096 
>>  exec/java-exec/src/test/java/org/apache/drill/exec/physical/impl/agg/TestHashAggr.java PRE-CREATION 
>>  exec/java-exec/src/test/java/org/apache/drill/exec/physical/impl/common/TestHashTable.java PRE-CREATION 
>>  exec/java-exec/src/test/resources/common/test_hashtable1.json PRE-CREATION 
>> 
>> Diff: https://reviews.apache.org/r/18347/diff/
>> 
>> 
>> Testing
>> -------
>> 
>> I have run several tests manually as part of TestHashAggr...these tests use TPC-H data and in particular a relatively large 'Orders' table.  However, I have not yet packaged the tests to run as part of JUnit since the location and size of the parquet files needs to be figured out. I will continue to work on that.  
>> 
>> 
>> Thanks,
>> 
>> Aman Sinha
> 

Re: Review Request 18347: Implement hash aggregation operator and a hash table (for Drill-335) and new templates for aggregate functions (Drill-318)

Posted by Jacques Nadeau <ja...@gmail.com>.

> On March 12, 2014, 4:36 p.m., Jason Altekruse wrote:
> > exec/java-exec/src/main/codegen/data/AggrTypes1.tdd, line 43
> > <https://reviews.apache.org/r/18347/diff/2/?file=512023#file512023line43>
> >
> >     Is it possible that we would want to make the output type of a sum of BigInts (or even Ints with the dataset sizes we are expecting) a type other than BigInt? If we move to something like a double, or a BigDecimal we could store a more accurate guess of the sum, rather than just overflowing the output value, which I assume would be reasonably common. There is the problem that while doubles can store large values, they lose precision, so as we calculate the sum (assuming it is going to be a large number of small values) once we reach a large partial sum, adding a single value would not increment the sum. We would have to store intermediate sums in some kind of temporary storage until they are large enough to be added to the whole sum and not be lost to rounding.

I think moving to approximate results implicitly is dangerous.  Let's keep these as big int.  If someone has an overflow problem, they could just cast prior to summing.


- Jacques


-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/18347/#review36717
-----------------------------------------------------------


On March 6, 2014, 2:10 a.m., Aman Sinha wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/18347/
> -----------------------------------------------------------
> 
> (Updated March 6, 2014, 2:10 a.m.)
> 
> 
> Review request for drill, Jacques Nadeau, Jason Altekruse, Mehant Baid, and Steven Phillips.
> 
> 
> Bugs: Drill-318, Drill-335 and and
>     https://issues.apache.org/jira/browse/Drill-318
>     https://issues.apache.org/jira/browse/Drill-335
>     https://issues.apache.org/jira/browse/and
> 
> 
> Repository: drill-git
> 
> 
> Description
> -------
> 
> This patch contains the following main components: 
> 1. Implementation of the hash aggregation execution operator - this has two main parts: the HashAggTemplate and the HashAggBatch.  
> 2. Implementation of a hash table which is used by the hash aggregation.  The hash table hash two main parts: the HashTableTemplate and the ChainedHashTable. The hash table internally uses the notion of 'BatchHolder' to keep track of all keys that can fit within one batch of 64K values.  New BatcHolder objects are created as needed.  Each BatchHolder has its own vector container.  The HashAggregate also has a similar structure and it keeps track of the workspace variables.  
>  (NOTE: An initial design document for the hash aggregation and hash table was already attached with Drill-335.  The document has not yet been updated with the latest implementation ... will try to do that in the near future).
> 3. Jinfeng's changes to use workspace vectors in the generated code for aggregate functions (previously, for streaming aggregate we only needed to maintain workspace variable for 1 running group; however for hash aggregate we need to maintain it for all groups).  
> 4. Fix for Drill-318: because of #3 above, the previous fix for Drill-318 is not valid anymore.  I modified the template generation code for the aggregate functions such that they conform to the new infrastructure.  
> 5. The original AggTemplate, AggBatch and Aggregator classes have been moved to corresponding StreamingAggTemplate, StreamingAggBatch and StreamingAggregator in order to differentiate it from hash aggregation.  These appear as new files but the code there has not changed.  
> 
> 
> Diffs
> -----
> 
>   common/src/main/java/org/apache/drill/common/expression/fn/AggregationFunctions.java 5b46b78 
>   exec/java-exec/src/main/codegen/config.fmpp 8f1060a 
>   exec/java-exec/src/main/codegen/data/AggrTypes1.tdd PRE-CREATION 
>   exec/java-exec/src/main/codegen/data/AggrTypes2.tdd PRE-CREATION 
>   exec/java-exec/src/main/codegen/templates/AggrTypeFunctions1.java PRE-CREATION 
>   exec/java-exec/src/main/codegen/templates/AggrTypeFunctions2.java PRE-CREATION 
>   exec/java-exec/src/main/codegen/templates/TypeHelper.java 8c72cf7 
>   exec/java-exec/src/main/java/org/apache/drill/exec/compile/sig/GeneratorMapping.java f99150e 
>   exec/java-exec/src/main/java/org/apache/drill/exec/compile/sig/MappingSet.java 1cb2380 
>   exec/java-exec/src/main/java/org/apache/drill/exec/expr/ClassGenerator.java e674eab 
>   exec/java-exec/src/main/java/org/apache/drill/exec/expr/fn/DrillAggFuncHolder.java 7622865 
>   exec/java-exec/src/main/java/org/apache/drill/exec/expr/fn/DrillFuncHolder.java dcdf1ea 
>   exec/java-exec/src/main/java/org/apache/drill/exec/expr/fn/FunctionConverter.java befa9bf 
>   exec/java-exec/src/main/java/org/apache/drill/exec/expr/fn/agg/impl/CountFunctions.java b0939f1 
>   exec/java-exec/src/main/java/org/apache/drill/exec/expr/fn/agg/impl/SumFunctions.java 57905fe 
>   exec/java-exec/src/main/java/org/apache/drill/exec/expr/fn/impl/BitFunctions.java 41bb7c6 
>   exec/java-exec/src/main/java/org/apache/drill/exec/ops/FragmentContext.java 8462622 
>   exec/java-exec/src/main/java/org/apache/drill/exec/opt/BasicOptimizer.java aed4802 
>   exec/java-exec/src/main/java/org/apache/drill/exec/physical/OperatorCost.java 0bade41 
>   exec/java-exec/src/main/java/org/apache/drill/exec/physical/base/AbstractPhysicalVisitor.java ec7244a 
>   exec/java-exec/src/main/java/org/apache/drill/exec/physical/base/PhysicalVisitor.java 2e2d7fd 
>   exec/java-exec/src/main/java/org/apache/drill/exec/physical/config/HashAggregate.java PRE-CREATION 
>   exec/java-exec/src/main/java/org/apache/drill/exec/physical/impl/aggregate/AggBatch.java 86fea4e 
>   exec/java-exec/src/main/java/org/apache/drill/exec/physical/impl/aggregate/AggBatchCreator.java 36344f9 
>   exec/java-exec/src/main/java/org/apache/drill/exec/physical/impl/aggregate/AggTemplate.java a8f39e3 
>   exec/java-exec/src/main/java/org/apache/drill/exec/physical/impl/aggregate/Aggregator.java 2683045 
>   exec/java-exec/src/main/java/org/apache/drill/exec/physical/impl/aggregate/HashAggBatch.java PRE-CREATION 
>   exec/java-exec/src/main/java/org/apache/drill/exec/physical/impl/aggregate/HashAggBatchCreator.java PRE-CREATION 
>   exec/java-exec/src/main/java/org/apache/drill/exec/physical/impl/aggregate/HashAggTemplate.java PRE-CREATION 
>   exec/java-exec/src/main/java/org/apache/drill/exec/physical/impl/aggregate/HashAggregator.java PRE-CREATION 
>   exec/java-exec/src/main/java/org/apache/drill/exec/physical/impl/aggregate/StreamingAggBatch.java PRE-CREATION 
>   exec/java-exec/src/main/java/org/apache/drill/exec/physical/impl/aggregate/StreamingAggBatchCreator.java PRE-CREATION 
>   exec/java-exec/src/main/java/org/apache/drill/exec/physical/impl/aggregate/StreamingAggTemplate.java PRE-CREATION 
>   exec/java-exec/src/main/java/org/apache/drill/exec/physical/impl/aggregate/StreamingAggregator.java PRE-CREATION 
>   exec/java-exec/src/main/java/org/apache/drill/exec/physical/impl/common/ChainedHashTable.java PRE-CREATION 
>   exec/java-exec/src/main/java/org/apache/drill/exec/physical/impl/common/HashTable.java PRE-CREATION 
>   exec/java-exec/src/main/java/org/apache/drill/exec/physical/impl/common/HashTableConfig.java PRE-CREATION 
>   exec/java-exec/src/main/java/org/apache/drill/exec/physical/impl/common/HashTableTemplate.java PRE-CREATION 
>   exec/java-exec/src/main/java/org/apache/drill/exec/physical/impl/project/ProjectRecordBatch.java 509d13b 
>   exec/java-exec/src/main/java/org/apache/drill/exec/record/VectorContainer.java 6e764a8 
>   exec/java-exec/src/test/java/org/apache/drill/exec/expr/ExpressionTest.java 0de64b4 
>   exec/java-exec/src/test/java/org/apache/drill/exec/physical/impl/SimpleRootExec.java ccf7758 
>   exec/java-exec/src/test/java/org/apache/drill/exec/physical/impl/TestSimpleFunctions.java f6a8096 
>   exec/java-exec/src/test/java/org/apache/drill/exec/physical/impl/agg/TestHashAggr.java PRE-CREATION 
>   exec/java-exec/src/test/java/org/apache/drill/exec/physical/impl/common/TestHashTable.java PRE-CREATION 
>   exec/java-exec/src/test/resources/common/test_hashtable1.json PRE-CREATION 
> 
> Diff: https://reviews.apache.org/r/18347/diff/
> 
> 
> Testing
> -------
> 
> I have run several tests manually as part of TestHashAggr...these tests use TPC-H data and in particular a relatively large 'Orders' table.  However, I have not yet packaged the tests to run as part of JUnit since the location and size of the parquet files needs to be figured out. I will continue to work on that.  
> 
> 
> Thanks,
> 
> Aman Sinha
> 
>


Re: Review Request 18347: Implement hash aggregation operator and a hash table (for Drill-335) and new templates for aggregate functions (Drill-318)

Posted by Jason Altekruse <al...@gmail.com>.
-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/18347/#review36717
-----------------------------------------------------------



exec/java-exec/src/main/codegen/data/AggrTypes1.tdd
<https://reviews.apache.org/r/18347/#comment67800>

    Is it possible that we would want to make the output type of a sum of BigInts (or even Ints with the dataset sizes we are expecting) a type other than BigInt? If we move to something like a double, or a BigDecimal we could store a more accurate guess of the sum, rather than just overflowing the output value, which I assume would be reasonably common. There is the problem that while doubles can store large values, they lose precision, so as we calculate the sum (assuming it is going to be a large number of small values) once we reach a large partial sum, adding a single value would not increment the sum. We would have to store intermediate sums in some kind of temporary storage until they are large enough to be added to the whole sum and not be lost to rounding.



exec/java-exec/src/main/codegen/templates/AggrTypeFunctions2.java
<https://reviews.apache.org/r/18347/#comment67805>

    While this strategy will produce the most accurate results, if we are storing the running sum in a BigInt for all of the types, we will almost certainly hit an overflow for inputs like BigInt, and will likely also be an issue for integer columns in full scale datasets as well. We should look at keeping an approximate average as the running state with a count of values factored in so far. New values should be factored by adding distance(runningAvg, currVal) * 1/valCount to the runningAvg. The downside is values that should push the average up or down just a little bit, once a lot of other values have been factored in will be lost to rounding errors. This can be fixed the same as above, by batching subsets of values that will have a collected sub-average applied to the total average all at once. As it is approximate, this could still produce inaccurate results if the sub-groups are not large enough, but this could be adjusted on a per query or per table basis.


- Jason Altekruse


On March 6, 2014, 2:10 a.m., Aman Sinha wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/18347/
> -----------------------------------------------------------
> 
> (Updated March 6, 2014, 2:10 a.m.)
> 
> 
> Review request for drill, Jacques Nadeau, Jason Altekruse, Mehant Baid, and Steven Phillips.
> 
> 
> Bugs: Drill-318, Drill-335 and and
>     https://issues.apache.org/jira/browse/Drill-318
>     https://issues.apache.org/jira/browse/Drill-335
>     https://issues.apache.org/jira/browse/and
> 
> 
> Repository: drill-git
> 
> 
> Description
> -------
> 
> This patch contains the following main components: 
> 1. Implementation of the hash aggregation execution operator - this has two main parts: the HashAggTemplate and the HashAggBatch.  
> 2. Implementation of a hash table which is used by the hash aggregation.  The hash table hash two main parts: the HashTableTemplate and the ChainedHashTable. The hash table internally uses the notion of 'BatchHolder' to keep track of all keys that can fit within one batch of 64K values.  New BatcHolder objects are created as needed.  Each BatchHolder has its own vector container.  The HashAggregate also has a similar structure and it keeps track of the workspace variables.  
>  (NOTE: An initial design document for the hash aggregation and hash table was already attached with Drill-335.  The document has not yet been updated with the latest implementation ... will try to do that in the near future).
> 3. Jinfeng's changes to use workspace vectors in the generated code for aggregate functions (previously, for streaming aggregate we only needed to maintain workspace variable for 1 running group; however for hash aggregate we need to maintain it for all groups).  
> 4. Fix for Drill-318: because of #3 above, the previous fix for Drill-318 is not valid anymore.  I modified the template generation code for the aggregate functions such that they conform to the new infrastructure.  
> 5. The original AggTemplate, AggBatch and Aggregator classes have been moved to corresponding StreamingAggTemplate, StreamingAggBatch and StreamingAggregator in order to differentiate it from hash aggregation.  These appear as new files but the code there has not changed.  
> 
> 
> Diffs
> -----
> 
>   common/src/main/java/org/apache/drill/common/expression/fn/AggregationFunctions.java 5b46b78 
>   exec/java-exec/src/main/codegen/config.fmpp 8f1060a 
>   exec/java-exec/src/main/codegen/data/AggrTypes1.tdd PRE-CREATION 
>   exec/java-exec/src/main/codegen/data/AggrTypes2.tdd PRE-CREATION 
>   exec/java-exec/src/main/codegen/templates/AggrTypeFunctions1.java PRE-CREATION 
>   exec/java-exec/src/main/codegen/templates/AggrTypeFunctions2.java PRE-CREATION 
>   exec/java-exec/src/main/codegen/templates/TypeHelper.java 8c72cf7 
>   exec/java-exec/src/main/java/org/apache/drill/exec/compile/sig/GeneratorMapping.java f99150e 
>   exec/java-exec/src/main/java/org/apache/drill/exec/compile/sig/MappingSet.java 1cb2380 
>   exec/java-exec/src/main/java/org/apache/drill/exec/expr/ClassGenerator.java e674eab 
>   exec/java-exec/src/main/java/org/apache/drill/exec/expr/fn/DrillAggFuncHolder.java 7622865 
>   exec/java-exec/src/main/java/org/apache/drill/exec/expr/fn/DrillFuncHolder.java dcdf1ea 
>   exec/java-exec/src/main/java/org/apache/drill/exec/expr/fn/FunctionConverter.java befa9bf 
>   exec/java-exec/src/main/java/org/apache/drill/exec/expr/fn/agg/impl/CountFunctions.java b0939f1 
>   exec/java-exec/src/main/java/org/apache/drill/exec/expr/fn/agg/impl/SumFunctions.java 57905fe 
>   exec/java-exec/src/main/java/org/apache/drill/exec/expr/fn/impl/BitFunctions.java 41bb7c6 
>   exec/java-exec/src/main/java/org/apache/drill/exec/ops/FragmentContext.java 8462622 
>   exec/java-exec/src/main/java/org/apache/drill/exec/opt/BasicOptimizer.java aed4802 
>   exec/java-exec/src/main/java/org/apache/drill/exec/physical/OperatorCost.java 0bade41 
>   exec/java-exec/src/main/java/org/apache/drill/exec/physical/base/AbstractPhysicalVisitor.java ec7244a 
>   exec/java-exec/src/main/java/org/apache/drill/exec/physical/base/PhysicalVisitor.java 2e2d7fd 
>   exec/java-exec/src/main/java/org/apache/drill/exec/physical/config/HashAggregate.java PRE-CREATION 
>   exec/java-exec/src/main/java/org/apache/drill/exec/physical/impl/aggregate/AggBatch.java 86fea4e 
>   exec/java-exec/src/main/java/org/apache/drill/exec/physical/impl/aggregate/AggBatchCreator.java 36344f9 
>   exec/java-exec/src/main/java/org/apache/drill/exec/physical/impl/aggregate/AggTemplate.java a8f39e3 
>   exec/java-exec/src/main/java/org/apache/drill/exec/physical/impl/aggregate/Aggregator.java 2683045 
>   exec/java-exec/src/main/java/org/apache/drill/exec/physical/impl/aggregate/HashAggBatch.java PRE-CREATION 
>   exec/java-exec/src/main/java/org/apache/drill/exec/physical/impl/aggregate/HashAggBatchCreator.java PRE-CREATION 
>   exec/java-exec/src/main/java/org/apache/drill/exec/physical/impl/aggregate/HashAggTemplate.java PRE-CREATION 
>   exec/java-exec/src/main/java/org/apache/drill/exec/physical/impl/aggregate/HashAggregator.java PRE-CREATION 
>   exec/java-exec/src/main/java/org/apache/drill/exec/physical/impl/aggregate/StreamingAggBatch.java PRE-CREATION 
>   exec/java-exec/src/main/java/org/apache/drill/exec/physical/impl/aggregate/StreamingAggBatchCreator.java PRE-CREATION 
>   exec/java-exec/src/main/java/org/apache/drill/exec/physical/impl/aggregate/StreamingAggTemplate.java PRE-CREATION 
>   exec/java-exec/src/main/java/org/apache/drill/exec/physical/impl/aggregate/StreamingAggregator.java PRE-CREATION 
>   exec/java-exec/src/main/java/org/apache/drill/exec/physical/impl/common/ChainedHashTable.java PRE-CREATION 
>   exec/java-exec/src/main/java/org/apache/drill/exec/physical/impl/common/HashTable.java PRE-CREATION 
>   exec/java-exec/src/main/java/org/apache/drill/exec/physical/impl/common/HashTableConfig.java PRE-CREATION 
>   exec/java-exec/src/main/java/org/apache/drill/exec/physical/impl/common/HashTableTemplate.java PRE-CREATION 
>   exec/java-exec/src/main/java/org/apache/drill/exec/physical/impl/project/ProjectRecordBatch.java 509d13b 
>   exec/java-exec/src/main/java/org/apache/drill/exec/record/VectorContainer.java 6e764a8 
>   exec/java-exec/src/test/java/org/apache/drill/exec/expr/ExpressionTest.java 0de64b4 
>   exec/java-exec/src/test/java/org/apache/drill/exec/physical/impl/SimpleRootExec.java ccf7758 
>   exec/java-exec/src/test/java/org/apache/drill/exec/physical/impl/TestSimpleFunctions.java f6a8096 
>   exec/java-exec/src/test/java/org/apache/drill/exec/physical/impl/agg/TestHashAggr.java PRE-CREATION 
>   exec/java-exec/src/test/java/org/apache/drill/exec/physical/impl/common/TestHashTable.java PRE-CREATION 
>   exec/java-exec/src/test/resources/common/test_hashtable1.json PRE-CREATION 
> 
> Diff: https://reviews.apache.org/r/18347/diff/
> 
> 
> Testing
> -------
> 
> I have run several tests manually as part of TestHashAggr...these tests use TPC-H data and in particular a relatively large 'Orders' table.  However, I have not yet packaged the tests to run as part of JUnit since the location and size of the parquet files needs to be figured out. I will continue to work on that.  
> 
> 
> Thanks,
> 
> Aman Sinha
> 
>


Re: Review Request 18347: Implement hash aggregation operator and a hash table (for Drill-335) and new templates for aggregate functions (Drill-318)

Posted by Aman Sinha <as...@maprtech.com>.

> On March 14, 2014, 1:43 a.m., Steven Phillips wrote:
> > exec/java-exec/src/main/java/org/apache/drill/exec/physical/impl/common/HashTableTemplate.java, line 116
> > <https://reviews.apache.org/r/18347/diff/2/?file=512058#file512058line116>
> >
> >     Won't the cloned containers share the vectors with the original if you use this method? Maybe I'm missing something, but this looks like it will cause corruption.

Fixed in latest patch.


> On March 14, 2014, 1:43 a.m., Steven Phillips wrote:
> > exec/java-exec/src/main/java/org/apache/drill/exec/record/VectorContainer.java, line 89
> > <https://reviews.apache.org/r/18347/diff/2/?file=512060#file512060line89>
> >
> >     After calling clone, the new container and incoming container will have references to the same vectors? Is the desired behavior to create new vectors of the same type?

This method is not needed anymore. Uploaded latest patch.


- Aman


-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/18347/#review37149
-----------------------------------------------------------


On March 15, 2014, 8:55 p.m., Aman Sinha wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/18347/
> -----------------------------------------------------------
> 
> (Updated March 15, 2014, 8:55 p.m.)
> 
> 
> Review request for drill, Jacques Nadeau, Jason Altekruse, Mehant Baid, and Steven Phillips.
> 
> 
> Bugs: Drill-318, Drill-335 and and
>     https://issues.apache.org/jira/browse/Drill-318
>     https://issues.apache.org/jira/browse/Drill-335
>     https://issues.apache.org/jira/browse/and
> 
> 
> Repository: drill-git
> 
> 
> Description
> -------
> 
> This patch contains the following main components: 
> 1. Implementation of the hash aggregation execution operator - this has two main parts: the HashAggTemplate and the HashAggBatch.  
> 2. Implementation of a hash table which is used by the hash aggregation.  The hash table hash two main parts: the HashTableTemplate and the ChainedHashTable. The hash table internally uses the notion of 'BatchHolder' to keep track of all keys that can fit within one batch of 64K values.  New BatcHolder objects are created as needed.  Each BatchHolder has its own vector container.  The HashAggregate also has a similar structure and it keeps track of the workspace variables.  
>  (NOTE: An initial design document for the hash aggregation and hash table was already attached with Drill-335.  The document has not yet been updated with the latest implementation ... will try to do that in the near future).
> 3. Jinfeng's changes to use workspace vectors in the generated code for aggregate functions (previously, for streaming aggregate we only needed to maintain workspace variable for 1 running group; however for hash aggregate we need to maintain it for all groups).  
> 4. Fix for Drill-318: because of #3 above, the previous fix for Drill-318 is not valid anymore.  I modified the template generation code for the aggregate functions such that they conform to the new infrastructure.  
> 5. The original AggTemplate, AggBatch and Aggregator classes have been moved to corresponding StreamingAggTemplate, StreamingAggBatch and StreamingAggregator in order to differentiate it from hash aggregation.  These appear as new files but the code there has not changed.  
> 
> 
> Diffs
> -----
> 
>   common/src/main/java/org/apache/drill/common/expression/fn/AggregationFunctions.java 5b46b78 
>   exec/java-exec/src/main/codegen/config.fmpp 8f1060a 
>   exec/java-exec/src/main/codegen/data/AggrTypes1.tdd PRE-CREATION 
>   exec/java-exec/src/main/codegen/data/AggrTypes2.tdd PRE-CREATION 
>   exec/java-exec/src/main/codegen/templates/AggrTypeFunctions1.java PRE-CREATION 
>   exec/java-exec/src/main/codegen/templates/AggrTypeFunctions2.java PRE-CREATION 
>   exec/java-exec/src/main/codegen/templates/TypeHelper.java 8c72cf7 
>   exec/java-exec/src/main/java/org/apache/drill/exec/compile/sig/GeneratorMapping.java f99150e 
>   exec/java-exec/src/main/java/org/apache/drill/exec/compile/sig/MappingSet.java 1cb2380 
>   exec/java-exec/src/main/java/org/apache/drill/exec/expr/ClassGenerator.java e674eab 
>   exec/java-exec/src/main/java/org/apache/drill/exec/expr/fn/DrillAggFuncHolder.java 7622865 
>   exec/java-exec/src/main/java/org/apache/drill/exec/expr/fn/DrillFuncHolder.java dcdf1ea 
>   exec/java-exec/src/main/java/org/apache/drill/exec/expr/fn/FunctionConverter.java befa9bf 
>   exec/java-exec/src/main/java/org/apache/drill/exec/expr/fn/agg/impl/CountFunctions.java b0939f1 
>   exec/java-exec/src/main/java/org/apache/drill/exec/expr/fn/agg/impl/SumFunctions.java 57905fe 
>   exec/java-exec/src/main/java/org/apache/drill/exec/expr/fn/impl/BitFunctions.java 41bb7c6 
>   exec/java-exec/src/main/java/org/apache/drill/exec/ops/FragmentContext.java 8462622 
>   exec/java-exec/src/main/java/org/apache/drill/exec/opt/BasicOptimizer.java 4ea87e0 
>   exec/java-exec/src/main/java/org/apache/drill/exec/physical/OperatorCost.java 0bade41 
>   exec/java-exec/src/main/java/org/apache/drill/exec/physical/base/AbstractPhysicalVisitor.java ec7244a 
>   exec/java-exec/src/main/java/org/apache/drill/exec/physical/base/PhysicalVisitor.java 2e2d7fd 
>   exec/java-exec/src/main/java/org/apache/drill/exec/physical/config/HashAggregate.java PRE-CREATION 
>   exec/java-exec/src/main/java/org/apache/drill/exec/physical/impl/aggregate/AggBatch.java 86fea4e 
>   exec/java-exec/src/main/java/org/apache/drill/exec/physical/impl/aggregate/AggBatchCreator.java 36344f9 
>   exec/java-exec/src/main/java/org/apache/drill/exec/physical/impl/aggregate/AggTemplate.java a8f39e3 
>   exec/java-exec/src/main/java/org/apache/drill/exec/physical/impl/aggregate/Aggregator.java 2683045 
>   exec/java-exec/src/main/java/org/apache/drill/exec/physical/impl/aggregate/HashAggBatch.java PRE-CREATION 
>   exec/java-exec/src/main/java/org/apache/drill/exec/physical/impl/aggregate/HashAggBatchCreator.java PRE-CREATION 
>   exec/java-exec/src/main/java/org/apache/drill/exec/physical/impl/aggregate/HashAggTemplate.java PRE-CREATION 
>   exec/java-exec/src/main/java/org/apache/drill/exec/physical/impl/aggregate/HashAggregator.java PRE-CREATION 
>   exec/java-exec/src/main/java/org/apache/drill/exec/physical/impl/aggregate/StreamingAggBatch.java PRE-CREATION 
>   exec/java-exec/src/main/java/org/apache/drill/exec/physical/impl/aggregate/StreamingAggBatchCreator.java PRE-CREATION 
>   exec/java-exec/src/main/java/org/apache/drill/exec/physical/impl/aggregate/StreamingAggTemplate.java PRE-CREATION 
>   exec/java-exec/src/main/java/org/apache/drill/exec/physical/impl/aggregate/StreamingAggregator.java PRE-CREATION 
>   exec/java-exec/src/main/java/org/apache/drill/exec/physical/impl/common/ChainedHashTable.java PRE-CREATION 
>   exec/java-exec/src/main/java/org/apache/drill/exec/physical/impl/common/HashTable.java PRE-CREATION 
>   exec/java-exec/src/main/java/org/apache/drill/exec/physical/impl/common/HashTableConfig.java PRE-CREATION 
>   exec/java-exec/src/main/java/org/apache/drill/exec/physical/impl/common/HashTableTemplate.java PRE-CREATION 
>   exec/java-exec/src/main/java/org/apache/drill/exec/physical/impl/project/ProjectRecordBatch.java 509d13b 
>   exec/java-exec/src/test/java/org/apache/drill/exec/expr/ExpressionTest.java 0de64b4 
>   exec/java-exec/src/test/java/org/apache/drill/exec/physical/impl/SimpleRootExec.java ccf7758 
>   exec/java-exec/src/test/java/org/apache/drill/exec/physical/impl/TestSimpleFunctions.java f6a8096 
>   exec/java-exec/src/test/java/org/apache/drill/exec/physical/impl/agg/TestHashAggr.java PRE-CREATION 
>   exec/java-exec/src/test/java/org/apache/drill/exec/physical/impl/common/TestHashTable.java PRE-CREATION 
>   exec/java-exec/src/test/resources/common/test_hashtable1.json PRE-CREATION 
> 
> Diff: https://reviews.apache.org/r/18347/diff/
> 
> 
> Testing
> -------
> 
> I have run several tests manually as part of TestHashAggr...these tests use TPC-H data and in particular a relatively large 'Orders' table.  However, I have not yet packaged the tests to run as part of JUnit since the location and size of the parquet files needs to be figured out. I will continue to work on that.  
> 
> 
> Thanks,
> 
> Aman Sinha
> 
>


Re: Review Request 18347: Implement hash aggregation operator and a hash table (for Drill-335) and new templates for aggregate functions (Drill-318)

Posted by Steven Phillips <sp...@maprtech.com>.
-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/18347/#review37149
-----------------------------------------------------------



exec/java-exec/src/main/java/org/apache/drill/exec/physical/impl/common/HashTableTemplate.java
<https://reviews.apache.org/r/18347/#comment68534>

    Won't the cloned containers share the vectors with the original if you use this method? Maybe I'm missing something, but this looks like it will cause corruption.



exec/java-exec/src/main/java/org/apache/drill/exec/record/VectorContainer.java
<https://reviews.apache.org/r/18347/#comment68536>

    After calling clone, the new container and incoming container will have references to the same vectors? Is the desired behavior to create new vectors of the same type?


- Steven Phillips


On March 6, 2014, 2:10 a.m., Aman Sinha wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/18347/
> -----------------------------------------------------------
> 
> (Updated March 6, 2014, 2:10 a.m.)
> 
> 
> Review request for drill, Jacques Nadeau, Jason Altekruse, Mehant Baid, and Steven Phillips.
> 
> 
> Bugs: Drill-318, Drill-335 and and
>     https://issues.apache.org/jira/browse/Drill-318
>     https://issues.apache.org/jira/browse/Drill-335
>     https://issues.apache.org/jira/browse/and
> 
> 
> Repository: drill-git
> 
> 
> Description
> -------
> 
> This patch contains the following main components: 
> 1. Implementation of the hash aggregation execution operator - this has two main parts: the HashAggTemplate and the HashAggBatch.  
> 2. Implementation of a hash table which is used by the hash aggregation.  The hash table hash two main parts: the HashTableTemplate and the ChainedHashTable. The hash table internally uses the notion of 'BatchHolder' to keep track of all keys that can fit within one batch of 64K values.  New BatcHolder objects are created as needed.  Each BatchHolder has its own vector container.  The HashAggregate also has a similar structure and it keeps track of the workspace variables.  
>  (NOTE: An initial design document for the hash aggregation and hash table was already attached with Drill-335.  The document has not yet been updated with the latest implementation ... will try to do that in the near future).
> 3. Jinfeng's changes to use workspace vectors in the generated code for aggregate functions (previously, for streaming aggregate we only needed to maintain workspace variable for 1 running group; however for hash aggregate we need to maintain it for all groups).  
> 4. Fix for Drill-318: because of #3 above, the previous fix for Drill-318 is not valid anymore.  I modified the template generation code for the aggregate functions such that they conform to the new infrastructure.  
> 5. The original AggTemplate, AggBatch and Aggregator classes have been moved to corresponding StreamingAggTemplate, StreamingAggBatch and StreamingAggregator in order to differentiate it from hash aggregation.  These appear as new files but the code there has not changed.  
> 
> 
> Diffs
> -----
> 
>   common/src/main/java/org/apache/drill/common/expression/fn/AggregationFunctions.java 5b46b78 
>   exec/java-exec/src/main/codegen/config.fmpp 8f1060a 
>   exec/java-exec/src/main/codegen/data/AggrTypes1.tdd PRE-CREATION 
>   exec/java-exec/src/main/codegen/data/AggrTypes2.tdd PRE-CREATION 
>   exec/java-exec/src/main/codegen/templates/AggrTypeFunctions1.java PRE-CREATION 
>   exec/java-exec/src/main/codegen/templates/AggrTypeFunctions2.java PRE-CREATION 
>   exec/java-exec/src/main/codegen/templates/TypeHelper.java 8c72cf7 
>   exec/java-exec/src/main/java/org/apache/drill/exec/compile/sig/GeneratorMapping.java f99150e 
>   exec/java-exec/src/main/java/org/apache/drill/exec/compile/sig/MappingSet.java 1cb2380 
>   exec/java-exec/src/main/java/org/apache/drill/exec/expr/ClassGenerator.java e674eab 
>   exec/java-exec/src/main/java/org/apache/drill/exec/expr/fn/DrillAggFuncHolder.java 7622865 
>   exec/java-exec/src/main/java/org/apache/drill/exec/expr/fn/DrillFuncHolder.java dcdf1ea 
>   exec/java-exec/src/main/java/org/apache/drill/exec/expr/fn/FunctionConverter.java befa9bf 
>   exec/java-exec/src/main/java/org/apache/drill/exec/expr/fn/agg/impl/CountFunctions.java b0939f1 
>   exec/java-exec/src/main/java/org/apache/drill/exec/expr/fn/agg/impl/SumFunctions.java 57905fe 
>   exec/java-exec/src/main/java/org/apache/drill/exec/expr/fn/impl/BitFunctions.java 41bb7c6 
>   exec/java-exec/src/main/java/org/apache/drill/exec/ops/FragmentContext.java 8462622 
>   exec/java-exec/src/main/java/org/apache/drill/exec/opt/BasicOptimizer.java aed4802 
>   exec/java-exec/src/main/java/org/apache/drill/exec/physical/OperatorCost.java 0bade41 
>   exec/java-exec/src/main/java/org/apache/drill/exec/physical/base/AbstractPhysicalVisitor.java ec7244a 
>   exec/java-exec/src/main/java/org/apache/drill/exec/physical/base/PhysicalVisitor.java 2e2d7fd 
>   exec/java-exec/src/main/java/org/apache/drill/exec/physical/config/HashAggregate.java PRE-CREATION 
>   exec/java-exec/src/main/java/org/apache/drill/exec/physical/impl/aggregate/AggBatch.java 86fea4e 
>   exec/java-exec/src/main/java/org/apache/drill/exec/physical/impl/aggregate/AggBatchCreator.java 36344f9 
>   exec/java-exec/src/main/java/org/apache/drill/exec/physical/impl/aggregate/AggTemplate.java a8f39e3 
>   exec/java-exec/src/main/java/org/apache/drill/exec/physical/impl/aggregate/Aggregator.java 2683045 
>   exec/java-exec/src/main/java/org/apache/drill/exec/physical/impl/aggregate/HashAggBatch.java PRE-CREATION 
>   exec/java-exec/src/main/java/org/apache/drill/exec/physical/impl/aggregate/HashAggBatchCreator.java PRE-CREATION 
>   exec/java-exec/src/main/java/org/apache/drill/exec/physical/impl/aggregate/HashAggTemplate.java PRE-CREATION 
>   exec/java-exec/src/main/java/org/apache/drill/exec/physical/impl/aggregate/HashAggregator.java PRE-CREATION 
>   exec/java-exec/src/main/java/org/apache/drill/exec/physical/impl/aggregate/StreamingAggBatch.java PRE-CREATION 
>   exec/java-exec/src/main/java/org/apache/drill/exec/physical/impl/aggregate/StreamingAggBatchCreator.java PRE-CREATION 
>   exec/java-exec/src/main/java/org/apache/drill/exec/physical/impl/aggregate/StreamingAggTemplate.java PRE-CREATION 
>   exec/java-exec/src/main/java/org/apache/drill/exec/physical/impl/aggregate/StreamingAggregator.java PRE-CREATION 
>   exec/java-exec/src/main/java/org/apache/drill/exec/physical/impl/common/ChainedHashTable.java PRE-CREATION 
>   exec/java-exec/src/main/java/org/apache/drill/exec/physical/impl/common/HashTable.java PRE-CREATION 
>   exec/java-exec/src/main/java/org/apache/drill/exec/physical/impl/common/HashTableConfig.java PRE-CREATION 
>   exec/java-exec/src/main/java/org/apache/drill/exec/physical/impl/common/HashTableTemplate.java PRE-CREATION 
>   exec/java-exec/src/main/java/org/apache/drill/exec/physical/impl/project/ProjectRecordBatch.java 509d13b 
>   exec/java-exec/src/main/java/org/apache/drill/exec/record/VectorContainer.java 6e764a8 
>   exec/java-exec/src/test/java/org/apache/drill/exec/expr/ExpressionTest.java 0de64b4 
>   exec/java-exec/src/test/java/org/apache/drill/exec/physical/impl/SimpleRootExec.java ccf7758 
>   exec/java-exec/src/test/java/org/apache/drill/exec/physical/impl/TestSimpleFunctions.java f6a8096 
>   exec/java-exec/src/test/java/org/apache/drill/exec/physical/impl/agg/TestHashAggr.java PRE-CREATION 
>   exec/java-exec/src/test/java/org/apache/drill/exec/physical/impl/common/TestHashTable.java PRE-CREATION 
>   exec/java-exec/src/test/resources/common/test_hashtable1.json PRE-CREATION 
> 
> Diff: https://reviews.apache.org/r/18347/diff/
> 
> 
> Testing
> -------
> 
> I have run several tests manually as part of TestHashAggr...these tests use TPC-H data and in particular a relatively large 'Orders' table.  However, I have not yet packaged the tests to run as part of JUnit since the location and size of the parquet files needs to be figured out. I will continue to work on that.  
> 
> 
> Thanks,
> 
> Aman Sinha
> 
>


Re: Review Request 18347: Implement hash aggregation operator and a hash table (for Drill-335) and new templates for aggregate functions (Drill-318)

Posted by Aman Sinha <as...@maprtech.com>.
-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/18347/
-----------------------------------------------------------

(Updated March 15, 2014, 8:55 p.m.)


Review request for drill, Jacques Nadeau, Jason Altekruse, Mehant Baid, and Steven Phillips.


Changes
-------

Addressed some code review comments and rebased my changes (actually cherry-picked since it worked smoother compared to rebasing) on upstream/master. 


Bugs: Drill-318, Drill-335 and and
    https://issues.apache.org/jira/browse/Drill-318
    https://issues.apache.org/jira/browse/Drill-335
    https://issues.apache.org/jira/browse/and


Repository: drill-git


Description
-------

This patch contains the following main components: 
1. Implementation of the hash aggregation execution operator - this has two main parts: the HashAggTemplate and the HashAggBatch.  
2. Implementation of a hash table which is used by the hash aggregation.  The hash table hash two main parts: the HashTableTemplate and the ChainedHashTable. The hash table internally uses the notion of 'BatchHolder' to keep track of all keys that can fit within one batch of 64K values.  New BatcHolder objects are created as needed.  Each BatchHolder has its own vector container.  The HashAggregate also has a similar structure and it keeps track of the workspace variables.  
 (NOTE: An initial design document for the hash aggregation and hash table was already attached with Drill-335.  The document has not yet been updated with the latest implementation ... will try to do that in the near future).
3. Jinfeng's changes to use workspace vectors in the generated code for aggregate functions (previously, for streaming aggregate we only needed to maintain workspace variable for 1 running group; however for hash aggregate we need to maintain it for all groups).  
4. Fix for Drill-318: because of #3 above, the previous fix for Drill-318 is not valid anymore.  I modified the template generation code for the aggregate functions such that they conform to the new infrastructure.  
5. The original AggTemplate, AggBatch and Aggregator classes have been moved to corresponding StreamingAggTemplate, StreamingAggBatch and StreamingAggregator in order to differentiate it from hash aggregation.  These appear as new files but the code there has not changed.  


Diffs (updated)
-----

  common/src/main/java/org/apache/drill/common/expression/fn/AggregationFunctions.java 5b46b78 
  exec/java-exec/src/main/codegen/config.fmpp 8f1060a 
  exec/java-exec/src/main/codegen/data/AggrTypes1.tdd PRE-CREATION 
  exec/java-exec/src/main/codegen/data/AggrTypes2.tdd PRE-CREATION 
  exec/java-exec/src/main/codegen/templates/AggrTypeFunctions1.java PRE-CREATION 
  exec/java-exec/src/main/codegen/templates/AggrTypeFunctions2.java PRE-CREATION 
  exec/java-exec/src/main/codegen/templates/TypeHelper.java 8c72cf7 
  exec/java-exec/src/main/java/org/apache/drill/exec/compile/sig/GeneratorMapping.java f99150e 
  exec/java-exec/src/main/java/org/apache/drill/exec/compile/sig/MappingSet.java 1cb2380 
  exec/java-exec/src/main/java/org/apache/drill/exec/expr/ClassGenerator.java e674eab 
  exec/java-exec/src/main/java/org/apache/drill/exec/expr/fn/DrillAggFuncHolder.java 7622865 
  exec/java-exec/src/main/java/org/apache/drill/exec/expr/fn/DrillFuncHolder.java dcdf1ea 
  exec/java-exec/src/main/java/org/apache/drill/exec/expr/fn/FunctionConverter.java befa9bf 
  exec/java-exec/src/main/java/org/apache/drill/exec/expr/fn/agg/impl/CountFunctions.java b0939f1 
  exec/java-exec/src/main/java/org/apache/drill/exec/expr/fn/agg/impl/SumFunctions.java 57905fe 
  exec/java-exec/src/main/java/org/apache/drill/exec/expr/fn/impl/BitFunctions.java 41bb7c6 
  exec/java-exec/src/main/java/org/apache/drill/exec/ops/FragmentContext.java 8462622 
  exec/java-exec/src/main/java/org/apache/drill/exec/opt/BasicOptimizer.java 4ea87e0 
  exec/java-exec/src/main/java/org/apache/drill/exec/physical/OperatorCost.java 0bade41 
  exec/java-exec/src/main/java/org/apache/drill/exec/physical/base/AbstractPhysicalVisitor.java ec7244a 
  exec/java-exec/src/main/java/org/apache/drill/exec/physical/base/PhysicalVisitor.java 2e2d7fd 
  exec/java-exec/src/main/java/org/apache/drill/exec/physical/config/HashAggregate.java PRE-CREATION 
  exec/java-exec/src/main/java/org/apache/drill/exec/physical/impl/aggregate/AggBatch.java 86fea4e 
  exec/java-exec/src/main/java/org/apache/drill/exec/physical/impl/aggregate/AggBatchCreator.java 36344f9 
  exec/java-exec/src/main/java/org/apache/drill/exec/physical/impl/aggregate/AggTemplate.java a8f39e3 
  exec/java-exec/src/main/java/org/apache/drill/exec/physical/impl/aggregate/Aggregator.java 2683045 
  exec/java-exec/src/main/java/org/apache/drill/exec/physical/impl/aggregate/HashAggBatch.java PRE-CREATION 
  exec/java-exec/src/main/java/org/apache/drill/exec/physical/impl/aggregate/HashAggBatchCreator.java PRE-CREATION 
  exec/java-exec/src/main/java/org/apache/drill/exec/physical/impl/aggregate/HashAggTemplate.java PRE-CREATION 
  exec/java-exec/src/main/java/org/apache/drill/exec/physical/impl/aggregate/HashAggregator.java PRE-CREATION 
  exec/java-exec/src/main/java/org/apache/drill/exec/physical/impl/aggregate/StreamingAggBatch.java PRE-CREATION 
  exec/java-exec/src/main/java/org/apache/drill/exec/physical/impl/aggregate/StreamingAggBatchCreator.java PRE-CREATION 
  exec/java-exec/src/main/java/org/apache/drill/exec/physical/impl/aggregate/StreamingAggTemplate.java PRE-CREATION 
  exec/java-exec/src/main/java/org/apache/drill/exec/physical/impl/aggregate/StreamingAggregator.java PRE-CREATION 
  exec/java-exec/src/main/java/org/apache/drill/exec/physical/impl/common/ChainedHashTable.java PRE-CREATION 
  exec/java-exec/src/main/java/org/apache/drill/exec/physical/impl/common/HashTable.java PRE-CREATION 
  exec/java-exec/src/main/java/org/apache/drill/exec/physical/impl/common/HashTableConfig.java PRE-CREATION 
  exec/java-exec/src/main/java/org/apache/drill/exec/physical/impl/common/HashTableTemplate.java PRE-CREATION 
  exec/java-exec/src/main/java/org/apache/drill/exec/physical/impl/project/ProjectRecordBatch.java 509d13b 
  exec/java-exec/src/test/java/org/apache/drill/exec/expr/ExpressionTest.java 0de64b4 
  exec/java-exec/src/test/java/org/apache/drill/exec/physical/impl/SimpleRootExec.java ccf7758 
  exec/java-exec/src/test/java/org/apache/drill/exec/physical/impl/TestSimpleFunctions.java f6a8096 
  exec/java-exec/src/test/java/org/apache/drill/exec/physical/impl/agg/TestHashAggr.java PRE-CREATION 
  exec/java-exec/src/test/java/org/apache/drill/exec/physical/impl/common/TestHashTable.java PRE-CREATION 
  exec/java-exec/src/test/resources/common/test_hashtable1.json PRE-CREATION 

Diff: https://reviews.apache.org/r/18347/diff/


Testing
-------

I have run several tests manually as part of TestHashAggr...these tests use TPC-H data and in particular a relatively large 'Orders' table.  However, I have not yet packaged the tests to run as part of JUnit since the location and size of the parquet files needs to be figured out. I will continue to work on that.  


Thanks,

Aman Sinha


Re: Review Request 18347: Implement hash aggregation operator and a hash table (for Drill-335) and new templates for aggregate functions (Drill-318)

Posted by Aman Sinha <as...@maprtech.com>.
-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/18347/
-----------------------------------------------------------

(Updated March 6, 2014, 2:10 a.m.)


Review request for drill, Jacques Nadeau, Jason Altekruse, Mehant Baid, and Steven Phillips.


Changes
-------

This second patch is a cumulative patch that includes the previous patch plus the following changes/enhancements:

1. It includes support for the code generation of the probe side of a hash join (previously, we only needed the 'build' side for the hash aggregate)
2. Fix for a index out of bound error that I encountered while doing testing - this was related to using the wrong TypedFieldId when accessing the hash table's container. This is resolved by doing the creation of htContainer as part of ChainedHashTable and making a clone of it in HashTableTemplate. 
3. Address a review comment: Allocate the links, hashValues and startIndices using our vector allocators (off-heap) instead of on the heap.   


Bugs: Drill-318, Drill-335 and and
    https://issues.apache.org/jira/browse/Drill-318
    https://issues.apache.org/jira/browse/Drill-335
    https://issues.apache.org/jira/browse/and


Repository: drill-git


Description
-------

This patch contains the following main components: 
1. Implementation of the hash aggregation execution operator - this has two main parts: the HashAggTemplate and the HashAggBatch.  
2. Implementation of a hash table which is used by the hash aggregation.  The hash table hash two main parts: the HashTableTemplate and the ChainedHashTable. The hash table internally uses the notion of 'BatchHolder' to keep track of all keys that can fit within one batch of 64K values.  New BatcHolder objects are created as needed.  Each BatchHolder has its own vector container.  The HashAggregate also has a similar structure and it keeps track of the workspace variables.  
 (NOTE: An initial design document for the hash aggregation and hash table was already attached with Drill-335.  The document has not yet been updated with the latest implementation ... will try to do that in the near future).
3. Jinfeng's changes to use workspace vectors in the generated code for aggregate functions (previously, for streaming aggregate we only needed to maintain workspace variable for 1 running group; however for hash aggregate we need to maintain it for all groups).  
4. Fix for Drill-318: because of #3 above, the previous fix for Drill-318 is not valid anymore.  I modified the template generation code for the aggregate functions such that they conform to the new infrastructure.  
5. The original AggTemplate, AggBatch and Aggregator classes have been moved to corresponding StreamingAggTemplate, StreamingAggBatch and StreamingAggregator in order to differentiate it from hash aggregation.  These appear as new files but the code there has not changed.  


Diffs (updated)
-----

  common/src/main/java/org/apache/drill/common/expression/fn/AggregationFunctions.java 5b46b78 
  exec/java-exec/src/main/codegen/config.fmpp 8f1060a 
  exec/java-exec/src/main/codegen/data/AggrTypes1.tdd PRE-CREATION 
  exec/java-exec/src/main/codegen/data/AggrTypes2.tdd PRE-CREATION 
  exec/java-exec/src/main/codegen/templates/AggrTypeFunctions1.java PRE-CREATION 
  exec/java-exec/src/main/codegen/templates/AggrTypeFunctions2.java PRE-CREATION 
  exec/java-exec/src/main/codegen/templates/TypeHelper.java 8c72cf7 
  exec/java-exec/src/main/java/org/apache/drill/exec/compile/sig/GeneratorMapping.java f99150e 
  exec/java-exec/src/main/java/org/apache/drill/exec/compile/sig/MappingSet.java 1cb2380 
  exec/java-exec/src/main/java/org/apache/drill/exec/expr/ClassGenerator.java e674eab 
  exec/java-exec/src/main/java/org/apache/drill/exec/expr/fn/DrillAggFuncHolder.java 7622865 
  exec/java-exec/src/main/java/org/apache/drill/exec/expr/fn/DrillFuncHolder.java dcdf1ea 
  exec/java-exec/src/main/java/org/apache/drill/exec/expr/fn/FunctionConverter.java befa9bf 
  exec/java-exec/src/main/java/org/apache/drill/exec/expr/fn/agg/impl/CountFunctions.java b0939f1 
  exec/java-exec/src/main/java/org/apache/drill/exec/expr/fn/agg/impl/SumFunctions.java 57905fe 
  exec/java-exec/src/main/java/org/apache/drill/exec/expr/fn/impl/BitFunctions.java 41bb7c6 
  exec/java-exec/src/main/java/org/apache/drill/exec/ops/FragmentContext.java 8462622 
  exec/java-exec/src/main/java/org/apache/drill/exec/opt/BasicOptimizer.java aed4802 
  exec/java-exec/src/main/java/org/apache/drill/exec/physical/OperatorCost.java 0bade41 
  exec/java-exec/src/main/java/org/apache/drill/exec/physical/base/AbstractPhysicalVisitor.java ec7244a 
  exec/java-exec/src/main/java/org/apache/drill/exec/physical/base/PhysicalVisitor.java 2e2d7fd 
  exec/java-exec/src/main/java/org/apache/drill/exec/physical/config/HashAggregate.java PRE-CREATION 
  exec/java-exec/src/main/java/org/apache/drill/exec/physical/impl/aggregate/AggBatch.java 86fea4e 
  exec/java-exec/src/main/java/org/apache/drill/exec/physical/impl/aggregate/AggBatchCreator.java 36344f9 
  exec/java-exec/src/main/java/org/apache/drill/exec/physical/impl/aggregate/AggTemplate.java a8f39e3 
  exec/java-exec/src/main/java/org/apache/drill/exec/physical/impl/aggregate/Aggregator.java 2683045 
  exec/java-exec/src/main/java/org/apache/drill/exec/physical/impl/aggregate/HashAggBatch.java PRE-CREATION 
  exec/java-exec/src/main/java/org/apache/drill/exec/physical/impl/aggregate/HashAggBatchCreator.java PRE-CREATION 
  exec/java-exec/src/main/java/org/apache/drill/exec/physical/impl/aggregate/HashAggTemplate.java PRE-CREATION 
  exec/java-exec/src/main/java/org/apache/drill/exec/physical/impl/aggregate/HashAggregator.java PRE-CREATION 
  exec/java-exec/src/main/java/org/apache/drill/exec/physical/impl/aggregate/StreamingAggBatch.java PRE-CREATION 
  exec/java-exec/src/main/java/org/apache/drill/exec/physical/impl/aggregate/StreamingAggBatchCreator.java PRE-CREATION 
  exec/java-exec/src/main/java/org/apache/drill/exec/physical/impl/aggregate/StreamingAggTemplate.java PRE-CREATION 
  exec/java-exec/src/main/java/org/apache/drill/exec/physical/impl/aggregate/StreamingAggregator.java PRE-CREATION 
  exec/java-exec/src/main/java/org/apache/drill/exec/physical/impl/common/ChainedHashTable.java PRE-CREATION 
  exec/java-exec/src/main/java/org/apache/drill/exec/physical/impl/common/HashTable.java PRE-CREATION 
  exec/java-exec/src/main/java/org/apache/drill/exec/physical/impl/common/HashTableConfig.java PRE-CREATION 
  exec/java-exec/src/main/java/org/apache/drill/exec/physical/impl/common/HashTableTemplate.java PRE-CREATION 
  exec/java-exec/src/main/java/org/apache/drill/exec/physical/impl/project/ProjectRecordBatch.java 509d13b 
  exec/java-exec/src/main/java/org/apache/drill/exec/record/VectorContainer.java 6e764a8 
  exec/java-exec/src/test/java/org/apache/drill/exec/expr/ExpressionTest.java 0de64b4 
  exec/java-exec/src/test/java/org/apache/drill/exec/physical/impl/SimpleRootExec.java ccf7758 
  exec/java-exec/src/test/java/org/apache/drill/exec/physical/impl/TestSimpleFunctions.java f6a8096 
  exec/java-exec/src/test/java/org/apache/drill/exec/physical/impl/agg/TestHashAggr.java PRE-CREATION 
  exec/java-exec/src/test/java/org/apache/drill/exec/physical/impl/common/TestHashTable.java PRE-CREATION 
  exec/java-exec/src/test/resources/common/test_hashtable1.json PRE-CREATION 

Diff: https://reviews.apache.org/r/18347/diff/


Testing
-------

I have run several tests manually as part of TestHashAggr...these tests use TPC-H data and in particular a relatively large 'Orders' table.  However, I have not yet packaged the tests to run as part of JUnit since the location and size of the parquet files needs to be figured out. I will continue to work on that.  


Thanks,

Aman Sinha