You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@hive.apache.org by Siying Dong <si...@fb.com> on 2011/04/20 20:28:29 UTC

Review Request: Input Sampling Splits

-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/633/
-----------------------------------------------------------

Review request for hive, Ning Zhang and namit jain.


Summary
-------

We need a better input sampling to serve at least two purposes:
1. test their queries against a smaller data set
2. understand more about how the data look like without scanning the whole table.
A simple function that gives a subset splits will help in those cases. It doesn't have to be strict sampling.

This diff allows a syntax of .. table TABLESAMPLE(n PERCENT), which samples input splits with size at least n% of the original inputs.


This addresses bug HIVE-2121.
    https://issues.apache.org/jira/browse/HIVE-2121


Diffs
-----

  trunk/common/src/java/org/apache/hadoop/hive/conf/HiveConf.java 1095244 
  trunk/conf/hive-default.xml 1095244 
  trunk/ql/src/java/org/apache/hadoop/hive/ql/io/CombineHiveInputFormat.java 1095244 
  trunk/ql/src/java/org/apache/hadoop/hive/ql/io/HiveFileFormatUtils.java 1095244 
  trunk/ql/src/java/org/apache/hadoop/hive/ql/optimizer/GenMRFileSink1.java 1095244 
  trunk/ql/src/java/org/apache/hadoop/hive/ql/optimizer/GenMRTableScan1.java 1095244 
  trunk/ql/src/java/org/apache/hadoop/hive/ql/optimizer/GenMRUnion1.java 1095244 
  trunk/ql/src/java/org/apache/hadoop/hive/ql/optimizer/GenMapRedUtils.java 1095244 
  trunk/ql/src/java/org/apache/hadoop/hive/ql/optimizer/MapJoinFactory.java 1095244 
  trunk/ql/src/java/org/apache/hadoop/hive/ql/parse/Hive.g 1095244 
  trunk/ql/src/java/org/apache/hadoop/hive/ql/parse/ParseContext.java 1095244 
  trunk/ql/src/java/org/apache/hadoop/hive/ql/parse/SemanticAnalyzer.java 1095244 
  trunk/ql/src/java/org/apache/hadoop/hive/ql/parse/SplitSample.java PRE-CREATION 
  trunk/ql/src/java/org/apache/hadoop/hive/ql/plan/MapredWork.java 1095244 
  trunk/ql/src/test/queries/clientnegative/split_sample_disabled.q PRE-CREATION 
  trunk/ql/src/test/queries/clientnegative/split_sample_wrong_format.q PRE-CREATION 
  trunk/ql/src/test/queries/clientpositive/split_sample.q PRE-CREATION 
  trunk/ql/src/test/results/clientnegative/split_sample_disabled.q.out PRE-CREATION 
  trunk/ql/src/test/results/clientnegative/split_sample_wrong_format.q.out PRE-CREATION 
  trunk/ql/src/test/results/clientpositive/bucket1.q.out 1095244 
  trunk/ql/src/test/results/clientpositive/bucket2.q.out 1095244 
  trunk/ql/src/test/results/clientpositive/bucket3.q.out 1095244 
  trunk/ql/src/test/results/clientpositive/bucketmapjoin1.q.out 1095244 
  trunk/ql/src/test/results/clientpositive/sample1.q.out 1095244 
  trunk/ql/src/test/results/clientpositive/sample10.q.out 1095244 
  trunk/ql/src/test/results/clientpositive/sample2.q.out 1095244 
  trunk/ql/src/test/results/clientpositive/sample3.q.out 1095244 
  trunk/ql/src/test/results/clientpositive/sample4.q.out 1095244 
  trunk/ql/src/test/results/clientpositive/sample5.q.out 1095244 
  trunk/ql/src/test/results/clientpositive/sample6.q.out 1095244 
  trunk/ql/src/test/results/clientpositive/sample7.q.out 1095244 
  trunk/ql/src/test/results/clientpositive/sample8.q.out 1095244 
  trunk/ql/src/test/results/clientpositive/sample9.q.out 1095244 

Diff: https://reviews.apache.org/r/633/diff


Testing
-------

TestCliDriver TestNegativeCliDriver, manual tests on real clusters.


Thanks,

Siying

Re: Review Request: Input Sampling Splits

Posted by namit jain <nj...@fb.com>.

-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/633/#review605
-----------------------------------------------------------



trunk/shims/src/0.20/java/org/apache/hadoop/hive/shims/Hadoop20Shims.java
<https://reviews.apache.org/r/633/#comment1249>

    talked to siying offline -
    
    the check:
    
    if (split instanceof Hadoop20Shims.InputSplitShim)
    
    
    is not needed - this can be replaced by an assert.
    
    Same in Hadoop20SShims.
    
    
    Otherwise looks good


- namit


On 2011-04-28 08:32:17, Siying Dong wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/633/
> -----------------------------------------------------------
> 
> (Updated 2011-04-28 08:32:17)
> 
> 
> Review request for hive, Ning Zhang and namit jain.
> 
> 
> Summary
> -------
> 
> We need a better input sampling to serve at least two purposes:
> 1. test their queries against a smaller data set
> 2. understand more about how the data look like without scanning the whole table.
> A simple function that gives a subset splits will help in those cases. It doesn't have to be strict sampling.
> 
> This diff allows a syntax of .. table TABLESAMPLE(n PERCENT), which samples input splits with size at least n% of the original inputs.
> 
> 
> This addresses bug HIVE-2121.
>     https://issues.apache.org/jira/browse/HIVE-2121
> 
> 
> Diffs
> -----
> 
>   trunk/common/src/java/org/apache/hadoop/hive/conf/HiveConf.java 1096852 
>   trunk/conf/hive-default.xml 1096852 
>   trunk/ql/src/java/org/apache/hadoop/hive/ql/io/CombineHiveInputFormat.java 1096852 
>   trunk/ql/src/java/org/apache/hadoop/hive/ql/io/HiveFileFormatUtils.java 1096852 
>   trunk/ql/src/java/org/apache/hadoop/hive/ql/optimizer/GenMRFileSink1.java 1096852 
>   trunk/ql/src/java/org/apache/hadoop/hive/ql/optimizer/GenMRTableScan1.java 1096852 
>   trunk/ql/src/java/org/apache/hadoop/hive/ql/optimizer/GenMRUnion1.java 1096852 
>   trunk/ql/src/java/org/apache/hadoop/hive/ql/optimizer/GenMapRedUtils.java 1096852 
>   trunk/ql/src/java/org/apache/hadoop/hive/ql/optimizer/MapJoinFactory.java 1096852 
>   trunk/ql/src/java/org/apache/hadoop/hive/ql/parse/Hive.g 1096852 
>   trunk/ql/src/java/org/apache/hadoop/hive/ql/parse/ParseContext.java 1096852 
>   trunk/ql/src/java/org/apache/hadoop/hive/ql/parse/SemanticAnalyzer.java 1096852 
>   trunk/ql/src/java/org/apache/hadoop/hive/ql/parse/SplitSample.java PRE-CREATION 
>   trunk/ql/src/java/org/apache/hadoop/hive/ql/plan/MapredWork.java 1096852 
>   trunk/ql/src/test/queries/clientnegative/split_sample_out_of_range.q PRE-CREATION 
>   trunk/ql/src/test/queries/clientnegative/split_sample_wrong_format.q PRE-CREATION 
>   trunk/ql/src/test/queries/clientpositive/split_sample.q PRE-CREATION 
>   trunk/ql/src/test/results/clientnegative/split_sample_out_of_range.q.out PRE-CREATION 
>   trunk/ql/src/test/results/clientnegative/split_sample_wrong_format.q.out PRE-CREATION 
>   trunk/ql/src/test/results/clientpositive/bucket1.q.out 1096852 
>   trunk/ql/src/test/results/clientpositive/bucket2.q.out 1096852 
>   trunk/ql/src/test/results/clientpositive/bucket3.q.out 1096852 
>   trunk/ql/src/test/results/clientpositive/bucketmapjoin1.q.out 1096852 
>   trunk/ql/src/test/results/clientpositive/sample1.q.out 1096852 
>   trunk/ql/src/test/results/clientpositive/sample10.q.out 1096852 
>   trunk/ql/src/test/results/clientpositive/sample2.q.out 1096852 
>   trunk/ql/src/test/results/clientpositive/sample3.q.out 1096852 
>   trunk/ql/src/test/results/clientpositive/sample4.q.out 1096852 
>   trunk/ql/src/test/results/clientpositive/sample5.q.out 1096852 
>   trunk/ql/src/test/results/clientpositive/sample6.q.out 1096852 
>   trunk/ql/src/test/results/clientpositive/sample7.q.out 1096852 
>   trunk/ql/src/test/results/clientpositive/sample8.q.out 1096852 
>   trunk/ql/src/test/results/clientpositive/sample9.q.out 1096852 
>   trunk/shims/src/0.20/java/org/apache/hadoop/hive/shims/Hadoop20Shims.java 1096852 
>   trunk/shims/src/0.20S/java/org/apache/hadoop/hive/shims/Hadoop20SShims.java 1096852 
>   trunk/shims/src/common/java/org/apache/hadoop/hive/shims/HadoopShims.java 1096852 
> 
> Diff: https://reviews.apache.org/r/633/diff
> 
> 
> Testing
> -------
> 
> TestCliDriver TestNegativeCliDriver, manual tests on real clusters.
> 
> 
> Thanks,
> 
> Siying
> 
>

Re: Review Request: Input Sampling Splits

Posted by Siying Dong <si...@fb.com>.

-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/633/
-----------------------------------------------------------

(Updated 2011-04-28 08:32:17.534107)


Review request for hive, Ning Zhang and namit jain.


Changes
-------

Two changes made according to Namit's comments:
1. explain will print out some about the sampling. (It might not be the best way to print but it follows the framework)
2. the granularity of sampling is down from split-level to HDFS block level.


Summary
-------

We need a better input sampling to serve at least two purposes:
1. test their queries against a smaller data set
2. understand more about how the data look like without scanning the whole table.
A simple function that gives a subset splits will help in those cases. It doesn't have to be strict sampling.

This diff allows a syntax of .. table TABLESAMPLE(n PERCENT), which samples input splits with size at least n% of the original inputs.


This addresses bug HIVE-2121.
    https://issues.apache.org/jira/browse/HIVE-2121


Diffs (updated)
-----

  trunk/common/src/java/org/apache/hadoop/hive/conf/HiveConf.java 1096852 
  trunk/conf/hive-default.xml 1096852 
  trunk/ql/src/java/org/apache/hadoop/hive/ql/io/CombineHiveInputFormat.java 1096852 
  trunk/ql/src/java/org/apache/hadoop/hive/ql/io/HiveFileFormatUtils.java 1096852 
  trunk/ql/src/java/org/apache/hadoop/hive/ql/optimizer/GenMRFileSink1.java 1096852 
  trunk/ql/src/java/org/apache/hadoop/hive/ql/optimizer/GenMRTableScan1.java 1096852 
  trunk/ql/src/java/org/apache/hadoop/hive/ql/optimizer/GenMRUnion1.java 1096852 
  trunk/ql/src/java/org/apache/hadoop/hive/ql/optimizer/GenMapRedUtils.java 1096852 
  trunk/ql/src/java/org/apache/hadoop/hive/ql/optimizer/MapJoinFactory.java 1096852 
  trunk/ql/src/java/org/apache/hadoop/hive/ql/parse/Hive.g 1096852 
  trunk/ql/src/java/org/apache/hadoop/hive/ql/parse/ParseContext.java 1096852 
  trunk/ql/src/java/org/apache/hadoop/hive/ql/parse/SemanticAnalyzer.java 1096852 
  trunk/ql/src/java/org/apache/hadoop/hive/ql/parse/SplitSample.java PRE-CREATION 
  trunk/ql/src/java/org/apache/hadoop/hive/ql/plan/MapredWork.java 1096852 
  trunk/ql/src/test/queries/clientnegative/split_sample_out_of_range.q PRE-CREATION 
  trunk/ql/src/test/queries/clientnegative/split_sample_wrong_format.q PRE-CREATION 
  trunk/ql/src/test/queries/clientpositive/split_sample.q PRE-CREATION 
  trunk/ql/src/test/results/clientnegative/split_sample_out_of_range.q.out PRE-CREATION 
  trunk/ql/src/test/results/clientnegative/split_sample_wrong_format.q.out PRE-CREATION 
  trunk/ql/src/test/results/clientpositive/bucket1.q.out 1096852 
  trunk/ql/src/test/results/clientpositive/bucket2.q.out 1096852 
  trunk/ql/src/test/results/clientpositive/bucket3.q.out 1096852 
  trunk/ql/src/test/results/clientpositive/bucketmapjoin1.q.out 1096852 
  trunk/ql/src/test/results/clientpositive/sample1.q.out 1096852 
  trunk/ql/src/test/results/clientpositive/sample10.q.out 1096852 
  trunk/ql/src/test/results/clientpositive/sample2.q.out 1096852 
  trunk/ql/src/test/results/clientpositive/sample3.q.out 1096852 
  trunk/ql/src/test/results/clientpositive/sample4.q.out 1096852 
  trunk/ql/src/test/results/clientpositive/sample5.q.out 1096852 
  trunk/ql/src/test/results/clientpositive/sample6.q.out 1096852 
  trunk/ql/src/test/results/clientpositive/sample7.q.out 1096852 
  trunk/ql/src/test/results/clientpositive/sample8.q.out 1096852 
  trunk/ql/src/test/results/clientpositive/sample9.q.out 1096852 
  trunk/shims/src/0.20/java/org/apache/hadoop/hive/shims/Hadoop20Shims.java 1096852 
  trunk/shims/src/0.20S/java/org/apache/hadoop/hive/shims/Hadoop20SShims.java 1096852 
  trunk/shims/src/common/java/org/apache/hadoop/hive/shims/HadoopShims.java 1096852 

Diff: https://reviews.apache.org/r/633/diff


Testing
-------

TestCliDriver TestNegativeCliDriver, manual tests on real clusters.


Thanks,

Siying

Re: Review Request: Input Sampling Splits

Posted by Ning Zhang <nz...@fb.com>.

-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/633/#review567
-----------------------------------------------------------



trunk/ql/src/java/org/apache/hadoop/hive/ql/io/CombineHiveInputFormat.java
<https://reviews.apache.org/r/633/#comment1205>

    This function could be quite expensive inside the loops. You may want to test a case where there are large # of partitions and each partition contains a large # of small files. 


- Ning


On 2011-04-26 21:19:18, Siying Dong wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/633/
> -----------------------------------------------------------
> 
> (Updated 2011-04-26 21:19:18)
> 
> 
> Review request for hive, Ning Zhang and namit jain.
> 
> 
> Summary
> -------
> 
> We need a better input sampling to serve at least two purposes:
> 1. test their queries against a smaller data set
> 2. understand more about how the data look like without scanning the whole table.
> A simple function that gives a subset splits will help in those cases. It doesn't have to be strict sampling.
> 
> This diff allows a syntax of .. table TABLESAMPLE(n PERCENT), which samples input splits with size at least n% of the original inputs.
> 
> 
> This addresses bug HIVE-2121.
>     https://issues.apache.org/jira/browse/HIVE-2121
> 
> 
> Diffs
> -----
> 
>   trunk/common/src/java/org/apache/hadoop/hive/conf/HiveConf.java 1096852 
>   trunk/conf/hive-default.xml 1096852 
>   trunk/ql/src/java/org/apache/hadoop/hive/ql/io/CombineHiveInputFormat.java 1096852 
>   trunk/ql/src/java/org/apache/hadoop/hive/ql/io/HiveFileFormatUtils.java 1096852 
>   trunk/ql/src/java/org/apache/hadoop/hive/ql/optimizer/GenMRFileSink1.java 1096852 
>   trunk/ql/src/java/org/apache/hadoop/hive/ql/optimizer/GenMRTableScan1.java 1096852 
>   trunk/ql/src/java/org/apache/hadoop/hive/ql/optimizer/GenMRUnion1.java 1096852 
>   trunk/ql/src/java/org/apache/hadoop/hive/ql/optimizer/GenMapRedUtils.java 1096852 
>   trunk/ql/src/java/org/apache/hadoop/hive/ql/optimizer/MapJoinFactory.java 1096852 
>   trunk/ql/src/java/org/apache/hadoop/hive/ql/parse/Hive.g 1096852 
>   trunk/ql/src/java/org/apache/hadoop/hive/ql/parse/ParseContext.java 1096852 
>   trunk/ql/src/java/org/apache/hadoop/hive/ql/parse/SemanticAnalyzer.java 1096852 
>   trunk/ql/src/java/org/apache/hadoop/hive/ql/parse/SplitSample.java PRE-CREATION 
>   trunk/ql/src/java/org/apache/hadoop/hive/ql/plan/MapredWork.java 1096852 
>   trunk/ql/src/test/queries/clientnegative/split_sample_disabled.q PRE-CREATION 
>   trunk/ql/src/test/queries/clientnegative/split_sample_out_of_range.q PRE-CREATION 
>   trunk/ql/src/test/queries/clientnegative/split_sample_wrong_format.q PRE-CREATION 
>   trunk/ql/src/test/queries/clientpositive/split_sample.q PRE-CREATION 
>   trunk/ql/src/test/results/clientnegative/split_sample_disabled.q.out PRE-CREATION 
>   trunk/ql/src/test/results/clientnegative/split_sample_out_of_range.q.out PRE-CREATION 
>   trunk/ql/src/test/results/clientnegative/split_sample_wrong_format.q.out PRE-CREATION 
>   trunk/ql/src/test/results/clientpositive/bucket1.q.out 1096852 
>   trunk/ql/src/test/results/clientpositive/bucket2.q.out 1096852 
>   trunk/ql/src/test/results/clientpositive/bucket3.q.out 1096852 
>   trunk/ql/src/test/results/clientpositive/bucketmapjoin1.q.out 1096852 
>   trunk/ql/src/test/results/clientpositive/sample1.q.out 1096852 
>   trunk/ql/src/test/results/clientpositive/sample10.q.out 1096852 
>   trunk/ql/src/test/results/clientpositive/sample2.q.out 1096852 
>   trunk/ql/src/test/results/clientpositive/sample3.q.out 1096852 
>   trunk/ql/src/test/results/clientpositive/sample4.q.out 1096852 
>   trunk/ql/src/test/results/clientpositive/sample5.q.out 1096852 
>   trunk/ql/src/test/results/clientpositive/sample6.q.out 1096852 
>   trunk/ql/src/test/results/clientpositive/sample7.q.out 1096852 
>   trunk/ql/src/test/results/clientpositive/sample8.q.out 1096852 
>   trunk/ql/src/test/results/clientpositive/sample9.q.out 1096852 
> 
> Diff: https://reviews.apache.org/r/633/diff
> 
> 
> Testing
> -------
> 
> TestCliDriver TestNegativeCliDriver, manual tests on real clusters.
> 
> 
> Thanks,
> 
> Siying
> 
>

Re: Review Request: Input Sampling Splits

Posted by Siying Dong <si...@fb.com>.

-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/633/
-----------------------------------------------------------

(Updated 2011-04-26 21:19:18.557345)


Review request for hive, Ning Zhang and namit jain.


Changes
-------

Addressing most of Ning's comments.


Summary
-------

We need a better input sampling to serve at least two purposes:
1. test their queries against a smaller data set
2. understand more about how the data look like without scanning the whole table.
A simple function that gives a subset splits will help in those cases. It doesn't have to be strict sampling.

This diff allows a syntax of .. table TABLESAMPLE(n PERCENT), which samples input splits with size at least n% of the original inputs.


This addresses bug HIVE-2121.
    https://issues.apache.org/jira/browse/HIVE-2121


Diffs (updated)
-----

  trunk/common/src/java/org/apache/hadoop/hive/conf/HiveConf.java 1096852 
  trunk/conf/hive-default.xml 1096852 
  trunk/ql/src/java/org/apache/hadoop/hive/ql/io/CombineHiveInputFormat.java 1096852 
  trunk/ql/src/java/org/apache/hadoop/hive/ql/io/HiveFileFormatUtils.java 1096852 
  trunk/ql/src/java/org/apache/hadoop/hive/ql/optimizer/GenMRFileSink1.java 1096852 
  trunk/ql/src/java/org/apache/hadoop/hive/ql/optimizer/GenMRTableScan1.java 1096852 
  trunk/ql/src/java/org/apache/hadoop/hive/ql/optimizer/GenMRUnion1.java 1096852 
  trunk/ql/src/java/org/apache/hadoop/hive/ql/optimizer/GenMapRedUtils.java 1096852 
  trunk/ql/src/java/org/apache/hadoop/hive/ql/optimizer/MapJoinFactory.java 1096852 
  trunk/ql/src/java/org/apache/hadoop/hive/ql/parse/Hive.g 1096852 
  trunk/ql/src/java/org/apache/hadoop/hive/ql/parse/ParseContext.java 1096852 
  trunk/ql/src/java/org/apache/hadoop/hive/ql/parse/SemanticAnalyzer.java 1096852 
  trunk/ql/src/java/org/apache/hadoop/hive/ql/parse/SplitSample.java PRE-CREATION 
  trunk/ql/src/java/org/apache/hadoop/hive/ql/plan/MapredWork.java 1096852 
  trunk/ql/src/test/queries/clientnegative/split_sample_disabled.q PRE-CREATION 
  trunk/ql/src/test/queries/clientnegative/split_sample_out_of_range.q PRE-CREATION 
  trunk/ql/src/test/queries/clientnegative/split_sample_wrong_format.q PRE-CREATION 
  trunk/ql/src/test/queries/clientpositive/split_sample.q PRE-CREATION 
  trunk/ql/src/test/results/clientnegative/split_sample_disabled.q.out PRE-CREATION 
  trunk/ql/src/test/results/clientnegative/split_sample_out_of_range.q.out PRE-CREATION 
  trunk/ql/src/test/results/clientnegative/split_sample_wrong_format.q.out PRE-CREATION 
  trunk/ql/src/test/results/clientpositive/bucket1.q.out 1096852 
  trunk/ql/src/test/results/clientpositive/bucket2.q.out 1096852 
  trunk/ql/src/test/results/clientpositive/bucket3.q.out 1096852 
  trunk/ql/src/test/results/clientpositive/bucketmapjoin1.q.out 1096852 
  trunk/ql/src/test/results/clientpositive/sample1.q.out 1096852 
  trunk/ql/src/test/results/clientpositive/sample10.q.out 1096852 
  trunk/ql/src/test/results/clientpositive/sample2.q.out 1096852 
  trunk/ql/src/test/results/clientpositive/sample3.q.out 1096852 
  trunk/ql/src/test/results/clientpositive/sample4.q.out 1096852 
  trunk/ql/src/test/results/clientpositive/sample5.q.out 1096852 
  trunk/ql/src/test/results/clientpositive/sample6.q.out 1096852 
  trunk/ql/src/test/results/clientpositive/sample7.q.out 1096852 
  trunk/ql/src/test/results/clientpositive/sample8.q.out 1096852 
  trunk/ql/src/test/results/clientpositive/sample9.q.out 1096852 

Diff: https://reviews.apache.org/r/633/diff


Testing
-------

TestCliDriver TestNegativeCliDriver, manual tests on real clusters.


Thanks,

Siying

Re: Review Request: Input Sampling Splits

Posted by Siying Dong <si...@fb.com>.


> On 2011-04-26 20:50:30, Siying Dong wrote:
> > trunk/ql/src/java/org/apache/hadoop/hive/ql/parse/SemanticAnalyzer.java, line 498
> > <https://reviews.apache.org/r/633/diff/1/?file=16093#file16093line498>
> >
> >     I feel like it is a little hard to explain what this sample guarantees. It basically only guarantees that we fetch at least the sampled percentage of source data. Not exact number, nor guarantee for #rows. I think an option to disable it is a way to avoid confusion in some ways. How do you think?
> 
> Ning Zhang wrote:
>     I think if we specify clearly the semantics of block-level sample in the wiki/documentation, there shouldn't be much confusion. In fact I think it is much easier to explain than the bucket-based sampling. In addition if the user has confusions about the semantics, throwing an SemanticException won't help them understand.  I think the only use case for this parameter is to act as a gatekeeper to this feature if we found a bug in it and want to disable the feature quickly. That should be able to be achieved by switching branches quickly. If we have a gatekeeper parameter for each feature, the conf will grow unnecessarily large quickly.

OK. I'll remove this switch. We need to document and communicate very well to users. People will easily misunderstand this.


> On 2011-04-26 20:50:30, Siying Dong wrote:
> > trunk/ql/src/java/org/apache/hadoop/hive/ql/parse/SemanticAnalyzer.java, line 6392
> > <https://reviews.apache.org/r/633/diff/1/?file=16093#file16093line6392>
> >
> >     limit can be combined with block sampling. Just this optimization for limit doesn't make sense when users already sample the input data and we won't get much benefit.
> 
> Ning Zhang wrote:
>     I think combining these two still makes sense: 1) as you mentioned block sampling is not limiting on the # of rows, but limit is. Combining these two allows the users to get approximately N rows quickly. 2) this restriction makes an exception in terms of the query language composition. From the language syntax, it is allowed and makes senses to combine block-sampling and limit, but the user will get a SemanticException if they do. I think SemanticException should be thrown only when there is a legitimate semantic error (e.g., the percentage number is negative). If you feel that it is not a major use case and would rather do it in a follow-up JIRA, we should document it in TODO and file a JIRA for it.

I think it is the misunderstanding here. Limit works with split sampling well. No exception will be thrown in any of those two combination and the result will be what we expected.
This condition only disabled the optimization that runs against a smaller data set for some limit queries. With split sampling, user already specific what a percentage to sample, there is no need to further run the query against a small subset of the inputs. From test case split_sample.q, you can already see how they work together well.


- Siying


-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/633/#review561
-----------------------------------------------------------


On 2011-04-26 21:19:18, Siying Dong wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/633/
> -----------------------------------------------------------
> 
> (Updated 2011-04-26 21:19:18)
> 
> 
> Review request for hive, Ning Zhang and namit jain.
> 
> 
> Summary
> -------
> 
> We need a better input sampling to serve at least two purposes:
> 1. test their queries against a smaller data set
> 2. understand more about how the data look like without scanning the whole table.
> A simple function that gives a subset splits will help in those cases. It doesn't have to be strict sampling.
> 
> This diff allows a syntax of .. table TABLESAMPLE(n PERCENT), which samples input splits with size at least n% of the original inputs.
> 
> 
> This addresses bug HIVE-2121.
>     https://issues.apache.org/jira/browse/HIVE-2121
> 
> 
> Diffs
> -----
> 
>   trunk/common/src/java/org/apache/hadoop/hive/conf/HiveConf.java 1096852 
>   trunk/conf/hive-default.xml 1096852 
>   trunk/ql/src/java/org/apache/hadoop/hive/ql/io/CombineHiveInputFormat.java 1096852 
>   trunk/ql/src/java/org/apache/hadoop/hive/ql/io/HiveFileFormatUtils.java 1096852 
>   trunk/ql/src/java/org/apache/hadoop/hive/ql/optimizer/GenMRFileSink1.java 1096852 
>   trunk/ql/src/java/org/apache/hadoop/hive/ql/optimizer/GenMRTableScan1.java 1096852 
>   trunk/ql/src/java/org/apache/hadoop/hive/ql/optimizer/GenMRUnion1.java 1096852 
>   trunk/ql/src/java/org/apache/hadoop/hive/ql/optimizer/GenMapRedUtils.java 1096852 
>   trunk/ql/src/java/org/apache/hadoop/hive/ql/optimizer/MapJoinFactory.java 1096852 
>   trunk/ql/src/java/org/apache/hadoop/hive/ql/parse/Hive.g 1096852 
>   trunk/ql/src/java/org/apache/hadoop/hive/ql/parse/ParseContext.java 1096852 
>   trunk/ql/src/java/org/apache/hadoop/hive/ql/parse/SemanticAnalyzer.java 1096852 
>   trunk/ql/src/java/org/apache/hadoop/hive/ql/parse/SplitSample.java PRE-CREATION 
>   trunk/ql/src/java/org/apache/hadoop/hive/ql/plan/MapredWork.java 1096852 
>   trunk/ql/src/test/queries/clientnegative/split_sample_disabled.q PRE-CREATION 
>   trunk/ql/src/test/queries/clientnegative/split_sample_out_of_range.q PRE-CREATION 
>   trunk/ql/src/test/queries/clientnegative/split_sample_wrong_format.q PRE-CREATION 
>   trunk/ql/src/test/queries/clientpositive/split_sample.q PRE-CREATION 
>   trunk/ql/src/test/results/clientnegative/split_sample_disabled.q.out PRE-CREATION 
>   trunk/ql/src/test/results/clientnegative/split_sample_out_of_range.q.out PRE-CREATION 
>   trunk/ql/src/test/results/clientnegative/split_sample_wrong_format.q.out PRE-CREATION 
>   trunk/ql/src/test/results/clientpositive/bucket1.q.out 1096852 
>   trunk/ql/src/test/results/clientpositive/bucket2.q.out 1096852 
>   trunk/ql/src/test/results/clientpositive/bucket3.q.out 1096852 
>   trunk/ql/src/test/results/clientpositive/bucketmapjoin1.q.out 1096852 
>   trunk/ql/src/test/results/clientpositive/sample1.q.out 1096852 
>   trunk/ql/src/test/results/clientpositive/sample10.q.out 1096852 
>   trunk/ql/src/test/results/clientpositive/sample2.q.out 1096852 
>   trunk/ql/src/test/results/clientpositive/sample3.q.out 1096852 
>   trunk/ql/src/test/results/clientpositive/sample4.q.out 1096852 
>   trunk/ql/src/test/results/clientpositive/sample5.q.out 1096852 
>   trunk/ql/src/test/results/clientpositive/sample6.q.out 1096852 
>   trunk/ql/src/test/results/clientpositive/sample7.q.out 1096852 
>   trunk/ql/src/test/results/clientpositive/sample8.q.out 1096852 
>   trunk/ql/src/test/results/clientpositive/sample9.q.out 1096852 
> 
> Diff: https://reviews.apache.org/r/633/diff
> 
> 
> Testing
> -------
> 
> TestCliDriver TestNegativeCliDriver, manual tests on real clusters.
> 
> 
> Thanks,
> 
> Siying
> 
>

Re: Review Request: Input Sampling Splits

Posted by Ning Zhang <nz...@fb.com>.


> On 2011-04-26 20:50:30, Siying Dong wrote:
> > trunk/ql/src/java/org/apache/hadoop/hive/ql/parse/SemanticAnalyzer.java, line 498
> > <https://reviews.apache.org/r/633/diff/1/?file=16093#file16093line498>
> >
> >     I feel like it is a little hard to explain what this sample guarantees. It basically only guarantees that we fetch at least the sampled percentage of source data. Not exact number, nor guarantee for #rows. I think an option to disable it is a way to avoid confusion in some ways. How do you think?

I think if we specify clearly the semantics of block-level sample in the wiki/documentation, there shouldn't be much confusion. In fact I think it is much easier to explain than the bucket-based sampling. In addition if the user has confusions about the semantics, throwing an SemanticException won't help them understand.  I think the only use case for this parameter is to act as a gatekeeper to this feature if we found a bug in it and want to disable the feature quickly. That should be able to be achieved by switching branches quickly. If we have a gatekeeper parameter for each feature, the conf will grow unnecessarily large quickly. 


> On 2011-04-26 20:50:30, Siying Dong wrote:
> > trunk/ql/src/java/org/apache/hadoop/hive/ql/parse/SemanticAnalyzer.java, line 6392
> > <https://reviews.apache.org/r/633/diff/1/?file=16093#file16093line6392>
> >
> >     limit can be combined with block sampling. Just this optimization for limit doesn't make sense when users already sample the input data and we won't get much benefit.

I think combining these two still makes sense: 1) as you mentioned block sampling is not limiting on the # of rows, but limit is. Combining these two allows the users to get approximately N rows quickly. 2) this restriction makes an exception in terms of the query language composition. From the language syntax, it is allowed and makes senses to combine block-sampling and limit, but the user will get a SemanticException if they do. I think SemanticException should be thrown only when there is a legitimate semantic error (e.g., the percentage number is negative). If you feel that it is not a major use case and would rather do it in a follow-up JIRA, we should document it in TODO and file a JIRA for it. 


- Ning


-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/633/#review561
-----------------------------------------------------------


On 2011-04-26 21:19:18, Siying Dong wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/633/
> -----------------------------------------------------------
> 
> (Updated 2011-04-26 21:19:18)
> 
> 
> Review request for hive, Ning Zhang and namit jain.
> 
> 
> Summary
> -------
> 
> We need a better input sampling to serve at least two purposes:
> 1. test their queries against a smaller data set
> 2. understand more about how the data look like without scanning the whole table.
> A simple function that gives a subset splits will help in those cases. It doesn't have to be strict sampling.
> 
> This diff allows a syntax of .. table TABLESAMPLE(n PERCENT), which samples input splits with size at least n% of the original inputs.
> 
> 
> This addresses bug HIVE-2121.
>     https://issues.apache.org/jira/browse/HIVE-2121
> 
> 
> Diffs
> -----
> 
>   trunk/common/src/java/org/apache/hadoop/hive/conf/HiveConf.java 1096852 
>   trunk/conf/hive-default.xml 1096852 
>   trunk/ql/src/java/org/apache/hadoop/hive/ql/io/CombineHiveInputFormat.java 1096852 
>   trunk/ql/src/java/org/apache/hadoop/hive/ql/io/HiveFileFormatUtils.java 1096852 
>   trunk/ql/src/java/org/apache/hadoop/hive/ql/optimizer/GenMRFileSink1.java 1096852 
>   trunk/ql/src/java/org/apache/hadoop/hive/ql/optimizer/GenMRTableScan1.java 1096852 
>   trunk/ql/src/java/org/apache/hadoop/hive/ql/optimizer/GenMRUnion1.java 1096852 
>   trunk/ql/src/java/org/apache/hadoop/hive/ql/optimizer/GenMapRedUtils.java 1096852 
>   trunk/ql/src/java/org/apache/hadoop/hive/ql/optimizer/MapJoinFactory.java 1096852 
>   trunk/ql/src/java/org/apache/hadoop/hive/ql/parse/Hive.g 1096852 
>   trunk/ql/src/java/org/apache/hadoop/hive/ql/parse/ParseContext.java 1096852 
>   trunk/ql/src/java/org/apache/hadoop/hive/ql/parse/SemanticAnalyzer.java 1096852 
>   trunk/ql/src/java/org/apache/hadoop/hive/ql/parse/SplitSample.java PRE-CREATION 
>   trunk/ql/src/java/org/apache/hadoop/hive/ql/plan/MapredWork.java 1096852 
>   trunk/ql/src/test/queries/clientnegative/split_sample_disabled.q PRE-CREATION 
>   trunk/ql/src/test/queries/clientnegative/split_sample_out_of_range.q PRE-CREATION 
>   trunk/ql/src/test/queries/clientnegative/split_sample_wrong_format.q PRE-CREATION 
>   trunk/ql/src/test/queries/clientpositive/split_sample.q PRE-CREATION 
>   trunk/ql/src/test/results/clientnegative/split_sample_disabled.q.out PRE-CREATION 
>   trunk/ql/src/test/results/clientnegative/split_sample_out_of_range.q.out PRE-CREATION 
>   trunk/ql/src/test/results/clientnegative/split_sample_wrong_format.q.out PRE-CREATION 
>   trunk/ql/src/test/results/clientpositive/bucket1.q.out 1096852 
>   trunk/ql/src/test/results/clientpositive/bucket2.q.out 1096852 
>   trunk/ql/src/test/results/clientpositive/bucket3.q.out 1096852 
>   trunk/ql/src/test/results/clientpositive/bucketmapjoin1.q.out 1096852 
>   trunk/ql/src/test/results/clientpositive/sample1.q.out 1096852 
>   trunk/ql/src/test/results/clientpositive/sample10.q.out 1096852 
>   trunk/ql/src/test/results/clientpositive/sample2.q.out 1096852 
>   trunk/ql/src/test/results/clientpositive/sample3.q.out 1096852 
>   trunk/ql/src/test/results/clientpositive/sample4.q.out 1096852 
>   trunk/ql/src/test/results/clientpositive/sample5.q.out 1096852 
>   trunk/ql/src/test/results/clientpositive/sample6.q.out 1096852 
>   trunk/ql/src/test/results/clientpositive/sample7.q.out 1096852 
>   trunk/ql/src/test/results/clientpositive/sample8.q.out 1096852 
>   trunk/ql/src/test/results/clientpositive/sample9.q.out 1096852 
> 
> Diff: https://reviews.apache.org/r/633/diff
> 
> 
> Testing
> -------
> 
> TestCliDriver TestNegativeCliDriver, manual tests on real clusters.
> 
> 
> Thanks,
> 
> Siying
> 
>

Re: Review Request: Input Sampling Splits

Posted by Ning Zhang <nz...@fb.com>.


> On 2011-04-26 20:50:30, Siying Dong wrote:
> > trunk/ql/src/java/org/apache/hadoop/hive/ql/parse/SemanticAnalyzer.java, line 498
> > <https://reviews.apache.org/r/633/diff/1/?file=16093#file16093line498>
> >
> >     I feel like it is a little hard to explain what this sample guarantees. It basically only guarantees that we fetch at least the sampled percentage of source data. Not exact number, nor guarantee for #rows. I think an option to disable it is a way to avoid confusion in some ways. How do you think?
> 
> Ning Zhang wrote:
>     I think if we specify clearly the semantics of block-level sample in the wiki/documentation, there shouldn't be much confusion. In fact I think it is much easier to explain than the bucket-based sampling. In addition if the user has confusions about the semantics, throwing an SemanticException won't help them understand.  I think the only use case for this parameter is to act as a gatekeeper to this feature if we found a bug in it and want to disable the feature quickly. That should be able to be achieved by switching branches quickly. If we have a gatekeeper parameter for each feature, the conf will grow unnecessarily large quickly.
> 
> Siying Dong wrote:
>     OK. I'll remove this switch. We need to document and communicate very well to users. People will easily misunderstand this.

Sounds good. Yes, we should clearly document it in the wiki by what it does and does not. 


> On 2011-04-26 20:50:30, Siying Dong wrote:
> > trunk/ql/src/java/org/apache/hadoop/hive/ql/parse/SemanticAnalyzer.java, line 6392
> > <https://reviews.apache.org/r/633/diff/1/?file=16093#file16093line6392>
> >
> >     limit can be combined with block sampling. Just this optimization for limit doesn't make sense when users already sample the input data and we won't get much benefit.
> 
> Ning Zhang wrote:
>     I think combining these two still makes sense: 1) as you mentioned block sampling is not limiting on the # of rows, but limit is. Combining these two allows the users to get approximately N rows quickly. 2) this restriction makes an exception in terms of the query language composition. From the language syntax, it is allowed and makes senses to combine block-sampling and limit, but the user will get a SemanticException if they do. I think SemanticException should be thrown only when there is a legitimate semantic error (e.g., the percentage number is negative). If you feel that it is not a major use case and would rather do it in a follow-up JIRA, we should document it in TODO and file a JIRA for it.
> 
> Siying Dong wrote:
>     I think it is the misunderstanding here. Limit works with split sampling well. No exception will be thrown in any of those two combination and the result will be what we expected.
>     This condition only disabled the optimization that runs against a smaller data set for some limit queries. With split sampling, user already specific what a percentage to sample, there is no need to further run the query against a small subset of the inputs. From test case split_sample.q, you can already see how they work together well.

oh I see. Thanks for the clarification!


- Ning


-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/633/#review561
-----------------------------------------------------------


On 2011-04-26 21:19:18, Siying Dong wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/633/
> -----------------------------------------------------------
> 
> (Updated 2011-04-26 21:19:18)
> 
> 
> Review request for hive, Ning Zhang and namit jain.
> 
> 
> Summary
> -------
> 
> We need a better input sampling to serve at least two purposes:
> 1. test their queries against a smaller data set
> 2. understand more about how the data look like without scanning the whole table.
> A simple function that gives a subset splits will help in those cases. It doesn't have to be strict sampling.
> 
> This diff allows a syntax of .. table TABLESAMPLE(n PERCENT), which samples input splits with size at least n% of the original inputs.
> 
> 
> This addresses bug HIVE-2121.
>     https://issues.apache.org/jira/browse/HIVE-2121
> 
> 
> Diffs
> -----
> 
>   trunk/common/src/java/org/apache/hadoop/hive/conf/HiveConf.java 1096852 
>   trunk/conf/hive-default.xml 1096852 
>   trunk/ql/src/java/org/apache/hadoop/hive/ql/io/CombineHiveInputFormat.java 1096852 
>   trunk/ql/src/java/org/apache/hadoop/hive/ql/io/HiveFileFormatUtils.java 1096852 
>   trunk/ql/src/java/org/apache/hadoop/hive/ql/optimizer/GenMRFileSink1.java 1096852 
>   trunk/ql/src/java/org/apache/hadoop/hive/ql/optimizer/GenMRTableScan1.java 1096852 
>   trunk/ql/src/java/org/apache/hadoop/hive/ql/optimizer/GenMRUnion1.java 1096852 
>   trunk/ql/src/java/org/apache/hadoop/hive/ql/optimizer/GenMapRedUtils.java 1096852 
>   trunk/ql/src/java/org/apache/hadoop/hive/ql/optimizer/MapJoinFactory.java 1096852 
>   trunk/ql/src/java/org/apache/hadoop/hive/ql/parse/Hive.g 1096852 
>   trunk/ql/src/java/org/apache/hadoop/hive/ql/parse/ParseContext.java 1096852 
>   trunk/ql/src/java/org/apache/hadoop/hive/ql/parse/SemanticAnalyzer.java 1096852 
>   trunk/ql/src/java/org/apache/hadoop/hive/ql/parse/SplitSample.java PRE-CREATION 
>   trunk/ql/src/java/org/apache/hadoop/hive/ql/plan/MapredWork.java 1096852 
>   trunk/ql/src/test/queries/clientnegative/split_sample_disabled.q PRE-CREATION 
>   trunk/ql/src/test/queries/clientnegative/split_sample_out_of_range.q PRE-CREATION 
>   trunk/ql/src/test/queries/clientnegative/split_sample_wrong_format.q PRE-CREATION 
>   trunk/ql/src/test/queries/clientpositive/split_sample.q PRE-CREATION 
>   trunk/ql/src/test/results/clientnegative/split_sample_disabled.q.out PRE-CREATION 
>   trunk/ql/src/test/results/clientnegative/split_sample_out_of_range.q.out PRE-CREATION 
>   trunk/ql/src/test/results/clientnegative/split_sample_wrong_format.q.out PRE-CREATION 
>   trunk/ql/src/test/results/clientpositive/bucket1.q.out 1096852 
>   trunk/ql/src/test/results/clientpositive/bucket2.q.out 1096852 
>   trunk/ql/src/test/results/clientpositive/bucket3.q.out 1096852 
>   trunk/ql/src/test/results/clientpositive/bucketmapjoin1.q.out 1096852 
>   trunk/ql/src/test/results/clientpositive/sample1.q.out 1096852 
>   trunk/ql/src/test/results/clientpositive/sample10.q.out 1096852 
>   trunk/ql/src/test/results/clientpositive/sample2.q.out 1096852 
>   trunk/ql/src/test/results/clientpositive/sample3.q.out 1096852 
>   trunk/ql/src/test/results/clientpositive/sample4.q.out 1096852 
>   trunk/ql/src/test/results/clientpositive/sample5.q.out 1096852 
>   trunk/ql/src/test/results/clientpositive/sample6.q.out 1096852 
>   trunk/ql/src/test/results/clientpositive/sample7.q.out 1096852 
>   trunk/ql/src/test/results/clientpositive/sample8.q.out 1096852 
>   trunk/ql/src/test/results/clientpositive/sample9.q.out 1096852 
> 
> Diff: https://reviews.apache.org/r/633/diff
> 
> 
> Testing
> -------
> 
> TestCliDriver TestNegativeCliDriver, manual tests on real clusters.
> 
> 
> Thanks,
> 
> Siying
> 
>

Re: Review Request: Input Sampling Splits

Posted by Siying Dong <si...@fb.com>.

-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/633/#review561
-----------------------------------------------------------



trunk/ql/src/java/org/apache/hadoop/hive/ql/parse/SemanticAnalyzer.java
<https://reviews.apache.org/r/633/#comment1196>

    I feel like it is a little hard to explain what this sample guarantees. It basically only guarantees that we fetch at least the sampled percentage of source data. Not exact number, nor guarantee for #rows. I think an option to disable it is a way to avoid confusion in some ways. How do you think?



trunk/ql/src/java/org/apache/hadoop/hive/ql/parse/SemanticAnalyzer.java
<https://reviews.apache.org/r/633/#comment1197>

    limit can be combined with block sampling. Just this optimization for limit doesn't make sense when users already sample the input data and we won't get much benefit.


- Siying


On 2011-04-20 18:28:29, Siying Dong wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/633/
> -----------------------------------------------------------
> 
> (Updated 2011-04-20 18:28:29)
> 
> 
> Review request for hive, Ning Zhang and namit jain.
> 
> 
> Summary
> -------
> 
> We need a better input sampling to serve at least two purposes:
> 1. test their queries against a smaller data set
> 2. understand more about how the data look like without scanning the whole table.
> A simple function that gives a subset splits will help in those cases. It doesn't have to be strict sampling.
> 
> This diff allows a syntax of .. table TABLESAMPLE(n PERCENT), which samples input splits with size at least n% of the original inputs.
> 
> 
> This addresses bug HIVE-2121.
>     https://issues.apache.org/jira/browse/HIVE-2121
> 
> 
> Diffs
> -----
> 
>   trunk/common/src/java/org/apache/hadoop/hive/conf/HiveConf.java 1095244 
>   trunk/conf/hive-default.xml 1095244 
>   trunk/ql/src/java/org/apache/hadoop/hive/ql/io/CombineHiveInputFormat.java 1095244 
>   trunk/ql/src/java/org/apache/hadoop/hive/ql/io/HiveFileFormatUtils.java 1095244 
>   trunk/ql/src/java/org/apache/hadoop/hive/ql/optimizer/GenMRFileSink1.java 1095244 
>   trunk/ql/src/java/org/apache/hadoop/hive/ql/optimizer/GenMRTableScan1.java 1095244 
>   trunk/ql/src/java/org/apache/hadoop/hive/ql/optimizer/GenMRUnion1.java 1095244 
>   trunk/ql/src/java/org/apache/hadoop/hive/ql/optimizer/GenMapRedUtils.java 1095244 
>   trunk/ql/src/java/org/apache/hadoop/hive/ql/optimizer/MapJoinFactory.java 1095244 
>   trunk/ql/src/java/org/apache/hadoop/hive/ql/parse/Hive.g 1095244 
>   trunk/ql/src/java/org/apache/hadoop/hive/ql/parse/ParseContext.java 1095244 
>   trunk/ql/src/java/org/apache/hadoop/hive/ql/parse/SemanticAnalyzer.java 1095244 
>   trunk/ql/src/java/org/apache/hadoop/hive/ql/parse/SplitSample.java PRE-CREATION 
>   trunk/ql/src/java/org/apache/hadoop/hive/ql/plan/MapredWork.java 1095244 
>   trunk/ql/src/test/queries/clientnegative/split_sample_disabled.q PRE-CREATION 
>   trunk/ql/src/test/queries/clientnegative/split_sample_wrong_format.q PRE-CREATION 
>   trunk/ql/src/test/queries/clientpositive/split_sample.q PRE-CREATION 
>   trunk/ql/src/test/results/clientnegative/split_sample_disabled.q.out PRE-CREATION 
>   trunk/ql/src/test/results/clientnegative/split_sample_wrong_format.q.out PRE-CREATION 
>   trunk/ql/src/test/results/clientpositive/bucket1.q.out 1095244 
>   trunk/ql/src/test/results/clientpositive/bucket2.q.out 1095244 
>   trunk/ql/src/test/results/clientpositive/bucket3.q.out 1095244 
>   trunk/ql/src/test/results/clientpositive/bucketmapjoin1.q.out 1095244 
>   trunk/ql/src/test/results/clientpositive/sample1.q.out 1095244 
>   trunk/ql/src/test/results/clientpositive/sample10.q.out 1095244 
>   trunk/ql/src/test/results/clientpositive/sample2.q.out 1095244 
>   trunk/ql/src/test/results/clientpositive/sample3.q.out 1095244 
>   trunk/ql/src/test/results/clientpositive/sample4.q.out 1095244 
>   trunk/ql/src/test/results/clientpositive/sample5.q.out 1095244 
>   trunk/ql/src/test/results/clientpositive/sample6.q.out 1095244 
>   trunk/ql/src/test/results/clientpositive/sample7.q.out 1095244 
>   trunk/ql/src/test/results/clientpositive/sample8.q.out 1095244 
>   trunk/ql/src/test/results/clientpositive/sample9.q.out 1095244 
> 
> Diff: https://reviews.apache.org/r/633/diff
> 
> 
> Testing
> -------
> 
> TestCliDriver TestNegativeCliDriver, manual tests on real clusters.
> 
> 
> Thanks,
> 
> Siying
> 
>

Re: Review Request: Input Sampling Splits

Posted by Ning Zhang <nz...@fb.com>.

-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/633/#review558
-----------------------------------------------------------



trunk/common/src/java/org/apache/hadoop/hive/conf/HiveConf.java
<https://reviews.apache.org/r/633/#comment1178>

    The naming of this parameter is a little bit confusing: the parameter key is called "randomnumber" but the value of it is a fixed number.  Do you mean this number is actually the seed to generate samples?



trunk/conf/hive-default.xml
<https://reviews.apache.org/r/633/#comment1179>

    same as above.



trunk/ql/src/java/org/apache/hadoop/hive/ql/io/CombineHiveInputFormat.java
<https://reviews.apache.org/r/633/#comment1195>

    better add a comment for this function explain what the rationale behind sampling splits.



trunk/ql/src/java/org/apache/hadoop/hive/ql/io/CombineHiveInputFormat.java
<https://reviews.apache.org/r/633/#comment1192>

    we should declare retLists as interface (List) rather than implementation (ArrayList)



trunk/ql/src/java/org/apache/hadoop/hive/ql/io/CombineHiveInputFormat.java
<https://reviews.apache.org/r/633/#comment1193>

    same here, should declare it as Map rather than HashMap.



trunk/ql/src/java/org/apache/hadoop/hive/ql/io/CombineHiveInputFormat.java
<https://reviews.apache.org/r/633/#comment1194>

    can you add comments here? 



trunk/ql/src/java/org/apache/hadoop/hive/ql/parse/SemanticAnalyzer.java
<https://reviews.apache.org/r/633/#comment1182>

    can you add a comment on what this Map is used for and what are the key and value of the Map?



trunk/ql/src/java/org/apache/hadoop/hive/ql/parse/SemanticAnalyzer.java
<https://reviews.apache.org/r/633/#comment1184>

    I think we don't need to introduce this parameter at all. For one it is a new feature rather than a different code path for an old feature. We don't need the "fallback" protection by a new parameter. Secondly, throwing a SemanticException here can only make the user asking how to solve this problem, which is to set the parameter to true. So it seems that it doesn't make sense to set the parameter to false in any cases. So why not remove the this parameter to make it cleaner. 



trunk/ql/src/java/org/apache/hadoop/hive/ql/parse/SemanticAnalyzer.java
<https://reviews.apache.org/r/633/#comment1189>

    you may want to check the percentage number (if it is a valid double and within the range [0,100]) and throw SemanticException if it is invalid before creating a SplitSample object.



trunk/ql/src/java/org/apache/hadoop/hive/ql/parse/SemanticAnalyzer.java
<https://reviews.apache.org/r/633/#comment1190>

    why limit cannot be combined with block sampling?



trunk/ql/src/java/org/apache/hadoop/hive/ql/parse/SplitSample.java
<https://reviews.apache.org/r/633/#comment1191>

    This comment doesn't belong to this class.


- Ning


On 2011-04-20 18:28:29, Siying Dong wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/633/
> -----------------------------------------------------------
> 
> (Updated 2011-04-20 18:28:29)
> 
> 
> Review request for hive, Ning Zhang and namit jain.
> 
> 
> Summary
> -------
> 
> We need a better input sampling to serve at least two purposes:
> 1. test their queries against a smaller data set
> 2. understand more about how the data look like without scanning the whole table.
> A simple function that gives a subset splits will help in those cases. It doesn't have to be strict sampling.
> 
> This diff allows a syntax of .. table TABLESAMPLE(n PERCENT), which samples input splits with size at least n% of the original inputs.
> 
> 
> This addresses bug HIVE-2121.
>     https://issues.apache.org/jira/browse/HIVE-2121
> 
> 
> Diffs
> -----
> 
>   trunk/common/src/java/org/apache/hadoop/hive/conf/HiveConf.java 1095244 
>   trunk/conf/hive-default.xml 1095244 
>   trunk/ql/src/java/org/apache/hadoop/hive/ql/io/CombineHiveInputFormat.java 1095244 
>   trunk/ql/src/java/org/apache/hadoop/hive/ql/io/HiveFileFormatUtils.java 1095244 
>   trunk/ql/src/java/org/apache/hadoop/hive/ql/optimizer/GenMRFileSink1.java 1095244 
>   trunk/ql/src/java/org/apache/hadoop/hive/ql/optimizer/GenMRTableScan1.java 1095244 
>   trunk/ql/src/java/org/apache/hadoop/hive/ql/optimizer/GenMRUnion1.java 1095244 
>   trunk/ql/src/java/org/apache/hadoop/hive/ql/optimizer/GenMapRedUtils.java 1095244 
>   trunk/ql/src/java/org/apache/hadoop/hive/ql/optimizer/MapJoinFactory.java 1095244 
>   trunk/ql/src/java/org/apache/hadoop/hive/ql/parse/Hive.g 1095244 
>   trunk/ql/src/java/org/apache/hadoop/hive/ql/parse/ParseContext.java 1095244 
>   trunk/ql/src/java/org/apache/hadoop/hive/ql/parse/SemanticAnalyzer.java 1095244 
>   trunk/ql/src/java/org/apache/hadoop/hive/ql/parse/SplitSample.java PRE-CREATION 
>   trunk/ql/src/java/org/apache/hadoop/hive/ql/plan/MapredWork.java 1095244 
>   trunk/ql/src/test/queries/clientnegative/split_sample_disabled.q PRE-CREATION 
>   trunk/ql/src/test/queries/clientnegative/split_sample_wrong_format.q PRE-CREATION 
>   trunk/ql/src/test/queries/clientpositive/split_sample.q PRE-CREATION 
>   trunk/ql/src/test/results/clientnegative/split_sample_disabled.q.out PRE-CREATION 
>   trunk/ql/src/test/results/clientnegative/split_sample_wrong_format.q.out PRE-CREATION 
>   trunk/ql/src/test/results/clientpositive/bucket1.q.out 1095244 
>   trunk/ql/src/test/results/clientpositive/bucket2.q.out 1095244 
>   trunk/ql/src/test/results/clientpositive/bucket3.q.out 1095244 
>   trunk/ql/src/test/results/clientpositive/bucketmapjoin1.q.out 1095244 
>   trunk/ql/src/test/results/clientpositive/sample1.q.out 1095244 
>   trunk/ql/src/test/results/clientpositive/sample10.q.out 1095244 
>   trunk/ql/src/test/results/clientpositive/sample2.q.out 1095244 
>   trunk/ql/src/test/results/clientpositive/sample3.q.out 1095244 
>   trunk/ql/src/test/results/clientpositive/sample4.q.out 1095244 
>   trunk/ql/src/test/results/clientpositive/sample5.q.out 1095244 
>   trunk/ql/src/test/results/clientpositive/sample6.q.out 1095244 
>   trunk/ql/src/test/results/clientpositive/sample7.q.out 1095244 
>   trunk/ql/src/test/results/clientpositive/sample8.q.out 1095244 
>   trunk/ql/src/test/results/clientpositive/sample9.q.out 1095244 
> 
> Diff: https://reviews.apache.org/r/633/diff
> 
> 
> Testing
> -------
> 
> TestCliDriver TestNegativeCliDriver, manual tests on real clusters.
> 
> 
> Thanks,
> 
> Siying
> 
>