You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@hive.apache.org by "Siying Dong (JIRA)" <ji...@apache.org> on 2011/04/20 02:28:05 UTC

[jira] [Created] (HIVE-2121) Input Sampling By Splits

Input Sampling By Splits
------------------------

                 Key: HIVE-2121
                 URL: https://issues.apache.org/jira/browse/HIVE-2121
             Project: Hive
          Issue Type: New Feature
            Reporter: Siying Dong
            Assignee: Siying Dong


We need a better input sampling to serve at least two purposes:
1. test their queries against a smaller data set
2. understand more about how the data look like without scanning the whole table.
A simple function that gives a subset splits will help in those cases. It doesn't have to be strict sampling.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (HIVE-2121) Input Sampling By Splits

Posted by "jiraposter@reviews.apache.org (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HIVE-2121?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13022277#comment-13022277 ] 

jiraposter@reviews.apache.org commented on HIVE-2121:
-----------------------------------------------------


-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/633/
-----------------------------------------------------------

Review request for hive, Ning Zhang and namit jain.


Summary
-------

We need a better input sampling to serve at least two purposes:
1. test their queries against a smaller data set
2. understand more about how the data look like without scanning the whole table.
A simple function that gives a subset splits will help in those cases. It doesn't have to be strict sampling.

This diff allows a syntax of .. table TABLESAMPLE(n PERCENT), which samples input splits with size at least n% of the original inputs.


This addresses bug HIVE-2121.
    https://issues.apache.org/jira/browse/HIVE-2121


Diffs
-----

  trunk/common/src/java/org/apache/hadoop/hive/conf/HiveConf.java 1095244 
  trunk/conf/hive-default.xml 1095244 
  trunk/ql/src/java/org/apache/hadoop/hive/ql/io/CombineHiveInputFormat.java 1095244 
  trunk/ql/src/java/org/apache/hadoop/hive/ql/io/HiveFileFormatUtils.java 1095244 
  trunk/ql/src/java/org/apache/hadoop/hive/ql/optimizer/GenMRFileSink1.java 1095244 
  trunk/ql/src/java/org/apache/hadoop/hive/ql/optimizer/GenMRTableScan1.java 1095244 
  trunk/ql/src/java/org/apache/hadoop/hive/ql/optimizer/GenMRUnion1.java 1095244 
  trunk/ql/src/java/org/apache/hadoop/hive/ql/optimizer/GenMapRedUtils.java 1095244 
  trunk/ql/src/java/org/apache/hadoop/hive/ql/optimizer/MapJoinFactory.java 1095244 
  trunk/ql/src/java/org/apache/hadoop/hive/ql/parse/Hive.g 1095244 
  trunk/ql/src/java/org/apache/hadoop/hive/ql/parse/ParseContext.java 1095244 
  trunk/ql/src/java/org/apache/hadoop/hive/ql/parse/SemanticAnalyzer.java 1095244 
  trunk/ql/src/java/org/apache/hadoop/hive/ql/parse/SplitSample.java PRE-CREATION 
  trunk/ql/src/java/org/apache/hadoop/hive/ql/plan/MapredWork.java 1095244 
  trunk/ql/src/test/queries/clientnegative/split_sample_disabled.q PRE-CREATION 
  trunk/ql/src/test/queries/clientnegative/split_sample_wrong_format.q PRE-CREATION 
  trunk/ql/src/test/queries/clientpositive/split_sample.q PRE-CREATION 
  trunk/ql/src/test/results/clientnegative/split_sample_disabled.q.out PRE-CREATION 
  trunk/ql/src/test/results/clientnegative/split_sample_wrong_format.q.out PRE-CREATION 
  trunk/ql/src/test/results/clientpositive/bucket1.q.out 1095244 
  trunk/ql/src/test/results/clientpositive/bucket2.q.out 1095244 
  trunk/ql/src/test/results/clientpositive/bucket3.q.out 1095244 
  trunk/ql/src/test/results/clientpositive/bucketmapjoin1.q.out 1095244 
  trunk/ql/src/test/results/clientpositive/sample1.q.out 1095244 
  trunk/ql/src/test/results/clientpositive/sample10.q.out 1095244 
  trunk/ql/src/test/results/clientpositive/sample2.q.out 1095244 
  trunk/ql/src/test/results/clientpositive/sample3.q.out 1095244 
  trunk/ql/src/test/results/clientpositive/sample4.q.out 1095244 
  trunk/ql/src/test/results/clientpositive/sample5.q.out 1095244 
  trunk/ql/src/test/results/clientpositive/sample6.q.out 1095244 
  trunk/ql/src/test/results/clientpositive/sample7.q.out 1095244 
  trunk/ql/src/test/results/clientpositive/sample8.q.out 1095244 
  trunk/ql/src/test/results/clientpositive/sample9.q.out 1095244 

Diff: https://reviews.apache.org/r/633/diff


Testing
-------

TestCliDriver TestNegativeCliDriver, manual tests on real clusters.


Thanks,

Siying



> Input Sampling By Splits
> ------------------------
>
>                 Key: HIVE-2121
>                 URL: https://issues.apache.org/jira/browse/HIVE-2121
>             Project: Hive
>          Issue Type: New Feature
>            Reporter: Siying Dong
>            Assignee: Siying Dong
>         Attachments: HIVE-2121.1.patch, HIVE-2121.2.patch
>
>
> We need a better input sampling to serve at least two purposes:
> 1. test their queries against a smaller data set
> 2. understand more about how the data look like without scanning the whole table.
> A simple function that gives a subset splits will help in those cases. It doesn't have to be strict sampling.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (HIVE-2121) Input Sampling By Splits

Posted by "Namit Jain (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/HIVE-2121?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Namit Jain updated HIVE-2121:
-----------------------------

    Status: Open  (was: Patch Available)

> Input Sampling By Splits
> ------------------------
>
>                 Key: HIVE-2121
>                 URL: https://issues.apache.org/jira/browse/HIVE-2121
>             Project: Hive
>          Issue Type: New Feature
>            Reporter: Siying Dong
>            Assignee: Siying Dong
>         Attachments: HIVE-2121.1.patch, HIVE-2121.2.patch, HIVE-2121.3.patch, HIVE-2121.4.patch
>
>
> We need a better input sampling to serve at least two purposes:
> 1. test their queries against a smaller data set
> 2. understand more about how the data look like without scanning the whole table.
> A simple function that gives a subset splits will help in those cases. It doesn't have to be strict sampling.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (HIVE-2121) Input Sampling By Splits

Posted by "Carl Steinbach (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/HIVE-2121?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Carl Steinbach updated HIVE-2121:
---------------------------------

      Component/s: Query Processor
    Fix Version/s: 0.8.0

> Input Sampling By Splits
> ------------------------
>
>                 Key: HIVE-2121
>                 URL: https://issues.apache.org/jira/browse/HIVE-2121
>             Project: Hive
>          Issue Type: New Feature
>          Components: Query Processor
>            Reporter: Siying Dong
>            Assignee: Siying Dong
>             Fix For: 0.8.0
>
>         Attachments: HIVE-2121.1.patch, HIVE-2121.2.patch, HIVE-2121.3.patch, HIVE-2121.4.patch, HIVE-2121.5.patch, HIVE-2121.6.patch, HIVE-2121.7.patch, HIVE-2121.8.patch
>
>
> We need a better input sampling to serve at least two purposes:
> 1. test their queries against a smaller data set
> 2. understand more about how the data look like without scanning the whole table.
> A simple function that gives a subset splits will help in those cases. It doesn't have to be strict sampling.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (HIVE-2121) Input Sampling By Splits

Posted by "Namit Jain (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/HIVE-2121?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Namit Jain updated HIVE-2121:
-----------------------------

    Status: Open  (was: Patch Available)

> Input Sampling By Splits
> ------------------------
>
>                 Key: HIVE-2121
>                 URL: https://issues.apache.org/jira/browse/HIVE-2121
>             Project: Hive
>          Issue Type: New Feature
>            Reporter: Siying Dong
>            Assignee: Siying Dong
>         Attachments: HIVE-2121.1.patch, HIVE-2121.2.patch, HIVE-2121.3.patch, HIVE-2121.4.patch, HIVE-2121.5.patch, HIVE-2121.6.patch
>
>
> We need a better input sampling to serve at least two purposes:
> 1. test their queries against a smaller data set
> 2. understand more about how the data look like without scanning the whole table.
> A simple function that gives a subset splits will help in those cases. It doesn't have to be strict sampling.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (HIVE-2121) Input Sampling By Splits

Posted by "jiraposter@reviews.apache.org (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HIVE-2121?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13025438#comment-13025438 ] 

jiraposter@reviews.apache.org commented on HIVE-2121:
-----------------------------------------------------


-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/633/
-----------------------------------------------------------

(Updated 2011-04-26 21:19:18.557345)


Review request for hive, Ning Zhang and namit jain.


Changes
-------

Addressing most of Ning's comments.


Summary
-------

We need a better input sampling to serve at least two purposes:
1. test their queries against a smaller data set
2. understand more about how the data look like without scanning the whole table.
A simple function that gives a subset splits will help in those cases. It doesn't have to be strict sampling.

This diff allows a syntax of .. table TABLESAMPLE(n PERCENT), which samples input splits with size at least n% of the original inputs.


This addresses bug HIVE-2121.
    https://issues.apache.org/jira/browse/HIVE-2121


Diffs (updated)
-----

  trunk/common/src/java/org/apache/hadoop/hive/conf/HiveConf.java 1096852 
  trunk/conf/hive-default.xml 1096852 
  trunk/ql/src/java/org/apache/hadoop/hive/ql/io/CombineHiveInputFormat.java 1096852 
  trunk/ql/src/java/org/apache/hadoop/hive/ql/io/HiveFileFormatUtils.java 1096852 
  trunk/ql/src/java/org/apache/hadoop/hive/ql/optimizer/GenMRFileSink1.java 1096852 
  trunk/ql/src/java/org/apache/hadoop/hive/ql/optimizer/GenMRTableScan1.java 1096852 
  trunk/ql/src/java/org/apache/hadoop/hive/ql/optimizer/GenMRUnion1.java 1096852 
  trunk/ql/src/java/org/apache/hadoop/hive/ql/optimizer/GenMapRedUtils.java 1096852 
  trunk/ql/src/java/org/apache/hadoop/hive/ql/optimizer/MapJoinFactory.java 1096852 
  trunk/ql/src/java/org/apache/hadoop/hive/ql/parse/Hive.g 1096852 
  trunk/ql/src/java/org/apache/hadoop/hive/ql/parse/ParseContext.java 1096852 
  trunk/ql/src/java/org/apache/hadoop/hive/ql/parse/SemanticAnalyzer.java 1096852 
  trunk/ql/src/java/org/apache/hadoop/hive/ql/parse/SplitSample.java PRE-CREATION 
  trunk/ql/src/java/org/apache/hadoop/hive/ql/plan/MapredWork.java 1096852 
  trunk/ql/src/test/queries/clientnegative/split_sample_disabled.q PRE-CREATION 
  trunk/ql/src/test/queries/clientnegative/split_sample_out_of_range.q PRE-CREATION 
  trunk/ql/src/test/queries/clientnegative/split_sample_wrong_format.q PRE-CREATION 
  trunk/ql/src/test/queries/clientpositive/split_sample.q PRE-CREATION 
  trunk/ql/src/test/results/clientnegative/split_sample_disabled.q.out PRE-CREATION 
  trunk/ql/src/test/results/clientnegative/split_sample_out_of_range.q.out PRE-CREATION 
  trunk/ql/src/test/results/clientnegative/split_sample_wrong_format.q.out PRE-CREATION 
  trunk/ql/src/test/results/clientpositive/bucket1.q.out 1096852 
  trunk/ql/src/test/results/clientpositive/bucket2.q.out 1096852 
  trunk/ql/src/test/results/clientpositive/bucket3.q.out 1096852 
  trunk/ql/src/test/results/clientpositive/bucketmapjoin1.q.out 1096852 
  trunk/ql/src/test/results/clientpositive/sample1.q.out 1096852 
  trunk/ql/src/test/results/clientpositive/sample10.q.out 1096852 
  trunk/ql/src/test/results/clientpositive/sample2.q.out 1096852 
  trunk/ql/src/test/results/clientpositive/sample3.q.out 1096852 
  trunk/ql/src/test/results/clientpositive/sample4.q.out 1096852 
  trunk/ql/src/test/results/clientpositive/sample5.q.out 1096852 
  trunk/ql/src/test/results/clientpositive/sample6.q.out 1096852 
  trunk/ql/src/test/results/clientpositive/sample7.q.out 1096852 
  trunk/ql/src/test/results/clientpositive/sample8.q.out 1096852 
  trunk/ql/src/test/results/clientpositive/sample9.q.out 1096852 

Diff: https://reviews.apache.org/r/633/diff


Testing
-------

TestCliDriver TestNegativeCliDriver, manual tests on real clusters.


Thanks,

Siying



> Input Sampling By Splits
> ------------------------
>
>                 Key: HIVE-2121
>                 URL: https://issues.apache.org/jira/browse/HIVE-2121
>             Project: Hive
>          Issue Type: New Feature
>            Reporter: Siying Dong
>            Assignee: Siying Dong
>         Attachments: HIVE-2121.1.patch, HIVE-2121.2.patch, HIVE-2121.3.patch
>
>
> We need a better input sampling to serve at least two purposes:
> 1. test their queries against a smaller data set
> 2. understand more about how the data look like without scanning the whole table.
> A simple function that gives a subset splits will help in those cases. It doesn't have to be strict sampling.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (HIVE-2121) Input Sampling By Splits

Posted by "Siying Dong (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/HIVE-2121?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Siying Dong updated HIVE-2121:
------------------------------

    Status: Patch Available  (was: Open)

> Input Sampling By Splits
> ------------------------
>
>                 Key: HIVE-2121
>                 URL: https://issues.apache.org/jira/browse/HIVE-2121
>             Project: Hive
>          Issue Type: New Feature
>            Reporter: Siying Dong
>            Assignee: Siying Dong
>         Attachments: HIVE-2121.1.patch, HIVE-2121.2.patch, HIVE-2121.3.patch, HIVE-2121.4.patch
>
>
> We need a better input sampling to serve at least two purposes:
> 1. test their queries against a smaller data set
> 2. understand more about how the data look like without scanning the whole table.
> A simple function that gives a subset splits will help in those cases. It doesn't have to be strict sampling.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (HIVE-2121) Input Sampling By Splits

Posted by "Siying Dong (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/HIVE-2121?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Siying Dong updated HIVE-2121:
------------------------------

    Attachment: HIVE-2121.1.patch

> Input Sampling By Splits
> ------------------------
>
>                 Key: HIVE-2121
>                 URL: https://issues.apache.org/jira/browse/HIVE-2121
>             Project: Hive
>          Issue Type: New Feature
>            Reporter: Siying Dong
>            Assignee: Siying Dong
>         Attachments: HIVE-2121.1.patch
>
>
> We need a better input sampling to serve at least two purposes:
> 1. test their queries against a smaller data set
> 2. understand more about how the data look like without scanning the whole table.
> A simple function that gives a subset splits will help in those cases. It doesn't have to be strict sampling.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (HIVE-2121) Input Sampling By Splits

Posted by "jiraposter@reviews.apache.org (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HIVE-2121?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13026721#comment-13026721 ] 

jiraposter@reviews.apache.org commented on HIVE-2121:
-----------------------------------------------------

-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/633/#review605
-----------------------------------------------------------

trunk/shims/src/0.20/java/org/apache/hadoop/hive/shims/Hadoop20Shims.java
<https://reviews.apache.org/r/633/#comment1249>

    talked to siying offline -

    the check:

    if (split instanceof Hadoop20Shims.InputSplitShim)

    is not needed - this can be replaced by an assert.

    Same in Hadoop20SShims.

    Otherwise looks good

- namit

On 2011-04-28 08:32:17, Siying Dong wrote:
bq.  
bq.  -----------------------------------------------------------
bq.  This is an automatically generated e-mail. To reply, visit:
bq.  https://reviews.apache.org/r/633/
bq.  -----------------------------------------------------------
bq.  
bq.  (Updated 2011-04-28 08:32:17)
bq.  
bq.  
bq.  Review request for hive, Ning Zhang and namit jain.
bq.  
bq.  
bq.  Summary
bq.  -------
bq.  
bq.  We need a better input sampling to serve at least two purposes:
bq.  1. test their queries against a smaller data set
bq.  2. understand more about how the data look like without scanning the whole table.
bq.  A simple function that gives a subset splits will help in those cases. It doesn't have to be strict sampling.
bq.  
bq.  This diff allows a syntax of .. table TABLESAMPLE(n PERCENT), which samples input splits with size at least n% of the original inputs.
bq.  
bq.  
bq.  This addresses bug HIVE-2121.
bq.      https://issues.apache.org/jira/browse/HIVE-2121
bq.  
bq.  
bq.  Diffs
bq.  -----
bq.  
bq.    trunk/common/src/java/org/apache/hadoop/hive/conf/HiveConf.java 1096852 
bq.    trunk/conf/hive-default.xml 1096852 
bq.    trunk/ql/src/java/org/apache/hadoop/hive/ql/io/CombineHiveInputFormat.java 1096852 
bq.    trunk/ql/src/java/org/apache/hadoop/hive/ql/io/HiveFileFormatUtils.java 1096852 
bq.    trunk/ql/src/java/org/apache/hadoop/hive/ql/optimizer/GenMRFileSink1.java 1096852 
bq.    trunk/ql/src/java/org/apache/hadoop/hive/ql/optimizer/GenMRTableScan1.java 1096852 
bq.    trunk/ql/src/java/org/apache/hadoop/hive/ql/optimizer/GenMRUnion1.java 1096852 
bq.    trunk/ql/src/java/org/apache/hadoop/hive/ql/optimizer/GenMapRedUtils.java 1096852 
bq.    trunk/ql/src/java/org/apache/hadoop/hive/ql/optimizer/MapJoinFactory.java 1096852 
bq.    trunk/ql/src/java/org/apache/hadoop/hive/ql/parse/Hive.g 1096852 
bq.    trunk/ql/src/java/org/apache/hadoop/hive/ql/parse/ParseContext.java 1096852 
bq.    trunk/ql/src/java/org/apache/hadoop/hive/ql/parse/SemanticAnalyzer.java 1096852 
bq.    trunk/ql/src/java/org/apache/hadoop/hive/ql/parse/SplitSample.java PRE-CREATION 
bq.    trunk/ql/src/java/org/apache/hadoop/hive/ql/plan/MapredWork.java 1096852 
bq.    trunk/ql/src/test/queries/clientnegative/split_sample_out_of_range.q PRE-CREATION 
bq.    trunk/ql/src/test/queries/clientnegative/split_sample_wrong_format.q PRE-CREATION 
bq.    trunk/ql/src/test/queries/clientpositive/split_sample.q PRE-CREATION 
bq.    trunk/ql/src/test/results/clientnegative/split_sample_out_of_range.q.out PRE-CREATION 
bq.    trunk/ql/src/test/results/clientnegative/split_sample_wrong_format.q.out PRE-CREATION 
bq.    trunk/ql/src/test/results/clientpositive/bucket1.q.out 1096852 
bq.    trunk/ql/src/test/results/clientpositive/bucket2.q.out 1096852 
bq.    trunk/ql/src/test/results/clientpositive/bucket3.q.out 1096852 
bq.    trunk/ql/src/test/results/clientpositive/bucketmapjoin1.q.out 1096852 
bq.    trunk/ql/src/test/results/clientpositive/sample1.q.out 1096852 
bq.    trunk/ql/src/test/results/clientpositive/sample10.q.out 1096852 
bq.    trunk/ql/src/test/results/clientpositive/sample2.q.out 1096852 
bq.    trunk/ql/src/test/results/clientpositive/sample3.q.out 1096852 
bq.    trunk/ql/src/test/results/clientpositive/sample4.q.out 1096852 
bq.    trunk/ql/src/test/results/clientpositive/sample5.q.out 1096852 
bq.    trunk/ql/src/test/results/clientpositive/sample6.q.out 1096852 
bq.    trunk/ql/src/test/results/clientpositive/sample7.q.out 1096852 
bq.    trunk/ql/src/test/results/clientpositive/sample8.q.out 1096852 
bq.    trunk/ql/src/test/results/clientpositive/sample9.q.out 1096852 
bq.    trunk/shims/src/0.20/java/org/apache/hadoop/hive/shims/Hadoop20Shims.java 1096852 
bq.    trunk/shims/src/0.20S/java/org/apache/hadoop/hive/shims/Hadoop20SShims.java 1096852 
bq.    trunk/shims/src/common/java/org/apache/hadoop/hive/shims/HadoopShims.java 1096852 
bq.  
bq.  Diff: https://reviews.apache.org/r/633/diff
bq.  
bq.  
bq.  Testing
bq.  -------
bq.  
bq.  TestCliDriver TestNegativeCliDriver, manual tests on real clusters.
bq.  
bq.  
bq.  Thanks,
bq.  
bq.  Siying
bq.  
bq.

> Input Sampling By Splits
> ------------------------
>
>                 Key: HIVE-2121
>                 URL: https://issues.apache.org/jira/browse/HIVE-2121
>             Project: Hive
>          Issue Type: New Feature
>            Reporter: Siying Dong
>            Assignee: Siying Dong
>         Attachments: HIVE-2121.1.patch, HIVE-2121.2.patch, HIVE-2121.3.patch, HIVE-2121.4.patch, HIVE-2121.5.patch, HIVE-2121.6.patch
>
>
> We need a better input sampling to serve at least two purposes:
> 1. test their queries against a smaller data set
> 2. understand more about how the data look like without scanning the whole table.
> A simple function that gives a subset splits will help in those cases. It doesn't have to be strict sampling.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (HIVE-2121) Input Sampling By Splits

Posted by "jiraposter@reviews.apache.org (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HIVE-2121?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13026204#comment-13026204 ] 

jiraposter@reviews.apache.org commented on HIVE-2121:
-----------------------------------------------------


-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/633/
-----------------------------------------------------------

(Updated 2011-04-28 08:32:17.534107)


Review request for hive, Ning Zhang and namit jain.


Changes
-------

Two changes made according to Namit's comments:
1. explain will print out some about the sampling. (It might not be the best way to print but it follows the framework)
2. the granularity of sampling is down from split-level to HDFS block level.


Summary
-------

We need a better input sampling to serve at least two purposes:
1. test their queries against a smaller data set
2. understand more about how the data look like without scanning the whole table.
A simple function that gives a subset splits will help in those cases. It doesn't have to be strict sampling.

This diff allows a syntax of .. table TABLESAMPLE(n PERCENT), which samples input splits with size at least n% of the original inputs.


This addresses bug HIVE-2121.
    https://issues.apache.org/jira/browse/HIVE-2121


Diffs (updated)
-----

  trunk/common/src/java/org/apache/hadoop/hive/conf/HiveConf.java 1096852 
  trunk/conf/hive-default.xml 1096852 
  trunk/ql/src/java/org/apache/hadoop/hive/ql/io/CombineHiveInputFormat.java 1096852 
  trunk/ql/src/java/org/apache/hadoop/hive/ql/io/HiveFileFormatUtils.java 1096852 
  trunk/ql/src/java/org/apache/hadoop/hive/ql/optimizer/GenMRFileSink1.java 1096852 
  trunk/ql/src/java/org/apache/hadoop/hive/ql/optimizer/GenMRTableScan1.java 1096852 
  trunk/ql/src/java/org/apache/hadoop/hive/ql/optimizer/GenMRUnion1.java 1096852 
  trunk/ql/src/java/org/apache/hadoop/hive/ql/optimizer/GenMapRedUtils.java 1096852 
  trunk/ql/src/java/org/apache/hadoop/hive/ql/optimizer/MapJoinFactory.java 1096852 
  trunk/ql/src/java/org/apache/hadoop/hive/ql/parse/Hive.g 1096852 
  trunk/ql/src/java/org/apache/hadoop/hive/ql/parse/ParseContext.java 1096852 
  trunk/ql/src/java/org/apache/hadoop/hive/ql/parse/SemanticAnalyzer.java 1096852 
  trunk/ql/src/java/org/apache/hadoop/hive/ql/parse/SplitSample.java PRE-CREATION 
  trunk/ql/src/java/org/apache/hadoop/hive/ql/plan/MapredWork.java 1096852 
  trunk/ql/src/test/queries/clientnegative/split_sample_out_of_range.q PRE-CREATION 
  trunk/ql/src/test/queries/clientnegative/split_sample_wrong_format.q PRE-CREATION 
  trunk/ql/src/test/queries/clientpositive/split_sample.q PRE-CREATION 
  trunk/ql/src/test/results/clientnegative/split_sample_out_of_range.q.out PRE-CREATION 
  trunk/ql/src/test/results/clientnegative/split_sample_wrong_format.q.out PRE-CREATION 
  trunk/ql/src/test/results/clientpositive/bucket1.q.out 1096852 
  trunk/ql/src/test/results/clientpositive/bucket2.q.out 1096852 
  trunk/ql/src/test/results/clientpositive/bucket3.q.out 1096852 
  trunk/ql/src/test/results/clientpositive/bucketmapjoin1.q.out 1096852 
  trunk/ql/src/test/results/clientpositive/sample1.q.out 1096852 
  trunk/ql/src/test/results/clientpositive/sample10.q.out 1096852 
  trunk/ql/src/test/results/clientpositive/sample2.q.out 1096852 
  trunk/ql/src/test/results/clientpositive/sample3.q.out 1096852 
  trunk/ql/src/test/results/clientpositive/sample4.q.out 1096852 
  trunk/ql/src/test/results/clientpositive/sample5.q.out 1096852 
  trunk/ql/src/test/results/clientpositive/sample6.q.out 1096852 
  trunk/ql/src/test/results/clientpositive/sample7.q.out 1096852 
  trunk/ql/src/test/results/clientpositive/sample8.q.out 1096852 
  trunk/ql/src/test/results/clientpositive/sample9.q.out 1096852 
  trunk/shims/src/0.20/java/org/apache/hadoop/hive/shims/Hadoop20Shims.java 1096852 
  trunk/shims/src/0.20S/java/org/apache/hadoop/hive/shims/Hadoop20SShims.java 1096852 
  trunk/shims/src/common/java/org/apache/hadoop/hive/shims/HadoopShims.java 1096852 

Diff: https://reviews.apache.org/r/633/diff


Testing
-------

TestCliDriver TestNegativeCliDriver, manual tests on real clusters.


Thanks,

Siying



> Input Sampling By Splits
> ------------------------
>
>                 Key: HIVE-2121
>                 URL: https://issues.apache.org/jira/browse/HIVE-2121
>             Project: Hive
>          Issue Type: New Feature
>            Reporter: Siying Dong
>            Assignee: Siying Dong
>         Attachments: HIVE-2121.1.patch, HIVE-2121.2.patch, HIVE-2121.3.patch, HIVE-2121.4.patch, HIVE-2121.5.patch
>
>
> We need a better input sampling to serve at least two purposes:
> 1. test their queries against a smaller data set
> 2. understand more about how the data look like without scanning the whole table.
> A simple function that gives a subset splits will help in those cases. It doesn't have to be strict sampling.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (HIVE-2121) Input Sampling By Splits

Posted by "Siying Dong (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HIVE-2121?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13021878#comment-13021878 ] 

Siying Dong commented on HIVE-2121:
-----------------------------------

still need to modify some test results.

> Input Sampling By Splits
> ------------------------
>
>                 Key: HIVE-2121
>                 URL: https://issues.apache.org/jira/browse/HIVE-2121
>             Project: Hive
>          Issue Type: New Feature
>            Reporter: Siying Dong
>            Assignee: Siying Dong
>         Attachments: HIVE-2121.1.patch
>
>
> We need a better input sampling to serve at least two purposes:
> 1. test their queries against a smaller data set
> 2. understand more about how the data look like without scanning the whole table.
> A simple function that gives a subset splits will help in those cases. It doesn't have to be strict sampling.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (HIVE-2121) Input Sampling By Splits

Posted by "Namit Jain (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HIVE-2121?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13025919#comment-13025919 ] 

Namit Jain commented on HIVE-2121:
----------------------------------

A few other comments:

   The percentage of data read is currently a function 
   of the split size - we should mark the last split 
   specially, and only read the required data at runtime.
   I mean, if we have only 1 split, and we need to sample
   10% of the data, there should be a way to do so. Currently,
   it seems impossible.

> Input Sampling By Splits
> ------------------------
>
>                 Key: HIVE-2121
>                 URL: https://issues.apache.org/jira/browse/HIVE-2121
>             Project: Hive
>          Issue Type: New Feature
>            Reporter: Siying Dong
>            Assignee: Siying Dong
>         Attachments: HIVE-2121.1.patch, HIVE-2121.2.patch, HIVE-2121.3.patch, HIVE-2121.4.patch
>
>
> We need a better input sampling to serve at least two purposes:
> 1. test their queries against a smaller data set
> 2. understand more about how the data look like without scanning the whole table.
> A simple function that gives a subset splits will help in those cases. It doesn't have to be strict sampling.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (HIVE-2121) Input Sampling By Splits

Posted by "Namit Jain (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HIVE-2121?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13026779#comment-13026779 ] 

Namit Jain commented on HIVE-2121:
----------------------------------

+1

> Input Sampling By Splits
> ------------------------
>
>                 Key: HIVE-2121
>                 URL: https://issues.apache.org/jira/browse/HIVE-2121
>             Project: Hive
>          Issue Type: New Feature
>            Reporter: Siying Dong
>            Assignee: Siying Dong
>         Attachments: HIVE-2121.1.patch, HIVE-2121.2.patch, HIVE-2121.3.patch, HIVE-2121.4.patch, HIVE-2121.5.patch, HIVE-2121.6.patch, HIVE-2121.7.patch
>
>
> We need a better input sampling to serve at least two purposes:
> 1. test their queries against a smaller data set
> 2. understand more about how the data look like without scanning the whole table.
> A simple function that gives a subset splits will help in those cases. It doesn't have to be strict sampling.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (HIVE-2121) Input Sampling By Splits

Posted by "Namit Jain (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HIVE-2121?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13026719#comment-13026719 ] 

Namit Jain commented on HIVE-2121:
----------------------------------

minor comments in review-board.



> Input Sampling By Splits
> ------------------------
>
>                 Key: HIVE-2121
>                 URL: https://issues.apache.org/jira/browse/HIVE-2121
>             Project: Hive
>          Issue Type: New Feature
>            Reporter: Siying Dong
>            Assignee: Siying Dong
>         Attachments: HIVE-2121.1.patch, HIVE-2121.2.patch, HIVE-2121.3.patch, HIVE-2121.4.patch, HIVE-2121.5.patch, HIVE-2121.6.patch
>
>
> We need a better input sampling to serve at least two purposes:
> 1. test their queries against a smaller data set
> 2. understand more about how the data look like without scanning the whole table.
> A simple function that gives a subset splits will help in those cases. It doesn't have to be strict sampling.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (HIVE-2121) Input Sampling By Splits

Posted by "jiraposter@reviews.apache.org (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HIVE-2121?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13025498#comment-13025498 ] 

jiraposter@reviews.apache.org commented on HIVE-2121:
-----------------------------------------------------

-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/633/#review567
-----------------------------------------------------------

trunk/ql/src/java/org/apache/hadoop/hive/ql/io/CombineHiveInputFormat.java
<https://reviews.apache.org/r/633/#comment1205>

    This function could be quite expensive inside the loops. You may want to test a case where there are large # of partitions and each partition contains a large # of small files. 

- Ning

On 2011-04-26 21:19:18, Siying Dong wrote:
bq.  
bq.  -----------------------------------------------------------
bq.  This is an automatically generated e-mail. To reply, visit:
bq.  https://reviews.apache.org/r/633/
bq.  -----------------------------------------------------------
bq.  
bq.  (Updated 2011-04-26 21:19:18)
bq.  
bq.  
bq.  Review request for hive, Ning Zhang and namit jain.
bq.  
bq.  
bq.  Summary
bq.  -------
bq.  
bq.  We need a better input sampling to serve at least two purposes:
bq.  1. test their queries against a smaller data set
bq.  2. understand more about how the data look like without scanning the whole table.
bq.  A simple function that gives a subset splits will help in those cases. It doesn't have to be strict sampling.
bq.  
bq.  This diff allows a syntax of .. table TABLESAMPLE(n PERCENT), which samples input splits with size at least n% of the original inputs.
bq.  
bq.  
bq.  This addresses bug HIVE-2121.
bq.      https://issues.apache.org/jira/browse/HIVE-2121
bq.  
bq.  
bq.  Diffs
bq.  -----
bq.  
bq.    trunk/common/src/java/org/apache/hadoop/hive/conf/HiveConf.java 1096852 
bq.    trunk/conf/hive-default.xml 1096852 
bq.    trunk/ql/src/java/org/apache/hadoop/hive/ql/io/CombineHiveInputFormat.java 1096852 
bq.    trunk/ql/src/java/org/apache/hadoop/hive/ql/io/HiveFileFormatUtils.java 1096852 
bq.    trunk/ql/src/java/org/apache/hadoop/hive/ql/optimizer/GenMRFileSink1.java 1096852 
bq.    trunk/ql/src/java/org/apache/hadoop/hive/ql/optimizer/GenMRTableScan1.java 1096852 
bq.    trunk/ql/src/java/org/apache/hadoop/hive/ql/optimizer/GenMRUnion1.java 1096852 
bq.    trunk/ql/src/java/org/apache/hadoop/hive/ql/optimizer/GenMapRedUtils.java 1096852 
bq.    trunk/ql/src/java/org/apache/hadoop/hive/ql/optimizer/MapJoinFactory.java 1096852 
bq.    trunk/ql/src/java/org/apache/hadoop/hive/ql/parse/Hive.g 1096852 
bq.    trunk/ql/src/java/org/apache/hadoop/hive/ql/parse/ParseContext.java 1096852 
bq.    trunk/ql/src/java/org/apache/hadoop/hive/ql/parse/SemanticAnalyzer.java 1096852 
bq.    trunk/ql/src/java/org/apache/hadoop/hive/ql/parse/SplitSample.java PRE-CREATION 
bq.    trunk/ql/src/java/org/apache/hadoop/hive/ql/plan/MapredWork.java 1096852 
bq.    trunk/ql/src/test/queries/clientnegative/split_sample_disabled.q PRE-CREATION 
bq.    trunk/ql/src/test/queries/clientnegative/split_sample_out_of_range.q PRE-CREATION 
bq.    trunk/ql/src/test/queries/clientnegative/split_sample_wrong_format.q PRE-CREATION 
bq.    trunk/ql/src/test/queries/clientpositive/split_sample.q PRE-CREATION 
bq.    trunk/ql/src/test/results/clientnegative/split_sample_disabled.q.out PRE-CREATION 
bq.    trunk/ql/src/test/results/clientnegative/split_sample_out_of_range.q.out PRE-CREATION 
bq.    trunk/ql/src/test/results/clientnegative/split_sample_wrong_format.q.out PRE-CREATION 
bq.    trunk/ql/src/test/results/clientpositive/bucket1.q.out 1096852 
bq.    trunk/ql/src/test/results/clientpositive/bucket2.q.out 1096852 
bq.    trunk/ql/src/test/results/clientpositive/bucket3.q.out 1096852 
bq.    trunk/ql/src/test/results/clientpositive/bucketmapjoin1.q.out 1096852 
bq.    trunk/ql/src/test/results/clientpositive/sample1.q.out 1096852 
bq.    trunk/ql/src/test/results/clientpositive/sample10.q.out 1096852 
bq.    trunk/ql/src/test/results/clientpositive/sample2.q.out 1096852 
bq.    trunk/ql/src/test/results/clientpositive/sample3.q.out 1096852 
bq.    trunk/ql/src/test/results/clientpositive/sample4.q.out 1096852 
bq.    trunk/ql/src/test/results/clientpositive/sample5.q.out 1096852 
bq.    trunk/ql/src/test/results/clientpositive/sample6.q.out 1096852 
bq.    trunk/ql/src/test/results/clientpositive/sample7.q.out 1096852 
bq.    trunk/ql/src/test/results/clientpositive/sample8.q.out 1096852 
bq.    trunk/ql/src/test/results/clientpositive/sample9.q.out 1096852 
bq.  
bq.  Diff: https://reviews.apache.org/r/633/diff
bq.  
bq.  
bq.  Testing
bq.  -------
bq.  
bq.  TestCliDriver TestNegativeCliDriver, manual tests on real clusters.
bq.  
bq.  
bq.  Thanks,
bq.  
bq.  Siying
bq.  
bq.

> Input Sampling By Splits
> ------------------------
>
>                 Key: HIVE-2121
>                 URL: https://issues.apache.org/jira/browse/HIVE-2121
>             Project: Hive
>          Issue Type: New Feature
>            Reporter: Siying Dong
>            Assignee: Siying Dong
>         Attachments: HIVE-2121.1.patch, HIVE-2121.2.patch, HIVE-2121.3.patch
>
>
> We need a better input sampling to serve at least two purposes:
> 1. test their queries against a smaller data set
> 2. understand more about how the data look like without scanning the whole table.
> A simple function that gives a subset splits will help in those cases. It doesn't have to be strict sampling.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (HIVE-2121) Input Sampling By Splits

Posted by "Siying Dong (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/HIVE-2121?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Siying Dong updated HIVE-2121:
------------------------------

    Attachment: HIVE-2121.8.patch

fix tests or test outputs.

> Input Sampling By Splits
> ------------------------
>
>                 Key: HIVE-2121
>                 URL: https://issues.apache.org/jira/browse/HIVE-2121
>             Project: Hive
>          Issue Type: New Feature
>            Reporter: Siying Dong
>            Assignee: Siying Dong
>         Attachments: HIVE-2121.1.patch, HIVE-2121.2.patch, HIVE-2121.3.patch, HIVE-2121.4.patch, HIVE-2121.5.patch, HIVE-2121.6.patch, HIVE-2121.7.patch, HIVE-2121.8.patch
>
>
> We need a better input sampling to serve at least two purposes:
> 1. test their queries against a smaller data set
> 2. understand more about how the data look like without scanning the whole table.
> A simple function that gives a subset splits will help in those cases. It doesn't have to be strict sampling.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (HIVE-2121) Input Sampling By Splits

Posted by "Ning Zhang (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HIVE-2121?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13022285#comment-13022285 ] 

Ning Zhang commented on HIVE-2121:
----------------------------------

I will take a look.

> Input Sampling By Splits
> ------------------------
>
>                 Key: HIVE-2121
>                 URL: https://issues.apache.org/jira/browse/HIVE-2121
>             Project: Hive
>          Issue Type: New Feature
>            Reporter: Siying Dong
>            Assignee: Siying Dong
>         Attachments: HIVE-2121.1.patch, HIVE-2121.2.patch
>
>
> We need a better input sampling to serve at least two purposes:
> 1. test their queries against a smaller data set
> 2. understand more about how the data look like without scanning the whole table.
> A simple function that gives a subset splits will help in those cases. It doesn't have to be strict sampling.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (HIVE-2121) Input Sampling By Splits

Posted by "Siying Dong (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/HIVE-2121?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Siying Dong updated HIVE-2121:
------------------------------

    Attachment: HIVE-2121.4.patch

remove the switch. 
I ran it against about 800 partitions input and didn't see obvious client time increase.

> Input Sampling By Splits
> ------------------------
>
>                 Key: HIVE-2121
>                 URL: https://issues.apache.org/jira/browse/HIVE-2121
>             Project: Hive
>          Issue Type: New Feature
>            Reporter: Siying Dong
>            Assignee: Siying Dong
>         Attachments: HIVE-2121.1.patch, HIVE-2121.2.patch, HIVE-2121.3.patch, HIVE-2121.4.patch
>
>
> We need a better input sampling to serve at least two purposes:
> 1. test their queries against a smaller data set
> 2. understand more about how the data look like without scanning the whole table.
> A simple function that gives a subset splits will help in those cases. It doesn't have to be strict sampling.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (HIVE-2121) Input Sampling By Splits

Posted by "Namit Jain (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HIVE-2121?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13026846#comment-13026846 ] 

Namit Jain commented on HIVE-2121:
----------------------------------

Some tests are failing:

split_sample.q


Some tests in TestMinimrCliDriver (log updates)

almost all test in TestParse (log updates)





> Input Sampling By Splits
> ------------------------
>
>                 Key: HIVE-2121
>                 URL: https://issues.apache.org/jira/browse/HIVE-2121
>             Project: Hive
>          Issue Type: New Feature
>            Reporter: Siying Dong
>            Assignee: Siying Dong
>         Attachments: HIVE-2121.1.patch, HIVE-2121.2.patch, HIVE-2121.3.patch, HIVE-2121.4.patch, HIVE-2121.5.patch, HIVE-2121.6.patch, HIVE-2121.7.patch
>
>
> We need a better input sampling to serve at least two purposes:
> 1. test their queries against a smaller data set
> 2. understand more about how the data look like without scanning the whole table.
> A simple function that gives a subset splits will help in those cases. It doesn't have to be strict sampling.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (HIVE-2121) Input Sampling By Splits

Posted by "Siying Dong (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/HIVE-2121?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Siying Dong updated HIVE-2121:
------------------------------

    Attachment: HIVE-2121.1.patch

> Input Sampling By Splits
> ------------------------
>
>                 Key: HIVE-2121
>                 URL: https://issues.apache.org/jira/browse/HIVE-2121
>             Project: Hive
>          Issue Type: New Feature
>            Reporter: Siying Dong
>            Assignee: Siying Dong
>         Attachments: HIVE-2121.1.patch
>
>
> We need a better input sampling to serve at least two purposes:
> 1. test their queries against a smaller data set
> 2. understand more about how the data look like without scanning the whole table.
> A simple function that gives a subset splits will help in those cases. It doesn't have to be strict sampling.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (HIVE-2121) Input Sampling By Splits

Posted by "Siying Dong (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/HIVE-2121?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Siying Dong updated HIVE-2121:
------------------------------

    Attachment: HIVE-2121.2.patch

> Input Sampling By Splits
> ------------------------
>
>                 Key: HIVE-2121
>                 URL: https://issues.apache.org/jira/browse/HIVE-2121
>             Project: Hive
>          Issue Type: New Feature
>            Reporter: Siying Dong
>            Assignee: Siying Dong
>         Attachments: HIVE-2121.1.patch, HIVE-2121.2.patch
>
>
> We need a better input sampling to serve at least two purposes:
> 1. test their queries against a smaller data set
> 2. understand more about how the data look like without scanning the whole table.
> A simple function that gives a subset splits will help in those cases. It doesn't have to be strict sampling.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (HIVE-2121) Input Sampling By Splits

Posted by "Siying Dong (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/HIVE-2121?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Siying Dong updated HIVE-2121:
------------------------------

    Status: Patch Available  (was: Open)

> Input Sampling By Splits
> ------------------------
>
>                 Key: HIVE-2121
>                 URL: https://issues.apache.org/jira/browse/HIVE-2121
>             Project: Hive
>          Issue Type: New Feature
>            Reporter: Siying Dong
>            Assignee: Siying Dong
>         Attachments: HIVE-2121.1.patch, HIVE-2121.2.patch, HIVE-2121.3.patch, HIVE-2121.4.patch, HIVE-2121.5.patch, HIVE-2121.6.patch, HIVE-2121.7.patch
>
>
> We need a better input sampling to serve at least two purposes:
> 1. test their queries against a smaller data set
> 2. understand more about how the data look like without scanning the whole table.
> A simple function that gives a subset splits will help in those cases. It doesn't have to be strict sampling.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (HIVE-2121) Input Sampling By Splits

Posted by "Namit Jain (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HIVE-2121?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13025614#comment-13025614 ] 

Namit Jain commented on HIVE-2121:
----------------------------------

A few other comments:

1. The behavior should be reflected in explain plan extended.
   Yongqiang recently added a similar plan change in pathtoAlias etc.
   in HIVE-2126. Take a look as an example.
2. It would increase the utility if I can add a new session variable
   where I specify the split percentage - this will be applicable to 
   all tables (this can also be done in a follow-up)

> Input Sampling By Splits
> ------------------------
>
>                 Key: HIVE-2121
>                 URL: https://issues.apache.org/jira/browse/HIVE-2121
>             Project: Hive
>          Issue Type: New Feature
>            Reporter: Siying Dong
>            Assignee: Siying Dong
>         Attachments: HIVE-2121.1.patch, HIVE-2121.2.patch, HIVE-2121.3.patch, HIVE-2121.4.patch
>
>
> We need a better input sampling to serve at least two purposes:
> 1. test their queries against a smaller data set
> 2. understand more about how the data look like without scanning the whole table.
> A simple function that gives a subset splits will help in those cases. It doesn't have to be strict sampling.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (HIVE-2121) Input Sampling By Splits

Posted by "Siying Dong (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HIVE-2121?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13022271#comment-13022271 ] 

Siying Dong commented on HIVE-2121:
-----------------------------------

changed test outputs. test outputs of sample tests changed as I chanced token name for sampling using buckets from TOK_TABLESAMPLE to TOK_TABLEBUCKETSAMPLE (as I am adding a TOK_TABLESPLITSAMPLE). Other than that, there should be no change.

This patch has a limit: even if we only sample one split out, execution is not switched to local mode if possible, as getSplits() is called in job submit part, which already passed the step to choose local mode.

> Input Sampling By Splits
> ------------------------
>
>                 Key: HIVE-2121
>                 URL: https://issues.apache.org/jira/browse/HIVE-2121
>             Project: Hive
>          Issue Type: New Feature
>            Reporter: Siying Dong
>            Assignee: Siying Dong
>         Attachments: HIVE-2121.1.patch, HIVE-2121.2.patch
>
>
> We need a better input sampling to serve at least two purposes:
> 1. test their queries against a smaller data set
> 2. understand more about how the data look like without scanning the whole table.
> A simple function that gives a subset splits will help in those cases. It doesn't have to be strict sampling.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (HIVE-2121) Input Sampling By Splits

Posted by "jiraposter@reviews.apache.org (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HIVE-2121?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13025476#comment-13025476 ] 

jiraposter@reviews.apache.org commented on HIVE-2121:
-----------------------------------------------------

bq.  On 2011-04-26 20:50:30, Siying Dong wrote:
bq.  > trunk/ql/src/java/org/apache/hadoop/hive/ql/parse/SemanticAnalyzer.java, line 498
bq.  > <https://reviews.apache.org/r/633/diff/1/?file=16093#file16093line498>
bq.  >
bq.  >     I feel like it is a little hard to explain what this sample guarantees. It basically only guarantees that we fetch at least the sampled percentage of source data. Not exact number, nor guarantee for #rows. I think an option to disable it is a way to avoid confusion in some ways. How do you think?
bq.  
bq.  Ning Zhang wrote:
bq.      I think if we specify clearly the semantics of block-level sample in the wiki/documentation, there shouldn't be much confusion. In fact I think it is much easier to explain than the bucket-based sampling. In addition if the user has confusions about the semantics, throwing an SemanticException won't help them understand.  I think the only use case for this parameter is to act as a gatekeeper to this feature if we found a bug in it and want to disable the feature quickly. That should be able to be achieved by switching branches quickly. If we have a gatekeeper parameter for each feature, the conf will grow unnecessarily large quickly.
bq.  
bq.  Siying Dong wrote:
bq.      OK. I'll remove this switch. We need to document and communicate very well to users. People will easily misunderstand this.

Sounds good. Yes, we should clearly document it in the wiki by what it does and does not. 

bq.  On 2011-04-26 20:50:30, Siying Dong wrote:
bq.  > trunk/ql/src/java/org/apache/hadoop/hive/ql/parse/SemanticAnalyzer.java, line 6392
bq.  > <https://reviews.apache.org/r/633/diff/1/?file=16093#file16093line6392>
bq.  >
bq.  >     limit can be combined with block sampling. Just this optimization for limit doesn't make sense when users already sample the input data and we won't get much benefit.
bq.  
bq.  Ning Zhang wrote:
bq.      I think combining these two still makes sense: 1) as you mentioned block sampling is not limiting on the # of rows, but limit is. Combining these two allows the users to get approximately N rows quickly. 2) this restriction makes an exception in terms of the query language composition. From the language syntax, it is allowed and makes senses to combine block-sampling and limit, but the user will get a SemanticException if they do. I think SemanticException should be thrown only when there is a legitimate semantic error (e.g., the percentage number is negative). If you feel that it is not a major use case and would rather do it in a follow-up JIRA, we should document it in TODO and file a JIRA for it.
bq.  
bq.  Siying Dong wrote:
bq.      I think it is the misunderstanding here. Limit works with split sampling well. No exception will be thrown in any of those two combination and the result will be what we expected.
bq.      This condition only disabled the optimization that runs against a smaller data set for some limit queries. With split sampling, user already specific what a percentage to sample, there is no need to further run the query against a small subset of the inputs. From test case split_sample.q, you can already see how they work together well.

oh I see. Thanks for the clarification!

- Ning

-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/633/#review561
-----------------------------------------------------------

On 2011-04-26 21:19:18, Siying Dong wrote:
bq.  
bq.  -----------------------------------------------------------
bq.  This is an automatically generated e-mail. To reply, visit:
bq.  https://reviews.apache.org/r/633/
bq.  -----------------------------------------------------------
bq.  
bq.  (Updated 2011-04-26 21:19:18)
bq.  
bq.  
bq.  Review request for hive, Ning Zhang and namit jain.
bq.  
bq.  
bq.  Summary
bq.  -------
bq.  
bq.  We need a better input sampling to serve at least two purposes:
bq.  1. test their queries against a smaller data set
bq.  2. understand more about how the data look like without scanning the whole table.
bq.  A simple function that gives a subset splits will help in those cases. It doesn't have to be strict sampling.
bq.  
bq.  This diff allows a syntax of .. table TABLESAMPLE(n PERCENT), which samples input splits with size at least n% of the original inputs.
bq.  
bq.  
bq.  This addresses bug HIVE-2121.
bq.      https://issues.apache.org/jira/browse/HIVE-2121
bq.  
bq.  
bq.  Diffs
bq.  -----
bq.  
bq.    trunk/common/src/java/org/apache/hadoop/hive/conf/HiveConf.java 1096852 
bq.    trunk/conf/hive-default.xml 1096852 
bq.    trunk/ql/src/java/org/apache/hadoop/hive/ql/io/CombineHiveInputFormat.java 1096852 
bq.    trunk/ql/src/java/org/apache/hadoop/hive/ql/io/HiveFileFormatUtils.java 1096852 
bq.    trunk/ql/src/java/org/apache/hadoop/hive/ql/optimizer/GenMRFileSink1.java 1096852 
bq.    trunk/ql/src/java/org/apache/hadoop/hive/ql/optimizer/GenMRTableScan1.java 1096852 
bq.    trunk/ql/src/java/org/apache/hadoop/hive/ql/optimizer/GenMRUnion1.java 1096852 
bq.    trunk/ql/src/java/org/apache/hadoop/hive/ql/optimizer/GenMapRedUtils.java 1096852 
bq.    trunk/ql/src/java/org/apache/hadoop/hive/ql/optimizer/MapJoinFactory.java 1096852 
bq.    trunk/ql/src/java/org/apache/hadoop/hive/ql/parse/Hive.g 1096852 
bq.    trunk/ql/src/java/org/apache/hadoop/hive/ql/parse/ParseContext.java 1096852 
bq.    trunk/ql/src/java/org/apache/hadoop/hive/ql/parse/SemanticAnalyzer.java 1096852 
bq.    trunk/ql/src/java/org/apache/hadoop/hive/ql/parse/SplitSample.java PRE-CREATION 
bq.    trunk/ql/src/java/org/apache/hadoop/hive/ql/plan/MapredWork.java 1096852 
bq.    trunk/ql/src/test/queries/clientnegative/split_sample_disabled.q PRE-CREATION 
bq.    trunk/ql/src/test/queries/clientnegative/split_sample_out_of_range.q PRE-CREATION 
bq.    trunk/ql/src/test/queries/clientnegative/split_sample_wrong_format.q PRE-CREATION 
bq.    trunk/ql/src/test/queries/clientpositive/split_sample.q PRE-CREATION 
bq.    trunk/ql/src/test/results/clientnegative/split_sample_disabled.q.out PRE-CREATION 
bq.    trunk/ql/src/test/results/clientnegative/split_sample_out_of_range.q.out PRE-CREATION 
bq.    trunk/ql/src/test/results/clientnegative/split_sample_wrong_format.q.out PRE-CREATION 
bq.    trunk/ql/src/test/results/clientpositive/bucket1.q.out 1096852 
bq.    trunk/ql/src/test/results/clientpositive/bucket2.q.out 1096852 
bq.    trunk/ql/src/test/results/clientpositive/bucket3.q.out 1096852 
bq.    trunk/ql/src/test/results/clientpositive/bucketmapjoin1.q.out 1096852 
bq.    trunk/ql/src/test/results/clientpositive/sample1.q.out 1096852 
bq.    trunk/ql/src/test/results/clientpositive/sample10.q.out 1096852 
bq.    trunk/ql/src/test/results/clientpositive/sample2.q.out 1096852 
bq.    trunk/ql/src/test/results/clientpositive/sample3.q.out 1096852 
bq.    trunk/ql/src/test/results/clientpositive/sample4.q.out 1096852 
bq.    trunk/ql/src/test/results/clientpositive/sample5.q.out 1096852 
bq.    trunk/ql/src/test/results/clientpositive/sample6.q.out 1096852 
bq.    trunk/ql/src/test/results/clientpositive/sample7.q.out 1096852 
bq.    trunk/ql/src/test/results/clientpositive/sample8.q.out 1096852 
bq.    trunk/ql/src/test/results/clientpositive/sample9.q.out 1096852 
bq.  
bq.  Diff: https://reviews.apache.org/r/633/diff
bq.  
bq.  
bq.  Testing
bq.  -------
bq.  
bq.  TestCliDriver TestNegativeCliDriver, manual tests on real clusters.
bq.  
bq.  
bq.  Thanks,
bq.  
bq.  Siying
bq.  
bq.

> Input Sampling By Splits
> ------------------------
>
>                 Key: HIVE-2121
>                 URL: https://issues.apache.org/jira/browse/HIVE-2121
>             Project: Hive
>          Issue Type: New Feature
>            Reporter: Siying Dong
>            Assignee: Siying Dong
>         Attachments: HIVE-2121.1.patch, HIVE-2121.2.patch, HIVE-2121.3.patch
>
>
> We need a better input sampling to serve at least two purposes:
> 1. test their queries against a smaller data set
> 2. understand more about how the data look like without scanning the whole table.
> A simple function that gives a subset splits will help in those cases. It doesn't have to be strict sampling.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (HIVE-2121) Input Sampling By Splits

Posted by "jiraposter@reviews.apache.org (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HIVE-2121?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13025107#comment-13025107 ] 

jiraposter@reviews.apache.org commented on HIVE-2121:
-----------------------------------------------------

-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/633/#review558
-----------------------------------------------------------

trunk/common/src/java/org/apache/hadoop/hive/conf/HiveConf.java
<https://reviews.apache.org/r/633/#comment1178>

    The naming of this parameter is a little bit confusing: the parameter key is called "randomnumber" but the value of it is a fixed number.  Do you mean this number is actually the seed to generate samples?

trunk/conf/hive-default.xml
<https://reviews.apache.org/r/633/#comment1179>

    same as above.

trunk/ql/src/java/org/apache/hadoop/hive/ql/io/CombineHiveInputFormat.java
<https://reviews.apache.org/r/633/#comment1195>

    better add a comment for this function explain what the rationale behind sampling splits.

trunk/ql/src/java/org/apache/hadoop/hive/ql/io/CombineHiveInputFormat.java
<https://reviews.apache.org/r/633/#comment1192>

    we should declare retLists as interface (List) rather than implementation (ArrayList)

trunk/ql/src/java/org/apache/hadoop/hive/ql/io/CombineHiveInputFormat.java
<https://reviews.apache.org/r/633/#comment1193>

    same here, should declare it as Map rather than HashMap.

trunk/ql/src/java/org/apache/hadoop/hive/ql/io/CombineHiveInputFormat.java
<https://reviews.apache.org/r/633/#comment1194>

    can you add comments here? 

trunk/ql/src/java/org/apache/hadoop/hive/ql/parse/SemanticAnalyzer.java
<https://reviews.apache.org/r/633/#comment1182>

    can you add a comment on what this Map is used for and what are the key and value of the Map?

trunk/ql/src/java/org/apache/hadoop/hive/ql/parse/SemanticAnalyzer.java
<https://reviews.apache.org/r/633/#comment1184>

    I think we don't need to introduce this parameter at all. For one it is a new feature rather than a different code path for an old feature. We don't need the "fallback" protection by a new parameter. Secondly, throwing a SemanticException here can only make the user asking how to solve this problem, which is to set the parameter to true. So it seems that it doesn't make sense to set the parameter to false in any cases. So why not remove the this parameter to make it cleaner. 

trunk/ql/src/java/org/apache/hadoop/hive/ql/parse/SemanticAnalyzer.java
<https://reviews.apache.org/r/633/#comment1189>

    you may want to check the percentage number (if it is a valid double and within the range [0,100]) and throw SemanticException if it is invalid before creating a SplitSample object.

trunk/ql/src/java/org/apache/hadoop/hive/ql/parse/SemanticAnalyzer.java
<https://reviews.apache.org/r/633/#comment1190>

    why limit cannot be combined with block sampling?

trunk/ql/src/java/org/apache/hadoop/hive/ql/parse/SplitSample.java
<https://reviews.apache.org/r/633/#comment1191>

    This comment doesn't belong to this class.

- Ning

On 2011-04-20 18:28:29, Siying Dong wrote:
bq.  
bq.  -----------------------------------------------------------
bq.  This is an automatically generated e-mail. To reply, visit:
bq.  https://reviews.apache.org/r/633/
bq.  -----------------------------------------------------------
bq.  
bq.  (Updated 2011-04-20 18:28:29)
bq.  
bq.  
bq.  Review request for hive, Ning Zhang and namit jain.
bq.  
bq.  
bq.  Summary
bq.  -------
bq.  
bq.  We need a better input sampling to serve at least two purposes:
bq.  1. test their queries against a smaller data set
bq.  2. understand more about how the data look like without scanning the whole table.
bq.  A simple function that gives a subset splits will help in those cases. It doesn't have to be strict sampling.
bq.  
bq.  This diff allows a syntax of .. table TABLESAMPLE(n PERCENT), which samples input splits with size at least n% of the original inputs.
bq.  
bq.  
bq.  This addresses bug HIVE-2121.
bq.      https://issues.apache.org/jira/browse/HIVE-2121
bq.  
bq.  
bq.  Diffs
bq.  -----
bq.  
bq.    trunk/common/src/java/org/apache/hadoop/hive/conf/HiveConf.java 1095244 
bq.    trunk/conf/hive-default.xml 1095244 
bq.    trunk/ql/src/java/org/apache/hadoop/hive/ql/io/CombineHiveInputFormat.java 1095244 
bq.    trunk/ql/src/java/org/apache/hadoop/hive/ql/io/HiveFileFormatUtils.java 1095244 
bq.    trunk/ql/src/java/org/apache/hadoop/hive/ql/optimizer/GenMRFileSink1.java 1095244 
bq.    trunk/ql/src/java/org/apache/hadoop/hive/ql/optimizer/GenMRTableScan1.java 1095244 
bq.    trunk/ql/src/java/org/apache/hadoop/hive/ql/optimizer/GenMRUnion1.java 1095244 
bq.    trunk/ql/src/java/org/apache/hadoop/hive/ql/optimizer/GenMapRedUtils.java 1095244 
bq.    trunk/ql/src/java/org/apache/hadoop/hive/ql/optimizer/MapJoinFactory.java 1095244 
bq.    trunk/ql/src/java/org/apache/hadoop/hive/ql/parse/Hive.g 1095244 
bq.    trunk/ql/src/java/org/apache/hadoop/hive/ql/parse/ParseContext.java 1095244 
bq.    trunk/ql/src/java/org/apache/hadoop/hive/ql/parse/SemanticAnalyzer.java 1095244 
bq.    trunk/ql/src/java/org/apache/hadoop/hive/ql/parse/SplitSample.java PRE-CREATION 
bq.    trunk/ql/src/java/org/apache/hadoop/hive/ql/plan/MapredWork.java 1095244 
bq.    trunk/ql/src/test/queries/clientnegative/split_sample_disabled.q PRE-CREATION 
bq.    trunk/ql/src/test/queries/clientnegative/split_sample_wrong_format.q PRE-CREATION 
bq.    trunk/ql/src/test/queries/clientpositive/split_sample.q PRE-CREATION 
bq.    trunk/ql/src/test/results/clientnegative/split_sample_disabled.q.out PRE-CREATION 
bq.    trunk/ql/src/test/results/clientnegative/split_sample_wrong_format.q.out PRE-CREATION 
bq.    trunk/ql/src/test/results/clientpositive/bucket1.q.out 1095244 
bq.    trunk/ql/src/test/results/clientpositive/bucket2.q.out 1095244 
bq.    trunk/ql/src/test/results/clientpositive/bucket3.q.out 1095244 
bq.    trunk/ql/src/test/results/clientpositive/bucketmapjoin1.q.out 1095244 
bq.    trunk/ql/src/test/results/clientpositive/sample1.q.out 1095244 
bq.    trunk/ql/src/test/results/clientpositive/sample10.q.out 1095244 
bq.    trunk/ql/src/test/results/clientpositive/sample2.q.out 1095244 
bq.    trunk/ql/src/test/results/clientpositive/sample3.q.out 1095244 
bq.    trunk/ql/src/test/results/clientpositive/sample4.q.out 1095244 
bq.    trunk/ql/src/test/results/clientpositive/sample5.q.out 1095244 
bq.    trunk/ql/src/test/results/clientpositive/sample6.q.out 1095244 
bq.    trunk/ql/src/test/results/clientpositive/sample7.q.out 1095244 
bq.    trunk/ql/src/test/results/clientpositive/sample8.q.out 1095244 
bq.    trunk/ql/src/test/results/clientpositive/sample9.q.out 1095244 
bq.  
bq.  Diff: https://reviews.apache.org/r/633/diff
bq.  
bq.  
bq.  Testing
bq.  -------
bq.  
bq.  TestCliDriver TestNegativeCliDriver, manual tests on real clusters.
bq.  
bq.  
bq.  Thanks,
bq.  
bq.  Siying
bq.  
bq.

> Input Sampling By Splits
> ------------------------
>
>                 Key: HIVE-2121
>                 URL: https://issues.apache.org/jira/browse/HIVE-2121
>             Project: Hive
>          Issue Type: New Feature
>            Reporter: Siying Dong
>            Assignee: Siying Dong
>         Attachments: HIVE-2121.1.patch, HIVE-2121.2.patch
>
>
> We need a better input sampling to serve at least two purposes:
> 1. test their queries against a smaller data set
> 2. understand more about how the data look like without scanning the whole table.
> A simple function that gives a subset splits will help in those cases. It doesn't have to be strict sampling.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (HIVE-2121) Input Sampling By Splits

Posted by "Siying Dong (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/HIVE-2121?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Siying Dong updated HIVE-2121:
------------------------------

    Attachment: HIVE-2121.7.patch

move instanceof InputSplitShim to assert.

> Input Sampling By Splits
> ------------------------
>
>                 Key: HIVE-2121
>                 URL: https://issues.apache.org/jira/browse/HIVE-2121
>             Project: Hive
>          Issue Type: New Feature
>            Reporter: Siying Dong
>            Assignee: Siying Dong
>         Attachments: HIVE-2121.1.patch, HIVE-2121.2.patch, HIVE-2121.3.patch, HIVE-2121.4.patch, HIVE-2121.5.patch, HIVE-2121.6.patch, HIVE-2121.7.patch
>
>
> We need a better input sampling to serve at least two purposes:
> 1. test their queries against a smaller data set
> 2. understand more about how the data look like without scanning the whole table.
> A simple function that gives a subset splits will help in those cases. It doesn't have to be strict sampling.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (HIVE-2121) Input Sampling By Splits

Posted by "jiraposter@reviews.apache.org (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HIVE-2121?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13025465#comment-13025465 ] 

jiraposter@reviews.apache.org commented on HIVE-2121:
-----------------------------------------------------

bq.  On 2011-04-26 20:50:30, Siying Dong wrote:
bq.  > trunk/ql/src/java/org/apache/hadoop/hive/ql/parse/SemanticAnalyzer.java, line 498
bq.  > <https://reviews.apache.org/r/633/diff/1/?file=16093#file16093line498>
bq.  >
bq.  >     I feel like it is a little hard to explain what this sample guarantees. It basically only guarantees that we fetch at least the sampled percentage of source data. Not exact number, nor guarantee for #rows. I think an option to disable it is a way to avoid confusion in some ways. How do you think?

I think if we specify clearly the semantics of block-level sample in the wiki/documentation, there shouldn't be much confusion. In fact I think it is much easier to explain than the bucket-based sampling. In addition if the user has confusions about the semantics, throwing an SemanticException won't help them understand.  I think the only use case for this parameter is to act as a gatekeeper to this feature if we found a bug in it and want to disable the feature quickly. That should be able to be achieved by switching branches quickly. If we have a gatekeeper parameter for each feature, the conf will grow unnecessarily large quickly. 

bq.  On 2011-04-26 20:50:30, Siying Dong wrote:
bq.  > trunk/ql/src/java/org/apache/hadoop/hive/ql/parse/SemanticAnalyzer.java, line 6392
bq.  > <https://reviews.apache.org/r/633/diff/1/?file=16093#file16093line6392>
bq.  >
bq.  >     limit can be combined with block sampling. Just this optimization for limit doesn't make sense when users already sample the input data and we won't get much benefit.

I think combining these two still makes sense: 1) as you mentioned block sampling is not limiting on the # of rows, but limit is. Combining these two allows the users to get approximately N rows quickly. 2) this restriction makes an exception in terms of the query language composition. From the language syntax, it is allowed and makes senses to combine block-sampling and limit, but the user will get a SemanticException if they do. I think SemanticException should be thrown only when there is a legitimate semantic error (e.g., the percentage number is negative). If you feel that it is not a major use case and would rather do it in a follow-up JIRA, we should document it in TODO and file a JIRA for it. 

- Ning

-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/633/#review561
-----------------------------------------------------------

On 2011-04-26 21:19:18, Siying Dong wrote:
bq.  
bq.  -----------------------------------------------------------
bq.  This is an automatically generated e-mail. To reply, visit:
bq.  https://reviews.apache.org/r/633/
bq.  -----------------------------------------------------------
bq.  
bq.  (Updated 2011-04-26 21:19:18)
bq.  
bq.  
bq.  Review request for hive, Ning Zhang and namit jain.
bq.  
bq.  
bq.  Summary
bq.  -------
bq.  
bq.  We need a better input sampling to serve at least two purposes:
bq.  1. test their queries against a smaller data set
bq.  2. understand more about how the data look like without scanning the whole table.
bq.  A simple function that gives a subset splits will help in those cases. It doesn't have to be strict sampling.
bq.  
bq.  This diff allows a syntax of .. table TABLESAMPLE(n PERCENT), which samples input splits with size at least n% of the original inputs.
bq.  
bq.  
bq.  This addresses bug HIVE-2121.
bq.      https://issues.apache.org/jira/browse/HIVE-2121
bq.  
bq.  
bq.  Diffs
bq.  -----
bq.  
bq.    trunk/common/src/java/org/apache/hadoop/hive/conf/HiveConf.java 1096852 
bq.    trunk/conf/hive-default.xml 1096852 
bq.    trunk/ql/src/java/org/apache/hadoop/hive/ql/io/CombineHiveInputFormat.java 1096852 
bq.    trunk/ql/src/java/org/apache/hadoop/hive/ql/io/HiveFileFormatUtils.java 1096852 
bq.    trunk/ql/src/java/org/apache/hadoop/hive/ql/optimizer/GenMRFileSink1.java 1096852 
bq.    trunk/ql/src/java/org/apache/hadoop/hive/ql/optimizer/GenMRTableScan1.java 1096852 
bq.    trunk/ql/src/java/org/apache/hadoop/hive/ql/optimizer/GenMRUnion1.java 1096852 
bq.    trunk/ql/src/java/org/apache/hadoop/hive/ql/optimizer/GenMapRedUtils.java 1096852 
bq.    trunk/ql/src/java/org/apache/hadoop/hive/ql/optimizer/MapJoinFactory.java 1096852 
bq.    trunk/ql/src/java/org/apache/hadoop/hive/ql/parse/Hive.g 1096852 
bq.    trunk/ql/src/java/org/apache/hadoop/hive/ql/parse/ParseContext.java 1096852 
bq.    trunk/ql/src/java/org/apache/hadoop/hive/ql/parse/SemanticAnalyzer.java 1096852 
bq.    trunk/ql/src/java/org/apache/hadoop/hive/ql/parse/SplitSample.java PRE-CREATION 
bq.    trunk/ql/src/java/org/apache/hadoop/hive/ql/plan/MapredWork.java 1096852 
bq.    trunk/ql/src/test/queries/clientnegative/split_sample_disabled.q PRE-CREATION 
bq.    trunk/ql/src/test/queries/clientnegative/split_sample_out_of_range.q PRE-CREATION 
bq.    trunk/ql/src/test/queries/clientnegative/split_sample_wrong_format.q PRE-CREATION 
bq.    trunk/ql/src/test/queries/clientpositive/split_sample.q PRE-CREATION 
bq.    trunk/ql/src/test/results/clientnegative/split_sample_disabled.q.out PRE-CREATION 
bq.    trunk/ql/src/test/results/clientnegative/split_sample_out_of_range.q.out PRE-CREATION 
bq.    trunk/ql/src/test/results/clientnegative/split_sample_wrong_format.q.out PRE-CREATION 
bq.    trunk/ql/src/test/results/clientpositive/bucket1.q.out 1096852 
bq.    trunk/ql/src/test/results/clientpositive/bucket2.q.out 1096852 
bq.    trunk/ql/src/test/results/clientpositive/bucket3.q.out 1096852 
bq.    trunk/ql/src/test/results/clientpositive/bucketmapjoin1.q.out 1096852 
bq.    trunk/ql/src/test/results/clientpositive/sample1.q.out 1096852 
bq.    trunk/ql/src/test/results/clientpositive/sample10.q.out 1096852 
bq.    trunk/ql/src/test/results/clientpositive/sample2.q.out 1096852 
bq.    trunk/ql/src/test/results/clientpositive/sample3.q.out 1096852 
bq.    trunk/ql/src/test/results/clientpositive/sample4.q.out 1096852 
bq.    trunk/ql/src/test/results/clientpositive/sample5.q.out 1096852 
bq.    trunk/ql/src/test/results/clientpositive/sample6.q.out 1096852 
bq.    trunk/ql/src/test/results/clientpositive/sample7.q.out 1096852 
bq.    trunk/ql/src/test/results/clientpositive/sample8.q.out 1096852 
bq.    trunk/ql/src/test/results/clientpositive/sample9.q.out 1096852 
bq.  
bq.  Diff: https://reviews.apache.org/r/633/diff
bq.  
bq.  
bq.  Testing
bq.  -------
bq.  
bq.  TestCliDriver TestNegativeCliDriver, manual tests on real clusters.
bq.  
bq.  
bq.  Thanks,
bq.  
bq.  Siying
bq.  
bq.

> Input Sampling By Splits
> ------------------------
>
>                 Key: HIVE-2121
>                 URL: https://issues.apache.org/jira/browse/HIVE-2121
>             Project: Hive
>          Issue Type: New Feature
>            Reporter: Siying Dong
>            Assignee: Siying Dong
>         Attachments: HIVE-2121.1.patch, HIVE-2121.2.patch, HIVE-2121.3.patch
>
>
> We need a better input sampling to serve at least two purposes:
> 1. test their queries against a smaller data set
> 2. understand more about how the data look like without scanning the whole table.
> A simple function that gives a subset splits will help in those cases. It doesn't have to be strict sampling.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (HIVE-2121) Input Sampling By Splits

Posted by "Carl Steinbach (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/HIVE-2121?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Carl Steinbach updated HIVE-2121:
---------------------------------

    Release Note: This patch adds support for the 'TABLESAMPLE(x PERCENT)' clause.
    
> Input Sampling By Splits
> ------------------------
>
>                 Key: HIVE-2121
>                 URL: https://issues.apache.org/jira/browse/HIVE-2121
>             Project: Hive
>          Issue Type: New Feature
>          Components: Query Processor
>            Reporter: Siying Dong
>            Assignee: Siying Dong
>             Fix For: 0.8.0
>
>         Attachments: HIVE-2121.1.patch, HIVE-2121.2.patch, HIVE-2121.3.patch, HIVE-2121.4.patch, HIVE-2121.5.patch, HIVE-2121.6.patch, HIVE-2121.7.patch, HIVE-2121.8.patch
>
>
> We need a better input sampling to serve at least two purposes:
> 1. test their queries against a smaller data set
> 2. understand more about how the data look like without scanning the whole table.
> A simple function that gives a subset splits will help in those cases. It doesn't have to be strict sampling.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (HIVE-2121) Input Sampling By Splits

Posted by "Ning Zhang (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HIVE-2121?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13022262#comment-13022262 ] 

Ning Zhang commented on HIVE-2121:
----------------------------------

Siying, can you create a review request on the review board?

> Input Sampling By Splits
> ------------------------
>
>                 Key: HIVE-2121
>                 URL: https://issues.apache.org/jira/browse/HIVE-2121
>             Project: Hive
>          Issue Type: New Feature
>            Reporter: Siying Dong
>            Assignee: Siying Dong
>         Attachments: HIVE-2121.1.patch
>
>
> We need a better input sampling to serve at least two purposes:
> 1. test their queries against a smaller data set
> 2. understand more about how the data look like without scanning the whole table.
> A simple function that gives a subset splits will help in those cases. It doesn't have to be strict sampling.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (HIVE-2121) Input Sampling By Splits

Posted by "jiraposter@reviews.apache.org (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HIVE-2121?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13025426#comment-13025426 ] 

jiraposter@reviews.apache.org commented on HIVE-2121:
-----------------------------------------------------

-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/633/#review561
-----------------------------------------------------------

trunk/ql/src/java/org/apache/hadoop/hive/ql/parse/SemanticAnalyzer.java
<https://reviews.apache.org/r/633/#comment1196>

    I feel like it is a little hard to explain what this sample guarantees. It basically only guarantees that we fetch at least the sampled percentage of source data. Not exact number, nor guarantee for #rows. I think an option to disable it is a way to avoid confusion in some ways. How do you think?

trunk/ql/src/java/org/apache/hadoop/hive/ql/parse/SemanticAnalyzer.java
<https://reviews.apache.org/r/633/#comment1197>

    limit can be combined with block sampling. Just this optimization for limit doesn't make sense when users already sample the input data and we won't get much benefit.

- Siying

On 2011-04-20 18:28:29, Siying Dong wrote:
bq.  
bq.  -----------------------------------------------------------
bq.  This is an automatically generated e-mail. To reply, visit:
bq.  https://reviews.apache.org/r/633/
bq.  -----------------------------------------------------------
bq.  
bq.  (Updated 2011-04-20 18:28:29)
bq.  
bq.  
bq.  Review request for hive, Ning Zhang and namit jain.
bq.  
bq.  
bq.  Summary
bq.  -------
bq.  
bq.  We need a better input sampling to serve at least two purposes:
bq.  1. test their queries against a smaller data set
bq.  2. understand more about how the data look like without scanning the whole table.
bq.  A simple function that gives a subset splits will help in those cases. It doesn't have to be strict sampling.
bq.  
bq.  This diff allows a syntax of .. table TABLESAMPLE(n PERCENT), which samples input splits with size at least n% of the original inputs.
bq.  
bq.  
bq.  This addresses bug HIVE-2121.
bq.      https://issues.apache.org/jira/browse/HIVE-2121
bq.  
bq.  
bq.  Diffs
bq.  -----
bq.  
bq.    trunk/common/src/java/org/apache/hadoop/hive/conf/HiveConf.java 1095244 
bq.    trunk/conf/hive-default.xml 1095244 
bq.    trunk/ql/src/java/org/apache/hadoop/hive/ql/io/CombineHiveInputFormat.java 1095244 
bq.    trunk/ql/src/java/org/apache/hadoop/hive/ql/io/HiveFileFormatUtils.java 1095244 
bq.    trunk/ql/src/java/org/apache/hadoop/hive/ql/optimizer/GenMRFileSink1.java 1095244 
bq.    trunk/ql/src/java/org/apache/hadoop/hive/ql/optimizer/GenMRTableScan1.java 1095244 
bq.    trunk/ql/src/java/org/apache/hadoop/hive/ql/optimizer/GenMRUnion1.java 1095244 
bq.    trunk/ql/src/java/org/apache/hadoop/hive/ql/optimizer/GenMapRedUtils.java 1095244 
bq.    trunk/ql/src/java/org/apache/hadoop/hive/ql/optimizer/MapJoinFactory.java 1095244 
bq.    trunk/ql/src/java/org/apache/hadoop/hive/ql/parse/Hive.g 1095244 
bq.    trunk/ql/src/java/org/apache/hadoop/hive/ql/parse/ParseContext.java 1095244 
bq.    trunk/ql/src/java/org/apache/hadoop/hive/ql/parse/SemanticAnalyzer.java 1095244 
bq.    trunk/ql/src/java/org/apache/hadoop/hive/ql/parse/SplitSample.java PRE-CREATION 
bq.    trunk/ql/src/java/org/apache/hadoop/hive/ql/plan/MapredWork.java 1095244 
bq.    trunk/ql/src/test/queries/clientnegative/split_sample_disabled.q PRE-CREATION 
bq.    trunk/ql/src/test/queries/clientnegative/split_sample_wrong_format.q PRE-CREATION 
bq.    trunk/ql/src/test/queries/clientpositive/split_sample.q PRE-CREATION 
bq.    trunk/ql/src/test/results/clientnegative/split_sample_disabled.q.out PRE-CREATION 
bq.    trunk/ql/src/test/results/clientnegative/split_sample_wrong_format.q.out PRE-CREATION 
bq.    trunk/ql/src/test/results/clientpositive/bucket1.q.out 1095244 
bq.    trunk/ql/src/test/results/clientpositive/bucket2.q.out 1095244 
bq.    trunk/ql/src/test/results/clientpositive/bucket3.q.out 1095244 
bq.    trunk/ql/src/test/results/clientpositive/bucketmapjoin1.q.out 1095244 
bq.    trunk/ql/src/test/results/clientpositive/sample1.q.out 1095244 
bq.    trunk/ql/src/test/results/clientpositive/sample10.q.out 1095244 
bq.    trunk/ql/src/test/results/clientpositive/sample2.q.out 1095244 
bq.    trunk/ql/src/test/results/clientpositive/sample3.q.out 1095244 
bq.    trunk/ql/src/test/results/clientpositive/sample4.q.out 1095244 
bq.    trunk/ql/src/test/results/clientpositive/sample5.q.out 1095244 
bq.    trunk/ql/src/test/results/clientpositive/sample6.q.out 1095244 
bq.    trunk/ql/src/test/results/clientpositive/sample7.q.out 1095244 
bq.    trunk/ql/src/test/results/clientpositive/sample8.q.out 1095244 
bq.    trunk/ql/src/test/results/clientpositive/sample9.q.out 1095244 
bq.  
bq.  Diff: https://reviews.apache.org/r/633/diff
bq.  
bq.  
bq.  Testing
bq.  -------
bq.  
bq.  TestCliDriver TestNegativeCliDriver, manual tests on real clusters.
bq.  
bq.  
bq.  Thanks,
bq.  
bq.  Siying
bq.  
bq.

> Input Sampling By Splits
> ------------------------
>
>                 Key: HIVE-2121
>                 URL: https://issues.apache.org/jira/browse/HIVE-2121
>             Project: Hive
>          Issue Type: New Feature
>            Reporter: Siying Dong
>            Assignee: Siying Dong
>         Attachments: HIVE-2121.1.patch, HIVE-2121.2.patch
>
>
> We need a better input sampling to serve at least two purposes:
> 1. test their queries against a smaller data set
> 2. understand more about how the data look like without scanning the whole table.
> A simple function that gives a subset splits will help in those cases. It doesn't have to be strict sampling.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (HIVE-2121) Input Sampling By Splits

Posted by "Siying Dong (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/HIVE-2121?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Siying Dong updated HIVE-2121:
------------------------------

    Attachment:     (was: HIVE-2121.1.patch)

> Input Sampling By Splits
> ------------------------
>
>                 Key: HIVE-2121
>                 URL: https://issues.apache.org/jira/browse/HIVE-2121
>             Project: Hive
>          Issue Type: New Feature
>            Reporter: Siying Dong
>            Assignee: Siying Dong
>         Attachments: HIVE-2121.1.patch
>
>
> We need a better input sampling to serve at least two purposes:
> 1. test their queries against a smaller data set
> 2. understand more about how the data look like without scanning the whole table.
> A simple function that gives a subset splits will help in those cases. It doesn't have to be strict sampling.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (HIVE-2121) Input Sampling By Splits

Posted by "Namit Jain (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/HIVE-2121?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Namit Jain updated HIVE-2121:
-----------------------------

      Resolution: Fixed
    Hadoop Flags: [Reviewed]
          Status: Resolved  (was: Patch Available)

Committed. Thanks Siying

> Input Sampling By Splits
> ------------------------
>
>                 Key: HIVE-2121
>                 URL: https://issues.apache.org/jira/browse/HIVE-2121
>             Project: Hive
>          Issue Type: New Feature
>            Reporter: Siying Dong
>            Assignee: Siying Dong
>         Attachments: HIVE-2121.1.patch, HIVE-2121.2.patch, HIVE-2121.3.patch, HIVE-2121.4.patch, HIVE-2121.5.patch, HIVE-2121.6.patch, HIVE-2121.7.patch, HIVE-2121.8.patch
>
>
> We need a better input sampling to serve at least two purposes:
> 1. test their queries against a smaller data set
> 2. understand more about how the data look like without scanning the whole table.
> A simple function that gives a subset splits will help in those cases. It doesn't have to be strict sampling.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (HIVE-2121) Input Sampling By Splits

Posted by "Siying Dong (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/HIVE-2121?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Siying Dong updated HIVE-2121:
------------------------------

    Attachment: HIVE-2121.4.patch

remove the switch

> Input Sampling By Splits
> ------------------------
>
>                 Key: HIVE-2121
>                 URL: https://issues.apache.org/jira/browse/HIVE-2121
>             Project: Hive
>          Issue Type: New Feature
>            Reporter: Siying Dong
>            Assignee: Siying Dong
>         Attachments: HIVE-2121.1.patch, HIVE-2121.2.patch, HIVE-2121.3.patch, HIVE-2121.4.patch
>
>
> We need a better input sampling to serve at least two purposes:
> 1. test their queries against a smaller data set
> 2. understand more about how the data look like without scanning the whole table.
> A simple function that gives a subset splits will help in those cases. It doesn't have to be strict sampling.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (HIVE-2121) Input Sampling By Splits

Posted by "jiraposter@reviews.apache.org (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HIVE-2121?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13025468#comment-13025468 ] 

jiraposter@reviews.apache.org commented on HIVE-2121:
-----------------------------------------------------

bq.  On 2011-04-26 20:50:30, Siying Dong wrote:
bq.  > trunk/ql/src/java/org/apache/hadoop/hive/ql/parse/SemanticAnalyzer.java, line 498
bq.  > <https://reviews.apache.org/r/633/diff/1/?file=16093#file16093line498>
bq.  >
bq.  >     I feel like it is a little hard to explain what this sample guarantees. It basically only guarantees that we fetch at least the sampled percentage of source data. Not exact number, nor guarantee for #rows. I think an option to disable it is a way to avoid confusion in some ways. How do you think?
bq.  
bq.  Ning Zhang wrote:
bq.      I think if we specify clearly the semantics of block-level sample in the wiki/documentation, there shouldn't be much confusion. In fact I think it is much easier to explain than the bucket-based sampling. In addition if the user has confusions about the semantics, throwing an SemanticException won't help them understand.  I think the only use case for this parameter is to act as a gatekeeper to this feature if we found a bug in it and want to disable the feature quickly. That should be able to be achieved by switching branches quickly. If we have a gatekeeper parameter for each feature, the conf will grow unnecessarily large quickly.

OK. I'll remove this switch. We need to document and communicate very well to users. People will easily misunderstand this.

bq.  On 2011-04-26 20:50:30, Siying Dong wrote:
bq.  > trunk/ql/src/java/org/apache/hadoop/hive/ql/parse/SemanticAnalyzer.java, line 6392
bq.  > <https://reviews.apache.org/r/633/diff/1/?file=16093#file16093line6392>
bq.  >
bq.  >     limit can be combined with block sampling. Just this optimization for limit doesn't make sense when users already sample the input data and we won't get much benefit.
bq.  
bq.  Ning Zhang wrote:
bq.      I think combining these two still makes sense: 1) as you mentioned block sampling is not limiting on the # of rows, but limit is. Combining these two allows the users to get approximately N rows quickly. 2) this restriction makes an exception in terms of the query language composition. From the language syntax, it is allowed and makes senses to combine block-sampling and limit, but the user will get a SemanticException if they do. I think SemanticException should be thrown only when there is a legitimate semantic error (e.g., the percentage number is negative). If you feel that it is not a major use case and would rather do it in a follow-up JIRA, we should document it in TODO and file a JIRA for it.

I think it is the misunderstanding here. Limit works with split sampling well. No exception will be thrown in any of those two combination and the result will be what we expected.
This condition only disabled the optimization that runs against a smaller data set for some limit queries. With split sampling, user already specific what a percentage to sample, there is no need to further run the query against a small subset of the inputs. From test case split_sample.q, you can already see how they work together well.

- Siying

-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/633/#review561
-----------------------------------------------------------

On 2011-04-26 21:19:18, Siying Dong wrote:
bq.  
bq.  -----------------------------------------------------------
bq.  This is an automatically generated e-mail. To reply, visit:
bq.  https://reviews.apache.org/r/633/
bq.  -----------------------------------------------------------
bq.  
bq.  (Updated 2011-04-26 21:19:18)
bq.  
bq.  
bq.  Review request for hive, Ning Zhang and namit jain.
bq.  
bq.  
bq.  Summary
bq.  -------
bq.  
bq.  We need a better input sampling to serve at least two purposes:
bq.  1. test their queries against a smaller data set
bq.  2. understand more about how the data look like without scanning the whole table.
bq.  A simple function that gives a subset splits will help in those cases. It doesn't have to be strict sampling.
bq.  
bq.  This diff allows a syntax of .. table TABLESAMPLE(n PERCENT), which samples input splits with size at least n% of the original inputs.
bq.  
bq.  
bq.  This addresses bug HIVE-2121.
bq.      https://issues.apache.org/jira/browse/HIVE-2121
bq.  
bq.  
bq.  Diffs
bq.  -----
bq.  
bq.    trunk/common/src/java/org/apache/hadoop/hive/conf/HiveConf.java 1096852 
bq.    trunk/conf/hive-default.xml 1096852 
bq.    trunk/ql/src/java/org/apache/hadoop/hive/ql/io/CombineHiveInputFormat.java 1096852 
bq.    trunk/ql/src/java/org/apache/hadoop/hive/ql/io/HiveFileFormatUtils.java 1096852 
bq.    trunk/ql/src/java/org/apache/hadoop/hive/ql/optimizer/GenMRFileSink1.java 1096852 
bq.    trunk/ql/src/java/org/apache/hadoop/hive/ql/optimizer/GenMRTableScan1.java 1096852 
bq.    trunk/ql/src/java/org/apache/hadoop/hive/ql/optimizer/GenMRUnion1.java 1096852 
bq.    trunk/ql/src/java/org/apache/hadoop/hive/ql/optimizer/GenMapRedUtils.java 1096852 
bq.    trunk/ql/src/java/org/apache/hadoop/hive/ql/optimizer/MapJoinFactory.java 1096852 
bq.    trunk/ql/src/java/org/apache/hadoop/hive/ql/parse/Hive.g 1096852 
bq.    trunk/ql/src/java/org/apache/hadoop/hive/ql/parse/ParseContext.java 1096852 
bq.    trunk/ql/src/java/org/apache/hadoop/hive/ql/parse/SemanticAnalyzer.java 1096852 
bq.    trunk/ql/src/java/org/apache/hadoop/hive/ql/parse/SplitSample.java PRE-CREATION 
bq.    trunk/ql/src/java/org/apache/hadoop/hive/ql/plan/MapredWork.java 1096852 
bq.    trunk/ql/src/test/queries/clientnegative/split_sample_disabled.q PRE-CREATION 
bq.    trunk/ql/src/test/queries/clientnegative/split_sample_out_of_range.q PRE-CREATION 
bq.    trunk/ql/src/test/queries/clientnegative/split_sample_wrong_format.q PRE-CREATION 
bq.    trunk/ql/src/test/queries/clientpositive/split_sample.q PRE-CREATION 
bq.    trunk/ql/src/test/results/clientnegative/split_sample_disabled.q.out PRE-CREATION 
bq.    trunk/ql/src/test/results/clientnegative/split_sample_out_of_range.q.out PRE-CREATION 
bq.    trunk/ql/src/test/results/clientnegative/split_sample_wrong_format.q.out PRE-CREATION 
bq.    trunk/ql/src/test/results/clientpositive/bucket1.q.out 1096852 
bq.    trunk/ql/src/test/results/clientpositive/bucket2.q.out 1096852 
bq.    trunk/ql/src/test/results/clientpositive/bucket3.q.out 1096852 
bq.    trunk/ql/src/test/results/clientpositive/bucketmapjoin1.q.out 1096852 
bq.    trunk/ql/src/test/results/clientpositive/sample1.q.out 1096852 
bq.    trunk/ql/src/test/results/clientpositive/sample10.q.out 1096852 
bq.    trunk/ql/src/test/results/clientpositive/sample2.q.out 1096852 
bq.    trunk/ql/src/test/results/clientpositive/sample3.q.out 1096852 
bq.    trunk/ql/src/test/results/clientpositive/sample4.q.out 1096852 
bq.    trunk/ql/src/test/results/clientpositive/sample5.q.out 1096852 
bq.    trunk/ql/src/test/results/clientpositive/sample6.q.out 1096852 
bq.    trunk/ql/src/test/results/clientpositive/sample7.q.out 1096852 
bq.    trunk/ql/src/test/results/clientpositive/sample8.q.out 1096852 
bq.    trunk/ql/src/test/results/clientpositive/sample9.q.out 1096852 
bq.  
bq.  Diff: https://reviews.apache.org/r/633/diff
bq.  
bq.  
bq.  Testing
bq.  -------
bq.  
bq.  TestCliDriver TestNegativeCliDriver, manual tests on real clusters.
bq.  
bq.  
bq.  Thanks,
bq.  
bq.  Siying
bq.  
bq.

> Input Sampling By Splits
> ------------------------
>
>                 Key: HIVE-2121
>                 URL: https://issues.apache.org/jira/browse/HIVE-2121
>             Project: Hive
>          Issue Type: New Feature
>            Reporter: Siying Dong
>            Assignee: Siying Dong
>         Attachments: HIVE-2121.1.patch, HIVE-2121.2.patch, HIVE-2121.3.patch
>
>
> We need a better input sampling to serve at least two purposes:
> 1. test their queries against a smaller data set
> 2. understand more about how the data look like without scanning the whole table.
> A simple function that gives a subset splits will help in those cases. It doesn't have to be strict sampling.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (HIVE-2121) Input Sampling By Splits

Posted by "Siying Dong (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/HIVE-2121?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Siying Dong updated HIVE-2121:
------------------------------

    Attachment: HIVE-2121.6.patch

forgot a file.

> Input Sampling By Splits
> ------------------------
>
>                 Key: HIVE-2121
>                 URL: https://issues.apache.org/jira/browse/HIVE-2121
>             Project: Hive
>          Issue Type: New Feature
>            Reporter: Siying Dong
>            Assignee: Siying Dong
>         Attachments: HIVE-2121.1.patch, HIVE-2121.2.patch, HIVE-2121.3.patch, HIVE-2121.4.patch, HIVE-2121.5.patch, HIVE-2121.6.patch
>
>
> We need a better input sampling to serve at least two purposes:
> 1. test their queries against a smaller data set
> 2. understand more about how the data look like without scanning the whole table.
> A simple function that gives a subset splits will help in those cases. It doesn't have to be strict sampling.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (HIVE-2121) Input Sampling By Splits

Posted by "Siying Dong (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/HIVE-2121?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Siying Dong updated HIVE-2121:
------------------------------

    Attachment: HIVE-2121.3.patch

addressing most of Ning's comments, except the two places I commented on review board.

> Input Sampling By Splits
> ------------------------
>
>                 Key: HIVE-2121
>                 URL: https://issues.apache.org/jira/browse/HIVE-2121
>             Project: Hive
>          Issue Type: New Feature
>            Reporter: Siying Dong
>            Assignee: Siying Dong
>         Attachments: HIVE-2121.1.patch, HIVE-2121.2.patch, HIVE-2121.3.patch
>
>
> We need a better input sampling to serve at least two purposes:
> 1. test their queries against a smaller data set
> 2. understand more about how the data look like without scanning the whole table.
> A simple function that gives a subset splits will help in those cases. It doesn't have to be strict sampling.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (HIVE-2121) Input Sampling By Splits

Posted by "Siying Dong (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/HIVE-2121?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Siying Dong updated HIVE-2121:
------------------------------

    Attachment:     (was: HIVE-2121.4.patch)

> Input Sampling By Splits
> ------------------------
>
>                 Key: HIVE-2121
>                 URL: https://issues.apache.org/jira/browse/HIVE-2121
>             Project: Hive
>          Issue Type: New Feature
>            Reporter: Siying Dong
>            Assignee: Siying Dong
>         Attachments: HIVE-2121.1.patch, HIVE-2121.2.patch, HIVE-2121.3.patch
>
>
> We need a better input sampling to serve at least two purposes:
> 1. test their queries against a smaller data set
> 2. understand more about how the data look like without scanning the whole table.
> A simple function that gives a subset splits will help in those cases. It doesn't have to be strict sampling.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (HIVE-2121) Input Sampling By Splits

Posted by "Siying Dong (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/HIVE-2121?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Siying Dong updated HIVE-2121:
------------------------------

    Attachment: HIVE-2121.5.patch

Two changes made according to Namit's comments:
1. explain will print out some about the sampling. (It might not be the best way to print but it follows the framework)
2. the granularity of sampling is down from split-level to HDFS block level.

> Input Sampling By Splits
> ------------------------
>
>                 Key: HIVE-2121
>                 URL: https://issues.apache.org/jira/browse/HIVE-2121
>             Project: Hive
>          Issue Type: New Feature
>            Reporter: Siying Dong
>            Assignee: Siying Dong
>         Attachments: HIVE-2121.1.patch, HIVE-2121.2.patch, HIVE-2121.3.patch, HIVE-2121.4.patch, HIVE-2121.5.patch
>
>
> We need a better input sampling to serve at least two purposes:
> 1. test their queries against a smaller data set
> 2. understand more about how the data look like without scanning the whole table.
> A simple function that gives a subset splits will help in those cases. It doesn't have to be strict sampling.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (HIVE-2121) Input Sampling By Splits

Posted by "Namit Jain (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HIVE-2121?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13027125#comment-13027125 ] 

Namit Jain commented on HIVE-2121:
----------------------------------

running tests

> Input Sampling By Splits
> ------------------------
>
>                 Key: HIVE-2121
>                 URL: https://issues.apache.org/jira/browse/HIVE-2121
>             Project: Hive
>          Issue Type: New Feature
>            Reporter: Siying Dong
>            Assignee: Siying Dong
>         Attachments: HIVE-2121.1.patch, HIVE-2121.2.patch, HIVE-2121.3.patch, HIVE-2121.4.patch, HIVE-2121.5.patch, HIVE-2121.6.patch, HIVE-2121.7.patch, HIVE-2121.8.patch
>
>
> We need a better input sampling to serve at least two purposes:
> 1. test their queries against a smaller data set
> 2. understand more about how the data look like without scanning the whole table.
> A simple function that gives a subset splits will help in those cases. It doesn't have to be strict sampling.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (HIVE-2121) Input Sampling By Splits

Posted by "Siying Dong (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/HIVE-2121?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Siying Dong updated HIVE-2121:
------------------------------

    Status: Patch Available  (was: Open)

> Input Sampling By Splits
> ------------------------
>
>                 Key: HIVE-2121
>                 URL: https://issues.apache.org/jira/browse/HIVE-2121
>             Project: Hive
>          Issue Type: New Feature
>            Reporter: Siying Dong
>            Assignee: Siying Dong
>         Attachments: HIVE-2121.1.patch, HIVE-2121.2.patch, HIVE-2121.3.patch, HIVE-2121.4.patch, HIVE-2121.5.patch
>
>
> We need a better input sampling to serve at least two purposes:
> 1. test their queries against a smaller data set
> 2. understand more about how the data look like without scanning the whole table.
> A simple function that gives a subset splits will help in those cases. It doesn't have to be strict sampling.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (HIVE-2121) Input Sampling By Splits

Posted by "Namit Jain (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HIVE-2121?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13025611#comment-13025611 ] 

Namit Jain commented on HIVE-2121:
----------------------------------

Add the new configuration variables in the name of the jira

> Input Sampling By Splits
> ------------------------
>
>                 Key: HIVE-2121
>                 URL: https://issues.apache.org/jira/browse/HIVE-2121
>             Project: Hive
>          Issue Type: New Feature
>            Reporter: Siying Dong
>            Assignee: Siying Dong
>         Attachments: HIVE-2121.1.patch, HIVE-2121.2.patch, HIVE-2121.3.patch, HIVE-2121.4.patch
>
>
> We need a better input sampling to serve at least two purposes:
> 1. test their queries against a smaller data set
> 2. understand more about how the data look like without scanning the whole table.
> A simple function that gives a subset splits will help in those cases. It doesn't have to be strict sampling.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira