You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@hawq.apache.org by kavinderd <gi...@git.apache.org> on 2017/04/26 22:12:19 UTC

[GitHub] incubator-hawq pull request #1224: HAWQ-1440. Support ANALYZE for all Hive E...

GitHub user kavinderd opened a pull request:

    https://github.com/apache/incubator-hawq/pull/1224

    HAWQ-1440. Support ANALYZE for all Hive External Tables

    

You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/kavinderd/incubator-hawq HAWQ-1440

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/incubator-hawq/pull/1224.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #1224
    
----
commit fb67dcbc5e98f50639c3d9f732d0b9cf260f8b1d
Author: Kavinder Dhaliwal <ka...@gmail.com>
Date:   2017-04-25T22:14:47Z

    HAWQ-1440. Support ANALYZE for all Hive External Tables

----


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] incubator-hawq pull request #1224: HAWQ-1440. Support ANALYZE for all Hive E...

Posted by kavinderd <gi...@git.apache.org>.
Github user kavinderd commented on a diff in the pull request:

    https://github.com/apache/incubator-hawq/pull/1224#discussion_r113578883
  
    --- Diff: pxf/pxf-hive/src/main/java/org/apache/hawq/pxf/plugins/hive/HiveDataFragmenter.java ---
    @@ -466,7 +466,14 @@ private boolean buildSingleFilter(Object filter,
          */
         @Override
         public FragmentsStats getFragmentsStats() throws Exception {
    -        throw new UnsupportedOperationException(
    -                "ANALYZE for Hive plugin is not supported");
    +        Metadata.Item tblDesc = HiveUtilities.extractTableFromName(inputData.getDataSource());
    +        Table tbl = HiveUtilities.getHiveTable(client, tblDesc);
    +        Metadata metadata = new Metadata(tblDesc);
    +        HiveUtilities.getSchema(tbl, metadata);
    +
    +        long split_count = Long.parseLong(tbl.getParameters().get("numFiles"));
    --- End diff --
    
    Hmm, ok I'll use that metadata stat instead


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] incubator-hawq pull request #1224: HAWQ-1440. Support ANALYZE for all Hive E...

Posted by kavinderd <gi...@git.apache.org>.
Github user kavinderd commented on a diff in the pull request:

    https://github.com/apache/incubator-hawq/pull/1224#discussion_r113817022
  
    --- Diff: pxf/pxf-hive/src/main/java/org/apache/hawq/pxf/plugins/hive/HiveDataFragmenter.java ---
    @@ -466,7 +466,14 @@ private boolean buildSingleFilter(Object filter,
          */
         @Override
         public FragmentsStats getFragmentsStats() throws Exception {
    -        throw new UnsupportedOperationException(
    -                "ANALYZE for Hive plugin is not supported");
    +        Metadata.Item tblDesc = HiveUtilities.extractTableFromName(inputData.getDataSource());
    +        Table tbl = HiveUtilities.getHiveTable(client, tblDesc);
    +        Metadata metadata = new Metadata(tblDesc);
    +        HiveUtilities.getSchema(tbl, metadata);
    +
    +        long split_count = Long.parseLong(tbl.getParameters().get("numFiles"));
    --- End diff --
    
    @sansanichfb @shivzone It is possible to have files that are larger than an hdfs block/split size. However, I think this is an anomaly especially with ORC where creating many small files is preferred to increase concurrency and parallelism. So based on the fact that this is an edge case and the number of splits is only used by HAWQ in calculating its sampling ratio for statistic collection is the current implementation acceptable?
    
    I personally don't think that getting an accurate number of splits for the ANALYZE case is worth running a function like https://github.com/apache/incubator-hawq/blob/master/pxf/pxf-hive/src/main/java/org/apache/hawq/pxf/plugins/hive/HiveDataFragmenter.java#L284 just to get a handle on the number of splits


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] incubator-hawq pull request #1224: HAWQ-1440. Support ANALYZE for all Hive E...

Posted by sansanichfb <gi...@git.apache.org>.
Github user sansanichfb commented on a diff in the pull request:

    https://github.com/apache/incubator-hawq/pull/1224#discussion_r113576418
  
    --- Diff: pxf/pxf-hive/src/main/java/org/apache/hawq/pxf/plugins/hive/HiveDataFragmenter.java ---
    @@ -466,7 +466,14 @@ private boolean buildSingleFilter(Object filter,
          */
         @Override
         public FragmentsStats getFragmentsStats() throws Exception {
    -        throw new UnsupportedOperationException(
    -                "ANALYZE for Hive plugin is not supported");
    +        Metadata.Item tblDesc = HiveUtilities.extractTableFromName(inputData.getDataSource());
    +        Table tbl = HiveUtilities.getHiveTable(client, tblDesc);
    +        Metadata metadata = new Metadata(tblDesc);
    +        HiveUtilities.getSchema(tbl, metadata);
    +
    +        long split_count = Long.parseLong(tbl.getParameters().get("numFiles"));
    --- End diff --
    
    I think split size is configured by "mapred.max.split.size" property of a job, so if a file happens to be bigger than split size - we might have multiple splits per one file.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] incubator-hawq pull request #1224: HAWQ-1440. Support ANALYZE for all Hive E...

Posted by shivzone <gi...@git.apache.org>.
Github user shivzone commented on a diff in the pull request:

    https://github.com/apache/incubator-hawq/pull/1224#discussion_r113588287
  
    --- Diff: pxf/pxf-hive/src/main/java/org/apache/hawq/pxf/plugins/hive/HiveDataFragmenter.java ---
    @@ -466,7 +466,14 @@ private boolean buildSingleFilter(Object filter,
          */
         @Override
         public FragmentsStats getFragmentsStats() throws Exception {
    -        throw new UnsupportedOperationException(
    -                "ANALYZE for Hive plugin is not supported");
    +        Metadata.Item tblDesc = HiveUtilities.extractTableFromName(inputData.getDataSource());
    +        Table tbl = HiveUtilities.getHiveTable(client, tblDesc);
    +        Metadata metadata = new Metadata(tblDesc);
    +        HiveUtilities.getSchema(tbl, metadata);
    +
    +        long split_count = Long.parseLong(tbl.getParameters().get("numFiles"));
    --- End diff --
    
    Yes, can you please test against a rather large ORC file to see the number of splits/blocks we return ?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] incubator-hawq pull request #1224: HAWQ-1440. Support ANALYZE for all Hive E...

Posted by asfgit <gi...@git.apache.org>.
Github user asfgit closed the pull request at:

    https://github.com/apache/incubator-hawq/pull/1224


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---