You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@hive.apache.org by "Anja Gruenheid (JIRA)" <ji...@apache.org> on 2011/02/02 01:52:28 UTC

[jira] Created: (HIVE-1940) Query Optimization Using Column Metadata and Histograms

Query Optimization Using Column Metadata and Histograms
-------------------------------------------------------

                 Key: HIVE-1940
                 URL: https://issues.apache.org/jira/browse/HIVE-1940
             Project: Hive
          Issue Type: New Feature
          Components: Metastore, Query Processor
            Reporter: Anja Gruenheid


The current basis for cost-based query optimization in Hive is information gathered on tables and partitions. To make further improvements in query optimization possible, the next step is to develop and implement possibilities to gather information on columns as discussed in issue HIVE-33. After that, an implementation of histograms is a possible option to use and collect run-time statistics. Next to the actual implementation of these features, it is also necessary to develop a consistent storage model for the MetaStore.

-- 
This message is automatically generated by JIRA.
-
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] Updated: (HIVE-1940) Query Optimization Using Column Metadata and Histograms

Posted by "Anja Gruenheid (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/HIVE-1940?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Anja Gruenheid updated HIVE-1940:
---------------------------------

    Attachment: HiveMetaStore.pdf

Hive MetaStore Model - 02/05/2011

> Query Optimization Using Column Metadata and Histograms
> -------------------------------------------------------
>
>                 Key: HIVE-1940
>                 URL: https://issues.apache.org/jira/browse/HIVE-1940
>             Project: Hive
>          Issue Type: New Feature
>          Components: Metastore, Query Processor
>            Reporter: Anja Gruenheid
>         Attachments: HiveMetaStore.pdf
>
>
> The current basis for cost-based query optimization in Hive is information gathered on tables and partitions. To make further improvements in query optimization possible, the next step is to develop and implement possibilities to gather information on columns as discussed in issue HIVE-33. After that, an implementation of histograms is a possible option to use and collect run-time statistics. Next to the actual implementation of these features, it is also necessary to develop a consistent storage model for the MetaStore.

-- 
This message is automatically generated by JIRA.
-
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] Commented: (HIVE-1940) Query Optimization Using Column Metadata and Histograms

Posted by "Anja Gruenheid (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HIVE-1940?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12990910#comment-12990910 ] 

Anja Gruenheid commented on HIVE-1940:
--------------------------------------

I found out that the IDXS metastore tables are generated when I create an index for the first time.

> Query Optimization Using Column Metadata and Histograms
> -------------------------------------------------------
>
>                 Key: HIVE-1940
>                 URL: https://issues.apache.org/jira/browse/HIVE-1940
>             Project: Hive
>          Issue Type: New Feature
>          Components: Metastore, Query Processor
>            Reporter: Anja Gruenheid
>
> The current basis for cost-based query optimization in Hive is information gathered on tables and partitions. To make further improvements in query optimization possible, the next step is to develop and implement possibilities to gather information on columns as discussed in issue HIVE-33. After that, an implementation of histograms is a possible option to use and collect run-time statistics. Next to the actual implementation of these features, it is also necessary to develop a consistent storage model for the MetaStore.

-- 
This message is automatically generated by JIRA.
-
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] Commented: (HIVE-1940) Query Optimization Using Column Metadata and Histograms

Posted by "Anja Gruenheid (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HIVE-1940?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12994987#comment-12994987 ] 

Anja Gruenheid commented on HIVE-1940:
--------------------------------------

I tried figuring out datanucleus and the creation of the initial metastore model, but I don't quite understand it:

When I create the metastore in MySQL, I generate the jars by running ant model-jar in the hive/metastore folder. When I then run hive, metastore tables are generated according the command that I use (eg show tables) in MySQL. I referenced org.datanucleus.store.rdbms.SchemaTool instead of jpox before generating the jar file, but it didn't change anything. Basically, there has to be an overview of all metastore tables that can possibly be invoked. My question is: where?

Thanks a lot for your help!

> Query Optimization Using Column Metadata and Histograms
> -------------------------------------------------------
>
>                 Key: HIVE-1940
>                 URL: https://issues.apache.org/jira/browse/HIVE-1940
>             Project: Hive
>          Issue Type: New Feature
>          Components: Metastore, Query Processor
>            Reporter: Anja Gruenheid
>         Attachments: HiveMetaStore.pdf
>
>
> The current basis for cost-based query optimization in Hive is information gathered on tables and partitions. To make further improvements in query optimization possible, the next step is to develop and implement possibilities to gather information on columns as discussed in issue HIVE-33. After that, an implementation of histograms is a possible option to use and collect run-time statistics. Next to the actual implementation of these features, it is also necessary to develop a consistent storage model for the MetaStore.

-- 
This message is automatically generated by JIRA.
-
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] Commented: (HIVE-1940) Query Optimization Using Column Metadata and Histograms

Posted by "Anja Gruenheid (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HIVE-1940?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12990331#comment-12990331 ] 

Anja Gruenheid commented on HIVE-1940:
--------------------------------------

I have set up the last stable version, but as far as I understood, some features have been added during the current iteration, which also have had impact on the design of the MetaStore. Is there an up-to-date overview of the MetaStore somewhere or should I retrace the updates that have been made since the last release?

If I can collect all the data that I need, I'll create the model.

> Query Optimization Using Column Metadata and Histograms
> -------------------------------------------------------
>
>                 Key: HIVE-1940
>                 URL: https://issues.apache.org/jira/browse/HIVE-1940
>             Project: Hive
>          Issue Type: New Feature
>          Components: Metastore, Query Processor
>            Reporter: Anja Gruenheid
>
> The current basis for cost-based query optimization in Hive is information gathered on tables and partitions. To make further improvements in query optimization possible, the next step is to develop and implement possibilities to gather information on columns as discussed in issue HIVE-33. After that, an implementation of histograms is a possible option to use and collect run-time statistics. Next to the actual implementation of these features, it is also necessary to develop a consistent storage model for the MetaStore.

-- 
This message is automatically generated by JIRA.
-
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] Commented: (HIVE-1940) Query Optimization Using Column Metadata and Histograms

Posted by "Anja Gruenheid (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HIVE-1940?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12991022#comment-12991022 ] 

Anja Gruenheid commented on HIVE-1940:
--------------------------------------

Here is the metastore model that I generated with MySQL Workbench: http://home.in.tum.de/~gruenhei/HiveMetaStore.pdf
Comparing this model to the one displayed in the index wiki, I noticed the two tables PARTITIONS and PARTITION_KEY_VALS are missing in my model. Do you have any idea how I can create them? I tried adding partitions on tables, but that just created entries in table PARTITION_KEYS.

> Query Optimization Using Column Metadata and Histograms
> -------------------------------------------------------
>
>                 Key: HIVE-1940
>                 URL: https://issues.apache.org/jira/browse/HIVE-1940
>             Project: Hive
>          Issue Type: New Feature
>          Components: Metastore, Query Processor
>            Reporter: Anja Gruenheid
>
> The current basis for cost-based query optimization in Hive is information gathered on tables and partitions. To make further improvements in query optimization possible, the next step is to develop and implement possibilities to gather information on columns as discussed in issue HIVE-33. After that, an implementation of histograms is a possible option to use and collect run-time statistics. Next to the actual implementation of these features, it is also necessary to develop a consistent storage model for the MetaStore.

-- 
This message is automatically generated by JIRA.
-
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] Commented: (HIVE-1940) Query Optimization Using Column Metadata and Histograms

Posted by "John Sichi (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HIVE-1940?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12990265#comment-12990265 ] 

John Sichi commented on HIVE-1940:
----------------------------------

Hi Anja,

To get a DDL script, you can install Hive and then get your DBMS to generate a script.  For example, with MySQL, you can use the mysqldump utility with --no-data option.

For Derby, see

http://www.vogella.de/articles/ApacheDerby/article.html#usage_sqldump

For an E/R diagram, I had good results with the open source tool Power Architect:

http://www.sqlpower.ca/page/architect

(Some manual layout required after reverse engineering.)  You can see an example here:

http://wiki.apache.org/hadoop/Hive/IndexDev#Metastore_Model

If you produce a diagram for the complete metastore, we can get it published in the wiki for others to use.


> Query Optimization Using Column Metadata and Histograms
> -------------------------------------------------------
>
>                 Key: HIVE-1940
>                 URL: https://issues.apache.org/jira/browse/HIVE-1940
>             Project: Hive
>          Issue Type: New Feature
>          Components: Metastore, Query Processor
>            Reporter: Anja Gruenheid
>
> The current basis for cost-based query optimization in Hive is information gathered on tables and partitions. To make further improvements in query optimization possible, the next step is to develop and implement possibilities to gather information on columns as discussed in issue HIVE-33. After that, an implementation of histograms is a possible option to use and collect run-time statistics. Next to the actual implementation of these features, it is also necessary to develop a consistent storage model for the MetaStore.

-- 
This message is automatically generated by JIRA.
-
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] Commented: (HIVE-1940) Query Optimization Using Column Metadata and Histograms

Posted by "Anja Gruenheid (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HIVE-1940?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12990868#comment-12990868 ] 

Anja Gruenheid commented on HIVE-1940:
--------------------------------------

I created the metastore as you suggested, but I'm missing a couple of tables like IDXS. I used MySQL as local database and adjusted the parameters accordingly.
When I create tables, I can see them in the metastore via MySQL, so it definitely is working.

> Query Optimization Using Column Metadata and Histograms
> -------------------------------------------------------
>
>                 Key: HIVE-1940
>                 URL: https://issues.apache.org/jira/browse/HIVE-1940
>             Project: Hive
>          Issue Type: New Feature
>          Components: Metastore, Query Processor
>            Reporter: Anja Gruenheid
>
> The current basis for cost-based query optimization in Hive is information gathered on tables and partitions. To make further improvements in query optimization possible, the next step is to develop and implement possibilities to gather information on columns as discussed in issue HIVE-33. After that, an implementation of histograms is a possible option to use and collect run-time statistics. Next to the actual implementation of these features, it is also necessary to develop a consistent storage model for the MetaStore.

-- 
This message is automatically generated by JIRA.
-
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (HIVE-1940) Query Optimization Using Column Metadata and Histograms

Posted by "Carl Steinbach (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/HIVE-1940?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Carl Steinbach updated HIVE-1940:
---------------------------------

    Attachment: Agruenheid_ideas11.pdf
    
> Query Optimization Using Column Metadata and Histograms
> -------------------------------------------------------
>
>                 Key: HIVE-1940
>                 URL: https://issues.apache.org/jira/browse/HIVE-1940
>             Project: Hive
>          Issue Type: New Feature
>          Components: Metastore, Query Processor, Statistics
>            Reporter: Anja Gruenheid
>         Attachments: Agruenheid_ideas11.pdf, HiveMetaStore.pdf
>
>
> The current basis for cost-based query optimization in Hive is information gathered on tables and partitions. To make further improvements in query optimization possible, the next step is to develop and implement possibilities to gather information on columns as discussed in issue HIVE-33. After that, an implementation of histograms is a possible option to use and collect run-time statistics. Next to the actual implementation of these features, it is also necessary to develop a consistent storage model for the MetaStore.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Resolved] (HIVE-1940) Query Optimization Using Column Metadata and Histograms

Posted by "Carl Steinbach (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/HIVE-1940?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Carl Steinbach resolved HIVE-1940.
----------------------------------

    Resolution: Duplicate

Resolving this as a duplicated of HIVE-1938 and HIVE-1362.
                
> Query Optimization Using Column Metadata and Histograms
> -------------------------------------------------------
>
>                 Key: HIVE-1940
>                 URL: https://issues.apache.org/jira/browse/HIVE-1940
>             Project: Hive
>          Issue Type: New Feature
>          Components: Metastore, Query Processor, Statistics
>            Reporter: Anja Gruenheid
>         Attachments: Agruenheid_ideas11.pdf, HiveMetaStore.pdf
>
>
> The current basis for cost-based query optimization in Hive is information gathered on tables and partitions. To make further improvements in query optimization possible, the next step is to develop and implement possibilities to gather information on columns as discussed in issue HIVE-33. After that, an implementation of histograms is a possible option to use and collect run-time statistics. Next to the actual implementation of these features, it is also necessary to develop a consistent storage model for the MetaStore.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (HIVE-1940) Query Optimization Using Column Metadata and Histograms

Posted by "Carl Steinbach (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HIVE-1940?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13274832#comment-13274832 ] 

Carl Steinbach commented on HIVE-1940:
--------------------------------------

@Anja: I noticed that you published a paper about Hive column statistics. Any chance you would be willing to share this code?
                
> Query Optimization Using Column Metadata and Histograms
> -------------------------------------------------------
>
>                 Key: HIVE-1940
>                 URL: https://issues.apache.org/jira/browse/HIVE-1940
>             Project: Hive
>          Issue Type: New Feature
>          Components: Metastore, Query Processor, Statistics
>            Reporter: Anja Gruenheid
>         Attachments: Agruenheid_ideas11.pdf, HiveMetaStore.pdf
>
>
> The current basis for cost-based query optimization in Hive is information gathered on tables and partitions. To make further improvements in query optimization possible, the next step is to develop and implement possibilities to gather information on columns as discussed in issue HIVE-33. After that, an implementation of histograms is a possible option to use and collect run-time statistics. Next to the actual implementation of these features, it is also necessary to develop a consistent storage model for the MetaStore.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] Commented: (HIVE-1940) Query Optimization Using Column Metadata and Histograms

Posted by "John Sichi (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HIVE-1940?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12991542#comment-12991542 ] 

John Sichi commented on HIVE-1940:
----------------------------------

Awesome diagram!  Can you add it as an attachment and check the radio button to grant license to ASF so that we can use it in the Hive wiki?

Try loading some data into your partitions; maybe it deferred that part of the schema creation until then.

There's a tool which can force generation of the entire schema:

http://www.datanucleus.org/products/accessplatform/rdbms/schematool.html

There's an ant target generate-schema which invokes it (in metastore/build.xml), but it's out-of-date because it still references jpox instead of datanucleus (e.g. it should be invoking org.datanucleus.store.rdbms.SchemaTool instead of org.jpox.SchemaTool).  If you get it working, submit a patch and we can update it.


> Query Optimization Using Column Metadata and Histograms
> -------------------------------------------------------
>
>                 Key: HIVE-1940
>                 URL: https://issues.apache.org/jira/browse/HIVE-1940
>             Project: Hive
>          Issue Type: New Feature
>          Components: Metastore, Query Processor
>            Reporter: Anja Gruenheid
>
> The current basis for cost-based query optimization in Hive is information gathered on tables and partitions. To make further improvements in query optimization possible, the next step is to develop and implement possibilities to gather information on columns as discussed in issue HIVE-33. After that, an implementation of histograms is a possible option to use and collect run-time statistics. Next to the actual implementation of these features, it is also necessary to develop a consistent storage model for the MetaStore.

-- 
This message is automatically generated by JIRA.
-
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (HIVE-1940) Query Optimization Using Column Metadata and Histograms

Posted by "Carl Steinbach (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/HIVE-1940?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Carl Steinbach updated HIVE-1940:
---------------------------------

    Component/s: Statistics

> Query Optimization Using Column Metadata and Histograms
> -------------------------------------------------------
>
>                 Key: HIVE-1940
>                 URL: https://issues.apache.org/jira/browse/HIVE-1940
>             Project: Hive
>          Issue Type: New Feature
>          Components: Metastore, Query Processor, Statistics
>            Reporter: Anja Gruenheid
>         Attachments: HiveMetaStore.pdf
>
>
> The current basis for cost-based query optimization in Hive is information gathered on tables and partitions. To make further improvements in query optimization possible, the next step is to develop and implement possibilities to gather information on columns as discussed in issue HIVE-33. After that, an implementation of histograms is a possible option to use and collect run-time statistics. Next to the actual implementation of these features, it is also necessary to develop a consistent storage model for the MetaStore.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] Commented: (HIVE-1940) Query Optimization Using Column Metadata and Histograms

Posted by "Anja Gruenheid (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HIVE-1940?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12989494#comment-12989494 ] 

Anja Gruenheid commented on HIVE-1940:
--------------------------------------

As first step, I would like to take a closer look at collecting meta data on the column level. In issue HIVE-33, five different statistics are described (# distinct values, # null values, 3 min values, 3 max values, avg size of column) that have been proposed as column meta data. As reference, I would take the implementation of the table/partition meta data collection.
As far as I can tell, deriving histograms is a little bit more complex than obtaining column information, which is why I want to start out with that.

Is there an up-to-date MetaStore DDL script or an E/R model?

> Query Optimization Using Column Metadata and Histograms
> -------------------------------------------------------
>
>                 Key: HIVE-1940
>                 URL: https://issues.apache.org/jira/browse/HIVE-1940
>             Project: Hive
>          Issue Type: New Feature
>          Components: Metastore, Query Processor
>            Reporter: Anja Gruenheid
>
> The current basis for cost-based query optimization in Hive is information gathered on tables and partitions. To make further improvements in query optimization possible, the next step is to develop and implement possibilities to gather information on columns as discussed in issue HIVE-33. After that, an implementation of histograms is a possible option to use and collect run-time statistics. Next to the actual implementation of these features, it is also necessary to develop a consistent storage model for the MetaStore.

-- 
This message is automatically generated by JIRA.
-
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] Commented: (HIVE-1940) Query Optimization Using Column Metadata and Histograms

Posted by "John Sichi (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HIVE-1940?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12990411#comment-12990411 ] 

John Sichi commented on HIVE-1940:
----------------------------------

If you just svn update to the tip of trunk and build/install from there, you'll get the latest metastore.  Substantial additions since 0.6 include support for indexes, authorization, and various database properties.

> Query Optimization Using Column Metadata and Histograms
> -------------------------------------------------------
>
>                 Key: HIVE-1940
>                 URL: https://issues.apache.org/jira/browse/HIVE-1940
>             Project: Hive
>          Issue Type: New Feature
>          Components: Metastore, Query Processor
>            Reporter: Anja Gruenheid
>
> The current basis for cost-based query optimization in Hive is information gathered on tables and partitions. To make further improvements in query optimization possible, the next step is to develop and implement possibilities to gather information on columns as discussed in issue HIVE-33. After that, an implementation of histograms is a possible option to use and collect run-time statistics. Next to the actual implementation of these features, it is also necessary to develop a consistent storage model for the MetaStore.

-- 
This message is automatically generated by JIRA.
-
For more information on JIRA, see: http://www.atlassian.com/software/jira