You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@hive.apache.org by "Julian Hyde (JIRA)" <ji...@apache.org> on 2014/10/15 20:51:37 UTC
[jira] [Commented] (HIVE-8467) Table Copy - Background, incremental data load

    [ https://issues.apache.org/jira/browse/HIVE-8467?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14172767#comment-14172767 ] 

Julian Hyde commented on HIVE-8467:
-----------------------------------

I see this as a particular kind of materialized view. In general, a materialized view is a table whose contents are guaranteed to be the same as executing a particular query. In this case, that query is simply 'select * from t'.

We don't have materialized view support yet, but I have been working on lattices in Calcite (formerly known as Optiq) (see OPTIQ-344) and there is a lot of interest in adding them to Hive. Each materialized "tile" in a lattice is a materialized view of the form 'select d1, d2, sum(m1), count(m2) from t group by d1, d2'.

So, let's talk about whether we could change the syntax to 'create materialized view'  and still deliver the functionality you need. Of course if the user enters anything other than 'select * from t order by k1, k2' they would get an error.

In terms of query planning, I strongly recommend that you build on the CBO work powered by Calcite. Let's suppose there is a table T and a copy C. After translating the query to a Calcite RelNode tree, there will be a TableAccessRel(T). After reading the metadata, we should create a TableAccessRel(C) and tell Calcite that it is equivalent.

That's all you need to do. Calcite will take it from there. Assuming the stats indicate that C is better (and they should, right, because the ORC representation will be smaller?) then the query will end up using C. But if, say, T has a partitioning scheme which is more suitable for a particular query, then Calcite will choose T.

> Table Copy - Background, incremental data load
> ----------------------------------------------
>
>                 Key: HIVE-8467
>                 URL: https://issues.apache.org/jira/browse/HIVE-8467
>             Project: Hive
>          Issue Type: New Feature
>            Reporter: Rajat Venkatesh
>         Attachments: Table Copies.pdf
>
>
> Traditionally, Hive and other tools in the Hadoop eco-system havent required a load stage. However, with recent developments, Hive is much more performant when data is stored in specific formats like ORC, Parquet, Avro etc. Technologies like Presto, also work much better with certain data formats. At the same time, data is generated or obtained from 3rd parties in non-optimal formats such as CSV, tab-limited or JSON. Many a times, its not an option to change the data format at the source. We've found that users either use sub-optimal formats or spend a large amount of effort creating and maintaining copies. We want to propose a new construct - Table Copy - to help “load” data into an optimal storage format.
> I am going to attach a PDF document with a lot more details especially addressing how is this different from bulk loads in relational DBs or materialized views.
> Looking forward to hear if others see a similar need to formalize conversion of data to different storage formats.  If yes, are the details in the PDF document a good start ?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)