You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@hive.apache.org by "Rajat Venkatesh (JIRA)" <ji...@apache.org> on 2014/10/15 18:01:33 UTC

[jira] [Updated] (HIVE-8467) Table Copy - Background, incremental data load

     [ https://issues.apache.org/jira/browse/HIVE-8467?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Rajat Venkatesh updated HIVE-8467:
----------------------------------
    Attachment: Table Copies.pdf

> Table Copy - Background, incremental data load
> ----------------------------------------------
>
>                 Key: HIVE-8467
>                 URL: https://issues.apache.org/jira/browse/HIVE-8467
>             Project: Hive
>          Issue Type: New Feature
>            Reporter: Rajat Venkatesh
>         Attachments: Table Copies.pdf
>
>
> Traditionally, Hive and other tools in the Hadoop eco-system havent required a load stage. However, with recent developments, Hive is much more performant when data is stored in specific formats like ORC, Parquet, Avro etc. Technologies like Presto, also work much better with certain data formats. At the same time, data is generated or obtained from 3rd parties in non-optimal formats such as CSV, tab-limited or JSON. Many a times, its not an option to change the data format at the source. We've found that users either use sub-optimal formats or spend a large amount of effort creating and maintaining copies. We want to propose a new construct - Table Copy - to help “load” data into an optimal storage format.
> I am going to attach a PDF document with a lot more details especially addressing how is this different from bulk loads in relational DBs or materialized views.
> Looking forward to hear if others see a similar need to formalize conversion of data to different storage formats.  If yes, are the details in the PDF document a good start ?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)