You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@hive.apache.org by "Mohammad Adnan (JIRA)" <ji...@apache.org> on 2016/11/07 13:56:58 UTC

[jira] [Commented] (HIVE-12860) Add WITH HEADER option to INSERT OVERWRITE DIRECTORY

    [ https://issues.apache.org/jira/browse/HIVE-12860?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15644238#comment-15644238 ] 

Mohammad Adnan commented on HIVE-12860:
---------------------------------------

is there any expected release version for this?
Also the solution of UNION of two tables seems very hackish. Headers would always be string and union of header might be done with integer values. Not sure if this can create problem but it sounds very tricky.

> Add WITH HEADER option to INSERT OVERWRITE DIRECTORY
> ----------------------------------------------------
>
>                 Key: HIVE-12860
>                 URL: https://issues.apache.org/jira/browse/HIVE-12860
>             Project: Hive
>          Issue Type: New Feature
>          Components: Hive
>            Reporter: Elliot West
>            Assignee: Elliot West
>
> _As a Hive user_
> _I'd like the option to seamlessly write out a header row to file system based result sets_
> _So that I can generate reports with a specification that mandates a header row._
> h3. Motivations
> There is a significant use-case where Hive is used to construct a scheduled data processing pipeline that generates a report in HDFS for consumption by some third party (internal or external). This report may then be transferred out of the system for consumption by other tools or processes. It is not uncommon for the third party to specify that the report includes a header row at the start of the file. The current options for adding headers are difficult to use effectively and elegantly.
> h3. Acceptance criteria
> * {{INSERT OVERWRITE DIRECTORY}} commands can be invoked with an option to include a header row at the start of the result set file.
> * The header row will contain the column names derived from the accompanying {{SELECT}} query.
> * It will likely be the case that multiple tasks will be writing the final file of the query result set. In this event only the task writing the first chunk of the file should emit the header row.
> h3. Proposed HQL changes
> {code}
> 1.  INSERT OVERWRITE [LOCAL] DIRECTORY directory1
> 2.    [ROW FORMAT row_format] [STORED AS file_format]
> 3.    [WITH HEADER]
> 4.    SELECT ... FROM ...
> {code}
> It is proposed that the {{WITH HEADER}} stanza at line 3 be introduced to enable this feature.
> h3. Current workarounds
> * It is usually suggested that users set the CLI option {{hive.cli.print.header=true}} and capture the result set from standard out. However, this does not work well in scheduled, headless environments such as the Oozie Hive action. This can also push the file handling into shell scripts and complicate the process of getting the report into HDFS.
> * The keep report processing entirely within the domain of Hive some users {{UNION}} the result of their query with a tiny table of a single row containing the header names. A synthesised rank column is used with an {{ORDER BY}} to ensure that the header is written to the very start of the file. See [this example on Stack Overflow|http://stackoverflow.com/questions/15139561/adding-column-headers-to-hive-result-set/25214480#25214480].
> h3. References
> * HIVE-138: Original request for header functionality.
> * [Hive Wiki: writing data into the file system from queries|https://cwiki.apache.org/confluence/display/Hive/LanguageManual+DML#LanguageManualDML-Writingdataintothefilesystemfromqueries].



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)