You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@phoenix.apache.org by "Josh Elser (JIRA)" <ji...@apache.org> on 2019/06/03 19:43:00 UTC

[jira] [Comment Edited] (PHOENIX-5258) Add support to parse header from the input CSV file as input columns for CsvBulkLoadTool

    [ https://issues.apache.org/jira/browse/PHOENIX-5258?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16854984#comment-16854984 ] 

Josh Elser edited comment on PHOENIX-5258 at 6/3/19 7:42 PM:
-------------------------------------------------------------

{quote}how generally the skip header works in CSVBulkLoadTool, does it skip first line for every input split, doesn't it possible that same CSV file is split into two inputSplits and our InputFormat is skipping the first line for each split resulting in one actual row less?
{quote}
Good question! This is done by the custom RecordReader and unwrapping the InputSplit, only to consume the first records when we're starting from the beginning of a file: [https://github.com/apache/phoenix/blob/20bc74145762d2b19e80b609bec901489accd5cb/phoenix-core/src/main/java/org/apache/phoenix/mapreduce/PhoenixTextInputFormat.java#L60-L70]

There isn't a safe way to do this unless:
 # You unwrap the InputSplit, rewind back to the head of the file and read the first line in the file (despite the InputSplit telling you not to do that).
 # You read the first line from all input CSV files and cache them in the job configuration.
 # You figure out a way to disallow splitting of the files at the InputFormat level (prevent a split from ever happening when this option is enabled).

I don't like option number 2. There may be issues with option number 1, but in theory it should work. I don't know where to point to try to implement option number 3. You would want to figure out where in the MapReduce code we go from input files to InputSplits, and figure out how the logic works there.


was (Author: elserj):
{quote}how generally the skip header works in CSVBulkLoadTool, does it skip first line for every input split, doesn't it possible that same CSV file is split into two inputSplits and our InputFormat is skipping the first line for each split resulting in one actual row less?
{quote}
Good question! This is done by the custom RecordReader and unwrapping the InputSplit, only to consume the first records when we're starting from the beginning of a file: [https://github.com/apache/phoenix/blob/20bc74145762d2b19e80b609bec901489accd5cb/phoenix-core/src/main/java/org/apache/phoenix/mapreduce/PhoenixTextInputFormat.java#L60-L70]

There isn't a safe way to do this unless:
 # You unwrap the InputSplit, rewind back to the head of the file and read the first line in the file (despite the InputSplit telling you not to do that).
 # You read the first line from all input CSV files and cache them in the job configuration.
 # You figure out a way to disallow splitting of the files at the InputFormat level (prevent a split from ever happening when this option is enabled).

I don't like option number 2. There may be issues with option number 1, but in theory it should work.

> Add support to parse header from the input CSV file as input columns for CsvBulkLoadTool
> ----------------------------------------------------------------------------------------
>
>                 Key: PHOENIX-5258
>                 URL: https://issues.apache.org/jira/browse/PHOENIX-5258
>             Project: Phoenix
>          Issue Type: Improvement
>            Reporter: Prashant Vithani
>            Assignee: Prashant Vithani
>            Priority: Minor
>             Fix For: 4.15.0, 5.1.0
>
>         Attachments: PHOENIX-5258-4.x-HBase-1.4.001.patch, PHOENIX-5258-4.x-HBase-1.4.patch, PHOENIX-5258-master.001.patch, PHOENIX-5258-master.patch
>
>          Time Spent: 40m
>  Remaining Estimate: 0h
>
> Currently, CsvBulkLoadTool does not support reading header from the input csv and expects the content of the csv to match with the table schema. The support for the header can be added to dynamically map the schema with the header.
> The proposed solution is to introduce another option for the tool `–parse-header`. If this option is passed, the input columns list is constructed by reading the first line of the input CSV file.
>  * If there is only one file, read the header from the first line and generate the `ColumnInfo` list.
>  * If there are multiple files, read the header from all the files, and throw an error if the headers across files do not match.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)