You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@hive.apache.org by "BELUGA BEHR (JIRA)" <ji...@apache.org> on 2018/03/12 20:46:00 UTC
[jira] [Commented] (HIVE-5795) Hive should be able to skip header and footer rows when reading data file for a table

    [ https://issues.apache.org/jira/browse/HIVE-5795?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16395861#comment-16395861 ] 

BELUGA BEHR commented on HIVE-5795:
-----------------------------------

This is an oldie, but a goodie.

I was looking at this ticket and trying to decide if I should propose that we drop this feature from Hive 3.0. I saw that there are some reports here of there being issues... a suggestion that this should be an "experimental" feature. For how long should that go on for?

I just ran some sample data threw this feature using Google's [NGram|http://storage.googleapis.com/books/ngrams/books/datasetsv2.html] data. I converted it to be comma separated (CSV) instead of tab separated (TSV) before loading it into HDFS. I also added a header to each file in the Hive table: {{ngram,year,match_count,volume_count}}.
{code:sql}
-- 2.2 GB / 1 File
CREATE EXTERNAL TABLE ngram1 (
  ngram string,
  year int,
  match_count int,
  volume_count int
)
ROW FORMAT DELIMITED FIELDS TERMINATED BY ','
LOCATION '/user/admin/ngram1';

-- 2.2 GB / 1 File
CREATE EXTERNAL TABLE ngram2 (
  ngram string,
  year int,
  match_count int,
  volume_count int
)
ROW FORMAT DELIMITED FIELDS TERMINATED BY ','
LOCATION '/user/admin/ngram2'
tblproperties ("skip.header.line.count"="1");

-- 5 Files / Each file has 1 header + 9 rows of data (10 total rows)
CREATE EXTERNAL TABLE ngram3 (
  ngram string,
  year int,
  match_count int,
  volume_count int
)
ROW FORMAT DELIMITED FIELDS TERMINATED BY ','
LOCATION '/user/admin/ngram3'
tblproperties ("skip.header.line.count"="1");
{code}
{code:java|title=select count(1) from ngram1}
INFO  : Compiling command(queryId=hive_20180312132727_5a75ce57-53e5-4ab2-9b97-31f8b3502969): select count(1) from ngram1
INFO  : Semantic Analysis Completed
INFO  : Returning Hive schema: Schema(fieldSchemas:[FieldSchema(name:_c0, type:bigint, comment:null)], properties:null)
INFO  : Completed compiling command(queryId=hive_20180312132727_5a75ce57-53e5-4ab2-9b97-31f8b3502969); Time taken: 0.047 seconds
INFO  : Executing command(queryId=hive_20180312132727_5a75ce57-53e5-4ab2-9b97-31f8b3502969): select count(1) from ngram1
INFO  : Query ID = hive_20180312132727_5a75ce57-53e5-4ab2-9b97-31f8b3502969
INFO  : Total jobs = 1
INFO  : Launching Job 1 out of 1
INFO  : Starting task [Stage-1:MAPRED] in serial mode
INFO  : Number of reduce tasks determined at compile time: 1
INFO  : In order to change the average load for a reducer (in bytes):
INFO  :   set hive.exec.reducers.bytes.per.reducer=<number>
INFO  : In order to limit the maximum number of reducers:
INFO  :   set hive.exec.reducers.max=<number>
INFO  : In order to set a constant number of reducers:
INFO  :   set mapreduce.job.reduces=<number>
INFO  : number of splits:9
INFO  : Submitting tokens for job: job_1520871396884_0011
INFO  : The url to track the job: http://myhost:8088/proxy/application_1520871396884_0011/
INFO  : Starting Job = job_1520871396884_0011, Tracking URL = http://myhost:8088/proxy/application_1520871396884_0011/
INFO  : Kill Command = /usr/lib/hadoop/bin/hadoop job  -kill job_1520871396884_0011
INFO  : Hadoop job information for Stage-1: number of mappers: 9; number of reducers: 1
INFO  : 2018-03-12 13:27:32,281 Stage-1 map = 0%,  reduce = 0%
INFO  : 2018-03-12 13:27:51,424 Stage-1 map = 11%,  reduce = 0%, Cumulative CPU 8.69 sec
INFO  : 2018-03-12 13:27:53,520 Stage-1 map = 70%,  reduce = 0%, Cumulative CPU 91.66 sec
INFO  : 2018-03-12 13:27:54,553 Stage-1 map = 85%,  reduce = 0%, Cumulative CPU 93.15 sec
INFO  : 2018-03-12 13:27:55,586 Stage-1 map = 100%,  reduce = 0%, Cumulative CPU 97.5 sec
INFO  : 2018-03-12 13:28:01,793 Stage-1 map = 100%,  reduce = 100%, Cumulative CPU 100.0 sec
INFO  : MapReduce Total cumulative CPU time: 1 minutes 40 seconds 0 msec
INFO  : Ended Job = job_1520871396884_0011
INFO  : MapReduce Jobs Launched: 
INFO  : Stage-Stage-1: Map: 9  Reduce: 1   Cumulative CPU: 100.0 sec   HDFS Read: 2317486328 HDFS Write: 10 SUCCESS
INFO  : Total MapReduce CPU Time Spent: 1 minutes 40 seconds 0 msec
INFO  : Completed executing command(queryId=hive_20180312132727_5a75ce57-53e5-4ab2-9b97-31f8b3502969); Time taken: 36.666 seconds
INFO  : OK

-- Result: 110075425
{code}
{code:java|title=select count(1) from ngram2}
INFO  : Compiling command(queryId=hive_20180312133030_804a096b-f5d8-46c8-a255-4a958e6d2f0d): select count(1) from ngram2
INFO  : Semantic Analysis Completed
INFO  : Returning Hive schema: Schema(fieldSchemas:[FieldSchema(name:_c0, type:bigint, comment:null)], properties:null)
INFO  : Completed compiling command(queryId=hive_20180312133030_804a096b-f5d8-46c8-a255-4a958e6d2f0d); Time taken: 0.046 seconds
INFO  : Executing command(queryId=hive_20180312133030_804a096b-f5d8-46c8-a255-4a958e6d2f0d): select count(1) from ngram2
INFO  : Query ID = hive_20180312133030_804a096b-f5d8-46c8-a255-4a958e6d2f0d
INFO  : Total jobs = 1
INFO  : Launching Job 1 out of 1
INFO  : Starting task [Stage-1:MAPRED] in serial mode
INFO  : Number of reduce tasks determined at compile time: 1
INFO  : In order to change the average load for a reducer (in bytes):
INFO  :   set hive.exec.reducers.bytes.per.reducer=<number>
INFO  : In order to limit the maximum number of reducers:
INFO  :   set hive.exec.reducers.max=<number>
INFO  : In order to set a constant number of reducers:
INFO  :   set mapreduce.job.reduces=<number>
INFO  : number of splits:9
INFO  : Submitting tokens for job: job_1520871396884_0012
INFO  : The url to track the job: http://myhost:8088/proxy/application_1520871396884_0012/
INFO  : Starting Job = job_1520871396884_0012, Tracking URL = http://myhost:8088/proxy/application_1520871396884_0012/
INFO  : Kill Command = /usr/lib/hadoop/bin/hadoop job  -kill job_1520871396884_0012
INFO  : Hadoop job information for Stage-1: number of mappers: 9; number of reducers: 1
INFO  : 2018-03-12 13:30:17,376 Stage-1 map = 0%,  reduce = 0%
INFO  : 2018-03-12 13:30:36,432 Stage-1 map = 11%,  reduce = 0%, Cumulative CPU 8.83 sec
INFO  : 2018-03-12 13:30:37,478 Stage-1 map = 44%,  reduce = 0%, Cumulative CPU 39.6 sec
INFO  : 2018-03-12 13:30:38,513 Stage-1 map = 67%,  reduce = 0%, Cumulative CPU 75.89 sec
INFO  : 2018-03-12 13:30:39,543 Stage-1 map = 85%,  reduce = 0%, Cumulative CPU 87.82 sec
INFO  : 2018-03-12 13:30:40,573 Stage-1 map = 93%,  reduce = 0%, Cumulative CPU 89.99 sec
INFO  : 2018-03-12 13:30:41,604 Stage-1 map = 100%,  reduce = 0%, Cumulative CPU 91.79 sec
INFO  : 2018-03-12 13:30:46,771 Stage-1 map = 100%,  reduce = 100%, Cumulative CPU 94.58 sec
INFO  : MapReduce Total cumulative CPU time: 1 minutes 34 seconds 580 msec
INFO  : Ended Job = job_1520871396884_0012
INFO  : MapReduce Jobs Launched: 
INFO  : Stage-Stage-1: Map: 9  Reduce: 1   Cumulative CPU: 94.58 sec   HDFS Read: 2317486734 HDFS Write: 10 SUCCESS
INFO  : Total MapReduce CPU Time Spent: 1 minutes 34 seconds 580 msec
INFO  : Completed executing command(queryId=hive_20180312133030_804a096b-f5d8-46c8-a255-4a958e6d2f0d); Time taken: 35.149 seconds
INFO  : OK

-- Result: 110075424
{code}
{code:java|title=select count(1) from ngram3}
INFO  : Executing command(queryId=hive_20180312133232_a10fcc3c-d8d6-4377-86b5-c76e9cdc9c8e): select count(1) from ngram3
INFO  : Query ID = hive_20180312133232_a10fcc3c-d8d6-4377-86b5-c76e9cdc9c8e
INFO  : Total jobs = 1
INFO  : Launching Job 1 out of 1
INFO  : Starting task [Stage-1:MAPRED] in serial mode
INFO  : Number of reduce tasks determined at compile time: 1
INFO  : In order to change the average load for a reducer (in bytes):
INFO  :   set hive.exec.reducers.bytes.per.reducer=<number>
INFO  : In order to limit the maximum number of reducers:
INFO  :   set hive.exec.reducers.max=<number>
INFO  : In order to set a constant number of reducers:
INFO  :   set mapreduce.job.reduces=<number>
INFO  : number of splits:1
INFO  : Submitting tokens for job: job_1520871396884_0013
INFO  : The url to track the job: http://myhost:8088/proxy/application_1520871396884_0013/
INFO  : Starting Job = job_1520871396884_0013, Tracking URL = http://myhost:8088/proxy/application_1520871396884_0013/
INFO  : Kill Command = /usr/lib/hadoop/bin/hadoop job  -kill job_1520871396884_0013
INFO  : Hadoop job information for Stage-1: number of mappers: 1; number of reducers: 1
INFO  : 2018-03-12 13:32:21,292 Stage-1 map = 0%,  reduce = 0%
INFO  : 2018-03-12 13:32:26,519 Stage-1 map = 100%,  reduce = 0%, Cumulative CPU 1.44 sec
INFO  : 2018-03-12 13:32:32,755 Stage-1 map = 100%,  reduce = 100%, Cumulative CPU 4.0 sec
INFO  : MapReduce Total cumulative CPU time: 4 seconds 0 msec
INFO  : Ended Job = job_1520871396884_0013
INFO  : MapReduce Jobs Launched: 
INFO  : Stage-Stage-1: Map: 1  Reduce: 1   Cumulative CPU: 4.0 sec   HDFS Read: 9229 HDFS Write: 3 SUCCESS
INFO  : Total MapReduce CPU Time Spent: 4 seconds 0 msec
INFO  : Completed executing command(queryId=hive_20180312133232_a10fcc3c-d8d6-4377-86b5-c76e9cdc9c8e); Time taken: 17.337 seconds
INFO  : OK

-- Result: 45
{code}
From this little test, we can see that the *skip.header.line.count* seems to be working correctly. For a large file with 9 splits/mapper, we can see that it correctly removed only 1 row from the 'count' results. Hive did not process the entire file in one Mapper; it correctly processed in parallel. For a data set with 5 files (10 rows each file), Hive correctly trimmed off one row from each file, for a total of 45 row of content, across multiple files in the context of a single Mapper.

And yet, I do no see any [documentation|https://www.google.com/search?q=site:apache.org+%22skip.header.line.count%22] covering this feature, except in some [Impala documentation|https://impala.apache.org/docs/build/html/topics/impala_txtfile.html].

Is this feature still useful or should the feature be dropped and the guidance would be to do some sort of pre-processing before loading into HDFS? If this feature is being kept, it should be better documented.

> Hive should be able to skip header and footer rows when reading data file for a table
> -------------------------------------------------------------------------------------
>
>                 Key: HIVE-5795
>                 URL: https://issues.apache.org/jira/browse/HIVE-5795
>             Project: Hive
>          Issue Type: New Feature
>            Reporter: Shuaishuai Nie
>            Assignee: Shuaishuai Nie
>            Priority: Major
>              Labels: TODOC13
>             Fix For: 0.13.0
>
>         Attachments: HIVE-5795.1.patch, HIVE-5795.2.patch, HIVE-5795.3.patch, HIVE-5795.4.patch, HIVE-5795.5.patch
>
>
> Hive should be able to skip header and footer lines when reading data file from table. In this way, user don't need to processing data which generated by other application with a header or footer and directly use the file for table operations.
> To implement this, the idea is adding new properties in table descriptions to define the number of lines in header and footer and skip them when reading the record from record reader. An DDL example for creating a table with header and footer should be like this:
> {code}
> Create external table testtable (name string, message string) row format delimited fields terminated by '\t' lines terminated by '\n' location '/testtable' tblproperties ("skip.header.line.count"="1", "skip.footer.line.count"="2");
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)