You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@hudi.apache.org by GitBox <gi...@apache.org> on 2021/02/01 07:48:34 UTC

[GitHub] [hudi] GintokiYs opened a new issue #2513: [SUPPORT]Hive-Cli set hive.input.format=org.apache.hudi.hadoop.HoodieParquetInputFormat and query error

GintokiYs opened a new issue #2513:
URL: https://github.com/apache/hudi/issues/2513


   **Describe the problem you faced**
   When I insert data through Hudi-Spark and synchronize the data to Hive, I can use Hive-Cli query this cow table and get the data (hudi-hadoop-mr-bundle-0.6.0 has been placed under ${HIVE_HOME}/lib)》
   
   ```hive> select * from hudi_imp_par_mor_local_x1 where serial_no = '10000301345';
   Query ID = root_20210201150400_fbb4e52b-c41d-4d6b-b1b8-4678a6642f2d
   Total jobs = 1
   Launching Job 1 out of 1
   Number of reduce tasks is set to 0 since there's no reduce operator
   21/02/01 15:04:02 INFO client.RMProxy: Connecting to ResourceManager at node103/10.20.29.103:8032
   21/02/01 15:04:02 INFO client.RMProxy: Connecting to ResourceManager at node103/10.20.29.103:8032
   Starting Job = job_1611822796186_0064, Tracking URL = http://node103:8088/proxy/application_1611822796186_0064/
   Kill Command = /opt/cloudera/parcels/CDH-6.2.1-1.cdh6.2.1.p0.1425774/lib/hadoop/bin/hadoop job  -kill job_1611822796186_0064
   Hadoop job information for Stage-1: number of mappers: 1; number of reducers: 0
   2021-02-01 15:04:10,216 Stage-1 map = 0%,  reduce = 0%
   2021-02-01 15:04:17,510 Stage-1 map = 100%,  reduce = 0%, Cumulative CPU 6.01 sec
   MapReduce Total cumulative CPU time: 6 seconds 10 msec
   Ended Job = job_1611822796186_0064
   MapReduce Jobs Launched:
   Stage-Stage-1: Map: 1   Cumulative CPU: 6.01 sec   HDFS Read: 10327962 HDFS Write: 397 HDFS EC Read: 0 SUCCESS
   Total MapReduce CPU Time Spent: 6 seconds 10 msec
   OK
   20210201145958  20210201145958_0_9      10000301345/001942775096/2      20190909        e3332789-77e5-4e6b-a0cd-24e87814c572-0_0-6-8_20210201145958.parquet     10000301345     NULL    20190505        001942775096   2       251942775095    1942775095      401345  A       222     223     02      301346  NULL    NULL    NULL    NULL    NULL    25      1612162791775   10000301345/001942775096/2      20190909
   Time taken: 17.764 seconds, Fetched: 1 row(s)
   ```
   But when I **set hive.input.format=org.apache.hudi.hadoop.HoodieParquetInputFormat** , I encountered the following error.
   ```hive> set hive.input.format=org.apache.hudi.hadoop.HoodieParquetInputFormat;
   hive> select * from hudi_imp_par_mor_local_x1 where serial_no = '10000301345';
   Query ID = root_20210201151617_b6b7ee22-cc7a-4e22-b318-2e952d74e8dc
   Total jobs = 1
   Launching Job 1 out of 1
   Number of reduce tasks is set to 0 since there's no reduce operator
   21/02/01 15:16:17 INFO client.RMProxy: Connecting to ResourceManager at node103/10.20.29.103:8032
   21/02/01 15:16:17 INFO client.RMProxy: Connecting to ResourceManager at node103/10.20.29.103:8032
   21/02/01 15:16:18 INFO utils.HoodieInputFormatUtils: Reading hoodie metadata from path hdfs://nameservice1/tmp/hudi/db1/hudi_imp_par_mor_local_x1
   21/02/01 15:16:18 INFO table.HoodieTableMetaClient: Loading HoodieTableMetaClient from hdfs://nameservice1/tmp/hudi/db1/hudi_imp_par_mor_local_x1
   21/02/01 15:16:18 INFO fs.FSUtils: Hadoop Configuration: fs.defaultFS: [hdfs://nameservice1], Config:[Configuration: core-default.xml, core-site.xml, mapred-default.xml, mapred-site.xml, yarn-default.xml, yarn-site.xml, hdfs-default.xml, hdfs-site.xml, org.apache.hadoop.hive.conf.LoopingByteArrayInputStream@78d6692f, file:/etc/hive/conf.cloudera.hive/hive-site.xml], FileSystem: [DFS[DFSClient[clientName=DFSClient_NONMAPREDUCE_815164917_1, ugi=root (auth:SIMPLE)]]]
   21/02/01 15:16:18 INFO table.HoodieTableConfig: Loading table properties from hdfs://nameservice1/tmp/hudi/db1/hudi_imp_par_mor_local_x1/.hoodie/hoodie.properties
   21/02/01 15:16:18 INFO table.HoodieTableMetaClient: Finished Loading Table of type COPY_ON_WRITE(version=1, baseFileFormat=PARQUET) from hdfs://nameservice1/tmp/hudi/db1/hudi_imp_par_mor_local_x1
   21/02/01 15:16:18 INFO hadoop.HoodieParquetInputFormat: Found a total of 1 groups
   21/02/01 15:16:18 INFO timeline.HoodieActiveTimeline: Loaded instants [[20210201145958__commit__COMPLETED], [20210201150644__commit__COMPLETED]]
   21/02/01 15:16:18 INFO view.HoodieTableFileSystemView: Adding file-groups for partition :20190909, #FileGroups=1
   21/02/01 15:16:18 INFO view.HoodieTableFileSystemView: Adding file-groups for partition :20180909, #FileGroups=2
   21/02/01 15:16:18 INFO view.HoodieTableFileSystemView: Adding file-groups for partition :20181230, #FileGroups=1
   21/02/01 15:16:18 INFO view.AbstractTableFileSystemView: addFilesToView: NumFiles=8, NumFileGroups=4, FileGroupsCreationTime=10, StoreTimeTaken=4
   21/02/01 15:16:18 INFO utils.HoodieInputFormatUtils: Total paths to process after hoodie filter 4
   Starting Job = job_1611822796186_0067, Tracking URL = http://node103:8088/proxy/application_1611822796186_0067/
   Kill Command = /opt/cloudera/parcels/CDH-6.2.1-1.cdh6.2.1.p0.1425774/lib/hadoop/bin/hadoop job  -kill job_1611822796186_0067
   Hadoop job information for Stage-1: number of mappers: 4; number of reducers: 0
   2021-02-01 15:16:25,401 Stage-1 map = 0%,  reduce = 0%
   2021-02-01 15:16:50,198 Stage-1 map = 100%,  reduce = 0%
   Ended Job = job_1611822796186_0067 with errors
   Error during job, obtaining debugging information...
   Examining task ID: task_1611822796186_0067_m_000003 (and more) from job job_1611822796186_0067
   Examining task ID: task_1611822796186_0067_m_000002 (and more) from job job_1611822796186_0067
   
   Task with the most failures(4):
   -----
   Task ID:
     task_1611822796186_0067_m_000000
   
   URL:
     http://node103:8088/taskdetails.jsp?jobid=job_1611822796186_0067&tipid=task_1611822796186_0067_m_000000
   -----
   Diagnostic Messages for this Task:
   Error: java.lang.RuntimeException: java.lang.NullPointerException
           at org.apache.hadoop.hive.ql.exec.mr.ExecMapper.map(ExecMapper.java:169)
           at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:54)
           at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:465)
           at org.apache.hadoop.mapred.MapTask.run(MapTask.java:349)
           at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:174)
           at java.security.AccessController.doPrivileged(Native Method)
           at javax.security.auth.Subject.doAs(Subject.java:422)
           at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1875)
           at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:168)
   Caused by: java.lang.NullPointerException
           at org.apache.hadoop.hive.ql.exec.AbstractMapOperator.getNominalPath(AbstractMapOperator.java:101)
           at org.apache.hadoop.hive.ql.exec.MapOperator.cleanUpInputFileChangedOp(MapOperator.java:447)
           at org.apache.hadoop.hive.ql.exec.Operator.cleanUpInputFileChanged(Operator.java:1109)
           at org.apache.hadoop.hive.ql.exec.MapOperator.process(MapOperator.java:477)
           at org.apache.hadoop.hive.ql.exec.mr.ExecMapper.map(ExecMapper.java:160)
           ... 8 more
   
   
   FAILED: Execution Error, return code 2 from org.apache.hadoop.hive.ql.exec.mr.MapRedTask
   MapReduce Jobs Launched:
   Stage-Stage-1: Map: 4   HDFS Read: 0 HDFS Write: 0 HDFS EC Read: 0 FAIL
   Total MapReduce CPU Time Spent: 0 msec
   ```
   
   **Environment Description**
   
   * Hudi version :0.6.0
   
   * Spark version :2.4.0+cdh6.2.1
   
   * Hive version :2.1.1+cdh6.2.1
   
   * Hadoop version :3.0.0+cdh6.2.1
   
   * Storage (HDFS/S3/GCS..) :HDFS
   
   * Running on Docker? (yes/no) :no
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [hudi] n3nash commented on issue #2513: [SUPPORT]Hive-Cli set hive.input.format=org.apache.hudi.hadoop.HoodieParquetInputFormat and query error

Posted by GitBox <gi...@apache.org>.
n3nash commented on issue #2513:
URL: https://github.com/apache/hudi/issues/2513#issuecomment-771416134


   @GintokiYs You should not set the hive input format that way. 
   
   You can set hive.input.format=org.apache.hadoop.hive.ql.io.HiveInputFormat. As long as your table is registered as a Hudi table and in Hive Metastore you can see that the input format is `org.apache.hudi.hadoop.HoodieParquetInputFormat`, HiveInputFormat will automatically find the HoodieInputFormat and use it. 
   
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [hudi] n3nash commented on issue #2513: [SUPPORT]Hive-Cli set hive.input.format=org.apache.hudi.hadoop.HoodieParquetInputFormat and query error

Posted by GitBox <gi...@apache.org>.
n3nash commented on issue #2513:
URL: https://github.com/apache/hudi/issues/2513#issuecomment-771416134


   @GintokiYs You should not set the hive input format that way. 
   
   You can set hive.input.format=org.apache.hadoop.hive.ql.io.HiveInputFormat. As long as your table is registered as a Hudi table and in Hive Metastore you can see that the input format is `org.apache.hudi.hadoop.HoodieParquetInputFormat`, HiveInputFormat will automatically find the HoodieInputFormat and use it. 
   
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [hudi] n3nash closed issue #2513: [SUPPORT]Hive-Cli set hive.input.format=org.apache.hudi.hadoop.HoodieParquetInputFormat and query error

Posted by GitBox <gi...@apache.org>.
n3nash closed issue #2513:
URL: https://github.com/apache/hudi/issues/2513


   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [hudi] GintokiYs commented on issue #2513: [SUPPORT]Hive-Cli set hive.input.format=org.apache.hudi.hadoop.HoodieParquetInputFormat and query error

Posted by GitBox <gi...@apache.org>.
GintokiYs commented on issue #2513:
URL: https://github.com/apache/hudi/issues/2513#issuecomment-771453564


   @n3nash Thank you for your reply. 
   When I update the data in the Hudi table, the Hive-Cli query will get two records (the two records have the same primary key), while the Spark-SQL query is normal (only one record).
   I want to know how to solve the problem of historical data in Hive-cli query.
   The following figure is the result of hive-cli query, where (10000301345/001942775096/2) is one of my composite primary keys.
   ```
   hive> select * from hudi_imp_par_mor_local_x1 where serial_no = '10000301345';
   Query ID = root_20210202160414_00dbbdc9-5d2a-490a-b5ba-dcdccf2c8c1b
   Total jobs = 1
   Launching Job 1 out of 1
   Number of reduce tasks is set to 0 since there's no reduce operator
   21/02/02 16:04:14 INFO client.RMProxy: Connecting to ResourceManager at node103/10.20.29.103:8032
   21/02/02 16:04:14 INFO client.RMProxy: Connecting to ResourceManager at node103/10.20.29.103:8032
   Starting Job = job_1611822796186_0114, Tracking URL = http://node103:8088/proxy/application_1611822796186_0114/
   Kill Command = /opt/cloudera/parcels/CDH-6.2.1-1.cdh6.2.1.p0.1425774/lib/hadoop/bin/hadoop job  -kill job_1611822796186_0114
   Hadoop job information for Stage-1: number of mappers: 1; number of reducers: 0
   2021-02-02 16:04:22,116 Stage-1 map = 0%,  reduce = 0%
   2021-02-02 16:04:30,370 Stage-1 map = 100%,  reduce = 0%, Cumulative CPU 6.91 sec
   MapReduce Total cumulative CPU time: 6 seconds 910 msec
   Ended Job = job_1611822796186_0114
   MapReduce Jobs Launched:
   Stage-Stage-1: Map: 1   Cumulative CPU: 6.91 sec   HDFS Read: 20638550 HDFS Write: 711 HDFS EC Read: 0 SUCCESS
   Total MapReduce CPU Time Spent: 6 seconds 910 msec
   OK
   20210201150644  20210201150644_0_178    10000301345/001942775096/2      20190909        e3332789-77e5-4e6b-a0cd-24e87814c572-0_0-22-53_20210201150644.parquet   10000301345     NULL    20190505        001942775096     2       251942775095    1942775095      401345  D       222     223     02      301346  NULL    NULL    NULL    NULL    NULL    25      1612163195163   10000301345/001942775096/2      20190909
   20210201145958  20210201145958_0_9      10000301345/001942775096/2      20190909        e3332789-77e5-4e6b-a0cd-24e87814c572-0_0-6-8_20210201145958.parquet     10000301345     NULL    20190505        001942775096     2       251942775095    1942775095      401345  A       222     223     02      301346  NULL    NULL    NULL    NULL    NULL    25      1612162791775   10000301345/001942775096/2      20190909
   Time taken: 17.288 seconds, Fetched: 2 row(s)
   ```


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [hudi] nsivabalan commented on issue #2513: [SUPPORT]Hive-Cli set hive.input.format=org.apache.hudi.hadoop.HoodieParquetInputFormat and query error

Posted by GitBox <gi...@apache.org>.
nsivabalan commented on issue #2513:
URL: https://github.com/apache/hudi/issues/2513#issuecomment-774502966


   @GintokiYs :  few quick questions as nishith follows up on your ticket. 
   
   - Were you running older version of Hudi and encountered this after upgrade? in other words, older Hudi version you were able to run successfully and with 0.7.0 you hit this issue?
   - Is this affecting your production? trying to gauge the severity.
   - O are you trying out a POC ? and this is the first time trying out Hudi.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [hudi] nsivabalan edited a comment on issue #2513: [SUPPORT]Hive-Cli set hive.input.format=org.apache.hudi.hadoop.HoodieParquetInputFormat and query error

Posted by GitBox <gi...@apache.org>.
nsivabalan edited a comment on issue #2513:
URL: https://github.com/apache/hudi/issues/2513#issuecomment-810434740


   @GintokiYs : once you respond, can you please remove "awaiting-user-response" label for the issue. If possible add "awaiting-community-help" label. 
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [hudi] n3nash closed issue #2513: [SUPPORT]Hive-Cli set hive.input.format=org.apache.hudi.hadoop.HoodieParquetInputFormat and query error

Posted by GitBox <gi...@apache.org>.
n3nash closed issue #2513:
URL: https://github.com/apache/hudi/issues/2513


   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [hudi] n3nash commented on issue #2513: [SUPPORT]Hive-Cli set hive.input.format=org.apache.hudi.hadoop.HoodieParquetInputFormat and query error

Posted by GitBox <gi...@apache.org>.
n3nash commented on issue #2513:
URL: https://github.com/apache/hudi/issues/2513#issuecomment-824538766


   @GintokiYs Closing this ticket due to inactivity. If you continue to see this issue, please re-open


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [hudi] nsivabalan commented on issue #2513: [SUPPORT]Hive-Cli set hive.input.format=org.apache.hudi.hadoop.HoodieParquetInputFormat and query error

Posted by GitBox <gi...@apache.org>.
nsivabalan commented on issue #2513:
URL: https://github.com/apache/hudi/issues/2513#issuecomment-810434740


   @GintokiYs : once you respond, can you please remove "awaiting-user-response" label for the issue. If possible add "aw


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [hudi] GintokiYs commented on issue #2513: [SUPPORT]Hive-Cli set hive.input.format=org.apache.hudi.hadoop.HoodieParquetInputFormat and query error

Posted by GitBox <gi...@apache.org>.
GintokiYs commented on issue #2513:
URL: https://github.com/apache/hudi/issues/2513#issuecomment-771453564


   @n3nash Thank you for your reply. 
   When I update the data in the Hudi table, the Hive-Cli query will get two records (the two records have the same primary key), while the Spark-SQL query is normal (only one record).
   I want to know how to solve the problem of historical data in Hive-cli query.
   The following figure is the result of hive-cli query, where (10000301345/001942775096/2) is one of my composite primary keys.
   ```
   hive> select * from hudi_imp_par_mor_local_x1 where serial_no = '10000301345';
   Query ID = root_20210202160414_00dbbdc9-5d2a-490a-b5ba-dcdccf2c8c1b
   Total jobs = 1
   Launching Job 1 out of 1
   Number of reduce tasks is set to 0 since there's no reduce operator
   21/02/02 16:04:14 INFO client.RMProxy: Connecting to ResourceManager at node103/10.20.29.103:8032
   21/02/02 16:04:14 INFO client.RMProxy: Connecting to ResourceManager at node103/10.20.29.103:8032
   Starting Job = job_1611822796186_0114, Tracking URL = http://node103:8088/proxy/application_1611822796186_0114/
   Kill Command = /opt/cloudera/parcels/CDH-6.2.1-1.cdh6.2.1.p0.1425774/lib/hadoop/bin/hadoop job  -kill job_1611822796186_0114
   Hadoop job information for Stage-1: number of mappers: 1; number of reducers: 0
   2021-02-02 16:04:22,116 Stage-1 map = 0%,  reduce = 0%
   2021-02-02 16:04:30,370 Stage-1 map = 100%,  reduce = 0%, Cumulative CPU 6.91 sec
   MapReduce Total cumulative CPU time: 6 seconds 910 msec
   Ended Job = job_1611822796186_0114
   MapReduce Jobs Launched:
   Stage-Stage-1: Map: 1   Cumulative CPU: 6.91 sec   HDFS Read: 20638550 HDFS Write: 711 HDFS EC Read: 0 SUCCESS
   Total MapReduce CPU Time Spent: 6 seconds 910 msec
   OK
   20210201150644  20210201150644_0_178    10000301345/001942775096/2      20190909        e3332789-77e5-4e6b-a0cd-24e87814c572-0_0-22-53_20210201150644.parquet   10000301345     NULL    20190505        001942775096     2       251942775095    1942775095      401345  D       222     223     02      301346  NULL    NULL    NULL    NULL    NULL    25      1612163195163   10000301345/001942775096/2      20190909
   20210201145958  20210201145958_0_9      10000301345/001942775096/2      20190909        e3332789-77e5-4e6b-a0cd-24e87814c572-0_0-6-8_20210201145958.parquet     10000301345     NULL    20190505        001942775096     2       251942775095    1942775095      401345  A       222     223     02      301346  NULL    NULL    NULL    NULL    NULL    25      1612162791775   10000301345/001942775096/2      20190909
   Time taken: 17.288 seconds, Fetched: 2 row(s)
   ```


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [hudi] n3nash commented on issue #2513: [SUPPORT]Hive-Cli set hive.input.format=org.apache.hudi.hadoop.HoodieParquetInputFormat and query error

Posted by GitBox <gi...@apache.org>.
n3nash commented on issue #2513:
URL: https://github.com/apache/hudi/issues/2513#issuecomment-813863772


   @GintokiYs Gentle ping on my comment above


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [hudi] n3nash closed issue #2513: [SUPPORT]Hive-Cli set hive.input.format=org.apache.hudi.hadoop.HoodieParquetInputFormat and query error

Posted by GitBox <gi...@apache.org>.
n3nash closed issue #2513:
URL: https://github.com/apache/hudi/issues/2513


   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [hudi] n3nash commented on issue #2513: [SUPPORT]Hive-Cli set hive.input.format=org.apache.hudi.hadoop.HoodieParquetInputFormat and query error

Posted by GitBox <gi...@apache.org>.
n3nash commented on issue #2513:
URL: https://github.com/apache/hudi/issues/2513#issuecomment-809997181


   @GintokiYs Sorry for the delayed response. Can you please check the partition path of the 2 records with the same record key, are they the same or different ? Also, are you able to reproduce this case in the docker quickstart utils (https://hudi.apache.org/docs/docker_demo.html) ? 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [hudi] nsivabalan commented on issue #2513: [SUPPORT]Hive-Cli set hive.input.format=org.apache.hudi.hadoop.HoodieParquetInputFormat and query error

Posted by GitBox <gi...@apache.org>.
nsivabalan commented on issue #2513:
URL: https://github.com/apache/hudi/issues/2513#issuecomment-809541159


   @n3nash : user is awaiting your response. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org