You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@hudi.apache.org by GitBox <gi...@apache.org> on 2021/02/01 07:48:34 UTC
[GitHub] [hudi] GintokiYs opened a new issue #2513: [SUPPORT]Hive-Cli set hive.input.format=org.apache.hudi.hadoop.HoodieParquetInputFormat and query error
GintokiYs opened a new issue #2513:
URL: https://github.com/apache/hudi/issues/2513
**Describe the problem you faced**
When I insert data through Hudi-Spark and synchronize the data to Hive, I can use Hive-Cli query this cow table and get the data (hudi-hadoop-mr-bundle-0.6.0 has been placed under ${HIVE_HOME}/lib)》
```hive> select * from hudi_imp_par_mor_local_x1 where serial_no = '10000301345';
Query ID = root_20210201150400_fbb4e52b-c41d-4d6b-b1b8-4678a6642f2d
Total jobs = 1
Launching Job 1 out of 1
Number of reduce tasks is set to 0 since there's no reduce operator
21/02/01 15:04:02 INFO client.RMProxy: Connecting to ResourceManager at node103/10.20.29.103:8032
21/02/01 15:04:02 INFO client.RMProxy: Connecting to ResourceManager at node103/10.20.29.103:8032
Starting Job = job_1611822796186_0064, Tracking URL = http://node103:8088/proxy/application_1611822796186_0064/
Kill Command = /opt/cloudera/parcels/CDH-6.2.1-1.cdh6.2.1.p0.1425774/lib/hadoop/bin/hadoop job -kill job_1611822796186_0064
Hadoop job information for Stage-1: number of mappers: 1; number of reducers: 0
2021-02-01 15:04:10,216 Stage-1 map = 0%, reduce = 0%
2021-02-01 15:04:17,510 Stage-1 map = 100%, reduce = 0%, Cumulative CPU 6.01 sec
MapReduce Total cumulative CPU time: 6 seconds 10 msec
Ended Job = job_1611822796186_0064
MapReduce Jobs Launched:
Stage-Stage-1: Map: 1 Cumulative CPU: 6.01 sec HDFS Read: 10327962 HDFS Write: 397 HDFS EC Read: 0 SUCCESS
Total MapReduce CPU Time Spent: 6 seconds 10 msec
OK
20210201145958 20210201145958_0_9 10000301345/001942775096/2 20190909 e3332789-77e5-4e6b-a0cd-24e87814c572-0_0-6-8_20210201145958.parquet 10000301345 NULL 20190505 001942775096 2 251942775095 1942775095 401345 A 222 223 02 301346 NULL NULL NULL NULL NULL 25 1612162791775 10000301345/001942775096/2 20190909
Time taken: 17.764 seconds, Fetched: 1 row(s)
```
But when I **set hive.input.format=org.apache.hudi.hadoop.HoodieParquetInputFormat** , I encountered the following error.
```hive> set hive.input.format=org.apache.hudi.hadoop.HoodieParquetInputFormat;
hive> select * from hudi_imp_par_mor_local_x1 where serial_no = '10000301345';
Query ID = root_20210201151617_b6b7ee22-cc7a-4e22-b318-2e952d74e8dc
Total jobs = 1
Launching Job 1 out of 1
Number of reduce tasks is set to 0 since there's no reduce operator
21/02/01 15:16:17 INFO client.RMProxy: Connecting to ResourceManager at node103/10.20.29.103:8032
21/02/01 15:16:17 INFO client.RMProxy: Connecting to ResourceManager at node103/10.20.29.103:8032
21/02/01 15:16:18 INFO utils.HoodieInputFormatUtils: Reading hoodie metadata from path hdfs://nameservice1/tmp/hudi/db1/hudi_imp_par_mor_local_x1
21/02/01 15:16:18 INFO table.HoodieTableMetaClient: Loading HoodieTableMetaClient from hdfs://nameservice1/tmp/hudi/db1/hudi_imp_par_mor_local_x1
21/02/01 15:16:18 INFO fs.FSUtils: Hadoop Configuration: fs.defaultFS: [hdfs://nameservice1], Config:[Configuration: core-default.xml, core-site.xml, mapred-default.xml, mapred-site.xml, yarn-default.xml, yarn-site.xml, hdfs-default.xml, hdfs-site.xml, org.apache.hadoop.hive.conf.LoopingByteArrayInputStream@78d6692f, file:/etc/hive/conf.cloudera.hive/hive-site.xml], FileSystem: [DFS[DFSClient[clientName=DFSClient_NONMAPREDUCE_815164917_1, ugi=root (auth:SIMPLE)]]]
21/02/01 15:16:18 INFO table.HoodieTableConfig: Loading table properties from hdfs://nameservice1/tmp/hudi/db1/hudi_imp_par_mor_local_x1/.hoodie/hoodie.properties
21/02/01 15:16:18 INFO table.HoodieTableMetaClient: Finished Loading Table of type COPY_ON_WRITE(version=1, baseFileFormat=PARQUET) from hdfs://nameservice1/tmp/hudi/db1/hudi_imp_par_mor_local_x1
21/02/01 15:16:18 INFO hadoop.HoodieParquetInputFormat: Found a total of 1 groups
21/02/01 15:16:18 INFO timeline.HoodieActiveTimeline: Loaded instants [[20210201145958__commit__COMPLETED], [20210201150644__commit__COMPLETED]]
21/02/01 15:16:18 INFO view.HoodieTableFileSystemView: Adding file-groups for partition :20190909, #FileGroups=1
21/02/01 15:16:18 INFO view.HoodieTableFileSystemView: Adding file-groups for partition :20180909, #FileGroups=2
21/02/01 15:16:18 INFO view.HoodieTableFileSystemView: Adding file-groups for partition :20181230, #FileGroups=1
21/02/01 15:16:18 INFO view.AbstractTableFileSystemView: addFilesToView: NumFiles=8, NumFileGroups=4, FileGroupsCreationTime=10, StoreTimeTaken=4
21/02/01 15:16:18 INFO utils.HoodieInputFormatUtils: Total paths to process after hoodie filter 4
Starting Job = job_1611822796186_0067, Tracking URL = http://node103:8088/proxy/application_1611822796186_0067/
Kill Command = /opt/cloudera/parcels/CDH-6.2.1-1.cdh6.2.1.p0.1425774/lib/hadoop/bin/hadoop job -kill job_1611822796186_0067
Hadoop job information for Stage-1: number of mappers: 4; number of reducers: 0
2021-02-01 15:16:25,401 Stage-1 map = 0%, reduce = 0%
2021-02-01 15:16:50,198 Stage-1 map = 100%, reduce = 0%
Ended Job = job_1611822796186_0067 with errors
Error during job, obtaining debugging information...
Examining task ID: task_1611822796186_0067_m_000003 (and more) from job job_1611822796186_0067
Examining task ID: task_1611822796186_0067_m_000002 (and more) from job job_1611822796186_0067
Task with the most failures(4):
-----
Task ID:
task_1611822796186_0067_m_000000
URL:
http://node103:8088/taskdetails.jsp?jobid=job_1611822796186_0067&tipid=task_1611822796186_0067_m_000000
-----
Diagnostic Messages for this Task:
Error: java.lang.RuntimeException: java.lang.NullPointerException
at org.apache.hadoop.hive.ql.exec.mr.ExecMapper.map(ExecMapper.java:169)
at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:54)
at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:465)
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:349)
at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:174)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:422)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1875)
at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:168)
Caused by: java.lang.NullPointerException
at org.apache.hadoop.hive.ql.exec.AbstractMapOperator.getNominalPath(AbstractMapOperator.java:101)
at org.apache.hadoop.hive.ql.exec.MapOperator.cleanUpInputFileChangedOp(MapOperator.java:447)
at org.apache.hadoop.hive.ql.exec.Operator.cleanUpInputFileChanged(Operator.java:1109)
at org.apache.hadoop.hive.ql.exec.MapOperator.process(MapOperator.java:477)
at org.apache.hadoop.hive.ql.exec.mr.ExecMapper.map(ExecMapper.java:160)
... 8 more
FAILED: Execution Error, return code 2 from org.apache.hadoop.hive.ql.exec.mr.MapRedTask
MapReduce Jobs Launched:
Stage-Stage-1: Map: 4 HDFS Read: 0 HDFS Write: 0 HDFS EC Read: 0 FAIL
Total MapReduce CPU Time Spent: 0 msec
```
**Environment Description**
* Hudi version :0.6.0
* Spark version :2.4.0+cdh6.2.1
* Hive version :2.1.1+cdh6.2.1
* Hadoop version :3.0.0+cdh6.2.1
* Storage (HDFS/S3/GCS..) :HDFS
* Running on Docker? (yes/no) :no
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [hudi] n3nash commented on issue #2513: [SUPPORT]Hive-Cli set hive.input.format=org.apache.hudi.hadoop.HoodieParquetInputFormat and query error
Posted by GitBox <gi...@apache.org>.
n3nash commented on issue #2513:
URL: https://github.com/apache/hudi/issues/2513#issuecomment-771416134
@GintokiYs You should not set the hive input format that way.
You can set hive.input.format=org.apache.hadoop.hive.ql.io.HiveInputFormat. As long as your table is registered as a Hudi table and in Hive Metastore you can see that the input format is `org.apache.hudi.hadoop.HoodieParquetInputFormat`, HiveInputFormat will automatically find the HoodieInputFormat and use it.
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [hudi] n3nash commented on issue #2513: [SUPPORT]Hive-Cli set hive.input.format=org.apache.hudi.hadoop.HoodieParquetInputFormat and query error
Posted by GitBox <gi...@apache.org>.
n3nash commented on issue #2513:
URL: https://github.com/apache/hudi/issues/2513#issuecomment-771416134
@GintokiYs You should not set the hive input format that way.
You can set hive.input.format=org.apache.hadoop.hive.ql.io.HiveInputFormat. As long as your table is registered as a Hudi table and in Hive Metastore you can see that the input format is `org.apache.hudi.hadoop.HoodieParquetInputFormat`, HiveInputFormat will automatically find the HoodieInputFormat and use it.
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [hudi] n3nash closed issue #2513: [SUPPORT]Hive-Cli set hive.input.format=org.apache.hudi.hadoop.HoodieParquetInputFormat and query error
Posted by GitBox <gi...@apache.org>.
n3nash closed issue #2513:
URL: https://github.com/apache/hudi/issues/2513
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [hudi] GintokiYs commented on issue #2513: [SUPPORT]Hive-Cli set hive.input.format=org.apache.hudi.hadoop.HoodieParquetInputFormat and query error
Posted by GitBox <gi...@apache.org>.
GintokiYs commented on issue #2513:
URL: https://github.com/apache/hudi/issues/2513#issuecomment-771453564
@n3nash Thank you for your reply.
When I update the data in the Hudi table, the Hive-Cli query will get two records (the two records have the same primary key), while the Spark-SQL query is normal (only one record).
I want to know how to solve the problem of historical data in Hive-cli query.
The following figure is the result of hive-cli query, where (10000301345/001942775096/2) is one of my composite primary keys.
```
hive> select * from hudi_imp_par_mor_local_x1 where serial_no = '10000301345';
Query ID = root_20210202160414_00dbbdc9-5d2a-490a-b5ba-dcdccf2c8c1b
Total jobs = 1
Launching Job 1 out of 1
Number of reduce tasks is set to 0 since there's no reduce operator
21/02/02 16:04:14 INFO client.RMProxy: Connecting to ResourceManager at node103/10.20.29.103:8032
21/02/02 16:04:14 INFO client.RMProxy: Connecting to ResourceManager at node103/10.20.29.103:8032
Starting Job = job_1611822796186_0114, Tracking URL = http://node103:8088/proxy/application_1611822796186_0114/
Kill Command = /opt/cloudera/parcels/CDH-6.2.1-1.cdh6.2.1.p0.1425774/lib/hadoop/bin/hadoop job -kill job_1611822796186_0114
Hadoop job information for Stage-1: number of mappers: 1; number of reducers: 0
2021-02-02 16:04:22,116 Stage-1 map = 0%, reduce = 0%
2021-02-02 16:04:30,370 Stage-1 map = 100%, reduce = 0%, Cumulative CPU 6.91 sec
MapReduce Total cumulative CPU time: 6 seconds 910 msec
Ended Job = job_1611822796186_0114
MapReduce Jobs Launched:
Stage-Stage-1: Map: 1 Cumulative CPU: 6.91 sec HDFS Read: 20638550 HDFS Write: 711 HDFS EC Read: 0 SUCCESS
Total MapReduce CPU Time Spent: 6 seconds 910 msec
OK
20210201150644 20210201150644_0_178 10000301345/001942775096/2 20190909 e3332789-77e5-4e6b-a0cd-24e87814c572-0_0-22-53_20210201150644.parquet 10000301345 NULL 20190505 001942775096 2 251942775095 1942775095 401345 D 222 223 02 301346 NULL NULL NULL NULL NULL 25 1612163195163 10000301345/001942775096/2 20190909
20210201145958 20210201145958_0_9 10000301345/001942775096/2 20190909 e3332789-77e5-4e6b-a0cd-24e87814c572-0_0-6-8_20210201145958.parquet 10000301345 NULL 20190505 001942775096 2 251942775095 1942775095 401345 A 222 223 02 301346 NULL NULL NULL NULL NULL 25 1612162791775 10000301345/001942775096/2 20190909
Time taken: 17.288 seconds, Fetched: 2 row(s)
```
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [hudi] nsivabalan commented on issue #2513: [SUPPORT]Hive-Cli set hive.input.format=org.apache.hudi.hadoop.HoodieParquetInputFormat and query error
Posted by GitBox <gi...@apache.org>.
nsivabalan commented on issue #2513:
URL: https://github.com/apache/hudi/issues/2513#issuecomment-774502966
@GintokiYs : few quick questions as nishith follows up on your ticket.
- Were you running older version of Hudi and encountered this after upgrade? in other words, older Hudi version you were able to run successfully and with 0.7.0 you hit this issue?
- Is this affecting your production? trying to gauge the severity.
- O are you trying out a POC ? and this is the first time trying out Hudi.
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [hudi] nsivabalan edited a comment on issue #2513: [SUPPORT]Hive-Cli set hive.input.format=org.apache.hudi.hadoop.HoodieParquetInputFormat and query error
Posted by GitBox <gi...@apache.org>.
nsivabalan edited a comment on issue #2513:
URL: https://github.com/apache/hudi/issues/2513#issuecomment-810434740
@GintokiYs : once you respond, can you please remove "awaiting-user-response" label for the issue. If possible add "awaiting-community-help" label.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [hudi] n3nash closed issue #2513: [SUPPORT]Hive-Cli set hive.input.format=org.apache.hudi.hadoop.HoodieParquetInputFormat and query error
Posted by GitBox <gi...@apache.org>.
n3nash closed issue #2513:
URL: https://github.com/apache/hudi/issues/2513
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [hudi] n3nash commented on issue #2513: [SUPPORT]Hive-Cli set hive.input.format=org.apache.hudi.hadoop.HoodieParquetInputFormat and query error
Posted by GitBox <gi...@apache.org>.
n3nash commented on issue #2513:
URL: https://github.com/apache/hudi/issues/2513#issuecomment-824538766
@GintokiYs Closing this ticket due to inactivity. If you continue to see this issue, please re-open
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [hudi] nsivabalan commented on issue #2513: [SUPPORT]Hive-Cli set hive.input.format=org.apache.hudi.hadoop.HoodieParquetInputFormat and query error
Posted by GitBox <gi...@apache.org>.
nsivabalan commented on issue #2513:
URL: https://github.com/apache/hudi/issues/2513#issuecomment-810434740
@GintokiYs : once you respond, can you please remove "awaiting-user-response" label for the issue. If possible add "aw
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [hudi] GintokiYs commented on issue #2513: [SUPPORT]Hive-Cli set hive.input.format=org.apache.hudi.hadoop.HoodieParquetInputFormat and query error
Posted by GitBox <gi...@apache.org>.
GintokiYs commented on issue #2513:
URL: https://github.com/apache/hudi/issues/2513#issuecomment-771453564
@n3nash Thank you for your reply.
When I update the data in the Hudi table, the Hive-Cli query will get two records (the two records have the same primary key), while the Spark-SQL query is normal (only one record).
I want to know how to solve the problem of historical data in Hive-cli query.
The following figure is the result of hive-cli query, where (10000301345/001942775096/2) is one of my composite primary keys.
```
hive> select * from hudi_imp_par_mor_local_x1 where serial_no = '10000301345';
Query ID = root_20210202160414_00dbbdc9-5d2a-490a-b5ba-dcdccf2c8c1b
Total jobs = 1
Launching Job 1 out of 1
Number of reduce tasks is set to 0 since there's no reduce operator
21/02/02 16:04:14 INFO client.RMProxy: Connecting to ResourceManager at node103/10.20.29.103:8032
21/02/02 16:04:14 INFO client.RMProxy: Connecting to ResourceManager at node103/10.20.29.103:8032
Starting Job = job_1611822796186_0114, Tracking URL = http://node103:8088/proxy/application_1611822796186_0114/
Kill Command = /opt/cloudera/parcels/CDH-6.2.1-1.cdh6.2.1.p0.1425774/lib/hadoop/bin/hadoop job -kill job_1611822796186_0114
Hadoop job information for Stage-1: number of mappers: 1; number of reducers: 0
2021-02-02 16:04:22,116 Stage-1 map = 0%, reduce = 0%
2021-02-02 16:04:30,370 Stage-1 map = 100%, reduce = 0%, Cumulative CPU 6.91 sec
MapReduce Total cumulative CPU time: 6 seconds 910 msec
Ended Job = job_1611822796186_0114
MapReduce Jobs Launched:
Stage-Stage-1: Map: 1 Cumulative CPU: 6.91 sec HDFS Read: 20638550 HDFS Write: 711 HDFS EC Read: 0 SUCCESS
Total MapReduce CPU Time Spent: 6 seconds 910 msec
OK
20210201150644 20210201150644_0_178 10000301345/001942775096/2 20190909 e3332789-77e5-4e6b-a0cd-24e87814c572-0_0-22-53_20210201150644.parquet 10000301345 NULL 20190505 001942775096 2 251942775095 1942775095 401345 D 222 223 02 301346 NULL NULL NULL NULL NULL 25 1612163195163 10000301345/001942775096/2 20190909
20210201145958 20210201145958_0_9 10000301345/001942775096/2 20190909 e3332789-77e5-4e6b-a0cd-24e87814c572-0_0-6-8_20210201145958.parquet 10000301345 NULL 20190505 001942775096 2 251942775095 1942775095 401345 A 222 223 02 301346 NULL NULL NULL NULL NULL 25 1612162791775 10000301345/001942775096/2 20190909
Time taken: 17.288 seconds, Fetched: 2 row(s)
```
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [hudi] n3nash commented on issue #2513: [SUPPORT]Hive-Cli set hive.input.format=org.apache.hudi.hadoop.HoodieParquetInputFormat and query error
Posted by GitBox <gi...@apache.org>.
n3nash commented on issue #2513:
URL: https://github.com/apache/hudi/issues/2513#issuecomment-813863772
@GintokiYs Gentle ping on my comment above
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [hudi] n3nash closed issue #2513: [SUPPORT]Hive-Cli set hive.input.format=org.apache.hudi.hadoop.HoodieParquetInputFormat and query error
Posted by GitBox <gi...@apache.org>.
n3nash closed issue #2513:
URL: https://github.com/apache/hudi/issues/2513
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [hudi] n3nash commented on issue #2513: [SUPPORT]Hive-Cli set hive.input.format=org.apache.hudi.hadoop.HoodieParquetInputFormat and query error
Posted by GitBox <gi...@apache.org>.
n3nash commented on issue #2513:
URL: https://github.com/apache/hudi/issues/2513#issuecomment-809997181
@GintokiYs Sorry for the delayed response. Can you please check the partition path of the 2 records with the same record key, are they the same or different ? Also, are you able to reproduce this case in the docker quickstart utils (https://hudi.apache.org/docs/docker_demo.html) ?
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [hudi] nsivabalan commented on issue #2513: [SUPPORT]Hive-Cli set hive.input.format=org.apache.hudi.hadoop.HoodieParquetInputFormat and query error
Posted by GitBox <gi...@apache.org>.
nsivabalan commented on issue #2513:
URL: https://github.com/apache/hudi/issues/2513#issuecomment-809541159
@n3nash : user is awaiting your response.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org