You are viewing a plain text version of this content. The canonical link for it is here.

Posted to commits@hudi.apache.org by GitBox <gi...@apache.org> on 2021/08/15 16:12:48 UTC

[GitHub] [hudi] affei opened a new issue #3478: [SUPPORT] Unexpected Hive behaviour

affei opened a new issue #3478:
URL: https://github.com/apache/hudi/issues/3478


   **Describe the problem you faced**
   
   Using `set hive.input.format=org.apache.hudi.hadoop.HoodieParquetInputFormat` gives an error:
   
   `Error: java.lang.RuntimeException: java.lang.IndexOutOfBoundsException: Index: 3, Size: 3 at org.apache.hadoop.hive.ql.io.parquet.vector.VectorizedParquetRecordReader.<init>(VectorizedParquetRecordReader.java:128) at org.apache.hadoop.hive.ql.io.parquet.VectorizedParquetInputFormat.getRecordReader(VectorizedParquetInputFormat.java:41) at org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat.getRecordReader(MapredParquetInputFormat.java:69) at org.apache.hudi.hadoop.HoodieParquetInputFormat.getRecordReader(HoodieParquetInputFormat.java:216) at org.apache.hadoop.mapred.MapTask$TrackedRecordReader.<init>(MapTask.java:169) at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:432) at org.apache.hadoop.mapred.MapTask.run(MapTask.java:343) at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:170) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:422) at org.apache.hadoop.security.UserGroupInformation.doA
 s(UserGroupInformation.java:1869) at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:164) Caused by: java.lang.IndexOutOfBoundsException: Index: 3, Size: 3 at java.util.ArrayList.rangeCheck(ArrayList.java:659) at java.util.ArrayList.get(ArrayList.java:435) at org.apache.hadoop.hive.ql.io.parquet.read.DataWritableReadSupport.getProjectedGroupFields(DataWritableReadSupport.java:121) at org.apache.hadoop.hive.ql.io.parquet.read.DataWritableReadSupport.getSchemaByName(DataWritableReadSupport.java:181) at org.apache.hadoop.hive.ql.io.parquet.read.DataWritableReadSupport.init(DataWritableReadSupport.java:369) at org.apache.hadoop.hive.ql.io.parquet.ParquetRecordReaderBase.getSplit(ParquetRecordReaderBase.java:84) at org.apache.hadoop.hive.ql.io.parquet.vector.VectorizedParquetRecordReader.<init>(VectorizedParquetRecordReader.java:122)`
   
   At the same time if I use `hive.input.format=org.apache.hadoop.hive.ql.io.CombineHiveInputFormat` no error but data is included from all files (not only last commits)
   
   Problem appears only on aggregates. Query like `select * from orders` doesn't give an error. But If I try something like:
   `select order_id, count(*) as co
   from orders
   group by order_id
   order by co desc`
   or just
   `select count(*) as co from orders`
   error appears
   
   **To Reproduce**
   
   Steps to reproduce the behavior:
   
   1. set hive.input.format=org.apache.hudi.hadoop.HoodieParquetInputFormat
   2. try select with `id, count(*)` on the table - you will get an error (at least I do)
   3. set hive.input.format=org.apache.hadoop.hive.ql.io.CombineHiveInputFormat
   4.try select `id, count(*)` on the table - you will get duplicated records, but id is a primary key, so there should be no duplicated. Looks like it runs on all parquets, not just last commits
   
   **Expected behavior**
   
   `hive.input.format=org.apache.hudi.hadoop.HoodieParquetInputFormat` should not raise an error and return values without duplicates
   
   **Environment Description**
   
   * Hudi version : master
   
   * Spark version : 2.4.7
   
   * Hive version : 2.3.8
   
   * Hadoop version : 2.7.3.2.6.5.1175-1
   
   * Storage (HDFS/S3/GCS..) : HDFS
   
   * Running on Docker? (yes/no) : no
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [hudi] affei commented on issue #3478: [SUPPORT] Unexpected Hive behaviour

Posted by GitBox <gi...@apache.org>.

affei commented on issue #3478:
URL: https://github.com/apache/hudi/issues/3478#issuecomment-1023365219


   Tested with `hive.input.format=org.apache.hudi.hadoop.hive.HoodieCombineHiveInputFormat` - everything worked as expected. Closing ticket
   Thanks for help!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [hudi] nsivabalan commented on issue #3478: [SUPPORT] Unexpected Hive behaviour

Posted by GitBox <gi...@apache.org>.

nsivabalan commented on issue #3478:
URL: https://github.com/apache/hudi/issues/3478#issuecomment-998386815


   @codope : do we have a tracking jira for this.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [hudi] codope commented on issue #3478: [SUPPORT] Unexpected Hive behaviour

Posted by GitBox <gi...@apache.org>.

codope commented on issue #3478:
URL: https://github.com/apache/hudi/issues/3478#issuecomment-998432591


   This should no longer be an issue. `HoodieParquetInputFormat` will not provide a real-time view. Instead, set `hive.input.format=org.apache.hudi.hadoop.hive.HoodieCombineHiveInputFormat`.
   @affei Can you confirm whether you still face the issue with `HoodieCombineHiveInputFormat`.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [hudi] nsivabalan commented on issue #3478: [SUPPORT] Unexpected Hive behaviour

Posted by GitBox <gi...@apache.org>.

nsivabalan commented on issue #3478:
URL: https://github.com/apache/hudi/issues/3478#issuecomment-1018642537


   @affei : not sure if this matters. But for a partitioned dataset, a pair of partition path and record key is unique for a given hudi table. So, there could be duplicate record keys in the output across diff partitions. Can you confirm that when you said you are seeing duplicates, you meant duplicate records having same value for both partition path and record keys.
   
   If you wish to have globally unique record keys, you may have to choose one of the GLOBAL index options for index types.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [hudi] codope commented on issue #3478: [SUPPORT] Unexpected Hive behaviour

Posted by GitBox <gi...@apache.org>.

codope commented on issue #3478:
URL: https://github.com/apache/hudi/issues/3478#issuecomment-905141112


   I can reproduce, though the not the exact stacktrace. The same query runs fine with `HiveInputFormat`.
   ```
   # beeline -u jdbc:hive2://hiveserver:10000   --hiveconf hive.input.format=org.apache.hudi.hadoop.HoodieParquetInputFormat   --hiveconf hive.stats.autogather=false --verbose -e "select count(*) from stock_ticks_mor_rt group by symbol HAVING symbol = 'GOOG';"
   issuing: !connect jdbc:hive2://hiveserver:10000 '' ''
   Connecting to jdbc:hive2://hiveserver:10000
   Connected to: Apache Hive (version 2.3.3)
   Driver: Hive JDBC (version 1.2.1.spark2)
   Transaction isolation: TRANSACTION_REPEATABLE_READ
   Executing command: select count(*) from stock_ticks_mor_rt group by symbol HAVING symbol = 'GOOG';
   WARNING: Hive-on-MR is deprecated in Hive 2 and may not be available in the future versions. Consider using a different execution engine (i.e. spark, tez) or using Hive 1.X releases.
   Getting log thread is interrupted, since query is done!
   Error: org.apache.hive.service.cli.HiveSQLException: Error while processing statement: FAILED: Execution Error, return code 2 from org.apache.hadoop.hive.ql.exec.mr.MapRedTask
   	at org.apache.hive.service.cli.operation.Operation.toSQLException(Operation.java:380)
   	at org.apache.hive.service.cli.operation.SQLOperation.runQuery(SQLOperation.java:257)
   	at org.apache.hive.service.cli.operation.SQLOperation.access$800(SQLOperation.java:91)
   	at org.apache.hive.service.cli.operation.SQLOperation$BackgroundWork$1.run(SQLOperation.java:348)
   	at java.security.AccessController.doPrivileged(Native Method)
   	at javax.security.auth.Subject.doAs(Subject.java:422)
   	at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1840)
   	at org.apache.hive.service.cli.operation.SQLOperation$BackgroundWork.run(SQLOperation.java:362)
   	at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
   	at java.util.concurrent.FutureTask.run(FutureTask.java:266)
   	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
   	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
   	at java.lang.Thread.run(Thread.java:748) (state=08S01,code=2)
   java.sql.SQLException: org.apache.hive.service.cli.HiveSQLException: Error while processing statement: FAILED: Execution Error, return code 2 from org.apache.hadoop.hive.ql.exec.mr.MapRedTask
   	at org.apache.hive.service.cli.operation.Operation.toSQLException(Operation.java:380)
   	at org.apache.hive.service.cli.operation.SQLOperation.runQuery(SQLOperation.java:257)
   	at org.apache.hive.service.cli.operation.SQLOperation.access$800(SQLOperation.java:91)
   	at org.apache.hive.service.cli.operation.SQLOperation$BackgroundWork$1.run(SQLOperation.java:348)
   	at java.security.AccessController.doPrivileged(Native Method)
   	at javax.security.auth.Subject.doAs(Subject.java:422)
   	at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1840)
   	at org.apache.hive.service.cli.operation.SQLOperation$BackgroundWork.run(SQLOperation.java:362)
   	at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
   	at java.util.concurrent.FutureTask.run(FutureTask.java:266)
   	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
   	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
   	at java.lang.Thread.run(Thread.java:748)
   
   	at org.apache.hive.jdbc.HiveStatement.execute(HiveStatement.java:296)
   	at org.apache.hive.beeline.Commands.execute(Commands.java:848)
   	at org.apache.hive.beeline.Commands.sql(Commands.java:713)
   	at org.apache.hive.beeline.BeeLine.dispatch(BeeLine.java:973)
   	at org.apache.hive.beeline.BeeLine.initArgs(BeeLine.java:720)
   	at org.apache.hive.beeline.BeeLine.begin(BeeLine.java:757)
   	at org.apache.hive.beeline.BeeLine.mainWithInputRedirection(BeeLine.java:484)
   	at org.apache.hive.beeline.BeeLine.main(BeeLine.java:467)
   Closing: 0: jdbc:hive2://hiveserver:10000
   
   
   # beeline -u jdbc:hive2://hiveserver:10000   --hiveconf hive.input.format=org.apache.hadoop.hive.ql.io.HiveInputFormat   --hiveconf hive.stats.autogather=false --verbose -e "select count(*) from stock_ticks_mor_rt group by symbol HAVING symbol = 'GOOG';"
   issuing: !connect jdbc:hive2://hiveserver:10000 '' ''
   Connecting to jdbc:hive2://hiveserver:10000
   Connected to: Apache Hive (version 2.3.3)
   Driver: Hive JDBC (version 1.2.1.spark2)
   Transaction isolation: TRANSACTION_REPEATABLE_READ
   Executing command: select count(*) from stock_ticks_mor_rt group by symbol HAVING symbol = 'GOOG';
   WARNING: Hive-on-MR is deprecated in Hive 2 and may not be available in the future versions. Consider using a different execution engine (i.e. spark, tez) or using Hive 1.X releases.
   Getting log thread is interrupted, since query is done!
   +------+--+
   | _c0  |
   +------+--+
   | 2    |
   +------+--+
   1 row selected (1.772 seconds)
   Beeline version 1.2.1.spark2 by Apache Hive
   Closing: 0: jdbc:hive2://hiveserver:10000
   ```
   
   Will check and get back.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [hudi] nsivabalan commented on issue #3478: [SUPPORT] Unexpected Hive behaviour

Posted by GitBox <gi...@apache.org>.

nsivabalan commented on issue #3478:
URL: https://github.com/apache/hudi/issues/3478#issuecomment-1020603348


   @danny0405 : did you get a chance to follow up with the author directly? wondering if we can close this due to inactivity. As suggested, HiveInputFormat or HoodieCombineInputFormat should be used for hive.input.format. 
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [hudi] affei commented on issue #3478: [SUPPORT] Unexpected Hive behaviour

Posted by GitBox <gi...@apache.org>.

affei commented on issue #3478:
URL: https://github.com/apache/hudi/issues/3478#issuecomment-1022073749


   @nsivabalan tbh, i don't remember if it was partitioned or non-partitioned table. We migrated to 0.10.0 last week, I can check hive behaviour there and come back with results


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [hudi] nsivabalan commented on issue #3478: [SUPPORT] Unexpected Hive behaviour

Posted by GitBox <gi...@apache.org>.

nsivabalan commented on issue #3478:
URL: https://github.com/apache/hudi/issues/3478#issuecomment-1008550588


   @affei : hey, any updates for us in this regard please.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [hudi] nsivabalan commented on issue #3478: [SUPPORT] Unexpected Hive behaviour

Posted by GitBox <gi...@apache.org>.

nsivabalan commented on issue #3478:
URL: https://github.com/apache/hudi/issues/3478#issuecomment-1023790632


   awesome, thanks for updating! 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [hudi] nsivabalan commented on issue #3478: [SUPPORT] Unexpected Hive behaviour

Posted by GitBox <gi...@apache.org>.

nsivabalan commented on issue #3478:
URL: https://github.com/apache/hudi/issues/3478#issuecomment-1020603348


   @danny0405 : did you get a chance to follow up with the author directly? wondering if we can close this due to inactivity. As suggested, HiveInputFormat or HoodieCombineInputFormat should be used for hive.input.format. 
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [hudi] affei closed issue #3478: [SUPPORT] Unexpected Hive behaviour

Posted by GitBox <gi...@apache.org>.

affei closed issue #3478:
URL: https://github.com/apache/hudi/issues/3478


   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [hudi] nsivabalan commented on issue #3478: [SUPPORT] Unexpected Hive behaviour

Posted by GitBox <gi...@apache.org>.

nsivabalan commented on issue #3478:
URL: https://github.com/apache/hudi/issues/3478#issuecomment-1018639801


   @codope : I assume you are claiming that w/ HoodieCombineHiveInputFormat, you are seeing duplicated data (user reported that he/she is seeing duplicates w/ CombinedInputFormat). 
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [hudi] danny0405 commented on issue #3478: [SUPPORT] Unexpected Hive behaviour

Posted by GitBox <gi...@apache.org>.

danny0405 commented on issue #3478:
URL: https://github.com/apache/hudi/issues/3478#issuecomment-971100825


   Hi @codope Can we solve this in release 0.10.0 ?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org