You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@hudi.apache.org by GitBox <gi...@apache.org> on 2021/11/18 03:17:33 UTC

[GitHub] [hudi] liujinhui1994 opened a new issue #4027: [SUPPORT] Structured streaming Async clustering IndexOutOfBoundsException

liujinhui1994 opened a new issue #4027:
URL: https://github.com/apache/hudi/issues/4027


   **_Tips before filing an issue_**
   
   - Have you gone through our [FAQs](https://cwiki.apache.org/confluence/display/HUDI/FAQ)?
   
   - Join the mailing list to engage in conversations and get faster support at dev-subscribe@hudi.apache.org.
   
   - If you have triaged this as a bug, then file an [issue](https://issues.apache.org/jira/projects/HUDI/issues) directly.
   
   **Describe the problem you faced**
   
   A clear and concise description of the problem.
   
   **To Reproduce**
   
   Steps to reproduce the behavior:
   
   1.
   2.
   3.
   4.
   
   **Expected behavior**
   
   A clear and concise description of what you expected to happen.
   
   **Environment Description**
   
   * Hudi version : master branch
   
   * Spark version : 2.4.5
   
   * Hive version :
   
   * Hadoop version : 3.0.0
   
   * Storage (HDFS/S3/GCS..) :
   
   * Running on Docker? (yes/no) :
   NO
   
   **Additional context**
   
   Add any other context about the problem here.
   
   **Stacktrace**
   
   ```java.lang.IndexOutOfBoundsException: Index: 0, Size: 0
   	at java.util.ArrayList.rangeCheck(ArrayList.java:659)
   	at java.util.ArrayList.get(ArrayList.java:435)
   	at org.apache.hudi.execution.bulkinsert.BulkInsertMapFunction.call(BulkInsertMapFunction.java:64)
   	at org.apache.hudi.execution.bulkinsert.BulkInsertMapFunction.call(BulkInsertMapFunction.java:37)
   	at org.apache.spark.api.java.JavaRDDLike$$anonfun$mapPartitionsWithIndex$1.apply(JavaRDDLike.scala:102)
   	at org.apache.spark.api.java.JavaRDDLike$$anonfun$mapPartitionsWithIndex$1.apply(JavaRDDLike.scala:102)
   	at org.apache.spark.rdd.RDD$$anonfun$mapPartitionsWithIndex$1$$anonfun$apply$25.apply(RDD.scala:874)
   	at org.apache.spark.rdd.RDD$$anonfun$mapPartitionsWithIndex$1$$anonfun$apply$25.apply(RDD.scala:874)
   	at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
   	at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:346)
   	at org.apache.spark.rdd.RDD.iterator(RDD.scala:310)
   	at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
   	at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:346)
   	at org.apache.spark.rdd.RDD.iterator(RDD.scala:310)
   	at org.apache.spark.rdd.UnionRDD.compute(UnionRDD.scala:105)
   	at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:346)
   	at org.apache.spark.rdd.RDD.iterator(RDD.scala:310)
   	at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
   	at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:346)
   	at org.apache.spark.rdd.RDD.iterator(RDD.scala:310)
   	at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)
   	at org.apache.spark.scheduler.Task.run(Task.scala:123)
   	at org.apache.spark.executor.Executor$TaskRunner$$anonfun$10.apply(Executor.scala:413)
   	at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1551)
   	at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:419)
   	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
   	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
   	at java.lang.Thread.run(Thread.java:748)```
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [hudi] liujinhui1994 edited a comment on issue #4027: [SUPPORT] Structured streaming Async clustering IndexOutOfBoundsException

Posted by GitBox <gi...@apache.org>.
liujinhui1994 edited a comment on issue #4027:
URL: https://github.com/apache/hudi/issues/4027#issuecomment-972486708


   If it is my usage problem, please let me know.
   If you need me to provide other information, please let me know.
   thanks


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [hudi] nsivabalan commented on issue #4027: [SUPPORT] Structured streaming Async clustering IndexOutOfBoundsException

Posted by GitBox <gi...@apache.org>.
nsivabalan commented on issue #4027:
URL: https://github.com/apache/hudi/issues/4027#issuecomment-991845749


   @codope : can you take a look please.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [hudi] liujinhui1994 commented on issue #4027: [SUPPORT] Structured streaming Async clustering IndexOutOfBoundsException

Posted by GitBox <gi...@apache.org>.
liujinhui1994 commented on issue #4027:
URL: https://github.com/apache/hudi/issues/4027#issuecomment-972485283


   ![image](https://user-images.githubusercontent.com/25769285/142346330-f90a656e-3401-452a-9be4-4b320cd68275.png)
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [hudi] liujinhui1994 commented on issue #4027: [SUPPORT] Structured streaming Async clustering IndexOutOfBoundsException

Posted by GitBox <gi...@apache.org>.
liujinhui1994 commented on issue #4027:
URL: https://github.com/apache/hudi/issues/4027#issuecomment-1001939593


   @nsivabalan Still not working, the same error


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [hudi] nsivabalan commented on issue #4027: [SUPPORT] Structured streaming Async clustering IndexOutOfBoundsException

Posted by GitBox <gi...@apache.org>.
nsivabalan commented on issue #4027:
URL: https://github.com/apache/hudi/issues/4027#issuecomment-997341873


   Hey @liujinhui1994 : Can you try with 0.10.0 or latest master. Looks like we made a fix around 0 outputfileGroups by nov 12 in [this](https://github.com/apache/hudi/pull/3833/files#r738037336) patch.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [hudi] nsivabalan commented on issue #4027: [SUPPORT] Structured streaming Async clustering IndexOutOfBoundsException

Posted by GitBox <gi...@apache.org>.
nsivabalan commented on issue #4027:
URL: https://github.com/apache/hudi/issues/4027#issuecomment-1002678785


   Hey, I tried to reproduce locally and could not. 
   https://gist.github.com/nsivabalan/7d6ea90ebfa76f9a53abedfa562562b7
   
   can you confirm few things:
   1. is your table MOR?
   2. If yes, do you have any file groups with any base files but just log files? From the code, I see hudi clustering considers only  parquet file size and not the log file sizes.
   3. Can you enable info logging and let us know what you see for "Adding one clustering group"


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [hudi] xushiyan commented on issue #4027: [SUPPORT] Structured streaming Async clustering IndexOutOfBoundsException

Posted by GitBox <gi...@apache.org>.
xushiyan commented on issue #4027:
URL: https://github.com/apache/hudi/issues/4027#issuecomment-974701743


   @liujinhui1994 
   
   > Starting clustering for a group, parallelism:0 commit:20211102011441.
   
   This comes from Clustering plan which was created from the replace commit metadata. Can you post the replace commit metadata content for this?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [hudi] nsivabalan commented on issue #4027: [SUPPORT] Structured streaming Async clustering IndexOutOfBoundsException

Posted by GitBox <gi...@apache.org>.
nsivabalan commented on issue #4027:
URL: https://github.com/apache/hudi/issues/4027#issuecomment-1003784541


   thanks. sure. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [hudi] liujinhui1994 commented on issue #4027: [SUPPORT] Structured streaming Async clustering IndexOutOfBoundsException

Posted by GitBox <gi...@apache.org>.
liujinhui1994 commented on issue #4027:
URL: https://github.com/apache/hudi/issues/4027#issuecomment-972486455


   ![image](https://user-images.githubusercontent.com/25769285/142346521-c6c53e17-6d94-47d2-859e-01c6f31648b6.png)
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [hudi] liujinhui1994 commented on issue #4027: [SUPPORT] Structured streaming Async clustering IndexOutOfBoundsException

Posted by GitBox <gi...@apache.org>.
liujinhui1994 commented on issue #4027:
URL: https://github.com/apache/hudi/issues/4027#issuecomment-1005506439


   > thanks. sure.
   
   I currently emptied the historical data directory and ran it with the same code and found that there is no problem.
   thanks 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [hudi] liujinhui1994 commented on issue #4027: [SUPPORT] Structured streaming Async clustering IndexOutOfBoundsException

Posted by GitBox <gi...@apache.org>.
liujinhui1994 commented on issue #4027:
URL: https://github.com/apache/hudi/issues/4027#issuecomment-979803519


   ![微信图片_20211126170421](https://user-images.githubusercontent.com/25769285/143555243-c0f6ee49-8525-4933-9f4b-1e73a6ebe842.png)
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [hudi] liujinhui1994 commented on issue #4027: [SUPPORT] Structured streaming Async clustering IndexOutOfBoundsException

Posted by GitBox <gi...@apache.org>.
liujinhui1994 commented on issue #4027:
URL: https://github.com/apache/hudi/issues/4027#issuecomment-979804506


   @xushiyan Sorry for the late reply. json is too long, I intercepted part of it


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [hudi] liujinhui1994 commented on issue #4027: [SUPPORT] Structured streaming Async clustering IndexOutOfBoundsException

Posted by GitBox <gi...@apache.org>.
liujinhui1994 commented on issue #4027:
URL: https://github.com/apache/hudi/issues/4027#issuecomment-1001942141


   i used latest master


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [hudi] liujinhui1994 commented on issue #4027: [SUPPORT] Structured streaming Async clustering IndexOutOfBoundsException

Posted by GitBox <gi...@apache.org>.
liujinhui1994 commented on issue #4027:
URL: https://github.com/apache/hudi/issues/4027#issuecomment-1003295308


   Thanks for the reply. I use COW
   I’m making some attempts. Once there are new discoveries, I will report
   
   > Hey, I tried to reproduce locally and could not. https://gist.github.com/nsivabalan/7d6ea90ebfa76f9a53abedfa562562b7
   > 
   > can you confirm few things:
   > 
   > 1. is your table MOR?
   > 2. If yes, do you have any file groups with any base files but just log files? From the code, I see hudi clustering considers only  parquet file size and not the log file sizes.
   > 3. Can you enable info logging and let us know what you see for "Adding one clustering group"
   
   Thanks for the reply. I use COW
   I’m making some attempts. Once there are new discoveries, I will report


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [hudi] liujinhui1994 commented on issue #4027: [SUPPORT] Structured streaming Async clustering IndexOutOfBoundsException

Posted by GitBox <gi...@apache.org>.
liujinhui1994 commented on issue #4027:
URL: https://github.com/apache/hudi/issues/4027#issuecomment-972486708


   If it is my usage problem, please let me know.
   If you need me to provide other information, please let me know


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [hudi] xushiyan commented on issue #4027: [SUPPORT] Structured streaming Async clustering IndexOutOfBoundsException

Posted by GitBox <gi...@apache.org>.
xushiyan commented on issue #4027:
URL: https://github.com/apache/hudi/issues/4027#issuecomment-974701743


   @liujinhui1994 
   
   > Starting clustering for a group, parallelism:0 commit:20211102011441.
   
   This comes from Clustering plan which was created from the replace commit metadata. Can you post the replace commit metadata content for this?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [hudi] liujinhui1994 commented on issue #4027: [SUPPORT] Structured streaming Async clustering IndexOutOfBoundsException

Posted by GitBox <gi...@apache.org>.
liujinhui1994 commented on issue #4027:
URL: https://github.com/apache/hudi/issues/4027#issuecomment-997530594


   ok,i am trying , thanks
   @nsivabalan 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [hudi] liujinhui1994 closed issue #4027: [SUPPORT] Structured streaming Async clustering IndexOutOfBoundsException

Posted by GitBox <gi...@apache.org>.
liujinhui1994 closed issue #4027:
URL: https://github.com/apache/hudi/issues/4027


   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org