You are viewing a plain text version of this content. The canonical link for it is here.

Posted to commits@hudi.apache.org by "Aload (via GitHub)" <gi...@apache.org> on 2023/04/27 07:47:31 UTC

[GitHub] [hudi] Aload opened a new issue, #6102: [SUPPORT]Missing data problem，exigency！！！

Aload opened a new issue, #6102:
URL: https://github.com/apache/hudi/issues/6102

**_Tips before filing an issue_**

- Have you gone through our [FAQs](https://hudi.apache.org/learn/faq/)?

- Join the mailing list to engage in conversations and get faster support at dev-subscribe@hudi.apache.org.

- If you have triaged this as a bug, then file an [issue](https://issues.apache.org/jira/projects/HUDI/issues) directly.

**Describe the problem you faced**

A clear and concise description of the problem.

**To Reproduce**

Steps to reproduce the behavior:

1. Use Flink consumption Kafka to write hoodie in real time (MOR table)
2. Spark3.2.1 was used to read THE MOR table for pre-aggregation
3. The written data and the actual data consumed by Flink do not match, and there is a big difference
4.

**Expected behavior**

A clear and concise description of what you expected to happen.

**Environment Description**

* Hudi version : 0.11.1

* Spark version :3.2.1

* Hive version :2.3.7

* Hadoop version :3.0.0

* Storage (HDFS/S3/GCS..) :hdfs

* Running on Docker? (yes/no) :no

**Additional context**

Add any other context about the problem here.

**Stacktrace**

```When I use Flink1.14.4 to consume Kakfa's write hoodie, I stop consuming after consuming, keep flink consumption data consistent with Kafka Tool data, and then check data through Spark3.2.1, I read far less data than real data, and some data also disappeared. None. At the same time, we created a new consumer program to consume data into Clickhouse in real time, and found that clickhouse had far more data than HUDi, and some data was hoodie missing in Clickhouse. There is no abnormal procedure in the whole operation process .```
![lQLPJxZ9o8vZmozNAvzNBaywhje58RmW_UUCzuojgEBCAA_1452_764](https://user-images.githubusercontent.com/13082598/178867202-f3ac12de-b175-4f31-aa82-d54cdbb0969c.png)
![lQLPJxZ9o8vZmmPNAojNBbCwCNd9dHpR1NICzuojf0BjAA_1456_648](https://user-images.githubusercontent.com/13082598/178867239-e80bc551-fd49-42f8-bac2-b445faf482ad.png)

--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [hudi] Aload commented on issue #6102: [SUPPORT]Missing data problem，exigency！！！

Posted by GitBox <gi...@apache.org>.

Aload commented on issue #6102:
URL: https://github.com/apache/hudi/issues/6102#issuecomment-1231030156

   > @Aload : after the patched version, do you see any data loss?
   
   Sorry, the data is missing due to that version. I have downgraded the version and haven't upgraded it yet. Expect to wait until 0.12.1 before considering the upgrade


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [hudi] yuzhaojing commented on issue #6102: [SUPPORT]Missing data problem，exigency！！！

Posted by GitBox <gi...@apache.org>.

yuzhaojing commented on issue #6102:
URL: https://github.com/apache/hudi/issues/6102#issuecomment-1198989990

   Can you show the log for Spark read, both driver and executor.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [hudi] yuzhaojing commented on issue #6102: [SUPPORT]Missing data problem，exigency！！！

Posted by GitBox <gi...@apache.org>.

yuzhaojing commented on issue #6102:
URL: https://github.com/apache/hudi/issues/6102#issuecomment-1189725432

   > ` System.setProperty("HADOOP_USER_NAME", "hdfs") val session: SparkSession = SparkSession.builder() .appName(this.getClass.getName) .master("local[*]") .config("spark.yarn.queue", "root.users.hdfs") .config("spark.serializer", "org.apache.spark.serializer.KryoSerializer") .config("spark.kryoserializer.buffer.max", "512m") .enableHiveSupport() .getOrCreate() session.sparkContext.setLogLevel("ERROR") private val sourceDf: DataFrame = session .read .format("hudi") // .option(DataSourceReadOptions.REALTIME_MERGE.key(),REALTIME_SKIP_MERGE_OPT_VAL) // .option(DataSourceReadOptions.QUERY_TYPE.key(),QUERY_TYPE_SNAPSHOT_OPT_VAL) .load("/hoodie/data/ods/ods_equipment_data") sourceDf.show(1000) // sourceDf.createTempView("tmp_water") println(sourceDf.count())`
   
   You can try change `REALTIME_SKIP_MERGE_OPT_VAL` to `REALTIME_PAYLOAD_COMBINE_OPT_VAL`.
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [hudi] nsivabalan commented on issue #6102: [SUPPORT]Missing data problem，exigency！！！

Posted by GitBox <gi...@apache.org>.

nsivabalan commented on issue #6102:
URL: https://github.com/apache/hudi/issues/6102#issuecomment-1229347829

   @Aload : after the patched version, do you see any data loss? 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [hudi] danny0405 commented on issue #6102: [SUPPORT]Missing data problem，exigency！！！

Posted by GitBox <gi...@apache.org>.

danny0405 commented on issue #6102:
URL: https://github.com/apache/hudi/issues/6102#issuecomment-1183933456

   Did you read the table using flink, is that right ?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [hudi] danny0405 commented on issue #6102: [SUPPORT]Missing data problem，exigency！！！

Posted by GitBox <gi...@apache.org>.

danny0405 commented on issue #6102:
URL: https://github.com/apache/hudi/issues/6102#issuecomment-1200799619

   You can try https://github.com/apache/hudi/pull/6182, which has some fix, one is https://github.com/apache/hudi/pull/6182/commits/528ad498ff936b73ce3af1f3f5603013bafad65c which can cause data loss.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [hudi] Aload commented on issue #6102: [SUPPORT]Missing data problem，exigency！！！

Posted by GitBox <gi...@apache.org>.

Aload commented on issue #6102:
URL: https://github.com/apache/hudi/issues/6102#issuecomment-1200570116

   > Can you show the log for Spark read, both driver and executor.
   
   I'm sorry. I already replayed the version.  Because I applied to the production, as far as the timely withdrawal 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [hudi] yuzhaojing commented on issue #6102: [SUPPORT]Missing data problem，exigency！！！

Posted by GitBox <gi...@apache.org>.

yuzhaojing commented on issue #6102:
URL: https://github.com/apache/hudi/issues/6102#issuecomment-1189203105

   Can you provide the configuration for spark read? I would like to know if it is due to the use of read optimization.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [hudi] danny0405 commented on issue #6102: [SUPPORT]Missing data problem，exigency！！！

Posted by GitBox <gi...@apache.org>.

danny0405 commented on issue #6102:
URL: https://github.com/apache/hudi/issues/6102#issuecomment-1306539496

   0.12.1 expects to solve the problem, feel free to re-open it if the problem still exists.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [hudi] danny0405 commented on issue #6102: [SUPPORT]Missing data problem，exigency！！！

Posted by "danny0405 (via GitHub)" <gi...@apache.org>.

danny0405 commented on issue #6102:
URL: https://github.com/apache/hudi/issues/6102#issuecomment-1527499721

   Can not reproduce anymore, it should be fixed with commit: https://github.com/apache/hudi/pull/6179 and https://github.com/apache/hudi/pull/8079.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [hudi] danny0405 closed issue #6102: [SUPPORT]Missing data problem，exigency！！！

Posted by "danny0405 (via GitHub)" <gi...@apache.org>.

danny0405 closed issue #6102: [SUPPORT]Missing data problem，exigency！！！
URL: https://github.com/apache/hudi/issues/6102


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [hudi] Aload commented on issue #6102: [SUPPORT]Missing data problem，exigency！！！

Posted by GitBox <gi...@apache.org>.

Aload commented on issue #6102:
URL: https://github.com/apache/hudi/issues/6102#issuecomment-1184039621

   > Did you read the table using flink, is that right ?
   
   Use Flink to write and Spark to read.
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [hudi] Aload commented on issue #6102: [SUPPORT]Missing data problem，exigency！！！

Posted by GitBox <gi...@apache.org>.

Aload commented on issue #6102:
URL: https://github.com/apache/hudi/issues/6102#issuecomment-1189706574

   > Can you provide the configuration for spark read? I would like to know if it is due to the use of read optimization.
   
   `  System.setProperty("HADOOP_USER_NAME", "hdfs")
     val session: SparkSession = SparkSession.builder()
       .appName(this.getClass.getName)
       .master("local[*]")
       .config("spark.yarn.queue", "root.users.hdfs")
       .config("spark.serializer", "org.apache.spark.serializer.KryoSerializer")
       .config("spark.kryoserializer.buffer.max", "512m")
       .enableHiveSupport()
       .getOrCreate()
     session.sparkContext.setLogLevel("ERROR")
     private val sourceDf: DataFrame = session
       .read
       .format("hudi")
       //    .option(DataSourceReadOptions.REALTIME_MERGE.key(),REALTIME_SKIP_MERGE_OPT_VAL)
       //    .option(DataSourceReadOptions.QUERY_TYPE.key(),QUERY_TYPE_SNAPSHOT_OPT_VAL)
       .load("/hoodie/data/ods/ods_equipment_data")
       sourceDf.show(1000)
     //  sourceDf.createTempView("tmp_water")
       println(sourceDf.count())
   `


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [hudi] danny0405 closed issue #6102: [SUPPORT]Missing data problem，exigency！！！

Posted by GitBox <gi...@apache.org>.

danny0405 closed issue #6102: [SUPPORT]Missing data problem，exigency！！！
URL: https://github.com/apache/hudi/issues/6102


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [hudi] Aload commented on issue #6102: [SUPPORT]Missing data problem，exigency！！！

Posted by GitBox <gi...@apache.org>.

Aload commented on issue #6102:
URL: https://github.com/apache/hudi/issues/6102#issuecomment-1202265022

   > You can try #6182, which has some fix, one is [528ad49](https://github.com/apache/hudi/commit/528ad498ff936b73ce3af1f3f5603013bafad65c) which can cause data loss.
   
   get,Will this bug merge into a new branch?
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [hudi] Aload commented on issue #6102: [SUPPORT]Missing data problem，exigency！！！

Posted by GitBox <gi...@apache.org>.

Aload commented on issue #6102:
URL: https://github.com/apache/hudi/issues/6102#issuecomment-1189982920

   > REALTIME_SKIP_MERGE_OPT_VAL
   
   ![image](https://user-images.githubusercontent.com/13082598/179935157-5902eda2-7fed-4c46-afc6-78ed2e2fcd82.png)
   We tried that, and that's the default. It doesn't work
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [hudi] xushiyan commented on issue #6102: [SUPPORT]Missing data problem，exigency！！！

Posted by GitBox <gi...@apache.org>.

xushiyan commented on issue #6102:
URL: https://github.com/apache/hudi/issues/6102#issuecomment-1296302757

   @Aload as 0.12.1 was out, have you given it a try? would like to know how it goes


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [hudi] codope commented on issue #6102: [SUPPORT]Missing data problem，exigency！！！

Posted by "codope (via GitHub)" <gi...@apache.org>.

codope commented on issue #6102:
URL: https://github.com/apache/hudi/issues/6102#issuecomment-1525011926

   Reopening to validate against master. cc @danny0405 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org