You are viewing a plain text version of this content. The canonical link for it is here.

Posted to commits@hudi.apache.org by "xmubeta (via GitHub)" <gi...@apache.org> on 2023/02/03 11:30:02 UTC

[GitHub] [hudi] xmubeta opened a new issue, #7844: [SUPPORT] Job became slow because reading archived commit files

xmubeta opened a new issue, #7844:
URL: https://github.com/apache/hudi/issues/7844

**_Tips before filing an issue_**

- Have you gone through our [FAQs](https://hudi.apache.org/learn/faq/)?

- Join the mailing list to engage in conversations and get faster support at dev-subscribe@hudi.apache.org.

- If you have triaged this as a bug, then file an [issue](https://issues.apache.org/jira/projects/HUDI/issues) directly.

**Describe the problem you faced**

We are running Glue streaming with Hudi 0.12.1 to write data from Kafka. The table is MOR for quick ingestion and commit method is inline as we don't want to run another job to perform compaction.
Initially each commit took 1.5 minutes. But after one day, it became slow to 3-4 minutes. After a few days, it became more than 10 minutes.
We noticed the slowness is due to Hudi was reading archived commits files after each commit. Because there were tons of commits files under .hoodie/archive directory on S3, it really took time. After I deleted those files, the speed became quick. But it would come down later.

I did some research and I found currently Hudi does not support purge these archived files automatically. There is a feature to merge small archived files. I tried it and the situation became better. But it was not fast enough.

Further research shows that might be related to sync hive meta. But I do need to sync. Is there any way to speed up this procedure? Thank you.

A clear and concise description of the problem.

**To Reproduce**

Steps to reproduce the behavior:

1. Set up a Kafka
2. Write pyspark to read stream from Kafka and then sink to hudi on S3.
3. Hudi config:

'className' : 'org.apache.hudi',
'hoodie.datasource.hive_sync.use_jdbc':'false',
'hoodie.datasource.write.partitionpath.field': partitionKey,
'hoodie.datasource.write.recordkey.field': primaryKey,
'hoodie.datasource.write.precombine.field': timestamp,
'hoodie.datasource.write.operation': 'insert',
'hoodie.table.name': targetTableName,
'hoodie.consistency.check.enabled': 'true',
'hoodie.datasource.write.hive_style_partitioning':'true',
'hoodie.datasource.write.table.type': 'MERGE_ON_READ',
'hoodie.datasource.hive_sync.database': targetDBName,
'hoodie.datasource.hive_sync.table': targetTableName,
'hoodie.datasource.hive_sync.enable': 'true',
'hoodie.datasource.hive_sync.partition_fields': partitionKey,
'hoodie.datasource.hive_sync.partition_extractor_class': 'org.apache.hudi.hive.MultiPartKeysValueExtractor',
'hoodie.datasource.hive_sync.support_timestamp':'true',
'hoodie.datasource.write.keygenerator.class': 'org.apache.hudi.keygen.SimpleKeyGenerator',
'hoodie.archive.merge.enable':'true',
'hoodie.index.type':'BLOOM', #SIMPLE
'hoodie.compact.inline': 'true',

**Expected behavior**

A clear and concise description of what you expected to happen.

**Environment Description**

* Hudi version : 0.12.1

* Spark version : 3.1 (AWS Glue 3.0)

* Hive version :

* Hadoop version :

* Storage (HDFS/S3/GCS..) : S3

* Running on Docker? (yes/no) :

**Additional context**

Add any other context about the problem here.

**Stacktrace**

```Add the stacktrace of the error.```

--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [hudi] nsivabalan commented on issue #7844: [SUPPORT] Job became slow because reading archived commit files

Posted by "nsivabalan (via GitHub)" <gi...@apache.org>.

nsivabalan commented on issue #7844:
URL: https://github.com/apache/hudi/issues/7844#issuecomment-1418498523

   yes, we already fixed it and will be part of 0.13.0 
   thanks for reporting. 
   will go ahead and close out the issue since its already fixed. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [hudi] xmubeta commented on issue #7844: [SUPPORT] Job became slow because reading archived commit files

Posted by "xmubeta (via GitHub)" <gi...@apache.org>.

xmubeta commented on issue #7844:
URL: https://github.com/apache/hudi/issues/7844#issuecomment-1415810777

   Just happened to find the PR https://github.com/apache/hudi/pull/7561. It seems be helpful. Is that true? Thank you.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [hudi] nsivabalan closed issue #7844: [SUPPORT] Job became slow because reading archived commit files

Posted by "nsivabalan (via GitHub)" <gi...@apache.org>.

nsivabalan closed issue #7844: [SUPPORT] Job became slow because reading archived commit files
URL: https://github.com/apache/hudi/issues/7844


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org