You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@hudi.apache.org by GitBox <gi...@apache.org> on 2020/11/25 14:10:33 UTC

[GitHub] [hudi] ygordefraga opened a new issue #2280: [SUPPORT] Simple record key and composite partition

ygordefraga opened a new issue #2280:
URL: https://github.com/apache/hudi/issues/2280


   Hi there,
   
   I have a problem with Hudi partition when there is only one record key and there are multiple partition fields. I tried to make it work using different approaches, but none of them worked fine.
   
   The DataFrame i'm working with is pretty simple. It contains all the messages sent by the organizations. Example below:
   
   message_id |timestamp|status|organization_id|year|month|day
   ------------ | -------------|-------------| -------------| -------------| -------------| -------------
   bdabfa6f-2a3e-4c17-acd7-350227473ae4  | 2020-11-25T10:00:00Z|SENT | 0b38bec3-15ac-4e57-9bb9-48d7de412ffa | 2020 | 11 | 25
   203d5495-9b5d-4003-b7f3-ab312a70db40 |2020-11-25T11:00:00Z |SENT | 75e498d4-c979-4a12-b8df-1051c7976d34 | 2020 | 11 | 25
   09fa0543-cf5a-4e6b-9d16-ad14a8a7058a | 2020-10-22T09:00:00Z|NOT_SENT | 0b38bec3-15ac-4e57-9bb9-48d7de412ffa | 2020 | 10 | 22
   
   This DataFrame will become a COW table and the configs are set like these:
   
   ```
   "hoodie.datasource.write.insert.drop.duplicates" -> "true"
   "hoodie.insert.shuffle.parallelism" -> "32"
   "hoodie.finalize.write.parallelism" -> "32"
   "hoodie.datasource.write.recordkey.field" -> "message_id"
   "hoodie.datasource.write.precombine.field" -> timestampColumn
   "hoodie.datasource.write.partitionpath.field" -> "organization_id:SIMPLE,year:SIMPLE,month:SIMPLE,day:SIMPLE"
   "hoodie.datasource.write.keygenerator.class" -> classOf[ComplexKeyGenerator].getName
   ```
   I followed [official docs](https://hudi.apache.org/docs/writing_data.html#key-generation)  to set `"hoodie.datasource.write.partitionpath.field"`. I decided to extract `year`, `month` and `day` from `timestamp` to facilitate the partition.
   
   The **problem** is that when I write the table like the way I just showed, this is how the partitions will look like  `messages/default/default/default/default`, but I want my table **to look like this** `messages/organization_id=<organization_id>/year=<year>/month=<month>/day=<day>`.
   
   Besides that, it worked when I set one more column with the value `organization_id=<organization_id>/year=<year>/month=<month>/day=<day>` for each row. This column was the value of the config `"hoodie.datasource.write.partitionpath.field"`. When this was executed, spark job took half an hour to write 300k rows (1000 organizations).
   
   How can I make it work correctly?
   
   > Hudi Version = 0.5.2
   > Spark Version = 2.4.5
   
   
   
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [hudi] ygordefraga removed a comment on issue #2280: [SUPPORT] Slow insert into COW tables with multi level partitions

Posted by GitBox <gi...@apache.org>.
ygordefraga removed a comment on issue #2280:
URL: https://github.com/apache/hudi/issues/2280#issuecomment-736666402


   >> Please note that the next version of hudi will come with consolidated metadata which will remove the listing altogether.
   
   What you mean with that?


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [hudi] bvaradar commented on issue #2280: [SUPPORT] Slow insert into COW tables with multi level partitions

Posted by GitBox <gi...@apache.org>.
bvaradar commented on issue #2280:
URL: https://github.com/apache/hudi/issues/2280#issuecomment-735927848


   @ygordefraga : This could be coming from the increase in the number of partitions.
   This could be related to https://github.com/apache/hudi/issues/2269#issuecomment-733299492 
   
   Also, note that since you increased the number of partition with the additional partitioning level, keeping the same number of executors wont be exactly apples-to-apples comparison.
   
   Can you try 0.6.0 (which has incremental cleaning support) to see if you get better performance.
   Please note that the next version of hudi will come with consolidated metadata which will remove the listing altogether.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [hudi] ygordefraga closed issue #2280: [SUPPORT] Simple record key and composite partition

Posted by GitBox <gi...@apache.org>.
ygordefraga closed issue #2280:
URL: https://github.com/apache/hudi/issues/2280


   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [hudi] ygordefraga removed a comment on issue #2280: [SUPPORT] Slow insert into COW tables with multi level partitions

Posted by GitBox <gi...@apache.org>.
ygordefraga removed a comment on issue #2280:
URL: https://github.com/apache/hudi/issues/2280#issuecomment-738010863


   I used Hudi 0.5.3 and this are the results:
   
   ![Screenshot from 2020-12-03 10-19-52](https://user-images.githubusercontent.com/20543207/101027135-803ef000-3556-11eb-98e6-e555bd10d2e9.png)
   
   Thank you, guys!


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [hudi] bvaradar commented on issue #2280: [SUPPORT] Slow insert into COW tables with multi level partitions

Posted by GitBox <gi...@apache.org>.
bvaradar commented on issue #2280:
URL: https://github.com/apache/hudi/issues/2280#issuecomment-736750785


   > > Please note that the next version of hudi will come with consolidated metadata which will remove the listing altogether.
   > 
   > What do you mean with that?
   
   The File listing improvements as mentioned in https://cwiki.apache.org/confluence/display/HUDI/RFC+-+15%3A+HUDI+File+Listing+and+Query+Planning+Improvements will be available in 0.7 release.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [hudi] ygordefraga removed a comment on issue #2280: [SUPPORT] Slow insert into COW tables with multi level partitions

Posted by GitBox <gi...@apache.org>.
ygordefraga removed a comment on issue #2280:
URL: https://github.com/apache/hudi/issues/2280#issuecomment-733991146


   ![Screenshot from 2020-11-25 20-32-52](https://user-images.githubusercontent.com/20543207/100291781-749f6800-2f5d-11eb-9726-8c27ed55bf50.png)
   
   That's a picture of the time it takes to process the data with partitions by organization_id.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [hudi] vinothchandar commented on issue #2280: [SUPPORT] Slow insert into COW tables with multi level partitions

Posted by GitBox <gi...@apache.org>.
vinothchandar commented on issue #2280:
URL: https://github.com/apache/hudi/issues/2280#issuecomment-735945530


   Looks like balaji did beat me to it.  :) 


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [hudi] ygordefraga commented on issue #2280: [SUPPORT] Slow insert into COW tables with multi level partitions

Posted by GitBox <gi...@apache.org>.
ygordefraga commented on issue #2280:
URL: https://github.com/apache/hudi/issues/2280#issuecomment-736666402


   >> Please note that the next version of hudi will come with consolidated metadata which will remove the listing altogether.
   
   What you mean with that?


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [hudi] ygordefraga commented on issue #2280: [SUPPORT] Slow insert into COW tables with multi level partitions

Posted by GitBox <gi...@apache.org>.
ygordefraga commented on issue #2280:
URL: https://github.com/apache/hudi/issues/2280#issuecomment-736807941


   Writing got better after setting the `hoodie.clean.automatic=false` . I'll try hudi 0.6 (or 0.5.3), compare and post it here.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [hudi] bvaradar commented on issue #2280: [SUPPORT] Slow insert into COW tables with multi level partitions

Posted by GitBox <gi...@apache.org>.
bvaradar commented on issue #2280:
URL: https://github.com/apache/hudi/issues/2280#issuecomment-736752118


   Regarding the bundle issue, the workaround posted should have worked. Another option is to try 0.5.3 which has the support too. 


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [hudi] ygordefraga commented on issue #2280: [SUPPORT] Slow insert into COW tables with multi level partitions

Posted by GitBox <gi...@apache.org>.
ygordefraga commented on issue #2280:
URL: https://github.com/apache/hudi/issues/2280#issuecomment-738010863


   I used Hudi 0.5.3 and this are the results:
   
   ![Screenshot from 2020-12-03 10-19-52](https://user-images.githubusercontent.com/20543207/101027135-803ef000-3556-11eb-98e6-e555bd10d2e9.png)
   
   Thank you, guys!


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [hudi] ygordefraga commented on issue #2280: [SUPPORT] Slow insert into COW tables with multi level partitions

Posted by GitBox <gi...@apache.org>.
ygordefraga commented on issue #2280:
URL: https://github.com/apache/hudi/issues/2280#issuecomment-736666569






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [hudi] ygordefraga removed a comment on issue #2280: [SUPPORT] Slow insert into COW tables with multi level partitions

Posted by GitBox <gi...@apache.org>.
ygordefraga removed a comment on issue #2280:
URL: https://github.com/apache/hudi/issues/2280#issuecomment-736666569


   > Please note that the next version of hudi will come with consolidated metadata which will remove the listing altogether.
   What do you mean with that?


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [hudi] ygordefraga commented on issue #2280: [SUPPORT] Composite partition performance

Posted by GitBox <gi...@apache.org>.
ygordefraga commented on issue #2280:
URL: https://github.com/apache/hudi/issues/2280#issuecomment-733991146


   ![Screenshot from 2020-11-25 20-32-52](https://user-images.githubusercontent.com/20543207/100291781-749f6800-2f5d-11eb-9726-8c27ed55bf50.png)
   
   That's a picture of the time it takes to process the data with partitions by organization_id.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [hudi] ygordefraga commented on issue #2280: [SUPPORT] Composite partition performance

Posted by GitBox <gi...@apache.org>.
ygordefraga commented on issue #2280:
URL: https://github.com/apache/hudi/issues/2280#issuecomment-733833453


   I changed the subject of this support issue because I already solved the problem I had before.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [hudi] ygordefraga removed a comment on issue #2280: [SUPPORT] Slow insert into COW tables with multi level partitions

Posted by GitBox <gi...@apache.org>.
ygordefraga removed a comment on issue #2280:
URL: https://github.com/apache/hudi/issues/2280#issuecomment-733833453


   I changed the subject of this support issue because I already solved the problem I had before.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [hudi] ygordefraga commented on issue #2280: [SUPPORT] Slow insert into COW tables with multi level partitions

Posted by GitBox <gi...@apache.org>.
ygordefraga commented on issue #2280:
URL: https://github.com/apache/hudi/issues/2280#issuecomment-736806224


   > Regarding the bundle issue, the workaround posted should have worked. Another option is to try 0.5.3 which has the support too.
   
   I'll try it again.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [hudi] ygordefraga commented on issue #2280: [SUPPORT] Slow insert into COW tables with multi level partitions

Posted by GitBox <gi...@apache.org>.
ygordefraga commented on issue #2280:
URL: https://github.com/apache/hudi/issues/2280#issuecomment-736580869


   > @ygordefraga : This could be coming from the increase in the number of partitions.
   > This could be related to [#2269 (comment)](https://github.com/apache/hudi/issues/2269#issuecomment-733299492)
   > 
   > Also, note that since you increased the number of partition with the additional partitioning level, keeping the same number of executors wont be exactly apples-to-apples comparison.
   > 
   > Can you try 0.6.0 (which has incremental cleaning support) to see if you get better performance.
   > Please note that the next version of hudi will come with consolidated metadata which will remove the listing altogether.
   
   Does "hoodie.clean.automatic" available on hudi 0.5.2? And does clean make sense for COW tables when there is no upsert? (my case)? What cleaning does in this situation?
   
   I tried Hudi 0.6, but I faced this [issue](https://issues.apache.org/jira/browse/HUDI-1117) and could not solve it. Any suggestion about that?
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [hudi] ygordefraga commented on issue #2280: [SUPPORT] Slow insert into COW tables with multi level partitions

Posted by GitBox <gi...@apache.org>.
ygordefraga commented on issue #2280:
URL: https://github.com/apache/hudi/issues/2280#issuecomment-738011563


   I used Hudi 0.5.3 and the results were really good.
   
   ![Screenshot from 2020-12-03 10-19-52](https://user-images.githubusercontent.com/20543207/101027340-aa90ad80-3556-11eb-8565-8ac134eee182.png)
   
   Thank you, guys!


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [hudi] ygordefraga closed issue #2280: [SUPPORT] Slow insert into COW tables with multi level partitions

Posted by GitBox <gi...@apache.org>.
ygordefraga closed issue #2280:
URL: https://github.com/apache/hudi/issues/2280


   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org