You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@hudi.apache.org by "selvaraj (Jira)" <ji...@apache.org> on 2021/08/25 23:05:00 UTC

[jira] [Updated] (HUDI-2363) COW : Listing leaf files and directories twice

     [ https://issues.apache.org/jira/browse/HUDI-2363?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

selvaraj updated HUDI-2363:
---------------------------
    Description: 
Team,

In our organization we are still using Hudi 0.5.0.  We would upgrade to the latest version in couple of quarters.   

problem scenario :

Many use cases in our project using COW and hive sync is disabled.  One of the Hudi contains two years worth of data , which are partitioned by date.  For every write on this table, i notice that Listing leaf files and directories job triggered twice. Normally it is triggered only once.  Attache the screenshot. 

 

once the first  listing leaf files and directories are done then another listing of leaf files and directories logs are rolled. 

I  spent some time in investigating the source code but couldn't trace where exactly it is being invoked .

 

How can it be avoided here? Unfortunately this one is adding up more latency in our flow.

 

  was:
Team,

In our organization we are still using Hudi 0.5.0.  We would upgrade to the latest version in couple of quarters.   

problem scenario :

Many use cases in our project using COW and hive sync is disabled.  One of the Hudi contains two years worth of data , which are partitioned by date.  For every write on this table, i notice that Listing leaf files and directories job triggered twice. Normally it is triggered only once.  Attache the screenshot. 

 

once the first  listing leaf files and directories are done, i noticed the below warning. and then another listing of leaf files and directories logs are rolled. 

21/08/24 20:40:40 *WARN* SharedInMemoryCache: Evicting cached table partition metadata from memory due to size constraints (spark.sql.hive.filesourcePartitionFileCacheSize = 262144000 bytes). This may impact query planning performance.

 

I  spent some time in investigating the source code but couldn't trace where exactly it is being invoked .

 

Are there any relationship between the warning message and this twice the listing happening?

How can it be avoided here?

 


> COW : Listing leaf files and directories twice
> ----------------------------------------------
>
>                 Key: HUDI-2363
>                 URL: https://issues.apache.org/jira/browse/HUDI-2363
>             Project: Apache Hudi
>          Issue Type: Bug
>          Components: Writer Core
>            Reporter: selvaraj
>            Priority: Major
>         Attachments: Screen Shot 2021-08-25 at 5.36.52 PM.png
>
>
> Team,
> In our organization we are still using Hudi 0.5.0.  We would upgrade to the latest version in couple of quarters.   
> problem scenario :
> Many use cases in our project using COW and hive sync is disabled.  One of the Hudi contains two years worth of data , which are partitioned by date.  For every write on this table, i notice that Listing leaf files and directories job triggered twice. Normally it is triggered only once.  Attache the screenshot. 
>  
> once the first  listing leaf files and directories are done then another listing of leaf files and directories logs are rolled. 
> I  spent some time in investigating the source code but couldn't trace where exactly it is being invoked .
>  
> How can it be avoided here? Unfortunately this one is adding up more latency in our flow.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)