You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@hudi.apache.org by "selvaraj (Jira)" <ji...@apache.org> on 2021/08/25 22:48:00 UTC

[jira] [Created] (HUDI-2363) COW : Listing leaf files and directories twice

selvaraj created HUDI-2363:
------------------------------

             Summary: COW : Listing leaf files and directories twice
                 Key: HUDI-2363
                 URL: https://issues.apache.org/jira/browse/HUDI-2363
             Project: Apache Hudi
          Issue Type: Bug
          Components: Writer Core
            Reporter: selvaraj
         Attachments: Screen Shot 2021-08-25 at 5.36.52 PM.png

Team,

In our organization we are still using Hudi 0.5.0.  We would upgrade to the latest version in couple of quarters.   

problem scenario :

Many use cases in our project using COW and hive sync is disabled.  One of the Hudi contains two years worth of data , which are partitioned by date.  For every write on this table, i notice that Listing leaf files and directories job triggered twice. Normally it is triggered only once.  Attache the screenshot. 

 

once the first  listing leaf files and directories are done, i noticed the below warning. and then another listing of leaf files and directories logs are rolled. 

21/08/24 20:40:40 *WARN* SharedInMemoryCache: Evicting cached table partition metadata from memory due to size constraints (spark.sql.hive.filesourcePartitionFileCacheSize = 262144000 bytes). This may impact query planning performance.

 

I  spent some time in investigating the source code but couldn't trace where exactly it is being invoked .

 

Are there any relationship between the warning message and this twice the listing happening?

How can it be avoided here?

 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)