You are viewing a plain text version of this content. The canonical link for it is here.

Posted to commits@hudi.apache.org by GitBox <gi...@apache.org> on 2020/07/18 00:20:01 UTC

[GitHub] [hudi] umehrot2 edited a comment on issue #1829: [SUPPORT] S3 slow file listing causes Hudi read performance.

umehrot2 edited a comment on issue #1829:
URL: https://github.com/apache/hudi/issues/1829#issuecomment-660389870

@zuyanton In your test with regular parquet tables you are probably not setting the following property in the spark config ```spark.sql.hive.convertMetastoreParquet=false```. When you set this property to ```false`` only then will Spark use `Parquet InputFormat` as well as its listing code. Otherwise by default Spark uses its native listing (parallelized over the cluster) and parquet readers which are supposed to be faster.

However the way Hudi works is it uses `InputFormat` implementation. Thus for a fair comparison when you test regular parquet with Spark you should set ```spark.sql.hive.convertMetastoreParquet=false``` and I think you will observe quite similar behavior then as to what you are seeing. Would you mind trying that out once ?

But @bvaradar irrespective I think for Hudi we should always compare our performance against standard spark performance (native listing and reading) and not the performance of spark when it is made to go through InputFormat. So we need to get this fixed either ways if we have to be comparable to spark parquet performance which uses parallelized listing over the cluster.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org