You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@spark.apache.org by Steve Loughran <st...@hortonworks.com> on 2017/05/03 13:55:35 UTC

Re: parquet optimal file structure - flat vs nested

> On 30 Apr 2017, at 09:19, Zeming Yu <ze...@gmail.com> wrote:
> 
> Hi,
> 
> We're building a parquet based data lake. I was under the impression that flat files are more efficient than deeply nested files (say 3 or 4 levels down). Is that correct?
> 
> Thanks,
> Zeming

Where's the data going to live: HDFS or an object store? If it's somewhere like Amazon S3 I'd be biased towards the flatter structure as how the client libraries mimic treewalking is pretty expensive in terms of HTTP calls, and, as those calls all take place during the initial, serialized, query planning stage, expensive. 

---------------------------------------------------------------------
To unsubscribe e-mail: user-unsubscribe@spark.apache.org