You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@parquet.apache.org by "Gabor Szadovszky (JIRA)" <ji...@apache.org> on 2019/03/25 09:37:00 UTC

[jira] [Commented] (PARQUET-1549) Option for one block per file in MapReduce output

    [ https://issues.apache.org/jira/browse/PARQUET-1549?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16800515#comment-16800515 ] 

Gabor Szadovszky commented on PARQUET-1549:
-------------------------------------------

What is not clear for me in your design is how the different file names/paths are generated. The current way of finalizing/padding a row-group is based on the configuration and driven by the parquet-mr library. How can the act of ending the current file and starting the new one can be driven by the library if it does not know the requested name. Maybe a kind of name generator interface can help but not sure if it would not over-complicate the design.

But, why would we need this implementation in the first place? Currently, parquet-mr handles the row-groups (in the different blocks) parallel (processed on different nodes) by using the haddop InputSplits. This way it does not matter if the row-group is a different file or only a separate hdfs block of the file. If Impala cannot handle the row-groups similarly then, I think, it is a lack of functionality at the Impala side and not at parquet-mr side.

> Option for one block per file in MapReduce output
> -------------------------------------------------
>
>                 Key: PARQUET-1549
>                 URL: https://issues.apache.org/jira/browse/PARQUET-1549
>             Project: Parquet
>          Issue Type: New Feature
>          Components: parquet-mr
>    Affects Versions: 1.10.0
>            Reporter: Gustavo Figueiredo
>            Priority: Minor
>
> When we create PARQUET files using a MapReduce application with current ParquetOutputFormat implementation, we don't have any option to reliably limit the number of blocks (row groups) we want to generate per file.
> The implemented configuration option 'parquet.block.size' (ParquetOutputFormat.BLOCK_SIZE) refers to the amount of data that goes into one block of data, but there are no guarantees that this will be the only block in a file. If one sets this configuration option to a very high value, it's likely there will be a single block per PARQUET file. However, this approach might lead to undesirably big files, so this would not be a good option in some scenarios.
> This behaviour can't be achieved by the client's 'mapper' either. Although there are some helpfull classes in Hadoop API, such as 'MultipleOutputs', we don't have enough information available at 'mapper' code in order to have this kind of control, unless one uses unsafe 'hacks' to gather information from private fields.
> By instance, suppose we have an ETL application that loads data from HBASE regions (might be one or more MAPs per region) and produces PARQUET files to be consumed in IMPALA tables (might be one or more PARQUET files per MAP task). To simplify, let's say there is no 'REDUCE' task in this application.
> For concreteness, lets say one could use for such job 'org.apache.hadoop.hbase.mapreduce.TableInputFormat' as input and 'org.apache.parquet.hadoop.ParquetOutputFormat' as output. 
> Following the guidelines for maximum query performance in Impala queries in HADOOP ecosystem, each PARQUET file should be approximately equal in size to a HDFS block and there should be only one single block of data (row group) in each of them (see https://impala.apache.org/docs/build/html/topics/impala_perf_cookbook.html#perf_cookbook__perf_cookbook_parquet_block_size).
> Currently we are only able to do this by trial and error with different configuration options.
> It would be nice to have a new boolean configuration option (lets call it 'parquet.split.file.per.block') related to the existing one 'parquet.block.size'. If it's set to false (default value), we would have the current behaviour. If it's to true, we would have one different PARQUET file being generated for each 'block' created, all coming from the same ParquetRecordWriter.
> In doing so, we would only have to worry about tuning the 'parquet.block.size' parameter in order to generate PARQUET files with one single block per file whose size is closer to the configured HDFS block size.
>  
> In order to implement this new feature, we only need to change a few classes in 'org.apache.parquet.hadoop' package, namely:
>  InternalParquetRecordWriter
>  ParquetFileWriter
>  ParquetOutputFormat
>  ParquetRecordWriter
> Briefly, these are the changes needed:
>  InternalParquetRecordWriter:
>  The field 'ParquetFileWriter parquetFileWriter' should not be 'final' anymore, since we want to be able to change this throughout the task.
>  The method 'checkBlockSizeReached' should call a new function 'startNewFile' just after a call to 'flushRowGroupToStore'.
>  The new method 'startNewFile' should have all the logic for closing the current file and starting a new one at the same location with a proper filename.
>  
>  ParquetFileWriter
>  The constructor argument 'OutputFile file' should be persisted as a new member field and made available by a new public method. This information is usefull for the 'startNewFile' implementation mentioned above.
>  The field 'MessageType schema' should be available by a new public method. This information is also usefull for the 'startNewFile' implementation.
>  
>  ParquetOutputFormat
>  The existing private method 'getMaxPaddingSize' should be made 'public' or at least 'package protected'. This information is usefull for the 'startNewFile' implementation mentioned above.
>  The new configuration option 'parquet.split.file.per.block' should be specified here like the other ones. The new behaviour in 'InternalParquetRecordWriter' is conditioned on this configuration option.
>  
>  ParquetRecordWriter
>  Just pass away the configuration option to the internal InternalParquetRecordWriter instance.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)