You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@hive.apache.org by John Omernik <jo...@omernik.com> on 2015/08/18 21:05:54 UTC

Parquet Files in Hive - Settings

Is there a good writeup on what the settings that can be tweaked in hive as
it pertains to writing parquet files are? For example, in some obscure
pages I've found settings like parquet.compression,
parquet.dictionary.page.size and parquet.enable.dictionary, but they were
in reference to stock mapr reduce jobs, not hive, and thus, I don't even
know what the defaults for these are when using hive.  I tried doing hive
-e "set"|grep "parquet\." but these settings aren't there.

Any documentation on what these are, what hive uses as defaults etc, and
how I can optimize my parquet writing with hive would be appreciated.

Re: Parquet Files in Hive - Settings

Posted by Shannon Ladymon <sl...@gmail.com>.
I've created a jira (HIVE-11598
<https://issues.apache.org/jira/browse/HIVE-11598>) for adding more
documentation on Parquet properties to the Hive wiki.  If you know of
anything in particular that should be added, please let me know and I'll do
my best to add it.

On Tue, Aug 18, 2015 at 1:44 PM, Ryan Harris <Ry...@zionsbancorp.com>
wrote:

> most are parquet settings....
>
>
>
> from
> https://github.com/apache/parquet-mr/blob/master/parquet-hadoop/src/main/java/org/apache/parquet/hadoop/ParquetOutputFormat.java
> :
>
>  * # The block size is the size of a row group being buffered in memory
>
>  * # this limits the memory usage when writing
>
>  * # Larger values will improve the IO when reading but consume more
> memory when writing
>
>  * parquet.block.size=134217728 # in bytes, default = 128 * 1024 * 1024
>
>  *
>
>  * # The page size is for compression. When reading, each page can be
> decompressed independently.
>
>  * # A block is composed of pages. The page is the smallest unit that must
> be read fully to access a single record.
>
>  * # If this value is too small, the compression will deteriorate
>
>  * parquet.page.size=1048576 # in bytes, default = 1 * 1024 * 1024
>
>  *
>
>  * # There is one dictionary page per column per row group when dictionary
> encoding is used.
>
>  * # The dictionary page size works like the page size but for dictionary
>
>  * parquet.dictionary.page.size=1048576 # in bytes, default = 1 * 1024 *
> 1024
>
>  *
>
>  * # The compression algorithm used to compress pages
>
>  * parquet.compression=UNCOMPRESSED # one of: UNCOMPRESSED, SNAPPY, GZIP,
> LZO. Default: UNCOMPRESSED. Supersedes mapred.output.compress*
>
>  *
>
>  * # The write support class to convert the records written to the
> OutputFormat into the events accepted by the record consumer
>
>  * # Usually provided by a specific ParquetOutputFormat subclass
>
>  * parquet.write.support.class= # fully qualified name
>
>  *
>
>  * # To enable/disable dictionary encoding
>
>  * parquet.enable.dictionary=true # false to disable dictionary encoding
>
>  *
>
>  * # To enable/disable summary metadata aggregation at the end of a MR job
>
>  * # The default is true (enabled)
>
>  * parquet.enable.summary-metadata=true # false to disable summary
> aggregation
>
>
>
> public class ParquetOutputFormat<T> extends FileOutputFormat<Void, T> {
>
>   private static final Log LOG = Log.getLog(ParquetOutputFormat.class);
>
>
>
>   public static final String BLOCK_SIZE           = "parquet.block.size";
>
>   public static final String PAGE_SIZE            = "parquet.page.size";
>
>   public static final String COMPRESSION          = "parquet.compression";
>
>   public static final String WRITE_SUPPORT_CLASS  =
> "parquet.write.support.class";
>
>   public static final String DICTIONARY_PAGE_SIZE =
> "parquet.dictionary.page.size";
>
>   public static final String ENABLE_DICTIONARY    =
> "parquet.enable.dictionary";
>
>   public static final String VALIDATION           = "parquet.validation";
>
>   public static final String WRITER_VERSION       =
> "parquet.writer.version";
>
>   public static final String ENABLE_JOB_SUMMARY   =
> "parquet.enable.summary-metadata";
>
>   public static final String MEMORY_POOL_RATIO    =
> "parquet.memory.pool.ratio";
>
>   public static final String MIN_MEMORY_ALLOCATION =
> "parquet.memory.min.chunk.size";
>
>
>
> some of the variables (e.g. parquet.enable.summary-metadata) may not
> currently be exposed via hive.  Others (parquet.block.size,
> parquet.compression, parquet.enable.dictionary)  have been exposed by
> hive-specific JIRAs (HIVE-7685, HIVE-7858, HIVE-8823)
>
>
>
> I'm also aware of these hive-specific ones:
>
> hive.parquet.timestamp.skip.conversion (HIVE-9482), documented here:
>
>
> https://cwiki.apache.org/confluence/display/Hive/Configuration+Properties#ConfigurationProperties-hive.parquet.timestamp.skip.conversion
>
> parquet.column.index.access (HIVE-6938, HIVE-7800), documented here:
>
> https://cwiki.apache.org/confluence/display/Hive/Parquet
>
>
>
> There may be others that I'm not aware of, but they would likely be in one
> of the JIRAs tracked by HIVE-8120
>
>
>
> It would certainly be nice if there was a single place to find all of this
> in the hive documentation...many of the individual JIRA notes indicate that
> it "should be documented in Hive's wiki," yet that doesn't appear to have
> occurred.
>
>
>
> https://cwiki.apache.org/confluence/display/Hive/Configuration+Properties
> would seem like a logical location, but it is definitely incomplete
> currently.
>
>
>
>
>
>
>
>
>
>
>
>
>
> *From:* John Omernik [mailto:john@omernik.com]
> *Sent:* Tuesday, August 18, 2015 1:06 PM
> *To:* user@hive.apache.org
> *Subject:* Parquet Files in Hive - Settings
>
>
>
> Is there a good writeup on what the settings that can be tweaked in hive
> as it pertains to writing parquet files are? For example, in some obscure
> pages I've found settings like parquet.compression,
> parquet.dictionary.page.size and parquet.enable.dictionary, but they were
> in reference to stock mapr reduce jobs, not hive, and thus, I don't even
> know what the defaults for these are when using hive.  I tried doing hive
> -e "set"|grep "parquet\." but these settings aren't there.
>
>
>
> Any documentation on what these are, what hive uses as defaults etc, and
> how I can optimize my parquet writing with hive would be appreciated.
>
>
>
>
> ------------------------------
> THIS ELECTRONIC MESSAGE, INCLUDING ANY ACCOMPANYING DOCUMENTS, IS
> CONFIDENTIAL and may contain information that is privileged and exempt from
> disclosure under applicable law. If you are neither the intended recipient
> nor responsible for delivering the message to the intended recipient,
> please note that any dissemination, distribution, copying or the taking of
> any action in reliance upon the message is strictly prohibited. If you have
> received this communication in error, please notify the sender immediately.
> Thank you.
>

RE: Parquet Files in Hive - Settings

Posted by Ryan Harris <Ry...@zionsbancorp.com>.
most are parquet settings....

from https://github.com/apache/parquet-mr/blob/master/parquet-hadoop/src/main/java/org/apache/parquet/hadoop/ParquetOutputFormat.java:
 * # The block size is the size of a row group being buffered in memory
 * # this limits the memory usage when writing
 * # Larger values will improve the IO when reading but consume more memory when writing
 * parquet.block.size=134217728 # in bytes, default = 128 * 1024 * 1024
 *
 * # The page size is for compression. When reading, each page can be decompressed independently.
 * # A block is composed of pages. The page is the smallest unit that must be read fully to access a single record.
 * # If this value is too small, the compression will deteriorate
 * parquet.page.size=1048576 # in bytes, default = 1 * 1024 * 1024
 *
 * # There is one dictionary page per column per row group when dictionary encoding is used.
 * # The dictionary page size works like the page size but for dictionary
 * parquet.dictionary.page.size=1048576 # in bytes, default = 1 * 1024 * 1024
 *
 * # The compression algorithm used to compress pages
 * parquet.compression=UNCOMPRESSED # one of: UNCOMPRESSED, SNAPPY, GZIP, LZO. Default: UNCOMPRESSED. Supersedes mapred.output.compress*
 *
 * # The write support class to convert the records written to the OutputFormat into the events accepted by the record consumer
 * # Usually provided by a specific ParquetOutputFormat subclass
 * parquet.write.support.class= # fully qualified name
 *
 * # To enable/disable dictionary encoding
 * parquet.enable.dictionary=true # false to disable dictionary encoding
 *
 * # To enable/disable summary metadata aggregation at the end of a MR job
 * # The default is true (enabled)
 * parquet.enable.summary-metadata=true # false to disable summary aggregation

public class ParquetOutputFormat<T> extends FileOutputFormat<Void, T> {
  private static final Log LOG = Log.getLog(ParquetOutputFormat.class);

  public static final String BLOCK_SIZE           = "parquet.block.size";
  public static final String PAGE_SIZE            = "parquet.page.size";
  public static final String COMPRESSION          = "parquet.compression";
  public static final String WRITE_SUPPORT_CLASS  = "parquet.write.support.class";
  public static final String DICTIONARY_PAGE_SIZE = "parquet.dictionary.page.size";
  public static final String ENABLE_DICTIONARY    = "parquet.enable.dictionary";
  public static final String VALIDATION           = "parquet.validation";
  public static final String WRITER_VERSION       = "parquet.writer.version";
  public static final String ENABLE_JOB_SUMMARY   = "parquet.enable.summary-metadata";
  public static final String MEMORY_POOL_RATIO    = "parquet.memory.pool.ratio";
  public static final String MIN_MEMORY_ALLOCATION = "parquet.memory.min.chunk.size";

some of the variables (e.g. parquet.enable.summary-metadata) may not currently be exposed via hive.  Others (parquet.block.size, parquet.compression, parquet.enable.dictionary)  have been exposed by hive-specific JIRAs (HIVE-7685, HIVE-7858, HIVE-8823)

I'm also aware of these hive-specific ones:
hive.parquet.timestamp.skip.conversion (HIVE-9482), documented here:
https://cwiki.apache.org/confluence/display/Hive/Configuration+Properties#ConfigurationProperties-hive.parquet.timestamp.skip.conversion
parquet.column.index.access (HIVE-6938, HIVE-7800), documented here:
https://cwiki.apache.org/confluence/display/Hive/Parquet

There may be others that I'm not aware of, but they would likely be in one of the JIRAs tracked by HIVE-8120

It would certainly be nice if there was a single place to find all of this in the hive documentation...many of the individual JIRA notes indicate that it "should be documented in Hive's wiki," yet that doesn't appear to have occurred.

https://cwiki.apache.org/confluence/display/Hive/Configuration+Properties would seem like a logical location, but it is definitely incomplete currently.






From: John Omernik [mailto:john@omernik.com]
Sent: Tuesday, August 18, 2015 1:06 PM
To: user@hive.apache.org
Subject: Parquet Files in Hive - Settings

Is there a good writeup on what the settings that can be tweaked in hive as it pertains to writing parquet files are? For example, in some obscure pages I've found settings like parquet.compression, parquet.dictionary.page.size and parquet.enable.dictionary, but they were in reference to stock mapr reduce jobs, not hive, and thus, I don't even know what the defaults for these are when using hive.  I tried doing hive -e "set"|grep "parquet\." but these settings aren't there.

Any documentation on what these are, what hive uses as defaults etc, and how I can optimize my parquet writing with hive would be appreciated.



======================================================================
THIS ELECTRONIC MESSAGE, INCLUDING ANY ACCOMPANYING DOCUMENTS, IS CONFIDENTIAL and may contain information that is privileged and exempt from disclosure under applicable law. If you are neither the intended recipient nor responsible for delivering the message to the intended recipient, please note that any dissemination, distribution, copying or the taking of any action in reliance upon the message is strictly prohibited. If you have received this communication in error, please notify the sender immediately.  Thank you.