You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@spark.apache.org by sh...@tsmc.com on 2015/07/17 07:56:17 UTC

what is : ParquetFileReader: reading summary file ?

Hi all,

our scenario is to generate lots of folders containinig parquet file and
then uses "add partition" to add these folder locations to a hive table;
when trying to read the hive table using Spark,
following logs would show up and took a lot of time on reading them;
but this won't happen after second of third time of querying this table
through sql in HiveContext;
does that mean that parquet file has did some chaching by itself? Thanks!


Jul 17, 2015 1:05:40 PM INFO: parquet.hadoop.ParquetFileReader: Initiating
action with parallelism: 5
Jul 17, 2015 1:05:40 PM INFO: parquet.hadoop.ParquetFileReader: reading
summary file: hdfs://f14ecat/HDFS/NEW_TCHART/20150701/LDSN/_common_metadata
Jul 17, 2015 1:05:40 PM INFO: parquet.hadoop.ParquetFileReader: reading
summary file: hdfs://f14ecat/HDFS/NEW_TCHART/20150702/MECC/_common_metadata
Jul 17, 2015 1:05:40 PM INFO: parquet.hadoop.ParquetFileReader: reading
summary file: hdfs://f14ecat/HDFS/NEW_TCHART/20150702/MCOX/_common_metadata
Jul 17, 2015 1:05:40 PM INFO: parquet.hadoop.ParquetFileReader: reading
summary file: hdfs://f14ecat/HDFS/NEW_TCHART/20150629/LCTE/_common_metadata
Jul 17, 2015 1:05:40 PM INFO: parquet.hadoop.ParquetFileReader: reading
summary file: hdfs://f14ecat/HDFS/NEW_TCHART/20150630/MDNS/_common_metadata
Jul 17, 2015 1:05:41 PM INFO: parquet.hadoop.ParquetFileReader: reading
summary file: hdfs://f14ecat/HDFS/NEW_TCHART/20150701/VSHM/_common_metadata
Jul 17, 2015 1:05:41 PM INFO: parquet.hadoop.ParquetFileReader: reading
summary file: hdfs://f14ecat/HDFS/NEW_TCHART/20150624/LSCB/_common_metadata
Jul 17, 2015 1:05:41 PM INFO: parquet.hadoop.ParquetFileReader: reading
summary file: hdfs://f14ecat/HDFS/NEW_TCHART/20150628/MPD8/_common_metadata
Jul 17, 2015 1:05:41 PM INFO: parquet.hadoop.ParquetFileReader: reading
summary file: hdfs://f14ecat/HDFS/NEW_TCHART/20150703/VSHM/_common_metadata
Jul 17, 2015 1:05:41 PM INFO: parquet.hadoop.ParquetFileReader: reading
summary file: hdfs://f14ecat/HDFS/NEW_TCHART/20150630/LIHI/_common_metadata
Jul 17, 2015 1:05:41 PM INFO: parquet.hadoop.ParquetFileReader: reading
summary file: hdfs://f14ecat/HDFS/NEW_TCHART/20150701/LESE/_common_metadata
Jul 17, 2015 1:05:41 PM INFO: parquet.hadoop.ParquetFileReader: reading
summary file: hdfs://f14ecat/HDFS/NEW_TCHART/20150626/MPD8/_common_metadata
Jul 17, 2015 1:05:41 PM INFO: parquet.hadoop.ParquetFileReader: reading
summary file: hdfs://f14ecat/HDFS/NEW_TCHART/20150624/MDHK/_common_metadata
Jul 17, 2015 1:05:41 PM INFO: parquet.hadoop.ParquetFileReader: reading
summary file: hdfs://f14ecat/HDFS/NEW_TCHART/20150628/VEMH/_common_metadata
Jul 17, 2015 1:05:41 PM INFO: parquet.hadoop.ParquetFileReader: reading
summary file: hdfs://f14ecat/HDFS/NEW_TCHART/20150626/MDHK/_common_metadata
Jul 17, 2015 1:05:41 PM INFO: parquet.hadoop.ParquetFileReader: reading
summary file: hdfs://f14ecat/HDFS/NEW_TCHART/20150628/LSCB/_common_metadata
Jul 17, 2015 1:05:41 PM INFO: parquet.hadoop.ParquetFileReader: reading
summary file: hdfs://f14ecat/HDFS/NEW_TCHART/20150627/LESR/_common_metadata
Jul 17, 2015 1:05:41 PM INFO: parquet.hadoop.ParquetFileReader: reading
summary file: hdfs://f14ecat/HDFS/NEW_TCHART/20150703/LESE/_common_metadata
 --------------------------------------------------------------------------- 
                                                         TSMC PROPERTY       
 This email communication (and any attachments) is proprietary information   
 for the sole use of its                                                     
 intended recipient. Any unauthorized review, use or distribution by anyone  
 other than the intended                                                     
 recipient is strictly prohibited.  If you are not the intended recipient,   
 please notify the sender by                                                 
 replying to this email, and then delete this email and any copies of it     
 immediately. Thank you.                                                     
 --------------------------------------------------------------------------- 





---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
For additional commands, e-mail: user-help@spark.apache.org

Re: what is : ParquetFileReader: reading summary file ?

Posted by Cheng Lian <li...@gmail.com>.

Yeah, Spark SQL Parquet support need to do some metadata discovery when 
firstly importing a folder containing Parquet files, and discovered 
metadata is cached.

Cheng

On 7/17/15 1:56 PM, shshann@tsmc.com wrote:
> Hi all,
>
> our scenario is to generate lots of folders containinig parquet file and
> then uses "add partition" to add these folder locations to a hive table;
> when trying to read the hive table using Spark,
> following logs would show up and took a lot of time on reading them;
> but this won't happen after second of third time of querying this table
> through sql in HiveContext;
> does that mean that parquet file has did some chaching by itself? Thanks!
>
>
> Jul 17, 2015 1:05:40 PM INFO: parquet.hadoop.ParquetFileReader: Initiating
> action with parallelism: 5
> Jul 17, 2015 1:05:40 PM INFO: parquet.hadoop.ParquetFileReader: reading
> summary file: hdfs://f14ecat/HDFS/NEW_TCHART/20150701/LDSN/_common_metadata
> Jul 17, 2015 1:05:40 PM INFO: parquet.hadoop.ParquetFileReader: reading
> summary file: hdfs://f14ecat/HDFS/NEW_TCHART/20150702/MECC/_common_metadata
> Jul 17, 2015 1:05:40 PM INFO: parquet.hadoop.ParquetFileReader: reading
> summary file: hdfs://f14ecat/HDFS/NEW_TCHART/20150702/MCOX/_common_metadata
> Jul 17, 2015 1:05:40 PM INFO: parquet.hadoop.ParquetFileReader: reading
> summary file: hdfs://f14ecat/HDFS/NEW_TCHART/20150629/LCTE/_common_metadata
> Jul 17, 2015 1:05:40 PM INFO: parquet.hadoop.ParquetFileReader: reading
> summary file: hdfs://f14ecat/HDFS/NEW_TCHART/20150630/MDNS/_common_metadata
> Jul 17, 2015 1:05:41 PM INFO: parquet.hadoop.ParquetFileReader: reading
> summary file: hdfs://f14ecat/HDFS/NEW_TCHART/20150701/VSHM/_common_metadata
> Jul 17, 2015 1:05:41 PM INFO: parquet.hadoop.ParquetFileReader: reading
> summary file: hdfs://f14ecat/HDFS/NEW_TCHART/20150624/LSCB/_common_metadata
> Jul 17, 2015 1:05:41 PM INFO: parquet.hadoop.ParquetFileReader: reading
> summary file: hdfs://f14ecat/HDFS/NEW_TCHART/20150628/MPD8/_common_metadata
> Jul 17, 2015 1:05:41 PM INFO: parquet.hadoop.ParquetFileReader: reading
> summary file: hdfs://f14ecat/HDFS/NEW_TCHART/20150703/VSHM/_common_metadata
> Jul 17, 2015 1:05:41 PM INFO: parquet.hadoop.ParquetFileReader: reading
> summary file: hdfs://f14ecat/HDFS/NEW_TCHART/20150630/LIHI/_common_metadata
> Jul 17, 2015 1:05:41 PM INFO: parquet.hadoop.ParquetFileReader: reading
> summary file: hdfs://f14ecat/HDFS/NEW_TCHART/20150701/LESE/_common_metadata
> Jul 17, 2015 1:05:41 PM INFO: parquet.hadoop.ParquetFileReader: reading
> summary file: hdfs://f14ecat/HDFS/NEW_TCHART/20150626/MPD8/_common_metadata
> Jul 17, 2015 1:05:41 PM INFO: parquet.hadoop.ParquetFileReader: reading
> summary file: hdfs://f14ecat/HDFS/NEW_TCHART/20150624/MDHK/_common_metadata
> Jul 17, 2015 1:05:41 PM INFO: parquet.hadoop.ParquetFileReader: reading
> summary file: hdfs://f14ecat/HDFS/NEW_TCHART/20150628/VEMH/_common_metadata
> Jul 17, 2015 1:05:41 PM INFO: parquet.hadoop.ParquetFileReader: reading
> summary file: hdfs://f14ecat/HDFS/NEW_TCHART/20150626/MDHK/_common_metadata
> Jul 17, 2015 1:05:41 PM INFO: parquet.hadoop.ParquetFileReader: reading
> summary file: hdfs://f14ecat/HDFS/NEW_TCHART/20150628/LSCB/_common_metadata
> Jul 17, 2015 1:05:41 PM INFO: parquet.hadoop.ParquetFileReader: reading
> summary file: hdfs://f14ecat/HDFS/NEW_TCHART/20150627/LESR/_common_metadata
> Jul 17, 2015 1:05:41 PM INFO: parquet.hadoop.ParquetFileReader: reading
> summary file: hdfs://f14ecat/HDFS/NEW_TCHART/20150703/LESE/_common_metadata
>   ---------------------------------------------------------------------------
>                                                           TSMC PROPERTY
>   This email communication (and any attachments) is proprietary information
>   for the sole use of its
>   intended recipient. Any unauthorized review, use or distribution by anyone
>   other than the intended
>   recipient is strictly prohibited.  If you are not the intended recipient,
>   please notify the sender by
>   replying to this email, and then delete this email and any copies of it
>   immediately. Thank you.
>   ---------------------------------------------------------------------------
>
>
>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
> For additional commands, e-mail: user-help@spark.apache.org
>
>


---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
For additional commands, e-mail: user-help@spark.apache.org