You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "zhangxiongfei (JIRA)" <ji...@apache.org> on 2015/04/16 07:23:58 UTC
[jira] [Commented] (SPARK-6921) Spark SQL API "saveAsParquetFile"
will output tachyon file with different block size
[ https://issues.apache.org/jira/browse/SPARK-6921?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14497580#comment-14497580 ]
zhangxiongfei commented on SPARK-6921:
--------------------------------------
I also did another test on HDFS files.
1)Set HDFS block size to 256M
sc.hadoopConfiguration.setLong("dfs.blocksize",268435456)
2)Save DataFrame to parquet file on HDFS. This code will write files with the block size 128M
ta3.saveAsParquetFile("/user/zhangxf/adClick-parquet-no-compress");
3)Save DataFrame to parquet file on HDFS again,This code will write files with the block size 256M
ta3.saveAsParquetFile("hdfs://namendoe:8020/user/zhangxf/adClick-parquet-no-compress");
It seems like different HDFS path schemas will lead to different block size.
> Spark SQL API "saveAsParquetFile" will output tachyon file with different block size
> ------------------------------------------------------------------------------------
>
> Key: SPARK-6921
> URL: https://issues.apache.org/jira/browse/SPARK-6921
> Project: Spark
> Issue Type: Bug
> Components: SQL
> Affects Versions: 1.3.0
> Reporter: zhangxiongfei
> Priority: Blocker
>
> I run below code in Spark Shell to access parquet files in Tachyon.
> 1.First,created a DataFrame by loading a bunch of Parquet Files in Tachyon
> val ta3 =sqlContext.parquetFile("tachyon://tachyonserver:19998/apps/tachyon/zhangxf/parquetAdClick-6p-256m");
> 2.Second, set the "fs.local.block.size" to 256M to make sure that block size of output files in Tachyon is 256M.
> sc.hadoopConfiguration.setLong("fs.local.block.size",268435456)
> 3.Third,saved above DataFrame into Parquet files that is stored in Tachyon
> ta3.saveAsParquetFile("tachyon://tachyonserver:19998/apps/tachyon/zhangxf/parquetAdClick-6p-256m-test");
> After above code run successfully, the output parquet files were stored in Tachyon,but these files have different block size,below is the information of those files in the path "tachyon://tachyonserver:19998/apps/tachyon/zhangxf/parquetAdClick-6p-256m-test":
> File Name Size Block Size In-Memory Pin Creation Time
> _SUCCESS 0.00 B 256.00 MB 100% NO 04-13-2015 17:48:23:519
> _common_metadata 1088.00 B 256.00 MB 100% NO 04-13-2015 17:48:23:741
> _metadata 22.71 KB 256.00 MB 100% NO 04-13-2015 17:48:23:646
> part-r-00001.parquet 177.19 MB 32.00 MB 100% NO 04-13-2015 17:46:44:626
> part-r-00002.parquet 177.21 MB 32.00 MB 100% NO 04-13-2015 17:46:44:636
> part-r-00003.parquet 177.02 MB 32.00 MB 100% NO 04-13-2015 17:46:45:439
> part-r-00004.parquet 177.21 MB 32.00 MB 100% NO 04-13-2015 17:46:44:845
> part-r-00005.parquet 177.40 MB 32.00 MB 100% NO 04-13-2015 17:46:44:638
> part-r-00006.parquet 177.33 MB 32.00 MB 100% NO 04-13-2015 17:46:44:648
> It seems that the API saveAsParquetFile does not distribute/broadcast the hadoopconfiguration to executors like the other API such as saveAsTextFile.The configutation "fs.local.block.size" only take effects on Driver.
> If I set that configuration before loading parquet files,the problem is gone.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org