You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@spark.apache.org by Stuart Layton <st...@gmail.com> on 2015/03/25 19:59:20 UTC
Can a DataFrame be saved to s3 directly using Parquet?
I'm trying to save a dataframe to s3 as a parquet file but I'm getting
Wrong FS errors
>>> df.saveAsParquetFile(parquetFile)
15/03/25 18:56:10 INFO storage.MemoryStore: ensureFreeSpace(46645) called
with curMem=82744, maxMem=278302556
15/03/25 18:56:10 INFO storage.MemoryStore: Block broadcast_5 stored as
values in memory (estimated size 45.6 KB, free 265.3 MB)
15/03/25 18:56:10 INFO storage.MemoryStore: ensureFreeSpace(7078) called
with curMem=129389, maxMem=278302556
15/03/25 18:56:10 INFO storage.MemoryStore: Block broadcast_5_piece0 stored
as bytes in memory (estimated size 6.9 KB, free 265.3 MB)
15/03/25 18:56:10 INFO storage.BlockManagerInfo: Added broadcast_5_piece0
in memory on ip-172-31-1-219.ec2.internal:58280 (size: 6.9 KB, free: 265.4
MB)
15/03/25 18:56:10 INFO storage.BlockManagerMaster: Updated info of block
broadcast_5_piece0
15/03/25 18:56:10 INFO spark.SparkContext: Created broadcast 5 from
textFile at JSONRelation.scala:98
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/root/spark/python/pyspark/sql/dataframe.py", line 121, in
saveAsParquetFile
self._jdf.saveAsParquetFile(path)
File "/root/spark/python/lib/py4j-0.8.2.1-src.zip/py4j/java_gateway.py",
line 538, in __call__
File "/root/spark/python/lib/py4j-0.8.2.1-src.zip/py4j/protocol.py", line
300, in get_return_value
py4j.protocol.Py4JJavaError: An error occurred while calling
o22.saveAsParquetFile.
: java.lang.IllegalArgumentException: Wrong FS:
s3n://com.my.bucket/spark-testing/, expected: hdfs://
ec2-52-0-159-113.compute-1.amazonaws.com:9000
Is it possible to save a dataframe to s3 directly using parquet?
--
Stuart Layton
Re: Can a DataFrame be saved to s3 directly using Parquet?
Posted by Michael Armbrust <mi...@databricks.com>.
Until then you can try
sql("SET spark.sql.parquet.useDataSourceApi=false")
On Wed, Mar 25, 2015 at 12:15 PM, Michael Armbrust <mi...@databricks.com>
wrote:
> This will be fixed in Spark 1.3.1:
> https://issues.apache.org/jira/browse/SPARK-6351
>
> and is fixed in master/branch-1.3 if you want to compile from source
>
> On Wed, Mar 25, 2015 at 11:59 AM, Stuart Layton <st...@gmail.com>
> wrote:
>
>> I'm trying to save a dataframe to s3 as a parquet file but I'm getting
>> Wrong FS errors
>>
>> >>> df.saveAsParquetFile(parquetFile)
>> 15/03/25 18:56:10 INFO storage.MemoryStore: ensureFreeSpace(46645) called
>> with curMem=82744, maxMem=278302556
>> 15/03/25 18:56:10 INFO storage.MemoryStore: Block broadcast_5 stored as
>> values in memory (estimated size 45.6 KB, free 265.3 MB)
>> 15/03/25 18:56:10 INFO storage.MemoryStore: ensureFreeSpace(7078) called
>> with curMem=129389, maxMem=278302556
>> 15/03/25 18:56:10 INFO storage.MemoryStore: Block broadcast_5_piece0
>> stored as bytes in memory (estimated size 6.9 KB, free 265.3 MB)
>> 15/03/25 18:56:10 INFO storage.BlockManagerInfo: Added broadcast_5_piece0
>> in memory on ip-172-31-1-219.ec2.internal:58280 (size: 6.9 KB, free: 265.4
>> MB)
>> 15/03/25 18:56:10 INFO storage.BlockManagerMaster: Updated info of block
>> broadcast_5_piece0
>> 15/03/25 18:56:10 INFO spark.SparkContext: Created broadcast 5 from
>> textFile at JSONRelation.scala:98
>> Traceback (most recent call last):
>> File "<stdin>", line 1, in <module>
>> File "/root/spark/python/pyspark/sql/dataframe.py", line 121, in
>> saveAsParquetFile
>> self._jdf.saveAsParquetFile(path)
>> File
>> "/root/spark/python/lib/py4j-0.8.2.1-src.zip/py4j/java_gateway.py", line
>> 538, in __call__
>> File "/root/spark/python/lib/py4j-0.8.2.1-src.zip/py4j/protocol.py",
>> line 300, in get_return_value
>> py4j.protocol.Py4JJavaError: An error occurred while calling
>> o22.saveAsParquetFile.
>> : java.lang.IllegalArgumentException: Wrong FS:
>> s3n://com.my.bucket/spark-testing/, expected: hdfs://
>> ec2-52-0-159-113.compute-1.amazonaws.com:9000
>>
>>
>> Is it possible to save a dataframe to s3 directly using parquet?
>>
>> --
>> Stuart Layton
>>
>
>
Re: Can a DataFrame be saved to s3 directly using Parquet?
Posted by Michael Armbrust <mi...@databricks.com>.
This will be fixed in Spark 1.3.1:
https://issues.apache.org/jira/browse/SPARK-6351
and is fixed in master/branch-1.3 if you want to compile from source
On Wed, Mar 25, 2015 at 11:59 AM, Stuart Layton <st...@gmail.com>
wrote:
> I'm trying to save a dataframe to s3 as a parquet file but I'm getting
> Wrong FS errors
>
> >>> df.saveAsParquetFile(parquetFile)
> 15/03/25 18:56:10 INFO storage.MemoryStore: ensureFreeSpace(46645) called
> with curMem=82744, maxMem=278302556
> 15/03/25 18:56:10 INFO storage.MemoryStore: Block broadcast_5 stored as
> values in memory (estimated size 45.6 KB, free 265.3 MB)
> 15/03/25 18:56:10 INFO storage.MemoryStore: ensureFreeSpace(7078) called
> with curMem=129389, maxMem=278302556
> 15/03/25 18:56:10 INFO storage.MemoryStore: Block broadcast_5_piece0
> stored as bytes in memory (estimated size 6.9 KB, free 265.3 MB)
> 15/03/25 18:56:10 INFO storage.BlockManagerInfo: Added broadcast_5_piece0
> in memory on ip-172-31-1-219.ec2.internal:58280 (size: 6.9 KB, free: 265.4
> MB)
> 15/03/25 18:56:10 INFO storage.BlockManagerMaster: Updated info of block
> broadcast_5_piece0
> 15/03/25 18:56:10 INFO spark.SparkContext: Created broadcast 5 from
> textFile at JSONRelation.scala:98
> Traceback (most recent call last):
> File "<stdin>", line 1, in <module>
> File "/root/spark/python/pyspark/sql/dataframe.py", line 121, in
> saveAsParquetFile
> self._jdf.saveAsParquetFile(path)
> File "/root/spark/python/lib/py4j-0.8.2.1-src.zip/py4j/java_gateway.py",
> line 538, in __call__
> File "/root/spark/python/lib/py4j-0.8.2.1-src.zip/py4j/protocol.py",
> line 300, in get_return_value
> py4j.protocol.Py4JJavaError: An error occurred while calling
> o22.saveAsParquetFile.
> : java.lang.IllegalArgumentException: Wrong FS:
> s3n://com.my.bucket/spark-testing/, expected: hdfs://
> ec2-52-0-159-113.compute-1.amazonaws.com:9000
>
>
> Is it possible to save a dataframe to s3 directly using parquet?
>
> --
> Stuart Layton
>