You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@spark.apache.org by Pei-Lun Lee <pl...@appier.com> on 2015/03/16 09:03:41 UTC
SparkSQL 1.3.0 cannot read parquet files from different file system
Hi,
I am using Spark 1.3.0, where I cannot load parquet files from more than
one file system, say one s3n://... and another hdfs://..., which worked in
older version, or if I set spark.sql.parquet.useDataSourceApi=false in 1.3.
One way to fix this is instead of get a single FileSystem from default
configuration in ParquetRelation2, call Path.getFileSystem for each path.
Here's the JIRA link and pull request:
https://issues.apache.org/jira/browse/SPARK-6351
https://github.com/apache/spark/pull/5039
Thanks,
--
Pei-Lun
Re: SparkSQL 1.3.0 cannot read parquet files from different file system
Posted by Pei-Lun Lee <pl...@appier.com>.
Looks like this is already solved in
https://issues.apache.org/jira/browse/SPARK-6330
On Mon, Mar 16, 2015 at 6:43 PM, Cheng Lian <li...@gmail.com> wrote:
> Oh sorry, I misread your question. I thought you were trying something
> like parquetFile(“s3n://file1,hdfs://file2”). Yeah, it’s a valid bug.
> Thanks for opening the JIRA ticket and the PR!
>
>
> Cheng
>
> On 3/16/15 6:39 PM, Cheng Lian wrote:
>
> Hi Pei-Lun,
>
> We intentionally disallowed passing multiple comma separated paths in
> 1.3.0. One of the reason is that users report that this fail when a file
> path contain an actual comma in it. In your case, you may do something like
> this:
>
> val s3nDF = parquetFile("s3n
> ://...
> ")val hdfsDF = parquetFile("hdfs://...")val finalDF = s3nDF.union(finalDF)
>
> Cheng
>
> On 3/16/15 4:03 PM, Pei-Lun Lee wrote:
>
> Hi,
>
> I am using Spark 1.3.0, where I cannot load parquet files from more than
> one file system, say one s3n://... and another hdfs://..., which worked in
> older version, or if I set spark.sql.parquet.useDataSourceApi=false in 1.3.
>
> One way to fix this is instead of get a single FileSystem from default
> configuration in ParquetRelation2, call Path.getFileSystem for each path.
>
> Here's the JIRA link and pull request:https://issues.apache.org/jira/browse/SPARK-6351https://github.com/apache/spark/pull/5039
>
> Thanks,
> --
> Pei-Lun
>
>
>
>
>
>
>
>
Re: SparkSQL 1.3.0 cannot read parquet files from different file
system
Posted by Cheng Lian <li...@gmail.com>.
Oh sorry, I misread your question. I thought you were trying something
like |parquetFile(“s3n://file1,hdfs://file2”)|. Yeah, it’s a valid bug.
Thanks for opening the JIRA ticket and the PR!
Cheng
On 3/16/15 6:39 PM, Cheng Lian wrote:
> Hi Pei-Lun,
>
> We intentionally disallowed passing multiple comma separated paths in
> 1.3.0. One of the reason is that users report that this fail when a
> file path contain an actual comma in it. In your case, you may do
> something like this:
>
> |val s3nDF = parquetFile("s3n://...
> ")
> val hdfsDF = parquetFile("hdfs://...")
> val finalDF = s3nDF.union(finalDF)
> |
>
> Cheng
>
> On 3/16/15 4:03 PM, Pei-Lun Lee wrote:
>
>> Hi,
>>
>> I am using Spark 1.3.0, where I cannot load parquet files from more than
>> one file system, say one s3n://... and another hdfs://..., which worked in
>> older version, or if I set spark.sql.parquet.useDataSourceApi=false in 1.3.
>>
>> One way to fix this is instead of get a single FileSystem from default
>> configuration in ParquetRelation2, call Path.getFileSystem for each path.
>>
>> Here's the JIRA link and pull request:
>> https://issues.apache.org/jira/browse/SPARK-6351
>> https://github.com/apache/spark/pull/5039
>>
>> Thanks,
>> --
>> Pei-Lun
>>
>
Re: SparkSQL 1.3.0 cannot read parquet files from different file
system
Posted by Cheng Lian <li...@gmail.com>.
Hi Pei-Lun,
We intentionally disallowed passing multiple comma separated paths in
1.3.0. One of the reason is that users report that this fail when a file
path contain an actual comma in it. In your case, you may do something
like this:
|val s3nDF = parquetFile("s3n://...")
val hdfsDF = parquetFile("hdfs://...")
val finalDF = s3nDF.union(finalDF)
|
Cheng
On 3/16/15 4:03 PM, Pei-Lun Lee wrote:
> Hi,
>
> I am using Spark 1.3.0, where I cannot load parquet files from more than
> one file system, say one s3n://... and another hdfs://..., which worked in
> older version, or if I set spark.sql.parquet.useDataSourceApi=false in 1.3.
>
> One way to fix this is instead of get a single FileSystem from default
> configuration in ParquetRelation2, call Path.getFileSystem for each path.
>
> Here's the JIRA link and pull request:
> https://issues.apache.org/jira/browse/SPARK-6351
> https://github.com/apache/spark/pull/5039
>
> Thanks,
> --
> Pei-Lun
>