You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Jork Zijlstra (JIRA)" <ji...@apache.org> on 2017/05/30 11:51:04 UTC

[jira] [Comment Edited] (SPARK-20799) Unable to infer schema for ORC on S3N when secrets are in the URL

    [ https://issues.apache.org/jira/browse/SPARK-20799?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16029297#comment-16029297 ] 

Jork Zijlstra edited comment on SPARK-20799 at 5/30/17 11:50 AM:
-----------------------------------------------------------------

Hi [~dongjoon],

Sorry that is took some time to test the Parquet file. Our spark cluster for the notebook got updated to spark 2.1.1 but it wouldn't play nice with the notebook version. Especially when using s3a path. Using s3n paths I could generated the no partition Parquet file.

It also seems to be a problem with Parquet files. It throws the same error.
{code}Exception in thread "main" org.apache.spark.sql.AnalysisException: Unable to infer schema for Parquet. It must be specified manually.;{code}

[~stevel@apache.org] 
Thanks for the settings. I'm trying to get the notebook to play nice with s3a path and playing and exploring the options now.

Don't you mean {code}
fs.s3a.bucket.site-2.access.key=my access key
fs.s3a.bucket.site-2.secret.key=my access secret
{code}

Regards, jork


was (Author: jzijlstra):
Hi [~dongjoon],

Sorry that is took some time to test the Parquet file. Our spark cluster for the notebook got updated to spark 2.1.1 but it wouldn't play nice with the notebook version. Especially when using s3a path. Using s3n paths I could generated the no partition Parquet file.

It also seems to be a problem with Parquet files. It throws the same error.
{code}Exception in thread "main" org.apache.spark.sql.AnalysisException: Unable to infer schema for Parquet. It must be specified manually.;{code}

[~stevel@apache.org] 
Thanks for the settings. I'm playing and exploring the options now for the s3a paths

Don't you mean {code}
fs.s3a.bucket.site-2.access.key=my access key
fs.s3a.bucket.site-2.secret.key=my access secret
{code}

Regards, jork

> Unable to infer schema for ORC on S3N when secrets are in the URL
> -----------------------------------------------------------------
>
>                 Key: SPARK-20799
>                 URL: https://issues.apache.org/jira/browse/SPARK-20799
>             Project: Spark
>          Issue Type: Bug
>          Components: SQL
>    Affects Versions: 2.1.1
>            Reporter: Jork Zijlstra
>            Priority: Minor
>
> We are getting the following exception: 
> {code}org.apache.spark.sql.AnalysisException: Unable to infer schema for ORC. It must be specified manually.{code}
> Combining following factors will cause it:
> - Use S3
> - Use format ORC
> - Don't apply a partitioning on de data
> - Embed AWS credentials in the path
> The problem is in the PartitioningAwareFileIndex def allFiles()
> {code}
> leafDirToChildrenFiles.get(qualifiedPath)
>           .orElse { leafFiles.get(qualifiedPath).map(Array(_)) }
>           .getOrElse(Array.empty)
> {code}
> leafDirToChildrenFiles uses the path WITHOUT credentials as its key while the qualifiedPath contains the path WITH credentials.
> So leafDirToChildrenFiles.get(qualifiedPath) doesn't find any files, so no data is read and the schema cannot be defined.
> Spark does output the S3xLoginHelper:90 - The Filesystem URI contains login details. This is insecure and may be unsupported in future., but this should not mean that it shouldn't work anymore.
> Workaround:
> Move the AWS credentials from the path to the SparkSession
> {code}
> SparkSession.builder
> 	.config("spark.hadoop.fs.s3n.awsAccessKeyId", {awsAccessKeyId})
> 	.config("spark.hadoop.fs.s3n.awsSecretAccessKey", {awsSecretAccessKey})
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org