You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Hyukjin Kwon (Jira)" <ji...@apache.org> on 2020/01/06 06:29:00 UTC

[jira] [Commented] (SPARK-30328) Fail to write local files with RDD.saveTextFile when setting the incorrect Hadoop configuration files

    [ https://issues.apache.org/jira/browse/SPARK-30328?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17008572#comment-17008572 ] 

Hyukjin Kwon commented on SPARK-30328:
--------------------------------------

Why don't you set the Hadoop configuration correctly? I think failing fast isn't a horrible idea.

> Fail to write local files with RDD.saveTextFile when setting the incorrect Hadoop configuration files
> -----------------------------------------------------------------------------------------------------
>
>                 Key: SPARK-30328
>                 URL: https://issues.apache.org/jira/browse/SPARK-30328
>             Project: Spark
>          Issue Type: Bug
>          Components: Spark Core
>    Affects Versions: 2.3.0
>            Reporter: chendihao
>            Priority: Major
>
> We find that the incorrect Hadoop configuration files cause the failure of saving RDD to local file system. It is not expected because we have specify the local url and the API of DataFrame.write.text does not have this issue. It is easy to reproduce and verify with Spark 2.3.0.
> 1.Do not set environment variable of `HADOOP_CONF_DIR`.
> 2.Install pyspark and run the local Python script. This should work and save files to local file system.
> {code:java}
> from pyspark.sql import SparkSession
> spark = SparkSession.builder.master("local").getOrCreate()
> sc = spark.sparkContextrdd = sc.parallelize([1, 2, 3])
> rdd.saveAsTextFile("file:///tmp/rdd.text")
> {code}
> 3.Set environment variable of `HADOOP_CONF_DIR` and put the Hadoop configuration files there. Make sure the format of `core-site.xml` is right but it has an unresolved host name.
> 4.Run the same Python script again. If it try to connect HDFS and found the unresolved host name, Java exception happens.
> We thinks `saveAsTextFile("file:///)` should not attempt to connect HDFS whenever `HADOOP_CONF_DIR` is set or not. Actually the following code of DataFrame will work with the same incorrect Hadoop configuration files.
> {code:java}
> from pyspark.sql import SparkSession
> spark = SparkSession.builder.master("local").getOrCreate()
> df = spark.createDataFrame(rows, ["attribute", "value"])
> df.write.parquet("file:///tmp/df.parquet")
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org