You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "chendihao (Jira)" <ji...@apache.org> on 2019/12/23 08:05:00 UTC

[jira] [Updated] (SPARK-30328) Fail to write local files with RDD.saveTextFile when setting the incorrect Hadoop configuration files

     [ https://issues.apache.org/jira/browse/SPARK-30328?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

chendihao updated SPARK-30328:
------------------------------
    Description: 
We find that the incorrect Hadoop configuration files cause the failure of saving RDD to local file system. It is not expected because we have specify the local url and the API of DataFrame.write.text does not have this issue. It is easy to reproduce and verify with Spark 2.3.0.

1.Do not set environment variable of `HADOOP_CONF_DIR`.

2.Install pyspark and run the local Python script. This should work and save files to local file system.
{code:java}
from pyspark.sql import SparkSession

spark = SparkSession.builder.master("local").getOrCreate()
sc = spark.sparkContextrdd = sc.parallelize([1, 2, 3])
rdd.saveAsTextFile("file:///tmp/rdd.text")
{code}
3.Set environment variable of `HADOOP_CONF_DIR` and put the Hadoop configuration files there. Make sure the format of `core-site.xml` is right but it has an unresolved host name.

4.Run the same Python script again. If it try to connect HDFS and found the unresolved host name, Java exception happens.

We thinks `saveAsTextFile("file:///)` should not attempt to connect HDFS not matter `HADOOP_CONF_DIR` is set. Actually the following code will work with the same incorrect Hadoop configuration files.
{code:java}
from pyspark.sql import SparkSession

spark = SparkSession.builder.master("local").getOrCreate()
df = spark.createDataFrame(rows, ["attribute", "value"])
df.write.parquet("file:///tmp/df.parquet")
{code}

> Fail to write local files with RDD.saveTextFile when setting the incorrect Hadoop configuration files
> -----------------------------------------------------------------------------------------------------
>
>                 Key: SPARK-30328
>                 URL: https://issues.apache.org/jira/browse/SPARK-30328
>             Project: Spark
>          Issue Type: Bug
>          Components: Spark Core
>    Affects Versions: 2.3.0
>            Reporter: chendihao
>            Priority: Major
>
> We find that the incorrect Hadoop configuration files cause the failure of saving RDD to local file system. It is not expected because we have specify the local url and the API of DataFrame.write.text does not have this issue. It is easy to reproduce and verify with Spark 2.3.0.
> 1.Do not set environment variable of `HADOOP_CONF_DIR`.
> 2.Install pyspark and run the local Python script. This should work and save files to local file system.
> {code:java}
> from pyspark.sql import SparkSession
> spark = SparkSession.builder.master("local").getOrCreate()
> sc = spark.sparkContextrdd = sc.parallelize([1, 2, 3])
> rdd.saveAsTextFile("file:///tmp/rdd.text")
> {code}
> 3.Set environment variable of `HADOOP_CONF_DIR` and put the Hadoop configuration files there. Make sure the format of `core-site.xml` is right but it has an unresolved host name.
> 4.Run the same Python script again. If it try to connect HDFS and found the unresolved host name, Java exception happens.
> We thinks `saveAsTextFile("file:///)` should not attempt to connect HDFS not matter `HADOOP_CONF_DIR` is set. Actually the following code will work with the same incorrect Hadoop configuration files.
> {code:java}
> from pyspark.sql import SparkSession
> spark = SparkSession.builder.master("local").getOrCreate()
> df = spark.createDataFrame(rows, ["attribute", "value"])
> df.write.parquet("file:///tmp/df.parquet")
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org