You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@spark.apache.org by MorEru <hs...@gmail.com> on 2015/07/06 22:47:31 UTC

Spark standalone cluster - Output file stored in temporary directory in worker

I have a Spark standalone cluster with 2 workers -

Master and one slave thread run on a single machine -- Machine 1
Another slave running on a separate machine -- Machine 2

I am running a spark shell in the 2nd machine that reads a file from hdfs
and does some calculations on them and stores the result in hdfs.

This is how I read the file in spark shell -
val file = sc.textFile("hdfs://localhost:9000/user/root/table.csv")

And this is how I write the result back to a file -
finalRDD.saveAsTextFile("hdfs://localhost:9000/user/root/output_file")

When I run the code, it runs in the cluster and the job succeeds with each
worker processing roughly half of the input file. I am also able to see the
records processed in the webUI.

But when I check HDFS in the 2nd machine, I find only one part of the output
file.

The other part is stored in the hdfs in the 1st machine. But even the part
is not actually present in the proper hdfs location and is instead stored in
a _temporary directory

In machine 2 -

root@worker:~# hadoop fs -ls ./output_file
Found 2 items
-rw-r--r--   3 root supergroup          0 2015-07-06 16:12
output_file/_SUCCESS
-rw-r--r--   3 root supergroup     984337 2015-07-06 16:12
output_file/part-00000

In machine 1 -

root@spark:~# hadoop fs -ls
./output_file/_temporary/0/task_201507061612_0003_m_000001
-rw-r--r--   3 root supergroup     971824 2015-07-06 16:12
output_file/_temporary/0/
task_201507061612_0003_m_000001/part-00001


I have a couple of questions -

1. Shouldn't both parts be on the worker 2 ( since the hdfs referred to in
the saveAsTextFile is the local hdfs) ? OR will the output be always split
in the workers ?
2. Why is the output stored in the _temporary directory in machine 1 ?




--
View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Spark-standalone-cluster-Output-file-stored-in-temporary-directory-in-worker-tp23653.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
For additional commands, e-mail: user-help@spark.apache.org

Re: Spark standalone cluster - Output file stored in temporary directory in worker

Posted by maxdml <ma...@gmail.com>.

I think the properties that you have in your hdfs-site.xml should go in the
core-site.xml (at least for the namenode.name and datanote.data ones). I
might be wrong here, but that's what I have in my setup.

you should also add hadoop.tmp.dir in your core-site.xml. That might be the
source of your inconsistency.

as for hadoop-env.sh, I just use it to export variable such as
HADOOP_PREFIX,  LOG_DIR, CONF_DIR and JAVA_HOME.




--
View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Spark-standalone-cluster-Output-file-stored-in-temporary-directory-in-worker-tp23653p23697.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
For additional commands, e-mail: user-help@spark.apache.org

Re: Spark standalone cluster - Output file stored in temporary directory in worker

Posted by MorEru <hs...@gmail.com>.

core-site.xml 

<configuration>
<property>
   <name>fs.default.name</name>
   <value>hdfs://localhost:9000</value>
</property>
</configuration>

hdfs_site.xml -

<configuration>
<property>
   <name>dfs.replication</name>
   <value>1</value>
 </property>
 <property>
   <name>dfs.namenode.name.dir</name>
   <value>file:/usr/local/hadoop_store/hdfs/namenode</value>
 </property>
 <property>
   <name>dfs.datanode.data.dir</name>
   <value>file:/usr/local/hadoop_store/hdfs/datanode</value>
 </property>
</configuration>

I have not made any changes to the default hadoo-env.sh apart from manually
adding the JAVA_HOME entry.

What should the properties be configured to ? To the master HDFS where the
file is actually present ?

Thanks.




--
View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Spark-standalone-cluster-Output-file-stored-in-temporary-directory-in-worker-tp23653p23683.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
For additional commands, e-mail: user-help@spark.apache.org

Re: Spark standalone cluster - Output file stored in temporary directory in worker

Posted by maxdml <ma...@gmail.com>.

Can you share your hadoop configuration file please?

- etc/hadoop/core-site.xml
- etc/hadoop/hdfs-site.xml
- etc/hadoop/hadoo-env.sh

AFAIK, the following properties should be configured:

hadoop.tmp.dir, dfs.namenode.name.dir, dfs.datanode.data.dir and
dfs.namenode.checkpoint.dir

Otherwise, an HDFS slave will use it's default temporary folder to save
blocks.



--
View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Spark-standalone-cluster-Output-file-stored-in-temporary-directory-in-worker-tp23653p23656.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
For additional commands, e-mail: user-help@spark.apache.org