You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@spark.apache.org by Ognen Duzlevski <og...@nengoiksvelzud.com> on 2014/03/24 22:00:00 UTC

Writing RDDs to HDFS

Is someRDD.saveAsTextFile("hdfs://ip:port/path/final_filename.txt") 
supposed to work? Meaning, can I save files to the HDFS fs this way?

I tried:

val r = sc.parallelize(List(1,2,3,4,5,6,7,8))
r.saveAsTextFile("hdfs://ip:port/path/file.txt")

and it is just hanging. At the same time on my HDFS it created file.txt 
but as a directory which has subdirectories (the final one is empty).

Thanks!
Ognen

Re: Writing RDDs to HDFS

Posted by Ognen Duzlevski <og...@nengoiksvelzud.com>.
Hmm. Strange. Even the below hangs.

val r = sc.parallelize(List(1,2,3,4,5,6,7,8))
r.count

I then looked at the web UI at port 8080 and realized that the spark 
shell is in WAITING status since another job is running on the 
standalone cluster. This may sound like a very stupid question but my 
expectation would be that I can submit multiple jobs at the same time 
and there would be some kind of a fair strategy to run them in turn. 
What Spark (basics) have a slept through? :)

Thanks!
Ognen

On 3/24/14, 4:00 PM, Ognen Duzlevski wrote:
> Is someRDD.saveAsTextFile("hdfs://ip:port/path/final_filename.txt") 
> supposed to work? Meaning, can I save files to the HDFS fs this way?
>
> I tried:
>
> val r = sc.parallelize(List(1,2,3,4,5,6,7,8))
> r.saveAsTextFile("hdfs://ip:port/path/file.txt")
>
> and it is just hanging. At the same time on my HDFS it created 
> file.txt but as a directory which has subdirectories (the final one is 
> empty).
>
> Thanks!
> Ognen
>
> -- 
> “No matter what they ever do to us, we must always act for the love of our people and the earth. We must not react out of hatred against those who have no sense.”
> -- John Trudell

Re: Writing RDDs to HDFS

Posted by Ognen Duzlevski <og...@plainvanillagames.com>.
Well, my long running app has 512M per executor on a 16 node cluster 
where each machine has 16G of RAM. I could not run a second application 
until I restricted the spark.cores.max. As soon as I restricted the 
cores, I am able to run a second job at the same time.

Ognen

On 3/24/14, 7:46 PM, Yana Kadiyska wrote:
> Ognen, can you comment if you were actually able to run two jobs
> concurrently with just restricting spark.cores.max? I run Shark on the
> same cluster and was not able to see a standalone job get in (since
> Shark is a "long running" job) until I restricted both spark.cores.max
> _and_ spark.executor.memory. Just curious if I did something wrong.
>
> On Mon, Mar 24, 2014 at 7:48 PM, Ognen Duzlevski
> <og...@plainvanillagames.com> wrote:
>> Just so I can close this thread (in case anyone else runs into this stuff) -
>> I did sleep through the basics of Spark ;). The answer on why my job is in
>> waiting state (hanging) is here:
>> http://spark.incubator.apache.org/docs/latest/spark-standalone.html#resource-scheduling
>>
>>
>> Ognen
>>
>> On 3/24/14, 5:01 PM, Diana Carroll wrote:
>>
>> Ongen:
>>
>> I don't know why your process is hanging, sorry.  But I do know that the way
>> saveAsTextFile works is that you give it a path to a directory, not a file.
>> The "file" is saved in multiple parts, corresponding to the partitions.
>> (part-00000, part-00001 etc.)
>>
>> (Presumably it does this because it allows each partition to be saved on the
>> local disk, to minimize network traffic.  It's how Hadoop works, too.)
>>
>>
>>
>>
>> On Mon, Mar 24, 2014 at 5:00 PM, Ognen Duzlevski <og...@nengoiksvelzud.com>
>> wrote:
>>> Is someRDD.saveAsTextFile("hdfs://ip:port/path/final_filename.txt")
>>> supposed to work? Meaning, can I save files to the HDFS fs this way?
>>>
>>> I tried:
>>>
>>> val r = sc.parallelize(List(1,2,3,4,5,6,7,8))
>>> r.saveAsTextFile("hdfs://ip:port/path/file.txt")
>>>
>>> and it is just hanging. At the same time on my HDFS it created file.txt
>>> but as a directory which has subdirectories (the final one is empty).
>>>
>>> Thanks!
>>> Ognen

Re: Writing RDDs to HDFS

Posted by Yana Kadiyska <ya...@gmail.com>.
Ognen, can you comment if you were actually able to run two jobs
concurrently with just restricting spark.cores.max? I run Shark on the
same cluster and was not able to see a standalone job get in (since
Shark is a "long running" job) until I restricted both spark.cores.max
_and_ spark.executor.memory. Just curious if I did something wrong.

On Mon, Mar 24, 2014 at 7:48 PM, Ognen Duzlevski
<og...@plainvanillagames.com> wrote:
> Just so I can close this thread (in case anyone else runs into this stuff) -
> I did sleep through the basics of Spark ;). The answer on why my job is in
> waiting state (hanging) is here:
> http://spark.incubator.apache.org/docs/latest/spark-standalone.html#resource-scheduling
>
>
> Ognen
>
> On 3/24/14, 5:01 PM, Diana Carroll wrote:
>
> Ongen:
>
> I don't know why your process is hanging, sorry.  But I do know that the way
> saveAsTextFile works is that you give it a path to a directory, not a file.
> The "file" is saved in multiple parts, corresponding to the partitions.
> (part-00000, part-00001 etc.)
>
> (Presumably it does this because it allows each partition to be saved on the
> local disk, to minimize network traffic.  It's how Hadoop works, too.)
>
>
>
>
> On Mon, Mar 24, 2014 at 5:00 PM, Ognen Duzlevski <og...@nengoiksvelzud.com>
> wrote:
>>
>> Is someRDD.saveAsTextFile("hdfs://ip:port/path/final_filename.txt")
>> supposed to work? Meaning, can I save files to the HDFS fs this way?
>>
>> I tried:
>>
>> val r = sc.parallelize(List(1,2,3,4,5,6,7,8))
>> r.saveAsTextFile("hdfs://ip:port/path/file.txt")
>>
>> and it is just hanging. At the same time on my HDFS it created file.txt
>> but as a directory which has subdirectories (the final one is empty).
>>
>> Thanks!
>> Ognen
>
>
>
> --
> "A distributed system is one in which the failure of a computer you didn't
> even know existed can render your own computer unusable"
> -- Leslie Lamport

Re: Writing RDDs to HDFS

Posted by Ognen Duzlevski <og...@plainvanillagames.com>.
Just so I can close this thread (in case anyone else runs into this 
stuff) - I did sleep through the basics of Spark ;). The answer on why 
my job is in waiting state (hanging) is here: 
http://spark.incubator.apache.org/docs/latest/spark-standalone.html#resource-scheduling

Ognen

On 3/24/14, 5:01 PM, Diana Carroll wrote:
> Ongen:
>
> I don't know why your process is hanging, sorry.  But I do know that 
> the way saveAsTextFile works is that you give it a path to a 
> directory, not a file.  The "file" is saved in multiple parts, 
> corresponding to the partitions. (part-00000, part-00001 etc.)
>
> (Presumably it does this because it allows each partition to be saved 
> on the local disk, to minimize network traffic.  It's how Hadoop 
> works, too.)
>
>
>
>
> On Mon, Mar 24, 2014 at 5:00 PM, Ognen Duzlevski 
> <ognen@nengoiksvelzud.com <ma...@nengoiksvelzud.com>> wrote:
>
>     Is
>     someRDD.saveAsTextFile("hdfs://ip:port/path/final_filename.txt")
>     supposed to work? Meaning, can I save files to the HDFS fs this way?
>
>     I tried:
>
>     val r = sc.parallelize(List(1,2,3,4,5,6,7,8))
>     r.saveAsTextFile("hdfs://ip:port/path/file.txt")
>
>     and it is just hanging. At the same time on my HDFS it created
>     file.txt but as a directory which has subdirectories (the final
>     one is empty).
>
>     Thanks!
>     Ognen
>
>

-- 
"A distributed system is one in which the failure of a computer you didn't even know existed can render your own computer unusable"
-- Leslie Lamport


Re: Writing RDDs to HDFS

Posted by Ognen Duzlevski <og...@plainvanillagames.com>.
Diana, thanks. I am not very well acquainted with HDFS. I use hdfs -put 
to put things as files into the filesystem (and sc.textFile to get stuff 
out of them in Spark) and I see that they appear to be saved as files 
that are replicated across 3 out of the 16 nodes in the hdfs cluster 
(which is my case is also my Spark cluster) -> hence, I was puzzled why 
a directory this time. What you are saying makes sense, I suppose. As 
for the hanging - I am curious about that myself.

Ognen

On 3/24/14, 5:01 PM, Diana Carroll wrote:
> Ongen:
>
> I don't know why your process is hanging, sorry.  But I do know that 
> the way saveAsTextFile works is that you give it a path to a 
> directory, not a file.  The "file" is saved in multiple parts, 
> corresponding to the partitions. (part-00000, part-00001 etc.)
>
> (Presumably it does this because it allows each partition to be saved 
> on the local disk, to minimize network traffic.  It's how Hadoop 
> works, too.)
>
>
>
>
> On Mon, Mar 24, 2014 at 5:00 PM, Ognen Duzlevski 
> <ognen@nengoiksvelzud.com <ma...@nengoiksvelzud.com>> wrote:
>
>     Is
>     someRDD.saveAsTextFile("hdfs://ip:port/path/final_filename.txt")
>     supposed to work? Meaning, can I save files to the HDFS fs this way?
>
>     I tried:
>
>     val r = sc.parallelize(List(1,2,3,4,5,6,7,8))
>     r.saveAsTextFile("hdfs://ip:port/path/file.txt")
>
>     and it is just hanging. At the same time on my HDFS it created
>     file.txt but as a directory which has subdirectories (the final
>     one is empty).
>
>     Thanks!
>     Ognen
>
>

Re: Writing RDDs to HDFS

Posted by Diana Carroll <dc...@cloudera.com>.
Ongen:

I don't know why your process is hanging, sorry.  But I do know that the
way saveAsTextFile works is that you give it a path to a directory, not a
file.  The "file" is saved in multiple parts, corresponding to the
partitions. (part-00000, part-00001 etc.)

(Presumably it does this because it allows each partition to be saved on
the local disk, to minimize network traffic.  It's how Hadoop works, too.)




On Mon, Mar 24, 2014 at 5:00 PM, Ognen Duzlevski
<og...@nengoiksvelzud.com>wrote:

> Is someRDD.saveAsTextFile("hdfs://ip:port/path/final_filename.txt")
> supposed to work? Meaning, can I save files to the HDFS fs this way?
>
> I tried:
>
> val r = sc.parallelize(List(1,2,3,4,5,6,7,8))
> r.saveAsTextFile("hdfs://ip:port/path/file.txt")
>
> and it is just hanging. At the same time on my HDFS it created file.txt
> but as a directory which has subdirectories (the final one is empty).
>
> Thanks!
> Ognen
>