You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@spark.apache.org by gtanguy <g....@gmail.com> on 2014/04/09 18:05:52 UTC

How does Spark handle RDD via HDFS ?

Hello everybody,

I am wondering how Spark handles via HDFS his RDD, what if during a map
phase I need data which are not present locally?

What I am working on :
I am working on a recommendation algorithm : Matrix Factorization (MF) using
a stochastic gradient as optimizer. For now my algorithm works locally but
to anticipate further needs I would like to parallelized it using spark
0.9.0 on HDFS (without yarn).
I saw the regression logistic (RL) SGD example in the MLibs. Matrix
Factorization can be view as multiple regression logistic iteration, so I
will follow the example to implement it. The only difference is : my dataset
is composed by 3 files :
User.csv -> (UserID age sex..)
Item.csv -> (ItemID color size..)
Obs.csv -> (UserID, ItemID, ratings)

What I understand :
In the RL example we have only the 'Obs.csv' file. Given that we have 3
machines, the file will be divided on 3 machines, during the map phase, the
RL algorithm will be respectively executed on the 3 slaves with local data.
So each RL will process 1/3 of the data. During the reduce phase, we just
average the result returned by each slave. No network communication is
needed during the RL process except the reduce step. All the data during map
phase used/needed are local.

What I am wondering :
In my case my MF needs on each machine all the informations of the
'User.csv' file, 1/3 'Item.csv' file and 1/3 obs.csv to operate. When HDFS
distributes my 3 files, I will have 1/3 of each file on each datanode.
1.What will happen when my MF algorithm is executed on each node?
2.Network communications will be needed for the 2/3 of the user.csv, right?
3.Will network communications be optimized as following :
During an actual computation does the data needed for the next computation
will be loaded (so that the time taking by the network communication won't
affect the computation time)?

Any help is highly appreciated.

Best regards,

Germain Tanguy.

--
View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/How-does-Spark-handle-RDD-via-HDFS-tp4003.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

Re: How does Spark handle RDD via HDFS ?

Posted by gtanguy <g....@gmail.com>.

Yes that help to understand better how works spark. But that was also what I
was afraid, I think the network communications will take to much time for my
job.

I will continue to look for a trick in order to not have network
communications.

I saw on the hadoop website that : "To minimize global bandwidth consumption
and read latency, HDFS tries to satisfy a read request from a replica that
is closest to the reader. If there exists a replica on the same rack as the
reader node, then that replica is preferred to satisfy the read request"

May if in a way I success to combine a part of spark and some of this, it
could work.

Thank you very much for you answer.

Germain.




--
View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/How-does-Spark-handle-RDD-via-HDFS-tp4003p4058.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

Re: How does Spark handle RDD via HDFS ?

Posted by Andrew Ash <an...@andrewash.com>.

The typical way to handle that use case would be to join the 3 files
together into one RDD and then do the factorization on that.  There will
definitely be network traffic during the initial join to get everything
into one table, and after that there will likely be more network traffic
for various shuffle joins that are needed.

So for your individual questions:

1) it's not so much that your algorithm is executed on each node, but that
the algorithm is executed on the cluster and small component tasks are run
on each node.  Each step in the algorithm happens across the whole cluster
concurrently.
2) network communication will be needed to move data around during the
algorithm.  I'm not familiar with MF, but if you need the whole dataset on
a machine at once for some reason, then you'll have lots of network
computation
3) This question sounds a bit like pre-fetching -- does Spark start loading
data that it will need "soon" but does not need now?  That does not happen,
and Spark RDDs are actually lazy (technical term) in that no computation
happens until the end.  Think of that like stacking operations on top of
data.  Stacking operations without evaluating them is cheap, and then once
you evaluate the whole thing it can do special pipelining stuff that is
able to keep the resident memory set lower.

Does that help?

Andrew

On Wed, Apr 9, 2014 at 9:05 AM, gtanguy <g....@gmail.com>wrote:

> Hello everybody,
>
> I am wondering how Spark handles via HDFS his RDD, what if during a map
> phase I need data which are not present locally?
>
> What I am working on :
> I am working on a recommendation algorithm : Matrix Factorization (MF)
> using
> a stochastic gradient as optimizer. For now my algorithm works locally but
> to anticipate further needs I would like to parallelized it using spark
> 0.9.0 on HDFS (without yarn).
> I saw the regression logistic (RL) SGD example in the MLibs. Matrix
> Factorization can be view as multiple regression logistic iteration, so I
> will follow the example to implement it. The only difference is : my
> dataset
> is composed by 3 files :
>         User.csv -> (UserID age sex..)
>         Item.csv -> (ItemID color size..)
>         Obs.csv -> (UserID, ItemID, ratings)
>
> What I understand :
> In the RL example we have only the 'Obs.csv' file. Given that we have 3
> machines, the file will be divided on 3 machines, during the map phase, the
> RL algorithm will be respectively executed on the 3 slaves with local data.
> So each RL will process 1/3 of the data. During the reduce phase, we just
> average the result returned by each slave. No network communication is
> needed during the RL process except the reduce step. All the data during
> map
> phase used/needed are local.
>
> What I am wondering :
> In my case my MF needs on each machine all the informations of the
> 'User.csv' file, 1/3 'Item.csv' file and 1/3 obs.csv   to operate. When
> HDFS
> distributes my 3 files, I will have 1/3 of each file on each datanode.
> 1.What will happen when my MF algorithm is executed on each node?
> 2.Network communications will be needed for the 2/3 of the user.csv, right?
> 3.Will network communications be optimized as following :
> During an actual computation does the data needed for the next computation
> will be loaded (so that the time taking by the network communication won't
> affect the computation time)?
>
> Any help is highly appreciated.
>
> Best regards,
>
> Germain Tanguy.
>
>
>
>
> --
> View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/How-does-Spark-handle-RDD-via-HDFS-tp4003.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>