You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@spark.apache.org by rapelly kartheek <ka...@gmail.com> on 2014/09/03 11:02:56 UTC

RDDs

Hi,

Can someone tell me what kind of operations can be performed on a
replicated rdd?? What are the use-cases of a replicated rdd.

One basic doubt that is bothering me from long time: what is the difference
between an application and job in the Spark parlance. I am confused b'cas
of Hadoop jargon.

Thank you

Re: RDDs

Posted by Manas Kar <ma...@gmail.com>.

The above is a great example using thread.
Does any one have an example using scala/Akka Future to do the same.
I am looking for an example like that which uses a Akka Future and does
something if the Future "Timesout"

On Tue, Mar 3, 2015 at 9:16 AM, Manas Kar <ma...@gmail.com>
wrote:

> The above is a great example using thread.
> Does any one have an example using scala/Akka Future to do the same.
> I am looking for an example like that which uses a Akka Future and does
> something if the Future "Timesout"
>
> On Tue, Mar 3, 2015 at 7:00 AM, Kartheek.R <ka...@gmail.com>
> wrote:
>
>> Hi TD,
>> "You can always run two jobs on the same cached RDD, and they can run in
>> parallel (assuming you launch the 2 jobs from two different threads)"
>>
>> Is this a correct way to launch jobs from two different threads?
>>
>> val threadA = new Thread(new Runnable {
>>       def run() {
>>       for(i<- 0 until end)
>>       {
>>         val numAs = logData.filter(line => line.contains("a"))
>>         println("Lines with a: %s".format(numAs.count))
>>       }
>>      }
>>     })
>>
>>    val threadB = new Thread(new Runnable {
>>       def run() {
>>       for(i<- 0 until end)
>>       {
>>         val numBs = logData.filter(line => line.contains("b"))
>>         println("Lines with b: %s".format(numBs.count))
>>       }
>>       }
>>     })
>>
>>
>>
>>
>> --
>> View this message in context:
>> http://apache-spark-user-list.1001560.n3.nabble.com/RDDs-tp13343p21892.html
>> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
>> For additional commands, e-mail: user-help@spark.apache.org
>>
>>
>

Re: RDDs

Posted by Manas Kar <ma...@gmail.com>.

The above is a great example using thread.
Does any one have an example using scala/Akka Future to do the same.
I am looking for an example like that which uses a Akka Future and does
something if the Future "Timesout"

On Tue, Mar 3, 2015 at 7:00 AM, Kartheek.R <ka...@gmail.com> wrote:

> Hi TD,
> "You can always run two jobs on the same cached RDD, and they can run in
> parallel (assuming you launch the 2 jobs from two different threads)"
>
> Is this a correct way to launch jobs from two different threads?
>
> val threadA = new Thread(new Runnable {
>       def run() {
>       for(i<- 0 until end)
>       {
>         val numAs = logData.filter(line => line.contains("a"))
>         println("Lines with a: %s".format(numAs.count))
>       }
>      }
>     })
>
>    val threadB = new Thread(new Runnable {
>       def run() {
>       for(i<- 0 until end)
>       {
>         val numBs = logData.filter(line => line.contains("b"))
>         println("Lines with b: %s".format(numBs.count))
>       }
>       }
>     })
>
>
>
>
> --
> View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/RDDs-tp13343p21892.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
> For additional commands, e-mail: user-help@spark.apache.org
>
>

Re: RDDs

Posted by "Kartheek.R" <ka...@gmail.com>.

Hi TD,
"You can always run two jobs on the same cached RDD, and they can run in
parallel (assuming you launch the 2 jobs from two different threads)"

Is this a correct way to launch jobs from two different threads?

val threadA = new Thread(new Runnable { 
      def run() {
      for(i<- 0 until end)
      {  
        val numAs = logData.filter(line => line.contains("a")) 
        println("Lines with a: %s".format(numAs.count)) 
      } 
     } 
    }) 
    
   val threadB = new Thread(new Runnable { 
      def run() { 
      for(i<- 0 until end)
      {
        val numBs = logData.filter(line => line.contains("b")) 
        println("Lines with b: %s".format(numBs.count)) 
      }
      } 
    }) 
    



--
View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/RDDs-tp13343p21892.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
For additional commands, e-mail: user-help@spark.apache.org

Re: RDDs

Posted by Tathagata Das <ta...@gmail.com>.

Yes Raymond is right. You can always run two jobs on the same cached RDD,
and they can run in parallel (assuming you launch the 2 jobs from two
different threads). However, with one copy of each RDD partition, the tasks
of two jobs will experience some slot contentions. So if you replicate it,
you can eek out a bit more parallelism as there will be less contention.
However it is still not possible to guarantee that there will be no
contention at all as there may just be many time more tasks that slots
available in the cluster, the jobs will contend for slots even if data
locality is ignored.


On Wed, Sep 3, 2014 at 11:08 PM, Liu, Raymond <ra...@intel.com> wrote:

> Actually, a replicated RDD and a parallel job on the same RDD, this two
> conception is not related at all.
> A replicated RDD just store data on multiple node, it helps with HA and
> provide better chance for data locality. It is still one RDD, not two
> separate RDD.
> While regarding run two jobs on the same RDD, it doesn't matter that the
> RDD is replicated or not. You can always do it if you wish to.
>
>
> Best Regards,
> Raymond Liu
>
> -----Original Message-----
> From: Kartheek.R [mailto:kartheek.mbms@gmail.com]
> Sent: Thursday, September 04, 2014 1:24 PM
> To: user@spark.incubator.apache.org
> Subject: RE: RDDs
>
> Thank you Raymond and Tobias.
> Yeah, I am very clear about what I was asking. I was talking about
> "replicated" rdd only. Now that I've got my understanding about job and
> application validated, I wanted to know if we can replicate an rdd and run
> two jobs (that need same rdd) of an application in parallel?.
>
> -Karthk
>
>
>
>
> --
> View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/RDDs-tp13343p13416.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: user-unsubscribe@spark.apache.org For additional
> commands, e-mail: user-help@spark.apache.org
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
> For additional commands, e-mail: user-help@spark.apache.org
>
>

Fwd: RDDs

Posted by rapelly kartheek <ka...@gmail.com>.

---------- Forwarded message ----------
From: rapelly kartheek <ka...@gmail.com>
Date: Thu, Sep 4, 2014 at 11:49 AM
Subject: Re: RDDs
To: "Liu, Raymond" <ra...@intel.com>


Thank you Raymond.
I am more clear now. So, if an rdd is replicated over multiple nodes (i.e.
say two sets of nodes as it is a collection of chunks), can we run two jobs
concurrently and seperately on these two sets of nodes?


On Thu, Sep 4, 2014 at 11:38 AM, Liu, Raymond <ra...@intel.com> wrote:

> Actually, a replicated RDD and a parallel job on the same RDD, this two
> conception is not related at all.
> A replicated RDD just store data on multiple node, it helps with HA and
> provide better chance for data locality. It is still one RDD, not two
> separate RDD.
> While regarding run two jobs on the same RDD, it doesn't matter that the
> RDD is replicated or not. You can always do it if you wish to.
>
>
> Best Regards,
> Raymond Liu
>
> -----Original Message-----
> From: Kartheek.R [mailto:kartheek.mbms@gmail.com]
> Sent: Thursday, September 04, 2014 1:24 PM
> To: user@spark.incubator.apache.org
> Subject: RE: RDDs
>
> Thank you Raymond and Tobias.
> Yeah, I am very clear about what I was asking. I was talking about
> "replicated" rdd only. Now that I've got my understanding about job and
> application validated, I wanted to know if we can replicate an rdd and run
> two jobs (that need same rdd) of an application in parallel?.
>
> -Karthk
>
>
>
>
> --
> View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/RDDs-tp13343p13416.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: user-unsubscribe@spark.apache.org For additional
> commands, e-mail: user-help@spark.apache.org
>
>

RE: RDDs

Posted by "Liu, Raymond" <ra...@intel.com>.

Actually, a replicated RDD and a parallel job on the same RDD, this two conception is not related at all. 
A replicated RDD just store data on multiple node, it helps with HA and provide better chance for data locality. It is still one RDD, not two separate RDD.
While regarding run two jobs on the same RDD, it doesn't matter that the RDD is replicated or not. You can always do it if you wish to.


Best Regards,
Raymond Liu

-----Original Message-----
From: Kartheek.R [mailto:kartheek.mbms@gmail.com] 
Sent: Thursday, September 04, 2014 1:24 PM
To: user@spark.incubator.apache.org
Subject: RE: RDDs

Thank you Raymond and Tobias. 
Yeah, I am very clear about what I was asking. I was talking about "replicated" rdd only. Now that I've got my understanding about job and application validated, I wanted to know if we can replicate an rdd and run two jobs (that need same rdd) of an application in parallel?.

-Karthk




--
View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/RDDs-tp13343p13416.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscribe@spark.apache.org For additional commands, e-mail: user-help@spark.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
For additional commands, e-mail: user-help@spark.apache.org

RE: RDDs

Posted by "Kartheek.R" <ka...@gmail.com>.

Thank you Raymond and Tobias. 
Yeah, I am very clear about what I was asking. I was talking about
"replicated" rdd only. Now that I've got my understanding about job and
application validated, I wanted to know if we can replicate an rdd and run
two jobs (that need same rdd) of an application in parallel?.

-Karthk




--
View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/RDDs-tp13343p13416.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
For additional commands, e-mail: user-help@spark.apache.org

RE: RDDs

Posted by "Liu, Raymond" <ra...@intel.com>.

Not sure what did you refer to when saying replicated rdd, if you actually mean RDD, then, yes , read the API doc and paper as Tobias mentioned.
If you actually focus on the word "replicated", then that is for fault tolerant, and probably mostly used in the streaming case for receiver created RDD.

For Spark, Application is your user program. And a job is an internal schedule conception, It's a group of some RDD operation. Your applications might invoke several jobs.


Best Regards,
Raymond Liu

From: rapelly kartheek [mailto:kartheek.mbms@gmail.com] 
Sent: Wednesday, September 03, 2014 5:03 PM
To: user@spark.apache.org
Subject: RDDs

Hi,
Can someone tell me what kind of operations can be performed on a replicated rdd?? What are the use-cases of a replicated rdd.
One basic doubt that is bothering me from long time: what is the difference between an application and job in the Spark parlance. I am confused b'cas of Hadoop jargon.
Thank you

Re: RDDs

Posted by Tobias Pfeiffer <tg...@preferred.jp>.

Hello,

On Wed, Sep 3, 2014 at 6:02 PM, rapelly kartheek <ka...@gmail.com>
wrote:
>
> Can someone tell me what kind of operations can be performed on a
> replicated rdd?? What are the use-cases of a replicated rdd.
>

I suggest you read

https://spark.apache.org/docs/latest/programming-guide.html#resilient-distributed-datasets-rdds
as an introduction, it lists a lot of the transformations and output
operations you can use.
Personally, I also found it quite helpful to read the paper about RDDs:
  http://www.cs.berkeley.edu/~matei/papers/2012/nsdi_spark.pdf

> One basic doubt that is bothering me from long time: what is the
> difference between an application and job in the Spark parlance. I am
> confused b'cas of Hadoop jargon.
>

OK, someone else might answer that. I am myself confused with application,
job, task, stage etc. ;-)

Tobias

Re: RDDs

Posted by "Kartheek.R" <ka...@gmail.com>.

Thank you yuanbosoft. 




--
View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/RDDs-tp13343p13444.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
For additional commands, e-mail: user-help@spark.apache.org