You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@spark.apache.org by David Thomas <dt...@gmail.com> on 2014/02/19 03:33:31 UTC

Mutating RDD

Let's say I have an RDD of text files from HDFS. During the runtime, is it
possible to check for new files in a particular directory and if present,
add them to the existing RDD?

Re: Mutating RDD

Posted by Tathagata Das <ta...@gmail.com>.
To add to the discussion, Spark Streaming's text file stream, automatically
detects new files and generates RDD out of them. For example, if you run 10
seconds batches, then all new files (of the same format) generated in the
directory every interval will be read and made into per-interval RDDs. Then
you can do whatever you want with those RDDs.

var unionRDD = ...

streamingContext.textFileStream(<directory>).foreachRDD(rdd => {
     // do what you want with the RDD
     // if you want to keep unioning
     unionRDD = unionRDD.union(rdd)
})

However, not that keeping on unioning RDD can rapidly increase the number
of partitions in the unioned RDD, which may degrade performance. Consider
using RDD.coalesce periodically to reduce the number of partitions.

TD


On Wed, Feb 19, 2014 at 5:44 AM, Ashish Rangole <ar...@gmail.com> wrote:

> You could also look at how the Spark Streaming DStream does what you
> described.
>
> Take a look at Spark StreamingContext.textFileStream implementation.
> On Feb 18, 2014 8:02 PM, "David Thomas" <dt...@gmail.com> wrote:
>
>> Perfect.
>>
>>
>> On Tue, Feb 18, 2014 at 7:58 PM, Mayur Rustagi <ma...@gmail.com>wrote:
>>
>>> RDD is immutable so modification of RDD is not possible, you can
>>> generate a new RDD unioning the two RDD created from new files and old
>>> in-memory RDD.
>>> Regards
>>> Mayur
>>>
>>> Mayur Rustagi
>>> Ph: +919632149971
>>> h <https://twitter.com/mayur_rustagi>ttp://www.sigmoidanalytics.com
>>> https://twitter.com/mayur_rustagi
>>>
>>>
>>>
>>> On Tue, Feb 18, 2014 at 6:33 PM, David Thomas <dt...@gmail.com>wrote:
>>>
>>>> Let's say I have an RDD of text files from HDFS. During the runtime, is
>>>> it possible to check for new files in a particular directory and if
>>>> present, add them to the existing RDD?
>>>>
>>>
>>>
>>

Re: Mutating RDD

Posted by Ashish Rangole <ar...@gmail.com>.
You could also look at how the Spark Streaming DStream does what you
described.

Take a look at Spark StreamingContext.textFileStream implementation.
On Feb 18, 2014 8:02 PM, "David Thomas" <dt...@gmail.com> wrote:

> Perfect.
>
>
> On Tue, Feb 18, 2014 at 7:58 PM, Mayur Rustagi <ma...@gmail.com>wrote:
>
>> RDD is immutable so modification of RDD is not possible, you can generate
>> a new RDD unioning the two RDD created from new files and old in-memory RDD.
>> Regards
>> Mayur
>>
>> Mayur Rustagi
>> Ph: +919632149971
>> h <https://twitter.com/mayur_rustagi>ttp://www.sigmoidanalytics.com
>> https://twitter.com/mayur_rustagi
>>
>>
>>
>> On Tue, Feb 18, 2014 at 6:33 PM, David Thomas <dt...@gmail.com>wrote:
>>
>>> Let's say I have an RDD of text files from HDFS. During the runtime, is
>>> it possible to check for new files in a particular directory and if
>>> present, add them to the existing RDD?
>>>
>>
>>
>

Re: Mutating RDD

Posted by David Thomas <dt...@gmail.com>.
Perfect.


On Tue, Feb 18, 2014 at 7:58 PM, Mayur Rustagi <ma...@gmail.com>wrote:

> RDD is immutable so modification of RDD is not possible, you can generate
> a new RDD unioning the two RDD created from new files and old in-memory RDD.
> Regards
> Mayur
>
> Mayur Rustagi
> Ph: +919632149971
> h <https://twitter.com/mayur_rustagi>ttp://www.sigmoidanalytics.com
> https://twitter.com/mayur_rustagi
>
>
>
> On Tue, Feb 18, 2014 at 6:33 PM, David Thomas <dt...@gmail.com> wrote:
>
>> Let's say I have an RDD of text files from HDFS. During the runtime, is
>> it possible to check for new files in a particular directory and if
>> present, add them to the existing RDD?
>>
>
>

Re: Mutating RDD

Posted by Mayur Rustagi <ma...@gmail.com>.
RDD is immutable so modification of RDD is not possible, you can generate a
new RDD unioning the two RDD created from new files and old in-memory RDD.
Regards
Mayur

Mayur Rustagi
Ph: +919632149971
h <https://twitter.com/mayur_rustagi>ttp://www.sigmoidanalytics.com
https://twitter.com/mayur_rustagi



On Tue, Feb 18, 2014 at 6:33 PM, David Thomas <dt...@gmail.com> wrote:

> Let's say I have an RDD of text files from HDFS. During the runtime, is it
> possible to check for new files in a particular directory and if present,
> add them to the existing RDD?
>