You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@spark.apache.org by Sourav Chandra <so...@livestream.com> on 2014/02/14 02:49:24 UTC

Spark streaming questions

HI,

I have couple of questions:

1. While going through the spark-streaming code, I found out there is one
configuration in JobScheduler/Generator (spark.streaming.concurrentJobs)
which is set to 1. There is no documentation for this parameter. After
setting this to 1000 in driver program, our streaming application's
performance is improved.

What is this variable used for? Is it safe to use/tweak this parameter?

2. Can someone explain the usage of MapOutputTracker, BlockManager
component. I have gone through the youtube video of Matei about spark
internals but this was not covered in detail.

3. Can someone explain the usage of cache w.r.t spark streaming? For
example if we do stream.cache(), will the cache remain constatnt with all
the partitions of rdd present across the nodes for that stream, OR will it
be regularly updated as in while new batch is coming?

Thanks,
-- 

Sourav Chandra

Senior Software Engineer

· · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · ·

sourav.chandra@livestream.com

o: +91 80 4121 8723

m: +91 988 699 3746

skype: sourav.chandra

Livestream

"Ajmera Summit", First Floor, #3/D, 68 Ward, 3rd Cross, 7th C Main, 3rd
Block, Koramangala Industrial Area,

Bangalore 560034

www.livestream.com

Re: Spark streaming questions

Posted by Tathagata Das <ta...@gmail.com>.

@sourav To answer your original questions

1. Repartition takes more time as it explicitly redistributes the data over
the whole cluster. It is a tradeoff between between load balancing and
latency. Regarading foreach stage doing shuffle, the naming of the stages
is a little confusing. A stage that is initiated by the RDD.foreach
operation will involve reading of outputs from the map outputs from the
previous stage. So there is shuffle read involved in that stage, as shown
by the numbers.

2. If you are using a broadcast variable like this

val bcast = sparkContext.broadcast(...)

dstream.map(x => {
     // use broadcast variable bcast
})

Then you can convert it to something like this.

var bcast = ...    // not a val

dstream.transform(rdd => {

        // update the bcast variable if it has not been updated for a long
time.
        if (currentTime - lastUpdateTime > threshold) {
               bcast = rdd.sparkContext.broadcast(...)
               lastUpdateTime = currentTime
        }

        rdd.map(x => {
              // use bcast
        })
})

3. MEMORY_AND_DISK_SER_2 is correct.

4. The list.appy() is probably for the stage that is writing the RDD
checkpoint.

5. What happens when you add more workers and enable repartition?

Btw RDD.foreach(saveToCassandra)   ... I am not sure how efficient this is.
Does the saveToCassandra() setup new connections to cassandra every time or
reuses connections across function calls? It does setup every time, then
calling this function ONCE for EVERY record to be pushed is a bad idea.
Consider using foreachPartition() where you can setup once per partition
and then push the whole partition.

TD




On Wed, Feb 19, 2014 at 10:47 AM, dachuan <hd...@gmail.com> wrote:

> I am curious about the scalability of spark streaming, too. the sosp 2013
> paper demonstrates the scalability by video app and mobile millennium job.
>
>
> On Wed, Feb 19, 2014 at 1:29 PM, Sourav Chandra <
> sourav.chandra@livestream.com> wrote:
>
>> Thanks Mayur for the response.
>>
>> While testing even though we are increasing number of workers the
>> performance did not improve. Is it because of only 1 NetworkReceiver? (We
>> are using KafkaDStream)
>>
>> How can we increase the throughput as spark streaming says the
>> performance should increase linearly as we add more nodes.
>>
>> Also I have some more open points asked along with snapshots regarding
>> Stages.
>>
>> Thanks,
>> Sourav
>>
>>
>> On Wed, Feb 19, 2014 at 11:50 PM, Mayur Rustagi <ma...@gmail.com>wrote:
>>
>>> You need heap if you are collect-ing a lot of data to save into
>>> cassandra, collect pulls all data to driver hence needs heap,
>>> Scheduler is quite fast already, not that much CPU dependent, but ya
>>> more CPU bring more love.
>>>
>>> Checkpointing files are cleaned up automatically
>>>
>>> Both master and worker are relying on same system then writing data on
>>> disk and fetching will impact , as both are relying on the same machine. I
>>> am not sure what do you mean by impact?
>>>
>>> You can inherit  the KafkaInputDStream and modify it, or create your
>>> own & if its helpful contribute it back.
>>>
>>>
>>>
>>> Mayur Rustagi
>>> Ph: +919632149971
>>> h <https://twitter.com/mayur_rustagi>ttp://www.sigmoidanalytics.com
>>> https://twitter.com/mayur_rustagi
>>>
>>>
>>>
>>> On Tue, Feb 18, 2014 at 10:08 PM, Sourav Chandra <
>>> sourav.chandra@livestream.com> wrote:
>>>
>>>> Thanks for the reply. In the driver program I am saving the values to
>>>> Cassandra so driver also should not have much heap/cpu. Please correct me
>>>> If I am wrong.
>>>>
>>>> As driver is the one from where scheduling is triggered shouldn't it be
>>>> having more CPU for faster scheduling/less scheduling delay?
>>>>
>>>> Still there are some open points/doubts to be clarified. Eagerly
>>>> waiting for the response
>>>>
>>>> Thanks,
>>>> Sourav
>>>>
>>>>
>>>> On Wed, Feb 19, 2014 at 9:40 AM, Andrew Ash <an...@andrewash.com>wrote:
>>>>
>>>>> In my experience, you don't need much horsepower on the master or
>>>>> worker nodes.  If you're bringing large data back to the driver (e.g. with
>>>>> .take or .collect) you can cause OOMs on the driver, so bump the heap if
>>>>> that's the case.  But the majority of your memory requirements will be in
>>>>> the executors, which are JVMs that the Worker spins up for each application
>>>>> (in the standalone mode cluster).
>>>>>
>>>>> Ajdrew
>>>>>
>>>>>
>>>>> On Tue, Feb 18, 2014 at 8:07 PM, Sourav Chandra <
>>>>> sourav.chandra@livestream.com> wrote:
>>>>>
>>>>>> Waiting for response :)
>>>>>>
>>>>>>
>>>>>> On Tue, Feb 18, 2014 at 1:09 PM, Sourav Chandra <
>>>>>> sourav.chandra@livestream.com> wrote:
>>>>>>
>>>>>>> I have couple of questions below:
>>>>>>>
>>>>>>> 1. What is the memory,CPU requirement for Master and Worker and
>>>>>>> Driver process? As per my understanding it should not be any higher than
>>>>>>> what default settings is at least for Master and Worker. As Driver does the
>>>>>>> actual DAG scheduling and all it should be fast process?
>>>>>>> Please correct me if I am wrong.  Also let me know the system
>>>>>>> requirements for all the 3 process.
>>>>>>>
>>>>>>> 2. If we run worker and master on same node, Is over spilling of RDD
>>>>>>> to disk or memory usage harmful for master? As per my understanding it
>>>>>>> should not impact as master and worker does very little thing (at kleast
>>>>>>> what is seen from logs), It is the executor whose performance will be
>>>>>>> degraded? Please correct me if I am wrong
>>>>>>>
>>>>>>> 3.  I was going through KafkaInputDStream and found out its only
>>>>>>> writing kafka message and partitioning key into block generator not other
>>>>>>> info like partition,offset. Is there any way to incorporate these or do we
>>>>>>> have to create our own DSTream for this.
>>>>>>>
>>>>>>> Thanks,
>>>>>>> Sourav
>>>>>>>
>>>>>>>
>>>>>>> On Mon, Feb 17, 2014 at 5:16 PM, Sourav Chandra <
>>>>>>> sourav.chandra@livestream.com> wrote:
>>>>>>>
>>>>>>>> One more question regarding check-pointing:
>>>>>>>>
>>>>>>>>  - What is the cleanup mechanism of checkpoint directory for
>>>>>>>> streaming application? Will older files be deleted automatically by spark?
>>>>>>>> Do we need to set up up scheduler task ? If so what is the strategy to
>>>>>>>> safely remove checkpoint files without disrupting ongoing process and disk
>>>>>>>> space?
>>>>>>>>
>>>>>>>> Thanks,
>>>>>>>> Sourav
>>>>>>>>
>>>>>>>>
>>>>>>>> On Mon, Feb 17, 2014 at 3:38 PM, Sourav Chandra <
>>>>>>>> sourav.chandra@livestream.com> wrote:
>>>>>>>>
>>>>>>>>> I did not see any improvement if we set spark.streaming.blockInterval =
>>>>>>>>> 100 and it degrades if I use repartition as mentioned,
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> On Mon, Feb 17, 2014 at 3:31 PM, Sourav Chandra <
>>>>>>>>> sourav.chandra@livestream.com> wrote:
>>>>>>>>>
>>>>>>>>>> Hi TD,
>>>>>>>>>>
>>>>>>>>>> Hope you have had a nice weekend.
>>>>>>>>>>
>>>>>>>>>> I am giving you a brief overview if what we are and we are trying
>>>>>>>>>> to achieve using spark streaming
>>>>>>>>>>
>>>>>>>>>> We are building a realtime analytics application using spark
>>>>>>>>>> streaming.
>>>>>>>>>> We are internet video broadcasting company and reltime analytics
>>>>>>>>>> should show no. of likes/comments/concuurent viewers per broadcast happened.
>>>>>>>>>>
>>>>>>>>>> Below is the overview of what we are doing:
>>>>>>>>>>
>>>>>>>>>> Spark properties :
>>>>>>>>>> - batch interval is set as 1 second
>>>>>>>>>> - spark.executor.memory = 10g
>>>>>>>>>> - spark.streaming.concurrentJobs = 1000
>>>>>>>>>> - spark.streaming.blockInterval = 100
>>>>>>>>>>
>>>>>>>>>> Create couple of broadcast variable to be used inside the Step 2
>>>>>>>>>> below
>>>>>>>>>>
>>>>>>>>>> 1. We are reading the analytics trigger messages from kafka using
>>>>>>>>>> kafkainputstream and then reparitioning as per your suggestion
>>>>>>>>>>    val kafkaStream = KafkaUtils.createStream(...).repartition(12)
>>>>>>>>>>
>>>>>>>>>> 2. Process the message read form kafka and generates a bunch of
>>>>>>>>>> related messages for analysis. In this step we use previously created
>>>>>>>>>> broadcast variables to get metadata about incoming message like - which
>>>>>>>>>> device it was generated, which country etc.
>>>>>>>>>>    val processedStream = kafkaStream.flatMap(...).map(s => (s,1))
>>>>>>>>>> // include count = 1 for each of generated message
>>>>>>>>>>
>>>>>>>>>> 3. Reducing the stream for last 1 second
>>>>>>>>>>    val reducedStream =
>>>>>>>>>> processedStream.reduceByKeyAndWindow((a:Int,b:Int) => a + b, Seconds(1),
>>>>>>>>>> Seconds(1), 12).checkpoint(Seconds(10))
>>>>>>>>>>
>>>>>>>>>> 4. Filtring out the above reducedStream to get 3 streams out of
>>>>>>>>>> it - second, muinute and hour resolutioned
>>>>>>>>>>    val secStream  = reducedStream.filter(_._1.resolution.label ==
>>>>>>>>>> "second")
>>>>>>>>>>    val minStream  = reducedStream.filter(_._1.resolution.label ==
>>>>>>>>>> "minute")
>>>>>>>>>>    val hourStream = reducedStream.filter(_._1.resolution.label ==
>>>>>>>>>> "hour")
>>>>>>>>>>
>>>>>>>>>> 5. Saving each of stream in cassandra in different tables (for
>>>>>>>>>> example secStream goes to sec table, minStream goes to min table and so on)
>>>>>>>>>>    secStream.foreachRDD(rdd => rdd.foreach(saveToCassandra()))
>>>>>>>>>>    ...
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> Now couple of open points:
>>>>>>>>>>
>>>>>>>>>> 1. Once repartiotn is called I observed below things which I need
>>>>>>>>>> clarification about:
>>>>>>>>>>    - Why foreach requires has so many shuffle read/write now? It
>>>>>>>>>> takes 4-5 seconds more if i use repartition(20).cache() than earlier where
>>>>>>>>>> I did not use repartition though ican see combineByKey stage has 12 tasks.
>>>>>>>>>> If I use repartition only it takes almost 1.5 times more than no
>>>>>>>>>> repartiton.
>>>>>>>>>>
>>>>>>>>>> 2. How can we use broadcast variable? How can we
>>>>>>>>>> re-submit/re-create the variables. Can you give some example?
>>>>>>>>>>
>>>>>>>>>> 3. Still I can see the apply stage on List.scala. What could the
>>>>>>>>>> reason?
>>>>>>>>>>
>>>>>>>>>> 4. Regarding storage level as we are using kafka dstream it is
>>>>>>>>>> MEMORY_AND_DISK_SER_2 instead of MEMORY_ONLY_2 as per code. Can you confirm
>>>>>>>>>> this? I got bit cionfused as you jhad mentioned this is MEMORY_ONLY_2
>>>>>>>>>>
>>>>>>>>>> 5. Still there is no improvement in performance even thogugh I
>>>>>>>>>> start more worker process.
>>>>>>>>>>
>>>>>>>>>> I have attached all the relevant snapshots from stage ui for your
>>>>>>>>>> reference.
>>>>>>>>>>
>>>>>>>>>> Thanks,
>>>>>>>>>> Sourav
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> On Sat, Feb 15, 2014 at 3:55 PM, Tathagata Das <
>>>>>>>>>> tathagata.das1565@gmail.com> wrote:
>>>>>>>>>>
>>>>>>>>>>> Depends on how you are using the broadcast variables. Can you
>>>>>>>>>>> give a high level overview of what DStream operations you are using and
>>>>>>>>>>> where does the broadcast variable get used?
>>>>>>>>>>>
>>>>>>>>>>> TD
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> On Fri, Feb 14, 2014 at 7:22 PM, Sourav Chandra <
>>>>>>>>>>> sourav.chandra@livestream.com> wrote:
>>>>>>>>>>>
>>>>>>>>>>>> Hi TD,
>>>>>>>>>>>>
>>>>>>>>>>>> Thanks a lot for going through all the questions scatted across
>>>>>>>>>>>> the mails and answering each one of them. Much appreciated
>>>>>>>>>>>>
>>>>>>>>>>>> I will get back with more details of code, stage ui once I am
>>>>>>>>>>>> in office on Monday.
>>>>>>>>>>>>
>>>>>>>>>>>> BTW, if I re-broadcast i.e. creating broadcast variables again
>>>>>>>>>>>> in some timer thread will this be reflected in the closures passed inside
>>>>>>>>>>>> the transformations? As i read somewhere spark will do some closure cleanup
>>>>>>>>>>>> before actually sending them to other components?
>>>>>>>>>>>>
>>>>>>>>>>>> Thanks,
>>>>>>>>>>>> Sourav
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> On Sat, Feb 15, 2014 at 5:31 AM, Tathagata Das <
>>>>>>>>>>>> tathagata.das1565@gmail.com> wrote:
>>>>>>>>>>>>
>>>>>>>>>>>>> Okay, thats a lots of mails to respond to! Let me try to do it
>>>>>>>>>>>>> point by point. I hope I cover all of the raised concerns.
>>>>>>>>>>>>>
>>>>>>>>>>>>> 1. STAGE PARALLELISM: I was confused about the stages. Yes,
>>>>>>>>>>>>> increasing the number of reducers to 12 should increase the tasks for the
>>>>>>>>>>>>> stage marked as "foreach" (thats the reduce stage, bad naming). To increase
>>>>>>>>>>>>> the parallelism of the map stage, you can do two things
>>>>>>>>>>>>>   (i) First repartition the data to larger number of
>>>>>>>>>>>>> partitions and then apply rest of the computation. For example if you were
>>>>>>>>>>>>> doing kafkaStream.map(....).reduceByKeyAndWindow(....), you can do
>>>>>>>>>>>>> kafkaStream.repartition(20).map(...).reduceByKeyAndWindow(...).
>>>>>>>>>>>>>  (ii) You can also try setting the
>>>>>>>>>>>>> spark.streaming.blockInterval configuration. This configuration decides how
>>>>>>>>>>>>> many blocks of data is created with received data every second. Default is
>>>>>>>>>>>>> 200ms, so it makes 4-5 blocks per second. You can either increase the batch
>>>>>>>>>>>>> interval or reduce the block interval.
>>>>>>>>>>>>>
>>>>>>>>>>>>> 2. APPLY STAGE: I am not entirely sure what that stage is
>>>>>>>>>>>>> without looking at all Spark and Spark Streaming the operations that you
>>>>>>>>>>>>> are doing in your program. And a large snapshot of the stages UI.
>>>>>>>>>>>>>
>>>>>>>>>>>>> 3. PERSIST LEVEL: DStream has two functions - persist(), which
>>>>>>>>>>>>> has the default StorageLevel of MEMORY_ONLY_SER, and
>>>>>>>>>>>>> persist(StorageLevel...... ) where you can specify the storage level. When
>>>>>>>>>>>>> you use StorageLevel.MEMORY_ONLY_SER or MEMORY_ONLY_SER_2 (that is without
>>>>>>>>>>>>> disk in it), it wont fall off to disk. It will just be lost. To fall of to
>>>>>>>>>>>>> disk you have to use MEMORY_AND_DISK_SER or MEMORY_AND_DISK_SER_2. Note
>>>>>>>>>>>>> that, SER = keep data serialized, good for GC behavior (see programming
>>>>>>>>>>>>> guide), and _2 = replicate twice.
>>>>>>>>>>>>>
>>>>>>>>>>>>> 4. BROADCAST FAILURE:
>>>>>>>>>>>>> When the cleaner ttl is set, everything gets cleaned,
>>>>>>>>>>>>> including broadcast variables. Hence the file backing the broadcast
>>>>>>>>>>>>> variable is getting delete, and the tasks are failing. If you are using the
>>>>>>>>>>>>> same broadcast variable for all batches, it is probably a good idea to
>>>>>>>>>>>>> re-broadcast the data (thatis, create new broadcast variables with the
>>>>>>>>>>>>> necessary data) periodically. The period should obviously be less than the
>>>>>>>>>>>>> ttl.
>>>>>>>>>>>>>
>>>>>>>>>>>>> 5. ACTIVE STAGES: Yes, 1000 means, it can run 1000 jobs in
>>>>>>>>>>>>> parallel. I am not sure what your usecase actually is that requires running
>>>>>>>>>>>>> 1000 jobs in parallel? Are you generating 1000 jobs EVERY batch? If you are
>>>>>>>>>>>>> generating N jobs every batch, then makes sense to have the concurrentJobs
>>>>>>>>>>>>> set to around N, maybe up to 2 * N.
>>>>>>>>>>>>>
>>>>>>>>>>>>> 6: 30 failed: probably considers the multiple attempts for
>>>>>>>>>>>>> each failed tasks.
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>> Hope this helps.
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>> TD
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>> On Fri, Feb 14, 2014 at 2:08 AM, Sourav Chandra <
>>>>>>>>>>>>> sourav.chandra@livestream.com> wrote:
>>>>>>>>>>>>>
>>>>>>>>>>>>>> Hi TD,
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> I think the FileNotFound is due to spark.cleaner.ttl
>>>>>>>>>>>>>> parameter which is set to 3600 sec i.e. 1 hour. Thats why the temp metadata
>>>>>>>>>>>>>> files are deleted.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Please correct me if I am wrong. Also If that is the case why
>>>>>>>>>>>>>> it did not download again and create the file? Is is because our
>>>>>>>>>>>>>> application is doing nothing i.e. no messages from kafka?
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Will it be downloaded if application again start receiving
>>>>>>>>>>>>>> data?
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Thanks,
>>>>>>>>>>>>>> Sourav
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> On Fri, Feb 14, 2014 at 2:55 PM, Sourav Chandra <
>>>>>>>>>>>>>> sourav.chandra@livestream.com> wrote:
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Hi TD,
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> I have kept running the streaming application for ~1hr
>>>>>>>>>>>>>>> though there is no messages present in Kafka , just to check the memory
>>>>>>>>>>>>>>> usage and all and then found out the stages have started failing (with
>>>>>>>>>>>>>>> exception java.io.FileNotFoundException
>>>>>>>>>>>>>>> (java.io.FileNotFoundException:
>>>>>>>>>>>>>>> http://10.10.127.230:57124/broadcast_1)) and there are 1000
>>>>>>>>>>>>>>> active stages
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Questions:
>>>>>>>>>>>>>>>  1. Why it suddenly started failing and not able to find
>>>>>>>>>>>>>>> broadcast _1 file? Is there any background cleanup causes this? How can we
>>>>>>>>>>>>>>> overcome this?
>>>>>>>>>>>>>>>  2. Is the 1000 actve stages are because of
>>>>>>>>>>>>>>> spark.streaming.concurrentJobs parameter?
>>>>>>>>>>>>>>>  3. Why these stages are in hanging state (the ui showing no
>>>>>>>>>>>>>>> tasks started)?
>>>>>>>>>>>>>>>      Shouldn't these also fail? what is the logic behind
>>>>>>>>>>>>>>> this?
>>>>>>>>>>>>>>>  4. Why taks:Succeed:Total in failed stages showing like
>>>>>>>>>>>>>>> (0/12)(30 failed)  I can understand it has total 12 tasks and none
>>>>>>>>>>>>>>> succeeded. From where its getting the 30 failed? Is it internal retry. If
>>>>>>>>>>>>>>> so why it is not same for all other failed stages/
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> I have attached the snapshots.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> On Fri, Feb 14, 2014 at 2:35 PM, Pankaj Mittal <
>>>>>>>>>>>>>>> pankaj.mittal@livestream.com> wrote:
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Hi TD,
>>>>>>>>>>>>>>>> There is no persist method which accepts boolean. There is
>>>>>>>>>>>>>>>> only persist(MEMORY_LEVEL) or default persist.
>>>>>>>>>>>>>>>> I have a question, RDDs remain in cache for some remember
>>>>>>>>>>>>>>>> time which is initialised to slide duration, but is it possible to set this
>>>>>>>>>>>>>>>> to let's say an hour without changing slide duration ?
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Thanks
>>>>>>>>>>>>>>>> Pankaj
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> On Fri, Feb 14, 2014 at 7:50 AM, Tathagata Das <
>>>>>>>>>>>>>>>> tathagata.das1565@gmail.com> wrote:
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> Answers inline. Hope these answer your questions.
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> TD
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> On Thu, Feb 13, 2014 at 5:49 PM, Sourav Chandra <
>>>>>>>>>>>>>>>>> sourav.chandra@livestream.com> wrote:
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> HI,
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> I have couple of questions:
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> 1. While going through the spark-streaming code, I found
>>>>>>>>>>>>>>>>>> out there is one configuration in JobScheduler/Generator
>>>>>>>>>>>>>>>>>> (spark.streaming.concurrentJobs) which is set to 1. There is no
>>>>>>>>>>>>>>>>>> documentation for this parameter. After setting this to 1000 in driver
>>>>>>>>>>>>>>>>>> program, our streaming application's performance is improved.
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> That is a parameter that allows Spark Stremaing to launch
>>>>>>>>>>>>>>>>> multiple Spark jobs simultaneously. While it can improve the performance in
>>>>>>>>>>>>>>>>> many scenarios (as it has in your case), it can actually increase the
>>>>>>>>>>>>>>>>> processing time of each batch and increase end-to-end latency in certain
>>>>>>>>>>>>>>>>> scenarios. So it is something that needs to be used with caution. That
>>>>>>>>>>>>>>>>> said, we should have definitely exposed it in the documentation.
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> What is this variable used for? Is it safe to use/tweak
>>>>>>>>>>>>>>>>>> this parameter?
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> 2. Can someone explain the usage of MapOutputTracker,
>>>>>>>>>>>>>>>>>> BlockManager component. I have gone through the youtube video of Matei
>>>>>>>>>>>>>>>>>> about spark internals but this was not covered in detail.
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> I am not sure if there is a detailed document anywhere
>>>>>>>>>>>>>>>>> that explains but I can give you a high level overview of the both.
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> BlockManager is like a distributed key-value store for
>>>>>>>>>>>>>>>>> large blobs (called blocks) of data. It has a master-worker architecture
>>>>>>>>>>>>>>>>> (loosely it is like the HDFS file system) where the BlockManager at the
>>>>>>>>>>>>>>>>> workers store the data blocks and BlockManagerMaster stores the metadata
>>>>>>>>>>>>>>>>> for what blocks are stored where. All the cached RDD's partitions and
>>>>>>>>>>>>>>>>> shuffle data are stored and managed by the BlockManager. It also transfers
>>>>>>>>>>>>>>>>> the blocks between the workers as needed (shuffles etc all happen through
>>>>>>>>>>>>>>>>> the block manager). Specifically for spark streaming, the data received
>>>>>>>>>>>>>>>>> from outside is stored in the BlockManager of the worker nodes, and the IDs
>>>>>>>>>>>>>>>>> of the blocks are reported to the BlockManagerMaster.
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> MapOutputTrackers is a simpler component that keeps track
>>>>>>>>>>>>>>>>> of the location of the output of the map stage, so that workers running the
>>>>>>>>>>>>>>>>> reduce stage knows which machines to pull the data from. That also has the
>>>>>>>>>>>>>>>>> master-worker component - master has the full knowledge of the mapoutput
>>>>>>>>>>>>>>>>> and the worker component on-demand pulls that knowledge from the master
>>>>>>>>>>>>>>>>> component when the reduce tasks are executed on the worker.
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> 3. Can someone explain the usage of cache w.r.t spark
>>>>>>>>>>>>>>>>>> streaming? For example if we do stream.cache(), will the cache remain
>>>>>>>>>>>>>>>>>> constant with all the partitions of RDDs present across the nodes for that
>>>>>>>>>>>>>>>>>> stream, OR will it be regularly updated as in while new batch is coming?
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> If you call DStream.persist (persist == cache = true),
>>>>>>>>>>>>>>>>> then all RDDs generated by the DStream will be persisted in the cache (in
>>>>>>>>>>>>>>>>> the BlockManager). As new RDDs are generated and persisted, old RDDs from
>>>>>>>>>>>>>>>>> the same DStream will fall out of memory. either by LRU or explicitly if
>>>>>>>>>>>>>>>>> spark.streaming.unpersist is set to true.
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> Thanks,
>>>>>>>>>>>>>>>>>> --
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> Sourav Chandra
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> Senior Software Engineer
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> · · · · · · · · · · · · · · · · · · · · · · · · · · · · ·
>>>>>>>>>>>>>>>>>> · · · ·
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> sourav.chandra@livestream.com
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> o: +91 80 4121 8723
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> m: +91 988 699 3746
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> skype: sourav.chandra
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> Livestream
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> "Ajmera Summit", First Floor, #3/D, 68 Ward, 3rd Cross,
>>>>>>>>>>>>>>>>>> 7th C Main, 3rd Block, Koramangala Industrial Area,
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> Bangalore 560034
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> www.livestream.com
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> --
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Sourav Chandra
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Senior Software Engineer
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> · · · · · · · · · · · · · · · · · · · · · · · · · · · · · ·
>>>>>>>>>>>>>>> · · ·
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> sourav.chandra@livestream.com
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> o: +91 80 4121 8723
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> m: +91 988 699 3746
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> skype: sourav.chandra
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Livestream
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> "Ajmera Summit", First Floor, #3/D, 68 Ward, 3rd Cross, 7th
>>>>>>>>>>>>>>> C Main, 3rd Block, Koramangala Industrial Area,
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Bangalore 560034
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> www.livestream.com
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> --
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Sourav Chandra
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Senior Software Engineer
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · ·
>>>>>>>>>>>>>> · ·
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> sourav.chandra@livestream.com
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> o: +91 80 4121 8723
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> m: +91 988 699 3746
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> skype: sourav.chandra
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Livestream
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> "Ajmera Summit", First Floor, #3/D, 68 Ward, 3rd Cross, 7th C
>>>>>>>>>>>>>> Main, 3rd Block, Koramangala Industrial Area,
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Bangalore 560034
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> www.livestream.com
>>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> --
>>>>>>>>>>>>
>>>>>>>>>>>> Sourav Chandra
>>>>>>>>>>>>
>>>>>>>>>>>> Senior Software Engineer
>>>>>>>>>>>>
>>>>>>>>>>>> · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · ·
>>>>>>>>>>>> ·
>>>>>>>>>>>>
>>>>>>>>>>>> sourav.chandra@livestream.com
>>>>>>>>>>>>
>>>>>>>>>>>> o: +91 80 4121 8723
>>>>>>>>>>>>
>>>>>>>>>>>> m: +91 988 699 3746
>>>>>>>>>>>>
>>>>>>>>>>>> skype: sourav.chandra
>>>>>>>>>>>>
>>>>>>>>>>>> Livestream
>>>>>>>>>>>>
>>>>>>>>>>>> "Ajmera Summit", First Floor, #3/D, 68 Ward, 3rd Cross, 7th C
>>>>>>>>>>>> Main, 3rd Block, Koramangala Industrial Area,
>>>>>>>>>>>>
>>>>>>>>>>>> Bangalore 560034
>>>>>>>>>>>>
>>>>>>>>>>>> www.livestream.com
>>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> --
>>>>>>>>>>
>>>>>>>>>> Sourav Chandra
>>>>>>>>>>
>>>>>>>>>> Senior Software Engineer
>>>>>>>>>>
>>>>>>>>>> · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · ·
>>>>>>>>>>
>>>>>>>>>> sourav.chandra@livestream.com
>>>>>>>>>>
>>>>>>>>>> o: +91 80 4121 8723
>>>>>>>>>>
>>>>>>>>>> m: +91 988 699 3746
>>>>>>>>>>
>>>>>>>>>> skype: sourav.chandra
>>>>>>>>>>
>>>>>>>>>> Livestream
>>>>>>>>>>
>>>>>>>>>> "Ajmera Summit", First Floor, #3/D, 68 Ward, 3rd Cross, 7th C
>>>>>>>>>> Main, 3rd Block, Koramangala Industrial Area,
>>>>>>>>>>
>>>>>>>>>> Bangalore 560034
>>>>>>>>>>
>>>>>>>>>> www.livestream.com
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> --
>>>>>>>>>
>>>>>>>>> Sourav Chandra
>>>>>>>>>
>>>>>>>>> Senior Software Engineer
>>>>>>>>>
>>>>>>>>> · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · ·
>>>>>>>>>
>>>>>>>>> sourav.chandra@livestream.com
>>>>>>>>>
>>>>>>>>> o: +91 80 4121 8723
>>>>>>>>>
>>>>>>>>> m: +91 988 699 3746
>>>>>>>>>
>>>>>>>>> skype: sourav.chandra
>>>>>>>>>
>>>>>>>>> Livestream
>>>>>>>>>
>>>>>>>>> "Ajmera Summit", First Floor, #3/D, 68 Ward, 3rd Cross, 7th C
>>>>>>>>> Main, 3rd Block, Koramangala Industrial Area,
>>>>>>>>>
>>>>>>>>> Bangalore 560034
>>>>>>>>>
>>>>>>>>> www.livestream.com
>>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> --
>>>>>>>>
>>>>>>>> Sourav Chandra
>>>>>>>>
>>>>>>>> Senior Software Engineer
>>>>>>>>
>>>>>>>> · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · ·
>>>>>>>>
>>>>>>>> sourav.chandra@livestream.com
>>>>>>>>
>>>>>>>> o: +91 80 4121 8723
>>>>>>>>
>>>>>>>> m: +91 988 699 3746
>>>>>>>>
>>>>>>>> skype: sourav.chandra
>>>>>>>>
>>>>>>>> Livestream
>>>>>>>>
>>>>>>>> "Ajmera Summit", First Floor, #3/D, 68 Ward, 3rd Cross, 7th C Main,
>>>>>>>> 3rd Block, Koramangala Industrial Area,
>>>>>>>>
>>>>>>>> Bangalore 560034
>>>>>>>>
>>>>>>>> www.livestream.com
>>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> --
>>>>>>>
>>>>>>> Sourav Chandra
>>>>>>>
>>>>>>> Senior Software Engineer
>>>>>>>
>>>>>>> · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · ·
>>>>>>>
>>>>>>> sourav.chandra@livestream.com
>>>>>>>
>>>>>>> o: +91 80 4121 8723
>>>>>>>
>>>>>>> m: +91 988 699 3746
>>>>>>>
>>>>>>> skype: sourav.chandra
>>>>>>>
>>>>>>> Livestream
>>>>>>>
>>>>>>> "Ajmera Summit", First Floor, #3/D, 68 Ward, 3rd Cross, 7th C Main,
>>>>>>> 3rd Block, Koramangala Industrial Area,
>>>>>>>
>>>>>>> Bangalore 560034
>>>>>>>
>>>>>>> www.livestream.com
>>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>> --
>>>>>>
>>>>>> Sourav Chandra
>>>>>>
>>>>>> Senior Software Engineer
>>>>>>
>>>>>> · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · ·
>>>>>>
>>>>>> sourav.chandra@livestream.com
>>>>>>
>>>>>> o: +91 80 4121 8723
>>>>>>
>>>>>> m: +91 988 699 3746
>>>>>>
>>>>>> skype: sourav.chandra
>>>>>>
>>>>>> Livestream
>>>>>>
>>>>>> "Ajmera Summit", First Floor, #3/D, 68 Ward, 3rd Cross, 7th C Main,
>>>>>> 3rd Block, Koramangala Industrial Area,
>>>>>>
>>>>>> Bangalore 560034
>>>>>>
>>>>>> www.livestream.com
>>>>>>
>>>>>
>>>>>
>>>>
>>>>
>>>> --
>>>>
>>>> Sourav Chandra
>>>>
>>>> Senior Software Engineer
>>>>
>>>> · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · ·
>>>>
>>>> sourav.chandra@livestream.com
>>>>
>>>> o: +91 80 4121 8723
>>>>
>>>> m: +91 988 699 3746
>>>>
>>>> skype: sourav.chandra
>>>>
>>>> Livestream
>>>>
>>>> "Ajmera Summit", First Floor, #3/D, 68 Ward, 3rd Cross, 7th C Main, 3rd
>>>> Block, Koramangala Industrial Area,
>>>>
>>>> Bangalore 560034
>>>>
>>>> www.livestream.com
>>>>
>>>
>>>
>>
>>
>> --
>>
>> Sourav Chandra
>>
>> Senior Software Engineer
>>
>> · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · ·
>>
>> sourav.chandra@livestream.com
>>
>> o: +91 80 4121 8723
>>
>> m: +91 988 699 3746
>>
>> skype: sourav.chandra
>>
>> Livestream
>>
>> "Ajmera Summit", First Floor, #3/D, 68 Ward, 3rd Cross, 7th C Main, 3rd
>> Block, Koramangala Industrial Area,
>>
>> Bangalore 560034
>>
>> www.livestream.com
>>
>
>
>
> --
> Dachuan Huang
> Cellphone: 614-390-7234
> 2015 Neil Avenue
> Ohio State University
> Columbus, Ohio
> U.S.A.
> 43210
>