You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@spark.apache.org by Saiph Kappa <sa...@gmail.com> on 2015/03/03 03:47:34 UTC

Re: throughput in the web console?

I performed repartitioning and everything went fine with respect to the
number of CPU cores being used (and respective times). However, I noticed
something very strange: inside a map operation I was doing a very simple
calculation and always using the same dataset (small enough to be entirely
processed in the same batch); then I iterated the RDDs and calculated the
mean, "foreachRDD(rdd => println("MEAN: " + rdd.mean()))". I noticed that
for different numbers of partitions (for instance, 4 and 8), the result of
the mean is different. Why does this happen?

On Thu, Feb 26, 2015 at 7:03 PM, Tathagata Das <td...@databricks.com> wrote:

> If you have one receiver, and you are doing only map-like operaitons then
> the process will primarily happen on one machine. To use all the machines,
> either receiver in parallel with multiple receivers, or spread out the
> computation by explicitly repartitioning the received streams
> (DStream.repartition) with sufficient partitions to load balance across
> more machines.
>
> TD
>
> On Thu, Feb 26, 2015 at 9:52 AM, Saiph Kappa <sa...@gmail.com>
> wrote:
>
>> One more question: while processing the exact same batch I noticed that
>> giving more CPUs to the worker does not decrease the duration of the batch.
>> I tried this with 4 and 8 CPUs. Though, I noticed that giving only 1 CPU
>> the duration increased, but apart from that the values were pretty similar,
>> whether I was using 4 or 6 or 8 CPUs.
>>
>> On Thu, Feb 26, 2015 at 5:35 PM, Saiph Kappa <sa...@gmail.com>
>> wrote:
>>
>>> By setting spark.eventLog.enabled to true it is possible to see the
>>> application UI after the application has finished its execution, however
>>> the Streaming tab is no longer visible.
>>>
>>> For measuring the duration of batches in the code I am doing something
>>> like this:
>>> «wordCharValues.foreachRDD(rdd => {
>>>             val startTick = System.currentTimeMillis()
>>>             val result = rdd.take(1)
>>>             val timeDiff = System.currentTimeMillis() - startTick»
>>>
>>> But my quesiton is: is it possible to see the rate/throughput
>>> (records/sec) when I have a stream to process log files that appear in a
>>> folder?
>>>
>>>
>>>
>>> On Thu, Feb 26, 2015 at 1:36 AM, Tathagata Das <td...@databricks.com>
>>> wrote:
>>>
>>>> Yes. # tuples processed in a batch = sum of all the tuples received by
>>>> all the receivers.
>>>>
>>>> In screen shot, there was a batch with 69.9K records, and there was a
>>>> batch which took 1 s 473 ms. These two batches can be the same, can be
>>>> different batches.
>>>>
>>>> TD
>>>>
>>>> On Wed, Feb 25, 2015 at 10:11 AM, Josh J <jo...@gmail.com> wrote:
>>>>
>>>>> If I'm using the kafka receiver, can I assume the number of records
>>>>> processed in the batch is the sum of the number of records processed by the
>>>>> kafka receiver?
>>>>>
>>>>> So in the screen shot attached the max rate of tuples processed in a
>>>>> batch is 42.7K + 27.2K = 69.9K tuples processed in a batch with a max
>>>>> processing time of 1 second 473 ms?
>>>>>
>>>>> On Wed, Feb 25, 2015 at 8:48 AM, Akhil Das <akhil@sigmoidanalytics.com
>>>>> > wrote:
>>>>>
>>>>>> By throughput you mean Number of events processed etc?
>>>>>>
>>>>>> [image: Inline image 1]
>>>>>>
>>>>>> Streaming tab already have these statistics.
>>>>>>
>>>>>>
>>>>>>
>>>>>> Thanks
>>>>>> Best Regards
>>>>>>
>>>>>> On Wed, Feb 25, 2015 at 9:59 PM, Josh J <jo...@gmail.com> wrote:
>>>>>>
>>>>>>>
>>>>>>> On Wed, Feb 25, 2015 at 7:54 AM, Akhil Das <
>>>>>>> akhil@sigmoidanalytics.com> wrote:
>>>>>>>
>>>>>>>> For SparkStreaming applications, there is already a tab called
>>>>>>>> "Streaming" which displays the basic statistics.
>>>>>>>
>>>>>>>
>>>>>>> Would I just need to extend this tab to add the throughput?
>>>>>>>
>>>>>>
>>>>>>
>>>>>
>>>>>
>>>>> ---------------------------------------------------------------------
>>>>> To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
>>>>> For additional commands, e-mail: user-help@spark.apache.org
>>>>>
>>>>
>>>>
>>>
>>
>

Re: throughput in the web console?

Posted by Saiph Kappa <sa...@gmail.com>.

Sorry I made a mistake. Please ignore my question.

On Tue, Mar 3, 2015 at 2:47 AM, Saiph Kappa <sa...@gmail.com> wrote:

> I performed repartitioning and everything went fine with respect to the
> number of CPU cores being used (and respective times). However, I noticed
> something very strange: inside a map operation I was doing a very simple
> calculation and always using the same dataset (small enough to be entirely
> processed in the same batch); then I iterated the RDDs and calculated the
> mean, "foreachRDD(rdd => println("MEAN: " + rdd.mean()))". I noticed that
> for different numbers of partitions (for instance, 4 and 8), the result of
> the mean is different. Why does this happen?
>
> On Thu, Feb 26, 2015 at 7:03 PM, Tathagata Das <td...@databricks.com>
> wrote:
>
>> If you have one receiver, and you are doing only map-like operaitons then
>> the process will primarily happen on one machine. To use all the machines,
>> either receiver in parallel with multiple receivers, or spread out the
>> computation by explicitly repartitioning the received streams
>> (DStream.repartition) with sufficient partitions to load balance across
>> more machines.
>>
>> TD
>>
>> On Thu, Feb 26, 2015 at 9:52 AM, Saiph Kappa <sa...@gmail.com>
>> wrote:
>>
>>> One more question: while processing the exact same batch I noticed that
>>> giving more CPUs to the worker does not decrease the duration of the batch.
>>> I tried this with 4 and 8 CPUs. Though, I noticed that giving only 1 CPU
>>> the duration increased, but apart from that the values were pretty similar,
>>> whether I was using 4 or 6 or 8 CPUs.
>>>
>>> On Thu, Feb 26, 2015 at 5:35 PM, Saiph Kappa <sa...@gmail.com>
>>> wrote:
>>>
>>>> By setting spark.eventLog.enabled to true it is possible to see the
>>>> application UI after the application has finished its execution, however
>>>> the Streaming tab is no longer visible.
>>>>
>>>> For measuring the duration of batches in the code I am doing something
>>>> like this:
>>>> «wordCharValues.foreachRDD(rdd => {
>>>>             val startTick = System.currentTimeMillis()
>>>>             val result = rdd.take(1)
>>>>             val timeDiff = System.currentTimeMillis() - startTick»
>>>>
>>>> But my quesiton is: is it possible to see the rate/throughput
>>>> (records/sec) when I have a stream to process log files that appear in a
>>>> folder?
>>>>
>>>>
>>>>
>>>> On Thu, Feb 26, 2015 at 1:36 AM, Tathagata Das <td...@databricks.com>
>>>> wrote:
>>>>
>>>>> Yes. # tuples processed in a batch = sum of all the tuples received by
>>>>> all the receivers.
>>>>>
>>>>> In screen shot, there was a batch with 69.9K records, and there was a
>>>>> batch which took 1 s 473 ms. These two batches can be the same, can be
>>>>> different batches.
>>>>>
>>>>> TD
>>>>>
>>>>> On Wed, Feb 25, 2015 at 10:11 AM, Josh J <jo...@gmail.com> wrote:
>>>>>
>>>>>> If I'm using the kafka receiver, can I assume the number of records
>>>>>> processed in the batch is the sum of the number of records processed by the
>>>>>> kafka receiver?
>>>>>>
>>>>>> So in the screen shot attached the max rate of tuples processed in a
>>>>>> batch is 42.7K + 27.2K = 69.9K tuples processed in a batch with a max
>>>>>> processing time of 1 second 473 ms?
>>>>>>
>>>>>> On Wed, Feb 25, 2015 at 8:48 AM, Akhil Das <
>>>>>> akhil@sigmoidanalytics.com> wrote:
>>>>>>
>>>>>>> By throughput you mean Number of events processed etc?
>>>>>>>
>>>>>>> [image: Inline image 1]
>>>>>>>
>>>>>>> Streaming tab already have these statistics.
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> Thanks
>>>>>>> Best Regards
>>>>>>>
>>>>>>> On Wed, Feb 25, 2015 at 9:59 PM, Josh J <jo...@gmail.com> wrote:
>>>>>>>
>>>>>>>>
>>>>>>>> On Wed, Feb 25, 2015 at 7:54 AM, Akhil Das <
>>>>>>>> akhil@sigmoidanalytics.com> wrote:
>>>>>>>>
>>>>>>>>> For SparkStreaming applications, there is already a tab called
>>>>>>>>> "Streaming" which displays the basic statistics.
>>>>>>>>
>>>>>>>>
>>>>>>>> Would I just need to extend this tab to add the throughput?
>>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>
>>>>>>
>>>>>> ---------------------------------------------------------------------
>>>>>> To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
>>>>>> For additional commands, e-mail: user-help@spark.apache.org
>>>>>>
>>>>>
>>>>>
>>>>
>>>
>>
>