You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@spark.apache.org by Rohit Karlupia <ro...@qubole.com> on 2018/03/22 03:36:39 UTC

Open sourcing Sparklens: Qubole's Spark Tuning Tool

Hi,

Happy to announce the availability of Sparklens as open source project. It
helps in understanding the  scalability limits of spark applications and
can be a useful guide on the path towards tuning applications for lower
runtime or cost.

Please clone from here: https://github.com/qubole/sparklens
Old blogpost: https://www.qubole.com/blog/introducing-quboles-spark-
tuning-tool/

thanks,
rohitk

PS: Thanks for the patience. It took couple of months to get back on this.

Re: Open sourcing Sparklens: Qubole's Spark Tuning Tool

Posted by Rohit Karlupia <ro...@qubole.com>.

Let me be more specific:

With GC/CPU aware task scheduling, user doesn't have to worry about
specifying cores carefully. So if the user always specify cores = 100 or
1024 for every executor, he will still not get OOM  (under vast majority of
cases). Internally, the scheduler will vary the number of tasks assigned to
executors ensuring that executor doesn't runs into GC cycles or causes
useless context switches.  In short, as long as users configure cores per
executors on the higher side, it will be harmless in general and can
actually help in increasing the throughput of the system by utilising
unused memory or CPU capacity available for use.

*For example:* lets say we are using 64 GB machine with 8 cores. Lets say
we are using one big 54GB executors with 8 cores. This results in on
average 7GB of memory per task. It is possible that some tasks take more
than 7GB and some takes less than 7GB. Consider a case when one task takes
34GB of memory. Very likely such a stage will fail depending upon if the
rest 7 tasks scheduled at the same time need more than 20GB of memory  (54
- 34). The usual approach to solving this problem without changing the
application would be to sacrifice cores and increase memory per core. The
stable configuration in this case could be 2 cores for 54GB executor, which
will result in wasting of 6 cores "throughout" the application.

With GC/CPU aware task scheduling one can configure the same executors with
say 64 cores and the application is very likely to succeed. Being aware of
GC, the scheduler will stop scheduling tasks on the executor, making it
possible for the running task to consume all 54GB of memory. This ensures
that we only "sacrifice" cores, when necessary and not in general and not
for the whole duration of the application.  On the other hand, if the
scheduler finds out that inspite of running 8 concurrent tasks, we still
have memory and cpu to spare, it will schedule more tasks upto 64, as
configured. So we not only get stability against skew but we also get
higher throughput when possible.

Hope that helps.

thanks,
rohitk












On Tue, Mar 27, 2018 at 9:20 AM, Fawze Abujaber <fa...@gmail.com> wrote:

> Thanks for the update.
>
> What about cores per executor?
>
> On Tue, 27 Mar 2018 at 6:45 Rohit Karlupia <ro...@qubole.com> wrote:
>
>> Thanks Fawze!
>>
>> On the memory front, I am currently working on GC and CPU aware task
>> scheduling. I see wonderful results based on my tests so far.  Once the
>> feature is complete and available, spark will work with whatever memory is
>> provided (at least enough for the largest possible task). It will also
>> allow you to run say 64 concurrent tasks on 8 core machine, if the nature
>> of tasks doesn't leads to memory or CPU contention. Essentially why worry
>> about tuning memory when you can let spark take care of it automatically
>> based on memory pressure. Will post details when we are ready.  So yes we
>> are working on memory, but it will not be a tool but a transparent feature.
>>
>> thanks,
>> rohitk
>>
>>
>>
>>
>> On Tue, Mar 27, 2018 at 7:53 AM, Fawze Abujaber <fa...@gmail.com>
>> wrote:
>>
>>> Hi Rohit,
>>>
>>> I would like to thank you for the unlimited patience and support that
>>> you are providing here and behind the scene for all of us.
>>>
>>> The tool is amazing and easy to use and understand most of the metrics
>>> ...
>>>
>>> Thinking if we need to run it in cluster mode and all the time, i think
>>> we can skip it as one or few runs can give you the large picture of how the
>>> job is running with different configuration and it's not too much
>>> complicated to run it using spark-submit.
>>>
>>> I think it will be so helpful if the sparklens can also include how the
>>> job is running with different configuration of cores and memory, Spark job
>>> with 1 exec and 1 core will run different from spark job with 1  exec and 3
>>> cores and for sure the same compare with different exec memory.
>>>
>>> Overall, it is so good starting point, but it will be a GAME CHANGER
>>> getting these metrics on the tool.
>>>
>>> @Rohit , Huge THANY YOU
>>>
>>> On Mon, Mar 26, 2018 at 1:35 PM, Rohit Karlupia <ro...@qubole.com>
>>> wrote:
>>>
>>>> Hi Shmuel,
>>>>
>>>> In general it is hard to pin point to exact code which is responsible
>>>> for a specific stage. For example when using spark sql, depending upon the
>>>> kind of joins, aggregations used in the the single line of query, we will
>>>> have multiple stages in the spark application. I usually try to split the
>>>> code into smaller chunks and also use the spark UI which has special
>>>> section for SQL. It can also show specific backtraces, but as I explained
>>>> earlier they might not be very helpful. Sparklens does help you ask the
>>>> right questions, but is not mature enough to answer all of them.
>>>>
>>>> Understanding the report:
>>>>
>>>> *1) The first part of total aggregate metrics for the application.*
>>>>
>>>> Printing application meterics.....
>>>>
>>>>  AggregateMetrics (Application Metrics) total measurements 1869
>>>>                 NAME                        SUM                MIN           MAX                MEAN
>>>>  diskBytesSpilled                            0.0 KB         0.0 KB         0.0 KB              0.0 KB
>>>>  executorRuntime                            15.1 hh         3.0 ms         4.0 mm             29.1 ss
>>>>  inputBytesRead                             26.1 GB         0.0 KB        43.8 MB             14.3 MB
>>>>  jvmGCTime                                  11.0 mm         0.0 ms         2.1 ss            354.0 ms
>>>>  memoryBytesSpilled                        314.2 GB         0.0 KB         1.1 GB            172.1 MB
>>>>  outputBytesWritten                          0.0 KB         0.0 KB         0.0 KB              0.0 KB
>>>>  peakExecutionMemory                         0.0 KB         0.0 KB         0.0 KB              0.0 KB
>>>>  resultSize                                 12.9 MB         2.0 KB        40.9 KB              7.1 KB
>>>>  shuffleReadBytesRead                      107.7 GB         0.0 KB       276.0 MB             59.0 MB
>>>>  shuffleReadFetchWaitTime                    2.0 ms         0.0 ms         0.0 ms              0.0 ms
>>>>  shuffleReadLocalBlocks                       2,318              0             68                   1
>>>>  shuffleReadRecordsRead               3,413,511,099              0      8,251,926           1,826,383
>>>>  shuffleReadRemoteBlocks                    291,126              0            824                 155
>>>>  shuffleWriteBytesWritten                  107.6 GB         0.0 KB       257.6 MB             58.9 MB
>>>>  shuffleWriteRecordsWritten           3,408,133,175              0      7,959,055           1,823,506
>>>>  shuffleWriteTime                            8.7 mm         0.0 ms         1.8 ss            278.2 ms
>>>>  taskDuration                               15.4 hh        12.0 ms         4.1 mm             29.7 ss
>>>>
>>>>
>>>> *2) Here we show number of hosts used and executors per host. I have seen users set executor memory to 33GB on a 64GB executor. Direct waste of 31GB of memory.*
>>>>
>>>> Total Hosts 135
>>>>
>>>>
>>>> Host server86.cluster.com startTime 02:26:21:081 executors count 3
>>>> Host server164.cluster.com startTime 02:30:12:204 executors count 1
>>>> Host server28.cluster.com startTime 02:31:09:023 executors count 1
>>>> Host server78.cluster.com startTime 02:26:08:844 executors count 5
>>>> Host server124.cluster.com startTime 02:26:10:523 executors count 3
>>>> Host server100.cluster.com startTime 02:30:24:073 executors count 1
>>>> Done printing host timeline
>>>> *3) Time at which executers were added. Not all executors are available at the start of the application. *
>>>>
>>>> Printing executors timeline....
>>>> Total Hosts 135
>>>> Total Executors 250
>>>> At 02:26 executors added 52 & removed  0 currently available 52
>>>> At 02:27 executors added 10 & removed  0 currently available 62
>>>> At 02:28 executors added 13 & removed  0 currently available 75
>>>> At 02:29 executors added 81 & removed  0 currently available 156
>>>> At 02:30 executors added 48 & removed  0 currently available 204
>>>> At 02:31 executors added 45 & removed  0 currently available 249
>>>> At 02:32 executors added 1 & removed  0 currently available 250
>>>>
>>>>
>>>> *4) How the stages within the jobs were scheduled. Helps you understand which stages ran in parallel and which are dependent on others.
>>>> *
>>>>
>>>> Printing Application timeline
>>>> 02:26:47:654      Stage 3 ended : maxTaskTime 3117 taskCount 1
>>>> 02:26:47:708      Stage 4 started : duration 00m 02s
>>>> 02:26:49:898      Stage 4 ended : maxTaskTime 226 taskCount 200
>>>> 02:26:49:901 JOB 3 ended
>>>> 02:26:56:234 JOB 4 started : duration 08m 28s
>>>> [      5 |||||||                                                                         ]
>>>> [      6  |||||||||||||||||||                                                            ]
>>>> [      9                   ||||||||                                                      ]
>>>> [     10     ||||||||||||||                                                              ]
>>>> [     11                                                                                 ]
>>>> [     12                     ||                                                          ]
>>>> [     13                       ||||                                                      ]
>>>> [     14                           |||||||||||||||                                       ]
>>>> [     15                                          |||||||||||||||||||||||||||||||||||||| ]
>>>> 02:26:58:095      Stage 5 started : duration 00m 44s
>>>> 02:27:42:816      Stage 5 ended : maxTaskTime 37214 taskCount 23
>>>> 02:27:03:478      Stage 6 started : duration 02m 04s
>>>> 02:29:07:517      Stage 6 ended : maxTaskTime 35578 taskCount 601
>>>> 02:28:56:449      Stage 9 started : duration 00m 46s
>>>> 02:29:42:625      Stage 9 ended : maxTaskTime 7196 taskCount 200
>>>> 02:27:22:343      Stage 10 started : duration 01m 33s
>>>> 02:28:56:333      Stage 10 ended : maxTaskTime 49203 taskCount 39
>>>> 02:27:23:910      Stage 11 started : duration 00m 00s
>>>> 02:27:24:422      Stage 11 ended : maxTaskTime 298 taskCount 2
>>>> 02:29:06:902      Stage 12 started : duration 00m 12s
>>>> 02:29:19:350      Stage 12 ended : maxTaskTime 11511 taskCount 200
>>>> 02:29:19:413      Stage 13 started : duration 00m 25s
>>>> 02:29:44:444      Stage 13 ended : maxTaskTime 24924 taskCount 200
>>>> 02:29:44:491      Stage 14 started : duration 01m 36s
>>>> 02:31:20:873      Stage 14 ended : maxTaskTime 86194 taskCount 200
>>>> 02:31:20:973      Stage 15 started : duration 04m 03s
>>>> 02:35:24:346      Stage 15 ended : maxTaskTime 238747 taskCount 200
>>>> 02:35:24:347 JOB 4 ended
>>>> 02:35:28:841 app ended
>>>> *5) I guess these metrics are well explained *
>>>>
>>>>
>>>>  Time spent in Driver vs Executors
>>>>  Driver WallClock Time    01m 02s   10.66%
>>>>  Executor WallClock Time  08m 43s   89.34%
>>>>  Total WallClock Time     09m 46s
>>>>
>>>>
>>>>
>>>> Minimum possible time for the app based on the critical path (with infinite resources)   07m 59s
>>>> Minimum possible time for the app with same executors, perfect parallelism and zero skew 02m 15s
>>>> If we were to run this app with single executor and single core                          15h 08m
>>>>
>>>>
>>>>  Total cores available to the app 750
>>>>
>>>>  OneCoreComputeHours: Measure of total compute power available from cluster. One core in the executor, running
>>>>                       for one hour, counts as one OneCoreComputeHour. Executors with 4 cores, will have 4 times
>>>>                       the OneCoreComputeHours compared to one with just one core. Similarly, one core executor
>>>>                       running for 4 hours will OnCoreComputeHours equal to 4 core executor running for 1 hour.
>>>>
>>>>  Driver Utilization (Cluster idle because of driver)
>>>>
>>>>  Total OneCoreComputeHours available                            122h 07m
>>>>  Total OneCoreComputeHours available (AutoScale Aware)           77h 25m
>>>>  OneCoreComputeHours wasted by driver                            13h 01m
>>>>
>>>>  AutoScale Aware: Most of the calculations by this tool will assume that all executors are available throughout
>>>>                   the runtime of the application. The number above is printed to show possible caution to be
>>>>                   taken in interpreting the efficiency metrics.
>>>>
>>>>  Cluster Utilization (Executors idle because of lack of tasks or skew)
>>>>
>>>>  Executor OneCoreComputeHours available                 109h 06m
>>>>  Executor OneCoreComputeHours used                       15h 07m        13.86%
>>>>  OneCoreComputeHours wasted                              93h 59m        86.14%
>>>>
>>>>  App Level Wastage Metrics (Driver + Executor)
>>>>
>>>>  OneCoreComputeHours wasted Driver               10.66%
>>>>  OneCoreComputeHours wasted Executor             76.96%
>>>>  OneCoreComputeHours wasted Total                87.62%
>>>>
>>>>
>>>>
>>>> 6) *Here we use the simulation to provide answers to how the application wall clock time will vary as we change the number of executors. Goal is to run the application at 100% cluster utilization and minimum time. Look for ROI in terms of wall clock time due to additional executors. Also if the application is not scaling, this is good time to revisit application and look for why it is not scaling.*
>>>>
>>>>  App completion time and cluster utilization estimates with different executor counts
>>>>
>>>>  Real App Duration 09m 46s
>>>>  Model Estimation  08m 01s
>>>>  Model Error       17%
>>>>
>>>>  NOTE: 1) Model error could be large when auto-scaling is enabled.
>>>>        2) Model doesn't handles multiple jobs run via thread-pool. For better insights into
>>>>           application scalability, please try such jobs one by one without thread-pool.
>>>>
>>>>
>>>>  Executor count    25  ( 10%) estimated time 17m 07s and estimated cluster utilization 70.61%
>>>>  Executor count    50  ( 20%) estimated time 12m 15s and estimated cluster utilization 49.34%
>>>>  Executor count   125  ( 50%) estimated time 08m 25s and estimated cluster utilization 28.72%
>>>>  Executor count   200  ( 80%) estimated time 08m 15s and estimated cluster utilization 18.29%
>>>>  Executor count   250  (100%) estimated time 08m 01s and estimated cluster utilization 15.06%
>>>>  Executor count   275  (110%) estimated time 08m 00s and estimated cluster utilization 13.72%
>>>>  Executor count   300  (120%) estimated time 07m 59s and estimated cluster utilization 12.61%
>>>>  Executor count   375  (150%) estimated time 07m 59s and estimated cluster utilization 10.09%
>>>>  Executor count   500  (200%) estimated time 07m 59s and estimated cluster utilization 7.57%
>>>>  Executor count   750  (300%) estimated time 07m 59s and estimated cluster utilization 5.04%
>>>>  Executor count  1000  (400%) estimated time 07m 59s and estimated cluster utilization 3.78%
>>>>  Executor count  1250  (500%) estimated time 07m 59s and estimated cluster utilization 3.03%
>>>>
>>>> *7) These two sections are for finding out which stage are taking most of the wall-clock time and why. It is either not enough parallelism or skew. Parallelism is easier to fix. Fixing skew will require changing the application in way that creates more uniform tasks.
>>>> *
>>>> Total tasks in all stages 1869
>>>> Per Stage  Utilization
>>>> Stage-ID   Wall    Task      Task     IO%    Input     Output    ----Shuffle-----    -WallClockTime-    --OneCoreComputeHours---   MaxTaskMem
>>>>           Clock%  Runtime%   Count                               Input  |  Output    Measured | Ideal   Available| Used%|Wasted%
>>>>        0    0.00    0.00         1    0.0   64.0 KB    0.0 KB    0.0 KB    0.0 KB    00m 02s   00m 00s    00h 27m    0.0  100.0    0.0 KB
>>>>        1    0.00    0.00         1    0.0   64.0 KB    0.0 KB    0.0 KB    0.0 KB    00m 02s   00m 00s    00h 30m    0.1   99.9    0.0 KB
>>>>        2    0.00    0.00         1    0.0   90.0 KB    0.0 KB    0.0 KB    0.0 KB    00m 03s   00m 00s    00h 37m    0.1   99.9    0.0 KB
>>>>        3    0.00    0.01         1    0.0  867.1 KB    0.0 KB    0.0 KB  148.4 KB    00m 04s   00m 00s    01h 01m    0.1   99.9    0.0 KB
>>>>        4    0.00    0.00       200    0.0    0.0 KB    0.0 KB  148.4 KB    0.0 KB    00m 02s   00m 00s    00h 27m    0.1   99.9    0.0 KB
>>>>        5    6.00    1.15        23    0.2  402.1 MB    0.0 KB    0.0 KB    1.3 GB    00m 44s   00m 00s    09h 19m    1.9   98.1    0.0 KB
>>>>        6   17.00   19.92       601    7.1   17.2 GB    0.0 KB    0.0 KB    1.8 GB    02m 04s   00m 14s    25h 50m   11.7   88.3    0.0 KB
>>>>        9    6.00    0.73       200    2.9    6.9 GB    0.0 KB  409.5 MB    2.8 GB    00m 46s   00m 00s    09h 37m    1.2   98.8    0.0 KB
>>>>       10   13.00    2.27        39    0.3  807.8 MB    0.0 KB    0.0 KB    2.5 GB    01m 33s   00m 01s    19h 34m    1.7   98.3    0.0 KB
>>>>       11    0.00    0.00         2    0.0   31.5 KB    0.0 KB    0.0 KB   60.0 KB    00m 00s   00m 00s    00h 06m    0.1   99.9    0.0 KB
>>>>       12    1.00    2.15       200    0.3  758.7 MB    0.0 KB    2.3 GB    1.5 GB    00m 12s   00m 01s    02h 35m   12.6   87.4    0.0 KB
>>>>       13    3.00    5.91       200    0.0    0.0 KB    0.0 KB    1.5 GB   47.5 GB    00m 25s   00m 04s    05h 12m   17.1   82.9    0.0 KB
>>>>       14   13.00   19.83       200    0.0    0.0 KB    0.0 KB   50.3 GB   50.3 GB    01m 36s   00m 14s    20h 04m   14.9   85.1    0.0 KB
>>>>       15   34.00   48.02       200    0.0    0.0 KB    0.0 KB   53.2 GB    0.0 KB    04m 03s   00m 34s    50h 42m   14.3   85.7    0.0 KB
>>>>
>>>>
>>>>  Stage-ID WallClock  OneCore       Task   PRatio    -----Task------   OIRatio  |* ShuffleWrite% ReadFetch%   GC%  *|
>>>>           Stage%     ComputeHours  Count            Skew   StageSkew
>>>>       0    0.32         00h 00m       1    0.00     1.00     0.37     0.00     |*   0.00           0.00    15.10  *|
>>>>       1    0.35         00h 00m       1    0.00     1.00     0.38     0.00     |*   0.00           0.00    15.56  *|
>>>>       2    0.43         00h 00m       1    0.00     1.00     0.45     0.00     |*   0.00           0.00     8.88  *|
>>>>       3    0.70         00h 00m       1    0.00     1.00     0.63     0.17     |*   4.51           0.00     6.74  *|
>>>>       4    0.31         00h 00m     200    0.27    37.67     0.10     0.00     |*   0.00           0.04    23.79  *|
>>>>       5    6.38         00h 10m      23    0.03     1.42     0.83     3.18     |*   1.08           0.00     2.72  *|
>>>>       6   17.68         03h 00m     601    0.80     2.07     0.29     0.10     |*   0.60           0.00     1.90  *|
>>>>       9    6.58         00h 06m     200    0.27     5.20     0.16     0.38     |*   4.74          13.24     4.04  *|
>>>>      10   13.40         00h 20m      39    0.05     1.67     0.52     3.17     |*   1.10           0.00     1.96  *|
>>>>      11    0.07         00h 00m       2    0.00     1.00     0.58     1.91     |*  13.59           0.00     0.00  *|
>>>>      12    1.77         00h 19m     200    0.27     1.99     0.92     0.50     |*   1.85          19.63     3.09  *|
>>>>      13    3.57         00h 53m     200    0.27     1.59     1.00    31.42     |*   6.06          12.25     1.33  *|
>>>>      14   13.74         02h 59m     200    0.27     1.65     0.89     1.00     |*   1.84           2.38     0.83  *|
>>>>      15   34.69         07h 15m     200    0.27     1.88     0.98     0.00     |*   0.00           4.21     0.88  *|
>>>>
>>>> PRatio:        Number of tasks in stage divided by number of cores. Represents degree of
>>>>                parallelism in the stage
>>>> TaskSkew:      Duration of largest task in stage divided by duration of median task.
>>>>                Represents degree of skew in the stage
>>>> TaskStageSkew: Duration of largest task in stage divided by total duration of the stage.
>>>>                Represents the impact of the largest task on stage time.
>>>> OIRatio:       Output to input ration. Total output of the stage (results + shuffle write)
>>>>                divided by total input (input data + shuffle read)
>>>>
>>>> These metrics below represent distribution of time within the stage
>>>>
>>>> ShuffleWrite:  Amount of time spent in shuffle writes across all tasks in the given
>>>>                stage as a percentage
>>>> ReadFetch:     Amount of time spent in shuffle read across all tasks in the given
>>>>                stage as a percentage
>>>> GC:            Amount of time spent in GC across all tasks in the given stage as a
>>>>                percentage
>>>>
>>>> If the stage contributes large percentage to overall application time, we could look into
>>>> these metrics to check which part (Shuffle write, read fetch or GC is responsible)
>>>>
>>>> thanks,
>>>>
>>>> rohitk
>>>>
>>>>
>>>>
>>>> On Mon, Mar 26, 2018 at 1:38 AM, Shmuel Blitz <
>>>> shmuel.blitz@similarweb.com> wrote:
>>>>
>>>>> Hi Rohit,
>>>>>
>>>>> Thanks for the analysis.
>>>>>
>>>>> I can use repartition on the slow task. But how can I tell what part
>>>>> of the code is in charge of the slow tasks?
>>>>>
>>>>> It would be great if you could further explain the rest of the output.
>>>>>
>>>>> Thanks in advance,
>>>>> Shmuel
>>>>>
>>>>> On Sun, Mar 25, 2018 at 12:46 PM, Rohit Karlupia <ro...@qubole.com>
>>>>> wrote:
>>>>>
>>>>>> Thanks Shamuel for trying out sparklens!
>>>>>>
>>>>>> Couple of things that I noticed:
>>>>>> 1) 250 executors is probably overkill for this job. It would run in
>>>>>> same time with around 100.
>>>>>> 2) Many of stages that take long time have only 200 tasks where as we
>>>>>> have 750 cores available for the job. 200 is the default value for
>>>>>> spark.sql.shuffle.partitions.  Alternatively you could try
>>>>>> increasing the value of spark.sql.shuffle.partitions to latest 750.
>>>>>>
>>>>>> thanks,
>>>>>> rohitk
>>>>>>
>>>>>> On Sun, Mar 25, 2018 at 1:25 PM, Shmuel Blitz <
>>>>>> shmuel.blitz@similarweb.com> wrote:
>>>>>>
>>>>>>> I ran it on a single job.
>>>>>>> SparkLens has an overhead on the job duration. I'm not ready to
>>>>>>> enable it by default on all our jobs.
>>>>>>>
>>>>>>> Attached is the output.
>>>>>>>
>>>>>>> Still trying to understand what exactly it means.
>>>>>>>
>>>>>>> On Sun, Mar 25, 2018 at 10:40 AM, Fawze Abujaber <fa...@gmail.com>
>>>>>>> wrote:
>>>>>>>
>>>>>>>> Nice!
>>>>>>>>
>>>>>>>> Shmuel, Were you able to run on a cluster level or for a specific
>>>>>>>> job?
>>>>>>>>
>>>>>>>> Did you configure it on the spark-default.conf?
>>>>>>>>
>>>>>>>> On Sun, 25 Mar 2018 at 10:34 Shmuel Blitz <
>>>>>>>> shmuel.blitz@similarweb.com> wrote:
>>>>>>>>
>>>>>>>>> Just to let you know, I have managed to run SparkLens on our
>>>>>>>>> cluster.
>>>>>>>>>
>>>>>>>>> I switched to the spark_1.6 branch, and also compiled against the
>>>>>>>>> specific image of Spark we are using (cdh5.7.6).
>>>>>>>>>
>>>>>>>>> Now I need to figure out what the output means... :P
>>>>>>>>>
>>>>>>>>> Shmuel
>>>>>>>>>
>>>>>>>>> On Fri, Mar 23, 2018 at 7:24 PM, Fawze Abujaber <fawzeaj@gmail.com
>>>>>>>>> > wrote:
>>>>>>>>>
>>>>>>>>>> Quick question:
>>>>>>>>>>
>>>>>>>>>> how to add the  --jars /path/to/sparklens_2.11-0.1.0.jar to the
>>>>>>>>>> spark-default conf, should it be using:
>>>>>>>>>>
>>>>>>>>>> spark.driver.extraClassPath /path/to/sparklens_2.11-0.1.0.jar or
>>>>>>>>>> i should use spark.jars option? anyone who could give an example how it
>>>>>>>>>> should be, and if i the path for the jar should be an hdfs path as i'm
>>>>>>>>>> using it in cluster mode.
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> On Fri, Mar 23, 2018 at 6:33 AM, Fawze Abujaber <
>>>>>>>>>> fawzeaj@gmail.com> wrote:
>>>>>>>>>>
>>>>>>>>>>> Hi Shmuel,
>>>>>>>>>>>
>>>>>>>>>>> Did you compile the code against the right branch for Spark 1.6.
>>>>>>>>>>>
>>>>>>>>>>> I tested it and it looks working and now i'm testing the branch
>>>>>>>>>>> for a wide tests, Please use the branch for Spark 1.6
>>>>>>>>>>>
>>>>>>>>>>> On Fri, Mar 23, 2018 at 12:43 AM, Shmuel Blitz <
>>>>>>>>>>> shmuel.blitz@similarweb.com> wrote:
>>>>>>>>>>>
>>>>>>>>>>>> Hi Rohit,
>>>>>>>>>>>>
>>>>>>>>>>>> Thanks for sharing this great tool.
>>>>>>>>>>>> I tried running a spark job with the tool, but it failed with
>>>>>>>>>>>> an *IncompatibleClassChangeError *Exception.
>>>>>>>>>>>>
>>>>>>>>>>>> I have opened an issue on Github.(https://github.com/
>>>>>>>>>>>> qubole/sparklens/issues/1)
>>>>>>>>>>>>
>>>>>>>>>>>> Shmuel
>>>>>>>>>>>>
>>>>>>>>>>>> On Thu, Mar 22, 2018 at 5:05 PM, Shmuel Blitz <
>>>>>>>>>>>> shmuel.blitz@similarweb.com> wrote:
>>>>>>>>>>>>
>>>>>>>>>>>>> Thanks.
>>>>>>>>>>>>>
>>>>>>>>>>>>> We will give this a try and report back.
>>>>>>>>>>>>>
>>>>>>>>>>>>> Shmuel
>>>>>>>>>>>>>
>>>>>>>>>>>>> On Thu, Mar 22, 2018 at 4:22 PM, Rohit Karlupia <
>>>>>>>>>>>>> rohitk@qubole.com> wrote:
>>>>>>>>>>>>>
>>>>>>>>>>>>>> Thanks everyone!
>>>>>>>>>>>>>> Please share how it works and how it doesn't. Both help.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Fawaze, just made few changes to make this work with spark
>>>>>>>>>>>>>> 1.6. Can you please try building from branch *spark_1.6*
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> thanks,
>>>>>>>>>>>>>> rohitk
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> On Thu, Mar 22, 2018 at 10:18 AM, Fawze Abujaber <
>>>>>>>>>>>>>> fawzeaj@gmail.com> wrote:
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> It's super amazing .... i see it was tested on spark 2.0.0
>>>>>>>>>>>>>>> and above, what about Spark 1.6 which is still part of Cloudera's main
>>>>>>>>>>>>>>> versions?
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> We have a vast Spark applications with version 1.6.0
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> On Thu, Mar 22, 2018 at 6:38 AM, Holden Karau <
>>>>>>>>>>>>>>> holden@pigscanfly.ca> wrote:
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Super exciting! I look forward to digging through it this
>>>>>>>>>>>>>>>> weekend.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> On Wed, Mar 21, 2018 at 9:33 PM ☼ R Nair (रविशंकर नायर) <
>>>>>>>>>>>>>>>> ravishankar.nair@gmail.com> wrote:
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> Excellent. You filled a missing link.
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> Best,
>>>>>>>>>>>>>>>>> Passion
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> On Wed, Mar 21, 2018 at 11:36 PM, Rohit Karlupia <
>>>>>>>>>>>>>>>>> rohitk@qubole.com> wrote:
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> Hi,
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> Happy to announce the availability of Sparklens as open
>>>>>>>>>>>>>>>>>> source project. It helps in understanding the  scalability limits of spark
>>>>>>>>>>>>>>>>>> applications and can be a useful guide on the path towards tuning
>>>>>>>>>>>>>>>>>> applications for lower runtime or cost.
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> Please clone from here: https://github.com/qubole/
>>>>>>>>>>>>>>>>>> sparklens
>>>>>>>>>>>>>>>>>> Old blogpost: https://www.qubole.
>>>>>>>>>>>>>>>>>> com/blog/introducing-quboles-spark-tuning-tool/
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> thanks,
>>>>>>>>>>>>>>>>>> rohitk
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> PS: Thanks for the patience. It took couple of months to
>>>>>>>>>>>>>>>>>> get back on this.
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> --
>>>>>>>>>>>>>>>> Twitter: https://twitter.com/holdenkarau
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>> --
>>>>>>>>>>>>> Shmuel Blitz
>>>>>>>>>>>>> Big Data Developer
>>>>>>>>>>>>> Email: shmuel.blitz@similarweb.com
>>>>>>>>>>>>> www.similarweb.com
>>>>>>>>>>>>> <https://www.facebook.com/SimilarWeb/>
>>>>>>>>>>>>> <https://www.linkedin.com/company/429838/>
>>>>>>>>>>>>> <https://twitter.com/similarweb>
>>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> --
>>>>>>>>>>>> Shmuel Blitz
>>>>>>>>>>>> Big Data Developer
>>>>>>>>>>>> Email: shmuel.blitz@similarweb.com
>>>>>>>>>>>> www.similarweb.com
>>>>>>>>>>>> <https://www.facebook.com/SimilarWeb/>
>>>>>>>>>>>> <https://www.linkedin.com/company/429838/>
>>>>>>>>>>>> <https://twitter.com/similarweb>
>>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> --
>>>>>>>>> Shmuel Blitz
>>>>>>>>> Big Data Developer
>>>>>>>>> Email: shmuel.blitz@similarweb.com
>>>>>>>>> www.similarweb.com
>>>>>>>>> <https://www.facebook.com/SimilarWeb/>
>>>>>>>>> <https://www.linkedin.com/company/429838/>
>>>>>>>>> <https://twitter.com/similarweb>
>>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> --
>>>>>>> Shmuel Blitz
>>>>>>> Big Data Developer
>>>>>>> Email: shmuel.blitz@similarweb.com
>>>>>>> www.similarweb.com
>>>>>>> <https://www.facebook.com/SimilarWeb/>
>>>>>>> <https://www.linkedin.com/company/429838/>
>>>>>>> <https://twitter.com/similarweb>
>>>>>>>
>>>>>>
>>>>>>
>>>>>
>>>>>
>>>>> --
>>>>> Shmuel Blitz
>>>>> Big Data Developer
>>>>> Email: shmuel.blitz@similarweb.com
>>>>> www.similarweb.com
>>>>> <https://www.facebook.com/SimilarWeb/>
>>>>> <https://www.linkedin.com/company/429838/>
>>>>> <https://twitter.com/similarweb>
>>>>>
>>>>
>>>>
>>>
>>>
>>> --
>>> Take Care
>>> Fawze Abujaber
>>>
>>
>> --
> Take Care
> Fawze Abujaber
>

Re: Open sourcing Sparklens: Qubole's Spark Tuning Tool

Posted by Fawze Abujaber <fa...@gmail.com>.

Thanks for the update.

What about cores per executor?

On Tue, 27 Mar 2018 at 6:45 Rohit Karlupia <ro...@qubole.com> wrote:

> Thanks Fawze!
>
> On the memory front, I am currently working on GC and CPU aware task
> scheduling. I see wonderful results based on my tests so far.  Once the
> feature is complete and available, spark will work with whatever memory is
> provided (at least enough for the largest possible task). It will also
> allow you to run say 64 concurrent tasks on 8 core machine, if the nature
> of tasks doesn't leads to memory or CPU contention. Essentially why worry
> about tuning memory when you can let spark take care of it automatically
> based on memory pressure. Will post details when we are ready.  So yes we
> are working on memory, but it will not be a tool but a transparent feature.
>
> thanks,
> rohitk
>
>
>
>
> On Tue, Mar 27, 2018 at 7:53 AM, Fawze Abujaber <fa...@gmail.com> wrote:
>
>> Hi Rohit,
>>
>> I would like to thank you for the unlimited patience and support that you
>> are providing here and behind the scene for all of us.
>>
>> The tool is amazing and easy to use and understand most of the metrics ...
>>
>> Thinking if we need to run it in cluster mode and all the time, i think
>> we can skip it as one or few runs can give you the large picture of how the
>> job is running with different configuration and it's not too much
>> complicated to run it using spark-submit.
>>
>> I think it will be so helpful if the sparklens can also include how the
>> job is running with different configuration of cores and memory, Spark job
>> with 1 exec and 1 core will run different from spark job with 1  exec and 3
>> cores and for sure the same compare with different exec memory.
>>
>> Overall, it is so good starting point, but it will be a GAME CHANGER
>> getting these metrics on the tool.
>>
>> @Rohit , Huge THANY YOU
>>
>> On Mon, Mar 26, 2018 at 1:35 PM, Rohit Karlupia <ro...@qubole.com>
>> wrote:
>>
>>> Hi Shmuel,
>>>
>>> In general it is hard to pin point to exact code which is responsible
>>> for a specific stage. For example when using spark sql, depending upon the
>>> kind of joins, aggregations used in the the single line of query, we will
>>> have multiple stages in the spark application. I usually try to split the
>>> code into smaller chunks and also use the spark UI which has special
>>> section for SQL. It can also show specific backtraces, but as I explained
>>> earlier they might not be very helpful. Sparklens does help you ask the
>>> right questions, but is not mature enough to answer all of them.
>>>
>>> Understanding the report:
>>>
>>> *1) The first part of total aggregate metrics for the application.*
>>>
>>> Printing application meterics.....
>>>
>>>  AggregateMetrics (Application Metrics) total measurements 1869
>>>                 NAME                        SUM                MIN           MAX                MEAN
>>>  diskBytesSpilled                            0.0 KB         0.0 KB         0.0 KB              0.0 KB
>>>  executorRuntime                            15.1 hh         3.0 ms         4.0 mm             29.1 ss
>>>  inputBytesRead                             26.1 GB         0.0 KB        43.8 MB             14.3 MB
>>>  jvmGCTime                                  11.0 mm         0.0 ms         2.1 ss            354.0 ms
>>>  memoryBytesSpilled                        314.2 GB         0.0 KB         1.1 GB            172.1 MB
>>>  outputBytesWritten                          0.0 KB         0.0 KB         0.0 KB              0.0 KB
>>>  peakExecutionMemory                         0.0 KB         0.0 KB         0.0 KB              0.0 KB
>>>  resultSize                                 12.9 MB         2.0 KB        40.9 KB              7.1 KB
>>>  shuffleReadBytesRead                      107.7 GB         0.0 KB       276.0 MB             59.0 MB
>>>  shuffleReadFetchWaitTime                    2.0 ms         0.0 ms         0.0 ms              0.0 ms
>>>  shuffleReadLocalBlocks                       2,318              0             68                   1
>>>  shuffleReadRecordsRead               3,413,511,099              0      8,251,926           1,826,383
>>>  shuffleReadRemoteBlocks                    291,126              0            824                 155
>>>  shuffleWriteBytesWritten                  107.6 GB         0.0 KB       257.6 MB             58.9 MB
>>>  shuffleWriteRecordsWritten           3,408,133,175              0      7,959,055           1,823,506
>>>  shuffleWriteTime                            8.7 mm         0.0 ms         1.8 ss            278.2 ms
>>>  taskDuration                               15.4 hh        12.0 ms         4.1 mm             29.7 ss
>>>
>>>
>>> *2) Here we show number of hosts used and executors per host. I have seen users set executor memory to 33GB on a 64GB executor. Direct waste of 31GB of memory.*
>>>
>>> Total Hosts 135
>>>
>>>
>>> Host server86.cluster.com startTime 02:26:21:081 executors count 3
>>> Host server164.cluster.com startTime 02:30:12:204 executors count 1
>>> Host server28.cluster.com startTime 02:31:09:023 executors count 1
>>> Host server78.cluster.com startTime 02:26:08:844 executors count 5
>>> Host server124.cluster.com startTime 02:26:10:523 executors count 3
>>> Host server100.cluster.com startTime 02:30:24:073 executors count 1
>>> Done printing host timeline
>>> *3) Time at which executers were added. Not all executors are available at the start of the application. *
>>>
>>> Printing executors timeline....
>>> Total Hosts 135
>>> Total Executors 250
>>> At 02:26 executors added 52 & removed  0 currently available 52
>>> At 02:27 executors added 10 & removed  0 currently available 62
>>> At 02:28 executors added 13 & removed  0 currently available 75
>>> At 02:29 executors added 81 & removed  0 currently available 156
>>> At 02:30 executors added 48 & removed  0 currently available 204
>>> At 02:31 executors added 45 & removed  0 currently available 249
>>> At 02:32 executors added 1 & removed  0 currently available 250
>>>
>>>
>>> *4) How the stages within the jobs were scheduled. Helps you understand which stages ran in parallel and which are dependent on others.
>>> *
>>>
>>> Printing Application timeline
>>> 02:26:47:654      Stage 3 ended : maxTaskTime 3117 taskCount 1
>>> 02:26:47:708      Stage 4 started : duration 00m 02s
>>> 02:26:49:898      Stage 4 ended : maxTaskTime 226 taskCount 200
>>> 02:26:49:901 JOB 3 ended
>>> 02:26:56:234 JOB 4 started : duration 08m 28s
>>> [      5 |||||||                                                                         ]
>>> [      6  |||||||||||||||||||                                                            ]
>>> [      9                   ||||||||                                                      ]
>>> [     10     ||||||||||||||                                                              ]
>>> [     11                                                                                 ]
>>> [     12                     ||                                                          ]
>>> [     13                       ||||                                                      ]
>>> [     14                           |||||||||||||||                                       ]
>>> [     15                                          |||||||||||||||||||||||||||||||||||||| ]
>>> 02:26:58:095      Stage 5 started : duration 00m 44s
>>> 02:27:42:816      Stage 5 ended : maxTaskTime 37214 taskCount 23
>>> 02:27:03:478      Stage 6 started : duration 02m 04s
>>> 02:29:07:517      Stage 6 ended : maxTaskTime 35578 taskCount 601
>>> 02:28:56:449      Stage 9 started : duration 00m 46s
>>> 02:29:42:625      Stage 9 ended : maxTaskTime 7196 taskCount 200
>>> 02:27:22:343      Stage 10 started : duration 01m 33s
>>> 02:28:56:333      Stage 10 ended : maxTaskTime 49203 taskCount 39
>>> 02:27:23:910      Stage 11 started : duration 00m 00s
>>> 02:27:24:422      Stage 11 ended : maxTaskTime 298 taskCount 2
>>> 02:29:06:902      Stage 12 started : duration 00m 12s
>>> 02:29:19:350      Stage 12 ended : maxTaskTime 11511 taskCount 200
>>> 02:29:19:413      Stage 13 started : duration 00m 25s
>>> 02:29:44:444      Stage 13 ended : maxTaskTime 24924 taskCount 200
>>> 02:29:44:491      Stage 14 started : duration 01m 36s
>>> 02:31:20:873      Stage 14 ended : maxTaskTime 86194 taskCount 200
>>> 02:31:20:973      Stage 15 started : duration 04m 03s
>>> 02:35:24:346      Stage 15 ended : maxTaskTime 238747 taskCount 200
>>> 02:35:24:347 JOB 4 ended
>>> 02:35:28:841 app ended
>>> *5) I guess these metrics are well explained *
>>>
>>>
>>>  Time spent in Driver vs Executors
>>>  Driver WallClock Time    01m 02s   10.66%
>>>  Executor WallClock Time  08m 43s   89.34%
>>>  Total WallClock Time     09m 46s
>>>
>>>
>>>
>>> Minimum possible time for the app based on the critical path (with infinite resources)   07m 59s
>>> Minimum possible time for the app with same executors, perfect parallelism and zero skew 02m 15s
>>> If we were to run this app with single executor and single core                          15h 08m
>>>
>>>
>>>  Total cores available to the app 750
>>>
>>>  OneCoreComputeHours: Measure of total compute power available from cluster. One core in the executor, running
>>>                       for one hour, counts as one OneCoreComputeHour. Executors with 4 cores, will have 4 times
>>>                       the OneCoreComputeHours compared to one with just one core. Similarly, one core executor
>>>                       running for 4 hours will OnCoreComputeHours equal to 4 core executor running for 1 hour.
>>>
>>>  Driver Utilization (Cluster idle because of driver)
>>>
>>>  Total OneCoreComputeHours available                            122h 07m
>>>  Total OneCoreComputeHours available (AutoScale Aware)           77h 25m
>>>  OneCoreComputeHours wasted by driver                            13h 01m
>>>
>>>  AutoScale Aware: Most of the calculations by this tool will assume that all executors are available throughout
>>>                   the runtime of the application. The number above is printed to show possible caution to be
>>>                   taken in interpreting the efficiency metrics.
>>>
>>>  Cluster Utilization (Executors idle because of lack of tasks or skew)
>>>
>>>  Executor OneCoreComputeHours available                 109h 06m
>>>  Executor OneCoreComputeHours used                       15h 07m        13.86%
>>>  OneCoreComputeHours wasted                              93h 59m        86.14%
>>>
>>>  App Level Wastage Metrics (Driver + Executor)
>>>
>>>  OneCoreComputeHours wasted Driver               10.66%
>>>  OneCoreComputeHours wasted Executor             76.96%
>>>  OneCoreComputeHours wasted Total                87.62%
>>>
>>>
>>>
>>> 6) *Here we use the simulation to provide answers to how the application wall clock time will vary as we change the number of executors. Goal is to run the application at 100% cluster utilization and minimum time. Look for ROI in terms of wall clock time due to additional executors. Also if the application is not scaling, this is good time to revisit application and look for why it is not scaling.*
>>>
>>>  App completion time and cluster utilization estimates with different executor counts
>>>
>>>  Real App Duration 09m 46s
>>>  Model Estimation  08m 01s
>>>  Model Error       17%
>>>
>>>  NOTE: 1) Model error could be large when auto-scaling is enabled.
>>>        2) Model doesn't handles multiple jobs run via thread-pool. For better insights into
>>>           application scalability, please try such jobs one by one without thread-pool.
>>>
>>>
>>>  Executor count    25  ( 10%) estimated time 17m 07s and estimated cluster utilization 70.61%
>>>  Executor count    50  ( 20%) estimated time 12m 15s and estimated cluster utilization 49.34%
>>>  Executor count   125  ( 50%) estimated time 08m 25s and estimated cluster utilization 28.72%
>>>  Executor count   200  ( 80%) estimated time 08m 15s and estimated cluster utilization 18.29%
>>>  Executor count   250  (100%) estimated time 08m 01s and estimated cluster utilization 15.06%
>>>  Executor count   275  (110%) estimated time 08m 00s and estimated cluster utilization 13.72%
>>>  Executor count   300  (120%) estimated time 07m 59s and estimated cluster utilization 12.61%
>>>  Executor count   375  (150%) estimated time 07m 59s and estimated cluster utilization 10.09%
>>>  Executor count   500  (200%) estimated time 07m 59s and estimated cluster utilization 7.57%
>>>  Executor count   750  (300%) estimated time 07m 59s and estimated cluster utilization 5.04%
>>>  Executor count  1000  (400%) estimated time 07m 59s and estimated cluster utilization 3.78%
>>>  Executor count  1250  (500%) estimated time 07m 59s and estimated cluster utilization 3.03%
>>>
>>> *7) These two sections are for finding out which stage are taking most of the wall-clock time and why. It is either not enough parallelism or skew. Parallelism is easier to fix. Fixing skew will require changing the application in way that creates more uniform tasks.
>>> *
>>> Total tasks in all stages 1869
>>> Per Stage  Utilization
>>> Stage-ID   Wall    Task      Task     IO%    Input     Output    ----Shuffle-----    -WallClockTime-    --OneCoreComputeHours---   MaxTaskMem
>>>           Clock%  Runtime%   Count                               Input  |  Output    Measured | Ideal   Available| Used%|Wasted%
>>>        0    0.00    0.00         1    0.0   64.0 KB    0.0 KB    0.0 KB    0.0 KB    00m 02s   00m 00s    00h 27m    0.0  100.0    0.0 KB
>>>        1    0.00    0.00         1    0.0   64.0 KB    0.0 KB    0.0 KB    0.0 KB    00m 02s   00m 00s    00h 30m    0.1   99.9    0.0 KB
>>>        2    0.00    0.00         1    0.0   90.0 KB    0.0 KB    0.0 KB    0.0 KB    00m 03s   00m 00s    00h 37m    0.1   99.9    0.0 KB
>>>        3    0.00    0.01         1    0.0  867.1 KB    0.0 KB    0.0 KB  148.4 KB    00m 04s   00m 00s    01h 01m    0.1   99.9    0.0 KB
>>>        4    0.00    0.00       200    0.0    0.0 KB    0.0 KB  148.4 KB    0.0 KB    00m 02s   00m 00s    00h 27m    0.1   99.9    0.0 KB
>>>        5    6.00    1.15        23    0.2  402.1 MB    0.0 KB    0.0 KB    1.3 GB    00m 44s   00m 00s    09h 19m    1.9   98.1    0.0 KB
>>>        6   17.00   19.92       601    7.1   17.2 GB    0.0 KB    0.0 KB    1.8 GB    02m 04s   00m 14s    25h 50m   11.7   88.3    0.0 KB
>>>        9    6.00    0.73       200    2.9    6.9 GB    0.0 KB  409.5 MB    2.8 GB    00m 46s   00m 00s    09h 37m    1.2   98.8    0.0 KB
>>>       10   13.00    2.27        39    0.3  807.8 MB    0.0 KB    0.0 KB    2.5 GB    01m 33s   00m 01s    19h 34m    1.7   98.3    0.0 KB
>>>       11    0.00    0.00         2    0.0   31.5 KB    0.0 KB    0.0 KB   60.0 KB    00m 00s   00m 00s    00h 06m    0.1   99.9    0.0 KB
>>>       12    1.00    2.15       200    0.3  758.7 MB    0.0 KB    2.3 GB    1.5 GB    00m 12s   00m 01s    02h 35m   12.6   87.4    0.0 KB
>>>       13    3.00    5.91       200    0.0    0.0 KB    0.0 KB    1.5 GB   47.5 GB    00m 25s   00m 04s    05h 12m   17.1   82.9    0.0 KB
>>>       14   13.00   19.83       200    0.0    0.0 KB    0.0 KB   50.3 GB   50.3 GB    01m 36s   00m 14s    20h 04m   14.9   85.1    0.0 KB
>>>       15   34.00   48.02       200    0.0    0.0 KB    0.0 KB   53.2 GB    0.0 KB    04m 03s   00m 34s    50h 42m   14.3   85.7    0.0 KB
>>>
>>>
>>>  Stage-ID WallClock  OneCore       Task   PRatio    -----Task------   OIRatio  |* ShuffleWrite% ReadFetch%   GC%  *|
>>>           Stage%     ComputeHours  Count            Skew   StageSkew
>>>       0    0.32         00h 00m       1    0.00     1.00     0.37     0.00     |*   0.00           0.00    15.10  *|
>>>       1    0.35         00h 00m       1    0.00     1.00     0.38     0.00     |*   0.00           0.00    15.56  *|
>>>       2    0.43         00h 00m       1    0.00     1.00     0.45     0.00     |*   0.00           0.00     8.88  *|
>>>       3    0.70         00h 00m       1    0.00     1.00     0.63     0.17     |*   4.51           0.00     6.74  *|
>>>       4    0.31         00h 00m     200    0.27    37.67     0.10     0.00     |*   0.00           0.04    23.79  *|
>>>       5    6.38         00h 10m      23    0.03     1.42     0.83     3.18     |*   1.08           0.00     2.72  *|
>>>       6   17.68         03h 00m     601    0.80     2.07     0.29     0.10     |*   0.60           0.00     1.90  *|
>>>       9    6.58         00h 06m     200    0.27     5.20     0.16     0.38     |*   4.74          13.24     4.04  *|
>>>      10   13.40         00h 20m      39    0.05     1.67     0.52     3.17     |*   1.10           0.00     1.96  *|
>>>      11    0.07         00h 00m       2    0.00     1.00     0.58     1.91     |*  13.59           0.00     0.00  *|
>>>      12    1.77         00h 19m     200    0.27     1.99     0.92     0.50     |*   1.85          19.63     3.09  *|
>>>      13    3.57         00h 53m     200    0.27     1.59     1.00    31.42     |*   6.06          12.25     1.33  *|
>>>      14   13.74         02h 59m     200    0.27     1.65     0.89     1.00     |*   1.84           2.38     0.83  *|
>>>      15   34.69         07h 15m     200    0.27     1.88     0.98     0.00     |*   0.00           4.21     0.88  *|
>>>
>>> PRatio:        Number of tasks in stage divided by number of cores. Represents degree of
>>>                parallelism in the stage
>>> TaskSkew:      Duration of largest task in stage divided by duration of median task.
>>>                Represents degree of skew in the stage
>>> TaskStageSkew: Duration of largest task in stage divided by total duration of the stage.
>>>                Represents the impact of the largest task on stage time.
>>> OIRatio:       Output to input ration. Total output of the stage (results + shuffle write)
>>>                divided by total input (input data + shuffle read)
>>>
>>> These metrics below represent distribution of time within the stage
>>>
>>> ShuffleWrite:  Amount of time spent in shuffle writes across all tasks in the given
>>>                stage as a percentage
>>> ReadFetch:     Amount of time spent in shuffle read across all tasks in the given
>>>                stage as a percentage
>>> GC:            Amount of time spent in GC across all tasks in the given stage as a
>>>                percentage
>>>
>>> If the stage contributes large percentage to overall application time, we could look into
>>> these metrics to check which part (Shuffle write, read fetch or GC is responsible)
>>>
>>> thanks,
>>>
>>> rohitk
>>>
>>>
>>>
>>> On Mon, Mar 26, 2018 at 1:38 AM, Shmuel Blitz <
>>> shmuel.blitz@similarweb.com> wrote:
>>>
>>>> Hi Rohit,
>>>>
>>>> Thanks for the analysis.
>>>>
>>>> I can use repartition on the slow task. But how can I tell what part of
>>>> the code is in charge of the slow tasks?
>>>>
>>>> It would be great if you could further explain the rest of the output.
>>>>
>>>> Thanks in advance,
>>>> Shmuel
>>>>
>>>> On Sun, Mar 25, 2018 at 12:46 PM, Rohit Karlupia <ro...@qubole.com>
>>>> wrote:
>>>>
>>>>> Thanks Shamuel for trying out sparklens!
>>>>>
>>>>> Couple of things that I noticed:
>>>>> 1) 250 executors is probably overkill for this job. It would run in
>>>>> same time with around 100.
>>>>> 2) Many of stages that take long time have only 200 tasks where as we
>>>>> have 750 cores available for the job. 200 is the default value for
>>>>> spark.sql.shuffle.partitions.  Alternatively you could try increasing
>>>>> the value of spark.sql.shuffle.partitions to latest 750.
>>>>>
>>>>> thanks,
>>>>> rohitk
>>>>>
>>>>> On Sun, Mar 25, 2018 at 1:25 PM, Shmuel Blitz <
>>>>> shmuel.blitz@similarweb.com> wrote:
>>>>>
>>>>>> I ran it on a single job.
>>>>>> SparkLens has an overhead on the job duration. I'm not ready to
>>>>>> enable it by default on all our jobs.
>>>>>>
>>>>>> Attached is the output.
>>>>>>
>>>>>> Still trying to understand what exactly it means.
>>>>>>
>>>>>> On Sun, Mar 25, 2018 at 10:40 AM, Fawze Abujaber <fa...@gmail.com>
>>>>>> wrote:
>>>>>>
>>>>>>> Nice!
>>>>>>>
>>>>>>> Shmuel, Were you able to run on a cluster level or for a specific
>>>>>>> job?
>>>>>>>
>>>>>>> Did you configure it on the spark-default.conf?
>>>>>>>
>>>>>>> On Sun, 25 Mar 2018 at 10:34 Shmuel Blitz <
>>>>>>> shmuel.blitz@similarweb.com> wrote:
>>>>>>>
>>>>>>>> Just to let you know, I have managed to run SparkLens on our
>>>>>>>> cluster.
>>>>>>>>
>>>>>>>> I switched to the spark_1.6 branch, and also compiled against the
>>>>>>>> specific image of Spark we are using (cdh5.7.6).
>>>>>>>>
>>>>>>>> Now I need to figure out what the output means... :P
>>>>>>>>
>>>>>>>> Shmuel
>>>>>>>>
>>>>>>>> On Fri, Mar 23, 2018 at 7:24 PM, Fawze Abujaber <fa...@gmail.com>
>>>>>>>> wrote:
>>>>>>>>
>>>>>>>>> Quick question:
>>>>>>>>>
>>>>>>>>> how to add the  --jars /path/to/sparklens_2.11-0.1.0.jar to the
>>>>>>>>> spark-default conf, should it be using:
>>>>>>>>>
>>>>>>>>> spark.driver.extraClassPath /path/to/sparklens_2.11-0.1.0.jar or
>>>>>>>>> i should use spark.jars option? anyone who could give an example how it
>>>>>>>>> should be, and if i the path for the jar should be an hdfs path as i'm
>>>>>>>>> using it in cluster mode.
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> On Fri, Mar 23, 2018 at 6:33 AM, Fawze Abujaber <fawzeaj@gmail.com
>>>>>>>>> > wrote:
>>>>>>>>>
>>>>>>>>>> Hi Shmuel,
>>>>>>>>>>
>>>>>>>>>> Did you compile the code against the right branch for Spark 1.6.
>>>>>>>>>>
>>>>>>>>>> I tested it and it looks working and now i'm testing the branch
>>>>>>>>>> for a wide tests, Please use the branch for Spark 1.6
>>>>>>>>>>
>>>>>>>>>> On Fri, Mar 23, 2018 at 12:43 AM, Shmuel Blitz <
>>>>>>>>>> shmuel.blitz@similarweb.com> wrote:
>>>>>>>>>>
>>>>>>>>>>> Hi Rohit,
>>>>>>>>>>>
>>>>>>>>>>> Thanks for sharing this great tool.
>>>>>>>>>>> I tried running a spark job with the tool, but it failed with an *IncompatibleClassChangeError
>>>>>>>>>>> *Exception.
>>>>>>>>>>>
>>>>>>>>>>> I have opened an issue on Github.(
>>>>>>>>>>> https://github.com/qubole/sparklens/issues/1)
>>>>>>>>>>>
>>>>>>>>>>> Shmuel
>>>>>>>>>>>
>>>>>>>>>>> On Thu, Mar 22, 2018 at 5:05 PM, Shmuel Blitz <
>>>>>>>>>>> shmuel.blitz@similarweb.com> wrote:
>>>>>>>>>>>
>>>>>>>>>>>> Thanks.
>>>>>>>>>>>>
>>>>>>>>>>>> We will give this a try and report back.
>>>>>>>>>>>>
>>>>>>>>>>>> Shmuel
>>>>>>>>>>>>
>>>>>>>>>>>> On Thu, Mar 22, 2018 at 4:22 PM, Rohit Karlupia <
>>>>>>>>>>>> rohitk@qubole.com> wrote:
>>>>>>>>>>>>
>>>>>>>>>>>>> Thanks everyone!
>>>>>>>>>>>>> Please share how it works and how it doesn't. Both help.
>>>>>>>>>>>>>
>>>>>>>>>>>>> Fawaze, just made few changes to make this work with spark
>>>>>>>>>>>>> 1.6. Can you please try building from branch *spark_1.6*
>>>>>>>>>>>>>
>>>>>>>>>>>>> thanks,
>>>>>>>>>>>>> rohitk
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>> On Thu, Mar 22, 2018 at 10:18 AM, Fawze Abujaber <
>>>>>>>>>>>>> fawzeaj@gmail.com> wrote:
>>>>>>>>>>>>>
>>>>>>>>>>>>>> It's super amazing .... i see it was tested on spark 2.0.0
>>>>>>>>>>>>>> and above, what about Spark 1.6 which is still part of Cloudera's main
>>>>>>>>>>>>>> versions?
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> We have a vast Spark applications with version 1.6.0
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> On Thu, Mar 22, 2018 at 6:38 AM, Holden Karau <
>>>>>>>>>>>>>> holden@pigscanfly.ca> wrote:
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Super exciting! I look forward to digging through it this
>>>>>>>>>>>>>>> weekend.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> On Wed, Mar 21, 2018 at 9:33 PM ☼ R Nair (रविशंकर नायर) <
>>>>>>>>>>>>>>> ravishankar.nair@gmail.com> wrote:
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Excellent. You filled a missing link.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Best,
>>>>>>>>>>>>>>>> Passion
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> On Wed, Mar 21, 2018 at 11:36 PM, Rohit Karlupia <
>>>>>>>>>>>>>>>> rohitk@qubole.com> wrote:
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> Hi,
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> Happy to announce the availability of Sparklens as open
>>>>>>>>>>>>>>>>> source project. It helps in understanding the  scalability limits of spark
>>>>>>>>>>>>>>>>> applications and can be a useful guide on the path towards tuning
>>>>>>>>>>>>>>>>> applications for lower runtime or cost.
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> Please clone from here:
>>>>>>>>>>>>>>>>> https://github.com/qubole/sparklens
>>>>>>>>>>>>>>>>> Old blogpost:
>>>>>>>>>>>>>>>>> https://www.qubole.com/blog/introducing-quboles-spark-tuning-tool/
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> thanks,
>>>>>>>>>>>>>>>>> rohitk
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> PS: Thanks for the patience. It took couple of months to
>>>>>>>>>>>>>>>>> get back on this.
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> --
>>>>>>>>>>>>>>> Twitter: https://twitter.com/holdenkarau
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> --
>>>>>>>>>>>> Shmuel Blitz
>>>>>>>>>>>> Big Data Developer
>>>>>>>>>>>> Email: shmuel.blitz@similarweb.com
>>>>>>>>>>>> www.similarweb.com
>>>>>>>>>>>> <https://www.facebook.com/SimilarWeb/>
>>>>>>>>>>>> <https://www.linkedin.com/company/429838/>
>>>>>>>>>>>> <https://twitter.com/similarweb>
>>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> --
>>>>>>>>>>> Shmuel Blitz
>>>>>>>>>>> Big Data Developer
>>>>>>>>>>> Email: shmuel.blitz@similarweb.com
>>>>>>>>>>> www.similarweb.com
>>>>>>>>>>> <https://www.facebook.com/SimilarWeb/>
>>>>>>>>>>> <https://www.linkedin.com/company/429838/>
>>>>>>>>>>> <https://twitter.com/similarweb>
>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> --
>>>>>>>> Shmuel Blitz
>>>>>>>> Big Data Developer
>>>>>>>> Email: shmuel.blitz@similarweb.com
>>>>>>>> www.similarweb.com
>>>>>>>> <https://www.facebook.com/SimilarWeb/>
>>>>>>>> <https://www.linkedin.com/company/429838/>
>>>>>>>> <https://twitter.com/similarweb>
>>>>>>>>
>>>>>>>
>>>>>>
>>>>>>
>>>>>> --
>>>>>> Shmuel Blitz
>>>>>> Big Data Developer
>>>>>> Email: shmuel.blitz@similarweb.com
>>>>>> www.similarweb.com
>>>>>> <https://www.facebook.com/SimilarWeb/>
>>>>>> <https://www.linkedin.com/company/429838/>
>>>>>> <https://twitter.com/similarweb>
>>>>>>
>>>>>
>>>>>
>>>>
>>>>
>>>> --
>>>> Shmuel Blitz
>>>> Big Data Developer
>>>> Email: shmuel.blitz@similarweb.com
>>>> www.similarweb.com
>>>> <https://www.facebook.com/SimilarWeb/>
>>>> <https://www.linkedin.com/company/429838/>
>>>> <https://twitter.com/similarweb>
>>>>
>>>
>>>
>>
>>
>> --
>> Take Care
>> Fawze Abujaber
>>
>
> --
Take Care
Fawze Abujaber

Re: Open sourcing Sparklens: Qubole's Spark Tuning Tool

Posted by Rohit Karlupia <ro...@qubole.com>.

Thanks Fawze!

On the memory front, I am currently working on GC and CPU aware task
scheduling. I see wonderful results based on my tests so far.  Once the
feature is complete and available, spark will work with whatever memory is
provided (at least enough for the largest possible task). It will also
allow you to run say 64 concurrent tasks on 8 core machine, if the nature
of tasks doesn't leads to memory or CPU contention. Essentially why worry
about tuning memory when you can let spark take care of it automatically
based on memory pressure. Will post details when we are ready.  So yes we
are working on memory, but it will not be a tool but a transparent feature.

thanks,
rohitk




On Tue, Mar 27, 2018 at 7:53 AM, Fawze Abujaber <fa...@gmail.com> wrote:

> Hi Rohit,
>
> I would like to thank you for the unlimited patience and support that you
> are providing here and behind the scene for all of us.
>
> The tool is amazing and easy to use and understand most of the metrics ...
>
> Thinking if we need to run it in cluster mode and all the time, i think we
> can skip it as one or few runs can give you the large picture of how the
> job is running with different configuration and it's not too much
> complicated to run it using spark-submit.
>
> I think it will be so helpful if the sparklens can also include how the
> job is running with different configuration of cores and memory, Spark job
> with 1 exec and 1 core will run different from spark job with 1  exec and 3
> cores and for sure the same compare with different exec memory.
>
> Overall, it is so good starting point, but it will be a GAME CHANGER
> getting these metrics on the tool.
>
> @Rohit , Huge THANY YOU
>
> On Mon, Mar 26, 2018 at 1:35 PM, Rohit Karlupia <ro...@qubole.com> wrote:
>
>> Hi Shmuel,
>>
>> In general it is hard to pin point to exact code which is responsible for
>> a specific stage. For example when using spark sql, depending upon the kind
>> of joins, aggregations used in the the single line of query, we will have
>> multiple stages in the spark application. I usually try to split the code
>> into smaller chunks and also use the spark UI which has special section for
>> SQL. It can also show specific backtraces, but as I explained earlier they
>> might not be very helpful. Sparklens does help you ask the right questions,
>> but is not mature enough to answer all of them.
>>
>> Understanding the report:
>>
>> *1) The first part of total aggregate metrics for the application.*
>>
>> Printing application meterics.....
>>
>>  AggregateMetrics (Application Metrics) total measurements 1869
>>                 NAME                        SUM                MIN           MAX                MEAN
>>  diskBytesSpilled                            0.0 KB         0.0 KB         0.0 KB              0.0 KB
>>  executorRuntime                            15.1 hh         3.0 ms         4.0 mm             29.1 ss
>>  inputBytesRead                             26.1 GB         0.0 KB        43.8 MB             14.3 MB
>>  jvmGCTime                                  11.0 mm         0.0 ms         2.1 ss            354.0 ms
>>  memoryBytesSpilled                        314.2 GB         0.0 KB         1.1 GB            172.1 MB
>>  outputBytesWritten                          0.0 KB         0.0 KB         0.0 KB              0.0 KB
>>  peakExecutionMemory                         0.0 KB         0.0 KB         0.0 KB              0.0 KB
>>  resultSize                                 12.9 MB         2.0 KB        40.9 KB              7.1 KB
>>  shuffleReadBytesRead                      107.7 GB         0.0 KB       276.0 MB             59.0 MB
>>  shuffleReadFetchWaitTime                    2.0 ms         0.0 ms         0.0 ms              0.0 ms
>>  shuffleReadLocalBlocks                       2,318              0             68                   1
>>  shuffleReadRecordsRead               3,413,511,099              0      8,251,926           1,826,383
>>  shuffleReadRemoteBlocks                    291,126              0            824                 155
>>  shuffleWriteBytesWritten                  107.6 GB         0.0 KB       257.6 MB             58.9 MB
>>  shuffleWriteRecordsWritten           3,408,133,175              0      7,959,055           1,823,506
>>  shuffleWriteTime                            8.7 mm         0.0 ms         1.8 ss            278.2 ms
>>  taskDuration                               15.4 hh        12.0 ms         4.1 mm             29.7 ss
>>
>>
>> *2) Here we show number of hosts used and executors per host. I have seen users set executor memory to 33GB on a 64GB executor. Direct waste of 31GB of memory.*
>>
>> Total Hosts 135
>>
>>
>> Host server86.cluster.com startTime 02:26:21:081 executors count 3
>> Host server164.cluster.com startTime 02:30:12:204 executors count 1
>> Host server28.cluster.com startTime 02:31:09:023 executors count 1
>> Host server78.cluster.com startTime 02:26:08:844 executors count 5
>> Host server124.cluster.com startTime 02:26:10:523 executors count 3
>> Host server100.cluster.com startTime 02:30:24:073 executors count 1
>> Done printing host timeline
>> *3) Time at which executers were added. Not all executors are available at the start of the application. *
>>
>> Printing executors timeline....
>> Total Hosts 135
>> Total Executors 250
>> At 02:26 executors added 52 & removed  0 currently available 52
>> At 02:27 executors added 10 & removed  0 currently available 62
>> At 02:28 executors added 13 & removed  0 currently available 75
>> At 02:29 executors added 81 & removed  0 currently available 156
>> At 02:30 executors added 48 & removed  0 currently available 204
>> At 02:31 executors added 45 & removed  0 currently available 249
>> At 02:32 executors added 1 & removed  0 currently available 250
>>
>>
>> *4) How the stages within the jobs were scheduled. Helps you understand which stages ran in parallel and which are dependent on others.
>> *
>>
>> Printing Application timeline
>> 02:26:47:654      Stage 3 ended : maxTaskTime 3117 taskCount 1
>> 02:26:47:708      Stage 4 started : duration 00m 02s
>> 02:26:49:898      Stage 4 ended : maxTaskTime 226 taskCount 200
>> 02:26:49:901 JOB 3 ended
>> 02:26:56:234 JOB 4 started : duration 08m 28s
>> [      5 |||||||                                                                         ]
>> [      6  |||||||||||||||||||                                                            ]
>> [      9                   ||||||||                                                      ]
>> [     10     ||||||||||||||                                                              ]
>> [     11                                                                                 ]
>> [     12                     ||                                                          ]
>> [     13                       ||||                                                      ]
>> [     14                           |||||||||||||||                                       ]
>> [     15                                          |||||||||||||||||||||||||||||||||||||| ]
>> 02:26:58:095      Stage 5 started : duration 00m 44s
>> 02:27:42:816      Stage 5 ended : maxTaskTime 37214 taskCount 23
>> 02:27:03:478      Stage 6 started : duration 02m 04s
>> 02:29:07:517      Stage 6 ended : maxTaskTime 35578 taskCount 601
>> 02:28:56:449      Stage 9 started : duration 00m 46s
>> 02:29:42:625      Stage 9 ended : maxTaskTime 7196 taskCount 200
>> 02:27:22:343      Stage 10 started : duration 01m 33s
>> 02:28:56:333      Stage 10 ended : maxTaskTime 49203 taskCount 39
>> 02:27:23:910      Stage 11 started : duration 00m 00s
>> 02:27:24:422      Stage 11 ended : maxTaskTime 298 taskCount 2
>> 02:29:06:902      Stage 12 started : duration 00m 12s
>> 02:29:19:350      Stage 12 ended : maxTaskTime 11511 taskCount 200
>> 02:29:19:413      Stage 13 started : duration 00m 25s
>> 02:29:44:444      Stage 13 ended : maxTaskTime 24924 taskCount 200
>> 02:29:44:491      Stage 14 started : duration 01m 36s
>> 02:31:20:873      Stage 14 ended : maxTaskTime 86194 taskCount 200
>> 02:31:20:973      Stage 15 started : duration 04m 03s
>> 02:35:24:346      Stage 15 ended : maxTaskTime 238747 taskCount 200
>> 02:35:24:347 JOB 4 ended
>> 02:35:28:841 app ended
>> *5) I guess these metrics are well explained *
>>
>>
>>  Time spent in Driver vs Executors
>>  Driver WallClock Time    01m 02s   10.66%
>>  Executor WallClock Time  08m 43s   89.34%
>>  Total WallClock Time     09m 46s
>>
>>
>>
>> Minimum possible time for the app based on the critical path (with infinite resources)   07m 59s
>> Minimum possible time for the app with same executors, perfect parallelism and zero skew 02m 15s
>> If we were to run this app with single executor and single core                          15h 08m
>>
>>
>>  Total cores available to the app 750
>>
>>  OneCoreComputeHours: Measure of total compute power available from cluster. One core in the executor, running
>>                       for one hour, counts as one OneCoreComputeHour. Executors with 4 cores, will have 4 times
>>                       the OneCoreComputeHours compared to one with just one core. Similarly, one core executor
>>                       running for 4 hours will OnCoreComputeHours equal to 4 core executor running for 1 hour.
>>
>>  Driver Utilization (Cluster idle because of driver)
>>
>>  Total OneCoreComputeHours available                            122h 07m
>>  Total OneCoreComputeHours available (AutoScale Aware)           77h 25m
>>  OneCoreComputeHours wasted by driver                            13h 01m
>>
>>  AutoScale Aware: Most of the calculations by this tool will assume that all executors are available throughout
>>                   the runtime of the application. The number above is printed to show possible caution to be
>>                   taken in interpreting the efficiency metrics.
>>
>>  Cluster Utilization (Executors idle because of lack of tasks or skew)
>>
>>  Executor OneCoreComputeHours available                 109h 06m
>>  Executor OneCoreComputeHours used                       15h 07m        13.86%
>>  OneCoreComputeHours wasted                              93h 59m        86.14%
>>
>>  App Level Wastage Metrics (Driver + Executor)
>>
>>  OneCoreComputeHours wasted Driver               10.66%
>>  OneCoreComputeHours wasted Executor             76.96%
>>  OneCoreComputeHours wasted Total                87.62%
>>
>>
>>
>> 6) *Here we use the simulation to provide answers to how the application wall clock time will vary as we change the number of executors. Goal is to run the application at 100% cluster utilization and minimum time. Look for ROI in terms of wall clock time due to additional executors. Also if the application is not scaling, this is good time to revisit application and look for why it is not scaling.*
>>
>>  App completion time and cluster utilization estimates with different executor counts
>>
>>  Real App Duration 09m 46s
>>  Model Estimation  08m 01s
>>  Model Error       17%
>>
>>  NOTE: 1) Model error could be large when auto-scaling is enabled.
>>        2) Model doesn't handles multiple jobs run via thread-pool. For better insights into
>>           application scalability, please try such jobs one by one without thread-pool.
>>
>>
>>  Executor count    25  ( 10%) estimated time 17m 07s and estimated cluster utilization 70.61%
>>  Executor count    50  ( 20%) estimated time 12m 15s and estimated cluster utilization 49.34%
>>  Executor count   125  ( 50%) estimated time 08m 25s and estimated cluster utilization 28.72%
>>  Executor count   200  ( 80%) estimated time 08m 15s and estimated cluster utilization 18.29%
>>  Executor count   250  (100%) estimated time 08m 01s and estimated cluster utilization 15.06%
>>  Executor count   275  (110%) estimated time 08m 00s and estimated cluster utilization 13.72%
>>  Executor count   300  (120%) estimated time 07m 59s and estimated cluster utilization 12.61%
>>  Executor count   375  (150%) estimated time 07m 59s and estimated cluster utilization 10.09%
>>  Executor count   500  (200%) estimated time 07m 59s and estimated cluster utilization 7.57%
>>  Executor count   750  (300%) estimated time 07m 59s and estimated cluster utilization 5.04%
>>  Executor count  1000  (400%) estimated time 07m 59s and estimated cluster utilization 3.78%
>>  Executor count  1250  (500%) estimated time 07m 59s and estimated cluster utilization 3.03%
>>
>> *7) These two sections are for finding out which stage are taking most of the wall-clock time and why. It is either not enough parallelism or skew. Parallelism is easier to fix. Fixing skew will require changing the application in way that creates more uniform tasks.
>> *
>> Total tasks in all stages 1869
>> Per Stage  Utilization
>> Stage-ID   Wall    Task      Task     IO%    Input     Output    ----Shuffle-----    -WallClockTime-    --OneCoreComputeHours---   MaxTaskMem
>>           Clock%  Runtime%   Count                               Input  |  Output    Measured | Ideal   Available| Used%|Wasted%
>>        0    0.00    0.00         1    0.0   64.0 KB    0.0 KB    0.0 KB    0.0 KB    00m 02s   00m 00s    00h 27m    0.0  100.0    0.0 KB
>>        1    0.00    0.00         1    0.0   64.0 KB    0.0 KB    0.0 KB    0.0 KB    00m 02s   00m 00s    00h 30m    0.1   99.9    0.0 KB
>>        2    0.00    0.00         1    0.0   90.0 KB    0.0 KB    0.0 KB    0.0 KB    00m 03s   00m 00s    00h 37m    0.1   99.9    0.0 KB
>>        3    0.00    0.01         1    0.0  867.1 KB    0.0 KB    0.0 KB  148.4 KB    00m 04s   00m 00s    01h 01m    0.1   99.9    0.0 KB
>>        4    0.00    0.00       200    0.0    0.0 KB    0.0 KB  148.4 KB    0.0 KB    00m 02s   00m 00s    00h 27m    0.1   99.9    0.0 KB
>>        5    6.00    1.15        23    0.2  402.1 MB    0.0 KB    0.0 KB    1.3 GB    00m 44s   00m 00s    09h 19m    1.9   98.1    0.0 KB
>>        6   17.00   19.92       601    7.1   17.2 GB    0.0 KB    0.0 KB    1.8 GB    02m 04s   00m 14s    25h 50m   11.7   88.3    0.0 KB
>>        9    6.00    0.73       200    2.9    6.9 GB    0.0 KB  409.5 MB    2.8 GB    00m 46s   00m 00s    09h 37m    1.2   98.8    0.0 KB
>>       10   13.00    2.27        39    0.3  807.8 MB    0.0 KB    0.0 KB    2.5 GB    01m 33s   00m 01s    19h 34m    1.7   98.3    0.0 KB
>>       11    0.00    0.00         2    0.0   31.5 KB    0.0 KB    0.0 KB   60.0 KB    00m 00s   00m 00s    00h 06m    0.1   99.9    0.0 KB
>>       12    1.00    2.15       200    0.3  758.7 MB    0.0 KB    2.3 GB    1.5 GB    00m 12s   00m 01s    02h 35m   12.6   87.4    0.0 KB
>>       13    3.00    5.91       200    0.0    0.0 KB    0.0 KB    1.5 GB   47.5 GB    00m 25s   00m 04s    05h 12m   17.1   82.9    0.0 KB
>>       14   13.00   19.83       200    0.0    0.0 KB    0.0 KB   50.3 GB   50.3 GB    01m 36s   00m 14s    20h 04m   14.9   85.1    0.0 KB
>>       15   34.00   48.02       200    0.0    0.0 KB    0.0 KB   53.2 GB    0.0 KB    04m 03s   00m 34s    50h 42m   14.3   85.7    0.0 KB
>>
>>
>>  Stage-ID WallClock  OneCore       Task   PRatio    -----Task------   OIRatio  |* ShuffleWrite% ReadFetch%   GC%  *|
>>           Stage%     ComputeHours  Count            Skew   StageSkew
>>       0    0.32         00h 00m       1    0.00     1.00     0.37     0.00     |*   0.00           0.00    15.10  *|
>>       1    0.35         00h 00m       1    0.00     1.00     0.38     0.00     |*   0.00           0.00    15.56  *|
>>       2    0.43         00h 00m       1    0.00     1.00     0.45     0.00     |*   0.00           0.00     8.88  *|
>>       3    0.70         00h 00m       1    0.00     1.00     0.63     0.17     |*   4.51           0.00     6.74  *|
>>       4    0.31         00h 00m     200    0.27    37.67     0.10     0.00     |*   0.00           0.04    23.79  *|
>>       5    6.38         00h 10m      23    0.03     1.42     0.83     3.18     |*   1.08           0.00     2.72  *|
>>       6   17.68         03h 00m     601    0.80     2.07     0.29     0.10     |*   0.60           0.00     1.90  *|
>>       9    6.58         00h 06m     200    0.27     5.20     0.16     0.38     |*   4.74          13.24     4.04  *|
>>      10   13.40         00h 20m      39    0.05     1.67     0.52     3.17     |*   1.10           0.00     1.96  *|
>>      11    0.07         00h 00m       2    0.00     1.00     0.58     1.91     |*  13.59           0.00     0.00  *|
>>      12    1.77         00h 19m     200    0.27     1.99     0.92     0.50     |*   1.85          19.63     3.09  *|
>>      13    3.57         00h 53m     200    0.27     1.59     1.00    31.42     |*   6.06          12.25     1.33  *|
>>      14   13.74         02h 59m     200    0.27     1.65     0.89     1.00     |*   1.84           2.38     0.83  *|
>>      15   34.69         07h 15m     200    0.27     1.88     0.98     0.00     |*   0.00           4.21     0.88  *|
>>
>> PRatio:        Number of tasks in stage divided by number of cores. Represents degree of
>>                parallelism in the stage
>> TaskSkew:      Duration of largest task in stage divided by duration of median task.
>>                Represents degree of skew in the stage
>> TaskStageSkew: Duration of largest task in stage divided by total duration of the stage.
>>                Represents the impact of the largest task on stage time.
>> OIRatio:       Output to input ration. Total output of the stage (results + shuffle write)
>>                divided by total input (input data + shuffle read)
>>
>> These metrics below represent distribution of time within the stage
>>
>> ShuffleWrite:  Amount of time spent in shuffle writes across all tasks in the given
>>                stage as a percentage
>> ReadFetch:     Amount of time spent in shuffle read across all tasks in the given
>>                stage as a percentage
>> GC:            Amount of time spent in GC across all tasks in the given stage as a
>>                percentage
>>
>> If the stage contributes large percentage to overall application time, we could look into
>> these metrics to check which part (Shuffle write, read fetch or GC is responsible)
>>
>> thanks,
>>
>> rohitk
>>
>>
>>
>> On Mon, Mar 26, 2018 at 1:38 AM, Shmuel Blitz <
>> shmuel.blitz@similarweb.com> wrote:
>>
>>> Hi Rohit,
>>>
>>> Thanks for the analysis.
>>>
>>> I can use repartition on the slow task. But how can I tell what part of
>>> the code is in charge of the slow tasks?
>>>
>>> It would be great if you could further explain the rest of the output.
>>>
>>> Thanks in advance,
>>> Shmuel
>>>
>>> On Sun, Mar 25, 2018 at 12:46 PM, Rohit Karlupia <ro...@qubole.com>
>>> wrote:
>>>
>>>> Thanks Shamuel for trying out sparklens!
>>>>
>>>> Couple of things that I noticed:
>>>> 1) 250 executors is probably overkill for this job. It would run in
>>>> same time with around 100.
>>>> 2) Many of stages that take long time have only 200 tasks where as we
>>>> have 750 cores available for the job. 200 is the default value for
>>>> spark.sql.shuffle.partitions.  Alternatively you could try increasing
>>>> the value of spark.sql.shuffle.partitions to latest 750.
>>>>
>>>> thanks,
>>>> rohitk
>>>>
>>>> On Sun, Mar 25, 2018 at 1:25 PM, Shmuel Blitz <
>>>> shmuel.blitz@similarweb.com> wrote:
>>>>
>>>>> I ran it on a single job.
>>>>> SparkLens has an overhead on the job duration. I'm not ready to enable
>>>>> it by default on all our jobs.
>>>>>
>>>>> Attached is the output.
>>>>>
>>>>> Still trying to understand what exactly it means.
>>>>>
>>>>> On Sun, Mar 25, 2018 at 10:40 AM, Fawze Abujaber <fa...@gmail.com>
>>>>> wrote:
>>>>>
>>>>>> Nice!
>>>>>>
>>>>>> Shmuel, Were you able to run on a cluster level or for a specific job?
>>>>>>
>>>>>> Did you configure it on the spark-default.conf?
>>>>>>
>>>>>> On Sun, 25 Mar 2018 at 10:34 Shmuel Blitz <
>>>>>> shmuel.blitz@similarweb.com> wrote:
>>>>>>
>>>>>>> Just to let you know, I have managed to run SparkLens on our cluster.
>>>>>>>
>>>>>>> I switched to the spark_1.6 branch, and also compiled against the
>>>>>>> specific image of Spark we are using (cdh5.7.6).
>>>>>>>
>>>>>>> Now I need to figure out what the output means... :P
>>>>>>>
>>>>>>> Shmuel
>>>>>>>
>>>>>>> On Fri, Mar 23, 2018 at 7:24 PM, Fawze Abujaber <fa...@gmail.com>
>>>>>>> wrote:
>>>>>>>
>>>>>>>> Quick question:
>>>>>>>>
>>>>>>>> how to add the  --jars /path/to/sparklens_2.11-0.1.0.jar to the
>>>>>>>> spark-default conf, should it be using:
>>>>>>>>
>>>>>>>> spark.driver.extraClassPath /path/to/sparklens_2.11-0.1.0.jar or i
>>>>>>>> should use spark.jars option? anyone who could give an example how it
>>>>>>>> should be, and if i the path for the jar should be an hdfs path as i'm
>>>>>>>> using it in cluster mode.
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> On Fri, Mar 23, 2018 at 6:33 AM, Fawze Abujaber <fa...@gmail.com>
>>>>>>>> wrote:
>>>>>>>>
>>>>>>>>> Hi Shmuel,
>>>>>>>>>
>>>>>>>>> Did you compile the code against the right branch for Spark 1.6.
>>>>>>>>>
>>>>>>>>> I tested it and it looks working and now i'm testing the branch
>>>>>>>>> for a wide tests, Please use the branch for Spark 1.6
>>>>>>>>>
>>>>>>>>> On Fri, Mar 23, 2018 at 12:43 AM, Shmuel Blitz <
>>>>>>>>> shmuel.blitz@similarweb.com> wrote:
>>>>>>>>>
>>>>>>>>>> Hi Rohit,
>>>>>>>>>>
>>>>>>>>>> Thanks for sharing this great tool.
>>>>>>>>>> I tried running a spark job with the tool, but it failed with an *IncompatibleClassChangeError
>>>>>>>>>> *Exception.
>>>>>>>>>>
>>>>>>>>>> I have opened an issue on Github.(https://github.com/qub
>>>>>>>>>> ole/sparklens/issues/1)
>>>>>>>>>>
>>>>>>>>>> Shmuel
>>>>>>>>>>
>>>>>>>>>> On Thu, Mar 22, 2018 at 5:05 PM, Shmuel Blitz <
>>>>>>>>>> shmuel.blitz@similarweb.com> wrote:
>>>>>>>>>>
>>>>>>>>>>> Thanks.
>>>>>>>>>>>
>>>>>>>>>>> We will give this a try and report back.
>>>>>>>>>>>
>>>>>>>>>>> Shmuel
>>>>>>>>>>>
>>>>>>>>>>> On Thu, Mar 22, 2018 at 4:22 PM, Rohit Karlupia <
>>>>>>>>>>> rohitk@qubole.com> wrote:
>>>>>>>>>>>
>>>>>>>>>>>> Thanks everyone!
>>>>>>>>>>>> Please share how it works and how it doesn't. Both help.
>>>>>>>>>>>>
>>>>>>>>>>>> Fawaze, just made few changes to make this work with spark 1.6.
>>>>>>>>>>>> Can you please try building from branch *spark_1.6*
>>>>>>>>>>>>
>>>>>>>>>>>> thanks,
>>>>>>>>>>>> rohitk
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> On Thu, Mar 22, 2018 at 10:18 AM, Fawze Abujaber <
>>>>>>>>>>>> fawzeaj@gmail.com> wrote:
>>>>>>>>>>>>
>>>>>>>>>>>>> It's super amazing .... i see it was tested on spark 2.0.0 and
>>>>>>>>>>>>> above, what about Spark 1.6 which is still part of Cloudera's main versions?
>>>>>>>>>>>>>
>>>>>>>>>>>>> We have a vast Spark applications with version 1.6.0
>>>>>>>>>>>>>
>>>>>>>>>>>>> On Thu, Mar 22, 2018 at 6:38 AM, Holden Karau <
>>>>>>>>>>>>> holden@pigscanfly.ca> wrote:
>>>>>>>>>>>>>
>>>>>>>>>>>>>> Super exciting! I look forward to digging through it this
>>>>>>>>>>>>>> weekend.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> On Wed, Mar 21, 2018 at 9:33 PM ☼ R Nair (रविशंकर नायर) <
>>>>>>>>>>>>>> ravishankar.nair@gmail.com> wrote:
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Excellent. You filled a missing link.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Best,
>>>>>>>>>>>>>>> Passion
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> On Wed, Mar 21, 2018 at 11:36 PM, Rohit Karlupia <
>>>>>>>>>>>>>>> rohitk@qubole.com> wrote:
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Hi,
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Happy to announce the availability of Sparklens as open
>>>>>>>>>>>>>>>> source project. It helps in understanding the  scalability limits of spark
>>>>>>>>>>>>>>>> applications and can be a useful guide on the path towards tuning
>>>>>>>>>>>>>>>> applications for lower runtime or cost.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Please clone from here: https://github.com/qubole/sparklens
>>>>>>>>>>>>>>>> Old blogpost: https://www.qubole.c
>>>>>>>>>>>>>>>> om/blog/introducing-quboles-spark-tuning-tool/
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> thanks,
>>>>>>>>>>>>>>>> rohitk
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> PS: Thanks for the patience. It took couple of months to
>>>>>>>>>>>>>>>> get back on this.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> --
>>>>>>>>>>>>>> Twitter: https://twitter.com/holdenkarau
>>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> --
>>>>>>>>>>> Shmuel Blitz
>>>>>>>>>>> Big Data Developer
>>>>>>>>>>> Email: shmuel.blitz@similarweb.com
>>>>>>>>>>> www.similarweb.com
>>>>>>>>>>> <https://www.facebook.com/SimilarWeb/>
>>>>>>>>>>> <https://www.linkedin.com/company/429838/>
>>>>>>>>>>> <https://twitter.com/similarweb>
>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> --
>>>>>>>>>> Shmuel Blitz
>>>>>>>>>> Big Data Developer
>>>>>>>>>> Email: shmuel.blitz@similarweb.com
>>>>>>>>>> www.similarweb.com
>>>>>>>>>> <https://www.facebook.com/SimilarWeb/>
>>>>>>>>>> <https://www.linkedin.com/company/429838/>
>>>>>>>>>> <https://twitter.com/similarweb>
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> --
>>>>>>> Shmuel Blitz
>>>>>>> Big Data Developer
>>>>>>> Email: shmuel.blitz@similarweb.com
>>>>>>> www.similarweb.com
>>>>>>> <https://www.facebook.com/SimilarWeb/>
>>>>>>> <https://www.linkedin.com/company/429838/>
>>>>>>> <https://twitter.com/similarweb>
>>>>>>>
>>>>>>
>>>>>
>>>>>
>>>>> --
>>>>> Shmuel Blitz
>>>>> Big Data Developer
>>>>> Email: shmuel.blitz@similarweb.com
>>>>> www.similarweb.com
>>>>> <https://www.facebook.com/SimilarWeb/>
>>>>> <https://www.linkedin.com/company/429838/>
>>>>> <https://twitter.com/similarweb>
>>>>>
>>>>
>>>>
>>>
>>>
>>> --
>>> Shmuel Blitz
>>> Big Data Developer
>>> Email: shmuel.blitz@similarweb.com
>>> www.similarweb.com
>>> <https://www.facebook.com/SimilarWeb/>
>>> <https://www.linkedin.com/company/429838/>
>>> <https://twitter.com/similarweb>
>>>
>>
>>
>
>
> --
> Take Care
> Fawze Abujaber
>

Re: Open sourcing Sparklens: Qubole's Spark Tuning Tool

Posted by Fawze Abujaber <fa...@gmail.com>.

Hi Rohit,

I would like to thank you for the unlimited patience and support that you
are providing here and behind the scene for all of us.

The tool is amazing and easy to use and understand most of the metrics ...

Thinking if we need to run it in cluster mode and all the time, i think we
can skip it as one or few runs can give you the large picture of how the
job is running with different configuration and it's not too much
complicated to run it using spark-submit.

I think it will be so helpful if the sparklens can also include how the job
is running with different configuration of cores and memory, Spark job with
1 exec and 1 core will run different from spark job with 1  exec and 3
cores and for sure the same compare with different exec memory.

Overall, it is so good starting point, but it will be a GAME CHANGER
getting these metrics on the tool.

@Rohit , Huge THANY YOU

On Mon, Mar 26, 2018 at 1:35 PM, Rohit Karlupia <ro...@qubole.com> wrote:

> Hi Shmuel,
>
> In general it is hard to pin point to exact code which is responsible for
> a specific stage. For example when using spark sql, depending upon the kind
> of joins, aggregations used in the the single line of query, we will have
> multiple stages in the spark application. I usually try to split the code
> into smaller chunks and also use the spark UI which has special section for
> SQL. It can also show specific backtraces, but as I explained earlier they
> might not be very helpful. Sparklens does help you ask the right questions,
> but is not mature enough to answer all of them.
>
> Understanding the report:
>
> *1) The first part of total aggregate metrics for the application.*
>
> Printing application meterics.....
>
>  AggregateMetrics (Application Metrics) total measurements 1869
>                 NAME                        SUM                MIN           MAX                MEAN
>  diskBytesSpilled                            0.0 KB         0.0 KB         0.0 KB              0.0 KB
>  executorRuntime                            15.1 hh         3.0 ms         4.0 mm             29.1 ss
>  inputBytesRead                             26.1 GB         0.0 KB        43.8 MB             14.3 MB
>  jvmGCTime                                  11.0 mm         0.0 ms         2.1 ss            354.0 ms
>  memoryBytesSpilled                        314.2 GB         0.0 KB         1.1 GB            172.1 MB
>  outputBytesWritten                          0.0 KB         0.0 KB         0.0 KB              0.0 KB
>  peakExecutionMemory                         0.0 KB         0.0 KB         0.0 KB              0.0 KB
>  resultSize                                 12.9 MB         2.0 KB        40.9 KB              7.1 KB
>  shuffleReadBytesRead                      107.7 GB         0.0 KB       276.0 MB             59.0 MB
>  shuffleReadFetchWaitTime                    2.0 ms         0.0 ms         0.0 ms              0.0 ms
>  shuffleReadLocalBlocks                       2,318              0             68                   1
>  shuffleReadRecordsRead               3,413,511,099              0      8,251,926           1,826,383
>  shuffleReadRemoteBlocks                    291,126              0            824                 155
>  shuffleWriteBytesWritten                  107.6 GB         0.0 KB       257.6 MB             58.9 MB
>  shuffleWriteRecordsWritten           3,408,133,175              0      7,959,055           1,823,506
>  shuffleWriteTime                            8.7 mm         0.0 ms         1.8 ss            278.2 ms
>  taskDuration                               15.4 hh        12.0 ms         4.1 mm             29.7 ss
>
>
> *2) Here we show number of hosts used and executors per host. I have seen users set executor memory to 33GB on a 64GB executor. Direct waste of 31GB of memory.*
>
> Total Hosts 135
>
>
> Host server86.cluster.com startTime 02:26:21:081 executors count 3
> Host server164.cluster.com startTime 02:30:12:204 executors count 1
> Host server28.cluster.com startTime 02:31:09:023 executors count 1
> Host server78.cluster.com startTime 02:26:08:844 executors count 5
> Host server124.cluster.com startTime 02:26:10:523 executors count 3
> Host server100.cluster.com startTime 02:30:24:073 executors count 1
> Done printing host timeline
> *3) Time at which executers were added. Not all executors are available at the start of the application. *
>
> Printing executors timeline....
> Total Hosts 135
> Total Executors 250
> At 02:26 executors added 52 & removed  0 currently available 52
> At 02:27 executors added 10 & removed  0 currently available 62
> At 02:28 executors added 13 & removed  0 currently available 75
> At 02:29 executors added 81 & removed  0 currently available 156
> At 02:30 executors added 48 & removed  0 currently available 204
> At 02:31 executors added 45 & removed  0 currently available 249
> At 02:32 executors added 1 & removed  0 currently available 250
>
>
> *4) How the stages within the jobs were scheduled. Helps you understand which stages ran in parallel and which are dependent on others.
> *
>
> Printing Application timeline
> 02:26:47:654      Stage 3 ended : maxTaskTime 3117 taskCount 1
> 02:26:47:708      Stage 4 started : duration 00m 02s
> 02:26:49:898      Stage 4 ended : maxTaskTime 226 taskCount 200
> 02:26:49:901 JOB 3 ended
> 02:26:56:234 JOB 4 started : duration 08m 28s
> [      5 |||||||                                                                         ]
> [      6  |||||||||||||||||||                                                            ]
> [      9                   ||||||||                                                      ]
> [     10     ||||||||||||||                                                              ]
> [     11                                                                                 ]
> [     12                     ||                                                          ]
> [     13                       ||||                                                      ]
> [     14                           |||||||||||||||                                       ]
> [     15                                          |||||||||||||||||||||||||||||||||||||| ]
> 02:26:58:095      Stage 5 started : duration 00m 44s
> 02:27:42:816      Stage 5 ended : maxTaskTime 37214 taskCount 23
> 02:27:03:478      Stage 6 started : duration 02m 04s
> 02:29:07:517      Stage 6 ended : maxTaskTime 35578 taskCount 601
> 02:28:56:449      Stage 9 started : duration 00m 46s
> 02:29:42:625      Stage 9 ended : maxTaskTime 7196 taskCount 200
> 02:27:22:343      Stage 10 started : duration 01m 33s
> 02:28:56:333      Stage 10 ended : maxTaskTime 49203 taskCount 39
> 02:27:23:910      Stage 11 started : duration 00m 00s
> 02:27:24:422      Stage 11 ended : maxTaskTime 298 taskCount 2
> 02:29:06:902      Stage 12 started : duration 00m 12s
> 02:29:19:350      Stage 12 ended : maxTaskTime 11511 taskCount 200
> 02:29:19:413      Stage 13 started : duration 00m 25s
> 02:29:44:444      Stage 13 ended : maxTaskTime 24924 taskCount 200
> 02:29:44:491      Stage 14 started : duration 01m 36s
> 02:31:20:873      Stage 14 ended : maxTaskTime 86194 taskCount 200
> 02:31:20:973      Stage 15 started : duration 04m 03s
> 02:35:24:346      Stage 15 ended : maxTaskTime 238747 taskCount 200
> 02:35:24:347 JOB 4 ended
> 02:35:28:841 app ended
> *5) I guess these metrics are well explained *
>
>
>  Time spent in Driver vs Executors
>  Driver WallClock Time    01m 02s   10.66%
>  Executor WallClock Time  08m 43s   89.34%
>  Total WallClock Time     09m 46s
>
>
>
> Minimum possible time for the app based on the critical path (with infinite resources)   07m 59s
> Minimum possible time for the app with same executors, perfect parallelism and zero skew 02m 15s
> If we were to run this app with single executor and single core                          15h 08m
>
>
>  Total cores available to the app 750
>
>  OneCoreComputeHours: Measure of total compute power available from cluster. One core in the executor, running
>                       for one hour, counts as one OneCoreComputeHour. Executors with 4 cores, will have 4 times
>                       the OneCoreComputeHours compared to one with just one core. Similarly, one core executor
>                       running for 4 hours will OnCoreComputeHours equal to 4 core executor running for 1 hour.
>
>  Driver Utilization (Cluster idle because of driver)
>
>  Total OneCoreComputeHours available                            122h 07m
>  Total OneCoreComputeHours available (AutoScale Aware)           77h 25m
>  OneCoreComputeHours wasted by driver                            13h 01m
>
>  AutoScale Aware: Most of the calculations by this tool will assume that all executors are available throughout
>                   the runtime of the application. The number above is printed to show possible caution to be
>                   taken in interpreting the efficiency metrics.
>
>  Cluster Utilization (Executors idle because of lack of tasks or skew)
>
>  Executor OneCoreComputeHours available                 109h 06m
>  Executor OneCoreComputeHours used                       15h 07m        13.86%
>  OneCoreComputeHours wasted                              93h 59m        86.14%
>
>  App Level Wastage Metrics (Driver + Executor)
>
>  OneCoreComputeHours wasted Driver               10.66%
>  OneCoreComputeHours wasted Executor             76.96%
>  OneCoreComputeHours wasted Total                87.62%
>
>
>
> 6) *Here we use the simulation to provide answers to how the application wall clock time will vary as we change the number of executors. Goal is to run the application at 100% cluster utilization and minimum time. Look for ROI in terms of wall clock time due to additional executors. Also if the application is not scaling, this is good time to revisit application and look for why it is not scaling.*
>
>  App completion time and cluster utilization estimates with different executor counts
>
>  Real App Duration 09m 46s
>  Model Estimation  08m 01s
>  Model Error       17%
>
>  NOTE: 1) Model error could be large when auto-scaling is enabled.
>        2) Model doesn't handles multiple jobs run via thread-pool. For better insights into
>           application scalability, please try such jobs one by one without thread-pool.
>
>
>  Executor count    25  ( 10%) estimated time 17m 07s and estimated cluster utilization 70.61%
>  Executor count    50  ( 20%) estimated time 12m 15s and estimated cluster utilization 49.34%
>  Executor count   125  ( 50%) estimated time 08m 25s and estimated cluster utilization 28.72%
>  Executor count   200  ( 80%) estimated time 08m 15s and estimated cluster utilization 18.29%
>  Executor count   250  (100%) estimated time 08m 01s and estimated cluster utilization 15.06%
>  Executor count   275  (110%) estimated time 08m 00s and estimated cluster utilization 13.72%
>  Executor count   300  (120%) estimated time 07m 59s and estimated cluster utilization 12.61%
>  Executor count   375  (150%) estimated time 07m 59s and estimated cluster utilization 10.09%
>  Executor count   500  (200%) estimated time 07m 59s and estimated cluster utilization 7.57%
>  Executor count   750  (300%) estimated time 07m 59s and estimated cluster utilization 5.04%
>  Executor count  1000  (400%) estimated time 07m 59s and estimated cluster utilization 3.78%
>  Executor count  1250  (500%) estimated time 07m 59s and estimated cluster utilization 3.03%
>
> *7) These two sections are for finding out which stage are taking most of the wall-clock time and why. It is either not enough parallelism or skew. Parallelism is easier to fix. Fixing skew will require changing the application in way that creates more uniform tasks.
> *
> Total tasks in all stages 1869
> Per Stage  Utilization
> Stage-ID   Wall    Task      Task     IO%    Input     Output    ----Shuffle-----    -WallClockTime-    --OneCoreComputeHours---   MaxTaskMem
>           Clock%  Runtime%   Count                               Input  |  Output    Measured | Ideal   Available| Used%|Wasted%
>        0    0.00    0.00         1    0.0   64.0 KB    0.0 KB    0.0 KB    0.0 KB    00m 02s   00m 00s    00h 27m    0.0  100.0    0.0 KB
>        1    0.00    0.00         1    0.0   64.0 KB    0.0 KB    0.0 KB    0.0 KB    00m 02s   00m 00s    00h 30m    0.1   99.9    0.0 KB
>        2    0.00    0.00         1    0.0   90.0 KB    0.0 KB    0.0 KB    0.0 KB    00m 03s   00m 00s    00h 37m    0.1   99.9    0.0 KB
>        3    0.00    0.01         1    0.0  867.1 KB    0.0 KB    0.0 KB  148.4 KB    00m 04s   00m 00s    01h 01m    0.1   99.9    0.0 KB
>        4    0.00    0.00       200    0.0    0.0 KB    0.0 KB  148.4 KB    0.0 KB    00m 02s   00m 00s    00h 27m    0.1   99.9    0.0 KB
>        5    6.00    1.15        23    0.2  402.1 MB    0.0 KB    0.0 KB    1.3 GB    00m 44s   00m 00s    09h 19m    1.9   98.1    0.0 KB
>        6   17.00   19.92       601    7.1   17.2 GB    0.0 KB    0.0 KB    1.8 GB    02m 04s   00m 14s    25h 50m   11.7   88.3    0.0 KB
>        9    6.00    0.73       200    2.9    6.9 GB    0.0 KB  409.5 MB    2.8 GB    00m 46s   00m 00s    09h 37m    1.2   98.8    0.0 KB
>       10   13.00    2.27        39    0.3  807.8 MB    0.0 KB    0.0 KB    2.5 GB    01m 33s   00m 01s    19h 34m    1.7   98.3    0.0 KB
>       11    0.00    0.00         2    0.0   31.5 KB    0.0 KB    0.0 KB   60.0 KB    00m 00s   00m 00s    00h 06m    0.1   99.9    0.0 KB
>       12    1.00    2.15       200    0.3  758.7 MB    0.0 KB    2.3 GB    1.5 GB    00m 12s   00m 01s    02h 35m   12.6   87.4    0.0 KB
>       13    3.00    5.91       200    0.0    0.0 KB    0.0 KB    1.5 GB   47.5 GB    00m 25s   00m 04s    05h 12m   17.1   82.9    0.0 KB
>       14   13.00   19.83       200    0.0    0.0 KB    0.0 KB   50.3 GB   50.3 GB    01m 36s   00m 14s    20h 04m   14.9   85.1    0.0 KB
>       15   34.00   48.02       200    0.0    0.0 KB    0.0 KB   53.2 GB    0.0 KB    04m 03s   00m 34s    50h 42m   14.3   85.7    0.0 KB
>
>
>  Stage-ID WallClock  OneCore       Task   PRatio    -----Task------   OIRatio  |* ShuffleWrite% ReadFetch%   GC%  *|
>           Stage%     ComputeHours  Count            Skew   StageSkew
>       0    0.32         00h 00m       1    0.00     1.00     0.37     0.00     |*   0.00           0.00    15.10  *|
>       1    0.35         00h 00m       1    0.00     1.00     0.38     0.00     |*   0.00           0.00    15.56  *|
>       2    0.43         00h 00m       1    0.00     1.00     0.45     0.00     |*   0.00           0.00     8.88  *|
>       3    0.70         00h 00m       1    0.00     1.00     0.63     0.17     |*   4.51           0.00     6.74  *|
>       4    0.31         00h 00m     200    0.27    37.67     0.10     0.00     |*   0.00           0.04    23.79  *|
>       5    6.38         00h 10m      23    0.03     1.42     0.83     3.18     |*   1.08           0.00     2.72  *|
>       6   17.68         03h 00m     601    0.80     2.07     0.29     0.10     |*   0.60           0.00     1.90  *|
>       9    6.58         00h 06m     200    0.27     5.20     0.16     0.38     |*   4.74          13.24     4.04  *|
>      10   13.40         00h 20m      39    0.05     1.67     0.52     3.17     |*   1.10           0.00     1.96  *|
>      11    0.07         00h 00m       2    0.00     1.00     0.58     1.91     |*  13.59           0.00     0.00  *|
>      12    1.77         00h 19m     200    0.27     1.99     0.92     0.50     |*   1.85          19.63     3.09  *|
>      13    3.57         00h 53m     200    0.27     1.59     1.00    31.42     |*   6.06          12.25     1.33  *|
>      14   13.74         02h 59m     200    0.27     1.65     0.89     1.00     |*   1.84           2.38     0.83  *|
>      15   34.69         07h 15m     200    0.27     1.88     0.98     0.00     |*   0.00           4.21     0.88  *|
>
> PRatio:        Number of tasks in stage divided by number of cores. Represents degree of
>                parallelism in the stage
> TaskSkew:      Duration of largest task in stage divided by duration of median task.
>                Represents degree of skew in the stage
> TaskStageSkew: Duration of largest task in stage divided by total duration of the stage.
>                Represents the impact of the largest task on stage time.
> OIRatio:       Output to input ration. Total output of the stage (results + shuffle write)
>                divided by total input (input data + shuffle read)
>
> These metrics below represent distribution of time within the stage
>
> ShuffleWrite:  Amount of time spent in shuffle writes across all tasks in the given
>                stage as a percentage
> ReadFetch:     Amount of time spent in shuffle read across all tasks in the given
>                stage as a percentage
> GC:            Amount of time spent in GC across all tasks in the given stage as a
>                percentage
>
> If the stage contributes large percentage to overall application time, we could look into
> these metrics to check which part (Shuffle write, read fetch or GC is responsible)
>
> thanks,
>
> rohitk
>
>
>
> On Mon, Mar 26, 2018 at 1:38 AM, Shmuel Blitz <shmuel.blitz@similarweb.com
> > wrote:
>
>> Hi Rohit,
>>
>> Thanks for the analysis.
>>
>> I can use repartition on the slow task. But how can I tell what part of
>> the code is in charge of the slow tasks?
>>
>> It would be great if you could further explain the rest of the output.
>>
>> Thanks in advance,
>> Shmuel
>>
>> On Sun, Mar 25, 2018 at 12:46 PM, Rohit Karlupia <ro...@qubole.com>
>> wrote:
>>
>>> Thanks Shamuel for trying out sparklens!
>>>
>>> Couple of things that I noticed:
>>> 1) 250 executors is probably overkill for this job. It would run in same
>>> time with around 100.
>>> 2) Many of stages that take long time have only 200 tasks where as we
>>> have 750 cores available for the job. 200 is the default value for
>>> spark.sql.shuffle.partitions.  Alternatively you could try increasing
>>> the value of spark.sql.shuffle.partitions to latest 750.
>>>
>>> thanks,
>>> rohitk
>>>
>>> On Sun, Mar 25, 2018 at 1:25 PM, Shmuel Blitz <
>>> shmuel.blitz@similarweb.com> wrote:
>>>
>>>> I ran it on a single job.
>>>> SparkLens has an overhead on the job duration. I'm not ready to enable
>>>> it by default on all our jobs.
>>>>
>>>> Attached is the output.
>>>>
>>>> Still trying to understand what exactly it means.
>>>>
>>>> On Sun, Mar 25, 2018 at 10:40 AM, Fawze Abujaber <fa...@gmail.com>
>>>> wrote:
>>>>
>>>>> Nice!
>>>>>
>>>>> Shmuel, Were you able to run on a cluster level or for a specific job?
>>>>>
>>>>> Did you configure it on the spark-default.conf?
>>>>>
>>>>> On Sun, 25 Mar 2018 at 10:34 Shmuel Blitz <sh...@similarweb.com>
>>>>> wrote:
>>>>>
>>>>>> Just to let you know, I have managed to run SparkLens on our cluster.
>>>>>>
>>>>>> I switched to the spark_1.6 branch, and also compiled against the
>>>>>> specific image of Spark we are using (cdh5.7.6).
>>>>>>
>>>>>> Now I need to figure out what the output means... :P
>>>>>>
>>>>>> Shmuel
>>>>>>
>>>>>> On Fri, Mar 23, 2018 at 7:24 PM, Fawze Abujaber <fa...@gmail.com>
>>>>>> wrote:
>>>>>>
>>>>>>> Quick question:
>>>>>>>
>>>>>>> how to add the  --jars /path/to/sparklens_2.11-0.1.0.jar to the
>>>>>>> spark-default conf, should it be using:
>>>>>>>
>>>>>>> spark.driver.extraClassPath /path/to/sparklens_2.11-0.1.0.jar or i
>>>>>>> should use spark.jars option? anyone who could give an example how it
>>>>>>> should be, and if i the path for the jar should be an hdfs path as i'm
>>>>>>> using it in cluster mode.
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> On Fri, Mar 23, 2018 at 6:33 AM, Fawze Abujaber <fa...@gmail.com>
>>>>>>> wrote:
>>>>>>>
>>>>>>>> Hi Shmuel,
>>>>>>>>
>>>>>>>> Did you compile the code against the right branch for Spark 1.6.
>>>>>>>>
>>>>>>>> I tested it and it looks working and now i'm testing the branch for
>>>>>>>> a wide tests, Please use the branch for Spark 1.6
>>>>>>>>
>>>>>>>> On Fri, Mar 23, 2018 at 12:43 AM, Shmuel Blitz <
>>>>>>>> shmuel.blitz@similarweb.com> wrote:
>>>>>>>>
>>>>>>>>> Hi Rohit,
>>>>>>>>>
>>>>>>>>> Thanks for sharing this great tool.
>>>>>>>>> I tried running a spark job with the tool, but it failed with an *IncompatibleClassChangeError
>>>>>>>>> *Exception.
>>>>>>>>>
>>>>>>>>> I have opened an issue on Github.(https://github.com/qub
>>>>>>>>> ole/sparklens/issues/1)
>>>>>>>>>
>>>>>>>>> Shmuel
>>>>>>>>>
>>>>>>>>> On Thu, Mar 22, 2018 at 5:05 PM, Shmuel Blitz <
>>>>>>>>> shmuel.blitz@similarweb.com> wrote:
>>>>>>>>>
>>>>>>>>>> Thanks.
>>>>>>>>>>
>>>>>>>>>> We will give this a try and report back.
>>>>>>>>>>
>>>>>>>>>> Shmuel
>>>>>>>>>>
>>>>>>>>>> On Thu, Mar 22, 2018 at 4:22 PM, Rohit Karlupia <
>>>>>>>>>> rohitk@qubole.com> wrote:
>>>>>>>>>>
>>>>>>>>>>> Thanks everyone!
>>>>>>>>>>> Please share how it works and how it doesn't. Both help.
>>>>>>>>>>>
>>>>>>>>>>> Fawaze, just made few changes to make this work with spark 1.6.
>>>>>>>>>>> Can you please try building from branch *spark_1.6*
>>>>>>>>>>>
>>>>>>>>>>> thanks,
>>>>>>>>>>> rohitk
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> On Thu, Mar 22, 2018 at 10:18 AM, Fawze Abujaber <
>>>>>>>>>>> fawzeaj@gmail.com> wrote:
>>>>>>>>>>>
>>>>>>>>>>>> It's super amazing .... i see it was tested on spark 2.0.0 and
>>>>>>>>>>>> above, what about Spark 1.6 which is still part of Cloudera's main versions?
>>>>>>>>>>>>
>>>>>>>>>>>> We have a vast Spark applications with version 1.6.0
>>>>>>>>>>>>
>>>>>>>>>>>> On Thu, Mar 22, 2018 at 6:38 AM, Holden Karau <
>>>>>>>>>>>> holden@pigscanfly.ca> wrote:
>>>>>>>>>>>>
>>>>>>>>>>>>> Super exciting! I look forward to digging through it this
>>>>>>>>>>>>> weekend.
>>>>>>>>>>>>>
>>>>>>>>>>>>> On Wed, Mar 21, 2018 at 9:33 PM ☼ R Nair (रविशंकर नायर) <
>>>>>>>>>>>>> ravishankar.nair@gmail.com> wrote:
>>>>>>>>>>>>>
>>>>>>>>>>>>>> Excellent. You filled a missing link.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Best,
>>>>>>>>>>>>>> Passion
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> On Wed, Mar 21, 2018 at 11:36 PM, Rohit Karlupia <
>>>>>>>>>>>>>> rohitk@qubole.com> wrote:
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Hi,
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Happy to announce the availability of Sparklens as open
>>>>>>>>>>>>>>> source project. It helps in understanding the  scalability limits of spark
>>>>>>>>>>>>>>> applications and can be a useful guide on the path towards tuning
>>>>>>>>>>>>>>> applications for lower runtime or cost.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Please clone from here: https://github.com/qubole/sparklens
>>>>>>>>>>>>>>> Old blogpost: https://www.qubole.c
>>>>>>>>>>>>>>> om/blog/introducing-quboles-spark-tuning-tool/
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> thanks,
>>>>>>>>>>>>>>> rohitk
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> PS: Thanks for the patience. It took couple of months to get
>>>>>>>>>>>>>>> back on this.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>> --
>>>>>>>>>>>>> Twitter: https://twitter.com/holdenkarau
>>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> --
>>>>>>>>>> Shmuel Blitz
>>>>>>>>>> Big Data Developer
>>>>>>>>>> Email: shmuel.blitz@similarweb.com
>>>>>>>>>> www.similarweb.com
>>>>>>>>>> <https://www.facebook.com/SimilarWeb/>
>>>>>>>>>> <https://www.linkedin.com/company/429838/>
>>>>>>>>>> <https://twitter.com/similarweb>
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> --
>>>>>>>>> Shmuel Blitz
>>>>>>>>> Big Data Developer
>>>>>>>>> Email: shmuel.blitz@similarweb.com
>>>>>>>>> www.similarweb.com
>>>>>>>>> <https://www.facebook.com/SimilarWeb/>
>>>>>>>>> <https://www.linkedin.com/company/429838/>
>>>>>>>>> <https://twitter.com/similarweb>
>>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>
>>>>>>
>>>>>> --
>>>>>> Shmuel Blitz
>>>>>> Big Data Developer
>>>>>> Email: shmuel.blitz@similarweb.com
>>>>>> www.similarweb.com
>>>>>> <https://www.facebook.com/SimilarWeb/>
>>>>>> <https://www.linkedin.com/company/429838/>
>>>>>> <https://twitter.com/similarweb>
>>>>>>
>>>>>
>>>>
>>>>
>>>> --
>>>> Shmuel Blitz
>>>> Big Data Developer
>>>> Email: shmuel.blitz@similarweb.com
>>>> www.similarweb.com
>>>> <https://www.facebook.com/SimilarWeb/>
>>>> <https://www.linkedin.com/company/429838/>
>>>> <https://twitter.com/similarweb>
>>>>
>>>
>>>
>>
>>
>> --
>> Shmuel Blitz
>> Big Data Developer
>> Email: shmuel.blitz@similarweb.com
>> www.similarweb.com
>> <https://www.facebook.com/SimilarWeb/>
>> <https://www.linkedin.com/company/429838/>
>> <https://twitter.com/similarweb>
>>
>
>


-- 
Take Care
Fawze Abujaber

Re: Open sourcing Sparklens: Qubole's Spark Tuning Tool

Posted by Rohit Karlupia <ro...@qubole.com>.

Hi Shmuel,

In general it is hard to pin point to exact code which is responsible for a
specific stage. For example when using spark sql, depending upon the kind
of joins, aggregations used in the the single line of query, we will have
multiple stages in the spark application. I usually try to split the code
into smaller chunks and also use the spark UI which has special section for
SQL. It can also show specific backtraces, but as I explained earlier they
might not be very helpful. Sparklens does help you ask the right questions,
but is not mature enough to answer all of them.

Understanding the report:

*1) The first part of total aggregate metrics for the application.*

Printing application meterics.....

 AggregateMetrics (Application Metrics) total measurements 1869
                NAME                        SUM                MIN
      MAX                MEAN
 diskBytesSpilled                            0.0 KB         0.0 KB
    0.0 KB              0.0 KB
 executorRuntime                            15.1 hh         3.0 ms
    4.0 mm             29.1 ss
 inputBytesRead                             26.1 GB         0.0 KB
   43.8 MB             14.3 MB
 jvmGCTime                                  11.0 mm         0.0 ms
    2.1 ss            354.0 ms
 memoryBytesSpilled                        314.2 GB         0.0 KB
    1.1 GB            172.1 MB
 outputBytesWritten                          0.0 KB         0.0 KB
    0.0 KB              0.0 KB
 peakExecutionMemory                         0.0 KB         0.0 KB
    0.0 KB              0.0 KB
 resultSize                                 12.9 MB         2.0 KB
   40.9 KB              7.1 KB
 shuffleReadBytesRead                      107.7 GB         0.0 KB
  276.0 MB             59.0 MB
 shuffleReadFetchWaitTime                    2.0 ms         0.0 ms
    0.0 ms              0.0 ms
 shuffleReadLocalBlocks                       2,318              0
        68                   1
 shuffleReadRecordsRead               3,413,511,099              0
 8,251,926           1,826,383
 shuffleReadRemoteBlocks                    291,126              0
       824                 155
 shuffleWriteBytesWritten                  107.6 GB         0.0 KB
  257.6 MB             58.9 MB
 shuffleWriteRecordsWritten           3,408,133,175              0
 7,959,055           1,823,506
 shuffleWriteTime                            8.7 mm         0.0 ms
    1.8 ss            278.2 ms
 taskDuration                               15.4 hh        12.0 ms
    4.1 mm             29.7 ss


*2) Here we show number of hosts used and executors per host. I have
seen users set executor memory to 33GB on a 64GB executor. Direct
waste of 31GB of memory.*

Total Hosts 135


Host server86.cluster.com startTime 02:26:21:081 executors count 3
Host server164.cluster.com startTime 02:30:12:204 executors count 1
Host server28.cluster.com startTime 02:31:09:023 executors count 1
Host server78.cluster.com startTime 02:26:08:844 executors count 5
Host server124.cluster.com startTime 02:26:10:523 executors count 3
Host server100.cluster.com startTime 02:30:24:073 executors count 1
Done printing host timeline
*3) Time at which executers were added. Not all executors are
available at the start of the application. *

Printing executors timeline....
Total Hosts 135
Total Executors 250
At 02:26 executors added 52 & removed  0 currently available 52
At 02:27 executors added 10 & removed  0 currently available 62
At 02:28 executors added 13 & removed  0 currently available 75
At 02:29 executors added 81 & removed  0 currently available 156
At 02:30 executors added 48 & removed  0 currently available 204
At 02:31 executors added 45 & removed  0 currently available 249
At 02:32 executors added 1 & removed  0 currently available 250


*4) How the stages within the jobs were scheduled. Helps you
understand which stages ran in parallel and which are dependent on
others.
*

Printing Application timeline
02:26:47:654      Stage 3 ended : maxTaskTime 3117 taskCount 1
02:26:47:708      Stage 4 started : duration 00m 02s
02:26:49:898      Stage 4 ended : maxTaskTime 226 taskCount 200
02:26:49:901 JOB 3 ended
02:26:56:234 JOB 4 started : duration 08m 28s
[      5 |||||||
                  ]
[      6  |||||||||||||||||||
                  ]
[      9                   ||||||||
                  ]
[     10     ||||||||||||||
                  ]
[     11
                  ]
[     12                     ||
                  ]
[     13                       ||||
                  ]
[     14                           |||||||||||||||
                  ]
[     15
|||||||||||||||||||||||||||||||||||||| ]
02:26:58:095      Stage 5 started : duration 00m 44s
02:27:42:816      Stage 5 ended : maxTaskTime 37214 taskCount 23
02:27:03:478      Stage 6 started : duration 02m 04s
02:29:07:517      Stage 6 ended : maxTaskTime 35578 taskCount 601
02:28:56:449      Stage 9 started : duration 00m 46s
02:29:42:625      Stage 9 ended : maxTaskTime 7196 taskCount 200
02:27:22:343      Stage 10 started : duration 01m 33s
02:28:56:333      Stage 10 ended : maxTaskTime 49203 taskCount 39
02:27:23:910      Stage 11 started : duration 00m 00s
02:27:24:422      Stage 11 ended : maxTaskTime 298 taskCount 2
02:29:06:902      Stage 12 started : duration 00m 12s
02:29:19:350      Stage 12 ended : maxTaskTime 11511 taskCount 200
02:29:19:413      Stage 13 started : duration 00m 25s
02:29:44:444      Stage 13 ended : maxTaskTime 24924 taskCount 200
02:29:44:491      Stage 14 started : duration 01m 36s
02:31:20:873      Stage 14 ended : maxTaskTime 86194 taskCount 200
02:31:20:973      Stage 15 started : duration 04m 03s
02:35:24:346      Stage 15 ended : maxTaskTime 238747 taskCount 200
02:35:24:347 JOB 4 ended
02:35:28:841 app ended
*5) I guess these metrics are well explained *


 Time spent in Driver vs Executors
 Driver WallClock Time    01m 02s   10.66%
 Executor WallClock Time  08m 43s   89.34%
 Total WallClock Time     09m 46s



Minimum possible time for the app based on the critical path (with
infinite resources)   07m 59s
Minimum possible time for the app with same executors, perfect
parallelism and zero skew 02m 15s
If we were to run this app with single executor and single core
                  15h 08m


 Total cores available to the app 750

 OneCoreComputeHours: Measure of total compute power available from
cluster. One core in the executor, running
                      for one hour, counts as one OneCoreComputeHour.
Executors with 4 cores, will have 4 times
                      the OneCoreComputeHours compared to one with
just one core. Similarly, one core executor
                      running for 4 hours will OnCoreComputeHours
equal to 4 core executor running for 1 hour.

 Driver Utilization (Cluster idle because of driver)

 Total OneCoreComputeHours available                            122h 07m
 Total OneCoreComputeHours available (AutoScale Aware)           77h 25m
 OneCoreComputeHours wasted by driver                            13h 01m

 AutoScale Aware: Most of the calculations by this tool will assume
that all executors are available throughout
                  the runtime of the application. The number above is
printed to show possible caution to be
                  taken in interpreting the efficiency metrics.

 Cluster Utilization (Executors idle because of lack of tasks or skew)

 Executor OneCoreComputeHours available                 109h 06m
 Executor OneCoreComputeHours used                       15h 07m        13.86%
 OneCoreComputeHours wasted                              93h 59m        86.14%

 App Level Wastage Metrics (Driver + Executor)

 OneCoreComputeHours wasted Driver               10.66%
 OneCoreComputeHours wasted Executor             76.96%
 OneCoreComputeHours wasted Total                87.62%



6) *Here we use the simulation to provide answers to how the
application wall clock time will vary as we change the number of
executors. Goal is to run the application at 100% cluster utilization
and minimum time. Look for ROI in terms of wall clock time due to
additional executors. Also if the application is not scaling, this is
good time to revisit application and look for why it is not scaling.*

 App completion time and cluster utilization estimates with different
executor counts

 Real App Duration 09m 46s
 Model Estimation  08m 01s
 Model Error       17%

 NOTE: 1) Model error could be large when auto-scaling is enabled.
       2) Model doesn't handles multiple jobs run via thread-pool. For
better insights into
          application scalability, please try such jobs one by one
without thread-pool.


 Executor count    25  ( 10%) estimated time 17m 07s and estimated
cluster utilization 70.61%
 Executor count    50  ( 20%) estimated time 12m 15s and estimated
cluster utilization 49.34%
 Executor count   125  ( 50%) estimated time 08m 25s and estimated
cluster utilization 28.72%
 Executor count   200  ( 80%) estimated time 08m 15s and estimated
cluster utilization 18.29%
 Executor count   250  (100%) estimated time 08m 01s and estimated
cluster utilization 15.06%
 Executor count   275  (110%) estimated time 08m 00s and estimated
cluster utilization 13.72%
 Executor count   300  (120%) estimated time 07m 59s and estimated
cluster utilization 12.61%
 Executor count   375  (150%) estimated time 07m 59s and estimated
cluster utilization 10.09%
 Executor count   500  (200%) estimated time 07m 59s and estimated
cluster utilization 7.57%
 Executor count   750  (300%) estimated time 07m 59s and estimated
cluster utilization 5.04%
 Executor count  1000  (400%) estimated time 07m 59s and estimated
cluster utilization 3.78%
 Executor count  1250  (500%) estimated time 07m 59s and estimated
cluster utilization 3.03%

*7) These two sections are for finding out which stage are taking most
of the wall-clock time and why. It is either not enough parallelism or
skew. Parallelism is easier to fix. Fixing skew will require changing
the application in way that creates more uniform tasks.
*
Total tasks in all stages 1869
Per Stage  Utilization
Stage-ID   Wall    Task      Task     IO%    Input     Output
----Shuffle-----    -WallClockTime-    --OneCoreComputeHours---
MaxTaskMem
          Clock%  Runtime%   Count                               Input
 |  Output    Measured | Ideal   Available| Used%|Wasted%
       0    0.00    0.00         1    0.0   64.0 KB    0.0 KB    0.0
KB    0.0 KB    00m 02s   00m 00s    00h 27m    0.0  100.0    0.0 KB
       1    0.00    0.00         1    0.0   64.0 KB    0.0 KB    0.0
KB    0.0 KB    00m 02s   00m 00s    00h 30m    0.1   99.9    0.0 KB
       2    0.00    0.00         1    0.0   90.0 KB    0.0 KB    0.0
KB    0.0 KB    00m 03s   00m 00s    00h 37m    0.1   99.9    0.0 KB
       3    0.00    0.01         1    0.0  867.1 KB    0.0 KB    0.0
KB  148.4 KB    00m 04s   00m 00s    01h 01m    0.1   99.9    0.0 KB
       4    0.00    0.00       200    0.0    0.0 KB    0.0 KB  148.4
KB    0.0 KB    00m 02s   00m 00s    00h 27m    0.1   99.9    0.0 KB
       5    6.00    1.15        23    0.2  402.1 MB    0.0 KB    0.0
KB    1.3 GB    00m 44s   00m 00s    09h 19m    1.9   98.1    0.0 KB
       6   17.00   19.92       601    7.1   17.2 GB    0.0 KB    0.0
KB    1.8 GB    02m 04s   00m 14s    25h 50m   11.7   88.3    0.0 KB
       9    6.00    0.73       200    2.9    6.9 GB    0.0 KB  409.5
MB    2.8 GB    00m 46s   00m 00s    09h 37m    1.2   98.8    0.0 KB
      10   13.00    2.27        39    0.3  807.8 MB    0.0 KB    0.0
KB    2.5 GB    01m 33s   00m 01s    19h 34m    1.7   98.3    0.0 KB
      11    0.00    0.00         2    0.0   31.5 KB    0.0 KB    0.0
KB   60.0 KB    00m 00s   00m 00s    00h 06m    0.1   99.9    0.0 KB
      12    1.00    2.15       200    0.3  758.7 MB    0.0 KB    2.3
GB    1.5 GB    00m 12s   00m 01s    02h 35m   12.6   87.4    0.0 KB
      13    3.00    5.91       200    0.0    0.0 KB    0.0 KB    1.5
GB   47.5 GB    00m 25s   00m 04s    05h 12m   17.1   82.9    0.0 KB
      14   13.00   19.83       200    0.0    0.0 KB    0.0 KB   50.3
GB   50.3 GB    01m 36s   00m 14s    20h 04m   14.9   85.1    0.0 KB
      15   34.00   48.02       200    0.0    0.0 KB    0.0 KB   53.2
GB    0.0 KB    04m 03s   00m 34s    50h 42m   14.3   85.7    0.0 KB


 Stage-ID WallClock  OneCore       Task   PRatio    -----Task------
OIRatio  |* ShuffleWrite% ReadFetch%   GC%  *|
          Stage%     ComputeHours  Count            Skew   StageSkew
      0    0.32         00h 00m       1    0.00     1.00     0.37
0.00     |*   0.00           0.00    15.10  *|
      1    0.35         00h 00m       1    0.00     1.00     0.38
0.00     |*   0.00           0.00    15.56  *|
      2    0.43         00h 00m       1    0.00     1.00     0.45
0.00     |*   0.00           0.00     8.88  *|
      3    0.70         00h 00m       1    0.00     1.00     0.63
0.17     |*   4.51           0.00     6.74  *|
      4    0.31         00h 00m     200    0.27    37.67     0.10
0.00     |*   0.00           0.04    23.79  *|
      5    6.38         00h 10m      23    0.03     1.42     0.83
3.18     |*   1.08           0.00     2.72  *|
      6   17.68         03h 00m     601    0.80     2.07     0.29
0.10     |*   0.60           0.00     1.90  *|
      9    6.58         00h 06m     200    0.27     5.20     0.16
0.38     |*   4.74          13.24     4.04  *|
     10   13.40         00h 20m      39    0.05     1.67     0.52
3.17     |*   1.10           0.00     1.96  *|
     11    0.07         00h 00m       2    0.00     1.00     0.58
1.91     |*  13.59           0.00     0.00  *|
     12    1.77         00h 19m     200    0.27     1.99     0.92
0.50     |*   1.85          19.63     3.09  *|
     13    3.57         00h 53m     200    0.27     1.59     1.00
31.42     |*   6.06          12.25     1.33  *|
     14   13.74         02h 59m     200    0.27     1.65     0.89
1.00     |*   1.84           2.38     0.83  *|
     15   34.69         07h 15m     200    0.27     1.88     0.98
0.00     |*   0.00           4.21     0.88  *|

PRatio:        Number of tasks in stage divided by number of cores.
Represents degree of
               parallelism in the stage
TaskSkew:      Duration of largest task in stage divided by duration
of median task.
               Represents degree of skew in the stage
TaskStageSkew: Duration of largest task in stage divided by total
duration of the stage.
               Represents the impact of the largest task on stage time.
OIRatio:       Output to input ration. Total output of the stage
(results + shuffle write)
               divided by total input (input data + shuffle read)

These metrics below represent distribution of time within the stage

ShuffleWrite:  Amount of time spent in shuffle writes across all tasks
in the given
               stage as a percentage
ReadFetch:     Amount of time spent in shuffle read across all tasks
in the given
               stage as a percentage
GC:            Amount of time spent in GC across all tasks in the
given stage as a
               percentage

If the stage contributes large percentage to overall application time,
we could look into
these metrics to check which part (Shuffle write, read fetch or GC is
responsible)

thanks,

rohitk



On Mon, Mar 26, 2018 at 1:38 AM, Shmuel Blitz <sh...@similarweb.com>
wrote:

> Hi Rohit,
>
> Thanks for the analysis.
>
> I can use repartition on the slow task. But how can I tell what part of
> the code is in charge of the slow tasks?
>
> It would be great if you could further explain the rest of the output.
>
> Thanks in advance,
> Shmuel
>
> On Sun, Mar 25, 2018 at 12:46 PM, Rohit Karlupia <ro...@qubole.com>
> wrote:
>
>> Thanks Shamuel for trying out sparklens!
>>
>> Couple of things that I noticed:
>> 1) 250 executors is probably overkill for this job. It would run in same
>> time with around 100.
>> 2) Many of stages that take long time have only 200 tasks where as we
>> have 750 cores available for the job. 200 is the default value for
>> spark.sql.shuffle.partitions.  Alternatively you could try increasing
>> the value of spark.sql.shuffle.partitions to latest 750.
>>
>> thanks,
>> rohitk
>>
>> On Sun, Mar 25, 2018 at 1:25 PM, Shmuel Blitz <
>> shmuel.blitz@similarweb.com> wrote:
>>
>>> I ran it on a single job.
>>> SparkLens has an overhead on the job duration. I'm not ready to enable
>>> it by default on all our jobs.
>>>
>>> Attached is the output.
>>>
>>> Still trying to understand what exactly it means.
>>>
>>> On Sun, Mar 25, 2018 at 10:40 AM, Fawze Abujaber <fa...@gmail.com>
>>> wrote:
>>>
>>>> Nice!
>>>>
>>>> Shmuel, Were you able to run on a cluster level or for a specific job?
>>>>
>>>> Did you configure it on the spark-default.conf?
>>>>
>>>> On Sun, 25 Mar 2018 at 10:34 Shmuel Blitz <sh...@similarweb.com>
>>>> wrote:
>>>>
>>>>> Just to let you know, I have managed to run SparkLens on our cluster.
>>>>>
>>>>> I switched to the spark_1.6 branch, and also compiled against the
>>>>> specific image of Spark we are using (cdh5.7.6).
>>>>>
>>>>> Now I need to figure out what the output means... :P
>>>>>
>>>>> Shmuel
>>>>>
>>>>> On Fri, Mar 23, 2018 at 7:24 PM, Fawze Abujaber <fa...@gmail.com>
>>>>> wrote:
>>>>>
>>>>>> Quick question:
>>>>>>
>>>>>> how to add the  --jars /path/to/sparklens_2.11-0.1.0.jar to the
>>>>>> spark-default conf, should it be using:
>>>>>>
>>>>>> spark.driver.extraClassPath /path/to/sparklens_2.11-0.1.0.jar or i
>>>>>> should use spark.jars option? anyone who could give an example how it
>>>>>> should be, and if i the path for the jar should be an hdfs path as i'm
>>>>>> using it in cluster mode.
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>> On Fri, Mar 23, 2018 at 6:33 AM, Fawze Abujaber <fa...@gmail.com>
>>>>>> wrote:
>>>>>>
>>>>>>> Hi Shmuel,
>>>>>>>
>>>>>>> Did you compile the code against the right branch for Spark 1.6.
>>>>>>>
>>>>>>> I tested it and it looks working and now i'm testing the branch for
>>>>>>> a wide tests, Please use the branch for Spark 1.6
>>>>>>>
>>>>>>> On Fri, Mar 23, 2018 at 12:43 AM, Shmuel Blitz <
>>>>>>> shmuel.blitz@similarweb.com> wrote:
>>>>>>>
>>>>>>>> Hi Rohit,
>>>>>>>>
>>>>>>>> Thanks for sharing this great tool.
>>>>>>>> I tried running a spark job with the tool, but it failed with an *IncompatibleClassChangeError
>>>>>>>> *Exception.
>>>>>>>>
>>>>>>>> I have opened an issue on Github.(https://github.com/qub
>>>>>>>> ole/sparklens/issues/1)
>>>>>>>>
>>>>>>>> Shmuel
>>>>>>>>
>>>>>>>> On Thu, Mar 22, 2018 at 5:05 PM, Shmuel Blitz <
>>>>>>>> shmuel.blitz@similarweb.com> wrote:
>>>>>>>>
>>>>>>>>> Thanks.
>>>>>>>>>
>>>>>>>>> We will give this a try and report back.
>>>>>>>>>
>>>>>>>>> Shmuel
>>>>>>>>>
>>>>>>>>> On Thu, Mar 22, 2018 at 4:22 PM, Rohit Karlupia <rohitk@qubole.com
>>>>>>>>> > wrote:
>>>>>>>>>
>>>>>>>>>> Thanks everyone!
>>>>>>>>>> Please share how it works and how it doesn't. Both help.
>>>>>>>>>>
>>>>>>>>>> Fawaze, just made few changes to make this work with spark 1.6.
>>>>>>>>>> Can you please try building from branch *spark_1.6*
>>>>>>>>>>
>>>>>>>>>> thanks,
>>>>>>>>>> rohitk
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> On Thu, Mar 22, 2018 at 10:18 AM, Fawze Abujaber <
>>>>>>>>>> fawzeaj@gmail.com> wrote:
>>>>>>>>>>
>>>>>>>>>>> It's super amazing .... i see it was tested on spark 2.0.0 and
>>>>>>>>>>> above, what about Spark 1.6 which is still part of Cloudera's main versions?
>>>>>>>>>>>
>>>>>>>>>>> We have a vast Spark applications with version 1.6.0
>>>>>>>>>>>
>>>>>>>>>>> On Thu, Mar 22, 2018 at 6:38 AM, Holden Karau <
>>>>>>>>>>> holden@pigscanfly.ca> wrote:
>>>>>>>>>>>
>>>>>>>>>>>> Super exciting! I look forward to digging through it this
>>>>>>>>>>>> weekend.
>>>>>>>>>>>>
>>>>>>>>>>>> On Wed, Mar 21, 2018 at 9:33 PM ☼ R Nair (रविशंकर नायर) <
>>>>>>>>>>>> ravishankar.nair@gmail.com> wrote:
>>>>>>>>>>>>
>>>>>>>>>>>>> Excellent. You filled a missing link.
>>>>>>>>>>>>>
>>>>>>>>>>>>> Best,
>>>>>>>>>>>>> Passion
>>>>>>>>>>>>>
>>>>>>>>>>>>> On Wed, Mar 21, 2018 at 11:36 PM, Rohit Karlupia <
>>>>>>>>>>>>> rohitk@qubole.com> wrote:
>>>>>>>>>>>>>
>>>>>>>>>>>>>> Hi,
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Happy to announce the availability of Sparklens as open
>>>>>>>>>>>>>> source project. It helps in understanding the  scalability limits of spark
>>>>>>>>>>>>>> applications and can be a useful guide on the path towards tuning
>>>>>>>>>>>>>> applications for lower runtime or cost.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Please clone from here: https://github.com/qubole/sparklens
>>>>>>>>>>>>>> Old blogpost: https://www.qubole.c
>>>>>>>>>>>>>> om/blog/introducing-quboles-spark-tuning-tool/
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> thanks,
>>>>>>>>>>>>>> rohitk
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> PS: Thanks for the patience. It took couple of months to get
>>>>>>>>>>>>>> back on this.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>> --
>>>>>>>>>>>> Twitter: https://twitter.com/holdenkarau
>>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> --
>>>>>>>>> Shmuel Blitz
>>>>>>>>> Big Data Developer
>>>>>>>>> Email: shmuel.blitz@similarweb.com
>>>>>>>>> www.similarweb.com
>>>>>>>>> <https://www.facebook.com/SimilarWeb/>
>>>>>>>>> <https://www.linkedin.com/company/429838/>
>>>>>>>>> <https://twitter.com/similarweb>
>>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> --
>>>>>>>> Shmuel Blitz
>>>>>>>> Big Data Developer
>>>>>>>> Email: shmuel.blitz@similarweb.com
>>>>>>>> www.similarweb.com
>>>>>>>> <https://www.facebook.com/SimilarWeb/>
>>>>>>>> <https://www.linkedin.com/company/429838/>
>>>>>>>> <https://twitter.com/similarweb>
>>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>
>>>>>
>>>>>
>>>>> --
>>>>> Shmuel Blitz
>>>>> Big Data Developer
>>>>> Email: shmuel.blitz@similarweb.com
>>>>> www.similarweb.com
>>>>> <https://www.facebook.com/SimilarWeb/>
>>>>> <https://www.linkedin.com/company/429838/>
>>>>> <https://twitter.com/similarweb>
>>>>>
>>>>
>>>
>>>
>>> --
>>> Shmuel Blitz
>>> Big Data Developer
>>> Email: shmuel.blitz@similarweb.com
>>> www.similarweb.com
>>> <https://www.facebook.com/SimilarWeb/>
>>> <https://www.linkedin.com/company/429838/>
>>> <https://twitter.com/similarweb>
>>>
>>
>>
>
>
> --
> Shmuel Blitz
> Big Data Developer
> Email: shmuel.blitz@similarweb.com
> www.similarweb.com
> <https://www.facebook.com/SimilarWeb/>
> <https://www.linkedin.com/company/429838/>
> <https://twitter.com/similarweb>
>

Re: Open sourcing Sparklens: Qubole's Spark Tuning Tool

Posted by Shmuel Blitz <sh...@similarweb.com>.

Hi Rohit,

Thanks for the analysis.

I can use repartition on the slow task. But how can I tell what part of the
code is in charge of the slow tasks?

It would be great if you could further explain the rest of the output.

Thanks in advance,
Shmuel

On Sun, Mar 25, 2018 at 12:46 PM, Rohit Karlupia <ro...@qubole.com> wrote:

> Thanks Shamuel for trying out sparklens!
>
> Couple of things that I noticed:
> 1) 250 executors is probably overkill for this job. It would run in same
> time with around 100.
> 2) Many of stages that take long time have only 200 tasks where as we have
> 750 cores available for the job. 200 is the default value for
> spark.sql.shuffle.partitions.  Alternatively you could try increasing the
> value of spark.sql.shuffle.partitions to latest 750.
>
> thanks,
> rohitk
>
> On Sun, Mar 25, 2018 at 1:25 PM, Shmuel Blitz <shmuel.blitz@similarweb.com
> > wrote:
>
>> I ran it on a single job.
>> SparkLens has an overhead on the job duration. I'm not ready to enable it
>> by default on all our jobs.
>>
>> Attached is the output.
>>
>> Still trying to understand what exactly it means.
>>
>> On Sun, Mar 25, 2018 at 10:40 AM, Fawze Abujaber <fa...@gmail.com>
>> wrote:
>>
>>> Nice!
>>>
>>> Shmuel, Were you able to run on a cluster level or for a specific job?
>>>
>>> Did you configure it on the spark-default.conf?
>>>
>>> On Sun, 25 Mar 2018 at 10:34 Shmuel Blitz <sh...@similarweb.com>
>>> wrote:
>>>
>>>> Just to let you know, I have managed to run SparkLens on our cluster.
>>>>
>>>> I switched to the spark_1.6 branch, and also compiled against the
>>>> specific image of Spark we are using (cdh5.7.6).
>>>>
>>>> Now I need to figure out what the output means... :P
>>>>
>>>> Shmuel
>>>>
>>>> On Fri, Mar 23, 2018 at 7:24 PM, Fawze Abujaber <fa...@gmail.com>
>>>> wrote:
>>>>
>>>>> Quick question:
>>>>>
>>>>> how to add the  --jars /path/to/sparklens_2.11-0.1.0.jar to the
>>>>> spark-default conf, should it be using:
>>>>>
>>>>> spark.driver.extraClassPath /path/to/sparklens_2.11-0.1.0.jar or i
>>>>> should use spark.jars option? anyone who could give an example how it
>>>>> should be, and if i the path for the jar should be an hdfs path as i'm
>>>>> using it in cluster mode.
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> On Fri, Mar 23, 2018 at 6:33 AM, Fawze Abujaber <fa...@gmail.com>
>>>>> wrote:
>>>>>
>>>>>> Hi Shmuel,
>>>>>>
>>>>>> Did you compile the code against the right branch for Spark 1.6.
>>>>>>
>>>>>> I tested it and it looks working and now i'm testing the branch for a
>>>>>> wide tests, Please use the branch for Spark 1.6
>>>>>>
>>>>>> On Fri, Mar 23, 2018 at 12:43 AM, Shmuel Blitz <
>>>>>> shmuel.blitz@similarweb.com> wrote:
>>>>>>
>>>>>>> Hi Rohit,
>>>>>>>
>>>>>>> Thanks for sharing this great tool.
>>>>>>> I tried running a spark job with the tool, but it failed with an *IncompatibleClassChangeError
>>>>>>> *Exception.
>>>>>>>
>>>>>>> I have opened an issue on Github.(https://github.com/qub
>>>>>>> ole/sparklens/issues/1)
>>>>>>>
>>>>>>> Shmuel
>>>>>>>
>>>>>>> On Thu, Mar 22, 2018 at 5:05 PM, Shmuel Blitz <
>>>>>>> shmuel.blitz@similarweb.com> wrote:
>>>>>>>
>>>>>>>> Thanks.
>>>>>>>>
>>>>>>>> We will give this a try and report back.
>>>>>>>>
>>>>>>>> Shmuel
>>>>>>>>
>>>>>>>> On Thu, Mar 22, 2018 at 4:22 PM, Rohit Karlupia <ro...@qubole.com>
>>>>>>>> wrote:
>>>>>>>>
>>>>>>>>> Thanks everyone!
>>>>>>>>> Please share how it works and how it doesn't. Both help.
>>>>>>>>>
>>>>>>>>> Fawaze, just made few changes to make this work with spark 1.6.
>>>>>>>>> Can you please try building from branch *spark_1.6*
>>>>>>>>>
>>>>>>>>> thanks,
>>>>>>>>> rohitk
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> On Thu, Mar 22, 2018 at 10:18 AM, Fawze Abujaber <
>>>>>>>>> fawzeaj@gmail.com> wrote:
>>>>>>>>>
>>>>>>>>>> It's super amazing .... i see it was tested on spark 2.0.0 and
>>>>>>>>>> above, what about Spark 1.6 which is still part of Cloudera's main versions?
>>>>>>>>>>
>>>>>>>>>> We have a vast Spark applications with version 1.6.0
>>>>>>>>>>
>>>>>>>>>> On Thu, Mar 22, 2018 at 6:38 AM, Holden Karau <
>>>>>>>>>> holden@pigscanfly.ca> wrote:
>>>>>>>>>>
>>>>>>>>>>> Super exciting! I look forward to digging through it this
>>>>>>>>>>> weekend.
>>>>>>>>>>>
>>>>>>>>>>> On Wed, Mar 21, 2018 at 9:33 PM ☼ R Nair (रविशंकर नायर) <
>>>>>>>>>>> ravishankar.nair@gmail.com> wrote:
>>>>>>>>>>>
>>>>>>>>>>>> Excellent. You filled a missing link.
>>>>>>>>>>>>
>>>>>>>>>>>> Best,
>>>>>>>>>>>> Passion
>>>>>>>>>>>>
>>>>>>>>>>>> On Wed, Mar 21, 2018 at 11:36 PM, Rohit Karlupia <
>>>>>>>>>>>> rohitk@qubole.com> wrote:
>>>>>>>>>>>>
>>>>>>>>>>>>> Hi,
>>>>>>>>>>>>>
>>>>>>>>>>>>> Happy to announce the availability of Sparklens as open source
>>>>>>>>>>>>> project. It helps in understanding the  scalability limits of spark
>>>>>>>>>>>>> applications and can be a useful guide on the path towards tuning
>>>>>>>>>>>>> applications for lower runtime or cost.
>>>>>>>>>>>>>
>>>>>>>>>>>>> Please clone from here: https://github.com/qubole/sparklens
>>>>>>>>>>>>> Old blogpost: https://www.qubole.c
>>>>>>>>>>>>> om/blog/introducing-quboles-spark-tuning-tool/
>>>>>>>>>>>>>
>>>>>>>>>>>>> thanks,
>>>>>>>>>>>>> rohitk
>>>>>>>>>>>>>
>>>>>>>>>>>>> PS: Thanks for the patience. It took couple of months to get
>>>>>>>>>>>>> back on this.
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>> --
>>>>>>>>>>> Twitter: https://twitter.com/holdenkarau
>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> --
>>>>>>>> Shmuel Blitz
>>>>>>>> Big Data Developer
>>>>>>>> Email: shmuel.blitz@similarweb.com
>>>>>>>> www.similarweb.com
>>>>>>>> <https://www.facebook.com/SimilarWeb/>
>>>>>>>> <https://www.linkedin.com/company/429838/>
>>>>>>>> <https://twitter.com/similarweb>
>>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> --
>>>>>>> Shmuel Blitz
>>>>>>> Big Data Developer
>>>>>>> Email: shmuel.blitz@similarweb.com
>>>>>>> www.similarweb.com
>>>>>>> <https://www.facebook.com/SimilarWeb/>
>>>>>>> <https://www.linkedin.com/company/429838/>
>>>>>>> <https://twitter.com/similarweb>
>>>>>>>
>>>>>>
>>>>>>
>>>>>
>>>>
>>>>
>>>> --
>>>> Shmuel Blitz
>>>> Big Data Developer
>>>> Email: shmuel.blitz@similarweb.com
>>>> www.similarweb.com
>>>> <https://www.facebook.com/SimilarWeb/>
>>>> <https://www.linkedin.com/company/429838/>
>>>> <https://twitter.com/similarweb>
>>>>
>>>
>>
>>
>> --
>> Shmuel Blitz
>> Big Data Developer
>> Email: shmuel.blitz@similarweb.com
>> www.similarweb.com
>> <https://www.facebook.com/SimilarWeb/>
>> <https://www.linkedin.com/company/429838/>
>> <https://twitter.com/similarweb>
>>
>
>


-- 
Shmuel Blitz
Big Data Developer
Email: shmuel.blitz@similarweb.com
www.similarweb.com
<https://www.facebook.com/SimilarWeb/>
<https://www.linkedin.com/company/429838/> <https://twitter.com/similarweb>

Re: Open sourcing Sparklens: Qubole's Spark Tuning Tool

Posted by Rohit Karlupia <ro...@qubole.com>.

Thanks Shamuel for trying out sparklens!

Couple of things that I noticed:
1) 250 executors is probably overkill for this job. It would run in same
time with around 100.
2) Many of stages that take long time have only 200 tasks where as we have
750 cores available for the job. 200 is the default value for
spark.sql.shuffle.partitions.  Alternatively you could try increasing the
value of spark.sql.shuffle.partitions to latest 750.

thanks,
rohitk

On Sun, Mar 25, 2018 at 1:25 PM, Shmuel Blitz <sh...@similarweb.com>
wrote:

> I ran it on a single job.
> SparkLens has an overhead on the job duration. I'm not ready to enable it
> by default on all our jobs.
>
> Attached is the output.
>
> Still trying to understand what exactly it means.
>
> On Sun, Mar 25, 2018 at 10:40 AM, Fawze Abujaber <fa...@gmail.com>
> wrote:
>
>> Nice!
>>
>> Shmuel, Were you able to run on a cluster level or for a specific job?
>>
>> Did you configure it on the spark-default.conf?
>>
>> On Sun, 25 Mar 2018 at 10:34 Shmuel Blitz <sh...@similarweb.com>
>> wrote:
>>
>>> Just to let you know, I have managed to run SparkLens on our cluster.
>>>
>>> I switched to the spark_1.6 branch, and also compiled against the
>>> specific image of Spark we are using (cdh5.7.6).
>>>
>>> Now I need to figure out what the output means... :P
>>>
>>> Shmuel
>>>
>>> On Fri, Mar 23, 2018 at 7:24 PM, Fawze Abujaber <fa...@gmail.com>
>>> wrote:
>>>
>>>> Quick question:
>>>>
>>>> how to add the  --jars /path/to/sparklens_2.11-0.1.0.jar to the
>>>> spark-default conf, should it be using:
>>>>
>>>> spark.driver.extraClassPath /path/to/sparklens_2.11-0.1.0.jar or i
>>>> should use spark.jars option? anyone who could give an example how it
>>>> should be, and if i the path for the jar should be an hdfs path as i'm
>>>> using it in cluster mode.
>>>>
>>>>
>>>>
>>>>
>>>> On Fri, Mar 23, 2018 at 6:33 AM, Fawze Abujaber <fa...@gmail.com>
>>>> wrote:
>>>>
>>>>> Hi Shmuel,
>>>>>
>>>>> Did you compile the code against the right branch for Spark 1.6.
>>>>>
>>>>> I tested it and it looks working and now i'm testing the branch for a
>>>>> wide tests, Please use the branch for Spark 1.6
>>>>>
>>>>> On Fri, Mar 23, 2018 at 12:43 AM, Shmuel Blitz <
>>>>> shmuel.blitz@similarweb.com> wrote:
>>>>>
>>>>>> Hi Rohit,
>>>>>>
>>>>>> Thanks for sharing this great tool.
>>>>>> I tried running a spark job with the tool, but it failed with an *IncompatibleClassChangeError
>>>>>> *Exception.
>>>>>>
>>>>>> I have opened an issue on Github.(https://github.com/qub
>>>>>> ole/sparklens/issues/1)
>>>>>>
>>>>>> Shmuel
>>>>>>
>>>>>> On Thu, Mar 22, 2018 at 5:05 PM, Shmuel Blitz <
>>>>>> shmuel.blitz@similarweb.com> wrote:
>>>>>>
>>>>>>> Thanks.
>>>>>>>
>>>>>>> We will give this a try and report back.
>>>>>>>
>>>>>>> Shmuel
>>>>>>>
>>>>>>> On Thu, Mar 22, 2018 at 4:22 PM, Rohit Karlupia <ro...@qubole.com>
>>>>>>> wrote:
>>>>>>>
>>>>>>>> Thanks everyone!
>>>>>>>> Please share how it works and how it doesn't. Both help.
>>>>>>>>
>>>>>>>> Fawaze, just made few changes to make this work with spark 1.6. Can
>>>>>>>> you please try building from branch *spark_1.6*
>>>>>>>>
>>>>>>>> thanks,
>>>>>>>> rohitk
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> On Thu, Mar 22, 2018 at 10:18 AM, Fawze Abujaber <fawzeaj@gmail.com
>>>>>>>> > wrote:
>>>>>>>>
>>>>>>>>> It's super amazing .... i see it was tested on spark 2.0.0 and
>>>>>>>>> above, what about Spark 1.6 which is still part of Cloudera's main versions?
>>>>>>>>>
>>>>>>>>> We have a vast Spark applications with version 1.6.0
>>>>>>>>>
>>>>>>>>> On Thu, Mar 22, 2018 at 6:38 AM, Holden Karau <
>>>>>>>>> holden@pigscanfly.ca> wrote:
>>>>>>>>>
>>>>>>>>>> Super exciting! I look forward to digging through it this weekend.
>>>>>>>>>>
>>>>>>>>>> On Wed, Mar 21, 2018 at 9:33 PM ☼ R Nair (रविशंकर नायर) <
>>>>>>>>>> ravishankar.nair@gmail.com> wrote:
>>>>>>>>>>
>>>>>>>>>>> Excellent. You filled a missing link.
>>>>>>>>>>>
>>>>>>>>>>> Best,
>>>>>>>>>>> Passion
>>>>>>>>>>>
>>>>>>>>>>> On Wed, Mar 21, 2018 at 11:36 PM, Rohit Karlupia <
>>>>>>>>>>> rohitk@qubole.com> wrote:
>>>>>>>>>>>
>>>>>>>>>>>> Hi,
>>>>>>>>>>>>
>>>>>>>>>>>> Happy to announce the availability of Sparklens as open source
>>>>>>>>>>>> project. It helps in understanding the  scalability limits of spark
>>>>>>>>>>>> applications and can be a useful guide on the path towards tuning
>>>>>>>>>>>> applications for lower runtime or cost.
>>>>>>>>>>>>
>>>>>>>>>>>> Please clone from here: https://github.com/qubole/sparklens
>>>>>>>>>>>> Old blogpost: https://www.qubole.c
>>>>>>>>>>>> om/blog/introducing-quboles-spark-tuning-tool/
>>>>>>>>>>>>
>>>>>>>>>>>> thanks,
>>>>>>>>>>>> rohitk
>>>>>>>>>>>>
>>>>>>>>>>>> PS: Thanks for the patience. It took couple of months to get
>>>>>>>>>>>> back on this.
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>> --
>>>>>>>>>> Twitter: https://twitter.com/holdenkarau
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> --
>>>>>>> Shmuel Blitz
>>>>>>> Big Data Developer
>>>>>>> Email: shmuel.blitz@similarweb.com
>>>>>>> www.similarweb.com
>>>>>>> <https://www.facebook.com/SimilarWeb/>
>>>>>>> <https://www.linkedin.com/company/429838/>
>>>>>>> <https://twitter.com/similarweb>
>>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>> --
>>>>>> Shmuel Blitz
>>>>>> Big Data Developer
>>>>>> Email: shmuel.blitz@similarweb.com
>>>>>> www.similarweb.com
>>>>>> <https://www.facebook.com/SimilarWeb/>
>>>>>> <https://www.linkedin.com/company/429838/>
>>>>>> <https://twitter.com/similarweb>
>>>>>>
>>>>>
>>>>>
>>>>
>>>
>>>
>>> --
>>> Shmuel Blitz
>>> Big Data Developer
>>> Email: shmuel.blitz@similarweb.com
>>> www.similarweb.com
>>> <https://www.facebook.com/SimilarWeb/>
>>> <https://www.linkedin.com/company/429838/>
>>> <https://twitter.com/similarweb>
>>>
>>
>
>
> --
> Shmuel Blitz
> Big Data Developer
> Email: shmuel.blitz@similarweb.com
> www.similarweb.com
> <https://www.facebook.com/SimilarWeb/>
> <https://www.linkedin.com/company/429838/>
> <https://twitter.com/similarweb>
>

Re: Open sourcing Sparklens: Qubole's Spark Tuning Tool

Posted by Shmuel Blitz <sh...@similarweb.com>.

I ran it on a single job.
SparkLens has an overhead on the job duration. I'm not ready to enable it
by default on all our jobs.

Attached is the output.

Still trying to understand what exactly it means.

On Sun, Mar 25, 2018 at 10:40 AM, Fawze Abujaber <fa...@gmail.com> wrote:

> Nice!
>
> Shmuel, Were you able to run on a cluster level or for a specific job?
>
> Did you configure it on the spark-default.conf?
>
> On Sun, 25 Mar 2018 at 10:34 Shmuel Blitz <sh...@similarweb.com>
> wrote:
>
>> Just to let you know, I have managed to run SparkLens on our cluster.
>>
>> I switched to the spark_1.6 branch, and also compiled against the
>> specific image of Spark we are using (cdh5.7.6).
>>
>> Now I need to figure out what the output means... :P
>>
>> Shmuel
>>
>> On Fri, Mar 23, 2018 at 7:24 PM, Fawze Abujaber <fa...@gmail.com>
>> wrote:
>>
>>> Quick question:
>>>
>>> how to add the  --jars /path/to/sparklens_2.11-0.1.0.jar to the
>>> spark-default conf, should it be using:
>>>
>>> spark.driver.extraClassPath /path/to/sparklens_2.11-0.1.0.jar or i
>>> should use spark.jars option? anyone who could give an example how it
>>> should be, and if i the path for the jar should be an hdfs path as i'm
>>> using it in cluster mode.
>>>
>>>
>>>
>>>
>>> On Fri, Mar 23, 2018 at 6:33 AM, Fawze Abujaber <fa...@gmail.com>
>>> wrote:
>>>
>>>> Hi Shmuel,
>>>>
>>>> Did you compile the code against the right branch for Spark 1.6.
>>>>
>>>> I tested it and it looks working and now i'm testing the branch for a
>>>> wide tests, Please use the branch for Spark 1.6
>>>>
>>>> On Fri, Mar 23, 2018 at 12:43 AM, Shmuel Blitz <
>>>> shmuel.blitz@similarweb.com> wrote:
>>>>
>>>>> Hi Rohit,
>>>>>
>>>>> Thanks for sharing this great tool.
>>>>> I tried running a spark job with the tool, but it failed with an *IncompatibleClassChangeError
>>>>> *Exception.
>>>>>
>>>>> I have opened an issue on Github.(https://github.com/qub
>>>>> ole/sparklens/issues/1)
>>>>>
>>>>> Shmuel
>>>>>
>>>>> On Thu, Mar 22, 2018 at 5:05 PM, Shmuel Blitz <
>>>>> shmuel.blitz@similarweb.com> wrote:
>>>>>
>>>>>> Thanks.
>>>>>>
>>>>>> We will give this a try and report back.
>>>>>>
>>>>>> Shmuel
>>>>>>
>>>>>> On Thu, Mar 22, 2018 at 4:22 PM, Rohit Karlupia <ro...@qubole.com>
>>>>>> wrote:
>>>>>>
>>>>>>> Thanks everyone!
>>>>>>> Please share how it works and how it doesn't. Both help.
>>>>>>>
>>>>>>> Fawaze, just made few changes to make this work with spark 1.6. Can
>>>>>>> you please try building from branch *spark_1.6*
>>>>>>>
>>>>>>> thanks,
>>>>>>> rohitk
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> On Thu, Mar 22, 2018 at 10:18 AM, Fawze Abujaber <fa...@gmail.com>
>>>>>>> wrote:
>>>>>>>
>>>>>>>> It's super amazing .... i see it was tested on spark 2.0.0 and
>>>>>>>> above, what about Spark 1.6 which is still part of Cloudera's main versions?
>>>>>>>>
>>>>>>>> We have a vast Spark applications with version 1.6.0
>>>>>>>>
>>>>>>>> On Thu, Mar 22, 2018 at 6:38 AM, Holden Karau <holden@pigscanfly.ca
>>>>>>>> > wrote:
>>>>>>>>
>>>>>>>>> Super exciting! I look forward to digging through it this weekend.
>>>>>>>>>
>>>>>>>>> On Wed, Mar 21, 2018 at 9:33 PM ☼ R Nair (रविशंकर नायर) <
>>>>>>>>> ravishankar.nair@gmail.com> wrote:
>>>>>>>>>
>>>>>>>>>> Excellent. You filled a missing link.
>>>>>>>>>>
>>>>>>>>>> Best,
>>>>>>>>>> Passion
>>>>>>>>>>
>>>>>>>>>> On Wed, Mar 21, 2018 at 11:36 PM, Rohit Karlupia <
>>>>>>>>>> rohitk@qubole.com> wrote:
>>>>>>>>>>
>>>>>>>>>>> Hi,
>>>>>>>>>>>
>>>>>>>>>>> Happy to announce the availability of Sparklens as open source
>>>>>>>>>>> project. It helps in understanding the  scalability limits of spark
>>>>>>>>>>> applications and can be a useful guide on the path towards tuning
>>>>>>>>>>> applications for lower runtime or cost.
>>>>>>>>>>>
>>>>>>>>>>> Please clone from here: https://github.com/qubole/sparklens
>>>>>>>>>>> Old blogpost: https://www.qubole.com/blog/introducing-quboles-sp
>>>>>>>>>>> ark-tuning-tool/
>>>>>>>>>>>
>>>>>>>>>>> thanks,
>>>>>>>>>>> rohitk
>>>>>>>>>>>
>>>>>>>>>>> PS: Thanks for the patience. It took couple of months to get
>>>>>>>>>>> back on this.
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>> --
>>>>>>>>> Twitter: https://twitter.com/holdenkarau
>>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>
>>>>>>
>>>>>> --
>>>>>> Shmuel Blitz
>>>>>> Big Data Developer
>>>>>> Email: shmuel.blitz@similarweb.com
>>>>>> www.similarweb.com
>>>>>> <https://www.facebook.com/SimilarWeb/>
>>>>>> <https://www.linkedin.com/company/429838/>
>>>>>> <https://twitter.com/similarweb>
>>>>>>
>>>>>
>>>>>
>>>>>
>>>>> --
>>>>> Shmuel Blitz
>>>>> Big Data Developer
>>>>> Email: shmuel.blitz@similarweb.com
>>>>> www.similarweb.com
>>>>> <https://www.facebook.com/SimilarWeb/>
>>>>> <https://www.linkedin.com/company/429838/>
>>>>> <https://twitter.com/similarweb>
>>>>>
>>>>
>>>>
>>>
>>
>>
>> --
>> Shmuel Blitz
>> Big Data Developer
>> Email: shmuel.blitz@similarweb.com
>> www.similarweb.com
>> <https://www.facebook.com/SimilarWeb/>
>> <https://www.linkedin.com/company/429838/>
>> <https://twitter.com/similarweb>
>>
>


-- 
Shmuel Blitz
Big Data Developer
Email: shmuel.blitz@similarweb.com
www.similarweb.com
<https://www.facebook.com/SimilarWeb/>
<https://www.linkedin.com/company/429838/> <https://twitter.com/similarweb>

Re: Open sourcing Sparklens: Qubole's Spark Tuning Tool

Posted by Fawze Abujaber <fa...@gmail.com>.

Nice!

Shmuel, Were you able to run on a cluster level or for a specific job?

Did you configure it on the spark-default.conf?

On Sun, 25 Mar 2018 at 10:34 Shmuel Blitz <sh...@similarweb.com>
wrote:

> Just to let you know, I have managed to run SparkLens on our cluster.
>
> I switched to the spark_1.6 branch, and also compiled against the specific
> image of Spark we are using (cdh5.7.6).
>
> Now I need to figure out what the output means... :P
>
> Shmuel
>
> On Fri, Mar 23, 2018 at 7:24 PM, Fawze Abujaber <fa...@gmail.com> wrote:
>
>> Quick question:
>>
>> how to add the  --jars /path/to/sparklens_2.11-0.1.0.jar to the
>> spark-default conf, should it be using:
>>
>> spark.driver.extraClassPath /path/to/sparklens_2.11-0.1.0.jar or i
>> should use spark.jars option? anyone who could give an example how it
>> should be, and if i the path for the jar should be an hdfs path as i'm
>> using it in cluster mode.
>>
>>
>>
>>
>> On Fri, Mar 23, 2018 at 6:33 AM, Fawze Abujaber <fa...@gmail.com>
>> wrote:
>>
>>> Hi Shmuel,
>>>
>>> Did you compile the code against the right branch for Spark 1.6.
>>>
>>> I tested it and it looks working and now i'm testing the branch for a
>>> wide tests, Please use the branch for Spark 1.6
>>>
>>> On Fri, Mar 23, 2018 at 12:43 AM, Shmuel Blitz <
>>> shmuel.blitz@similarweb.com> wrote:
>>>
>>>> Hi Rohit,
>>>>
>>>> Thanks for sharing this great tool.
>>>> I tried running a spark job with the tool, but it failed with an *IncompatibleClassChangeError
>>>> *Exception.
>>>>
>>>> I have opened an issue on Github.(https://github.com/
>>>> qubole/sparklens/issues/1)
>>>>
>>>> Shmuel
>>>>
>>>> On Thu, Mar 22, 2018 at 5:05 PM, Shmuel Blitz <
>>>> shmuel.blitz@similarweb.com> wrote:
>>>>
>>>>> Thanks.
>>>>>
>>>>> We will give this a try and report back.
>>>>>
>>>>> Shmuel
>>>>>
>>>>> On Thu, Mar 22, 2018 at 4:22 PM, Rohit Karlupia <ro...@qubole.com>
>>>>> wrote:
>>>>>
>>>>>> Thanks everyone!
>>>>>> Please share how it works and how it doesn't. Both help.
>>>>>>
>>>>>> Fawaze, just made few changes to make this work with spark 1.6. Can
>>>>>> you please try building from branch *spark_1.6*
>>>>>>
>>>>>> thanks,
>>>>>> rohitk
>>>>>>
>>>>>>
>>>>>>
>>>>>> On Thu, Mar 22, 2018 at 10:18 AM, Fawze Abujaber <fa...@gmail.com>
>>>>>> wrote:
>>>>>>
>>>>>>> It's super amazing .... i see it was tested on spark 2.0.0 and
>>>>>>> above, what about Spark 1.6 which is still part of Cloudera's main versions?
>>>>>>>
>>>>>>> We have a vast Spark applications with version 1.6.0
>>>>>>>
>>>>>>> On Thu, Mar 22, 2018 at 6:38 AM, Holden Karau <ho...@pigscanfly.ca>
>>>>>>> wrote:
>>>>>>>
>>>>>>>> Super exciting! I look forward to digging through it this weekend.
>>>>>>>>
>>>>>>>> On Wed, Mar 21, 2018 at 9:33 PM ☼ R Nair (रविशंकर नायर) <
>>>>>>>> ravishankar.nair@gmail.com> wrote:
>>>>>>>>
>>>>>>>>> Excellent. You filled a missing link.
>>>>>>>>>
>>>>>>>>> Best,
>>>>>>>>> Passion
>>>>>>>>>
>>>>>>>>> On Wed, Mar 21, 2018 at 11:36 PM, Rohit Karlupia <
>>>>>>>>> rohitk@qubole.com> wrote:
>>>>>>>>>
>>>>>>>>>> Hi,
>>>>>>>>>>
>>>>>>>>>> Happy to announce the availability of Sparklens as open source
>>>>>>>>>> project. It helps in understanding the  scalability limits of spark
>>>>>>>>>> applications and can be a useful guide on the path towards tuning
>>>>>>>>>> applications for lower runtime or cost.
>>>>>>>>>>
>>>>>>>>>> Please clone from here: https://github.com/qubole/sparklens
>>>>>>>>>> Old blogpost: https://www.qubole.com/blog/introducing-quboles-
>>>>>>>>>> spark-tuning-tool/
>>>>>>>>>>
>>>>>>>>>> thanks,
>>>>>>>>>> rohitk
>>>>>>>>>>
>>>>>>>>>> PS: Thanks for the patience. It took couple of months to get back
>>>>>>>>>> on this.
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>> --
>>>>>>>> Twitter: https://twitter.com/holdenkarau
>>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>
>>>>>
>>>>>
>>>>> --
>>>>> Shmuel Blitz
>>>>> Big Data Developer
>>>>> Email: shmuel.blitz@similarweb.com
>>>>> www.similarweb.com
>>>>> <https://www.facebook.com/SimilarWeb/>
>>>>> <https://www.linkedin.com/company/429838/>
>>>>> <https://twitter.com/similarweb>
>>>>>
>>>>
>>>>
>>>>
>>>> --
>>>> Shmuel Blitz
>>>> Big Data Developer
>>>> Email: shmuel.blitz@similarweb.com
>>>> www.similarweb.com
>>>> <https://www.facebook.com/SimilarWeb/>
>>>> <https://www.linkedin.com/company/429838/>
>>>> <https://twitter.com/similarweb>
>>>>
>>>
>>>
>>
>
>
> --
> Shmuel Blitz
> Big Data Developer
> Email: shmuel.blitz@similarweb.com
> www.similarweb.com
> <https://www.facebook.com/SimilarWeb/>
> <https://www.linkedin.com/company/429838/>
> <https://twitter.com/similarweb>
>

Re: Open sourcing Sparklens: Qubole's Spark Tuning Tool

Posted by Shmuel Blitz <sh...@similarweb.com>.

Just to let you know, I have managed to run SparkLens on our cluster.

I switched to the spark_1.6 branch, and also compiled against the specific
image of Spark we are using (cdh5.7.6).

Now I need to figure out what the output means... :P

Shmuel

On Fri, Mar 23, 2018 at 7:24 PM, Fawze Abujaber <fa...@gmail.com> wrote:

> Quick question:
>
> how to add the  --jars /path/to/sparklens_2.11-0.1.0.jar to the
> spark-default conf, should it be using:
>
> spark.driver.extraClassPath /path/to/sparklens_2.11-0.1.0.jar or i should
> use spark.jars option? anyone who could give an example how it should be,
> and if i the path for the jar should be an hdfs path as i'm using it in
> cluster mode.
>
>
>
>
> On Fri, Mar 23, 2018 at 6:33 AM, Fawze Abujaber <fa...@gmail.com> wrote:
>
>> Hi Shmuel,
>>
>> Did you compile the code against the right branch for Spark 1.6.
>>
>> I tested it and it looks working and now i'm testing the branch for a
>> wide tests, Please use the branch for Spark 1.6
>>
>> On Fri, Mar 23, 2018 at 12:43 AM, Shmuel Blitz <
>> shmuel.blitz@similarweb.com> wrote:
>>
>>> Hi Rohit,
>>>
>>> Thanks for sharing this great tool.
>>> I tried running a spark job with the tool, but it failed with an *IncompatibleClassChangeError
>>> *Exception.
>>>
>>> I have opened an issue on Github.(https://github.com/qub
>>> ole/sparklens/issues/1)
>>>
>>> Shmuel
>>>
>>> On Thu, Mar 22, 2018 at 5:05 PM, Shmuel Blitz <
>>> shmuel.blitz@similarweb.com> wrote:
>>>
>>>> Thanks.
>>>>
>>>> We will give this a try and report back.
>>>>
>>>> Shmuel
>>>>
>>>> On Thu, Mar 22, 2018 at 4:22 PM, Rohit Karlupia <ro...@qubole.com>
>>>> wrote:
>>>>
>>>>> Thanks everyone!
>>>>> Please share how it works and how it doesn't. Both help.
>>>>>
>>>>> Fawaze, just made few changes to make this work with spark 1.6. Can
>>>>> you please try building from branch *spark_1.6*
>>>>>
>>>>> thanks,
>>>>> rohitk
>>>>>
>>>>>
>>>>>
>>>>> On Thu, Mar 22, 2018 at 10:18 AM, Fawze Abujaber <fa...@gmail.com>
>>>>> wrote:
>>>>>
>>>>>> It's super amazing .... i see it was tested on spark 2.0.0 and above,
>>>>>> what about Spark 1.6 which is still part of Cloudera's main versions?
>>>>>>
>>>>>> We have a vast Spark applications with version 1.6.0
>>>>>>
>>>>>> On Thu, Mar 22, 2018 at 6:38 AM, Holden Karau <ho...@pigscanfly.ca>
>>>>>> wrote:
>>>>>>
>>>>>>> Super exciting! I look forward to digging through it this weekend.
>>>>>>>
>>>>>>> On Wed, Mar 21, 2018 at 9:33 PM ☼ R Nair (रविशंकर नायर) <
>>>>>>> ravishankar.nair@gmail.com> wrote:
>>>>>>>
>>>>>>>> Excellent. You filled a missing link.
>>>>>>>>
>>>>>>>> Best,
>>>>>>>> Passion
>>>>>>>>
>>>>>>>> On Wed, Mar 21, 2018 at 11:36 PM, Rohit Karlupia <rohitk@qubole.com
>>>>>>>> > wrote:
>>>>>>>>
>>>>>>>>> Hi,
>>>>>>>>>
>>>>>>>>> Happy to announce the availability of Sparklens as open source
>>>>>>>>> project. It helps in understanding the  scalability limits of spark
>>>>>>>>> applications and can be a useful guide on the path towards tuning
>>>>>>>>> applications for lower runtime or cost.
>>>>>>>>>
>>>>>>>>> Please clone from here: https://github.com/qubole/sparklens
>>>>>>>>> Old blogpost: https://www.qubole.com/blog/introducing-quboles-sp
>>>>>>>>> ark-tuning-tool/
>>>>>>>>>
>>>>>>>>> thanks,
>>>>>>>>> rohitk
>>>>>>>>>
>>>>>>>>> PS: Thanks for the patience. It took couple of months to get back
>>>>>>>>> on this.
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>> --
>>>>>>> Twitter: https://twitter.com/holdenkarau
>>>>>>>
>>>>>>
>>>>>>
>>>>>
>>>>
>>>>
>>>> --
>>>> Shmuel Blitz
>>>> Big Data Developer
>>>> Email: shmuel.blitz@similarweb.com
>>>> www.similarweb.com
>>>> <https://www.facebook.com/SimilarWeb/>
>>>> <https://www.linkedin.com/company/429838/>
>>>> <https://twitter.com/similarweb>
>>>>
>>>
>>>
>>>
>>> --
>>> Shmuel Blitz
>>> Big Data Developer
>>> Email: shmuel.blitz@similarweb.com
>>> www.similarweb.com
>>> <https://www.facebook.com/SimilarWeb/>
>>> <https://www.linkedin.com/company/429838/>
>>> <https://twitter.com/similarweb>
>>>
>>
>>
>


-- 
Shmuel Blitz
Big Data Developer
Email: shmuel.blitz@similarweb.com
www.similarweb.com
<https://www.facebook.com/SimilarWeb/>
<https://www.linkedin.com/company/429838/> <https://twitter.com/similarweb>

Re: Open sourcing Sparklens: Qubole's Spark Tuning Tool

Posted by Fawze Abujaber <fa...@gmail.com>.

Quick question:

how to add the  --jars /path/to/sparklens_2.11-0.1.0.jar to the
spark-default conf, should it be using:

spark.driver.extraClassPath /path/to/sparklens_2.11-0.1.0.jar or i should
use spark.jars option? anyone who could give an example how it should be,
and if i the path for the jar should be an hdfs path as i'm using it in
cluster mode.




On Fri, Mar 23, 2018 at 6:33 AM, Fawze Abujaber <fa...@gmail.com> wrote:

> Hi Shmuel,
>
> Did you compile the code against the right branch for Spark 1.6.
>
> I tested it and it looks working and now i'm testing the branch for a wide
> tests, Please use the branch for Spark 1.6
>
> On Fri, Mar 23, 2018 at 12:43 AM, Shmuel Blitz <
> shmuel.blitz@similarweb.com> wrote:
>
>> Hi Rohit,
>>
>> Thanks for sharing this great tool.
>> I tried running a spark job with the tool, but it failed with an *IncompatibleClassChangeError
>> *Exception.
>>
>> I have opened an issue on Github.(https://github.com/qub
>> ole/sparklens/issues/1)
>>
>> Shmuel
>>
>> On Thu, Mar 22, 2018 at 5:05 PM, Shmuel Blitz <
>> shmuel.blitz@similarweb.com> wrote:
>>
>>> Thanks.
>>>
>>> We will give this a try and report back.
>>>
>>> Shmuel
>>>
>>> On Thu, Mar 22, 2018 at 4:22 PM, Rohit Karlupia <ro...@qubole.com>
>>> wrote:
>>>
>>>> Thanks everyone!
>>>> Please share how it works and how it doesn't. Both help.
>>>>
>>>> Fawaze, just made few changes to make this work with spark 1.6. Can you
>>>> please try building from branch *spark_1.6*
>>>>
>>>> thanks,
>>>> rohitk
>>>>
>>>>
>>>>
>>>> On Thu, Mar 22, 2018 at 10:18 AM, Fawze Abujaber <fa...@gmail.com>
>>>> wrote:
>>>>
>>>>> It's super amazing .... i see it was tested on spark 2.0.0 and above,
>>>>> what about Spark 1.6 which is still part of Cloudera's main versions?
>>>>>
>>>>> We have a vast Spark applications with version 1.6.0
>>>>>
>>>>> On Thu, Mar 22, 2018 at 6:38 AM, Holden Karau <ho...@pigscanfly.ca>
>>>>> wrote:
>>>>>
>>>>>> Super exciting! I look forward to digging through it this weekend.
>>>>>>
>>>>>> On Wed, Mar 21, 2018 at 9:33 PM ☼ R Nair (रविशंकर नायर) <
>>>>>> ravishankar.nair@gmail.com> wrote:
>>>>>>
>>>>>>> Excellent. You filled a missing link.
>>>>>>>
>>>>>>> Best,
>>>>>>> Passion
>>>>>>>
>>>>>>> On Wed, Mar 21, 2018 at 11:36 PM, Rohit Karlupia <ro...@qubole.com>
>>>>>>> wrote:
>>>>>>>
>>>>>>>> Hi,
>>>>>>>>
>>>>>>>> Happy to announce the availability of Sparklens as open source
>>>>>>>> project. It helps in understanding the  scalability limits of spark
>>>>>>>> applications and can be a useful guide on the path towards tuning
>>>>>>>> applications for lower runtime or cost.
>>>>>>>>
>>>>>>>> Please clone from here: https://github.com/qubole/sparklens
>>>>>>>> Old blogpost: https://www.qubole.com/blog/introducing-quboles-sp
>>>>>>>> ark-tuning-tool/
>>>>>>>>
>>>>>>>> thanks,
>>>>>>>> rohitk
>>>>>>>>
>>>>>>>> PS: Thanks for the patience. It took couple of months to get back
>>>>>>>> on this.
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>> --
>>>>>> Twitter: https://twitter.com/holdenkarau
>>>>>>
>>>>>
>>>>>
>>>>
>>>
>>>
>>> --
>>> Shmuel Blitz
>>> Big Data Developer
>>> Email: shmuel.blitz@similarweb.com
>>> www.similarweb.com
>>> <https://www.facebook.com/SimilarWeb/>
>>> <https://www.linkedin.com/company/429838/>
>>> <https://twitter.com/similarweb>
>>>
>>
>>
>>
>> --
>> Shmuel Blitz
>> Big Data Developer
>> Email: shmuel.blitz@similarweb.com
>> www.similarweb.com
>> <https://www.facebook.com/SimilarWeb/>
>> <https://www.linkedin.com/company/429838/>
>> <https://twitter.com/similarweb>
>>
>
>

Re: Open sourcing Sparklens: Qubole's Spark Tuning Tool

Posted by Fawze Abujaber <fa...@gmail.com>.

Hi Shmuel,

Did you compile the code against the right branch for Spark 1.6.

I tested it and it looks working and now i'm testing the branch for a wide
tests, Please use the branch for Spark 1.6

On Fri, Mar 23, 2018 at 12:43 AM, Shmuel Blitz <sh...@similarweb.com>
wrote:

> Hi Rohit,
>
> Thanks for sharing this great tool.
> I tried running a spark job with the tool, but it failed with an *IncompatibleClassChangeError
> *Exception.
>
> I have opened an issue on Github.(https://github.com/
> qubole/sparklens/issues/1)
>
> Shmuel
>
> On Thu, Mar 22, 2018 at 5:05 PM, Shmuel Blitz <shmuel.blitz@similarweb.com
> > wrote:
>
>> Thanks.
>>
>> We will give this a try and report back.
>>
>> Shmuel
>>
>> On Thu, Mar 22, 2018 at 4:22 PM, Rohit Karlupia <ro...@qubole.com>
>> wrote:
>>
>>> Thanks everyone!
>>> Please share how it works and how it doesn't. Both help.
>>>
>>> Fawaze, just made few changes to make this work with spark 1.6. Can you
>>> please try building from branch *spark_1.6*
>>>
>>> thanks,
>>> rohitk
>>>
>>>
>>>
>>> On Thu, Mar 22, 2018 at 10:18 AM, Fawze Abujaber <fa...@gmail.com>
>>> wrote:
>>>
>>>> It's super amazing .... i see it was tested on spark 2.0.0 and above,
>>>> what about Spark 1.6 which is still part of Cloudera's main versions?
>>>>
>>>> We have a vast Spark applications with version 1.6.0
>>>>
>>>> On Thu, Mar 22, 2018 at 6:38 AM, Holden Karau <ho...@pigscanfly.ca>
>>>> wrote:
>>>>
>>>>> Super exciting! I look forward to digging through it this weekend.
>>>>>
>>>>> On Wed, Mar 21, 2018 at 9:33 PM ☼ R Nair (रविशंकर नायर) <
>>>>> ravishankar.nair@gmail.com> wrote:
>>>>>
>>>>>> Excellent. You filled a missing link.
>>>>>>
>>>>>> Best,
>>>>>> Passion
>>>>>>
>>>>>> On Wed, Mar 21, 2018 at 11:36 PM, Rohit Karlupia <ro...@qubole.com>
>>>>>> wrote:
>>>>>>
>>>>>>> Hi,
>>>>>>>
>>>>>>> Happy to announce the availability of Sparklens as open source
>>>>>>> project. It helps in understanding the  scalability limits of spark
>>>>>>> applications and can be a useful guide on the path towards tuning
>>>>>>> applications for lower runtime or cost.
>>>>>>>
>>>>>>> Please clone from here: https://github.com/qubole/sparklens
>>>>>>> Old blogpost: https://www.qubole.com/blog/introducing-quboles-sp
>>>>>>> ark-tuning-tool/
>>>>>>>
>>>>>>> thanks,
>>>>>>> rohitk
>>>>>>>
>>>>>>> PS: Thanks for the patience. It took couple of months to get back on
>>>>>>> this.
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>> --
>>>>> Twitter: https://twitter.com/holdenkarau
>>>>>
>>>>
>>>>
>>>
>>
>>
>> --
>> Shmuel Blitz
>> Big Data Developer
>> Email: shmuel.blitz@similarweb.com
>> www.similarweb.com
>> <https://www.facebook.com/SimilarWeb/>
>> <https://www.linkedin.com/company/429838/>
>> <https://twitter.com/similarweb>
>>
>
>
>
> --
> Shmuel Blitz
> Big Data Developer
> Email: shmuel.blitz@similarweb.com
> www.similarweb.com
> <https://www.facebook.com/SimilarWeb/>
> <https://www.linkedin.com/company/429838/>
> <https://twitter.com/similarweb>
>

Re: Open sourcing Sparklens: Qubole's Spark Tuning Tool

Posted by Shmuel Blitz <sh...@similarweb.com>.

Hi Rohit,

Thanks for sharing this great tool.
I tried running a spark job with the tool, but it failed with an
*IncompatibleClassChangeError
*Exception.

I have opened an issue on Github.(
https://github.com/qubole/sparklens/issues/1)

Shmuel

On Thu, Mar 22, 2018 at 5:05 PM, Shmuel Blitz <sh...@similarweb.com>
wrote:

> Thanks.
>
> We will give this a try and report back.
>
> Shmuel
>
> On Thu, Mar 22, 2018 at 4:22 PM, Rohit Karlupia <ro...@qubole.com> wrote:
>
>> Thanks everyone!
>> Please share how it works and how it doesn't. Both help.
>>
>> Fawaze, just made few changes to make this work with spark 1.6. Can you
>> please try building from branch *spark_1.6*
>>
>> thanks,
>> rohitk
>>
>>
>>
>> On Thu, Mar 22, 2018 at 10:18 AM, Fawze Abujaber <fa...@gmail.com>
>> wrote:
>>
>>> It's super amazing .... i see it was tested on spark 2.0.0 and above,
>>> what about Spark 1.6 which is still part of Cloudera's main versions?
>>>
>>> We have a vast Spark applications with version 1.6.0
>>>
>>> On Thu, Mar 22, 2018 at 6:38 AM, Holden Karau <ho...@pigscanfly.ca>
>>> wrote:
>>>
>>>> Super exciting! I look forward to digging through it this weekend.
>>>>
>>>> On Wed, Mar 21, 2018 at 9:33 PM ☼ R Nair (रविशंकर नायर) <
>>>> ravishankar.nair@gmail.com> wrote:
>>>>
>>>>> Excellent. You filled a missing link.
>>>>>
>>>>> Best,
>>>>> Passion
>>>>>
>>>>> On Wed, Mar 21, 2018 at 11:36 PM, Rohit Karlupia <ro...@qubole.com>
>>>>> wrote:
>>>>>
>>>>>> Hi,
>>>>>>
>>>>>> Happy to announce the availability of Sparklens as open source
>>>>>> project. It helps in understanding the  scalability limits of spark
>>>>>> applications and can be a useful guide on the path towards tuning
>>>>>> applications for lower runtime or cost.
>>>>>>
>>>>>> Please clone from here: https://github.com/qubole/sparklens
>>>>>> Old blogpost: https://www.qubole.com/blog/introducing-quboles-sp
>>>>>> ark-tuning-tool/
>>>>>>
>>>>>> thanks,
>>>>>> rohitk
>>>>>>
>>>>>> PS: Thanks for the patience. It took couple of months to get back on
>>>>>> this.
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>> --
>>>> Twitter: https://twitter.com/holdenkarau
>>>>
>>>
>>>
>>
>
>
> --
> Shmuel Blitz
> Big Data Developer
> Email: shmuel.blitz@similarweb.com
> www.similarweb.com
> <https://www.facebook.com/SimilarWeb/>
> <https://www.linkedin.com/company/429838/>
> <https://twitter.com/similarweb>
>



-- 
Shmuel Blitz
Big Data Developer
Email: shmuel.blitz@similarweb.com
www.similarweb.com
<https://www.facebook.com/SimilarWeb/>
<https://www.linkedin.com/company/429838/> <https://twitter.com/similarweb>

Re: Open sourcing Sparklens: Qubole's Spark Tuning Tool

Posted by Shmuel Blitz <sh...@similarweb.com>.

Thanks.

We will give this a try and report back.

Shmuel

On Thu, Mar 22, 2018 at 4:22 PM, Rohit Karlupia <ro...@qubole.com> wrote:

> Thanks everyone!
> Please share how it works and how it doesn't. Both help.
>
> Fawaze, just made few changes to make this work with spark 1.6. Can you
> please try building from branch *spark_1.6*
>
> thanks,
> rohitk
>
>
>
> On Thu, Mar 22, 2018 at 10:18 AM, Fawze Abujaber <fa...@gmail.com>
> wrote:
>
>> It's super amazing .... i see it was tested on spark 2.0.0 and above,
>> what about Spark 1.6 which is still part of Cloudera's main versions?
>>
>> We have a vast Spark applications with version 1.6.0
>>
>> On Thu, Mar 22, 2018 at 6:38 AM, Holden Karau <ho...@pigscanfly.ca>
>> wrote:
>>
>>> Super exciting! I look forward to digging through it this weekend.
>>>
>>> On Wed, Mar 21, 2018 at 9:33 PM ☼ R Nair (रविशंकर नायर) <
>>> ravishankar.nair@gmail.com> wrote:
>>>
>>>> Excellent. You filled a missing link.
>>>>
>>>> Best,
>>>> Passion
>>>>
>>>> On Wed, Mar 21, 2018 at 11:36 PM, Rohit Karlupia <ro...@qubole.com>
>>>> wrote:
>>>>
>>>>> Hi,
>>>>>
>>>>> Happy to announce the availability of Sparklens as open source
>>>>> project. It helps in understanding the  scalability limits of spark
>>>>> applications and can be a useful guide on the path towards tuning
>>>>> applications for lower runtime or cost.
>>>>>
>>>>> Please clone from here: https://github.com/qubole/sparklens
>>>>> Old blogpost: https://www.qubole.com/blog/introducing-quboles-sp
>>>>> ark-tuning-tool/
>>>>>
>>>>> thanks,
>>>>> rohitk
>>>>>
>>>>> PS: Thanks for the patience. It took couple of months to get back on
>>>>> this.
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>> --
>>> Twitter: https://twitter.com/holdenkarau
>>>
>>
>>
>


-- 
Shmuel Blitz
Big Data Developer
Email: shmuel.blitz@similarweb.com
www.similarweb.com
<https://www.facebook.com/SimilarWeb/>
<https://www.linkedin.com/company/429838/> <https://twitter.com/similarweb>

Re: Open sourcing Sparklens: Qubole's Spark Tuning Tool

Posted by Rohit Karlupia <ro...@qubole.com>.

Thanks everyone!
Please share how it works and how it doesn't. Both help.

Fawaze, just made few changes to make this work with spark 1.6. Can you
please try building from branch *spark_1.6*

thanks,
rohitk



On Thu, Mar 22, 2018 at 10:18 AM, Fawze Abujaber <fa...@gmail.com> wrote:

> It's super amazing .... i see it was tested on spark 2.0.0 and above, what
> about Spark 1.6 which is still part of Cloudera's main versions?
>
> We have a vast Spark applications with version 1.6.0
>
> On Thu, Mar 22, 2018 at 6:38 AM, Holden Karau <ho...@pigscanfly.ca>
> wrote:
>
>> Super exciting! I look forward to digging through it this weekend.
>>
>> On Wed, Mar 21, 2018 at 9:33 PM ☼ R Nair (रविशंकर नायर) <
>> ravishankar.nair@gmail.com> wrote:
>>
>>> Excellent. You filled a missing link.
>>>
>>> Best,
>>> Passion
>>>
>>> On Wed, Mar 21, 2018 at 11:36 PM, Rohit Karlupia <ro...@qubole.com>
>>> wrote:
>>>
>>>> Hi,
>>>>
>>>> Happy to announce the availability of Sparklens as open source project.
>>>> It helps in understanding the  scalability limits of spark applications and
>>>> can be a useful guide on the path towards tuning applications for lower
>>>> runtime or cost.
>>>>
>>>> Please clone from here: https://github.com/qubole/sparklens
>>>> Old blogpost: https://www.qubole.com/blog/introducing-quboles-sp
>>>> ark-tuning-tool/
>>>>
>>>> thanks,
>>>> rohitk
>>>>
>>>> PS: Thanks for the patience. It took couple of months to get back on
>>>> this.
>>>>
>>>>
>>>>
>>>>
>>>>
>>> --
>> Twitter: https://twitter.com/holdenkarau
>>
>
>

Re: Open sourcing Sparklens: Qubole's Spark Tuning Tool

Posted by Fawze Abujaber <fa...@gmail.com>.

It's super amazing .... i see it was tested on spark 2.0.0 and above, what
about Spark 1.6 which is still part of Cloudera's main versions?

We have a vast Spark applications with version 1.6.0

On Thu, Mar 22, 2018 at 6:38 AM, Holden Karau <ho...@pigscanfly.ca> wrote:

> Super exciting! I look forward to digging through it this weekend.
>
> On Wed, Mar 21, 2018 at 9:33 PM ☼ R Nair (रविशंकर नायर) <
> ravishankar.nair@gmail.com> wrote:
>
>> Excellent. You filled a missing link.
>>
>> Best,
>> Passion
>>
>> On Wed, Mar 21, 2018 at 11:36 PM, Rohit Karlupia <ro...@qubole.com>
>> wrote:
>>
>>> Hi,
>>>
>>> Happy to announce the availability of Sparklens as open source project.
>>> It helps in understanding the  scalability limits of spark applications and
>>> can be a useful guide on the path towards tuning applications for lower
>>> runtime or cost.
>>>
>>> Please clone from here: https://github.com/qubole/sparklens
>>> Old blogpost: https://www.qubole.com/blog/introducing-quboles-
>>> spark-tuning-tool/
>>>
>>> thanks,
>>> rohitk
>>>
>>> PS: Thanks for the patience. It took couple of months to get back on
>>> this.
>>>
>>>
>>>
>>>
>>>
>> --
> Twitter: https://twitter.com/holdenkarau
>

Re: Open sourcing Sparklens: Qubole's Spark Tuning Tool

Posted by Holden Karau <ho...@pigscanfly.ca>.

Super exciting! I look forward to digging through it this weekend.

On Wed, Mar 21, 2018 at 9:33 PM ☼ R Nair (रविशंकर नायर) <
ravishankar.nair@gmail.com> wrote:

> Excellent. You filled a missing link.
>
> Best,
> Passion
>
> On Wed, Mar 21, 2018 at 11:36 PM, Rohit Karlupia <ro...@qubole.com>
> wrote:
>
>> Hi,
>>
>> Happy to announce the availability of Sparklens as open source project.
>> It helps in understanding the  scalability limits of spark applications and
>> can be a useful guide on the path towards tuning applications for lower
>> runtime or cost.
>>
>> Please clone from here: https://github.com/qubole/sparklens
>> Old blogpost:
>> https://www.qubole.com/blog/introducing-quboles-spark-tuning-tool/
>>
>> thanks,
>> rohitk
>>
>> PS: Thanks for the patience. It took couple of months to get back on
>> this.
>>
>>
>>
>>
>>
> --
Twitter: https://twitter.com/holdenkarau

Re: Open sourcing Sparklens: Qubole's Spark Tuning Tool

Posted by "☼ R Nair (रविशंकर नायर)" <ra...@gmail.com>.

Excellent. You filled a missing link.

Best,
Passion

On Wed, Mar 21, 2018 at 11:36 PM, Rohit Karlupia <ro...@qubole.com> wrote:

> Hi,
>
> Happy to announce the availability of Sparklens as open source project. It
> helps in understanding the  scalability limits of spark applications and
> can be a useful guide on the path towards tuning applications for lower
> runtime or cost.
>
> Please clone from here: https://github.com/qubole/sparklens
> Old blogpost: https://www.qubole.com/blog/introducing-quboles-
> spark-tuning-tool/
>
> thanks,
> rohitk
>
> PS: Thanks for the patience. It took couple of months to get back on this.
>
>
>
>
>