You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@spark.apache.org by Pipster Neko <ot...@gmail.com> on 2019/06/05 03:05:02 UTC

Spark Streaming: Task not distributed

Hi,

I am curious how records are being put to task, since, as you may see on
the photo below, there's 1 specific executor that contains more task than
the other.
The setup is this:

   - Spark version 2.3.1
   - Spark streaming job runs on Spark Standalone with following
   configuration:
      - spark.max.cores: 105
      - executor-memory: 4G
      - driver-memory: 2G
      - memory.storageFraction: 0.1
      - spark.streaming.kafka.maxRatePerPartition: 15000
      - duration per second: 20 seconds
   - Spark streaming job per batch finishes at ~9 seconds, and consuming
   ~800k records
   - Spark standalone contains:
      - workers: 15 (8 cores, 30G memory per worker)
      - cores: 120
      - memory: 455.6G
   - Consumes on kafka topic with 60 partitions

The spark streaming job is consuming records on kafka
using org.apache.spark.streaming.kafka010.KafkaUtils, record format is
JSON, what it does is map and filter transformations (the data type being
transformed is a class with 50 fields), no repartitioning, and in the end
sink to another topic with 60 partitions, and transform map to pair
(timestamp as key, and class as value) -> countByValue -> sortByValue and
print the top 10 records.

Would like to do tuning and enhancements and hope someone could explain and
assist where I should look into.

[image:
screencapture-aratupstream201-prod-hnd2-bdd-local-4040-executors-2019-06-04-18_28_59.png]

Thanks in advance!

A