You are viewing a plain text version of this content. The canonical link for it is here.
Posted to reviews@spark.apache.org by GitBox <gi...@apache.org> on 2019/08/19 07:27:53 UTC
[GitHub] [spark] kamalbanga edited a comment on issue #25449: [PYSPARK]
Simpler countByValue using collections' Counter
kamalbanga edited a comment on issue #25449: [PYSPARK] Simpler countByValue using collections' Counter
URL: https://github.com/apache/spark/pull/25449#issuecomment-521977070
I benchmarked it and the existing implementation is faster 🤦♂
```python
from pyspark import SparkContext, SparkConf
from random import choices
from collections import defaultdict, Counter
from contextlib import contextmanager
from operator import add
import time
MAX_NUM = int(1e9)
RDD_SIZE = int(1e7)
NUM_PARTITIONS = 4
@contextmanager
def timethis(snippet):
start = time.time()
yield
print(f'Duration of {snippet}: {(time.time() - start):.1f} seconds')
def countval1(iterator):
yield Counter(iterator)
def countval2(iterator):
counts = defaultdict(int)
for k in iterator:
counts[k] += 1
yield counts
def mergeMaps(m1, m2):
for k, v in m1.items():
m2[k] += v
return m2
random_integers = choices(population=range(MAX_NUM), k=RDD_SIZE)
sc = SparkContext(conf=SparkConf().setAppName('Benchmark'))
random_rdd = sc.parallelize(random_integers, NUM_PARTITIONS)
with timethis('Spark Counter'):
agg1 = random_rdd.mapPartitions(countval1).reduce(add)
with timethis('Spark defaultdict'):
agg2 = random_rdd.mapPartitions(countval2).reduce(mergeMaps)
```
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
With regards,
Apache Git Services
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org