You are viewing a plain text version of this content. The canonical link for it is here.
Posted to reviews@spark.apache.org by staple <gi...@git.apache.org> on 2014/09/11 18:09:01 UTC

[GitHub] spark pull request: [SPARK-3488][MLLIB] Cache python RDDs after de...

GitHub user staple opened a pull request:

    https://github.com/apache/spark/pull/2362

    [SPARK-3488][MLLIB] Cache python RDDs after deserialization for relevant iterative learners.

    When running an iterative learning algorithm, it makes sense that the input RDD be cached for improved performance. When learning is applied to a python RDD, previously the python RDD was always cached, then in scala that cached RDD was mapped to an uncached deserialized RDD, and the uncached RDD was passed to the learning algorithm. Since the RDD with deserialized data was uncached, learning algorithms would implicitly deserialize the same data repeatedly, on every iteration.
    
    This patch moves RDD caching after deserialization for learning algorithms that should be called with a cached RDD. For algorithms that implement their own caching internally, the input RDD is no longer cached. Below I’ve listed the different learning routines accessible from python, the location where caching was previously enabled, and the location (if any) where caching is now enabled by this patch.
    
    LogisticRegressionWithSGD:
    was: python (in _regression_train_wrapper/_get_unmangled_labeled_point_rdd)
    now: jvm (trainRegressionModel)
    
    SVMWithSGD:
    was: python (in _regression_train_wrapper/_get_unmangled_labeled_point_rdd)
    now: jvm (trainRegressionModel)
    
    NaiveBayes:
    was: python (in _get_unmangled_labeled_point_rdd)
    now: none
    
    KMeans:
    was: python (in _get_unmangled_double_vector_rdd)
    now: jvm (trainKMeansModel)
    
    ALS:
    was: python (in _get_unmangled_rdd)
    now: none
    
    LinearRegressionWithSGD:
    was: python (in _regression_train_wrapper/_get_unmangled_labeled_point_rdd)
    now: jvm (trainRegressionModel)
    
    LassoWithSGD:
    was: python (in _regression_train_wrapper/_get_unmangled_labeled_point_rdd)
    now: jvm (trainRegressionModel)
    
    RidgeRegressionWithSGD:
    was: python (in _regression_train_wrapper/_get_unmangled_labeled_point_rdd)
    now: jvm (trainRegressionModel)
    
    DecisionTree:
    was: python (in _get_unmangled_labeled_point_rdd)
    now: none

You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/staple/spark SPARK-3488

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/2362.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #2362
    
----
commit 7042ebc2214f13c7d5d5acd28fcfa0478c1ddf2c
Author: Aaron Staple <aa...@gmail.com>
Date:   2014-09-11T05:11:11Z

    [SPARK-3488][MLLIB] Cache python RDDs after deserialization for relevant iterative learners.

----


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-3488][MLLIB] Cache python RDDs after de...

Posted by staple <gi...@git.apache.org>.
Github user staple commented on the pull request:

    https://github.com/apache/spark/pull/2362#issuecomment-55539838
  
    I ran a simple logistic regression performance test on my local machine (ubuntu desktop w/ 8gb ram). I used two data sizes: 2m records, which was not memory constrained, and 10m records which was memory constrained (generating log messages such as `CacheManager: Not enough space to cache partition`). I tested without this patch, with this patch, and with a modified version of this patch using `MEMORY_ONLY_SER` to persist the deserialized objects. Here are the results (each reported runtime is the mean of 3 runs):
    
    2m records:
    master | 47.9099563758
    w/ patch | 32.1143682798
    w/ MEMORY_ONLY_SER | 79.4589416981
    
    10m records:
    master | 2130.3178509871
    w/ patch | 3232.856136322
    w/ MEMORY_ONLY_SER | 2772.3923886617
    
    It looks like, running in memory, this patch provides a 33% speed improvement, while the `MEMORY_ONLY_SER` version is 66% slower than master. In the test case with insufficient memory to keep all the `cache()`-ed training rdd partitions cached at once, this patch is 52% slower while `MEMORY_ONLY_SER` is 30% slower.
    
    I’m not that familiar with the typical mllib memory profile. Do you think the in-memory result here would be similar to a real world run?
    
    Finally, here is the test script. Let me know if it seems reasonable. The data generation was roughly inspired by your mllib perf test in spark-perf.
    
    Data generation:
    
    	import random
    
    	from pyspark import SparkContext
    	from pyspark.mllib.regression import LabeledPoint
    
    	class NormalGenerator:
    		def __init__(self):
    			self.mu = random.random()
    			self.sigma = random.random()
    
    		def __call__(self, rnd):
    			return rnd.normalvariate(self.mu,self.sigma)
    
    	class PointGenerator:
    		def __init__(self):
    			self.generators = [[NormalGenerator() for _ in range(5)] for _ in range(2)]
    
    		def __call__(self, rnd):
    			label = rnd.choice([0, 1])
    			return LabeledPoint(float(label),[g(rnd) for g in self.generators[label]])
    
    	pointGenerator = PointGenerator()
    	sc = SparkContext()
    
    	def generatePoints(n):
    		def generateData(index):
    			rnd = random.Random(hash(str(index)))
    			for _ in range(n / 10):
    				yield pointGenerator(rnd)
    
    		points = sc.parallelize(range(10), 10).flatMap(generateData)
    		print points.count()
    		points.saveAsPickleFile('logistic%.0e' % n)
    
    	generatePoints(int(2e6))
    	generatePoints(int(1e7))
    
    Test:
    
    	import time
    	import sys
    
    	from pyspark import SparkContext
    	from pyspark.mllib.classification import LogisticRegressionWithSGD
    
    	sc = SparkContext()
    	points = sc.pickleFile(sys.argv[1])
    	start = time.time()
    	model = LogisticRegressionWithSGD.train(points, 100)
    	print 'Runtime: ' + `(time.time() - start)`
    	print model.weights



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-3488][MLLIB] Cache python RDDs after de...

Posted by staple <gi...@git.apache.org>.
Github user staple commented on the pull request:

    https://github.com/apache/spark/pull/2362#issuecomment-55636095
  
    @davies understood, thanks for the feedback. It sounds like for now the preference is to continue caching the python serialized version because the reduced memory footprint is currently worth the cpu cost of repeated deserialization.
    
    Would it make sense to preserve the portions of this patch that drop caching for the NaiveBayes, ALS, and DecisionTree learners, which I do not believe require external caching to prevent repeated RDD re-evaluation during learning? NavieBayes only evaluates its input RDD once, while ALS and DecisionTree internally persist transformations of their input RDDs.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-3488][MLLIB] Cache python RDDs after de...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/2362#issuecomment-55287917
  
    Can one of the admins verify this patch?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-3488][MLLIB] Cache python RDDs after de...

Posted by staple <gi...@git.apache.org>.
Github user staple commented on the pull request:

    https://github.com/apache/spark/pull/2362#issuecomment-55762293
  
    @davies I created a separate PR for disabling automatic caching for some learners: https://github.com/apache/spark/pull/2412


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-3488][MLLIB] Cache python RDDs after de...

Posted by davies <gi...@git.apache.org>.
Github user davies commented on the pull request:

    https://github.com/apache/spark/pull/2362#issuecomment-55308253
  
    I think you could pick any algorithm that you think will have most difference.
    
    For repeated warning, maybe it's not hard to make it show only once.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-3488][MLLIB] Cache python RDDs after de...

Posted by staple <gi...@git.apache.org>.
Github user staple commented on the pull request:

    https://github.com/apache/spark/pull/2362#issuecomment-55682979
  
    For the PR code, it looks like on each training iteration there are messages about not being able to cache partitions rdd_4_5 - rdd_4_27
    
    For the master code, it looks like on each training iteration there are messages about not being able to cache partitions rdd_3_13, rdd_3_15 - rdd_3_27
    
    It looks to me like a greater proportion of the data can be cached in master, and the set of cached partitions seems consistent across all training iterations.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-3488][MLLIB] Cache python RDDs after de...

Posted by staple <gi...@git.apache.org>.
Github user staple commented on the pull request:

    https://github.com/apache/spark/pull/2362#issuecomment-55681746
  
    Just to make sure I understand, are you saying you don't expect to see the measured performance hit with this PR? If so I can investigate further.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-3488][MLLIB] Cache python RDDs after de...

Posted by davies <gi...@git.apache.org>.
Github user davies commented on the pull request:

    https://github.com/apache/spark/pull/2362#issuecomment-55552191
  
    The benchmark result sounds reasonable, thanks for confirm it. Cache the RDD after serialization will reduce the memory usage and GC pressure, but have some CPU overhead. Also caching the serialized data from Python is better than serialize them again in JVM (master is always better than w/ MEMORY_ONLY_SER)
    
    I think the memory usage (sometimes out of memory, means stability) is more important than CPU, right now, so I would like to hold off this change, maybe revisit it in the future.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-3488][MLLIB] Cache python RDDs after de...

Posted by staple <gi...@git.apache.org>.
Github user staple closed the pull request at:

    https://github.com/apache/spark/pull/2362


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-3488][MLLIB] Cache python RDDs after de...

Posted by davies <gi...@git.apache.org>.
Github user davies commented on the pull request:

    https://github.com/apache/spark/pull/2362#issuecomment-55637953
  
    > Would it make sense to preserve the portions of this patch that drop caching for the NaiveBayes, ALS, and DecisionTree learners, which I do not believe require external caching to prevent repeated RDD re-evaluation during learning? NavieBayes only evaluates its input RDD once, while ALS and DecisionTree internally persist transformations of their input RDDs.
    
    These are helpful, you could do it in another PR.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-3488][MLLIB] Cache python RDDs after de...

Posted by davies <gi...@git.apache.org>.
Github user davies commented on the pull request:

    https://github.com/apache/spark/pull/2362#issuecomment-55297787
  
    Can your do some benchmark to show the difference?
    
    I'm in doubt that caching the serialized data will better than caching the original objects, the former can release the GC pressure a lot. So we do in this way in Spark SQL, the columns are serialized (maybe compressed) for caching.
    
    Also, there are some cases that the cache is "none" after this patch, what does it mean?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-3488][MLLIB] Cache python RDDs after de...

Posted by mengxr <gi...@git.apache.org>.
Github user mengxr commented on the pull request:

    https://github.com/apache/spark/pull/2362#issuecomment-55680211
  
    @staple How many iterations did you run? Did you generate data or load from disk/hdfs? Did you cache the Python RDD? When the dataset is not fully cached, I still expect similar performance. But your result shows a big gap. Maybe it is rotating cached blocks.
    
    > 10m records:
    > master: 2130.3178509871
    > w/ patch: 3232.856136322


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-3488][MLLIB] Cache python RDDs after de...

Posted by staple <gi...@git.apache.org>.
Github user staple commented on the pull request:

    https://github.com/apache/spark/pull/2362#issuecomment-55303441
  
    Hi, I implemented this per discussion here https://github.com/apache/spark/pull/2347#issuecomment-55181535, assuming I understood the comment correctly. The context is that we are supposed to log a warning when running an iterative learning algorithm on an uncached rdd. What originally led me to identify SPARK-3488 is that if the deserialized python rdds are always uncached, a warning will always be logged.
    
    Obviously a meaningful performance difference would trump the implementation of this warning message, and I haven't measured performance - just discussed options in the above referenced pull request. But by way of comparison, is there any significant difference in memory pressure between caching a LabeledPoint rdd deserialized from python and caching a LabeledPoint rdd crated natively in scala (which is the typical use case with a scala rather than python client)?
    
    If I should do some performance testing, are there any examples of tests and infrastructure you'd suggest as a starting point?
    
    'none' means the rdd is not cached within the python -> scala mllib interface, where previously it was cached. The learning algorithms for which rdds are no longer cached implement their own caching internally.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-3488][MLLIB] Cache python RDDs after de...

Posted by staple <gi...@git.apache.org>.
Github user staple commented on the pull request:

    https://github.com/apache/spark/pull/2362#issuecomment-55681425
  
    @mengxr I ran for 100 iterations. Loaded data from disk using python's SparkContext.pickleFile() (disk is ssd). I did not do any manual caching. For more details, you can also see the test script I included in my description above.
    
    I also saved the logs from my test runs if those are helpful to see. During the 10m record run I saw many log messages about 'CacheManager: Not enough space to cache partition' which I interpreted as indicating lack of caching due to memory exhaustion. But I haven't diagnosed the slowdown beyond that.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org