You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Josh Rosen (JIRA)" <ji...@apache.org> on 2014/10/11 20:21:33 UTC
[jira] [Commented] (SPARK-3910)
./python/pyspark/mllib/classification.py doctests fails with module name
pollution
[ https://issues.apache.org/jira/browse/SPARK-3910?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14168286#comment-14168286 ]
Josh Rosen commented on SPARK-3910:
-----------------------------------
This seems to work for me. If my current working directory is $SPARK_HOME and I run
{code}
./bin/pyspark python/pyspark/mllib/classification.py
{code}
then I don't see any circular import problems. Widely-used libraries like NumPy declare modules that shadow the built-ins (such as {{np.random}}), so I don't think that this is impossible.
Are you trying to run {{classification.py}} from inside of the {{python/pyspark/mllib}} directory?
> ./python/pyspark/mllib/classification.py doctests fails with module name pollution
> ----------------------------------------------------------------------------------
>
> Key: SPARK-3910
> URL: https://issues.apache.org/jira/browse/SPARK-3910
> Project: Spark
> Issue Type: Sub-task
> Components: PySpark
> Affects Versions: 1.2.0
> Environment: Mac OS X 10.9.5, Python 2.6.8, Java 1.8.0_20, Jinja2==2.7.3, MarkupSafe==0.23, Pygments==1.6, Sphinx==1.2.3, argparse==1.2.1, docutils==0.12, flake8==2.2.3, mccabe==0.2.1, numpy==1.9.0, pep8==1.5.7, psutil==2.1.3, pyflake8==0.1.9, pyflakes==0.8.1, unittest2==0.5.1, wsgiref==0.1.2
> Reporter: cocoatomo
> Labels: pyspark, testing
>
> In ./python/run-tests script, we run the doctests in ./pyspark/mllib/classification.py.
> The output is as following:
> {noformat}
> $ ./python/run-tests
> ...
> Running test: pyspark/mllib/classification.py
> Traceback (most recent call last):
> File "pyspark/mllib/classification.py", line 20, in <module>
> import numpy
> File "/Users/tomohiko/.virtualenvs/pyspark_py26/lib/python2.6/site-packages/numpy/__init__.py", line 170, in <module>
> from . import add_newdocs
> File "/Users/tomohiko/.virtualenvs/pyspark_py26/lib/python2.6/site-packages/numpy/add_newdocs.py", line 13, in <module>
> from numpy.lib import add_newdoc
> File "/Users/tomohiko/.virtualenvs/pyspark_py26/lib/python2.6/site-packages/numpy/lib/__init__.py", line 8, in <module>
> from .type_check import *
> File "/Users/tomohiko/.virtualenvs/pyspark_py26/lib/python2.6/site-packages/numpy/lib/type_check.py", line 11, in <module>
> import numpy.core.numeric as _nx
> File "/Users/tomohiko/.virtualenvs/pyspark_py26/lib/python2.6/site-packages/numpy/core/__init__.py", line 46, in <module>
> from numpy.testing import Tester
> File "/Users/tomohiko/.virtualenvs/pyspark_py26/lib/python2.6/site-packages/numpy/testing/__init__.py", line 13, in <module>
> from .utils import *
> File "/Users/tomohiko/.virtualenvs/pyspark_py26/lib/python2.6/site-packages/numpy/testing/utils.py", line 15, in <module>
> from tempfile import mkdtemp
> File "/System/Library/Frameworks/Python.framework/Versions/2.6/lib/python2.6/tempfile.py", line 34, in <module>
> from random import Random as _Random
> File "/Users/tomohiko/MyRepos/Scala/spark/python/pyspark/mllib/random.py", line 24, in <module>
> from pyspark.rdd import RDD
> File "/Users/tomohiko/MyRepos/Scala/spark/python/pyspark/__init__.py", line 51, in <module>
> from pyspark.context import SparkContext
> File "/Users/tomohiko/MyRepos/Scala/spark/python/pyspark/context.py", line 22, in <module>
> from tempfile import NamedTemporaryFile
> ImportError: cannot import name NamedTemporaryFile
> 0.07 real 0.04 user 0.02 sys
> Had test failures; see logs.
> {noformat}
> The problem is a cyclic import of tempfile module.
> The cause of it is that pyspark.mllib.random module exists in the directory where pyspark.mllib.classification module exists.
> classification module imports numpy module, and then numpy module imports tempfile module from its inside.
> Now the first entry sys.path is the directory "./python/pyspark/mllib" (where the executed file "classification.py" exists), so tempfile module imports pyspark.mllib.random module (not the standard library "random" module).
> Finally, import chains reach tempfile again, then a cyclic import is formed.
> Summary: classification → numpy → tempfile → pyspark.mllib.random → tempfile → (cyclic import!!)
> Furthermore, stat module is in a standard library, and pyspark.mllib.stat module exists. This also may be troublesome.
> commit: 0e8203f4fb721158fb27897680da476174d24c4b
> A fundamental solution is to avoid using module names used by standard libraries (currently "random" and "stat").
> A difficulty of this solution is to rename pyspark.mllib.random and pyspark.mllib.stat, which may be already used.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org