You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Josh Rosen (JIRA)" <ji...@apache.org> on 2014/10/11 20:21:33 UTC

[jira] [Commented] (SPARK-3910) ./python/pyspark/mllib/classification.py doctests fails with module name pollution

    [ https://issues.apache.org/jira/browse/SPARK-3910?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14168286#comment-14168286 ] 

Josh Rosen commented on SPARK-3910:
-----------------------------------

This seems to work for me.  If my current working directory is $SPARK_HOME and I run

{code}
./bin/pyspark python/pyspark/mllib/classification.py
{code}

then I don't see any circular import problems.  Widely-used libraries like NumPy declare modules that shadow the built-ins (such as {{np.random}}), so I don't think that this is impossible.

Are you trying to run {{classification.py}} from inside of the {{python/pyspark/mllib}} directory?

> ./python/pyspark/mllib/classification.py doctests fails with module name pollution
> ----------------------------------------------------------------------------------
>
>                 Key: SPARK-3910
>                 URL: https://issues.apache.org/jira/browse/SPARK-3910
>             Project: Spark
>          Issue Type: Sub-task
>          Components: PySpark
>    Affects Versions: 1.2.0
>         Environment: Mac OS X 10.9.5, Python 2.6.8, Java 1.8.0_20, Jinja2==2.7.3, MarkupSafe==0.23, Pygments==1.6, Sphinx==1.2.3, argparse==1.2.1, docutils==0.12, flake8==2.2.3, mccabe==0.2.1, numpy==1.9.0, pep8==1.5.7, psutil==2.1.3, pyflake8==0.1.9, pyflakes==0.8.1, unittest2==0.5.1, wsgiref==0.1.2
>            Reporter: cocoatomo
>              Labels: pyspark, testing
>
> In ./python/run-tests script, we run the doctests in ./pyspark/mllib/classification.py.
> The output is as following:
> {noformat}
> $ ./python/run-tests
> ...
> Running test: pyspark/mllib/classification.py
> Traceback (most recent call last):
>   File "pyspark/mllib/classification.py", line 20, in <module>
>     import numpy
>   File "/Users/tomohiko/.virtualenvs/pyspark_py26/lib/python2.6/site-packages/numpy/__init__.py", line 170, in <module>
>     from . import add_newdocs
>   File "/Users/tomohiko/.virtualenvs/pyspark_py26/lib/python2.6/site-packages/numpy/add_newdocs.py", line 13, in <module>
>     from numpy.lib import add_newdoc
>   File "/Users/tomohiko/.virtualenvs/pyspark_py26/lib/python2.6/site-packages/numpy/lib/__init__.py", line 8, in <module>
>     from .type_check import *
>   File "/Users/tomohiko/.virtualenvs/pyspark_py26/lib/python2.6/site-packages/numpy/lib/type_check.py", line 11, in <module>
>     import numpy.core.numeric as _nx
>   File "/Users/tomohiko/.virtualenvs/pyspark_py26/lib/python2.6/site-packages/numpy/core/__init__.py", line 46, in <module>
>     from numpy.testing import Tester
>   File "/Users/tomohiko/.virtualenvs/pyspark_py26/lib/python2.6/site-packages/numpy/testing/__init__.py", line 13, in <module>
>     from .utils import *
>   File "/Users/tomohiko/.virtualenvs/pyspark_py26/lib/python2.6/site-packages/numpy/testing/utils.py", line 15, in <module>
>     from tempfile import mkdtemp
>   File "/System/Library/Frameworks/Python.framework/Versions/2.6/lib/python2.6/tempfile.py", line 34, in <module>
>     from random import Random as _Random
>   File "/Users/tomohiko/MyRepos/Scala/spark/python/pyspark/mllib/random.py", line 24, in <module>
>     from pyspark.rdd import RDD
>   File "/Users/tomohiko/MyRepos/Scala/spark/python/pyspark/__init__.py", line 51, in <module>
>     from pyspark.context import SparkContext
>   File "/Users/tomohiko/MyRepos/Scala/spark/python/pyspark/context.py", line 22, in <module>
>     from tempfile import NamedTemporaryFile
> ImportError: cannot import name NamedTemporaryFile
>         0.07 real         0.04 user         0.02 sys
> Had test failures; see logs.
> {noformat}
> The problem is a cyclic import of tempfile module.
> The cause of it is that pyspark.mllib.random module exists in the directory where pyspark.mllib.classification module exists.
> classification module imports numpy module, and then numpy module imports tempfile module from its inside.
> Now the first entry sys.path is the directory "./python/pyspark/mllib" (where the executed file "classification.py" exists), so tempfile module imports pyspark.mllib.random module (not the standard library "random" module).
> Finally, import chains reach tempfile again, then a cyclic import is formed.
> Summary: classification → numpy → tempfile → pyspark.mllib.random → tempfile → (cyclic import!!)
> Furthermore, stat module is in a standard library, and pyspark.mllib.stat module exists. This also may be troublesome.
> commit: 0e8203f4fb721158fb27897680da476174d24c4b
> A fundamental solution is to avoid using module names used by standard libraries (currently "random" and "stat").
> A difficulty of this solution is to rename pyspark.mllib.random and pyspark.mllib.stat, which may be already used.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org