You are viewing a plain text version of this content. The canonical link for it is here.

Posted to issues@spark.apache.org by "Daniel Oakley (Jira)" <ji...@apache.org> on 2022/08/09 03:37:00 UTC

[jira] [Created] (SPARK-40011) Pandas API on Spark requires Pandas

Daniel Oakley created SPARK-40011:
-------------------------------------

             Summary: Pandas API on Spark requires Pandas
                 Key: SPARK-40011
                 URL: https://issues.apache.org/jira/browse/SPARK-40011
             Project: Spark
          Issue Type: Bug
          Components: Pandas API on Spark
    Affects Versions: 3.3.0
            Reporter: Daniel Oakley
             Fix For: 3.3.1


Pandas API on Spark includes code like:

> import pandas as pd
> from pandas.api.types import is_hashable, is_list_like  # type: ignore[attr-defined]

This breaks if you don't have pandas installed on your Spark cluster.

Pandas API was supposed to be an API not pandas integration, why does it require pandas to be installed?

In many places Spark jobs may be run on various Spark clusters with no assurance of particular Python packages installed at a root level. 

Can this dependency be removed? Or the required version of Pandas be bundled with the Spark distribution? Similar for numpy and other deps.

If not the docs should clearly state it is not merely a Spark API that mirror the Pandas API, but something quite different.

 

 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org