You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Daniel Oakley (Jira)" <ji...@apache.org> on 2022/08/09 03:37:00 UTC
[jira] [Created] (SPARK-40011) Pandas API on Spark requires Pandas
Daniel Oakley created SPARK-40011:
-------------------------------------
Summary: Pandas API on Spark requires Pandas
Key: SPARK-40011
URL: https://issues.apache.org/jira/browse/SPARK-40011
Project: Spark
Issue Type: Bug
Components: Pandas API on Spark
Affects Versions: 3.3.0
Reporter: Daniel Oakley
Fix For: 3.3.1
Pandas API on Spark includes code like:
> import pandas as pd
> from pandas.api.types import is_hashable, is_list_like # type: ignore[attr-defined]
This breaks if you don't have pandas installed on your Spark cluster.
Pandas API was supposed to be an API not pandas integration, why does it require pandas to be installed?
In many places Spark jobs may be run on various Spark clusters with no assurance of particular Python packages installed at a root level.
Can this dependency be removed? Or the required version of Pandas be bundled with the Spark distribution? Similar for numpy and other deps.
If not the docs should clearly state it is not merely a Spark API that mirror the Pandas API, but something quite different.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org