You are viewing a plain text version of this content. The canonical link for it is here.

Posted to reviews@spark.apache.org by GitBox <gi...@apache.org> on 2022/06/07 18:26:01 UTC

[GitHub] [spark] xinrong-databricks opened a new pull request, #36793: [SPARK-39406][PYTHON] Accept NumPy array in createDataFrame

xinrong-databricks opened a new pull request, #36793:
URL: https://github.com/apache/spark/pull/36793

   ### What changes were proposed in this pull request?
   Accept numpy array in createDataFrame, with existing dtypes support.
   
   
   ### Why are the changes needed?
   As part of [SPARK-39405], 
   
   ### Does this PR introduce _any_ user-facing change?
   Yes, NumPy array is accepted in createDataFrame now:
   ```py
   ```
   
   
   ### How was this patch tested?
   Unit tests.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] xinrong-databricks commented on a diff in pull request #36793: [SPARK-39406][PYTHON] Accept NumPy array in createDataFrame

Posted by GitBox <gi...@apache.org>.

xinrong-databricks commented on code in PR #36793:
URL: https://github.com/apache/spark/pull/36793#discussion_r892890115


##########
python/pyspark/sql/tests/test_arrow.py:
##########
@@ -495,6 +509,22 @@ def test_schema_conversion_roundtrip(self):
         schema_rt = from_arrow_schema(arrow_schema)
         self.assertEqual(self.schema, schema_rt)
 
+    @unittest.skipIf(not have_numpy, cast(str, numpy_requirement_message))

Review Comment:
   How about checking if NumPy is installed?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] HyukjinKwon commented on pull request #36793: [SPARK-39406][PYTHON] Accept NumPy array in createDataFrame

Posted by GitBox <gi...@apache.org>.

HyukjinKwon commented on PR #36793:
URL: https://github.com/apache/spark/pull/36793#issuecomment-1153328439

   Merged to master.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] xinrong-databricks commented on a diff in pull request #36793: [SPARK-39406][PYTHON] Accept NumPy array in createDataFrame

Posted by GitBox <gi...@apache.org>.

xinrong-databricks commented on code in PR #36793:
URL: https://github.com/apache/spark/pull/36793#discussion_r891581163


##########
python/pyspark/sql/session.py:
##########
@@ -952,12 +953,30 @@ def createDataFrame(  # type: ignore[misc]
             schema = [x.encode("utf-8") if not isinstance(x, str) else x for x in schema]
 
         try:
-            import pandas
+            import pandas as pd
 
             has_pandas = True
         except Exception:
             has_pandas = False
-        if has_pandas and isinstance(data, pandas.DataFrame):
+
+        try:
+            import numpy as np

Review Comment:
   All NumPy versions should be fine as long as pandas can convert the ndarray to pandas Frame. So we check pandas minimum version only.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] xinrong-databricks commented on a diff in pull request #36793: [SPARK-39406][PYTHON] Accept NumPy array in createDataFrame

Posted by GitBox <gi...@apache.org>.

xinrong-databricks commented on code in PR #36793:
URL: https://github.com/apache/spark/pull/36793#discussion_r893926925


##########
python/pyspark/sql/tests/test_arrow.py:
##########
@@ -495,6 +509,22 @@ def test_schema_conversion_roundtrip(self):
         schema_rt = from_arrow_schema(arrow_schema)
         self.assertEqual(self.schema, schema_rt)
 
+    @unittest.skipIf(not have_numpy, cast(str, numpy_requirement_message))

Review Comment:
   Sounds good! Now we assume that the testing environment has numpy installed.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] xinrong-databricks commented on a diff in pull request #36793: [SPARK-39406][PYTHON] Accept NumPy array in createDataFrame

Posted by GitBox <gi...@apache.org>.

xinrong-databricks commented on code in PR #36793:
URL: https://github.com/apache/spark/pull/36793#discussion_r895210329


##########
python/pyspark/sql/session.py:
##########
@@ -952,12 +953,29 @@ def createDataFrame(  # type: ignore[misc]
             schema = [x.encode("utf-8") if not isinstance(x, str) else x for x in schema]
 
         try:
-            import pandas
+            import pandas as pd
 
             has_pandas = True
         except Exception:
             has_pandas = False
-        if has_pandas and isinstance(data, pandas.DataFrame):
+
+        try:
+            import numpy as np

Review Comment:
   Good point! For consistency with NumPy versions in other modules, `numpy>=1.15` is added. 
   
   May I ask what's our rule for deciding the minimum version of a library? 
   
   Currently, we set`pandas>=1.0.5`, which requires `numpy>=1.13.3` (according to [pandas installation](https://pandas.pydata.org/pandas-docs/version/1.0/getting_started/install.html)).
   
   CC @HyukjinKwon 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] Yikun commented on a diff in pull request #36793: [SPARK-39406][PYTHON] Accept NumPy array in createDataFrame

Posted by GitBox <gi...@apache.org>.

Yikun commented on code in PR #36793:
URL: https://github.com/apache/spark/pull/36793#discussion_r894971731


##########
python/pyspark/sql/session.py:
##########
@@ -952,12 +953,29 @@ def createDataFrame(  # type: ignore[misc]
             schema = [x.encode("utf-8") if not isinstance(x, str) else x for x in schema]
 
         try:
-            import pandas
+            import pandas as pd
 
             has_pandas = True
         except Exception:
             has_pandas = False
-        if has_pandas and isinstance(data, pandas.DataFrame):
+
+        try:
+            import numpy as np
+
+            has_numpy = True
+        except Exception:
+            has_numpy = False
+
+        if has_numpy and isinstance(data, np.ndarray):
+            from pyspark.sql.pandas.utils import require_minimum_pandas_version
+
+            require_minimum_pandas_version()
+            if data.ndim not in [1, 2]:
+                raise ValueError("NumPy array input should be of 1 or 2 dimensions.")
+            column_names = ["value"] if data.ndim == 1 else ["_1", "_2"]
+            data = pd.DataFrame(data, columns=column_names)

Review Comment:
   question: Is other numpy types supported in future also considered first covert to pandas frame first? numpy --> pandas --> spark df?



##########
python/pyspark/sql/session.py:
##########
@@ -952,12 +953,29 @@ def createDataFrame(  # type: ignore[misc]
             schema = [x.encode("utf-8") if not isinstance(x, str) else x for x in schema]
 
         try:
-            import pandas
+            import pandas as pd
 
             has_pandas = True
         except Exception:
             has_pandas = False
-        if has_pandas and isinstance(data, pandas.DataFrame):
+
+        try:
+            import numpy as np
+
+            has_numpy = True
+        except Exception:
+            has_numpy = False
+
+        if has_numpy and isinstance(data, np.ndarray):
+            from pyspark.sql.pandas.utils import require_minimum_pandas_version
+
+            require_minimum_pandas_version()

Review Comment:
   nit: If only numpy installed but pandas not installed, will only raised pandas not installed. User may confused why need to install pandas but I just want to using numpy ?
   
   So maybe give a certainly exceptions before here to tell users, numpy type will first convert pandas df in pyspark so pandas installed is required, like:
   
   ```python
   if not has_pandas:
       // warning or raised friendly exception
   
   from pyspark.sql.pandas.utils import require_minimum_pandas_version
   require_minimum_pandas_version()
   ```
   
   or add a notes before here at least.
   



##########
python/pyspark/sql/session.py:
##########
@@ -952,12 +953,29 @@ def createDataFrame(  # type: ignore[misc]
             schema = [x.encode("utf-8") if not isinstance(x, str) else x for x in schema]
 
         try:
-            import pandas
+            import pandas as pd
 
             has_pandas = True
         except Exception:
             has_pandas = False
-        if has_pandas and isinstance(data, pandas.DataFrame):
+
+        try:
+            import numpy as np

Review Comment:
   Should we add `numpy` as requirement in [setup.py](https://github.com/apache/spark/blob/master/python/setup.py#L266-L269)? and mentioned it in deps doc?
   
   [1] https://github.com/apache/spark/blob/master/python/setup.py#L266-L269
   [2] https://dist.apache.org/repos/dist/dev/spark/v3.3.0-rc1-docs/_site/api/python/getting_started/install.html#dependencies



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] srowen commented on a diff in pull request #36793: [SPARK-39406][PYTHON] Accept NumPy array in createDataFrame

Posted by GitBox <gi...@apache.org>.

srowen commented on code in PR #36793:
URL: https://github.com/apache/spark/pull/36793#discussion_r895004388


##########
python/pyspark/sql/session.py:
##########
@@ -952,12 +953,29 @@ def createDataFrame(  # type: ignore[misc]
             schema = [x.encode("utf-8") if not isinstance(x, str) else x for x in schema]
 
         try:
-            import pandas
+            import pandas as pd
 
             has_pandas = True
         except Exception:
             has_pandas = False
-        if has_pandas and isinstance(data, pandas.DataFrame):
+
+        try:
+            import numpy as np
+
+            has_numpy = True
+        except Exception:
+            has_numpy = False
+
+        if has_numpy and isinstance(data, np.ndarray):
+            from pyspark.sql.pandas.utils import require_minimum_pandas_version
+
+            require_minimum_pandas_version()
+            if data.ndim not in [1, 2]:
+                raise ImportError("NumPy array input should be of 1 or 2 dimensions.")
+            column_names = ["value"] if data.ndim == 1 else ["_1", "_2"]

Review Comment:
   Ideally it infers the narrowest type supporting all values. I think there is another issue open about this



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] xinrong-databricks commented on a diff in pull request #36793: [SPARK-39406][PYTHON] Accept NumPy array in createDataFrame

Posted by GitBox <gi...@apache.org>.

xinrong-databricks commented on code in PR #36793:
URL: https://github.com/apache/spark/pull/36793#discussion_r895208448


##########
python/pyspark/sql/session.py:
##########
@@ -952,12 +953,29 @@ def createDataFrame(  # type: ignore[misc]
             schema = [x.encode("utf-8") if not isinstance(x, str) else x for x in schema]
 
         try:
-            import pandas
+            import pandas as pd
 
             has_pandas = True
         except Exception:
             has_pandas = False
-        if has_pandas and isinstance(data, pandas.DataFrame):
+
+        try:
+            import numpy as np
+
+            has_numpy = True
+        except Exception:
+            has_numpy = False
+
+        if has_numpy and isinstance(data, np.ndarray):
+            from pyspark.sql.pandas.utils import require_minimum_pandas_version
+
+            require_minimum_pandas_version()
+            if data.ndim not in [1, 2]:
+                raise ImportError("NumPy array input should be of 1 or 2 dimensions.")
+            column_names = ["value"] if data.ndim == 1 else ["_1", "_2"]

Review Comment:
   I see, then we may have to iterate through the values and find the Lowest Common Ancestor of all involved types.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] HyukjinKwon closed pull request #36793: [SPARK-39406][PYTHON] Accept NumPy array in createDataFrame

Posted by GitBox <gi...@apache.org>.

HyukjinKwon closed pull request #36793: [SPARK-39406][PYTHON] Accept NumPy array in createDataFrame
URL: https://github.com/apache/spark/pull/36793


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] HyukjinKwon commented on a diff in pull request #36793: [SPARK-39406][PYTHON] Accept NumPy array in createDataFrame

Posted by GitBox <gi...@apache.org>.

HyukjinKwon commented on code in PR #36793:
URL: https://github.com/apache/spark/pull/36793#discussion_r892985507


##########
python/pyspark/sql/tests/test_arrow.py:
##########
@@ -495,6 +509,22 @@ def test_schema_conversion_roundtrip(self):
         schema_rt = from_arrow_schema(arrow_schema)
         self.assertEqual(self.schema, schema_rt)
 
+    @unittest.skipIf(not have_numpy, cast(str, numpy_requirement_message))

Review Comment:
   That's fine. It's not the main code but testing. Let's just keep it simple.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] HyukjinKwon commented on a diff in pull request #36793: [SPARK-39406][PYTHON] Accept NumPy array in createDataFrame

Posted by GitBox <gi...@apache.org>.

HyukjinKwon commented on code in PR #36793:
URL: https://github.com/apache/spark/pull/36793#discussion_r892865940


##########
python/pyspark/sql/tests/test_arrow.py:
##########
@@ -495,6 +509,22 @@ def test_schema_conversion_roundtrip(self):
         schema_rt = from_arrow_schema(arrow_schema)
         self.assertEqual(self.schema, schema_rt)
 
+    @unittest.skipIf(not have_numpy, cast(str, numpy_requirement_message))

Review Comment:
   The whole test class here assumes to have pandas and PyArrow so it should be fine to don't add this check.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] xinrong-databricks commented on a diff in pull request #36793: [SPARK-39406][PYTHON] Accept NumPy array in createDataFrame

Posted by GitBox <gi...@apache.org>.

xinrong-databricks commented on code in PR #36793:
URL: https://github.com/apache/spark/pull/36793#discussion_r891581163


##########
python/pyspark/sql/session.py:
##########
@@ -952,12 +953,30 @@ def createDataFrame(  # type: ignore[misc]
             schema = [x.encode("utf-8") if not isinstance(x, str) else x for x in schema]
 
         try:
-            import pandas
+            import pandas as pd
 
             has_pandas = True
         except Exception:
             has_pandas = False
-        if has_pandas and isinstance(data, pandas.DataFrame):
+
+        try:
+            import numpy as np

Review Comment:
   All NumPy versions should be fine as long as pandas can convert the ndarray to pandas Frame.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] HyukjinKwon commented on a diff in pull request #36793: [SPARK-39406][PYTHON] Accept NumPy array in createDataFrame

Posted by GitBox <gi...@apache.org>.

HyukjinKwon commented on code in PR #36793:
URL: https://github.com/apache/spark/pull/36793#discussion_r894039985


##########
python/pyspark/sql/session.py:
##########
@@ -952,12 +953,29 @@ def createDataFrame(  # type: ignore[misc]
             schema = [x.encode("utf-8") if not isinstance(x, str) else x for x in schema]
 
         try:
-            import pandas
+            import pandas as pd
 
             has_pandas = True
         except Exception:
             has_pandas = False
-        if has_pandas and isinstance(data, pandas.DataFrame):
+
+        try:
+            import numpy as np
+
+            has_numpy = True
+        except Exception:
+            has_numpy = False
+
+        if has_numpy and isinstance(data, np.ndarray):
+            from pyspark.sql.pandas.utils import require_minimum_pandas_version
+
+            require_minimum_pandas_version()
+            if data.ndim not in [1, 2]:
+                raise ImportError("NumPy array input should be of 1 or 2 dimensions.")
+            column_names = ["value"] if data.ndim == 1 else ["_1", "_2"]

Review Comment:
   Would have to be done separately but just noting here: maybe we should make the case like `spark.createDataFrame([1, 2])` working too. `spark.createDataFrame([1, 2], "int")` works. the problem seems like in schema inference.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] HyukjinKwon commented on a diff in pull request #36793: [SPARK-39406][PYTHON] Accept NumPy array in createDataFrame

Posted by GitBox <gi...@apache.org>.

HyukjinKwon commented on code in PR #36793:
URL: https://github.com/apache/spark/pull/36793#discussion_r895251697


##########
python/pyspark/sql/session.py:
##########
@@ -952,12 +953,29 @@ def createDataFrame(  # type: ignore[misc]
             schema = [x.encode("utf-8") if not isinstance(x, str) else x for x in schema]
 
         try:
-            import pandas
+            import pandas as pd
 
             has_pandas = True
         except Exception:
             has_pandas = False
-        if has_pandas and isinstance(data, pandas.DataFrame):
+
+        try:
+            import numpy as np

Review Comment:
   Let's don't add it for now since this is optional for SQL



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] HyukjinKwon commented on a diff in pull request #36793: [SPARK-39406][PYTHON] Accept NumPy array in createDataFrame

Posted by GitBox <gi...@apache.org>.

HyukjinKwon commented on code in PR #36793:
URL: https://github.com/apache/spark/pull/36793#discussion_r891811193


##########
python/pyspark/sql/tests/test_session.py:
##########
@@ -379,6 +381,54 @@ def test_use_custom_class_for_extensions(self):
         )
 
 
+class CreateDataFrame(unittest.TestCase):

Review Comment:
   Can we add this test into ArrowTests so it tests Arrow optimization/nonoptimization?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] HyukjinKwon commented on a diff in pull request #36793: [SPARK-39406][PYTHON] Accept NumPy array in createDataFrame

Posted by GitBox <gi...@apache.org>.

HyukjinKwon commented on code in PR #36793:
URL: https://github.com/apache/spark/pull/36793#discussion_r892865940


##########
python/pyspark/sql/tests/test_arrow.py:
##########
@@ -495,6 +509,22 @@ def test_schema_conversion_roundtrip(self):
         schema_rt = from_arrow_schema(arrow_schema)
         self.assertEqual(self.schema, schema_rt)
 
+    @unittest.skipIf(not have_numpy, cast(str, numpy_requirement_message))

Review Comment:
   The whole test cases here assume to have pandas and PyArrow so it should be fine to don't add this check.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] xinrong-databricks commented on a diff in pull request #36793: [SPARK-39406][PYTHON] Accept NumPy array in createDataFrame

Posted by GitBox <gi...@apache.org>.

xinrong-databricks commented on code in PR #36793:
URL: https://github.com/apache/spark/pull/36793#discussion_r895206894


##########
python/pyspark/sql/session.py:
##########
@@ -952,12 +953,29 @@ def createDataFrame(  # type: ignore[misc]
             schema = [x.encode("utf-8") if not isinstance(x, str) else x for x in schema]
 
         try:
-            import pandas
+            import pandas as pd
 
             has_pandas = True
         except Exception:
             has_pandas = False
-        if has_pandas and isinstance(data, pandas.DataFrame):
+
+        try:
+            import numpy as np
+
+            has_numpy = True
+        except Exception:
+            has_numpy = False
+
+        if has_numpy and isinstance(data, np.ndarray):
+            from pyspark.sql.pandas.utils import require_minimum_pandas_version
+
+            require_minimum_pandas_version()
+            if data.ndim not in [1, 2]:
+                raise ValueError("NumPy array input should be of 1 or 2 dimensions.")
+            column_names = ["value"] if data.ndim == 1 else ["_1", "_2"]
+            data = pd.DataFrame(data, columns=column_names)

Review Comment:
   Yes, since pandas extends numpy dtypes.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] xinrong-databricks commented on pull request #36793: [SPARK-39406][PYTHON] Accept NumPy array in createDataFrame

Posted by GitBox <gi...@apache.org>.

xinrong-databricks commented on PR #36793:
URL: https://github.com/apache/spark/pull/36793#issuecomment-1153244595

   Rebased for conflicts in `python/docs/source/getting_started/install.rst` only.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] itholic commented on pull request #36793: [SPARK-39406][PYTHON] Accept NumPy array in createDataFrame

Posted by GitBox <gi...@apache.org>.

itholic commented on PR #36793:
URL: https://github.com/apache/spark/pull/36793#issuecomment-1150640100

   Seems like it would fixed when https://github.com/apache/spark/pull/36813 is merged.
   
   Let's rebase after then.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] itholic commented on pull request #36793: [SPARK-39406][PYTHON] Accept NumPy array in createDataFrame

Posted by GitBox <gi...@apache.org>.

itholic commented on PR #36793:
URL: https://github.com/apache/spark/pull/36793#issuecomment-1150640677

   Otherwise looks good if test passes


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] xinrong-databricks commented on a diff in pull request #36793: [SPARK-39406][PYTHON] Accept NumPy array in createDataFrame

Posted by GitBox <gi...@apache.org>.

xinrong-databricks commented on code in PR #36793:
URL: https://github.com/apache/spark/pull/36793#discussion_r894697512


##########
python/pyspark/sql/session.py:
##########
@@ -952,12 +953,29 @@ def createDataFrame(  # type: ignore[misc]
             schema = [x.encode("utf-8") if not isinstance(x, str) else x for x in schema]
 
         try:
-            import pandas
+            import pandas as pd
 
             has_pandas = True
         except Exception:
             has_pandas = False
-        if has_pandas and isinstance(data, pandas.DataFrame):
+
+        try:
+            import numpy as np
+
+            has_numpy = True
+        except Exception:
+            has_numpy = False
+
+        if has_numpy and isinstance(data, np.ndarray):
+            from pyspark.sql.pandas.utils import require_minimum_pandas_version
+
+            require_minimum_pandas_version()
+            if data.ndim not in [1, 2]:
+                raise ImportError("NumPy array input should be of 1 or 2 dimensions.")
+            column_names = ["value"] if data.ndim == 1 else ["_1", "_2"]

Review Comment:
   Also, the input list may consist of both NumPy scalars and native Python scalars, e.g. `spark.createDataFrame([np.int8(1), 2])`.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] Yikun commented on a diff in pull request #36793: [SPARK-39406][PYTHON] Accept NumPy array in createDataFrame

Posted by GitBox <gi...@apache.org>.

Yikun commented on code in PR #36793:
URL: https://github.com/apache/spark/pull/36793#discussion_r894989727


##########
python/pyspark/sql/session.py:
##########
@@ -952,12 +953,29 @@ def createDataFrame(  # type: ignore[misc]
             schema = [x.encode("utf-8") if not isinstance(x, str) else x for x in schema]
 
         try:
-            import pandas
+            import pandas as pd
 
             has_pandas = True
         except Exception:
             has_pandas = False
-        if has_pandas and isinstance(data, pandas.DataFrame):
+
+        try:
+            import numpy as np
+
+            has_numpy = True
+        except Exception:
+            has_numpy = False
+
+        if has_numpy and isinstance(data, np.ndarray):
+            from pyspark.sql.pandas.utils import require_minimum_pandas_version
+
+            require_minimum_pandas_version()

Review Comment:
   nit: If only numpy installed but pandas not installed, will only raised pandas not installed. User may confused: why need to install pandas but I just want to using numpy ?
   
   So maybe give a certainly exceptions before here to tell users, numpy type will first convert pandas df in pyspark so pandas installed is required, like:
   
   ```python
   if not has_pandas:
       // warning or raised friendly exception
   
   from pyspark.sql.pandas.utils import require_minimum_pandas_version
   require_minimum_pandas_version()
   ```
   
   or add a notes before here at least.
   



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] xinrong-databricks commented on a diff in pull request #36793: [SPARK-39406][PYTHON] Accept NumPy array in createDataFrame

Posted by GitBox <gi...@apache.org>.

xinrong-databricks commented on code in PR #36793:
URL: https://github.com/apache/spark/pull/36793#discussion_r895210608


##########
python/pyspark/sql/session.py:
##########
@@ -952,12 +953,29 @@ def createDataFrame(  # type: ignore[misc]
             schema = [x.encode("utf-8") if not isinstance(x, str) else x for x in schema]
 
         try:
-            import pandas
+            import pandas as pd
 
             has_pandas = True
         except Exception:
             has_pandas = False
-        if has_pandas and isinstance(data, pandas.DataFrame):
+
+        try:
+            import numpy as np
+
+            has_numpy = True
+        except Exception:
+            has_numpy = False
+
+        if has_numpy and isinstance(data, np.ndarray):
+            from pyspark.sql.pandas.utils import require_minimum_pandas_version
+
+            require_minimum_pandas_version()

Review Comment:
   What you suggested makes sense.
   
   My concern is that 
   ```py
   if not has_pandas:
       // warning or raised friendly exception
   ```
   overlaps with the `require_minimum_pandas_version` check, since the latter checks `has_pandas` as well.
   
   I will leave a note for now and we may follow up with a better approach later. WDYT :)



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] xinrong-databricks commented on a diff in pull request #36793: [SPARK-39406][PYTHON] Accept NumPy array in createDataFrame

Posted by GitBox <gi...@apache.org>.

xinrong-databricks commented on code in PR #36793:
URL: https://github.com/apache/spark/pull/36793#discussion_r892863327


##########
python/pyspark/sql/tests/test_session.py:
##########
@@ -379,6 +381,54 @@ def test_use_custom_class_for_extensions(self):
         )
 
 
+class CreateDataFrame(unittest.TestCase):

Review Comment:
   Good idea! Adjusted.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] HyukjinKwon commented on a diff in pull request #36793: [SPARK-39406][PYTHON] Accept NumPy array in createDataFrame

Posted by GitBox <gi...@apache.org>.

HyukjinKwon commented on code in PR #36793:
URL: https://github.com/apache/spark/pull/36793#discussion_r894068688


##########
python/pyspark/sql/session.py:
##########
@@ -952,12 +953,29 @@ def createDataFrame(  # type: ignore[misc]
             schema = [x.encode("utf-8") if not isinstance(x, str) else x for x in schema]
 
         try:
-            import pandas
+            import pandas as pd
 
             has_pandas = True
         except Exception:
             has_pandas = False
-        if has_pandas and isinstance(data, pandas.DataFrame):
+
+        try:
+            import numpy as np
+
+            has_numpy = True
+        except Exception:
+            has_numpy = False
+
+        if has_numpy and isinstance(data, np.ndarray):
+            from pyspark.sql.pandas.utils import require_minimum_pandas_version
+
+            require_minimum_pandas_version()
+            if data.ndim not in [1, 2]:
+                raise ImportError("NumPy array input should be of 1 or 2 dimensions.")

Review Comment:
   Wait, I think it shouldn't be an import error?



##########
python/pyspark/sql/session.py:
##########
@@ -952,12 +953,29 @@ def createDataFrame(  # type: ignore[misc]
             schema = [x.encode("utf-8") if not isinstance(x, str) else x for x in schema]
 
         try:
-            import pandas
+            import pandas as pd
 
             has_pandas = True
         except Exception:
             has_pandas = False
-        if has_pandas and isinstance(data, pandas.DataFrame):
+
+        try:
+            import numpy as np
+
+            has_numpy = True
+        except Exception:
+            has_numpy = False
+
+        if has_numpy and isinstance(data, np.ndarray):
+            from pyspark.sql.pandas.utils import require_minimum_pandas_version
+
+            require_minimum_pandas_version()
+            if data.ndim not in [1, 2]:
+                raise ImportError("NumPy array input should be of 1 or 2 dimensions.")

Review Comment:
   Wait, I think it shouldn't be an import error?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] xinrong-databricks commented on pull request #36793: [SPARK-39406][PYTHON] Accept NumPy array in createDataFrame

Posted by GitBox <gi...@apache.org>.

xinrong-databricks commented on PR #36793:
URL: https://github.com/apache/spark/pull/36793#issuecomment-1151589025

   Rebased to pass tests. Thanks @itholic! 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] xinrong-databricks commented on a diff in pull request #36793: [SPARK-39406][PYTHON] Accept NumPy array in createDataFrame

Posted by GitBox <gi...@apache.org>.

xinrong-databricks commented on code in PR #36793:
URL: https://github.com/apache/spark/pull/36793#discussion_r894695373


##########
python/pyspark/sql/session.py:
##########
@@ -952,12 +953,29 @@ def createDataFrame(  # type: ignore[misc]
             schema = [x.encode("utf-8") if not isinstance(x, str) else x for x in schema]
 
         try:
-            import pandas
+            import pandas as pd
 
             has_pandas = True
         except Exception:
             has_pandas = False
-        if has_pandas and isinstance(data, pandas.DataFrame):
+
+        try:
+            import numpy as np
+
+            has_numpy = True
+        except Exception:
+            has_numpy = False
+
+        if has_numpy and isinstance(data, np.ndarray):
+            from pyspark.sql.pandas.utils import require_minimum_pandas_version
+
+            require_minimum_pandas_version()
+            if data.ndim not in [1, 2]:
+                raise ImportError("NumPy array input should be of 1 or 2 dimensions.")
+            column_names = ["value"] if data.ndim == 1 else ["_1", "_2"]

Review Comment:
   We seem to infer the schema of the input list from the first element only. To support `spark.createDataFrame([1, 2])`, we may assume the elements in the list are of the same type. Is that acceptable?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] xinrong-databricks commented on a diff in pull request #36793: [SPARK-39406][PYTHON] Accept NumPy array in createDataFrame

Posted by GitBox <gi...@apache.org>.

xinrong-databricks commented on code in PR #36793:
URL: https://github.com/apache/spark/pull/36793#discussion_r894700485


##########
python/pyspark/sql/session.py:
##########
@@ -952,12 +953,29 @@ def createDataFrame(  # type: ignore[misc]
             schema = [x.encode("utf-8") if not isinstance(x, str) else x for x in schema]
 
         try:
-            import pandas
+            import pandas as pd
 
             has_pandas = True
         except Exception:
             has_pandas = False
-        if has_pandas and isinstance(data, pandas.DataFrame):
+
+        try:
+            import numpy as np
+
+            has_numpy = True
+        except Exception:
+            has_numpy = False
+
+        if has_numpy and isinstance(data, np.ndarray):
+            from pyspark.sql.pandas.utils import require_minimum_pandas_version
+
+            require_minimum_pandas_version()
+            if data.ndim not in [1, 2]:
+                raise ImportError("NumPy array input should be of 1 or 2 dimensions.")

Review Comment:
   Good catch! Corrected it and added a test case.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org