You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Daniel Davies (Jira)" <ji...@apache.org> on 2021/12/26 20:23:00 UTC
[jira] [Commented] (SPARK-37697) Make it easier to convert numpy arrays to Spark Dataframes

    [ https://issues.apache.org/jira/browse/SPARK-37697?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17465443#comment-17465443 ] 

Daniel Davies commented on SPARK-37697:
---------------------------------------

Hey Douglas,
 
I've definitely been caught by numpy types a few times in some of our spark workflows, so would be keen to solve in PySpark directly also.
 
Do you want to be able to create a DataFrame from a list of numpy numbers directly? This is supported in Pandas, but I think it's not possible to do this even with native Python types in Spark (e.g. my understanding is that the input to createDataFrame is required to be an iterable of rows), so maybe there's a discussion around supporting that. For example, running the below:
{code:java}
df = spark.createDataFrame([1,2,3,4,5]){code}
Fails with:
{code:java}
TypeError: Can not infer schema for type: <class 'int'> {code}
The more common issue I've come across though is something like the following not being supported:
{code:java}
df = spark.createDataFrame([np.arange(10), np.arange(10)]){code}
(I.e. where each row could be a numpy list and/or the overall input list of rows is wrapped in a numpy array also).
 
I'd be happy to take the work on for this PR- whether the first possible case, or the second (it would be my first contribution, so if you/ anyone else think this is more complex than I'm currently estimating below, let me know).
 
For creating a dataframe with a flat iterable, this looks like it would be an addition of a createDataFrame function around [here|https://github.com/apache/spark/blob/master/python/pyspark/sql/session.py#L700]
 
For the second problem- i.e. where the input model remains the same, but rows are provided as numpy arrays; I'd be keen to re-use numpy's ndarray tolist() function here. Not only does this push the underlying array object into Python lists, which PySpark already supports, but it also has the benefit of converting list items of the numpy-specific types to Python native scalars. From a brief glance I've taken, it looks like this would need to be taken into account in three places:
 
- In the set of prepare functions in session.py [here|https://github.com/apache/spark/blob/master/python/pyspark/sql/session.py#L912]
- In the conversion function [here|https://github.com/apache/spark/blob/master/python/pyspark/sql/types.py#L1447]
- In the schema inference function [here|https://github.com/apache/spark/blob/master/python/pyspark/sql/types.py#L1280]
 
This would work for inputs where rows are numpy arrays of any type; but a bit more work would be needed to make a row like the following work:
 
{code:java}
[1, 2, numpy.int64(3)]{code}
 
Hope that all makes sense- let me know which of the two problems you meant & are more interested in solving.
 
I'd also be keen to get a review of whether any of my solutions made sense

> Make it easier to convert numpy arrays to Spark Dataframes
> ----------------------------------------------------------
>
>                 Key: SPARK-37697
>                 URL: https://issues.apache.org/jira/browse/SPARK-37697
>             Project: Spark
>          Issue Type: Improvement
>          Components: PySpark
>    Affects Versions: 3.1.2
>            Reporter: Douglas Moore
>            Priority: Major
>
> Make it easier to convert numpy arrays to dataframes.
> Often we receive errors:
>  
> {code:java}
> df = spark.createDataFrame(numpy.arange(10))
> Can not infer schema for type: <class 'numpy.int64'>
> {code}
>  
> OR
> {code:java}
> df = spark.createDataFrame(numpy.arange(10.))
> Can not infer schema for type: <class 'numpy.float64'>
> {code}
>  
> Today (Spark 3.x) we have to:
> {code:java}
> spark.createDataFrame(pd.DataFrame(numpy.arange(10.))) {code}
> Make this easier with a direct conversion from Numpy arrays to Spark Dataframes.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org