You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Leandro Ferrado (JIRA)" <ji...@apache.org> on 2016/10/12 02:40:20 UTC

[jira] [Comment Edited] (SPARK-11758) Missing Index column while creating a DataFrame from Pandas

    [ https://issues.apache.org/jira/browse/SPARK-11758?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15567329#comment-15567329 ] 

Leandro Ferrado edited comment on SPARK-11758 at 10/12/16 2:39 AM:
-------------------------------------------------------------------

Hi Holden. First, I would add just a single line in order to avoid the bad conversion of 'datetime' objects (so far, DataFrame.to_records(index=False) converts a Date column into a LongInt column). The idea is to first convert all columns into string types, thus the function DataFrame.to_records(index=False) wouldn't make bad conversions with datetime.datetime objects. However, that can be done only if we define a pyspark.sql.dataframe.DataFrame with a schema of strings or if we didn't define an schema (in that case, the function create an schema of strings). So, the modification is only present on the condition 'schema=None' and the snippet would be:

-------
if has_pandas and isinstance(data, pandas.DataFrame):
            if schema is None:
            # begin if clause#
            schema = [str(x) for x in data.columns]
            data = data.astype(str)  # Converting all fields on string objects because we don't have a defined schema
           # end if clause#
            data = [r.tolist() for r in data.to_records(index=False)]
-------

In case of having an schema with timestamps (e.g. TimestampType() or DateType()), it is needed a prior conversion between datetime.datetime objects on Python to a convenient format for pyspark DataFrames. 
Regarding to the 'index=False' term, so far I can't figure out an scenario in which it is needed an index per row on a DataFrame. So it may be fine that argument on the function, I'm not sure.


was (Author: leferrad):
Hi Holden. First, I would add just a single line in order to avoid the bad conversion of 'datetime' objects (so far, DataFrame.to_records(index=False) converts a Date column into a LongInt column). The idea is to first convert all columns into string types, thus the function DataFrame.to_records(index=False) wouldn't make bad conversions with datetime.datetime objects. However, that can be done only if we define a pyspark.sql.dataframe.DataFrame with a schema of strings or if we didn't define an schema (in that case, the function create an schema of strings). So, the modification is only present on the condition 'schema=None' and the snippet would be:

-------
if has_pandas and isinstance(data, pandas.DataFrame):
            if schema is None:
                schema = [str(x) for x in data.columns]
                data = data.astype(str)  # Converting all fields on string objects because we don't have a defined schema
            data = [r.tolist() for r in data.to_records(index=False)]
-------

In case of having an schema with timestamps (e.g. TimestampType() or DateType()), it is needed a prior conversion between datetime.datetime objects on Python to a convenient format for pyspark DataFrames. 
Regarding to the 'index=False' term, so far I can't figure out an scenario in which it is needed an index per row on a DataFrame. So it may be fine that argument on the function, I'm not sure.

> Missing Index column while creating a DataFrame from Pandas 
> ------------------------------------------------------------
>
>                 Key: SPARK-11758
>                 URL: https://issues.apache.org/jira/browse/SPARK-11758
>             Project: Spark
>          Issue Type: Bug
>          Components: PySpark, SQL
>    Affects Versions: 1.5.1
>         Environment: Linux Debian, PySpark, in local testing.
>            Reporter: Leandro Ferrado
>            Priority: Minor
>   Original Estimate: 5h
>  Remaining Estimate: 5h
>
> In PySpark's SQLContext, when it invokes createDataFrame() from a pandas.DataFrame and indicating a 'schema' with StructFields, the function _createFromLocal() converts the pandas.DataFrame but ignoring two points:
> - Index column, because the flag index=False
> - Timestamp's records, because a Date column can't be index and Pandas doesn't converts its records in Timestamp's type.
> So, converting a DataFrame from Pandas to SQL is poor in scenarios with temporal records.
> Doc: http://pandas.pydata.org/pandas-docs/version/0.17.0/generated/pandas.DataFrame.to_records.html
> Affected code:
> def _createFromLocal(self, data, schema):
>         """
>         Create an RDD for DataFrame from an list or pandas.DataFrame, returns
>         the RDD and schema.
>         """
>         if has_pandas and isinstance(data, pandas.DataFrame):
>             if schema is None:
>                 schema = [str(x) for x in data.columns]
>             data = [r.tolist() for r in data.to_records(index=False)]  # HERE
>         # ...



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org