You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@arrow.apache.org by "Tim Kellogg (Jira)" <ji...@apache.org> on 2020/02/22 15:18:00 UTC

[jira] [Commented] (ARROW-2818) [Python] Better error message when passing SparseDataFrame into Table.from_pandas

    [ https://issues.apache.org/jira/browse/ARROW-2818?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17042602#comment-17042602 ] 

Tim Kellogg commented on ARROW-2818:
------------------------------------

Are there plans to support sparse tables/data frames?

In https://github.com/apache/arrow/issues/1894 the reason they gave for not supporting sparse tables is because Pandas has been unclear about their own support. However, Pandas 1.0 changed their support to hang off DataFrame.sparse, and leverages scipy sparse columns via a sparse dtype (https://pandas.pydata.org/pandas-docs/stable/user_guide/sparse.html). It seems as though Pandas is maintaining support. 

The problem isn’t going away. One-hot encoded data (a common representation in machine learning) is very sparse and will continue to be commonly used for the foreseeable future. 

There are 55,000 unique ICD 11 codes; one-hot encoding ICD codes leads to very wide and sparse tables. Lots of other examples too...

> [Python] Better error message when passing SparseDataFrame into Table.from_pandas
> ---------------------------------------------------------------------------------
>
>                 Key: ARROW-2818
>                 URL: https://issues.apache.org/jira/browse/ARROW-2818
>             Project: Apache Arrow
>          Issue Type: Improvement
>          Components: Python
>            Reporter: Wes McKinney
>            Assignee: Joris Van den Bossche
>            Priority: Major
>              Labels: pull-request-available
>             Fix For: 0.14.0
>
>          Time Spent: 0.5h
>  Remaining Estimate: 0h
>
> This can be a rough edge for users. Note that pandas sparse support is being considered for deprecation
> original issue https://github.com/apache/arrow/issues/1894



--
This message was sent by Atlassian Jira
(v8.3.4#803005)