You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@arrow.apache.org by "Joris Van den Bossche (JIRA)" <ji...@apache.org> on 2019/07/31 07:59:00 UTC

[jira] [Commented] (ARROW-6001) [Python] Add from_pylist() and to_pylist() to pyarrow.Table to convert list of records

    [ https://issues.apache.org/jira/browse/ARROW-6001?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16896888#comment-16896888 ] 

Joris Van den Bossche commented on ARROW-6001:
----------------------------------------------

I think the functionality to convert to / from a list of dicts (a "list of records") is something nice to have in pyarrow. The question is then where to fit it in or how to call the new method.

{quote}I think {{Table.from_arrays}} could be improved to accept other Python sequences{quote}

I personally would not add such functionality to {{from_arrays}}, which is working column-wise (the arrays you pass make up the columns of the resulting Table). That's a well defined scope, and I would keep functionality to convert row-wise input data in a separate function.

For {{from_pydict}}, it is similar: that function also currently works column-wise.

So I think new methods such as {{from_pylist}} / {{to_pylist}} is the better approach.  
I am only not fully sure about the name "pylist", as that name does not directly reflect that it is a list of rows as dicts (it could also be a list of column-wise arrays). In pandas, this is basically called `from_records`, but the "records" could also be confusing in arrow context given that we have RecordBatches (although a method to convert a list of that is already called {{from_batches}}).

> [Python] Add from_pylist() and to_pylist() to pyarrow.Table to convert list of records
> --------------------------------------------------------------------------------------
>
>                 Key: ARROW-6001
>                 URL: https://issues.apache.org/jira/browse/ARROW-6001
>             Project: Apache Arrow
>          Issue Type: Improvement
>          Components: Python
>            Reporter: David Lee
>            Priority: Minor
>
> I noticed that pyarrow.Table.to_pydict() exists, but pyarrow.Table.from_pydict() doesn't exist. There is a proposed ticket to create one, but it doesn't take into account potential mismatches between column order and number of columns.
> I'm including some code I've written which I've been using to handle arrow conversions to ordered dictionaries and lists of dictionaries.. I've also included an example where this can be used to speed up pandas.to_dict() by a factor of 6x.
>  
> {code:java}
> def from_pylist(pylist, names=None, schema=None, safe=True):
>     """
>     Converts a python list of dictionaries to a pyarrow table
>     :param pylist: pylist list of dictionaries
>     :param names: list of column names
>     :param schema: pyarrow schema
>     :param safe: True or False
>     :return: arrow table
>     """
>     arrow_columns = list()
>     if schema:
>         for column in schema.names:
>             arrow_columns.append(pa.array([v[column] if column in v else None for v in pylist], safe=safe, type=schema.types[schema.get_field_index(column)]))
>         arrow_table = pa.Table.from_arrays(arrow_columns, schema.names)
>     else:
>         for column in names:
>             arrow_columns.append(pa.array([v[column] if column in v else None for v in pylist], safe=safe))
>         arrow_table = pa.Table.from_arrays(arrow_columns, names)
>     return arrow_table
> def to_pylist(arrow_table, index_columns=None):
>     """
>     Converts a pyarrow table to a python list of dictionaries
>     :param arrow_table: arrow table
>     :param index_columns: columns to index
>     :return: python list of dictionaries
>     """
>     pydict = arrow_table.to_pydict()
>     if index_columns:
>         columns = arrow_table.schema.names
>         columns.append("_index")
>         pylist = [{column: tuple([pydict[index_column][row] for index_column in index_columns]) if column == '_index' else pydict[column][row] for column in columns} for row in range(arrow_table.num_rows)]
>     else:
>         pylist = [{column: pydict[column][row] for column in arrow_table.schema.names} for row in range(arrow_table.num_rows)]
>     return pylist
> def from_pydict(pydict, names=None, schema=None, safe=True):
>     """
>     Converts a pyarrow table to a python ordered dictionary
>     :param pydict: ordered dictionary
>     :param names: list of column names
>     :param schema: pyarrow schema
>     :param safe: True or False
>     :return: arrow table
>     """
>     arrow_columns = list()
>     dict_columns = list(pydict.keys())
>     if schema:
>         for column in schema.names:
>             if column in pydict:
>                 arrow_columns.append(pa.array(pydict[column], safe=safe, type=schema.types[schema.get_field_index(column)]))
>             else:
>                 arrow_columns.append(pa.array([None] * len(pydict[dict_columns[0]]), safe=safe, type=schema.types[schema.get_field_index(column)]))
>         arrow_table = pa.Table.from_arrays(arrow_columns, schema.names)
>     else:
>         if not names:
>             names = dict_columns
>         for column in names:
>             if column in dict_columns:
>                 arrow_columns.append(pa.array(pydict[column], safe=safe))
>             else:
>                 arrow_columns.append(pa.array([None] * len(pydict[dict_columns[0]]), safe=safe))
>         arrow_table = pa.Table.from_arrays(arrow_columns, names)
>     return arrow_table
> def get_indexed_values(arrow_table, index_columns):
>     """
>     returns back a set of unique values for a list of columns.
>     :param arrow_table: arrow_table
>     :param index_columns: list of column names
>     :return: set of tuples
>     """
>     pydict = arrow_table.to_pydict()
>     index_set = set([tuple([pydict[index_column][row] for index_column in index_columns]) for row in range(arrow_table.num_rows)])
>     return index_set
> {code}
> Here are my benchmarks using pandas to arrow to python vs of pandas.to_dict()
>  
> {code:java}
> # benchmark panda conversion to python objects
> print('**benchmark 1 million rows**')
> start_time = time.time()
> python_df1 = panda_df1.to_dict(orient='records')
> total_time = time.time() - start_time
> print("pandas to python: " + str(total_time))
> start_time = time.time()
> arrow_df1 = pa.Table.from_pandas(panda_df1)
> pydict = arrow_df1.to_pydict()
> python_df1 = [{column: pydict[column][row] for column in arrow_df1.schema.names} for row in range(arrow_df1.num_rows)]
> total_time = time.time() - start_time
> print("pandas to arrow to python: " + str(total_time))
> print('**benchmark 4 million rows**')
> start_time = time.time()
> python_df4 = panda_df4.to_dict(orient='records')
> total_time = time.time() - start_time
> print("pandas to python:: " + str(total_time))
> start_time = time.time()
> arrow_df4 = pa.Table.from_pandas(panda_df4)
> pydict = arrow_df4.to_pydict()
> python_df4 = [{column: pydict[column][row] for column in arrow_df4.schema.names} for row in range(arrow_df4.num_rows)]
> total_time = time.time() - start_time
> print("pandas to arrow to python: " + str(total_time))
> {code}
>   
> {code:java}
> **benchmark 1 million rows**
> pandas to python: 13.204811334609985
> pandas to arrow to python: 2.00173282623291
> **benchmark 4 million rows**
> pandas to python:: 51.655067682266235
> pandas to arrow to python: 8.562284231185913
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)