You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@beam.apache.org by "Brian Hulette (Jira)" <ji...@apache.org> on 2021/12/16 20:16:00 UTC
[jira] [Assigned] (BEAM-13421) Python DefferedDataFrame.xs differs from Pandas

     [ https://issues.apache.org/jira/browse/BEAM-13421?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Brian Hulette reassigned BEAM-13421:
------------------------------------

    Assignee: Brian Hulette

> Python DefferedDataFrame.xs differs from Pandas
> -----------------------------------------------
>
>                 Key: BEAM-13421
>                 URL: https://issues.apache.org/jira/browse/BEAM-13421
>             Project: Beam
>          Issue Type: Bug
>          Components: dsl-dataframe
>    Affects Versions: 2.34.0
>         Environment: Tested in Jupyter Notebook running in Docker.
> The docker file is produced by a modified version of https://github.com/fozziethebeat/gpu-jupyter/blob/master/.build/Dockerfile
>            Reporter: Keith Stevens
>            Assignee: Brian Hulette
>            Priority: P2
>
> When testing the `xs` method on DeferredDataFrames I'm seeing a few inconsistent results.  I have two minimal examples that showcase the errors.
>  
> First inconsistency: Beam's `xs` requries one left over index field while Pandas does not.
> {code:java}
> with beam.Pipeline(options=PipelineOptions()) as pipeline:
>     df = pd.DataFrame(
>         np.array([
>             ['state', 'day1', 12],
>             ['state', 'day1', 1],
>             ['state', 'day2', 14],
>             ['county', 'day1', 9],
>         ]),
>         columns=['provider', 'time', 'value'])
>     # Create just one index field
>     df = df.set_index(['provider'])
>     df.to_parquet('test.parquet')
>     
>     # Should print out
>     #           time value
>     # provider            
>     # state     day1    12
>     # state     day1     1
>     # state     day2    14
>     print(df.xs('state'))
>     
>     # Should emit the same data to a csv but instead dies due to
>     # Cannot remove 1 levels from an index with 1 levels: at least one level must be left.
>     test_df = (pipeline | read_parquet('test.parquet'))
>     (
>         test_df.xs('state').to_csv('test.csv')
>     ) {code}
> Second inconsistency: Beam dies for no clear reason
> {code:java}
> import pandas as pd
> import numpy as npwith beam.Pipeline(options=PipelineOptions()) as pipeline:
>     df = pd.DataFrame(
>         np.array([
>             ['state', 'day1', 12],
>             ['state', 'day1', 1],
>             ['state', 'day2', 14],
>             ['county', 'day1', 9],
>         ]),
>         columns=['provider', 'time', 'value'])
>     # Create two index fields to satisfy Beam
>     df = df.set_index(['provider', 'time'])
>     df.to_parquet('test.parquet')
>     
>     # Should print out
>     #      value
>     # time      
>     # day1    12
>     # day1     1
>     # day2    14
>     print(df.xs('state'))
>     
>     # Dies with no clear error at
>     # /opt/conda/lib/python3.9/site-packages/apache_beam/dataframe/transforms.py in output_partitioning_in_stage(expr, stage)
>     # 305 
>     # 306       # Anything that's not an input must have arguments
>     # 307       assert len(expr.args())
>     # 308 
>     # 309       arg_partitionings = set(
>     test_df = (pipeline | read_parquet('test.parquet'))
>     (
>         test_df.xs('state').to_csv('test.csv')
>     ) {code}



--
This message was sent by Atlassian Jira
(v8.20.1#820001)