You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@beam.apache.org by "Keith Stevens (Jira)" <ji...@apache.org> on 2021/12/09 07:28:00 UTC
[jira] [Created] (BEAM-13421) Python DefferedDataFrame.xs differs from Pandas

Keith Stevens created BEAM-13421:
------------------------------------

             Summary: Python DefferedDataFrame.xs differs from Pandas
                 Key: BEAM-13421
                 URL: https://issues.apache.org/jira/browse/BEAM-13421
             Project: Beam
          Issue Type: Bug
          Components: sdk-py-core
    Affects Versions: 2.34.0
         Environment: Tested in Jupyter Notebook running in Docker.

The docker file is produced by a modified version of https://github.com/fozziethebeat/gpu-jupyter/blob/master/.build/Dockerfile

            Reporter: Keith Stevens


When testing the `xs` method on DeferredDataFrames I'm seeing a few inconsistent results.  I have two minimal examples that showcase the errors.

 

First inconsistency: Beam's `xs` requries one left over index field while Pandas does not.
{code:java}
with beam.Pipeline(options=PipelineOptions()) as pipeline:
    df = pd.DataFrame(
        np.array([
            ['state', 'day1', 12],
            ['state', 'day1', 1],
            ['state', 'day2', 14],
            ['county', 'day1', 9],
        ]),
        columns=['provider', 'time', 'value'])
    # Create just one index field
    df = df.set_index(['provider'])
    df.to_parquet('test.parquet')
    
    # Should print out
    #           time value
    # provider            
    # state     day1    12
    # state     day1     1
    # state     day2    14
    print(df.xs('state'))
    
    # Should emit the same data to a csv but instead dies due to
    # Cannot remove 1 levels from an index with 1 levels: at least one level must be left.
    test_df = (pipeline | read_parquet('test.parquet'))
    (
        test_df.xs('state').to_csv('test.csv')
    ) {code}
Second inconsistency: Beam dies for no clear reason
{code:java}
import pandas as pd
import numpy as npwith beam.Pipeline(options=PipelineOptions()) as pipeline:
    df = pd.DataFrame(
        np.array([
            ['state', 'day1', 12],
            ['state', 'day1', 1],
            ['state', 'day2', 14],
            ['county', 'day1', 9],
        ]),
        columns=['provider', 'time', 'value'])
    # Create two index fields to satisfy Beam
    df = df.set_index(['provider', 'time'])
    df.to_parquet('test.parquet')
    
    # Should print out
    #      value
    # time      
    # day1    12
    # day1     1
    # day2    14
    print(df.xs('state'))
    
    # Dies with no clear error at
    # /opt/conda/lib/python3.9/site-packages/apache_beam/dataframe/transforms.py in output_partitioning_in_stage(expr, stage)
    # 305 
    # 306       # Anything that's not an input must have arguments
    # 307       assert len(expr.args())
    # 308 
    # 309       arg_partitionings = set(
    test_df = (pipeline | read_parquet('test.parquet'))
    (
        test_df.xs('state').to_csv('test.csv')
    ) {code}



--
This message was sent by Atlassian Jira
(v8.20.1#820001)