You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@beam.apache.org by "Brian Hulette (Jira)" <ji...@apache.org> on 2021/12/16 20:16:00 UTC
[jira] [Assigned] (BEAM-13421) Python DefferedDataFrame.xs differs from Pandas
[ https://issues.apache.org/jira/browse/BEAM-13421?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Brian Hulette reassigned BEAM-13421:
------------------------------------
Assignee: Brian Hulette
> Python DefferedDataFrame.xs differs from Pandas
> -----------------------------------------------
>
> Key: BEAM-13421
> URL: https://issues.apache.org/jira/browse/BEAM-13421
> Project: Beam
> Issue Type: Bug
> Components: dsl-dataframe
> Affects Versions: 2.34.0
> Environment: Tested in Jupyter Notebook running in Docker.
> The docker file is produced by a modified version of https://github.com/fozziethebeat/gpu-jupyter/blob/master/.build/Dockerfile
> Reporter: Keith Stevens
> Assignee: Brian Hulette
> Priority: P2
>
> When testing the `xs` method on DeferredDataFrames I'm seeing a few inconsistent results. I have two minimal examples that showcase the errors.
>
> First inconsistency: Beam's `xs` requries one left over index field while Pandas does not.
> {code:java}
> with beam.Pipeline(options=PipelineOptions()) as pipeline:
> df = pd.DataFrame(
> np.array([
> ['state', 'day1', 12],
> ['state', 'day1', 1],
> ['state', 'day2', 14],
> ['county', 'day1', 9],
> ]),
> columns=['provider', 'time', 'value'])
> # Create just one index field
> df = df.set_index(['provider'])
> df.to_parquet('test.parquet')
>
> # Should print out
> # time value
> # provider
> # state day1 12
> # state day1 1
> # state day2 14
> print(df.xs('state'))
>
> # Should emit the same data to a csv but instead dies due to
> # Cannot remove 1 levels from an index with 1 levels: at least one level must be left.
> test_df = (pipeline | read_parquet('test.parquet'))
> (
> test_df.xs('state').to_csv('test.csv')
> ) {code}
> Second inconsistency: Beam dies for no clear reason
> {code:java}
> import pandas as pd
> import numpy as npwith beam.Pipeline(options=PipelineOptions()) as pipeline:
> df = pd.DataFrame(
> np.array([
> ['state', 'day1', 12],
> ['state', 'day1', 1],
> ['state', 'day2', 14],
> ['county', 'day1', 9],
> ]),
> columns=['provider', 'time', 'value'])
> # Create two index fields to satisfy Beam
> df = df.set_index(['provider', 'time'])
> df.to_parquet('test.parquet')
>
> # Should print out
> # value
> # time
> # day1 12
> # day1 1
> # day2 14
> print(df.xs('state'))
>
> # Dies with no clear error at
> # /opt/conda/lib/python3.9/site-packages/apache_beam/dataframe/transforms.py in output_partitioning_in_stage(expr, stage)
> # 305
> # 306 # Anything that's not an input must have arguments
> # 307 assert len(expr.args())
> # 308
> # 309 arg_partitionings = set(
> test_df = (pipeline | read_parquet('test.parquet'))
> (
> test_df.xs('state').to_csv('test.csv')
> ) {code}
--
This message was sent by Atlassian Jira
(v8.20.1#820001)