You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@beam.apache.org by "Keith Stevens (Jira)" <ji...@apache.org> on 2021/12/09 07:28:00 UTC
[jira] [Created] (BEAM-13421) Python DefferedDataFrame.xs differs from Pandas
Keith Stevens created BEAM-13421:
------------------------------------
Summary: Python DefferedDataFrame.xs differs from Pandas
Key: BEAM-13421
URL: https://issues.apache.org/jira/browse/BEAM-13421
Project: Beam
Issue Type: Bug
Components: sdk-py-core
Affects Versions: 2.34.0
Environment: Tested in Jupyter Notebook running in Docker.
The docker file is produced by a modified version of https://github.com/fozziethebeat/gpu-jupyter/blob/master/.build/Dockerfile
Reporter: Keith Stevens
When testing the `xs` method on DeferredDataFrames I'm seeing a few inconsistent results. I have two minimal examples that showcase the errors.
First inconsistency: Beam's `xs` requries one left over index field while Pandas does not.
{code:java}
with beam.Pipeline(options=PipelineOptions()) as pipeline:
df = pd.DataFrame(
np.array([
['state', 'day1', 12],
['state', 'day1', 1],
['state', 'day2', 14],
['county', 'day1', 9],
]),
columns=['provider', 'time', 'value'])
# Create just one index field
df = df.set_index(['provider'])
df.to_parquet('test.parquet')
# Should print out
# time value
# provider
# state day1 12
# state day1 1
# state day2 14
print(df.xs('state'))
# Should emit the same data to a csv but instead dies due to
# Cannot remove 1 levels from an index with 1 levels: at least one level must be left.
test_df = (pipeline | read_parquet('test.parquet'))
(
test_df.xs('state').to_csv('test.csv')
) {code}
Second inconsistency: Beam dies for no clear reason
{code:java}
import pandas as pd
import numpy as npwith beam.Pipeline(options=PipelineOptions()) as pipeline:
df = pd.DataFrame(
np.array([
['state', 'day1', 12],
['state', 'day1', 1],
['state', 'day2', 14],
['county', 'day1', 9],
]),
columns=['provider', 'time', 'value'])
# Create two index fields to satisfy Beam
df = df.set_index(['provider', 'time'])
df.to_parquet('test.parquet')
# Should print out
# value
# time
# day1 12
# day1 1
# day2 14
print(df.xs('state'))
# Dies with no clear error at
# /opt/conda/lib/python3.9/site-packages/apache_beam/dataframe/transforms.py in output_partitioning_in_stage(expr, stage)
# 305
# 306 # Anything that's not an input must have arguments
# 307 assert len(expr.args())
# 308
# 309 arg_partitionings = set(
test_df = (pipeline | read_parquet('test.parquet'))
(
test_df.xs('state').to_csv('test.csv')
) {code}
--
This message was sent by Atlassian Jira
(v8.20.1#820001)