You are viewing a plain text version of this content. The canonical link for it is here.
Posted to github@beam.apache.org by GitBox <gi...@apache.org> on 2022/06/05 00:26:16 UTC
[GitHub] [beam] damccorm opened a new issue, #21578: Nullable Integer support in with pandas not working as expected
damccorm opened a new issue, #21578:
URL: https://github.com/apache/beam/issues/21578
I am reading data from a parquet and one of the columns is a Nullable Integer ([https://pandas.pydata.org/docs/user_guide/integer_na.html#integer-na)](https://pandas.pydata.org/docs/user_guide/integer_na.html#integer-na))
Not 100% sure I correctly declared it:
```
import typing
from typing import Dict, Iterable, List, Optional
import apache_beam as beam
from
apache_beam.options.pipeline_options import PipelineOptions
class Record(typing.NamedTuple):
port: Optional[int]
#port: str
recFields=set([i for i in Record.__dict__.keys() if i[:1] != '_'])
beam.coders.registry.register_coder(Record,beam.coders.RowCoder)
def
extractDF(tuple):
df=tuple[1].to_pandas()
print(type(df.port.dtype))
return df
input_patterns
= ['data/*.parquet']
#local runner
options = PipelineOptions(flags=[], type_check_additional='all')
def
toRecords(df):
#df["port"]=None
return df.to_dict('records')
with beam.Pipeline(options=options)
as pipeline:
lines = (pipeline | 'Create file patterns' >> beam.Create(input_patterns)
| 'Read Parquet files' >> beam.io.ReadAllFromParquetBatched(columns=recFields,with_filename=True)
| 'Extract DF' >> beam.Map(extractDF )
| 'To dictionaries' >> beam.FlatMap(toRecords)
| 'ToRows' >> beam.Map(lambda x: Record(**x)).with_output_types(Record)
| "print">> beam.Map(print))
```
This fails with an type error.
When I uncomment the line in toRecords to set everything to None it works fine.
Imported from Jira [BEAM-14228](https://issues.apache.org/jira/browse/BEAM-14228). Original Jira may contain additional context.
Reported by: kohlerm.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: github-unsubscribe@beam.apache.org.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org