You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@beam.apache.org by "Brian Hulette (Jira)" <ji...@apache.org> on 2020/10/12 15:59:00 UTC
[jira] [Commented] (BEAM-10573) CSV files are loaded several times
if they are too large
[ https://issues.apache.org/jira/browse/BEAM-10573?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17212465#comment-17212465 ]
Brian Hulette commented on BEAM-10573:
--------------------------------------
CC: [~chamikara], [~pabloem] any idea what's going on here?
> CSV files are loaded several times if they are too large
> --------------------------------------------------------
>
> Key: BEAM-10573
> URL: https://issues.apache.org/jira/browse/BEAM-10573
> Project: Beam
> Issue Type: Bug
> Components: io-py-files
> Affects Versions: 2.22.0
> Reporter: julien richard
> Priority: P1
>
> I have this small sample:
>
> {code:java}
> import apache_beam as beam
> import apache_beam.io.filebasedsource
> import csv
> class CsvFileSource(apache_beam.io.filebasedsource.FileBasedSource):
> def read_records(self, file_name, range_tracker):
> with open(file_name, 'r') as file:
> reader = csv.DictReader(file)
> print("Load CSV file")
> for rec in reader:
> yield rec
> if __name__ == '__main__':
> with beam.Pipeline() as p:
> count_feature = (p
> | 'create' >> beam.io.Read(CsvFileSource("myFile.csv"))
> | 'count element' >> beam.combiners.Count.Globally()
> | 'Print' >> beam.Map(print)
> ){code}
>
>
> for some reason if the CSV file is too large it is loaded several times...
> for example for a file with 80000 rows (18.5 mo) the file is loaded 5 times.
> At the end I have 400000 elements in my PCollection.
>
--
This message was sent by Atlassian Jira
(v8.3.4#803005)