You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@beam.apache.org by "julien richard (Jira)" <ji...@apache.org> on 2020/08/21 12:36:00 UTC
[jira] [Updated] (BEAM-10573) CSV files are loaded several times if they are too large

     [ https://issues.apache.org/jira/browse/BEAM-10573?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

julien richard updated BEAM-10573:
----------------------------------
    Description: 
I have this small sample:

 
{code:java}
import apache_beam as beam
import apache_beam.io.filebasedsource
import csv


class CsvFileSource(apache_beam.io.filebasedsource.FileBasedSource):
 def read_records(self, file_name, range_tracker):
     with open(file_name, 'r') as file:
        reader = csv.DictReader(file)
        print("Load CSV file")
           for rec in reader:
              yield rec


if __name__ == '__main__':
 with beam.Pipeline() as p:
 count_feature = (p
           | 'create' >> beam.io.Read(CsvFileSource("myFile.csv"))
           | 'count element' >> beam.combiners.Count.Globally()
           | 'Print' >> beam.Map(print)
 ){code}
 

 

for some reason if the CSV file is too large it is loaded several times...

for example for a file with 80000 rows (18.5 mo) the file is loaded 5 times.

At the end I have 400000 elements in my PCollection.

 

  was:
I have this small sample:

 
{code:java}
import apache_beam as beam
import apache_beam.io.filebasedsource
import csv


class CsvFileSource(apache_beam.io.filebasedsource.FileBasedSource):
 def read_records(self, file_name, range_tracker):
     with open(file_name, 'r') as file:
        reader = csv.DictReader(file)
        print("Load CSV file")
           for rec in reader:
              yield rec


if __name__ == '__main__':
 with beam.Pipeline() as p:
 count_feature = (p
           | 'create' >> beam.io.Read(CsvFileSource("myFile.csv"))
           | 'count element' >> beam.combiners.Count.Globally()
           | 'Print' >> beam.Map(print)
 ){code}
 

 

for some reason if the CSV file is large it is loaded several times...

for example for a file with 80000 rows (18.5 mo) the file is loaded 5 times.

At the end I have 400000 elements in my PCollection.

 


> CSV files are loaded several times if they are too large
> --------------------------------------------------------
>
>                 Key: BEAM-10573
>                 URL: https://issues.apache.org/jira/browse/BEAM-10573
>             Project: Beam
>          Issue Type: Bug
>          Components: io-py-files
>    Affects Versions: 2.22.0
>            Reporter: julien richard
>            Priority: P2
>
> I have this small sample:
>  
> {code:java}
> import apache_beam as beam
> import apache_beam.io.filebasedsource
> import csv
> class CsvFileSource(apache_beam.io.filebasedsource.FileBasedSource):
>  def read_records(self, file_name, range_tracker):
>      with open(file_name, 'r') as file:
>         reader = csv.DictReader(file)
>         print("Load CSV file")
>            for rec in reader:
>               yield rec
> if __name__ == '__main__':
>  with beam.Pipeline() as p:
>  count_feature = (p
>            | 'create' >> beam.io.Read(CsvFileSource("myFile.csv"))
>            | 'count element' >> beam.combiners.Count.Globally()
>            | 'Print' >> beam.Map(print)
>  ){code}
>  
>  
> for some reason if the CSV file is too large it is loaded several times...
> for example for a file with 80000 rows (18.5 mo) the file is loaded 5 times.
> At the end I have 400000 elements in my PCollection.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)