You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@beam.apache.org by "Ahmet Altay (JIRA)" <ji...@apache.org> on 2017/08/28 21:12:00 UTC

[jira] [Assigned] (BEAM-2815) Python DirectRunner is unusable with input files in the 100-250MB range

     [ https://issues.apache.org/jira/browse/BEAM-2815?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Ahmet Altay reassigned BEAM-2815:
---------------------------------

    Assignee: Ahmet Altay  (was: Thomas Groh)

> Python DirectRunner is unusable with input files in the 100-250MB range
> -----------------------------------------------------------------------
>
>                 Key: BEAM-2815
>                 URL: https://issues.apache.org/jira/browse/BEAM-2815
>             Project: Beam
>          Issue Type: Bug
>          Components: runner-direct, sdk-py
>    Affects Versions: 2.1.0
>         Environment: python 2.7.10, beam 2.1, os x 
>            Reporter: Peter Hausel
>            Assignee: Ahmet Altay
>         Attachments: Screen Shot 2017-08-27 at 9.00.29 AM.png, Screen Shot 2017-08-27 at 9.06.00 AM.png
>
>
> The current python DirectRunner implementation seems to be unusable with training data sets that are bigger than tiny samples - making serious local development impossible or very cumbersome. I am aware of some of the limitations of the current DirectRunner implementation[1][2][3], however I was not sure if this odd behavior is expected.
> [1][2][3]
> https://stackoverflow.com/a/44765621
> https://issues.apache.org/jira/browse/BEAM-1442
> https://beam.apache.org/documentation/runners/direct/
> Repro:
> The simple script below blew up my laptop (MBP 2015) and had to terminate the process after 10 minutes or so (screenshots about high memory and CPU utilization are also attached).
> {code}
> from apache_beam.io import textio
> import apache_beam as beam
> from apache_beam.options.pipeline_options import PipelineOptions
> from apache_beam.options.pipeline_options import SetupOptions
> import argparse
> def run(argv=None):
>      """Main entry point; defines and runs the pipeline."""
>      parser = argparse.ArgumentParser()
>      parser.add_argument('--input',
>                       dest='input',
>                       default='/Users/pk11/Desktop/Motor_Vehicle_Crashes_-_Vehicle_Information__Three_Year_Window.csv',
>                       help='Input file to process.')
>      known_args, pipeline_args = parser.parse_known_args(argv)
>      pipeline_options = PipelineOptions(pipeline_args)
>      pipeline_options.view_as(SetupOptions).save_main_session = True
>      pipeline = beam.Pipeline(options=pipeline_options)
>      raw_data = (
>            pipeline
>            | 'ReadTrainData' >> textio.ReadFromText(known_args.input, skip_header_lines=1)
>            | 'Map' >> beam.Map(lambda line: line.lower())
>      )
>      result = pipeline.run()
>      result.wait_until_finish()
>      print(raw_data)
> if __name__ == '__main__':
>   run()
> {code}
> Example dataset:  https://catalog.data.gov/dataset/motor-vehicle-crashes-vehicle-information-beginning-2009
> for comparison: 
> {code}
> lines = [line.lower() for line in open('/Users/pk11/Desktop/Motor_Vehicle_Crashes_-_Vehicle_Information__Three_Year_Window.csv')]
> print(len(lines))
> {code}
> this vanilla python script runs on the same hardware and dataset in 0m4.909s. 



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)