You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@beam.apache.org by "Peter Hausel (JIRA)" <ji...@apache.org> on 2017/08/28 21:05:00 UTC
[jira] [Created] (BEAM-2815) Python DirectRunner is unusable with input files in the 100-250MB range

Peter Hausel created BEAM-2815:
----------------------------------

             Summary: Python DirectRunner is unusable with input files in the 100-250MB range
                 Key: BEAM-2815
                 URL: https://issues.apache.org/jira/browse/BEAM-2815
             Project: Beam
          Issue Type: Bug
          Components: runner-direct
    Affects Versions: 2.1.0
         Environment: python 2.7.10, beam 2.1, os x 
            Reporter: Peter Hausel
            Assignee: Thomas Groh


The current python DirectRunner implementation seems to be unusable with training data sets that are bigger than tiny samples - making serious local development impossible or very cumbersome. I am aware of some of the limitations of the current DirectRunner implementation[1][2][3], however I was not sure if this odd behavior is expected.


[1][2][3]
https://stackoverflow.com/a/44765621
https://issues.apache.org/jira/browse/BEAM-1442
https://beam.apache.org/documentation/runners/direct/

Repro:
The simple script below blew up my laptop (MBP 2015) and had to terminate the process after 10 minutes or so (screenshots about high memory and CPU utilization are also attached).

{code}
from apache_beam.io import textio
import apache_beam as beam
from apache_beam.options.pipeline_options import PipelineOptions
from apache_beam.options.pipeline_options import SetupOptions
import argparse

def run(argv=None):
     """Main entry point; defines and runs the wordcount pipeline."""
     parser = argparse.ArgumentParser()
     parser.add_argument('--input',
                      dest='input',
                      default='/Users/pk11/Desktop/Motor_Vehicle_Crashes_-_Vehicle_Information__Three_Year_Window.csv',
                      help='Input file to process.')
     known_args, pipeline_args = parser.parse_known_args(argv)
     pipeline_options = PipelineOptions(pipeline_args)
     pipeline_options.view_as(SetupOptions).save_main_session = True
     pipeline = beam.Pipeline(options=pipeline_options)
     raw_data = (
           pipeline
           | 'ReadTrainData' >> textio.ReadFromText(known_args.input, skip_header_lines=1)
           | 'Map' >> beam.Map(lambda line: line.lower())
     )
     result = pipeline.run()
     result.wait_until_finish()
     print(raw_data)

if __name__ == '__main__':
  run()
{code}

Example dataset:  https://catalog.data.gov/dataset/motor-vehicle-crashes-vehicle-information-beginning-2009

for comparison: 

{code}
lines = [line.lower() for line in open('/Users/pk11/Desktop/Motor_Vehicle_Crashes_-_Vehicle_Information__Three_Year_Window.csv')]
print(len(lines))
{code}

this vanilla python script runs on the same hardware and dataset in 0m4.909s. 



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)