You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@madlib.apache.org by "Frank McQuillan (JIRA)" <ji...@apache.org> on 2018/08/24 00:21:00 UTC

[jira] [Assigned] (MADLIB-1265) Formalize the read data code for parallel segment data loading

     [ https://issues.apache.org/jira/browse/MADLIB-1265?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Frank McQuillan reassigned MADLIB-1265:
---------------------------------------

    Assignee: Nikhil

> Formalize the read data code for parallel segment data loading
> --------------------------------------------------------------
>
>                 Key: MADLIB-1265
>                 URL: https://issues.apache.org/jira/browse/MADLIB-1265
>             Project: Apache MADlib
>          Issue Type: New Feature
>          Components: Module: Utilities
>            Reporter: Frank McQuillan
>            Assignee: Nikhil
>            Priority: Major
>             Fix For: v2.0
>
>
> Story
> We need to productize the read data code so that the madlib.fit function can call it along with the user defined python script.
> Details
> * Figure out how to copy read_data.py and user code to all hosts?
> * The read_data script will call the user data script and then write the model. This way the user can easily iterate on their python file.
> * Think about error handling in read_data.py. If we write all errors to a log file, do we delete the log file every time the madlib.fit udf is called? Do we need to rotate the log files?
> * We need to make sure that we take a lock on model file while writing. An alternative to avoid the need for locking is to create one file per segment. We can append the segment id in the file name and use that name to create the external readable table.
> * read_data.py can be copied to all the segments during madlib install. This file can take the user_defined_module as an argument which will then be dynamically imported.
> * How will the python memory be managed ? The postgres process for each spawns a connection process because of the INSERT which in turn spawns another process run our executable command from the CREATE WRITABLE EXTERNAL WEB TABLE definition. This means that the memory is prob not restricted by greenplum.
> * Should read_data.py also get the column names and the grouping col information? Can we pass the metadata without duplication ? madlib.fit which will be a plpython function can take care of this:
>     a. Get the absolute path of the user defined python file and copy it to all the segments ( this is up for discussion, maybe there is a better way to copy the user defined code to all the segments)
>     b. Parse the columns to include and get the types of all the columns using plpy.execute(). Write this metadata information along with any other relevant information to a yml file and copy this to all the segments.
> * The grp col value should be written to the model file.
> * Since the data is read through a pipe, read_data.py can also provide an api to stream rows to the user defined python file.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)