You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@systemml.apache.org by "Mike Dusenberry (JIRA)" <ji...@apache.org> on 2017/07/27 00:38:00 UTC

[jira] [Created] (SYSTEMML-1813) Preprocessing simplification and cleanup

Mike Dusenberry created SYSTEMML-1813:
-----------------------------------------

             Summary: Preprocessing simplification and cleanup
                 Key: SYSTEMML-1813
                 URL: https://issues.apache.org/jira/browse/SYSTEMML-1813
             Project: SystemML
          Issue Type: Improvement
            Reporter: Mike Dusenberry
            Assignee: Mike Dusenberry


In anticipation of near-future algorithmic improvements to the preprocessing to improve model training, this simplifies and cleans up the preprocessing code as follows.

- Previously, we were processing all slides into one large saved
DataFrame, and then splitting that DataFrame into train and validation
DataFrames.  We should simplify this by splitting the slide numbers
into train and validation sets, and then processing those slides
separately.  This will effectively skip the creation of the large DataFrame,
and remove the need to split that large DataFrame into train/val ones,
which should provide a large performance benefit.  The DataFrame `union`
method can be used to combine two DataFrames row-wise.
- Previously, we maintained a list of "broken" slides that were manually
removed.  We should remove that manual list, and instead add a
try/except filtering step to automatically remove problematic slides.
- We should move ad-hoc sampling code into a new `sample` function.
- We should move code to add row indices to a DataFrame into a new
`add_row_indices` function.

The benefit is that near-future algorithmic improvements to the
preprocessing code will be much easier to incorporate.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)