You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@systemml.apache.org by "Mike Dusenberry (JIRA)" <ji...@apache.org> on 2017/08/18 18:09:00 UTC

[jira] [Closed] (SYSTEMML-1813) Preprocessing simplification and cleanup

     [ https://issues.apache.org/jira/browse/SYSTEMML-1813?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Mike Dusenberry closed SYSTEMML-1813.
-------------------------------------

> Preprocessing simplification and cleanup
> ----------------------------------------
>
>                 Key: SYSTEMML-1813
>                 URL: https://issues.apache.org/jira/browse/SYSTEMML-1813
>             Project: SystemML
>          Issue Type: Improvement
>            Reporter: Mike Dusenberry
>            Assignee: Mike Dusenberry
>             Fix For: SystemML 1.0
>
>
> In anticipation of near-future algorithmic improvements to the preprocessing to improve model training, this simplifies and cleans up the preprocessing code as follows.
> - Previously, we were processing all slides into one large saved
> DataFrame, and then splitting that DataFrame into train and validation
> DataFrames.  We should simplify this by splitting the slide numbers
> into train and validation sets, and then processing those slides
> separately.  This will effectively skip the creation of the large DataFrame,
> and remove the need to split that large DataFrame into train/val ones,
> which should provide a large performance benefit.  The DataFrame `union`
> method can be used to combine two DataFrames row-wise.
> - Previously, we maintained a list of "broken" slides that were manually
> removed.  We should remove that manual list, and instead add a
> try/except filtering step to automatically remove problematic slides.
> - We should move ad-hoc sampling code into a new `sample` function.
> - We should move code to add row indices to a DataFrame into a new
> `add_row_indices` function.
> The benefit is that near-future algorithmic improvements to the
> preprocessing code will be much easier to incorporate.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)