You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@systemml.apache.org by "Mike Dusenberry (JIRA)" <ji...@apache.org> on 2017/08/18 18:09:00 UTC
[jira] [Closed] (SYSTEMML-1813) Preprocessing simplification and
cleanup
[ https://issues.apache.org/jira/browse/SYSTEMML-1813?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Mike Dusenberry closed SYSTEMML-1813.
-------------------------------------
> Preprocessing simplification and cleanup
> ----------------------------------------
>
> Key: SYSTEMML-1813
> URL: https://issues.apache.org/jira/browse/SYSTEMML-1813
> Project: SystemML
> Issue Type: Improvement
> Reporter: Mike Dusenberry
> Assignee: Mike Dusenberry
> Fix For: SystemML 1.0
>
>
> In anticipation of near-future algorithmic improvements to the preprocessing to improve model training, this simplifies and cleans up the preprocessing code as follows.
> - Previously, we were processing all slides into one large saved
> DataFrame, and then splitting that DataFrame into train and validation
> DataFrames. We should simplify this by splitting the slide numbers
> into train and validation sets, and then processing those slides
> separately. This will effectively skip the creation of the large DataFrame,
> and remove the need to split that large DataFrame into train/val ones,
> which should provide a large performance benefit. The DataFrame `union`
> method can be used to combine two DataFrames row-wise.
> - Previously, we maintained a list of "broken" slides that were manually
> removed. We should remove that manual list, and instead add a
> try/except filtering step to automatically remove problematic slides.
> - We should move ad-hoc sampling code into a new `sample` function.
> - We should move code to add row indices to a DataFrame into a new
> `add_row_indices` function.
> The benefit is that near-future algorithmic improvements to the
> preprocessing code will be much easier to incorporate.
--
This message was sent by Atlassian JIRA
(v6.4.14#64029)