You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@systemml.apache.org by "Glenn Weidner (JIRA)" <ji...@apache.org> on 2017/09/09 05:09:00 UTC
[jira] [Updated] (SYSTEMML-1813) Preprocessing simplification and
cleanup
[ https://issues.apache.org/jira/browse/SYSTEMML-1813?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Glenn Weidner updated SYSTEMML-1813:
------------------------------------
Fix Version/s: (was: SystemML 1.0)
SystemML 0.15
> Preprocessing simplification and cleanup
> ----------------------------------------
>
> Key: SYSTEMML-1813
> URL: https://issues.apache.org/jira/browse/SYSTEMML-1813
> Project: SystemML
> Issue Type: Improvement
> Reporter: Mike Dusenberry
> Assignee: Mike Dusenberry
> Fix For: SystemML 0.15
>
>
> In anticipation of near-future algorithmic improvements to the preprocessing to improve model training, this simplifies and cleans up the preprocessing code as follows.
> - Previously, we were processing all slides into one large saved
> DataFrame, and then splitting that DataFrame into train and validation
> DataFrames. We should simplify this by splitting the slide numbers
> into train and validation sets, and then processing those slides
> separately. This will effectively skip the creation of the large DataFrame,
> and remove the need to split that large DataFrame into train/val ones,
> which should provide a large performance benefit. The DataFrame `union`
> method can be used to combine two DataFrames row-wise.
> - Previously, we maintained a list of "broken" slides that were manually
> removed. We should remove that manual list, and instead add a
> try/except filtering step to automatically remove problematic slides.
> - We should move ad-hoc sampling code into a new `sample` function.
> - We should move code to add row indices to a DataFrame into a new
> `add_row_indices` function.
> The benefit is that near-future algorithmic improvements to the
> preprocessing code will be much easier to incorporate.
--
This message was sent by Atlassian JIRA
(v6.4.14#64029)