You are viewing a plain text version of this content. The canonical link for it is here.

Posted to issues@systemml.apache.org by "Mike Dusenberry (JIRA)" <ji...@apache.org> on 2017/07/27 00:38:00 UTC

[jira] [Created] (SYSTEMML-1813) Preprocessing simplification and cleanup

Mike Dusenberry created SYSTEMML-1813:
-----------------------------------------

Summary: Preprocessing simplification and cleanup
Key: SYSTEMML-1813
URL: https://issues.apache.org/jira/browse/SYSTEMML-1813
Project: SystemML
Issue Type: Improvement
Reporter: Mike Dusenberry
Assignee: Mike Dusenberry

In anticipation of near-future algorithmic improvements to the preprocessing to improve model training, this simplifies and cleans up the preprocessing code as follows.

- Previously, we were processing all slides into one large saved
DataFrame, and then splitting that DataFrame into train and validation
DataFrames. We should simplify this by splitting the slide numbers
into train and validation sets, and then processing those slides
separately. This will effectively skip the creation of the large DataFrame,
and remove the need to split that large DataFrame into train/val ones,
which should provide a large performance benefit. The DataFrame `union`
method can be used to combine two DataFrames row-wise.
- Previously, we maintained a list of "broken" slides that were manually
removed. We should remove that manual list, and instead add a
try/except filtering step to automatically remove problematic slides.
- We should move ad-hoc sampling code into a new `sample` function.
- We should move code to add row indices to a DataFrame into a new
`add_row_indices` function.

The benefit is that near-future algorithmic improvements to the
preprocessing code will be much easier to incorporate.

--
This message was sent by Atlassian JIRA
(v6.4.14#64029)