You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@systemds.apache.org by GitBox <gi...@apache.org> on 2021/12/28 21:23:50 UTC

[GitHub] [systemds] mboehm7 edited a comment on pull request #1490: [SYSTEMDS-3256] Adding include vector parameter in cleaning primitives

mboehm7 edited a comment on pull request #1490:
URL: https://github.com/apache/systemds/pull/1490#issuecomment-1002285230


   Thanks for the continued effort on the cleaning framework @Shafaq-Siddiqi. After going through this PR, I had the feeling this splitting into train/test within the individual cleaning primitives is not at the right place. Instead of constructing the include vector during pipeline execution and splitting in each primitive, we could simply split the dataset before. In most primitives (let's take normalize() as an example) the state handling of colMins and colMaxs did not change at all, so upfront splitting would yield the same results. If I'm missing something, please let me know. 
   
   Coming back to state management, I would propose the following:
   * I'll add an evalList function that returns a list if a called function has multiple returns, that way we can reuse existing primitives without changes.
   * During pipeline enumeration, we then keep track of these state bundles and return them together with the top-k cleaning pipelines
   * We should further clearly separate between fitPipeline and applyPipeline and per primitive the frame that encodes primitives indicates the number and/or position of state variables.
   * We have pairs of fit (e.g., normalize) and apply (normalizeApply) functions where the former can call the latter as part of its implementation. If it helps to groups these pairs we can think of a library similar to our NN library
   that gets imported during pipeline enumeration. 
   
   Would that work for you? 
   Yes, this would work in general with all the primitives except the complex ones like mice. I will start working on the library idea as for enumeration we need wrappers with takes train and test datasets and internally call normalize and then normalizeApply inside a single box (i.e., function definition.)
   Thanks for the review and feedback.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscribe@systemds.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org