You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@systemds.apache.org by GitBox <gi...@apache.org> on 2021/03/01 10:26:17 UTC

[GitHub] [systemds] Baunsgaard opened a new pull request #1195: [SYSTEMDS-2880] MultiLogReg consistancy

Baunsgaard opened a new pull request #1195:
URL: https://github.com/apache/systemds/pull/1195


   This commit make the MultiLogReg predict and training consistent in its handling of Y.
   furthermore it has a small optimization where isNaN is used on sum of the matrix. This makes it faster in general, since it does not scan the matrix twice, the small downside is that it does not count the number of NaNs anymore, but that does not matter.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [systemds] mboehm7 commented on pull request #1195: [SYSTEMDS-2880] MultiLogReg consistancy

Posted by GitBox <gi...@apache.org>.
mboehm7 commented on pull request #1195:
URL: https://github.com/apache/systemds/pull/1195#issuecomment-788225280


   First of all (now with a bit more time), thanks for the initiative fixing the comments and other potential issues. Few comments on the code though:
   
   * Please leave the `Y = ifelse(Y <= 0, max_y + 1, Y)` as it is (converting all <=0 labels to a new maximum label). That way it naturally applied to binary classification where labels are often given as -1 and 1
   * Remove the call `LT = decompress(LT)` (that is exactly why we should disable its use at script level other than in tests)
   * Personally, I would remove the support for one-hot encoded labels y in order to reduce the surface of the API. If kept, please move the checks for min/max y inside the respective branch
   * The isNaN(X) returns an empty block if there are no NaN (without allocation) - so it's overhead is actually very small. The sum can be faster due to multi-threading, but if that is true we might want to introduce the rewrite `sum(isNaN(X))!=0 --> isNan(sum(X))` instead of modifying such scripts. 
   * Fix the introduced typos (e.g., witch, alreaddy)


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [systemds] Baunsgaard commented on pull request #1195: [SYSTEMDS-2880] MultiLogReg consistancy

Posted by GitBox <gi...@apache.org>.
Baunsgaard commented on pull request #1195:
URL: https://github.com/apache/systemds/pull/1195#issuecomment-788323417


   
   >     * Please leave the `Y = ifelse(Y <= 0, max_y + 1, Y)` as it is (converting all <=0 labels to a new maximum label). That way it naturally applied to binary classification where labels are often given as -1 and 1
   
   >     * Personally, I would remove the support for one-hot encoded labels y in order to reduce the surface of the API. If kept, please move the checks for min/max y inside the respective branch
   
   I think this is not intuitive, and should result in an error, where ppl should preprocess instead of relying on the function to fix it.
   It is frustrating is that the 0'th label suddenly becomes the last label. this confused me with MNIST.
   also this current implementation, and the main reason why i'm playing around with this, can contain more or less labels when calling predict.
   
   If the dateset contains classes 1-5 but because of some reason the training data is only containing 1-4. then the model calling table creates values to classify into 4 labels, rather than the 5 intended originally. This sometimes happens if you slice in covtype, where sometime the last label is missing.
   More common is in the predict call where if we call with a single data point, lets say label 3 then this same way of handling Y would not work because we end up with 3 columns instead of 5. The predict can be fixed by looking at the other arguments to the function but i would rather have the Y input match the input to training.
   
   The most intuitive API here for me is: 
   
   if 1 column -> onehot encode (don't change the negative or 0 values but stop execution if they are there)
   if more than one column -> assume one hot encoded and verify that all values are 0 or 1. ideally i would make this default and not allow 1 column inputs.
   
   >     * Remove the call `LT = decompress(LT)` (that is exactly why we should disable its use at script level other than in tests)
   
   Ups, debugging code. "classic"
   
   >     * The isNaN(X) returns an empty block if there are no NaN (without allocation) - so it's overhead is actually very small. The sum can be faster due to multi-threading, but if that is true we might want to introduce the rewrite `sum(isNaN(X))!=0 --> isNan(sum(X))` instead of modifying such scripts.
   
   sure, i mentioned this last meeting, and you said the same. I double checked the code... and you are right. it should not matter in the case where there is no NaN values. if there is some, then the rewrite is faster, but then the execution will crash anyway. With the rewrite i could avoid supporting Unary operations in compression, but i guess i will just have to add it.
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [systemds] Baunsgaard closed pull request #1195: [SYSTEMDS-2880] MultiLogReg consistancy

Posted by GitBox <gi...@apache.org>.
Baunsgaard closed pull request #1195:
URL: https://github.com/apache/systemds/pull/1195


   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [systemds] mboehm7 commented on pull request #1195: [SYSTEMDS-2880] MultiLogReg consistancy

Posted by GitBox <gi...@apache.org>.
mboehm7 commented on pull request #1195:
URL: https://github.com/apache/systemds/pull/1195#issuecomment-787846349


   please hold off bringing this in - some of the changes look like regressions. I already fixed the consistency of mlogreg train and predict, and it must not call stop on -1/1 labels. 


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [systemds] Baunsgaard commented on pull request #1195: [SYSTEMDS-2880] MultiLogReg consistancy

Posted by GitBox <gi...@apache.org>.
Baunsgaard commented on pull request #1195:
URL: https://github.com/apache/systemds/pull/1195#issuecomment-787852061


   okay, so replace all negative values with 0 ?
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org