You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@systemds.apache.org by GitBox <gi...@apache.org> on 2021/02/21 16:53:19 UTC

[GitHub] [systemds] Iseratho commented on pull request #1169: [WIP] Tokenizer Reference Implementation

Iseratho commented on pull request #1169:
URL: https://github.com/apache/systemds/pull/1169#issuecomment-782887952


   ### Consideration before merging the PR
   
   When representing the tokens in long-format (i.e., a transformation that expands on rows (rows: n, maxTokens: m, idCols: k) -> (m*n, k+2)), I get the message in a follow-up `transformencode`: 
   > Job aborted due to stage failure: Task 0 in stage 10.0 failed 1 times, most recent failure: Lost task 0.0 in stage 10.0 (TID 18, localhost, executor driver): org.apache.sysds.runtime.DMLRuntimeException: Number of non-zeros mismatch on merge disjoint (target=1000x4, nnz target=4000, nnz source=3992)
   
   Unfortunately, I have not been able to fix this bug since it does not occur in the `tokenize` itself.
   However, I have since implemented a wide-format (i.e., a transformation that expands on columns (rows: n, maxTokens: m, idCols: k) -> (n, m+k)), where I could not reproduce the issue. The current state of the PR uses this format in the test cases and passes all checks. 
   
   My specific questions are:
   1.	Does anyone know what the issue could be or how it could be fixed?
   2.	Conversely, why does the issue not occur on the wide-format? (I want to ensure that the code indeed works and not just hides the error)
   3.	Should I drop the support for the long-format to circumvent the issue?
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org