You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@systemds.apache.org by GitBox <gi...@apache.org> on 2021/02/21 22:26:33 UTC

[GitHub] [systemds] Iseratho edited a comment on pull request #1169: Tokenizer API and initial algorithms

Iseratho edited a comment on pull request #1169:
URL: https://github.com/apache/systemds/pull/1169#issuecomment-782887952


   ### Consideration when merging the PR
   
   When representing the tokens in long-format (i.e., a transformation that expands on rows (rows: n, maxTokens: m, idCols: k) -> (m*n, k+2)), I get the message in a follow-up `transformencode`: 
   > Job aborted due to stage failure: Task 0 in stage 10.0 failed 1 times, most recent failure: Lost task 0.0 in stage 10.0 (TID 18, localhost, executor driver): org.apache.sysds.runtime.DMLRuntimeException: Number of non-zeros mismatch on merge disjoint (target=1000x4, nnz target=4000, nnz source=3992)
   
   Unfortunately, I have not been able to fix this bug since it does not occur in the `tokenize` itself.
   However, I have since implemented a wide-format (i.e., a transformation that expands on columns (rows: n, maxTokens: m, idCols: k) -> (n, m+k)), where I could not reproduce the issue. The current state of the PR uses this format in the test cases and passes all checks. 
   
   I have commented out the test cases that do not work and they should be addressed in the future.
   If there is another other issue, just let me know. Other than this one issue, I consider this PR mergeable from my point of view.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org