You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@systemds.apache.org by GitBox <gi...@apache.org> on 2021/01/30 21:15:59 UTC

[GitHub] [systemds] Iseratho commented on pull request #1169: [WIP] Tokenizer Reference Implementation

Iseratho commented on pull request #1169:
URL: https://github.com/apache/systemds/pull/1169#issuecomment-770282063


   ##  Update on the PR
   The PR now contains a reference implementation for the tokenization API. It can be used for simple tokenization and is extensible with new algorithms.
   
   It provides the following features:
   - [x] 2 simple tokenization algorithms (i.e., whitespace and ngram)
   - [x] 2 output representations (i.e., count and position)
   - [x] algorithms are configurable with JSON spec
   
   Notes on design considerations:
   - [x] output representation is independent of the tokenizer algorithm
   - [x] distributable function, as it does not create a dictionary for the tokens (tokens can be encoded with `transformencode`)
   - [x] API similar to transform functions (e.g., using JSON spec)
   
   There is still an open issue regarding correctly setting the DataCharacterstics for Spark execution.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org