You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@systemds.apache.org by GitBox <gi...@apache.org> on 2020/07/20 17:45:47 UTC

[GitHub] [systemds] Iseratho commented on a change in pull request #993: [SYSTEMDS-265] Entity resolution pipelines and primitives.

Iseratho commented on a change in pull request #993:
URL: https://github.com/apache/systemds/pull/993#discussion_r457584817



##########
File path: scripts/staging/entity-resolution/README.md
##########
@@ -0,0 +1,99 @@
+# Entity Resolution
+
+## Pipeline design and primitives
+
+We provide two example scripts, `entity-clustering.dml` and `binary-entity-resolution.dml`. These handle reading input 
+files and writing output files and call functions provided in `primitives/pipeline.dml`.
+
+### Input files
+
+The provided scripts can read two types of input files. The token file is mandatory since it contains the row identifiers, 
+but the embedding file is optional. The actual use of tokens and/or embeddings can be configured via command line parameters 
+to the scripts.
+
+##### Token files
+
+This file type is a CSV file with 3 columns. The first column is the string or integer row identifier, the second is the 
+string token, and the third is the number of occurences. This simple format is used as a bag-of-words representation.
+
+##### Embedding files
+
+This file type is a CSV matrix file with each row containing arbitrary-dimensional embeddings. The order of row identifiers
+is assumed to be the same as in the token file. This saves some computation and storage time, but could be changed with 
+some modifications to the example scripts.
+
+### Primitives
+
+While the example scripts may be sufficient for many simple use cases, we aim to provide a toolkit of composable functions
+to facilitate more complex tasks. The top-level pipelines are defined as a couple of functions in `primitives/pipeline.dml`.
+The goal is that it should be relatively easy to copy one of these pipelines and swap out the primitive functions used
+to create a custom pipeline.
+
+To convert the input token file into a bag-of-words contingency table representation, we provide the functions
+`convert_frame_tokens_to_matrix_bow` and `convert_frame_tokens_to_matrix_bow_2` in  `primitives/preprocessing.dml`.
+The latter is used to compute a compatible contigency table with matching vocabulary for binary entity resolution. 
+
+We provide naive, constant-size blocking and locality-sensitive hashing (LSH) as functions in `primitives/blocking.dml`.
+
+For entity clustering, we only provide a simple clustering approach which makes all connected components in an adjacency
+matrix fully connected. This function is located in `primitives/clustering.dml`.
+
+To restore an adjacency matrix to a list of pairs, we provide the functions `untable` and `untable_offset` in 
+`primitives/postprocessing.dml`.
+
+Finally, `primitives/evaluation.dml` defines some metrics that can be used to evaluate the performance of the entity
+resolution pipelines. They are used in the script `eval-entity-resolution.dml`. 
+
+## Testing and Examples
+
+There is a test data repository that was used to develop these scripts at 
+[repo](https://github.com/skogler/systemds-amls-project-data). In the examples below, it is assumed that this repo is 
+cloned as `data` in the SystemDS root folder. The data in that repository is sourced from the Uni Leipzig entity resolution 
+[benchmark](https://dbs.uni-leipzig.de/research/projects/object_matching/benchmark_datasets_for_entity_resolution).

Review comment:
       On the mentioned website there are no comparison values. 
   For the DBLP-ACM dataset we get a F1-score of 0.8948. 
   For the Affiliations dataset we get a F1-score of 0.1429. 
   Note that we primarily focused on building the primitives and a basic pipeline. 
   We suspect that these numbers can be improved a lot by focusing more on data preprocessing (e.g., stemming). 




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org