You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@opennlp.apache.org by "jzonthemtn (via GitHub)" <gi...@apache.org> on 2023/03/24 13:04:32 UTC

[GitHub] [opennlp] jzonthemtn opened a new pull request, #523: OPENNLP-1442: Sentence transformers

jzonthemtn opened a new pull request, #523:
URL: https://github.com/apache/opennlp/pull/523

   Thank you for contributing to Apache OpenNLP.
   
   In order to streamline the review of the contribution we ask you
   to ensure the following steps have been taken:
   
   ### For all changes:
   - [X] Is there a JIRA ticket associated with this PR? Is it referenced 
        in the commit message?
   
   - [X] Does your PR title start with OPENNLP-XXXX where XXXX is the JIRA number you are trying to resolve? Pay particular attention to the hyphen "-" character.
   
   - [X] Has your PR been rebased against the latest commit within the target branch (typically main)?
   
   - [X] Is your initial contribution a single, squashed commit?
   
   ### For code changes:
   - [X] Have you ensured that the full suite of tests is executed via mvn clean install at the root opennlp folder?
   - [X] Have you written or updated unit tests to verify your changes?
   - [ ] If adding new dependencies to the code, are these dependencies licensed in a way that is compatible for inclusion under [ASF 2.0](http://www.apache.org/legal/resolved.html#category-a)? 
   - [ ] If applicable, have you updated the LICENSE file, including the main LICENSE file in opennlp folder?
   - [ ] If applicable, have you updated the NOTICE file, including the main NOTICE file found in opennlp folder?
   
   ### For documentation related changes:
   - [ ] Have you ensured that format looks appropriate for the output in which it is rendered?
   
   ### Note:
   Please ensure that once the PR is submitted, you check GitHub Actions for build issues and submit an update to your PR as soon as possible.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscribe@opennlp.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [opennlp] atarora commented on a diff in pull request #523: OPENNLP-1442: Sentence transformers

Posted by "atarora (via GitHub)" <gi...@apache.org>.
atarora commented on code in PR #523:
URL: https://github.com/apache/opennlp/pull/523#discussion_r1154581552


##########
opennlp-dl/src/main/java/opennlp/dl/vectors/SentenceVectorsDL.java:
##########
@@ -0,0 +1,112 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License. You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package opennlp.dl.vectors;
+
+import java.io.File;
+import java.io.IOException;
+import java.nio.LongBuffer;
+import java.util.Arrays;
+import java.util.HashMap;
+import java.util.Map;
+
+import ai.onnxruntime.OnnxTensor;
+import ai.onnxruntime.OrtEnvironment;
+import ai.onnxruntime.OrtException;
+import ai.onnxruntime.OrtSession;
+import org.slf4j.Logger;
+import org.slf4j.LoggerFactory;
+
+import opennlp.dl.AbstractDL;
+import opennlp.dl.Tokens;
+import opennlp.tools.tokenize.Tokenizer;
+import opennlp.tools.tokenize.WordpieceTokenizer;
+
+/**
+ * Facilitates the generation of sentence vectors using
+ * a sentence-transformers model converted to ONNX.
+ */
+public class SentenceVectorsDL extends AbstractDL {
+
+  private static final Logger logger = LoggerFactory.getLogger(SentenceVectorsDL.class);
+
+  /**
+   * Creates an instance of the class.
+   * @param model The file name of a sentence vectors ONNX model.
+   * @param vocabulary The file name of the vocabulary file for the model.
+   * @throws OrtException Thrown if the model cannot be loaded.
+   * @throws IOException Thrown if the vocabulary file cannot be loaded.
+   */
+  public SentenceVectorsDL(final File model, final File vocabulary)
+      throws OrtException, IOException {
+
+    env = OrtEnvironment.getEnvironment();
+    session = env.createSession(model.getPath(), new OrtSession.SessionOptions());
+    vocab = loadVocab(new File(vocabulary.getPath()));
+    tokenizer = new WordpieceTokenizer(vocab.keySet());

Review Comment:
   out of curiosity and may be a stupid question, does this tokenize the same way as defined in tokenizer_config.config ?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscribe@opennlp.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [opennlp] rzo1 commented on a diff in pull request #523: OPENNLP-1442: Sentence transformers

Posted by "rzo1 (via GitHub)" <gi...@apache.org>.
rzo1 commented on code in PR #523:
URL: https://github.com/apache/opennlp/pull/523#discussion_r1150074429


##########
opennlp-dl/README.md:
##########
@@ -4,44 +4,50 @@ This module provides OpenNLP interface implementations for ONNX models using the
 
 **Important**: This does not provide the ability to train models. Model training is done outside of OpenNLP. This code provides the ability to use ONNX models from OpenNLP.
 
-To build with example models, download the models to the `/src/test/resources` directory. (These are the exported models described below.)
+Models used in the tests are available in the opennlp evaluation test data.
 
-```
-
-export OPENNLP_DATA=/tmp/
-mkdir /tmp/dl-doccat /tmp/dl-namefinder
+## NameFinderDL
 
-# Document categorizer model
-wget https://www.dropbox.com/s/n9uzs8r4xm9rhxb/model.onnx?dl=0 -O $OPENNLP_DATA/dl-doccat/model.onnx
-wget https://www.dropbox.com/s/aw6yjc68jw0jts6/vocab.txt?dl=0 -O $OPENNLP_DATA/dl-doccat/vocab.txt
+* Export a Huggingface NER model to ONNX, e.g.:
 
-# Namefinder model
-wget https://www.dropbox.com/s/zgogq65gs9tyfm1/model.onnx?dl=0 -O $OPENNLP_DATA/dl-namefinder/model.onnx
-wget https://www.dropbox.com/s/3byt1jggly1dg98/vocab.txt?dl=0 -O $OPENNLP_DATA/dl-/namefinder/vocab.txt
+```
+python -m transformers.onnx --model=dslim/bert-base-NER --feature token-classification exported
 ```
 
-## TokenNameFinder
+## DocumentCategorizerDL
 
-* Export a Huggingface NER model to ONNX, e.g.:
+* Export a Huggingface classification (e.g. sentiment) model to ONNX, e.g.:
 
 ```
-python -m transformers.onnx --model=dslim/bert-base-NER --feature token-classification exported
+python -m transformers.onnx --model=nlptown/bert-base-multilingual-uncased-sentiment --feature sequence-classification exported
 ```
 
-* Copy the exported model to `src/test/resources/namefinder/model.onnx`.
-* Copy the model's [vocab.txt](https://huggingface.co/dslim/bert-base-NER/tree/main) to `src/test/resources/namefinder/vocab.txt`.
+## SentenceVectors
 
-Now you can run the tests in `NameFinderDLTest`.
+* Convert a sentence vectors model to ONNX, e.g.:
 
-## DocumentCategorizer
-
-* Export a Huggingface classification (e.g. sentiment) model to ONNX, e.g.:
+Install dependencies:
 
 ```
-python -m transformers.onnx --model=nlptown/bert-base-multilingual-uncased-sentiment --feature sequence-classification exported
+python3 -m pip install optimum onnx onnxruntime
+```
+
+Convert the model:
+
 ```

Review Comment:
   Think `python` should be supported according to https://github.com/github/linguist/blob/master/lib/linguist/languages.yml and https://docs.github.com/en/get-started/writing-on-github/working-with-advanced-formatting/creating-and-highlighting-code-blocks



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscribe@opennlp.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [opennlp] kinow commented on a diff in pull request #523: OPENNLP-1442: Sentence transformers

Posted by "kinow (via GitHub)" <gi...@apache.org>.
kinow commented on code in PR #523:
URL: https://github.com/apache/opennlp/pull/523#discussion_r1149277871


##########
opennlp-dl/README.md:
##########
@@ -4,44 +4,50 @@ This module provides OpenNLP interface implementations for ONNX models using the
 
 **Important**: This does not provide the ability to train models. Model training is done outside of OpenNLP. This code provides the ability to use ONNX models from OpenNLP.
 
-To build with example models, download the models to the `/src/test/resources` directory. (These are the exported models described below.)
+Models used in the tests are available in the opennlp evaluation test data.
 
-```
-
-export OPENNLP_DATA=/tmp/
-mkdir /tmp/dl-doccat /tmp/dl-namefinder
+## NameFinderDL
 
-# Document categorizer model
-wget https://www.dropbox.com/s/n9uzs8r4xm9rhxb/model.onnx?dl=0 -O $OPENNLP_DATA/dl-doccat/model.onnx
-wget https://www.dropbox.com/s/aw6yjc68jw0jts6/vocab.txt?dl=0 -O $OPENNLP_DATA/dl-doccat/vocab.txt
+* Export a Huggingface NER model to ONNX, e.g.:
 
-# Namefinder model
-wget https://www.dropbox.com/s/zgogq65gs9tyfm1/model.onnx?dl=0 -O $OPENNLP_DATA/dl-namefinder/model.onnx
-wget https://www.dropbox.com/s/3byt1jggly1dg98/vocab.txt?dl=0 -O $OPENNLP_DATA/dl-/namefinder/vocab.txt
+```
+python -m transformers.onnx --model=dslim/bert-base-NER --feature token-classification exported
 ```
 
-## TokenNameFinder
+## DocumentCategorizerDL
 
-* Export a Huggingface NER model to ONNX, e.g.:
+* Export a Huggingface classification (e.g. sentiment) model to ONNX, e.g.:
 
 ```
-python -m transformers.onnx --model=dslim/bert-base-NER --feature token-classification exported
+python -m transformers.onnx --model=nlptown/bert-base-multilingual-uncased-sentiment --feature sequence-classification exported
 ```
 
-* Copy the exported model to `src/test/resources/namefinder/model.onnx`.
-* Copy the model's [vocab.txt](https://huggingface.co/dslim/bert-base-NER/tree/main) to `src/test/resources/namefinder/vocab.txt`.
+## SentenceVectors
 
-Now you can run the tests in `NameFinderDLTest`.
+* Convert a sentence vectors model to ONNX, e.g.:
 
-## DocumentCategorizer
-
-* Export a Huggingface classification (e.g. sentiment) model to ONNX, e.g.:
+Install dependencies:
 
 ```
-python -m transformers.onnx --model=nlptown/bert-base-multilingual-uncased-sentiment --feature sequence-classification exported
+python3 -m pip install optimum onnx onnxruntime
+```
+
+Convert the model:
+
 ```

Review Comment:
   Not sure if \`\`\`bash or \`\`\`python is supported on README files for GitHub, but shouldn't matter.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscribe@opennlp.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [opennlp] jzonthemtn commented on a diff in pull request #523: OPENNLP-1442: Sentence transformers

Posted by "jzonthemtn (via GitHub)" <gi...@apache.org>.
jzonthemtn commented on code in PR #523:
URL: https://github.com/apache/opennlp/pull/523#discussion_r1147990411


##########
opennlp-dl/src/main/java/opennlp/dl/AbstractDL.java:
##########
@@ -0,0 +1,73 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License. You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package opennlp.dl;
+
+import java.io.BufferedReader;
+import java.io.File;
+import java.io.FileReader;
+import java.io.IOException;
+import java.util.HashMap;
+import java.util.Map;
+
+import ai.onnxruntime.OrtEnvironment;
+import ai.onnxruntime.OrtSession;
+
+import opennlp.tools.tokenize.Tokenizer;
+
+/**
+ * Base class for OpenNLP deep-learning classes using ONNX Runtime.
+ */
+public abstract class AbstractDL {
+
+  public static final String INPUT_IDS = "input_ids";
+  public static final String ATTENTION_MASK = "attention_mask";
+  public static final String TOKEN_TYPE_IDS = "token_type_ids";
+
+  protected OrtEnvironment env;
+  protected OrtSession session;
+  protected Tokenizer tokenizer;
+  protected Map<String, Integer> vocab;
+
+  /**
+   * Loads a vocabulary file from disk.
+   * @param vocabFile The vocabulary file.
+   * @return A map of vocabulary words to integer IDs.
+   * @throws IOException Thrown if the vocabulary file cannot be opened and read.
+   */
+  public Map<String, Integer> loadVocab(File vocabFile) throws IOException {
+
+    final Map<String, Integer> v = new HashMap<>();
+
+    BufferedReader br = new BufferedReader(new FileReader(vocabFile.getPath()));
+    String line = br.readLine();

Review Comment:
   I don't think so but I will check.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscribe@opennlp.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [opennlp] jzonthemtn commented on a diff in pull request #523: OPENNLP-1442: Sentence transformers

Posted by "jzonthemtn (via GitHub)" <gi...@apache.org>.
jzonthemtn commented on code in PR #523:
URL: https://github.com/apache/opennlp/pull/523#discussion_r1148381770


##########
opennlp-dl/src/main/java/opennlp/dl/AbstractDL.java:
##########
@@ -0,0 +1,73 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License. You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package opennlp.dl;
+
+import java.io.BufferedReader;
+import java.io.File;
+import java.io.FileReader;
+import java.io.IOException;
+import java.util.HashMap;
+import java.util.Map;
+
+import ai.onnxruntime.OrtEnvironment;
+import ai.onnxruntime.OrtSession;
+
+import opennlp.tools.tokenize.Tokenizer;
+
+/**
+ * Base class for OpenNLP deep-learning classes using ONNX Runtime.
+ */
+public abstract class AbstractDL {
+
+  public static final String INPUT_IDS = "input_ids";
+  public static final String ATTENTION_MASK = "attention_mask";
+  public static final String TOKEN_TYPE_IDS = "token_type_ids";
+
+  protected OrtEnvironment env;
+  protected OrtSession session;
+  protected Tokenizer tokenizer;
+  protected Map<String, Integer> vocab;
+
+  /**
+   * Loads a vocabulary file from disk.
+   * @param vocabFile The vocabulary file.
+   * @return A map of vocabulary words to integer IDs.
+   * @throws IOException Thrown if the vocabulary file cannot be opened and read.
+   */
+  public Map<String, Integer> loadVocab(File vocabFile) throws IOException {
+
+    final Map<String, Integer> v = new HashMap<>();
+
+    BufferedReader br = new BufferedReader(new FileReader(vocabFile.getPath()));
+    String line = br.readLine();

Review Comment:
   It wasn't supposed to! Fixed.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscribe@opennlp.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [opennlp] rzo1 commented on a diff in pull request #523: OPENNLP-1442: Sentence transformers

Posted by "rzo1 (via GitHub)" <gi...@apache.org>.
rzo1 commented on code in PR #523:
URL: https://github.com/apache/opennlp/pull/523#discussion_r1147876849


##########
opennlp-dl/src/main/java/opennlp/dl/AbstractDL.java:
##########
@@ -0,0 +1,73 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License. You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package opennlp.dl;
+
+import java.io.BufferedReader;
+import java.io.File;
+import java.io.FileReader;
+import java.io.IOException;
+import java.util.HashMap;
+import java.util.Map;
+
+import ai.onnxruntime.OrtEnvironment;
+import ai.onnxruntime.OrtSession;
+
+import opennlp.tools.tokenize.Tokenizer;
+
+/**
+ * Base class for OpenNLP deep-learning classes using ONNX Runtime.
+ */
+public abstract class AbstractDL {
+
+  public static final String INPUT_IDS = "input_ids";
+  public static final String ATTENTION_MASK = "attention_mask";
+  public static final String TOKEN_TYPE_IDS = "token_type_ids";
+
+  protected OrtEnvironment env;
+  protected OrtSession session;
+  protected Tokenizer tokenizer;
+  protected Map<String, Integer> vocab;
+
+  /**
+   * Loads a vocabulary file from disk.
+   * @param vocabFile The vocabulary file.
+   * @return A map of vocabulary words to integer IDs.
+   * @throws IOException Thrown if the vocabulary file cannot be opened and read.
+   */
+  public Map<String, Integer> loadVocab(File vocabFile) throws IOException {
+
+    final Map<String, Integer> v = new HashMap<>();
+
+    BufferedReader br = new BufferedReader(new FileReader(vocabFile.getPath()));

Review Comment:
   Try-with-resources or Files.readAllLines(...)? We should also define the encoding (utf8?) instead relying on the plattform locale.



##########
opennlp-dl/src/main/java/opennlp/dl/AbstractDL.java:
##########
@@ -0,0 +1,73 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License. You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package opennlp.dl;
+
+import java.io.BufferedReader;
+import java.io.File;
+import java.io.FileReader;
+import java.io.IOException;
+import java.util.HashMap;
+import java.util.Map;
+
+import ai.onnxruntime.OrtEnvironment;
+import ai.onnxruntime.OrtSession;
+
+import opennlp.tools.tokenize.Tokenizer;
+
+/**
+ * Base class for OpenNLP deep-learning classes using ONNX Runtime.
+ */
+public abstract class AbstractDL {
+
+  public static final String INPUT_IDS = "input_ids";
+  public static final String ATTENTION_MASK = "attention_mask";
+  public static final String TOKEN_TYPE_IDS = "token_type_ids";
+
+  protected OrtEnvironment env;
+  protected OrtSession session;
+  protected Tokenizer tokenizer;
+  protected Map<String, Integer> vocab;
+
+  /**
+   * Loads a vocabulary file from disk.
+   * @param vocabFile The vocabulary file.
+   * @return A map of vocabulary words to integer IDs.
+   * @throws IOException Thrown if the vocabulary file cannot be opened and read.
+   */
+  public Map<String, Integer> loadVocab(File vocabFile) throws IOException {
+
+    final Map<String, Integer> v = new HashMap<>();
+
+    BufferedReader br = new BufferedReader(new FileReader(vocabFile.getPath()));
+    String line = br.readLine();

Review Comment:
   I guess, it is intended to skip the first line?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscribe@opennlp.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [opennlp] rzo1 commented on a diff in pull request #523: OPENNLP-1442: Sentence transformers

Posted by "rzo1 (via GitHub)" <gi...@apache.org>.
rzo1 commented on code in PR #523:
URL: https://github.com/apache/opennlp/pull/523#discussion_r1148384275


##########
opennlp-dl/src/main/java/opennlp/dl/AbstractDL.java:
##########
@@ -0,0 +1,71 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License. You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package opennlp.dl;
+
+import java.io.File;
+import java.io.IOException;
+import java.nio.file.Files;
+import java.nio.file.Path;
+import java.util.HashMap;
+import java.util.Map;
+import java.util.stream.Stream;
+
+import ai.onnxruntime.OrtEnvironment;
+import ai.onnxruntime.OrtSession;
+
+import opennlp.tools.tokenize.Tokenizer;
+
+/**
+ * Base class for OpenNLP deep-learning classes using ONNX Runtime.
+ */
+public abstract class AbstractDL {
+
+  public static final String INPUT_IDS = "input_ids";
+  public static final String ATTENTION_MASK = "attention_mask";
+  public static final String TOKEN_TYPE_IDS = "token_type_ids";
+
+  protected OrtEnvironment env;
+  protected OrtSession session;
+  protected Tokenizer tokenizer;
+  protected Map<String, Integer> vocab;
+
+  /**
+   * Loads a vocabulary file from disk.
+   * @param vocabFile The vocabulary file.
+   * @return A map of vocabulary words to integer IDs.
+   * @throws IOException Thrown if the vocabulary file cannot be opened and read.
+   */
+  public Map<String, Integer> loadVocab(final File vocabFile) throws IOException {
+
+    final Map<String, Integer> vocab = new HashMap<>();
+
+    int counter = 0;
+
+    try (Stream<String> lines = Files.lines(Path.of(vocabFile.getPath()))) {
+
+      lines.forEach(line -> {
+        vocab.put(line, counter);

Review Comment:
   Do we need to increase the counter? (Think it was there before?)



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscribe@opennlp.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [opennlp] jzonthemtn commented on pull request #523: OPENNLP-1442: Sentence transformers

Posted by "jzonthemtn (via GitHub)" <gi...@apache.org>.
jzonthemtn commented on PR #523:
URL: https://github.com/apache/opennlp/pull/523#issuecomment-1491983210

   > > I have not yet put those model files in the eval test data. So those tests will fail until they're in there.
   > 
   > Is it available, so we can merge?
   
   Not yet -- I will try to add them later today.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscribe@opennlp.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [opennlp] jzonthemtn merged pull request #523: OPENNLP-1442: Sentence transformers

Posted by "jzonthemtn (via GitHub)" <gi...@apache.org>.
jzonthemtn merged PR #523:
URL: https://github.com/apache/opennlp/pull/523


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscribe@opennlp.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [opennlp] rzo1 commented on a diff in pull request #523: OPENNLP-1442: Sentence transformers

Posted by "rzo1 (via GitHub)" <gi...@apache.org>.
rzo1 commented on code in PR #523:
URL: https://github.com/apache/opennlp/pull/523#discussion_r1150073553


##########
opennlp-dl/README.md:
##########
@@ -4,44 +4,50 @@ This module provides OpenNLP interface implementations for ONNX models using the
 
 **Important**: This does not provide the ability to train models. Model training is done outside of OpenNLP. This code provides the ability to use ONNX models from OpenNLP.
 
-To build with example models, download the models to the `/src/test/resources` directory. (These are the exported models described below.)
+Models used in the tests are available in the opennlp evaluation test data.
 
-```
-
-export OPENNLP_DATA=/tmp/
-mkdir /tmp/dl-doccat /tmp/dl-namefinder
+## NameFinderDL
 
-# Document categorizer model
-wget https://www.dropbox.com/s/n9uzs8r4xm9rhxb/model.onnx?dl=0 -O $OPENNLP_DATA/dl-doccat/model.onnx
-wget https://www.dropbox.com/s/aw6yjc68jw0jts6/vocab.txt?dl=0 -O $OPENNLP_DATA/dl-doccat/vocab.txt
+* Export a Huggingface NER model to ONNX, e.g.:
 
-# Namefinder model
-wget https://www.dropbox.com/s/zgogq65gs9tyfm1/model.onnx?dl=0 -O $OPENNLP_DATA/dl-namefinder/model.onnx
-wget https://www.dropbox.com/s/3byt1jggly1dg98/vocab.txt?dl=0 -O $OPENNLP_DATA/dl-/namefinder/vocab.txt
+```
+python -m transformers.onnx --model=dslim/bert-base-NER --feature token-classification exported
 ```
 
-## TokenNameFinder
+## DocumentCategorizerDL
 
-* Export a Huggingface NER model to ONNX, e.g.:
+* Export a Huggingface classification (e.g. sentiment) model to ONNX, e.g.:
 
 ```
-python -m transformers.onnx --model=dslim/bert-base-NER --feature token-classification exported
+python -m transformers.onnx --model=nlptown/bert-base-multilingual-uncased-sentiment --feature sequence-classification exported
 ```
 
-* Copy the exported model to `src/test/resources/namefinder/model.onnx`.
-* Copy the model's [vocab.txt](https://huggingface.co/dslim/bert-base-NER/tree/main) to `src/test/resources/namefinder/vocab.txt`.
+## SentenceVectors
 
-Now you can run the tests in `NameFinderDLTest`.
+* Convert a sentence vectors model to ONNX, e.g.:

Review Comment:
   We should remove the `*` to get a consistent README, imho.
   
   ![image](https://user-images.githubusercontent.com/13417392/228144540-885c9a21-5d84-4148-959a-1910e1767f66.png)
   



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscribe@opennlp.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [opennlp] jzonthemtn commented on a diff in pull request #523: OPENNLP-1442: Sentence transformers

Posted by "jzonthemtn (via GitHub)" <gi...@apache.org>.
jzonthemtn commented on code in PR #523:
URL: https://github.com/apache/opennlp/pull/523#discussion_r1150595315


##########
opennlp-dl/README.md:
##########
@@ -4,44 +4,50 @@ This module provides OpenNLP interface implementations for ONNX models using the
 
 **Important**: This does not provide the ability to train models. Model training is done outside of OpenNLP. This code provides the ability to use ONNX models from OpenNLP.
 
-To build with example models, download the models to the `/src/test/resources` directory. (These are the exported models described below.)
+Models used in the tests are available in the opennlp evaluation test data.
 
-```
-
-export OPENNLP_DATA=/tmp/
-mkdir /tmp/dl-doccat /tmp/dl-namefinder
+## NameFinderDL
 
-# Document categorizer model
-wget https://www.dropbox.com/s/n9uzs8r4xm9rhxb/model.onnx?dl=0 -O $OPENNLP_DATA/dl-doccat/model.onnx
-wget https://www.dropbox.com/s/aw6yjc68jw0jts6/vocab.txt?dl=0 -O $OPENNLP_DATA/dl-doccat/vocab.txt
+* Export a Huggingface NER model to ONNX, e.g.:
 
-# Namefinder model
-wget https://www.dropbox.com/s/zgogq65gs9tyfm1/model.onnx?dl=0 -O $OPENNLP_DATA/dl-namefinder/model.onnx
-wget https://www.dropbox.com/s/3byt1jggly1dg98/vocab.txt?dl=0 -O $OPENNLP_DATA/dl-/namefinder/vocab.txt
+```
+python -m transformers.onnx --model=dslim/bert-base-NER --feature token-classification exported
 ```
 
-## TokenNameFinder
+## DocumentCategorizerDL
 
-* Export a Huggingface NER model to ONNX, e.g.:
+* Export a Huggingface classification (e.g. sentiment) model to ONNX, e.g.:
 
 ```
-python -m transformers.onnx --model=dslim/bert-base-NER --feature token-classification exported
+python -m transformers.onnx --model=nlptown/bert-base-multilingual-uncased-sentiment --feature sequence-classification exported
 ```
 
-* Copy the exported model to `src/test/resources/namefinder/model.onnx`.
-* Copy the model's [vocab.txt](https://huggingface.co/dslim/bert-base-NER/tree/main) to `src/test/resources/namefinder/vocab.txt`.
+## SentenceVectors
 
-Now you can run the tests in `NameFinderDLTest`.
+* Convert a sentence vectors model to ONNX, e.g.:

Review Comment:
   Yes, good catch. I have removed the `*`s.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscribe@opennlp.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [opennlp] jzonthemtn commented on pull request #523: OPENNLP-1442: Sentence transformers

Posted by "jzonthemtn (via GitHub)" <gi...@apache.org>.
jzonthemtn commented on PR #523:
URL: https://github.com/apache/opennlp/pull/523#issuecomment-1492192829

   > > > I have not yet put those model files in the eval test data. So those tests will fail until they're in there.
   > > 
   > > 
   > > Is it available, so we can merge?
   > 
   > Not yet -- I will try to add them later today.
   
   I updated the test data to include the model files.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscribe@opennlp.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [opennlp] jzonthemtn commented on a diff in pull request #523: OPENNLP-1442: Sentence transformers

Posted by "jzonthemtn (via GitHub)" <gi...@apache.org>.
jzonthemtn commented on code in PR #523:
URL: https://github.com/apache/opennlp/pull/523#discussion_r1154812833


##########
opennlp-dl/src/main/java/opennlp/dl/vectors/SentenceVectorsDL.java:
##########
@@ -0,0 +1,112 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License. You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package opennlp.dl.vectors;
+
+import java.io.File;
+import java.io.IOException;
+import java.nio.LongBuffer;
+import java.util.Arrays;
+import java.util.HashMap;
+import java.util.Map;
+
+import ai.onnxruntime.OnnxTensor;
+import ai.onnxruntime.OrtEnvironment;
+import ai.onnxruntime.OrtException;
+import ai.onnxruntime.OrtSession;
+import org.slf4j.Logger;
+import org.slf4j.LoggerFactory;
+
+import opennlp.dl.AbstractDL;
+import opennlp.dl.Tokens;
+import opennlp.tools.tokenize.Tokenizer;
+import opennlp.tools.tokenize.WordpieceTokenizer;
+
+/**
+ * Facilitates the generation of sentence vectors using
+ * a sentence-transformers model converted to ONNX.
+ */
+public class SentenceVectorsDL extends AbstractDL {
+
+  private static final Logger logger = LoggerFactory.getLogger(SentenceVectorsDL.class);
+
+  /**
+   * Creates an instance of the class.
+   * @param model The file name of a sentence vectors ONNX model.
+   * @param vocabulary The file name of the vocabulary file for the model.
+   * @throws OrtException Thrown if the model cannot be loaded.
+   * @throws IOException Thrown if the vocabulary file cannot be loaded.
+   */
+  public SentenceVectorsDL(final File model, final File vocabulary)
+      throws OrtException, IOException {
+
+    env = OrtEnvironment.getEnvironment();
+    session = env.createSession(model.getPath(), new OrtSession.SessionOptions());
+    vocab = loadVocab(new File(vocabulary.getPath()));
+    tokenizer = new WordpieceTokenizer(vocab.keySet());

Review Comment:
   Not a stupid question at all -- it does not. It uses a WordPiece-based tokenizer. We still need an implementation of `Tokenizer` that can use a pre-trained model. The sentence transformers tokenizer [models](https://huggingface.co/sentence-transformers/all-MiniLM-L6-v2/blob/main/tokenizer_config.json) were made using `BertTokenizer` which is based on WordPiece. When I tested this branch I did a simplified vector search by generating vectors for multiple sentences. I then used those vectors to find the most related sentences. Visually, the results made sense, but that's not to say there are lots of ways to improve.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscribe@opennlp.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [opennlp] jzonthemtn commented on a diff in pull request #523: OPENNLP-1442: Sentence transformers

Posted by "jzonthemtn (via GitHub)" <gi...@apache.org>.
jzonthemtn commented on code in PR #523:
URL: https://github.com/apache/opennlp/pull/523#discussion_r1150597785


##########
opennlp-dl/README.md:
##########
@@ -4,44 +4,50 @@ This module provides OpenNLP interface implementations for ONNX models using the
 
 **Important**: This does not provide the ability to train models. Model training is done outside of OpenNLP. This code provides the ability to use ONNX models from OpenNLP.
 
-To build with example models, download the models to the `/src/test/resources` directory. (These are the exported models described below.)
+Models used in the tests are available in the opennlp evaluation test data.
 
-```
-
-export OPENNLP_DATA=/tmp/
-mkdir /tmp/dl-doccat /tmp/dl-namefinder
+## NameFinderDL
 
-# Document categorizer model
-wget https://www.dropbox.com/s/n9uzs8r4xm9rhxb/model.onnx?dl=0 -O $OPENNLP_DATA/dl-doccat/model.onnx
-wget https://www.dropbox.com/s/aw6yjc68jw0jts6/vocab.txt?dl=0 -O $OPENNLP_DATA/dl-doccat/vocab.txt
+* Export a Huggingface NER model to ONNX, e.g.:
 
-# Namefinder model
-wget https://www.dropbox.com/s/zgogq65gs9tyfm1/model.onnx?dl=0 -O $OPENNLP_DATA/dl-namefinder/model.onnx
-wget https://www.dropbox.com/s/3byt1jggly1dg98/vocab.txt?dl=0 -O $OPENNLP_DATA/dl-/namefinder/vocab.txt
+```
+python -m transformers.onnx --model=dslim/bert-base-NER --feature token-classification exported
 ```
 
-## TokenNameFinder
+## DocumentCategorizerDL
 
-* Export a Huggingface NER model to ONNX, e.g.:
+* Export a Huggingface classification (e.g. sentiment) model to ONNX, e.g.:
 
 ```
-python -m transformers.onnx --model=dslim/bert-base-NER --feature token-classification exported
+python -m transformers.onnx --model=nlptown/bert-base-multilingual-uncased-sentiment --feature sequence-classification exported
 ```
 
-* Copy the exported model to `src/test/resources/namefinder/model.onnx`.
-* Copy the model's [vocab.txt](https://huggingface.co/dslim/bert-base-NER/tree/main) to `src/test/resources/namefinder/vocab.txt`.
+## SentenceVectors
 
-Now you can run the tests in `NameFinderDLTest`.
+* Convert a sentence vectors model to ONNX, e.g.:
 
-## DocumentCategorizer
-
-* Export a Huggingface classification (e.g. sentiment) model to ONNX, e.g.:
+Install dependencies:
 
 ```
-python -m transformers.onnx --model=nlptown/bert-base-multilingual-uncased-sentiment --feature sequence-classification exported
+python3 -m pip install optimum onnx onnxruntime
+```
+
+Convert the model:
+
 ```

Review Comment:
   I added those.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscribe@opennlp.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [opennlp] kinow commented on a diff in pull request #523: OPENNLP-1442: Sentence transformers

Posted by "kinow (via GitHub)" <gi...@apache.org>.
kinow commented on code in PR #523:
URL: https://github.com/apache/opennlp/pull/523#discussion_r1149281568


##########
opennlp-dl/src/main/java/opennlp/dl/doccat/DocumentCategorizerDL.java:
##########
@@ -223,41 +214,14 @@ private int getKey(String value) {
 
   }
 
-  /**
-   * Loads a vocabulary file from disk.
-   * @param vocab The vocabulary file.
-   * @return A map of vocabulary words to integer IDs.
-   * @throws IOException Thrown if the vocabulary file cannot be opened and read.
-   */
-  private Map<String, Integer> loadVocab(File vocab) throws IOException {
-
-    final Map<String, Integer> v = new HashMap<>();
-
-    BufferedReader br = new BufferedReader(new FileReader(vocab.getPath()));
-    String line = br.readLine();
-    int x = 0;
-
-    while (line != null) {
-
-      line = br.readLine();
-      x++;
-
-      v.put(line, x);
-
-    }
-
-    return v;
-
-  }
-

Review Comment:
   Nice simplification :+1: !



##########
opennlp-dl/README.md:
##########
@@ -4,44 +4,50 @@ This module provides OpenNLP interface implementations for ONNX models using the
 
 **Important**: This does not provide the ability to train models. Model training is done outside of OpenNLP. This code provides the ability to use ONNX models from OpenNLP.
 
-To build with example models, download the models to the `/src/test/resources` directory. (These are the exported models described below.)
+Models used in the tests are available in the opennlp evaluation test data.
 
-```
-
-export OPENNLP_DATA=/tmp/
-mkdir /tmp/dl-doccat /tmp/dl-namefinder
+## NameFinderDL
 
-# Document categorizer model
-wget https://www.dropbox.com/s/n9uzs8r4xm9rhxb/model.onnx?dl=0 -O $OPENNLP_DATA/dl-doccat/model.onnx
-wget https://www.dropbox.com/s/aw6yjc68jw0jts6/vocab.txt?dl=0 -O $OPENNLP_DATA/dl-doccat/vocab.txt
+* Export a Huggingface NER model to ONNX, e.g.:
 
-# Namefinder model
-wget https://www.dropbox.com/s/zgogq65gs9tyfm1/model.onnx?dl=0 -O $OPENNLP_DATA/dl-namefinder/model.onnx
-wget https://www.dropbox.com/s/3byt1jggly1dg98/vocab.txt?dl=0 -O $OPENNLP_DATA/dl-/namefinder/vocab.txt
+```
+python -m transformers.onnx --model=dslim/bert-base-NER --feature token-classification exported
 ```
 
-## TokenNameFinder
+## DocumentCategorizerDL
 
-* Export a Huggingface NER model to ONNX, e.g.:
+* Export a Huggingface classification (e.g. sentiment) model to ONNX, e.g.:
 
 ```
-python -m transformers.onnx --model=dslim/bert-base-NER --feature token-classification exported
+python -m transformers.onnx --model=nlptown/bert-base-multilingual-uncased-sentiment --feature sequence-classification exported
 ```
 
-* Copy the exported model to `src/test/resources/namefinder/model.onnx`.
-* Copy the model's [vocab.txt](https://huggingface.co/dslim/bert-base-NER/tree/main) to `src/test/resources/namefinder/vocab.txt`.
+## SentenceVectors
 
-Now you can run the tests in `NameFinderDLTest`.
+* Convert a sentence vectors model to ONNX, e.g.:

Review Comment:
   Maybe the GitHub UI is confusing me, but was the `* ` intentional here? I'm seeing an H2, then this list item, but then after that I see paragraphs with "Install dependencies:", "Convert the model"... or were those supposed to be list items too?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscribe@opennlp.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [opennlp] jzonthemtn commented on a diff in pull request #523: OPENNLP-1442: Sentence transformers

Posted by "jzonthemtn (via GitHub)" <gi...@apache.org>.
jzonthemtn commented on code in PR #523:
URL: https://github.com/apache/opennlp/pull/523#discussion_r1148536600


##########
opennlp-dl/src/main/java/opennlp/dl/AbstractDL.java:
##########
@@ -0,0 +1,71 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License. You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package opennlp.dl;
+
+import java.io.File;
+import java.io.IOException;
+import java.nio.file.Files;
+import java.nio.file.Path;
+import java.util.HashMap;
+import java.util.Map;
+import java.util.stream.Stream;
+
+import ai.onnxruntime.OrtEnvironment;
+import ai.onnxruntime.OrtSession;
+
+import opennlp.tools.tokenize.Tokenizer;
+
+/**
+ * Base class for OpenNLP deep-learning classes using ONNX Runtime.
+ */
+public abstract class AbstractDL {
+
+  public static final String INPUT_IDS = "input_ids";
+  public static final String ATTENTION_MASK = "attention_mask";
+  public static final String TOKEN_TYPE_IDS = "token_type_ids";
+
+  protected OrtEnvironment env;
+  protected OrtSession session;
+  protected Tokenizer tokenizer;
+  protected Map<String, Integer> vocab;
+
+  /**
+   * Loads a vocabulary file from disk.
+   * @param vocabFile The vocabulary file.
+   * @return A map of vocabulary words to integer IDs.
+   * @throws IOException Thrown if the vocabulary file cannot be opened and read.
+   */
+  public Map<String, Integer> loadVocab(final File vocabFile) throws IOException {
+
+    final Map<String, Integer> vocab = new HashMap<>();
+
+    int counter = 0;
+
+    try (Stream<String> lines = Files.lines(Path.of(vocabFile.getPath()))) {
+
+      lines.forEach(line -> {
+        vocab.put(line, counter);

Review Comment:
   Thanks for catching that.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscribe@opennlp.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [opennlp] jzonthemtn commented on a diff in pull request #523: OPENNLP-1442: Sentence transformers

Posted by "jzonthemtn (via GitHub)" <gi...@apache.org>.
jzonthemtn commented on code in PR #523:
URL: https://github.com/apache/opennlp/pull/523#discussion_r1154812833


##########
opennlp-dl/src/main/java/opennlp/dl/vectors/SentenceVectorsDL.java:
##########
@@ -0,0 +1,112 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License. You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package opennlp.dl.vectors;
+
+import java.io.File;
+import java.io.IOException;
+import java.nio.LongBuffer;
+import java.util.Arrays;
+import java.util.HashMap;
+import java.util.Map;
+
+import ai.onnxruntime.OnnxTensor;
+import ai.onnxruntime.OrtEnvironment;
+import ai.onnxruntime.OrtException;
+import ai.onnxruntime.OrtSession;
+import org.slf4j.Logger;
+import org.slf4j.LoggerFactory;
+
+import opennlp.dl.AbstractDL;
+import opennlp.dl.Tokens;
+import opennlp.tools.tokenize.Tokenizer;
+import opennlp.tools.tokenize.WordpieceTokenizer;
+
+/**
+ * Facilitates the generation of sentence vectors using
+ * a sentence-transformers model converted to ONNX.
+ */
+public class SentenceVectorsDL extends AbstractDL {
+
+  private static final Logger logger = LoggerFactory.getLogger(SentenceVectorsDL.class);
+
+  /**
+   * Creates an instance of the class.
+   * @param model The file name of a sentence vectors ONNX model.
+   * @param vocabulary The file name of the vocabulary file for the model.
+   * @throws OrtException Thrown if the model cannot be loaded.
+   * @throws IOException Thrown if the vocabulary file cannot be loaded.
+   */
+  public SentenceVectorsDL(final File model, final File vocabulary)
+      throws OrtException, IOException {
+
+    env = OrtEnvironment.getEnvironment();
+    session = env.createSession(model.getPath(), new OrtSession.SessionOptions());
+    vocab = loadVocab(new File(vocabulary.getPath()));
+    tokenizer = new WordpieceTokenizer(vocab.keySet());

Review Comment:
   Not a stupid question at all -- it does not. It uses a WordPiece-based tokenizer. We still need an implementation of `Tokenizer` that can use a pre-trained model. The sentence transformers tokenizer [models](https://huggingface.co/sentence-transformers/all-MiniLM-L6-v2/blob/main/tokenizer_config.json) were made using `BertTokenizer` which is based on WordPiece. When I tested this branch I did a simplified vector search by generating vectors for multiple sentences. I think used those vectors to find the most related sentences. Visually, the results made sense, but that's not to say there are lots of ways to improve.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscribe@opennlp.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [opennlp] rzo1 commented on pull request #523: OPENNLP-1442: Sentence transformers

Posted by "rzo1 (via GitHub)" <gi...@apache.org>.
rzo1 commented on PR #523:
URL: https://github.com/apache/opennlp/pull/523#issuecomment-1491916036

   > I have not yet put those model files in the eval test data. So those tests will fail until they're in there.
   
   Is it available, so we can merge?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscribe@opennlp.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [opennlp] jzonthemtn commented on pull request #523: OPENNLP-1442: Sentence transformers

Posted by "jzonthemtn (via GitHub)" <gi...@apache.org>.
jzonthemtn commented on PR #523:
URL: https://github.com/apache/opennlp/pull/523#issuecomment-1482773945

   I have not yet put those model files in the eval test data. So those tests will fail until they're in there.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscribe@opennlp.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org