You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@opennlp.apache.org by jo...@apache.org on 2018/12/12 15:55:44 UTC

[opennlp-sandbox] branch word_dropout created (now 2105a05)

This is an automated email from the ASF dual-hosted git repository.

joern pushed a change to branch word_dropout
in repository https://gitbox.apache.org/repos/asf/opennlp-sandbox.git.


      at 2105a05  Add word dropout, tokens are replaced with __UNK__ token

This branch includes the following new commits:

     new 2105a05  Add word dropout, tokens are replaced with __UNK__ token

The 1 revisions listed above as "new" are entirely new to this
repository and will be described in separate emails.  The revisions
listed as "add" were already present in the repository and have only
been added to this reference.



[opennlp-sandbox] 01/01: Add word dropout, tokens are replaced with __UNK__ token

Posted by jo...@apache.org.
This is an automated email from the ASF dual-hosted git repository.

joern pushed a commit to branch word_dropout
in repository https://gitbox.apache.org/repos/asf/opennlp-sandbox.git

commit 2105a0509eaf0e17069a551ab48e14b62f92b095
Author: Jörn Kottmann <jo...@apache.org>
AuthorDate: Wed Dec 12 16:55:21 2018 +0100

    Add word dropout, tokens are replaced with __UNK__ token
---
 tf-ner-poc/src/main/python/namefinder/namefinder.py | 8 +++++++-
 1 file changed, 7 insertions(+), 1 deletion(-)

diff --git a/tf-ner-poc/src/main/python/namefinder/namefinder.py b/tf-ner-poc/src/main/python/namefinder/namefinder.py
index 9150bd1..f41fd7d 100644
--- a/tf-ner-poc/src/main/python/namefinder/namefinder.py
+++ b/tf-ner-poc/src/main/python/namefinder/namefinder.py
@@ -19,7 +19,7 @@
 
 # This poc is based on source code taken from:
 # https://github.com/guillaumegenthial/sequence_tagging
-
+import random
 import sys
 from math import floor
 import tensorflow as tf
@@ -396,6 +396,12 @@ def main():
                 sentences_batch, chars_batch, word_length_batch, labels_batch, lengths = \
                     name_finder.mini_batch(rev_word_dict, char_dict, sentences, labels, batch_size, batch_index)
 
+                # TODO: Add a parameter to disable/enable this ?!?!
+                for batch_row in range(batch_size):
+                    for token_index in range(lengths[batch_row]):
+                        if random.uniform(0, 1) <= 0.05:
+                            sentences_batch[batch_row][token_index] = word_dict['__UNK__']
+
                 feed_dict = {token_ids_ph:  sentences_batch, char_ids_ph: chars_batch, word_lengths_ph: word_length_batch, sequence_lengths_ph: lengths,
                              labels_ph: labels_batch, dropout_keep_prob: 0.5}