You are viewing a plain text version of this content. The canonical link for it is here.

Posted to commits@commons.apache.org by ah...@apache.org on 2019/03/08 00:12:42 UTC

[commons-text] branch master updated (0ada5fa -> 19df20d)

This is an automated email from the ASF dual-hosted git repository.

aherbert pushed a change to branch master
in repository https://gitbox.apache.org/repos/asf/commons-text.git.


    from 0ada5fa  TEXT-153: Make prefixSet a BitSet. (#108)
     new f40607a  TEXT-156: Fix the RegexTokenizer to use a static Pattern.
     new 4fa483c  Merge branch 'improvement-TEXT-156'
     new 19df20d  TEXT-156: Update changes.xml

The 3 revisions listed above as "new" are entirely new to this
repository and will be described in separate emails.  The revisions
listed as "add" were already present in the repository and have only
been added to this reference.


Summary of changes:
 src/changes/changes.xml                                            | 1 +
 .../java/org/apache/commons/text/similarity/RegexTokenizer.java    | 7 ++++---
 2 files changed, 5 insertions(+), 3 deletions(-)

[commons-text] 02/03: Merge branch 'improvement-TEXT-156'

Posted by ah...@apache.org.

This is an automated email from the ASF dual-hosted git repository.

aherbert pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/commons-text.git

commit 4fa483c5de0f053301947201a6aa49b13cb1dd0d
Merge: 0ada5fa f40607a
Author: Alex Herbert <ah...@apache.org>
AuthorDate: Fri Mar 8 00:09:38 2019 +0000

    Merge branch 'improvement-TEXT-156'
    
    Closes #109

 .../java/org/apache/commons/text/similarity/RegexTokenizer.java    | 7 ++++---
 1 file changed, 4 insertions(+), 3 deletions(-)

[commons-text] 03/03: TEXT-156: Update changes.xml

Posted by ah...@apache.org.

This is an automated email from the ASF dual-hosted git repository.

aherbert pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/commons-text.git

commit 19df20d10051d1204f321b8706ae91ccce73f1c6
Author: Alex Herbert <ah...@apache.org>
AuthorDate: Fri Mar 8 00:12:40 2019 +0000

    TEXT-156: Update changes.xml
---
 src/changes/changes.xml | 1 +
 1 file changed, 1 insertion(+)

diff --git a/src/changes/changes.xml b/src/changes/changes.xml
index 410c5ab..1cb5391 100644
--- a/src/changes/changes.xml
+++ b/src/changes/changes.xml
@@ -53,6 +53,7 @@ The <action> type attribute can be add,update,fix,remove.
     <action issue="TEXT-138" type="add" dev="ggregory" due-to="Neal Johnson, Don Jeba">TextStringBuilder append sub-sequence not consistent with Appendable.</action>
     <action issue="TEXT-152" type="add" dev="" due-to="@CAPS50">Fix possible infinite loop in WordUtils.wrap for a regex pattern that would trigger on a match of 0 length</action>
     <action issue="TEXT-153" type="update" dev="" due-to="amirhadadi">Make prefixSet in LookupTranslator a BitSet</action>
+    <action issue="TEXT-156" type="update" dev="aherbert">Fix the RegexTokenizer to use a static Pattern</action>
   </release>
 
   <release version="1.6" date="2018-10-12" description="Release 1.6">

[commons-text] 01/03: TEXT-156: Fix the RegexTokenizer to use a static Pattern.

Posted by ah...@apache.org.

This is an automated email from the ASF dual-hosted git repository.

aherbert pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/commons-text.git

commit f40607a1689b96831212e73a3588187778e6dc2a
Author: Alex Herbert <ah...@apache.org>
AuthorDate: Thu Mar 7 23:13:49 2019 +0000

    TEXT-156: Fix the RegexTokenizer to use a static Pattern.
    
    Remove the use of CharSequence.toString() to pass to the
    matcher(CharSequence) method.
    
    Fix the javadoc header @code tag.
---
 .../java/org/apache/commons/text/similarity/RegexTokenizer.java    | 7 ++++---
 1 file changed, 4 insertions(+), 3 deletions(-)

diff --git a/src/main/java/org/apache/commons/text/similarity/RegexTokenizer.java b/src/main/java/org/apache/commons/text/similarity/RegexTokenizer.java
index cc009ef..f650c0c 100644
--- a/src/main/java/org/apache/commons/text/similarity/RegexTokenizer.java
+++ b/src/main/java/org/apache/commons/text/similarity/RegexTokenizer.java
@@ -26,12 +26,14 @@ import org.apache.commons.lang3.Validate;
 
 /**
  * A simple word tokenizer that utilizes regex to find words. It applies a regex
- * {@code}(\w)+{@code} over the input text to extract words from a given character
+ * {@code (\w)+} over the input text to extract words from a given character
  * sequence.
  *
  * @since 1.0
  */
 class RegexTokenizer implements Tokenizer<CharSequence> {
+    /** The whitespace pattern. */
+    private static final Pattern PATTERN = Pattern.compile("(\\w)+");
 
     /**
      * {@inheritDoc}
@@ -41,8 +43,7 @@ class RegexTokenizer implements Tokenizer<CharSequence> {
     @Override
     public CharSequence[] tokenize(final CharSequence text) {
         Validate.isTrue(StringUtils.isNotBlank(text), "Invalid text");
-        final Pattern pattern = Pattern.compile("(\\w)+");
-        final Matcher matcher = pattern.matcher(text.toString());
+        final Matcher matcher = PATTERN.matcher(text);
         final List<String> tokens = new ArrayList<>();
         while (matcher.find()) {
             tokens.add(matcher.group(0));