You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@lucene.apache.org by "Uwe Schindler (Created) (JIRA)" <ji...@apache.org> on 2012/04/14 12:27:17 UTC

[jira] [Created] (LUCENE-3983) HTMLCharacterEntities.jflex uses String.toUpperCase without Locale

HTMLCharacterEntities.jflex uses String.toUpperCase without Locale
------------------------------------------------------------------

                 Key: LUCENE-3983
                 URL: https://issues.apache.org/jira/browse/LUCENE-3983
             Project: Lucene - Java
          Issue Type: Bug
            Reporter: Uwe Schindler
            Assignee: Steven Rowe


Is this expected?

{code:java}
      "xi", "\u03BE", "yacute", "\u00FD", "yen", "\u00A5", "yuml", "\u00FF",
      "zeta", "\u03B6", "zwj", "\u200D", "zwnj", "\u200C"
    };
    for (int i = 0 ; i < entities.length ; i += 2) {
      Character value = entities[i + 1].charAt(0);
      entityValues.put(entities[i], value);
      if (upperCaseVariantsAccepted.contains(entities[i])) {
        entityValues.put(entities[i].toUpperCase(), value);
      }
    }
{code}

In my opinion, this should look like:

{code:java}
      "xi", "\u03BE", "yacute", "\u00FD", "yen", "\u00A5", "yuml", "\u00FF",
      "zeta", "\u03B6", "zwj", "\u200D", "zwnj", "\u200C"
    };
    for (int i = 0 ; i < entities.length ; i += 2) {
      Character value = entities[i + 1].charAt(0);
      entityValues.put(entities[i], value);
      if (upperCaseVariantsAccepted.contains(entities[i])) {
        entityValues.put(entities[i].toUpperCase(Locale.ENGLISH), value);
      }
    }
{code}

(otherwise in the Turkish locale, the entities containing "i" (like "xi" -> '\u03BE') will not be detected correctly).

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org

[jira] [Resolved] (LUCENE-3983) HTMLCharacterEntities.jflex uses String.toUpperCase without Locale

Posted by "Steven Rowe (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/LUCENE-3983?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Steven Rowe resolved LUCENE-3983.
---------------------------------

       Resolution: Fixed
    Fix Version/s: 4.0
    Lucene Fields: New,Patch Available  (was: New)

Committed to trunk.

I don't think it's worth it to backport to the 3.6 branch, since the only danger here was if the set of recognized uppercase variants of HTML character entities ever grew, one of them might contain an "i"; since branch 3.6 is bugfix-only, though, that set will never grow.
                
> HTMLCharacterEntities.jflex uses String.toUpperCase without Locale
> ------------------------------------------------------------------
>
>                 Key: LUCENE-3983
>                 URL: https://issues.apache.org/jira/browse/LUCENE-3983
>             Project: Lucene - Java
>          Issue Type: Bug
>            Reporter: Uwe Schindler
>            Assignee: Steven Rowe
>            Priority: Minor
>             Fix For: 4.0
>
>         Attachments: LUCENE-3983.patch
>
>
> Is this expected?
> {code:java}
>       "xi", "\u03BE", "yacute", "\u00FD", "yen", "\u00A5", "yuml", "\u00FF",
>       "zeta", "\u03B6", "zwj", "\u200D", "zwnj", "\u200C"
>     };
>     for (int i = 0 ; i < entities.length ; i += 2) {
>       Character value = entities[i + 1].charAt(0);
>       entityValues.put(entities[i], value);
>       if (upperCaseVariantsAccepted.contains(entities[i])) {
>         entityValues.put(entities[i].toUpperCase(), value);
>       }
>     }
> {code}
> In my opinion, this should look like:
> {code:java}
>       "xi", "\u03BE", "yacute", "\u00FD", "yen", "\u00A5", "yuml", "\u00FF",
>       "zeta", "\u03B6", "zwj", "\u200D", "zwnj", "\u200C"
>     };
>     for (int i = 0 ; i < entities.length ; i += 2) {
>       Character value = entities[i + 1].charAt(0);
>       entityValues.put(entities[i], value);
>       if (upperCaseVariantsAccepted.contains(entities[i])) {
>         entityValues.put(entities[i].toUpperCase(Locale.ENGLISH), value);
>       }
>     }
> {code}
> (otherwise in the Turkish locale, the entities containing "i" (like "xi" -> '\u03BE') will not be detected correctly).

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org

[jira] [Commented] (LUCENE-3983) HTMLCharacterEntities.jflex uses String.toUpperCase without Locale

Posted by "Uwe Schindler (Commented) (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/LUCENE-3983?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13254161#comment-13254161 ] 

Uwe Schindler commented on LUCENE-3983:
---------------------------------------

I have no preference, I just noticed the missing Locale and that alarmed me. We should really avoid that to prevent bugs from the beginning.
I would simply add the Locale.ENGLISH, commit that and leave the rest unchanged. I just assigned it to you, as I have no uptodate jfex installed to regenerate the java files, otherwise I would have heavy committed :-)
                
> HTMLCharacterEntities.jflex uses String.toUpperCase without Locale
> ------------------------------------------------------------------
>
>                 Key: LUCENE-3983
>                 URL: https://issues.apache.org/jira/browse/LUCENE-3983
>             Project: Lucene - Java
>          Issue Type: Bug
>            Reporter: Uwe Schindler
>            Assignee: Steven Rowe
>            Priority: Minor
>
> Is this expected?
> {code:java}
>       "xi", "\u03BE", "yacute", "\u00FD", "yen", "\u00A5", "yuml", "\u00FF",
>       "zeta", "\u03B6", "zwj", "\u200D", "zwnj", "\u200C"
>     };
>     for (int i = 0 ; i < entities.length ; i += 2) {
>       Character value = entities[i + 1].charAt(0);
>       entityValues.put(entities[i], value);
>       if (upperCaseVariantsAccepted.contains(entities[i])) {
>         entityValues.put(entities[i].toUpperCase(), value);
>       }
>     }
> {code}
> In my opinion, this should look like:
> {code:java}
>       "xi", "\u03BE", "yacute", "\u00FD", "yen", "\u00A5", "yuml", "\u00FF",
>       "zeta", "\u03B6", "zwj", "\u200D", "zwnj", "\u200C"
>     };
>     for (int i = 0 ; i < entities.length ; i += 2) {
>       Character value = entities[i + 1].charAt(0);
>       entityValues.put(entities[i], value);
>       if (upperCaseVariantsAccepted.contains(entities[i])) {
>         entityValues.put(entities[i].toUpperCase(Locale.ENGLISH), value);
>       }
>     }
> {code}
> (otherwise in the Turkish locale, the entities containing "i" (like "xi" -> '\u03BE') will not be detected correctly).

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org

[jira] [Updated] (LUCENE-3983) HTMLCharacterEntities.jflex uses String.toUpperCase without Locale

Posted by "Steven Rowe (Updated) (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/LUCENE-3983?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Steven Rowe updated LUCENE-3983:
--------------------------------

    Priority: Minor  (was: Major)
    
> HTMLCharacterEntities.jflex uses String.toUpperCase without Locale
> ------------------------------------------------------------------
>
>                 Key: LUCENE-3983
>                 URL: https://issues.apache.org/jira/browse/LUCENE-3983
>             Project: Lucene - Java
>          Issue Type: Bug
>            Reporter: Uwe Schindler
>            Assignee: Steven Rowe
>            Priority: Minor
>
> Is this expected?
> {code:java}
>       "xi", "\u03BE", "yacute", "\u00FD", "yen", "\u00A5", "yuml", "\u00FF",
>       "zeta", "\u03B6", "zwj", "\u200D", "zwnj", "\u200C"
>     };
>     for (int i = 0 ; i < entities.length ; i += 2) {
>       Character value = entities[i + 1].charAt(0);
>       entityValues.put(entities[i], value);
>       if (upperCaseVariantsAccepted.contains(entities[i])) {
>         entityValues.put(entities[i].toUpperCase(), value);
>       }
>     }
> {code}
> In my opinion, this should look like:
> {code:java}
>       "xi", "\u03BE", "yacute", "\u00FD", "yen", "\u00A5", "yuml", "\u00FF",
>       "zeta", "\u03B6", "zwj", "\u200D", "zwnj", "\u200C"
>     };
>     for (int i = 0 ; i < entities.length ; i += 2) {
>       Character value = entities[i + 1].charAt(0);
>       entityValues.put(entities[i], value);
>       if (upperCaseVariantsAccepted.contains(entities[i])) {
>         entityValues.put(entities[i].toUpperCase(Locale.ENGLISH), value);
>       }
>     }
> {code}
> (otherwise in the Turkish locale, the entities containing "i" (like "xi" -> '\u03BE') will not be detected correctly).

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org

[jira] [Commented] (LUCENE-3983) HTMLCharacterEntities.jflex uses String.toUpperCase without Locale

Posted by "Steven Rowe (Commented) (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/LUCENE-3983?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13254144#comment-13254144 ] 

Steven Rowe commented on LUCENE-3983:
-------------------------------------

Maybe a better idea would be to convert {{upperCaseVariantsAccepted}} into a map, so that runtime uppercasing isn't required:

{code:java}
  private static final Map<String,String> upperCaseVariantsAccepted
      = new HashMap<String,String>();
  static {
    upperCaseVariantsAccepted.put("quot", "QUOT");
    upperCaseVariantsAccepted.put("copy", "COPY");
    upperCaseVariantsAccepted.put("gt", "GT");
    upperCaseVariantsAccepted.put("lt", "LT");
    upperCaseVariantsAccepted.put("reg", "REG");
    upperCaseVariantsAccepted.put("amp", "AMP");
  }
[...]
  for (int i = 0 ; i < entities.length ; i += 2) {
    Character value = entities[i + 1].charAt(0);
    entityValues.put(entities[i], value);
    String upperCaseVariant = upperCaseVariantsAccepted.get(entities[i]);
    if (upperCaseVariant != null) {
      entityValues.put(upperCaseVariant, value);
    }
  }
{code}
                
> HTMLCharacterEntities.jflex uses String.toUpperCase without Locale
> ------------------------------------------------------------------
>
>                 Key: LUCENE-3983
>                 URL: https://issues.apache.org/jira/browse/LUCENE-3983
>             Project: Lucene - Java
>          Issue Type: Bug
>            Reporter: Uwe Schindler
>            Assignee: Steven Rowe
>
> Is this expected?
> {code:java}
>       "xi", "\u03BE", "yacute", "\u00FD", "yen", "\u00A5", "yuml", "\u00FF",
>       "zeta", "\u03B6", "zwj", "\u200D", "zwnj", "\u200C"
>     };
>     for (int i = 0 ; i < entities.length ; i += 2) {
>       Character value = entities[i + 1].charAt(0);
>       entityValues.put(entities[i], value);
>       if (upperCaseVariantsAccepted.contains(entities[i])) {
>         entityValues.put(entities[i].toUpperCase(), value);
>       }
>     }
> {code}
> In my opinion, this should look like:
> {code:java}
>       "xi", "\u03BE", "yacute", "\u00FD", "yen", "\u00A5", "yuml", "\u00FF",
>       "zeta", "\u03B6", "zwj", "\u200D", "zwnj", "\u200C"
>     };
>     for (int i = 0 ; i < entities.length ; i += 2) {
>       Character value = entities[i + 1].charAt(0);
>       entityValues.put(entities[i], value);
>       if (upperCaseVariantsAccepted.contains(entities[i])) {
>         entityValues.put(entities[i].toUpperCase(Locale.ENGLISH), value);
>       }
>     }
> {code}
> (otherwise in the Turkish locale, the entities containing "i" (like "xi" -> '\u03BE') will not be detected correctly).

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org

[jira] [Updated] (LUCENE-3983) HTMLCharacterEntities.jflex uses String.toUpperCase without Locale

Posted by "Steven Rowe (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/LUCENE-3983?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Steven Rowe updated LUCENE-3983:
--------------------------------

    Attachment: LUCENE-3983.patch

Patch removing runtime upcasing, as in my previous comment.

Committing shortly.
                
> HTMLCharacterEntities.jflex uses String.toUpperCase without Locale
> ------------------------------------------------------------------
>
>                 Key: LUCENE-3983
>                 URL: https://issues.apache.org/jira/browse/LUCENE-3983
>             Project: Lucene - Java
>          Issue Type: Bug
>            Reporter: Uwe Schindler
>            Assignee: Steven Rowe
>            Priority: Minor
>         Attachments: LUCENE-3983.patch
>
>
> Is this expected?
> {code:java}
>       "xi", "\u03BE", "yacute", "\u00FD", "yen", "\u00A5", "yuml", "\u00FF",
>       "zeta", "\u03B6", "zwj", "\u200D", "zwnj", "\u200C"
>     };
>     for (int i = 0 ; i < entities.length ; i += 2) {
>       Character value = entities[i + 1].charAt(0);
>       entityValues.put(entities[i], value);
>       if (upperCaseVariantsAccepted.contains(entities[i])) {
>         entityValues.put(entities[i].toUpperCase(), value);
>       }
>     }
> {code}
> In my opinion, this should look like:
> {code:java}
>       "xi", "\u03BE", "yacute", "\u00FD", "yen", "\u00A5", "yuml", "\u00FF",
>       "zeta", "\u03B6", "zwj", "\u200D", "zwnj", "\u200C"
>     };
>     for (int i = 0 ; i < entities.length ; i += 2) {
>       Character value = entities[i + 1].charAt(0);
>       entityValues.put(entities[i], value);
>       if (upperCaseVariantsAccepted.contains(entities[i])) {
>         entityValues.put(entities[i].toUpperCase(Locale.ENGLISH), value);
>       }
>     }
> {code}
> (otherwise in the Turkish locale, the entities containing "i" (like "xi" -> '\u03BE') will not be detected correctly).

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org

[jira] [Commented] (LUCENE-3983) HTMLCharacterEntities.jflex uses String.toUpperCase without Locale

Posted by "Steven Rowe (Commented) (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/LUCENE-3983?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13254143#comment-13254143 ] 

Steven Rowe commented on LUCENE-3983:
-------------------------------------

Since {{upperCaseVariantsAccepted}} entries don't include an "i", and this set will likely never grow, this isn't really a problem?:
{code:java}
private static final Set<String> upperCaseVariantsAccepted
    = new HashSet<String>(Arrays.asList("quot","copy","gt","lt","reg","amp"));{code}

However, it's definitely a good idea in general.

+1
                
> HTMLCharacterEntities.jflex uses String.toUpperCase without Locale
> ------------------------------------------------------------------
>
>                 Key: LUCENE-3983
>                 URL: https://issues.apache.org/jira/browse/LUCENE-3983
>             Project: Lucene - Java
>          Issue Type: Bug
>            Reporter: Uwe Schindler
>            Assignee: Steven Rowe
>
> Is this expected?
> {code:java}
>       "xi", "\u03BE", "yacute", "\u00FD", "yen", "\u00A5", "yuml", "\u00FF",
>       "zeta", "\u03B6", "zwj", "\u200D", "zwnj", "\u200C"
>     };
>     for (int i = 0 ; i < entities.length ; i += 2) {
>       Character value = entities[i + 1].charAt(0);
>       entityValues.put(entities[i], value);
>       if (upperCaseVariantsAccepted.contains(entities[i])) {
>         entityValues.put(entities[i].toUpperCase(), value);
>       }
>     }
> {code}
> In my opinion, this should look like:
> {code:java}
>       "xi", "\u03BE", "yacute", "\u00FD", "yen", "\u00A5", "yuml", "\u00FF",
>       "zeta", "\u03B6", "zwj", "\u200D", "zwnj", "\u200C"
>     };
>     for (int i = 0 ; i < entities.length ; i += 2) {
>       Character value = entities[i + 1].charAt(0);
>       entityValues.put(entities[i], value);
>       if (upperCaseVariantsAccepted.contains(entities[i])) {
>         entityValues.put(entities[i].toUpperCase(Locale.ENGLISH), value);
>       }
>     }
> {code}
> (otherwise in the Turkish locale, the entities containing "i" (like "xi" -> '\u03BE') will not be detected correctly).

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org