You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@commons.apache.org by paul womack <pw...@papermule.co.uk> on 2010/06/25 11:38:31 UTC
Re: [lang] collapsing unicode white space
Scott Wilson wrote:
> Well after a bit of research I finally found a solution to this problem,
> and though StringUtils and CharSetUtils play a role, there was still a
> bit of a gap.
>
> Here is the code:
>
> private static String normalize(String in, boolean includeWhitespace){
> if (in == null) return "";
> String out = "";
> for (int x=0;x<in.length();x++){
> String s = in.substring(x, x+1);
> char ch = s.charAt(0);
> if (Character.isSpaceChar(ch) || (Character.isWhitespace(ch) &&
> includeWhitespace)){
> s = " ";
> }
> out = out + s;
> }
> out = CharSetUtils.squeeze(out, " ");
> out = StringUtils.strip(out);
> return out;
> }
>
> Interestingly enough there is no "normalize unicode white space/space
> chars" method in any of the libs that I tested (e.g. jdom, dom4j).
Surely a simple regex does it?
Sujit posted:
> s = s.replaceAll("\\s+", " ");
>
> or since you are doing unicode:
>
> String s = "This\u0200\u0200is\u0200a\u0200\u0200test";
> System.out.println("before=" + s);
> s = s.replaceAll("\u0200+", "\u0200");
> System.out.println("after=" + s);
But (reading the regexp documentation), there's
\p{javaWhitespace} Equivalent to java.lang.Character.isWhitespace()
which appears to do just what's wanted.
BugBear
---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscribe@commons.apache.org
For additional commands, e-mail: user-help@commons.apache.org
Re: [lang] collapsing unicode white space
Posted by Scott Wilson <sc...@gmail.com>.
On 25 Jun 2010, at 10:38, paul womack wrote:
> Scott Wilson wrote:
>> Well after a bit of research I finally found a solution to this problem, and though StringUtils and CharSetUtils play a role, there was still a bit of a gap.
>> Here is the code:
>> private static String normalize(String in, boolean includeWhitespace){
>> if (in == null) return "";
>> String out = "";
>> for (int x=0;x<in.length();x++){
>> String s = in.substring(x, x+1);
>> char ch = s.charAt(0);
>> if (Character.isSpaceChar(ch) || (Character.isWhitespace(ch) && includeWhitespace)){
>> s = " ";
>> }
>> out = out + s;
>> }
>> out = CharSetUtils.squeeze(out, " ");
>> out = StringUtils.strip(out);
>> return out;
>> }
>> Interestingly enough there is no "normalize unicode white space/space chars" method in any of the libs that I tested (e.g. jdom, dom4j).
>
> Surely a simple regex does it?
>
> Sujit posted:
>> s = s.replaceAll("\\s+", " ");
>> or since you are doing unicode:
>> String s = "This\u0200\u0200is\u0200a\u0200\u0200test";
>> System.out.println("before=" + s);
>> s = s.replaceAll("\u0200+", "\u0200");
>> System.out.println("after=" + s);
>
>
> But (reading the regexp documentation), there's
> \p{javaWhitespace} Equivalent to java.lang.Character.isWhitespace()
>
> which appears to do just what's wanted.
Certainly possible - thanks! If there is also an equivalent for isSpaceChar then that may cover it.
The current code is here:
https://svn.apache.org/repos/asf/incubator/wookie/trunk/parser/java/src/org/apache/wookie/w3c/util/UnicodeUtils.java
The algorithms it is required to conform to are:
http://www.w3.org/TR/widgets/#rule-for-getting-text-content-with-norma
http://www.w3.org/TR/widgets/#rule-for-getting-a-single-attribute-valu
>
> BugBear
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: user-unsubscribe@commons.apache.org
> For additional commands, e-mail: user-help@commons.apache.org
>
---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscribe@commons.apache.org
For additional commands, e-mail: user-help@commons.apache.org