You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@commons.apache.org by Vitor Costa <fv...@yahoo.com.br> on 2010/08/25 23:50:45 UTC

[lang] StringEscapeUtils.unescapeHtml(" ") doesn't return a space

Hi,

I am writing a crawler to get some info on web pages and I am using commons lang 
to unescape the html file.
I was having some problems with my regex expressions until I realized that the 
following is printing false:

System.out.println(" ".equals(StringEscapeUtils. unescapeHtml("&nbsp;")));

Is this a bug? Or is it the expected behavior of the unescape method when 
dealing with escaped space characters?


Also, if I unescape 'sbrubbles&nbps;' and then trim() it, the space still 
appears in the end of the string.
Visually  speaking, unescaping '&nbsp;' returns a space. But programmatically 
speaking, the system doesn't recognize it as a space character.

Thanks in advance,
Vitor.


      

Re: [lang] StringEscapeUtils.unescapeHtml(" ") doesn't return a space

Posted by "E. Michael Akerman" <mi...@exchange.uark.edu>.
I'm not certain how StringEscapeUtils handles it, but in HTML land, it should be equal to character 160 instead of 32.  It has 
different meaning than space.

Michael Akerman
Systems Analyst
University IT Services

----- Original Message ----- 
From: "Vitor Costa" <fv...@yahoo.com.br>
To: <us...@commons.apache.org>
Sent: Wednesday, August 25, 2010 4:50 PM
Subject: [lang] StringEscapeUtils.unescapeHtml(" ") doesn't return a space


Hi,

I am writing a crawler to get some info on web pages and I am using commons lang
to unescape the html file.
I was having some problems with my regex expressions until I realized that the
following is printing false:

System.out.println(" ".equals(StringEscapeUtils. unescapeHtml("&nbsp;")));

Is this a bug? Or is it the expected behavior of the unescape method when
dealing with escaped space characters?


Also, if I unescape 'sbrubbles&nbps;' and then trim() it, the space still
appears in the end of the string.
Visually  speaking, unescaping '&nbsp;' returns a space. But programmatically
speaking, the system doesn't recognize it as a space character.

Thanks in advance,
Vitor.





---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscribe@commons.apache.org
For additional commands, e-mail: user-help@commons.apache.org