You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@commons.apache.org by "Oliver Meyn (GBIF)" <om...@gbif.org> on 2010/12/22 09:54:59 UTC

[digester] utf-8 problems

Hi all,

I'm trying to read utf-8 encoded files and pass those to a digester for parsing.  The parsed results break the encoding (subtly) such that some characters are not being represented properly.  My test case so far has been u umlaut, so ü.

I'm confident the file is uft-8 because I created it like this (so, without BOM):

      Writer out = new OutputStreamWriter(new FileOutputStream(path), "UTF-8");
      out.write(longXmlStringContainingUmlaut);
      out.close();

My Digester code looks like this:

      FileInputStream fis = new FileInputStream(file);

      Digester digester = new Digester();
      digester.setNamespaceAware(true);
      digester.setValidating(false);
      digester.push(targetObject);

      // a bunch of digester rules

      InputSource inputSource = new InputSource(fis);
      inputSource.setEncoding("UTF-8");
      digester.parse(inputSource);

If I dump the contents of the fis before digesting using an InputStreamReader set to utf8 I see the umlaut.  The result of the digest is ü.  I've tried all of the parse signatures, including the inputStream version where I use an InputStreamReader set to UTF-8.  I've also tried the InputSource method (above) without using the setEncoding.  All cases produce the same result.  

I suspect this may be a Xerces problem, but maybe it's Digester, and ideally it's me being dumb in some way.  Any and all help is appreciated.

Thanks,
Oliver
--
Oliver Meyn
Software Developer
Global Biodiversity Information Facility (GBIF)
+45 35 32 15 12
http://www.gbif.org


---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscribe@commons.apache.org
For additional commands, e-mail: user-help@commons.apache.org


Re: [digester] utf-8 problems

Posted by "Oliver Meyn (GBIF)" <om...@gbif.org>.
Bah - please disregard.  Of course the answer was me being dumb.  Turns out there are two passes through digesters, the second one taking input from the first.  The second pass wasn't setting the encoding on its InputStream.

Apologies for the distraction,
Oliver

On 2010-12-22, at 9:54 AM, Oliver Meyn (GBIF) wrote:

> Hi all,
> 
> I'm trying to read utf-8 encoded files and pass those to a digester for parsing.  The parsed results break the encoding (subtly) such that some characters are not being represented properly.  My test case so far has been u umlaut, so ü.
> 
> I'm confident the file is uft-8 because I created it like this (so, without BOM):
> 
>      Writer out = new OutputStreamWriter(new FileOutputStream(path), "UTF-8");
>      out.write(longXmlStringContainingUmlaut);
>      out.close();
> 
> My Digester code looks like this:
> 
>      FileInputStream fis = new FileInputStream(file);
> 
>      Digester digester = new Digester();
>      digester.setNamespaceAware(true);
>      digester.setValidating(false);
>      digester.push(targetObject);
> 
>      // a bunch of digester rules
> 
>      InputSource inputSource = new InputSource(fis);
>      inputSource.setEncoding("UTF-8");
>      digester.parse(inputSource);
> 
> If I dump the contents of the fis before digesting using an InputStreamReader set to utf8 I see the umlaut.  The result of the digest is ü.  I've tried all of the parse signatures, including the inputStream version where I use an InputStreamReader set to UTF-8.  I've also tried the InputSource method (above) without using the setEncoding.  All cases produce the same result.  
> 
> I suspect this may be a Xerces problem, but maybe it's Digester, and ideally it's me being dumb in some way.  Any and all help is appreciated.
> 
> Thanks,
> Oliver
> --
> Oliver Meyn
> Software Developer
> Global Biodiversity Information Facility (GBIF)
> +45 35 32 15 12
> http://www.gbif.org
> 
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: user-unsubscribe@commons.apache.org
> For additional commands, e-mail: user-help@commons.apache.org
> 
> 


--
Oliver Meyn
Software Developer
Global Biodiversity Information Facility (GBIF)
+45 35 32 15 12
http://www.gbif.org


---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscribe@commons.apache.org
For additional commands, e-mail: user-help@commons.apache.org