You are viewing a plain text version of this content. The canonical link for it is here.
Posted to users@cocoon.apache.org by Reinhard Haller <re...@interactive-net.de> on 2007/11/21 10:39:42 UTC

2.1.10: charset & nekohtml

Hi,

I've a problem with umlaut's in nekohtml with the following url:

http://www.heise.de/security/news/meldung/99281/

The html-document doesn't contain any charset spec and neko has a 
charset problem (the charset of the http response is utf-8).

My sitemap snippet:

        <map:generator label="content" logger="sitemap.generator.html" 
name="nekohtml" src="org.apache.cocoon.generation.NekoHTMLGenerator"/>

            <map:match pattern="**/*.neko">
                <map:generate type="nekohtml" 
src="{request-param:serv}"  label="debug1x" />
                 <map:serialize type="xml"/>
            </map:match>

Any suggestions, parameters to set?

Thanks
Reinhard



---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe@cocoon.apache.org
For additional commands, e-mail: users-help@cocoon.apache.org


Re: 2.1.10: charset & nekohtml

Posted by Bertrand Delacretaz <bd...@apache.org>.
On Nov 22, 2007 11:51 AM, Reinhard Haller
<re...@interactive-net.de> wrote:

>  Bertrand Delacretaz schrieb:
>  ...<map:transform type="nekohtml">
>  <map:parameter name="input-encoding" value="iso-8859-1" />
>  </map:transform>...
>
> ... I'm not convinced, the parameter changes anything as you can see in the
> following sitemap (I tried also iso-8859-1 and utf-8)....

Right, sorry - I double-checked, and this was using a slightly
customized version of the NekoHTMLTransformer, where we have added
this parameter.

Basically, you want this line in NekoHTMLTransformer:

           ByteArrayInputStream bais =
                new ByteArrayInputStream(text.getBytes());

to use a specific encoding, like

   ByteArrayInputStream bais = new
ByteArrayInputStream(text.getBytes(inputEncoding));

and you can make this configurable by reading the parameter in the
setup() method:

     inputEncoding = par.getParameter("input-encoding",DEFAULT_INPUT_ENCODING);

after declaring these class members:

   /** Encoding to use to convert input text for reading by Neko */
  final static String DEFAULT_INPUT_ENCODING = "iso-8859-1";
  private String inputEncoding = DEFAULT_INPUT_ENCODING;

I don't have time to prepare a patch ATM, but if you want to it that
should be simple enough.

-Bertrand

---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe@cocoon.apache.org
For additional commands, e-mail: users-help@cocoon.apache.org


Re: 2.1.10: charset & nekohtml

Posted by Reinhard Haller <re...@interactive-net.de>.
Joerg Heinicke schrieb:
> On 22.11.2007 5:51 Uhr, Reinhard Haller wrote:
>
>>>> ...The html-document doesn't contain any charset spec and neko has a
>>>> charset problem (the charset of the http response is utf-8)....
>>>>     
>>>
>>> I've had to use the input-encoding parameter for neko to work
>>> correctly, for example:
>>>
>>>         <map:transform type="nekohtml">
>>>           <map:parameter name="input-encoding" value="iso-8859-1" />
>>>         </map:transform>
>>>
>>>   
>> I'm not convinced, the parameter changes anything as you can see in 
>> the following sitemap (I tried also iso-8859-1 and utf-8).
>>
>>            <map:match pattern="**/*.neko">
>>                <map:generate type="nekohtml" 
>> src="{request-param:serv}"  label="debug1x" >
>>            <parameter name="input-encoding" value="1"/>
>
> Is it something as trivial as the correct namespace prefix: 
> map:parameter??
>
no, this was only a typo.

Reinhard


---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe@cocoon.apache.org
For additional commands, e-mail: users-help@cocoon.apache.org


Re: 2.1.10: charset & nekohtml

Posted by Joerg Heinicke <jo...@gmx.de>.
On 22.11.2007 5:51 Uhr, Reinhard Haller wrote:

>>> ...The html-document doesn't contain any charset spec and neko has a
>>> charset problem (the charset of the http response is utf-8)....
>>>     
>>
>> I've had to use the input-encoding parameter for neko to work
>> correctly, for example:
>>
>>         <map:transform type="nekohtml">
>>           <map:parameter name="input-encoding" value="iso-8859-1" />
>>         </map:transform>
>>
>>   
> I'm not convinced, the parameter changes anything as you can see in the 
> following sitemap (I tried also iso-8859-1 and utf-8).
> 
>            <map:match pattern="**/*.neko">
>                <map:generate type="nekohtml" src="{request-param:serv}"  
> label="debug1x" >
>            <parameter name="input-encoding" value="1"/>

Is it something as trivial as the correct namespace prefix: map:parameter??

Joerg

---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe@cocoon.apache.org
For additional commands, e-mail: users-help@cocoon.apache.org


Re: 2.1.10: charset & nekohtml

Posted by Reinhard Haller <re...@interactive-net.de>.
Hi Bertrand,

Bertrand Delacretaz schrieb:
> On Nov 21, 2007 10:39 AM, Reinhard Haller
> <re...@interactive-net.de> wrote:
>
>   
>> ...The html-document doesn't contain any charset spec and neko has a
>> charset problem (the charset of the http response is utf-8)....
>>     
>
> I've had to use the input-encoding parameter for neko to work
> correctly, for example:
>
>         <map:transform type="nekohtml">
>           <map:parameter name="input-encoding" value="iso-8859-1" />
>         </map:transform>
>
>   
I'm not convinced, the parameter changes anything as you can see in the 
following sitemap (I tried also iso-8859-1 and utf-8).

            <map:match pattern="**/*.neko">
                <map:generate type="nekohtml" 
src="{request-param:serv}"  label="debug1x" >
            <parameter name="input-encoding" value="1"/>
                </map:generate>
                 <map:serialize type="xml"/>
            </map:match>

Greetings
Reinhard


Re: 2.1.10: charset & nekohtml

Posted by Bertrand Delacretaz <bd...@apache.org>.
On Nov 21, 2007 10:39 AM, Reinhard Haller
<re...@interactive-net.de> wrote:

> ...The html-document doesn't contain any charset spec and neko has a
> charset problem (the charset of the http response is utf-8)....

I've had to use the input-encoding parameter for neko to work
correctly, for example:

        <map:transform type="nekohtml">
          <map:parameter name="input-encoding" value="iso-8859-1" />
        </map:transform>

Haven't investigated exactly what's happening there.

-Bertrand

---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe@cocoon.apache.org
For additional commands, e-mail: users-help@cocoon.apache.org


Re: 2.1.10: charset & nekohtml

Posted by Reinhard Haller <re...@interactive-net.de>.
Hi, Ignacio,

listas@carmenynacho.com schrieb:
>> From: Reinhard Haller [mailto:reinhard.haller@interactive-net.de] 
>> Sent: Wednesday, November 21, 2007 10:40 AM
>>     
>
> AFAIK to config neko you will need to pass a properties file: ( exceprt from
> default cocoon.xconf )
>
>       <map:generator label="content" logger="sitemap.generator.html"
>         name="nekohtml" pool-max="${nekohtml-generator.pool-max}"
>         src="org.apache.cocoon.generation.NekoHTMLGenerator">
>           <neko-config>context://WEB-INF/neko.properties</neko-config>
>       </map:generator>
>
>
> You can then tweak the properties file pointed by neko-config.. There is a
> neko.properties in the default install..Buried inside, it is:
>
> http\://cyberneko.org/html/properties/default-encoding=Windows-1252
>   

I knew the neko-html works in a similar way as the old Tidy HTML-generator.

I'm nto sure the setting of the default encoding really solves my 
problem. If you analyze the http-response to

http://www.heise.de/security/news/meldung/99281/

you can see the charset is defined as utf-8. So I changed the neko 
default-encoding property to

http\://cyberneko.org/html/properties/default-encoding=utf-8


The resulting neko output has the same errors regarding umlauts as all 
my other tryouts.

Any suggestions?

Thanks
Reinhard



RE: 2.1.10: charset & nekohtml

Posted by li...@carmenynacho.com.
> From: Reinhard Haller [mailto:reinhard.haller@interactive-net.de] 
> Sent: Wednesday, November 21, 2007 10:40 AM

AFAIK to config neko you will need to pass a properties file: ( exceprt from
default cocoon.xconf )

      <map:generator label="content" logger="sitemap.generator.html"
        name="nekohtml" pool-max="${nekohtml-generator.pool-max}"
        src="org.apache.cocoon.generation.NekoHTMLGenerator">
          <neko-config>context://WEB-INF/neko.properties</neko-config>
      </map:generator>


You can then tweak the properties file pointed by neko-config.. There is a
neko.properties in the default install..Buried inside, it is:

http\://cyberneko.org/html/properties/default-encoding=Windows-1252

HTH

Experience is the mother of science ;)

Saludos,
Ignacio J. Ortega
 



---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe@cocoon.apache.org
For additional commands, e-mail: users-help@cocoon.apache.org