You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@commons.apache.org by Otis Gospodnetic <ot...@yahoo.com> on 2009/04/16 00:06:28 UTC

[Digester] HTML entity decoding?

Hello,

I'm using Digester 2.0 and trying to process XML that
may include HTML entities and trying to get Digester to decode them
when parsing.

For example, my XML contains:
  <name><![CDATA[Gr&uuml;ber]]></name>

Currently, Digester is parses this as:  Gr&uuml;ber

But what I am really after is "Grüber", so I am looking for a way to get this &uuml; entity decoded by Digester.
How do I tell Digester to decode HTML entities?

Also, if I don't use CDATA, like this:
  <name>Gr&uuml;ber</name>

Digester gives me: Grber

Any help would be very appreciated.  Thanks,

Otis
--
Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscribe@commons.apache.org
For additional commands, e-mail: user-help@commons.apache.org

Re: [Digester] HTML entity decoding?

Posted by Otis Gospodnetic <ot...@yahoo.com>.

Hi,

Thanks Paul.  I'm getting closer, but still not there.  More inlined comments/questions.



----- Original Message ----
> From: Paul Libbrecht <pa...@activemath.org>
> 
> Le 22-avr.-09 à 06:06, Otis Gospodnetic a écrit :
> > XML files I'm trying to parse do have "links" to DTDs in the "header" 
> (sometimes with a full http://... URL, and sometimes with just a local file 
> name), but there are no actual DTD files there.  Is the first step, then, making 
> sure that the referenced DTD files really exist at locations pointed to in the 
> "header" of the XML?
> 
> The short answer is yes.
> The long answer is yes except if you manage to configure xml catalogs (I think 
> that, in the case of Xerces, something such as the XmlResolver is used) which 
> associate "public-ids" to local files. That's best for performance but long to 
> configure.

OK, I get this, although I don't know yet how to tell Digester to do this.

> I suppose this going to be living in something that is not command-line so DTDs 
> should be cached. At worst, make sure the property for such in the parser is st.

Actually, I do run this XML parsing tool from the command-line.

> >> Here's a text pointing to such a DTD:
> >> 
> http://www.w3.org/TR/xhtml-modularization/dtd_module_defs.html#a_xhtml_character_entities
> > 
> > So does this mean i would have to ensure that the DTD files contain things 
> like:
> > 
> > 
> > <!ENTITY uuml   "ü" >
> > ...
> > and so on?
> > And if my DTD had this, are you saying Digester would decode my:
> >  <name><![CDATA[Gr&uuml;ber]]></name>
> 
> no
> but
> Grüber
> (the other form is exactly an escape which is equivalent to
>   Gr&uuml;ber not what you want!)
> 
> > to Grüber?  Or to ü ?
> 
> (both of the above are equivalent in XML compliant parsers. A method reading 
> that XML would only receive Grüber.

Hm, still not sure what would get me Grüber, what exactly I'd need to do to make Digester or the underlying parser go from:
  <name><![CDATA[Gr&uuml;ber]]></name>
to
   Grüber

> > My end goal is to index this data with Lucene/Solr, so I need it to be 
> "Grüber" before I send it to Lucene/Solr.
> > In other words, if I end up with ü, this is still no good for me, as I 
> still wouldn't have Grüber.
> 
> You could also insert the DTDs inside the solr document.

Uh, this would be very very complicated.  I just need to parse out that Grüber and store/index it as such.

> >> Note that opening the file with a validating parser will certainly grumble 
> about
> >> all sorts of undeclared elements, this is ok, it does not prevent parsing but
> >> is, indeed, a validation error.
> > 
> > Uh, I'm lost here.  Which file are you referring to?  DTD or the XML file?  
> Sounds like XML.  And why would I get complaints about undeclared elements?
> 
> the DTD has the double function of declaring elements and attributes as well as 
> entities.
> DTD validation will fail if you have just defined entities in your DTD but not 
> the relevant elements.
> XML parsing will fail if you use entities that you have not defined.

OK.

> >> However you get the entity-expansion.
> > 
> > How?  If I make the XML parser validating?
> 
> if you use a conforming parsing.

So, it sounds like the following may be the recipe:
- make sure the referenced DTD files really exist and that the parser can get to them
- make sure the DTD files include entities used in the XML document
- turn XML validation on (?)
- run XML parser

... and now, because this now has access to the DTD files and those DTD files declare the entities, the XML parser turn, say, &uump; into ü.

Is this correct?

> > This is what I do to my Digester instance as soon as I create it:
> >        dig.setValidating(false);
> 
> this is to prevent that validating failures (such as undeclared attributes or 
> elements stop processing it is good.

It is good to turn validation off?

> >        dig.setEntityResolver(new NoOpEntityResolver());
> > And that NoOpEntityResolver is my custom class that implements the 
> resolveEntity method:
> 
> I believe that is definitely the problem! ;-)

OK.  Perhaps the errors I was seeing were there because the DTD files were missing.

> Please note that most DTD files that people refer to are easy to get publicly 
> and are often bundled with software.

OK, so I really need to find those DTDs.

> What kind of files are these that you are reading with Digester?
> Do you have samples?


> You seem to be lacking control of the DTDs in the same fancy way HTML files are 
> done. I would consider NekoHtml tools then.

Yes, I don't have DTDs, but who knows, I may be able to find them.  Why do you suggest NekoHtml?

> >> Note that using the first form, which contains an *escaped* entity, there's
> >> nothing to do! You'd have to match them manually ("re-entrantly") into a 
> parser
> >> that parses entities properly.
> > 
> > Uh, what does this mean? :)
> > Are you saying "ü" is the "escaped" form of the entity?  (what would be 
> the unescaped form of it?)
> 
> I was saying <![CDATA[Gr&uuml;ber]]>  or Gr&uuml;ber is the escaped form for 
> which you can only fix by applying regexps (which might break other things).

Hm, then, if I understand you correctly, you are saying I will *not* be able to get    Grüber from <![CDATA[Gr&uuml;ber]]> because <![CDATA[Gr&uuml;ber]]> is the escaped form (of    Grüber?) and the XML parser will not be able to "unescape" it to get me    Grüber even if I follow the above steps?
And you are saying the only way for me to fix this is to manually replace &uump; with ü ==> s/&uuml;/ü/ type of thing?
If that's what you are saying, them this is a completely manual process, nothing that XML parser can help me with?  That doesn't sound right, so I'm probably misunderstanding you.

> > And what do you mean by there is nothing to do?  (I was hoping the parser 
> would do the work and convert "ü" to "ü")
> > I don't understand the last sentence.... so I'm not even sure how to ask any 
> questions about it.... but it sounds like you are saying that some parsers may 
> simply do what I need, just not Digester?  I'm not sure what you mean by manual 
> matching?
> 
> Digester is not a parser, it uses the JAXP-available parsers.
> By default in JDK >= 1.5, this is a xerces copy (under com.sun packages).
> If you have other parsers in the classpath these may be rather taken (something 
> in META-INF can be used I think).
> 
> Xerces does a good job so it's definitely possible to work with it. E.g. DTD 
> caching can be configured for it as well as catalogs.
> 
> Digester is there to make the interface between xml-parsing and java objects.
> If you're just producing XML outside, there may be alternatives, indeed.

Right.  I'm trying to stick with Digester because the XML I'm parsing would be a pain to parse with straight Xerces.

Thanks,
Otis

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscribe@commons.apache.org
For additional commands, e-mail: user-help@commons.apache.org

Re: [Digester] HTML entity decoding?

Posted by Paul Libbrecht <pa...@activemath.org>.

Le 22-avr.-09 à 06:06, Otis Gospodnetic a écrit :
> I'm no XML guru, so some of this stuff is fuzzy.  Please see my  
> comments/questions below.

I'm happy to help ;-)

> XML files I'm trying to parse do have "links" to DTDs in the  
> "header" (sometimes with a full http://... URL, and sometimes with  
> just a local file name), but there are no actual DTD files there.   
> Is the first step, then, making sure that the referenced DTD files  
> really exist at locations pointed to in the "header" of the XML?

The short answer is yes.
The long answer is yes except if you manage to configure xml catalogs  
(I think that, in the case of Xerces, something such as the  
XmlResolver is used) which associate "public-ids" to local files.  
That's best for performance but long to configure.

I suppose this going to be living in something that is not command- 
line so DTDs should be cached. At worst, make sure the property for  
such in the parser is st.

>> Here's a text pointing to such a DTD:
>> http://www.w3.org/TR/xhtml-modularization/dtd_module_defs.html#a_xhtml_character_entities
>
> So does this mean i would have to ensure that the DTD files contain  
> things like:
> <!ENTITY nbsp   "&#160;" >
> <!ENTITY iexcl  "&#161;" >
> <!ENTITY uuml   "&#252;" >
> ...
> and so on?
> And if my DTD had this, are you saying Digester would decode my:
> <name><![CDATA[Gr&uuml;ber]]></name>

no
but
<name>Gr&uuml;ber</name>
(the other form is exactly an escape which is equivalent to
   <name>Gr&amp;uuml;ber</name> not what you want!)

> to Grüber?  Or to &#252; ?

(both of the above are equivalent in XML compliant parsers. A method  
reading that XML would only receive Grüber.


> My end goal is to index this data with Lucene/Solr, so I need it to  
> be "Grüber" before I send it to Lucene/Solr.
> In other words, if I end up with &#252, this is still no good for  
> me, as I still wouldn't have Grüber.

You could also insert the DTDs inside the solr document.

>> Note that opening the file with a validating parser will certainly  
>> grumble about
>> all sorts of undeclared elements, this is ok, it does not prevent  
>> parsing but
>> is, indeed, a validation error.
>
> Uh, I'm lost here.  Which file are you referring to?  DTD or the XML  
> file?  Sounds like XML.  And why would I get complaints about  
> undeclared elements?

the DTD has the double function of declaring elements and attributes  
as well as entities.
DTD validation will fail if you have just defined entities in your DTD  
but not the relevant elements.
XML parsing will fail if you use entities that you have not defined.

>> However you get the entity-expansion.
>
> How?  If I make the XML parser validating?

if you use a conforming parsing.

> This is what I do to my Digester instance as soon as I create it:
>        dig.setValidating(false);

this is to prevent that validating failures (such as undeclared  
attributes or elements stop processing it is good.

>        dig.setEntityResolver(new NoOpEntityResolver());
> And that NoOpEntityResolver is my custom class that implements the  
> resolveEntity method:

I believe that is definitely the problem! ;-)
Please note that most DTD files that people refer to are easy to get  
publicly and are often bundled with software.

What kind of files are these that you are reading with Digester?
Do you have samples?
You seem to be lacking control of the DTDs in the same fancy way HTML  
files are done. I would consider NekoHtml tools then.

>> Note that using the first form, which contains an *escaped* entity,  
>> there's
>> nothing to do! You'd have to match them manually ("re-entrantly")  
>> into a parser
>> that parses entities properly.
>
> Uh, what does this mean? :)
> Are you saying "&uuml;" is the "escaped" form of the entity?  (what  
> would be the unescaped form of it?)

I was saying <![CDATA[Gr&uuml;ber]]> or Gr&amp;uuml;ber is the escaped  
form for which you can only fix by applying regexps (which might break  
other things).

> And what do you mean by there is nothing to do?  (I was hoping the  
> parser would do the work and convert "&uuml;" to "ü")
> I don't understand the last sentence.... so I'm not even sure how to  
> ask any questions about it.... but it sounds like you are saying  
> that some parsers may simply do what I need, just not Digester?  I'm  
> not sure what you mean by manual matching?

Digester is not a parser, it uses the JAXP-available parsers.
By default in JDK >= 1.5, this is a xerces copy (under com.sun  
packages).
If you have other parsers in the classpath these may be rather taken  
(something in META-INF can be used I think).

Xerces does a good job so it's definitely possible to work with it.  
E.g. DTD caching can be configured for it as well as catalogs.

Digester is there to make the interface between xml-parsing and java  
objects.
If you're just producing XML outside, there may be alternatives, indeed.

paul

Re: [Digester] HTML entity decoding?

Posted by Otis Gospodnetic <ot...@yahoo.com>.

Hi Paul,

I'm no XML guru, so some of this stuff is fuzzy.  Please see my comments/questions below.



----- Original Message ----
> From: Paul Libbrecht <pa...@activemath.org>
> To: Commons Users List <us...@commons.apache.org>
> Sent: Wednesday, April 15, 2009 6:24:05 PM
> Subject: Re: [Digester] HTML entity decoding?
> 
> Hello Otis,
> 
> For the second form you'll need to hook a DTD to do so. A DTD declaration in 
> your header pointing to a DTD which defines these entities I am no expert in 
> Digester but I believe that it is the only way to do so. At least according to 
> the XML specs.

XML files I'm trying to parse do have "links" to DTDs in the "header" (sometimes with a full http://... URL, and sometimes with just a local file name), but there are no actual DTD files there.  Is the first step, then, making sure that the referenced DTD files really exist at locations pointed to in the "header" of the XML?

> Here's a text pointing to such a DTD:
>   
> http://www.w3.org/TR/xhtml-modularization/dtd_module_defs.html#a_xhtml_character_entities

So does this mean i would have to ensure that the DTD files contain things like:

<!ENTITY nbsp   "&#160;" >
<!ENTITY iexcl  "&#161;" >
<!ENTITY uuml   "&#252;" >
...
and so on?
And if my DTD had this, are you saying Digester would decode my:
<name><![CDATA[Gr&uuml;ber]]></name>

to Grüber?  Or to &#252; ?

My end goal is to index this data with Lucene/Solr, so I need it to be "Grüber" before I send it to Lucene/Solr.  In other words, if I end up with &#252, this is still no good for me, as I still wouldn't have Grüber.

> Note that opening the file with a validating parser will certainly grumble about 
> all sorts of undeclared elements, this is ok, it does not prevent parsing but 
> is, indeed, a validation error.

Uh, I'm lost here.  Which file are you referring to?  DTD or the XML file?  Sounds like XML.  And why would I get complaints about undeclared elements?

> However you get the entity-expansion.

How?  If I make the XML parser validating?  This is what I do to my Digester instance as soon as I create it:
        dig.setValidating(false);
        dig.setEntityResolver(new NoOpEntityResolver());

And that NoOpEntityResolver is my custom class that implements the resolveEntity method:

public class NoOpEntityResolver implements EntityResolver {
    public InputSource resolveEntity(String publicId, String systemId) {
    // this method just
        if (systemId.equals("file:///tmp/dtd/foo-1.2.dtd")
                || systemId.equals("http://example.com/dtd/foo-1.2.dtd") {
            BlankReader reader = new BlankReader();
            // return a special input source                                                                                                   
            return new InputSource(reader);
        } else {
            // use the default behaviour                                                                                                       
            return null;
        }
    }
    class BlankReader extends Reader {
        @Override
        public void close() throws IOException {}
        @Override
        public int read(char[] arg0, int arg1, int arg2) throws IOException {
            return -1;
        }
    }

Could this be a problem?  I had to add this class to stop Digester from breaking, if I recall correctly, because those .dtd files don't actually exist.

> Note that using the first form, which contains an *escaped* entity, there's 
> nothing to do! You'd have to match them manually ("re-entrantly") into a parser 
> that parses entities properly.

Uh, what does this mean? :)
Are you saying "&uuml;" is the "escaped" form of the entity?  (what would be the unescaped form of it?)
And what do you mean by there is nothing to do?  (I was hoping the parser would do the work and convert "&uuml;" to "ü")
I don't understand the last sentence.... so I'm not even sure how to ask any questions about it.... but it sounds like you are saying that some parsers may simply do what I need, just not Digester?  I'm not sure what you mean by manual matching?

Any further help would be greatly appreciated.

Thanks,
Otis

> paul
> 
> PS: I would feel lucky not to have been blown away the XML parsing in the second 
> case as a normal XML parser does: missing entity declaration means unparseable 
> XML while missing element declaration means much less a dangerous thing.
> 
> Le 16-avr.-09 à 00:06, Otis Gospodnetic a écrit :
> 
> > 
> > Hello,
> > 
> > I'm using Digester 2.0 and trying to process XML that
> > may include HTML entities and trying to get Digester to decode them
> > when parsing.
> > 
> > For example, my XML contains:
> >  
> > 
> > Currently, Digester is parses this as:  Grüber
> > 
> > But what I am really after is "Grüber", so I am looking for a way to get this 
> ü entity decoded by Digester.
> > How do I tell Digester to decode HTML entities?
> > 
> > Also, if I don't use CDATA, like this:
> >  Grüber
> > 
> > Digester gives me: Grber


---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscribe@commons.apache.org
For additional commands, e-mail: user-help@commons.apache.org

Re: [Digester] HTML entity decoding?

Posted by Paul Libbrecht <pa...@activemath.org>.

Hello Otis,

For the second form you'll need to hook a DTD to do so. A DTD  
declaration in your header pointing to a DTD which defines these  
entities I am no expert in Digester but I believe that it is the only  
way to do so. At least according to the XML specs.

Here's a text pointing to such a DTD:
   http://www.w3.org/TR/xhtml-modularization/dtd_module_defs.html#a_xhtml_character_entities

Note that opening the file with a validating parser will certainly  
grumble about all sorts of undeclared elements, this is ok, it does  
not prevent parsing but is, indeed, a validation error.
However you get the entity-expansion.

Note that using the first form, which contains an *escaped* entity,  
there's nothing to do! You'd have to match them manually ("re- 
entrantly") into a parser that parses entities properly.

paul

PS: I would feel lucky not to have been blown away the XML parsing in  
the second case as a normal XML parser does: missing entity  
declaration means unparseable XML while missing element declaration  
means much less a dangerous thing.

Le 16-avr.-09 à 00:06, Otis Gospodnetic a écrit :

>
> Hello,
>
> I'm using Digester 2.0 and trying to process XML that
> may include HTML entities and trying to get Digester to decode them
> when parsing.
>
> For example, my XML contains:
>  <name><![CDATA[Gr&uuml;ber]]></name>
>
> Currently, Digester is parses this as:  Gr&uuml;ber
>
> But what I am really after is "Grüber", so I am looking for a way to  
> get this &uuml; entity decoded by Digester.
> How do I tell Digester to decode HTML entities?
>
> Also, if I don't use CDATA, like this:
>  <name>Gr&uuml;ber</name>
>
> Digester gives me: Grber