You are viewing a plain text version of this content. The canonical link for it is here.

Posted to j-dev@xerces.apache.org by Lars Oppermann <la...@kurtius-looft.com> on 2000/07/05 15:51:59 UTC

Expanding of character entities in serialized documents

Hi everybody,

I' m working on a tool which extracts chunks of text from XML documents
for indexing in a search engine (lucene).
So what I'm doing is getting the relvant subtrees from the document and
using a DOM Text serializer (org.apache.xml.serialize.TextSerializer) to
convert then to text for submission into the index.
The probblem that I'm experiencing is, that character entities used in
the documents do not get resolved. The enteties are declared like this
in my documents:

<!DOCTYPE page [
  <!ENTITY % HTMLlat1 SYSTEM "../dtd/xhtml-lat1.ent">
  %HTMLlat1;
]>
<page>
...
</page>

What do I have to do to get the TextSerializer to expand the references
to the apropriate characters?

Thanks,
Lars

get the Encoding of a DOM

Posted by Stefan Rauch <sr...@uos.de>.

Hi!

I'm sorry if this has already been asked.
Is there any way in xerces-j to get the specified encoding from a
DocumentImpl ?
Oracles Document Implementation (from Oracles XML-Parser for Java 2.0.2.x)
has the methods getEncoding() and setEncoding(String encoding) is there any
similar way in xerces-j to obtain and set the encoding of a Document
representing an xml-doc?

Thanks.

Stefan Rauch.