You are viewing a plain text version of this content. The canonical link for it is here.
Posted to j-users@xerces.apache.org by Takumi Fujiwara <tr...@yahoo.com> on 2002/08/20 22:05:35 UTC

Entity Reference in NekoHTML parser

I am using NekoHTML Parser, I think it translates
entity references during parsing, is it possible to
turn OFF that feature (e.g. if it sees &nbsp; in the
text node, leave it as &nbsp;)?

Thank for any help.


__________________________________________________
Do You Yahoo!?
HotJobs - Search Thousands of New Jobs
http://www.hotjobs.com

---------------------------------------------------------------------
To unsubscribe, e-mail: xerces-j-user-unsubscribe@xml.apache.org
For additional commands, e-mail: xerces-j-user-help@xml.apache.org


Re: Entity Reference in NekoHTML parser

Posted by Andy Clark <an...@apache.org>.
Takumi Fujiwara wrote:
 > Thanks for your response. But I don't want it to convert, I don't
 > want it to create a DOM node either. I just want it to leave it as it
 > is.

You can use a filter for this purpose. Here is an
example:

   import org.cyberneko.html.filters.DefaultFilter;

   import org.apache.xerces.util.XMLStringBuffer;
   import org.apache.xerces.xni.Augmentations;
   import org.apache.xerces.xni.XMLLocator;
   import org.apache.xerces.xni.XMLResourceIdentifier;
   import org.apache.xerces.xni.XMLString;
   import org.apache.xerces.xni.XNIException;

   public Entity2Text extends DefaultFilter {

     boolean inEntityRef;
     XMLStringBuffer buffer = new XMLStringBuffer();

     public void startDocument(XMLLocator locator,
                               String encoding,
                               Augmentations augs)
       throws XNIException {
       super.startDocument(locator, encoding, augs);
       inEntityRef = false;
     }

     public void characters(XMLString text,
                            Augmentations augs)
       throws XNIException {
       if (!inEntityRef) {
         super.characters(text, augs);
       }
     }

     public void startGeneralEntity(String name,
                                    XMLResourceIdentifier id,
                                    String encoding,
                                    Augmentations augs)
       throws XNIException {
       inEntityRef = true;
       buffer.clear();
       buffer.append(name);
       super.characters(buffer, augs);
     }

     public void endGeneralEntity(String name,
                                  Augmentations augs)
       throws XNIException {
       inEntityRef = false;
     }

   }

Then you just turn on the notification of entity refs
and append the filter to the parsing pipeline. Like so:

   XMLParserConfiguration config = new HTMLConfiguration();
 
config.setFeature("http://cyberneko.org/html/features/scanner/notify-builtin-refs", 
true);
   XMLDocumentFilter[] filters = { new Entity2Text() };
   config.setProperty("http://cyberneko.org/html/properties/filters", 
filters);

The "setFeature" and "setProperty" methods can also
be called on the parser classes.

Does this work for you?

P.S. I wrote the code by heart so I may have some
      mistakes. But this should get you started.

-- 
Andy Clark * andyc@apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: xerces-j-user-unsubscribe@xml.apache.org
For additional commands, e-mail: xerces-j-user-help@xml.apache.org


Re: Entity Reference in NekoHTML parser

Posted by Takumi Fujiwara <tr...@yahoo.com>.
Andy, 

Thanks for your response. But I don't want it to
convert, I don't want it to create a DOM node either.
I just want it to leave it as it is.

For example, the source is this:
<A> This is some text with &amp; </a>

I want the parse to create a HTMLAnchor element and
then a Text element with node value " This is some
text with &amp; ".

Is this possible?

Thank you.

--- Andy Clark <an...@apache.org> wrote:
> Takumi Fujiwara wrote:
> > I am using NekoHTML Parser, I think it translates
> > entity references during parsing, is it possible
> to
> > turn OFF that feature (e.g. if it sees &nbsp; in
> the
> > text node, leave it as &nbsp;)?
> 
> Currently, it reports *all* character content which
> means converting &nbsp; to its character equivalent.
> However, you can set a feature so that the standard
> HTML entity boundaries are reported[1]. This would,
> for example, create a DOM entity ref node for "nbsp"
> and the others.
> 
> Would this help your problem?
> 
> [1]
>
http://www.apache.org/~andyc/neko/doc/html/settings.html
>     
>
"http://cyberneko.org/html/features/scanner/notify-builtin-refs"
> 
> -- 
> Andy Clark * andyc@apache.org
> 
> 
>
---------------------------------------------------------------------
> To unsubscribe, e-mail:
> xerces-j-user-unsubscribe@xml.apache.org
> For additional commands, e-mail:
> xerces-j-user-help@xml.apache.org
> 


__________________________________________________
Do You Yahoo!?
HotJobs - Search Thousands of New Jobs
http://www.hotjobs.com

---------------------------------------------------------------------
To unsubscribe, e-mail: xerces-j-user-unsubscribe@xml.apache.org
For additional commands, e-mail: xerces-j-user-help@xml.apache.org


Re: Entity Reference in NekoHTML parser

Posted by Andy Clark <an...@apache.org>.
Takumi Fujiwara wrote:
> I am using NekoHTML Parser, I think it translates
> entity references during parsing, is it possible to
> turn OFF that feature (e.g. if it sees &nbsp; in the
> text node, leave it as &nbsp;)?

Currently, it reports *all* character content which
means converting &nbsp; to its character equivalent.
However, you can set a feature so that the standard
HTML entity boundaries are reported[1]. This would,
for example, create a DOM entity ref node for "nbsp"
and the others.

Would this help your problem?

[1] http://www.apache.org/~andyc/neko/doc/html/settings.html
     "http://cyberneko.org/html/features/scanner/notify-builtin-refs"

-- 
Andy Clark * andyc@apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: xerces-j-user-unsubscribe@xml.apache.org
For additional commands, e-mail: xerces-j-user-help@xml.apache.org