You are viewing a plain text version of this content. The canonical link for it is here.

Posted to j-users@xerces.apache.org by st...@cyberspace.org on 2002/04/11 13:12:04 UTC

HTMLDocument howto,..(urgent)


How do we get HTMLDocument for a html, the following doesnt seem to work:

 DOMParser domp = new DOMParser();
 domp.parse(new InputSource(htmlfile));
 Document d = domp.getDocument();
 HTMLDocument hd = (HTMLDocument)d;
 System.out.println ("HTML TITLE: "+hd.getTitle());
 System.out.println ("HTML body: "+ hd.getBody().getNodeValue());

   on the html

   <html>
   <head>
   <title>title text</title>
   </head>
   <body>body text </body>
   </html>


   thanks,st.

---------------------------------------------------------------------
To unsubscribe, e-mail: xerces-j-user-unsubscribe@xml.apache.org
For additional commands, e-mail: xerces-j-user-help@xml.apache.org

HTMLCollection id vs name

Posted by st...@cyberspace.org.

The apiDoc for HTMLCollection says:

public Node namedItem(java.lang.String name)
This method retrieves a Node using a name. It first searches for a Node with 
a matching id attribute. If it doesn't find one, it then searches for a Node with
a matching name attribute, but only on those elements that are allowed a name attribute.


But I found that with xerces1.4.4, this method returns null for
<form name="myf" />
but works right for 
<form id="myf" />

Is this a  bug or am I somewhere wrong?

st.

---------------------------------------------------------------------
To unsubscribe, e-mail: xerces-j-user-unsubscribe@xml.apache.org
For additional commands, e-mail: xerces-j-user-help@xml.apache.org

Re: HTMLDocument howto,..(urgent)

Posted by Andy Clark <an...@apache.org>.

step1b@cyberspace.org wrote:
> How do we get HTMLDocument for a html, the following doesnt seem to work:
> 
>  DOMParser domp = new DOMParser();
>  domp.parse(new InputSource(htmlfile));
>  Document d = domp.getDocument();

The parser doesn't try to assume you want an HTML document
instance if the root element is "html". But if you know that
you are parsing HTML documents that are well-formed according
to the XML specification, then set the following property
*before* calling "parse":

domp.setProperty("http://apache.org/xml/properties/dom/document-class-name",
                   "org.apache.html.dom.HTMLDocumentImpl");

However, if your documents are *not* well-formed XML docs
(and most HTML documents are not) then you need to "tidy"
them before parsing them with Xerces. You can use JTidy to
do the job or NekoHTML (and there are probably other tools
available as well). Here are the links:

  http://www.sourceforge.net/projects/jtidy
  http://www.apache.org/~andyc/

-- 
Andy Clark * andyc@apache.org

---------------------------------------------------------------------
To unsubscribe, e-mail: xerces-j-user-unsubscribe@xml.apache.org
For additional commands, e-mail: xerces-j-user-help@xml.apache.org