You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@abdera.apache.org by James M Snell <ja...@gmail.com> on 2008/01/14 21:26:55 UTC

HTML Parser

All,

I have some code based on Henri Sivonen's html5 parser that adds HTML 
parsing capabilities to the Abdera api.  For instance,

   URL url = new URL("http://www.snellspace.com");
   Abdera abdera = Abdera.getInstance();
   Parser parser = abdera.getParserFactory().getParser("html");
   Document doc = parser.parse(url.openStream());
   doc.writeTo(System.out);

The parser will repair broken markup and allow it to be accessed using 
the Abdera Element objects.  The two cases where this becomes 
particularly use is...

a) Performing autodiscovery of feeds and atompub service docs
b) Converting HTML content to XHTML content and protecting feeds against
    accidental breakage.

For example,

   List<Element> list =
     HtmlHelper.discoverLinks(
       "http://www.snellspace.com/wp",
       "application/atom+xml",
       "alternate");
   for (Element el : list) {
     String href = el.getAttributeValue("href");
     String title = el.getAttributeValue("title");
     String type = el.getAttributeValue("type");
     System.out.println(type + ", " + title + ", " + href);
   }

And another:

   Abdera abdera = Abdera.getInstance();
   Entry entry = abdera.newEntry();
   entry.setContentAsXhtml(HtmlCleaner.parse("<p>test<br>foo"));
   System.out.println(entry);

Which outputs:

   <entry xmlns="http://www.w3.org/2005/Atom">
     <content type="xhtml">
       <div xmlns="http://www.w3.org/1999/xhtml">
         <p>test<br />foo</p>
       </div>
     </content>
   </entry>

Note that the html fragment is fixed by the HtmlCleaner.

I could commit this but doing so means adding two new optional 
dependency jars.  I think the function is valuable enough to justify the 
addition but I wanted to run it past the rest of you first.

- James

Re: HTML Parser

Posted by James M Snell <ja...@gmail.com>.
Checked in.

Btw, this code let's us do some other cool things...

   URL url = new URL("http://www.snellspace.com/wp");
   Abdera abdera = Abdera.getInstance();
   Document<Element> doc =
     abdera.getParserFactory().
     getParser("html").parse(
       url.openStream());
   XPath xpath = abdera.getXPath();

   // enumerate all links in the html doc
   List<Element> nodes = xpath.selectNodes("//a", doc);
   for (Element node : nodes)
     System.out.println(node);

   // enumerate all hCards in the html doc
   List<Element> vcards =
     xpath.selectNodes(
       "//*[@class ='vcard']",doc);
   for (Element node : vcards)
     System.out.println(node);

- James

Brian Moseley wrote:
> On Jan 14, 2008 12:26 PM, James M Snell <ja...@gmail.com> wrote:
> 
>> I could commit this but doing so means adding two new optional
>> dependency jars.  I think the function is valuable enough to justify the
>> addition but I wanted to run it past the rest of you first.
> 
> great addition. +1
> 

Re: HTML Parser

Posted by Brian Moseley <bc...@maz.org>.
On Jan 14, 2008 12:26 PM, James M Snell <ja...@gmail.com> wrote:

> I could commit this but doing so means adding two new optional
> dependency jars.  I think the function is valuable enough to justify the
> addition but I wanted to run it past the rest of you first.

great addition. +1