You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@abdera.apache.org by James M Snell <ja...@gmail.com> on 2008/01/14 21:26:55 UTC
HTML Parser
All,
I have some code based on Henri Sivonen's html5 parser that adds HTML
parsing capabilities to the Abdera api. For instance,
URL url = new URL("http://www.snellspace.com");
Abdera abdera = Abdera.getInstance();
Parser parser = abdera.getParserFactory().getParser("html");
Document doc = parser.parse(url.openStream());
doc.writeTo(System.out);
The parser will repair broken markup and allow it to be accessed using
the Abdera Element objects. The two cases where this becomes
particularly use is...
a) Performing autodiscovery of feeds and atompub service docs
b) Converting HTML content to XHTML content and protecting feeds against
accidental breakage.
For example,
List<Element> list =
HtmlHelper.discoverLinks(
"http://www.snellspace.com/wp",
"application/atom+xml",
"alternate");
for (Element el : list) {
String href = el.getAttributeValue("href");
String title = el.getAttributeValue("title");
String type = el.getAttributeValue("type");
System.out.println(type + ", " + title + ", " + href);
}
And another:
Abdera abdera = Abdera.getInstance();
Entry entry = abdera.newEntry();
entry.setContentAsXhtml(HtmlCleaner.parse("<p>test<br>foo"));
System.out.println(entry);
Which outputs:
<entry xmlns="http://www.w3.org/2005/Atom">
<content type="xhtml">
<div xmlns="http://www.w3.org/1999/xhtml">
<p>test<br />foo</p>
</div>
</content>
</entry>
Note that the html fragment is fixed by the HtmlCleaner.
I could commit this but doing so means adding two new optional
dependency jars. I think the function is valuable enough to justify the
addition but I wanted to run it past the rest of you first.
- James
Re: HTML Parser
Posted by James M Snell <ja...@gmail.com>.
Checked in.
Btw, this code let's us do some other cool things...
URL url = new URL("http://www.snellspace.com/wp");
Abdera abdera = Abdera.getInstance();
Document<Element> doc =
abdera.getParserFactory().
getParser("html").parse(
url.openStream());
XPath xpath = abdera.getXPath();
// enumerate all links in the html doc
List<Element> nodes = xpath.selectNodes("//a", doc);
for (Element node : nodes)
System.out.println(node);
// enumerate all hCards in the html doc
List<Element> vcards =
xpath.selectNodes(
"//*[@class ='vcard']",doc);
for (Element node : vcards)
System.out.println(node);
- James
Brian Moseley wrote:
> On Jan 14, 2008 12:26 PM, James M Snell <ja...@gmail.com> wrote:
>
>> I could commit this but doing so means adding two new optional
>> dependency jars. I think the function is valuable enough to justify the
>> addition but I wanted to run it past the rest of you first.
>
> great addition. +1
>
Re: HTML Parser
Posted by Brian Moseley <bc...@maz.org>.
On Jan 14, 2008 12:26 PM, James M Snell <ja...@gmail.com> wrote:
> I could commit this but doing so means adding two new optional
> dependency jars. I think the function is valuable enough to justify the
> addition but I wanted to run it past the rest of you first.
great addition. +1