You are viewing a plain text version of this content. The canonical link for it is here.
Posted to j-users@xalan.apache.org by "McGibbney, Lewis John" <Le...@gcu.ac.uk> on 2012/01/07 18:39:24 UTC

Basic XSL HTML-> XML query

Hi Everyone,

I'm currently using Xalan-J within Yax the Java-based Xproc implementation and have a real basic query. My situation is that I have lots of legal documents which exist in HTML, these in turn include lots and lots of presentation mark-up which I would like to strip before getting down to the Xalan-j XSL stuff. Reasoning behind this that at the end of my pipeline I am looking to have an RDF/XML dataset based upon the source HTML therefore presentation is not important as its primary function will be to query against rather than to view through a browser. The question I have is whether this must be done via an XSL implementation or whether there is some kind of convenience/util interface which can be extended to do this type of thing?

Thank you very much for any feedback, and apologies for the boring question, I'm new to XSL in general. Thank you

Lewis

Glasgow Caledonian University is a registered Scottish charity, number SC021474

Winner: Times Higher Education’s Widening Participation Initiative of the Year 2009 and Herald Society’s Education Initiative of the Year 2009.
http://www.gcu.ac.uk/newsevents/news/bycategory/theuniversity/1/name,6219,en.html

Winner: Times Higher Education’s Outstanding Support for Early Career Researchers of the Year 2010, GCU as a lead with Universities Scotland partners.
http://www.gcu.ac.uk/newsevents/news/bycategory/theuniversity/1/name,15691,en.html

RE: Basic XSL HTML-> XML query

Posted by "McGibbney, Lewis John" <Le...@gcu.ac.uk>.
Thanks for all of the input on this. I was aware of the problems parsing HTML with vanilla Xalan-j but just discovered that by some stroke of luck the source docs are actually XHTML anyway [1].

Again thank you for the help.

Lewis

[1] https://github.com/lewismc/yax/blob/master/sts/0.1.html

________________________________________
From: Timothy Jones [Timothy.Jones@syniverse.com]
Sent: 08 January 2012 15:24
To: milu71@gmx.de; xalan-j-users@xml.apache.org
Subject: Re: Basic XSL HTML-> XML query

I recall using a small Java library called Neko to parse HTML into an XML DOM.  It did the trick!   Just wanted to add it to the list.


tlj

----- Original Message -----
From: Michael Ludwig [mailto:milu71@gmx.de]
Sent: Sunday, January 08, 2012 04:57 AM
To: xalan-j-users@xml.apache.org <xa...@xml.apache.org>
Subject: Re: Basic XSL HTML-> XML query

keshlam@us.ibm.com schrieb am 07.01.2012 um 21:18 (-0500):
> The problem is, HTML is not an XML-based language, so unless you've
> deliberately written your input document as XHTML, odds are that no
> XML parser will accept it.

Sure, but as you're saying:

> There are HTML parsers available which produce SAX or DOM (XML)
> output. You could get one of those, use it to read the input document,
> and route its output to Xalan for processing.
>
> Or you could look for a tool which rewrites HTML as XHTML. I believe
> the W3C's "tidy" tool can be configured to do that. Then you'd run the
> resulting XHTML document (which _is_ XML) through Xalan.

Choices include:

* tidy
* libxml2
* tagsoup

--
Michael Ludwig
Email has been scanned for viruses by Altman Technologies' email management service - www.altman.co.uk/emailsystems

Glasgow Caledonian University is a registered Scottish charity, number SC021474

Winner: Times Higher Education’s Widening Participation Initiative of the Year 2009 and Herald Society’s Education Initiative of the Year 2009.
http://www.gcu.ac.uk/newsevents/news/bycategory/theuniversity/1/name,6219,en.html

Winner: Times Higher Education’s Outstanding Support for Early Career Researchers of the Year 2010, GCU as a lead with Universities Scotland partners.
http://www.gcu.ac.uk/newsevents/news/bycategory/theuniversity/1/name,15691,en.html

Re: Basic XSL HTML-> XML query

Posted by Timothy Jones <Ti...@syniverse.com>.
I recall using a small Java library called Neko to parse HTML into an XML DOM.  It did the trick!   Just wanted to add it to the list.


tlj

----- Original Message -----
From: Michael Ludwig [mailto:milu71@gmx.de]
Sent: Sunday, January 08, 2012 04:57 AM
To: xalan-j-users@xml.apache.org <xa...@xml.apache.org>
Subject: Re: Basic XSL HTML-> XML query

keshlam@us.ibm.com schrieb am 07.01.2012 um 21:18 (-0500):
> The problem is, HTML is not an XML-based language, so unless you've
> deliberately written your input document as XHTML, odds are that no
> XML parser will accept it.

Sure, but as you're saying:

> There are HTML parsers available which produce SAX or DOM (XML)
> output. You could get one of those, use it to read the input document,
> and route its output to Xalan for processing.
> 
> Or you could look for a tool which rewrites HTML as XHTML. I believe
> the W3C's "tidy" tool can be configured to do that. Then you'd run the
> resulting XHTML document (which _is_ XML) through Xalan.

Choices include:

* tidy
* libxml2
* tagsoup

-- 
Michael Ludwig

Re: Basic XSL HTML-> XML query

Posted by Michael Ludwig <mi...@gmx.de>.
keshlam@us.ibm.com schrieb am 07.01.2012 um 21:18 (-0500):
> The problem is, HTML is not an XML-based language, so unless you've
> deliberately written your input document as XHTML, odds are that no
> XML parser will accept it.

Sure, but as you're saying:

> There are HTML parsers available which produce SAX or DOM (XML)
> output. You could get one of those, use it to read the input document,
> and route its output to Xalan for processing.
> 
> Or you could look for a tool which rewrites HTML as XHTML. I believe
> the W3C's "tidy" tool can be configured to do that. Then you'd run the
> resulting XHTML document (which _is_ XML) through Xalan.

Choices include:

* tidy
* libxml2
* tagsoup

-- 
Michael Ludwig

Re: Basic XSL HTML-> XML query

Posted by ke...@us.ibm.com.
The problem is, HTML is not an XML-based language, so unless you've 
deliberately written your input document as XHTML, odds are that no XML 
parser will accept it.

There are HTML parsers available which produce SAX or DOM (XML) output. 
You could get one of those, use it to read the input document, and route 
its output to Xalan for processing.

Or you could look for a tool which rewrites HTML as XHTML. I believe the 
W3C's "tidy" tool can be configured to do that. Then you'd run the 
resulting XHTML document (which _is_ XML) through Xalan.


______________________________________
"You build world of steel and stone
I build worlds of words alone
Skilled tradespeople, long years taught:
You shape matter; I shape thought."
(http://www.songworm.com/lyrics/songworm-parody/ShapesofShadow.html)

Re: Basic XSL HTML-> XML query

Posted by Michael Ludwig <mi...@gmx.de>.
McGibbney, Lewis John schrieb am 07.01.2012 um 17:39 (+0000):

> My situation is that I have lots of legal documents which exist in
> HTML, these in turn include lots and lots of presentation mark-up
> which I would like to strip before getting down to the Xalan-j XSL
> stuff. […] The question I have is whether this must be done via an
> XSL implementation or whether there is some kind of convenience/util
> interface which can be extended to do this type of thing?

I don't know of any convenience utility to do the job, but note that it
isn't hard using XSLT. You start with an identity transform and then add
rules to drop all attributes you don't want to skip all elements you
don't want. And that's it.

  <xsl:template match="@style | @border | @bgcolor | @whatnot" />

  <xsl:template match="font | div | whatnot">
    <xsl:apply-templates select="@*|node()"/>
  </xsl:template>

-- 
Michael Ludwig