You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@tika.apache.org by Pablo Queixalos <pa...@polyspot.com> on 2011/09/22 11:34:19 UTC
HSLFExtractor & POI : Looking for better XHTML
Hi,
The XHTML output of HSLFExtractor parser is not pure XHTML, it only inserts the full text into a P[aragraph] tag (including non-html carriage returns). This behavior comes from the poor capabilities that the POI PowerPointExtractor offers.
Based on the PowerPointExtractor implementation, I rewrote the HSLFExtractor parser. This new impl produces a better XHTML but uses the org.apache.poi.hslf POI model.
- What is the philosophy of Tika parsers implementations against their dependencies ? I mean, must the HSLFExtractor implement the strict minimal code to integrate the top POI API, or it is ok to do it the way I did ?
- Is there conventions for the XHTML produced by the parsers : global formatting (ie, a <div> per page, <h1> for headers) and related CSS classes ?
I am quite new to tika-parsers, so any feedback on the rewritten HSLFExtractor will be apreciated.
Thanks,
Pablo QUEIXALOS
Developer
79, rue du Faubourg Poissonnière
75009 Paris - France
P Before printing, think about environment
This message may contain confidential or privileged information. If you are not the intended recipient, please advise the sender immediately by reply e-mail and delete this message and any attachments without retaining a copy.
RE: HSLFExtractor & POI : Looking for better XHTML
Posted by Pablo Queixalos <pa...@polyspot.com>.
Oops, attachment was dropped.
Here it is : http://dl.free.fr/mJ2N9wIBh/HSLFExtractor.java
De : Pablo Queixalos [mailto:pablo.queixalos@polyspot.com]
Envoyé : jeudi 22 septembre 2011 11:34
À : dev@tika.apache.org
Objet : HSLFExtractor & POI : Looking for better XHTML
Hi,
The XHTML output of HSLFExtractor parser is not pure XHTML, it only inserts the full text into a P[aragraph] tag (including non-html carriage returns). This behavior comes from the poor capabilities that the POI PowerPointExtractor offers.
Based on the PowerPointExtractor implementation, I rewrote the HSLFExtractor parser. This new impl produces a better XHTML but uses the org.apache.poi.hslf POI model.
- What is the philosophy of Tika parsers implementations against their dependencies ? I mean, must the HSLFExtractor implement the strict minimal code to integrate the top POI API, or it is ok to do it the way I did ?
- Is there conventions for the XHTML produced by the parsers : global formatting (ie, a <div> per page, <h1> for headers) and related CSS classes ?
I am quite new to tika-parsers, so any feedback on the rewritten HSLFExtractor will be apreciated.
Thanks,
Pablo QUEIXALOS
Developer
79, rue du Faubourg Poissonnière
75009 Paris - France
P Before printing, think about environment
This message may contain confidential or privileged information. If you are not the intended recipient, please advise the sender immediately by reply e-mail and delete this message and any attachments without retaining a copy.
RE: HSLFExtractor & POI : Looking for better XHTML
Posted by Pablo Queixalos <pa...@polyspot.com>.
Thank you for your answers.
I created the related JIRA entry https://issues.apache.org/jira/browse/TIKA-727
Pablo.
-----Message d'origine-----
De : Nick Burch [mailto:nick.burch@alfresco.com]
Envoyé : jeudi 22 septembre 2011 11:55
À : dev@tika.apache.org
Objet : Re: HSLFExtractor & POI : Looking for better XHTML
On Thu, 22 Sep 2011, Pablo Queixalos wrote:
> Based on the PowerPointExtractor implementation, I rewrote the
> HSLFExtractor parser. This new impl produces a better XHTML but uses
> the org.apache.poi.hslf POI model.
If you wouldn't mind, please create a new JIRA entry for this, and upload your patch.
> - What is the philosophy of Tika parsers implementations against their
> dependencies ? I mean, must the HSLFExtractor implement the strict
> minimal code to integrate the top POI API, or it is ok to do it the
> way I did ?
It's fine to use other parts of the API as needed. If you look at some of the other office parsers you'll see that they all do that too
> - Is there conventions for the XHTML produced by the parsers : global
> formatting (ie, a <div> per page, <h1> for headers) and related CSS
> classes ?
We try to keep the xhtml simple and clean, with sensible tags, and we try to keep it similar between different formats of the same type. Ideally if you take the same presentation and save it as .ppt, .pptx and .odp, then Tika will give you quite similar XHTML back again.
(We don't try to re-create the exact layout and formatting of the original document however)
Nick
Re: HSLFExtractor & POI : Looking for better XHTML
Posted by Nick Burch <ni...@alfresco.com>.
On Thu, 22 Sep 2011, Pablo Queixalos wrote:
> Based on the PowerPointExtractor implementation, I rewrote the
> HSLFExtractor parser. This new impl produces a better XHTML but uses the
> org.apache.poi.hslf POI model.
If you wouldn't mind, please create a new JIRA entry for this, and upload
your patch.
> - What is the philosophy of Tika parsers implementations against their
> dependencies ? I mean, must the HSLFExtractor implement the strict
> minimal code to integrate the top POI API, or it is ok to do it the way
> I did ?
It's fine to use other parts of the API as needed. If you look at some of
the other office parsers you'll see that they all do that too
> - Is there conventions for the XHTML produced by the parsers : global
> formatting (ie, a <div> per page, <h1> for headers) and related CSS
> classes ?
We try to keep the xhtml simple and clean, with sensible tags, and we try
to keep it similar between different formats of the same type. Ideally if
you take the same presentation and save it as .ppt, .pptx and .odp, then
Tika will give you quite similar XHTML back again.
(We don't try to re-create the exact layout and formatting of the original
document however)
Nick