You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@tika.apache.org by Pablo Queixalos <pa...@polyspot.com> on 2011/09/22 11:34:19 UTC

HSLFExtractor & POI : Looking for better XHTML

Hi,

 

 

The XHTML output of HSLFExtractor parser is not pure XHTML, it only inserts the full text into a P[aragraph] tag (including non-html carriage returns).  This behavior comes from the poor capabilities that the POI PowerPointExtractor offers.

 

Based on the PowerPointExtractor implementation, I rewrote the HSLFExtractor parser. This new impl produces a better XHTML but uses the org.apache.poi.hslf POI model.

 

-          What is the philosophy of Tika  parsers implementations against their dependencies ? I mean, must the HSLFExtractor implement the strict minimal code to integrate the top POI API, or it is ok to do it the way I did ?

 

-          Is there conventions for the XHTML produced by the parsers : global formatting (ie, a <div> per page, <h1> for headers) and related CSS classes ?

 

 

I am quite new to tika-parsers, so any feedback on the rewritten HSLFExtractor will be apreciated.

 

 

Thanks,

 

 

Pablo QUEIXALOS

Developer 

 

 

79, rue du Faubourg Poissonnière

75009 Paris - France

P Before printing, think about environment

This message may contain confidential or privileged information. If you are not the intended recipient, please advise the sender immediately by reply e-mail and delete this message and any attachments without retaining a copy.

RE: HSLFExtractor & POI : Looking for better XHTML

Posted by Pablo Queixalos <pa...@polyspot.com>.

Oops, attachment was dropped.

Here it is : http://dl.free.fr/mJ2N9wIBh/HSLFExtractor.java

De : Pablo Queixalos [mailto:pablo.queixalos@polyspot.com]
Envoyé : jeudi 22 septembre 2011 11:34
À : dev@tika.apache.org
Objet : HSLFExtractor & POI : Looking for better XHTML

Hi,

The XHTML output of HSLFExtractor parser is not pure XHTML, it only inserts the full text into a P[aragraph] tag (including non-html carriage returns). This behavior comes from the poor capabilities that the POI PowerPointExtractor offers.

Based on the PowerPointExtractor implementation, I rewrote the HSLFExtractor parser. This new impl produces a better XHTML but uses the org.apache.poi.hslf POI model.

- What is the philosophy of Tika parsers implementations against their dependencies ? I mean, must the HSLFExtractor implement the strict minimal code to integrate the top POI API, or it is ok to do it the way I did ?

- Is there conventions for the XHTML produced by the parsers : global formatting (ie, a <div> per page, <h1> for headers) and related CSS classes ?

I am quite new to tika-parsers, so any feedback on the rewritten HSLFExtractor will be apreciated.

Thanks,

Pablo QUEIXALOS

Developer

79, rue du Faubourg Poissonnière

75009 Paris - France

P Before printing, think about environment

This message may contain confidential or privileged information. If you are not the intended recipient, please advise the sender immediately by reply e-mail and delete this message and any attachments without retaining a copy.

RE: HSLFExtractor & POI : Looking for better XHTML

Posted by Pablo Queixalos <pa...@polyspot.com>.

Thank you for your answers.

I created the related JIRA entry https://issues.apache.org/jira/browse/TIKA-727

Pablo.

-----Message d'origine-----
De : Nick Burch [mailto:nick.burch@alfresco.com] 
Envoyé : jeudi 22 septembre 2011 11:55
À : dev@tika.apache.org
Objet : Re: HSLFExtractor & POI : Looking for better XHTML

On Thu, 22 Sep 2011, Pablo Queixalos wrote:
> Based on the PowerPointExtractor implementation, I rewrote the 
> HSLFExtractor parser. This new impl produces a better XHTML but uses 
> the org.apache.poi.hslf POI model.

If you wouldn't mind, please create a new JIRA entry for this, and upload your patch.

> - What is the philosophy of Tika parsers implementations against their 
> dependencies ? I mean, must the HSLFExtractor implement the strict 
> minimal code to integrate the top POI API, or it is ok to do it the 
> way I did ?

It's fine to use other parts of the API as needed. If you look at some of the other office parsers you'll see that they all do that too

> - Is there conventions for the XHTML produced by the parsers : global 
> formatting (ie, a <div> per page, <h1> for headers) and related CSS 
> classes ?

We try to keep the xhtml simple and clean, with sensible tags, and we try to keep it similar between different formats of the same type. Ideally if you take the same presentation and save it as .ppt, .pptx and .odp, then Tika will give you quite similar XHTML back again.

(We don't try to re-create the exact layout and formatting of the original document however)

Nick

Re: HSLFExtractor & POI : Looking for better XHTML

Posted by Nick Burch <ni...@alfresco.com>.

On Thu, 22 Sep 2011, Pablo Queixalos wrote:
> Based on the PowerPointExtractor implementation, I rewrote the 
> HSLFExtractor parser. This new impl produces a better XHTML but uses the 
> org.apache.poi.hslf POI model.

If you wouldn't mind, please create a new JIRA entry for this, and upload 
your patch.

> - What is the philosophy of Tika parsers implementations against their 
> dependencies ? I mean, must the HSLFExtractor implement the strict 
> minimal code to integrate the top POI API, or it is ok to do it the way 
> I did ?

It's fine to use other parts of the API as needed. If you look at some of 
the other office parsers you'll see that they all do that too

> - Is there conventions for the XHTML produced by the parsers : global 
> formatting (ie, a <div> per page, <h1> for headers) and related CSS 
> classes ?

We try to keep the xhtml simple and clean, with sensible tags, and we try 
to keep it similar between different formats of the same type. Ideally if 
you take the same presentation and save it as .ppt, .pptx and .odp, then 
Tika will give you quite similar XHTML back again.

(We don't try to re-create the exact layout and formatting of the original 
document however)

Nick