You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@tika.apache.org by "Tim Allison (JIRA)" <ji...@apache.org> on 2016/12/21 14:05:58 UTC

[jira] [Comment Edited] (TIKA-2211) ePub formatting instructions appear in plain text output

    [ https://issues.apache.org/jira/browse/TIKA-2211?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15767125#comment-15767125 ] 

Tim Allison edited comment on TIKA-2211 at 12/21/16 2:05 PM:
-------------------------------------------------------------

That does make sense.  Unless we find that epub xhtml/html is as nasty as real html, I'd prefer to leave that out.

After some kicking of tires, the solution appears to be simpler.  The EPubContentParser was adding a new XHTMLContentHandler for each chapter.  I _think_ this prevented the BodyContentHandler from working properly -- this is a filter that only passes on contents from within <body></body> elements...which prevents <style> and <script> types of things that show up in <head> from entering the "content" section.

Once I removed the xhtml content handler from the EPubContentParser, all seems to work, and only body elements are being added to the overall output.

What I can't figure out is why no one has complained that the ToXML option didn't appear to work...at least on our one test file.  That now does work.

Also, I turned on some tests for the iBooksParser.  There's a comment in the test code that it didn't use to work, but it seems to be working now even before I made the change...not sure what was going on there.


was (Author: tallison@mitre.org):
That does make sense.  Unless we find the epub xhtml/html is as nasty as real html, I'd prefer to leave that out.

After some kicking of tires, the solution appears to be simpler.  The EPubContentParser was adding a new XHTMLContentHandler for each chapter.  I _think_ this prevented the BodyContentHandler from working properly -- this is a filter that only passes on contents from within <body></body> elements...which prevents <style> and <script> types of things that show up in <head> from entering the "content" section.

Once I removed the xhtml content handler from the EPubContentParser, all seems to work, and only body elements are being added to the overall output.

What I can't figure out is why no one has complained that the ToXML option didn't appear to work...at least on our one test file.  That now does work.

Also, I turned on some tests for the iBooksParser.  There's a comment in the test code that it didn't use to work, but it seems to be working now even before I made the change...not sure what was going on there.

> ePub formatting instructions appear in plain text output
> --------------------------------------------------------
>
>                 Key: TIKA-2211
>                 URL: https://issues.apache.org/jira/browse/TIKA-2211
>             Project: Tika
>          Issue Type: Bug
>    Affects Versions: 1.14
>         Environment: I tested this on on Mac OSX 10.11.6 with Oracle JDK 1.8.0_112.  The Tika stand-alone application was launched as follows:
> {code}
> java -jar tika-app-1.14.jar
> {code}
>            Reporter: Adam Carroll
>
> For some ePub files, format information appears in the plain text output produced by Apache Tika.  For example the Tika stand-alone application shows the following text for the file “Don Quijote de la Mancha - Miguel de Cervantes.epub” (dowloaded [here|http://www.literanda.com/don-quijote-de-la-mancha--miguel-de-cervantes--epub]):
> {code}
> /**/
>   p.sgc-2 {font-style: italic; text-align: right}
>   p.sgc-1 {text-align: justify;}
>   h3.sgc-3 {text-align: center;}
>   /**/
> Al duque de Béjar
> Marqués de Gibraleón, conde de Benalcázar y Bañares, vizconde de La Puebla de Alcocer, señor de las villas de Capilla, Curiel y Burguillos
> En fe del buen acogimiento y honra que hace Vuestra Excelencia a toda suerte de libros, como príncipe tan inclinado a favorecer las buenas artes, mayormente las que por su nobleza no se abaten al servicio y granjerías del vulgo, he determinado de sacar a luz El ingenioso hidalgo don Quijote de la Mancha, al abrigo del clarísimo nombre de Vuestra Excelencia, a quien, con el acatamiento que debo a tanta grandeza, suplico le reciba agradablemente en su protección, para que a su sombra, aunque desnudo de aquel precioso ornamento de elegancia y erudición de que suelen andar vestidas las obras que se componen en las casas de los hombres que saben, ose parecer seguramente en el juicio de algunos que, conteniéndose en los límites de su ignorancia, suelen condenar con más rigor y menos justicia los trabajos ajenos; que, poniendo los ojos la prudencia de Vuestra Excelencia en mi buen deseo, fío que no desdeñará la cortedad de tan humilde servicio.
> {code}
> To reproduce this problem run the stand-alone version of Tika and open an affected ePub file such as the one mentioned above.  Then go to View -> Plain Text.  You should see the problem there.
> By the way, thanks for making Apache Tika a really useful library.  Keep up the good work!



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)