You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@tika.apache.org by "Hudson (JIRA)" <ji...@apache.org> on 2019/03/21 21:49:00 UTC

[jira] [Commented] (TIKA-2310) Try to order chapters in epub correctly

    [ https://issues.apache.org/jira/browse/TIKA-2310?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16798470#comment-16798470 ] 

Hudson commented on TIKA-2310:
------------------------------

UNSTABLE: Integrated in Jenkins build tika-2.x-windows #393 (See [https://builds.apache.org/job/tika-2.x-windows/393/])
TIKA-2841 - focusing on epub, but also fixing TIKA-2310, and handling (tallison: rev 4131c6e30f2e0eb1feb85e0f7576531d4e830468)
* (edit) tika-core/src/test/java/org/apache/tika/TikaTest.java
* (add) tika-parsers/src/main/java/org/apache/tika/parser/utils/ZipSalvager.java
* (edit) tika-parsers/src/test/java/org/apache/tika/parser/epub/EpubParserTest.java
* (add) tika-parsers/src/test/resources/org/apache/tika/parser/epub/tika-config.xml
* (edit) tika-parsers/src/test/resources/test-documents/testEPUB.epub
* (edit) tika-core/src/main/java/org/apache/tika/utils/XMLReaderUtils.java
* (edit) tika-parsers/src/main/java/org/apache/tika/parser/microsoft/ooxml/OOXMLExtractorFactory.java
* (edit) tika-parsers/src/test/java/org/apache/tika/parser/microsoft/ooxml/TruncatedOOXMLTest.java
* (edit) tika-parsers/src/test/java/org/apache/tika/parser/dbf/DBFParserTest.java
* (edit) tika-parsers/src/main/java/org/apache/tika/parser/epub/EpubParser.java


> Try to order chapters in epub correctly
> ---------------------------------------
>
>                 Key: TIKA-2310
>                 URL: https://issues.apache.org/jira/browse/TIKA-2310
>             Project: Tika
>          Issue Type: Bug
>            Reporter: Tim Allison
>            Assignee: Tim Allison
>            Priority: Minor
>             Fix For: 1.21
>
>
> [~johanvanderknijff] recently pointed out on twitter that our Epub parser doesn't handle chapters in the right order.  We should try to fix our parser so that the output is in the correct order.
> Epub is new to me, but it looks like we can scrape the order out of content.opf(?).
> This would require dumping the stream to a ZipFile for direct access to zip entries, but we require that of ooxml...



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)