You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@tika.apache.org by Tim Allison <ta...@apache.org> on 2018/09/19 19:13:20 UTC
1.19.1?
The mp3 regression is bad. In hindsight, the Tika-eval reports were fairly
clear on this but I did some self-hand-waving to excuse away the
numbers...I shouldn’t have.
I want to add some new reports to tika-eval so that this never happens
again.
How long should we wait for 1.19.1 or 1.20?
Best,
Tim
On Wed, Sep 19, 2018 at 2:29 PM Hudson (JIRA) <ji...@apache.org> wrote:
>
> [
> https://issues.apache.org/jira/browse/TIKA-2730?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16621008#comment-16621008
> ]
>
> Hudson commented on TIKA-2730:
> ------------------------------
>
> SUCCESS: Integrated in Jenkins build tika-branch-1x #94 (See [
> https://builds.apache.org/job/tika-branch-1x/94/])
> TIKA-2730 -- allow last frame to be truncated w/o throwing an EOF
> (tallison: [
> https://github.com/apache/tika/commit/80cfd6d4a4270f8f3697c6dc083b3dedfc36c86a
> ])
> * (edit)
> tika-parsers/src/main/java/org/apache/tika/parser/mp3/MpegStream.java
> * (edit)
> tika-parsers/src/test/java/org/apache/tika/parser/mp3/Mp3ParserTest.java
> * (add)
> tika-parsers/src/test/resources/test-documents/testMP3i18n_truncated.mp3
> * (edit)
> tika-parsers/src/main/java/org/apache/tika/parser/mp3/Mp3Parser.java
>
>
> > parseToString fails for a simple mp3
> > ------------------------------------
> >
> > Key: TIKA-2730
> > URL: https://issues.apache.org/jira/browse/TIKA-2730
> > Project: Tika
> > Issue Type: Bug
> > Affects Versions: 1.19
> > Reporter: Boris Petrov
> > Assignee: Tim Allison
> > Priority: Major
> > Fix For: 2.0.0, 1.20
> >
> > Attachments: demo.mp3
> >
> >
> > This is a regression from 1.18. I've attached the mp3 that fails. The
> exception I get is:
> > {noformat}
> > org.apache.tika.exception.TikaException: TIKA-198: Illegal IOException
> from org.apache.tika.parser.mp3.Mp3Parser@cefe6c6
> > at
> org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:286)
> > at
> org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280)
> > at
> org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:143)
> > at org.apache.tika.Tika.parseToString(Tika.java:527)
> > at com.company.TextExtractor.getText(TextExtractor.java:39)
> > Caused by:
> > java.io.EOFException: EOF: tried to skip 361 but could only skip 247
> > at
> org.apache.tika.parser.mp3.MpegStream.skipFrame(MpegStream.java:166)
> > at
> org.apache.tika.parser.mp3.Mp3Parser.getAllTagHandlers(Mp3Parser.java:204)
> > at org.apache.tika.parser.mp3.Mp3Parser.parse(Mp3Parser.java:71)
> > at
> org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280)
> > ... 5 more{noformat}
>
>
>
> --
> This message was sent by Atlassian JIRA
> (v7.6.3#76005)
>
Re: 1.19.1?
Posted by Chris Mattmann <ma...@apache.org>.
Sounds great!
From: Tim Allison <ta...@apache.org>
Reply-To: "dev@tika.apache.org" <de...@tika.apache.org>
Date: Tuesday, September 25, 2018 at 9:40 AM
To: "dev@tika.apache.org" <de...@tika.apache.org>
Subject: Re: 1.19.1?
Given the mp3 issue and some other items, let's go with 1.19.1 rc1
today or tomorrow?
On Mon, Sep 24, 2018 at 3:07 PM Nick Burch <ap...@gagravarr.org> wrote:
On Mon, 24 Sep 2018, Tim Allison wrote:
> Aside from the problem with users and non-standard XML parsers, were
> there any other show-stoppers in POI 4.0.0? Is there a reason to wait
> for POI 4.0.1?
I think, in terms of Tika affecting bugs, it was the xml parser stuff, and
commons compress missing from the pom.
Nick
Re: 1.19.1?
Posted by Tim Allison <ta...@apache.org>.
Given the mp3 issue and some other items, let's go with 1.19.1 rc1
today or tomorrow?
On Mon, Sep 24, 2018 at 3:07 PM Nick Burch <ap...@gagravarr.org> wrote:
>
> On Mon, 24 Sep 2018, Tim Allison wrote:
> > Aside from the problem with users and non-standard XML parsers, were
> > there any other show-stoppers in POI 4.0.0? Is there a reason to wait
> > for POI 4.0.1?
>
> I think, in terms of Tika affecting bugs, it was the xml parser stuff, and
> commons compress missing from the pom.
>
> Nick
Re: 1.19.1?
Posted by Nick Burch <ap...@gagravarr.org>.
On Mon, 24 Sep 2018, Tim Allison wrote:
> Aside from the problem with users and non-standard XML parsers, were
> there any other show-stoppers in POI 4.0.0? Is there a reason to wait
> for POI 4.0.1?
I think, in terms of Tika affecting bugs, it was the xml parser stuff, and
commons compress missing from the pom.
Nick
Re: 1.19.1?
Posted by Tim Allison <ta...@apache.org>.
Nick,
Aside from the problem with users and non-standard XML parsers, were
there any other show-stoppers in POI 4.0.0? Is there a reason to wait
for POI 4.0.1?
On Fri, Sep 21, 2018 at 12:48 PM Chris Mattmann <ma...@apache.org> wrote:
>
> Let’s roll it….
>
>
>
>
>
>
>
> From: Tim Allison <ta...@apache.org>
> Reply-To: "dev@tika.apache.org" <de...@tika.apache.org>
> Date: Wednesday, September 19, 2018 at 12:14 PM
> To: "dev@tika.apache.org" <de...@tika.apache.org>
> Subject: 1.19.1?
>
>
>
> The mp3 regression is bad. In hindsight, the Tika-eval reports were fairly
>
> clear on this but I did some self-hand-waving to excuse away the
>
> numbers...I shouldn’t have.
>
>
>
> I want to add some new reports to tika-eval so that this never happens
>
> again.
>
>
>
> How long should we wait for 1.19.1 or 1.20?
>
>
>
> Best,
>
>
>
> Tim
>
>
>
> On Wed, Sep 19, 2018 at 2:29 PM Hudson (JIRA) <ji...@apache.org> wrote:
>
>
>
>
>
> [
>
> https://issues.apache.org/jira/browse/TIKA-2730?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16621008#comment-16621008
>
> ]
>
>
>
> Hudson commented on TIKA-2730:
>
> ------------------------------
>
>
>
> SUCCESS: Integrated in Jenkins build tika-branch-1x #94 (See [
>
> https://builds.apache.org/job/tika-branch-1x/94/])
>
> TIKA-2730 -- allow last frame to be truncated w/o throwing an EOF
>
> (tallison: [
>
> https://github.com/apache/tika/commit/80cfd6d4a4270f8f3697c6dc083b3dedfc36c86a
>
> ])
>
> * (edit)
>
> tika-parsers/src/main/java/org/apache/tika/parser/mp3/MpegStream.java
>
> * (edit)
>
> tika-parsers/src/test/java/org/apache/tika/parser/mp3/Mp3ParserTest.java
>
> * (add)
>
> tika-parsers/src/test/resources/test-documents/testMP3i18n_truncated.mp3
>
> * (edit)
>
> tika-parsers/src/main/java/org/apache/tika/parser/mp3/Mp3Parser.java
>
>
>
>
>
> > parseToString fails for a simple mp3
>
> > ------------------------------------
>
> >
>
> > Key: TIKA-2730
>
> > URL: https://issues.apache.org/jira/browse/TIKA-2730
>
> > Project: Tika
>
> > Issue Type: Bug
>
> > Affects Versions: 1.19
>
> > Reporter: Boris Petrov
>
> > Assignee: Tim Allison
>
> > Priority: Major
>
> > Fix For: 2.0.0, 1.20
>
> >
>
> > Attachments: demo.mp3
>
> >
>
> >
>
> > This is a regression from 1.18. I've attached the mp3 that fails. The
>
> exception I get is:
>
> > {noformat}
>
> > org.apache.tika.exception.TikaException: TIKA-198: Illegal IOException
>
> from org.apache.tika.parser.mp3.Mp3Parser@cefe6c6
>
> > at
>
> org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:286)
>
> > at
>
> org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280)
>
> > at
>
> org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:143)
>
> > at org.apache.tika.Tika.parseToString(Tika.java:527)
>
> > at com.company.TextExtractor.getText(TextExtractor.java:39)
>
> > Caused by:
>
> > java.io.EOFException: EOF: tried to skip 361 but could only skip 247
>
> > at
>
> org.apache.tika.parser.mp3.MpegStream.skipFrame(MpegStream.java:166)
>
> > at
>
> org.apache.tika.parser.mp3.Mp3Parser.getAllTagHandlers(Mp3Parser.java:204)
>
> > at org.apache.tika.parser.mp3.Mp3Parser.parse(Mp3Parser.java:71)
>
> > at
>
> org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280)
>
> > ... 5 more{noformat}
>
>
>
>
>
>
>
> --
>
> This message was sent by Atlassian JIRA
>
> (v7.6.3#76005)
>
>
>
>
>
Re: 1.19.1?
Posted by Chris Mattmann <ma...@apache.org>.
Let’s roll it….
From: Tim Allison <ta...@apache.org>
Reply-To: "dev@tika.apache.org" <de...@tika.apache.org>
Date: Wednesday, September 19, 2018 at 12:14 PM
To: "dev@tika.apache.org" <de...@tika.apache.org>
Subject: 1.19.1?
The mp3 regression is bad. In hindsight, the Tika-eval reports were fairly
clear on this but I did some self-hand-waving to excuse away the
numbers...I shouldn’t have.
I want to add some new reports to tika-eval so that this never happens
again.
How long should we wait for 1.19.1 or 1.20?
Best,
Tim
On Wed, Sep 19, 2018 at 2:29 PM Hudson (JIRA) <ji...@apache.org> wrote:
[
https://issues.apache.org/jira/browse/TIKA-2730?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16621008#comment-16621008
]
Hudson commented on TIKA-2730:
------------------------------
SUCCESS: Integrated in Jenkins build tika-branch-1x #94 (See [
https://builds.apache.org/job/tika-branch-1x/94/])
TIKA-2730 -- allow last frame to be truncated w/o throwing an EOF
(tallison: [
https://github.com/apache/tika/commit/80cfd6d4a4270f8f3697c6dc083b3dedfc36c86a
])
* (edit)
tika-parsers/src/main/java/org/apache/tika/parser/mp3/MpegStream.java
* (edit)
tika-parsers/src/test/java/org/apache/tika/parser/mp3/Mp3ParserTest.java
* (add)
tika-parsers/src/test/resources/test-documents/testMP3i18n_truncated.mp3
* (edit)
tika-parsers/src/main/java/org/apache/tika/parser/mp3/Mp3Parser.java
> parseToString fails for a simple mp3
> ------------------------------------
>
> Key: TIKA-2730
> URL: https://issues.apache.org/jira/browse/TIKA-2730
> Project: Tika
> Issue Type: Bug
> Affects Versions: 1.19
> Reporter: Boris Petrov
> Assignee: Tim Allison
> Priority: Major
> Fix For: 2.0.0, 1.20
>
> Attachments: demo.mp3
>
>
> This is a regression from 1.18. I've attached the mp3 that fails. The
exception I get is:
> {noformat}
> org.apache.tika.exception.TikaException: TIKA-198: Illegal IOException
from org.apache.tika.parser.mp3.Mp3Parser@cefe6c6
> at
org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:286)
> at
org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280)
> at
org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:143)
> at org.apache.tika.Tika.parseToString(Tika.java:527)
> at com.company.TextExtractor.getText(TextExtractor.java:39)
> Caused by:
> java.io.EOFException: EOF: tried to skip 361 but could only skip 247
> at
org.apache.tika.parser.mp3.MpegStream.skipFrame(MpegStream.java:166)
> at
org.apache.tika.parser.mp3.Mp3Parser.getAllTagHandlers(Mp3Parser.java:204)
> at org.apache.tika.parser.mp3.Mp3Parser.parse(Mp3Parser.java:71)
> at
org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280)
> ... 5 more{noformat}
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)
Re: 1.19.1?
Posted by Tim Allison <ta...@apache.org>.
Y, and I think I duplicated that bug when I copied/pasted from POI to
Tika, so that's a good reminder to fix that in Tika asap as well as
potentially wait for POI 4.0.1. Thank you!
On Wed, Sep 19, 2018 at 4:53 PM Nick Burch <ap...@gagravarr.org> wrote:
>
> On Wed, 19 Sep 2018, Tim Allison wrote:
> > The mp3 regression is bad. In hindsight, the Tika-eval reports were
> > fairly clear on this but I did some self-hand-waving to excuse away the
> > numbers...I shouldn’t have.
> >
> > I want to add some new reports to tika-eval so that this never happens
> > again.
> >
> > How long should we wait for 1.19.1 or 1.20?
>
> There's a POI xml bug on certain older platforms (POI tries too hard to
> lock down the xml settings even if the xml parser doesn't do that...),
> maybe worth trying to get a POI 4.0.1 out, then do a Tika 1.19.1 or 1.20
> (depending on how many other bugs we spot in the POI wait!)
>
> Nick
Re: 1.19.1?
Posted by Nick Burch <ap...@gagravarr.org>.
On Wed, 19 Sep 2018, Tim Allison wrote:
> The mp3 regression is bad. In hindsight, the Tika-eval reports were
> fairly clear on this but I did some self-hand-waving to excuse away the
> numbers...I shouldn’t have.
>
> I want to add some new reports to tika-eval so that this never happens
> again.
>
> How long should we wait for 1.19.1 or 1.20?
There's a POI xml bug on certain older platforms (POI tries too hard to
lock down the xml settings even if the xml parser doesn't do that...),
maybe worth trying to get a POI 4.0.1 out, then do a Tika 1.19.1 or 1.20
(depending on how many other bugs we spot in the POI wait!)
Nick