You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@tika.apache.org by André Ricardo <an...@gmail.com> on 2010/08/05 17:52:44 UTC

Tika parsing corrupt mp3

Hello,

I was trying some mp3s in Tika coming from Nutch 0.9/1.0 samples and with "A
corrupt MP3 file that has been truncated half way through the ID3v2 frames"
returned this:

$ java -jar tika-app-0.7.jar -v -m
~/nutch-0.9/src/plugin/parse-mp3/sample/test.mp3
Exception in thread "main" org.apache.tika.exception.TikaException:
TIKA-198: Illegal IOException from
org.apache.tika.parser.mp3.Mp3Parser@1bf3d87
    at
org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:138)
    at
org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:99)
    at org.apache.tika.cli.TikaCLI.process(TikaCLI.java:169)
    at org.apache.tika.cli.TikaCLI.main(TikaCLI.java:62)
Caused by: java.io.IOException: Tried to read 259186 bytes, but only 65526
bytes present
    at org.apache.tika.parser.mp3.ID3v2Frame.readFully(ID3v2Frame.java:160)
    at org.apache.tika.parser.mp3.ID3v2Frame.<init>(ID3v2Frame.java:110)
    at
org.apache.tika.parser.mp3.ID3v2Frame.createFrameIfPresent(ID3v2Frame.java:81)
    at
org.apache.tika.parser.mp3.Mp3Parser.getAllTagHandlers(Mp3Parser.java:128)
    at org.apache.tika.parser.mp3.Mp3Parser.parse(Mp3Parser.java:64)
    at
org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:132)
    ... 3 more

Also tried with the latest trunk from github reproducing the problem:

$ java -jar tika-app-0.8-SNAPSHOT.jar -v -m
~/nutch-0.9/src/plugin/parse-mp3/sample/test.mp3
Exception in thread "main" org.apache.tika.exception.TikaException:
TIKA-198: Illegal IOException from
org.apache.tika.parser.mp3.Mp3Parser@e79839
    at
org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:169)
    at
org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:110)
    at org.apache.tika.cli.TikaCLI.process(TikaCLI.java:193)
    at org.apache.tika.cli.TikaCLI.main(TikaCLI.java:72)
Caused by: java.io.IOException: Tried to read 259186 bytes, but only 65526
bytes present
    at org.apache.tika.parser.mp3.ID3v2Frame.readFully(ID3v2Frame.java:160)
    at org.apache.tika.parser.mp3.ID3v2Frame.<init>(ID3v2Frame.java:110)
    at
org.apache.tika.parser.mp3.ID3v2Frame.createFrameIfPresent(ID3v2Frame.java:81)
    at
org.apache.tika.parser.mp3.Mp3Parser.getAllTagHandlers(Mp3Parser.java:133)
    at org.apache.tika.parser.mp3.Mp3Parser.parse(Mp3Parser.java:64)
    at
org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:163)
    ... 3 more

The mp3 is here:
http://github.com/apache/nutch/raw/tags/release-1.0/src/plugin/parse-mp3/sample/test.mp3

All the other mp3 samples were parsed well by Tika.

Should I open an issue in Jira? And if so, would you consider this a bug or
an improvement?

André Ricardo

Re: Tika parsing corrupt mp3

Posted by "Mattmann, Chris A (388J)" <ch...@jpl.nasa.gov>.
You did great, thanks André!

Cheers,
Chris


On 8/6/10 8:13 AM, "André Ricardo" <an...@gmail.com> wrote:

Hello Chris,

Just opened the issue, I hope I did everything ok since it is the first time
I'm opening an issue in JIRA.

Thank you for your answer,
André Ricardo


On Thu, Aug 5, 2010 at 10:19 PM, Mattmann, Chris A (388J) <
chris.a.mattmann@jpl.nasa.gov> wrote:

> Hi André,
>
> Yes, please, file an issue in JIRA and point at the mp3 file and the test
> case that failed. Thanks so much!
>
> Cheers,
> Chris
>
>
>
> On 8/5/10 8:52 AM, "André Ricardo" <an...@gmail.com> wrote:
>
> Hello,
>
> I was trying some mp3s in Tika coming from Nutch 0.9/1.0 samples and with
> "A
> corrupt MP3 file that has been truncated half way through the ID3v2 frames"
> returned this:
>
> $ java -jar tika-app-0.7.jar -v -m
> ~/nutch-0.9/src/plugin/parse-mp3/sample/test.mp3
> Exception in thread "main" org.apache.tika.exception.TikaException:
> TIKA-198: Illegal IOException from
> org.apache.tika.parser.mp3.Mp3Parser@1bf3d87
>    at
> org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:138)
>    at
> org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:99)
>    at org.apache.tika.cli.TikaCLI.process(TikaCLI.java:169)
>    at org.apache.tika.cli.TikaCLI.main(TikaCLI.java:62)
> Caused by: java.io.IOException: Tried to read 259186 bytes, but only 65526
> bytes present
>    at org.apache.tika.parser.mp3.ID3v2Frame.readFully(ID3v2Frame.java:160)
>    at org.apache.tika.parser.mp3.ID3v2Frame.<init>(ID3v2Frame.java:110)
>    at
>
> org.apache.tika.parser.mp3.ID3v2Frame.createFrameIfPresent(ID3v2Frame.java:81)
>    at
> org.apache.tika.parser.mp3.Mp3Parser.getAllTagHandlers(Mp3Parser.java:128)
>    at org.apache.tika.parser.mp3.Mp3Parser.parse(Mp3Parser.java:64)
>    at
> org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:132)
>    ... 3 more
>
> Also tried with the latest trunk from github reproducing the problem:
>
> $ java -jar tika-app-0.8-SNAPSHOT.jar -v -m
> ~/nutch-0.9/src/plugin/parse-mp3/sample/test.mp3
> Exception in thread "main" org.apache.tika.exception.TikaException:
> TIKA-198: Illegal IOException from
> org.apache.tika.parser.mp3.Mp3Parser@e79839
>    at
> org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:169)
>    at
> org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:110)
>    at org.apache.tika.cli.TikaCLI.process(TikaCLI.java:193)
>    at org.apache.tika.cli.TikaCLI.main(TikaCLI.java:72)
> Caused by: java.io.IOException: Tried to read 259186 bytes, but only 65526
> bytes present
>    at org.apache.tika.parser.mp3.ID3v2Frame.readFully(ID3v2Frame.java:160)
>    at org.apache.tika.parser.mp3.ID3v2Frame.<init>(ID3v2Frame.java:110)
>    at
>
> org.apache.tika.parser.mp3.ID3v2Frame.createFrameIfPresent(ID3v2Frame.java:81)
>    at
> org.apache.tika.parser.mp3.Mp3Parser.getAllTagHandlers(Mp3Parser.java:133)
>    at org.apache.tika.parser.mp3.Mp3Parser.parse(Mp3Parser.java:64)
>    at
> org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:163)
>    ... 3 more
>
> The mp3 is here:
>
> http://github.com/apache/nutch/raw/tags/release-1.0/src/plugin/parse-mp3/sample/test.mp3
>
> All the other mp3 samples were parsed well by Tika.
>
> Should I open an issue in Jira? And if so, would you consider this a bug or
> an improvement?
>
> André Ricardo
>
>
>
> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
> Chris Mattmann, Ph.D.
> Senior Computer Scientist
> NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
> Office: 171-266B, Mailstop: 171-246
> Email: Chris.Mattmann@jpl.nasa.gov
> WWW:   http://sunset.usc.edu/~mattmann/<http://sunset.usc.edu/%7Emattmann/>
> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
> Adjunct Assistant Professor, Computer Science Department
> University of Southern California, Los Angeles, CA 90089 USA
> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>
>



++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Chris Mattmann, Ph.D.
Senior Computer Scientist
NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
Office: 171-266B, Mailstop: 171-246
Email: Chris.Mattmann@jpl.nasa.gov
WWW:   http://sunset.usc.edu/~mattmann/
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Adjunct Assistant Professor, Computer Science Department
University of Southern California, Los Angeles, CA 90089 USA
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++


Re: Tika parsing corrupt mp3

Posted by André Ricardo <an...@gmail.com>.
Hello Chris,

Just opened the issue, I hope I did everything ok since it is the first time
I'm opening an issue in JIRA.

Thank you for your answer,
André Ricardo


On Thu, Aug 5, 2010 at 10:19 PM, Mattmann, Chris A (388J) <
chris.a.mattmann@jpl.nasa.gov> wrote:

> Hi André,
>
> Yes, please, file an issue in JIRA and point at the mp3 file and the test
> case that failed. Thanks so much!
>
> Cheers,
> Chris
>
>
>
> On 8/5/10 8:52 AM, "André Ricardo" <an...@gmail.com> wrote:
>
> Hello,
>
> I was trying some mp3s in Tika coming from Nutch 0.9/1.0 samples and with
> "A
> corrupt MP3 file that has been truncated half way through the ID3v2 frames"
> returned this:
>
> $ java -jar tika-app-0.7.jar -v -m
> ~/nutch-0.9/src/plugin/parse-mp3/sample/test.mp3
> Exception in thread "main" org.apache.tika.exception.TikaException:
> TIKA-198: Illegal IOException from
> org.apache.tika.parser.mp3.Mp3Parser@1bf3d87
>    at
> org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:138)
>    at
> org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:99)
>    at org.apache.tika.cli.TikaCLI.process(TikaCLI.java:169)
>    at org.apache.tika.cli.TikaCLI.main(TikaCLI.java:62)
> Caused by: java.io.IOException: Tried to read 259186 bytes, but only 65526
> bytes present
>    at org.apache.tika.parser.mp3.ID3v2Frame.readFully(ID3v2Frame.java:160)
>    at org.apache.tika.parser.mp3.ID3v2Frame.<init>(ID3v2Frame.java:110)
>    at
>
> org.apache.tika.parser.mp3.ID3v2Frame.createFrameIfPresent(ID3v2Frame.java:81)
>    at
> org.apache.tika.parser.mp3.Mp3Parser.getAllTagHandlers(Mp3Parser.java:128)
>    at org.apache.tika.parser.mp3.Mp3Parser.parse(Mp3Parser.java:64)
>    at
> org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:132)
>    ... 3 more
>
> Also tried with the latest trunk from github reproducing the problem:
>
> $ java -jar tika-app-0.8-SNAPSHOT.jar -v -m
> ~/nutch-0.9/src/plugin/parse-mp3/sample/test.mp3
> Exception in thread "main" org.apache.tika.exception.TikaException:
> TIKA-198: Illegal IOException from
> org.apache.tika.parser.mp3.Mp3Parser@e79839
>    at
> org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:169)
>    at
> org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:110)
>    at org.apache.tika.cli.TikaCLI.process(TikaCLI.java:193)
>    at org.apache.tika.cli.TikaCLI.main(TikaCLI.java:72)
> Caused by: java.io.IOException: Tried to read 259186 bytes, but only 65526
> bytes present
>    at org.apache.tika.parser.mp3.ID3v2Frame.readFully(ID3v2Frame.java:160)
>    at org.apache.tika.parser.mp3.ID3v2Frame.<init>(ID3v2Frame.java:110)
>    at
>
> org.apache.tika.parser.mp3.ID3v2Frame.createFrameIfPresent(ID3v2Frame.java:81)
>    at
> org.apache.tika.parser.mp3.Mp3Parser.getAllTagHandlers(Mp3Parser.java:133)
>    at org.apache.tika.parser.mp3.Mp3Parser.parse(Mp3Parser.java:64)
>    at
> org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:163)
>    ... 3 more
>
> The mp3 is here:
>
> http://github.com/apache/nutch/raw/tags/release-1.0/src/plugin/parse-mp3/sample/test.mp3
>
> All the other mp3 samples were parsed well by Tika.
>
> Should I open an issue in Jira? And if so, would you consider this a bug or
> an improvement?
>
> André Ricardo
>
>
>
> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
> Chris Mattmann, Ph.D.
> Senior Computer Scientist
> NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
> Office: 171-266B, Mailstop: 171-246
> Email: Chris.Mattmann@jpl.nasa.gov
> WWW:   http://sunset.usc.edu/~mattmann/<http://sunset.usc.edu/%7Emattmann/>
> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
> Adjunct Assistant Professor, Computer Science Department
> University of Southern California, Los Angeles, CA 90089 USA
> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>
>

Re: Tika parsing corrupt mp3

Posted by "Mattmann, Chris A (388J)" <ch...@jpl.nasa.gov>.
Hi André,

Yes, please, file an issue in JIRA and point at the mp3 file and the test case that failed. Thanks so much!

Cheers,
Chris



On 8/5/10 8:52 AM, "André Ricardo" <an...@gmail.com> wrote:

Hello,

I was trying some mp3s in Tika coming from Nutch 0.9/1.0 samples and with "A
corrupt MP3 file that has been truncated half way through the ID3v2 frames"
returned this:

$ java -jar tika-app-0.7.jar -v -m
~/nutch-0.9/src/plugin/parse-mp3/sample/test.mp3
Exception in thread "main" org.apache.tika.exception.TikaException:
TIKA-198: Illegal IOException from
org.apache.tika.parser.mp3.Mp3Parser@1bf3d87
    at
org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:138)
    at
org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:99)
    at org.apache.tika.cli.TikaCLI.process(TikaCLI.java:169)
    at org.apache.tika.cli.TikaCLI.main(TikaCLI.java:62)
Caused by: java.io.IOException: Tried to read 259186 bytes, but only 65526
bytes present
    at org.apache.tika.parser.mp3.ID3v2Frame.readFully(ID3v2Frame.java:160)
    at org.apache.tika.parser.mp3.ID3v2Frame.<init>(ID3v2Frame.java:110)
    at
org.apache.tika.parser.mp3.ID3v2Frame.createFrameIfPresent(ID3v2Frame.java:81)
    at
org.apache.tika.parser.mp3.Mp3Parser.getAllTagHandlers(Mp3Parser.java:128)
    at org.apache.tika.parser.mp3.Mp3Parser.parse(Mp3Parser.java:64)
    at
org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:132)
    ... 3 more

Also tried with the latest trunk from github reproducing the problem:

$ java -jar tika-app-0.8-SNAPSHOT.jar -v -m
~/nutch-0.9/src/plugin/parse-mp3/sample/test.mp3
Exception in thread "main" org.apache.tika.exception.TikaException:
TIKA-198: Illegal IOException from
org.apache.tika.parser.mp3.Mp3Parser@e79839
    at
org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:169)
    at
org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:110)
    at org.apache.tika.cli.TikaCLI.process(TikaCLI.java:193)
    at org.apache.tika.cli.TikaCLI.main(TikaCLI.java:72)
Caused by: java.io.IOException: Tried to read 259186 bytes, but only 65526
bytes present
    at org.apache.tika.parser.mp3.ID3v2Frame.readFully(ID3v2Frame.java:160)
    at org.apache.tika.parser.mp3.ID3v2Frame.<init>(ID3v2Frame.java:110)
    at
org.apache.tika.parser.mp3.ID3v2Frame.createFrameIfPresent(ID3v2Frame.java:81)
    at
org.apache.tika.parser.mp3.Mp3Parser.getAllTagHandlers(Mp3Parser.java:133)
    at org.apache.tika.parser.mp3.Mp3Parser.parse(Mp3Parser.java:64)
    at
org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:163)
    ... 3 more

The mp3 is here:
http://github.com/apache/nutch/raw/tags/release-1.0/src/plugin/parse-mp3/sample/test.mp3

All the other mp3 samples were parsed well by Tika.

Should I open an issue in Jira? And if so, would you consider this a bug or
an improvement?

André Ricardo



++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Chris Mattmann, Ph.D.
Senior Computer Scientist
NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
Office: 171-266B, Mailstop: 171-246
Email: Chris.Mattmann@jpl.nasa.gov
WWW:   http://sunset.usc.edu/~mattmann/
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Adjunct Assistant Professor, Computer Science Department
University of Southern California, Los Angeles, CA 90089 USA
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++