You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@tika.apache.org by André Ricardo <an...@gmail.com> on 2010/08/05 17:52:44 UTC
Tika parsing corrupt mp3
Hello,
I was trying some mp3s in Tika coming from Nutch 0.9/1.0 samples and with "A
corrupt MP3 file that has been truncated half way through the ID3v2 frames"
returned this:
$ java -jar tika-app-0.7.jar -v -m
~/nutch-0.9/src/plugin/parse-mp3/sample/test.mp3
Exception in thread "main" org.apache.tika.exception.TikaException:
TIKA-198: Illegal IOException from
org.apache.tika.parser.mp3.Mp3Parser@1bf3d87
at
org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:138)
at
org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:99)
at org.apache.tika.cli.TikaCLI.process(TikaCLI.java:169)
at org.apache.tika.cli.TikaCLI.main(TikaCLI.java:62)
Caused by: java.io.IOException: Tried to read 259186 bytes, but only 65526
bytes present
at org.apache.tika.parser.mp3.ID3v2Frame.readFully(ID3v2Frame.java:160)
at org.apache.tika.parser.mp3.ID3v2Frame.<init>(ID3v2Frame.java:110)
at
org.apache.tika.parser.mp3.ID3v2Frame.createFrameIfPresent(ID3v2Frame.java:81)
at
org.apache.tika.parser.mp3.Mp3Parser.getAllTagHandlers(Mp3Parser.java:128)
at org.apache.tika.parser.mp3.Mp3Parser.parse(Mp3Parser.java:64)
at
org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:132)
... 3 more
Also tried with the latest trunk from github reproducing the problem:
$ java -jar tika-app-0.8-SNAPSHOT.jar -v -m
~/nutch-0.9/src/plugin/parse-mp3/sample/test.mp3
Exception in thread "main" org.apache.tika.exception.TikaException:
TIKA-198: Illegal IOException from
org.apache.tika.parser.mp3.Mp3Parser@e79839
at
org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:169)
at
org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:110)
at org.apache.tika.cli.TikaCLI.process(TikaCLI.java:193)
at org.apache.tika.cli.TikaCLI.main(TikaCLI.java:72)
Caused by: java.io.IOException: Tried to read 259186 bytes, but only 65526
bytes present
at org.apache.tika.parser.mp3.ID3v2Frame.readFully(ID3v2Frame.java:160)
at org.apache.tika.parser.mp3.ID3v2Frame.<init>(ID3v2Frame.java:110)
at
org.apache.tika.parser.mp3.ID3v2Frame.createFrameIfPresent(ID3v2Frame.java:81)
at
org.apache.tika.parser.mp3.Mp3Parser.getAllTagHandlers(Mp3Parser.java:133)
at org.apache.tika.parser.mp3.Mp3Parser.parse(Mp3Parser.java:64)
at
org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:163)
... 3 more
The mp3 is here:
http://github.com/apache/nutch/raw/tags/release-1.0/src/plugin/parse-mp3/sample/test.mp3
All the other mp3 samples were parsed well by Tika.
Should I open an issue in Jira? And if so, would you consider this a bug or
an improvement?
André Ricardo
Re: Tika parsing corrupt mp3
Posted by "Mattmann, Chris A (388J)" <ch...@jpl.nasa.gov>.
You did great, thanks André!
Cheers,
Chris
On 8/6/10 8:13 AM, "André Ricardo" <an...@gmail.com> wrote:
Hello Chris,
Just opened the issue, I hope I did everything ok since it is the first time
I'm opening an issue in JIRA.
Thank you for your answer,
André Ricardo
On Thu, Aug 5, 2010 at 10:19 PM, Mattmann, Chris A (388J) <
chris.a.mattmann@jpl.nasa.gov> wrote:
> Hi André,
>
> Yes, please, file an issue in JIRA and point at the mp3 file and the test
> case that failed. Thanks so much!
>
> Cheers,
> Chris
>
>
>
> On 8/5/10 8:52 AM, "André Ricardo" <an...@gmail.com> wrote:
>
> Hello,
>
> I was trying some mp3s in Tika coming from Nutch 0.9/1.0 samples and with
> "A
> corrupt MP3 file that has been truncated half way through the ID3v2 frames"
> returned this:
>
> $ java -jar tika-app-0.7.jar -v -m
> ~/nutch-0.9/src/plugin/parse-mp3/sample/test.mp3
> Exception in thread "main" org.apache.tika.exception.TikaException:
> TIKA-198: Illegal IOException from
> org.apache.tika.parser.mp3.Mp3Parser@1bf3d87
> at
> org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:138)
> at
> org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:99)
> at org.apache.tika.cli.TikaCLI.process(TikaCLI.java:169)
> at org.apache.tika.cli.TikaCLI.main(TikaCLI.java:62)
> Caused by: java.io.IOException: Tried to read 259186 bytes, but only 65526
> bytes present
> at org.apache.tika.parser.mp3.ID3v2Frame.readFully(ID3v2Frame.java:160)
> at org.apache.tika.parser.mp3.ID3v2Frame.<init>(ID3v2Frame.java:110)
> at
>
> org.apache.tika.parser.mp3.ID3v2Frame.createFrameIfPresent(ID3v2Frame.java:81)
> at
> org.apache.tika.parser.mp3.Mp3Parser.getAllTagHandlers(Mp3Parser.java:128)
> at org.apache.tika.parser.mp3.Mp3Parser.parse(Mp3Parser.java:64)
> at
> org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:132)
> ... 3 more
>
> Also tried with the latest trunk from github reproducing the problem:
>
> $ java -jar tika-app-0.8-SNAPSHOT.jar -v -m
> ~/nutch-0.9/src/plugin/parse-mp3/sample/test.mp3
> Exception in thread "main" org.apache.tika.exception.TikaException:
> TIKA-198: Illegal IOException from
> org.apache.tika.parser.mp3.Mp3Parser@e79839
> at
> org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:169)
> at
> org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:110)
> at org.apache.tika.cli.TikaCLI.process(TikaCLI.java:193)
> at org.apache.tika.cli.TikaCLI.main(TikaCLI.java:72)
> Caused by: java.io.IOException: Tried to read 259186 bytes, but only 65526
> bytes present
> at org.apache.tika.parser.mp3.ID3v2Frame.readFully(ID3v2Frame.java:160)
> at org.apache.tika.parser.mp3.ID3v2Frame.<init>(ID3v2Frame.java:110)
> at
>
> org.apache.tika.parser.mp3.ID3v2Frame.createFrameIfPresent(ID3v2Frame.java:81)
> at
> org.apache.tika.parser.mp3.Mp3Parser.getAllTagHandlers(Mp3Parser.java:133)
> at org.apache.tika.parser.mp3.Mp3Parser.parse(Mp3Parser.java:64)
> at
> org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:163)
> ... 3 more
>
> The mp3 is here:
>
> http://github.com/apache/nutch/raw/tags/release-1.0/src/plugin/parse-mp3/sample/test.mp3
>
> All the other mp3 samples were parsed well by Tika.
>
> Should I open an issue in Jira? And if so, would you consider this a bug or
> an improvement?
>
> André Ricardo
>
>
>
> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
> Chris Mattmann, Ph.D.
> Senior Computer Scientist
> NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
> Office: 171-266B, Mailstop: 171-246
> Email: Chris.Mattmann@jpl.nasa.gov
> WWW: http://sunset.usc.edu/~mattmann/<http://sunset.usc.edu/%7Emattmann/>
> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
> Adjunct Assistant Professor, Computer Science Department
> University of Southern California, Los Angeles, CA 90089 USA
> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>
>
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Chris Mattmann, Ph.D.
Senior Computer Scientist
NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
Office: 171-266B, Mailstop: 171-246
Email: Chris.Mattmann@jpl.nasa.gov
WWW: http://sunset.usc.edu/~mattmann/
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Adjunct Assistant Professor, Computer Science Department
University of Southern California, Los Angeles, CA 90089 USA
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Re: Tika parsing corrupt mp3
Posted by André Ricardo <an...@gmail.com>.
Hello Chris,
Just opened the issue, I hope I did everything ok since it is the first time
I'm opening an issue in JIRA.
Thank you for your answer,
André Ricardo
On Thu, Aug 5, 2010 at 10:19 PM, Mattmann, Chris A (388J) <
chris.a.mattmann@jpl.nasa.gov> wrote:
> Hi André,
>
> Yes, please, file an issue in JIRA and point at the mp3 file and the test
> case that failed. Thanks so much!
>
> Cheers,
> Chris
>
>
>
> On 8/5/10 8:52 AM, "André Ricardo" <an...@gmail.com> wrote:
>
> Hello,
>
> I was trying some mp3s in Tika coming from Nutch 0.9/1.0 samples and with
> "A
> corrupt MP3 file that has been truncated half way through the ID3v2 frames"
> returned this:
>
> $ java -jar tika-app-0.7.jar -v -m
> ~/nutch-0.9/src/plugin/parse-mp3/sample/test.mp3
> Exception in thread "main" org.apache.tika.exception.TikaException:
> TIKA-198: Illegal IOException from
> org.apache.tika.parser.mp3.Mp3Parser@1bf3d87
> at
> org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:138)
> at
> org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:99)
> at org.apache.tika.cli.TikaCLI.process(TikaCLI.java:169)
> at org.apache.tika.cli.TikaCLI.main(TikaCLI.java:62)
> Caused by: java.io.IOException: Tried to read 259186 bytes, but only 65526
> bytes present
> at org.apache.tika.parser.mp3.ID3v2Frame.readFully(ID3v2Frame.java:160)
> at org.apache.tika.parser.mp3.ID3v2Frame.<init>(ID3v2Frame.java:110)
> at
>
> org.apache.tika.parser.mp3.ID3v2Frame.createFrameIfPresent(ID3v2Frame.java:81)
> at
> org.apache.tika.parser.mp3.Mp3Parser.getAllTagHandlers(Mp3Parser.java:128)
> at org.apache.tika.parser.mp3.Mp3Parser.parse(Mp3Parser.java:64)
> at
> org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:132)
> ... 3 more
>
> Also tried with the latest trunk from github reproducing the problem:
>
> $ java -jar tika-app-0.8-SNAPSHOT.jar -v -m
> ~/nutch-0.9/src/plugin/parse-mp3/sample/test.mp3
> Exception in thread "main" org.apache.tika.exception.TikaException:
> TIKA-198: Illegal IOException from
> org.apache.tika.parser.mp3.Mp3Parser@e79839
> at
> org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:169)
> at
> org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:110)
> at org.apache.tika.cli.TikaCLI.process(TikaCLI.java:193)
> at org.apache.tika.cli.TikaCLI.main(TikaCLI.java:72)
> Caused by: java.io.IOException: Tried to read 259186 bytes, but only 65526
> bytes present
> at org.apache.tika.parser.mp3.ID3v2Frame.readFully(ID3v2Frame.java:160)
> at org.apache.tika.parser.mp3.ID3v2Frame.<init>(ID3v2Frame.java:110)
> at
>
> org.apache.tika.parser.mp3.ID3v2Frame.createFrameIfPresent(ID3v2Frame.java:81)
> at
> org.apache.tika.parser.mp3.Mp3Parser.getAllTagHandlers(Mp3Parser.java:133)
> at org.apache.tika.parser.mp3.Mp3Parser.parse(Mp3Parser.java:64)
> at
> org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:163)
> ... 3 more
>
> The mp3 is here:
>
> http://github.com/apache/nutch/raw/tags/release-1.0/src/plugin/parse-mp3/sample/test.mp3
>
> All the other mp3 samples were parsed well by Tika.
>
> Should I open an issue in Jira? And if so, would you consider this a bug or
> an improvement?
>
> André Ricardo
>
>
>
> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
> Chris Mattmann, Ph.D.
> Senior Computer Scientist
> NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
> Office: 171-266B, Mailstop: 171-246
> Email: Chris.Mattmann@jpl.nasa.gov
> WWW: http://sunset.usc.edu/~mattmann/<http://sunset.usc.edu/%7Emattmann/>
> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
> Adjunct Assistant Professor, Computer Science Department
> University of Southern California, Los Angeles, CA 90089 USA
> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>
>
Re: Tika parsing corrupt mp3
Posted by "Mattmann, Chris A (388J)" <ch...@jpl.nasa.gov>.
Hi André,
Yes, please, file an issue in JIRA and point at the mp3 file and the test case that failed. Thanks so much!
Cheers,
Chris
On 8/5/10 8:52 AM, "André Ricardo" <an...@gmail.com> wrote:
Hello,
I was trying some mp3s in Tika coming from Nutch 0.9/1.0 samples and with "A
corrupt MP3 file that has been truncated half way through the ID3v2 frames"
returned this:
$ java -jar tika-app-0.7.jar -v -m
~/nutch-0.9/src/plugin/parse-mp3/sample/test.mp3
Exception in thread "main" org.apache.tika.exception.TikaException:
TIKA-198: Illegal IOException from
org.apache.tika.parser.mp3.Mp3Parser@1bf3d87
at
org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:138)
at
org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:99)
at org.apache.tika.cli.TikaCLI.process(TikaCLI.java:169)
at org.apache.tika.cli.TikaCLI.main(TikaCLI.java:62)
Caused by: java.io.IOException: Tried to read 259186 bytes, but only 65526
bytes present
at org.apache.tika.parser.mp3.ID3v2Frame.readFully(ID3v2Frame.java:160)
at org.apache.tika.parser.mp3.ID3v2Frame.<init>(ID3v2Frame.java:110)
at
org.apache.tika.parser.mp3.ID3v2Frame.createFrameIfPresent(ID3v2Frame.java:81)
at
org.apache.tika.parser.mp3.Mp3Parser.getAllTagHandlers(Mp3Parser.java:128)
at org.apache.tika.parser.mp3.Mp3Parser.parse(Mp3Parser.java:64)
at
org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:132)
... 3 more
Also tried with the latest trunk from github reproducing the problem:
$ java -jar tika-app-0.8-SNAPSHOT.jar -v -m
~/nutch-0.9/src/plugin/parse-mp3/sample/test.mp3
Exception in thread "main" org.apache.tika.exception.TikaException:
TIKA-198: Illegal IOException from
org.apache.tika.parser.mp3.Mp3Parser@e79839
at
org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:169)
at
org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:110)
at org.apache.tika.cli.TikaCLI.process(TikaCLI.java:193)
at org.apache.tika.cli.TikaCLI.main(TikaCLI.java:72)
Caused by: java.io.IOException: Tried to read 259186 bytes, but only 65526
bytes present
at org.apache.tika.parser.mp3.ID3v2Frame.readFully(ID3v2Frame.java:160)
at org.apache.tika.parser.mp3.ID3v2Frame.<init>(ID3v2Frame.java:110)
at
org.apache.tika.parser.mp3.ID3v2Frame.createFrameIfPresent(ID3v2Frame.java:81)
at
org.apache.tika.parser.mp3.Mp3Parser.getAllTagHandlers(Mp3Parser.java:133)
at org.apache.tika.parser.mp3.Mp3Parser.parse(Mp3Parser.java:64)
at
org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:163)
... 3 more
The mp3 is here:
http://github.com/apache/nutch/raw/tags/release-1.0/src/plugin/parse-mp3/sample/test.mp3
All the other mp3 samples were parsed well by Tika.
Should I open an issue in Jira? And if so, would you consider this a bug or
an improvement?
André Ricardo
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Chris Mattmann, Ph.D.
Senior Computer Scientist
NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
Office: 171-266B, Mailstop: 171-246
Email: Chris.Mattmann@jpl.nasa.gov
WWW: http://sunset.usc.edu/~mattmann/
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Adjunct Assistant Professor, Computer Science Department
University of Southern California, Los Angeles, CA 90089 USA
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++