You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@tika.apache.org by "André Ricardo (JIRA)" <ji...@apache.org> on 2010/08/06 16:43:16 UTC
[jira] Created: (TIKA-474) Tika parsing corrupt mp3
Tika parsing corrupt mp3
------------------------
Key: TIKA-474
URL: https://issues.apache.org/jira/browse/TIKA-474
Project: Tika
Issue Type: Improvement
Components: cli
Affects Versions: 0.7
Environment: Linux Mandriva 2010 based OS (Linux Caixa Mágica 15)
Reporter: André Ricardo
I was trying some mp3s in tika-app cli coming from Nutch 0.9/1.0 samples and with "A corrupt MP3 file that has been truncated half way through the ID3v2 frames" returned this:
$ java -jar tika-app-0.7.jar -v -m ~/nutch-0.9/src/plugin/parse-mp3/sample/test.mp3
Exception in thread "main" org.apache.tika.exception.TikaException: TIKA-198: Illegal IOException from org.apache.tika.parser.mp3.Mp3Parser@1bf3d87
at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:138)
at org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:99)
at org.apache.tika.cli.TikaCLI.process(TikaCLI.java:169)
at org.apache.tika.cli.TikaCLI.main(TikaCLI.java:62)
Caused by: java.io.IOException: Tried to read 259186 bytes, but only 65526 bytes present
at org.apache.tika.parser.mp3.ID3v2Frame.readFully(ID3v2Frame.java:160)
at org.apache.tika.parser.mp3.ID3v2Frame.<init>(ID3v2Frame.java:110)
at org.apache.tika.parser.mp3.ID3v2Frame.createFrameIfPresent(ID3v2Frame.java:81)
at org.apache.tika.parser.mp3.Mp3Parser.getAllTagHandlers(Mp3Parser.java:128)
at org.apache.tika.parser.mp3.Mp3Parser.parse(Mp3Parser.java:64)
at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:132)
... 3 more
Also tried with the latest trunk from github reproducing the problem:
$ java -jar tika-app-0.8-SNAPSHOT.jar -v -m ~/nutch-0.9/src/plugin/parse-mp3/sample/test.mp3
Exception in thread "main" org.apache.tika.exception.TikaException: TIKA-198: Illegal IOException from org.apache.tika.parser.mp3.Mp3Parser@e79839
at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:169)
at org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:110)
at org.apache.tika.cli.TikaCLI.process(TikaCLI.java:193)
at org.apache.tika.cli.TikaCLI.main(TikaCLI.java:72)
Caused by: java.io.IOException: Tried to read 259186 bytes, but only 65526 bytes present
at org.apache.tika.parser.mp3.ID3v2Frame.readFully(ID3v2Frame.java:160)
at org.apache.tika.parser.mp3.ID3v2Frame.<init>(ID3v2Frame.java:110)
at org.apache.tika.parser.mp3.ID3v2Frame.createFrameIfPresent(ID3v2Frame.java:81)
at org.apache.tika.parser.mp3.Mp3Parser.getAllTagHandlers(Mp3Parser.java:133)
at org.apache.tika.parser.mp3.Mp3Parser.parse(Mp3Parser.java:64)
at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:163)
... 3 more
The mp3 is here: http://github.com/apache/nutch/raw/tags/release-1.0/src/plugin/parse-mp3/sample/test.mp3
All the other mp3 samples were parsed well by Tika.
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.
[jira] Updated: (TIKA-474) Tika parsing corrupt mp3
Posted by "André Ricardo (JIRA)" <ji...@apache.org>.
[ https://issues.apache.org/jira/browse/TIKA-474?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
André Ricardo updated TIKA-474:
-------------------------------
Attachment: test.mp3
This is "A corrupt MP3 file that has been truncated half way through the ID3v2 frames" from the Nutch 0.9/1.0 sample mp3s files.
> Tika parsing corrupt mp3
> ------------------------
>
> Key: TIKA-474
> URL: https://issues.apache.org/jira/browse/TIKA-474
> Project: Tika
> Issue Type: Improvement
> Components: cli
> Affects Versions: 0.7
> Environment: Linux Mandriva 2010 based OS (Linux Caixa Mágica 15)
> Reporter: André Ricardo
> Attachments: test.mp3
>
>
> I was trying some mp3s in tika-app cli coming from Nutch 0.9/1.0 samples and with "A corrupt MP3 file that has been truncated half way through the ID3v2 frames" returned this:
> $ java -jar tika-app-0.7.jar -v -m ~/nutch-0.9/src/plugin/parse-mp3/sample/test.mp3
> Exception in thread "main" org.apache.tika.exception.TikaException: TIKA-198: Illegal IOException from org.apache.tika.parser.mp3.Mp3Parser@1bf3d87
> at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:138)
> at org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:99)
> at org.apache.tika.cli.TikaCLI.process(TikaCLI.java:169)
> at org.apache.tika.cli.TikaCLI.main(TikaCLI.java:62)
> Caused by: java.io.IOException: Tried to read 259186 bytes, but only 65526 bytes present
> at org.apache.tika.parser.mp3.ID3v2Frame.readFully(ID3v2Frame.java:160)
> at org.apache.tika.parser.mp3.ID3v2Frame.<init>(ID3v2Frame.java:110)
> at org.apache.tika.parser.mp3.ID3v2Frame.createFrameIfPresent(ID3v2Frame.java:81)
> at org.apache.tika.parser.mp3.Mp3Parser.getAllTagHandlers(Mp3Parser.java:128)
> at org.apache.tika.parser.mp3.Mp3Parser.parse(Mp3Parser.java:64)
> at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:132)
> ... 3 more
> Also tried with the latest trunk from github reproducing the problem:
> $ java -jar tika-app-0.8-SNAPSHOT.jar -v -m ~/nutch-0.9/src/plugin/parse-mp3/sample/test.mp3
> Exception in thread "main" org.apache.tika.exception.TikaException: TIKA-198: Illegal IOException from org.apache.tika.parser.mp3.Mp3Parser@e79839
> at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:169)
> at org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:110)
> at org.apache.tika.cli.TikaCLI.process(TikaCLI.java:193)
> at org.apache.tika.cli.TikaCLI.main(TikaCLI.java:72)
> Caused by: java.io.IOException: Tried to read 259186 bytes, but only 65526 bytes present
> at org.apache.tika.parser.mp3.ID3v2Frame.readFully(ID3v2Frame.java:160)
> at org.apache.tika.parser.mp3.ID3v2Frame.<init>(ID3v2Frame.java:110)
> at org.apache.tika.parser.mp3.ID3v2Frame.createFrameIfPresent(ID3v2Frame.java:81)
> at org.apache.tika.parser.mp3.Mp3Parser.getAllTagHandlers(Mp3Parser.java:133)
> at org.apache.tika.parser.mp3.Mp3Parser.parse(Mp3Parser.java:64)
> at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:163)
> ... 3 more
> The mp3 is here: http://github.com/apache/nutch/raw/tags/release-1.0/src/plugin/parse-mp3/sample/test.mp3
> All the other mp3 samples were parsed well by Tika.
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.
[jira] Commented: (TIKA-474) Tika parsing corrupt mp3
Posted by "André Ricardo (JIRA)" <ji...@apache.org>.
[ https://issues.apache.org/jira/browse/TIKA-474?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12896583#action_12896583 ]
André Ricardo commented on TIKA-474:
------------------------------------
Thank you Nick!
Just looked at the diff to learn how you fixed it and it was nice to see that you have also wrote a test case!
Built the latest version from svn and it works now:
$ java -jar tika-app-0.8-SNAPSHOT.jar -v -m ~/nutch-0.9/src/plugin/parse-mp3/sample/test.mp3
Author: The White Stripes
Content-Length: 65536
Content-Type: audio/mpeg
resourceName: test.mp3
title: Girl you have no faith in medicine
xmpDM:album: Elephant
xmpDM:artist: The White Stripes
xmpDM:composer: null
xmpDM:genre: null
xmpDM:logComment: eng
xmpDM:releaseDate: 2003
xmpDM:trackNumber: 13
> Tika parsing corrupt mp3
> ------------------------
>
> Key: TIKA-474
> URL: https://issues.apache.org/jira/browse/TIKA-474
> Project: Tika
> Issue Type: Improvement
> Components: cli
> Affects Versions: 0.7
> Environment: Linux Mandriva 2010 based OS (Linux Caixa Mágica 15)
> java version "1.6.0_20"
> Java(TM) SE Runtime Environment (build 1.6.0_20-b02)
> Reporter: André Ricardo
> Assignee: Nick Burch
> Fix For: 0.8
>
> Attachments: test.mp3
>
>
> I was trying some mp3s in tika-app cli coming from Nutch 0.9/1.0 samples and with "A corrupt MP3 file that has been truncated half way through the ID3v2 frames" returned this:
> $ java -jar tika-app-0.7.jar -v -m ~/nutch-0.9/src/plugin/parse-mp3/sample/test.mp3
> Exception in thread "main" org.apache.tika.exception.TikaException: TIKA-198: Illegal IOException from org.apache.tika.parser.mp3.Mp3Parser@1bf3d87
> at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:138)
> at org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:99)
> at org.apache.tika.cli.TikaCLI.process(TikaCLI.java:169)
> at org.apache.tika.cli.TikaCLI.main(TikaCLI.java:62)
> Caused by: java.io.IOException: Tried to read 259186 bytes, but only 65526 bytes present
> at org.apache.tika.parser.mp3.ID3v2Frame.readFully(ID3v2Frame.java:160)
> at org.apache.tika.parser.mp3.ID3v2Frame.<init>(ID3v2Frame.java:110)
> at org.apache.tika.parser.mp3.ID3v2Frame.createFrameIfPresent(ID3v2Frame.java:81)
> at org.apache.tika.parser.mp3.Mp3Parser.getAllTagHandlers(Mp3Parser.java:128)
> at org.apache.tika.parser.mp3.Mp3Parser.parse(Mp3Parser.java:64)
> at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:132)
> ... 3 more
> Also tried with the latest trunk from github reproducing the problem:
> $ java -jar tika-app-0.8-SNAPSHOT.jar -v -m ~/nutch-0.9/src/plugin/parse-mp3/sample/test.mp3
> Exception in thread "main" org.apache.tika.exception.TikaException: TIKA-198: Illegal IOException from org.apache.tika.parser.mp3.Mp3Parser@e79839
> at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:169)
> at org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:110)
> at org.apache.tika.cli.TikaCLI.process(TikaCLI.java:193)
> at org.apache.tika.cli.TikaCLI.main(TikaCLI.java:72)
> Caused by: java.io.IOException: Tried to read 259186 bytes, but only 65526 bytes present
> at org.apache.tika.parser.mp3.ID3v2Frame.readFully(ID3v2Frame.java:160)
> at org.apache.tika.parser.mp3.ID3v2Frame.<init>(ID3v2Frame.java:110)
> at org.apache.tika.parser.mp3.ID3v2Frame.createFrameIfPresent(ID3v2Frame.java:81)
> at org.apache.tika.parser.mp3.Mp3Parser.getAllTagHandlers(Mp3Parser.java:133)
> at org.apache.tika.parser.mp3.Mp3Parser.parse(Mp3Parser.java:64)
> at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:163)
> ... 3 more
> The mp3 is here: http://github.com/apache/nutch/raw/tags/release-1.0/src/plugin/parse-mp3/sample/test.mp3
> All the other mp3 samples were parsed well by Tika.
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.
[jira] Updated: (TIKA-474) Tika parsing corrupt mp3
Posted by "André Ricardo (JIRA)" <ji...@apache.org>.
[ https://issues.apache.org/jira/browse/TIKA-474?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
André Ricardo updated TIKA-474:
-------------------------------
Environment:
Linux Mandriva 2010 based OS (Linux Caixa Mágica 15)
java version "1.6.0_20"
Java(TM) SE Runtime Environment (build 1.6.0_20-b02)
was:Linux Mandriva 2010 based OS (Linux Caixa Mágica 15)
> Tika parsing corrupt mp3
> ------------------------
>
> Key: TIKA-474
> URL: https://issues.apache.org/jira/browse/TIKA-474
> Project: Tika
> Issue Type: Improvement
> Components: cli
> Affects Versions: 0.7
> Environment: Linux Mandriva 2010 based OS (Linux Caixa Mágica 15)
> java version "1.6.0_20"
> Java(TM) SE Runtime Environment (build 1.6.0_20-b02)
> Reporter: André Ricardo
> Attachments: test.mp3
>
>
> I was trying some mp3s in tika-app cli coming from Nutch 0.9/1.0 samples and with "A corrupt MP3 file that has been truncated half way through the ID3v2 frames" returned this:
> $ java -jar tika-app-0.7.jar -v -m ~/nutch-0.9/src/plugin/parse-mp3/sample/test.mp3
> Exception in thread "main" org.apache.tika.exception.TikaException: TIKA-198: Illegal IOException from org.apache.tika.parser.mp3.Mp3Parser@1bf3d87
> at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:138)
> at org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:99)
> at org.apache.tika.cli.TikaCLI.process(TikaCLI.java:169)
> at org.apache.tika.cli.TikaCLI.main(TikaCLI.java:62)
> Caused by: java.io.IOException: Tried to read 259186 bytes, but only 65526 bytes present
> at org.apache.tika.parser.mp3.ID3v2Frame.readFully(ID3v2Frame.java:160)
> at org.apache.tika.parser.mp3.ID3v2Frame.<init>(ID3v2Frame.java:110)
> at org.apache.tika.parser.mp3.ID3v2Frame.createFrameIfPresent(ID3v2Frame.java:81)
> at org.apache.tika.parser.mp3.Mp3Parser.getAllTagHandlers(Mp3Parser.java:128)
> at org.apache.tika.parser.mp3.Mp3Parser.parse(Mp3Parser.java:64)
> at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:132)
> ... 3 more
> Also tried with the latest trunk from github reproducing the problem:
> $ java -jar tika-app-0.8-SNAPSHOT.jar -v -m ~/nutch-0.9/src/plugin/parse-mp3/sample/test.mp3
> Exception in thread "main" org.apache.tika.exception.TikaException: TIKA-198: Illegal IOException from org.apache.tika.parser.mp3.Mp3Parser@e79839
> at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:169)
> at org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:110)
> at org.apache.tika.cli.TikaCLI.process(TikaCLI.java:193)
> at org.apache.tika.cli.TikaCLI.main(TikaCLI.java:72)
> Caused by: java.io.IOException: Tried to read 259186 bytes, but only 65526 bytes present
> at org.apache.tika.parser.mp3.ID3v2Frame.readFully(ID3v2Frame.java:160)
> at org.apache.tika.parser.mp3.ID3v2Frame.<init>(ID3v2Frame.java:110)
> at org.apache.tika.parser.mp3.ID3v2Frame.createFrameIfPresent(ID3v2Frame.java:81)
> at org.apache.tika.parser.mp3.Mp3Parser.getAllTagHandlers(Mp3Parser.java:133)
> at org.apache.tika.parser.mp3.Mp3Parser.parse(Mp3Parser.java:64)
> at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:163)
> ... 3 more
> The mp3 is here: http://github.com/apache/nutch/raw/tags/release-1.0/src/plugin/parse-mp3/sample/test.mp3
> All the other mp3 samples were parsed well by Tika.
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.
[jira] Resolved: (TIKA-474) Tika parsing corrupt mp3
Posted by "Nick Burch (JIRA)" <ji...@apache.org>.
[ https://issues.apache.org/jira/browse/TIKA-474?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Nick Burch resolved TIKA-474.
-----------------------------
Assignee: Nick Burch
Fix Version/s: 0.8
Resolution: Fixed
Fixed in r983661.
The ID3 v2 header parsing now tries to do its best if not all the data for the header exists
> Tika parsing corrupt mp3
> ------------------------
>
> Key: TIKA-474
> URL: https://issues.apache.org/jira/browse/TIKA-474
> Project: Tika
> Issue Type: Improvement
> Components: cli
> Affects Versions: 0.7
> Environment: Linux Mandriva 2010 based OS (Linux Caixa Mágica 15)
> java version "1.6.0_20"
> Java(TM) SE Runtime Environment (build 1.6.0_20-b02)
> Reporter: André Ricardo
> Assignee: Nick Burch
> Fix For: 0.8
>
> Attachments: test.mp3
>
>
> I was trying some mp3s in tika-app cli coming from Nutch 0.9/1.0 samples and with "A corrupt MP3 file that has been truncated half way through the ID3v2 frames" returned this:
> $ java -jar tika-app-0.7.jar -v -m ~/nutch-0.9/src/plugin/parse-mp3/sample/test.mp3
> Exception in thread "main" org.apache.tika.exception.TikaException: TIKA-198: Illegal IOException from org.apache.tika.parser.mp3.Mp3Parser@1bf3d87
> at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:138)
> at org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:99)
> at org.apache.tika.cli.TikaCLI.process(TikaCLI.java:169)
> at org.apache.tika.cli.TikaCLI.main(TikaCLI.java:62)
> Caused by: java.io.IOException: Tried to read 259186 bytes, but only 65526 bytes present
> at org.apache.tika.parser.mp3.ID3v2Frame.readFully(ID3v2Frame.java:160)
> at org.apache.tika.parser.mp3.ID3v2Frame.<init>(ID3v2Frame.java:110)
> at org.apache.tika.parser.mp3.ID3v2Frame.createFrameIfPresent(ID3v2Frame.java:81)
> at org.apache.tika.parser.mp3.Mp3Parser.getAllTagHandlers(Mp3Parser.java:128)
> at org.apache.tika.parser.mp3.Mp3Parser.parse(Mp3Parser.java:64)
> at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:132)
> ... 3 more
> Also tried with the latest trunk from github reproducing the problem:
> $ java -jar tika-app-0.8-SNAPSHOT.jar -v -m ~/nutch-0.9/src/plugin/parse-mp3/sample/test.mp3
> Exception in thread "main" org.apache.tika.exception.TikaException: TIKA-198: Illegal IOException from org.apache.tika.parser.mp3.Mp3Parser@e79839
> at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:169)
> at org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:110)
> at org.apache.tika.cli.TikaCLI.process(TikaCLI.java:193)
> at org.apache.tika.cli.TikaCLI.main(TikaCLI.java:72)
> Caused by: java.io.IOException: Tried to read 259186 bytes, but only 65526 bytes present
> at org.apache.tika.parser.mp3.ID3v2Frame.readFully(ID3v2Frame.java:160)
> at org.apache.tika.parser.mp3.ID3v2Frame.<init>(ID3v2Frame.java:110)
> at org.apache.tika.parser.mp3.ID3v2Frame.createFrameIfPresent(ID3v2Frame.java:81)
> at org.apache.tika.parser.mp3.Mp3Parser.getAllTagHandlers(Mp3Parser.java:133)
> at org.apache.tika.parser.mp3.Mp3Parser.parse(Mp3Parser.java:64)
> at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:163)
> ... 3 more
> The mp3 is here: http://github.com/apache/nutch/raw/tags/release-1.0/src/plugin/parse-mp3/sample/test.mp3
> All the other mp3 samples were parsed well by Tika.
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.