You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@tika.apache.org by "William Seemann (Created) (JIRA)" <ji...@apache.org> on 2011/11/27 09:46:39 UTC
[jira] [Created] (TIKA-793) Invalid ASCII character (65533) when
retriving MP3 metadata
Invalid ASCII character (65533) when retriving MP3 metadata
-----------------------------------------------------------
Key: TIKA-793
URL: https://issues.apache.org/jira/browse/TIKA-793
Project: Tika
Issue Type: Bug
Components: metadata, parser
Affects Versions: 1.0
Environment: Ubuntu 10.04 (x64), Android (2.2 +)
Reporter: William Seemann
Priority: Minor
When extracting metadata from certain mp3's (the id3 version appears to be 2.4) I'm seeing invalid characters at the end of the parsed fields. For example:
American M�
which should be:
American Me
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (TIKA-793) Invalid ASCII character (65533) when
retriving MP3 metadata
Posted by "William Seemann (Commented) (JIRA)" <ji...@apache.org>.
[ https://issues.apache.org/jira/browse/TIKA-793?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13176864#comment-13176864 ]
William Seemann commented on TIKA-793:
--------------------------------------
Nick, thanks for the prompt fix. Keep up the good work.
> Invalid ASCII character (65533) when retriving MP3 metadata
> -----------------------------------------------------------
>
> Key: TIKA-793
> URL: https://issues.apache.org/jira/browse/TIKA-793
> Project: Tika
> Issue Type: Bug
> Components: metadata, parser
> Affects Versions: 1.0
> Environment: Ubuntu 10.04 (x64), Android (2.2 +)
> Reporter: William Seemann
> Priority: Minor
> Attachments: TikaTest.java
>
>
> When extracting metadata from certain mp3's (the id3 version appears to be 2.4) I'm seeing invalid characters at the end of the parsed fields. For example:
> American M�
> which should be:
> American Me
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (TIKA-793) Invalid ASCII character (65533) when
retriving MP3 metadata
Posted by "William Seemann (Updated) (JIRA)" <ji...@apache.org>.
[ https://issues.apache.org/jira/browse/TIKA-793?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
William Seemann updated TIKA-793:
---------------------------------
Attachment: TikaTest.java
The code I'm using to test
> Invalid ASCII character (65533) when retriving MP3 metadata
> -----------------------------------------------------------
>
> Key: TIKA-793
> URL: https://issues.apache.org/jira/browse/TIKA-793
> Project: Tika
> Issue Type: Bug
> Components: metadata, parser
> Affects Versions: 1.0
> Environment: Ubuntu 10.04 (x64), Android (2.2 +)
> Reporter: William Seemann
> Priority: Minor
> Attachments: TikaTest.java
>
>
> When extracting metadata from certain mp3's (the id3 version appears to be 2.4) I'm seeing invalid characters at the end of the parsed fields. For example:
> American M�
> which should be:
> American Me
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (TIKA-793) Invalid ASCII character (65533) when
retriving MP3 metadata
Posted by "Nick Burch (Commented) (JIRA)" <ji...@apache.org>.
[ https://issues.apache.org/jira/browse/TIKA-793?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13176070#comment-13176070 ]
Nick Burch commented on TIKA-793:
---------------------------------
I've tracked this to two bugs. Both relate to the handling of UTF-16 encoded strings.
I've fixed the first in r1224865, which was a problem in the null termination stripping
The second is the handling of the COMM (Comment) tag, which contains both a language and text. We don't currently support the language being encoded differently to the text, that remains to be fixed (and really needs a test file too)
> Invalid ASCII character (65533) when retriving MP3 metadata
> -----------------------------------------------------------
>
> Key: TIKA-793
> URL: https://issues.apache.org/jira/browse/TIKA-793
> Project: Tika
> Issue Type: Bug
> Components: metadata, parser
> Affects Versions: 1.0
> Environment: Ubuntu 10.04 (x64), Android (2.2 +)
> Reporter: William Seemann
> Priority: Minor
> Attachments: TikaTest.java
>
>
> When extracting metadata from certain mp3's (the id3 version appears to be 2.4) I'm seeing invalid characters at the end of the parsed fields. For example:
> American M�
> which should be:
> American Me
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Resolved] (TIKA-793) Invalid ASCII character (65533) when
retriving MP3 metadata
Posted by "Nick Burch (Resolved) (JIRA)" <ji...@apache.org>.
[ https://issues.apache.org/jira/browse/TIKA-793?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Nick Burch resolved TIKA-793.
-----------------------------
Resolution: Fixed
Fix Version/s: 1.1
> Invalid ASCII character (65533) when retriving MP3 metadata
> -----------------------------------------------------------
>
> Key: TIKA-793
> URL: https://issues.apache.org/jira/browse/TIKA-793
> Project: Tika
> Issue Type: Bug
> Components: metadata, parser
> Affects Versions: 1.0
> Environment: Ubuntu 10.04 (x64), Android (2.2 +)
> Reporter: William Seemann
> Priority: Minor
> Fix For: 1.1
>
> Attachments: TikaTest.java
>
>
> When extracting metadata from certain mp3's (the id3 version appears to be 2.4) I'm seeing invalid characters at the end of the parsed fields. For example:
> American M�
> which should be:
> American Me
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (TIKA-793) Invalid ASCII character (65533) when
retriving MP3 metadata
Posted by "William Seemann (Commented) (JIRA)" <ji...@apache.org>.
[ https://issues.apache.org/jira/browse/TIKA-793?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13157706#comment-13157706 ]
William Seemann commented on TIKA-793:
--------------------------------------
Also, it's worth noting, I see this issue in almost all of the mp3's I've downloaded from Amazon.com.
> Invalid ASCII character (65533) when retriving MP3 metadata
> -----------------------------------------------------------
>
> Key: TIKA-793
> URL: https://issues.apache.org/jira/browse/TIKA-793
> Project: Tika
> Issue Type: Bug
> Components: metadata, parser
> Affects Versions: 1.0
> Environment: Ubuntu 10.04 (x64), Android (2.2 +)
> Reporter: William Seemann
> Priority: Minor
> Attachments: TikaTest.java
>
>
> When extracting metadata from certain mp3's (the id3 version appears to be 2.4) I'm seeing invalid characters at the end of the parsed fields. For example:
> American M�
> which should be:
> American Me
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (TIKA-793) Invalid ASCII character (65533) when
retriving MP3 metadata
Posted by "Nick Burch (Commented) (JIRA)" <ji...@apache.org>.
[ https://issues.apache.org/jira/browse/TIKA-793?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13160810#comment-13160810 ]
Nick Burch commented on TIKA-793:
---------------------------------
I've managed to reproduce this on one of my Amazon MP3s, will use that to test a fix when I have a chance
> Invalid ASCII character (65533) when retriving MP3 metadata
> -----------------------------------------------------------
>
> Key: TIKA-793
> URL: https://issues.apache.org/jira/browse/TIKA-793
> Project: Tika
> Issue Type: Bug
> Components: metadata, parser
> Affects Versions: 1.0
> Environment: Ubuntu 10.04 (x64), Android (2.2 +)
> Reporter: William Seemann
> Priority: Minor
> Attachments: TikaTest.java
>
>
> When extracting metadata from certain mp3's (the id3 version appears to be 2.4) I'm seeing invalid characters at the end of the parsed fields. For example:
> American M�
> which should be:
> American Me
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (TIKA-793) Invalid ASCII character (65533) when
retriving MP3 metadata
Posted by "Nick Burch (Commented) (JIRA)" <ji...@apache.org>.
[ https://issues.apache.org/jira/browse/TIKA-793?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13177081#comment-13177081 ]
Nick Burch commented on TIKA-793:
---------------------------------
Comment (COM/COMM) tag handling fixed in r1225480 - it uses a different form to the other text tags so needs explicit encoding aware handling of the different parts of it.
> Invalid ASCII character (65533) when retriving MP3 metadata
> -----------------------------------------------------------
>
> Key: TIKA-793
> URL: https://issues.apache.org/jira/browse/TIKA-793
> Project: Tika
> Issue Type: Bug
> Components: metadata, parser
> Affects Versions: 1.0
> Environment: Ubuntu 10.04 (x64), Android (2.2 +)
> Reporter: William Seemann
> Priority: Minor
> Fix For: 1.1
>
> Attachments: TikaTest.java
>
>
> When extracting metadata from certain mp3's (the id3 version appears to be 2.4) I'm seeing invalid characters at the end of the parsed fields. For example:
> American M�
> which should be:
> American Me
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira