You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@tika.apache.org by "William Seemann (Created) (JIRA)" <ji...@apache.org> on 2011/11/27 09:46:39 UTC

[jira] [Created] (TIKA-793) Invalid ASCII character (65533) when retriving MP3 metadata

Invalid ASCII character (65533) when retriving MP3 metadata
-----------------------------------------------------------

                 Key: TIKA-793
                 URL: https://issues.apache.org/jira/browse/TIKA-793
             Project: Tika
          Issue Type: Bug
          Components: metadata, parser
    Affects Versions: 1.0
         Environment: Ubuntu 10.04 (x64), Android (2.2 +)
            Reporter: William Seemann
            Priority: Minor


When extracting metadata from certain mp3's (the id3 version appears to be 2.4) I'm seeing invalid characters at the end of the parsed fields. For example:

American M�

which should be:

American Me

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

       

[jira] [Commented] (TIKA-793) Invalid ASCII character (65533) when retriving MP3 metadata

Posted by "William Seemann (Commented) (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/TIKA-793?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13176864#comment-13176864 ] 

William Seemann commented on TIKA-793:
--------------------------------------

Nick, thanks for the prompt fix. Keep up the good work.
                
> Invalid ASCII character (65533) when retriving MP3 metadata
> -----------------------------------------------------------
>
>                 Key: TIKA-793
>                 URL: https://issues.apache.org/jira/browse/TIKA-793
>             Project: Tika
>          Issue Type: Bug
>          Components: metadata, parser
>    Affects Versions: 1.0
>         Environment: Ubuntu 10.04 (x64), Android (2.2 +)
>            Reporter: William Seemann
>            Priority: Minor
>         Attachments: TikaTest.java
>
>
> When extracting metadata from certain mp3's (the id3 version appears to be 2.4) I'm seeing invalid characters at the end of the parsed fields. For example:
> American M�
> which should be:
> American Me

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

       

[jira] [Updated] (TIKA-793) Invalid ASCII character (65533) when retriving MP3 metadata

Posted by "William Seemann (Updated) (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/TIKA-793?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

William Seemann updated TIKA-793:
---------------------------------

    Attachment: TikaTest.java

The code I'm using to test
                
> Invalid ASCII character (65533) when retriving MP3 metadata
> -----------------------------------------------------------
>
>                 Key: TIKA-793
>                 URL: https://issues.apache.org/jira/browse/TIKA-793
>             Project: Tika
>          Issue Type: Bug
>          Components: metadata, parser
>    Affects Versions: 1.0
>         Environment: Ubuntu 10.04 (x64), Android (2.2 +)
>            Reporter: William Seemann
>            Priority: Minor
>         Attachments: TikaTest.java
>
>
> When extracting metadata from certain mp3's (the id3 version appears to be 2.4) I'm seeing invalid characters at the end of the parsed fields. For example:
> American M�
> which should be:
> American Me

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

       

[jira] [Commented] (TIKA-793) Invalid ASCII character (65533) when retriving MP3 metadata

Posted by "Nick Burch (Commented) (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/TIKA-793?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13176070#comment-13176070 ] 

Nick Burch commented on TIKA-793:
---------------------------------

I've tracked this to two bugs. Both relate to the handling of UTF-16 encoded strings.

I've fixed the first in r1224865, which was a problem in the null termination stripping

The second is the handling of the COMM (Comment) tag, which contains both a language and text. We don't currently support the language being encoded differently to the text, that remains to be fixed (and really needs a test file too)
                
> Invalid ASCII character (65533) when retriving MP3 metadata
> -----------------------------------------------------------
>
>                 Key: TIKA-793
>                 URL: https://issues.apache.org/jira/browse/TIKA-793
>             Project: Tika
>          Issue Type: Bug
>          Components: metadata, parser
>    Affects Versions: 1.0
>         Environment: Ubuntu 10.04 (x64), Android (2.2 +)
>            Reporter: William Seemann
>            Priority: Minor
>         Attachments: TikaTest.java
>
>
> When extracting metadata from certain mp3's (the id3 version appears to be 2.4) I'm seeing invalid characters at the end of the parsed fields. For example:
> American M�
> which should be:
> American Me

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

       

[jira] [Resolved] (TIKA-793) Invalid ASCII character (65533) when retriving MP3 metadata

Posted by "Nick Burch (Resolved) (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/TIKA-793?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Nick Burch resolved TIKA-793.
-----------------------------

       Resolution: Fixed
    Fix Version/s: 1.1
    
> Invalid ASCII character (65533) when retriving MP3 metadata
> -----------------------------------------------------------
>
>                 Key: TIKA-793
>                 URL: https://issues.apache.org/jira/browse/TIKA-793
>             Project: Tika
>          Issue Type: Bug
>          Components: metadata, parser
>    Affects Versions: 1.0
>         Environment: Ubuntu 10.04 (x64), Android (2.2 +)
>            Reporter: William Seemann
>            Priority: Minor
>             Fix For: 1.1
>
>         Attachments: TikaTest.java
>
>
> When extracting metadata from certain mp3's (the id3 version appears to be 2.4) I'm seeing invalid characters at the end of the parsed fields. For example:
> American M�
> which should be:
> American Me

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

       

[jira] [Commented] (TIKA-793) Invalid ASCII character (65533) when retriving MP3 metadata

Posted by "William Seemann (Commented) (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/TIKA-793?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13157706#comment-13157706 ] 

William Seemann commented on TIKA-793:
--------------------------------------

Also, it's worth noting, I see this issue in almost all of the mp3's I've downloaded from Amazon.com.
                
> Invalid ASCII character (65533) when retriving MP3 metadata
> -----------------------------------------------------------
>
>                 Key: TIKA-793
>                 URL: https://issues.apache.org/jira/browse/TIKA-793
>             Project: Tika
>          Issue Type: Bug
>          Components: metadata, parser
>    Affects Versions: 1.0
>         Environment: Ubuntu 10.04 (x64), Android (2.2 +)
>            Reporter: William Seemann
>            Priority: Minor
>         Attachments: TikaTest.java
>
>
> When extracting metadata from certain mp3's (the id3 version appears to be 2.4) I'm seeing invalid characters at the end of the parsed fields. For example:
> American M�
> which should be:
> American Me

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

       

[jira] [Commented] (TIKA-793) Invalid ASCII character (65533) when retriving MP3 metadata

Posted by "Nick Burch (Commented) (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/TIKA-793?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13160810#comment-13160810 ] 

Nick Burch commented on TIKA-793:
---------------------------------

I've managed to reproduce this on one of my Amazon MP3s, will use that to test a fix when I have a chance
                
> Invalid ASCII character (65533) when retriving MP3 metadata
> -----------------------------------------------------------
>
>                 Key: TIKA-793
>                 URL: https://issues.apache.org/jira/browse/TIKA-793
>             Project: Tika
>          Issue Type: Bug
>          Components: metadata, parser
>    Affects Versions: 1.0
>         Environment: Ubuntu 10.04 (x64), Android (2.2 +)
>            Reporter: William Seemann
>            Priority: Minor
>         Attachments: TikaTest.java
>
>
> When extracting metadata from certain mp3's (the id3 version appears to be 2.4) I'm seeing invalid characters at the end of the parsed fields. For example:
> American M�
> which should be:
> American Me

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

       

[jira] [Commented] (TIKA-793) Invalid ASCII character (65533) when retriving MP3 metadata

Posted by "Nick Burch (Commented) (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/TIKA-793?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13177081#comment-13177081 ] 

Nick Burch commented on TIKA-793:
---------------------------------

Comment (COM/COMM) tag handling fixed in r1225480 - it uses a different form to the other text tags so needs explicit encoding aware handling of the different parts of it.
                
> Invalid ASCII character (65533) when retriving MP3 metadata
> -----------------------------------------------------------
>
>                 Key: TIKA-793
>                 URL: https://issues.apache.org/jira/browse/TIKA-793
>             Project: Tika
>          Issue Type: Bug
>          Components: metadata, parser
>    Affects Versions: 1.0
>         Environment: Ubuntu 10.04 (x64), Android (2.2 +)
>            Reporter: William Seemann
>            Priority: Minor
>             Fix For: 1.1
>
>         Attachments: TikaTest.java
>
>
> When extracting metadata from certain mp3's (the id3 version appears to be 2.4) I'm seeing invalid characters at the end of the parsed fields. For example:
> American M�
> which should be:
> American Me

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira