You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@tika.apache.org by "Antoni Mylka (Created) (JIRA)" <ji...@apache.org> on 2011/12/13 19:12:30 UTC

[jira] [Created] (TIKA-813) Webarchive detection.

Webarchive detection.
---------------------

                 Key: TIKA-813
                 URL: https://issues.apache.org/jira/browse/TIKA-813
             Project: Tika
          Issue Type: Improvement
          Components: mime
    Affects Versions: 1.1
            Reporter: Antoni Mylka
         Attachments: tika-webarchive-detection.patch

I'd like to be be able to detect .webarchive files. They are a special case of the Apple Binary Property list format. They are generated by the Safari browser and contain all the files that comprise a web page within a single container file.

Can anyone supply an example file? All the ones I have are confidential and I don't have a mac myself.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Updated] (TIKA-813) Webarchive detection.

Posted by "Antoni Mylka (Updated) (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/TIKA-813?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Antoni Mylka updated TIKA-813:
------------------------------

    Attachment:     (was: tika-webarchive-detection.patch)
    
> Webarchive detection.
> ---------------------
>
>                 Key: TIKA-813
>                 URL: https://issues.apache.org/jira/browse/TIKA-813
>             Project: Tika
>          Issue Type: Improvement
>          Components: mime
>    Affects Versions: 1.1
>            Reporter: Antoni Mylka
>         Attachments: Apache_Tika.webarchive, testWEBARCHIVE.webarchive, tika-813.patch
>
>
> I'd like to be be able to detect .webarchive files. They are a special case of the Apple Binary Property list format. They are generated by the Safari browser and contain all the files that comprise a web page within a single container file.
> Can anyone supply an example file? All the ones I have are confidential and I don't have a mac myself.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Updated] (TIKA-813) Webarchive detection.

Posted by "Antoni Mylka (Updated) (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/TIKA-813?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Antoni Mylka updated TIKA-813:
------------------------------

    Attachment: tika-webarchive-detection.patch

A patch which adds the appropriate rules to tika-mimetypes.xml
                
> Webarchive detection.
> ---------------------
>
>                 Key: TIKA-813
>                 URL: https://issues.apache.org/jira/browse/TIKA-813
>             Project: Tika
>          Issue Type: Improvement
>          Components: mime
>    Affects Versions: 1.1
>            Reporter: Antoni Mylka
>         Attachments: tika-webarchive-detection.patch
>
>
> I'd like to be be able to detect .webarchive files. They are a special case of the Apple Binary Property list format. They are generated by the Safari browser and contain all the files that comprise a web page within a single container file.
> Can anyone supply an example file? All the ones I have are confidential and I don't have a mac myself.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Updated] (TIKA-813) Webarchive detection.

Posted by "Antoni Mylka (Updated) (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/TIKA-813?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Antoni Mylka updated TIKA-813:
------------------------------

    Attachment: testWEBARCHIVE.webarchive
                tika-813.patch

A second version of the patch which includes a unit test based on the file kindly provided by Andrzej. It turns out that the bplist magic had to be given higher priority to trump the (X)HTML magics, which occur later on in the file (it's a saved webpage after all).


                
> Webarchive detection.
> ---------------------
>
>                 Key: TIKA-813
>                 URL: https://issues.apache.org/jira/browse/TIKA-813
>             Project: Tika
>          Issue Type: Improvement
>          Components: mime
>    Affects Versions: 1.1
>            Reporter: Antoni Mylka
>         Attachments: Apache_Tika.webarchive, testWEBARCHIVE.webarchive, tika-813.patch
>
>
> I'd like to be be able to detect .webarchive files. They are a special case of the Apple Binary Property list format. They are generated by the Safari browser and contain all the files that comprise a web page within a single container file.
> Can anyone supply an example file? All the ones I have are confidential and I don't have a mac myself.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Updated] (TIKA-813) Webarchive detection.

Posted by "Andrzej Bialecki (Updated) (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/TIKA-813?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Andrzej Bialecki  updated TIKA-813:
-----------------------------------

    Attachment: Apache_Tika.webarchive

This file looks strangely appropriate...
                
> Webarchive detection.
> ---------------------
>
>                 Key: TIKA-813
>                 URL: https://issues.apache.org/jira/browse/TIKA-813
>             Project: Tika
>          Issue Type: Improvement
>          Components: mime
>    Affects Versions: 1.1
>            Reporter: Antoni Mylka
>         Attachments: Apache_Tika.webarchive, tika-webarchive-detection.patch
>
>
> I'd like to be be able to detect .webarchive files. They are a special case of the Apple Binary Property list format. They are generated by the Safari browser and contain all the files that comprise a web page within a single container file.
> Can anyone supply an example file? All the ones I have are confidential and I don't have a mac myself.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Closed] (TIKA-813) Webarchive detection.

Posted by "Antoni Mylka (Closed) (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/TIKA-813?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Antoni Mylka closed TIKA-813.
-----------------------------

       Resolution: Fixed
    Fix Version/s: 1.1

Committed the magics and the unit tests in t1220696. Thanks for the example file!
                
> Webarchive detection.
> ---------------------
>
>                 Key: TIKA-813
>                 URL: https://issues.apache.org/jira/browse/TIKA-813
>             Project: Tika
>          Issue Type: Improvement
>          Components: mime
>    Affects Versions: 1.1
>            Reporter: Antoni Mylka
>             Fix For: 1.1
>
>         Attachments: Apache_Tika.webarchive, testWEBARCHIVE.webarchive, tika-813.patch
>
>
> I'd like to be be able to detect .webarchive files. They are a special case of the Apple Binary Property list format. They are generated by the Safari browser and contain all the files that comprise a web page within a single container file.
> Can anyone supply an example file? All the ones I have are confidential and I don't have a mac myself.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira