You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@tika.apache.org by "Antoni Mylka (Created) (JIRA)" <ji...@apache.org> on 2011/12/21 00:07:31 UTC

[jira] [Created] (TIKA-823) Detect StarOffice files

Detect StarOffice files
-----------------------

                 Key: TIKA-823
                 URL: https://issues.apache.org/jira/browse/TIKA-823
             Project: Tika
          Issue Type: Improvement
    Affects Versions: 1.1
            Reporter: Antoni Mylka


I would like both MimeTypes and the POIFSContainerDetector to be able to detect files created with Star Office Draw, Impress, Writer and Calc.

I started working on this, but stumbled upon a POI issue, which I posted to poi-user. 

http://thread.gmane.org/gmane.comp.jakarta.poi.user/17857

Nick? Yegor? I know you're on the Tika list as well. Could you take a look? How to get the raw content of CompObj entry?

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (TIKA-823) Detect StarOffice files

Posted by "Alex Ott (Commented) (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/TIKA-823?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13173940#comment-13173940 ] 

Alex Ott commented on TIKA-823:
-------------------------------

for .sdw and .sdc you can just look onto names of streams in root directory: they should be  /StarWriterDocument and /StarCalcDocument, but for .sda and .sdd it's more compilcated - they both have /StarDrawDocument3 entries, so you'll need to parse CompObj as you suggested
                
> Detect StarOffice files
> -----------------------
>
>                 Key: TIKA-823
>                 URL: https://issues.apache.org/jira/browse/TIKA-823
>             Project: Tika
>          Issue Type: Improvement
>    Affects Versions: 1.1
>            Reporter: Antoni Mylka
>         Attachments: testStarOffice-5.2-calc.sdc, testStarOffice-5.2-draw.sda, testStarOffice-5.2-impress.sdd, testStarOffice-5.2-write.sdw
>
>
> I would like both MimeTypes and the POIFSContainerDetector to be able to detect files created with Star Office Draw, Impress, Writer and Calc.
> I started working on this, but stumbled upon a POI issue, which I posted to poi-user. 
> http://thread.gmane.org/gmane.comp.jakarta.poi.user/17857
> Nick? Yegor? I know you're on the Tika list as well. Could you take a look? How to get the raw content of CompObj entry?

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (TIKA-823) Detect StarOffice files

Posted by "Nick Burch (Commented) (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/TIKA-823?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13173820#comment-13173820 ] 

Nick Burch commented on TIKA-823:
---------------------------------

Note that it looks like the strings are prefixed with a 4 byte long length field, and are null terminated. It looks like the first one may always start in the same place in the file, if so you can probably skip forward to that, then use the POI utils to read you the string from the DocumentInputStream
                
> Detect StarOffice files
> -----------------------
>
>                 Key: TIKA-823
>                 URL: https://issues.apache.org/jira/browse/TIKA-823
>             Project: Tika
>          Issue Type: Improvement
>    Affects Versions: 1.1
>            Reporter: Antoni Mylka
>         Attachments: testStarOffice-5.2-calc.sdc, testStarOffice-5.2-draw.sda, testStarOffice-5.2-impress.sdd, testStarOffice-5.2-write.sdw
>
>
> I would like both MimeTypes and the POIFSContainerDetector to be able to detect files created with Star Office Draw, Impress, Writer and Calc.
> I started working on this, but stumbled upon a POI issue, which I posted to poi-user. 
> http://thread.gmane.org/gmane.comp.jakarta.poi.user/17857
> Nick? Yegor? I know you're on the Tika list as well. Could you take a look? How to get the raw content of CompObj entry?

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Closed] (TIKA-823) Detect StarOffice files

Posted by "Antoni Mylka (Closed) (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/TIKA-823?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Antoni Mylka closed TIKA-823.
-----------------------------

       Resolution: Fixed
    Fix Version/s: 1.1

Committed in r1221686. Thanks for the tip about DocumentInputStream. The commit fixes the indentation in few places, as noticed by Nick in dev@tika email:

http://www.mail-archive.com/dev@tika.apache.org/msg03608.html


                
> Detect StarOffice files
> -----------------------
>
>                 Key: TIKA-823
>                 URL: https://issues.apache.org/jira/browse/TIKA-823
>             Project: Tika
>          Issue Type: Improvement
>    Affects Versions: 1.1
>            Reporter: Antoni Mylka
>             Fix For: 1.1
>
>         Attachments: testStarOffice-5.2-calc.sdc, testStarOffice-5.2-draw.sda, testStarOffice-5.2-impress.sdd, testStarOffice-5.2-write.sdw
>
>
> I would like both MimeTypes and the POIFSContainerDetector to be able to detect files created with Star Office Draw, Impress, Writer and Calc.
> I started working on this, but stumbled upon a POI issue, which I posted to poi-user. 
> http://thread.gmane.org/gmane.comp.jakarta.poi.user/17857
> Nick? Yegor? I know you're on the Tika list as well. Could you take a look? How to get the raw content of CompObj entry?

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Updated] (TIKA-823) Detect StarOffice files

Posted by "Antoni Mylka (Updated) (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/TIKA-823?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Antoni Mylka updated TIKA-823:
------------------------------

    Attachment: testStarOffice-5.2-write.sdw
                testStarOffice-5.2-impress.sdd
                testStarOffice-5.2-draw.sda
                testStarOffice-5.2-calc.sdc

The files I want to distinguish inside POIFSContainerDetector. Impress and Draw have the same set of top-level names. I'd like to distinguish them by strings contained in the raw content of the CompObj entry, but I don't know how to get that content via POI. Please have a look at my user@poi question.
                
> Detect StarOffice files
> -----------------------
>
>                 Key: TIKA-823
>                 URL: https://issues.apache.org/jira/browse/TIKA-823
>             Project: Tika
>          Issue Type: Improvement
>    Affects Versions: 1.1
>            Reporter: Antoni Mylka
>         Attachments: testStarOffice-5.2-calc.sdc, testStarOffice-5.2-draw.sda, testStarOffice-5.2-impress.sdd, testStarOffice-5.2-write.sdw
>
>
> I would like both MimeTypes and the POIFSContainerDetector to be able to detect files created with Star Office Draw, Impress, Writer and Calc.
> I started working on this, but stumbled upon a POI issue, which I posted to poi-user. 
> http://thread.gmane.org/gmane.comp.jakarta.poi.user/17857
> Nick? Yegor? I know you're on the Tika list as well. Could you take a look? How to get the raw content of CompObj entry?

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira