You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@tika.apache.org by "Antoni Mylka (Created) (JIRA)" <ji...@apache.org> on 2011/12/21 00:07:31 UTC
[jira] [Created] (TIKA-823) Detect StarOffice files
Detect StarOffice files
-----------------------
Key: TIKA-823
URL: https://issues.apache.org/jira/browse/TIKA-823
Project: Tika
Issue Type: Improvement
Affects Versions: 1.1
Reporter: Antoni Mylka
I would like both MimeTypes and the POIFSContainerDetector to be able to detect files created with Star Office Draw, Impress, Writer and Calc.
I started working on this, but stumbled upon a POI issue, which I posted to poi-user.
http://thread.gmane.org/gmane.comp.jakarta.poi.user/17857
Nick? Yegor? I know you're on the Tika list as well. Could you take a look? How to get the raw content of CompObj entry?
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (TIKA-823) Detect StarOffice files
Posted by "Alex Ott (Commented) (JIRA)" <ji...@apache.org>.
[ https://issues.apache.org/jira/browse/TIKA-823?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13173940#comment-13173940 ]
Alex Ott commented on TIKA-823:
-------------------------------
for .sdw and .sdc you can just look onto names of streams in root directory: they should be /StarWriterDocument and /StarCalcDocument, but for .sda and .sdd it's more compilcated - they both have /StarDrawDocument3 entries, so you'll need to parse CompObj as you suggested
> Detect StarOffice files
> -----------------------
>
> Key: TIKA-823
> URL: https://issues.apache.org/jira/browse/TIKA-823
> Project: Tika
> Issue Type: Improvement
> Affects Versions: 1.1
> Reporter: Antoni Mylka
> Attachments: testStarOffice-5.2-calc.sdc, testStarOffice-5.2-draw.sda, testStarOffice-5.2-impress.sdd, testStarOffice-5.2-write.sdw
>
>
> I would like both MimeTypes and the POIFSContainerDetector to be able to detect files created with Star Office Draw, Impress, Writer and Calc.
> I started working on this, but stumbled upon a POI issue, which I posted to poi-user.
> http://thread.gmane.org/gmane.comp.jakarta.poi.user/17857
> Nick? Yegor? I know you're on the Tika list as well. Could you take a look? How to get the raw content of CompObj entry?
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (TIKA-823) Detect StarOffice files
Posted by "Nick Burch (Commented) (JIRA)" <ji...@apache.org>.
[ https://issues.apache.org/jira/browse/TIKA-823?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13173820#comment-13173820 ]
Nick Burch commented on TIKA-823:
---------------------------------
Note that it looks like the strings are prefixed with a 4 byte long length field, and are null terminated. It looks like the first one may always start in the same place in the file, if so you can probably skip forward to that, then use the POI utils to read you the string from the DocumentInputStream
> Detect StarOffice files
> -----------------------
>
> Key: TIKA-823
> URL: https://issues.apache.org/jira/browse/TIKA-823
> Project: Tika
> Issue Type: Improvement
> Affects Versions: 1.1
> Reporter: Antoni Mylka
> Attachments: testStarOffice-5.2-calc.sdc, testStarOffice-5.2-draw.sda, testStarOffice-5.2-impress.sdd, testStarOffice-5.2-write.sdw
>
>
> I would like both MimeTypes and the POIFSContainerDetector to be able to detect files created with Star Office Draw, Impress, Writer and Calc.
> I started working on this, but stumbled upon a POI issue, which I posted to poi-user.
> http://thread.gmane.org/gmane.comp.jakarta.poi.user/17857
> Nick? Yegor? I know you're on the Tika list as well. Could you take a look? How to get the raw content of CompObj entry?
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Closed] (TIKA-823) Detect StarOffice files
Posted by "Antoni Mylka (Closed) (JIRA)" <ji...@apache.org>.
[ https://issues.apache.org/jira/browse/TIKA-823?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Antoni Mylka closed TIKA-823.
-----------------------------
Resolution: Fixed
Fix Version/s: 1.1
Committed in r1221686. Thanks for the tip about DocumentInputStream. The commit fixes the indentation in few places, as noticed by Nick in dev@tika email:
http://www.mail-archive.com/dev@tika.apache.org/msg03608.html
> Detect StarOffice files
> -----------------------
>
> Key: TIKA-823
> URL: https://issues.apache.org/jira/browse/TIKA-823
> Project: Tika
> Issue Type: Improvement
> Affects Versions: 1.1
> Reporter: Antoni Mylka
> Fix For: 1.1
>
> Attachments: testStarOffice-5.2-calc.sdc, testStarOffice-5.2-draw.sda, testStarOffice-5.2-impress.sdd, testStarOffice-5.2-write.sdw
>
>
> I would like both MimeTypes and the POIFSContainerDetector to be able to detect files created with Star Office Draw, Impress, Writer and Calc.
> I started working on this, but stumbled upon a POI issue, which I posted to poi-user.
> http://thread.gmane.org/gmane.comp.jakarta.poi.user/17857
> Nick? Yegor? I know you're on the Tika list as well. Could you take a look? How to get the raw content of CompObj entry?
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (TIKA-823) Detect StarOffice files
Posted by "Antoni Mylka (Updated) (JIRA)" <ji...@apache.org>.
[ https://issues.apache.org/jira/browse/TIKA-823?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Antoni Mylka updated TIKA-823:
------------------------------
Attachment: testStarOffice-5.2-write.sdw
testStarOffice-5.2-impress.sdd
testStarOffice-5.2-draw.sda
testStarOffice-5.2-calc.sdc
The files I want to distinguish inside POIFSContainerDetector. Impress and Draw have the same set of top-level names. I'd like to distinguish them by strings contained in the raw content of the CompObj entry, but I don't know how to get that content via POI. Please have a look at my user@poi question.
> Detect StarOffice files
> -----------------------
>
> Key: TIKA-823
> URL: https://issues.apache.org/jira/browse/TIKA-823
> Project: Tika
> Issue Type: Improvement
> Affects Versions: 1.1
> Reporter: Antoni Mylka
> Attachments: testStarOffice-5.2-calc.sdc, testStarOffice-5.2-draw.sda, testStarOffice-5.2-impress.sdd, testStarOffice-5.2-write.sdw
>
>
> I would like both MimeTypes and the POIFSContainerDetector to be able to detect files created with Star Office Draw, Impress, Writer and Calc.
> I started working on this, but stumbled upon a POI issue, which I posted to poi-user.
> http://thread.gmane.org/gmane.comp.jakarta.poi.user/17857
> Nick? Yegor? I know you're on the Tika list as well. Could you take a look? How to get the raw content of CompObj entry?
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira