You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@tika.apache.org by "Tim Allison (JIRA)" <ji...@apache.org> on 2018/04/17 12:01:00 UTC

[jira] [Issue Comment Deleted] (TIKA-2632) Analyze unknown govdocs files

     [ https://issues.apache.org/jira/browse/TIKA-2632?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Tim Allison updated TIKA-2632:
------------------------------
    Comment: was deleted

(was: bq. Turned out that someone else already investigated this case a month ago...

And that someone else is none other than [~anjackson], a good friend of Tika. :))

> Analyze unknown govdocs files
> -----------------------------
>
>                 Key: TIKA-2632
>                 URL: https://issues.apache.org/jira/browse/TIKA-2632
>             Project: Tika
>          Issue Type: Improvement
>            Reporter: Andreas Meier
>            Priority: Minor
>
> I recently started to analyze randomly govdocs1 files that could not be recognized by TIKA properly.
>  
> This ticket should be used to identify problems with old or proprietary files and to extend TIKA step-by-step if needed.
>  
> Stumbled across the following filetypes/files:
>  
> 1. Old PowerPoint files (I expect Version 2.0 or 3.0) are not recognized properly:
> Found some mysterious files starting with 0xeddead0b and 0x0baddeed
> Turned out that someone else already investigated this case a month ago:
> [link http://anjackson.net/2018/03/15/story-of-a-bad-deed/|http://anjackson.net/2018/03/15/story-of-a-bad-deed/]
> The files are old PowerPoint. (PowerPoint 3.0 or 2.0)
> I think these Magic-strings should be added tika-mimetypes.xml as well as another PowerPoint mime-type. (maybe application/vnd.ms-powerpoint.2 or application/vnd.ms-powerpoint.3 ?)
> Example files in govdocs1: 
> 144/144504.unk
> 272/272490.unk
> 430/430427.unk
> (several more...)
> 2. Proprietary File Format: SigmaPlot Exchange File .jxf:
> Magic: 0x8888000c4a5846
> Example file in govdocs1:
> 975/975382.unk
> 975/975383.unk
>  (several more...)
> 3. There are two old excel file types which are not recognized at the Moment (application/vnd.ms-excel.sheet.2):
> 376/376222.unk and 622/62252.unk start with 0x0900040007001000 instead of 0x0900040000001000
> 224/224485.unk and 615/615187.unk start with  0x0900040002001000 instead of 0x0900040000001000
> The magic for application/vnd.ms-excel.sheet.2 should be adapted:
> 0x02001000
> and
> 0x07001000
> must be added.
> Furthermore we have to check whether the parser can be adapted to process all the mentioned files.
> (LibreOffice can open all of these files)
> 4. Special Header/Wrapper in front of application/vnd.ms-excel.sheet.3
> In file 611/611703.unk I found a 128-byte long header in front of the excel file.
> therefore the file could not be recognized correclty by TIKA
> After I cut the header, the file could be recognized and converted by TIKA.
> 5. SAS Data file
> Example file:
> 020/020505.unk
> 6. AirSar Data (Airborne synthetic aperature Radar)
> Example file:
> 348/349489.unk (several more...)
> 7. Advanced Data Format (ADF)
> Used in CGNS (CFD General Notation System .cgns)
> Example file:
> 363/363966.unk
> 8. Unknown Microsoft Word Document
> Example file:
> 202/202718.unk
> (Recognized as Microsoft Word Document by Linux Magic)
> 9. Unknown PowerPoint 3.0 file?
> Example file:
> 388/388212.unk
> 10. Microsoft Compound File Binary File Format?
> Example file
> 857/857353.unk
> Let me know if I should open a separate ticket for case 1. and 3.!
> If there is any better place (except the mailing lists) to publish the analyzation results let me know.
>  
> Regards
>  
> Andreas



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)