You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@tika.apache.org by "Andreas Meier (JIRA)" <ji...@apache.org> on 2018/04/18 07:47:00 UTC

[jira] [Comment Edited] (TIKA-2632) Analyze unknown govdocs files

    [ https://issues.apache.org/jira/browse/TIKA-2632?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16442059#comment-16442059 ] 

Andreas Meier edited comment on TIKA-2632 at 4/18/18 7:46 AM:
--------------------------------------------------------------

Thanks for the link [~tallison@mitre.org]

Glad to see you joining the discussion [~anjackson].

In my opinion Tika should try to determine as much filetypes as possible.
I don't think that it will confuse people too much since there are already formats that can't be parsed by Tika out-of-the-box without additional libraries (GDAL)
To avoid confusion we can also log some warning for specific filetypes.


BTW: It seems that GDAL can read/parse the above mentioned AirSar Data Files: http://www.gdal.org/frmt_airsar.html



was (Author: andreasmeier):
Thanks for the link [~tallison@mitre.org]

Glad to see you joining the discussion [~anjackson].

In my opinion Tika should try to determine as much filetypes as possible.
I don't think that it will confuse people too much since there are already formats that can't be parsed by Tika out-of-the-box without additional libraries (GDAL)
To avoid confusion we can also log some warning for specific filetypes.


> Analyze unknown govdocs files
> -----------------------------
>
>                 Key: TIKA-2632
>                 URL: https://issues.apache.org/jira/browse/TIKA-2632
>             Project: Tika
>          Issue Type: Improvement
>            Reporter: Andreas Meier
>            Priority: Minor
>
> I recently started to analyze randomly govdocs1 files that could not be recognized by TIKA properly.
>  
> This ticket should be used to identify problems with old or proprietary files and to extend TIKA step-by-step if needed.
>  
> Stumbled across the following filetypes/files:
>  
> 1. Old PowerPoint files (I expect Version 2.0 or 3.0) are not recognized properly:
> Found some mysterious files starting with 0xeddead0b and 0x0baddeed
> Turned out that someone else already investigated this case a month ago:
> [link http://anjackson.net/2018/03/15/story-of-a-bad-deed/|http://anjackson.net/2018/03/15/story-of-a-bad-deed/]
> The files are old PowerPoint. (PowerPoint 3.0 or 2.0)
> I think these Magic-strings should be added tika-mimetypes.xml as well as another PowerPoint mime-type. (maybe application/vnd.ms-powerpoint.2 or application/vnd.ms-powerpoint.3 ?)
> Example files in govdocs1: 
> 144/144504.unk
> 272/272490.unk
> 430/430427.unk
> (several more...)
> 2. Proprietary File Format: SigmaPlot Exchange File .jxf:
> Magic: 0x8888000c4a5846
> Example file in govdocs1:
> 975/975382.unk
> 975/975383.unk
>  (several more...)
> 3. There are two old excel file types which are not recognized at the Moment (application/vnd.ms-excel.sheet.2):
> 376/376222.unk and 622/62252.unk start with 0x0900040007001000 instead of 0x0900040000001000
> 224/224485.unk and 615/615187.unk start with  0x0900040002001000 instead of 0x0900040000001000
> The magic for application/vnd.ms-excel.sheet.2 should be adapted:
> 0x02001000
> and
> 0x07001000
> must be added.
> Furthermore we have to check whether the parser can be adapted to process all the mentioned files.
> (LibreOffice can open all of these files)
> 4. 128-byte header in front of files 
> There are several files in the corpus that start with a 128-byte long header in front of the actual file.
> The header contains the filename and a specific filetype (TEXTXCEL for 4.1 and SLD3PPT3 for 4.2)
> 4.1 In file 611/611703.unk I found a 128-byte long header in front of the excel file. (application/vnd.ms-excel.sheet.3)
> therefore the file could not be recognized correclty by TIKA
> After I cut the header, the file could be recognized and converted by TIKA.
> 4.2 The following files are old PowerPoint files with a leading 128-byte header
> 388/388212.unk
> 775/775724.unk
> 790/790351.unk
> 5. SAS Data file
> Example file:
> 020/020505.unk
> 6. AirSar Data (Airborne synthetic aperature Radar)
> Example file:
> 348/349489.unk (several more...)
> 7. Advanced Data Format (ADF)
> Used in CGNS (CFD General Notation System .cgns)
> Example file:
> 363/363966.unk
> 8. Unknown (old?) Microsoft Word Document
> Example file:
> 202/202718.unk
> (Recognized as Microsoft Word Document by Linux Magic)
> 9. Raw weather data by nws noaa
> SXXX.. KWAL ...
> Example files:
> 136/136247.unk
> 400/400289.unk
> 10. Microsoft Compound File Binary File Format?
> Files of this type have already been handled by [~tallison@mitre.org] in TIKA-1813
> Example file
> 857/857353.unk
> Let me know if I should open a separate ticket for case 1. and 3.!
> If there is any better place (except the mailing lists) to publish the analyzation results let me know.
>  
> Regards
>  
> Andreas



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)