You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@tika.apache.org by "Sameer (Jira)" <ji...@apache.org> on 2021/05/26 11:10:00 UTC

[jira] [Comment Edited] (TIKA-2689) *.ai type (Adobe illustrator ) files are not detected correctly.

    [ https://issues.apache.org/jira/browse/TIKA-2689?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17351715#comment-17351715 ] 

Sameer edited comment on TIKA-2689 at 5/26/21, 11:09 AM:
---------------------------------------------------------

[~amitp7007] thanks that worked for me as well.

Putting in additional bits for someone who might need a complete solution,

1. Create a file called custom-mimetypes.xml, with following content, notice the <mime-info> tags added to the xml provided by Amit.

{color:#e8bf6a}<?{color}{color:#bababa}xml version{color}{color:#6a8759}="1.0" {color}{color:#bababa}encoding{color}{color:#6a8759}="UTF-8"{color}{color:#e8bf6a}?>{color}{color:#e8bf6a}<mime-info>{color}{color:#e8bf6a} <mime-type {color}{color:#bababa}type{color}{color:#6a8759}="application/illustrator"{color}{color:#e8bf6a}>{color}{color:#e8bf6a} <alias {color}{color:#bababa}type{color}{color:#6a8759}="application/vnd.adobe.illustrator"{color}{color:#e8bf6a}/>{color}{color:#e8bf6a} <acronym>{color}AI{color:#e8bf6a}</acronym>{color}{color:#e8bf6a} <_comment>{color}Adobe Illustrator Artwork{color:#e8bf6a}</_comment>{color}{color:#e8bf6a} <magic {color}{color:#bababa}priority{color}{color:#6a8759}="50"{color}{color:#e8bf6a}>{color} {color:#808080}<!-- Normally just %PDF- -->{color} {color:#e8bf6a}<match {color}{color:#bababa}value{color}{color:#6a8759}="%PDF-" {color}{color:#bababa}type{color}{color:#6a8759}="string" {color}{color:#bababa}offset{color}{color:#6a8759}="0"{color}{color:#e8bf6a}/>{color} {color:#808080}<!-- Sometimes has a UTF-8 Byte Order Mark first -->{color} {color:#e8bf6a}<match {color}{color:#bababa}value{color}{color:#6a8759}="\xef\xbb\xbf%PDF-" {color}{color:#bababa}type{color}{color:#6a8759}="string" {color}{color:#bababa}offset{color}{color:#6a8759}="0"{color}{color:#e8bf6a}/>{color}{color:#e8bf6a} </magic>{color}{color:#e8bf6a} <magic {color}{color:#bababa}priority{color}{color:#6a8759}="20"{color}{color:#e8bf6a}>{color} {color:#808080}<!-- Low priority match for %PDF-#.# near the start of the file -->{color}{color:#808080} <!-- Can trigger false positives, so set the priority rather low here -->{color} {color:#e8bf6a}<match {color}{color:#bababa}value{color}{color:#6a8759}="%PDF-1." {color}{color:#bababa}type{color}{color:#6a8759}="string" {color}{color:#bababa}offset{color}{color:#6a8759}="1:512"{color}{color:#e8bf6a}/>{color}{color:#e8bf6a} <match {color}{color:#bababa}value{color}{color:#6a8759}="%PDF-2." {color}{color:#bababa}type{color}{color:#6a8759}="string" {color}{color:#bababa}offset{color}{color:#6a8759}="1:512"{color}{color:#e8bf6a}/>{color}{color:#e8bf6a} </magic>{color}{color:#e8bf6a} <glob {color}{color:#bababa}pattern{color}{color:#6a8759}="*.ai"{color}{color:#e8bf6a}/>{color}{color:#e8bf6a} <sub-class-of {color}{color:#bababa}type{color}{color:#6a8759}="application/postscript"{color}{color:#e8bf6a}/>{color}{color:#e8bf6a} </mime-type>{color}{color:#e8bf6a}</mime-info>{color}

2. Just put in your classpath, that's it, Tika will pick it up to add a new matcher.

 

*Couple of callouts*,
 # This is still not an accurate signature of AI files, we are just enabling this to match files with the mentioned signature as AI as well.
 # You still need to hint Tika, with a filename, {{{color:#9876aa}tika{color}.detect(inputStream, fileName) }}then only it will resolved as ai, otherwise it will be matched to both pdf and ai, though resolved as pdf.
 # You might find [this|https://tika.apache.org/1.8/parser_guide.html#Add_your_MIME-Type] guide helpful as well.

Since, its been quite sometime since the last comment on this thread, has there been any progress to match AI files better?

 


was (Author: sameer.sunil):
[~amitp7007] thanks that worked for me as well.

Putting in additional bits for someone who might need a complete solution,

1. Create a file called custom-mimetypes.xml, with following content, notice the <mime-info> tags added to the xml provided by Amit.

{color:#e8bf6a}<?{color}{color:#bababa}xml version{color}{color:#6a8759}="1.0" {color}{color:#bababa}encoding{color}{color:#6a8759}="UTF-8"{color}{color:#e8bf6a}?>
{color}{color:#e8bf6a}<mime-info>
{color}{color:#e8bf6a} <mime-type {color}{color:#bababa}type{color}{color:#6a8759}="application/illustrator"{color}{color:#e8bf6a}>
{color}{color:#e8bf6a} <alias {color}{color:#bababa}type{color}{color:#6a8759}="application/vnd.adobe.illustrator"{color}{color:#e8bf6a}/>
{color}{color:#e8bf6a} <acronym>{color}AI{color:#e8bf6a}</acronym>
{color}{color:#e8bf6a} <_comment>{color}Adobe Illustrator Artwork{color:#e8bf6a}</_comment>
{color}{color:#e8bf6a} <magic {color}{color:#bababa}priority{color}{color:#6a8759}="50"{color}{color:#e8bf6a}>
{color} {color:#808080}<!-- Normally just %PDF- -->
{color} {color:#e8bf6a}<match {color}{color:#bababa}value{color}{color:#6a8759}="%PDF-" {color}{color:#bababa}type{color}{color:#6a8759}="string" {color}{color:#bababa}offset{color}{color:#6a8759}="0"{color}{color:#e8bf6a}/>
{color} {color:#808080}<!-- Sometimes has a UTF-8 Byte Order Mark first -->
{color} {color:#e8bf6a}<match {color}{color:#bababa}value{color}{color:#6a8759}="\xef\xbb\xbf%PDF-" {color}{color:#bababa}type{color}{color:#6a8759}="string" {color}{color:#bababa}offset{color}{color:#6a8759}="0"{color}{color:#e8bf6a}/>
{color}{color:#e8bf6a} </magic>
{color}{color:#e8bf6a} <magic {color}{color:#bababa}priority{color}{color:#6a8759}="20"{color}{color:#e8bf6a}>
{color} {color:#808080}<!-- Low priority match for %PDF-#.# near the start of the file -->
{color}{color:#808080} <!-- Can trigger false positives, so set the priority rather low here -->
{color} {color:#e8bf6a}<match {color}{color:#bababa}value{color}{color:#6a8759}="%PDF-1." {color}{color:#bababa}type{color}{color:#6a8759}="string" {color}{color:#bababa}offset{color}{color:#6a8759}="1:512"{color}{color:#e8bf6a}/>
{color}{color:#e8bf6a} <match {color}{color:#bababa}value{color}{color:#6a8759}="%PDF-2." {color}{color:#bababa}type{color}{color:#6a8759}="string" {color}{color:#bababa}offset{color}{color:#6a8759}="1:512"{color}{color:#e8bf6a}/>
{color}{color:#e8bf6a} </magic>
{color}{color:#e8bf6a} <glob {color}{color:#bababa}pattern{color}{color:#6a8759}="*.ai"{color}{color:#e8bf6a}/>
{color}{color:#e8bf6a} <sub-class-of {color}{color:#bababa}type{color}{color:#6a8759}="application/postscript"{color}{color:#e8bf6a}/>
{color}{color:#e8bf6a} </mime-type>
{color}{color:#e8bf6a}</mime-info>{color}

2. Just put in your classpath, that's it, Tika will pick it up to add a new matcher.

 

*Couple of callouts*,
 # This is still not an accurate signature of AI files, we are just enabling this to match files with the mentioned signature as AI as well.
 # You still need to hint Tika, with a filename, 

{{{color:#9876aa}tika{color}.detect(inputStream, fileName)}}
then only it will resolved as ai, otherwise it will be matched to both pdf and ai, though resolved as pdf.
 # You might find [this|https://tika.apache.org/1.8/parser_guide.html#Add_your_MIME-Type] guide helpful as well.


Since, its been quite sometime since the last comment on this thread, has there been any progress to match AI files better?

 

> *.ai type (Adobe illustrator ) files are not detected correctly.
> ----------------------------------------------------------------
>
>                 Key: TIKA-2689
>                 URL: https://issues.apache.org/jira/browse/TIKA-2689
>             Project: Tika
>          Issue Type: Bug
>          Components: core
>    Affects Versions: 1.16, 1.17, 1.18
>            Reporter: Amit Pandey
>            Priority: Major
>         Attachments: example.ai
>
>
> There is in-consistency in detecting **ai* types files when using different overloaded detect method. When I am using _detect(String filename)_, it gives correct file type - "*application/illustrator*". If I use _detect(InputStream is, String filename)_ or _detect(File fileObj)_ -  it gives file type "*application/pdf*".
> Here is sample code I used.
>   [https://stackoverflow.com/questions/51359351/tika-detect-method-not-giving-same-exact-file-type|http://example.com/]
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)