You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@chemistry.apache.org by Mark Streit <mc...@gmail.com> on 2017/01/23 17:20:04 UTC

MIME Types for BMP and XML files - diff between Chemistry Client API and Apache Tika

*Chemistry folks*

*Wondering if anyone has stumbled across a similar issue we are seeing.  We
have a custom product solution built around Apache Chemistry 0.11.0 as we
use the following dependencies in our app:*

<dependency>
<groupId>org.apache.chemistry.opencmis</groupId>
<artifactId>chemistry-opencmis-client-api</artifactId>
<version>0.11.0</version>
</dependency>
<dependency>
<groupId>org.apache.chemistry.opencmis</groupId>
<artifactId>chemistry-opencmis-client-bindings</artifactId>
<version>0.11.0</version>
</dependency>
<dependency>
<groupId>org.apache.chemistry.opencmis</groupId>
<artifactId>chemistry-opencmis-client-impl</artifactId>
<version>0.11.0</version>
</dependency>


*During an upload (creation) of a CMIS Document, we set the
ContentStreamMimeType property  *cmis:contentStreamMimeType based on the
value returned by using Apache Tika and its Detector interface.

2 cases are presenting problems for us:  files with extension .bmp and
.xml.   In each case, Tika returns one value that does not seem to align w/
the class:  org.apache.chemistry.opencmis.commons.impl.MimeTypes

For a file *widget.bmp*:

   - The MIME returned from Tika is *"image/x-ms-bmp"* and our application
   code successfully creates the cmis:Document object setting the
cmis:contentStreamMimeType
   to "*image/x-ms-bmp*".
   - If you create the content using the *Chemistry Workbench,* the content
   is created successfully as well, but the cmis:contentStreamMimeType is
   set to "*image/bmp*".

Likewise for a file *another_widget.xml*:

   - The MIME returned from Tika is *"application/xml"* and our application
   code successfully creates the cmis:Document object setting the
cmis:contentStreamMimeType
   to "*application/xml*".
   - If you create the content using the *Chemistry Workbench,* the content
   is created successfully as well, but the cmis:contentStreamMimeType is
   set to "*text/xml*".


It appears, based on what we can determine that the following class
*org.apache.chemistry.opencmis.commons.impl.MimeTypes* includes the
following lines:

   MIME2EXT.put("text/xml", "xml");

   MIME2EXT.put("image/bmp", "bmp");


The reason this can matter, at least for our case, is where our
backend CMIS service implementation is Alfresco Enterprise 5.1 and
apparently the "transformation" service provided to automatically
generate PDF renditions of uploaded files, will not generate the PDF
rendition when the MIME values returned using Apache Tika are
specified.

However, if the values returned using
*org.apache.chemistry.opencmis.commons.impl.MimeTypes **are then used
to set that property**: *cmis:contentStreamMimeType - the PDF
rendition is created successfully in both cases.

Obviously this points more to the internal transformation services
provided by Alfresco (I believe ImageMagik for BMP and PDFBox for XML
files)... *but the broader question is more about the DIFFERENCE in
what Apache Tika returns vs what the CMIS Client API returns*.  It
seems perhaps, Tika may cause downstream issues depending on what is
being done to the contentStream of the cmis:Document instance.


Note that our only reason for using Apache Tika was that we saw it
mentioned in the Manning book on CMIS and Chemistry:
https://www.manning.com/books/cmis-and-apache-chemistry-in-action
(an extremely helpful book BTW)


Thanks,


Mark Streit

Re: MIME Types for BMP and XML files - diff between Chemistry Client API and Apache Tika

Posted by Nick Burch <ap...@gagravarr.org>.
On Mon, 23 Jan 2017, Mark Streit wrote:
> For a file *widget.bmp*:
>
>   - The MIME returned from Tika is *"image/x-ms-bmp"* and our application
>   code successfully creates the cmis:Document object setting the
> cmis:contentStreamMimeType
>   to "*image/x-ms-bmp*".
>   - If you create the content using the *Chemistry Workbench,* the content
>   is created successfully as well, but the cmis:contentStreamMimeType is
>   set to "*image/bmp*".

See TIKA-2250. Chemistry uses the "common" but until recently unofficial 
mimetype, Tika was giving the "correct" one with the other as an alias. 
However, as of a few months ago (see RFC 7903), the one that Chemistry 
uses is now official, so Tika will switch

> Likewise for a file *another_widget.xml*:
>
>   - The MIME returned from Tika is *"application/xml"* and our application
>   code successfully creates the cmis:Document object setting the
> cmis:contentStreamMimeType
>   to "*application/xml*".
>   - If you create the content using the *Chemistry Workbench,* the content
>   is created successfully as well, but the cmis:contentStreamMimeType is
>   set to "*text/xml*".

Generally text/xml is used for XML files which can be sensibly looked at 
by humans, application/xml for everything else.

Nick