You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@chemistry.apache.org by Tim Webster <ti...@gmail.com> on 2014/08/29 18:09:06 UTC

MIME types preventing addition of content

Hi,

I'm deriving the mime type of documents (using Apache Tika) and then adding
them to my CMIS repository using a Chemistry Java client.  For some reason,
certain mime types seem to prevent the content from being added.

The document gets added, but the mime type and content are empty.

The biggest offender seems to be application/msword, and there seem to be
others.

I've used the same code for the past couple of years to do this, and the
only thing I've changed is the value of the mime type in the ContentStream.
Previously, I used to just set everything to 'application/octet-stream'.
 If I switch back to that, it works fine.  The Tika libraries are doing
their job just fine, and returning the correct mime type.

If I add the same document through the workbench, the document gets added,
and the mime type and content are totally fine.

I've attached a screenshot of my CMIS workbench so you can see the effect.

Anyone have any ideas?

[image: Inline image 1]

Re: MIME types preventing addition of content

Posted by Jay Brown <ja...@us.ibm.com>.
If mime type is not supplied, FileNet will try to guess from the filename
on the content stream.    If memory serves, suppling the mime type should
override its guess.

Jay Brown
Senior Engineer, ECM Development
IBM Software Group
jay.brown@us.ibm.com
www.linkedin.com/in/parityerror/


|------------>
| From:      |
|------------>
  >--------------------------------------------------------------------------------------------------------------------------------------------------|
  |Florian Müller <fm...@apache.org>                                                                                                                  |
  >--------------------------------------------------------------------------------------------------------------------------------------------------|
|------------>
| To:        |
|------------>
  >--------------------------------------------------------------------------------------------------------------------------------------------------|
  |dev@chemistry.apache.org                                                                                                                          |
  >--------------------------------------------------------------------------------------------------------------------------------------------------|
|------------>
| Cc:        |
|------------>
  >--------------------------------------------------------------------------------------------------------------------------------------------------|
  |tim.webster@gmail.com                                                                                                                             |
  >--------------------------------------------------------------------------------------------------------------------------------------------------|
|------------>
| Date:      |
|------------>
  >--------------------------------------------------------------------------------------------------------------------------------------------------|
  |08/30/2014 04:45 AM                                                                                                                               |
  >--------------------------------------------------------------------------------------------------------------------------------------------------|
|------------>
| Subject:   |
|------------>
  >--------------------------------------------------------------------------------------------------------------------------------------------------|
  |Re: MIME types preventing addition of content                                                                                                     |
  >--------------------------------------------------------------------------------------------------------------------------------------------------|





Hi Tim,

You are supposed to set the MIME type. Repositories should accept it
even if it doesn't make sense. (For example, a zero byte document is
never a valid Word file.)

Some repositories handle content with no MIME type or the MIME type
"application/octet-stream" differently and try to determine the correct
MIME type. Alfresco does it, the SAP Document Service does it, and
probably others too. But you cannot rely on that.
(@Jay: What does FileNet do?)

SharePoint does it completely differently (and is not spec compliant in
this regard). It ignores your MIME type. Instead it determines the MIME
type when you access the document based on the file extension. That is,
if you change the name (and extension) after you've uploaded the
document, you get a different MIME type. The worst part is that if
SharePoint doesn't know the file extension, it doesn't return any MIME
type.

I hope that helps.


- Florian

> Update - the problem I was having was against FileNet P8.  I tried the
> same code against Alfresco 4.2f, and I also got strange behaviour
> (although different).
>
> With Alfresco, if you explicitly set the mime type when you add the
> document, the mime type gets set OK, but the content is doesn't seem to
> be present (length of 0 in workbench).  Also with Alfresco, if you set
> the mime type to 'application/octet-stream' it seems to override this
> with the actual correct mime type, and import the document successfully
> (content and all).
>
> So I'm a bit confused as to how all this is supposed to work.  Should we
> not be setting the mime type ourselves?  Why do the repository
> implementations seem to want to interfere with this?
>
> Again, this is mostly using a 'old' Word format document (.doc) and the
> application/msword mime type, of which our customers repositories are
> full of.
>
> Tim
>
>
>
>
>
>
> On Fri, Aug 29, 2014 at 5:09 PM, Tim Webster <tim.webster@gmail.com
> <ma...@gmail.com>> wrote:
>
>     Hi,
>
>     I'm deriving the mime type of documents (using Apache Tika) and then
>     adding them to my CMIS repository using a Chemistry Java client.
>      For some reason, certain mime types seem to prevent the content
>     from being added.
>
>     The document gets added, but the mime type and content are empty.
>
>     The biggest offender seems to be application/msword, and there seem
>     to be others.
>
>     I've used the same code for the past couple of years to do this, and
>     the only thing I've changed is the value of the mime type in the
>     ContentStream. Previously, I used to just set everything to
>     'application/octet-stream'.  If I switch back to that, it works
>     fine.  The Tika libraries are doing their job just fine, and
>     returning the correct mime type.
>
>     If I add the same document through the workbench, the document gets
>     added, and the mime type and content are totally fine.
>
>     I've attached a screenshot of my CMIS workbench so you can see the
>     effect.
>
>     Anyone have any ideas?
>
>     Inline image 1
>
>



Re: MIME types preventing addition of content

Posted by Florian Müller <fm...@apache.org>.
Hi Tim,

You are supposed to set the MIME type. Repositories should accept it
even if it doesn't make sense. (For example, a zero byte document is
never a valid Word file.)

Some repositories handle content with no MIME type or the MIME type
"application/octet-stream" differently and try to determine the correct
MIME type. Alfresco does it, the SAP Document Service does it, and
probably others too. But you cannot rely on that.
(@Jay: What does FileNet do?)

SharePoint does it completely differently (and is not spec compliant in
this regard). It ignores your MIME type. Instead it determines the MIME
type when you access the document based on the file extension. That is,
if you change the name (and extension) after you've uploaded the
document, you get a different MIME type. The worst part is that if
SharePoint doesn't know the file extension, it doesn't return any MIME type.

I hope that helps.


- Florian

> Update - the problem I was having was against FileNet P8.  I tried the
> same code against Alfresco 4.2f, and I also got strange behaviour
> (although different).
> 
> With Alfresco, if you explicitly set the mime type when you add the
> document, the mime type gets set OK, but the content is doesn't seem to
> be present (length of 0 in workbench).  Also with Alfresco, if you set
> the mime type to 'application/octet-stream' it seems to override this
> with the actual correct mime type, and import the document successfully
> (content and all).
> 
> So I'm a bit confused as to how all this is supposed to work.  Should we
> not be setting the mime type ourselves?  Why do the repository
> implementations seem to want to interfere with this?
> 
> Again, this is mostly using a 'old' Word format document (.doc) and the
> application/msword mime type, of which our customers repositories are
> full of.
> 
> Tim
> 
> 
> 
> 
> 
> 
> On Fri, Aug 29, 2014 at 5:09 PM, Tim Webster <tim.webster@gmail.com
> <ma...@gmail.com>> wrote:
> 
>     Hi,
> 
>     I'm deriving the mime type of documents (using Apache Tika) and then
>     adding them to my CMIS repository using a Chemistry Java client.
>      For some reason, certain mime types seem to prevent the content
>     from being added.
> 
>     The document gets added, but the mime type and content are empty.
> 
>     The biggest offender seems to be application/msword, and there seem
>     to be others.
> 
>     I've used the same code for the past couple of years to do this, and
>     the only thing I've changed is the value of the mime type in the
>     ContentStream. Previously, I used to just set everything to
>     'application/octet-stream'.  If I switch back to that, it works
>     fine.  The Tika libraries are doing their job just fine, and
>     returning the correct mime type.
> 
>     If I add the same document through the workbench, the document gets
>     added, and the mime type and content are totally fine.
> 
>     I've attached a screenshot of my CMIS workbench so you can see the
>     effect.
> 
>     Anyone have any ideas?
> 
>     Inline image 1
> 
> 

Re: MIME types preventing addition of content

Posted by Tim Webster <ti...@gmail.com>.
Update - the problem I was having was against FileNet P8.  I tried the same
code against Alfresco 4.2f, and I also got strange behaviour (although
different).

With Alfresco, if you explicitly set the mime type when you add the
document, the mime type gets set OK, but the content is doesn't seem to be
present (length of 0 in workbench).  Also with Alfresco, if you set the
mime type to 'application/octet-stream' it seems to override this with the
actual correct mime type, and import the document successfully (content and
all).

So I'm a bit confused as to how all this is supposed to work.  Should we
not be setting the mime type ourselves?  Why do the repository
implementations seem to want to interfere with this?

Again, this is mostly using a 'old' Word format document (.doc) and the
application/msword mime type, of which our customers repositories are full
of.

Tim






On Fri, Aug 29, 2014 at 5:09 PM, Tim Webster <ti...@gmail.com> wrote:

> Hi,
>
> I'm deriving the mime type of documents (using Apache Tika) and then
> adding them to my CMIS repository using a Chemistry Java client.  For some
> reason, certain mime types seem to prevent the content from being added.
>
> The document gets added, but the mime type and content are empty.
>
> The biggest offender seems to be application/msword, and there seem to be
> others.
>
> I've used the same code for the past couple of years to do this, and the
> only thing I've changed is the value of the mime type in the ContentStream.
> Previously, I used to just set everything to 'application/octet-stream'.
>  If I switch back to that, it works fine.  The Tika libraries are doing
> their job just fine, and returning the correct mime type.
>
> If I add the same document through the workbench, the document gets added,
> and the mime type and content are totally fine.
>
> I've attached a screenshot of my CMIS workbench so you can see the effect.
>
> Anyone have any ideas?
>
> [image: Inline image 1]
>