You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@couchdb.apache.org by "Mclean, Adam" <ad...@rbc.com> on 2012/11/01 21:49:28 UTC

Storing text/plain attachments and digest

Hi list,

I'm working on using couch as a script / configuration file repository.
Then retrieving the contents as they are updated to various hosts.

The digest produced by the file upload is key to this working for me so
I'm not replacing files that are already the same in couch.  I've been
struggling with attachments that are uploaded as content-type
'text/plain'.  I haven't been able to match the md5sum that couch shows
me under any circumstance.  The text/plain I'm sending up is us-ascii
which to my understanding from an md5sum perspective wouldn't be any
different represented as UTF-8.

When I upload a file without a content-type of text/plain, I find the
md5sum representation is accurate.

I've also noticed the headers when retrieving a text/plain attachment
show as:
HTTP/1.1 200 OK
Server: CouchDB/1.2.0 (Erlang OTP/R15B02)
ETag: "2DvSqQDg7SoM+Su+Ymkt4A=="
Date: Thu, 01 Nov 2012 20:36:37 GMT
Content-Type: text/plain
Content-Length: 8840
Cache-Control: must-revalidate
Accept-Ranges: none

Versus something encoded differently:
Server: CouchDB/1.2.0 (Erlang OTP/R15B02)
ETag: "tDK39M3il/9xLqnYLosWKg=="
Date: Thu, 01 Nov 2012 20:44:42 GMT
Content-Type: application/x-perl
Content-MD5: tDK39M3il/9xLqnYLosWKg==
Content-Length: 7080
Cache-Control: must-revalidate
Accept-Ranges: bytes

Which produces a 'Content-MD5' that I can verify / duplicate.

What am I missing?  Are md5s not generated for text/plain?

Thanks!
_______________________________________________________________________

This email may be privileged and/or confidential, and the
sender does not waive any related rights and obligations.
Any distribution, use or copying of this email or the
information it contains by other than an intended recipient
is unauthorized. If you received this email in error,
please advise the sender (by return email or otherwise)
immediately. You have consented to receive the attached
electronically at the above-noted email address; please retain a
copy of this confirmation for future reference.

Ce courriel est confidentiel et protégé. L'expéditeur ne renonce
pas aux droits et obligations qui s'y rapportent. Toute diffusion,
utilisation ou copie de ce courriel ou des renseignements qu'il
contient par une personne autre que le (les) destinataire(s)
désigné(s) est interdite. Si vous recevez ce courriel par erreur,
veuillez en aviser l'expéditeur immédiatement, par retour de courriel
ou par un autre moyen. Vous avez accepté de recevoir le(s) document(s)
ci-joint(s) par voie électronique à l'adresse courriel indiquée ci-dessus;
veuillez conserver une copie de cette confirmation pour les fins de reference future.

Re: Storing text/plain attachments and digest

Posted by Robert Newson <rn...@apache.org>.
CouchDB compresses some attachments (determined by content-type).
"text/plain" is one of them (the full list is in your default.ini file),
and so the MD5 returned is that of the compressed form.


On 1 November 2012 20:49, Mclean, Adam <ad...@rbc.com> wrote:

> Hi list,
>
> I'm working on using couch as a script / configuration file repository.
> Then retrieving the contents as they are updated to various hosts.
>
> The digest produced by the file upload is key to this working for me so
> I'm not replacing files that are already the same in couch.  I've been
> struggling with attachments that are uploaded as content-type
> 'text/plain'.  I haven't been able to match the md5sum that couch shows
> me under any circumstance.  The text/plain I'm sending up is us-ascii
> which to my understanding from an md5sum perspective wouldn't be any
> different represented as UTF-8.
>
> When I upload a file without a content-type of text/plain, I find the
> md5sum representation is accurate.
>
> I've also noticed the headers when retrieving a text/plain attachment
> show as:
> HTTP/1.1 200 OK
> Server: CouchDB/1.2.0 (Erlang OTP/R15B02)
> ETag: "2DvSqQDg7SoM+Su+Ymkt4A=="
> Date: Thu, 01 Nov 2012 20:36:37 GMT
> Content-Type: text/plain
> Content-Length: 8840
> Cache-Control: must-revalidate
> Accept-Ranges: none
>
> Versus something encoded differently:
> Server: CouchDB/1.2.0 (Erlang OTP/R15B02)
> ETag: "tDK39M3il/9xLqnYLosWKg=="
> Date: Thu, 01 Nov 2012 20:44:42 GMT
> Content-Type: application/x-perl
> Content-MD5: tDK39M3il/9xLqnYLosWKg==
> Content-Length: 7080
> Cache-Control: must-revalidate
> Accept-Ranges: bytes
>
> Which produces a 'Content-MD5' that I can verify / duplicate.
>
> What am I missing?  Are md5s not generated for text/plain?
>
> Thanks!
> _______________________________________________________________________
>
> This email may be privileged and/or confidential, and the
> sender does not waive any related rights and obligations.
> Any distribution, use or copying of this email or the
> information it contains by other than an intended recipient
> is unauthorized. If you received this email in error,
> please advise the sender (by return email or otherwise)
> immediately. You have consented to receive the attached
> electronically at the above-noted email address; please retain a
> copy of this confirmation for future reference.
>
> Ce courriel est confidentiel et protégé. L'expéditeur ne renonce
> pas aux droits et obligations qui s'y rapportent. Toute diffusion,
> utilisation ou copie de ce courriel ou des renseignements qu'il
> contient par une personne autre que le (les) destinataire(s)
> désigné(s) est interdite. Si vous recevez ce courriel par erreur,
> veuillez en aviser l'expéditeur immédiatement, par retour de courriel
> ou par un autre moyen. Vous avez accepté de recevoir le(s) document(s)
> ci-joint(s) par voie électronique à l'adresse courriel indiquée ci-dessus;
> veuillez conserver une copie de cette confirmation pour les fins de
> reference future.
>

Re: Storing text/plain attachments and digest

Posted by Keith Gable <zi...@ignition-project.com>.
Another option would be to send the files as application/octet-stream. I
don't think those are compressed.


---
Keith Gable
A+ Certified Professional
Network+ Certified Professional
Storage+ Certified Professional
Mobile Application Developer / Web Developer



On Thu, Nov 1, 2012 at 9:09 PM, Mclean, Adam <ad...@rbc.com> wrote:

> Of course! Makes perfect sense considering the compression.  Adding my
> own property to the document sounds like the winner then.
>
> Thanks for the direction.
>
> -----Original Message-----
> From: Robert Newson [mailto:rnewson@apache.org]
> Sent: 2012, November, 01 8:18 PM
> To: user@couchdb.apache.org
> Subject: Re: Storing text/plain attachments and digest
>
> To be specific, the Content-MD5 is always the MD5 of the response body,
> but this is not necessarily true for ETag. If you do want it to match,
> then either use a content-type that is not going to be compressed, or
> remove the content-type from couchdb's configuration.
>
> It is appropriate (w.r.t RFC 2616) to depend on the Content-MD5 header.
> If you supply when when PUT'ting a standalone attachment, we'll even
> verify it matches and return an error if it doesn't.
>
> Jens, I'm not familiar with that optimization but, if it exists, it came
> after I exposed the MD5 in this manner. The only place I think the
> replicator is involved is that, by emitting this information, the
> replicator validates that attachments aren't corrupted in transit.
>
>
> On 2 November 2012 00:05, Jens Alfke <je...@couchbase.com> wrote:
>
> >
> > On Nov 1, 2012, at 1:49 PM, "Mclean, Adam" <ad...@rbc.com>
> wrote:
> >
> > > The digest produced by the file upload is key to this working for me
>
> > > so I'm not replacing files that are already the same in couch.  I've
>
> > > been
> >
> > IMHO you should not try to interpret the contents of the attachment
> > 'digest' property. It's mostly meant as an optimization for the
> > replicator, not as a user feature. Don't assume that it consists of
> > the string "md5-" followed by a hex MD5 digest of the actual
> > attachment contents. As you've seen, this isn't true for compressed
> > attachments. It's even more untrue for attachments on TouchDB, which
> uses a SHA1 digest instead.
> >
> > If you need to track the identities of attachments using a digest, it
> > would be safer to add your own digest property to the document, so
> > that you have control over how it's generated.
> >
> > -Jens
> _______________________________________________________________________
>
> This email may be privileged and/or confidential, and the
> sender does not waive any related rights and obligations.
> Any distribution, use or copying of this email or the
> information it contains by other than an intended recipient
> is unauthorized. If you received this email in error,
> please advise the sender (by return email or otherwise)
> immediately. You have consented to receive the attached
> electronically at the above-noted email address; please retain a
> copy of this confirmation for future reference.
>
> Ce courriel est confidentiel et protégé. L'expéditeur ne renonce
> pas aux droits et obligations qui s'y rapportent. Toute diffusion,
> utilisation ou copie de ce courriel ou des renseignements qu'il
> contient par une personne autre que le (les) destinataire(s)
> désigné(s) est interdite. Si vous recevez ce courriel par erreur,
> veuillez en aviser l'expéditeur immédiatement, par retour de courriel
> ou par un autre moyen. Vous avez accepté de recevoir le(s) document(s)
> ci-joint(s) par voie électronique à l'adresse courriel indiquée ci-dessus;
> veuillez conserver une copie de cette confirmation pour les fins de
> reference future.
>

RE: Storing text/plain attachments and digest

Posted by "Mclean, Adam" <ad...@rbc.com>.
Of course! Makes perfect sense considering the compression.  Adding my
own property to the document sounds like the winner then.

Thanks for the direction.

-----Original Message-----
From: Robert Newson [mailto:rnewson@apache.org] 
Sent: 2012, November, 01 8:18 PM
To: user@couchdb.apache.org
Subject: Re: Storing text/plain attachments and digest

To be specific, the Content-MD5 is always the MD5 of the response body,
but this is not necessarily true for ETag. If you do want it to match,
then either use a content-type that is not going to be compressed, or
remove the content-type from couchdb's configuration.

It is appropriate (w.r.t RFC 2616) to depend on the Content-MD5 header.
If you supply when when PUT'ting a standalone attachment, we'll even
verify it matches and return an error if it doesn't.

Jens, I'm not familiar with that optimization but, if it exists, it came
after I exposed the MD5 in this manner. The only place I think the
replicator is involved is that, by emitting this information, the
replicator validates that attachments aren't corrupted in transit.


On 2 November 2012 00:05, Jens Alfke <je...@couchbase.com> wrote:

>
> On Nov 1, 2012, at 1:49 PM, "Mclean, Adam" <ad...@rbc.com>
wrote:
>
> > The digest produced by the file upload is key to this working for me

> > so I'm not replacing files that are already the same in couch.  I've

> > been
>
> IMHO you should not try to interpret the contents of the attachment 
> 'digest' property. It's mostly meant as an optimization for the 
> replicator, not as a user feature. Don't assume that it consists of 
> the string "md5-" followed by a hex MD5 digest of the actual 
> attachment contents. As you've seen, this isn't true for compressed 
> attachments. It's even more untrue for attachments on TouchDB, which
uses a SHA1 digest instead.
>
> If you need to track the identities of attachments using a digest, it 
> would be safer to add your own digest property to the document, so 
> that you have control over how it's generated.
>
> -Jens
_______________________________________________________________________

This email may be privileged and/or confidential, and the
sender does not waive any related rights and obligations.
Any distribution, use or copying of this email or the
information it contains by other than an intended recipient
is unauthorized. If you received this email in error,
please advise the sender (by return email or otherwise)
immediately. You have consented to receive the attached
electronically at the above-noted email address; please retain a
copy of this confirmation for future reference.

Ce courriel est confidentiel et protégé. L'expéditeur ne renonce
pas aux droits et obligations qui s'y rapportent. Toute diffusion,
utilisation ou copie de ce courriel ou des renseignements qu'il
contient par une personne autre que le (les) destinataire(s)
désigné(s) est interdite. Si vous recevez ce courriel par erreur,
veuillez en aviser l'expéditeur immédiatement, par retour de courriel
ou par un autre moyen. Vous avez accepté de recevoir le(s) document(s)
ci-joint(s) par voie électronique à l'adresse courriel indiquée ci-dessus;
veuillez conserver une copie de cette confirmation pour les fins de reference future.

Re: Storing text/plain attachments and digest

Posted by Robert Newson <rn...@apache.org>.
To be specific, the Content-MD5 is always the MD5 of the response body, but
this is not necessarily true for ETag. If you do want it to match, then
either use a content-type that is not going to be compressed, or remove the
content-type from couchdb's configuration.

It is appropriate (w.r.t RFC 2616) to depend on the Content-MD5 header. If
you supply when when PUT'ting a standalone attachment, we'll even verify it
matches and return an error if it doesn't.

Jens, I'm not familiar with that optimization but, if it exists, it came
after I exposed the MD5 in this manner. The only place I think the
replicator is involved is that, by emitting this information, the
replicator validates that attachments aren't corrupted in transit.


On 2 November 2012 00:05, Jens Alfke <je...@couchbase.com> wrote:

>
> On Nov 1, 2012, at 1:49 PM, "Mclean, Adam" <ad...@rbc.com> wrote:
>
> > The digest produced by the file upload is key to this working for me so
> > I'm not replacing files that are already the same in couch.  I've been
>
> IMHO you should not try to interpret the contents of the attachment
> ‘digest’ property. It’s mostly meant as an optimization for the replicator,
> not as a user feature. Don’t assume that it consists of the string “md5-“
> followed by a hex MD5 digest of the actual attachment contents. As you’ve
> seen, this isn’t true for compressed attachments. It’s even more untrue for
> attachments on TouchDB, which uses a SHA1 digest instead.
>
> If you need to track the identities of attachments using a digest, it
> would be safer to add your own digest property to the document, so that you
> have control over how it’s generated.
>
> —Jens

Re: Storing text/plain attachments and digest

Posted by Jens Alfke <je...@couchbase.com>.
On Nov 1, 2012, at 1:49 PM, "Mclean, Adam" <ad...@rbc.com> wrote:

> The digest produced by the file upload is key to this working for me so
> I'm not replacing files that are already the same in couch.  I've been

IMHO you should not try to interpret the contents of the attachment ‘digest’ property. It’s mostly meant as an optimization for the replicator, not as a user feature. Don’t assume that it consists of the string “md5-“ followed by a hex MD5 digest of the actual attachment contents. As you’ve seen, this isn’t true for compressed attachments. It’s even more untrue for attachments on TouchDB, which uses a SHA1 digest instead.

If you need to track the identities of attachments using a digest, it would be safer to add your own digest property to the document, so that you have control over how it’s generated.

—Jens