You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@couchdb.apache.org by Robert Newson <ro...@gmail.com> on 2011/01/26 14:20:40 UTC

Next-generation attachment storage.

All,

Most of you know that I'm currently working on 'external attachments'.
I've spent quite some time reading and modifying the current code and
have tried several approaches to the problem. I've implemented one
version fairly completely
(https://github.com/rnewson/couchdb/tree/external_attachments) which
places any attachment over a threshold (defaulting to 256 kb) into a
separate file (and all files that are sent chunked). This branch works
for PUT/GET/DELETE, local and remote replication and compaction.
External attachments do not support compression or ranges yet.

At this point, I'd like to get some feedback. I don't believe
file-per-attachment is a solution that works for everyone but it was
necessary to make a choice in order to understand how to integrate any
kind of external attachment into couchdb.

So, here's my real proposal for CouchDB 1.2 (or 2.0?);

Attachments are stored contiguously in compound files following a
simplified form of Haystack
(http://www.facebook.com/note.php?note_id=76191543919). I won't
describe Haystack in detail as the article covers it, and it's not
exactly what we need (the indexes, for example, are pointless, given
we have a database). The basic idea is we have a small number of files
that we append to, the limit of concurrency being the number of files
(i.e, we will not interleave attachments in these files).

There are several consequences to this;

Pro
1) we can remove the 4k blocking in .couch files.
2) .couch files are smaller, improving all i/o operations (especially
compaction).
3) we can use more efficient primitives (like sendfile) to fetch attachments.

Con
1) haystack files need compaction (though this involves no seeking so
should be far better than .couch compaction)
2) more file descriptors
3) .couch files are no longer self-contained (complicating backup
schemes, migration)

I had originally planned for each database to have exclusive access to
N haystack files (N is configurable, of course) since this aids with
backups. However, another compelling option is to have N haystack
files for all databases. This reduces the number of file descriptors
needed, but complicates backup (we'd probably have to write a tool to
extract matching attachments).

I've rushed through that rather breezily, I apologize. I've been
thinking about this for quite some time so I likely have answers to
most questions on this.

B.

Re: Next-generation attachment storage.

Posted by Paul Davis <pa...@gmail.com>.

On Wed, Jan 26, 2011 at 9:47 AM, Robert Newson <ro...@gmail.com> wrote:
> Luwak looks very interesting, thanks!
>
> As I noted originally, the harder part of the work is integrated in
> with couchdb and/or replacing the current attachment code entirely
> (which is my preference), so I went with the simplest approach to
> externalizing attachments (one attachment per file).
>
> The issue of synchronizing the data between the two storage systems
> needs some careful thought. My current approach is to put data into
> the attachment store (whether haystack, luwak or custom) with a
> 'provisional' marker. After we write_and_commit, we go back and mark
> it as final. We do something similar for removal ('provisionally
> removed' -> 'removed'). This will allow us, in most circumstances, to
> know the status of an item in the attachment store without
> cross-referencing it with couchdb. This will be important when
> compacting the attachment storage files (necessary in haystack, no
> clue yet for luwak).
>

I'm going to make a simplifying assumption of a haystack layout with a
single file but this should extend to having multiple files easily.

All we should need to do to make sure the files are in sync is that
anything referenced by the main database has been synced to disk
before we sync the main db. So, if the database keeps a bit of state
(say, last byte in the haystack file that is referenced assuming
append only attachments) and haystack keeps the last byte of the file
that it ran an fsync on, then we can efficiently (ie, only sync
haystack when necessary) each time we sync the main db. For multiple
files, its just an integer per file.

The part that i like about this is that it's almost identical to what
we're doing now except our append only strategy makes it slightly more
impossible to be syncing a db header that indirectly references beyond
the end of the file where we're writing the header.

When we start talking about pre-sync/post-sync I start thinking
"failure permutation".

> B.
>
> On Wed, Jan 26, 2011 at 2:35 PM, Benoit Chesneau <bc...@gmail.com> wrote:
>> On Wed, Jan 26, 2011 at 2:20 PM, Robert Newson <ro...@gmail.com> wrote:
>>> All,
>>>
>>> Most of you know that I'm currently working on 'external attachments'.
>>> I've spent quite some time reading and modifying the current code and
>>> have tried several approaches to the problem. I've implemented one
>>> version fairly completely
>>> (https://github.com/rnewson/couchdb/tree/external_attachments) which
>>> places any attachment over a threshold (defaulting to 256 kb) into a
>>> separate file (and all files that are sent chunked). This branch works
>>> for PUT/GET/DELETE, local and remote replication and compaction.
>>> External attachments do not support compression or ranges yet.
>>>
>>> At this point, I'd like to get some feedback. I don't believe
>>> file-per-attachment is a solution that works for everyone but it was
>>> necessary to make a choice in order to understand how to integrate any
>>> kind of external attachment into couchdb.
>>>
>>> So, here's my real proposal for CouchDB 1.2 (or 2.0?);
>>>
>>> Attachments are stored contiguously in compound files following a
>>> simplified form of Haystack
>>> (http://www.facebook.com/note.php?note_id=76191543919). I won't
>>> describe Haystack in detail as the article covers it, and it's not
>>> exactly what we need (the indexes, for example, are pointless, given
>>> we have a database). The basic idea is we have a small number of files
>>> that we append to, the limit of concurrency being the number of files
>>> (i.e, we will not interleave attachments in these files).
>>>
>>> There are several consequences to this;
>>>
>>> Pro
>>> 1) we can remove the 4k blocking in .couch files.
>>> 2) .couch files are smaller, improving all i/o operations (especially
>>> compaction).
>>
>>> 3) we can use more efficient primitives (like sendfile) to fetch attachments.
>>>
>>> Con
>>> 1) haystack files need compaction (though this involves no seeking so
>>> should be far better than .couch compaction)
>>> 2) more file descriptors
>>> 3) .couch files are no longer self-contained (complicating backup
>>> schemes, migration)
>>>
>>> I had originally planned for each database to have exclusive access to
>>> N haystack files (N is configurable, of course) since this aids with
>>> backups. However, another compelling option is to have N haystack
>>> files for all databases. This reduces the number of file descriptors
>>> needed, but complicates backup (we'd probably have to write a tool to
>>> extract matching attachments).
>>>
>>
>> I would go for one file / db, so we could remove attachments in the
>> same time we delete a db.
>>
>> The CONS about that is that we can't share attachements between db if
>> their signatures are the same. Another way would be to maintain an
>> index of attachements / dbs so we could remove then if they don't
>> appear to any other db after one have been removed.
>>
>>
>>
>>
>>> I've rushed through that rather breezily, I apologize. I've been
>>> thinking about this for quite some time so I likely have answers to
>>> most questions on this.
>>>
>>> B.
>>>
>>
>> That's a good idea anyway. Also did you have a look in luwak from basho ?
>> https://github.com/basho/luwak
>>
>> I know that's the implementation is different but I like the idea to
>> reuse the db to put attachements / chunks. So we could imagine to
>> dispatch chunks as we do for docs on cluster solutions. We could also
>> imagine to handle metadatas.
>>
>> - benoit
>>
>

Re: Next-generation attachment storage.

Posted by Robert Newson <ro...@gmail.com>.

Luwak looks very interesting, thanks!

As I noted originally, the harder part of the work is integrated in
with couchdb and/or replacing the current attachment code entirely
(which is my preference), so I went with the simplest approach to
externalizing attachments (one attachment per file).

The issue of synchronizing the data between the two storage systems
needs some careful thought. My current approach is to put data into
the attachment store (whether haystack, luwak or custom) with a
'provisional' marker. After we write_and_commit, we go back and mark
it as final. We do something similar for removal ('provisionally
removed' -> 'removed'). This will allow us, in most circumstances, to
know the status of an item in the attachment store without
cross-referencing it with couchdb. This will be important when
compacting the attachment storage files (necessary in haystack, no
clue yet for luwak).

B.

On Wed, Jan 26, 2011 at 2:35 PM, Benoit Chesneau <bc...@gmail.com> wrote:
> On Wed, Jan 26, 2011 at 2:20 PM, Robert Newson <ro...@gmail.com> wrote:
>> All,
>>
>> Most of you know that I'm currently working on 'external attachments'.
>> I've spent quite some time reading and modifying the current code and
>> have tried several approaches to the problem. I've implemented one
>> version fairly completely
>> (https://github.com/rnewson/couchdb/tree/external_attachments) which
>> places any attachment over a threshold (defaulting to 256 kb) into a
>> separate file (and all files that are sent chunked). This branch works
>> for PUT/GET/DELETE, local and remote replication and compaction.
>> External attachments do not support compression or ranges yet.
>>
>> At this point, I'd like to get some feedback. I don't believe
>> file-per-attachment is a solution that works for everyone but it was
>> necessary to make a choice in order to understand how to integrate any
>> kind of external attachment into couchdb.
>>
>> So, here's my real proposal for CouchDB 1.2 (or 2.0?);
>>
>> Attachments are stored contiguously in compound files following a
>> simplified form of Haystack
>> (http://www.facebook.com/note.php?note_id=76191543919). I won't
>> describe Haystack in detail as the article covers it, and it's not
>> exactly what we need (the indexes, for example, are pointless, given
>> we have a database). The basic idea is we have a small number of files
>> that we append to, the limit of concurrency being the number of files
>> (i.e, we will not interleave attachments in these files).
>>
>> There are several consequences to this;
>>
>> Pro
>> 1) we can remove the 4k blocking in .couch files.
>> 2) .couch files are smaller, improving all i/o operations (especially
>> compaction).
>
>> 3) we can use more efficient primitives (like sendfile) to fetch attachments.
>>
>> Con
>> 1) haystack files need compaction (though this involves no seeking so
>> should be far better than .couch compaction)
>> 2) more file descriptors
>> 3) .couch files are no longer self-contained (complicating backup
>> schemes, migration)
>>
>> I had originally planned for each database to have exclusive access to
>> N haystack files (N is configurable, of course) since this aids with
>> backups. However, another compelling option is to have N haystack
>> files for all databases. This reduces the number of file descriptors
>> needed, but complicates backup (we'd probably have to write a tool to
>> extract matching attachments).
>>
>
> I would go for one file / db, so we could remove attachments in the
> same time we delete a db.
>
> The CONS about that is that we can't share attachements between db if
> their signatures are the same. Another way would be to maintain an
> index of attachements / dbs so we could remove then if they don't
> appear to any other db after one have been removed.
>
>
>
>
>> I've rushed through that rather breezily, I apologize. I've been
>> thinking about this for quite some time so I likely have answers to
>> most questions on this.
>>
>> B.
>>
>
> That's a good idea anyway. Also did you have a look in luwak from basho ?
> https://github.com/basho/luwak
>
> I know that's the implementation is different but I like the idea to
> reuse the db to put attachements / chunks. So we could imagine to
> dispatch chunks as we do for docs on cluster solutions. We could also
> imagine to handle metadatas.
>
> - benoit
>

Re: Next-generation attachment storage.

Posted by Paul Davis <pa...@gmail.com>.

On Wed, Jan 26, 2011 at 10:34 AM, Robert Newson <ro...@gmail.com> wrote:
> Agree completely that commingled attachment files would not be an
> appropriate default. However, managing a fixed number of very large
> (e.g, 200 Gib) files full of attachment data would work well in a
> hosted service. Obviously the code would have to be solid to prevent
> the kind of data disclosure problems you mention.

No, not just a default, I'm saying "not in the release tarball or in
any method, shape, or form signaled as supported by Apache CouchDB".
If hosting groups want to write and implement this I think that'd be
just fine.

> The haystack paper covers this btw. Each entry has a random cookie
> value stored with it, you need to present the same value for the read
> to succeed. The cookie could be stored in the #att record. Obviously
> it still requires the code to verify the cookie and restrict the read
> only to the bytes covered by that item, but that's a code quality
> thing and should be easy enough to review.
>

The issue here is that I just assume that there will be a bug in the
code that leaks information across databases. So the question is if we
make the bet that we can prevent it from happening for the next 15
years until some whippersnapper db comes and replaces us. The reason
I'd be against including multi-tenant files is that I see that as
requiring the same amount of effort as if it were the only supported
option. Its just not ok for db's to have the leakage as a possible
failure condition IMO.

There's also the part about information leakage using timing attacks
and such forth that I don't see as surmountable.

> B.
>
> On Wed, Jan 26, 2011 at 3:23 PM, Paul Davis <pa...@gmail.com> wrote:
>> On Wed, Jan 26, 2011 at 9:35 AM, Benoit Chesneau <bc...@gmail.com> wrote:
>>> On Wed, Jan 26, 2011 at 2:20 PM, Robert Newson <ro...@gmail.com> wrote:
>>>> All,
>>>>
>>>> Most of you know that I'm currently working on 'external attachments'.
>>>> I've spent quite some time reading and modifying the current code and
>>>> have tried several approaches to the problem. I've implemented one
>>>> version fairly completely
>>>> (https://github.com/rnewson/couchdb/tree/external_attachments) which
>>>> places any attachment over a threshold (defaulting to 256 kb) into a
>>>> separate file (and all files that are sent chunked). This branch works
>>>> for PUT/GET/DELETE, local and remote replication and compaction.
>>>> External attachments do not support compression or ranges yet.
>>>>
>>>> At this point, I'd like to get some feedback. I don't believe
>>>> file-per-attachment is a solution that works for everyone but it was
>>>> necessary to make a choice in order to understand how to integrate any
>>>> kind of external attachment into couchdb.
>>>>
>>>> So, here's my real proposal for CouchDB 1.2 (or 2.0?);
>>>>
>>>> Attachments are stored contiguously in compound files following a
>>>> simplified form of Haystack
>>>> (http://www.facebook.com/note.php?note_id=76191543919). I won't
>>>> describe Haystack in detail as the article covers it, and it's not
>>>> exactly what we need (the indexes, for example, are pointless, given
>>>> we have a database). The basic idea is we have a small number of files
>>>> that we append to, the limit of concurrency being the number of files
>>>> (i.e, we will not interleave attachments in these files).
>>>>
>>>> There are several consequences to this;
>>>>
>>>> Pro
>>>> 1) we can remove the 4k blocking in .couch files.
>>>> 2) .couch files are smaller, improving all i/o operations (especially
>>>> compaction).
>>>
>>>> 3) we can use more efficient primitives (like sendfile) to fetch attachments.
>>>>
>>>> Con
>>>> 1) haystack files need compaction (though this involves no seeking so
>>>> should be far better than .couch compaction)
>>>> 2) more file descriptors
>>>> 3) .couch files are no longer self-contained (complicating backup
>>>> schemes, migration)
>>>>
>>>> I had originally planned for each database to have exclusive access to
>>>> N haystack files (N is configurable, of course) since this aids with
>>>> backups. However, another compelling option is to have N haystack
>>>> files for all databases. This reduces the number of file descriptors
>>>> needed, but complicates backup (we'd probably have to write a tool to
>>>> extract matching attachments).
>>>>
>>>
>>> I would go for one file / db, so we could remove attachments in the
>>> same time we delete a db.
>>>
>>> The CONS about that is that we can't share attachements between db if
>>> their signatures are the same. Another way would be to maintain an
>>> index of attachements / dbs so we could remove then if they don't
>>> appear to any other db after one have been removed.
>>>
>>>
>>>
>>>
>>>> I've rushed through that rather breezily, I apologize. I've been
>>>> thinking about this for quite some time so I likely have answers to
>>>> most questions on this.
>>>>
>>>> B.
>>>>
>>>
>>> That's a good idea anyway. Also did you have a look in luwak from basho ?
>>> https://github.com/basho/luwak
>>>
>>> I know that's the implementation is different but I like the idea to
>>> reuse the db to put attachements / chunks. So we could imagine to
>>> dispatch chunks as we do for docs on cluster solutions. We could also
>>> imagine to handle metadatas.
>>>
>>> - benoit
>>>
>>
>> Another bit that Bob2 didn't mention was the idea of making this a
>> pluggable API so that we can have a couple implementations that are
>> configurable. For instance, Benoit's idea for a single file of
>> interleaved attachments or the haystack approach with multiple files
>> that keep attachments in contiguous chunks.
>>
>> As to sharing attachments between db's, I would be hugely hugely
>> against releasing that as part of an actual release as there are a
>> *lot* of downsides in how that would open us up for bad failure
>> conditions. Ie, things like sending attachments from different db's by
>> accident or or what not. Also, in shared tenant situations it seems
>> like it'd be a prime suspect for information leakage and such forth.
>> But I digress.
>>
>

Re: Next-generation attachment storage.

Posted by Robert Newson <ro...@gmail.com>.

Agree completely that commingled attachment files would not be an
appropriate default. However, managing a fixed number of very large
(e.g, 200 Gib) files full of attachment data would work well in a
hosted service. Obviously the code would have to be solid to prevent
the kind of data disclosure problems you mention.

The haystack paper covers this btw. Each entry has a random cookie
value stored with it, you need to present the same value for the read
to succeed. The cookie could be stored in the #att record. Obviously
it still requires the code to verify the cookie and restrict the read
only to the bytes covered by that item, but that's a code quality
thing and should be easy enough to review.

B.

On Wed, Jan 26, 2011 at 3:23 PM, Paul Davis <pa...@gmail.com> wrote:
> On Wed, Jan 26, 2011 at 9:35 AM, Benoit Chesneau <bc...@gmail.com> wrote:
>> On Wed, Jan 26, 2011 at 2:20 PM, Robert Newson <ro...@gmail.com> wrote:
>>> All,
>>>
>>> Most of you know that I'm currently working on 'external attachments'.
>>> I've spent quite some time reading and modifying the current code and
>>> have tried several approaches to the problem. I've implemented one
>>> version fairly completely
>>> (https://github.com/rnewson/couchdb/tree/external_attachments) which
>>> places any attachment over a threshold (defaulting to 256 kb) into a
>>> separate file (and all files that are sent chunked). This branch works
>>> for PUT/GET/DELETE, local and remote replication and compaction.
>>> External attachments do not support compression or ranges yet.
>>>
>>> At this point, I'd like to get some feedback. I don't believe
>>> file-per-attachment is a solution that works for everyone but it was
>>> necessary to make a choice in order to understand how to integrate any
>>> kind of external attachment into couchdb.
>>>
>>> So, here's my real proposal for CouchDB 1.2 (or 2.0?);
>>>
>>> Attachments are stored contiguously in compound files following a
>>> simplified form of Haystack
>>> (http://www.facebook.com/note.php?note_id=76191543919). I won't
>>> describe Haystack in detail as the article covers it, and it's not
>>> exactly what we need (the indexes, for example, are pointless, given
>>> we have a database). The basic idea is we have a small number of files
>>> that we append to, the limit of concurrency being the number of files
>>> (i.e, we will not interleave attachments in these files).
>>>
>>> There are several consequences to this;
>>>
>>> Pro
>>> 1) we can remove the 4k blocking in .couch files.
>>> 2) .couch files are smaller, improving all i/o operations (especially
>>> compaction).
>>
>>> 3) we can use more efficient primitives (like sendfile) to fetch attachments.
>>>
>>> Con
>>> 1) haystack files need compaction (though this involves no seeking so
>>> should be far better than .couch compaction)
>>> 2) more file descriptors
>>> 3) .couch files are no longer self-contained (complicating backup
>>> schemes, migration)
>>>
>>> I had originally planned for each database to have exclusive access to
>>> N haystack files (N is configurable, of course) since this aids with
>>> backups. However, another compelling option is to have N haystack
>>> files for all databases. This reduces the number of file descriptors
>>> needed, but complicates backup (we'd probably have to write a tool to
>>> extract matching attachments).
>>>
>>
>> I would go for one file / db, so we could remove attachments in the
>> same time we delete a db.
>>
>> The CONS about that is that we can't share attachements between db if
>> their signatures are the same. Another way would be to maintain an
>> index of attachements / dbs so we could remove then if they don't
>> appear to any other db after one have been removed.
>>
>>
>>
>>
>>> I've rushed through that rather breezily, I apologize. I've been
>>> thinking about this for quite some time so I likely have answers to
>>> most questions on this.
>>>
>>> B.
>>>
>>
>> That's a good idea anyway. Also did you have a look in luwak from basho ?
>> https://github.com/basho/luwak
>>
>> I know that's the implementation is different but I like the idea to
>> reuse the db to put attachements / chunks. So we could imagine to
>> dispatch chunks as we do for docs on cluster solutions. We could also
>> imagine to handle metadatas.
>>
>> - benoit
>>
>
> Another bit that Bob2 didn't mention was the idea of making this a
> pluggable API so that we can have a couple implementations that are
> configurable. For instance, Benoit's idea for a single file of
> interleaved attachments or the haystack approach with multiple files
> that keep attachments in contiguous chunks.
>
> As to sharing attachments between db's, I would be hugely hugely
> against releasing that as part of an actual release as there are a
> *lot* of downsides in how that would open us up for bad failure
> conditions. Ie, things like sending attachments from different db's by
> accident or or what not. Also, in shared tenant situations it seems
> like it'd be a prime suspect for information leakage and such forth.
> But I digress.
>

Re: Next-generation attachment storage.

Posted by Paul Davis <pa...@gmail.com>.

On Wed, Jan 26, 2011 at 9:35 AM, Benoit Chesneau <bc...@gmail.com> wrote:
> On Wed, Jan 26, 2011 at 2:20 PM, Robert Newson <ro...@gmail.com> wrote:
>> All,
>>
>> Most of you know that I'm currently working on 'external attachments'.
>> I've spent quite some time reading and modifying the current code and
>> have tried several approaches to the problem. I've implemented one
>> version fairly completely
>> (https://github.com/rnewson/couchdb/tree/external_attachments) which
>> places any attachment over a threshold (defaulting to 256 kb) into a
>> separate file (and all files that are sent chunked). This branch works
>> for PUT/GET/DELETE, local and remote replication and compaction.
>> External attachments do not support compression or ranges yet.
>>
>> At this point, I'd like to get some feedback. I don't believe
>> file-per-attachment is a solution that works for everyone but it was
>> necessary to make a choice in order to understand how to integrate any
>> kind of external attachment into couchdb.
>>
>> So, here's my real proposal for CouchDB 1.2 (or 2.0?);
>>
>> Attachments are stored contiguously in compound files following a
>> simplified form of Haystack
>> (http://www.facebook.com/note.php?note_id=76191543919). I won't
>> describe Haystack in detail as the article covers it, and it's not
>> exactly what we need (the indexes, for example, are pointless, given
>> we have a database). The basic idea is we have a small number of files
>> that we append to, the limit of concurrency being the number of files
>> (i.e, we will not interleave attachments in these files).
>>
>> There are several consequences to this;
>>
>> Pro
>> 1) we can remove the 4k blocking in .couch files.
>> 2) .couch files are smaller, improving all i/o operations (especially
>> compaction).
>
>> 3) we can use more efficient primitives (like sendfile) to fetch attachments.
>>
>> Con
>> 1) haystack files need compaction (though this involves no seeking so
>> should be far better than .couch compaction)
>> 2) more file descriptors
>> 3) .couch files are no longer self-contained (complicating backup
>> schemes, migration)
>>
>> I had originally planned for each database to have exclusive access to
>> N haystack files (N is configurable, of course) since this aids with
>> backups. However, another compelling option is to have N haystack
>> files for all databases. This reduces the number of file descriptors
>> needed, but complicates backup (we'd probably have to write a tool to
>> extract matching attachments).
>>
>
> I would go for one file / db, so we could remove attachments in the
> same time we delete a db.
>
> The CONS about that is that we can't share attachements between db if
> their signatures are the same. Another way would be to maintain an
> index of attachements / dbs so we could remove then if they don't
> appear to any other db after one have been removed.
>
>
>
>
>> I've rushed through that rather breezily, I apologize. I've been
>> thinking about this for quite some time so I likely have answers to
>> most questions on this.
>>
>> B.
>>
>
> That's a good idea anyway. Also did you have a look in luwak from basho ?
> https://github.com/basho/luwak
>
> I know that's the implementation is different but I like the idea to
> reuse the db to put attachements / chunks. So we could imagine to
> dispatch chunks as we do for docs on cluster solutions. We could also
> imagine to handle metadatas.
>
> - benoit
>

Another bit that Bob2 didn't mention was the idea of making this a
pluggable API so that we can have a couple implementations that are
configurable. For instance, Benoit's idea for a single file of
interleaved attachments or the haystack approach with multiple files
that keep attachments in contiguous chunks.

As to sharing attachments between db's, I would be hugely hugely
against releasing that as part of an actual release as there are a
*lot* of downsides in how that would open us up for bad failure
conditions. Ie, things like sending attachments from different db's by
accident or or what not. Also, in shared tenant situations it seems
like it'd be a prime suspect for information leakage and such forth.
But I digress.

Re: Next-generation attachment storage.

Posted by Benoit Chesneau <bc...@gmail.com>.

On Wed, Jan 26, 2011 at 2:20 PM, Robert Newson <ro...@gmail.com> wrote:
> All,
>
> Most of you know that I'm currently working on 'external attachments'.
> I've spent quite some time reading and modifying the current code and
> have tried several approaches to the problem. I've implemented one
> version fairly completely
> (https://github.com/rnewson/couchdb/tree/external_attachments) which
> places any attachment over a threshold (defaulting to 256 kb) into a
> separate file (and all files that are sent chunked). This branch works
> for PUT/GET/DELETE, local and remote replication and compaction.
> External attachments do not support compression or ranges yet.
>
> At this point, I'd like to get some feedback. I don't believe
> file-per-attachment is a solution that works for everyone but it was
> necessary to make a choice in order to understand how to integrate any
> kind of external attachment into couchdb.
>
> So, here's my real proposal for CouchDB 1.2 (or 2.0?);
>
> Attachments are stored contiguously in compound files following a
> simplified form of Haystack
> (http://www.facebook.com/note.php?note_id=76191543919). I won't
> describe Haystack in detail as the article covers it, and it's not
> exactly what we need (the indexes, for example, are pointless, given
> we have a database). The basic idea is we have a small number of files
> that we append to, the limit of concurrency being the number of files
> (i.e, we will not interleave attachments in these files).
>
> There are several consequences to this;
>
> Pro
> 1) we can remove the 4k blocking in .couch files.
> 2) .couch files are smaller, improving all i/o operations (especially
> compaction).

> 3) we can use more efficient primitives (like sendfile) to fetch attachments.
>
> Con
> 1) haystack files need compaction (though this involves no seeking so
> should be far better than .couch compaction)
> 2) more file descriptors
> 3) .couch files are no longer self-contained (complicating backup
> schemes, migration)
>
> I had originally planned for each database to have exclusive access to
> N haystack files (N is configurable, of course) since this aids with
> backups. However, another compelling option is to have N haystack
> files for all databases. This reduces the number of file descriptors
> needed, but complicates backup (we'd probably have to write a tool to
> extract matching attachments).
>

I would go for one file / db, so we could remove attachments in the
same time we delete a db.

The CONS about that is that we can't share attachements between db if
their signatures are the same. Another way would be to maintain an
index of attachements / dbs so we could remove then if they don't
appear to any other db after one have been removed.




> I've rushed through that rather breezily, I apologize. I've been
> thinking about this for quite some time so I likely have answers to
> most questions on this.
>
> B.
>

That's a good idea anyway. Also did you have a look in luwak from basho ?
https://github.com/basho/luwak

I know that's the implementation is different but I like the idea to
reuse the db to put attachements / chunks. So we could imagine to
dispatch chunks as we do for docs on cluster solutions. We could also
imagine to handle metadatas.

- benoit

Re: Next-generation attachment storage.

Posted by Paul Davis <pa...@gmail.com>.

On Wed, Jan 26, 2011 at 2:36 PM, Robert Newson <ro...@gmail.com> wrote:
> Yup, that's what prompted the post, actually. I had started from the
> point of view that external attachments would be optional. That brings
> quite a lot of complexity, so I'm basically asking if there's any
> objections to moving wholly over to this new strategy?
>

I object, my lord. Also, I fancy your scarf.

http://www.youtube.com/watch?v=aE3gMN97TKw

> B.
>
> On Wed, Jan 26, 2011 at 7:28 PM, Paul Davis <pa...@gmail.com> wrote:
>> On Wed, Jan 26, 2011 at 2:27 PM, Randall Leeds <ra...@gmail.com> wrote:
>>> Just one note below for now.
>>>
>>> On Wed, Jan 26, 2011 at 05:20, Robert Newson <ro...@gmail.com> wrote:
>>>> All,
>>>>
>>>> Most of you know that I'm currently working on 'external attachments'.
>>>> I've spent quite some time reading and modifying the current code and
>>>> have tried several approaches to the problem. I've implemented one
>>>> version fairly completely
>>>> (https://github.com/rnewson/couchdb/tree/external_attachments) which
>>>> places any attachment over a threshold (defaulting to 256 kb) into a
>>>> separate file (and all files that are sent chunked). This branch works
>>>> for PUT/GET/DELETE, local and remote replication and compaction.
>>>> External attachments do not support compression or ranges yet.
>>>
>>> [snip]
>>>
>>>> Pro
>>>> 1) we can remove the 4k blocking in .couch files.
>>>
>>> Not if we have a threshold for external storage. Only if *all*
>>> attachments are external.
>>> Yes?
>>>
>>> -Randall
>>>
>>
>> Yes.
>>
>

Re: Next-generation attachment storage.

Posted by Benoit Chesneau <bc...@gmail.com>.

On Wed, Jan 26, 2011 at 8:36 PM, Robert Newson <ro...@gmail.com> wrote:
> Yup, that's what prompted the post, actually. I had started from the
> point of view that external attachments would be optional. That brings
> quite a lot of complexity, so I'm basically asking if there's any
> objections to moving wholly over to this new strategy?
>
> B.
>

Not sure what is the startegy.

I think I'm +1 to have first the attachments system pluggable. Then
people could hack.

- benoit

Re: Next-generation attachment storage.

Posted by Robert Newson <ro...@gmail.com>.

Yup, that's what prompted the post, actually. I had started from the
point of view that external attachments would be optional. That brings
quite a lot of complexity, so I'm basically asking if there's any
objections to moving wholly over to this new strategy?

B.

On Wed, Jan 26, 2011 at 7:28 PM, Paul Davis <pa...@gmail.com> wrote:
> On Wed, Jan 26, 2011 at 2:27 PM, Randall Leeds <ra...@gmail.com> wrote:
>> Just one note below for now.
>>
>> On Wed, Jan 26, 2011 at 05:20, Robert Newson <ro...@gmail.com> wrote:
>>> All,
>>>
>>> Most of you know that I'm currently working on 'external attachments'.
>>> I've spent quite some time reading and modifying the current code and
>>> have tried several approaches to the problem. I've implemented one
>>> version fairly completely
>>> (https://github.com/rnewson/couchdb/tree/external_attachments) which
>>> places any attachment over a threshold (defaulting to 256 kb) into a
>>> separate file (and all files that are sent chunked). This branch works
>>> for PUT/GET/DELETE, local and remote replication and compaction.
>>> External attachments do not support compression or ranges yet.
>>
>> [snip]
>>
>>> Pro
>>> 1) we can remove the 4k blocking in .couch files.
>>
>> Not if we have a threshold for external storage. Only if *all*
>> attachments are external.
>> Yes?
>>
>> -Randall
>>
>
> Yes.
>

Re: Next-generation attachment storage.

Posted by Paul Davis <pa...@gmail.com>.

On Wed, Jan 26, 2011 at 2:27 PM, Randall Leeds <ra...@gmail.com> wrote:
> Just one note below for now.
>
> On Wed, Jan 26, 2011 at 05:20, Robert Newson <ro...@gmail.com> wrote:
>> All,
>>
>> Most of you know that I'm currently working on 'external attachments'.
>> I've spent quite some time reading and modifying the current code and
>> have tried several approaches to the problem. I've implemented one
>> version fairly completely
>> (https://github.com/rnewson/couchdb/tree/external_attachments) which
>> places any attachment over a threshold (defaulting to 256 kb) into a
>> separate file (and all files that are sent chunked). This branch works
>> for PUT/GET/DELETE, local and remote replication and compaction.
>> External attachments do not support compression or ranges yet.
>
> [snip]
>
>> Pro
>> 1) we can remove the 4k blocking in .couch files.
>
> Not if we have a threshold for external storage. Only if *all*
> attachments are external.
> Yes?
>
> -Randall
>

Yes.

Re: Next-generation attachment storage.

Posted by Randall Leeds <ra...@gmail.com>.

Just one note below for now.

On Wed, Jan 26, 2011 at 05:20, Robert Newson <ro...@gmail.com> wrote:
> All,
>
> Most of you know that I'm currently working on 'external attachments'.
> I've spent quite some time reading and modifying the current code and
> have tried several approaches to the problem. I've implemented one
> version fairly completely
> (https://github.com/rnewson/couchdb/tree/external_attachments) which
> places any attachment over a threshold (defaulting to 256 kb) into a
> separate file (and all files that are sent chunked). This branch works
> for PUT/GET/DELETE, local and remote replication and compaction.
> External attachments do not support compression or ranges yet.

[snip]

> Pro
> 1) we can remove the 4k blocking in .couch files.

Not if we have a threshold for external storage. Only if *all*
attachments are external.
Yes?

-Randall