You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@couchdb.apache.org by svilen <az...@svilendobrev.com> on 2013/03/25 21:13:35 UTC

huge attachments - experience?

hi
i need some form of synchronised storage for audio files (with
some metadata). Something like 10-400Mb per attachment, 1-10
attachments per doc, overall about 1000 docs, 3000 attachments,
300G total. Now i'm using just plain filesystem but it's a pain to
maintain consistency across several copies.

As i don't really need more than 1 version back, i'm playing with idea
of using couchdb for that. Either putting the files as attachments, or
if not possible, using it as filesystem-miming synchronised metadata,
with appropriate listeners reacting on changes (like rename, mv, etc).

Any experiences on this? how does couchdb work in such
big-files scenario?

ciao
svilen

Re: huge attachments - experience?

Posted by Dave Cottlehuber <dc...@jsonified.com>.
On 25 March 2013 22:44, Jens Alfke <je...@couchbase.com> wrote:
>
> On Mar 25, 2013, at 1:13 PM, svilen <az...@svilendobrev.com>> wrote:
>
> As i don't really need more than 1 version back, i'm playing with idea
> of using couchdb for that. Either putting the files as attachments, or
> if not possible, using it as filesystem-miming synchronised metadata,
> with appropriate listeners reacting on changes (like rename, mv, etc).

+1 to all Jens & Nils said with 2 more points.

If you store only metadata in couch, using a hash like md5 of the data
instead of the actual filename, then using that to point to the stored
files on disk is quite attractive. Renames, moves, are all internal to
couchdb as the data hasn't changed. It will deduplicate itself as well
if you have multiple copies (e.g. revisions of docs).

The down side of putting stuff outside couch is that you need to
manage the things you get for free:

- easy replication model
- deletion handling (how many docs have this file, should I delete
this file now because the document attachment was deleted, etc)
- streaming of data from within couchdb
- inbuilt compression
- keeping replication partners in sync (I don't need this doc anymore
but the others don't yet have the updated copy type problems, esp in
mesh replication topology)

The other nasty thing about attachments in couch is that during
replication, if there is a failure we can't restart part-way through.
And as they're stored directly on disk, we duplicate that waste on
both the network, and in storage inside the DB file. This may or may
not be a problem for your use case.

A+
Dave

Re: huge attachments - experience?

Posted by Jens Alfke <je...@couchbase.com>.
On Mar 25, 2013, at 1:13 PM, svilen <az...@svilendobrev.com>> wrote:

As i don't really need more than 1 version back, i'm playing with idea
of using couchdb for that. Either putting the files as attachments, or
if not possible, using it as filesystem-miming synchronised metadata,
with appropriate listeners reacting on changes (like rename, mv, etc).

This may get slow. CouchDB stores the attachment inside the database file, so every time the database is compacted, all the still-valid attachments have to be copied over to the new file. (And if you don’t compact the database, you end up using space for every version of every attachment.)

TouchDB stores attachments as separate files in the filesystem. This means they don’t get copied. It stores them as files named after their SHA-1 digests, which also means you get some deduplication: if the database has multiple attachments with identical bodies, only one copy is stored.

—Jens

Re: huge attachments - experience?

Posted by Brad Rhoads <bd...@gmail.com>.
We're doing a mobile (Android) digital library system and storing large
docs (audio/video) in couchdb. There's a couchdb doc with metadata and then
the corresponding file is attached. It seems to be working OK so far, but
we're just getting started.

---------------------------
www.maf.org/rhoads
www.ontherhoads.org


On Mon, Mar 25, 2013 at 2:13 PM, svilen <az...@svilendobrev.com> wrote:

> hi
> i need some form of synchronised storage for audio files (with
> some metadata). Something like 10-400Mb per attachment, 1-10
> attachments per doc, overall about 1000 docs, 3000 attachments,
> 300G total. Now i'm using just plain filesystem but it's a pain to
> maintain consistency across several copies.
>
> As i don't really need more than 1 version back, i'm playing with idea
> of using couchdb for that. Either putting the files as attachments, or
> if not possible, using it as filesystem-miming synchronised metadata,
> with appropriate listeners reacting on changes (like rename, mv, etc).
>
> Any experiences on this? how does couchdb work in such
> big-files scenario?
>
> ciao
> svilen
>

Re: huge attachments - experience?

Posted by svilen <az...@svilendobrev.com>.
Nils Breunese <N....@vpro.nl> wrote:
> 
> What's the reason using something like DRBD+OCFS2 (or network
> filesystems in general?) is not an option?
general lazyness/no-time-ness, and no extra partitions/disks for that.

> svilen wrote:
> 
> > yeah, checked all those. They are post-factum, file synchronizers -
> > can not manage directories (rename a dir => copy all the dir again).
> > The only thing that manages directories is bazaar but it chokes
> > after 8-9Gb of data (out of memory). Anyway seems i need to separate
> > filesystem metadata (filenames, directories etc) from actual raw
> > datafiles. e.g. fancy hierarchy of hardlinks, pointing to flat-dir
> > files. And apply separate syncs to those two. But should think it
> > out first.

Re: huge attachments - experience?

Posted by Nils Breunese <N....@vpro.nl>.
svilen wrote:

> yeah, checked all those. They are post-factum, file synchronizers - can
> not manage directories (rename a dir => copy all the dir again).
> The only thing that manages directories is bazaar but it chokes after
> 8-9Gb of data (out of memory). Anyway seems i need to separate
> filesystem metadata (filenames, directories etc) from actual raw
> datafiles. e.g. fancy hierarchy of hardlinks, pointing to flat-dir
> files. And apply separate syncs to those two. But should think it out
> first.

What's the reason using something like DRBD+OCFS2 (or network filesystems in general?) is not an option?

Nils.

Re: huge attachments - experience?

Posted by Ethan <et...@gmail.com>.
On Thu, Mar 28, 2013 at 7:23 AM, svilen <az...@svilendobrev.com> wrote:

> yeah, checked all those. They are post-factum, file synchronizers - can
> not manage directories (rename a dir => copy all the dir again).
> The only thing that manages directories is bazaar but it chokes after
> 8-9Gb of data (out of memory). Anyway seems i need to separate
> filesystem metadata (filenames, directories etc) from actual raw
> datafiles. e.g. fancy hierarchy of hardlinks, pointing to flat-dir
> files. And apply separate syncs to those two. But should think it out
> first.
>

Hi, you might look at git-annex. I use it to store large media files.
Basically, it keeps a directory tree of symlinks checked in to git, but the
targets of those symlinks get copied around on your say so. It makes it
really easy to answer "which hard drive is this file on" as well as letting
you rename files even if the files aren't themselves there.

Ethan

Re: huge attachments - experience?

Posted by svilen <az...@svilendobrev.com>.
yeah, checked all those. They are post-factum, file synchronizers - can
not manage directories (rename a dir => copy all the dir again).
The only thing that manages directories is bazaar but it chokes after
8-9Gb of data (out of memory). Anyway seems i need to separate
filesystem metadata (filenames, directories etc) from actual raw
datafiles. e.g. fancy hierarchy of hardlinks, pointing to flat-dir
files. And apply separate syncs to those two. But should think it out
first.

On Thu, 28 Mar 2013 10:29:30 +0100
Nils Breunese <N....@vpro.nl> wrote:

> svilen wrote:
> 
> >> Nothing couchdb, but bup (https://github.com/bup/bup/) might be 
> >> the solution for you.
> > will check it up, thanks.
> > 
> >> What did you use before? rsync? 
> > rsync , lftp, manual cp -ru ..
> > 
> >> What's the nature for the problem?
> > well, the usual computer-geek one: 3 machines with copies of same
> > huge file hierarchy (500G) somewhere. Anything changed on any one
> > machine should sooner-or-later propagate to others. "worst" kind of
> > changes are renaming of directories, splitting one dir into two,
> > etc. filesystem-level meta-changes. Nothing that i know tracks those
> > (except bazaar and maybe mercurial) although it's simple. no need to
> > be fast or immediate, it's just me changing things manualy here or
> > there.
> > setting up special ocfs2 partitions is not an option.
> 
> If it were two machines, you could use Unison, but since you have
> three and apparently setting up a network filesystem is not an
> option, you might want to take a look at csync2:
> 
> http://linuxaria.com/howto/csync2-a-filesystem-syncronization-tool-for-linux?lang=en
> http://oss.linbit.com/csync2/
> 
> You could also use csync2 with lsyncd if you need more instant
> updates.
> 
> https://www.axivo.com/community/threads/lightning-fast-synch-with-csync2-and-lsyncd.121/
> 
> Nils.

Re: huge attachments - experience?

Posted by Nils Breunese <N....@vpro.nl>.
svilen wrote:

>> Nothing couchdb, but bup (https://github.com/bup/bup/) might be 
>> the solution for you.
> will check it up, thanks.
> 
>> What did you use before? rsync? 
> rsync , lftp, manual cp -ru ..
> 
>> What's the nature for the problem?
> well, the usual computer-geek one: 3 machines with copies of same huge
> file hierarchy (500G) somewhere. Anything changed on any one machine
> should sooner-or-later propagate to others. "worst" kind of changes are
> renaming of directories, splitting one dir into two, etc.
> filesystem-level meta-changes. Nothing that i know tracks those
> (except bazaar and maybe mercurial) although it's simple. no need to
> be fast or immediate, it's just me changing things manualy here or
> there.
> setting up special ocfs2 partitions is not an option.

If it were two machines, you could use Unison, but since you have three and apparently setting up a network filesystem is not an option, you might want to take a look at csync2:

http://linuxaria.com/howto/csync2-a-filesystem-syncronization-tool-for-linux?lang=en
http://oss.linbit.com/csync2/

You could also use csync2 with lsyncd if you need more instant updates.

https://www.axivo.com/community/threads/lightning-fast-synch-with-csync2-and-lsyncd.121/

Nils.

Re: huge attachments - experience?

Posted by svilen <az...@svilendobrev.com>.
>Nothing couchdb, but bup (https://github.com/bup/bup/) might be 
>the solution for you.
will check it up, thanks.

> What did you use before? rsync? 
rsync , lftp, manual cp -ru ..

> What's the nature for the problem?
well, the usual computer-geek one: 3 machines with copies of same huge
file hierarchy (500G) somewhere. Anything changed on any one machine
should sooner-or-later propagate to others. "worst" kind of changes are
renaming of directories, splitting one dir into two, etc.
filesystem-level meta-changes. Nothing that i know tracks those
(except bazaar and maybe mercurial) although it's simple. no need to
be fast or immediate, it's just me changing things manualy here or
there.
setting up special ocfs2 partitions is not an option.

anyway, this went quite offtopic.

svilen

> On Tue, Mar 26, 2013 at 2:33 PM, Matthieu Rakotojaona
> <ma...@gmail.com> wrote:
> > On Tue, Mar 26, 2013 at 12:53 PM, svilen <az...@svilendobrev.com>
> > wrote:
> >> Jens, Nils, Dave, thanks for answering.
> >>
> >> it's all in local network, or nearly-local network. speed doesn't
> >> matter, consistency does. each copy can get changed, but its
> >> usualy different things that change. it's all separate files or
> >> dirs of files (eventualy the metadata could go as docs into
> >> couchdb itself, oneday). think of describing many LP contents.
> >>
> >> The most troublesome is renaming moving deleting stuff - files AND
> >> dirs
> >> - as i keep doing that all the time. i've been trying lftp, rsync,
> >>  csync2, ocsync.. all with various success. Noone manages
> >> dir-rename. Now i'm trying bazaar as that does manage the
> >> renameing but keeping .mp3/.flacs in a vcs is somewhat.. too much.
> >
> > Nothing couchdb, but bup (https://github.com/bup/bup/) might be the
> > solution for you.
> >
> > Another idea (this one is more 'because I can') I had was to use
> > bittorrent to exchange raw data, and couchdb to exchange the
> > bittorrent files. I doubt it will scale, though.
> >
> >
> > --
> > Matthieu RAKOTOJAONA

Re: huge attachments - experience?

Posted by Albin Stigö <al...@gmail.com>.
What did you use before? rsync? What's the nature for the problem?


--Albin


On Tue, Mar 26, 2013 at 2:33 PM, Matthieu Rakotojaona
<ma...@gmail.com> wrote:
> On Tue, Mar 26, 2013 at 12:53 PM, svilen <az...@svilendobrev.com> wrote:
>> Jens, Nils, Dave, thanks for answering.
>>
>> it's all in local network, or nearly-local network. speed doesn't
>> matter, consistency does. each copy can get changed, but its
>> usualy different things that change. it's all separate files or dirs of
>> files (eventualy the metadata could go as docs into couchdb itself,
>> oneday). think of describing many LP contents.
>>
>> The most troublesome is renaming moving deleting stuff - files AND dirs
>> - as i keep doing that all the time. i've been trying lftp, rsync,
>>  csync2, ocsync.. all with various success. Noone manages dir-rename.
>>  Now i'm trying bazaar as that does manage the renameing but
>>  keeping .mp3/.flacs in a vcs is somewhat.. too much.
>
> Nothing couchdb, but bup (https://github.com/bup/bup/) might be the
> solution for you.
>
> Another idea (this one is more 'because I can') I had was to use
> bittorrent to exchange raw data, and couchdb to exchange the
> bittorrent files. I doubt it will scale, though.
>
>
> --
> Matthieu RAKOTOJAONA

Re: huge attachments - experience?

Posted by Matthieu Rakotojaona <ma...@gmail.com>.
On Tue, Mar 26, 2013 at 12:53 PM, svilen <az...@svilendobrev.com> wrote:
> Jens, Nils, Dave, thanks for answering.
>
> it's all in local network, or nearly-local network. speed doesn't
> matter, consistency does. each copy can get changed, but its
> usualy different things that change. it's all separate files or dirs of
> files (eventualy the metadata could go as docs into couchdb itself,
> oneday). think of describing many LP contents.
>
> The most troublesome is renaming moving deleting stuff - files AND dirs
> - as i keep doing that all the time. i've been trying lftp, rsync,
>  csync2, ocsync.. all with various success. Noone manages dir-rename.
>  Now i'm trying bazaar as that does manage the renameing but
>  keeping .mp3/.flacs in a vcs is somewhat.. too much.

Nothing couchdb, but bup (https://github.com/bup/bup/) might be the
solution for you.

Another idea (this one is more 'because I can') I had was to use
bittorrent to exchange raw data, and couchdb to exchange the
bittorrent files. I doubt it will scale, though.


-- 
Matthieu RAKOTOJAONA

Re: huge attachments - experience?

Posted by svilen <az...@svilendobrev.com>.
Jens, Nils, Dave, thanks for answering. 

it's all in local network, or nearly-local network. speed doesn't
matter, consistency does. each copy can get changed, but its
usualy different things that change. it's all separate files or dirs of
files (eventualy the metadata could go as docs into couchdb itself,
oneday). think of describing many LP contents.

The most troublesome is renaming moving deleting stuff - files AND dirs
- as i keep doing that all the time. i've been trying lftp, rsync,
 csync2, ocsync.. all with various success. Noone manages dir-rename.
 Now i'm trying bazaar as that does manage the renameing but
 keeping .mp3/.flacs in a vcs is somewhat.. too much. 

hmm maybe i need to somehow disconnect the data itself from the
dir/file naming.

eventualy if i find the time, i may make a fuse filesystem layer using
couchdb as replicated-change-log while the actual files will be just
plain files.. but that's yet another todo.

ciao
svilen

On Tue, 26 Mar 2013 11:52:39 +0100
Nils Breunese <N....@vpro.nl> wrote:

> svilen wrote:
> 
> > i need some form of synchronised storage for audio files (with
> > some metadata). Something like 10-400Mb per attachment, 1-10
> > attachments per doc, overall about 1000 docs, 3000 attachments,
> > 300G total. Now i'm using just plain filesystem but it's a pain to
> > maintain consistency across several copies.
> 
> Do you have a master copy? Are all copies on a LAN or around the
> globe? How fast should changes propagate across all copies? Is the
> metadata stored in the audio files, or could it be? Or does the
> metadata need to be stored separately? Not that I don't like CouchDB,
> but it sounds like plain old rsync could be a reasonable solution. 
> 
> Nils.
> 

Re: huge attachments - experience?

Posted by Nils Breunese <N....@vpro.nl>.
svilen wrote:

> i need some form of synchronised storage for audio files (with
> some metadata). Something like 10-400Mb per attachment, 1-10
> attachments per doc, overall about 1000 docs, 3000 attachments,
> 300G total. Now i'm using just plain filesystem but it's a pain to
> maintain consistency across several copies.

Do you have a master copy? Are all copies on a LAN or around the globe? How fast should changes propagate across all copies? Is the metadata stored in the audio files, or could it be? Or does the metadata need to be stored separately? Not that I don't like CouchDB, but it sounds like plain old rsync could be a reasonable solution. 

Nils.