You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@couchdb.apache.org by Dale Scott <da...@shaw.ca> on 2015/11/15 19:46:34 UTC

distributed user case

Hi, I've been lurking for a while and have a use case and architecture that
I'd appreciate comments on. I've never personally built anything like this
before.

 

Without intentionally obfuscating, I have 128GB of data collected from an
experiment, roughly equivalent to a large set of 640x480 PNG images. Images
are independent and analyzed image-by-image by an image recognition
algorithm. I was thinking of dividing the set of images into sub-sets by a
scheduler and have a new EC2 instance analyze each sub-set.

 

Are there any places in this scenario where couchdb would shine? Replicating
a master couchdb image recognition library to each new EC2 instance?
Replicating the analysis results from each EC2 instance to a master couchdb
database?

 

Thanks!

---

Dale R. Scott, P.Eng.

Transparency with Trust

Re: distributed user case

Posted by Dale Scott <da...@shaw.ca>.

Thanks Dave for the ideas. I will check the links.

I'm not sure if my problem is about message passing or sharing data. The sensor data is captured offline but real-time from ~128 I2C sensors, and is convenient to think of as a stream of images in time. 

Dale

----- Original Message -----
From: "Dave Cottlehuber" <dc...@apache.org>
To: user@couchdb.apache.org
Sent: Monday, November 16, 2015 2:21:11 AM
Subject: Re: distributed user case

On Sun, 15 Nov 2015, at 07:46 PM, Dale Scott wrote:
> Hi, I've been lurking for a while and have a use case and architecture
> that
> I'd appreciate comments on. I've never personally built anything like
> this
> before.
> 
>  
> 
> Without intentionally obfuscating, I have 128GB of data collected from an
> experiment, roughly equivalent to a large set of 640x480 PNG images.
> Images
> are independent and analyzed image-by-image by an image recognition
> algorithm. I was thinking of dividing the set of images into sub-sets by
> a
> scheduler and have a new EC2 instance analyze each sub-set.
> 
>  
> 
> Are there any places in this scenario where couchdb would shine?
> Replicating
> a master couchdb image recognition library to each new EC2 instance?
> Replicating the analysis results from each EC2 instance to a master
> couchdb
> database?
> 
>  
> 
> Thanks!
> 
> ---
> 
> Dale R. Scott, P.Eng.
> 
> Transparency with Trust

Welcome Dale!

This sounds roughly like you have a message passing workflow:

- Jobs are inserted into the system
- N workers  process Y jobs
- The results are stored (or collated...)

For a pure couchdb approach, see https://github.com/iriscouch/cqs &
https://github.com/jo/couch-daemon in particular the links in the last
one may be very interesting for your obfuscated use case. The general
idea is to have workers actively pulling jobs off a couchdb, updating
the doc with a time-stamped reservation, and having a reaper process to
ensure that slow workers' docs are returned to the queue for another
hopefully faster worker to pick it up. Using this + attachments may work
well, or you may prefer to keep the queue separate from the raw data in
a different db.

However you may find using something like rabbitmq is easier here, or
even some hosted cloud equivalent (maybe AWS lambda) but if you want to
keep the raw & generated attachments in related (or the same) doc it may
be better in couchdb.

I know a number of people e.g. jhs@ who have successfully (ab)used
couchdb as both a message queue and a backing store for this, it really
depends on whether you want to use couchdb for everything, or have some
other needs that are better served with a real message queue
architecture + couchdb to store and transfer the potentially large
image/data attachments instead of bloating the message queue.

I think the tradeoff is largely around what else you need to do & how
much data you are sending around, and whether you need a full-blown
message queue system or can hack up the equivalents you need with
couchdb instead.

A+
Dave

Re: distributed user case

Posted by Dave Cottlehuber <dc...@apache.org>.

On Sun, 15 Nov 2015, at 07:46 PM, Dale Scott wrote:
> Hi, I've been lurking for a while and have a use case and architecture
> that
> I'd appreciate comments on. I've never personally built anything like
> this
> before.
> 
>  
> 
> Without intentionally obfuscating, I have 128GB of data collected from an
> experiment, roughly equivalent to a large set of 640x480 PNG images.
> Images
> are independent and analyzed image-by-image by an image recognition
> algorithm. I was thinking of dividing the set of images into sub-sets by
> a
> scheduler and have a new EC2 instance analyze each sub-set.
> 
>  
> 
> Are there any places in this scenario where couchdb would shine?
> Replicating
> a master couchdb image recognition library to each new EC2 instance?
> Replicating the analysis results from each EC2 instance to a master
> couchdb
> database?
> 
>  
> 
> Thanks!
> 
> ---
> 
> Dale R. Scott, P.Eng.
> 
> Transparency with Trust

Welcome Dale!

This sounds roughly like you have a message passing workflow:

- Jobs are inserted into the system
- N workers  process Y jobs
- The results are stored (or collated...)

For a pure couchdb approach, see https://github.com/iriscouch/cqs &
https://github.com/jo/couch-daemon in particular the links in the last
one may be very interesting for your obfuscated use case. The general
idea is to have workers actively pulling jobs off a couchdb, updating
the doc with a time-stamped reservation, and having a reaper process to
ensure that slow workers' docs are returned to the queue for another
hopefully faster worker to pick it up. Using this + attachments may work
well, or you may prefer to keep the queue separate from the raw data in
a different db.

However you may find using something like rabbitmq is easier here, or
even some hosted cloud equivalent (maybe AWS lambda) but if you want to
keep the raw & generated attachments in related (or the same) doc it may
be better in couchdb.

I know a number of people e.g. jhs@ who have successfully (ab)used
couchdb as both a message queue and a backing store for this, it really
depends on whether you want to use couchdb for everything, or have some
other needs that are better served with a real message queue
architecture + couchdb to store and transfer the potentially large
image/data attachments instead of bloating the message queue.

I think the tradeoff is largely around what else you need to do & how
much data you are sending around, and whether you need a full-blown
message queue system or can hack up the equivalents you need with
couchdb instead.

A+
Dave

Re: distributed user case

Posted by Dale Scott <da...@shaw.ca>.

Thanks James for the idea.

----- Original Message -----
From: "James Dingwall" <ja...@zynstra.com>
To: user@couchdb.apache.org
Sent: Monday, November 16, 2015 2:26:02 AM
Subject: Re: distributed user case

Dale Scott wrote:
> Without intentionally obfuscating, I have 128GB of data collected from an
> experiment, roughly equivalent to a large set of 640x480 PNG images. Images
> are independent and analyzed image-by-image by an image recognition
> algorithm. I was thinking of dividing the set of images into sub-sets by a
> scheduler and have a new EC2 instance analyze each sub-set.

You may find that replicating subsets of your data to the anaylzing
instance is unnecessary.  If I want to process a set of documents in
parallel and it isn't important where they are processed I write a view
function which assigns a random number to a document from 1..n: e.g.

function(doc) {
     var instances_count = 3;
     if(!doc.analyzer_result) {
         emit(''+Math.round(Math.random()*instance_count));
     }
}

For each analzyer assign it a number which is the key it will process
from the database.  If the analysis time > rtt of talking to CouchDB
this should be ok as is.  You could buffer documents at the fetch (query
with include_docs=true, fetch and limit=200) / save (use bulk update)
stage if the network time becomes significant in relation to the processing.

James
Zynstra is a private limited company registered in England and Wales (registered number 07864369). Our registered office and Headquarters are at The Innovation Centre, Broad Quay, Bath, BA1 1UD. This email, its contents and any attachments are confidential. If you have received this message in error please delete it from your system and advise the sender immediately.

Re: distributed user case

Posted by James Dingwall <ja...@zynstra.com>.

Dale Scott wrote:
> Without intentionally obfuscating, I have 128GB of data collected from an
> experiment, roughly equivalent to a large set of 640x480 PNG images. Images
> are independent and analyzed image-by-image by an image recognition
> algorithm. I was thinking of dividing the set of images into sub-sets by a
> scheduler and have a new EC2 instance analyze each sub-set.
You may find that replicating subsets of your data to the anaylzing
instance is unnecessary.  If I want to process a set of documents in
parallel and it isn't important where they are processed I write a view
function which assigns a random number to a document from 1..n: e.g.

function(doc) {
     var instances_count = 3;
     if(!doc.analyzer_result) {
         emit(''+Math.round(Math.random()*instance_count));
     }
}

For each analzyer assign it a number which is the key it will process
from the database.  If the analysis time > rtt of talking to CouchDB
this should be ok as is.  You could buffer documents at the fetch (query
with include_docs=true, fetch and limit=200) / save (use bulk update)
stage if the network time becomes significant in relation to the processing.

James
Zynstra is a private limited company registered in England and Wales (registered number 07864369). Our registered office and Headquarters are at The Innovation Centre, Broad Quay, Bath, BA1 1UD. This email, its contents and any attachments are confidential. If you have received this message in error please delete it from your system and advise the sender immediately.