You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@couchdb.apache.org by Steve-Mustafa Ismail Mustafa <m....@gmail.com> on 2010/04/16 02:16:11 UTC

CouchDB and Hadoop

I swear, I spent over an hour going through the mailing list trying to 
find an answer.

I know that CouchDB is a document oriented DB and I know that Hadoop is 
a File System and that both implement Map/Reduce.  But is it possible to 
have them stacked with Hadoop being the FS in use and CouchDB being the 
DB? This way, wouldn't you get the distributed/clustered FS abilities of 
Hadoop in addition to the powerful retrieval abilities of CouchDB?

If its not possible, and I suspect that it is so, _why_? Don't they 
operate on two seperate levels? Wouldn't CouchDB sort of replace HBase?

Thanks in advance for any and all replies

RE: CouchDB and Hadoop_

Posted by Fredrik Widlund <fr...@qbrick.com>.

Yes, a distributed file system would just sync on a lower level. I'm not proposing this, I was just commenting on the hadoop thread. For me though, it would actually be relevant to at least consider using file system synchronization, if it against all odds would work. I'll check out couchdb-lounge more in detail for sure.

Kind regards,
Fredrik Widlund

-----Original Message-----
From: Randall Leeds [mailto:randall.leeds@gmail.com]
Sent: den 16 april 2010 21:06
To: user@couchdb.apache.org
Subject: Re: CouchDB and Hadoop_

Hey Fredrik,

I'm one of the couchdb-lounge developers. I'd like to understand
better what your performance concerns are. Why are you concerned about
replicating a large number of changes? A distributed file system would
be doing the same thing but at a lower level. If such a system were to
work you'd be saving only HTTP and JSON overhead vs replication. If
the replicator is too slow, that is something that can possibly be
improved. If you're concerned about the runtime impact of replication
this is tunable via the [replicator] configuration section.

couchdb-lounge already uses nginx for distributing simple GET and PUT
operations to documents and a python-twisted daemon to handle views.
The twisted daemon has configurable caching (with the one caveat that
the cache is currently unbounded, so the daemon needs to be restarted
periodically.... I should really fix this :-P). It should be possible
to chain any standard nginx caching modules in front of the lounge
proxy module.

If you have other concerns or would like to investigate more, ping me
on irc (tilgovi) or join us over on
http://groups.google.com/group/couchdb-lounge

-Randall

On Fri, Apr 16, 2010 at 09:54, Fredrik Widlund
<fr...@qbrick.com> wrote:
>
>
> Thanks, I will! We will actually use nginx for "dumb" caching, but add an api layer in between the cache and the couch. Also we actually need to mirror data to provide HA, and the performance issues we're having are more about constantly replicating a large number of changes than accelerating the reads. I'm not sure if couchdb-lounge would address this.
>
> We did stumble upon a bug that's being addressed and we we're also provided with a temporary work-around and it could be due to that, but with a quite modest load we periodically kept hitting the roof of a e5520 quad-core so I'm a bit worried about the performance aspect.
>
> Kind regards,
> Fredrik Widlund
>
> -----Ursprungligt meddelande-----
> Från: David Coallier [mailto:david.coallier@gmail.com]
> Skickat: den 16 april 2010 18:06
> Till: user@couchdb.apache.org
> Ämne: Re: CouchDB and Hadoop_
>
> On 16 April 2010 16:22, Fredrik Widlund <fr...@qbrick.com> wrote:
>>
>>
>> Well, we're building a solution on Couch and replication on a relatively large scale and saying "it just works" doesn't really describe it for us. I really like the Couch design but it's a bit of a challenge making it work, for us. I can describe the case if you like.
>>
>> Also we already have a decentralized distributed file system layer (which often is a natural part of a cloud solution I suppose) so if we could run it on top of that it would lessen the complexity of the overall solution.
>>
>> In any case it was a quick comment to the Hadoop question, and maybe it just wouldn't work that way. You could in general discuss atomic operations/locking and performance implications by moving synchronization to a lower abstraction layer I guess.
>>
> <snip>
>
> You should look into couchdb-lounge . It should resolve most of your
> "sharding" replication issues :)
>
> --
> David Coallier
>
>

Re: CouchDB and Hadoop_

Posted by Adam Kocoloski <ko...@apache.org>.

Thanks Fredrik.  I think I have a pretty good handle on what's happening and have replied in detail in JIRA.  Best,

Adam

On Apr 19, 2010, at 10:22 AM, Fredrik Widlund wrote:

> 
> 
> Hi,
> 
> https://issues.apache.org/jira/browse/COUCHDB-722
> 
> Thanks,
> Fredrik
> 
> -----Original Message-----
> From: Adam Kocoloski [mailto:kocolosk@apache.org]
> Sent: den 19 april 2010 16:05
> To: user@couchdb.apache.org
> Subject: Re: CouchDB and Hadoop_
> 
> Hi Fredrik, thanks for the details.  The CPU utilization does not sound normal at all.  I have a node replicating 30-75 updates/sec (unique documents, diurnal fluctuations) for several months now and it almost never uses more than 50% of one core of a virtualized e5410 box with 1.7G of RAM.
> 
> I would definitely look into the crashes first and see if that resolves the giant fluctuations in CPU.  Is there a JIRA ticket I can follow? (I'm one of the developers of the replicator).  Best,
> 
> Adam
> 
> On Apr 19, 2010, at 4:07 AM, Fredrik Widlund wrote:
> 
>> 
>> 
>> Hi,
>> 
>> The case I've tested so far is using couch in the following setup (which is a small part of what would be a production level setup for us)
>> - two bidirectionally synced nodes
>> - <50 writes/s to node A, each updating a unique doc
>> - <50 writes/s to node B, each updating a unique doc
>> - <50 reads/s from each node
>> - regular compacting the database containing the docs
>> 
>> The two nodes run on quad (e5520) cpu with 16G ram. CPU ramp down and up to 400% (i.e. full load on all cores) every few seconds. Couch 0.11.0 crashes regularly, which has been reported and is being worked on from what I understand. Also, the replications tasks breaks and has to be restarted very often, probably due to the problem above.
>> 
>> Now, I've received a temporary patch as a possible work-around for the crashes and I haven't tested this case with the work-around yet, but I would assume this hopefully sorts out the crashes, but not the cpu load.
>> 
>> Kind regards,
>> Fredrik Widlund
>> 
>> -----Original Message-----
>> From: Randall Leeds [mailto:randall.leeds@gmail.com]
>> Sent: den 16 april 2010 21:06
>> To: user@couchdb.apache.org
>> Subject: Re: CouchDB and Hadoop_
>> 
>> Hey Fredrik,
>> 
>> I'm one of the couchdb-lounge developers. I'd like to understand
>> better what your performance concerns are. Why are you concerned about
>> replicating a large number of changes? A distributed file system would
>> be doing the same thing but at a lower level. If such a system were to
>> work you'd be saving only HTTP and JSON overhead vs replication. If
>> the replicator is too slow, that is something that can possibly be
>> improved. If you're concerned about the runtime impact of replication
>> this is tunable via the [replicator] configuration section.
>> 
>> couchdb-lounge already uses nginx for distributing simple GET and PUT
>> operations to documents and a python-twisted daemon to handle views.
>> The twisted daemon has configurable caching (with the one caveat that
>> the cache is currently unbounded, so the daemon needs to be restarted
>> periodically.... I should really fix this :-P). It should be possible
>> to chain any standard nginx caching modules in front of the lounge
>> proxy module.
>> 
>> If you have other concerns or would like to investigate more, ping me
>> on irc (tilgovi) or join us over on
>> http://groups.google.com/group/couchdb-lounge
>> 
>> -Randall
>> 
>> On Fri, Apr 16, 2010 at 09:54, Fredrik Widlund
>> <fr...@qbrick.com> wrote:
>>> 
>>> 
>>> Thanks, I will! We will actually use nginx for "dumb" caching, but add an api layer in between the cache and the couch. Also we actually need to mirror data to provide HA, and the performance issues we're having are more about constantly replicating a large number of changes than accelerating the reads. I'm not sure if couchdb-lounge would address this.
>>> 
>>> We did stumble upon a bug that's being addressed and we we're also provided with a temporary work-around and it could be due to that, but with a quite modest load we periodically kept hitting the roof of a e5520 quad-core so I'm a bit worried about the performance aspect.
>>> 
>>> Kind regards,
>>> Fredrik Widlund
>>> 
>>> -----Ursprungligt meddelande-----
>>> Från: David Coallier [mailto:david.coallier@gmail.com]
>>> Skickat: den 16 april 2010 18:06
>>> Till: user@couchdb.apache.org
>>> Ämne: Re: CouchDB and Hadoop_
>>> 
>>> On 16 April 2010 16:22, Fredrik Widlund <fr...@qbrick.com> wrote:
>>>> 
>>>> 
>>>> Well, we're building a solution on Couch and replication on a relatively large scale and saying "it just works" doesn't really describe it for us. I really like the Couch design but it's a bit of a challenge making it work, for us. I can describe the case if you like.
>>>> 
>>>> Also we already have a decentralized distributed file system layer (which often is a natural part of a cloud solution I suppose) so if we could run it on top of that it would lessen the complexity of the overall solution.
>>>> 
>>>> In any case it was a quick comment to the Hadoop question, and maybe it just wouldn't work that way. You could in general discuss atomic operations/locking and performance implications by moving synchronization to a lower abstraction layer I guess.
>>>> 
>>> <snip>
>>> 
>>> You should look into couchdb-lounge . It should resolve most of your
>>> "sharding" replication issues :)
>>> 
>>> --
>>> David Coallier
>>> 
>>> 
>> 
> 
>

RE: CouchDB and Hadoop_

Posted by Fredrik Widlund <fr...@qbrick.com>.


Hi,

https://issues.apache.org/jira/browse/COUCHDB-722

Thanks,
Fredrik

-----Original Message-----
From: Adam Kocoloski [mailto:kocolosk@apache.org]
Sent: den 19 april 2010 16:05
To: user@couchdb.apache.org
Subject: Re: CouchDB and Hadoop_

Hi Fredrik, thanks for the details.  The CPU utilization does not sound normal at all.  I have a node replicating 30-75 updates/sec (unique documents, diurnal fluctuations) for several months now and it almost never uses more than 50% of one core of a virtualized e5410 box with 1.7G of RAM.

I would definitely look into the crashes first and see if that resolves the giant fluctuations in CPU.  Is there a JIRA ticket I can follow? (I'm one of the developers of the replicator).  Best,

Adam

On Apr 19, 2010, at 4:07 AM, Fredrik Widlund wrote:

>
>
> Hi,
>
> The case I've tested so far is using couch in the following setup (which is a small part of what would be a production level setup for us)
> - two bidirectionally synced nodes
> - <50 writes/s to node A, each updating a unique doc
> - <50 writes/s to node B, each updating a unique doc
> - <50 reads/s from each node
> - regular compacting the database containing the docs
>
> The two nodes run on quad (e5520) cpu with 16G ram. CPU ramp down and up to 400% (i.e. full load on all cores) every few seconds. Couch 0.11.0 crashes regularly, which has been reported and is being worked on from what I understand. Also, the replications tasks breaks and has to be restarted very often, probably due to the problem above.
>
> Now, I've received a temporary patch as a possible work-around for the crashes and I haven't tested this case with the work-around yet, but I would assume this hopefully sorts out the crashes, but not the cpu load.
>
> Kind regards,
> Fredrik Widlund
>
> -----Original Message-----
> From: Randall Leeds [mailto:randall.leeds@gmail.com]
> Sent: den 16 april 2010 21:06
> To: user@couchdb.apache.org
> Subject: Re: CouchDB and Hadoop_
>
> Hey Fredrik,
>
> I'm one of the couchdb-lounge developers. I'd like to understand
> better what your performance concerns are. Why are you concerned about
> replicating a large number of changes? A distributed file system would
> be doing the same thing but at a lower level. If such a system were to
> work you'd be saving only HTTP and JSON overhead vs replication. If
> the replicator is too slow, that is something that can possibly be
> improved. If you're concerned about the runtime impact of replication
> this is tunable via the [replicator] configuration section.
>
> couchdb-lounge already uses nginx for distributing simple GET and PUT
> operations to documents and a python-twisted daemon to handle views.
> The twisted daemon has configurable caching (with the one caveat that
> the cache is currently unbounded, so the daemon needs to be restarted
> periodically.... I should really fix this :-P). It should be possible
> to chain any standard nginx caching modules in front of the lounge
> proxy module.
>
> If you have other concerns or would like to investigate more, ping me
> on irc (tilgovi) or join us over on
> http://groups.google.com/group/couchdb-lounge
>
> -Randall
>
> On Fri, Apr 16, 2010 at 09:54, Fredrik Widlund
> <fr...@qbrick.com> wrote:
>>
>>
>> Thanks, I will! We will actually use nginx for "dumb" caching, but add an api layer in between the cache and the couch. Also we actually need to mirror data to provide HA, and the performance issues we're having are more about constantly replicating a large number of changes than accelerating the reads. I'm not sure if couchdb-lounge would address this.
>>
>> We did stumble upon a bug that's being addressed and we we're also provided with a temporary work-around and it could be due to that, but with a quite modest load we periodically kept hitting the roof of a e5520 quad-core so I'm a bit worried about the performance aspect.
>>
>> Kind regards,
>> Fredrik Widlund
>>
>> -----Ursprungligt meddelande-----
>> Från: David Coallier [mailto:david.coallier@gmail.com]
>> Skickat: den 16 april 2010 18:06
>> Till: user@couchdb.apache.org
>> Ämne: Re: CouchDB and Hadoop_
>>
>> On 16 April 2010 16:22, Fredrik Widlund <fr...@qbrick.com> wrote:
>>>
>>>
>>> Well, we're building a solution on Couch and replication on a relatively large scale and saying "it just works" doesn't really describe it for us. I really like the Couch design but it's a bit of a challenge making it work, for us. I can describe the case if you like.
>>>
>>> Also we already have a decentralized distributed file system layer (which often is a natural part of a cloud solution I suppose) so if we could run it on top of that it would lessen the complexity of the overall solution.
>>>
>>> In any case it was a quick comment to the Hadoop question, and maybe it just wouldn't work that way. You could in general discuss atomic operations/locking and performance implications by moving synchronization to a lower abstraction layer I guess.
>>>
>> <snip>
>>
>> You should look into couchdb-lounge . It should resolve most of your
>> "sharding" replication issues :)
>>
>> --
>> David Coallier
>>
>>
>

Re: CouchDB and Hadoop_

Posted by Adam Kocoloski <ko...@apache.org>.

Hi Fredrik, thanks for the details.  The CPU utilization does not sound normal at all.  I have a node replicating 30-75 updates/sec (unique documents, diurnal fluctuations) for several months now and it almost never uses more than 50% of one core of a virtualized e5410 box with 1.7G of RAM.

I would definitely look into the crashes first and see if that resolves the giant fluctuations in CPU.  Is there a JIRA ticket I can follow? (I'm one of the developers of the replicator).  Best,

Adam

On Apr 19, 2010, at 4:07 AM, Fredrik Widlund wrote:

> 
> 
> Hi,
> 
> The case I've tested so far is using couch in the following setup (which is a small part of what would be a production level setup for us)
> - two bidirectionally synced nodes
> - <50 writes/s to node A, each updating a unique doc
> - <50 writes/s to node B, each updating a unique doc
> - <50 reads/s from each node
> - regular compacting the database containing the docs
> 
> The two nodes run on quad (e5520) cpu with 16G ram. CPU ramp down and up to 400% (i.e. full load on all cores) every few seconds. Couch 0.11.0 crashes regularly, which has been reported and is being worked on from what I understand. Also, the replications tasks breaks and has to be restarted very often, probably due to the problem above.
> 
> Now, I've received a temporary patch as a possible work-around for the crashes and I haven't tested this case with the work-around yet, but I would assume this hopefully sorts out the crashes, but not the cpu load.
> 
> Kind regards,
> Fredrik Widlund
> 
> -----Original Message-----
> From: Randall Leeds [mailto:randall.leeds@gmail.com]
> Sent: den 16 april 2010 21:06
> To: user@couchdb.apache.org
> Subject: Re: CouchDB and Hadoop_
> 
> Hey Fredrik,
> 
> I'm one of the couchdb-lounge developers. I'd like to understand
> better what your performance concerns are. Why are you concerned about
> replicating a large number of changes? A distributed file system would
> be doing the same thing but at a lower level. If such a system were to
> work you'd be saving only HTTP and JSON overhead vs replication. If
> the replicator is too slow, that is something that can possibly be
> improved. If you're concerned about the runtime impact of replication
> this is tunable via the [replicator] configuration section.
> 
> couchdb-lounge already uses nginx for distributing simple GET and PUT
> operations to documents and a python-twisted daemon to handle views.
> The twisted daemon has configurable caching (with the one caveat that
> the cache is currently unbounded, so the daemon needs to be restarted
> periodically.... I should really fix this :-P). It should be possible
> to chain any standard nginx caching modules in front of the lounge
> proxy module.
> 
> If you have other concerns or would like to investigate more, ping me
> on irc (tilgovi) or join us over on
> http://groups.google.com/group/couchdb-lounge
> 
> -Randall
> 
> On Fri, Apr 16, 2010 at 09:54, Fredrik Widlund
> <fr...@qbrick.com> wrote:
>> 
>> 
>> Thanks, I will! We will actually use nginx for "dumb" caching, but add an api layer in between the cache and the couch. Also we actually need to mirror data to provide HA, and the performance issues we're having are more about constantly replicating a large number of changes than accelerating the reads. I'm not sure if couchdb-lounge would address this.
>> 
>> We did stumble upon a bug that's being addressed and we we're also provided with a temporary work-around and it could be due to that, but with a quite modest load we periodically kept hitting the roof of a e5520 quad-core so I'm a bit worried about the performance aspect.
>> 
>> Kind regards,
>> Fredrik Widlund
>> 
>> -----Ursprungligt meddelande-----
>> Från: David Coallier [mailto:david.coallier@gmail.com]
>> Skickat: den 16 april 2010 18:06
>> Till: user@couchdb.apache.org
>> Ämne: Re: CouchDB and Hadoop_
>> 
>> On 16 April 2010 16:22, Fredrik Widlund <fr...@qbrick.com> wrote:
>>> 
>>> 
>>> Well, we're building a solution on Couch and replication on a relatively large scale and saying "it just works" doesn't really describe it for us. I really like the Couch design but it's a bit of a challenge making it work, for us. I can describe the case if you like.
>>> 
>>> Also we already have a decentralized distributed file system layer (which often is a natural part of a cloud solution I suppose) so if we could run it on top of that it would lessen the complexity of the overall solution.
>>> 
>>> In any case it was a quick comment to the Hadoop question, and maybe it just wouldn't work that way. You could in general discuss atomic operations/locking and performance implications by moving synchronization to a lower abstraction layer I guess.
>>> 
>> <snip>
>> 
>> You should look into couchdb-lounge . It should resolve most of your
>> "sharding" replication issues :)
>> 
>> --
>> David Coallier
>> 
>> 
>

RE: CouchDB and Hadoop_

Posted by Fredrik Widlund <fr...@qbrick.com>.

Hi,

The case I've tested so far is using couch in the following setup (which is a small part of what would be a production level setup for us)
- two bidirectionally synced nodes
- <50 writes/s to node A, each updating a unique doc
- <50 writes/s to node B, each updating a unique doc
- <50 reads/s from each node
- regular compacting the database containing the docs

The two nodes run on quad (e5520) cpu with 16G ram. CPU ramp down and up to 400% (i.e. full load on all cores) every few seconds. Couch 0.11.0 crashes regularly, which has been reported and is being worked on from what I understand. Also, the replications tasks breaks and has to be restarted very often, probably due to the problem above.

Now, I've received a temporary patch as a possible work-around for the crashes and I haven't tested this case with the work-around yet, but I would assume this hopefully sorts out the crashes, but not the cpu load.

Kind regards,
Fredrik Widlund

-----Original Message-----
From: Randall Leeds [mailto:randall.leeds@gmail.com]
Sent: den 16 april 2010 21:06
To: user@couchdb.apache.org
Subject: Re: CouchDB and Hadoop_

Hey Fredrik,

I'm one of the couchdb-lounge developers. I'd like to understand
better what your performance concerns are. Why are you concerned about
replicating a large number of changes? A distributed file system would
be doing the same thing but at a lower level. If such a system were to
work you'd be saving only HTTP and JSON overhead vs replication. If
the replicator is too slow, that is something that can possibly be
improved. If you're concerned about the runtime impact of replication
this is tunable via the [replicator] configuration section.

couchdb-lounge already uses nginx for distributing simple GET and PUT
operations to documents and a python-twisted daemon to handle views.
The twisted daemon has configurable caching (with the one caveat that
the cache is currently unbounded, so the daemon needs to be restarted
periodically.... I should really fix this :-P). It should be possible
to chain any standard nginx caching modules in front of the lounge
proxy module.

If you have other concerns or would like to investigate more, ping me
on irc (tilgovi) or join us over on
http://groups.google.com/group/couchdb-lounge

-Randall

On Fri, Apr 16, 2010 at 09:54, Fredrik Widlund
<fr...@qbrick.com> wrote:
>
>
> Thanks, I will! We will actually use nginx for "dumb" caching, but add an api layer in between the cache and the couch. Also we actually need to mirror data to provide HA, and the performance issues we're having are more about constantly replicating a large number of changes than accelerating the reads. I'm not sure if couchdb-lounge would address this.
>
> We did stumble upon a bug that's being addressed and we we're also provided with a temporary work-around and it could be due to that, but with a quite modest load we periodically kept hitting the roof of a e5520 quad-core so I'm a bit worried about the performance aspect.
>
> Kind regards,
> Fredrik Widlund
>
> -----Ursprungligt meddelande-----
> Från: David Coallier [mailto:david.coallier@gmail.com]
> Skickat: den 16 april 2010 18:06
> Till: user@couchdb.apache.org
> Ämne: Re: CouchDB and Hadoop_
>
> On 16 April 2010 16:22, Fredrik Widlund <fr...@qbrick.com> wrote:
>>
>>
>> Well, we're building a solution on Couch and replication on a relatively large scale and saying "it just works" doesn't really describe it for us. I really like the Couch design but it's a bit of a challenge making it work, for us. I can describe the case if you like.
>>
>> Also we already have a decentralized distributed file system layer (which often is a natural part of a cloud solution I suppose) so if we could run it on top of that it would lessen the complexity of the overall solution.
>>
>> In any case it was a quick comment to the Hadoop question, and maybe it just wouldn't work that way. You could in general discuss atomic operations/locking and performance implications by moving synchronization to a lower abstraction layer I guess.
>>
> <snip>
>
> You should look into couchdb-lounge . It should resolve most of your
> "sharding" replication issues :)
>
> --
> David Coallier
>
>

Re: CouchDB and Hadoop_

Posted by Randall Leeds <ra...@gmail.com>.

Hey Fredrik,

I'm one of the couchdb-lounge developers. I'd like to understand
better what your performance concerns are. Why are you concerned about
replicating a large number of changes? A distributed file system would
be doing the same thing but at a lower level. If such a system were to
work you'd be saving only HTTP and JSON overhead vs replication. If
the replicator is too slow, that is something that can possibly be
improved. If you're concerned about the runtime impact of replication
this is tunable via the [replicator] configuration section.

couchdb-lounge already uses nginx for distributing simple GET and PUT
operations to documents and a python-twisted daemon to handle views.
The twisted daemon has configurable caching (with the one caveat that
the cache is currently unbounded, so the daemon needs to be restarted
periodically.... I should really fix this :-P). It should be possible
to chain any standard nginx caching modules in front of the lounge
proxy module.

If you have other concerns or would like to investigate more, ping me
on irc (tilgovi) or join us over on
http://groups.google.com/group/couchdb-lounge

-Randall

On Fri, Apr 16, 2010 at 09:54, Fredrik Widlund
<fr...@qbrick.com> wrote:
>
>
> Thanks, I will! We will actually use nginx for "dumb" caching, but add an api layer in between the cache and the couch. Also we actually need to mirror data to provide HA, and the performance issues we're having are more about constantly replicating a large number of changes than accelerating the reads. I'm not sure if couchdb-lounge would address this.
>
> We did stumble upon a bug that's being addressed and we we're also provided with a temporary work-around and it could be due to that, but with a quite modest load we periodically kept hitting the roof of a e5520 quad-core so I'm a bit worried about the performance aspect.
>
> Kind regards,
> Fredrik Widlund
>
> -----Ursprungligt meddelande-----
> Från: David Coallier [mailto:david.coallier@gmail.com]
> Skickat: den 16 april 2010 18:06
> Till: user@couchdb.apache.org
> Ämne: Re: CouchDB and Hadoop_
>
> On 16 April 2010 16:22, Fredrik Widlund <fr...@qbrick.com> wrote:
>>
>>
>> Well, we're building a solution on Couch and replication on a relatively large scale and saying "it just works" doesn't really describe it for us. I really like the Couch design but it's a bit of a challenge making it work, for us. I can describe the case if you like.
>>
>> Also we already have a decentralized distributed file system layer (which often is a natural part of a cloud solution I suppose) so if we could run it on top of that it would lessen the complexity of the overall solution.
>>
>> In any case it was a quick comment to the Hadoop question, and maybe it just wouldn't work that way. You could in general discuss atomic operations/locking and performance implications by moving synchronization to a lower abstraction layer I guess.
>>
> <snip>
>
> You should look into couchdb-lounge . It should resolve most of your
> "sharding" replication issues :)
>
> --
> David Coallier
>
>

SV: CouchDB and Hadoop_

Posted by Fredrik Widlund <fr...@qbrick.com>.

Thanks, I will! We will actually use nginx for "dumb" caching, but add an api layer in between the cache and the couch. Also we actually need to mirror data to provide HA, and the performance issues we're having are more about constantly replicating a large number of changes than accelerating the reads. I'm not sure if couchdb-lounge would address this.

We did stumble upon a bug that's being addressed and we we're also provided with a temporary work-around and it could be due to that, but with a quite modest load we periodically kept hitting the roof of a e5520 quad-core so I'm a bit worried about the performance aspect.

Kind regards,
Fredrik Widlund

-----Ursprungligt meddelande-----
Från: David Coallier [mailto:david.coallier@gmail.com]
Skickat: den 16 april 2010 18:06
Till: user@couchdb.apache.org
Ämne: Re: CouchDB and Hadoop_

On 16 April 2010 16:22, Fredrik Widlund <fr...@qbrick.com> wrote:
>
>
> Well, we're building a solution on Couch and replication on a relatively large scale and saying "it just works" doesn't really describe it for us. I really like the Couch design but it's a bit of a challenge making it work, for us. I can describe the case if you like.
>
> Also we already have a decentralized distributed file system layer (which often is a natural part of a cloud solution I suppose) so if we could run it on top of that it would lessen the complexity of the overall solution.
>
> In any case it was a quick comment to the Hadoop question, and maybe it just wouldn't work that way. You could in general discuss atomic operations/locking and performance implications by moving synchronization to a lower abstraction layer I guess.
>
<snip>

You should look into couchdb-lounge . It should resolve most of your
"sharding" replication issues :)

--
David Coallier

Re: CouchDB and Hadoop_

Posted by David Coallier <da...@gmail.com>.

On 16 April 2010 16:22, Fredrik Widlund <fr...@qbrick.com> wrote:
>
>
> Well, we're building a solution on Couch and replication on a relatively large scale and saying "it just works" doesn't really describe it for us. I really like the Couch design but it's a bit of a challenge making it work, for us. I can describe the case if you like.
>
> Also we already have a decentralized distributed file system layer (which often is a natural part of a cloud solution I suppose) so if we could run it on top of that it would lessen the complexity of the overall solution.
>
> In any case it was a quick comment to the Hadoop question, and maybe it just wouldn't work that way. You could in general discuss atomic operations/locking and performance implications by moving synchronization to a lower abstraction layer I guess.
>
<snip>

You should look into couchdb-lounge . It should resolve most of your
"sharding" replication issues :)

-- 
David Coallier

RE: CouchDB and Hadoop_

Posted by Fredrik Widlund <fr...@qbrick.com>.

Well, we're building a solution on Couch and replication on a relatively large scale and saying "it just works" doesn't really describe it for us. I really like the Couch design but it's a bit of a challenge making it work, for us. I can describe the case if you like.

Also we already have a decentralized distributed file system layer (which often is a natural part of a cloud solution I suppose) so if we could run it on top of that it would lessen the complexity of the overall solution.

In any case it was a quick comment to the Hadoop question, and maybe it just wouldn't work that way. You could in general discuss atomic operations/locking and performance implications by moving synchronization to a lower abstraction layer I guess.

Kind regards,
Fredrik Widlund

-----Original Message-----
From: Sebastian Cohnen [mailto:sebastiancohnen@googlemail.com]
Sent: den 16 april 2010 15:41
To: user@couchdb.apache.org
Subject: Re: CouchDB and Hadoop_

Why would someone possibly do that? CouchDB can do many things really well, and replication is one of these things. It's dead simple to set up and just works...

On 16.04.2010, at 15:29, Fredrik Widlund wrote:

>
>
> Are the files reopened for each write etc? If locking works glusterfs for example could be a nice solution for the replication. Each write would be atomically written to all instances, and reads would be local (using AFR with preferred servers).
>
> Kind regards,
> Fredrik Widlund
>
>
> -----Original Message-----
> From: Suhail Ahmed [mailto:suhailski@gmail.com]
> Sent: den 16 april 2010 10:13
> To: user@couchdb.apache.org
> Subject: Re: CouchDB and Hadoop
>
> Sure It can be done but for me the whole Java to Erlang layer would be a
> mess since they are so different. The better way to go about doing this
> would to be implement a distributed file system like Hadoop underneath Couch
> for same effect.
>
> On Fri, Apr 16, 2010 at 1:16 AM, Steve-Mustafa Ismail Mustafa <
> m.i.mustafa@gmail.com> wrote:
>
>> I swear, I spent over an hour going through the mailing list trying to find
>> an answer.
>>
>> I know that CouchDB is a document oriented DB and I know that Hadoop is a
>> File System and that both implement Map/Reduce.  But is it possible to have
>> them stacked with Hadoop being the FS in use and CouchDB being the DB? This
>> way, wouldn't you get the distributed/clustered FS abilities of Hadoop in
>> addition to the powerful retrieval abilities of CouchDB?
>>
>> If its not possible, and I suspect that it is so, _why_? Don't they operate
>> on two seperate levels? Wouldn't CouchDB sort of replace HBase?
>>
>> Thanks in advance for any and all replies
>>
>

Re: CouchDB and Hadoop_

Posted by Sebastian Cohnen <se...@googlemail.com>.

Why would someone possibly do that? CouchDB can do many things really well, and replication is one of these things. It's dead simple to set up and just works...

On 16.04.2010, at 15:29, Fredrik Widlund wrote:

> 
> 
> Are the files reopened for each write etc? If locking works glusterfs for example could be a nice solution for the replication. Each write would be atomically written to all instances, and reads would be local (using AFR with preferred servers).
> 
> Kind regards,
> Fredrik Widlund
> 
> 
> -----Original Message-----
> From: Suhail Ahmed [mailto:suhailski@gmail.com]
> Sent: den 16 april 2010 10:13
> To: user@couchdb.apache.org
> Subject: Re: CouchDB and Hadoop
> 
> Sure It can be done but for me the whole Java to Erlang layer would be a
> mess since they are so different. The better way to go about doing this
> would to be implement a distributed file system like Hadoop underneath Couch
> for same effect.
> 
> On Fri, Apr 16, 2010 at 1:16 AM, Steve-Mustafa Ismail Mustafa <
> m.i.mustafa@gmail.com> wrote:
> 
>> I swear, I spent over an hour going through the mailing list trying to find
>> an answer.
>> 
>> I know that CouchDB is a document oriented DB and I know that Hadoop is a
>> File System and that both implement Map/Reduce.  But is it possible to have
>> them stacked with Hadoop being the FS in use and CouchDB being the DB? This
>> way, wouldn't you get the distributed/clustered FS abilities of Hadoop in
>> addition to the powerful retrieval abilities of CouchDB?
>> 
>> If its not possible, and I suspect that it is so, _why_? Don't they operate
>> on two seperate levels? Wouldn't CouchDB sort of replace HBase?
>> 
>> Thanks in advance for any and all replies
>> 
>

RE: CouchDB and Hadoop_

Posted by Fredrik Widlund <fr...@qbrick.com>.

Are the files reopened for each write etc? If locking works glusterfs for example could be a nice solution for the replication. Each write would be atomically written to all instances, and reads would be local (using AFR with preferred servers).

Kind regards,
Fredrik Widlund

-----Original Message-----
From: Suhail Ahmed [mailto:suhailski@gmail.com]
Sent: den 16 april 2010 10:13
To: user@couchdb.apache.org
Subject: Re: CouchDB and Hadoop

Sure It can be done but for me the whole Java to Erlang layer would be a
mess since they are so different. The better way to go about doing this
would to be implement a distributed file system like Hadoop underneath Couch
for same effect.

On Fri, Apr 16, 2010 at 1:16 AM, Steve-Mustafa Ismail Mustafa <
m.i.mustafa@gmail.com> wrote:

> I swear, I spent over an hour going through the mailing list trying to find
> an answer.
>
> I know that CouchDB is a document oriented DB and I know that Hadoop is a
> File System and that both implement Map/Reduce.  But is it possible to have
> them stacked with Hadoop being the FS in use and CouchDB being the DB? This
> way, wouldn't you get the distributed/clustered FS abilities of Hadoop in
> addition to the powerful retrieval abilities of CouchDB?
>
> If its not possible, and I suspect that it is so, _why_? Don't they operate
> on two seperate levels? Wouldn't CouchDB sort of replace HBase?
>
> Thanks in advance for any and all replies
>

Re: CouchDB and Hadoop

Posted by "Jim R. Wilson" <wi...@gmail.com>.

Hi Steve,

Clarification points:  Hadoop is not a filesystem, it's an
implementation of MapReduce.  For sharing files around the cluster,
Hadoop uses HDFS (the Hadoop Distributed File System) by default, and
can also use other filesystems (I believe it supports Amazon S3
storage if the cluster is in EC2).

So, the questions become:

* Could the data file for a CouchDB database be stored in HDFS?
* Could the MapReduce tasks executed by CouchDB be offloadad to Hadoop?

I think the answers to both are "probably".  How much work it would
take to implement such a system is an open question.  I suspect
storing the data file on HDFS would be easier than offloading the
mapreduce tasks.

As far as handling the Java to Erlang/JavaScript mismatch, I think
that particular piece can be addressed by using Hadoop Streaming[1].
I have done a fair amount of work using Python to work on JSON objects
over Hadoop Streaming - Erlang/JavaScript should be no different.

The real question in my mind is, "why do any of this?".  Both Hadoop
and CouchDB are fine systems with particular goals in mind.  I'm not
convinced there's significant value in Frankensteining them together.

Just my $0.02

[1] http://hadoop.apache.org/common/docs/r0.15.2/streaming.html

-- Jim R. Wilson (jimbojw)

On Fri, Apr 16, 2010 at 4:12 AM, Suhail Ahmed <su...@gmail.com> wrote:
> Sure It can be done but for me the whole Java to Erlang layer would be a
> mess since they are so different. The better way to go about doing this
> would to be implement a distributed file system like Hadoop underneath Couch
> for same effect.
>
> On Fri, Apr 16, 2010 at 1:16 AM, Steve-Mustafa Ismail Mustafa <
> m.i.mustafa@gmail.com> wrote:
>
>> I swear, I spent over an hour going through the mailing list trying to find
>> an answer.
>>
>> I know that CouchDB is a document oriented DB and I know that Hadoop is a
>> File System and that both implement Map/Reduce.  But is it possible to have
>> them stacked with Hadoop being the FS in use and CouchDB being the DB? This
>> way, wouldn't you get the distributed/clustered FS abilities of Hadoop in
>> addition to the powerful retrieval abilities of CouchDB?
>>
>> If its not possible, and I suspect that it is so, _why_? Don't they operate
>> on two seperate levels? Wouldn't CouchDB sort of replace HBase?
>>
>> Thanks in advance for any and all replies
>>
>

Re: CouchDB and Hadoop

Posted by Suhail Ahmed <su...@gmail.com>.

Sure It can be done but for me the whole Java to Erlang layer would be a
mess since they are so different. The better way to go about doing this
would to be implement a distributed file system like Hadoop underneath Couch
for same effect.

On Fri, Apr 16, 2010 at 1:16 AM, Steve-Mustafa Ismail Mustafa <
m.i.mustafa@gmail.com> wrote:

> I swear, I spent over an hour going through the mailing list trying to find
> an answer.
>
> I know that CouchDB is a document oriented DB and I know that Hadoop is a
> File System and that both implement Map/Reduce.  But is it possible to have
> them stacked with Hadoop being the FS in use and CouchDB being the DB? This
> way, wouldn't you get the distributed/clustered FS abilities of Hadoop in
> addition to the powerful retrieval abilities of CouchDB?
>
> If its not possible, and I suspect that it is so, _why_? Don't they operate
> on two seperate levels? Wouldn't CouchDB sort of replace HBase?
>
> Thanks in advance for any and all replies
>