You are viewing a plain text version of this content. The canonical link for it is here.
Posted to common-user@hadoop.apache.org by Kiet Tran <kt...@gmail.com> on 2015/06/07 23:53:16 UTC

Advantage/disadvantage of dbm vs join vs HBase

Hi,

I have a roughly 5 GB file where each row is a key, value pair. I
would like to use this as a "hashmap" against another large set of
file. From searching around, one way to do it would be to turn it into
a dbm like DBD and put it into a distributed cache. Another is by
joining the data. A third one is putting it into HBase and use it for
lookup.

I'm more familiar with the first approach, so it seems simpler to me.
However, I have read that using a distributed cache for files beyond a
few megabytes is not recommended because the file is replicated across
all the data nodes. This doesn't seem that bad to me because I just
pay this overhead once at the beginning of the job, and then each node
gets a copy locally, right? If I were to go with join, would it not
increase the workload (more entries) and create the same network
congestion issue? And wouldn't going with HBase means making it a
bottleneck?

What's the advantage and disadvantage of going for one solution over
the others? What if, for example, that "hashmap" needs to be from,
say, a 40GB file. How would my option change? At which point would
each option make sense?

Sincerely,
Kiet Tran

Re: Advantage/disadvantage of dbm vs join vs HBase

Posted by Kiet Tran <kt...@gmail.com>.
Nope. I have never used HBase before. I'm also new to Hadoop in
general. I'll be running the MapReduce job on EMR.

Disregarding what I'm familiar with, I'd also like to know when one
would do one thing vs another. Maybe it's something we can only tell
from experimenting around, but it sounds like a problem others have
ran into before.

Sincerely,
Kiet Tran

On Sun, Jun 7, 2015 at 8:34 PM, Ted Yu <yu...@gmail.com> wrote:
> Do you have hbase running in your cluster ?
>
> I ask this because bringing HBase as a new component into your deployment
> incurs operational overhead which you may not be familiar with.
>
> Cheers
>
> On Sun, Jun 7, 2015 at 2:53 PM, Kiet Tran <kt...@gmail.com> wrote:
>>
>> Hi,
>>
>> I have a roughly 5 GB file where each row is a key, value pair. I
>> would like to use this as a "hashmap" against another large set of
>> file. From searching around, one way to do it would be to turn it into
>> a dbm like DBD and put it into a distributed cache. Another is by
>> joining the data. A third one is putting it into HBase and use it for
>> lookup.
>>
>> I'm more familiar with the first approach, so it seems simpler to me.
>> However, I have read that using a distributed cache for files beyond a
>> few megabytes is not recommended because the file is replicated across
>> all the data nodes. This doesn't seem that bad to me because I just
>> pay this overhead once at the beginning of the job, and then each node
>> gets a copy locally, right? If I were to go with join, would it not
>> increase the workload (more entries) and create the same network
>> congestion issue? And wouldn't going with HBase means making it a
>> bottleneck?
>>
>> What's the advantage and disadvantage of going for one solution over
>> the others? What if, for example, that "hashmap" needs to be from,
>> say, a 40GB file. How would my option change? At which point would
>> each option make sense?
>>
>> Sincerely,
>> Kiet Tran
>
>

Re: Advantage/disadvantage of dbm vs join vs HBase

Posted by Kiet Tran <kt...@gmail.com>.
Nope. I have never used HBase before. I'm also new to Hadoop in
general. I'll be running the MapReduce job on EMR.

Disregarding what I'm familiar with, I'd also like to know when one
would do one thing vs another. Maybe it's something we can only tell
from experimenting around, but it sounds like a problem others have
ran into before.

Sincerely,
Kiet Tran

On Sun, Jun 7, 2015 at 8:34 PM, Ted Yu <yu...@gmail.com> wrote:
> Do you have hbase running in your cluster ?
>
> I ask this because bringing HBase as a new component into your deployment
> incurs operational overhead which you may not be familiar with.
>
> Cheers
>
> On Sun, Jun 7, 2015 at 2:53 PM, Kiet Tran <kt...@gmail.com> wrote:
>>
>> Hi,
>>
>> I have a roughly 5 GB file where each row is a key, value pair. I
>> would like to use this as a "hashmap" against another large set of
>> file. From searching around, one way to do it would be to turn it into
>> a dbm like DBD and put it into a distributed cache. Another is by
>> joining the data. A third one is putting it into HBase and use it for
>> lookup.
>>
>> I'm more familiar with the first approach, so it seems simpler to me.
>> However, I have read that using a distributed cache for files beyond a
>> few megabytes is not recommended because the file is replicated across
>> all the data nodes. This doesn't seem that bad to me because I just
>> pay this overhead once at the beginning of the job, and then each node
>> gets a copy locally, right? If I were to go with join, would it not
>> increase the workload (more entries) and create the same network
>> congestion issue? And wouldn't going with HBase means making it a
>> bottleneck?
>>
>> What's the advantage and disadvantage of going for one solution over
>> the others? What if, for example, that "hashmap" needs to be from,
>> say, a 40GB file. How would my option change? At which point would
>> each option make sense?
>>
>> Sincerely,
>> Kiet Tran
>
>

Re: Advantage/disadvantage of dbm vs join vs HBase

Posted by Kiet Tran <kt...@gmail.com>.
Nope. I have never used HBase before. I'm also new to Hadoop in
general. I'll be running the MapReduce job on EMR.

Disregarding what I'm familiar with, I'd also like to know when one
would do one thing vs another. Maybe it's something we can only tell
from experimenting around, but it sounds like a problem others have
ran into before.

Sincerely,
Kiet Tran

On Sun, Jun 7, 2015 at 8:34 PM, Ted Yu <yu...@gmail.com> wrote:
> Do you have hbase running in your cluster ?
>
> I ask this because bringing HBase as a new component into your deployment
> incurs operational overhead which you may not be familiar with.
>
> Cheers
>
> On Sun, Jun 7, 2015 at 2:53 PM, Kiet Tran <kt...@gmail.com> wrote:
>>
>> Hi,
>>
>> I have a roughly 5 GB file where each row is a key, value pair. I
>> would like to use this as a "hashmap" against another large set of
>> file. From searching around, one way to do it would be to turn it into
>> a dbm like DBD and put it into a distributed cache. Another is by
>> joining the data. A third one is putting it into HBase and use it for
>> lookup.
>>
>> I'm more familiar with the first approach, so it seems simpler to me.
>> However, I have read that using a distributed cache for files beyond a
>> few megabytes is not recommended because the file is replicated across
>> all the data nodes. This doesn't seem that bad to me because I just
>> pay this overhead once at the beginning of the job, and then each node
>> gets a copy locally, right? If I were to go with join, would it not
>> increase the workload (more entries) and create the same network
>> congestion issue? And wouldn't going with HBase means making it a
>> bottleneck?
>>
>> What's the advantage and disadvantage of going for one solution over
>> the others? What if, for example, that "hashmap" needs to be from,
>> say, a 40GB file. How would my option change? At which point would
>> each option make sense?
>>
>> Sincerely,
>> Kiet Tran
>
>

Re: Advantage/disadvantage of dbm vs join vs HBase

Posted by Kiet Tran <kt...@gmail.com>.
Nope. I have never used HBase before. I'm also new to Hadoop in
general. I'll be running the MapReduce job on EMR.

Disregarding what I'm familiar with, I'd also like to know when one
would do one thing vs another. Maybe it's something we can only tell
from experimenting around, but it sounds like a problem others have
ran into before.

Sincerely,
Kiet Tran

On Sun, Jun 7, 2015 at 8:34 PM, Ted Yu <yu...@gmail.com> wrote:
> Do you have hbase running in your cluster ?
>
> I ask this because bringing HBase as a new component into your deployment
> incurs operational overhead which you may not be familiar with.
>
> Cheers
>
> On Sun, Jun 7, 2015 at 2:53 PM, Kiet Tran <kt...@gmail.com> wrote:
>>
>> Hi,
>>
>> I have a roughly 5 GB file where each row is a key, value pair. I
>> would like to use this as a "hashmap" against another large set of
>> file. From searching around, one way to do it would be to turn it into
>> a dbm like DBD and put it into a distributed cache. Another is by
>> joining the data. A third one is putting it into HBase and use it for
>> lookup.
>>
>> I'm more familiar with the first approach, so it seems simpler to me.
>> However, I have read that using a distributed cache for files beyond a
>> few megabytes is not recommended because the file is replicated across
>> all the data nodes. This doesn't seem that bad to me because I just
>> pay this overhead once at the beginning of the job, and then each node
>> gets a copy locally, right? If I were to go with join, would it not
>> increase the workload (more entries) and create the same network
>> congestion issue? And wouldn't going with HBase means making it a
>> bottleneck?
>>
>> What's the advantage and disadvantage of going for one solution over
>> the others? What if, for example, that "hashmap" needs to be from,
>> say, a 40GB file. How would my option change? At which point would
>> each option make sense?
>>
>> Sincerely,
>> Kiet Tran
>
>

Re: Advantage/disadvantage of dbm vs join vs HBase

Posted by Ted Yu <yu...@gmail.com>.
Do you have hbase running in your cluster ?

I ask this because bringing HBase as a new component into your deployment
incurs operational overhead which you may not be familiar with.

Cheers

On Sun, Jun 7, 2015 at 2:53 PM, Kiet Tran <kt...@gmail.com> wrote:

> Hi,
>
> I have a roughly 5 GB file where each row is a key, value pair. I
> would like to use this as a "hashmap" against another large set of
> file. From searching around, one way to do it would be to turn it into
> a dbm like DBD and put it into a distributed cache. Another is by
> joining the data. A third one is putting it into HBase and use it for
> lookup.
>
> I'm more familiar with the first approach, so it seems simpler to me.
> However, I have read that using a distributed cache for files beyond a
> few megabytes is not recommended because the file is replicated across
> all the data nodes. This doesn't seem that bad to me because I just
> pay this overhead once at the beginning of the job, and then each node
> gets a copy locally, right? If I were to go with join, would it not
> increase the workload (more entries) and create the same network
> congestion issue? And wouldn't going with HBase means making it a
> bottleneck?
>
> What's the advantage and disadvantage of going for one solution over
> the others? What if, for example, that "hashmap" needs to be from,
> say, a 40GB file. How would my option change? At which point would
> each option make sense?
>
> Sincerely,
> Kiet Tran
>

Re: Advantage/disadvantage of dbm vs join vs HBase

Posted by Ted Yu <yu...@gmail.com>.
Do you have hbase running in your cluster ?

I ask this because bringing HBase as a new component into your deployment
incurs operational overhead which you may not be familiar with.

Cheers

On Sun, Jun 7, 2015 at 2:53 PM, Kiet Tran <kt...@gmail.com> wrote:

> Hi,
>
> I have a roughly 5 GB file where each row is a key, value pair. I
> would like to use this as a "hashmap" against another large set of
> file. From searching around, one way to do it would be to turn it into
> a dbm like DBD and put it into a distributed cache. Another is by
> joining the data. A third one is putting it into HBase and use it for
> lookup.
>
> I'm more familiar with the first approach, so it seems simpler to me.
> However, I have read that using a distributed cache for files beyond a
> few megabytes is not recommended because the file is replicated across
> all the data nodes. This doesn't seem that bad to me because I just
> pay this overhead once at the beginning of the job, and then each node
> gets a copy locally, right? If I were to go with join, would it not
> increase the workload (more entries) and create the same network
> congestion issue? And wouldn't going with HBase means making it a
> bottleneck?
>
> What's the advantage and disadvantage of going for one solution over
> the others? What if, for example, that "hashmap" needs to be from,
> say, a 40GB file. How would my option change? At which point would
> each option make sense?
>
> Sincerely,
> Kiet Tran
>

Re: Advantage/disadvantage of dbm vs join vs HBase

Posted by Ted Yu <yu...@gmail.com>.
Do you have hbase running in your cluster ?

I ask this because bringing HBase as a new component into your deployment
incurs operational overhead which you may not be familiar with.

Cheers

On Sun, Jun 7, 2015 at 2:53 PM, Kiet Tran <kt...@gmail.com> wrote:

> Hi,
>
> I have a roughly 5 GB file where each row is a key, value pair. I
> would like to use this as a "hashmap" against another large set of
> file. From searching around, one way to do it would be to turn it into
> a dbm like DBD and put it into a distributed cache. Another is by
> joining the data. A third one is putting it into HBase and use it for
> lookup.
>
> I'm more familiar with the first approach, so it seems simpler to me.
> However, I have read that using a distributed cache for files beyond a
> few megabytes is not recommended because the file is replicated across
> all the data nodes. This doesn't seem that bad to me because I just
> pay this overhead once at the beginning of the job, and then each node
> gets a copy locally, right? If I were to go with join, would it not
> increase the workload (more entries) and create the same network
> congestion issue? And wouldn't going with HBase means making it a
> bottleneck?
>
> What's the advantage and disadvantage of going for one solution over
> the others? What if, for example, that "hashmap" needs to be from,
> say, a 40GB file. How would my option change? At which point would
> each option make sense?
>
> Sincerely,
> Kiet Tran
>

Re: Advantage/disadvantage of dbm vs join vs HBase

Posted by Ted Yu <yu...@gmail.com>.
Do you have hbase running in your cluster ?

I ask this because bringing HBase as a new component into your deployment
incurs operational overhead which you may not be familiar with.

Cheers

On Sun, Jun 7, 2015 at 2:53 PM, Kiet Tran <kt...@gmail.com> wrote:

> Hi,
>
> I have a roughly 5 GB file where each row is a key, value pair. I
> would like to use this as a "hashmap" against another large set of
> file. From searching around, one way to do it would be to turn it into
> a dbm like DBD and put it into a distributed cache. Another is by
> joining the data. A third one is putting it into HBase and use it for
> lookup.
>
> I'm more familiar with the first approach, so it seems simpler to me.
> However, I have read that using a distributed cache for files beyond a
> few megabytes is not recommended because the file is replicated across
> all the data nodes. This doesn't seem that bad to me because I just
> pay this overhead once at the beginning of the job, and then each node
> gets a copy locally, right? If I were to go with join, would it not
> increase the workload (more entries) and create the same network
> congestion issue? And wouldn't going with HBase means making it a
> bottleneck?
>
> What's the advantage and disadvantage of going for one solution over
> the others? What if, for example, that "hashmap" needs to be from,
> say, a 40GB file. How would my option change? At which point would
> each option make sense?
>
> Sincerely,
> Kiet Tran
>