You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@hadoop.apache.org by David Parks <da...@yahoo.com> on 2012/10/27 05:07:16 UTC

Cluster wide atomic operations

How can we manage cluster-wide atomic operations? Such as maintaining an
auto-increment counter.

Does Hadoop provide native support for these kinds of operations?

An in case ultimate answer involves zookeeper, I'd love to work out doing
this in AWS/EMR.


Re: Cluster wide atomic operations

Posted by Michael Katzenellenbogen <mi...@cloudera.com>.
Twitter's Snowflake may provide you with some inspiration:

https://github.com/twitter/snowflake

-Michael

On Oct 28, 2012, at 9:16 PM, David Parks <da...@yahoo.com> wrote:

I need a unique & permanent ID assigned to new item encountered, which has
a constraint that it is in the range of, let’s say for simple discussion,
one to one million.



I suppose I could assign a range of usable IDs to each reduce task (where
ID’s are assigned) and keep those organized somehow at the end of the job,
but this seems clunky too.



Since this is on AWS, zookeeper is not a good option. I thought it was part
of the hadoop cluster (and thus easy to access), but guess I was wrong
there.



I would think that such a service would run most logically on the
taskmaster server. I’m surprised this isn’t a common issue. I guess I could
launch a separate job that runs such a sequence service perhaps. But that’s
non trivial its self with failure concerns.



Perhaps there’s just a better way of thinking of this?





*From:* Ted Dunning [mailto:tdunning@maprtech.com <td...@maprtech.com>]
*Sent:* Saturday, October 27, 2012 12:23 PM
*To:* user@hadoop.apache.org
*Subject:* Re: Cluster wide atomic operations



This is better asked on the Zookeeper lists.



The first answer is that global atomic operations are a generally bad idea.



The second answer is that if you an batch these operations up then you can
cut the evilness of global atomicity by a substantial factor.



Are you sure you need a global counter?

On Fri, Oct 26, 2012 at 11:07 PM, David Parks <da...@yahoo.com>
wrote:

How can we manage cluster-wide atomic operations? Such as maintaining an
auto-increment counter.

Does Hadoop provide native support for these kinds of operations?

An in case ultimate answer involves zookeeper, I'd love to work out doing
this in AWS/EMR.

Re: Cluster wide atomic operations

Posted by Taeho Kang <tk...@gmail.com>.
Hello, David,

How about using something like Redis for that matter? http://redis.io

There are services like RedisToGo (https://redistogo.com/), which also runs
on AWS and is very easy to get started. Sign up and few clicks and you are
set to go.


On Mon, Oct 29, 2012 at 10:15 AM, David Parks <da...@yahoo.com>wrote:

> I need a unique & permanent ID assigned to new item encountered, which has
> a constraint that it is in the range of, let’s say for simple discussion,
> one to one million.****
>
> ** **
>
> I suppose I could assign a range of usable IDs to each reduce task (where
> ID’s are assigned) and keep those organized somehow at the end of the job,
> but this seems clunky too.****
>
> ** **
>
> Since this is on AWS, zookeeper is not a good option. I thought it was
> part of the hadoop cluster (and thus easy to access), but guess I was wrong
> there.****
>
> ** **
>
> I would think that such a service would run most logically on the
> taskmaster server. I’m surprised this isn’t a common issue. I guess I could
> launch a separate job that runs such a sequence service perhaps. But that’s
> non trivial its self with failure concerns. ****
>
> ** **
>
> Perhaps there’s just a better way of thinking of this?****
>
> ** **
>
> ** **
>
> *From:* Ted Dunning [mailto:tdunning@maprtech.com]
> *Sent:* Saturday, October 27, 2012 12:23 PM
> *To:* user@hadoop.apache.org
> *Subject:* Re: Cluster wide atomic operations****
>
> ** **
>
> This is better asked on the Zookeeper lists.****
>
> ** **
>
> The first answer is that global atomic operations are a generally bad idea.
> ****
>
> ** **
>
> The second answer is that if you an batch these operations up then you can
> cut the evilness of global atomicity by a substantial factor.****
>
> ** **
>
> Are you sure you need a global counter?****
>
> On Fri, Oct 26, 2012 at 11:07 PM, David Parks <da...@yahoo.com>
> wrote:****
>
> How can we manage cluster-wide atomic operations? Such as maintaining an
> auto-increment counter.
>
> Does Hadoop provide native support for these kinds of operations?
>
> An in case ultimate answer involves zookeeper, I'd love to work out doing
> this in AWS/EMR.****
>
> ** **
>

Re: Cluster wide atomic operations

Posted by Michael Katzenellenbogen <mi...@cloudera.com>.
Twitter's Snowflake may provide you with some inspiration:

https://github.com/twitter/snowflake

-Michael

On Oct 28, 2012, at 9:16 PM, David Parks <da...@yahoo.com> wrote:

I need a unique & permanent ID assigned to new item encountered, which has
a constraint that it is in the range of, let’s say for simple discussion,
one to one million.



I suppose I could assign a range of usable IDs to each reduce task (where
ID’s are assigned) and keep those organized somehow at the end of the job,
but this seems clunky too.



Since this is on AWS, zookeeper is not a good option. I thought it was part
of the hadoop cluster (and thus easy to access), but guess I was wrong
there.



I would think that such a service would run most logically on the
taskmaster server. I’m surprised this isn’t a common issue. I guess I could
launch a separate job that runs such a sequence service perhaps. But that’s
non trivial its self with failure concerns.



Perhaps there’s just a better way of thinking of this?





*From:* Ted Dunning [mailto:tdunning@maprtech.com <td...@maprtech.com>]
*Sent:* Saturday, October 27, 2012 12:23 PM
*To:* user@hadoop.apache.org
*Subject:* Re: Cluster wide atomic operations



This is better asked on the Zookeeper lists.



The first answer is that global atomic operations are a generally bad idea.



The second answer is that if you an batch these operations up then you can
cut the evilness of global atomicity by a substantial factor.



Are you sure you need a global counter?

On Fri, Oct 26, 2012 at 11:07 PM, David Parks <da...@yahoo.com>
wrote:

How can we manage cluster-wide atomic operations? Such as maintaining an
auto-increment counter.

Does Hadoop provide native support for these kinds of operations?

An in case ultimate answer involves zookeeper, I'd love to work out doing
this in AWS/EMR.

Re: Cluster wide atomic operations

Posted by Taeho Kang <tk...@gmail.com>.
Hello, David,

How about using something like Redis for that matter? http://redis.io

There are services like RedisToGo (https://redistogo.com/), which also runs
on AWS and is very easy to get started. Sign up and few clicks and you are
set to go.


On Mon, Oct 29, 2012 at 10:15 AM, David Parks <da...@yahoo.com>wrote:

> I need a unique & permanent ID assigned to new item encountered, which has
> a constraint that it is in the range of, let’s say for simple discussion,
> one to one million.****
>
> ** **
>
> I suppose I could assign a range of usable IDs to each reduce task (where
> ID’s are assigned) and keep those organized somehow at the end of the job,
> but this seems clunky too.****
>
> ** **
>
> Since this is on AWS, zookeeper is not a good option. I thought it was
> part of the hadoop cluster (and thus easy to access), but guess I was wrong
> there.****
>
> ** **
>
> I would think that such a service would run most logically on the
> taskmaster server. I’m surprised this isn’t a common issue. I guess I could
> launch a separate job that runs such a sequence service perhaps. But that’s
> non trivial its self with failure concerns. ****
>
> ** **
>
> Perhaps there’s just a better way of thinking of this?****
>
> ** **
>
> ** **
>
> *From:* Ted Dunning [mailto:tdunning@maprtech.com]
> *Sent:* Saturday, October 27, 2012 12:23 PM
> *To:* user@hadoop.apache.org
> *Subject:* Re: Cluster wide atomic operations****
>
> ** **
>
> This is better asked on the Zookeeper lists.****
>
> ** **
>
> The first answer is that global atomic operations are a generally bad idea.
> ****
>
> ** **
>
> The second answer is that if you an batch these operations up then you can
> cut the evilness of global atomicity by a substantial factor.****
>
> ** **
>
> Are you sure you need a global counter?****
>
> On Fri, Oct 26, 2012 at 11:07 PM, David Parks <da...@yahoo.com>
> wrote:****
>
> How can we manage cluster-wide atomic operations? Such as maintaining an
> auto-increment counter.
>
> Does Hadoop provide native support for these kinds of operations?
>
> An in case ultimate answer involves zookeeper, I'd love to work out doing
> this in AWS/EMR.****
>
> ** **
>

Re: Cluster wide atomic operations

Posted by Steve Loughran <st...@hortonworks.com>.
On 29 October 2012 01:15, David Parks <da...@yahoo.com> wrote:

> I need a unique & permanent ID assigned to new item encountered, which has
> a constraint that it is in the range of, let’s say for simple discussion,
> one to one million.
>

I'd go for UUID generation, which you can do in parallel -though it doesn't
meet your range requirements

Re: Cluster wide atomic operations

Posted by Steve Loughran <st...@hortonworks.com>.
On 29 October 2012 01:15, David Parks <da...@yahoo.com> wrote:

> I need a unique & permanent ID assigned to new item encountered, which has
> a constraint that it is in the range of, let’s say for simple discussion,
> one to one million.
>

I'd go for UUID generation, which you can do in parallel -though it doesn't
meet your range requirements

Re: Cluster wide atomic operations

Posted by Taeho Kang <tk...@gmail.com>.
Hello, David,

How about using something like Redis for that matter? http://redis.io

There are services like RedisToGo (https://redistogo.com/), which also runs
on AWS and is very easy to get started. Sign up and few clicks and you are
set to go.


On Mon, Oct 29, 2012 at 10:15 AM, David Parks <da...@yahoo.com>wrote:

> I need a unique & permanent ID assigned to new item encountered, which has
> a constraint that it is in the range of, let’s say for simple discussion,
> one to one million.****
>
> ** **
>
> I suppose I could assign a range of usable IDs to each reduce task (where
> ID’s are assigned) and keep those organized somehow at the end of the job,
> but this seems clunky too.****
>
> ** **
>
> Since this is on AWS, zookeeper is not a good option. I thought it was
> part of the hadoop cluster (and thus easy to access), but guess I was wrong
> there.****
>
> ** **
>
> I would think that such a service would run most logically on the
> taskmaster server. I’m surprised this isn’t a common issue. I guess I could
> launch a separate job that runs such a sequence service perhaps. But that’s
> non trivial its self with failure concerns. ****
>
> ** **
>
> Perhaps there’s just a better way of thinking of this?****
>
> ** **
>
> ** **
>
> *From:* Ted Dunning [mailto:tdunning@maprtech.com]
> *Sent:* Saturday, October 27, 2012 12:23 PM
> *To:* user@hadoop.apache.org
> *Subject:* Re: Cluster wide atomic operations****
>
> ** **
>
> This is better asked on the Zookeeper lists.****
>
> ** **
>
> The first answer is that global atomic operations are a generally bad idea.
> ****
>
> ** **
>
> The second answer is that if you an batch these operations up then you can
> cut the evilness of global atomicity by a substantial factor.****
>
> ** **
>
> Are you sure you need a global counter?****
>
> On Fri, Oct 26, 2012 at 11:07 PM, David Parks <da...@yahoo.com>
> wrote:****
>
> How can we manage cluster-wide atomic operations? Such as maintaining an
> auto-increment counter.
>
> Does Hadoop provide native support for these kinds of operations?
>
> An in case ultimate answer involves zookeeper, I'd love to work out doing
> this in AWS/EMR.****
>
> ** **
>

RE: Cluster wide atomic operations

Posted by David Parks <da...@yahoo.com>.
That's a very helpful discussion. Thank you.

 

I'd like to go with assigning blocks of IDs for each reducer. Snowflake
would require external changes that are a pain, I'd rather make my job fit
our current constraints.

 

Is there a way to get an index number for each reducer such that I could
identify which block of IDs to assign each one?

 

Thanks,

David

 

 

From: Ted Dunning [mailto:tdunning@maprtech.com] 
Sent: Monday, October 29, 2012 12:58 PM
To: user@hadoop.apache.org
Subject: Re: Cluster wide atomic operations

 

 

On Sun, Oct 28, 2012 at 9:15 PM, David Parks <da...@yahoo.com> wrote:

I need a unique & permanent ID assigned to new item encountered, which has a
constraint that it is in the range of, let's say for simple discussion, one
to one million.

 

Having such a limited range may require that you have a central service to
generate ID's.  The use of a central service can be disastrous for
throughput.

 

 I suppose I could assign a range of usable IDs to each reduce task (where
ID's are assigned) and keep those organized somehow at the end of the job,
but this seems clunky too.

 

Yes.  Much better. 

 

 Since this is on AWS, zookeeper is not a good option. I thought it was part
of the hadoop cluster (and thus easy to access), but guess I was wrong
there.

 

No.  This is specifically not part of Hadoop for performance reasons.

 

 I would think that such a service would run most logically on the
taskmaster server. I'm surprised this isn't a common issue. I guess I could
launch a separate job that runs such a sequence service perhaps. But that's
non trivial its self with failure concerns.

 

The problem is that a serial number service is a major loss of performance
in a parallel system.  Unless you relax the idea considerably (by allowing
blocks, or having lots of bits like Snowflake), then you wind up with a
round-trip per id and you have a critical section on the ID generator.  This
is bad.

 

Look up Amdahl's Law.

 

 Perhaps there's just a better way of thinking of this?

 

Yes.  Use lots of bits and be satisfied with uniqueness rather than perfect
ordering and limited range.

 

As the other respondent said, look up Snowflake.

 

 


RE: Cluster wide atomic operations

Posted by David Parks <da...@yahoo.com>.
That's a very helpful discussion. Thank you.

 

I'd like to go with assigning blocks of IDs for each reducer. Snowflake
would require external changes that are a pain, I'd rather make my job fit
our current constraints.

 

Is there a way to get an index number for each reducer such that I could
identify which block of IDs to assign each one?

 

Thanks,

David

 

 

From: Ted Dunning [mailto:tdunning@maprtech.com] 
Sent: Monday, October 29, 2012 12:58 PM
To: user@hadoop.apache.org
Subject: Re: Cluster wide atomic operations

 

 

On Sun, Oct 28, 2012 at 9:15 PM, David Parks <da...@yahoo.com> wrote:

I need a unique & permanent ID assigned to new item encountered, which has a
constraint that it is in the range of, let's say for simple discussion, one
to one million.

 

Having such a limited range may require that you have a central service to
generate ID's.  The use of a central service can be disastrous for
throughput.

 

 I suppose I could assign a range of usable IDs to each reduce task (where
ID's are assigned) and keep those organized somehow at the end of the job,
but this seems clunky too.

 

Yes.  Much better. 

 

 Since this is on AWS, zookeeper is not a good option. I thought it was part
of the hadoop cluster (and thus easy to access), but guess I was wrong
there.

 

No.  This is specifically not part of Hadoop for performance reasons.

 

 I would think that such a service would run most logically on the
taskmaster server. I'm surprised this isn't a common issue. I guess I could
launch a separate job that runs such a sequence service perhaps. But that's
non trivial its self with failure concerns.

 

The problem is that a serial number service is a major loss of performance
in a parallel system.  Unless you relax the idea considerably (by allowing
blocks, or having lots of bits like Snowflake), then you wind up with a
round-trip per id and you have a critical section on the ID generator.  This
is bad.

 

Look up Amdahl's Law.

 

 Perhaps there's just a better way of thinking of this?

 

Yes.  Use lots of bits and be satisfied with uniqueness rather than perfect
ordering and limited range.

 

As the other respondent said, look up Snowflake.

 

 


RE: Cluster wide atomic operations

Posted by David Parks <da...@yahoo.com>.
That's a very helpful discussion. Thank you.

 

I'd like to go with assigning blocks of IDs for each reducer. Snowflake
would require external changes that are a pain, I'd rather make my job fit
our current constraints.

 

Is there a way to get an index number for each reducer such that I could
identify which block of IDs to assign each one?

 

Thanks,

David

 

 

From: Ted Dunning [mailto:tdunning@maprtech.com] 
Sent: Monday, October 29, 2012 12:58 PM
To: user@hadoop.apache.org
Subject: Re: Cluster wide atomic operations

 

 

On Sun, Oct 28, 2012 at 9:15 PM, David Parks <da...@yahoo.com> wrote:

I need a unique & permanent ID assigned to new item encountered, which has a
constraint that it is in the range of, let's say for simple discussion, one
to one million.

 

Having such a limited range may require that you have a central service to
generate ID's.  The use of a central service can be disastrous for
throughput.

 

 I suppose I could assign a range of usable IDs to each reduce task (where
ID's are assigned) and keep those organized somehow at the end of the job,
but this seems clunky too.

 

Yes.  Much better. 

 

 Since this is on AWS, zookeeper is not a good option. I thought it was part
of the hadoop cluster (and thus easy to access), but guess I was wrong
there.

 

No.  This is specifically not part of Hadoop for performance reasons.

 

 I would think that such a service would run most logically on the
taskmaster server. I'm surprised this isn't a common issue. I guess I could
launch a separate job that runs such a sequence service perhaps. But that's
non trivial its self with failure concerns.

 

The problem is that a serial number service is a major loss of performance
in a parallel system.  Unless you relax the idea considerably (by allowing
blocks, or having lots of bits like Snowflake), then you wind up with a
round-trip per id and you have a critical section on the ID generator.  This
is bad.

 

Look up Amdahl's Law.

 

 Perhaps there's just a better way of thinking of this?

 

Yes.  Use lots of bits and be satisfied with uniqueness rather than perfect
ordering and limited range.

 

As the other respondent said, look up Snowflake.

 

 


RE: Cluster wide atomic operations

Posted by David Parks <da...@yahoo.com>.
That's a very helpful discussion. Thank you.

 

I'd like to go with assigning blocks of IDs for each reducer. Snowflake
would require external changes that are a pain, I'd rather make my job fit
our current constraints.

 

Is there a way to get an index number for each reducer such that I could
identify which block of IDs to assign each one?

 

Thanks,

David

 

 

From: Ted Dunning [mailto:tdunning@maprtech.com] 
Sent: Monday, October 29, 2012 12:58 PM
To: user@hadoop.apache.org
Subject: Re: Cluster wide atomic operations

 

 

On Sun, Oct 28, 2012 at 9:15 PM, David Parks <da...@yahoo.com> wrote:

I need a unique & permanent ID assigned to new item encountered, which has a
constraint that it is in the range of, let's say for simple discussion, one
to one million.

 

Having such a limited range may require that you have a central service to
generate ID's.  The use of a central service can be disastrous for
throughput.

 

 I suppose I could assign a range of usable IDs to each reduce task (where
ID's are assigned) and keep those organized somehow at the end of the job,
but this seems clunky too.

 

Yes.  Much better. 

 

 Since this is on AWS, zookeeper is not a good option. I thought it was part
of the hadoop cluster (and thus easy to access), but guess I was wrong
there.

 

No.  This is specifically not part of Hadoop for performance reasons.

 

 I would think that such a service would run most logically on the
taskmaster server. I'm surprised this isn't a common issue. I guess I could
launch a separate job that runs such a sequence service perhaps. But that's
non trivial its self with failure concerns.

 

The problem is that a serial number service is a major loss of performance
in a parallel system.  Unless you relax the idea considerably (by allowing
blocks, or having lots of bits like Snowflake), then you wind up with a
round-trip per id and you have a critical section on the ID generator.  This
is bad.

 

Look up Amdahl's Law.

 

 Perhaps there's just a better way of thinking of this?

 

Yes.  Use lots of bits and be satisfied with uniqueness rather than perfect
ordering and limited range.

 

As the other respondent said, look up Snowflake.

 

 


Re: Cluster wide atomic operations

Posted by Ted Dunning <td...@maprtech.com>.
On Sun, Oct 28, 2012 at 9:15 PM, David Parks <da...@yahoo.com> wrote:

> I need a unique & permanent ID assigned to new item encountered, which has
> a constraint that it is in the range of, let’s say for simple discussion,
> one to one million.
>

Having such a limited range may require that you have a central service to
generate ID's.  The use of a central service can be disastrous for
throughput.


> ****
>
> ** I suppose I could assign a range of usable IDs to each reduce task
> (where ID’s are assigned) and keep those organized somehow at the end of
> the job, but this seems clunky too.
>
> **
>

Yes.  Much better.


>  Since this is on AWS, zookeeper is not a good option. I thought it was
> part of the hadoop cluster (and thus easy to access), but guess I was wrong
> there.
>

No.  This is specifically not part of Hadoop for performance reasons.


> ** I would think that such a service would run most logically on the
> taskmaster server. I’m surprised this isn’t a common issue. I guess I could
> launch a separate job that runs such a sequence service perhaps. But that’s
> non trivial its self with failure concerns.
>

The problem is that a serial number service is a major loss of performance
in a parallel system.  Unless you relax the idea considerably (by allowing
blocks, or having lots of bits like Snowflake), then you wind up with a
round-trip per id and you have a critical section on the ID generator.
 This is bad.

Look up Amdahl's Law.


> ** Perhaps there’s just a better way of thinking of this?
>

Yes.  Use lots of bits and be satisfied with uniqueness rather than perfect
ordering and limited range.

As the other respondent said, look up Snowflake.

Re: Cluster wide atomic operations

Posted by Ted Dunning <td...@maprtech.com>.
On Sun, Oct 28, 2012 at 9:15 PM, David Parks <da...@yahoo.com> wrote:

> I need a unique & permanent ID assigned to new item encountered, which has
> a constraint that it is in the range of, let’s say for simple discussion,
> one to one million.
>

Having such a limited range may require that you have a central service to
generate ID's.  The use of a central service can be disastrous for
throughput.


> ****
>
> ** I suppose I could assign a range of usable IDs to each reduce task
> (where ID’s are assigned) and keep those organized somehow at the end of
> the job, but this seems clunky too.
>
> **
>

Yes.  Much better.


>  Since this is on AWS, zookeeper is not a good option. I thought it was
> part of the hadoop cluster (and thus easy to access), but guess I was wrong
> there.
>

No.  This is specifically not part of Hadoop for performance reasons.


> ** I would think that such a service would run most logically on the
> taskmaster server. I’m surprised this isn’t a common issue. I guess I could
> launch a separate job that runs such a sequence service perhaps. But that’s
> non trivial its self with failure concerns.
>

The problem is that a serial number service is a major loss of performance
in a parallel system.  Unless you relax the idea considerably (by allowing
blocks, or having lots of bits like Snowflake), then you wind up with a
round-trip per id and you have a critical section on the ID generator.
 This is bad.

Look up Amdahl's Law.


> ** Perhaps there’s just a better way of thinking of this?
>

Yes.  Use lots of bits and be satisfied with uniqueness rather than perfect
ordering and limited range.

As the other respondent said, look up Snowflake.

Re: Cluster wide atomic operations

Posted by Ted Dunning <td...@maprtech.com>.
On Sun, Oct 28, 2012 at 9:15 PM, David Parks <da...@yahoo.com> wrote:

> I need a unique & permanent ID assigned to new item encountered, which has
> a constraint that it is in the range of, let’s say for simple discussion,
> one to one million.
>

Having such a limited range may require that you have a central service to
generate ID's.  The use of a central service can be disastrous for
throughput.


> ****
>
> ** I suppose I could assign a range of usable IDs to each reduce task
> (where ID’s are assigned) and keep those organized somehow at the end of
> the job, but this seems clunky too.
>
> **
>

Yes.  Much better.


>  Since this is on AWS, zookeeper is not a good option. I thought it was
> part of the hadoop cluster (and thus easy to access), but guess I was wrong
> there.
>

No.  This is specifically not part of Hadoop for performance reasons.


> ** I would think that such a service would run most logically on the
> taskmaster server. I’m surprised this isn’t a common issue. I guess I could
> launch a separate job that runs such a sequence service perhaps. But that’s
> non trivial its self with failure concerns.
>

The problem is that a serial number service is a major loss of performance
in a parallel system.  Unless you relax the idea considerably (by allowing
blocks, or having lots of bits like Snowflake), then you wind up with a
round-trip per id and you have a critical section on the ID generator.
 This is bad.

Look up Amdahl's Law.


> ** Perhaps there’s just a better way of thinking of this?
>

Yes.  Use lots of bits and be satisfied with uniqueness rather than perfect
ordering and limited range.

As the other respondent said, look up Snowflake.

Re: Cluster wide atomic operations

Posted by Michael Katzenellenbogen <mi...@cloudera.com>.
Twitter's Snowflake may provide you with some inspiration:

https://github.com/twitter/snowflake

-Michael

On Oct 28, 2012, at 9:16 PM, David Parks <da...@yahoo.com> wrote:

I need a unique & permanent ID assigned to new item encountered, which has
a constraint that it is in the range of, let’s say for simple discussion,
one to one million.



I suppose I could assign a range of usable IDs to each reduce task (where
ID’s are assigned) and keep those organized somehow at the end of the job,
but this seems clunky too.



Since this is on AWS, zookeeper is not a good option. I thought it was part
of the hadoop cluster (and thus easy to access), but guess I was wrong
there.



I would think that such a service would run most logically on the
taskmaster server. I’m surprised this isn’t a common issue. I guess I could
launch a separate job that runs such a sequence service perhaps. But that’s
non trivial its self with failure concerns.



Perhaps there’s just a better way of thinking of this?





*From:* Ted Dunning [mailto:tdunning@maprtech.com <td...@maprtech.com>]
*Sent:* Saturday, October 27, 2012 12:23 PM
*To:* user@hadoop.apache.org
*Subject:* Re: Cluster wide atomic operations



This is better asked on the Zookeeper lists.



The first answer is that global atomic operations are a generally bad idea.



The second answer is that if you an batch these operations up then you can
cut the evilness of global atomicity by a substantial factor.



Are you sure you need a global counter?

On Fri, Oct 26, 2012 at 11:07 PM, David Parks <da...@yahoo.com>
wrote:

How can we manage cluster-wide atomic operations? Such as maintaining an
auto-increment counter.

Does Hadoop provide native support for these kinds of operations?

An in case ultimate answer involves zookeeper, I'd love to work out doing
this in AWS/EMR.

Re: Cluster wide atomic operations

Posted by Taeho Kang <tk...@gmail.com>.
Hello, David,

How about using something like Redis for that matter? http://redis.io

There are services like RedisToGo (https://redistogo.com/), which also runs
on AWS and is very easy to get started. Sign up and few clicks and you are
set to go.


On Mon, Oct 29, 2012 at 10:15 AM, David Parks <da...@yahoo.com>wrote:

> I need a unique & permanent ID assigned to new item encountered, which has
> a constraint that it is in the range of, let’s say for simple discussion,
> one to one million.****
>
> ** **
>
> I suppose I could assign a range of usable IDs to each reduce task (where
> ID’s are assigned) and keep those organized somehow at the end of the job,
> but this seems clunky too.****
>
> ** **
>
> Since this is on AWS, zookeeper is not a good option. I thought it was
> part of the hadoop cluster (and thus easy to access), but guess I was wrong
> there.****
>
> ** **
>
> I would think that such a service would run most logically on the
> taskmaster server. I’m surprised this isn’t a common issue. I guess I could
> launch a separate job that runs such a sequence service perhaps. But that’s
> non trivial its self with failure concerns. ****
>
> ** **
>
> Perhaps there’s just a better way of thinking of this?****
>
> ** **
>
> ** **
>
> *From:* Ted Dunning [mailto:tdunning@maprtech.com]
> *Sent:* Saturday, October 27, 2012 12:23 PM
> *To:* user@hadoop.apache.org
> *Subject:* Re: Cluster wide atomic operations****
>
> ** **
>
> This is better asked on the Zookeeper lists.****
>
> ** **
>
> The first answer is that global atomic operations are a generally bad idea.
> ****
>
> ** **
>
> The second answer is that if you an batch these operations up then you can
> cut the evilness of global atomicity by a substantial factor.****
>
> ** **
>
> Are you sure you need a global counter?****
>
> On Fri, Oct 26, 2012 at 11:07 PM, David Parks <da...@yahoo.com>
> wrote:****
>
> How can we manage cluster-wide atomic operations? Such as maintaining an
> auto-increment counter.
>
> Does Hadoop provide native support for these kinds of operations?
>
> An in case ultimate answer involves zookeeper, I'd love to work out doing
> this in AWS/EMR.****
>
> ** **
>

Re: Cluster wide atomic operations

Posted by Steve Loughran <st...@hortonworks.com>.
On 29 October 2012 01:15, David Parks <da...@yahoo.com> wrote:

> I need a unique & permanent ID assigned to new item encountered, which has
> a constraint that it is in the range of, let’s say for simple discussion,
> one to one million.
>

I'd go for UUID generation, which you can do in parallel -though it doesn't
meet your range requirements

Re: Cluster wide atomic operations

Posted by Steve Loughran <st...@hortonworks.com>.
On 29 October 2012 01:15, David Parks <da...@yahoo.com> wrote:

> I need a unique & permanent ID assigned to new item encountered, which has
> a constraint that it is in the range of, let’s say for simple discussion,
> one to one million.
>

I'd go for UUID generation, which you can do in parallel -though it doesn't
meet your range requirements

Re: Cluster wide atomic operations

Posted by Michael Katzenellenbogen <mi...@cloudera.com>.
Twitter's Snowflake may provide you with some inspiration:

https://github.com/twitter/snowflake

-Michael

On Oct 28, 2012, at 9:16 PM, David Parks <da...@yahoo.com> wrote:

I need a unique & permanent ID assigned to new item encountered, which has
a constraint that it is in the range of, let’s say for simple discussion,
one to one million.



I suppose I could assign a range of usable IDs to each reduce task (where
ID’s are assigned) and keep those organized somehow at the end of the job,
but this seems clunky too.



Since this is on AWS, zookeeper is not a good option. I thought it was part
of the hadoop cluster (and thus easy to access), but guess I was wrong
there.



I would think that such a service would run most logically on the
taskmaster server. I’m surprised this isn’t a common issue. I guess I could
launch a separate job that runs such a sequence service perhaps. But that’s
non trivial its self with failure concerns.



Perhaps there’s just a better way of thinking of this?





*From:* Ted Dunning [mailto:tdunning@maprtech.com <td...@maprtech.com>]
*Sent:* Saturday, October 27, 2012 12:23 PM
*To:* user@hadoop.apache.org
*Subject:* Re: Cluster wide atomic operations



This is better asked on the Zookeeper lists.



The first answer is that global atomic operations are a generally bad idea.



The second answer is that if you an batch these operations up then you can
cut the evilness of global atomicity by a substantial factor.



Are you sure you need a global counter?

On Fri, Oct 26, 2012 at 11:07 PM, David Parks <da...@yahoo.com>
wrote:

How can we manage cluster-wide atomic operations? Such as maintaining an
auto-increment counter.

Does Hadoop provide native support for these kinds of operations?

An in case ultimate answer involves zookeeper, I'd love to work out doing
this in AWS/EMR.

Re: Cluster wide atomic operations

Posted by Ted Dunning <td...@maprtech.com>.
On Sun, Oct 28, 2012 at 9:15 PM, David Parks <da...@yahoo.com> wrote:

> I need a unique & permanent ID assigned to new item encountered, which has
> a constraint that it is in the range of, let’s say for simple discussion,
> one to one million.
>

Having such a limited range may require that you have a central service to
generate ID's.  The use of a central service can be disastrous for
throughput.


> ****
>
> ** I suppose I could assign a range of usable IDs to each reduce task
> (where ID’s are assigned) and keep those organized somehow at the end of
> the job, but this seems clunky too.
>
> **
>

Yes.  Much better.


>  Since this is on AWS, zookeeper is not a good option. I thought it was
> part of the hadoop cluster (and thus easy to access), but guess I was wrong
> there.
>

No.  This is specifically not part of Hadoop for performance reasons.


> ** I would think that such a service would run most logically on the
> taskmaster server. I’m surprised this isn’t a common issue. I guess I could
> launch a separate job that runs such a sequence service perhaps. But that’s
> non trivial its self with failure concerns.
>

The problem is that a serial number service is a major loss of performance
in a parallel system.  Unless you relax the idea considerably (by allowing
blocks, or having lots of bits like Snowflake), then you wind up with a
round-trip per id and you have a critical section on the ID generator.
 This is bad.

Look up Amdahl's Law.


> ** Perhaps there’s just a better way of thinking of this?
>

Yes.  Use lots of bits and be satisfied with uniqueness rather than perfect
ordering and limited range.

As the other respondent said, look up Snowflake.

RE: Cluster wide atomic operations

Posted by David Parks <da...@yahoo.com>.
I need a unique & permanent ID assigned to new item encountered, which has a
constraint that it is in the range of, let's say for simple discussion, one
to one million.

 

I suppose I could assign a range of usable IDs to each reduce task (where
ID's are assigned) and keep those organized somehow at the end of the job,
but this seems clunky too.

 

Since this is on AWS, zookeeper is not a good option. I thought it was part
of the hadoop cluster (and thus easy to access), but guess I was wrong
there.

 

I would think that such a service would run most logically on the taskmaster
server. I'm surprised this isn't a common issue. I guess I could launch a
separate job that runs such a sequence service perhaps. But that's non
trivial its self with failure concerns. 

 

Perhaps there's just a better way of thinking of this?

 

 

From: Ted Dunning [mailto:tdunning@maprtech.com] 
Sent: Saturday, October 27, 2012 12:23 PM
To: user@hadoop.apache.org
Subject: Re: Cluster wide atomic operations

 

This is better asked on the Zookeeper lists.

 

The first answer is that global atomic operations are a generally bad idea.

 

The second answer is that if you an batch these operations up then you can
cut the evilness of global atomicity by a substantial factor.

 

Are you sure you need a global counter?

On Fri, Oct 26, 2012 at 11:07 PM, David Parks <da...@yahoo.com>
wrote:

How can we manage cluster-wide atomic operations? Such as maintaining an
auto-increment counter.

Does Hadoop provide native support for these kinds of operations?

An in case ultimate answer involves zookeeper, I'd love to work out doing
this in AWS/EMR.

 


RE: Cluster wide atomic operations

Posted by David Parks <da...@yahoo.com>.
I need a unique & permanent ID assigned to new item encountered, which has a
constraint that it is in the range of, let's say for simple discussion, one
to one million.

 

I suppose I could assign a range of usable IDs to each reduce task (where
ID's are assigned) and keep those organized somehow at the end of the job,
but this seems clunky too.

 

Since this is on AWS, zookeeper is not a good option. I thought it was part
of the hadoop cluster (and thus easy to access), but guess I was wrong
there.

 

I would think that such a service would run most logically on the taskmaster
server. I'm surprised this isn't a common issue. I guess I could launch a
separate job that runs such a sequence service perhaps. But that's non
trivial its self with failure concerns. 

 

Perhaps there's just a better way of thinking of this?

 

 

From: Ted Dunning [mailto:tdunning@maprtech.com] 
Sent: Saturday, October 27, 2012 12:23 PM
To: user@hadoop.apache.org
Subject: Re: Cluster wide atomic operations

 

This is better asked on the Zookeeper lists.

 

The first answer is that global atomic operations are a generally bad idea.

 

The second answer is that if you an batch these operations up then you can
cut the evilness of global atomicity by a substantial factor.

 

Are you sure you need a global counter?

On Fri, Oct 26, 2012 at 11:07 PM, David Parks <da...@yahoo.com>
wrote:

How can we manage cluster-wide atomic operations? Such as maintaining an
auto-increment counter.

Does Hadoop provide native support for these kinds of operations?

An in case ultimate answer involves zookeeper, I'd love to work out doing
this in AWS/EMR.

 


RE: Cluster wide atomic operations

Posted by David Parks <da...@yahoo.com>.
I need a unique & permanent ID assigned to new item encountered, which has a
constraint that it is in the range of, let's say for simple discussion, one
to one million.

 

I suppose I could assign a range of usable IDs to each reduce task (where
ID's are assigned) and keep those organized somehow at the end of the job,
but this seems clunky too.

 

Since this is on AWS, zookeeper is not a good option. I thought it was part
of the hadoop cluster (and thus easy to access), but guess I was wrong
there.

 

I would think that such a service would run most logically on the taskmaster
server. I'm surprised this isn't a common issue. I guess I could launch a
separate job that runs such a sequence service perhaps. But that's non
trivial its self with failure concerns. 

 

Perhaps there's just a better way of thinking of this?

 

 

From: Ted Dunning [mailto:tdunning@maprtech.com] 
Sent: Saturday, October 27, 2012 12:23 PM
To: user@hadoop.apache.org
Subject: Re: Cluster wide atomic operations

 

This is better asked on the Zookeeper lists.

 

The first answer is that global atomic operations are a generally bad idea.

 

The second answer is that if you an batch these operations up then you can
cut the evilness of global atomicity by a substantial factor.

 

Are you sure you need a global counter?

On Fri, Oct 26, 2012 at 11:07 PM, David Parks <da...@yahoo.com>
wrote:

How can we manage cluster-wide atomic operations? Such as maintaining an
auto-increment counter.

Does Hadoop provide native support for these kinds of operations?

An in case ultimate answer involves zookeeper, I'd love to work out doing
this in AWS/EMR.

 


RE: Cluster wide atomic operations

Posted by David Parks <da...@yahoo.com>.
I need a unique & permanent ID assigned to new item encountered, which has a
constraint that it is in the range of, let's say for simple discussion, one
to one million.

 

I suppose I could assign a range of usable IDs to each reduce task (where
ID's are assigned) and keep those organized somehow at the end of the job,
but this seems clunky too.

 

Since this is on AWS, zookeeper is not a good option. I thought it was part
of the hadoop cluster (and thus easy to access), but guess I was wrong
there.

 

I would think that such a service would run most logically on the taskmaster
server. I'm surprised this isn't a common issue. I guess I could launch a
separate job that runs such a sequence service perhaps. But that's non
trivial its self with failure concerns. 

 

Perhaps there's just a better way of thinking of this?

 

 

From: Ted Dunning [mailto:tdunning@maprtech.com] 
Sent: Saturday, October 27, 2012 12:23 PM
To: user@hadoop.apache.org
Subject: Re: Cluster wide atomic operations

 

This is better asked on the Zookeeper lists.

 

The first answer is that global atomic operations are a generally bad idea.

 

The second answer is that if you an batch these operations up then you can
cut the evilness of global atomicity by a substantial factor.

 

Are you sure you need a global counter?

On Fri, Oct 26, 2012 at 11:07 PM, David Parks <da...@yahoo.com>
wrote:

How can we manage cluster-wide atomic operations? Such as maintaining an
auto-increment counter.

Does Hadoop provide native support for these kinds of operations?

An in case ultimate answer involves zookeeper, I'd love to work out doing
this in AWS/EMR.

 


Re: Cluster wide atomic operations

Posted by Ted Dunning <td...@maprtech.com>.
This is better asked on the Zookeeper lists.

The first answer is that global atomic operations are a generally bad idea.

The second answer is that if you an batch these operations up then you can
cut the evilness of global atomicity by a substantial factor.

Are you sure you need a global counter?

On Fri, Oct 26, 2012 at 11:07 PM, David Parks <da...@yahoo.com>wrote:

> How can we manage cluster-wide atomic operations? Such as maintaining an
> auto-increment counter.
>
> Does Hadoop provide native support for these kinds of operations?
>
> An in case ultimate answer involves zookeeper, I'd love to work out doing
> this in AWS/EMR.
>
>

Re: Cluster wide atomic operations

Posted by Ted Dunning <td...@maprtech.com>.
This is better asked on the Zookeeper lists.

The first answer is that global atomic operations are a generally bad idea.

The second answer is that if you an batch these operations up then you can
cut the evilness of global atomicity by a substantial factor.

Are you sure you need a global counter?

On Fri, Oct 26, 2012 at 11:07 PM, David Parks <da...@yahoo.com>wrote:

> How can we manage cluster-wide atomic operations? Such as maintaining an
> auto-increment counter.
>
> Does Hadoop provide native support for these kinds of operations?
>
> An in case ultimate answer involves zookeeper, I'd love to work out doing
> this in AWS/EMR.
>
>

Re: Cluster wide atomic operations

Posted by Ted Dunning <td...@maprtech.com>.
This is better asked on the Zookeeper lists.

The first answer is that global atomic operations are a generally bad idea.

The second answer is that if you an batch these operations up then you can
cut the evilness of global atomicity by a substantial factor.

Are you sure you need a global counter?

On Fri, Oct 26, 2012 at 11:07 PM, David Parks <da...@yahoo.com>wrote:

> How can we manage cluster-wide atomic operations? Such as maintaining an
> auto-increment counter.
>
> Does Hadoop provide native support for these kinds of operations?
>
> An in case ultimate answer involves zookeeper, I'd love to work out doing
> this in AWS/EMR.
>
>

Re: Cluster wide atomic operations

Posted by Ted Dunning <td...@maprtech.com>.
This is better asked on the Zookeeper lists.

The first answer is that global atomic operations are a generally bad idea.

The second answer is that if you an batch these operations up then you can
cut the evilness of global atomicity by a substantial factor.

Are you sure you need a global counter?

On Fri, Oct 26, 2012 at 11:07 PM, David Parks <da...@yahoo.com>wrote:

> How can we manage cluster-wide atomic operations? Such as maintaining an
> auto-increment counter.
>
> Does Hadoop provide native support for these kinds of operations?
>
> An in case ultimate answer involves zookeeper, I'd love to work out doing
> this in AWS/EMR.
>
>