You are viewing a plain text version of this content. The canonical link for it is here.
Posted to mapreduce-user@hadoop.apache.org by yogesh dhari <yo...@live.com> on 2012/10/01 15:36:58 UTC

HADOOP in Production

Hi all,

I have understood the Hadoop and Hadoop Ecosystem(Pig as ETL, Hive as DataWare house, Sqoop as importing tool). I worked and learned on single node cluster with demo data.

As Hadoop suits best on Unix platform. Please help me to understand the requirement form start to finish to use Hadoop in production.

What would be the things to use Hadoop on real time project.

like Hadoop automation on Unix, alert of failure process.

Please put some light on using Hadoop on real time and what objectives are recommended. 


Thanks & Regards
Yogesh Kumar

 		 	   		  

RE: HADOOP in Production

Posted by "Gauthier, Alexander" <Al...@Teradata.com>.
Owned.

From: Ted Dunning [mailto:tdunning@maprtech.com]
Sent: Tuesday, October 02, 2012 4:13 PM
To: user@hadoop.apache.org
Subject: Re: HADOOP in Production


On Tue, Oct 2, 2012 at 7:05 PM, Hank Cohen <ha...@altior.com>> wrote:
There is an important difference between real time and real fast

Real time means that system response must meet a fixed schedule.
Real fast just means sooner is better.

Good thought, but real-time can also include a fixed schedule and a specified list of exceptional conditions which would prevent meeting the schedule.

It may also include a fixed schedule that must be met some fraction of the time (usually very near 100% of the time).

Without providing exceptions, you basically force the designer to lie about how reliable their system is.

Real time systems always have hard schedules. The schedule could be in microseconds to control a laser for making masks for semiconductor manufacturing, milliseconds to control the ignition in your car or flight controls on an F-22 or  seconds for even slower moving processes.  In real time system missing the schedule can mean that very bad things happen: planes fall from the sky, your laser printer fries it's imaging drum, factories explode etc.

It can mean that.  But if you specify the exceptional situations you can specifically mitigate for them.

Most transaction processing is happy with real fast
The folks doing high velocity trading are pretty close to real time but they probably will be happy with real fast.
If real fast systems miss a schedule then someone loses money.

Yeah.  And if you talk to these guys, they know the difference and ask for real-time.


The reason that RTOS type operating systems are popular for real time applications is that they don't allow operations to spend indeterminate amounts of time in uninterruptable states.  Java will never qualify as a real time system because it has garbage collection and garbage collection can lock up a system for an indefinite amount of time while it goes through marking and counting.

You are behind the times on a few counts.

- Java's collectors don't "count".

- Java can be real-time:

http://rtjava.blogspot.co.uk/2009/07/real-time-java-vms.html

- Garbage collection can be deterministic and real-time:

http://www.ibm.com/developerworks/java/library/j-rtj4/index.html
http://docs.oracle.com/cd/E13221_01/wlrt/docs30/intro_wlrt/tuning.html


RE: HADOOP in Production

Posted by Hank Cohen <ha...@altior.com>.
Good points.

I'm not trying to be exhaustive in the discussion of real time systems.
My only intent was to point out the difference between real time and fast response.
There are lots of real time requirements that do not require a particularly fast response but the response needs to be on time.  Timing errors can introduce noise at a pretty astounding rate.
Real fast often uses the same technology as real time but the consequences of missing a deadline are different.

As for the high velocity traders, I'm an advocate of the Robin Hood Tax.
A little friction might help the world.

Hank Cohen

From: Ted Dunning [mailto:tdunning@maprtech.com]
Sent: Tuesday, October 02, 2012 4:13 PM
To: user@hadoop.apache.org
Subject: Re: HADOOP in Production


On Tue, Oct 2, 2012 at 7:05 PM, Hank Cohen <ha...@altior.com>> wrote:
There is an important difference between real time and real fast

Real time means that system response must meet a fixed schedule.
Real fast just means sooner is better.

Good thought, but real-time can also include a fixed schedule and a specified list of exceptional conditions which would prevent meeting the schedule.

It may also include a fixed schedule that must be met some fraction of the time (usually very near 100% of the time).

Without providing exceptions, you basically force the designer to lie about how reliable their system is.

Real time systems always have hard schedules. The schedule could be in microseconds to control a laser for making masks for semiconductor manufacturing, milliseconds to control the ignition in your car or flight controls on an F-22 or  seconds for even slower moving processes.  In real time system missing the schedule can mean that very bad things happen: planes fall from the sky, your laser printer fries it's imaging drum, factories explode etc.

It can mean that.  But if you specify the exceptional situations you can specifically mitigate for them.

Most transaction processing is happy with real fast
The folks doing high velocity trading are pretty close to real time but they probably will be happy with real fast.
If real fast systems miss a schedule then someone loses money.

Yeah.  And if you talk to these guys, they know the difference and ask for real-time.


The reason that RTOS type operating systems are popular for real time applications is that they don't allow operations to spend indeterminate amounts of time in uninterruptable states.  Java will never qualify as a real time system because it has garbage collection and garbage collection can lock up a system for an indefinite amount of time while it goes through marking and counting.

You are behind the times on a few counts.

- Java's collectors don't "count".

- Java can be real-time:

http://rtjava.blogspot.co.uk/2009/07/real-time-java-vms.html

- Garbage collection can be deterministic and real-time:

http://www.ibm.com/developerworks/java/library/j-rtj4/index.html
http://docs.oracle.com/cd/E13221_01/wlrt/docs30/intro_wlrt/tuning.html


RE: HADOOP in Production

Posted by "Gauthier, Alexander" <Al...@Teradata.com>.
Owned.

From: Ted Dunning [mailto:tdunning@maprtech.com]
Sent: Tuesday, October 02, 2012 4:13 PM
To: user@hadoop.apache.org
Subject: Re: HADOOP in Production


On Tue, Oct 2, 2012 at 7:05 PM, Hank Cohen <ha...@altior.com>> wrote:
There is an important difference between real time and real fast

Real time means that system response must meet a fixed schedule.
Real fast just means sooner is better.

Good thought, but real-time can also include a fixed schedule and a specified list of exceptional conditions which would prevent meeting the schedule.

It may also include a fixed schedule that must be met some fraction of the time (usually very near 100% of the time).

Without providing exceptions, you basically force the designer to lie about how reliable their system is.

Real time systems always have hard schedules. The schedule could be in microseconds to control a laser for making masks for semiconductor manufacturing, milliseconds to control the ignition in your car or flight controls on an F-22 or  seconds for even slower moving processes.  In real time system missing the schedule can mean that very bad things happen: planes fall from the sky, your laser printer fries it's imaging drum, factories explode etc.

It can mean that.  But if you specify the exceptional situations you can specifically mitigate for them.

Most transaction processing is happy with real fast
The folks doing high velocity trading are pretty close to real time but they probably will be happy with real fast.
If real fast systems miss a schedule then someone loses money.

Yeah.  And if you talk to these guys, they know the difference and ask for real-time.


The reason that RTOS type operating systems are popular for real time applications is that they don't allow operations to spend indeterminate amounts of time in uninterruptable states.  Java will never qualify as a real time system because it has garbage collection and garbage collection can lock up a system for an indefinite amount of time while it goes through marking and counting.

You are behind the times on a few counts.

- Java's collectors don't "count".

- Java can be real-time:

http://rtjava.blogspot.co.uk/2009/07/real-time-java-vms.html

- Garbage collection can be deterministic and real-time:

http://www.ibm.com/developerworks/java/library/j-rtj4/index.html
http://docs.oracle.com/cd/E13221_01/wlrt/docs30/intro_wlrt/tuning.html


RE: HADOOP in Production

Posted by Hank Cohen <ha...@altior.com>.
Good points.

I'm not trying to be exhaustive in the discussion of real time systems.
My only intent was to point out the difference between real time and fast response.
There are lots of real time requirements that do not require a particularly fast response but the response needs to be on time.  Timing errors can introduce noise at a pretty astounding rate.
Real fast often uses the same technology as real time but the consequences of missing a deadline are different.

As for the high velocity traders, I'm an advocate of the Robin Hood Tax.
A little friction might help the world.

Hank Cohen

From: Ted Dunning [mailto:tdunning@maprtech.com]
Sent: Tuesday, October 02, 2012 4:13 PM
To: user@hadoop.apache.org
Subject: Re: HADOOP in Production


On Tue, Oct 2, 2012 at 7:05 PM, Hank Cohen <ha...@altior.com>> wrote:
There is an important difference between real time and real fast

Real time means that system response must meet a fixed schedule.
Real fast just means sooner is better.

Good thought, but real-time can also include a fixed schedule and a specified list of exceptional conditions which would prevent meeting the schedule.

It may also include a fixed schedule that must be met some fraction of the time (usually very near 100% of the time).

Without providing exceptions, you basically force the designer to lie about how reliable their system is.

Real time systems always have hard schedules. The schedule could be in microseconds to control a laser for making masks for semiconductor manufacturing, milliseconds to control the ignition in your car or flight controls on an F-22 or  seconds for even slower moving processes.  In real time system missing the schedule can mean that very bad things happen: planes fall from the sky, your laser printer fries it's imaging drum, factories explode etc.

It can mean that.  But if you specify the exceptional situations you can specifically mitigate for them.

Most transaction processing is happy with real fast
The folks doing high velocity trading are pretty close to real time but they probably will be happy with real fast.
If real fast systems miss a schedule then someone loses money.

Yeah.  And if you talk to these guys, they know the difference and ask for real-time.


The reason that RTOS type operating systems are popular for real time applications is that they don't allow operations to spend indeterminate amounts of time in uninterruptable states.  Java will never qualify as a real time system because it has garbage collection and garbage collection can lock up a system for an indefinite amount of time while it goes through marking and counting.

You are behind the times on a few counts.

- Java's collectors don't "count".

- Java can be real-time:

http://rtjava.blogspot.co.uk/2009/07/real-time-java-vms.html

- Garbage collection can be deterministic and real-time:

http://www.ibm.com/developerworks/java/library/j-rtj4/index.html
http://docs.oracle.com/cd/E13221_01/wlrt/docs30/intro_wlrt/tuning.html


RE: HADOOP in Production

Posted by "Gauthier, Alexander" <Al...@Teradata.com>.
Owned.

From: Ted Dunning [mailto:tdunning@maprtech.com]
Sent: Tuesday, October 02, 2012 4:13 PM
To: user@hadoop.apache.org
Subject: Re: HADOOP in Production


On Tue, Oct 2, 2012 at 7:05 PM, Hank Cohen <ha...@altior.com>> wrote:
There is an important difference between real time and real fast

Real time means that system response must meet a fixed schedule.
Real fast just means sooner is better.

Good thought, but real-time can also include a fixed schedule and a specified list of exceptional conditions which would prevent meeting the schedule.

It may also include a fixed schedule that must be met some fraction of the time (usually very near 100% of the time).

Without providing exceptions, you basically force the designer to lie about how reliable their system is.

Real time systems always have hard schedules. The schedule could be in microseconds to control a laser for making masks for semiconductor manufacturing, milliseconds to control the ignition in your car or flight controls on an F-22 or  seconds for even slower moving processes.  In real time system missing the schedule can mean that very bad things happen: planes fall from the sky, your laser printer fries it's imaging drum, factories explode etc.

It can mean that.  But if you specify the exceptional situations you can specifically mitigate for them.

Most transaction processing is happy with real fast
The folks doing high velocity trading are pretty close to real time but they probably will be happy with real fast.
If real fast systems miss a schedule then someone loses money.

Yeah.  And if you talk to these guys, they know the difference and ask for real-time.


The reason that RTOS type operating systems are popular for real time applications is that they don't allow operations to spend indeterminate amounts of time in uninterruptable states.  Java will never qualify as a real time system because it has garbage collection and garbage collection can lock up a system for an indefinite amount of time while it goes through marking and counting.

You are behind the times on a few counts.

- Java's collectors don't "count".

- Java can be real-time:

http://rtjava.blogspot.co.uk/2009/07/real-time-java-vms.html

- Garbage collection can be deterministic and real-time:

http://www.ibm.com/developerworks/java/library/j-rtj4/index.html
http://docs.oracle.com/cd/E13221_01/wlrt/docs30/intro_wlrt/tuning.html


RE: HADOOP in Production

Posted by Hank Cohen <ha...@altior.com>.
Good points.

I'm not trying to be exhaustive in the discussion of real time systems.
My only intent was to point out the difference between real time and fast response.
There are lots of real time requirements that do not require a particularly fast response but the response needs to be on time.  Timing errors can introduce noise at a pretty astounding rate.
Real fast often uses the same technology as real time but the consequences of missing a deadline are different.

As for the high velocity traders, I'm an advocate of the Robin Hood Tax.
A little friction might help the world.

Hank Cohen

From: Ted Dunning [mailto:tdunning@maprtech.com]
Sent: Tuesday, October 02, 2012 4:13 PM
To: user@hadoop.apache.org
Subject: Re: HADOOP in Production


On Tue, Oct 2, 2012 at 7:05 PM, Hank Cohen <ha...@altior.com>> wrote:
There is an important difference between real time and real fast

Real time means that system response must meet a fixed schedule.
Real fast just means sooner is better.

Good thought, but real-time can also include a fixed schedule and a specified list of exceptional conditions which would prevent meeting the schedule.

It may also include a fixed schedule that must be met some fraction of the time (usually very near 100% of the time).

Without providing exceptions, you basically force the designer to lie about how reliable their system is.

Real time systems always have hard schedules. The schedule could be in microseconds to control a laser for making masks for semiconductor manufacturing, milliseconds to control the ignition in your car or flight controls on an F-22 or  seconds for even slower moving processes.  In real time system missing the schedule can mean that very bad things happen: planes fall from the sky, your laser printer fries it's imaging drum, factories explode etc.

It can mean that.  But if you specify the exceptional situations you can specifically mitigate for them.

Most transaction processing is happy with real fast
The folks doing high velocity trading are pretty close to real time but they probably will be happy with real fast.
If real fast systems miss a schedule then someone loses money.

Yeah.  And if you talk to these guys, they know the difference and ask for real-time.


The reason that RTOS type operating systems are popular for real time applications is that they don't allow operations to spend indeterminate amounts of time in uninterruptable states.  Java will never qualify as a real time system because it has garbage collection and garbage collection can lock up a system for an indefinite amount of time while it goes through marking and counting.

You are behind the times on a few counts.

- Java's collectors don't "count".

- Java can be real-time:

http://rtjava.blogspot.co.uk/2009/07/real-time-java-vms.html

- Garbage collection can be deterministic and real-time:

http://www.ibm.com/developerworks/java/library/j-rtj4/index.html
http://docs.oracle.com/cd/E13221_01/wlrt/docs30/intro_wlrt/tuning.html


RE: HADOOP in Production

Posted by "Gauthier, Alexander" <Al...@Teradata.com>.
Owned.

From: Ted Dunning [mailto:tdunning@maprtech.com]
Sent: Tuesday, October 02, 2012 4:13 PM
To: user@hadoop.apache.org
Subject: Re: HADOOP in Production


On Tue, Oct 2, 2012 at 7:05 PM, Hank Cohen <ha...@altior.com>> wrote:
There is an important difference between real time and real fast

Real time means that system response must meet a fixed schedule.
Real fast just means sooner is better.

Good thought, but real-time can also include a fixed schedule and a specified list of exceptional conditions which would prevent meeting the schedule.

It may also include a fixed schedule that must be met some fraction of the time (usually very near 100% of the time).

Without providing exceptions, you basically force the designer to lie about how reliable their system is.

Real time systems always have hard schedules. The schedule could be in microseconds to control a laser for making masks for semiconductor manufacturing, milliseconds to control the ignition in your car or flight controls on an F-22 or  seconds for even slower moving processes.  In real time system missing the schedule can mean that very bad things happen: planes fall from the sky, your laser printer fries it's imaging drum, factories explode etc.

It can mean that.  But if you specify the exceptional situations you can specifically mitigate for them.

Most transaction processing is happy with real fast
The folks doing high velocity trading are pretty close to real time but they probably will be happy with real fast.
If real fast systems miss a schedule then someone loses money.

Yeah.  And if you talk to these guys, they know the difference and ask for real-time.


The reason that RTOS type operating systems are popular for real time applications is that they don't allow operations to spend indeterminate amounts of time in uninterruptable states.  Java will never qualify as a real time system because it has garbage collection and garbage collection can lock up a system for an indefinite amount of time while it goes through marking and counting.

You are behind the times on a few counts.

- Java's collectors don't "count".

- Java can be real-time:

http://rtjava.blogspot.co.uk/2009/07/real-time-java-vms.html

- Garbage collection can be deterministic and real-time:

http://www.ibm.com/developerworks/java/library/j-rtj4/index.html
http://docs.oracle.com/cd/E13221_01/wlrt/docs30/intro_wlrt/tuning.html


RE: HADOOP in Production

Posted by Hank Cohen <ha...@altior.com>.
Good points.

I'm not trying to be exhaustive in the discussion of real time systems.
My only intent was to point out the difference between real time and fast response.
There are lots of real time requirements that do not require a particularly fast response but the response needs to be on time.  Timing errors can introduce noise at a pretty astounding rate.
Real fast often uses the same technology as real time but the consequences of missing a deadline are different.

As for the high velocity traders, I'm an advocate of the Robin Hood Tax.
A little friction might help the world.

Hank Cohen

From: Ted Dunning [mailto:tdunning@maprtech.com]
Sent: Tuesday, October 02, 2012 4:13 PM
To: user@hadoop.apache.org
Subject: Re: HADOOP in Production


On Tue, Oct 2, 2012 at 7:05 PM, Hank Cohen <ha...@altior.com>> wrote:
There is an important difference between real time and real fast

Real time means that system response must meet a fixed schedule.
Real fast just means sooner is better.

Good thought, but real-time can also include a fixed schedule and a specified list of exceptional conditions which would prevent meeting the schedule.

It may also include a fixed schedule that must be met some fraction of the time (usually very near 100% of the time).

Without providing exceptions, you basically force the designer to lie about how reliable their system is.

Real time systems always have hard schedules. The schedule could be in microseconds to control a laser for making masks for semiconductor manufacturing, milliseconds to control the ignition in your car or flight controls on an F-22 or  seconds for even slower moving processes.  In real time system missing the schedule can mean that very bad things happen: planes fall from the sky, your laser printer fries it's imaging drum, factories explode etc.

It can mean that.  But if you specify the exceptional situations you can specifically mitigate for them.

Most transaction processing is happy with real fast
The folks doing high velocity trading are pretty close to real time but they probably will be happy with real fast.
If real fast systems miss a schedule then someone loses money.

Yeah.  And if you talk to these guys, they know the difference and ask for real-time.


The reason that RTOS type operating systems are popular for real time applications is that they don't allow operations to spend indeterminate amounts of time in uninterruptable states.  Java will never qualify as a real time system because it has garbage collection and garbage collection can lock up a system for an indefinite amount of time while it goes through marking and counting.

You are behind the times on a few counts.

- Java's collectors don't "count".

- Java can be real-time:

http://rtjava.blogspot.co.uk/2009/07/real-time-java-vms.html

- Garbage collection can be deterministic and real-time:

http://www.ibm.com/developerworks/java/library/j-rtj4/index.html
http://docs.oracle.com/cd/E13221_01/wlrt/docs30/intro_wlrt/tuning.html


Re: HADOOP in Production

Posted by Ted Dunning <td...@maprtech.com>.
On Tue, Oct 2, 2012 at 7:05 PM, Hank Cohen <ha...@altior.com> wrote:

> There is an important difference between real time and real fast
>
> Real time means that system response must meet a fixed schedule.
> Real fast just means sooner is better.
>

Good thought, but real-time can also include a fixed schedule and a
specified list of exceptional conditions which would prevent meeting the
schedule.

It may also include a fixed schedule that must be met some fraction of the
time (usually very near 100% of the time).

Without providing exceptions, you basically force the designer to lie about
how reliable their system is.


> Real time systems always have hard schedules. The schedule could be in
> microseconds to control a laser for making masks for semiconductor
> manufacturing, milliseconds to control the ignition in your car or flight
> controls on an F-22 or  seconds for even slower moving processes.  In real
> time system missing the schedule can mean that very bad things happen:
> planes fall from the sky, your laser printer fries it's imaging drum,
> factories explode etc.
>

It can mean that.  But if you specify the exceptional situations you can
specifically mitigate for them.


> Most transaction processing is happy with real fast
> The folks doing high velocity trading are pretty close to real time but
> they probably will be happy with real fast.
> If real fast systems miss a schedule then someone loses money.
>

Yeah.  And if you talk to these guys, they know the difference and ask for
real-time.



> The reason that RTOS type operating systems are popular for real time
> applications is that they don't allow operations to spend indeterminate
> amounts of time in uninterruptable states.  Java will never qualify as a
> real time system because it has garbage collection and garbage collection
> can lock up a system for an indefinite amount of time while it goes through
> marking and counting.
>

You are behind the times on a few counts.

- Java's collectors don't "count".

- Java can be real-time:

http://rtjava.blogspot.co.uk/2009/07/real-time-java-vms.html

- Garbage collection can be deterministic and real-time:

http://www.ibm.com/developerworks/java/library/j-rtj4/index.html
http://docs.oracle.com/cd/E13221_01/wlrt/docs30/intro_wlrt/tuning.html

Re: HADOOP in Production

Posted by Ted Dunning <td...@maprtech.com>.
On Tue, Oct 2, 2012 at 7:05 PM, Hank Cohen <ha...@altior.com> wrote:

> There is an important difference between real time and real fast
>
> Real time means that system response must meet a fixed schedule.
> Real fast just means sooner is better.
>

Good thought, but real-time can also include a fixed schedule and a
specified list of exceptional conditions which would prevent meeting the
schedule.

It may also include a fixed schedule that must be met some fraction of the
time (usually very near 100% of the time).

Without providing exceptions, you basically force the designer to lie about
how reliable their system is.


> Real time systems always have hard schedules. The schedule could be in
> microseconds to control a laser for making masks for semiconductor
> manufacturing, milliseconds to control the ignition in your car or flight
> controls on an F-22 or  seconds for even slower moving processes.  In real
> time system missing the schedule can mean that very bad things happen:
> planes fall from the sky, your laser printer fries it's imaging drum,
> factories explode etc.
>

It can mean that.  But if you specify the exceptional situations you can
specifically mitigate for them.


> Most transaction processing is happy with real fast
> The folks doing high velocity trading are pretty close to real time but
> they probably will be happy with real fast.
> If real fast systems miss a schedule then someone loses money.
>

Yeah.  And if you talk to these guys, they know the difference and ask for
real-time.



> The reason that RTOS type operating systems are popular for real time
> applications is that they don't allow operations to spend indeterminate
> amounts of time in uninterruptable states.  Java will never qualify as a
> real time system because it has garbage collection and garbage collection
> can lock up a system for an indefinite amount of time while it goes through
> marking and counting.
>

You are behind the times on a few counts.

- Java's collectors don't "count".

- Java can be real-time:

http://rtjava.blogspot.co.uk/2009/07/real-time-java-vms.html

- Garbage collection can be deterministic and real-time:

http://www.ibm.com/developerworks/java/library/j-rtj4/index.html
http://docs.oracle.com/cd/E13221_01/wlrt/docs30/intro_wlrt/tuning.html

Re: HADOOP in Production

Posted by Ted Dunning <td...@maprtech.com>.
On Tue, Oct 2, 2012 at 7:05 PM, Hank Cohen <ha...@altior.com> wrote:

> There is an important difference between real time and real fast
>
> Real time means that system response must meet a fixed schedule.
> Real fast just means sooner is better.
>

Good thought, but real-time can also include a fixed schedule and a
specified list of exceptional conditions which would prevent meeting the
schedule.

It may also include a fixed schedule that must be met some fraction of the
time (usually very near 100% of the time).

Without providing exceptions, you basically force the designer to lie about
how reliable their system is.


> Real time systems always have hard schedules. The schedule could be in
> microseconds to control a laser for making masks for semiconductor
> manufacturing, milliseconds to control the ignition in your car or flight
> controls on an F-22 or  seconds for even slower moving processes.  In real
> time system missing the schedule can mean that very bad things happen:
> planes fall from the sky, your laser printer fries it's imaging drum,
> factories explode etc.
>

It can mean that.  But if you specify the exceptional situations you can
specifically mitigate for them.


> Most transaction processing is happy with real fast
> The folks doing high velocity trading are pretty close to real time but
> they probably will be happy with real fast.
> If real fast systems miss a schedule then someone loses money.
>

Yeah.  And if you talk to these guys, they know the difference and ask for
real-time.



> The reason that RTOS type operating systems are popular for real time
> applications is that they don't allow operations to spend indeterminate
> amounts of time in uninterruptable states.  Java will never qualify as a
> real time system because it has garbage collection and garbage collection
> can lock up a system for an indefinite amount of time while it goes through
> marking and counting.
>

You are behind the times on a few counts.

- Java's collectors don't "count".

- Java can be real-time:

http://rtjava.blogspot.co.uk/2009/07/real-time-java-vms.html

- Garbage collection can be deterministic and real-time:

http://www.ibm.com/developerworks/java/library/j-rtj4/index.html
http://docs.oracle.com/cd/E13221_01/wlrt/docs30/intro_wlrt/tuning.html

Re: HADOOP in Production

Posted by Ted Dunning <td...@maprtech.com>.
On Tue, Oct 2, 2012 at 7:05 PM, Hank Cohen <ha...@altior.com> wrote:

> There is an important difference between real time and real fast
>
> Real time means that system response must meet a fixed schedule.
> Real fast just means sooner is better.
>

Good thought, but real-time can also include a fixed schedule and a
specified list of exceptional conditions which would prevent meeting the
schedule.

It may also include a fixed schedule that must be met some fraction of the
time (usually very near 100% of the time).

Without providing exceptions, you basically force the designer to lie about
how reliable their system is.


> Real time systems always have hard schedules. The schedule could be in
> microseconds to control a laser for making masks for semiconductor
> manufacturing, milliseconds to control the ignition in your car or flight
> controls on an F-22 or  seconds for even slower moving processes.  In real
> time system missing the schedule can mean that very bad things happen:
> planes fall from the sky, your laser printer fries it's imaging drum,
> factories explode etc.
>

It can mean that.  But if you specify the exceptional situations you can
specifically mitigate for them.


> Most transaction processing is happy with real fast
> The folks doing high velocity trading are pretty close to real time but
> they probably will be happy with real fast.
> If real fast systems miss a schedule then someone loses money.
>

Yeah.  And if you talk to these guys, they know the difference and ask for
real-time.



> The reason that RTOS type operating systems are popular for real time
> applications is that they don't allow operations to spend indeterminate
> amounts of time in uninterruptable states.  Java will never qualify as a
> real time system because it has garbage collection and garbage collection
> can lock up a system for an indefinite amount of time while it goes through
> marking and counting.
>

You are behind the times on a few counts.

- Java's collectors don't "count".

- Java can be real-time:

http://rtjava.blogspot.co.uk/2009/07/real-time-java-vms.html

- Garbage collection can be deterministic and real-time:

http://www.ibm.com/developerworks/java/library/j-rtj4/index.html
http://docs.oracle.com/cd/E13221_01/wlrt/docs30/intro_wlrt/tuning.html

RE: HADOOP in Production

Posted by Hank Cohen <ha...@altior.com>.
There is an important difference between real time and real fast

Real time means that system response must meet a fixed schedule.  
Real fast just means sooner is better.

Real time systems always have hard schedules. The schedule could be in microseconds to control a laser for making masks for semiconductor manufacturing, milliseconds to control the ignition in your car or flight controls on an F-22 or  seconds for even slower moving processes.  In real time system missing the schedule can mean that very bad things happen: planes fall from the sky, your laser printer fries it's imaging drum, factories explode etc.

Most transaction processing is happy with real fast
The folks doing high velocity trading are pretty close to real time but they probably will be happy with real fast.
If real fast systems miss a schedule then someone loses money.

The reason that RTOS type operating systems are popular for real time applications is that they don't allow operations to spend indeterminate amounts of time in uninterruptable states.  Java will never qualify as a real time system because it has garbage collection and garbage collection can lock up a system for an indefinite amount of time while it goes through marking and counting.

Hank Cohen



RE: HADOOP in Production

Posted by Hank Cohen <ha...@altior.com>.
There is an important difference between real time and real fast

Real time means that system response must meet a fixed schedule.  
Real fast just means sooner is better.

Real time systems always have hard schedules. The schedule could be in microseconds to control a laser for making masks for semiconductor manufacturing, milliseconds to control the ignition in your car or flight controls on an F-22 or  seconds for even slower moving processes.  In real time system missing the schedule can mean that very bad things happen: planes fall from the sky, your laser printer fries it's imaging drum, factories explode etc.

Most transaction processing is happy with real fast
The folks doing high velocity trading are pretty close to real time but they probably will be happy with real fast.
If real fast systems miss a schedule then someone loses money.

The reason that RTOS type operating systems are popular for real time applications is that they don't allow operations to spend indeterminate amounts of time in uninterruptable states.  Java will never qualify as a real time system because it has garbage collection and garbage collection can lock up a system for an indefinite amount of time while it goes through marking and counting.

Hank Cohen



RE: HADOOP in Production

Posted by Hank Cohen <ha...@altior.com>.
There is an important difference between real time and real fast

Real time means that system response must meet a fixed schedule.  
Real fast just means sooner is better.

Real time systems always have hard schedules. The schedule could be in microseconds to control a laser for making masks for semiconductor manufacturing, milliseconds to control the ignition in your car or flight controls on an F-22 or  seconds for even slower moving processes.  In real time system missing the schedule can mean that very bad things happen: planes fall from the sky, your laser printer fries it's imaging drum, factories explode etc.

Most transaction processing is happy with real fast
The folks doing high velocity trading are pretty close to real time but they probably will be happy with real fast.
If real fast systems miss a schedule then someone loses money.

The reason that RTOS type operating systems are popular for real time applications is that they don't allow operations to spend indeterminate amounts of time in uninterruptable states.  Java will never qualify as a real time system because it has garbage collection and garbage collection can lock up a system for an indefinite amount of time while it goes through marking and counting.

Hank Cohen



RE: HADOOP in Production

Posted by Hank Cohen <ha...@altior.com>.
There is an important difference between real time and real fast

Real time means that system response must meet a fixed schedule.  
Real fast just means sooner is better.

Real time systems always have hard schedules. The schedule could be in microseconds to control a laser for making masks for semiconductor manufacturing, milliseconds to control the ignition in your car or flight controls on an F-22 or  seconds for even slower moving processes.  In real time system missing the schedule can mean that very bad things happen: planes fall from the sky, your laser printer fries it's imaging drum, factories explode etc.

Most transaction processing is happy with real fast
The folks doing high velocity trading are pretty close to real time but they probably will be happy with real fast.
If real fast systems miss a schedule then someone loses money.

The reason that RTOS type operating systems are popular for real time applications is that they don't allow operations to spend indeterminate amounts of time in uninterruptable states.  Java will never qualify as a real time system because it has garbage collection and garbage collection can lock up a system for an indefinite amount of time while it goes through marking and counting.

Hank Cohen



Re: HADOOP in Production

Posted by Michael Segel <mi...@hotmail.com>.
Funny that the OP asks about 'real time'...

This comes up quiet often and its always misunderstood. 

First, when we say 'real time' many take it to mean subjective real time.  Real 'real time' would require some sort of RTOS underneath. 

Second Hadoop is a parallelized framework. You have several components that make up Hadoop.  A distributed scheduler, a distributed disk and tools to manipulate the data. 

You can use Hadoop in subjective real time scenarios. 

One common pattern is to use M/R to process the data, and HBase to deliver ad-hoc access to records returning a result in sub second response time. 

I think that there's an upcoming talk at Strata in NY on using Hadoop, (HBase and SOLR) to provide real time access. 

Out side of that, yeah Tom White's book is a great start, however, some of the feedback I've heard it that its a dry read. 
But then again, most technical books are. :-) 


On Oct 2, 2012, at 6:47 AM, Ruslan Al-Fakikh <me...@gmail.com> wrote:

> Hi,
> 
> There are too many issues to discuss I guess. I would recommend
> reading Hadoop The Definitive Guide by Tom White. There are some
> chapters for the answers.
> Also what did you mean my 'real time"? Hadoop is not designed for
> giving real time results of queries. It is rather for offline data
> analysis, because each query can take minutes or hours to finish.
> AFAIK, HBase provides some real time functionality though.
> For Hadoop automation, you can try Oozie. We are using opswise in our company
> 
> Best Regards
> 
> On Mon, Oct 1, 2012 at 5:36 PM, yogesh dhari <yo...@live.com> wrote:
>> Hi all,
>> 
>> I have understood the Hadoop and Hadoop Ecosystem(Pig as ETL, Hive as
>> DataWare house, Sqoop as importing tool). I worked and learned on single
>> node cluster with demo data.
>> 
>> As Hadoop suits best on Unix platform. Please help me to understand the
>> requirement form start to finish to use Hadoop in production.
>> 
>> What would be the things to use Hadoop on real time project.
>> 
>> like Hadoop automation on Unix, alert of failure process.
>> 
>> Please put some light on using Hadoop on real time and what objectives are
>> recommended.
>> 
>> 
>> Thanks & Regards
>> Yogesh Kumar
>> 
> 


Re: HADOOP in Production

Posted by Michael Segel <mi...@hotmail.com>.
Funny that the OP asks about 'real time'...

This comes up quiet often and its always misunderstood. 

First, when we say 'real time' many take it to mean subjective real time.  Real 'real time' would require some sort of RTOS underneath. 

Second Hadoop is a parallelized framework. You have several components that make up Hadoop.  A distributed scheduler, a distributed disk and tools to manipulate the data. 

You can use Hadoop in subjective real time scenarios. 

One common pattern is to use M/R to process the data, and HBase to deliver ad-hoc access to records returning a result in sub second response time. 

I think that there's an upcoming talk at Strata in NY on using Hadoop, (HBase and SOLR) to provide real time access. 

Out side of that, yeah Tom White's book is a great start, however, some of the feedback I've heard it that its a dry read. 
But then again, most technical books are. :-) 


On Oct 2, 2012, at 6:47 AM, Ruslan Al-Fakikh <me...@gmail.com> wrote:

> Hi,
> 
> There are too many issues to discuss I guess. I would recommend
> reading Hadoop The Definitive Guide by Tom White. There are some
> chapters for the answers.
> Also what did you mean my 'real time"? Hadoop is not designed for
> giving real time results of queries. It is rather for offline data
> analysis, because each query can take minutes or hours to finish.
> AFAIK, HBase provides some real time functionality though.
> For Hadoop automation, you can try Oozie. We are using opswise in our company
> 
> Best Regards
> 
> On Mon, Oct 1, 2012 at 5:36 PM, yogesh dhari <yo...@live.com> wrote:
>> Hi all,
>> 
>> I have understood the Hadoop and Hadoop Ecosystem(Pig as ETL, Hive as
>> DataWare house, Sqoop as importing tool). I worked and learned on single
>> node cluster with demo data.
>> 
>> As Hadoop suits best on Unix platform. Please help me to understand the
>> requirement form start to finish to use Hadoop in production.
>> 
>> What would be the things to use Hadoop on real time project.
>> 
>> like Hadoop automation on Unix, alert of failure process.
>> 
>> Please put some light on using Hadoop on real time and what objectives are
>> recommended.
>> 
>> 
>> Thanks & Regards
>> Yogesh Kumar
>> 
> 


Re: HADOOP in Production

Posted by Michael Segel <mi...@hotmail.com>.
Funny that the OP asks about 'real time'...

This comes up quiet often and its always misunderstood. 

First, when we say 'real time' many take it to mean subjective real time.  Real 'real time' would require some sort of RTOS underneath. 

Second Hadoop is a parallelized framework. You have several components that make up Hadoop.  A distributed scheduler, a distributed disk and tools to manipulate the data. 

You can use Hadoop in subjective real time scenarios. 

One common pattern is to use M/R to process the data, and HBase to deliver ad-hoc access to records returning a result in sub second response time. 

I think that there's an upcoming talk at Strata in NY on using Hadoop, (HBase and SOLR) to provide real time access. 

Out side of that, yeah Tom White's book is a great start, however, some of the feedback I've heard it that its a dry read. 
But then again, most technical books are. :-) 


On Oct 2, 2012, at 6:47 AM, Ruslan Al-Fakikh <me...@gmail.com> wrote:

> Hi,
> 
> There are too many issues to discuss I guess. I would recommend
> reading Hadoop The Definitive Guide by Tom White. There are some
> chapters for the answers.
> Also what did you mean my 'real time"? Hadoop is not designed for
> giving real time results of queries. It is rather for offline data
> analysis, because each query can take minutes or hours to finish.
> AFAIK, HBase provides some real time functionality though.
> For Hadoop automation, you can try Oozie. We are using opswise in our company
> 
> Best Regards
> 
> On Mon, Oct 1, 2012 at 5:36 PM, yogesh dhari <yo...@live.com> wrote:
>> Hi all,
>> 
>> I have understood the Hadoop and Hadoop Ecosystem(Pig as ETL, Hive as
>> DataWare house, Sqoop as importing tool). I worked and learned on single
>> node cluster with demo data.
>> 
>> As Hadoop suits best on Unix platform. Please help me to understand the
>> requirement form start to finish to use Hadoop in production.
>> 
>> What would be the things to use Hadoop on real time project.
>> 
>> like Hadoop automation on Unix, alert of failure process.
>> 
>> Please put some light on using Hadoop on real time and what objectives are
>> recommended.
>> 
>> 
>> Thanks & Regards
>> Yogesh Kumar
>> 
> 


Re: HADOOP in Production

Posted by Michael Segel <mi...@hotmail.com>.
Funny that the OP asks about 'real time'...

This comes up quiet often and its always misunderstood. 

First, when we say 'real time' many take it to mean subjective real time.  Real 'real time' would require some sort of RTOS underneath. 

Second Hadoop is a parallelized framework. You have several components that make up Hadoop.  A distributed scheduler, a distributed disk and tools to manipulate the data. 

You can use Hadoop in subjective real time scenarios. 

One common pattern is to use M/R to process the data, and HBase to deliver ad-hoc access to records returning a result in sub second response time. 

I think that there's an upcoming talk at Strata in NY on using Hadoop, (HBase and SOLR) to provide real time access. 

Out side of that, yeah Tom White's book is a great start, however, some of the feedback I've heard it that its a dry read. 
But then again, most technical books are. :-) 


On Oct 2, 2012, at 6:47 AM, Ruslan Al-Fakikh <me...@gmail.com> wrote:

> Hi,
> 
> There are too many issues to discuss I guess. I would recommend
> reading Hadoop The Definitive Guide by Tom White. There are some
> chapters for the answers.
> Also what did you mean my 'real time"? Hadoop is not designed for
> giving real time results of queries. It is rather for offline data
> analysis, because each query can take minutes or hours to finish.
> AFAIK, HBase provides some real time functionality though.
> For Hadoop automation, you can try Oozie. We are using opswise in our company
> 
> Best Regards
> 
> On Mon, Oct 1, 2012 at 5:36 PM, yogesh dhari <yo...@live.com> wrote:
>> Hi all,
>> 
>> I have understood the Hadoop and Hadoop Ecosystem(Pig as ETL, Hive as
>> DataWare house, Sqoop as importing tool). I worked and learned on single
>> node cluster with demo data.
>> 
>> As Hadoop suits best on Unix platform. Please help me to understand the
>> requirement form start to finish to use Hadoop in production.
>> 
>> What would be the things to use Hadoop on real time project.
>> 
>> like Hadoop automation on Unix, alert of failure process.
>> 
>> Please put some light on using Hadoop on real time and what objectives are
>> recommended.
>> 
>> 
>> Thanks & Regards
>> Yogesh Kumar
>> 
> 


Re: HADOOP in Production

Posted by Ruslan Al-Fakikh <me...@gmail.com>.
Hi,

There are too many issues to discuss I guess. I would recommend
reading Hadoop The Definitive Guide by Tom White. There are some
chapters for the answers.
Also what did you mean my 'real time"? Hadoop is not designed for
giving real time results of queries. It is rather for offline data
analysis, because each query can take minutes or hours to finish.
AFAIK, HBase provides some real time functionality though.
For Hadoop automation, you can try Oozie. We are using opswise in our company

Best Regards

On Mon, Oct 1, 2012 at 5:36 PM, yogesh dhari <yo...@live.com> wrote:
> Hi all,
>
> I have understood the Hadoop and Hadoop Ecosystem(Pig as ETL, Hive as
> DataWare house, Sqoop as importing tool). I worked and learned on single
> node cluster with demo data.
>
> As Hadoop suits best on Unix platform. Please help me to understand the
> requirement form start to finish to use Hadoop in production.
>
> What would be the things to use Hadoop on real time project.
>
> like Hadoop automation on Unix, alert of failure process.
>
> Please put some light on using Hadoop on real time and what objectives are
> recommended.
>
>
> Thanks & Regards
> Yogesh Kumar
>

Re: HADOOP in Production

Posted by Ruslan Al-Fakikh <me...@gmail.com>.
Hi,

There are too many issues to discuss I guess. I would recommend
reading Hadoop The Definitive Guide by Tom White. There are some
chapters for the answers.
Also what did you mean my 'real time"? Hadoop is not designed for
giving real time results of queries. It is rather for offline data
analysis, because each query can take minutes or hours to finish.
AFAIK, HBase provides some real time functionality though.
For Hadoop automation, you can try Oozie. We are using opswise in our company

Best Regards

On Mon, Oct 1, 2012 at 5:36 PM, yogesh dhari <yo...@live.com> wrote:
> Hi all,
>
> I have understood the Hadoop and Hadoop Ecosystem(Pig as ETL, Hive as
> DataWare house, Sqoop as importing tool). I worked and learned on single
> node cluster with demo data.
>
> As Hadoop suits best on Unix platform. Please help me to understand the
> requirement form start to finish to use Hadoop in production.
>
> What would be the things to use Hadoop on real time project.
>
> like Hadoop automation on Unix, alert of failure process.
>
> Please put some light on using Hadoop on real time and what objectives are
> recommended.
>
>
> Thanks & Regards
> Yogesh Kumar
>

Re: HADOOP in Production

Posted by Ruslan Al-Fakikh <me...@gmail.com>.
Hi,

There are too many issues to discuss I guess. I would recommend
reading Hadoop The Definitive Guide by Tom White. There are some
chapters for the answers.
Also what did you mean my 'real time"? Hadoop is not designed for
giving real time results of queries. It is rather for offline data
analysis, because each query can take minutes or hours to finish.
AFAIK, HBase provides some real time functionality though.
For Hadoop automation, you can try Oozie. We are using opswise in our company

Best Regards

On Mon, Oct 1, 2012 at 5:36 PM, yogesh dhari <yo...@live.com> wrote:
> Hi all,
>
> I have understood the Hadoop and Hadoop Ecosystem(Pig as ETL, Hive as
> DataWare house, Sqoop as importing tool). I worked and learned on single
> node cluster with demo data.
>
> As Hadoop suits best on Unix platform. Please help me to understand the
> requirement form start to finish to use Hadoop in production.
>
> What would be the things to use Hadoop on real time project.
>
> like Hadoop automation on Unix, alert of failure process.
>
> Please put some light on using Hadoop on real time and what objectives are
> recommended.
>
>
> Thanks & Regards
> Yogesh Kumar
>

Re: HADOOP in Production

Posted by Ruslan Al-Fakikh <me...@gmail.com>.
Hi,

There are too many issues to discuss I guess. I would recommend
reading Hadoop The Definitive Guide by Tom White. There are some
chapters for the answers.
Also what did you mean my 'real time"? Hadoop is not designed for
giving real time results of queries. It is rather for offline data
analysis, because each query can take minutes or hours to finish.
AFAIK, HBase provides some real time functionality though.
For Hadoop automation, you can try Oozie. We are using opswise in our company

Best Regards

On Mon, Oct 1, 2012 at 5:36 PM, yogesh dhari <yo...@live.com> wrote:
> Hi all,
>
> I have understood the Hadoop and Hadoop Ecosystem(Pig as ETL, Hive as
> DataWare house, Sqoop as importing tool). I worked and learned on single
> node cluster with demo data.
>
> As Hadoop suits best on Unix platform. Please help me to understand the
> requirement form start to finish to use Hadoop in production.
>
> What would be the things to use Hadoop on real time project.
>
> like Hadoop automation on Unix, alert of failure process.
>
> Please put some light on using Hadoop on real time and what objectives are
> recommended.
>
>
> Thanks & Regards
> Yogesh Kumar
>