You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@hadoop.apache.org by Jason Yang <li...@gmail.com> on 2012/09/14 08:03:32 UTC

What's the basic idea of pseudo-distributed Hadoop ?

Hi, all

I have a question about how does the pseudo-distributed Hadoop cluster work:

As many map tasks are submitted to the pseudo-distributed Hadoop cluster,
does the hadoop run each mapper in sequence ? or does it run these mappers
in different threads or something could be parallel?

-- 
YANG, Lin

Re: What's the basic idea of pseudo-distributed Hadoop ?

Posted by Hemanth Yamijala <yh...@thoughtworks.com>.

One thing to be careful about is paths of dependent libraries or
executables like streaming binaries. In pseudo distributed mode, since all
processes are looking on the same machine, it is likely that they will find
paths that are really local to only the machine where the job is being
launched from. When you start to run them in a true distributed
environment, and if these files are not packaged and distributed to the
cluster in some way, they will start failing.

Thanks
hemanth

On Fri, Sep 14, 2012 at 1:04 PM, Jason Yang <li...@gmail.com>wrote:

> All right, I got it.
>
> Thanks for all of you.
>
>
> 2012/9/14 Bertrand Dechoux <de...@gmail.com>
>
>> The only difference between pseudo-distributed and fully distributed
>> would be scale. You could say that code that runs fine on the former, runs
>> fine too on the latter. But it does not necessary mean that the performance
>> will scale the same way (ie if you keep a list of elements in memory, at
>> bigger scale you could receive OOME).
>>
>> Of course, like it has been implied in previous answers, you can't say
>> the same with standalone. With this mode, you could use a global mutable
>> static state thinking it's fine without caring about distribution between
>> the nodes. In that case, the same code launched on pseudo-distributed will
>> fail to replicate the same results.
>>
>> Regards
>>
>> Bertrand
>>
>>
>> On Fri, Sep 14, 2012 at 9:24 AM, Harsh J <ha...@cloudera.com> wrote:
>>
>>> Hi Jason,
>>>
>>> I think you're confusing the standalone mode with a pseudo-distributed
>>> mode. The former is a limited mode of MR where no daemons need to be
>>> deployed and the tasks run in a single JVM (via threads).
>>>
>>> A pseudo distributed cluster is a cluster where all daemons are
>>> running on one node itself. Hence, not "distributed" in the sense of
>>> multi-nodes (no use of an network gear) but works in the same way
>>> between nodes (RPC, etc.) as a fully-distributed one.
>>>
>>> If an MR program works fine in a pseudo-distributed mode, it "should"
>>> work (no guarantee) fine in a fully-distributed mode iff all nodes
>>> have the same arch/OS, same JVM, and job-specific configurations. This
>>> is because tasks execute on various nodes and may be affected by the
>>> node's behavior or setup that is different from others - and thats
>>> something you'd have to detect/know about if it exhibits failures more
>>> than others.
>>>
>>> On Fri, Sep 14, 2012 at 11:58 AM, Jason Yang <li...@gmail.com>
>>> wrote:
>>> > Hey, Kai
>>> >
>>> > Thanks for you reply.
>>> >
>>> > I was wondering what's difference btw the pseudo-distributed and
>>> > fully-distributed hadoop, except the maximum number of map/reduce.
>>> >
>>> > And if a MR program works fine in pseudo-distributed cluster, will it
>>> work
>>> > exactly fine in the fully-distributed cluster ?
>>> >
>>> >
>>> > 2012/9/14 Kai Voigt <k...@123.org>
>>> >>
>>> >> e default setting is that a tasktracker can run up to two map and
>>> reduce
>>> >> tasks in parallel (mapred.tasktracker.map.tasks.maximum and
>>> >> mapred.tasktracker.reduce.tasks.maximum), so you will actually see
>>> some
>>> >> concurrency on your one machine.
>>> >
>>> >
>>> >
>>> >
>>> > --
>>> > YANG, Lin
>>> >
>>>
>>>
>>>
>>> --
>>> Harsh J
>>>
>>
>>
>>
>> --
>> Bertrand Dechoux
>>
>
>
>
> --
> YANG, Lin
>
>

Re: What's the basic idea of pseudo-distributed Hadoop ?

Posted by Hemanth Yamijala <yh...@thoughtworks.com>.

One thing to be careful about is paths of dependent libraries or
executables like streaming binaries. In pseudo distributed mode, since all
processes are looking on the same machine, it is likely that they will find
paths that are really local to only the machine where the job is being
launched from. When you start to run them in a true distributed
environment, and if these files are not packaged and distributed to the
cluster in some way, they will start failing.

Thanks
hemanth

On Fri, Sep 14, 2012 at 1:04 PM, Jason Yang <li...@gmail.com>wrote:

> All right, I got it.
>
> Thanks for all of you.
>
>
> 2012/9/14 Bertrand Dechoux <de...@gmail.com>
>
>> The only difference between pseudo-distributed and fully distributed
>> would be scale. You could say that code that runs fine on the former, runs
>> fine too on the latter. But it does not necessary mean that the performance
>> will scale the same way (ie if you keep a list of elements in memory, at
>> bigger scale you could receive OOME).
>>
>> Of course, like it has been implied in previous answers, you can't say
>> the same with standalone. With this mode, you could use a global mutable
>> static state thinking it's fine without caring about distribution between
>> the nodes. In that case, the same code launched on pseudo-distributed will
>> fail to replicate the same results.
>>
>> Regards
>>
>> Bertrand
>>
>>
>> On Fri, Sep 14, 2012 at 9:24 AM, Harsh J <ha...@cloudera.com> wrote:
>>
>>> Hi Jason,
>>>
>>> I think you're confusing the standalone mode with a pseudo-distributed
>>> mode. The former is a limited mode of MR where no daemons need to be
>>> deployed and the tasks run in a single JVM (via threads).
>>>
>>> A pseudo distributed cluster is a cluster where all daemons are
>>> running on one node itself. Hence, not "distributed" in the sense of
>>> multi-nodes (no use of an network gear) but works in the same way
>>> between nodes (RPC, etc.) as a fully-distributed one.
>>>
>>> If an MR program works fine in a pseudo-distributed mode, it "should"
>>> work (no guarantee) fine in a fully-distributed mode iff all nodes
>>> have the same arch/OS, same JVM, and job-specific configurations. This
>>> is because tasks execute on various nodes and may be affected by the
>>> node's behavior or setup that is different from others - and thats
>>> something you'd have to detect/know about if it exhibits failures more
>>> than others.
>>>
>>> On Fri, Sep 14, 2012 at 11:58 AM, Jason Yang <li...@gmail.com>
>>> wrote:
>>> > Hey, Kai
>>> >
>>> > Thanks for you reply.
>>> >
>>> > I was wondering what's difference btw the pseudo-distributed and
>>> > fully-distributed hadoop, except the maximum number of map/reduce.
>>> >
>>> > And if a MR program works fine in pseudo-distributed cluster, will it
>>> work
>>> > exactly fine in the fully-distributed cluster ?
>>> >
>>> >
>>> > 2012/9/14 Kai Voigt <k...@123.org>
>>> >>
>>> >> e default setting is that a tasktracker can run up to two map and
>>> reduce
>>> >> tasks in parallel (mapred.tasktracker.map.tasks.maximum and
>>> >> mapred.tasktracker.reduce.tasks.maximum), so you will actually see
>>> some
>>> >> concurrency on your one machine.
>>> >
>>> >
>>> >
>>> >
>>> > --
>>> > YANG, Lin
>>> >
>>>
>>>
>>>
>>> --
>>> Harsh J
>>>
>>
>>
>>
>> --
>> Bertrand Dechoux
>>
>
>
>
> --
> YANG, Lin
>
>

Re: What's the basic idea of pseudo-distributed Hadoop ?

Posted by Hemanth Yamijala <yh...@thoughtworks.com>.

One thing to be careful about is paths of dependent libraries or
executables like streaming binaries. In pseudo distributed mode, since all
processes are looking on the same machine, it is likely that they will find
paths that are really local to only the machine where the job is being
launched from. When you start to run them in a true distributed
environment, and if these files are not packaged and distributed to the
cluster in some way, they will start failing.

Thanks
hemanth

On Fri, Sep 14, 2012 at 1:04 PM, Jason Yang <li...@gmail.com>wrote:

> All right, I got it.
>
> Thanks for all of you.
>
>
> 2012/9/14 Bertrand Dechoux <de...@gmail.com>
>
>> The only difference between pseudo-distributed and fully distributed
>> would be scale. You could say that code that runs fine on the former, runs
>> fine too on the latter. But it does not necessary mean that the performance
>> will scale the same way (ie if you keep a list of elements in memory, at
>> bigger scale you could receive OOME).
>>
>> Of course, like it has been implied in previous answers, you can't say
>> the same with standalone. With this mode, you could use a global mutable
>> static state thinking it's fine without caring about distribution between
>> the nodes. In that case, the same code launched on pseudo-distributed will
>> fail to replicate the same results.
>>
>> Regards
>>
>> Bertrand
>>
>>
>> On Fri, Sep 14, 2012 at 9:24 AM, Harsh J <ha...@cloudera.com> wrote:
>>
>>> Hi Jason,
>>>
>>> I think you're confusing the standalone mode with a pseudo-distributed
>>> mode. The former is a limited mode of MR where no daemons need to be
>>> deployed and the tasks run in a single JVM (via threads).
>>>
>>> A pseudo distributed cluster is a cluster where all daemons are
>>> running on one node itself. Hence, not "distributed" in the sense of
>>> multi-nodes (no use of an network gear) but works in the same way
>>> between nodes (RPC, etc.) as a fully-distributed one.
>>>
>>> If an MR program works fine in a pseudo-distributed mode, it "should"
>>> work (no guarantee) fine in a fully-distributed mode iff all nodes
>>> have the same arch/OS, same JVM, and job-specific configurations. This
>>> is because tasks execute on various nodes and may be affected by the
>>> node's behavior or setup that is different from others - and thats
>>> something you'd have to detect/know about if it exhibits failures more
>>> than others.
>>>
>>> On Fri, Sep 14, 2012 at 11:58 AM, Jason Yang <li...@gmail.com>
>>> wrote:
>>> > Hey, Kai
>>> >
>>> > Thanks for you reply.
>>> >
>>> > I was wondering what's difference btw the pseudo-distributed and
>>> > fully-distributed hadoop, except the maximum number of map/reduce.
>>> >
>>> > And if a MR program works fine in pseudo-distributed cluster, will it
>>> work
>>> > exactly fine in the fully-distributed cluster ?
>>> >
>>> >
>>> > 2012/9/14 Kai Voigt <k...@123.org>
>>> >>
>>> >> e default setting is that a tasktracker can run up to two map and
>>> reduce
>>> >> tasks in parallel (mapred.tasktracker.map.tasks.maximum and
>>> >> mapred.tasktracker.reduce.tasks.maximum), so you will actually see
>>> some
>>> >> concurrency on your one machine.
>>> >
>>> >
>>> >
>>> >
>>> > --
>>> > YANG, Lin
>>> >
>>>
>>>
>>>
>>> --
>>> Harsh J
>>>
>>
>>
>>
>> --
>> Bertrand Dechoux
>>
>
>
>
> --
> YANG, Lin
>
>

Re: What's the basic idea of pseudo-distributed Hadoop ?

Posted by Hemanth Yamijala <yh...@thoughtworks.com>.

One thing to be careful about is paths of dependent libraries or
executables like streaming binaries. In pseudo distributed mode, since all
processes are looking on the same machine, it is likely that they will find
paths that are really local to only the machine where the job is being
launched from. When you start to run them in a true distributed
environment, and if these files are not packaged and distributed to the
cluster in some way, they will start failing.

Thanks
hemanth

On Fri, Sep 14, 2012 at 1:04 PM, Jason Yang <li...@gmail.com>wrote:

> All right, I got it.
>
> Thanks for all of you.
>
>
> 2012/9/14 Bertrand Dechoux <de...@gmail.com>
>
>> The only difference between pseudo-distributed and fully distributed
>> would be scale. You could say that code that runs fine on the former, runs
>> fine too on the latter. But it does not necessary mean that the performance
>> will scale the same way (ie if you keep a list of elements in memory, at
>> bigger scale you could receive OOME).
>>
>> Of course, like it has been implied in previous answers, you can't say
>> the same with standalone. With this mode, you could use a global mutable
>> static state thinking it's fine without caring about distribution between
>> the nodes. In that case, the same code launched on pseudo-distributed will
>> fail to replicate the same results.
>>
>> Regards
>>
>> Bertrand
>>
>>
>> On Fri, Sep 14, 2012 at 9:24 AM, Harsh J <ha...@cloudera.com> wrote:
>>
>>> Hi Jason,
>>>
>>> I think you're confusing the standalone mode with a pseudo-distributed
>>> mode. The former is a limited mode of MR where no daemons need to be
>>> deployed and the tasks run in a single JVM (via threads).
>>>
>>> A pseudo distributed cluster is a cluster where all daemons are
>>> running on one node itself. Hence, not "distributed" in the sense of
>>> multi-nodes (no use of an network gear) but works in the same way
>>> between nodes (RPC, etc.) as a fully-distributed one.
>>>
>>> If an MR program works fine in a pseudo-distributed mode, it "should"
>>> work (no guarantee) fine in a fully-distributed mode iff all nodes
>>> have the same arch/OS, same JVM, and job-specific configurations. This
>>> is because tasks execute on various nodes and may be affected by the
>>> node's behavior or setup that is different from others - and thats
>>> something you'd have to detect/know about if it exhibits failures more
>>> than others.
>>>
>>> On Fri, Sep 14, 2012 at 11:58 AM, Jason Yang <li...@gmail.com>
>>> wrote:
>>> > Hey, Kai
>>> >
>>> > Thanks for you reply.
>>> >
>>> > I was wondering what's difference btw the pseudo-distributed and
>>> > fully-distributed hadoop, except the maximum number of map/reduce.
>>> >
>>> > And if a MR program works fine in pseudo-distributed cluster, will it
>>> work
>>> > exactly fine in the fully-distributed cluster ?
>>> >
>>> >
>>> > 2012/9/14 Kai Voigt <k...@123.org>
>>> >>
>>> >> e default setting is that a tasktracker can run up to two map and
>>> reduce
>>> >> tasks in parallel (mapred.tasktracker.map.tasks.maximum and
>>> >> mapred.tasktracker.reduce.tasks.maximum), so you will actually see
>>> some
>>> >> concurrency on your one machine.
>>> >
>>> >
>>> >
>>> >
>>> > --
>>> > YANG, Lin
>>> >
>>>
>>>
>>>
>>> --
>>> Harsh J
>>>
>>
>>
>>
>> --
>> Bertrand Dechoux
>>
>
>
>
> --
> YANG, Lin
>
>

Re: What's the basic idea of pseudo-distributed Hadoop ?

Posted by Jason Yang <li...@gmail.com>.

All right, I got it.

Thanks for all of you.

2012/9/14 Bertrand Dechoux <de...@gmail.com>

> The only difference between pseudo-distributed and fully distributed would
> be scale. You could say that code that runs fine on the former, runs fine
> too on the latter. But it does not necessary mean that the performance will
> scale the same way (ie if you keep a list of elements in memory, at bigger
> scale you could receive OOME).
>
> Of course, like it has been implied in previous answers, you can't say the
> same with standalone. With this mode, you could use a global mutable static
> state thinking it's fine without caring about distribution between the
> nodes. In that case, the same code launched on pseudo-distributed will fail
> to replicate the same results.
>
> Regards
>
> Bertrand
>
>
> On Fri, Sep 14, 2012 at 9:24 AM, Harsh J <ha...@cloudera.com> wrote:
>
>> Hi Jason,
>>
>> I think you're confusing the standalone mode with a pseudo-distributed
>> mode. The former is a limited mode of MR where no daemons need to be
>> deployed and the tasks run in a single JVM (via threads).
>>
>> A pseudo distributed cluster is a cluster where all daemons are
>> running on one node itself. Hence, not "distributed" in the sense of
>> multi-nodes (no use of an network gear) but works in the same way
>> between nodes (RPC, etc.) as a fully-distributed one.
>>
>> If an MR program works fine in a pseudo-distributed mode, it "should"
>> work (no guarantee) fine in a fully-distributed mode iff all nodes
>> have the same arch/OS, same JVM, and job-specific configurations. This
>> is because tasks execute on various nodes and may be affected by the
>> node's behavior or setup that is different from others - and thats
>> something you'd have to detect/know about if it exhibits failures more
>> than others.
>>
>> On Fri, Sep 14, 2012 at 11:58 AM, Jason Yang <li...@gmail.com>
>> wrote:
>> > Hey, Kai
>> >
>> > Thanks for you reply.
>> >
>> > I was wondering what's difference btw the pseudo-distributed and
>> > fully-distributed hadoop, except the maximum number of map/reduce.
>> >
>> > And if a MR program works fine in pseudo-distributed cluster, will it
>> work
>> > exactly fine in the fully-distributed cluster ?
>> >
>> >
>> > 2012/9/14 Kai Voigt <k...@123.org>
>> >>
>> >> e default setting is that a tasktracker can run up to two map and
>> reduce
>> >> tasks in parallel (mapred.tasktracker.map.tasks.maximum and
>> >> mapred.tasktracker.reduce.tasks.maximum), so you will actually see some
>> >> concurrency on your one machine.
>> >
>> >
>> >
>> >
>> > --
>> > YANG, Lin
>> >
>>
>>
>>
>> --
>> Harsh J
>>
>
>
>
> --
> Bertrand Dechoux
>



-- 
YANG, Lin

Re: What's the basic idea of pseudo-distributed Hadoop ?

Posted by Jason Yang <li...@gmail.com>.

All right, I got it.

Thanks for all of you.

2012/9/14 Bertrand Dechoux <de...@gmail.com>

> The only difference between pseudo-distributed and fully distributed would
> be scale. You could say that code that runs fine on the former, runs fine
> too on the latter. But it does not necessary mean that the performance will
> scale the same way (ie if you keep a list of elements in memory, at bigger
> scale you could receive OOME).
>
> Of course, like it has been implied in previous answers, you can't say the
> same with standalone. With this mode, you could use a global mutable static
> state thinking it's fine without caring about distribution between the
> nodes. In that case, the same code launched on pseudo-distributed will fail
> to replicate the same results.
>
> Regards
>
> Bertrand
>
>
> On Fri, Sep 14, 2012 at 9:24 AM, Harsh J <ha...@cloudera.com> wrote:
>
>> Hi Jason,
>>
>> I think you're confusing the standalone mode with a pseudo-distributed
>> mode. The former is a limited mode of MR where no daemons need to be
>> deployed and the tasks run in a single JVM (via threads).
>>
>> A pseudo distributed cluster is a cluster where all daemons are
>> running on one node itself. Hence, not "distributed" in the sense of
>> multi-nodes (no use of an network gear) but works in the same way
>> between nodes (RPC, etc.) as a fully-distributed one.
>>
>> If an MR program works fine in a pseudo-distributed mode, it "should"
>> work (no guarantee) fine in a fully-distributed mode iff all nodes
>> have the same arch/OS, same JVM, and job-specific configurations. This
>> is because tasks execute on various nodes and may be affected by the
>> node's behavior or setup that is different from others - and thats
>> something you'd have to detect/know about if it exhibits failures more
>> than others.
>>
>> On Fri, Sep 14, 2012 at 11:58 AM, Jason Yang <li...@gmail.com>
>> wrote:
>> > Hey, Kai
>> >
>> > Thanks for you reply.
>> >
>> > I was wondering what's difference btw the pseudo-distributed and
>> > fully-distributed hadoop, except the maximum number of map/reduce.
>> >
>> > And if a MR program works fine in pseudo-distributed cluster, will it
>> work
>> > exactly fine in the fully-distributed cluster ?
>> >
>> >
>> > 2012/9/14 Kai Voigt <k...@123.org>
>> >>
>> >> e default setting is that a tasktracker can run up to two map and
>> reduce
>> >> tasks in parallel (mapred.tasktracker.map.tasks.maximum and
>> >> mapred.tasktracker.reduce.tasks.maximum), so you will actually see some
>> >> concurrency on your one machine.
>> >
>> >
>> >
>> >
>> > --
>> > YANG, Lin
>> >
>>
>>
>>
>> --
>> Harsh J
>>
>
>
>
> --
> Bertrand Dechoux
>



-- 
YANG, Lin

Re: What's the basic idea of pseudo-distributed Hadoop ?

Posted by Jason Yang <li...@gmail.com>.

All right, I got it.

Thanks for all of you.

2012/9/14 Bertrand Dechoux <de...@gmail.com>

> The only difference between pseudo-distributed and fully distributed would
> be scale. You could say that code that runs fine on the former, runs fine
> too on the latter. But it does not necessary mean that the performance will
> scale the same way (ie if you keep a list of elements in memory, at bigger
> scale you could receive OOME).
>
> Of course, like it has been implied in previous answers, you can't say the
> same with standalone. With this mode, you could use a global mutable static
> state thinking it's fine without caring about distribution between the
> nodes. In that case, the same code launched on pseudo-distributed will fail
> to replicate the same results.
>
> Regards
>
> Bertrand
>
>
> On Fri, Sep 14, 2012 at 9:24 AM, Harsh J <ha...@cloudera.com> wrote:
>
>> Hi Jason,
>>
>> I think you're confusing the standalone mode with a pseudo-distributed
>> mode. The former is a limited mode of MR where no daemons need to be
>> deployed and the tasks run in a single JVM (via threads).
>>
>> A pseudo distributed cluster is a cluster where all daemons are
>> running on one node itself. Hence, not "distributed" in the sense of
>> multi-nodes (no use of an network gear) but works in the same way
>> between nodes (RPC, etc.) as a fully-distributed one.
>>
>> If an MR program works fine in a pseudo-distributed mode, it "should"
>> work (no guarantee) fine in a fully-distributed mode iff all nodes
>> have the same arch/OS, same JVM, and job-specific configurations. This
>> is because tasks execute on various nodes and may be affected by the
>> node's behavior or setup that is different from others - and thats
>> something you'd have to detect/know about if it exhibits failures more
>> than others.
>>
>> On Fri, Sep 14, 2012 at 11:58 AM, Jason Yang <li...@gmail.com>
>> wrote:
>> > Hey, Kai
>> >
>> > Thanks for you reply.
>> >
>> > I was wondering what's difference btw the pseudo-distributed and
>> > fully-distributed hadoop, except the maximum number of map/reduce.
>> >
>> > And if a MR program works fine in pseudo-distributed cluster, will it
>> work
>> > exactly fine in the fully-distributed cluster ?
>> >
>> >
>> > 2012/9/14 Kai Voigt <k...@123.org>
>> >>
>> >> e default setting is that a tasktracker can run up to two map and
>> reduce
>> >> tasks in parallel (mapred.tasktracker.map.tasks.maximum and
>> >> mapred.tasktracker.reduce.tasks.maximum), so you will actually see some
>> >> concurrency on your one machine.
>> >
>> >
>> >
>> >
>> > --
>> > YANG, Lin
>> >
>>
>>
>>
>> --
>> Harsh J
>>
>
>
>
> --
> Bertrand Dechoux
>



-- 
YANG, Lin

Re: What's the basic idea of pseudo-distributed Hadoop ?

Posted by Jason Yang <li...@gmail.com>.

All right, I got it.

Thanks for all of you.

2012/9/14 Bertrand Dechoux <de...@gmail.com>

> The only difference between pseudo-distributed and fully distributed would
> be scale. You could say that code that runs fine on the former, runs fine
> too on the latter. But it does not necessary mean that the performance will
> scale the same way (ie if you keep a list of elements in memory, at bigger
> scale you could receive OOME).
>
> Of course, like it has been implied in previous answers, you can't say the
> same with standalone. With this mode, you could use a global mutable static
> state thinking it's fine without caring about distribution between the
> nodes. In that case, the same code launched on pseudo-distributed will fail
> to replicate the same results.
>
> Regards
>
> Bertrand
>
>
> On Fri, Sep 14, 2012 at 9:24 AM, Harsh J <ha...@cloudera.com> wrote:
>
>> Hi Jason,
>>
>> I think you're confusing the standalone mode with a pseudo-distributed
>> mode. The former is a limited mode of MR where no daemons need to be
>> deployed and the tasks run in a single JVM (via threads).
>>
>> A pseudo distributed cluster is a cluster where all daemons are
>> running on one node itself. Hence, not "distributed" in the sense of
>> multi-nodes (no use of an network gear) but works in the same way
>> between nodes (RPC, etc.) as a fully-distributed one.
>>
>> If an MR program works fine in a pseudo-distributed mode, it "should"
>> work (no guarantee) fine in a fully-distributed mode iff all nodes
>> have the same arch/OS, same JVM, and job-specific configurations. This
>> is because tasks execute on various nodes and may be affected by the
>> node's behavior or setup that is different from others - and thats
>> something you'd have to detect/know about if it exhibits failures more
>> than others.
>>
>> On Fri, Sep 14, 2012 at 11:58 AM, Jason Yang <li...@gmail.com>
>> wrote:
>> > Hey, Kai
>> >
>> > Thanks for you reply.
>> >
>> > I was wondering what's difference btw the pseudo-distributed and
>> > fully-distributed hadoop, except the maximum number of map/reduce.
>> >
>> > And if a MR program works fine in pseudo-distributed cluster, will it
>> work
>> > exactly fine in the fully-distributed cluster ?
>> >
>> >
>> > 2012/9/14 Kai Voigt <k...@123.org>
>> >>
>> >> e default setting is that a tasktracker can run up to two map and
>> reduce
>> >> tasks in parallel (mapred.tasktracker.map.tasks.maximum and
>> >> mapred.tasktracker.reduce.tasks.maximum), so you will actually see some
>> >> concurrency on your one machine.
>> >
>> >
>> >
>> >
>> > --
>> > YANG, Lin
>> >
>>
>>
>>
>> --
>> Harsh J
>>
>
>
>
> --
> Bertrand Dechoux
>



-- 
YANG, Lin

Re: What's the basic idea of pseudo-distributed Hadoop ?

Posted by Bertrand Dechoux <de...@gmail.com>.

The only difference between pseudo-distributed and fully distributed would
be scale. You could say that code that runs fine on the former, runs fine
too on the latter. But it does not necessary mean that the performance will
scale the same way (ie if you keep a list of elements in memory, at bigger
scale you could receive OOME).

Of course, like it has been implied in previous answers, you can't say the
same with standalone. With this mode, you could use a global mutable static
state thinking it's fine without caring about distribution between the
nodes. In that case, the same code launched on pseudo-distributed will fail
to replicate the same results.

Regards

Bertrand

On Fri, Sep 14, 2012 at 9:24 AM, Harsh J <ha...@cloudera.com> wrote:

> Hi Jason,
>
> I think you're confusing the standalone mode with a pseudo-distributed
> mode. The former is a limited mode of MR where no daemons need to be
> deployed and the tasks run in a single JVM (via threads).
>
> A pseudo distributed cluster is a cluster where all daemons are
> running on one node itself. Hence, not "distributed" in the sense of
> multi-nodes (no use of an network gear) but works in the same way
> between nodes (RPC, etc.) as a fully-distributed one.
>
> If an MR program works fine in a pseudo-distributed mode, it "should"
> work (no guarantee) fine in a fully-distributed mode iff all nodes
> have the same arch/OS, same JVM, and job-specific configurations. This
> is because tasks execute on various nodes and may be affected by the
> node's behavior or setup that is different from others - and thats
> something you'd have to detect/know about if it exhibits failures more
> than others.
>
> On Fri, Sep 14, 2012 at 11:58 AM, Jason Yang <li...@gmail.com>
> wrote:
> > Hey, Kai
> >
> > Thanks for you reply.
> >
> > I was wondering what's difference btw the pseudo-distributed and
> > fully-distributed hadoop, except the maximum number of map/reduce.
> >
> > And if a MR program works fine in pseudo-distributed cluster, will it
> work
> > exactly fine in the fully-distributed cluster ?
> >
> >
> > 2012/9/14 Kai Voigt <k...@123.org>
> >>
> >> e default setting is that a tasktracker can run up to two map and reduce
> >> tasks in parallel (mapred.tasktracker.map.tasks.maximum and
> >> mapred.tasktracker.reduce.tasks.maximum), so you will actually see some
> >> concurrency on your one machine.
> >
> >
> >
> >
> > --
> > YANG, Lin
> >
>
>
>
> --
> Harsh J
>



-- 
Bertrand Dechoux

Re: What's the basic idea of pseudo-distributed Hadoop ?

Posted by Bertrand Dechoux <de...@gmail.com>.

The only difference between pseudo-distributed and fully distributed would
be scale. You could say that code that runs fine on the former, runs fine
too on the latter. But it does not necessary mean that the performance will
scale the same way (ie if you keep a list of elements in memory, at bigger
scale you could receive OOME).

Of course, like it has been implied in previous answers, you can't say the
same with standalone. With this mode, you could use a global mutable static
state thinking it's fine without caring about distribution between the
nodes. In that case, the same code launched on pseudo-distributed will fail
to replicate the same results.

Regards

Bertrand

On Fri, Sep 14, 2012 at 9:24 AM, Harsh J <ha...@cloudera.com> wrote:

> Hi Jason,
>
> I think you're confusing the standalone mode with a pseudo-distributed
> mode. The former is a limited mode of MR where no daemons need to be
> deployed and the tasks run in a single JVM (via threads).
>
> A pseudo distributed cluster is a cluster where all daemons are
> running on one node itself. Hence, not "distributed" in the sense of
> multi-nodes (no use of an network gear) but works in the same way
> between nodes (RPC, etc.) as a fully-distributed one.
>
> If an MR program works fine in a pseudo-distributed mode, it "should"
> work (no guarantee) fine in a fully-distributed mode iff all nodes
> have the same arch/OS, same JVM, and job-specific configurations. This
> is because tasks execute on various nodes and may be affected by the
> node's behavior or setup that is different from others - and thats
> something you'd have to detect/know about if it exhibits failures more
> than others.
>
> On Fri, Sep 14, 2012 at 11:58 AM, Jason Yang <li...@gmail.com>
> wrote:
> > Hey, Kai
> >
> > Thanks for you reply.
> >
> > I was wondering what's difference btw the pseudo-distributed and
> > fully-distributed hadoop, except the maximum number of map/reduce.
> >
> > And if a MR program works fine in pseudo-distributed cluster, will it
> work
> > exactly fine in the fully-distributed cluster ?
> >
> >
> > 2012/9/14 Kai Voigt <k...@123.org>
> >>
> >> e default setting is that a tasktracker can run up to two map and reduce
> >> tasks in parallel (mapred.tasktracker.map.tasks.maximum and
> >> mapred.tasktracker.reduce.tasks.maximum), so you will actually see some
> >> concurrency on your one machine.
> >
> >
> >
> >
> > --
> > YANG, Lin
> >
>
>
>
> --
> Harsh J
>



-- 
Bertrand Dechoux

Re: What's the basic idea of pseudo-distributed Hadoop ?

Posted by Bertrand Dechoux <de...@gmail.com>.

The only difference between pseudo-distributed and fully distributed would
be scale. You could say that code that runs fine on the former, runs fine
too on the latter. But it does not necessary mean that the performance will
scale the same way (ie if you keep a list of elements in memory, at bigger
scale you could receive OOME).

Of course, like it has been implied in previous answers, you can't say the
same with standalone. With this mode, you could use a global mutable static
state thinking it's fine without caring about distribution between the
nodes. In that case, the same code launched on pseudo-distributed will fail
to replicate the same results.

Regards

Bertrand

On Fri, Sep 14, 2012 at 9:24 AM, Harsh J <ha...@cloudera.com> wrote:

> Hi Jason,
>
> I think you're confusing the standalone mode with a pseudo-distributed
> mode. The former is a limited mode of MR where no daemons need to be
> deployed and the tasks run in a single JVM (via threads).
>
> A pseudo distributed cluster is a cluster where all daemons are
> running on one node itself. Hence, not "distributed" in the sense of
> multi-nodes (no use of an network gear) but works in the same way
> between nodes (RPC, etc.) as a fully-distributed one.
>
> If an MR program works fine in a pseudo-distributed mode, it "should"
> work (no guarantee) fine in a fully-distributed mode iff all nodes
> have the same arch/OS, same JVM, and job-specific configurations. This
> is because tasks execute on various nodes and may be affected by the
> node's behavior or setup that is different from others - and thats
> something you'd have to detect/know about if it exhibits failures more
> than others.
>
> On Fri, Sep 14, 2012 at 11:58 AM, Jason Yang <li...@gmail.com>
> wrote:
> > Hey, Kai
> >
> > Thanks for you reply.
> >
> > I was wondering what's difference btw the pseudo-distributed and
> > fully-distributed hadoop, except the maximum number of map/reduce.
> >
> > And if a MR program works fine in pseudo-distributed cluster, will it
> work
> > exactly fine in the fully-distributed cluster ?
> >
> >
> > 2012/9/14 Kai Voigt <k...@123.org>
> >>
> >> e default setting is that a tasktracker can run up to two map and reduce
> >> tasks in parallel (mapred.tasktracker.map.tasks.maximum and
> >> mapred.tasktracker.reduce.tasks.maximum), so you will actually see some
> >> concurrency on your one machine.
> >
> >
> >
> >
> > --
> > YANG, Lin
> >
>
>
>
> --
> Harsh J
>



-- 
Bertrand Dechoux

Re: What's the basic idea of pseudo-distributed Hadoop ?

Posted by Bertrand Dechoux <de...@gmail.com>.

The only difference between pseudo-distributed and fully distributed would
be scale. You could say that code that runs fine on the former, runs fine
too on the latter. But it does not necessary mean that the performance will
scale the same way (ie if you keep a list of elements in memory, at bigger
scale you could receive OOME).

Of course, like it has been implied in previous answers, you can't say the
same with standalone. With this mode, you could use a global mutable static
state thinking it's fine without caring about distribution between the
nodes. In that case, the same code launched on pseudo-distributed will fail
to replicate the same results.

Regards

Bertrand

On Fri, Sep 14, 2012 at 9:24 AM, Harsh J <ha...@cloudera.com> wrote:

> Hi Jason,
>
> I think you're confusing the standalone mode with a pseudo-distributed
> mode. The former is a limited mode of MR where no daemons need to be
> deployed and the tasks run in a single JVM (via threads).
>
> A pseudo distributed cluster is a cluster where all daemons are
> running on one node itself. Hence, not "distributed" in the sense of
> multi-nodes (no use of an network gear) but works in the same way
> between nodes (RPC, etc.) as a fully-distributed one.
>
> If an MR program works fine in a pseudo-distributed mode, it "should"
> work (no guarantee) fine in a fully-distributed mode iff all nodes
> have the same arch/OS, same JVM, and job-specific configurations. This
> is because tasks execute on various nodes and may be affected by the
> node's behavior or setup that is different from others - and thats
> something you'd have to detect/know about if it exhibits failures more
> than others.
>
> On Fri, Sep 14, 2012 at 11:58 AM, Jason Yang <li...@gmail.com>
> wrote:
> > Hey, Kai
> >
> > Thanks for you reply.
> >
> > I was wondering what's difference btw the pseudo-distributed and
> > fully-distributed hadoop, except the maximum number of map/reduce.
> >
> > And if a MR program works fine in pseudo-distributed cluster, will it
> work
> > exactly fine in the fully-distributed cluster ?
> >
> >
> > 2012/9/14 Kai Voigt <k...@123.org>
> >>
> >> e default setting is that a tasktracker can run up to two map and reduce
> >> tasks in parallel (mapred.tasktracker.map.tasks.maximum and
> >> mapred.tasktracker.reduce.tasks.maximum), so you will actually see some
> >> concurrency on your one machine.
> >
> >
> >
> >
> > --
> > YANG, Lin
> >
>
>
>
> --
> Harsh J
>



-- 
Bertrand Dechoux

Re: What's the basic idea of pseudo-distributed Hadoop ?

Posted by Harsh J <ha...@cloudera.com>.

Hi Jason,

I think you're confusing the standalone mode with a pseudo-distributed
mode. The former is a limited mode of MR where no daemons need to be
deployed and the tasks run in a single JVM (via threads).

A pseudo distributed cluster is a cluster where all daemons are
running on one node itself. Hence, not "distributed" in the sense of
multi-nodes (no use of an network gear) but works in the same way
between nodes (RPC, etc.) as a fully-distributed one.

If an MR program works fine in a pseudo-distributed mode, it "should"
work (no guarantee) fine in a fully-distributed mode iff all nodes
have the same arch/OS, same JVM, and job-specific configurations. This
is because tasks execute on various nodes and may be affected by the
node's behavior or setup that is different from others - and thats
something you'd have to detect/know about if it exhibits failures more
than others.

On Fri, Sep 14, 2012 at 11:58 AM, Jason Yang <li...@gmail.com> wrote:
> Hey, Kai
>
> Thanks for you reply.
>
> I was wondering what's difference btw the pseudo-distributed and
> fully-distributed hadoop, except the maximum number of map/reduce.
>
> And if a MR program works fine in pseudo-distributed cluster, will it work
> exactly fine in the fully-distributed cluster ?
>
>
> 2012/9/14 Kai Voigt <k...@123.org>
>>
>> e default setting is that a tasktracker can run up to two map and reduce
>> tasks in parallel (mapred.tasktracker.map.tasks.maximum and
>> mapred.tasktracker.reduce.tasks.maximum), so you will actually see some
>> concurrency on your one machine.
>
>
>
>
> --
> YANG, Lin
>

-- 
Harsh J

Re: What's the basic idea of pseudo-distributed Hadoop ?

Posted by Harsh J <ha...@cloudera.com>.

Hi Jason,

I think you're confusing the standalone mode with a pseudo-distributed
mode. The former is a limited mode of MR where no daemons need to be
deployed and the tasks run in a single JVM (via threads).

A pseudo distributed cluster is a cluster where all daemons are
running on one node itself. Hence, not "distributed" in the sense of
multi-nodes (no use of an network gear) but works in the same way
between nodes (RPC, etc.) as a fully-distributed one.

If an MR program works fine in a pseudo-distributed mode, it "should"
work (no guarantee) fine in a fully-distributed mode iff all nodes
have the same arch/OS, same JVM, and job-specific configurations. This
is because tasks execute on various nodes and may be affected by the
node's behavior or setup that is different from others - and thats
something you'd have to detect/know about if it exhibits failures more
than others.

On Fri, Sep 14, 2012 at 11:58 AM, Jason Yang <li...@gmail.com> wrote:
> Hey, Kai
>
> Thanks for you reply.
>
> I was wondering what's difference btw the pseudo-distributed and
> fully-distributed hadoop, except the maximum number of map/reduce.
>
> And if a MR program works fine in pseudo-distributed cluster, will it work
> exactly fine in the fully-distributed cluster ?
>
>
> 2012/9/14 Kai Voigt <k...@123.org>
>>
>> e default setting is that a tasktracker can run up to two map and reduce
>> tasks in parallel (mapred.tasktracker.map.tasks.maximum and
>> mapred.tasktracker.reduce.tasks.maximum), so you will actually see some
>> concurrency on your one machine.
>
>
>
>
> --
> YANG, Lin
>

-- 
Harsh J

Re: What's the basic idea of pseudo-distributed Hadoop ?

Posted by Harsh J <ha...@cloudera.com>.

Hi Jason,

I think you're confusing the standalone mode with a pseudo-distributed
mode. The former is a limited mode of MR where no daemons need to be
deployed and the tasks run in a single JVM (via threads).

A pseudo distributed cluster is a cluster where all daemons are
running on one node itself. Hence, not "distributed" in the sense of
multi-nodes (no use of an network gear) but works in the same way
between nodes (RPC, etc.) as a fully-distributed one.

If an MR program works fine in a pseudo-distributed mode, it "should"
work (no guarantee) fine in a fully-distributed mode iff all nodes
have the same arch/OS, same JVM, and job-specific configurations. This
is because tasks execute on various nodes and may be affected by the
node's behavior or setup that is different from others - and thats
something you'd have to detect/know about if it exhibits failures more
than others.

On Fri, Sep 14, 2012 at 11:58 AM, Jason Yang <li...@gmail.com> wrote:
> Hey, Kai
>
> Thanks for you reply.
>
> I was wondering what's difference btw the pseudo-distributed and
> fully-distributed hadoop, except the maximum number of map/reduce.
>
> And if a MR program works fine in pseudo-distributed cluster, will it work
> exactly fine in the fully-distributed cluster ?
>
>
> 2012/9/14 Kai Voigt <k...@123.org>
>>
>> e default setting is that a tasktracker can run up to two map and reduce
>> tasks in parallel (mapred.tasktracker.map.tasks.maximum and
>> mapred.tasktracker.reduce.tasks.maximum), so you will actually see some
>> concurrency on your one machine.
>
>
>
>
> --
> YANG, Lin
>

-- 
Harsh J

Re: What's the basic idea of pseudo-distributed Hadoop ?

Posted by Harsh J <ha...@cloudera.com>.

Hi Jason,

I think you're confusing the standalone mode with a pseudo-distributed
mode. The former is a limited mode of MR where no daemons need to be
deployed and the tasks run in a single JVM (via threads).

A pseudo distributed cluster is a cluster where all daemons are
running on one node itself. Hence, not "distributed" in the sense of
multi-nodes (no use of an network gear) but works in the same way
between nodes (RPC, etc.) as a fully-distributed one.

If an MR program works fine in a pseudo-distributed mode, it "should"
work (no guarantee) fine in a fully-distributed mode iff all nodes
have the same arch/OS, same JVM, and job-specific configurations. This
is because tasks execute on various nodes and may be affected by the
node's behavior or setup that is different from others - and thats
something you'd have to detect/know about if it exhibits failures more
than others.

On Fri, Sep 14, 2012 at 11:58 AM, Jason Yang <li...@gmail.com> wrote:
> Hey, Kai
>
> Thanks for you reply.
>
> I was wondering what's difference btw the pseudo-distributed and
> fully-distributed hadoop, except the maximum number of map/reduce.
>
> And if a MR program works fine in pseudo-distributed cluster, will it work
> exactly fine in the fully-distributed cluster ?
>
>
> 2012/9/14 Kai Voigt <k...@123.org>
>>
>> e default setting is that a tasktracker can run up to two map and reduce
>> tasks in parallel (mapred.tasktracker.map.tasks.maximum and
>> mapred.tasktracker.reduce.tasks.maximum), so you will actually see some
>> concurrency on your one machine.
>
>
>
>
> --
> YANG, Lin
>

-- 
Harsh J

Re: What's the basic idea of pseudo-distributed Hadoop ?

Posted by Jason Yang <li...@gmail.com>.

Hey, Kai

Thanks for you reply.

I was wondering what's difference btw the pseudo-distributed and
fully-distributed hadoop, except the maximum number of map/reduce.

And if a MR program works fine in pseudo-distributed cluster, will it work
exactly fine in the fully-distributed cluster ?

2012/9/14 Kai Voigt <k...@123.org>

> e default setting is that a tasktracker can run up to two map and reduce
> tasks in parallel (mapred.tasktracker.map.tasks.maximum and
> mapred.tasktracker.reduce.tasks.maximum), so you will actually see some
> concurrency on your one machine.
>



-- 
YANG, Lin

Re: What's the basic idea of pseudo-distributed Hadoop ?

Posted by Jason Yang <li...@gmail.com>.

Hey, Kai

Thanks for you reply.

I was wondering what's difference btw the pseudo-distributed and
fully-distributed hadoop, except the maximum number of map/reduce.

And if a MR program works fine in pseudo-distributed cluster, will it work
exactly fine in the fully-distributed cluster ?

2012/9/14 Kai Voigt <k...@123.org>

> e default setting is that a tasktracker can run up to two map and reduce
> tasks in parallel (mapred.tasktracker.map.tasks.maximum and
> mapred.tasktracker.reduce.tasks.maximum), so you will actually see some
> concurrency on your one machine.
>



-- 
YANG, Lin

Re: What's the basic idea of pseudo-distributed Hadoop ?

Posted by Jason Yang <li...@gmail.com>.

Hey, Kai

Thanks for you reply.

I was wondering what's difference btw the pseudo-distributed and
fully-distributed hadoop, except the maximum number of map/reduce.

And if a MR program works fine in pseudo-distributed cluster, will it work
exactly fine in the fully-distributed cluster ?

2012/9/14 Kai Voigt <k...@123.org>

> e default setting is that a tasktracker can run up to two map and reduce
> tasks in parallel (mapred.tasktracker.map.tasks.maximum and
> mapred.tasktracker.reduce.tasks.maximum), so you will actually see some
> concurrency on your one machine.
>



-- 
YANG, Lin

Re: What's the basic idea of pseudo-distributed Hadoop ?

Posted by Jason Yang <li...@gmail.com>.

Hey, Kai

Thanks for you reply.

I was wondering what's difference btw the pseudo-distributed and
fully-distributed hadoop, except the maximum number of map/reduce.

And if a MR program works fine in pseudo-distributed cluster, will it work
exactly fine in the fully-distributed cluster ?

2012/9/14 Kai Voigt <k...@123.org>

> e default setting is that a tasktracker can run up to two map and reduce
> tasks in parallel (mapred.tasktracker.map.tasks.maximum and
> mapred.tasktracker.reduce.tasks.maximum), so you will actually see some
> concurrency on your one machine.
>



-- 
YANG, Lin

Re: What's the basic idea of pseudo-distributed Hadoop ?

Posted by Kai Voigt <k...@123.org>.

Hello.

Am 14.09.2012 um 08:03 schrieb Jason Yang <li...@gmail.com>:

> I have a question about how does the pseudo-distributed Hadoop cluster work:
> 
> As many map tasks are submitted to the pseudo-distributed Hadoop cluster, does the hadoop run each mapper in sequence ? or does it run these mappers in different threads or something could be parallel?

pseudo-distributed mode is a one node cluster. You have a namenode, a jobtracker, and a single datanode and tasktracker running. You can verify with "jps" command.

The default setting is that a tasktracker can run up to two map and reduce tasks in parallel (mapred.tasktracker.map.tasks.maximum and mapred.tasktracker.reduce.tasks.maximum), so you will actually see some concurrency on your one machine.

Kai

-- 
Kai Voigt
k@123.org

Re: What's the basic idea of pseudo-distributed Hadoop ?

Posted by Kai Voigt <k...@123.org>.

Hello.

Am 14.09.2012 um 08:03 schrieb Jason Yang <li...@gmail.com>:

> I have a question about how does the pseudo-distributed Hadoop cluster work:
> 
> As many map tasks are submitted to the pseudo-distributed Hadoop cluster, does the hadoop run each mapper in sequence ? or does it run these mappers in different threads or something could be parallel?

pseudo-distributed mode is a one node cluster. You have a namenode, a jobtracker, and a single datanode and tasktracker running. You can verify with "jps" command.

The default setting is that a tasktracker can run up to two map and reduce tasks in parallel (mapred.tasktracker.map.tasks.maximum and mapred.tasktracker.reduce.tasks.maximum), so you will actually see some concurrency on your one machine.

Kai

-- 
Kai Voigt
k@123.org

Re: What's the basic idea of pseudo-distributed Hadoop ?

Posted by Kai Voigt <k...@123.org>.

Hello.

Am 14.09.2012 um 08:03 schrieb Jason Yang <li...@gmail.com>:

> I have a question about how does the pseudo-distributed Hadoop cluster work:
> 
> As many map tasks are submitted to the pseudo-distributed Hadoop cluster, does the hadoop run each mapper in sequence ? or does it run these mappers in different threads or something could be parallel?

pseudo-distributed mode is a one node cluster. You have a namenode, a jobtracker, and a single datanode and tasktracker running. You can verify with "jps" command.

The default setting is that a tasktracker can run up to two map and reduce tasks in parallel (mapred.tasktracker.map.tasks.maximum and mapred.tasktracker.reduce.tasks.maximum), so you will actually see some concurrency on your one machine.

Kai

-- 
Kai Voigt
k@123.org

Re: What's the basic idea of pseudo-distributed Hadoop ?

Posted by Kai Voigt <k...@123.org>.

Hello.

Am 14.09.2012 um 08:03 schrieb Jason Yang <li...@gmail.com>:

> I have a question about how does the pseudo-distributed Hadoop cluster work:
> 
> As many map tasks are submitted to the pseudo-distributed Hadoop cluster, does the hadoop run each mapper in sequence ? or does it run these mappers in different threads or something could be parallel?

pseudo-distributed mode is a one node cluster. You have a namenode, a jobtracker, and a single datanode and tasktracker running. You can verify with "jps" command.

The default setting is that a tasktracker can run up to two map and reduce tasks in parallel (mapred.tasktracker.map.tasks.maximum and mapred.tasktracker.reduce.tasks.maximum), so you will actually see some concurrency on your one machine.

Kai

-- 
Kai Voigt
k@123.org