You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@hama.apache.org by Apurv Verma <da...@gmail.com> on 2011/11/11 13:43:46 UTC

Absolute Newbie

Hii all,


   1.  I am an absolute newbie to Hama and Hadoop. Should I learn hadoop
   first before I can begin contributing to this project?

   2. I don't exactly understand how hama works and what it is. All I
   understand is that it's a graph library written over a distributed
   architecture hadoop.
   Where can get to know the basics of the hama, as I have already stated
   before that do I also need to learn hadoop?

   3. On the getting started page, instructions are given with Maven and
   SVN.
   I have experience with git and not these. I found that the mirror github
   repository and have forked it and would be working through it only. Is it
   OK?

   4. To begin with I found this issue for newbies. HAMA-469.
   https://issues.apache.org/jira/browse/HAMA-469
   It says that statusUpdate() method should be called finally. So what I
   can see that there is a

   umbilical.statusUpdate(taskId, currentTaskStatus);
   I will put it in finally block. I dont understand what this piece of
   code wants to do. Basically what i have understood there is a cyclic
   barrier kind of thing so as to create a rendezvous for many threads. Some
   messages are combined and the function returns. I am still lost at
   understanding the codebase.


   5. I also found that I can apply to apache for a mentor. Here is my
   skillset [0] and I wish to become a long term contributor to projects
   centred around hadoop.
   [0] http://in.linkedin.com/in/apurv5
   I am really looking forward to becoming a full fledged contributor in a
   span of six months.



--
thanks and regards,

Apurv Verma
B. Tech.(CSE)
IIT- Ropar

Re: Absolute Newbie

Posted by Thomas Jungblut <th...@googlemail.com>.

Hey,

great you're getting forward. I know that it takes a lot of time and I
would be very glad to help you along the way.

Let me just clarify one thing:

 Then there are slaves or GroomServers which do the tasks.


That is quite accurate, but the Groom servers don't really execute the
tasks.
Tasks are running in their own process, so the Grooms are responsible to
launch the processes.
Then the running task send their status update to their own groom which
runs on the local machine.
That is what HAMA-469 is all about, to send a correct status to the groom
(=umbilical).

Here is what my question is?
> Normally when you we parallelize an algorithm we split it into many threads
> and then combine the answers returned by them in the master, so isn't the
> zookeeper a part of the master only. why is it separate here? Don't the
> GroomServers return the results to BSPMaster and the BSPMaster combines
> them. Where does zookeeper fit in here?


We do have this parallelization that you speak of, but we don't need to
combine the result. Look at the wonderful schema picture at wikipedia, does
the tasks combine their results? (Hint: Not they don't! They just
communicate with each other!)
BSP does not define a "result" of a "return", so we don't have the need to
combine anything. Unlike MapReduce which is focused a lot more on crunching
the input and get a result from the two functions Map and Reduce.
If you know the difference between a function and a procedure, BSP is a
procedure, map and reduce are functions.

Instead of threads we use processes, which are more expensive than using
threads, but they are more fault-tolerant. (For example what if one thread
bails out? It would tear down the whole computation on the host because the
other threads will go down as well. Bad example I know, but this is largely
the case. Processes can be restarted and don't damage the other
tasks/processes)

Zookeeper just fits in there as a synchronization service. It is basically
just a synchronization helper for a big cluster. Think of semaphores or
mutual exclusions, but with multiple servers.
In our case we need a barrier synchronization, the Java API equivalent is
the CyclicBarrier or the CountDownLatch. But they don't work so well for a
distributed environment, so we use Zookeeper ;)

Hope it helped you :)
Good luck with your study and come back if you have questions.

Regards,
Thomas


2011/11/12 Apurv Verma <da...@gmail.com>

> Hii,
>  The hama community is really very helpful. I just thought to write back
> notifying that reading and understanding all the links is taking some time.
> I have understood the basic overview of Hama and BSP.
> Basically here is what I have understood.
>
> There is a a BSPMaster like a master taking all the decisions, how to
> schedule.etc.
> Then there are slaves or GroomServers which do the tasks.
> Then there is a zookeeper to do the barrier synchronization.
>
> Here is what my question is?
> Normally when you we parallelize an algorithm we split it into many threads
> and then combine the answers returned by them in the master, so isn't the
> zookeeper a part of the master only. why is it separate here? Don't the
> GroomServers return the results to BSPMaster and the BSPMaster combines
> them. Where does zookeeper fit in here?
>
>
> Now I am trying to understand HDFS and how the parallel graph search
> algorithm which is given as a example in this presentation [0] works.
> I will get back as soon as I do these.
>
> [0]
>
> http://www.slideshare.net/guest20d395b/apache-hama-an-introduction-tobulk-synchronization-parallel-on-hadoop
> --
> thanks and regards,
>
> Apurv Verma
> B. Tech.(CSE)
> IIT- Ropar
>
>
>
>
>
>
> On Fri, Nov 11, 2011 at 7:31 PM, Thomas Jungblut <
> thomas.jungblut@googlemail.com> wrote:
>
> > Hey,
> >
> > thanks for your interest, it is currently a bit chaotic and not well
> > documented, but that's open source ;))
> > I answer your questions one by one.
> >
> > 1.  I am an absolute newbie to Hama and Hadoop. Should I learn hadoop
> > >   first before I can begin contributing to this project?
> >
> >
> > We officially just use HDFS, so it is enough if you're familiar with the
> > FileSystem API. [1]
> > This includes that you are familiar with the Writable interface[2], which
> > lets you serialize and deserialize objects.
> >
> >  2. I don't exactly understand how hama works and what it is. All I
> > >   understand is that it's a graph library written over a distributed
> > >   architecture hadoop.
> > >   Where can get to know the basics of the hama, as I have already
> stated
> > >   before that do I also need to learn hadoop?
> >
> >
> > It is not nessacarely a graph library, we are a BSP (Bulk Synchronous
> > Parallel) Framework. You can familiarize with BSP by reading the
> wikipedia
> > article [3]
> > However, you can solve graph problems with it as well as matrix
> operations
> > or other fancy stuff like real time processing streams.
> > Like in the last question, you don't need to understand MapReduce (I
> guess
> > that's what you mean by Hadoop in this case) to understand BSP, but once
> > you understand BSP, you will understand MapReduce. Hope you get the
> > directions ;)
> >
> >  3. On the getting started page, instructions are given with Maven and
> > >   SVN.
> > >   I have experience with git and not these. I found that the mirror
> > github
> > >   repository and have forked it and would be working through it only.
> Is
> > it
> > >   OK?
> >
> >
> > We work with patches (which are unified diffs), this will also work with
> > git. Sadly you can't skip maven, this will be a must-have.
> > If you are targetting to be a long-term committer, no matter what project
> > at Apache, you will have to know how to use SVN.
> > Git is only a read-only repository and will be constantly mirrored from
> > SVN.
> > SVN is really easy, in my opinion easier than git, so this won't be a
> > problem.
> >
> >  4. To begin with I found this issue for newbies. HAMA-469.
> > >   https://issues.apache.org/jira/browse/HAMA-469
> > >   It says that statusUpdate() method should be called finally. So what
> I
> > >   can see that there is a
> > >   umbilical.statusUpdate(taskId, currentTaskStatus);
> > >   I will put it in finally block. I dont understand what this piece of
> > >   code wants to do. Basically what i have understood there is a cyclic
> > >   barrier kind of thing so as to create a rendezvous for many threads.
> > Some
> > >   messages are combined and the function returns. I am still lost at
> > >   understanding the codebase.
> >
> >
> > Great you've already found our issue tracker and the newbie issues.
> > Sadly the description does not cover everything, e.G. motivation and
> stuff.
> > A quick explanation is: If failure in the sync method occured, we want to
> > update the "umbilical", so that it knows that the sync has failed.
> > Adding a finally block is not the right way, you should take a look at
> the
> > catch clause.
> > There is only a error log, but we want to make the status update in this
> > clause and make the process fail = throw a runtime exception.
> >
> > Once reading the wikipedia article I hope you know what the sync method
> > should do (send messages!).
> > This isn't the whole story yet, but I think you can explore for yourself
> by
> > debugging a bit.
> >
> >  5. I also found that I can apply to apache for a mentor. Here is my
> > >   skillset [0] and I wish to become a long term contributor to projects
> > >   centred around hadoop.
> > >   [0] http://in.linkedin.com/in/apurv5
> > >   I am really looking forward to becoming a full fledged contributor
> in a
> > >   span of six months.
> >
> >
> > Nice CV, but it is enough if you can code in Java and are creative in
> > finding solutions. And actually making them run as well.
> > I'm not sure if I can mentor you, but I guess we are all able to help you
> > once you'll facing a problem.
> > Just ask on the mailing list or mail me directly ;)
> >
> > Hope I clarified a few things. Looking forward to hear from you!
> >
> > Thomas
> >
> > [1]
> >
> >
> http://hadoop.apache.org/common/docs/current/api/org/apache/hadoop/fs/FileSystem.html
> >
> > [2]
> >
> >
> http://hadoop.apache.org/common/docs/current/api/org/apache/hadoop/io/Writable.html
> > [3] http://en.wikipedia.org/wiki/Bulk_synchronous_parallel
> >
> > 2011/11/11 Apurv Verma <da...@gmail.com>
> >
> > > Hii all,
> > >
> > >
> > >   1.  I am an absolute newbie to Hama and Hadoop. Should I learn hadoop
> > >   first before I can begin contributing to this project?
> > >
> > >   2. I don't exactly understand how hama works and what it is. All I
> > >   understand is that it's a graph library written over a distributed
> > >   architecture hadoop.
> > >   Where can get to know the basics of the hama, as I have already
> stated
> > >   before that do I also need to learn hadoop?
> > >
> > >   3. On the getting started page, instructions are given with Maven and
> > >   SVN.
> > >   I have experience with git and not these. I found that the mirror
> > github
> > >   repository and have forked it and would be working through it only.
> Is
> > it
> > >   OK?
> > >
> > >   4. To begin with I found this issue for newbies. HAMA-469.
> > >   https://issues.apache.org/jira/browse/HAMA-469
> > >   It says that statusUpdate() method should be called finally. So what
> I
> > >   can see that there is a
> > >
> > >   umbilical.statusUpdate(taskId, currentTaskStatus);
> > >   I will put it in finally block. I dont understand what this piece of
> > >   code wants to do. Basically what i have understood there is a cyclic
> > >   barrier kind of thing so as to create a rendezvous for many threads.
> > Some
> > >   messages are combined and the function returns. I am still lost at
> > >   understanding the codebase.
> > >
> > >
> > >   5. I also found that I can apply to apache for a mentor. Here is my
> > >   skillset [0] and I wish to become a long term contributor to projects
> > >   centred around hadoop.
> > >   [0] http://in.linkedin.com/in/apurv5
> > >   I am really looking forward to becoming a full fledged contributor
> in a
> > >   span of six months.
> > >
> > >
> > >
> > > --
> > > thanks and regards,
> > >
> > > Apurv Verma
> > > B. Tech.(CSE)
> > > IIT- Ropar
> > >
> >
> >
> >
> > --
> > Thomas Jungblut
> > Berlin <th...@gmail.com>
> >
>



-- 
Thomas Jungblut
Berlin <th...@gmail.com>

Re: Absolute Newbie

Posted by Apurv Verma <da...@gmail.com>.

Hii,
 The hama community is really very helpful. I just thought to write back
notifying that reading and understanding all the links is taking some time.
I have understood the basic overview of Hama and BSP.
Basically here is what I have understood.

There is a a BSPMaster like a master taking all the decisions, how to
schedule.etc.
Then there are slaves or GroomServers which do the tasks.
Then there is a zookeeper to do the barrier synchronization.

Here is what my question is?
Normally when you we parallelize an algorithm we split it into many threads
and then combine the answers returned by them in the master, so isn't the
zookeeper a part of the master only. why is it separate here? Don't the
GroomServers return the results to BSPMaster and the BSPMaster combines
them. Where does zookeeper fit in here?


Now I am trying to understand HDFS and how the parallel graph search
algorithm which is given as a example in this presentation [0] works.
I will get back as soon as I do these.

[0]
http://www.slideshare.net/guest20d395b/apache-hama-an-introduction-tobulk-synchronization-parallel-on-hadoop
--
thanks and regards,

Apurv Verma
B. Tech.(CSE)
IIT- Ropar






On Fri, Nov 11, 2011 at 7:31 PM, Thomas Jungblut <
thomas.jungblut@googlemail.com> wrote:

> Hey,
>
> thanks for your interest, it is currently a bit chaotic and not well
> documented, but that's open source ;))
> I answer your questions one by one.
>
> 1.  I am an absolute newbie to Hama and Hadoop. Should I learn hadoop
> >   first before I can begin contributing to this project?
>
>
> We officially just use HDFS, so it is enough if you're familiar with the
> FileSystem API. [1]
> This includes that you are familiar with the Writable interface[2], which
> lets you serialize and deserialize objects.
>
>  2. I don't exactly understand how hama works and what it is. All I
> >   understand is that it's a graph library written over a distributed
> >   architecture hadoop.
> >   Where can get to know the basics of the hama, as I have already stated
> >   before that do I also need to learn hadoop?
>
>
> It is not nessacarely a graph library, we are a BSP (Bulk Synchronous
> Parallel) Framework. You can familiarize with BSP by reading the wikipedia
> article [3]
> However, you can solve graph problems with it as well as matrix operations
> or other fancy stuff like real time processing streams.
> Like in the last question, you don't need to understand MapReduce (I guess
> that's what you mean by Hadoop in this case) to understand BSP, but once
> you understand BSP, you will understand MapReduce. Hope you get the
> directions ;)
>
>  3. On the getting started page, instructions are given with Maven and
> >   SVN.
> >   I have experience with git and not these. I found that the mirror
> github
> >   repository and have forked it and would be working through it only. Is
> it
> >   OK?
>
>
> We work with patches (which are unified diffs), this will also work with
> git. Sadly you can't skip maven, this will be a must-have.
> If you are targetting to be a long-term committer, no matter what project
> at Apache, you will have to know how to use SVN.
> Git is only a read-only repository and will be constantly mirrored from
> SVN.
> SVN is really easy, in my opinion easier than git, so this won't be a
> problem.
>
>  4. To begin with I found this issue for newbies. HAMA-469.
> >   https://issues.apache.org/jira/browse/HAMA-469
> >   It says that statusUpdate() method should be called finally. So what I
> >   can see that there is a
> >   umbilical.statusUpdate(taskId, currentTaskStatus);
> >   I will put it in finally block. I dont understand what this piece of
> >   code wants to do. Basically what i have understood there is a cyclic
> >   barrier kind of thing so as to create a rendezvous for many threads.
> Some
> >   messages are combined and the function returns. I am still lost at
> >   understanding the codebase.
>
>
> Great you've already found our issue tracker and the newbie issues.
> Sadly the description does not cover everything, e.G. motivation and stuff.
> A quick explanation is: If failure in the sync method occured, we want to
> update the "umbilical", so that it knows that the sync has failed.
> Adding a finally block is not the right way, you should take a look at the
> catch clause.
> There is only a error log, but we want to make the status update in this
> clause and make the process fail = throw a runtime exception.
>
> Once reading the wikipedia article I hope you know what the sync method
> should do (send messages!).
> This isn't the whole story yet, but I think you can explore for yourself by
> debugging a bit.
>
>  5. I also found that I can apply to apache for a mentor. Here is my
> >   skillset [0] and I wish to become a long term contributor to projects
> >   centred around hadoop.
> >   [0] http://in.linkedin.com/in/apurv5
> >   I am really looking forward to becoming a full fledged contributor in a
> >   span of six months.
>
>
> Nice CV, but it is enough if you can code in Java and are creative in
> finding solutions. And actually making them run as well.
> I'm not sure if I can mentor you, but I guess we are all able to help you
> once you'll facing a problem.
> Just ask on the mailing list or mail me directly ;)
>
> Hope I clarified a few things. Looking forward to hear from you!
>
> Thomas
>
> [1]
>
> http://hadoop.apache.org/common/docs/current/api/org/apache/hadoop/fs/FileSystem.html
>
> [2]
>
> http://hadoop.apache.org/common/docs/current/api/org/apache/hadoop/io/Writable.html
> [3] http://en.wikipedia.org/wiki/Bulk_synchronous_parallel
>
> 2011/11/11 Apurv Verma <da...@gmail.com>
>
> > Hii all,
> >
> >
> >   1.  I am an absolute newbie to Hama and Hadoop. Should I learn hadoop
> >   first before I can begin contributing to this project?
> >
> >   2. I don't exactly understand how hama works and what it is. All I
> >   understand is that it's a graph library written over a distributed
> >   architecture hadoop.
> >   Where can get to know the basics of the hama, as I have already stated
> >   before that do I also need to learn hadoop?
> >
> >   3. On the getting started page, instructions are given with Maven and
> >   SVN.
> >   I have experience with git and not these. I found that the mirror
> github
> >   repository and have forked it and would be working through it only. Is
> it
> >   OK?
> >
> >   4. To begin with I found this issue for newbies. HAMA-469.
> >   https://issues.apache.org/jira/browse/HAMA-469
> >   It says that statusUpdate() method should be called finally. So what I
> >   can see that there is a
> >
> >   umbilical.statusUpdate(taskId, currentTaskStatus);
> >   I will put it in finally block. I dont understand what this piece of
> >   code wants to do. Basically what i have understood there is a cyclic
> >   barrier kind of thing so as to create a rendezvous for many threads.
> Some
> >   messages are combined and the function returns. I am still lost at
> >   understanding the codebase.
> >
> >
> >   5. I also found that I can apply to apache for a mentor. Here is my
> >   skillset [0] and I wish to become a long term contributor to projects
> >   centred around hadoop.
> >   [0] http://in.linkedin.com/in/apurv5
> >   I am really looking forward to becoming a full fledged contributor in a
> >   span of six months.
> >
> >
> >
> > --
> > thanks and regards,
> >
> > Apurv Verma
> > B. Tech.(CSE)
> > IIT- Ropar
> >
>
>
>
> --
> Thomas Jungblut
> Berlin <th...@gmail.com>
>

Re: Absolute Newbie

Posted by Thomas Jungblut <th...@googlemail.com>.

Hey,

thanks for your interest, it is currently a bit chaotic and not well
documented, but that's open source ;))
I answer your questions one by one.

1.  I am an absolute newbie to Hama and Hadoop. Should I learn hadoop
>   first before I can begin contributing to this project?


We officially just use HDFS, so it is enough if you're familiar with the
FileSystem API. [1]
This includes that you are familiar with the Writable interface[2], which
lets you serialize and deserialize objects.

  2. I don't exactly understand how hama works and what it is. All I
>   understand is that it's a graph library written over a distributed
>   architecture hadoop.
>   Where can get to know the basics of the hama, as I have already stated
>   before that do I also need to learn hadoop?


It is not nessacarely a graph library, we are a BSP (Bulk Synchronous
Parallel) Framework. You can familiarize with BSP by reading the wikipedia
article [3]
However, you can solve graph problems with it as well as matrix operations
or other fancy stuff like real time processing streams.
Like in the last question, you don't need to understand MapReduce (I guess
that's what you mean by Hadoop in this case) to understand BSP, but once
you understand BSP, you will understand MapReduce. Hope you get the
directions ;)

  3. On the getting started page, instructions are given with Maven and
>   SVN.
>   I have experience with git and not these. I found that the mirror github
>   repository and have forked it and would be working through it only. Is it
>   OK?


We work with patches (which are unified diffs), this will also work with
git. Sadly you can't skip maven, this will be a must-have.
If you are targetting to be a long-term committer, no matter what project
at Apache, you will have to know how to use SVN.
Git is only a read-only repository and will be constantly mirrored from SVN.
SVN is really easy, in my opinion easier than git, so this won't be a
problem.

  4. To begin with I found this issue for newbies. HAMA-469.
>   https://issues.apache.org/jira/browse/HAMA-469
>   It says that statusUpdate() method should be called finally. So what I
>   can see that there is a
>   umbilical.statusUpdate(taskId, currentTaskStatus);
>   I will put it in finally block. I dont understand what this piece of
>   code wants to do. Basically what i have understood there is a cyclic
>   barrier kind of thing so as to create a rendezvous for many threads. Some
>   messages are combined and the function returns. I am still lost at
>   understanding the codebase.


Great you've already found our issue tracker and the newbie issues.
Sadly the description does not cover everything, e.G. motivation and stuff.
A quick explanation is: If failure in the sync method occured, we want to
update the "umbilical", so that it knows that the sync has failed.
Adding a finally block is not the right way, you should take a look at the
catch clause.
There is only a error log, but we want to make the status update in this
clause and make the process fail = throw a runtime exception.

Once reading the wikipedia article I hope you know what the sync method
should do (send messages!).
This isn't the whole story yet, but I think you can explore for yourself by
debugging a bit.

  5. I also found that I can apply to apache for a mentor. Here is my
>   skillset [0] and I wish to become a long term contributor to projects
>   centred around hadoop.
>   [0] http://in.linkedin.com/in/apurv5
>   I am really looking forward to becoming a full fledged contributor in a
>   span of six months.


Nice CV, but it is enough if you can code in Java and are creative in
finding solutions. And actually making them run as well.
I'm not sure if I can mentor you, but I guess we are all able to help you
once you'll facing a problem.
Just ask on the mailing list or mail me directly ;)

Hope I clarified a few things. Looking forward to hear from you!

Thomas

[1]
http://hadoop.apache.org/common/docs/current/api/org/apache/hadoop/fs/FileSystem.html

[2]
http://hadoop.apache.org/common/docs/current/api/org/apache/hadoop/io/Writable.html
[3] http://en.wikipedia.org/wiki/Bulk_synchronous_parallel

2011/11/11 Apurv Verma <da...@gmail.com>

> Hii all,
>
>
>   1.  I am an absolute newbie to Hama and Hadoop. Should I learn hadoop
>   first before I can begin contributing to this project?
>
>   2. I don't exactly understand how hama works and what it is. All I
>   understand is that it's a graph library written over a distributed
>   architecture hadoop.
>   Where can get to know the basics of the hama, as I have already stated
>   before that do I also need to learn hadoop?
>
>   3. On the getting started page, instructions are given with Maven and
>   SVN.
>   I have experience with git and not these. I found that the mirror github
>   repository and have forked it and would be working through it only. Is it
>   OK?
>
>   4. To begin with I found this issue for newbies. HAMA-469.
>   https://issues.apache.org/jira/browse/HAMA-469
>   It says that statusUpdate() method should be called finally. So what I
>   can see that there is a
>
>   umbilical.statusUpdate(taskId, currentTaskStatus);
>   I will put it in finally block. I dont understand what this piece of
>   code wants to do. Basically what i have understood there is a cyclic
>   barrier kind of thing so as to create a rendezvous for many threads. Some
>   messages are combined and the function returns. I am still lost at
>   understanding the codebase.
>
>
>   5. I also found that I can apply to apache for a mentor. Here is my
>   skillset [0] and I wish to become a long term contributor to projects
>   centred around hadoop.
>   [0] http://in.linkedin.com/in/apurv5
>   I am really looking forward to becoming a full fledged contributor in a
>   span of six months.
>
>
>
> --
> thanks and regards,
>
> Apurv Verma
> B. Tech.(CSE)
> IIT- Ropar
>



-- 
Thomas Jungblut
Berlin <th...@gmail.com>