You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@hadoop.apache.org by Mike Wenzel <mw...@proheris.de> on 2016/06/09 09:15:27 UTC

Looking for documentation/guides on Hadoop 2.7.2

Hey everyone. I just started some weeks ago to learn about Hadoop. I got the task to understand the Hadoop Ecosystem, and be able to answer some questions. First of all I started reading a book "OReilly - Hadoop The Definitive Guide". After reading the book I had a first idea of how components work together, but for me the book didn't helped me to understand what's going on. In my opinion the book described pretty much general in depth details about various components. This didn't helped me to understand the Hadoop Ecosystem.

I started to work with it. I installed a VM (SUSE Leap 42.1) and followed the https://hadoop.apache.org/docs/stable/hadoop-project-dist/hadoop-common/SingleCluster.html Guide.
After doing this I started to work with files on it. I wrote my first simple mapper and reducer, and I analyzed my apache log for some testing. This worked good so far.

But let's face my problems:
1) All my knowledge about the Installing of Hadoop right now is: Unpacking a .tar.gz. I ran some shell-scripts and everything was running fine. Well, I have no clue at all, which components are now installed on the VM and where are they located and installed?

2) Furthermore, I'm missing all kinds of information about setting those up. The apache guide on some point says "Now check that you can ssh to the localhost without a passphrase" "If you cannot ssh to localhost without a passphrase, execute the following commands:". Well, I'd like to know what am I doing here ?! I mean WHY do I need ssh running on localhost, and WHY do this have to be without a passphrase. Which other ways of configuring this do exists?

3) Same on the next point: "The following instructions are to run a MapReduce job locally. If you want to execute a job on YARN, see YARN on Single Node." "Format the filesystem: $ bin/hdfs namenode -format". I have no clue how HDFS internally work. For me a Filesystem is where I can setup partitions hooked on folders. So how am I supposed to explain hdfs to someone else?
I understood the storing of data, splitting files in blocks, spread files around the cluster, store metadata, but if someone asks me: "How can this be called filesystem if you install it by unpacking a .tar.gz?" I simply can't answer this question in any way.

So I'm now looking for a documentation/guide for:
- Which requirements do I have?
-- Does I have to use a specific Filesystem? If yes/no, why or what would you recommend?
-- How should I partition my VM?
-- On which partition should I install which components?
- Setting up a VM with Hadoop
- Configure Hadoop step by step
- Setup all kinds of deamons/nodes manually and explain where are they located (how they work) and how they should be configured

I'm right now reading: https://hadoop.apache.org/docs/stable/hadoop-project-dist/hadoop-common/ClusterSetup.html but after some first readings this Guide will tell you what to write in which configuration-file, but now why you should do this or not. I'm feeling like "leaved alone in the darkness" after getting an idea of what Hadoop is. I hope some of you can show me some ways to get back om the road.
For me it's very important not just to write some configuration somewhere. I need to understand what's going on because if I got a running cluster and things, I need to be sure to handle all this stuff before going in productive use with it.

Best Regards
Mike

AW: Looking for documentation/guides on Hadoop 2.7.2

Posted by Mike Wenzel <mw...@proheris.de>.

Hi Anu and Johny,

first of all I want to thank both of you for replying. I didn't wanted to look that frustrated as it seems to look. I'm pretty fine and my interests in Hadoop are huge :)

@Anu: I did an free-online course by cloudera / udacity which helped me a lot to make things clear regarding the Hadoop Ecosystem. At least I think it's enough right now for me. Taking a second look after in the book helped me too to make things clear. Yes I got the 4th Edition and I will take some more looks into the specific chapters after. Thanks for the advice.

Regarding SSH: Yeah, I admit that this is my gap of linux-knowledge. I'm sorry for that. Today, I don't even know why I didn't researched SSH to get a basic knowledge of this keygen usage. Anyway, sorry, you're totally right here.

Regarding Yarn and HDFS: Seems like I didn't clearly showed what I'm missing here. (I think) I got a basic knowledge of YARN which is fine for now. In future this needs to be extended and improved anyway.

When I ran through the guide "Hadoop: Setting up a Single Node Cluster" the guide says "Format the filesystem: $ bin/hdfs namenode -format".  The HDFS was installed and formatted and almost ready to start and work on it.
> My thoughts: Where did the filesystem got installed?
> How can I change these settings to another location?
I think when I install an Hadoop Cluster, I won't have the HDFS installed on /tmp.
"Make the HDFS directories required to execute MapReduce jobs:"
I thought:
> Why are those required to run MapReduce jobs and do I only need them for this specific example/guide here, or do I really need them no matter if I install Hadoop different from this guide?

I think that clarifying all those possible questions would make the guide looking worse, because mostly people want simple step by step guides. And that's totally fine, it's maybe just me who didn't found other documentation before doing this. But at the points shown above, I just had those questions and I couldn't simply go on without keeping them in mind and thinking about it. I would have loved to see some links there like: "for further details check the in-depth configuration guide hdfs" and a specific guide for this.

For me this is my first time I got my hands on Hadoop, I did all this running in a VM just for some first testing purposes. I know that VMs shouldn't be used.

To answer your question:
"Please let us know what is challenging for you in the current set of instructions. Are you able to setup single instance, pseudo instance and then progress to a cluster setup?."
I did all 3 steps on the Guide "Hadoop: Setting up a Single Node Cluster." and everything worked fine. For me the challenge is to answer peoples questions about the system at this point. By following the guide I installed software on my pc, having no idea what components, how many components and where they got installed.

Maybe I asked to early. For now, I'll try to setup multiple machines, each machine got their specific job and asking/researched my question as soon as they come up.

Best Regards,
Mike.


Von: johny casanova [mailto:pcgamer2426@outlook.com]
Gesendet: Donnerstag, 9. Juni 2016 20:38
An: Anu Engineer <ae...@hortonworks.com>; Mike Wenzel <mw...@proheris.de>; user@hadoop.apache.org
Betreff: Re: Looking for documentation/guides on Hadoop 2.7.2


Mike,



Here is a guide on how to do some of the work but, this is using Ambari and not just the tar.gz This can help you understand how to piece certain things together. https://cwiki.apache.org/confluence/display/AMBARI/Start+Guide+Using+Centos+6.x  this help me understand more when I was in the same position as you.

________________________________
From: Anu Engineer <ae...@hortonworks.com>>
Sent: Thursday, June 9, 2016 2:33 PM
To: Mike Wenzel; user@hadoop.apache.org<ma...@hadoop.apache.org>
Subject: Re: Looking for documentation/guides on Hadoop 2.7.2


Hi Mike,



I am sorry your experience with setting up Hadoop has been frustrating and mysterious. I will try to give partial answers / pointers to where you should be looking. Please be patient with me.



>  After reading the book I had a first idea of how components work together, but for me the book didn't helped me to understand what's going on

I generally recommend this book to anyone starting off with Hadoop, and IMHO it is the best book for an overview of Hadoop.



>  All my knowledge about the Installing of Hadoop right now is: Unpacking a .tar.gz. I ran some shell-scripts and everything was running fine

I would have presumed that book tells you the about the various components - HDFS, MapReduce, YARN etc. if you are using edition 4 of the book please look at Chapter 2, 3, & 4.



>  Furthermore, I'm missing all kinds of information about setting those up. The apache guide on some point says "Now check that you can ssh to the localhost without a passphrase" "

Thank you for the feedback. Hadoop relies on an underlying operating system - For example - Hadoop generally runs on top of Linux. We assume that you understand these underlying layers.

SSH is something that is used extensively in Linux world- and you run into a problem like this - google is your friend. I just typed SSH into google - and this was the first link -- https://en.wikipedia.org/wiki/Secure_Shell and the section on public key/private key (another time to reach out to your friend google if you don't understand how that works) which explains how password less logins work. I understand your frustration, but explaining SSH in our documentation would just frustrate most of our users.
Secure Shell - Wikipedia, the free encyclopedia<https://en.wikipedia.org/wiki/Secure_Shell>
en.wikipedia.org
Secure Shell, or SSH, is a cryptographic (encrypted) network protocol operating at layer 7 of the OSI Model to allow remote login and other network services to ...




>  following instructions are to run a MapReduce job locally. If you want to execute a job on YARN, see YARN on Single Node."

Please take a look at Chapter 2 of the "Definitive Guide", and for YARN please look at chapter 4. It has excellent explanations on both of these.  If you are saying that Apache's documentation is not as great as these external resources, yes, we are aware of that. Would you like to help us address that?



>  Format the filesystem: $ bin/hdfs namenode -format". I have no clue how HDFS internally work.

Isn't that the beauty of file systems and databases in general? that you don't have to master the intricate details of B+ Trees or Query Optimization. Since we are open source - We encourage people like you who would like to understand more to read source.



>  For me a Filesystem is where I can setup partitions hooked on folders. So how am I supposed to explain hdfs to someone else?

Not to offend you, but I feel that it is a very limited world view of file systems.  There are a large number of files systems - beyond the ones that have partitions. If you are asking why HDFS should be treated as a file system, the simplest answer is that it offers Posix file system semantics. That is, it looks and acts like a file system.



>   "How can this be called filesystem if you install it by unpacking a .tar.gz?" I simply can't answer this question in any way.

Again google is your friend. https://en.wikipedia.org/wiki/Clustered_file_system#Distributed_file_systems Please take a look at different distributed file systems to get an expanded perspective of file systems.





> So I'm now looking for a documentation/guide for:

> - Which requirements do I have?



It is very difficult to answer given the current lack of advancement in mind reading :) (Just kidding). You have a problem to solve - and most problems can be broken down to a programming pattern offered by the Hadoop eco-system.



You might have a big data storage problem - HDFS might be your solution, you might want to run computations on top of it, MapReduce, Spark etc. might be your solution. You want to have a scalable key value store - HBase might help you,

You want to run SQL like queries - Hive might offer you a solution. If you have specific problem, you can post the question to the user group and someone will generally answer your question.



-- Does I have to use a specific Filesystem? If yes/no, why or what would you recommend?

Any file system that can store large number of files work well. We have seen HDFS work on top of all kinds of file systems. Ext4, XFS etc. etc.  In other words, use the physical file system you like and if you run into any issues please report it here or in the dev group.



-- How should I partition my VM?

Generally, we do not recommend you to run in VMs. The book that you were referring to has a section called Part 3. Hadoop Operations. The first few chapters' deal with it. Or else reaching out to our friend google and asking "hadoop cluster hardware" gives me many links and recommendations. I would read the blogs from Cloudera and Hortonworks.



-- On which partition should I install which components?

Again, this is a very specific question that needs us to understand your cluster configuration. The general answer is keep your Hadoop binaries, conf and logs separate from your data files.

Protect your data directories with physical file system permissions and run Hadoop under a specific user like hdfs, yarn etc.



- Setting up a VM with Hadoop

Apache does not recommend that you run it under VMs. If you really want to do this, you might want to look at documentation provided by virtualization/cloud providers like VMware, Windows Azure or Amazon EMR.



- Configure Hadoop step by step

Please let us know what is challenging for you in the current set of instructions. Are you able to setup single instance, pseudo instance and then progress to a cluster setup?.

I can sense a great deal of frustration, but I am not able to help you unless I specifically know what is bothering you.



- Setup all kinds of deamons/nodes manually and explain where are they located (how they work) and how they should be configured

Go to your Hadoop directory and go to sbin. Read the source of start-dfs.sh or start-all.sh. It gives you pointers to what services are running, or run start-dfs.sh and then run "sudo jps" to see the running services.





Thanks

Anu

















From: Mike Wenzel <mw...@proheris.de>>
Date: Thursday, June 9, 2016 at 2:15 AM
To: "user@hadoop.apache.org<ma...@hadoop.apache.org>" <us...@hadoop.apache.org>>
Subject: Looking for documentation/guides on Hadoop 2.7.2



Hey everyone. I just started some weeks ago to learn about Hadoop. I got the task to understand the Hadoop Ecosystem, and be able to answer some questions. First of all I started reading a book "OReilly - Hadoop The Definitive Guide". After reading the book I had a first idea of how components work together, but for me the book didn't helped me to understand what's going on. In my opinion the book described pretty much general in depth details about various components. This didn't helped me to understand the Hadoop Ecosystem.



I started to work with it. I installed a VM (SUSE Leap 42.1) and followed the https://hadoop.apache.org/docs/stable/hadoop-project-dist/hadoop-common/SingleCluster.html Guide.

After doing this I started to work with files on it. I wrote my first simple mapper and reducer, and I analyzed my apache log for some testing. This worked good so far.



But let's face my problems:

1) All my knowledge about the Installing of Hadoop right now is: Unpacking a .tar.gz. I ran some shell-scripts and everything was running fine. Well, I have no clue at all, which components are now installed on the VM and where are they located and installed?



2) Furthermore, I'm missing all kinds of information about setting those up. The apache guide on some point says "Now check that you can ssh to the localhost without a passphrase" "If you cannot ssh to localhost without a passphrase, execute the following commands:". Well, I'd like to know what am I doing here ?! I mean WHY do I need ssh running on localhost, and WHY do this have to be without a passphrase. Which other ways of configuring this do exists?



3) Same on the next point: "The following instructions are to run a MapReduce job locally. If you want to execute a job on YARN, see YARN on Single Node." "Format the filesystem: $ bin/hdfs namenode -format". I have no clue how HDFS internally work. For me a Filesystem is where I can setup partitions hooked on folders. So how am I supposed to explain hdfs to someone else?

I understood the storing of data, splitting files in blocks, spread files around the cluster, store metadata, but if someone asks me: "How can this be called filesystem if you install it by unpacking a .tar.gz?" I simply can't answer this question in any way.



So I'm now looking for a documentation/guide for:

- Which requirements do I have?

-- Does I have to use a specific Filesystem? If yes/no, why or what would you recommend?

-- How should I partition my VM?

-- On which partition should I install which components?

- Setting up a VM with Hadoop

- Configure Hadoop step by step

- Setup all kinds of deamons/nodes manually and explain where are they located (how they work) and how they should be configured



I'm right now reading: https://hadoop.apache.org/docs/stable/hadoop-project-dist/hadoop-common/ClusterSetup.html but after some first readings this Guide will tell you what to write in which configuration-file, but now why you should do this or not. I'm feeling like "leaved alone in the darkness" after getting an idea of what Hadoop is. I hope some of you can show me some ways to get back om the road.

For me it's very important not just to write some configuration somewhere. I need to understand what's going on because if I got a running cluster and things, I need to be sure to handle all this stuff before going in productive use with it.



Best Regards

Mike

Re: Looking for documentation/guides on Hadoop 2.7.2

Posted by johny casanova <pc...@outlook.com>.

Mike,

Here is a guide on how to do some of the work but, this is using Ambari and not just the tar.gz This can help you understand how to piece certain things together. https://cwiki.apache.org/confluence/display/AMBARI/Start+Guide+Using+Centos+6.x  this help me understand more when I was in the same position as you.

________________________________
From: Anu Engineer <ae...@hortonworks.com>
Sent: Thursday, June 9, 2016 2:33 PM
To: Mike Wenzel; user@hadoop.apache.org
Subject: Re: Looking for documentation/guides on Hadoop 2.7.2

Hi Mike,

I am sorry your experience with setting up Hadoop has been frustrating and mysterious. I will try to give partial answers / pointers to where you should be looking. Please be patient with me.

Ø  After reading the book I had a first idea of how components work together, but for me the book didn't helped me to understand what’s going on

I generally recommend this book to anyone starting off with Hadoop, and IMHO it is the best book for an overview of Hadoop.

Ø  All my knowledge about the Installing of Hadoop right now is: Unpacking a .tar.gz. I ran some shell-scripts and everything was running fine

I would have presumed that book tells you the about the various components – HDFS, MapReduce, YARN etc. if you are using edition 4 of the book please look at Chapter 2, 3, & 4.

Ø  Furthermore, I'm missing all kinds of information about setting those up. The apache guide on some point says "Now check that you can ssh to the localhost without a passphrase" "

Thank you for the feedback. Hadoop relies on an underlying operating system – For example – Hadoop generally runs on top of Linux. We assume that you understand these underlying layers.

SSH is something that is used extensively in Linux world– and you run into a problem like this – google is your friend. I just typed SSH into google – and this was the first link -- https://en.wikipedia.org/wiki/Secure_Shell and the section on public key/private key (another time to reach out to your friend google if you don’t understand how that works) which explains how password less logins work. I understand your frustration, but explaining SSH in our documentation would just frustrate most of our users.

Secure Shell - Wikipedia, the free encyclopedia<https://en.wikipedia.org/wiki/Secure_Shell>
en.wikipedia.org
Secure Shell, or SSH, is a cryptographic (encrypted) network protocol operating at layer 7 of the OSI Model to allow remote login and other network services to ...

Ø  following instructions are to run a MapReduce job locally. If you want to execute a job on YARN, see YARN on Single Node."

Please take a look at Chapter 2 of the “Definitive Guide”, and for YARN please look at chapter 4. It has excellent explanations on both of these.  If you are saying that Apache’s documentation is not as great as these external resources, yes, we are aware of that. Would you like to help us address that?

Ø  Format the filesystem: $ bin/hdfs namenode -format". I have no clue how HDFS internally work.

Isn’t that the beauty of file systems and databases in general? that you don’t have to master the intricate details of B+ Trees or Query Optimization. Since we are open source – We encourage people like you who would like to understand more to read source.

Ø  For me a Filesystem is where I can setup partitions hooked on folders. So how am I supposed to explain hdfs to someone else?

Not to offend you, but I feel that it is a very limited world view of file systems.  There are a large number of files systems – beyond the ones that have partitions. If you are asking why HDFS should be treated as a file system, the simplest answer is that it offers Posix file system semantics. That is, it looks and acts like a file system.

Ø   "How can this be called filesystem if you install it by unpacking a .tar.gz?" I simply can't answer this question in any way.

Again google is your friend. https://en.wikipedia.org/wiki/Clustered_file_system#Distributed_file_systems Please take a look at different distributed file systems to get an expanded perspective of file systems.

> So I'm now looking for a documentation/guide for:

> - Which requirements do I have?

It is very difficult to answer given the current lack of advancement in mind reading :) (Just kidding). You have a problem to solve – and most problems can be broken down to a programming pattern offered by the Hadoop eco-system.

You might have a big data storage problem – HDFS might be your solution, you might want to run computations on top of it, MapReduce, Spark etc. might be your solution. You want to have a scalable key value store – HBase might help you,

You want to run SQL like queries – Hive might offer you a solution. If you have specific problem, you can post the question to the user group and someone will generally answer your question.

-- Does I have to use a specific Filesystem? If yes/no, why or what would you recommend?

Any file system that can store large number of files work well. We have seen HDFS work on top of all kinds of file systems. Ext4, XFS etc. etc.  In other words, use the physical file system you like and if you run into any issues please report it here or in the dev group.

-- How should I partition my VM?

Generally, we do not recommend you to run in VMs. The book that you were referring to has a section called Part 3. Hadoop Operations. The first few chapters’ deal with it. Or else reaching out to our friend google and asking “hadoop cluster hardware” gives me many links and recommendations. I would read the blogs from Cloudera and Hortonworks.

-- On which partition should I install which components?

Again, this is a very specific question that needs us to understand your cluster configuration. The general answer is keep your Hadoop binaries, conf and logs separate from your data files.

Protect your data directories with physical file system permissions and run Hadoop under a specific user like hdfs, yarn etc.

- Setting up a VM with Hadoop

Apache does not recommend that you run it under VMs. If you really want to do this, you might want to look at documentation provided by virtualization/cloud providers like VMware, Windows Azure or Amazon EMR.

- Configure Hadoop step by step

Please let us know what is challenging for you in the current set of instructions. Are you able to setup single instance, pseudo instance and then progress to a cluster setup?.

I can sense a great deal of frustration, but I am not able to help you unless I specifically know what is bothering you.

- Setup all kinds of deamons/nodes manually and explain where are they located (how they work) and how they should be configured

Go to your Hadoop directory and go to sbin. Read the source of start-dfs.sh or start-all.sh. It gives you pointers to what services are running, or run start-dfs.sh and then run “sudo jps” to see the running services.

Thanks

Anu

From: Mike Wenzel <mw...@proheris.de>
Date: Thursday, June 9, 2016 at 2:15 AM
To: "user@hadoop.apache.org" <us...@hadoop.apache.org>
Subject: Looking for documentation/guides on Hadoop 2.7.2

Hey everyone. I just started some weeks ago to learn about Hadoop. I got the task to understand the Hadoop Ecosystem, and be able to answer some questions. First of all I started reading a book "OReilly - Hadoop The Definitive Guide". After reading the book I had a first idea of how components work together, but for me the book didn't helped me to understand what’s going on. In my opinion the book described pretty much general in depth details about various components. This didn't helped me to understand the Hadoop Ecosystem.

I started to work with it. I installed a VM (SUSE Leap 42.1) and followed the https://hadoop.apache.org/docs/stable/hadoop-project-dist/hadoop-common/SingleCluster.html Guide.

After doing this I started to work with files on it. I wrote my first simple mapper and reducer, and I analyzed my apache log for some testing. This worked good so far.

But let’s face my problems:

1) All my knowledge about the Installing of Hadoop right now is: Unpacking a .tar.gz. I ran some shell-scripts and everything was running fine. Well, I have no clue at all, which components are now installed on the VM and where are they located and installed?

2) Furthermore, I'm missing all kinds of information about setting those up. The apache guide on some point says "Now check that you can ssh to the localhost without a passphrase" "If you cannot ssh to localhost without a passphrase, execute the following commands:". Well, I'd like to know what am I doing here ?! I mean WHY do I need ssh running on localhost, and WHY do this have to be without a passphrase. Which other ways of configuring this do exists?

3) Same on the next point: "The following instructions are to run a MapReduce job locally. If you want to execute a job on YARN, see YARN on Single Node." "Format the filesystem: $ bin/hdfs namenode -format". I have no clue how HDFS internally work. For me a Filesystem is where I can setup partitions hooked on folders. So how am I supposed to explain hdfs to someone else?

I understood the storing of data, splitting files in blocks, spread files around the cluster, store metadata, but if someone asks me: "How can this be called filesystem if you install it by unpacking a .tar.gz?" I simply can't answer this question in any way.

So I'm now looking for a documentation/guide for:

- Which requirements do I have?

-- Does I have to use a specific Filesystem? If yes/no, why or what would you recommend?

-- How should I partition my VM?

-- On which partition should I install which components?

- Setting up a VM with Hadoop

- Configure Hadoop step by step

- Setup all kinds of deamons/nodes manually and explain where are they located (how they work) and how they should be configured

I'm right now reading: https://hadoop.apache.org/docs/stable/hadoop-project-dist/hadoop-common/ClusterSetup.html but after some first readings this Guide will tell you what to write in which configuration-file, but now why you should do this or not. I'm feeling like "leaved alone in the darkness" after getting an idea of what Hadoop is. I hope some of you can show me some ways to get back om the road.

For me it's very important not just to write some configuration somewhere. I need to understand what's going on because if I got a running cluster and things, I need to be sure to handle all this stuff before going in productive use with it.

Best Regards

Mike

Re: Looking for documentation/guides on Hadoop 2.7.2

Posted by Anu Engineer <ae...@hortonworks.com>.

Hi Mike,

I am sorry your experience with setting up Hadoop has been frustrating and mysterious. I will try to give partial answers / pointers to where you should be looking. Please be patient with me.


Ø  After reading the book I had a first idea of how components work together, but for me the book didn't helped me to understand what’s going on

I generally recommend this book to anyone starting off with Hadoop, and IMHO it is the best book for an overview of Hadoop.



Ø  All my knowledge about the Installing of Hadoop right now is: Unpacking a .tar.gz. I ran some shell-scripts and everything was running fine

I would have presumed that book tells you the about the various components – HDFS, MapReduce, YARN etc. if you are using edition 4 of the book please look at Chapter 2, 3, & 4.



Ø  Furthermore, I'm missing all kinds of information about setting those up. The apache guide on some point says "Now check that you can ssh to the localhost without a passphrase" "

Thank you for the feedback. Hadoop relies on an underlying operating system – For example – Hadoop generally runs on top of Linux. We assume that you understand these underlying layers.

SSH is something that is used extensively in Linux world– and you run into a problem like this – google is your friend. I just typed SSH into google – and this was the first link -- https://en.wikipedia.org/wiki/Secure_Shell and the section on public key/private key (another time to reach out to your friend google if you don’t understand how that works) which explains how password less logins work. I understand your frustration, but explaining SSH in our documentation would just frustrate most of our users.



Ø  following instructions are to run a MapReduce job locally. If you want to execute a job on YARN, see YARN on Single Node."

Please take a look at Chapter 2 of the “Definitive Guide”, and for YARN please look at chapter 4. It has excellent explanations on both of these.  If you are saying that Apache’s documentation is not as great as these external resources, yes, we are aware of that. Would you like to help us address that?



Ø  Format the filesystem: $ bin/hdfs namenode -format". I have no clue how HDFS internally work.

Isn’t that the beauty of file systems and databases in general? that you don’t have to master the intricate details of B+ Trees or Query Optimization. Since we are open source – We encourage people like you who would like to understand more to read source.



Ø  For me a Filesystem is where I can setup partitions hooked on folders. So how am I supposed to explain hdfs to someone else?

Not to offend you, but I feel that it is a very limited world view of file systems.  There are a large number of files systems – beyond the ones that have partitions. If you are asking why HDFS should be treated as a file system, the simplest answer is that it offers Posix file system semantics. That is, it looks and acts like a file system.



Ø   "How can this be called filesystem if you install it by unpacking a .tar.gz?" I simply can't answer this question in any way.

Again google is your friend. https://en.wikipedia.org/wiki/Clustered_file_system#Distributed_file_systems Please take a look at different distributed file systems to get an expanded perspective of file systems.




> So I'm now looking for a documentation/guide for:
> - Which requirements do I have?

It is very difficult to answer given the current lack of advancement in mind reading ☺ (Just kidding). You have a problem to solve – and most problems can be broken down to a programming pattern offered by the Hadoop eco-system.

You might have a big data storage problem – HDFS might be your solution, you might want to run computations on top of it, MapReduce, Spark etc. might be your solution. You want to have a scalable key value store – HBase might help you,
You want to run SQL like queries – Hive might offer you a solution. If you have specific problem, you can post the question to the user group and someone will generally answer your question.

-- Does I have to use a specific Filesystem? If yes/no, why or what would you recommend?
Any file system that can store large number of files work well. We have seen HDFS work on top of all kinds of file systems. Ext4, XFS etc. etc.  In other words, use the physical file system you like and if you run into any issues please report it here or in the dev group.

-- How should I partition my VM?
Generally, we do not recommend you to run in VMs. The book that you were referring to has a section called Part 3. Hadoop Operations. The first few chapters’ deal with it. Or else reaching out to our friend google and asking “hadoop cluster hardware” gives me many links and recommendations. I would read the blogs from Cloudera and Hortonworks.

-- On which partition should I install which components?
Again, this is a very specific question that needs us to understand your cluster configuration. The general answer is keep your Hadoop binaries, conf and logs separate from your data files.
Protect your data directories with physical file system permissions and run Hadoop under a specific user like hdfs, yarn etc.

- Setting up a VM with Hadoop
Apache does not recommend that you run it under VMs. If you really want to do this, you might want to look at documentation provided by virtualization/cloud providers like VMware, Windows Azure or Amazon EMR.

- Configure Hadoop step by step
Please let us know what is challenging for you in the current set of instructions. Are you able to setup single instance, pseudo instance and then progress to a cluster setup?.
I can sense a great deal of frustration, but I am not able to help you unless I specifically know what is bothering you.

- Setup all kinds of deamons/nodes manually and explain where are they located (how they work) and how they should be configured
Go to your Hadoop directory and go to sbin. Read the source of start-dfs.sh or start-all.sh. It gives you pointers to what services are running, or run start-dfs.sh and then run “sudo jps” to see the running services.


Thanks
Anu














From: Mike Wenzel <mw...@proheris.de>
Date: Thursday, June 9, 2016 at 2:15 AM
To: "user@hadoop.apache.org" <us...@hadoop.apache.org>
Subject: Looking for documentation/guides on Hadoop 2.7.2

Hey everyone. I just started some weeks ago to learn about Hadoop. I got the task to understand the Hadoop Ecosystem, and be able to answer some questions. First of all I started reading a book "OReilly - Hadoop The Definitive Guide". After reading the book I had a first idea of how components work together, but for me the book didn't helped me to understand what’s going on. In my opinion the book described pretty much general in depth details about various components. This didn't helped me to understand the Hadoop Ecosystem.

I started to work with it. I installed a VM (SUSE Leap 42.1) and followed the https://hadoop.apache.org/docs/stable/hadoop-project-dist/hadoop-common/SingleCluster.html Guide.
After doing this I started to work with files on it. I wrote my first simple mapper and reducer, and I analyzed my apache log for some testing. This worked good so far.

But let’s face my problems:
1) All my knowledge about the Installing of Hadoop right now is: Unpacking a .tar.gz. I ran some shell-scripts and everything was running fine. Well, I have no clue at all, which components are now installed on the VM and where are they located and installed?

2) Furthermore, I'm missing all kinds of information about setting those up. The apache guide on some point says "Now check that you can ssh to the localhost without a passphrase" "If you cannot ssh to localhost without a passphrase, execute the following commands:". Well, I'd like to know what am I doing here ?! I mean WHY do I need ssh running on localhost, and WHY do this have to be without a passphrase. Which other ways of configuring this do exists?

3) Same on the next point: "The following instructions are to run a MapReduce job locally. If you want to execute a job on YARN, see YARN on Single Node." "Format the filesystem: $ bin/hdfs namenode -format". I have no clue how HDFS internally work. For me a Filesystem is where I can setup partitions hooked on folders. So how am I supposed to explain hdfs to someone else?
I understood the storing of data, splitting files in blocks, spread files around the cluster, store metadata, but if someone asks me: "How can this be called filesystem if you install it by unpacking a .tar.gz?" I simply can't answer this question in any way.

So I'm now looking for a documentation/guide for:
- Which requirements do I have?
-- Does I have to use a specific Filesystem? If yes/no, why or what would you recommend?
-- How should I partition my VM?
-- On which partition should I install which components?
- Setting up a VM with Hadoop
- Configure Hadoop step by step
- Setup all kinds of deamons/nodes manually and explain where are they located (how they work) and how they should be configured

I'm right now reading: https://hadoop.apache.org/docs/stable/hadoop-project-dist/hadoop-common/ClusterSetup.html but after some first readings this Guide will tell you what to write in which configuration-file, but now why you should do this or not. I'm feeling like "leaved alone in the darkness" after getting an idea of what Hadoop is. I hope some of you can show me some ways to get back om the road.
For me it's very important not just to write some configuration somewhere. I need to understand what's going on because if I got a running cluster and things, I need to be sure to handle all this stuff before going in productive use with it.

Best Regards
Mike