You are viewing a plain text version of this content. The canonical link for it is here.
Posted to notifications@training.apache.org by GitBox <gi...@apache.org> on 2019/12/06 11:03:17 UTC

[GitHub] [incubator-training] RyanSkraba commented on a change in pull request #63: Training-27: Hadoop Training Slides

RyanSkraba commented on a change in pull request #63: Training-27: Hadoop Training Slides
URL: https://github.com/apache/incubator-training/pull/63#discussion_r354771610
 
 

 ##########
 File path: content/Hadoop/src/main/asciidoc/index.adoc
 ##########
 @@ -0,0 +1,111 @@
+////
+
+  Licensed to the Apache Software Foundation (ASF) under one or more
+  contributor license agreements.  See the NOTICE file distributed with
+  this work for additional information regarding copyright ownership.
+  The ASF licenses this file to You under the Apache License, Version 2.0
+  (the "License"); you may not use this file except in compliance with
+  the License.  You may obtain a copy of the License at
+
+      http://www.apache.org/licenses/LICENSE-2.0
+
+  Unless required by applicable law or agreed to in writing, software
+  distributed under the License is distributed on an "AS IS" BASIS,
+  WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+  See the License for the specific language governing permissions and
+  limitations under the License.
+
+////
+:revealjs_progress: true
+:revealjs_slidenumber: true
+:sourcedir: ../java
+
+== What is Apache Hadoop?
+
+Apache Hadoop is a framework that allows for the distributed processing of large data sets across clusters of computers using simple programming models.
+
+Two main layers:
+
+- Processing layer (MapReduce)
+- Storage layer (Hadoop Distributed File System)
+
+== MapReduce
+
+Hadoop MapReduce is a software framework for easily writing applications which process vast amounts of data (multi-terabyte data-sets) in-parallel on large clusters (thousands of nodes) of commodity hardware in a reliable, fault-tolerant manner. 
+
+The MapReduce framework consists of:
+
+- single master ResourceManager
+- one worker NodeManager per cluster-node
+- MRAppMaster per application 
+
+== Hadoop Distributed File System
+
+Hadoop Distributed File System (HDFS) is a distributed file system designed to run on commodity hardware. 
+
+- similar to other distributed file systems
+- highly fault-tolerant 
+- designed to be deployed on low-cost hardware
+- provides high throughput access to application data
+
+== MapReduce Inputs and Outputs
+
+The MapReduce framework operates exclusively on <key, value> pairs, that is, the framework views the input to the job as a set of <key, value> pairs and produces a set of <key, value> pairs as the output of the job, conceivably of different types.
+
+The key and value classes have to be serializable by the framework and hence need to implement the Writable interface. Additionally, the key classes have to implement the WritableComparable interface to facilitate sorting by the framework.
+
+Input and Output types of a MapReduce job:
+
+(input) <k1, v1> -> map -> <k2, v2> ->
+combine -> <k2, v2> ->
+reduce -> <k3, v3> (output)
+
+== HDFS Architecture
+
+HDFS has a master/slave architecture. An HDFS cluster consists of:
+
+- single NameNode, a master server that manages the file system namespace and regulates access to files by clients
+- one of more of DataNodes used for storage and serving read and write requests from the file system’s clients
+
+== HDFS Useful Features
+
+New features and improvements are regularly implemented in HDFS. The following is a subset of useful features in HDFS:
+
+- File permissions and authentication.
+- Rack awareness: to take a node’s physical location into account while scheduling tasks and allocating storage.
+- Safemode: an administrative mode for maintenance.
+- fsck: a utility to diagnose health of the file system, to find missing files or blocks.
+- fetchdt: a utility to fetch DelegationToken and store it in a file on the local system.
+- Balancer: tool to balance the cluster when the data is unevenly distributed among DataNodes.
+- Upgrade and rollback: after a software upgrade, it is possible to rollback to HDFS’ state before the upgrade in case of unexpected problems.
+
+
+== HDFS Special Nodes
+
+Main nodes in HDFS are as follows:
+
+- Secondary NameNode
+- Checkpoint node
+- Backup node
+
+== HDFS Secondary NameNode
+
+Secondary NameNode performs periodic checkpoints of the namespace.
+It helps keep the size of file containing log of HDFS modifications within certain limits at the NameNode.
+
+== HDFS Checkpoint node
+
+Checkpoint node performs periodic checkpoints of the namespace.
+It helps minimize the size of the log stored at the NameNode containing changes to the HDFS.
+It replaces the role previously filled by the Secondary NameNode.
+The NameNode allows multiple Checkpoint nodes simultaneously, as long as there are no Backup nodes registered with the system.
+
+
+== HDFS Backup node
+
+Backup node is an extension to the Checkpoint node.
+In addition to checkpointing it also receives a stream of edits from the NameNode.
+It maintains its own in-memory copy of the namespace, which is always in sync with the active NameNode namespace state.
+Only one Backup node may be registered with the NameNode at once.
+
+== HDFS Commands
 
 Review comment:
   Empty slide!

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services