You are viewing a plain text version of this content. The canonical link for it is here.

Posted to common-user@hadoop.apache.org by panamamike <pa...@hotmail.com> on 2011/10/23 16:58:43 UTC

Need help understanding Hadoop Architecture

I'm new to Hadoop.  I've read a few articles and presentations which are
directed at explaining what Hadoop is, and how it works.  Currently my
understanding is Hadoop is an MPP system which leverages the use of large
block size to quickly find data.  In theory, I understand how a large block
size along with an MPP architecture as well as using what I'm understanding
to be a massive index scheme via mapreduce can be used to find data.

What I don't understand is how ,after you identify the appropriate 64MB
blocksize, do you find the data you're specifically after?  Does this mean
the CPU has to search the entire 64MB block for the data of interest?  If
so, how does Hadoop know what data from that block to retrieve?

I'm assuming the block is probably composed of one or more files.  If not,
I'm assuming the user isn't look for the entire 64MB block rather a portion
of it.

Any help indicating documentation, books, articles on the subject would be
much appreciated.

Regards,

Mike
-- 
View this message in context: http://old.nabble.com/Need-help-understanding-Hadoop-Architecture-tp32705405p32705405.html
Sent from the Hadoop core-user mailing list archive at Nabble.com.

Re: Need help understanding Hadoop Architecture

Posted by George <ke...@gmail.com>.

To all

I have been following this board for the past few weeks, and the information
has been great - so I appreciate the amount of sharing that has been going
on

I am in the "newbie" category here - so there is something I need some
guidance on.   I think I have a basic understanding of  HDFS and how data is
loaded into HDFS.

What I haven't figured out just yet - how do you organize the "data" ?  I
know how you do it with a relational database - but I have read that Yahoo
has installations with more than 60 Million files.

At the end of the day, you need SOME idea of what you are accessing, don't
you ?   Anything that talks to the organization of data in HDFS and the
approach of querying against it would be very helpful

Thanks in advance !




On Mon, Oct 24, 2011 at 12:26 PM, Anupam Seth <an...@yahoo-inc.com> wrote:

> Hi Mike,
>
> This might help address your question:
>
> http://storageconference.org/2010/Papers/MSST/Shvachko.pdf
>
> Regards,
> Anupam
>
> -----Original Message-----
> From: panamamike [mailto:panamamike@hotmail.com]
> Sent: Sunday, October 23, 2011 9:59 AM
> To: core-user@hadoop.apache.org
> Subject: Need help understanding Hadoop Architecture
>
>
> I'm new to Hadoop.  I've read a few articles and presentations which are
> directed at explaining what Hadoop is, and how it works.  Currently my
> understanding is Hadoop is an MPP system which leverages the use of large
> block size to quickly find data.  In theory, I understand how a large block
> size along with an MPP architecture as well as using what I'm understanding
> to be a massive index scheme via mapreduce can be used to find data.
>
> What I don't understand is how ,after you identify the appropriate 64MB
> blocksize, do you find the data you're specifically after?  Does this mean
> the CPU has to search the entire 64MB block for the data of interest?  If
> so, how does Hadoop know what data from that block to retrieve?
>
> I'm assuming the block is probably composed of one or more files.  If not,
> I'm assuming the user isn't look for the entire 64MB block rather a portion
> of it.
>
> Any help indicating documentation, books, articles on the subject would be
> much appreciated.
>
> Regards,
>
> Mike
> --
> View this message in context:
> http://old.nabble.com/Need-help-understanding-Hadoop-Architecture-tp32705405p32705405.html
> Sent from the Hadoop core-user mailing list archive at Nabble.com.
>
>

RE: Need help understanding Hadoop Architecture

Posted by Anupam Seth <an...@yahoo-inc.com>.

Hi Mike,

This might help address your question:

http://storageconference.org/2010/Papers/MSST/Shvachko.pdf

Regards,
Anupam

-----Original Message-----
From: panamamike [mailto:panamamike@hotmail.com] 
Sent: Sunday, October 23, 2011 9:59 AM
To: core-user@hadoop.apache.org
Subject: Need help understanding Hadoop Architecture


I'm new to Hadoop.  I've read a few articles and presentations which are
directed at explaining what Hadoop is, and how it works.  Currently my
understanding is Hadoop is an MPP system which leverages the use of large
block size to quickly find data.  In theory, I understand how a large block
size along with an MPP architecture as well as using what I'm understanding
to be a massive index scheme via mapreduce can be used to find data.

What I don't understand is how ,after you identify the appropriate 64MB
blocksize, do you find the data you're specifically after?  Does this mean
the CPU has to search the entire 64MB block for the data of interest?  If
so, how does Hadoop know what data from that block to retrieve?

I'm assuming the block is probably composed of one or more files.  If not,
I'm assuming the user isn't look for the entire 64MB block rather a portion
of it.

Any help indicating documentation, books, articles on the subject would be
much appreciated.

Regards,

Mike
-- 
View this message in context: http://old.nabble.com/Need-help-understanding-Hadoop-Architecture-tp32705405p32705405.html
Sent from the Hadoop core-user mailing list archive at Nabble.com.

Re: Need help understanding Hadoop Architecture

Posted by Uma Maheswara Rao G 72686 <ma...@huawei.com>.

Hi,

Firt of all, welcome to Hadoop. 
----- Original Message -----
From: panamamike <pa...@hotmail.com>
Date: Sunday, October 23, 2011 8:29 pm
Subject: Need help understanding Hadoop Architecture
To: core-user@hadoop.apache.org

> 
> I'm new to Hadoop.  I've read a few articles and presentations 
> which are
> directed at explaining what Hadoop is, and how it works.  Currently my
> understanding is Hadoop is an MPP system which leverages the use of 
> largeblock size to quickly find data.  In theory, I understand how 
> a large block
> size along with an MPP architecture as well as using what I'm 
> understandingto be a massive index scheme via mapreduce can be used 
> to find data.
> 
> What I don't understand is how ,after you identify the appropriate 
> 64MBblocksize, do you find the data you're specifically after?  
> Does this mean
> the CPU has to search the entire 64MB block for the data of 
> interest?  If
> so, how does Hadoop know what data from that block to retrieve?
> 
> I'm assuming the block is probably composed of one or more files.  
> If not,
> I'm assuming the user isn't look for the entire 64MB block rather a 
> portionof it.
> 
I am just giving breif about file system here.

Distributed file system contains, NameNode, DataNode, Checkpointing nodes and DFSClient.

Here NameNode will maintain the metadat about the files and blocks.
Datanode holds the actual data. and it will send the heartbeats to NN.So, Namenode knows about the DN status.

DFSClient is client side ligic, which will first ask the namenode to give set of DN to write the file. Then NN will add their entries in metadata and give DN list to client. Then client will write the Data to Dtatnodes directly.

While reading the file also, Client will ask NN to give the block locations, then client will directly connect to DN and read the data.

There are many other concepts replication, leasemonitoring...etc.

I hope this will give you about initial understanding about HDFS.
Please go through the below document which will explan you very clearly with the architecture diagrams.

> Any help indicating documentation, books, articles on the subject 
> would be
> much appreciated.
Here is a doc for HADOOP http://db.trimtabs.com:2080/mindterm/ebooks/Hadoop_The_Definitive_Guide_Cr.pdf
> 
> Regards,
> 
> Mike
> -- 
> View this message in context: http://old.nabble.com/Need-help-
> understanding-Hadoop-Architecture-tp32705405p32705405.html
> Sent from the Hadoop core-user mailing list archive at Nabble.com.
> 
> 

Regards,
Uma

Re: Need help understanding Hadoop Architecture

Posted by "real great.." <gr...@gmail.com>.

Hey, that book doesnt include material about Hadoop MR2. Would be worth
looking into some Arun C murthy sirs
presentations.

On Tue, Nov 22, 2011 at 6:53 AM, hari708 <ha...@gmail.com> wrote:

>
> hello,
> Please help me on this.
> Hi,
> I have a big file consisting of XML data.the XML is not represented as a
> single line in the file. if we stream this file using ./hadoop dfs -put
> command to a hadoop directory .How the distribution happens.?
> Basically in My mapreduce program i am expecting a complete XML as my
> input.i have a CustomReader(for XML) in my mapreduce job configuration.My
> main confusion is if namenode distribute data to DataNodes ,there is a
> chance that a part of xml can go to one data node and other half can go in
> another datanode.If that is the case will my custom XMLReader in the
> mapreduce be able to combine it(as mapreduce reads data locally only).
> Please help me on this?
>
> oleksiy wrote:
> >
> > Hello,
> >
> > Sorry for the late answer (didn't have time).
> > So the first what I would like to clarify is what you mean by
> > "unstructured data"? Could you give me your example of this data. You
> > should keep in mind, that hadoop effective only for processing particular
> > types of tasks. In other words how you compute median using hadoop Map
> > Reduce? This kind of situation is not for hadoop.
> >
> > So, let me give you small description of how hadoop works regarding what
> > you wrote. Let's look at a sample (the simple map reduce word count app
> > from hadoop site):
> > We have 1 GB unstructured text file (tel it be some book). We are saving
> > this book to the HDFS which is by default will divide this data by blocks
> > of 64MB and put them to 3 different nodes. So, right now we have 1Gb file
> > splited by blocks and spread arose HDFS cluster.
> >
> > When we run Map Reduce job. Hadoop automatically compute how much tasks
> he
> > needs to process this data. Let hadoop created 10 tsaks. And in this
> > situation 1 task will process the first 64MB which located on the node 1
> > (for instance) the second process second 64MB which located on the
> machine
> > 2 and so on.
> >
> > In this situation each map process their own peace of data (in our case
> > this is 64MB).
> >
> > Also one note regarding metadata. Only NameNode contains metadata info.
> > So, in our example NameNode knows that we have 1GB file split by 64 MB,
> > and we have 16 pieces which is spread arose the cluster. By knowing this
> > hadoop mustn't know the real data structure. For example in our example
> we
> > have simple book and by default hadoop use "TextInputFormat" for
> > processing simple text files. And in this case when hadoop reads data he
> > will take key like number of the string in the file, and value will be
> the
> > string. And he mustn't know the format in this case.
> >
> > that's it :)
> >
> >
> >
> >
> > panamamike wrote:
> >>
> >>
> >>
> >> oleksiy wrote:
> >>>
> >>> Hello,
> >>>
> >>> I would suggest you to read at least this piece of info:
> >>>
> http://hadoop.apache.org/common/docs/r0.20.204.0/hdfs_design.html#NameNode+and+DataNodes
> >>> HDFS Architecture
> >>>
> >>>
> >>> This is the main part of HDFS  architecture. There you can find some
> >>> info of how client read data from different nodes.
> >>> Also I would suggest good book "
> >>> http://www.amazon.com/Hadoop-Definitive-Guide-Tom-White/dp/1449389732
> >>> Tom White - Hadoop. The Definitive Guide - 2010, 2nd Edition "
> >>> There you definitely find all answers on your questions.
> >>>
> >>> Regards,
> >>> Oleksiy
> >>>
> >>>
> >>> panamamike wrote:
> >>>>
> >>>> I'm new to Hadoop.  I've read a few articles and presentations which
> >>>> are directed at explaining what Hadoop is, and how it works.
>  Currently
> >>>> my understanding is Hadoop is an MPP system which leverages the use of
> >>>> large block size to quickly find data.  In theory, I understand how a
> >>>> large block size along with an MPP architecture as well as using what
> >>>> I'm understanding to be a massive index scheme via mapreduce can be
> >>>> used to find data.
> >>>>
> >>>> What I don't understand is how ,after you identify the appropriate
> 64MB
> >>>> blocksize, do you find the data you're specifically after?  Does this
> >>>> mean the CPU has to search the entire 64MB block for the data of
> >>>> interest?  If so, how does Hadoop know what data from that block to
> >>>> retrieve?
> >>>>
> >>>> I'm assuming the block is probably composed of one or more files.  If
> >>>> not, I'm assuming the user isn't look for the entire 64MB block rather
> >>>> a portion of it.
> >>>>
> >>>> Any help indicating documentation, books, articles on the subject
> would
> >>>> be much appreciated.
> >>>>
> >>>> Regards,
> >>>>
> >>>> Mike
> >>>>
> >>>
> >>>
> >>
> >> Oleksiy,
> >>
> >> Thank you for your input, I've actually ready that section of the Hadoop
> >> documentation.  I think it does a good job of describing the general
> >> architecture of how Hadoop works.  The description reminds me of how the
> >> Teradata MPP architecture.  The thing that I'm missing, is how does
> >> Hadoop look find things?
> >>
> >> I see how Hadoop can potentially narrow searches down by leveraging the
> >> concept of using metadata indexes to find the large 64MB blocks, I'm
> >> calling these large since typical blocks are measured in bytes, however,
> >> when it does find this block, how does it search within the block?  Does
> >> it then get down to a brute force type search of the 64MB, and because
> >> systems are just fast enough these days that search isn't a big deal?
> >>
> >> Going back to my comparison to Teradata, teradata had a weakness in that
> >> the speed of the MPP architecture was dependant on the quality of the
> >> data distribution index.  Meaning, there had to be a way for the system
> >> to determine how to store data across the commodity hardware in order to
> >> have a even distribution.  If the distribution isn't even, meaning based
> >> on the index defined most data goes to one node in the system, you get
> >> something call hot amping where the MPP advantage is lost because the
> >> majority of the work is being directed to the one node.
> >>
> >> How does Hadoop tacking this particular issue?  Really, when it comes
> >> down to it, how does hadoop distribute the data, balance load data as
> >> well as keep up the parallel performance?  This gets back to my question
> >> of how does Hadoop find things quickly?  I know in Teradata, it's based
> >> on the design of the main index.  My assumption is that Hadoop does
> >> something similar with the metadata, but then that means unstructured
> >> data would have to be associated to some sort of metadata tags.
> >>
> >> Futhermore, that unstructured data could only be found if the correct
> >> metadata keys values are searched.  Is this the way it works?
> >>
> >> Mike
> >>
> >
> >
>
> --
> View this message in context:
> http://old.nabble.com/Need-help-understanding-Hadoop-Architecture-tp32705405p32871905.html
> Sent from the Hadoop core-user mailing list archive at Nabble.com.
>
>


-- 
Regards,
R.V.

Re: Need help understanding Hadoop Architecture

Posted by hari708 <ha...@gmail.com>.

hello,
Please help me on this.
Hi,
I have a big file consisting of XML data.the XML is not represented as a
single line in the file. if we stream this file using ./hadoop dfs -put
command to a hadoop directory .How the distribution happens.?
Basically in My mapreduce program i am expecting a complete XML as my
input.i have a CustomReader(for XML) in my mapreduce job configuration.My
main confusion is if namenode distribute data to DataNodes ,there is a
chance that a part of xml can go to one data node and other half can go in
another datanode.If that is the case will my custom XMLReader in the
mapreduce be able to combine it(as mapreduce reads data locally only).
Please help me on this?

oleksiy wrote:
> 
> Hello,
> 
> Sorry for the late answer (didn't have time).
> So the first what I would like to clarify is what you mean by
> "unstructured data"? Could you give me your example of this data. You
> should keep in mind, that hadoop effective only for processing particular
> types of tasks. In other words how you compute median using hadoop Map
> Reduce? This kind of situation is not for hadoop.
> 
> So, let me give you small description of how hadoop works regarding what
> you wrote. Let's look at a sample (the simple map reduce word count app
> from hadoop site):
> We have 1 GB unstructured text file (tel it be some book). We are saving
> this book to the HDFS which is by default will divide this data by blocks
> of 64MB and put them to 3 different nodes. So, right now we have 1Gb file
> splited by blocks and spread arose HDFS cluster.
> 
> When we run Map Reduce job. Hadoop automatically compute how much tasks he
> needs to process this data. Let hadoop created 10 tsaks. And in this
> situation 1 task will process the first 64MB which located on the node 1
> (for instance) the second process second 64MB which located on the machine
> 2 and so on.
> 
> In this situation each map process their own peace of data (in our case
> this is 64MB). 
> 
> Also one note regarding metadata. Only NameNode contains metadata info.
> So, in our example NameNode knows that we have 1GB file split by 64 MB,
> and we have 16 pieces which is spread arose the cluster. By knowing this
> hadoop mustn't know the real data structure. For example in our example we
> have simple book and by default hadoop use "TextInputFormat" for
> processing simple text files. And in this case when hadoop reads data he
> will take key like number of the string in the file, and value will be the
> string. And he mustn't know the format in this case.
> 
> that's it :)
> 
> 
> 
> 
> panamamike wrote:
>> 
>> 
>> 
>> oleksiy wrote:
>>> 
>>> Hello,
>>> 
>>> I would suggest you to read at least this piece of info: 
>>> http://hadoop.apache.org/common/docs/r0.20.204.0/hdfs_design.html#NameNode+and+DataNodes
>>> HDFS Architecture 
>>> 
>>> 
>>> This is the main part of HDFS  architecture. There you can find some
>>> info of how client read data from different nodes. 
>>> Also I would suggest good book "
>>> http://www.amazon.com/Hadoop-Definitive-Guide-Tom-White/dp/1449389732
>>> Tom White - Hadoop. The Definitive Guide - 2010, 2nd Edition "
>>> There you definitely find all answers on your questions. 
>>> 
>>> Regards,
>>> Oleksiy
>>> 
>>> 
>>> panamamike wrote:
>>>> 
>>>> I'm new to Hadoop.  I've read a few articles and presentations which
>>>> are directed at explaining what Hadoop is, and how it works.  Currently
>>>> my understanding is Hadoop is an MPP system which leverages the use of
>>>> large block size to quickly find data.  In theory, I understand how a
>>>> large block size along with an MPP architecture as well as using what
>>>> I'm understanding to be a massive index scheme via mapreduce can be
>>>> used to find data.
>>>> 
>>>> What I don't understand is how ,after you identify the appropriate 64MB
>>>> blocksize, do you find the data you're specifically after?  Does this
>>>> mean the CPU has to search the entire 64MB block for the data of
>>>> interest?  If so, how does Hadoop know what data from that block to
>>>> retrieve?
>>>> 
>>>> I'm assuming the block is probably composed of one or more files.  If
>>>> not, I'm assuming the user isn't look for the entire 64MB block rather
>>>> a portion of it.
>>>> 
>>>> Any help indicating documentation, books, articles on the subject would
>>>> be much appreciated.
>>>> 
>>>> Regards,
>>>> 
>>>> Mike
>>>> 
>>> 
>>> 
>> 
>> Oleksiy,
>> 
>> Thank you for your input, I've actually ready that section of the Hadoop
>> documentation.  I think it does a good job of describing the general
>> architecture of how Hadoop works.  The description reminds me of how the
>> Teradata MPP architecture.  The thing that I'm missing, is how does
>> Hadoop look find things?
>> 
>> I see how Hadoop can potentially narrow searches down by leveraging the
>> concept of using metadata indexes to find the large 64MB blocks, I'm
>> calling these large since typical blocks are measured in bytes, however,
>> when it does find this block, how does it search within the block?  Does
>> it then get down to a brute force type search of the 64MB, and because
>> systems are just fast enough these days that search isn't a big deal?
>> 
>> Going back to my comparison to Teradata, teradata had a weakness in that
>> the speed of the MPP architecture was dependant on the quality of the
>> data distribution index.  Meaning, there had to be a way for the system
>> to determine how to store data across the commodity hardware in order to
>> have a even distribution.  If the distribution isn't even, meaning based
>> on the index defined most data goes to one node in the system, you get
>> something call hot amping where the MPP advantage is lost because the
>> majority of the work is being directed to the one node.
>> 
>> How does Hadoop tacking this particular issue?  Really, when it comes
>> down to it, how does hadoop distribute the data, balance load data as
>> well as keep up the parallel performance?  This gets back to my question
>> of how does Hadoop find things quickly?  I know in Teradata, it's based
>> on the design of the main index.  My assumption is that Hadoop does
>> something similar with the metadata, but then that means unstructured
>> data would have to be associated to some sort of metadata tags.
>> 
>> Futhermore, that unstructured data could only be found if the correct
>> metadata keys values are searched.  Is this the way it works?
>> 
>> Mike
>> 
> 
> 

-- 
View this message in context: http://old.nabble.com/Need-help-understanding-Hadoop-Architecture-tp32705405p32871905.html
Sent from the Hadoop core-user mailing list archive at Nabble.com.

Re: Need help understanding Hadoop Architecture

Posted by oleksiy <ga...@mail.ru>.

Hello,

Sorry for the late answer (didn't have time).
So the first what I would like to clarify is what you mean by "unstructured
data"? Could you give me your example of this data. You should keep in mind,
that hadoop effective only for processing particular types of tasks. In
other words how you compute median using hadoop Map Reduce? This kind of
situation is not for hadoop.

So, let me give you small description of how hadoop works regarding what you
wrote. Let's look at a sample (the simple map reduce word count app from
hadoop site):
We have 1 GB unstructured text file (tel it be some book). We are saving
this book to the HDFS which is by default will divide this data by blocks of
64MB and put them to 3 different nodes. So, right now we have 1Gb file
splited by blocks and spread arose HDFS cluster.

When we run Map Reduce job. Hadoop automatically compute how much tasks he
needs to process this data. Let hadoop created 10 tsaks. And in this
situation 1 task will process the first 64MB which located on the node 1
(for instance) the second process second 64MB which located on the machine 2
and so on.

In this situation each map process their own peace of data (in our case this
is 64MB). 

Also one note regarding metadata. Only NameNode contains metadata info. So,
in our example NameNode knows that we have 1GB file split by 64 MB, and we
have 16 pieces which is spread arose the cluster. By knowing this hadoop
mustn't know the real data structure. For example in our example we have
simple book and by default hadoop use "TextInputFormat" for processing
simple text files. And in this case when hadoop reads data he will take key
like number of the string in the file, and value will be the string. And he
mustn't know the format in this case.

that's it :)

panamamike wrote:
> 
> 
> 
> oleksiy wrote:
>> 
>> Hello,
>> 
>> I would suggest you to read at least this piece of info: 
>> http://hadoop.apache.org/common/docs/r0.20.204.0/hdfs_design.html#NameNode+and+DataNodes
>> HDFS Architecture 
>> 
>> 
>> This is the main part of HDFS  architecture. There you can find some info
>> of how client read data from different nodes. 
>> Also I would suggest good book "
>> http://www.amazon.com/Hadoop-Definitive-Guide-Tom-White/dp/1449389732 Tom
>> White - Hadoop. The Definitive Guide - 2010, 2nd Edition "
>> There you definitely find all answers on your questions. 
>> 
>> Regards,
>> Oleksiy
>> 
>> 
>> panamamike wrote:
>>> 
>>> I'm new to Hadoop.  I've read a few articles and presentations which are
>>> directed at explaining what Hadoop is, and how it works.  Currently my
>>> understanding is Hadoop is an MPP system which leverages the use of
>>> large block size to quickly find data.  In theory, I understand how a
>>> large block size along with an MPP architecture as well as using what
>>> I'm understanding to be a massive index scheme via mapreduce can be used
>>> to find data.
>>> 
>>> What I don't understand is how ,after you identify the appropriate 64MB
>>> blocksize, do you find the data you're specifically after?  Does this
>>> mean the CPU has to search the entire 64MB block for the data of
>>> interest?  If so, how does Hadoop know what data from that block to
>>> retrieve?
>>> 
>>> I'm assuming the block is probably composed of one or more files.  If
>>> not, I'm assuming the user isn't look for the entire 64MB block rather a
>>> portion of it.
>>> 
>>> Any help indicating documentation, books, articles on the subject would
>>> be much appreciated.
>>> 
>>> Regards,
>>> 
>>> Mike
>>> 
>> 
>> 
> 
> Oleksiy,
> 
> Thank you for your input, I've actually ready that section of the Hadoop
> documentation.  I think it does a good job of describing the general
> architecture of how Hadoop works.  The description reminds me of how the
> Teradata MPP architecture.  The thing that I'm missing, is how does Hadoop
> look find things?
> 
> I see how Hadoop can potentially narrow searches down by leveraging the
> concept of using metadata indexes to find the large 64MB blocks, I'm
> calling these large since typical blocks are measured in bytes, however,
> when it does find this block, how does it search within the block?  Does
> it then get down to a brute force type search of the 64MB, and because
> systems are just fast enough these days that search isn't a big deal?
> 
> Going back to my comparison to Teradata, teradata had a weakness in that
> the speed of the MPP architecture was dependant on the quality of the data
> distribution index.  Meaning, there had to be a way for the system to
> determine how to store data across the commodity hardware in order to have
> a even distribution.  If the distribution isn't even, meaning based on the
> index defined most data goes to one node in the system, you get something
> call hot amping where the MPP advantage is lost because the majority of
> the work is being directed to the one node.
> 
> How does Hadoop tacking this particular issue?  Really, when it comes down
> to it, how does hadoop distribute the data, balance load data as well as
> keep up the parallel performance?  This gets back to my question of how
> does Hadoop find things quickly?  I know in Teradata, it's based on the
> design of the main index.  My assumption is that Hadoop does something
> similar with the metadata, but then that means unstructured data would
> have to be associated to some sort of metadata tags.
> 
> Futhermore, that unstructured data could only be found if the correct
> metadata keys values are searched.  Is this the way it works?
> 
> Mike
> 

-- 
View this message in context: http://old.nabble.com/Need-help-understanding-Hadoop-Architecture-tp32705405p32752983.html
Sent from the Hadoop core-user mailing list archive at Nabble.com.

Re: Need help understanding Hadoop Architecture

Posted by panamamike <pa...@hotmail.com>.

oleksiy wrote:
> 
> Hello,
> 
> I would suggest you to read at least this piece of info: 
> http://hadoop.apache.org/common/docs/r0.20.204.0/hdfs_design.html#NameNode+and+DataNodes
> HDFS Architecture 
> 
> 
> This is the main part of HDFS  architecture. There you can find some info
> of how client read data from different nodes. 
> Also I would suggest good book "
> http://www.amazon.com/Hadoop-Definitive-Guide-Tom-White/dp/1449389732 Tom
> White - Hadoop. The Definitive Guide - 2010, 2nd Edition "
> There you definitely find all answers on your questions. 
> 
> Regards,
> Oleksiy
> 
> 
> panamamike wrote:
>> 
>> I'm new to Hadoop.  I've read a few articles and presentations which are
>> directed at explaining what Hadoop is, and how it works.  Currently my
>> understanding is Hadoop is an MPP system which leverages the use of large
>> block size to quickly find data.  In theory, I understand how a large
>> block size along with an MPP architecture as well as using what I'm
>> understanding to be a massive index scheme via mapreduce can be used to
>> find data.
>> 
>> What I don't understand is how ,after you identify the appropriate 64MB
>> blocksize, do you find the data you're specifically after?  Does this
>> mean the CPU has to search the entire 64MB block for the data of
>> interest?  If so, how does Hadoop know what data from that block to
>> retrieve?
>> 
>> I'm assuming the block is probably composed of one or more files.  If
>> not, I'm assuming the user isn't look for the entire 64MB block rather a
>> portion of it.
>> 
>> Any help indicating documentation, books, articles on the subject would
>> be much appreciated.
>> 
>> Regards,
>> 
>> Mike
>> 
> 
> 

Oleksiy,

Thank you for your input, I've actually ready that section of the Hadoop
documentation.  I think it does a good job of describing the general
architecture of how Hadoop works.  The description reminds me of how the
Teradata MPP architecture.  The thing that I'm missing, is how does Hadoop
look find things?

I see how Hadoop can potentially narrow searches down by leveraging the
concept of using metadata indexes to find the large 64MB blocks, I'm calling
these large since typical blocks are measured in bytes, however, when it
does find this block, how does it search within the block?  Does it then get
down to a brute force type search of the 64MB, and because systems are just
fast enough these days that search isn't a big deal?

Going back to my comparison to Teradata, teradata had a weakness in that the
speed of the MPP architecture was dependant on the quality of the data
distribution index.  Meaning, there had to be a way for the system to
determine how to store data across the commodity hardware in order to have a
even distribution.  If the distribution isn't even, meaning based on the
index defined most data goes to one node in the system, you get something
call hot amping where the MPP advantage is lost because the majority of the
work is being directed to the one node.

How does Hadoop tacking this particular issue?  Really, when it comes down
to it, how does hadoop distribute the data, balance load data as well as
keep up the parallel performance?  This gets back to my question of how does
Hadoop find things quickly?  I know in Teradata, it's based on the design of
the main index.  My assumption is that Hadoop does something similar with
the metadata, but then that means unstructured data would have to be
associated to some sort of metadata tags.

Futhermore, that unstructured data could only be found if the correct
metadata keys values are searched.  Is this the way it works?

Mike
-- 
View this message in context: http://old.nabble.com/Need-help-understanding-Hadoop-Architecture-tp32705405p32724383.html
Sent from the Hadoop core-user mailing list archive at Nabble.com.

Re: Need help understanding Hadoop Architecture

Posted by oleksiy <ga...@mail.ru>.

Hello,

I would suggest you to read at least this piece of info: 
http://hadoop.apache.org/common/docs/r0.20.204.0/hdfs_design.html#NameNode+and+DataNodes
HDFS Architecture 


This is the main part of HDFS  architecture. There you can find some info of
how client read data from different nodes. 
Also I would suggest good book "
http://www.amazon.com/Hadoop-Definitive-Guide-Tom-White/dp/1449389732 Tom
White - Hadoop. The Definitive Guide - 2010, 2nd Edition "
There you definitely find all answers on your questions. 

Regards,
Oleksiy


panamamike wrote:
> 
> I'm new to Hadoop.  I've read a few articles and presentations which are
> directed at explaining what Hadoop is, and how it works.  Currently my
> understanding is Hadoop is an MPP system which leverages the use of large
> block size to quickly find data.  In theory, I understand how a large
> block size along with an MPP architecture as well as using what I'm
> understanding to be a massive index scheme via mapreduce can be used to
> find data.
> 
> What I don't understand is how ,after you identify the appropriate 64MB
> blocksize, do you find the data you're specifically after?  Does this mean
> the CPU has to search the entire 64MB block for the data of interest?  If
> so, how does Hadoop know what data from that block to retrieve?
> 
> I'm assuming the block is probably composed of one or more files.  If not,
> I'm assuming the user isn't look for the entire 64MB block rather a
> portion of it.
> 
> Any help indicating documentation, books, articles on the subject would be
> much appreciated.
> 
> Regards,
> 
> Mike
> 

-- 
View this message in context: http://old.nabble.com/Need-help-understanding-Hadoop-Architecture-tp32705405p32722610.html
Sent from the Hadoop core-user mailing list archive at Nabble.com.