You are viewing a plain text version of this content. The canonical link for it is here.

Posted to common-user@hadoop.apache.org by Sai Sai <sa...@yahoo.in> on 2013/04/12 10:10:28 UTC

Re: 10 TB of a data file.

In real world can a file be of this big size as 10 TB? 
Will the data be put into a txt file or what kind of a file?
If someone would like to open such a big file to look at the content will OS support opening such big files? 
If not how to handle this kind of scenario?
Any input will be appreciated.
Thanks
Sai

Re: Will HDFS refer to the memory of NameNode & DataNode or is it a separate machine

Posted by Nitin Pawar <ni...@gmail.com>.

HDFS - hadoop distributed file system
as it stands a file system .. first basic question you will need to search
is do you need a process to run a file system?
when you find an answer to that second question will be
will a single process be enough for a distributed system ? meaning sub
components of the system may exist on different machines

namenode and datanode combined make hdfs. combining all  of their processes
you make hdfs.

namenode is master for the hdfs which keeps the file system image in memory
when it starts it loads it up in memory and serves all requests from memory
there on. There are steps taken to save the FSImage to disk. You can read
about it in detail in hdfs architecture.

when you put a file in hdfs .. it may or may not go to a single machine.
Namenode never stores the data files. it just stores the metadata for the
hdfs.
so when you load a file it will be going to datanode and the file
information will be going to namenode. depending on the size it will be
split in multiple blocks and then multiple blocks may land on multiple
datanodes. If your filesize is less than or exactly equal to block size you
can find out which datanode it is located. else there is no guarantee that
file will be only on single node only if you have fully distributed mode

PS: this is my understanding. Others may correct me as well

On Fri, Apr 12, 2013 at 2:00 PM, Sai Sai <sa...@yahoo.in> wrote:

> A few basic questions:
>
> Will HDFS refer to the memory of NameNode & DataNode or is it a separate
> machine.
>
> For NameNode, DataNode and others there is a process associated with each
> of em.
> But no process is for HDFS, wondering why? I understand that fsImage has
> the meta data of the HDFS, so when NameNode or DataNode or JobTracker/TT
> needs to get file info will they just look into the fsImage.
>
> When we put a file in HDFS is it possible to look/find in which node
> (NN/DN) it physically sits.
>
> Any help is appreciated.
> Thanks
> Sai
>

-- 
Nitin Pawar

Re: Will HDFS refer to the memory of NameNode & DataNode or is it a separate machine

Posted by Nitin Pawar <ni...@gmail.com>.

HDFS - hadoop distributed file system
as it stands a file system .. first basic question you will need to search
is do you need a process to run a file system?
when you find an answer to that second question will be
will a single process be enough for a distributed system ? meaning sub
components of the system may exist on different machines

namenode and datanode combined make hdfs. combining all  of their processes
you make hdfs.

namenode is master for the hdfs which keeps the file system image in memory
when it starts it loads it up in memory and serves all requests from memory
there on. There are steps taken to save the FSImage to disk. You can read
about it in detail in hdfs architecture.

when you put a file in hdfs .. it may or may not go to a single machine.
Namenode never stores the data files. it just stores the metadata for the
hdfs.
so when you load a file it will be going to datanode and the file
information will be going to namenode. depending on the size it will be
split in multiple blocks and then multiple blocks may land on multiple
datanodes. If your filesize is less than or exactly equal to block size you
can find out which datanode it is located. else there is no guarantee that
file will be only on single node only if you have fully distributed mode

PS: this is my understanding. Others may correct me as well

On Fri, Apr 12, 2013 at 2:00 PM, Sai Sai <sa...@yahoo.in> wrote:

> A few basic questions:
>
> Will HDFS refer to the memory of NameNode & DataNode or is it a separate
> machine.
>
> For NameNode, DataNode and others there is a process associated with each
> of em.
> But no process is for HDFS, wondering why? I understand that fsImage has
> the meta data of the HDFS, so when NameNode or DataNode or JobTracker/TT
> needs to get file info will they just look into the fsImage.
>
> When we put a file in HDFS is it possible to look/find in which node
> (NN/DN) it physically sits.
>
> Any help is appreciated.
> Thanks
> Sai
>

-- 
Nitin Pawar

Re: Will HDFS refer to the memory of NameNode & DataNode or is it a separate machine

Posted by Nitin Pawar <ni...@gmail.com>.

HDFS - hadoop distributed file system
as it stands a file system .. first basic question you will need to search
is do you need a process to run a file system?
when you find an answer to that second question will be
will a single process be enough for a distributed system ? meaning sub
components of the system may exist on different machines

namenode and datanode combined make hdfs. combining all  of their processes
you make hdfs.

namenode is master for the hdfs which keeps the file system image in memory
when it starts it loads it up in memory and serves all requests from memory
there on. There are steps taken to save the FSImage to disk. You can read
about it in detail in hdfs architecture.

when you put a file in hdfs .. it may or may not go to a single machine.
Namenode never stores the data files. it just stores the metadata for the
hdfs.
so when you load a file it will be going to datanode and the file
information will be going to namenode. depending on the size it will be
split in multiple blocks and then multiple blocks may land on multiple
datanodes. If your filesize is less than or exactly equal to block size you
can find out which datanode it is located. else there is no guarantee that
file will be only on single node only if you have fully distributed mode

PS: this is my understanding. Others may correct me as well

On Fri, Apr 12, 2013 at 2:00 PM, Sai Sai <sa...@yahoo.in> wrote:

> A few basic questions:
>
> Will HDFS refer to the memory of NameNode & DataNode or is it a separate
> machine.
>
> For NameNode, DataNode and others there is a process associated with each
> of em.
> But no process is for HDFS, wondering why? I understand that fsImage has
> the meta data of the HDFS, so when NameNode or DataNode or JobTracker/TT
> needs to get file info will they just look into the fsImage.
>
> When we put a file in HDFS is it possible to look/find in which node
> (NN/DN) it physically sits.
>
> Any help is appreciated.
> Thanks
> Sai
>

-- 
Nitin Pawar

Re: Will HDFS refer to the memory of NameNode & DataNode or is it a separate machine

Posted by Nitin Pawar <ni...@gmail.com>.

HDFS - hadoop distributed file system
as it stands a file system .. first basic question you will need to search
is do you need a process to run a file system?
when you find an answer to that second question will be
will a single process be enough for a distributed system ? meaning sub
components of the system may exist on different machines

namenode and datanode combined make hdfs. combining all  of their processes
you make hdfs.

namenode is master for the hdfs which keeps the file system image in memory
when it starts it loads it up in memory and serves all requests from memory
there on. There are steps taken to save the FSImage to disk. You can read
about it in detail in hdfs architecture.

when you put a file in hdfs .. it may or may not go to a single machine.
Namenode never stores the data files. it just stores the metadata for the
hdfs.
so when you load a file it will be going to datanode and the file
information will be going to namenode. depending on the size it will be
split in multiple blocks and then multiple blocks may land on multiple
datanodes. If your filesize is less than or exactly equal to block size you
can find out which datanode it is located. else there is no guarantee that
file will be only on single node only if you have fully distributed mode

PS: this is my understanding. Others may correct me as well

On Fri, Apr 12, 2013 at 2:00 PM, Sai Sai <sa...@yahoo.in> wrote:

> A few basic questions:
>
> Will HDFS refer to the memory of NameNode & DataNode or is it a separate
> machine.
>
> For NameNode, DataNode and others there is a process associated with each
> of em.
> But no process is for HDFS, wondering why? I understand that fsImage has
> the meta data of the HDFS, so when NameNode or DataNode or JobTracker/TT
> needs to get file info will they just look into the fsImage.
>
> When we put a file in HDFS is it possible to look/find in which node
> (NN/DN) it physically sits.
>
> Any help is appreciated.
> Thanks
> Sai
>

-- 
Nitin Pawar

Re: Will HDFS refer to the memory of NameNode & DataNode or is it a separate machine

Posted by Sai Sai <sa...@yahoo.in>.

A few basic questions:

Will HDFS refer to the memory of NameNode & DataNode or is it a separate machine.


For NameNode, DataNode and others there is a process associated with each of em.
But no process is for HDFS, wondering why? I understand that fsImage has the meta data of the HDFS, so when NameNode or DataNode or JobTracker/TT needs to get file info will they just look into the fsImage.

When we put a file in HDFS is it possible to look/find in which node (NN/DN) it physically sits.

Any help is appreciated.
Thanks
Sai

Re: Will HDFS refer to the memory of NameNode & DataNode or is it a separate machine

Posted by Sai Sai <sa...@yahoo.in>.

A few basic questions:

Will HDFS refer to the memory of NameNode & DataNode or is it a separate machine.


For NameNode, DataNode and others there is a process associated with each of em.
But no process is for HDFS, wondering why? I understand that fsImage has the meta data of the HDFS, so when NameNode or DataNode or JobTracker/TT needs to get file info will they just look into the fsImage.

When we put a file in HDFS is it possible to look/find in which node (NN/DN) it physically sits.

Any help is appreciated.
Thanks
Sai

Re: Will HDFS refer to the memory of NameNode & DataNode or is it a separate machine

Posted by Sai Sai <sa...@yahoo.in>.

A few basic questions:

Will HDFS refer to the memory of NameNode & DataNode or is it a separate machine.


For NameNode, DataNode and others there is a process associated with each of em.
But no process is for HDFS, wondering why? I understand that fsImage has the meta data of the HDFS, so when NameNode or DataNode or JobTracker/TT needs to get file info will they just look into the fsImage.

When we put a file in HDFS is it possible to look/find in which node (NN/DN) it physically sits.

Any help is appreciated.
Thanks
Sai

Re: How to find the num of Mappers

Posted by Nitin Pawar <ni...@gmail.com>.

your question is answered here
http://wiki.apache.org/hadoop/HowManyMapsAndReduces

To answer first part of your question,

it is not mandatory to run all the maps of a given job at a single time.
Maps are executed as and when the map slots are available on the
tasktrackers

On Fri, Apr 12, 2013 at 1:51 PM, Sai Sai <sa...@yahoo.in> wrote:

> If we have a 640 MB data file and have 3 Data Nodes in a cluster.
> The file can be split into 10 Blocks and starts the Mappers M1, M2,  M3
> first.
> As each one completes the task M4 and so on will be run.
> It appears like it is not necessary to run all the 10 Map tasks in
> parallel at once.
> Just wondering if this is right assumption.
> What if we have 10 TB of data file with 3 Data Nodes, how to find the
> number of mappers that will be created.
> Thanks
> Sai
>

-- 
Nitin Pawar

Re: How to find the num of Mappers

Posted by Nitin Pawar <ni...@gmail.com>.

your question is answered here
http://wiki.apache.org/hadoop/HowManyMapsAndReduces

To answer first part of your question,

it is not mandatory to run all the maps of a given job at a single time.
Maps are executed as and when the map slots are available on the
tasktrackers

On Fri, Apr 12, 2013 at 1:51 PM, Sai Sai <sa...@yahoo.in> wrote:

> If we have a 640 MB data file and have 3 Data Nodes in a cluster.
> The file can be split into 10 Blocks and starts the Mappers M1, M2,  M3
> first.
> As each one completes the task M4 and so on will be run.
> It appears like it is not necessary to run all the 10 Map tasks in
> parallel at once.
> Just wondering if this is right assumption.
> What if we have 10 TB of data file with 3 Data Nodes, how to find the
> number of mappers that will be created.
> Thanks
> Sai
>

-- 
Nitin Pawar

Re: How to find the num of Mappers

Posted by Nitin Pawar <ni...@gmail.com>.

your question is answered here
http://wiki.apache.org/hadoop/HowManyMapsAndReduces

To answer first part of your question,

it is not mandatory to run all the maps of a given job at a single time.
Maps are executed as and when the map slots are available on the
tasktrackers

On Fri, Apr 12, 2013 at 1:51 PM, Sai Sai <sa...@yahoo.in> wrote:

> If we have a 640 MB data file and have 3 Data Nodes in a cluster.
> The file can be split into 10 Blocks and starts the Mappers M1, M2,  M3
> first.
> As each one completes the task M4 and so on will be run.
> It appears like it is not necessary to run all the 10 Map tasks in
> parallel at once.
> Just wondering if this is right assumption.
> What if we have 10 TB of data file with 3 Data Nodes, how to find the
> number of mappers that will be created.
> Thanks
> Sai
>

-- 
Nitin Pawar

Re: Will HDFS refer to the memory of NameNode & DataNode or is it a separate machine

Posted by Sai Sai <sa...@yahoo.in>.

A few basic questions:

Will HDFS refer to the memory of NameNode & DataNode or is it a separate machine.


For NameNode, DataNode and others there is a process associated with each of em.
But no process is for HDFS, wondering why? I understand that fsImage has the meta data of the HDFS, so when NameNode or DataNode or JobTracker/TT needs to get file info will they just look into the fsImage.

When we put a file in HDFS is it possible to look/find in which node (NN/DN) it physically sits.

Any help is appreciated.
Thanks
Sai

Re: How to find the num of Mappers

Posted by Nitin Pawar <ni...@gmail.com>.

your question is answered here
http://wiki.apache.org/hadoop/HowManyMapsAndReduces

To answer first part of your question,

it is not mandatory to run all the maps of a given job at a single time.
Maps are executed as and when the map slots are available on the
tasktrackers

On Fri, Apr 12, 2013 at 1:51 PM, Sai Sai <sa...@yahoo.in> wrote:

> If we have a 640 MB data file and have 3 Data Nodes in a cluster.
> The file can be split into 10 Blocks and starts the Mappers M1, M2,  M3
> first.
> As each one completes the task M4 and so on will be run.
> It appears like it is not necessary to run all the 10 Map tasks in
> parallel at once.
> Just wondering if this is right assumption.
> What if we have 10 TB of data file with 3 Data Nodes, how to find the
> number of mappers that will be created.
> Thanks
> Sai
>

-- 
Nitin Pawar

Re: How to find the num of Mappers

Posted by Sai Sai <sa...@yahoo.in>.

If we have a 640 MB data file and have 3 Data Nodes in a cluster.
The file can be split into 10 Blocks and starts the Mappers M1, M2,  M3 first.
As each one completes the task M4 and so on will be run. 
It appears like it is not necessary to run all the 10 Map tasks in parallel at once.
Just wondering if this is right assumption.
What if we have 10 TB of data file with 3 Data Nodes, how to find the number of mappers that will be created.
Thanks
Sai