You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@hadoop.apache.org by Ravi Prakash <ra...@ymail.com> on 2013/10/01 23:24:51 UTC

Re: Uploading a file to HDFS

Karim! 

Look at DFSOutputStream.java:DataStreamer

HTH
Ravi




________________________________
 From: Karim Awara <ka...@kaust.edu.sa>
To: user <us...@hadoop.apache.org> 
Sent: Thursday, September 26, 2013 7:51 AM
Subject: Re: Uploading a file to HDFS
 



Thanks for the reply. when the client caches 64KB of data on its own side, do you know which set of major java classes/files responsible for such action? 


--
Best Regards,
Karim Ahmed Awara


On Thu, Sep 26, 2013 at 2:25 PM, Jitendra Yadav <je...@gmail.com> wrote:

Case 2:
>
>
>
>While selecting target DN in case of write operations, NN
will always prefers first DN as same DN from where client  sending the
data, in some cases NN ignore that DN when there is some disk space issues or
some other health symptoms found,rest of things will same.
>
>
>Thanks
>Jitendra
>
>
>
>On Thu, Sep 26, 2013 at 4:15 PM, Shekhar Sharma <sh...@gmail.com> wrote:
>
>Its not the namenode that does the reading or breaking of the file..
>>When you run the command hadoop fs -put <input> <output>.....
>>Here "hadoop" is a script file which is default client for hadoop..and
>>when the client contacts the namenode for writing, then NN creates a
>>block id and ask 3 dN to host the block ( replication factor to 3) and
>>this information is sent to client.
>>
>>client caches 64KB of data on its own side and then pushes the data to
>>the DN and then this data gets pushed through pipeline..and this
>>process gets repeated till 64MB data is written and if the client
>>wants to to write more then he will again contact NN via heart beat
>>signal and this process continuess...
>>
>>Check how does writing happens in HDFS?
>>
>>
>>Regards,
>>Som Shekhar Sharma
>>+91-8197243810
>>
>>
>>
>>On Thu, Sep 26, 2013 at 3:41 PM, Karim Awara <ka...@kaust.edu.sa> wrote:
>>> Hi,
>>>
>>> I have a couple of questions about the process of uploading a large file (>
>>> 10GB) to HDFS.
>>>
>>> To make sure my understanding is correct, assuming I have a cluster of N
>>> machines.
>>>
>>> What happens in the following:
>>>
>>>
>>> Case 1:
>>>                 assuming i want to uppload a file (input.txt) of size K GBs
>>> that resides on the local disk of machine 1 (which happens to be the
>>> namenode only). if I am running the command  -put input.txt {some hdfs dir}
>>> from the namenode (assuming it does not play the datanode role), then will
>>> the namenode read the first 64MB in a temporary pipe and then transfers it
>>> to one of the cluster datanodes once finished?  Or the namenode does not do
>>> any reading of the file, but rather asks a certain datanode to read the 64MB
>>> window from the file remotely?
>>>
>>>
>>> Case 2:
>>>              assume machine 1 is the namenode, but i run the -put command
>>> from machine 3 (which is a datanode). who will start reading the file?
>>>
>>>
>>>
>>> --
>>> Best Regards,
>>> Karim Ahmed Awara
>>>
>>> ________________________________
>>> This message and its contents, including attachments are intended solely for
>>> the original recipient. If you are not the intended recipient or have
>>> received this message in error, please notify me immediately and delete this
>>> message from your computer system. Any unauthorized use or distribution is
>>> prohibited. Please consider the environment before printing this email.
>>
>



________________________________
This message and its contents, including attachments are intended solely for the original recipient. If you are not the intended recipient or have received this message in error, please notify me immediately and delete this message from your computer system. Any unauthorized use or distribution is prohibited. Please consider the environment before printing this email.

Re: Uploading a file to HDFS

Posted by Jay Vyas <ja...@gmail.com>.

I've diagramed the hadoop HDFS write path here:

http://jayunit100.blogspot.com/2013/04/the-kv-pair-salmon-run-in-mapreduce-hdfs.html


On Tue, Oct 1, 2013 at 5:24 PM, Ravi Prakash <ra...@ymail.com> wrote:

> Karim!
>
> Look at DFSOutputStream.java:DataStreamer
>
> HTH
> Ravi
>
>
>   ------------------------------
>  *From:* Karim Awara <ka...@kaust.edu.sa>
> *To:* user <us...@hadoop.apache.org>
> *Sent:* Thursday, September 26, 2013 7:51 AM
> *Subject:* Re: Uploading a file to HDFS
>
>
> Thanks for the reply. when the client caches 64KB of data on its own side,
> do you know which set of major java classes/files responsible for such
> action?
>
> --
> Best Regards,
> Karim Ahmed Awara
>
>
> On Thu, Sep 26, 2013 at 2:25 PM, Jitendra Yadav <
> jeetuyadav200890@gmail.com> wrote:
>
> Case 2:
>
> While selecting target DN in case of write operations, NN will always
> prefers first DN as same DN from where client  sending the data, in some
> cases NN ignore that DN when there is some disk space issues or some other
> health symptoms found,rest of things will same.
>
> Thanks
> Jitendra
>
>
> On Thu, Sep 26, 2013 at 4:15 PM, Shekhar Sharma <sh...@gmail.com>wrote:
>
> Its not the namenode that does the reading or breaking of the file..
> When you run the command hadoop fs -put <input> <output>.....
> Here "hadoop" is a script file which is default client for hadoop..and
> when the client contacts the namenode for writing, then NN creates a
> block id and ask 3 dN to host the block ( replication factor to 3) and
> this information is sent to client.
>
> client caches 64KB of data on its own side and then pushes the data to
> the DN and then this data gets pushed through pipeline..and this
> process gets repeated till 64MB data is written and if the client
> wants to to write more then he will again contact NN via heart beat
> signal and this process continuess...
>
> Check how does writing happens in HDFS?
>
>
> Regards,
> Som Shekhar Sharma
> +91-8197243810
>
>
> On Thu, Sep 26, 2013 at 3:41 PM, Karim Awara <ka...@kaust.edu.sa>
> wrote:
> > Hi,
> >
> > I have a couple of questions about the process of uploading a large file
> (>
> > 10GB) to HDFS.
> >
> > To make sure my understanding is correct, assuming I have a cluster of N
> > machines.
> >
> > What happens in the following:
> >
> >
> > Case 1:
> >                 assuming i want to uppload a file (input.txt) of size K
> GBs
> > that resides on the local disk of machine 1 (which happens to be the
> > namenode only). if I am running the command  -put input.txt {some hdfs
> dir}
> > from the namenode (assuming it does not play the datanode role), then
> will
> > the namenode read the first 64MB in a temporary pipe and then transfers
> it
> > to one of the cluster datanodes once finished?  Or the namenode does not
> do
> > any reading of the file, but rather asks a certain datanode to read the
> 64MB
> > window from the file remotely?
> >
> >
> > Case 2:
> >              assume machine 1 is the namenode, but i run the -put command
> > from machine 3 (which is a datanode). who will start reading the file?
> >
> >
> >
> > --
> > Best Regards,
> > Karim Ahmed Awara
> >
> > ________________________________
> > This message and its contents, including attachments are intended solely
> for
> > the original recipient. If you are not the intended recipient or have
> > received this message in error, please notify me immediately and delete
> this
> > message from your computer system. Any unauthorized use or distribution
> is
> > prohibited. Please consider the environment before printing this email.
>
>
>
>
> ------------------------------
> This message and its contents, including attachments are intended solely
> for the original recipient. If you are not the intended recipient or have
> received this message in error, please notify me immediately and delete
> this message from your computer system. Any unauthorized use or
> distribution is prohibited. Please consider the environment before printing
> this email.
>
>
>


-- 
Jay Vyas
http://jayunit100.blogspot.com

Re: Uploading a file to HDFS

Posted by Jay Vyas <ja...@gmail.com>.

I've diagramed the hadoop HDFS write path here:

http://jayunit100.blogspot.com/2013/04/the-kv-pair-salmon-run-in-mapreduce-hdfs.html


On Tue, Oct 1, 2013 at 5:24 PM, Ravi Prakash <ra...@ymail.com> wrote:

> Karim!
>
> Look at DFSOutputStream.java:DataStreamer
>
> HTH
> Ravi
>
>
>   ------------------------------
>  *From:* Karim Awara <ka...@kaust.edu.sa>
> *To:* user <us...@hadoop.apache.org>
> *Sent:* Thursday, September 26, 2013 7:51 AM
> *Subject:* Re: Uploading a file to HDFS
>
>
> Thanks for the reply. when the client caches 64KB of data on its own side,
> do you know which set of major java classes/files responsible for such
> action?
>
> --
> Best Regards,
> Karim Ahmed Awara
>
>
> On Thu, Sep 26, 2013 at 2:25 PM, Jitendra Yadav <
> jeetuyadav200890@gmail.com> wrote:
>
> Case 2:
>
> While selecting target DN in case of write operations, NN will always
> prefers first DN as same DN from where client  sending the data, in some
> cases NN ignore that DN when there is some disk space issues or some other
> health symptoms found,rest of things will same.
>
> Thanks
> Jitendra
>
>
> On Thu, Sep 26, 2013 at 4:15 PM, Shekhar Sharma <sh...@gmail.com>wrote:
>
> Its not the namenode that does the reading or breaking of the file..
> When you run the command hadoop fs -put <input> <output>.....
> Here "hadoop" is a script file which is default client for hadoop..and
> when the client contacts the namenode for writing, then NN creates a
> block id and ask 3 dN to host the block ( replication factor to 3) and
> this information is sent to client.
>
> client caches 64KB of data on its own side and then pushes the data to
> the DN and then this data gets pushed through pipeline..and this
> process gets repeated till 64MB data is written and if the client
> wants to to write more then he will again contact NN via heart beat
> signal and this process continuess...
>
> Check how does writing happens in HDFS?
>
>
> Regards,
> Som Shekhar Sharma
> +91-8197243810
>
>
> On Thu, Sep 26, 2013 at 3:41 PM, Karim Awara <ka...@kaust.edu.sa>
> wrote:
> > Hi,
> >
> > I have a couple of questions about the process of uploading a large file
> (>
> > 10GB) to HDFS.
> >
> > To make sure my understanding is correct, assuming I have a cluster of N
> > machines.
> >
> > What happens in the following:
> >
> >
> > Case 1:
> >                 assuming i want to uppload a file (input.txt) of size K
> GBs
> > that resides on the local disk of machine 1 (which happens to be the
> > namenode only). if I am running the command  -put input.txt {some hdfs
> dir}
> > from the namenode (assuming it does not play the datanode role), then
> will
> > the namenode read the first 64MB in a temporary pipe and then transfers
> it
> > to one of the cluster datanodes once finished?  Or the namenode does not
> do
> > any reading of the file, but rather asks a certain datanode to read the
> 64MB
> > window from the file remotely?
> >
> >
> > Case 2:
> >              assume machine 1 is the namenode, but i run the -put command
> > from machine 3 (which is a datanode). who will start reading the file?
> >
> >
> >
> > --
> > Best Regards,
> > Karim Ahmed Awara
> >
> > ________________________________
> > This message and its contents, including attachments are intended solely
> for
> > the original recipient. If you are not the intended recipient or have
> > received this message in error, please notify me immediately and delete
> this
> > message from your computer system. Any unauthorized use or distribution
> is
> > prohibited. Please consider the environment before printing this email.
>
>
>
>
> ------------------------------
> This message and its contents, including attachments are intended solely
> for the original recipient. If you are not the intended recipient or have
> received this message in error, please notify me immediately and delete
> this message from your computer system. Any unauthorized use or
> distribution is prohibited. Please consider the environment before printing
> this email.
>
>
>


-- 
Jay Vyas
http://jayunit100.blogspot.com

Re: Uploading a file to HDFS

Posted by Jay Vyas <ja...@gmail.com>.

I've diagramed the hadoop HDFS write path here:

http://jayunit100.blogspot.com/2013/04/the-kv-pair-salmon-run-in-mapreduce-hdfs.html


On Tue, Oct 1, 2013 at 5:24 PM, Ravi Prakash <ra...@ymail.com> wrote:

> Karim!
>
> Look at DFSOutputStream.java:DataStreamer
>
> HTH
> Ravi
>
>
>   ------------------------------
>  *From:* Karim Awara <ka...@kaust.edu.sa>
> *To:* user <us...@hadoop.apache.org>
> *Sent:* Thursday, September 26, 2013 7:51 AM
> *Subject:* Re: Uploading a file to HDFS
>
>
> Thanks for the reply. when the client caches 64KB of data on its own side,
> do you know which set of major java classes/files responsible for such
> action?
>
> --
> Best Regards,
> Karim Ahmed Awara
>
>
> On Thu, Sep 26, 2013 at 2:25 PM, Jitendra Yadav <
> jeetuyadav200890@gmail.com> wrote:
>
> Case 2:
>
> While selecting target DN in case of write operations, NN will always
> prefers first DN as same DN from where client  sending the data, in some
> cases NN ignore that DN when there is some disk space issues or some other
> health symptoms found,rest of things will same.
>
> Thanks
> Jitendra
>
>
> On Thu, Sep 26, 2013 at 4:15 PM, Shekhar Sharma <sh...@gmail.com>wrote:
>
> Its not the namenode that does the reading or breaking of the file..
> When you run the command hadoop fs -put <input> <output>.....
> Here "hadoop" is a script file which is default client for hadoop..and
> when the client contacts the namenode for writing, then NN creates a
> block id and ask 3 dN to host the block ( replication factor to 3) and
> this information is sent to client.
>
> client caches 64KB of data on its own side and then pushes the data to
> the DN and then this data gets pushed through pipeline..and this
> process gets repeated till 64MB data is written and if the client
> wants to to write more then he will again contact NN via heart beat
> signal and this process continuess...
>
> Check how does writing happens in HDFS?
>
>
> Regards,
> Som Shekhar Sharma
> +91-8197243810
>
>
> On Thu, Sep 26, 2013 at 3:41 PM, Karim Awara <ka...@kaust.edu.sa>
> wrote:
> > Hi,
> >
> > I have a couple of questions about the process of uploading a large file
> (>
> > 10GB) to HDFS.
> >
> > To make sure my understanding is correct, assuming I have a cluster of N
> > machines.
> >
> > What happens in the following:
> >
> >
> > Case 1:
> >                 assuming i want to uppload a file (input.txt) of size K
> GBs
> > that resides on the local disk of machine 1 (which happens to be the
> > namenode only). if I am running the command  -put input.txt {some hdfs
> dir}
> > from the namenode (assuming it does not play the datanode role), then
> will
> > the namenode read the first 64MB in a temporary pipe and then transfers
> it
> > to one of the cluster datanodes once finished?  Or the namenode does not
> do
> > any reading of the file, but rather asks a certain datanode to read the
> 64MB
> > window from the file remotely?
> >
> >
> > Case 2:
> >              assume machine 1 is the namenode, but i run the -put command
> > from machine 3 (which is a datanode). who will start reading the file?
> >
> >
> >
> > --
> > Best Regards,
> > Karim Ahmed Awara
> >
> > ________________________________
> > This message and its contents, including attachments are intended solely
> for
> > the original recipient. If you are not the intended recipient or have
> > received this message in error, please notify me immediately and delete
> this
> > message from your computer system. Any unauthorized use or distribution
> is
> > prohibited. Please consider the environment before printing this email.
>
>
>
>
> ------------------------------
> This message and its contents, including attachments are intended solely
> for the original recipient. If you are not the intended recipient or have
> received this message in error, please notify me immediately and delete
> this message from your computer system. Any unauthorized use or
> distribution is prohibited. Please consider the environment before printing
> this email.
>
>
>


-- 
Jay Vyas
http://jayunit100.blogspot.com

Re: Uploading a file to HDFS

Posted by Jay Vyas <ja...@gmail.com>.

I've diagramed the hadoop HDFS write path here:

http://jayunit100.blogspot.com/2013/04/the-kv-pair-salmon-run-in-mapreduce-hdfs.html


On Tue, Oct 1, 2013 at 5:24 PM, Ravi Prakash <ra...@ymail.com> wrote:

> Karim!
>
> Look at DFSOutputStream.java:DataStreamer
>
> HTH
> Ravi
>
>
>   ------------------------------
>  *From:* Karim Awara <ka...@kaust.edu.sa>
> *To:* user <us...@hadoop.apache.org>
> *Sent:* Thursday, September 26, 2013 7:51 AM
> *Subject:* Re: Uploading a file to HDFS
>
>
> Thanks for the reply. when the client caches 64KB of data on its own side,
> do you know which set of major java classes/files responsible for such
> action?
>
> --
> Best Regards,
> Karim Ahmed Awara
>
>
> On Thu, Sep 26, 2013 at 2:25 PM, Jitendra Yadav <
> jeetuyadav200890@gmail.com> wrote:
>
> Case 2:
>
> While selecting target DN in case of write operations, NN will always
> prefers first DN as same DN from where client  sending the data, in some
> cases NN ignore that DN when there is some disk space issues or some other
> health symptoms found,rest of things will same.
>
> Thanks
> Jitendra
>
>
> On Thu, Sep 26, 2013 at 4:15 PM, Shekhar Sharma <sh...@gmail.com>wrote:
>
> Its not the namenode that does the reading or breaking of the file..
> When you run the command hadoop fs -put <input> <output>.....
> Here "hadoop" is a script file which is default client for hadoop..and
> when the client contacts the namenode for writing, then NN creates a
> block id and ask 3 dN to host the block ( replication factor to 3) and
> this information is sent to client.
>
> client caches 64KB of data on its own side and then pushes the data to
> the DN and then this data gets pushed through pipeline..and this
> process gets repeated till 64MB data is written and if the client
> wants to to write more then he will again contact NN via heart beat
> signal and this process continuess...
>
> Check how does writing happens in HDFS?
>
>
> Regards,
> Som Shekhar Sharma
> +91-8197243810
>
>
> On Thu, Sep 26, 2013 at 3:41 PM, Karim Awara <ka...@kaust.edu.sa>
> wrote:
> > Hi,
> >
> > I have a couple of questions about the process of uploading a large file
> (>
> > 10GB) to HDFS.
> >
> > To make sure my understanding is correct, assuming I have a cluster of N
> > machines.
> >
> > What happens in the following:
> >
> >
> > Case 1:
> >                 assuming i want to uppload a file (input.txt) of size K
> GBs
> > that resides on the local disk of machine 1 (which happens to be the
> > namenode only). if I am running the command  -put input.txt {some hdfs
> dir}
> > from the namenode (assuming it does not play the datanode role), then
> will
> > the namenode read the first 64MB in a temporary pipe and then transfers
> it
> > to one of the cluster datanodes once finished?  Or the namenode does not
> do
> > any reading of the file, but rather asks a certain datanode to read the
> 64MB
> > window from the file remotely?
> >
> >
> > Case 2:
> >              assume machine 1 is the namenode, but i run the -put command
> > from machine 3 (which is a datanode). who will start reading the file?
> >
> >
> >
> > --
> > Best Regards,
> > Karim Ahmed Awara
> >
> > ________________________________
> > This message and its contents, including attachments are intended solely
> for
> > the original recipient. If you are not the intended recipient or have
> > received this message in error, please notify me immediately and delete
> this
> > message from your computer system. Any unauthorized use or distribution
> is
> > prohibited. Please consider the environment before printing this email.
>
>
>
>
> ------------------------------
> This message and its contents, including attachments are intended solely
> for the original recipient. If you are not the intended recipient or have
> received this message in error, please notify me immediately and delete
> this message from your computer system. Any unauthorized use or
> distribution is prohibited. Please consider the environment before printing
> this email.
>
>
>


-- 
Jay Vyas
http://jayunit100.blogspot.com