You are viewing a plain text version of this content. The canonical link for it is here.

Posted to hdfs-user@hadoop.apache.org by Mahmood Naderan <nt...@yahoo.com> on 2013/05/27 20:10:11 UTC

understanding souce code structure

Hello

I am trying to understand the source of of hadoop especially the HDFS. I want to know where should I look exactly in the source code about how HDFS distributes the data. Also how the map reduce engine tries to read the data. 


Any hint regarding the location of those in the source code is appreciated.
 

Regards,
Mahmood

Re: understanding souce code structure

Posted by Arpit Agarwal <aa...@hortonworks.com>.

It can be overwhelming to jump into the HDFS code. Have you read the
architectural
overview of HDFS <https://hadoop.apache.org/docs/r1.0.4/hdfs_design.html>?

I found it easiest to start with the DFSClient interface which encapsulates
client operations.

The DFSClient communicates with the namenode using ClientProtocol.
The server side of the ClientProtocol handling is in NameNodeRpcServer.
Client communication with the DataNode is encapsulated in
DataTransferProtocol.

Feel free to ask more specific questions if you get stuck.

-Arpit

On Mon, May 27, 2013 at 11:22 AM, Jay Vyas <ja...@gmail.com> wrote:

> Hi!  a few weeks ago I had the same question... Tried a first iteration at
> documenting this by going through the classes starting with key/value pairs
> in the blog post below.
>
>
> http://jayunit100.blogspot.com/2013/04/the-kv-pair-salmon-run-in-mapreduce-hdfs.html
>
> Note it's not perfect yet but I think it should provide some insight into
> things.  The lynch pin of it all is the DFSOutputStream and the
> DataStreamer classes.   Anyways... Feel free to borrow the contents and
> roll your own , or comment on it & leave some feedback,or let me know if
> anything is missing.
>
> Definetly would be awesome to have a rock solid view of the full write
> path.
>
> On May 27, 2013, at 2:10 PM, Mahmood Naderan <nt...@yahoo.com> wrote:
>
> Hello
>
> I am trying to understand the source of of hadoop especially the HDFS. I
> want to know where should I look exactly in the source code about how HDFS
> distributes the data. Also how the map reduce engine tries to read the
> data.
>
>
> Any hint regarding the location of those in the source code is appreciated.
>
> Regards,
> Mahmood*
> *
>
>

Re: understanding souce code structure

Posted by Arpit Agarwal <aa...@hortonworks.com>.

It can be overwhelming to jump into the HDFS code. Have you read the
architectural
overview of HDFS <https://hadoop.apache.org/docs/r1.0.4/hdfs_design.html>?

I found it easiest to start with the DFSClient interface which encapsulates
client operations.

The DFSClient communicates with the namenode using ClientProtocol.
The server side of the ClientProtocol handling is in NameNodeRpcServer.
Client communication with the DataNode is encapsulated in
DataTransferProtocol.

Feel free to ask more specific questions if you get stuck.

-Arpit

On Mon, May 27, 2013 at 11:22 AM, Jay Vyas <ja...@gmail.com> wrote:

> Hi!  a few weeks ago I had the same question... Tried a first iteration at
> documenting this by going through the classes starting with key/value pairs
> in the blog post below.
>
>
> http://jayunit100.blogspot.com/2013/04/the-kv-pair-salmon-run-in-mapreduce-hdfs.html
>
> Note it's not perfect yet but I think it should provide some insight into
> things.  The lynch pin of it all is the DFSOutputStream and the
> DataStreamer classes.   Anyways... Feel free to borrow the contents and
> roll your own , or comment on it & leave some feedback,or let me know if
> anything is missing.
>
> Definetly would be awesome to have a rock solid view of the full write
> path.
>
> On May 27, 2013, at 2:10 PM, Mahmood Naderan <nt...@yahoo.com> wrote:
>
> Hello
>
> I am trying to understand the source of of hadoop especially the HDFS. I
> want to know where should I look exactly in the source code about how HDFS
> distributes the data. Also how the map reduce engine tries to read the
> data.
>
>
> Any hint regarding the location of those in the source code is appreciated.
>
> Regards,
> Mahmood*
> *
>
>

Re: understanding souce code structure

Posted by Arpit Agarwal <aa...@hortonworks.com>.

It can be overwhelming to jump into the HDFS code. Have you read the
architectural
overview of HDFS <https://hadoop.apache.org/docs/r1.0.4/hdfs_design.html>?

I found it easiest to start with the DFSClient interface which encapsulates
client operations.

The DFSClient communicates with the namenode using ClientProtocol.
The server side of the ClientProtocol handling is in NameNodeRpcServer.
Client communication with the DataNode is encapsulated in
DataTransferProtocol.

Feel free to ask more specific questions if you get stuck.

-Arpit

On Mon, May 27, 2013 at 11:22 AM, Jay Vyas <ja...@gmail.com> wrote:

> Hi!  a few weeks ago I had the same question... Tried a first iteration at
> documenting this by going through the classes starting with key/value pairs
> in the blog post below.
>
>
> http://jayunit100.blogspot.com/2013/04/the-kv-pair-salmon-run-in-mapreduce-hdfs.html
>
> Note it's not perfect yet but I think it should provide some insight into
> things.  The lynch pin of it all is the DFSOutputStream and the
> DataStreamer classes.   Anyways... Feel free to borrow the contents and
> roll your own , or comment on it & leave some feedback,or let me know if
> anything is missing.
>
> Definetly would be awesome to have a rock solid view of the full write
> path.
>
> On May 27, 2013, at 2:10 PM, Mahmood Naderan <nt...@yahoo.com> wrote:
>
> Hello
>
> I am trying to understand the source of of hadoop especially the HDFS. I
> want to know where should I look exactly in the source code about how HDFS
> distributes the data. Also how the map reduce engine tries to read the
> data.
>
>
> Any hint regarding the location of those in the source code is appreciated.
>
> Regards,
> Mahmood*
> *
>
>

Re: understanding souce code structure

Posted by Arpit Agarwal <aa...@hortonworks.com>.

It can be overwhelming to jump into the HDFS code. Have you read the
architectural
overview of HDFS <https://hadoop.apache.org/docs/r1.0.4/hdfs_design.html>?

I found it easiest to start with the DFSClient interface which encapsulates
client operations.

The DFSClient communicates with the namenode using ClientProtocol.
The server side of the ClientProtocol handling is in NameNodeRpcServer.
Client communication with the DataNode is encapsulated in
DataTransferProtocol.

Feel free to ask more specific questions if you get stuck.

-Arpit

On Mon, May 27, 2013 at 11:22 AM, Jay Vyas <ja...@gmail.com> wrote:

> Hi!  a few weeks ago I had the same question... Tried a first iteration at
> documenting this by going through the classes starting with key/value pairs
> in the blog post below.
>
>
> http://jayunit100.blogspot.com/2013/04/the-kv-pair-salmon-run-in-mapreduce-hdfs.html
>
> Note it's not perfect yet but I think it should provide some insight into
> things.  The lynch pin of it all is the DFSOutputStream and the
> DataStreamer classes.   Anyways... Feel free to borrow the contents and
> roll your own , or comment on it & leave some feedback,or let me know if
> anything is missing.
>
> Definetly would be awesome to have a rock solid view of the full write
> path.
>
> On May 27, 2013, at 2:10 PM, Mahmood Naderan <nt...@yahoo.com> wrote:
>
> Hello
>
> I am trying to understand the source of of hadoop especially the HDFS. I
> want to know where should I look exactly in the source code about how HDFS
> distributes the data. Also how the map reduce engine tries to read the
> data.
>
>
> Any hint regarding the location of those in the source code is appreciated.
>
> Regards,
> Mahmood*
> *
>
>

Re: understanding souce code structure

Posted by Jay Vyas <ja...@gmail.com>.

Hi!  a few weeks ago I had the same question... Tried a first iteration at documenting this by going through the classes starting with key/value pairs in the blog post below.  

http://jayunit100.blogspot.com/2013/04/the-kv-pair-salmon-run-in-mapreduce-hdfs.html

Note it's not perfect yet but I think it should provide some insight into things.  The lynch pin of it all is the DFSOutputStream and the DataStreamer classes.   Anyways... Feel free to borrow the contents and roll your own , or comment on it & leave some feedback,or let me know if anything is missing.   

Definetly would be awesome to have a rock solid view of the full write path.

On May 27, 2013, at 2:10 PM, Mahmood Naderan <nt...@yahoo.com> wrote:

> Hello
> 
> I am trying to understand the source of of hadoop especially the HDFS. I want to know where should I look exactly in the source code about how HDFS distributes the data. Also how the map reduce engine tries to read the data. 
> 
> 
> Any hint regarding the location of those in the source code is appreciated.
>  
> Regards,
> Mahmood

Re: understanding souce code structure

Posted by Jay Vyas <ja...@gmail.com>.

Hi!  a few weeks ago I had the same question... Tried a first iteration at documenting this by going through the classes starting with key/value pairs in the blog post below.  

http://jayunit100.blogspot.com/2013/04/the-kv-pair-salmon-run-in-mapreduce-hdfs.html

Note it's not perfect yet but I think it should provide some insight into things.  The lynch pin of it all is the DFSOutputStream and the DataStreamer classes.   Anyways... Feel free to borrow the contents and roll your own , or comment on it & leave some feedback,or let me know if anything is missing.   

Definetly would be awesome to have a rock solid view of the full write path.

On May 27, 2013, at 2:10 PM, Mahmood Naderan <nt...@yahoo.com> wrote:

> Hello
> 
> I am trying to understand the source of of hadoop especially the HDFS. I want to know where should I look exactly in the source code about how HDFS distributes the data. Also how the map reduce engine tries to read the data. 
> 
> 
> Any hint regarding the location of those in the source code is appreciated.
>  
> Regards,
> Mahmood

Re: understanding souce code structure

Posted by Jay Vyas <ja...@gmail.com>.

Hi!  a few weeks ago I had the same question... Tried a first iteration at documenting this by going through the classes starting with key/value pairs in the blog post below.  

http://jayunit100.blogspot.com/2013/04/the-kv-pair-salmon-run-in-mapreduce-hdfs.html

Note it's not perfect yet but I think it should provide some insight into things.  The lynch pin of it all is the DFSOutputStream and the DataStreamer classes.   Anyways... Feel free to borrow the contents and roll your own , or comment on it & leave some feedback,or let me know if anything is missing.   

Definetly would be awesome to have a rock solid view of the full write path.

On May 27, 2013, at 2:10 PM, Mahmood Naderan <nt...@yahoo.com> wrote:

> Hello
> 
> I am trying to understand the source of of hadoop especially the HDFS. I want to know where should I look exactly in the source code about how HDFS distributes the data. Also how the map reduce engine tries to read the data. 
> 
> 
> Any hint regarding the location of those in the source code is appreciated.
>  
> Regards,
> Mahmood

Re: understanding souce code structure

Posted by Jay Vyas <ja...@gmail.com>.

Hi!  a few weeks ago I had the same question... Tried a first iteration at documenting this by going through the classes starting with key/value pairs in the blog post below.  

http://jayunit100.blogspot.com/2013/04/the-kv-pair-salmon-run-in-mapreduce-hdfs.html

Note it's not perfect yet but I think it should provide some insight into things.  The lynch pin of it all is the DFSOutputStream and the DataStreamer classes.   Anyways... Feel free to borrow the contents and roll your own , or comment on it & leave some feedback,or let me know if anything is missing.   

Definetly would be awesome to have a rock solid view of the full write path.

On May 27, 2013, at 2:10 PM, Mahmood Naderan <nt...@yahoo.com> wrote:

> Hello
> 
> I am trying to understand the source of of hadoop especially the HDFS. I want to know where should I look exactly in the source code about how HDFS distributes the data. Also how the map reduce engine tries to read the data. 
> 
> 
> Any hint regarding the location of those in the source code is appreciated.
>  
> Regards,
> Mahmood