You are viewing a plain text version of this content. The canonical link for it is here.

Posted to mapreduce-user@hadoop.apache.org by RJ Nowling <rn...@gmail.com> on 2014/03/16 23:07:01 UTC

Data Locality and WebHDFS

Hi all,

I'm writing up a Google Summer of Code proposal to add HDFS support to
Disco, an Erlang MapReduce framework.

We're interested in using WebHDFS.  I have two questions:

1) Does WebHDFS allow querying data locality information?

2) If the data locality information is known, can data on specific data
nodes be accessed via Web HDFS?  Or do all Web HDFS requests have to go
through a single server?

Thanks,
RJ

-- 
em rnowling@gmail.com
c 954.496.2314

Re: Data Locality and WebHDFS

Posted by RJ Nowling <rn...@gmail.com>.

Thank you, Tsz.  That helps!


On Mon, Mar 17, 2014 at 2:30 PM, Tsz Wo Sze <sz...@yahoo.com> wrote:

> The file offset is considered in WebHDFS redirection.  It redirects to a
> datanode with the first block the client going to read, not the first block
> of the file.
>
> Hope it helps.
> Tsz-Wo
>
>
>   On Monday, March 17, 2014 10:09 AM, Alejandro Abdelnur <
> tucu@cloudera.com> wrote:
>
> actually, i am wrong, the webhdfs rest call has an offset.
>
> Alejandro
> (phone typing)
>
> On Mar 17, 2014, at 10:07, Alejandro Abdelnur <tu...@cloudera.com> wrote:
>
> dont recall how skips are handled in webhdfs, but i would assume that
> you'll get to the first block As usual, and the skip is handled by the DN
> serving the file (as webhdfs doesnot know at open that you'll skip)
>
> Alejandro
> (phone typing)
>
> On Mar 17, 2014, at 9:47, RJ Nowling <rn...@gmail.com> wrote:
>
> Hi Alejandro,
>
> The WebHDFS API allows specifying an offset and length for the request.
>  If I specify an offset that start in the second block for a file (thus
> skipping the first block all together), will the namenode still direct me
> to a datanode with the first block or will it direct me to a namenode with
> the second block?  I.e., am I assured data locality only on the first block
> of the file (as you're saying) or on the first block I am accessing?
>
> If it is as you say, then I may want to reach out the WebHDFS developers
> and see if they would be interested in the additional functionality.
>
> Thank you,
> RJ
>
>
> On Mon, Mar 17, 2014 at 2:40 AM, Alejandro Abdelnur <tu...@cloudera.com>wrote:
>
> I may have expressed myself wrong. You don't need to do any test to see
> how locality works with files of multiple blocks. If you are accessing a
> file of more than one block over webhdfs, you only have assured locality
> for the first block of the file.
>
> Thanks.
>
>
> On Sun, Mar 16, 2014 at 9:18 PM, RJ Nowling <rn...@gmail.com> wrote:
>
> Thank you, Mingjiang and Alejandro.
>
> This is interesting.  Since we will use the data locality information for
> scheduling, we could "hack" this to get the data locality information, at
> least for the first block.  As Alejandro says, we'd have to test what
> happens for other data blocks -- e.g., what if, knowing the block sizes, we
> request the second or third block?
>
> Interesting food for thought!  I see some experiments in my future!
>
> Thanks!
>
>
> On Sun, Mar 16, 2014 at 10:14 PM, Alejandro Abdelnur <tu...@cloudera.com>wrote:
>
> well, this is for the first block of the file, the rest of the file
> (blocks being local or not) are streamed out by the same datanode. for
> small files (one block) you'll get locality, for large files only the first
> block, and by chance if other blocks are local to that datanode.
>
>
> Alejandro
> (phone typing)
>
> On Mar 16, 2014, at 18:53, Mingjiang Shi <ms...@gopivotal.com> wrote:
>
> According to this page:
> http://hortonworks.com/blog/webhdfs-%E2%80%93-http-rest-access-to-hdfs/
>
> *Data Locality*: The file read and file write calls are redirected to the
> corresponding datanodes. It uses the full bandwidth of the Hadoop cluster
> for streaming data.
> *A HDFS Built-in Component*: WebHDFS is a first class built-in component
> of HDFS. It runs inside Namenodes and Datanodes, therefore, it can use all
> HDFS functionalities. It is a part of HDFS - there are no additional
> servers to install
>
>
> So it looks like the data locality is built-into webhdfs, client will be
> redirected to the data node automatically.
>
>
>
>
> On Mon, Mar 17, 2014 at 6:07 AM, RJ Nowling <rn...@gmail.com> wrote:
>
> Hi all,
>
> I'm writing up a Google Summer of Code proposal to add HDFS support to
> Disco, an Erlang MapReduce framework.
>
> We're interested in using WebHDFS.  I have two questions:
>
> 1) Does WebHDFS allow querying data locality information?
>
> 2) If the data locality information is known, can data on specific data
> nodes be accessed via Web HDFS?  Or do all Web HDFS requests have to go
> through a single server?
>
> Thanks,
> RJ
>
> --
> em rnowling@gmail.com
> c 954.496.2314
>
>
>
>
> --
> Cheers
> -MJ
>
>
>
>
> --
> em rnowling@gmail.com
> c 954.496.2314
>
>
>
>
> --
> Alejandro
>
>
>
>
> --
> em rnowling@gmail.com
> c 954.496.2314
>
>
>
>


-- 
em rnowling@gmail.com
c 954.496.2314

Re: Data Locality and WebHDFS

Posted by RJ Nowling <rn...@gmail.com>.

Thank you, Tsz.  That helps!


On Mon, Mar 17, 2014 at 2:30 PM, Tsz Wo Sze <sz...@yahoo.com> wrote:

> The file offset is considered in WebHDFS redirection.  It redirects to a
> datanode with the first block the client going to read, not the first block
> of the file.
>
> Hope it helps.
> Tsz-Wo
>
>
>   On Monday, March 17, 2014 10:09 AM, Alejandro Abdelnur <
> tucu@cloudera.com> wrote:
>
> actually, i am wrong, the webhdfs rest call has an offset.
>
> Alejandro
> (phone typing)
>
> On Mar 17, 2014, at 10:07, Alejandro Abdelnur <tu...@cloudera.com> wrote:
>
> dont recall how skips are handled in webhdfs, but i would assume that
> you'll get to the first block As usual, and the skip is handled by the DN
> serving the file (as webhdfs doesnot know at open that you'll skip)
>
> Alejandro
> (phone typing)
>
> On Mar 17, 2014, at 9:47, RJ Nowling <rn...@gmail.com> wrote:
>
> Hi Alejandro,
>
> The WebHDFS API allows specifying an offset and length for the request.
>  If I specify an offset that start in the second block for a file (thus
> skipping the first block all together), will the namenode still direct me
> to a datanode with the first block or will it direct me to a namenode with
> the second block?  I.e., am I assured data locality only on the first block
> of the file (as you're saying) or on the first block I am accessing?
>
> If it is as you say, then I may want to reach out the WebHDFS developers
> and see if they would be interested in the additional functionality.
>
> Thank you,
> RJ
>
>
> On Mon, Mar 17, 2014 at 2:40 AM, Alejandro Abdelnur <tu...@cloudera.com>wrote:
>
> I may have expressed myself wrong. You don't need to do any test to see
> how locality works with files of multiple blocks. If you are accessing a
> file of more than one block over webhdfs, you only have assured locality
> for the first block of the file.
>
> Thanks.
>
>
> On Sun, Mar 16, 2014 at 9:18 PM, RJ Nowling <rn...@gmail.com> wrote:
>
> Thank you, Mingjiang and Alejandro.
>
> This is interesting.  Since we will use the data locality information for
> scheduling, we could "hack" this to get the data locality information, at
> least for the first block.  As Alejandro says, we'd have to test what
> happens for other data blocks -- e.g., what if, knowing the block sizes, we
> request the second or third block?
>
> Interesting food for thought!  I see some experiments in my future!
>
> Thanks!
>
>
> On Sun, Mar 16, 2014 at 10:14 PM, Alejandro Abdelnur <tu...@cloudera.com>wrote:
>
> well, this is for the first block of the file, the rest of the file
> (blocks being local or not) are streamed out by the same datanode. for
> small files (one block) you'll get locality, for large files only the first
> block, and by chance if other blocks are local to that datanode.
>
>
> Alejandro
> (phone typing)
>
> On Mar 16, 2014, at 18:53, Mingjiang Shi <ms...@gopivotal.com> wrote:
>
> According to this page:
> http://hortonworks.com/blog/webhdfs-%E2%80%93-http-rest-access-to-hdfs/
>
> *Data Locality*: The file read and file write calls are redirected to the
> corresponding datanodes. It uses the full bandwidth of the Hadoop cluster
> for streaming data.
> *A HDFS Built-in Component*: WebHDFS is a first class built-in component
> of HDFS. It runs inside Namenodes and Datanodes, therefore, it can use all
> HDFS functionalities. It is a part of HDFS - there are no additional
> servers to install
>
>
> So it looks like the data locality is built-into webhdfs, client will be
> redirected to the data node automatically.
>
>
>
>
> On Mon, Mar 17, 2014 at 6:07 AM, RJ Nowling <rn...@gmail.com> wrote:
>
> Hi all,
>
> I'm writing up a Google Summer of Code proposal to add HDFS support to
> Disco, an Erlang MapReduce framework.
>
> We're interested in using WebHDFS.  I have two questions:
>
> 1) Does WebHDFS allow querying data locality information?
>
> 2) If the data locality information is known, can data on specific data
> nodes be accessed via Web HDFS?  Or do all Web HDFS requests have to go
> through a single server?
>
> Thanks,
> RJ
>
> --
> em rnowling@gmail.com
> c 954.496.2314
>
>
>
>
> --
> Cheers
> -MJ
>
>
>
>
> --
> em rnowling@gmail.com
> c 954.496.2314
>
>
>
>
> --
> Alejandro
>
>
>
>
> --
> em rnowling@gmail.com
> c 954.496.2314
>
>
>
>


-- 
em rnowling@gmail.com
c 954.496.2314

Re: Data Locality and WebHDFS

Posted by RJ Nowling <rn...@gmail.com>.

Thank you, Tsz.  That helps!


On Mon, Mar 17, 2014 at 2:30 PM, Tsz Wo Sze <sz...@yahoo.com> wrote:

> The file offset is considered in WebHDFS redirection.  It redirects to a
> datanode with the first block the client going to read, not the first block
> of the file.
>
> Hope it helps.
> Tsz-Wo
>
>
>   On Monday, March 17, 2014 10:09 AM, Alejandro Abdelnur <
> tucu@cloudera.com> wrote:
>
> actually, i am wrong, the webhdfs rest call has an offset.
>
> Alejandro
> (phone typing)
>
> On Mar 17, 2014, at 10:07, Alejandro Abdelnur <tu...@cloudera.com> wrote:
>
> dont recall how skips are handled in webhdfs, but i would assume that
> you'll get to the first block As usual, and the skip is handled by the DN
> serving the file (as webhdfs doesnot know at open that you'll skip)
>
> Alejandro
> (phone typing)
>
> On Mar 17, 2014, at 9:47, RJ Nowling <rn...@gmail.com> wrote:
>
> Hi Alejandro,
>
> The WebHDFS API allows specifying an offset and length for the request.
>  If I specify an offset that start in the second block for a file (thus
> skipping the first block all together), will the namenode still direct me
> to a datanode with the first block or will it direct me to a namenode with
> the second block?  I.e., am I assured data locality only on the first block
> of the file (as you're saying) or on the first block I am accessing?
>
> If it is as you say, then I may want to reach out the WebHDFS developers
> and see if they would be interested in the additional functionality.
>
> Thank you,
> RJ
>
>
> On Mon, Mar 17, 2014 at 2:40 AM, Alejandro Abdelnur <tu...@cloudera.com>wrote:
>
> I may have expressed myself wrong. You don't need to do any test to see
> how locality works with files of multiple blocks. If you are accessing a
> file of more than one block over webhdfs, you only have assured locality
> for the first block of the file.
>
> Thanks.
>
>
> On Sun, Mar 16, 2014 at 9:18 PM, RJ Nowling <rn...@gmail.com> wrote:
>
> Thank you, Mingjiang and Alejandro.
>
> This is interesting.  Since we will use the data locality information for
> scheduling, we could "hack" this to get the data locality information, at
> least for the first block.  As Alejandro says, we'd have to test what
> happens for other data blocks -- e.g., what if, knowing the block sizes, we
> request the second or third block?
>
> Interesting food for thought!  I see some experiments in my future!
>
> Thanks!
>
>
> On Sun, Mar 16, 2014 at 10:14 PM, Alejandro Abdelnur <tu...@cloudera.com>wrote:
>
> well, this is for the first block of the file, the rest of the file
> (blocks being local or not) are streamed out by the same datanode. for
> small files (one block) you'll get locality, for large files only the first
> block, and by chance if other blocks are local to that datanode.
>
>
> Alejandro
> (phone typing)
>
> On Mar 16, 2014, at 18:53, Mingjiang Shi <ms...@gopivotal.com> wrote:
>
> According to this page:
> http://hortonworks.com/blog/webhdfs-%E2%80%93-http-rest-access-to-hdfs/
>
> *Data Locality*: The file read and file write calls are redirected to the
> corresponding datanodes. It uses the full bandwidth of the Hadoop cluster
> for streaming data.
> *A HDFS Built-in Component*: WebHDFS is a first class built-in component
> of HDFS. It runs inside Namenodes and Datanodes, therefore, it can use all
> HDFS functionalities. It is a part of HDFS - there are no additional
> servers to install
>
>
> So it looks like the data locality is built-into webhdfs, client will be
> redirected to the data node automatically.
>
>
>
>
> On Mon, Mar 17, 2014 at 6:07 AM, RJ Nowling <rn...@gmail.com> wrote:
>
> Hi all,
>
> I'm writing up a Google Summer of Code proposal to add HDFS support to
> Disco, an Erlang MapReduce framework.
>
> We're interested in using WebHDFS.  I have two questions:
>
> 1) Does WebHDFS allow querying data locality information?
>
> 2) If the data locality information is known, can data on specific data
> nodes be accessed via Web HDFS?  Or do all Web HDFS requests have to go
> through a single server?
>
> Thanks,
> RJ
>
> --
> em rnowling@gmail.com
> c 954.496.2314
>
>
>
>
> --
> Cheers
> -MJ
>
>
>
>
> --
> em rnowling@gmail.com
> c 954.496.2314
>
>
>
>
> --
> Alejandro
>
>
>
>
> --
> em rnowling@gmail.com
> c 954.496.2314
>
>
>
>


-- 
em rnowling@gmail.com
c 954.496.2314

Re: Data Locality and WebHDFS

Posted by RJ Nowling <rn...@gmail.com>.

Thank you, Tsz.  That helps!


On Mon, Mar 17, 2014 at 2:30 PM, Tsz Wo Sze <sz...@yahoo.com> wrote:

> The file offset is considered in WebHDFS redirection.  It redirects to a
> datanode with the first block the client going to read, not the first block
> of the file.
>
> Hope it helps.
> Tsz-Wo
>
>
>   On Monday, March 17, 2014 10:09 AM, Alejandro Abdelnur <
> tucu@cloudera.com> wrote:
>
> actually, i am wrong, the webhdfs rest call has an offset.
>
> Alejandro
> (phone typing)
>
> On Mar 17, 2014, at 10:07, Alejandro Abdelnur <tu...@cloudera.com> wrote:
>
> dont recall how skips are handled in webhdfs, but i would assume that
> you'll get to the first block As usual, and the skip is handled by the DN
> serving the file (as webhdfs doesnot know at open that you'll skip)
>
> Alejandro
> (phone typing)
>
> On Mar 17, 2014, at 9:47, RJ Nowling <rn...@gmail.com> wrote:
>
> Hi Alejandro,
>
> The WebHDFS API allows specifying an offset and length for the request.
>  If I specify an offset that start in the second block for a file (thus
> skipping the first block all together), will the namenode still direct me
> to a datanode with the first block or will it direct me to a namenode with
> the second block?  I.e., am I assured data locality only on the first block
> of the file (as you're saying) or on the first block I am accessing?
>
> If it is as you say, then I may want to reach out the WebHDFS developers
> and see if they would be interested in the additional functionality.
>
> Thank you,
> RJ
>
>
> On Mon, Mar 17, 2014 at 2:40 AM, Alejandro Abdelnur <tu...@cloudera.com>wrote:
>
> I may have expressed myself wrong. You don't need to do any test to see
> how locality works with files of multiple blocks. If you are accessing a
> file of more than one block over webhdfs, you only have assured locality
> for the first block of the file.
>
> Thanks.
>
>
> On Sun, Mar 16, 2014 at 9:18 PM, RJ Nowling <rn...@gmail.com> wrote:
>
> Thank you, Mingjiang and Alejandro.
>
> This is interesting.  Since we will use the data locality information for
> scheduling, we could "hack" this to get the data locality information, at
> least for the first block.  As Alejandro says, we'd have to test what
> happens for other data blocks -- e.g., what if, knowing the block sizes, we
> request the second or third block?
>
> Interesting food for thought!  I see some experiments in my future!
>
> Thanks!
>
>
> On Sun, Mar 16, 2014 at 10:14 PM, Alejandro Abdelnur <tu...@cloudera.com>wrote:
>
> well, this is for the first block of the file, the rest of the file
> (blocks being local or not) are streamed out by the same datanode. for
> small files (one block) you'll get locality, for large files only the first
> block, and by chance if other blocks are local to that datanode.
>
>
> Alejandro
> (phone typing)
>
> On Mar 16, 2014, at 18:53, Mingjiang Shi <ms...@gopivotal.com> wrote:
>
> According to this page:
> http://hortonworks.com/blog/webhdfs-%E2%80%93-http-rest-access-to-hdfs/
>
> *Data Locality*: The file read and file write calls are redirected to the
> corresponding datanodes. It uses the full bandwidth of the Hadoop cluster
> for streaming data.
> *A HDFS Built-in Component*: WebHDFS is a first class built-in component
> of HDFS. It runs inside Namenodes and Datanodes, therefore, it can use all
> HDFS functionalities. It is a part of HDFS - there are no additional
> servers to install
>
>
> So it looks like the data locality is built-into webhdfs, client will be
> redirected to the data node automatically.
>
>
>
>
> On Mon, Mar 17, 2014 at 6:07 AM, RJ Nowling <rn...@gmail.com> wrote:
>
> Hi all,
>
> I'm writing up a Google Summer of Code proposal to add HDFS support to
> Disco, an Erlang MapReduce framework.
>
> We're interested in using WebHDFS.  I have two questions:
>
> 1) Does WebHDFS allow querying data locality information?
>
> 2) If the data locality information is known, can data on specific data
> nodes be accessed via Web HDFS?  Or do all Web HDFS requests have to go
> through a single server?
>
> Thanks,
> RJ
>
> --
> em rnowling@gmail.com
> c 954.496.2314
>
>
>
>
> --
> Cheers
> -MJ
>
>
>
>
> --
> em rnowling@gmail.com
> c 954.496.2314
>
>
>
>
> --
> Alejandro
>
>
>
>
> --
> em rnowling@gmail.com
> c 954.496.2314
>
>
>
>


-- 
em rnowling@gmail.com
c 954.496.2314

Re: Data Locality and WebHDFS

Posted by Tsz Wo Sze <sz...@yahoo.com>.

The file offset is considered in WebHDFS redirection.  It redirects to a datanode with the first block the client going to read, not the first block of the file.

Hope it helps.
Tsz-Wo



On Monday, March 17, 2014 10:09 AM, Alejandro Abdelnur <tu...@cloudera.com> wrote:
 
actually, i am wrong, the webhdfs rest call has an offset. 
>
>Alejandro
>(phone typing)
>
>On Mar 17, 2014, at 10:07, Alejandro Abdelnur <tu...@cloudera.com> wrote:
>
>
>dont recall how skips are handled in webhdfs, but i would assume that you'll get to the first block As usual, and the skip is handled by the DN serving the file (as webhdfs doesnot know at open that you'll skip)
>
>Alejandro
>(phone typing)
>
>On Mar 17, 2014, at 9:47, RJ Nowling <rn...@gmail.com> wrote:
>
>
>Hi Alejandro,
>>
>>
>>The WebHDFS API allows specifying an offset and length for the request.  If I specify an offset that start in the second block for a file (thus skipping the first block all together), will the namenode still direct me to a datanode with the first block or will it direct me to a namenode with the second block?  I.e., am I assured data locality only on the first block of the file (as you're saying) or on the first block I am accessing?
>>
>>
>>If it is as you say, then I may want to reach out the WebHDFS developers and see if they would be interested in the additional functionality.
>>
>>
>>Thank you,
>>RJ
>>
>>
>>
>>On Mon, Mar 17, 2014 at 2:40 AM, Alejandro Abdelnur <tu...@cloudera.com> wrote:
>>
>>I may have expressed myself wrong. You don't need to do any test to see how locality works with files of multiple blocks. If you are accessing a file of more than one block over webhdfs, you only have assured locality for the first block of the file.
>>>
>>>
>>>Thanks.
>>>
>>>
>>>
>>>On Sun, Mar 16, 2014 at 9:18 PM, RJ Nowling <rn...@gmail.com> wrote:
>>>
>>>Thank you, Mingjiang and Alejandro.
>>>>
>>>>
>>>>This is interesting.  Since we will use the data locality information for scheduling, we could "hack" this to get the data locality information, at least for the first block.  As Alejandro says, we'd have to test what happens for other data blocks -- e.g., what if, knowing the block sizes, we request the second or third block?
>>>>
>>>>
>>>>Interesting food for thought!  I see some experiments in my future!  
>>>>
>>>>
>>>>Thanks!
>>>>
>>>>
>>>>
>>>>On Sun, Mar 16, 2014 at 10:14 PM, Alejandro Abdelnur <tu...@cloudera.com> wrote:
>>>>
>>>>well, this is for the first block of the file, the rest of the file (blocks being local or not) are streamed out by the same datanode. for small files (one block) you'll get locality, for large files only the first block, and by chance if other blocks are local to that datanode. 
>>>>>
>>>>>
>>>>>
>>>>>Alejandro
>>>>>(phone typing)
>>>>>
>>>>>On Mar 16, 2014, at 18:53, Mingjiang Shi <ms...@gopivotal.com> wrote:
>>>>>
>>>>>
>>>>>According to this page: http://hortonworks.com/blog/webhdfs-%E2%80%93-http-rest-access-to-hdfs/
>>>>>>
>>>>>>Data Locality: The file read and file write calls 
are redirected to the corresponding datanodes. It uses the full 
bandwidth of the Hadoop cluster for streaming data.
>>>>>>>A HDFS Built-in Component: WebHDFS is a first class 
built-in component of HDFS. It runs inside Namenodes and Datanodes, 
therefore, it can use all HDFS functionalities. It is a part of HDFS – 
there are no additional servers to install
>>>>>>
>>>>>>So it looks like the data locality is built-into webhdfs, client will be redirected to the data node automatically. 
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>On Mon, Mar 17, 2014 at 6:07 AM, RJ Nowling <rn...@gmail.com> wrote:
>>>>>>
>>>>>>Hi all,
>>>>>>>
>>>>>>>
>>>>>>>I'm writing up a Google Summer of Code proposal to add HDFS support to Disco, an Erlang MapReduce framework.  
>>>>>>>
>>>>>>>
>>>>>>>We're interested in using WebHDFS.  I have two questions:
>>>>>>>
>>>>>>>
>>>>>>>1) Does WebHDFS allow querying data locality information?
>>>>>>>
>>>>>>>
>>>>>>>2) If the data locality information is known, can data on specific data nodes be accessed via Web HDFS?  Or do all Web HDFS requests have to go through a single server?
>>>>>>>
>>>>>>>Thanks,
>>>>>>>RJ
>>>>>>>
>>>>>>>
>>>>>>>-- 
>>>>>>>em rnowling@gmail.com
>>>>>>>c 954.496.2314 
>>>>>>
>>>>>>
>>>>>>-- 
>>>>>>
>>>>>>Cheers
>>>>>>-MJ
>>>>>>
>>>>
>>>>
>>>>
>>>>-- 
>>>>em rnowling@gmail.com
>>>>c 954.496.2314 
>>>
>>>
>>>
>>>-- 
>>>Alejandro 
>>
>>
>>
>>-- 
>>em rnowling@gmail.com
>>c 954.496.2314 
>
>

Re: Data Locality and WebHDFS

Posted by Tsz Wo Sze <sz...@yahoo.com>.

The file offset is considered in WebHDFS redirection.  It redirects to a datanode with the first block the client going to read, not the first block of the file.

Hope it helps.
Tsz-Wo



On Monday, March 17, 2014 10:09 AM, Alejandro Abdelnur <tu...@cloudera.com> wrote:
 
actually, i am wrong, the webhdfs rest call has an offset. 
>
>Alejandro
>(phone typing)
>
>On Mar 17, 2014, at 10:07, Alejandro Abdelnur <tu...@cloudera.com> wrote:
>
>
>dont recall how skips are handled in webhdfs, but i would assume that you'll get to the first block As usual, and the skip is handled by the DN serving the file (as webhdfs doesnot know at open that you'll skip)
>
>Alejandro
>(phone typing)
>
>On Mar 17, 2014, at 9:47, RJ Nowling <rn...@gmail.com> wrote:
>
>
>Hi Alejandro,
>>
>>
>>The WebHDFS API allows specifying an offset and length for the request.  If I specify an offset that start in the second block for a file (thus skipping the first block all together), will the namenode still direct me to a datanode with the first block or will it direct me to a namenode with the second block?  I.e., am I assured data locality only on the first block of the file (as you're saying) or on the first block I am accessing?
>>
>>
>>If it is as you say, then I may want to reach out the WebHDFS developers and see if they would be interested in the additional functionality.
>>
>>
>>Thank you,
>>RJ
>>
>>
>>
>>On Mon, Mar 17, 2014 at 2:40 AM, Alejandro Abdelnur <tu...@cloudera.com> wrote:
>>
>>I may have expressed myself wrong. You don't need to do any test to see how locality works with files of multiple blocks. If you are accessing a file of more than one block over webhdfs, you only have assured locality for the first block of the file.
>>>
>>>
>>>Thanks.
>>>
>>>
>>>
>>>On Sun, Mar 16, 2014 at 9:18 PM, RJ Nowling <rn...@gmail.com> wrote:
>>>
>>>Thank you, Mingjiang and Alejandro.
>>>>
>>>>
>>>>This is interesting.  Since we will use the data locality information for scheduling, we could "hack" this to get the data locality information, at least for the first block.  As Alejandro says, we'd have to test what happens for other data blocks -- e.g., what if, knowing the block sizes, we request the second or third block?
>>>>
>>>>
>>>>Interesting food for thought!  I see some experiments in my future!  
>>>>
>>>>
>>>>Thanks!
>>>>
>>>>
>>>>
>>>>On Sun, Mar 16, 2014 at 10:14 PM, Alejandro Abdelnur <tu...@cloudera.com> wrote:
>>>>
>>>>well, this is for the first block of the file, the rest of the file (blocks being local or not) are streamed out by the same datanode. for small files (one block) you'll get locality, for large files only the first block, and by chance if other blocks are local to that datanode. 
>>>>>
>>>>>
>>>>>
>>>>>Alejandro
>>>>>(phone typing)
>>>>>
>>>>>On Mar 16, 2014, at 18:53, Mingjiang Shi <ms...@gopivotal.com> wrote:
>>>>>
>>>>>
>>>>>According to this page: http://hortonworks.com/blog/webhdfs-%E2%80%93-http-rest-access-to-hdfs/
>>>>>>
>>>>>>Data Locality: The file read and file write calls 
are redirected to the corresponding datanodes. It uses the full 
bandwidth of the Hadoop cluster for streaming data.
>>>>>>>A HDFS Built-in Component: WebHDFS is a first class 
built-in component of HDFS. It runs inside Namenodes and Datanodes, 
therefore, it can use all HDFS functionalities. It is a part of HDFS – 
there are no additional servers to install
>>>>>>
>>>>>>So it looks like the data locality is built-into webhdfs, client will be redirected to the data node automatically. 
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>On Mon, Mar 17, 2014 at 6:07 AM, RJ Nowling <rn...@gmail.com> wrote:
>>>>>>
>>>>>>Hi all,
>>>>>>>
>>>>>>>
>>>>>>>I'm writing up a Google Summer of Code proposal to add HDFS support to Disco, an Erlang MapReduce framework.  
>>>>>>>
>>>>>>>
>>>>>>>We're interested in using WebHDFS.  I have two questions:
>>>>>>>
>>>>>>>
>>>>>>>1) Does WebHDFS allow querying data locality information?
>>>>>>>
>>>>>>>
>>>>>>>2) If the data locality information is known, can data on specific data nodes be accessed via Web HDFS?  Or do all Web HDFS requests have to go through a single server?
>>>>>>>
>>>>>>>Thanks,
>>>>>>>RJ
>>>>>>>
>>>>>>>
>>>>>>>-- 
>>>>>>>em rnowling@gmail.com
>>>>>>>c 954.496.2314 
>>>>>>
>>>>>>
>>>>>>-- 
>>>>>>
>>>>>>Cheers
>>>>>>-MJ
>>>>>>
>>>>
>>>>
>>>>
>>>>-- 
>>>>em rnowling@gmail.com
>>>>c 954.496.2314 
>>>
>>>
>>>
>>>-- 
>>>Alejandro 
>>
>>
>>
>>-- 
>>em rnowling@gmail.com
>>c 954.496.2314 
>
>

Re: Data Locality and WebHDFS

Posted by Tsz Wo Sze <sz...@yahoo.com>.

The file offset is considered in WebHDFS redirection.  It redirects to a datanode with the first block the client going to read, not the first block of the file.

Hope it helps.
Tsz-Wo



On Monday, March 17, 2014 10:09 AM, Alejandro Abdelnur <tu...@cloudera.com> wrote:
 
actually, i am wrong, the webhdfs rest call has an offset. 
>
>Alejandro
>(phone typing)
>
>On Mar 17, 2014, at 10:07, Alejandro Abdelnur <tu...@cloudera.com> wrote:
>
>
>dont recall how skips are handled in webhdfs, but i would assume that you'll get to the first block As usual, and the skip is handled by the DN serving the file (as webhdfs doesnot know at open that you'll skip)
>
>Alejandro
>(phone typing)
>
>On Mar 17, 2014, at 9:47, RJ Nowling <rn...@gmail.com> wrote:
>
>
>Hi Alejandro,
>>
>>
>>The WebHDFS API allows specifying an offset and length for the request.  If I specify an offset that start in the second block for a file (thus skipping the first block all together), will the namenode still direct me to a datanode with the first block or will it direct me to a namenode with the second block?  I.e., am I assured data locality only on the first block of the file (as you're saying) or on the first block I am accessing?
>>
>>
>>If it is as you say, then I may want to reach out the WebHDFS developers and see if they would be interested in the additional functionality.
>>
>>
>>Thank you,
>>RJ
>>
>>
>>
>>On Mon, Mar 17, 2014 at 2:40 AM, Alejandro Abdelnur <tu...@cloudera.com> wrote:
>>
>>I may have expressed myself wrong. You don't need to do any test to see how locality works with files of multiple blocks. If you are accessing a file of more than one block over webhdfs, you only have assured locality for the first block of the file.
>>>
>>>
>>>Thanks.
>>>
>>>
>>>
>>>On Sun, Mar 16, 2014 at 9:18 PM, RJ Nowling <rn...@gmail.com> wrote:
>>>
>>>Thank you, Mingjiang and Alejandro.
>>>>
>>>>
>>>>This is interesting.  Since we will use the data locality information for scheduling, we could "hack" this to get the data locality information, at least for the first block.  As Alejandro says, we'd have to test what happens for other data blocks -- e.g., what if, knowing the block sizes, we request the second or third block?
>>>>
>>>>
>>>>Interesting food for thought!  I see some experiments in my future!  
>>>>
>>>>
>>>>Thanks!
>>>>
>>>>
>>>>
>>>>On Sun, Mar 16, 2014 at 10:14 PM, Alejandro Abdelnur <tu...@cloudera.com> wrote:
>>>>
>>>>well, this is for the first block of the file, the rest of the file (blocks being local or not) are streamed out by the same datanode. for small files (one block) you'll get locality, for large files only the first block, and by chance if other blocks are local to that datanode. 
>>>>>
>>>>>
>>>>>
>>>>>Alejandro
>>>>>(phone typing)
>>>>>
>>>>>On Mar 16, 2014, at 18:53, Mingjiang Shi <ms...@gopivotal.com> wrote:
>>>>>
>>>>>
>>>>>According to this page: http://hortonworks.com/blog/webhdfs-%E2%80%93-http-rest-access-to-hdfs/
>>>>>>
>>>>>>Data Locality: The file read and file write calls 
are redirected to the corresponding datanodes. It uses the full 
bandwidth of the Hadoop cluster for streaming data.
>>>>>>>A HDFS Built-in Component: WebHDFS is a first class 
built-in component of HDFS. It runs inside Namenodes and Datanodes, 
therefore, it can use all HDFS functionalities. It is a part of HDFS – 
there are no additional servers to install
>>>>>>
>>>>>>So it looks like the data locality is built-into webhdfs, client will be redirected to the data node automatically. 
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>On Mon, Mar 17, 2014 at 6:07 AM, RJ Nowling <rn...@gmail.com> wrote:
>>>>>>
>>>>>>Hi all,
>>>>>>>
>>>>>>>
>>>>>>>I'm writing up a Google Summer of Code proposal to add HDFS support to Disco, an Erlang MapReduce framework.  
>>>>>>>
>>>>>>>
>>>>>>>We're interested in using WebHDFS.  I have two questions:
>>>>>>>
>>>>>>>
>>>>>>>1) Does WebHDFS allow querying data locality information?
>>>>>>>
>>>>>>>
>>>>>>>2) If the data locality information is known, can data on specific data nodes be accessed via Web HDFS?  Or do all Web HDFS requests have to go through a single server?
>>>>>>>
>>>>>>>Thanks,
>>>>>>>RJ
>>>>>>>
>>>>>>>
>>>>>>>-- 
>>>>>>>em rnowling@gmail.com
>>>>>>>c 954.496.2314 
>>>>>>
>>>>>>
>>>>>>-- 
>>>>>>
>>>>>>Cheers
>>>>>>-MJ
>>>>>>
>>>>
>>>>
>>>>
>>>>-- 
>>>>em rnowling@gmail.com
>>>>c 954.496.2314 
>>>
>>>
>>>
>>>-- 
>>>Alejandro 
>>
>>
>>
>>-- 
>>em rnowling@gmail.com
>>c 954.496.2314 
>
>

Re: Data Locality and WebHDFS

Posted by Tsz Wo Sze <sz...@yahoo.com>.

The file offset is considered in WebHDFS redirection.  It redirects to a datanode with the first block the client going to read, not the first block of the file.

Hope it helps.
Tsz-Wo



On Monday, March 17, 2014 10:09 AM, Alejandro Abdelnur <tu...@cloudera.com> wrote:
 
actually, i am wrong, the webhdfs rest call has an offset. 
>
>Alejandro
>(phone typing)
>
>On Mar 17, 2014, at 10:07, Alejandro Abdelnur <tu...@cloudera.com> wrote:
>
>
>dont recall how skips are handled in webhdfs, but i would assume that you'll get to the first block As usual, and the skip is handled by the DN serving the file (as webhdfs doesnot know at open that you'll skip)
>
>Alejandro
>(phone typing)
>
>On Mar 17, 2014, at 9:47, RJ Nowling <rn...@gmail.com> wrote:
>
>
>Hi Alejandro,
>>
>>
>>The WebHDFS API allows specifying an offset and length for the request.  If I specify an offset that start in the second block for a file (thus skipping the first block all together), will the namenode still direct me to a datanode with the first block or will it direct me to a namenode with the second block?  I.e., am I assured data locality only on the first block of the file (as you're saying) or on the first block I am accessing?
>>
>>
>>If it is as you say, then I may want to reach out the WebHDFS developers and see if they would be interested in the additional functionality.
>>
>>
>>Thank you,
>>RJ
>>
>>
>>
>>On Mon, Mar 17, 2014 at 2:40 AM, Alejandro Abdelnur <tu...@cloudera.com> wrote:
>>
>>I may have expressed myself wrong. You don't need to do any test to see how locality works with files of multiple blocks. If you are accessing a file of more than one block over webhdfs, you only have assured locality for the first block of the file.
>>>
>>>
>>>Thanks.
>>>
>>>
>>>
>>>On Sun, Mar 16, 2014 at 9:18 PM, RJ Nowling <rn...@gmail.com> wrote:
>>>
>>>Thank you, Mingjiang and Alejandro.
>>>>
>>>>
>>>>This is interesting.  Since we will use the data locality information for scheduling, we could "hack" this to get the data locality information, at least for the first block.  As Alejandro says, we'd have to test what happens for other data blocks -- e.g., what if, knowing the block sizes, we request the second or third block?
>>>>
>>>>
>>>>Interesting food for thought!  I see some experiments in my future!  
>>>>
>>>>
>>>>Thanks!
>>>>
>>>>
>>>>
>>>>On Sun, Mar 16, 2014 at 10:14 PM, Alejandro Abdelnur <tu...@cloudera.com> wrote:
>>>>
>>>>well, this is for the first block of the file, the rest of the file (blocks being local or not) are streamed out by the same datanode. for small files (one block) you'll get locality, for large files only the first block, and by chance if other blocks are local to that datanode. 
>>>>>
>>>>>
>>>>>
>>>>>Alejandro
>>>>>(phone typing)
>>>>>
>>>>>On Mar 16, 2014, at 18:53, Mingjiang Shi <ms...@gopivotal.com> wrote:
>>>>>
>>>>>
>>>>>According to this page: http://hortonworks.com/blog/webhdfs-%E2%80%93-http-rest-access-to-hdfs/
>>>>>>
>>>>>>Data Locality: The file read and file write calls 
are redirected to the corresponding datanodes. It uses the full 
bandwidth of the Hadoop cluster for streaming data.
>>>>>>>A HDFS Built-in Component: WebHDFS is a first class 
built-in component of HDFS. It runs inside Namenodes and Datanodes, 
therefore, it can use all HDFS functionalities. It is a part of HDFS – 
there are no additional servers to install
>>>>>>
>>>>>>So it looks like the data locality is built-into webhdfs, client will be redirected to the data node automatically. 
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>On Mon, Mar 17, 2014 at 6:07 AM, RJ Nowling <rn...@gmail.com> wrote:
>>>>>>
>>>>>>Hi all,
>>>>>>>
>>>>>>>
>>>>>>>I'm writing up a Google Summer of Code proposal to add HDFS support to Disco, an Erlang MapReduce framework.  
>>>>>>>
>>>>>>>
>>>>>>>We're interested in using WebHDFS.  I have two questions:
>>>>>>>
>>>>>>>
>>>>>>>1) Does WebHDFS allow querying data locality information?
>>>>>>>
>>>>>>>
>>>>>>>2) If the data locality information is known, can data on specific data nodes be accessed via Web HDFS?  Or do all Web HDFS requests have to go through a single server?
>>>>>>>
>>>>>>>Thanks,
>>>>>>>RJ
>>>>>>>
>>>>>>>
>>>>>>>-- 
>>>>>>>em rnowling@gmail.com
>>>>>>>c 954.496.2314 
>>>>>>
>>>>>>
>>>>>>-- 
>>>>>>
>>>>>>Cheers
>>>>>>-MJ
>>>>>>
>>>>
>>>>
>>>>
>>>>-- 
>>>>em rnowling@gmail.com
>>>>c 954.496.2314 
>>>
>>>
>>>
>>>-- 
>>>Alejandro 
>>
>>
>>
>>-- 
>>em rnowling@gmail.com
>>c 954.496.2314 
>
>

Re: Data Locality and WebHDFS

Posted by Alejandro Abdelnur <tu...@cloudera.com>.

actually, i am wrong, the webhdfs rest call has an offset. 

Alejandro
(phone typing)

> On Mar 17, 2014, at 10:07, Alejandro Abdelnur <tu...@cloudera.com> wrote:
> 
> dont recall how skips are handled in webhdfs, but i would assume that you'll get to the first block As usual, and the skip is handled by the DN serving the file (as webhdfs doesnot know at open that you'll skip)
> 
> Alejandro
> (phone typing)
> 
>> On Mar 17, 2014, at 9:47, RJ Nowling <rn...@gmail.com> wrote:
>> 
>> Hi Alejandro,
>> 
>> The WebHDFS API allows specifying an offset and length for the request.  If I specify an offset that start in the second block for a file (thus skipping the first block all together), will the namenode still direct me to a datanode with the first block or will it direct me to a namenode with the second block?  I.e., am I assured data locality only on the first block of the file (as you're saying) or on the first block I am accessing?
>> 
>> If it is as you say, then I may want to reach out the WebHDFS developers and see if they would be interested in the additional functionality.
>> 
>> Thank you,
>> RJ
>> 
>> 
>>> On Mon, Mar 17, 2014 at 2:40 AM, Alejandro Abdelnur <tu...@cloudera.com> wrote:
>>> I may have expressed myself wrong. You don't need to do any test to see how locality works with files of multiple blocks. If you are accessing a file of more than one block over webhdfs, you only have assured locality for the first block of the file.
>>> 
>>> Thanks.
>>> 
>>> 
>>>> On Sun, Mar 16, 2014 at 9:18 PM, RJ Nowling <rn...@gmail.com> wrote:
>>>> Thank you, Mingjiang and Alejandro.
>>>> 
>>>> This is interesting.  Since we will use the data locality information for scheduling, we could "hack" this to get the data locality information, at least for the first block.  As Alejandro says, we'd have to test what happens for other data blocks -- e.g., what if, knowing the block sizes, we request the second or third block?
>>>> 
>>>> Interesting food for thought!  I see some experiments in my future!  
>>>> 
>>>> Thanks!
>>>> 
>>>> 
>>>>> On Sun, Mar 16, 2014 at 10:14 PM, Alejandro Abdelnur <tu...@cloudera.com> wrote:
>>>>> well, this is for the first block of the file, the rest of the file (blocks being local or not) are streamed out by the same datanode. for small files (one block) you'll get locality, for large files only the first block, and by chance if other blocks are local to that datanode. 
>>>>> 
>>>>> 
>>>>> Alejandro
>>>>> (phone typing)
>>>>> 
>>>>>> On Mar 16, 2014, at 18:53, Mingjiang Shi <ms...@gopivotal.com> wrote:
>>>>>> 
>>>>>> According to this page: http://hortonworks.com/blog/webhdfs-%E2%80%93-http-rest-access-to-hdfs/
>>>>>>> Data Locality: The file read and file write calls are redirected to the corresponding datanodes. It uses the full bandwidth of the Hadoop cluster for streaming data.
>>>>>>> 
>>>>>>> A HDFS Built-in Component: WebHDFS is a first class built-in component of HDFS. It runs inside Namenodes and Datanodes, therefore, it can use all HDFS functionalities. It is a part of HDFS – there are no additional servers to install
>>>>>>> 
>>>>>> 
>>>>>> So it looks like the data locality is built-into webhdfs, client will be redirected to the data node automatically. 
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>>> On Mon, Mar 17, 2014 at 6:07 AM, RJ Nowling <rn...@gmail.com> wrote:
>>>>>>> Hi all,
>>>>>>> 
>>>>>>> I'm writing up a Google Summer of Code proposal to add HDFS support to Disco, an Erlang MapReduce framework.  
>>>>>>> 
>>>>>>> We're interested in using WebHDFS.  I have two questions:
>>>>>>> 
>>>>>>> 1) Does WebHDFS allow querying data locality information?
>>>>>>> 
>>>>>>> 2) If the data locality information is known, can data on specific data nodes be accessed via Web HDFS?  Or do all Web HDFS requests have to go through a single server?
>>>>>>> 
>>>>>>> Thanks,
>>>>>>> RJ
>>>>>>> 
>>>>>>> -- 
>>>>>>> em rnowling@gmail.com
>>>>>>> c 954.496.2314
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> -- 
>>>>>> Cheers
>>>>>> -MJ
>>>> 
>>>> 
>>>> 
>>>> -- 
>>>> em rnowling@gmail.com
>>>> c 954.496.2314
>>> 
>>> 
>>> 
>>> -- 
>>> Alejandro
>> 
>> 
>> 
>> -- 
>> em rnowling@gmail.com
>> c 954.496.2314

Re: Data Locality and WebHDFS

Posted by Alejandro Abdelnur <tu...@cloudera.com>.

actually, i am wrong, the webhdfs rest call has an offset. 

Alejandro
(phone typing)

> On Mar 17, 2014, at 10:07, Alejandro Abdelnur <tu...@cloudera.com> wrote:
> 
> dont recall how skips are handled in webhdfs, but i would assume that you'll get to the first block As usual, and the skip is handled by the DN serving the file (as webhdfs doesnot know at open that you'll skip)
> 
> Alejandro
> (phone typing)
> 
>> On Mar 17, 2014, at 9:47, RJ Nowling <rn...@gmail.com> wrote:
>> 
>> Hi Alejandro,
>> 
>> The WebHDFS API allows specifying an offset and length for the request.  If I specify an offset that start in the second block for a file (thus skipping the first block all together), will the namenode still direct me to a datanode with the first block or will it direct me to a namenode with the second block?  I.e., am I assured data locality only on the first block of the file (as you're saying) or on the first block I am accessing?
>> 
>> If it is as you say, then I may want to reach out the WebHDFS developers and see if they would be interested in the additional functionality.
>> 
>> Thank you,
>> RJ
>> 
>> 
>>> On Mon, Mar 17, 2014 at 2:40 AM, Alejandro Abdelnur <tu...@cloudera.com> wrote:
>>> I may have expressed myself wrong. You don't need to do any test to see how locality works with files of multiple blocks. If you are accessing a file of more than one block over webhdfs, you only have assured locality for the first block of the file.
>>> 
>>> Thanks.
>>> 
>>> 
>>>> On Sun, Mar 16, 2014 at 9:18 PM, RJ Nowling <rn...@gmail.com> wrote:
>>>> Thank you, Mingjiang and Alejandro.
>>>> 
>>>> This is interesting.  Since we will use the data locality information for scheduling, we could "hack" this to get the data locality information, at least for the first block.  As Alejandro says, we'd have to test what happens for other data blocks -- e.g., what if, knowing the block sizes, we request the second or third block?
>>>> 
>>>> Interesting food for thought!  I see some experiments in my future!  
>>>> 
>>>> Thanks!
>>>> 
>>>> 
>>>>> On Sun, Mar 16, 2014 at 10:14 PM, Alejandro Abdelnur <tu...@cloudera.com> wrote:
>>>>> well, this is for the first block of the file, the rest of the file (blocks being local or not) are streamed out by the same datanode. for small files (one block) you'll get locality, for large files only the first block, and by chance if other blocks are local to that datanode. 
>>>>> 
>>>>> 
>>>>> Alejandro
>>>>> (phone typing)
>>>>> 
>>>>>> On Mar 16, 2014, at 18:53, Mingjiang Shi <ms...@gopivotal.com> wrote:
>>>>>> 
>>>>>> According to this page: http://hortonworks.com/blog/webhdfs-%E2%80%93-http-rest-access-to-hdfs/
>>>>>>> Data Locality: The file read and file write calls are redirected to the corresponding datanodes. It uses the full bandwidth of the Hadoop cluster for streaming data.
>>>>>>> 
>>>>>>> A HDFS Built-in Component: WebHDFS is a first class built-in component of HDFS. It runs inside Namenodes and Datanodes, therefore, it can use all HDFS functionalities. It is a part of HDFS – there are no additional servers to install
>>>>>>> 
>>>>>> 
>>>>>> So it looks like the data locality is built-into webhdfs, client will be redirected to the data node automatically. 
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>>> On Mon, Mar 17, 2014 at 6:07 AM, RJ Nowling <rn...@gmail.com> wrote:
>>>>>>> Hi all,
>>>>>>> 
>>>>>>> I'm writing up a Google Summer of Code proposal to add HDFS support to Disco, an Erlang MapReduce framework.  
>>>>>>> 
>>>>>>> We're interested in using WebHDFS.  I have two questions:
>>>>>>> 
>>>>>>> 1) Does WebHDFS allow querying data locality information?
>>>>>>> 
>>>>>>> 2) If the data locality information is known, can data on specific data nodes be accessed via Web HDFS?  Or do all Web HDFS requests have to go through a single server?
>>>>>>> 
>>>>>>> Thanks,
>>>>>>> RJ
>>>>>>> 
>>>>>>> -- 
>>>>>>> em rnowling@gmail.com
>>>>>>> c 954.496.2314
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> -- 
>>>>>> Cheers
>>>>>> -MJ
>>>> 
>>>> 
>>>> 
>>>> -- 
>>>> em rnowling@gmail.com
>>>> c 954.496.2314
>>> 
>>> 
>>> 
>>> -- 
>>> Alejandro
>> 
>> 
>> 
>> -- 
>> em rnowling@gmail.com
>> c 954.496.2314

Re: Data Locality and WebHDFS

Posted by Alejandro Abdelnur <tu...@cloudera.com>.

actually, i am wrong, the webhdfs rest call has an offset. 

Alejandro
(phone typing)

> On Mar 17, 2014, at 10:07, Alejandro Abdelnur <tu...@cloudera.com> wrote:
> 
> dont recall how skips are handled in webhdfs, but i would assume that you'll get to the first block As usual, and the skip is handled by the DN serving the file (as webhdfs doesnot know at open that you'll skip)
> 
> Alejandro
> (phone typing)
> 
>> On Mar 17, 2014, at 9:47, RJ Nowling <rn...@gmail.com> wrote:
>> 
>> Hi Alejandro,
>> 
>> The WebHDFS API allows specifying an offset and length for the request.  If I specify an offset that start in the second block for a file (thus skipping the first block all together), will the namenode still direct me to a datanode with the first block or will it direct me to a namenode with the second block?  I.e., am I assured data locality only on the first block of the file (as you're saying) or on the first block I am accessing?
>> 
>> If it is as you say, then I may want to reach out the WebHDFS developers and see if they would be interested in the additional functionality.
>> 
>> Thank you,
>> RJ
>> 
>> 
>>> On Mon, Mar 17, 2014 at 2:40 AM, Alejandro Abdelnur <tu...@cloudera.com> wrote:
>>> I may have expressed myself wrong. You don't need to do any test to see how locality works with files of multiple blocks. If you are accessing a file of more than one block over webhdfs, you only have assured locality for the first block of the file.
>>> 
>>> Thanks.
>>> 
>>> 
>>>> On Sun, Mar 16, 2014 at 9:18 PM, RJ Nowling <rn...@gmail.com> wrote:
>>>> Thank you, Mingjiang and Alejandro.
>>>> 
>>>> This is interesting.  Since we will use the data locality information for scheduling, we could "hack" this to get the data locality information, at least for the first block.  As Alejandro says, we'd have to test what happens for other data blocks -- e.g., what if, knowing the block sizes, we request the second or third block?
>>>> 
>>>> Interesting food for thought!  I see some experiments in my future!  
>>>> 
>>>> Thanks!
>>>> 
>>>> 
>>>>> On Sun, Mar 16, 2014 at 10:14 PM, Alejandro Abdelnur <tu...@cloudera.com> wrote:
>>>>> well, this is for the first block of the file, the rest of the file (blocks being local or not) are streamed out by the same datanode. for small files (one block) you'll get locality, for large files only the first block, and by chance if other blocks are local to that datanode. 
>>>>> 
>>>>> 
>>>>> Alejandro
>>>>> (phone typing)
>>>>> 
>>>>>> On Mar 16, 2014, at 18:53, Mingjiang Shi <ms...@gopivotal.com> wrote:
>>>>>> 
>>>>>> According to this page: http://hortonworks.com/blog/webhdfs-%E2%80%93-http-rest-access-to-hdfs/
>>>>>>> Data Locality: The file read and file write calls are redirected to the corresponding datanodes. It uses the full bandwidth of the Hadoop cluster for streaming data.
>>>>>>> 
>>>>>>> A HDFS Built-in Component: WebHDFS is a first class built-in component of HDFS. It runs inside Namenodes and Datanodes, therefore, it can use all HDFS functionalities. It is a part of HDFS – there are no additional servers to install
>>>>>>> 
>>>>>> 
>>>>>> So it looks like the data locality is built-into webhdfs, client will be redirected to the data node automatically. 
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>>> On Mon, Mar 17, 2014 at 6:07 AM, RJ Nowling <rn...@gmail.com> wrote:
>>>>>>> Hi all,
>>>>>>> 
>>>>>>> I'm writing up a Google Summer of Code proposal to add HDFS support to Disco, an Erlang MapReduce framework.  
>>>>>>> 
>>>>>>> We're interested in using WebHDFS.  I have two questions:
>>>>>>> 
>>>>>>> 1) Does WebHDFS allow querying data locality information?
>>>>>>> 
>>>>>>> 2) If the data locality information is known, can data on specific data nodes be accessed via Web HDFS?  Or do all Web HDFS requests have to go through a single server?
>>>>>>> 
>>>>>>> Thanks,
>>>>>>> RJ
>>>>>>> 
>>>>>>> -- 
>>>>>>> em rnowling@gmail.com
>>>>>>> c 954.496.2314
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> -- 
>>>>>> Cheers
>>>>>> -MJ
>>>> 
>>>> 
>>>> 
>>>> -- 
>>>> em rnowling@gmail.com
>>>> c 954.496.2314
>>> 
>>> 
>>> 
>>> -- 
>>> Alejandro
>> 
>> 
>> 
>> -- 
>> em rnowling@gmail.com
>> c 954.496.2314

Re: Data Locality and WebHDFS

Posted by Alejandro Abdelnur <tu...@cloudera.com>.

actually, i am wrong, the webhdfs rest call has an offset. 

Alejandro
(phone typing)

> On Mar 17, 2014, at 10:07, Alejandro Abdelnur <tu...@cloudera.com> wrote:
> 
> dont recall how skips are handled in webhdfs, but i would assume that you'll get to the first block As usual, and the skip is handled by the DN serving the file (as webhdfs doesnot know at open that you'll skip)
> 
> Alejandro
> (phone typing)
> 
>> On Mar 17, 2014, at 9:47, RJ Nowling <rn...@gmail.com> wrote:
>> 
>> Hi Alejandro,
>> 
>> The WebHDFS API allows specifying an offset and length for the request.  If I specify an offset that start in the second block for a file (thus skipping the first block all together), will the namenode still direct me to a datanode with the first block or will it direct me to a namenode with the second block?  I.e., am I assured data locality only on the first block of the file (as you're saying) or on the first block I am accessing?
>> 
>> If it is as you say, then I may want to reach out the WebHDFS developers and see if they would be interested in the additional functionality.
>> 
>> Thank you,
>> RJ
>> 
>> 
>>> On Mon, Mar 17, 2014 at 2:40 AM, Alejandro Abdelnur <tu...@cloudera.com> wrote:
>>> I may have expressed myself wrong. You don't need to do any test to see how locality works with files of multiple blocks. If you are accessing a file of more than one block over webhdfs, you only have assured locality for the first block of the file.
>>> 
>>> Thanks.
>>> 
>>> 
>>>> On Sun, Mar 16, 2014 at 9:18 PM, RJ Nowling <rn...@gmail.com> wrote:
>>>> Thank you, Mingjiang and Alejandro.
>>>> 
>>>> This is interesting.  Since we will use the data locality information for scheduling, we could "hack" this to get the data locality information, at least for the first block.  As Alejandro says, we'd have to test what happens for other data blocks -- e.g., what if, knowing the block sizes, we request the second or third block?
>>>> 
>>>> Interesting food for thought!  I see some experiments in my future!  
>>>> 
>>>> Thanks!
>>>> 
>>>> 
>>>>> On Sun, Mar 16, 2014 at 10:14 PM, Alejandro Abdelnur <tu...@cloudera.com> wrote:
>>>>> well, this is for the first block of the file, the rest of the file (blocks being local or not) are streamed out by the same datanode. for small files (one block) you'll get locality, for large files only the first block, and by chance if other blocks are local to that datanode. 
>>>>> 
>>>>> 
>>>>> Alejandro
>>>>> (phone typing)
>>>>> 
>>>>>> On Mar 16, 2014, at 18:53, Mingjiang Shi <ms...@gopivotal.com> wrote:
>>>>>> 
>>>>>> According to this page: http://hortonworks.com/blog/webhdfs-%E2%80%93-http-rest-access-to-hdfs/
>>>>>>> Data Locality: The file read and file write calls are redirected to the corresponding datanodes. It uses the full bandwidth of the Hadoop cluster for streaming data.
>>>>>>> 
>>>>>>> A HDFS Built-in Component: WebHDFS is a first class built-in component of HDFS. It runs inside Namenodes and Datanodes, therefore, it can use all HDFS functionalities. It is a part of HDFS – there are no additional servers to install
>>>>>>> 
>>>>>> 
>>>>>> So it looks like the data locality is built-into webhdfs, client will be redirected to the data node automatically. 
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>>> On Mon, Mar 17, 2014 at 6:07 AM, RJ Nowling <rn...@gmail.com> wrote:
>>>>>>> Hi all,
>>>>>>> 
>>>>>>> I'm writing up a Google Summer of Code proposal to add HDFS support to Disco, an Erlang MapReduce framework.  
>>>>>>> 
>>>>>>> We're interested in using WebHDFS.  I have two questions:
>>>>>>> 
>>>>>>> 1) Does WebHDFS allow querying data locality information?
>>>>>>> 
>>>>>>> 2) If the data locality information is known, can data on specific data nodes be accessed via Web HDFS?  Or do all Web HDFS requests have to go through a single server?
>>>>>>> 
>>>>>>> Thanks,
>>>>>>> RJ
>>>>>>> 
>>>>>>> -- 
>>>>>>> em rnowling@gmail.com
>>>>>>> c 954.496.2314
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> -- 
>>>>>> Cheers
>>>>>> -MJ
>>>> 
>>>> 
>>>> 
>>>> -- 
>>>> em rnowling@gmail.com
>>>> c 954.496.2314
>>> 
>>> 
>>> 
>>> -- 
>>> Alejandro
>> 
>> 
>> 
>> -- 
>> em rnowling@gmail.com
>> c 954.496.2314

Re: Data Locality and WebHDFS

Posted by Alejandro Abdelnur <tu...@cloudera.com>.

dont recall how skips are handled in webhdfs, but i would assume that you'll get to the first block As usual, and the skip is handled by the DN serving the file (as webhdfs doesnot know at open that you'll skip)

Alejandro
(phone typing)

> On Mar 17, 2014, at 9:47, RJ Nowling <rn...@gmail.com> wrote:
> 
> Hi Alejandro,
> 
> The WebHDFS API allows specifying an offset and length for the request.  If I specify an offset that start in the second block for a file (thus skipping the first block all together), will the namenode still direct me to a datanode with the first block or will it direct me to a namenode with the second block?  I.e., am I assured data locality only on the first block of the file (as you're saying) or on the first block I am accessing?
> 
> If it is as you say, then I may want to reach out the WebHDFS developers and see if they would be interested in the additional functionality.
> 
> Thank you,
> RJ
> 
> 
>> On Mon, Mar 17, 2014 at 2:40 AM, Alejandro Abdelnur <tu...@cloudera.com> wrote:
>> I may have expressed myself wrong. You don't need to do any test to see how locality works with files of multiple blocks. If you are accessing a file of more than one block over webhdfs, you only have assured locality for the first block of the file.
>> 
>> Thanks.
>> 
>> 
>>> On Sun, Mar 16, 2014 at 9:18 PM, RJ Nowling <rn...@gmail.com> wrote:
>>> Thank you, Mingjiang and Alejandro.
>>> 
>>> This is interesting.  Since we will use the data locality information for scheduling, we could "hack" this to get the data locality information, at least for the first block.  As Alejandro says, we'd have to test what happens for other data blocks -- e.g., what if, knowing the block sizes, we request the second or third block?
>>> 
>>> Interesting food for thought!  I see some experiments in my future!  
>>> 
>>> Thanks!
>>> 
>>> 
>>>> On Sun, Mar 16, 2014 at 10:14 PM, Alejandro Abdelnur <tu...@cloudera.com> wrote:
>>>> well, this is for the first block of the file, the rest of the file (blocks being local or not) are streamed out by the same datanode. for small files (one block) you'll get locality, for large files only the first block, and by chance if other blocks are local to that datanode. 
>>>> 
>>>> 
>>>> Alejandro
>>>> (phone typing)
>>>> 
>>>>> On Mar 16, 2014, at 18:53, Mingjiang Shi <ms...@gopivotal.com> wrote:
>>>>> 
>>>>> According to this page: http://hortonworks.com/blog/webhdfs-%E2%80%93-http-rest-access-to-hdfs/
>>>>>> Data Locality: The file read and file write calls are redirected to the corresponding datanodes. It uses the full bandwidth of the Hadoop cluster for streaming data.
>>>>>> 
>>>>>> A HDFS Built-in Component: WebHDFS is a first class built-in component of HDFS. It runs inside Namenodes and Datanodes, therefore, it can use all HDFS functionalities. It is a part of HDFS – there are no additional servers to install
>>>>>> 
>>>>> 
>>>>> So it looks like the data locality is built-into webhdfs, client will be redirected to the data node automatically. 
>>>>> 
>>>>> 
>>>>> 
>>>>> 
>>>>>> On Mon, Mar 17, 2014 at 6:07 AM, RJ Nowling <rn...@gmail.com> wrote:
>>>>>> Hi all,
>>>>>> 
>>>>>> I'm writing up a Google Summer of Code proposal to add HDFS support to Disco, an Erlang MapReduce framework.  
>>>>>> 
>>>>>> We're interested in using WebHDFS.  I have two questions:
>>>>>> 
>>>>>> 1) Does WebHDFS allow querying data locality information?
>>>>>> 
>>>>>> 2) If the data locality information is known, can data on specific data nodes be accessed via Web HDFS?  Or do all Web HDFS requests have to go through a single server?
>>>>>> 
>>>>>> Thanks,
>>>>>> RJ
>>>>>> 
>>>>>> -- 
>>>>>> em rnowling@gmail.com
>>>>>> c 954.496.2314
>>>>> 
>>>>> 
>>>>> 
>>>>> -- 
>>>>> Cheers
>>>>> -MJ
>>> 
>>> 
>>> 
>>> -- 
>>> em rnowling@gmail.com
>>> c 954.496.2314
>> 
>> 
>> 
>> -- 
>> Alejandro
> 
> 
> 
> -- 
> em rnowling@gmail.com
> c 954.496.2314

Re: Data Locality and WebHDFS

Posted by Alejandro Abdelnur <tu...@cloudera.com>.

dont recall how skips are handled in webhdfs, but i would assume that you'll get to the first block As usual, and the skip is handled by the DN serving the file (as webhdfs doesnot know at open that you'll skip)

Alejandro
(phone typing)

> On Mar 17, 2014, at 9:47, RJ Nowling <rn...@gmail.com> wrote:
> 
> Hi Alejandro,
> 
> The WebHDFS API allows specifying an offset and length for the request.  If I specify an offset that start in the second block for a file (thus skipping the first block all together), will the namenode still direct me to a datanode with the first block or will it direct me to a namenode with the second block?  I.e., am I assured data locality only on the first block of the file (as you're saying) or on the first block I am accessing?
> 
> If it is as you say, then I may want to reach out the WebHDFS developers and see if they would be interested in the additional functionality.
> 
> Thank you,
> RJ
> 
> 
>> On Mon, Mar 17, 2014 at 2:40 AM, Alejandro Abdelnur <tu...@cloudera.com> wrote:
>> I may have expressed myself wrong. You don't need to do any test to see how locality works with files of multiple blocks. If you are accessing a file of more than one block over webhdfs, you only have assured locality for the first block of the file.
>> 
>> Thanks.
>> 
>> 
>>> On Sun, Mar 16, 2014 at 9:18 PM, RJ Nowling <rn...@gmail.com> wrote:
>>> Thank you, Mingjiang and Alejandro.
>>> 
>>> This is interesting.  Since we will use the data locality information for scheduling, we could "hack" this to get the data locality information, at least for the first block.  As Alejandro says, we'd have to test what happens for other data blocks -- e.g., what if, knowing the block sizes, we request the second or third block?
>>> 
>>> Interesting food for thought!  I see some experiments in my future!  
>>> 
>>> Thanks!
>>> 
>>> 
>>>> On Sun, Mar 16, 2014 at 10:14 PM, Alejandro Abdelnur <tu...@cloudera.com> wrote:
>>>> well, this is for the first block of the file, the rest of the file (blocks being local or not) are streamed out by the same datanode. for small files (one block) you'll get locality, for large files only the first block, and by chance if other blocks are local to that datanode. 
>>>> 
>>>> 
>>>> Alejandro
>>>> (phone typing)
>>>> 
>>>>> On Mar 16, 2014, at 18:53, Mingjiang Shi <ms...@gopivotal.com> wrote:
>>>>> 
>>>>> According to this page: http://hortonworks.com/blog/webhdfs-%E2%80%93-http-rest-access-to-hdfs/
>>>>>> Data Locality: The file read and file write calls are redirected to the corresponding datanodes. It uses the full bandwidth of the Hadoop cluster for streaming data.
>>>>>> 
>>>>>> A HDFS Built-in Component: WebHDFS is a first class built-in component of HDFS. It runs inside Namenodes and Datanodes, therefore, it can use all HDFS functionalities. It is a part of HDFS – there are no additional servers to install
>>>>>> 
>>>>> 
>>>>> So it looks like the data locality is built-into webhdfs, client will be redirected to the data node automatically. 
>>>>> 
>>>>> 
>>>>> 
>>>>> 
>>>>>> On Mon, Mar 17, 2014 at 6:07 AM, RJ Nowling <rn...@gmail.com> wrote:
>>>>>> Hi all,
>>>>>> 
>>>>>> I'm writing up a Google Summer of Code proposal to add HDFS support to Disco, an Erlang MapReduce framework.  
>>>>>> 
>>>>>> We're interested in using WebHDFS.  I have two questions:
>>>>>> 
>>>>>> 1) Does WebHDFS allow querying data locality information?
>>>>>> 
>>>>>> 2) If the data locality information is known, can data on specific data nodes be accessed via Web HDFS?  Or do all Web HDFS requests have to go through a single server?
>>>>>> 
>>>>>> Thanks,
>>>>>> RJ
>>>>>> 
>>>>>> -- 
>>>>>> em rnowling@gmail.com
>>>>>> c 954.496.2314
>>>>> 
>>>>> 
>>>>> 
>>>>> -- 
>>>>> Cheers
>>>>> -MJ
>>> 
>>> 
>>> 
>>> -- 
>>> em rnowling@gmail.com
>>> c 954.496.2314
>> 
>> 
>> 
>> -- 
>> Alejandro
> 
> 
> 
> -- 
> em rnowling@gmail.com
> c 954.496.2314

Re: Data Locality and WebHDFS

Posted by Alejandro Abdelnur <tu...@cloudera.com>.

dont recall how skips are handled in webhdfs, but i would assume that you'll get to the first block As usual, and the skip is handled by the DN serving the file (as webhdfs doesnot know at open that you'll skip)

Alejandro
(phone typing)

> On Mar 17, 2014, at 9:47, RJ Nowling <rn...@gmail.com> wrote:
> 
> Hi Alejandro,
> 
> The WebHDFS API allows specifying an offset and length for the request.  If I specify an offset that start in the second block for a file (thus skipping the first block all together), will the namenode still direct me to a datanode with the first block or will it direct me to a namenode with the second block?  I.e., am I assured data locality only on the first block of the file (as you're saying) or on the first block I am accessing?
> 
> If it is as you say, then I may want to reach out the WebHDFS developers and see if they would be interested in the additional functionality.
> 
> Thank you,
> RJ
> 
> 
>> On Mon, Mar 17, 2014 at 2:40 AM, Alejandro Abdelnur <tu...@cloudera.com> wrote:
>> I may have expressed myself wrong. You don't need to do any test to see how locality works with files of multiple blocks. If you are accessing a file of more than one block over webhdfs, you only have assured locality for the first block of the file.
>> 
>> Thanks.
>> 
>> 
>>> On Sun, Mar 16, 2014 at 9:18 PM, RJ Nowling <rn...@gmail.com> wrote:
>>> Thank you, Mingjiang and Alejandro.
>>> 
>>> This is interesting.  Since we will use the data locality information for scheduling, we could "hack" this to get the data locality information, at least for the first block.  As Alejandro says, we'd have to test what happens for other data blocks -- e.g., what if, knowing the block sizes, we request the second or third block?
>>> 
>>> Interesting food for thought!  I see some experiments in my future!  
>>> 
>>> Thanks!
>>> 
>>> 
>>>> On Sun, Mar 16, 2014 at 10:14 PM, Alejandro Abdelnur <tu...@cloudera.com> wrote:
>>>> well, this is for the first block of the file, the rest of the file (blocks being local or not) are streamed out by the same datanode. for small files (one block) you'll get locality, for large files only the first block, and by chance if other blocks are local to that datanode. 
>>>> 
>>>> 
>>>> Alejandro
>>>> (phone typing)
>>>> 
>>>>> On Mar 16, 2014, at 18:53, Mingjiang Shi <ms...@gopivotal.com> wrote:
>>>>> 
>>>>> According to this page: http://hortonworks.com/blog/webhdfs-%E2%80%93-http-rest-access-to-hdfs/
>>>>>> Data Locality: The file read and file write calls are redirected to the corresponding datanodes. It uses the full bandwidth of the Hadoop cluster for streaming data.
>>>>>> 
>>>>>> A HDFS Built-in Component: WebHDFS is a first class built-in component of HDFS. It runs inside Namenodes and Datanodes, therefore, it can use all HDFS functionalities. It is a part of HDFS – there are no additional servers to install
>>>>>> 
>>>>> 
>>>>> So it looks like the data locality is built-into webhdfs, client will be redirected to the data node automatically. 
>>>>> 
>>>>> 
>>>>> 
>>>>> 
>>>>>> On Mon, Mar 17, 2014 at 6:07 AM, RJ Nowling <rn...@gmail.com> wrote:
>>>>>> Hi all,
>>>>>> 
>>>>>> I'm writing up a Google Summer of Code proposal to add HDFS support to Disco, an Erlang MapReduce framework.  
>>>>>> 
>>>>>> We're interested in using WebHDFS.  I have two questions:
>>>>>> 
>>>>>> 1) Does WebHDFS allow querying data locality information?
>>>>>> 
>>>>>> 2) If the data locality information is known, can data on specific data nodes be accessed via Web HDFS?  Or do all Web HDFS requests have to go through a single server?
>>>>>> 
>>>>>> Thanks,
>>>>>> RJ
>>>>>> 
>>>>>> -- 
>>>>>> em rnowling@gmail.com
>>>>>> c 954.496.2314
>>>>> 
>>>>> 
>>>>> 
>>>>> -- 
>>>>> Cheers
>>>>> -MJ
>>> 
>>> 
>>> 
>>> -- 
>>> em rnowling@gmail.com
>>> c 954.496.2314
>> 
>> 
>> 
>> -- 
>> Alejandro
> 
> 
> 
> -- 
> em rnowling@gmail.com
> c 954.496.2314

Re: Data Locality and WebHDFS

Posted by Alejandro Abdelnur <tu...@cloudera.com>.

dont recall how skips are handled in webhdfs, but i would assume that you'll get to the first block As usual, and the skip is handled by the DN serving the file (as webhdfs doesnot know at open that you'll skip)

Alejandro
(phone typing)

> On Mar 17, 2014, at 9:47, RJ Nowling <rn...@gmail.com> wrote:
> 
> Hi Alejandro,
> 
> The WebHDFS API allows specifying an offset and length for the request.  If I specify an offset that start in the second block for a file (thus skipping the first block all together), will the namenode still direct me to a datanode with the first block or will it direct me to a namenode with the second block?  I.e., am I assured data locality only on the first block of the file (as you're saying) or on the first block I am accessing?
> 
> If it is as you say, then I may want to reach out the WebHDFS developers and see if they would be interested in the additional functionality.
> 
> Thank you,
> RJ
> 
> 
>> On Mon, Mar 17, 2014 at 2:40 AM, Alejandro Abdelnur <tu...@cloudera.com> wrote:
>> I may have expressed myself wrong. You don't need to do any test to see how locality works with files of multiple blocks. If you are accessing a file of more than one block over webhdfs, you only have assured locality for the first block of the file.
>> 
>> Thanks.
>> 
>> 
>>> On Sun, Mar 16, 2014 at 9:18 PM, RJ Nowling <rn...@gmail.com> wrote:
>>> Thank you, Mingjiang and Alejandro.
>>> 
>>> This is interesting.  Since we will use the data locality information for scheduling, we could "hack" this to get the data locality information, at least for the first block.  As Alejandro says, we'd have to test what happens for other data blocks -- e.g., what if, knowing the block sizes, we request the second or third block?
>>> 
>>> Interesting food for thought!  I see some experiments in my future!  
>>> 
>>> Thanks!
>>> 
>>> 
>>>> On Sun, Mar 16, 2014 at 10:14 PM, Alejandro Abdelnur <tu...@cloudera.com> wrote:
>>>> well, this is for the first block of the file, the rest of the file (blocks being local or not) are streamed out by the same datanode. for small files (one block) you'll get locality, for large files only the first block, and by chance if other blocks are local to that datanode. 
>>>> 
>>>> 
>>>> Alejandro
>>>> (phone typing)
>>>> 
>>>>> On Mar 16, 2014, at 18:53, Mingjiang Shi <ms...@gopivotal.com> wrote:
>>>>> 
>>>>> According to this page: http://hortonworks.com/blog/webhdfs-%E2%80%93-http-rest-access-to-hdfs/
>>>>>> Data Locality: The file read and file write calls are redirected to the corresponding datanodes. It uses the full bandwidth of the Hadoop cluster for streaming data.
>>>>>> 
>>>>>> A HDFS Built-in Component: WebHDFS is a first class built-in component of HDFS. It runs inside Namenodes and Datanodes, therefore, it can use all HDFS functionalities. It is a part of HDFS – there are no additional servers to install
>>>>>> 
>>>>> 
>>>>> So it looks like the data locality is built-into webhdfs, client will be redirected to the data node automatically. 
>>>>> 
>>>>> 
>>>>> 
>>>>> 
>>>>>> On Mon, Mar 17, 2014 at 6:07 AM, RJ Nowling <rn...@gmail.com> wrote:
>>>>>> Hi all,
>>>>>> 
>>>>>> I'm writing up a Google Summer of Code proposal to add HDFS support to Disco, an Erlang MapReduce framework.  
>>>>>> 
>>>>>> We're interested in using WebHDFS.  I have two questions:
>>>>>> 
>>>>>> 1) Does WebHDFS allow querying data locality information?
>>>>>> 
>>>>>> 2) If the data locality information is known, can data on specific data nodes be accessed via Web HDFS?  Or do all Web HDFS requests have to go through a single server?
>>>>>> 
>>>>>> Thanks,
>>>>>> RJ
>>>>>> 
>>>>>> -- 
>>>>>> em rnowling@gmail.com
>>>>>> c 954.496.2314
>>>>> 
>>>>> 
>>>>> 
>>>>> -- 
>>>>> Cheers
>>>>> -MJ
>>> 
>>> 
>>> 
>>> -- 
>>> em rnowling@gmail.com
>>> c 954.496.2314
>> 
>> 
>> 
>> -- 
>> Alejandro
> 
> 
> 
> -- 
> em rnowling@gmail.com
> c 954.496.2314

Re: Data Locality and WebHDFS

Posted by RJ Nowling <rn...@gmail.com>.

Hi Alejandro,

The WebHDFS API allows specifying an offset and length for the request.  If
I specify an offset that start in the second block for a file (thus
skipping the first block all together), will the namenode still direct me
to a datanode with the first block or will it direct me to a namenode with
the second block?  I.e., am I assured data locality only on the first block
of the file (as you're saying) or on the first block I am accessing?

If it is as you say, then I may want to reach out the WebHDFS developers
and see if they would be interested in the additional functionality.

Thank you,
RJ


On Mon, Mar 17, 2014 at 2:40 AM, Alejandro Abdelnur <tu...@cloudera.com>wrote:

> I may have expressed myself wrong. You don't need to do any test to see
> how locality works with files of multiple blocks. If you are accessing a
> file of more than one block over webhdfs, you only have assured locality
> for the first block of the file.
>
> Thanks.
>
>
> On Sun, Mar 16, 2014 at 9:18 PM, RJ Nowling <rn...@gmail.com> wrote:
>
>> Thank you, Mingjiang and Alejandro.
>>
>> This is interesting.  Since we will use the data locality information for
>> scheduling, we could "hack" this to get the data locality information, at
>> least for the first block.  As Alejandro says, we'd have to test what
>> happens for other data blocks -- e.g., what if, knowing the block sizes, we
>> request the second or third block?
>>
>> Interesting food for thought!  I see some experiments in my future!
>>
>> Thanks!
>>
>>
>> On Sun, Mar 16, 2014 at 10:14 PM, Alejandro Abdelnur <tu...@cloudera.com>wrote:
>>
>>> well, this is for the first block of the file, the rest of the file
>>> (blocks being local or not) are streamed out by the same datanode. for
>>> small files (one block) you'll get locality, for large files only the first
>>> block, and by chance if other blocks are local to that datanode.
>>>
>>>
>>> Alejandro
>>> (phone typing)
>>>
>>> On Mar 16, 2014, at 18:53, Mingjiang Shi <ms...@gopivotal.com> wrote:
>>>
>>> According to this page:
>>> http://hortonworks.com/blog/webhdfs-%E2%80%93-http-rest-access-to-hdfs/
>>>
>>>> *Data Locality*: The file read and file write calls are redirected to
>>>> the corresponding datanodes. It uses the full bandwidth of the Hadoop
>>>> cluster for streaming data.
>>>>
>>>> *A HDFS Built-in Component*: WebHDFS is a first class built-in
>>>> component of HDFS. It runs inside Namenodes and Datanodes, therefore, it
>>>> can use all HDFS functionalities. It is a part of HDFS - there are no
>>>> additional servers to install
>>>>
>>>
>>> So it looks like the data locality is built-into webhdfs, client will be
>>> redirected to the data node automatically.
>>>
>>>
>>>
>>>
>>> On Mon, Mar 17, 2014 at 6:07 AM, RJ Nowling <rn...@gmail.com> wrote:
>>>
>>>> Hi all,
>>>>
>>>> I'm writing up a Google Summer of Code proposal to add HDFS support to
>>>> Disco, an Erlang MapReduce framework.
>>>>
>>>> We're interested in using WebHDFS.  I have two questions:
>>>>
>>>> 1) Does WebHDFS allow querying data locality information?
>>>>
>>>> 2) If the data locality information is known, can data on specific data
>>>> nodes be accessed via Web HDFS?  Or do all Web HDFS requests have to go
>>>> through a single server?
>>>>
>>>> Thanks,
>>>> RJ
>>>>
>>>> --
>>>> em rnowling@gmail.com
>>>> c 954.496.2314
>>>>
>>>
>>>
>>>
>>> --
>>> Cheers
>>> -MJ
>>>
>>>
>>
>>
>> --
>> em rnowling@gmail.com
>> c 954.496.2314
>>
>
>
>
> --
> Alejandro
>



-- 
em rnowling@gmail.com
c 954.496.2314

Re: Data Locality and WebHDFS

Posted by RJ Nowling <rn...@gmail.com>.

Hi Alejandro,

The WebHDFS API allows specifying an offset and length for the request.  If
I specify an offset that start in the second block for a file (thus
skipping the first block all together), will the namenode still direct me
to a datanode with the first block or will it direct me to a namenode with
the second block?  I.e., am I assured data locality only on the first block
of the file (as you're saying) or on the first block I am accessing?

If it is as you say, then I may want to reach out the WebHDFS developers
and see if they would be interested in the additional functionality.

Thank you,
RJ


On Mon, Mar 17, 2014 at 2:40 AM, Alejandro Abdelnur <tu...@cloudera.com>wrote:

> I may have expressed myself wrong. You don't need to do any test to see
> how locality works with files of multiple blocks. If you are accessing a
> file of more than one block over webhdfs, you only have assured locality
> for the first block of the file.
>
> Thanks.
>
>
> On Sun, Mar 16, 2014 at 9:18 PM, RJ Nowling <rn...@gmail.com> wrote:
>
>> Thank you, Mingjiang and Alejandro.
>>
>> This is interesting.  Since we will use the data locality information for
>> scheduling, we could "hack" this to get the data locality information, at
>> least for the first block.  As Alejandro says, we'd have to test what
>> happens for other data blocks -- e.g., what if, knowing the block sizes, we
>> request the second or third block?
>>
>> Interesting food for thought!  I see some experiments in my future!
>>
>> Thanks!
>>
>>
>> On Sun, Mar 16, 2014 at 10:14 PM, Alejandro Abdelnur <tu...@cloudera.com>wrote:
>>
>>> well, this is for the first block of the file, the rest of the file
>>> (blocks being local or not) are streamed out by the same datanode. for
>>> small files (one block) you'll get locality, for large files only the first
>>> block, and by chance if other blocks are local to that datanode.
>>>
>>>
>>> Alejandro
>>> (phone typing)
>>>
>>> On Mar 16, 2014, at 18:53, Mingjiang Shi <ms...@gopivotal.com> wrote:
>>>
>>> According to this page:
>>> http://hortonworks.com/blog/webhdfs-%E2%80%93-http-rest-access-to-hdfs/
>>>
>>>> *Data Locality*: The file read and file write calls are redirected to
>>>> the corresponding datanodes. It uses the full bandwidth of the Hadoop
>>>> cluster for streaming data.
>>>>
>>>> *A HDFS Built-in Component*: WebHDFS is a first class built-in
>>>> component of HDFS. It runs inside Namenodes and Datanodes, therefore, it
>>>> can use all HDFS functionalities. It is a part of HDFS - there are no
>>>> additional servers to install
>>>>
>>>
>>> So it looks like the data locality is built-into webhdfs, client will be
>>> redirected to the data node automatically.
>>>
>>>
>>>
>>>
>>> On Mon, Mar 17, 2014 at 6:07 AM, RJ Nowling <rn...@gmail.com> wrote:
>>>
>>>> Hi all,
>>>>
>>>> I'm writing up a Google Summer of Code proposal to add HDFS support to
>>>> Disco, an Erlang MapReduce framework.
>>>>
>>>> We're interested in using WebHDFS.  I have two questions:
>>>>
>>>> 1) Does WebHDFS allow querying data locality information?
>>>>
>>>> 2) If the data locality information is known, can data on specific data
>>>> nodes be accessed via Web HDFS?  Or do all Web HDFS requests have to go
>>>> through a single server?
>>>>
>>>> Thanks,
>>>> RJ
>>>>
>>>> --
>>>> em rnowling@gmail.com
>>>> c 954.496.2314
>>>>
>>>
>>>
>>>
>>> --
>>> Cheers
>>> -MJ
>>>
>>>
>>
>>
>> --
>> em rnowling@gmail.com
>> c 954.496.2314
>>
>
>
>
> --
> Alejandro
>



-- 
em rnowling@gmail.com
c 954.496.2314

Re: Data Locality and WebHDFS

Posted by RJ Nowling <rn...@gmail.com>.

Hi Alejandro,

The WebHDFS API allows specifying an offset and length for the request.  If
I specify an offset that start in the second block for a file (thus
skipping the first block all together), will the namenode still direct me
to a datanode with the first block or will it direct me to a namenode with
the second block?  I.e., am I assured data locality only on the first block
of the file (as you're saying) or on the first block I am accessing?

If it is as you say, then I may want to reach out the WebHDFS developers
and see if they would be interested in the additional functionality.

Thank you,
RJ


On Mon, Mar 17, 2014 at 2:40 AM, Alejandro Abdelnur <tu...@cloudera.com>wrote:

> I may have expressed myself wrong. You don't need to do any test to see
> how locality works with files of multiple blocks. If you are accessing a
> file of more than one block over webhdfs, you only have assured locality
> for the first block of the file.
>
> Thanks.
>
>
> On Sun, Mar 16, 2014 at 9:18 PM, RJ Nowling <rn...@gmail.com> wrote:
>
>> Thank you, Mingjiang and Alejandro.
>>
>> This is interesting.  Since we will use the data locality information for
>> scheduling, we could "hack" this to get the data locality information, at
>> least for the first block.  As Alejandro says, we'd have to test what
>> happens for other data blocks -- e.g., what if, knowing the block sizes, we
>> request the second or third block?
>>
>> Interesting food for thought!  I see some experiments in my future!
>>
>> Thanks!
>>
>>
>> On Sun, Mar 16, 2014 at 10:14 PM, Alejandro Abdelnur <tu...@cloudera.com>wrote:
>>
>>> well, this is for the first block of the file, the rest of the file
>>> (blocks being local or not) are streamed out by the same datanode. for
>>> small files (one block) you'll get locality, for large files only the first
>>> block, and by chance if other blocks are local to that datanode.
>>>
>>>
>>> Alejandro
>>> (phone typing)
>>>
>>> On Mar 16, 2014, at 18:53, Mingjiang Shi <ms...@gopivotal.com> wrote:
>>>
>>> According to this page:
>>> http://hortonworks.com/blog/webhdfs-%E2%80%93-http-rest-access-to-hdfs/
>>>
>>>> *Data Locality*: The file read and file write calls are redirected to
>>>> the corresponding datanodes. It uses the full bandwidth of the Hadoop
>>>> cluster for streaming data.
>>>>
>>>> *A HDFS Built-in Component*: WebHDFS is a first class built-in
>>>> component of HDFS. It runs inside Namenodes and Datanodes, therefore, it
>>>> can use all HDFS functionalities. It is a part of HDFS - there are no
>>>> additional servers to install
>>>>
>>>
>>> So it looks like the data locality is built-into webhdfs, client will be
>>> redirected to the data node automatically.
>>>
>>>
>>>
>>>
>>> On Mon, Mar 17, 2014 at 6:07 AM, RJ Nowling <rn...@gmail.com> wrote:
>>>
>>>> Hi all,
>>>>
>>>> I'm writing up a Google Summer of Code proposal to add HDFS support to
>>>> Disco, an Erlang MapReduce framework.
>>>>
>>>> We're interested in using WebHDFS.  I have two questions:
>>>>
>>>> 1) Does WebHDFS allow querying data locality information?
>>>>
>>>> 2) If the data locality information is known, can data on specific data
>>>> nodes be accessed via Web HDFS?  Or do all Web HDFS requests have to go
>>>> through a single server?
>>>>
>>>> Thanks,
>>>> RJ
>>>>
>>>> --
>>>> em rnowling@gmail.com
>>>> c 954.496.2314
>>>>
>>>
>>>
>>>
>>> --
>>> Cheers
>>> -MJ
>>>
>>>
>>
>>
>> --
>> em rnowling@gmail.com
>> c 954.496.2314
>>
>
>
>
> --
> Alejandro
>



-- 
em rnowling@gmail.com
c 954.496.2314

Re: Data Locality and WebHDFS

Posted by RJ Nowling <rn...@gmail.com>.

Hi Alejandro,

The WebHDFS API allows specifying an offset and length for the request.  If
I specify an offset that start in the second block for a file (thus
skipping the first block all together), will the namenode still direct me
to a datanode with the first block or will it direct me to a namenode with
the second block?  I.e., am I assured data locality only on the first block
of the file (as you're saying) or on the first block I am accessing?

If it is as you say, then I may want to reach out the WebHDFS developers
and see if they would be interested in the additional functionality.

Thank you,
RJ


On Mon, Mar 17, 2014 at 2:40 AM, Alejandro Abdelnur <tu...@cloudera.com>wrote:

> I may have expressed myself wrong. You don't need to do any test to see
> how locality works with files of multiple blocks. If you are accessing a
> file of more than one block over webhdfs, you only have assured locality
> for the first block of the file.
>
> Thanks.
>
>
> On Sun, Mar 16, 2014 at 9:18 PM, RJ Nowling <rn...@gmail.com> wrote:
>
>> Thank you, Mingjiang and Alejandro.
>>
>> This is interesting.  Since we will use the data locality information for
>> scheduling, we could "hack" this to get the data locality information, at
>> least for the first block.  As Alejandro says, we'd have to test what
>> happens for other data blocks -- e.g., what if, knowing the block sizes, we
>> request the second or third block?
>>
>> Interesting food for thought!  I see some experiments in my future!
>>
>> Thanks!
>>
>>
>> On Sun, Mar 16, 2014 at 10:14 PM, Alejandro Abdelnur <tu...@cloudera.com>wrote:
>>
>>> well, this is for the first block of the file, the rest of the file
>>> (blocks being local or not) are streamed out by the same datanode. for
>>> small files (one block) you'll get locality, for large files only the first
>>> block, and by chance if other blocks are local to that datanode.
>>>
>>>
>>> Alejandro
>>> (phone typing)
>>>
>>> On Mar 16, 2014, at 18:53, Mingjiang Shi <ms...@gopivotal.com> wrote:
>>>
>>> According to this page:
>>> http://hortonworks.com/blog/webhdfs-%E2%80%93-http-rest-access-to-hdfs/
>>>
>>>> *Data Locality*: The file read and file write calls are redirected to
>>>> the corresponding datanodes. It uses the full bandwidth of the Hadoop
>>>> cluster for streaming data.
>>>>
>>>> *A HDFS Built-in Component*: WebHDFS is a first class built-in
>>>> component of HDFS. It runs inside Namenodes and Datanodes, therefore, it
>>>> can use all HDFS functionalities. It is a part of HDFS - there are no
>>>> additional servers to install
>>>>
>>>
>>> So it looks like the data locality is built-into webhdfs, client will be
>>> redirected to the data node automatically.
>>>
>>>
>>>
>>>
>>> On Mon, Mar 17, 2014 at 6:07 AM, RJ Nowling <rn...@gmail.com> wrote:
>>>
>>>> Hi all,
>>>>
>>>> I'm writing up a Google Summer of Code proposal to add HDFS support to
>>>> Disco, an Erlang MapReduce framework.
>>>>
>>>> We're interested in using WebHDFS.  I have two questions:
>>>>
>>>> 1) Does WebHDFS allow querying data locality information?
>>>>
>>>> 2) If the data locality information is known, can data on specific data
>>>> nodes be accessed via Web HDFS?  Or do all Web HDFS requests have to go
>>>> through a single server?
>>>>
>>>> Thanks,
>>>> RJ
>>>>
>>>> --
>>>> em rnowling@gmail.com
>>>> c 954.496.2314
>>>>
>>>
>>>
>>>
>>> --
>>> Cheers
>>> -MJ
>>>
>>>
>>
>>
>> --
>> em rnowling@gmail.com
>> c 954.496.2314
>>
>
>
>
> --
> Alejandro
>



-- 
em rnowling@gmail.com
c 954.496.2314

Re: Data Locality and WebHDFS

Posted by Alejandro Abdelnur <tu...@cloudera.com>.

I may have expressed myself wrong. You don't need to do any test to see how
locality works with files of multiple blocks. If you are accessing a file
of more than one block over webhdfs, you only have assured locality for the
first block of the file.

Thanks.


On Sun, Mar 16, 2014 at 9:18 PM, RJ Nowling <rn...@gmail.com> wrote:

> Thank you, Mingjiang and Alejandro.
>
> This is interesting.  Since we will use the data locality information for
> scheduling, we could "hack" this to get the data locality information, at
> least for the first block.  As Alejandro says, we'd have to test what
> happens for other data blocks -- e.g., what if, knowing the block sizes, we
> request the second or third block?
>
> Interesting food for thought!  I see some experiments in my future!
>
> Thanks!
>
>
> On Sun, Mar 16, 2014 at 10:14 PM, Alejandro Abdelnur <tu...@cloudera.com>wrote:
>
>> well, this is for the first block of the file, the rest of the file
>> (blocks being local or not) are streamed out by the same datanode. for
>> small files (one block) you'll get locality, for large files only the first
>> block, and by chance if other blocks are local to that datanode.
>>
>>
>> Alejandro
>> (phone typing)
>>
>> On Mar 16, 2014, at 18:53, Mingjiang Shi <ms...@gopivotal.com> wrote:
>>
>> According to this page:
>> http://hortonworks.com/blog/webhdfs-%E2%80%93-http-rest-access-to-hdfs/
>>
>>> *Data Locality*: The file read and file write calls are redirected to
>>> the corresponding datanodes. It uses the full bandwidth of the Hadoop
>>> cluster for streaming data.
>>>
>>> *A HDFS Built-in Component*: WebHDFS is a first class built-in
>>> component of HDFS. It runs inside Namenodes and Datanodes, therefore, it
>>> can use all HDFS functionalities. It is a part of HDFS - there are no
>>> additional servers to install
>>>
>>
>> So it looks like the data locality is built-into webhdfs, client will be
>> redirected to the data node automatically.
>>
>>
>>
>>
>> On Mon, Mar 17, 2014 at 6:07 AM, RJ Nowling <rn...@gmail.com> wrote:
>>
>>> Hi all,
>>>
>>> I'm writing up a Google Summer of Code proposal to add HDFS support to
>>> Disco, an Erlang MapReduce framework.
>>>
>>> We're interested in using WebHDFS.  I have two questions:
>>>
>>> 1) Does WebHDFS allow querying data locality information?
>>>
>>> 2) If the data locality information is known, can data on specific data
>>> nodes be accessed via Web HDFS?  Or do all Web HDFS requests have to go
>>> through a single server?
>>>
>>> Thanks,
>>> RJ
>>>
>>> --
>>> em rnowling@gmail.com
>>> c 954.496.2314
>>>
>>
>>
>>
>> --
>> Cheers
>> -MJ
>>
>>
>
>
> --
> em rnowling@gmail.com
> c 954.496.2314
>



-- 
Alejandro

Re: Data Locality and WebHDFS

Posted by Alejandro Abdelnur <tu...@cloudera.com>.

I may have expressed myself wrong. You don't need to do any test to see how
locality works with files of multiple blocks. If you are accessing a file
of more than one block over webhdfs, you only have assured locality for the
first block of the file.

Thanks.


On Sun, Mar 16, 2014 at 9:18 PM, RJ Nowling <rn...@gmail.com> wrote:

> Thank you, Mingjiang and Alejandro.
>
> This is interesting.  Since we will use the data locality information for
> scheduling, we could "hack" this to get the data locality information, at
> least for the first block.  As Alejandro says, we'd have to test what
> happens for other data blocks -- e.g., what if, knowing the block sizes, we
> request the second or third block?
>
> Interesting food for thought!  I see some experiments in my future!
>
> Thanks!
>
>
> On Sun, Mar 16, 2014 at 10:14 PM, Alejandro Abdelnur <tu...@cloudera.com>wrote:
>
>> well, this is for the first block of the file, the rest of the file
>> (blocks being local or not) are streamed out by the same datanode. for
>> small files (one block) you'll get locality, for large files only the first
>> block, and by chance if other blocks are local to that datanode.
>>
>>
>> Alejandro
>> (phone typing)
>>
>> On Mar 16, 2014, at 18:53, Mingjiang Shi <ms...@gopivotal.com> wrote:
>>
>> According to this page:
>> http://hortonworks.com/blog/webhdfs-%E2%80%93-http-rest-access-to-hdfs/
>>
>>> *Data Locality*: The file read and file write calls are redirected to
>>> the corresponding datanodes. It uses the full bandwidth of the Hadoop
>>> cluster for streaming data.
>>>
>>> *A HDFS Built-in Component*: WebHDFS is a first class built-in
>>> component of HDFS. It runs inside Namenodes and Datanodes, therefore, it
>>> can use all HDFS functionalities. It is a part of HDFS - there are no
>>> additional servers to install
>>>
>>
>> So it looks like the data locality is built-into webhdfs, client will be
>> redirected to the data node automatically.
>>
>>
>>
>>
>> On Mon, Mar 17, 2014 at 6:07 AM, RJ Nowling <rn...@gmail.com> wrote:
>>
>>> Hi all,
>>>
>>> I'm writing up a Google Summer of Code proposal to add HDFS support to
>>> Disco, an Erlang MapReduce framework.
>>>
>>> We're interested in using WebHDFS.  I have two questions:
>>>
>>> 1) Does WebHDFS allow querying data locality information?
>>>
>>> 2) If the data locality information is known, can data on specific data
>>> nodes be accessed via Web HDFS?  Or do all Web HDFS requests have to go
>>> through a single server?
>>>
>>> Thanks,
>>> RJ
>>>
>>> --
>>> em rnowling@gmail.com
>>> c 954.496.2314
>>>
>>
>>
>>
>> --
>> Cheers
>> -MJ
>>
>>
>
>
> --
> em rnowling@gmail.com
> c 954.496.2314
>



-- 
Alejandro

Re: Data Locality and WebHDFS

Posted by Alejandro Abdelnur <tu...@cloudera.com>.

I may have expressed myself wrong. You don't need to do any test to see how
locality works with files of multiple blocks. If you are accessing a file
of more than one block over webhdfs, you only have assured locality for the
first block of the file.

Thanks.


On Sun, Mar 16, 2014 at 9:18 PM, RJ Nowling <rn...@gmail.com> wrote:

> Thank you, Mingjiang and Alejandro.
>
> This is interesting.  Since we will use the data locality information for
> scheduling, we could "hack" this to get the data locality information, at
> least for the first block.  As Alejandro says, we'd have to test what
> happens for other data blocks -- e.g., what if, knowing the block sizes, we
> request the second or third block?
>
> Interesting food for thought!  I see some experiments in my future!
>
> Thanks!
>
>
> On Sun, Mar 16, 2014 at 10:14 PM, Alejandro Abdelnur <tu...@cloudera.com>wrote:
>
>> well, this is for the first block of the file, the rest of the file
>> (blocks being local or not) are streamed out by the same datanode. for
>> small files (one block) you'll get locality, for large files only the first
>> block, and by chance if other blocks are local to that datanode.
>>
>>
>> Alejandro
>> (phone typing)
>>
>> On Mar 16, 2014, at 18:53, Mingjiang Shi <ms...@gopivotal.com> wrote:
>>
>> According to this page:
>> http://hortonworks.com/blog/webhdfs-%E2%80%93-http-rest-access-to-hdfs/
>>
>>> *Data Locality*: The file read and file write calls are redirected to
>>> the corresponding datanodes. It uses the full bandwidth of the Hadoop
>>> cluster for streaming data.
>>>
>>> *A HDFS Built-in Component*: WebHDFS is a first class built-in
>>> component of HDFS. It runs inside Namenodes and Datanodes, therefore, it
>>> can use all HDFS functionalities. It is a part of HDFS - there are no
>>> additional servers to install
>>>
>>
>> So it looks like the data locality is built-into webhdfs, client will be
>> redirected to the data node automatically.
>>
>>
>>
>>
>> On Mon, Mar 17, 2014 at 6:07 AM, RJ Nowling <rn...@gmail.com> wrote:
>>
>>> Hi all,
>>>
>>> I'm writing up a Google Summer of Code proposal to add HDFS support to
>>> Disco, an Erlang MapReduce framework.
>>>
>>> We're interested in using WebHDFS.  I have two questions:
>>>
>>> 1) Does WebHDFS allow querying data locality information?
>>>
>>> 2) If the data locality information is known, can data on specific data
>>> nodes be accessed via Web HDFS?  Or do all Web HDFS requests have to go
>>> through a single server?
>>>
>>> Thanks,
>>> RJ
>>>
>>> --
>>> em rnowling@gmail.com
>>> c 954.496.2314
>>>
>>
>>
>>
>> --
>> Cheers
>> -MJ
>>
>>
>
>
> --
> em rnowling@gmail.com
> c 954.496.2314
>



-- 
Alejandro

Re: Data Locality and WebHDFS

Posted by Alejandro Abdelnur <tu...@cloudera.com>.

I may have expressed myself wrong. You don't need to do any test to see how
locality works with files of multiple blocks. If you are accessing a file
of more than one block over webhdfs, you only have assured locality for the
first block of the file.

Thanks.


On Sun, Mar 16, 2014 at 9:18 PM, RJ Nowling <rn...@gmail.com> wrote:

> Thank you, Mingjiang and Alejandro.
>
> This is interesting.  Since we will use the data locality information for
> scheduling, we could "hack" this to get the data locality information, at
> least for the first block.  As Alejandro says, we'd have to test what
> happens for other data blocks -- e.g., what if, knowing the block sizes, we
> request the second or third block?
>
> Interesting food for thought!  I see some experiments in my future!
>
> Thanks!
>
>
> On Sun, Mar 16, 2014 at 10:14 PM, Alejandro Abdelnur <tu...@cloudera.com>wrote:
>
>> well, this is for the first block of the file, the rest of the file
>> (blocks being local or not) are streamed out by the same datanode. for
>> small files (one block) you'll get locality, for large files only the first
>> block, and by chance if other blocks are local to that datanode.
>>
>>
>> Alejandro
>> (phone typing)
>>
>> On Mar 16, 2014, at 18:53, Mingjiang Shi <ms...@gopivotal.com> wrote:
>>
>> According to this page:
>> http://hortonworks.com/blog/webhdfs-%E2%80%93-http-rest-access-to-hdfs/
>>
>>> *Data Locality*: The file read and file write calls are redirected to
>>> the corresponding datanodes. It uses the full bandwidth of the Hadoop
>>> cluster for streaming data.
>>>
>>> *A HDFS Built-in Component*: WebHDFS is a first class built-in
>>> component of HDFS. It runs inside Namenodes and Datanodes, therefore, it
>>> can use all HDFS functionalities. It is a part of HDFS - there are no
>>> additional servers to install
>>>
>>
>> So it looks like the data locality is built-into webhdfs, client will be
>> redirected to the data node automatically.
>>
>>
>>
>>
>> On Mon, Mar 17, 2014 at 6:07 AM, RJ Nowling <rn...@gmail.com> wrote:
>>
>>> Hi all,
>>>
>>> I'm writing up a Google Summer of Code proposal to add HDFS support to
>>> Disco, an Erlang MapReduce framework.
>>>
>>> We're interested in using WebHDFS.  I have two questions:
>>>
>>> 1) Does WebHDFS allow querying data locality information?
>>>
>>> 2) If the data locality information is known, can data on specific data
>>> nodes be accessed via Web HDFS?  Or do all Web HDFS requests have to go
>>> through a single server?
>>>
>>> Thanks,
>>> RJ
>>>
>>> --
>>> em rnowling@gmail.com
>>> c 954.496.2314
>>>
>>
>>
>>
>> --
>> Cheers
>> -MJ
>>
>>
>
>
> --
> em rnowling@gmail.com
> c 954.496.2314
>



-- 
Alejandro

Re: Data Locality and WebHDFS

Posted by RJ Nowling <rn...@gmail.com>.

Thank you, Mingjiang and Alejandro.

This is interesting.  Since we will use the data locality information for
scheduling, we could "hack" this to get the data locality information, at
least for the first block.  As Alejandro says, we'd have to test what
happens for other data blocks -- e.g., what if, knowing the block sizes, we
request the second or third block?

Interesting food for thought!  I see some experiments in my future!

Thanks!


On Sun, Mar 16, 2014 at 10:14 PM, Alejandro Abdelnur <tu...@cloudera.com>wrote:

> well, this is for the first block of the file, the rest of the file
> (blocks being local or not) are streamed out by the same datanode. for
> small files (one block) you'll get locality, for large files only the first
> block, and by chance if other blocks are local to that datanode.
>
>
> Alejandro
> (phone typing)
>
> On Mar 16, 2014, at 18:53, Mingjiang Shi <ms...@gopivotal.com> wrote:
>
> According to this page:
> http://hortonworks.com/blog/webhdfs-%E2%80%93-http-rest-access-to-hdfs/
>
>> *Data Locality*: The file read and file write calls are redirected to
>> the corresponding datanodes. It uses the full bandwidth of the Hadoop
>> cluster for streaming data.
>>
>> *A HDFS Built-in Component*: WebHDFS is a first class built-in component
>> of HDFS. It runs inside Namenodes and Datanodes, therefore, it can use all
>> HDFS functionalities. It is a part of HDFS - there are no additional
>> servers to install
>>
>
> So it looks like the data locality is built-into webhdfs, client will be
> redirected to the data node automatically.
>
>
>
>
> On Mon, Mar 17, 2014 at 6:07 AM, RJ Nowling <rn...@gmail.com> wrote:
>
>> Hi all,
>>
>> I'm writing up a Google Summer of Code proposal to add HDFS support to
>> Disco, an Erlang MapReduce framework.
>>
>> We're interested in using WebHDFS.  I have two questions:
>>
>> 1) Does WebHDFS allow querying data locality information?
>>
>> 2) If the data locality information is known, can data on specific data
>> nodes be accessed via Web HDFS?  Or do all Web HDFS requests have to go
>> through a single server?
>>
>> Thanks,
>> RJ
>>
>> --
>> em rnowling@gmail.com
>> c 954.496.2314
>>
>
>
>
> --
> Cheers
> -MJ
>
>


-- 
em rnowling@gmail.com
c 954.496.2314

Re: Data Locality and WebHDFS

Posted by RJ Nowling <rn...@gmail.com>.

Thank you, Mingjiang and Alejandro.

This is interesting.  Since we will use the data locality information for
scheduling, we could "hack" this to get the data locality information, at
least for the first block.  As Alejandro says, we'd have to test what
happens for other data blocks -- e.g., what if, knowing the block sizes, we
request the second or third block?

Interesting food for thought!  I see some experiments in my future!

Thanks!


On Sun, Mar 16, 2014 at 10:14 PM, Alejandro Abdelnur <tu...@cloudera.com>wrote:

> well, this is for the first block of the file, the rest of the file
> (blocks being local or not) are streamed out by the same datanode. for
> small files (one block) you'll get locality, for large files only the first
> block, and by chance if other blocks are local to that datanode.
>
>
> Alejandro
> (phone typing)
>
> On Mar 16, 2014, at 18:53, Mingjiang Shi <ms...@gopivotal.com> wrote:
>
> According to this page:
> http://hortonworks.com/blog/webhdfs-%E2%80%93-http-rest-access-to-hdfs/
>
>> *Data Locality*: The file read and file write calls are redirected to
>> the corresponding datanodes. It uses the full bandwidth of the Hadoop
>> cluster for streaming data.
>>
>> *A HDFS Built-in Component*: WebHDFS is a first class built-in component
>> of HDFS. It runs inside Namenodes and Datanodes, therefore, it can use all
>> HDFS functionalities. It is a part of HDFS - there are no additional
>> servers to install
>>
>
> So it looks like the data locality is built-into webhdfs, client will be
> redirected to the data node automatically.
>
>
>
>
> On Mon, Mar 17, 2014 at 6:07 AM, RJ Nowling <rn...@gmail.com> wrote:
>
>> Hi all,
>>
>> I'm writing up a Google Summer of Code proposal to add HDFS support to
>> Disco, an Erlang MapReduce framework.
>>
>> We're interested in using WebHDFS.  I have two questions:
>>
>> 1) Does WebHDFS allow querying data locality information?
>>
>> 2) If the data locality information is known, can data on specific data
>> nodes be accessed via Web HDFS?  Or do all Web HDFS requests have to go
>> through a single server?
>>
>> Thanks,
>> RJ
>>
>> --
>> em rnowling@gmail.com
>> c 954.496.2314
>>
>
>
>
> --
> Cheers
> -MJ
>
>


-- 
em rnowling@gmail.com
c 954.496.2314

Re: Data Locality and WebHDFS

Posted by RJ Nowling <rn...@gmail.com>.

Thank you, Mingjiang and Alejandro.

This is interesting.  Since we will use the data locality information for
scheduling, we could "hack" this to get the data locality information, at
least for the first block.  As Alejandro says, we'd have to test what
happens for other data blocks -- e.g., what if, knowing the block sizes, we
request the second or third block?

Interesting food for thought!  I see some experiments in my future!

Thanks!


On Sun, Mar 16, 2014 at 10:14 PM, Alejandro Abdelnur <tu...@cloudera.com>wrote:

> well, this is for the first block of the file, the rest of the file
> (blocks being local or not) are streamed out by the same datanode. for
> small files (one block) you'll get locality, for large files only the first
> block, and by chance if other blocks are local to that datanode.
>
>
> Alejandro
> (phone typing)
>
> On Mar 16, 2014, at 18:53, Mingjiang Shi <ms...@gopivotal.com> wrote:
>
> According to this page:
> http://hortonworks.com/blog/webhdfs-%E2%80%93-http-rest-access-to-hdfs/
>
>> *Data Locality*: The file read and file write calls are redirected to
>> the corresponding datanodes. It uses the full bandwidth of the Hadoop
>> cluster for streaming data.
>>
>> *A HDFS Built-in Component*: WebHDFS is a first class built-in component
>> of HDFS. It runs inside Namenodes and Datanodes, therefore, it can use all
>> HDFS functionalities. It is a part of HDFS - there are no additional
>> servers to install
>>
>
> So it looks like the data locality is built-into webhdfs, client will be
> redirected to the data node automatically.
>
>
>
>
> On Mon, Mar 17, 2014 at 6:07 AM, RJ Nowling <rn...@gmail.com> wrote:
>
>> Hi all,
>>
>> I'm writing up a Google Summer of Code proposal to add HDFS support to
>> Disco, an Erlang MapReduce framework.
>>
>> We're interested in using WebHDFS.  I have two questions:
>>
>> 1) Does WebHDFS allow querying data locality information?
>>
>> 2) If the data locality information is known, can data on specific data
>> nodes be accessed via Web HDFS?  Or do all Web HDFS requests have to go
>> through a single server?
>>
>> Thanks,
>> RJ
>>
>> --
>> em rnowling@gmail.com
>> c 954.496.2314
>>
>
>
>
> --
> Cheers
> -MJ
>
>


-- 
em rnowling@gmail.com
c 954.496.2314

Re: Data Locality and WebHDFS

Posted by RJ Nowling <rn...@gmail.com>.

Thank you, Mingjiang and Alejandro.

This is interesting.  Since we will use the data locality information for
scheduling, we could "hack" this to get the data locality information, at
least for the first block.  As Alejandro says, we'd have to test what
happens for other data blocks -- e.g., what if, knowing the block sizes, we
request the second or third block?

Interesting food for thought!  I see some experiments in my future!

Thanks!


On Sun, Mar 16, 2014 at 10:14 PM, Alejandro Abdelnur <tu...@cloudera.com>wrote:

> well, this is for the first block of the file, the rest of the file
> (blocks being local or not) are streamed out by the same datanode. for
> small files (one block) you'll get locality, for large files only the first
> block, and by chance if other blocks are local to that datanode.
>
>
> Alejandro
> (phone typing)
>
> On Mar 16, 2014, at 18:53, Mingjiang Shi <ms...@gopivotal.com> wrote:
>
> According to this page:
> http://hortonworks.com/blog/webhdfs-%E2%80%93-http-rest-access-to-hdfs/
>
>> *Data Locality*: The file read and file write calls are redirected to
>> the corresponding datanodes. It uses the full bandwidth of the Hadoop
>> cluster for streaming data.
>>
>> *A HDFS Built-in Component*: WebHDFS is a first class built-in component
>> of HDFS. It runs inside Namenodes and Datanodes, therefore, it can use all
>> HDFS functionalities. It is a part of HDFS - there are no additional
>> servers to install
>>
>
> So it looks like the data locality is built-into webhdfs, client will be
> redirected to the data node automatically.
>
>
>
>
> On Mon, Mar 17, 2014 at 6:07 AM, RJ Nowling <rn...@gmail.com> wrote:
>
>> Hi all,
>>
>> I'm writing up a Google Summer of Code proposal to add HDFS support to
>> Disco, an Erlang MapReduce framework.
>>
>> We're interested in using WebHDFS.  I have two questions:
>>
>> 1) Does WebHDFS allow querying data locality information?
>>
>> 2) If the data locality information is known, can data on specific data
>> nodes be accessed via Web HDFS?  Or do all Web HDFS requests have to go
>> through a single server?
>>
>> Thanks,
>> RJ
>>
>> --
>> em rnowling@gmail.com
>> c 954.496.2314
>>
>
>
>
> --
> Cheers
> -MJ
>
>


-- 
em rnowling@gmail.com
c 954.496.2314

Re: Data Locality and WebHDFS

Posted by Alejandro Abdelnur <tu...@cloudera.com>.

well, this is for the first block of the file, the rest of the file (blocks being local or not) are streamed out by the same datanode. for small files (one block) you'll get locality, for large files only the first block, and by chance if other blocks are local to that datanode. 


Alejandro
(phone typing)

> On Mar 16, 2014, at 18:53, Mingjiang Shi <ms...@gopivotal.com> wrote:
> 
> According to this page: http://hortonworks.com/blog/webhdfs-%E2%80%93-http-rest-access-to-hdfs/
>> Data Locality: The file read and file write calls are redirected to the corresponding datanodes. It uses the full bandwidth of the Hadoop cluster for streaming data.
>> 
>> A HDFS Built-in Component: WebHDFS is a first class built-in component of HDFS. It runs inside Namenodes and Datanodes, therefore, it can use all HDFS functionalities. It is a part of HDFS – there are no additional servers to install
>> 
> 
> So it looks like the data locality is built-into webhdfs, client will be redirected to the data node automatically. 
> 
> 
> 
> 
>> On Mon, Mar 17, 2014 at 6:07 AM, RJ Nowling <rn...@gmail.com> wrote:
>> Hi all,
>> 
>> I'm writing up a Google Summer of Code proposal to add HDFS support to Disco, an Erlang MapReduce framework.  
>> 
>> We're interested in using WebHDFS.  I have two questions:
>> 
>> 1) Does WebHDFS allow querying data locality information?
>> 
>> 2) If the data locality information is known, can data on specific data nodes be accessed via Web HDFS?  Or do all Web HDFS requests have to go through a single server?
>> 
>> Thanks,
>> RJ
>> 
>> -- 
>> em rnowling@gmail.com
>> c 954.496.2314
> 
> 
> 
> -- 
> Cheers
> -MJ

Re: Data Locality and WebHDFS

Posted by Alejandro Abdelnur <tu...@cloudera.com>.

well, this is for the first block of the file, the rest of the file (blocks being local or not) are streamed out by the same datanode. for small files (one block) you'll get locality, for large files only the first block, and by chance if other blocks are local to that datanode. 


Alejandro
(phone typing)

> On Mar 16, 2014, at 18:53, Mingjiang Shi <ms...@gopivotal.com> wrote:
> 
> According to this page: http://hortonworks.com/blog/webhdfs-%E2%80%93-http-rest-access-to-hdfs/
>> Data Locality: The file read and file write calls are redirected to the corresponding datanodes. It uses the full bandwidth of the Hadoop cluster for streaming data.
>> 
>> A HDFS Built-in Component: WebHDFS is a first class built-in component of HDFS. It runs inside Namenodes and Datanodes, therefore, it can use all HDFS functionalities. It is a part of HDFS – there are no additional servers to install
>> 
> 
> So it looks like the data locality is built-into webhdfs, client will be redirected to the data node automatically. 
> 
> 
> 
> 
>> On Mon, Mar 17, 2014 at 6:07 AM, RJ Nowling <rn...@gmail.com> wrote:
>> Hi all,
>> 
>> I'm writing up a Google Summer of Code proposal to add HDFS support to Disco, an Erlang MapReduce framework.  
>> 
>> We're interested in using WebHDFS.  I have two questions:
>> 
>> 1) Does WebHDFS allow querying data locality information?
>> 
>> 2) If the data locality information is known, can data on specific data nodes be accessed via Web HDFS?  Or do all Web HDFS requests have to go through a single server?
>> 
>> Thanks,
>> RJ
>> 
>> -- 
>> em rnowling@gmail.com
>> c 954.496.2314
> 
> 
> 
> -- 
> Cheers
> -MJ

Re: Data Locality and WebHDFS

Posted by Alejandro Abdelnur <tu...@cloudera.com>.

well, this is for the first block of the file, the rest of the file (blocks being local or not) are streamed out by the same datanode. for small files (one block) you'll get locality, for large files only the first block, and by chance if other blocks are local to that datanode. 


Alejandro
(phone typing)

> On Mar 16, 2014, at 18:53, Mingjiang Shi <ms...@gopivotal.com> wrote:
> 
> According to this page: http://hortonworks.com/blog/webhdfs-%E2%80%93-http-rest-access-to-hdfs/
>> Data Locality: The file read and file write calls are redirected to the corresponding datanodes. It uses the full bandwidth of the Hadoop cluster for streaming data.
>> 
>> A HDFS Built-in Component: WebHDFS is a first class built-in component of HDFS. It runs inside Namenodes and Datanodes, therefore, it can use all HDFS functionalities. It is a part of HDFS – there are no additional servers to install
>> 
> 
> So it looks like the data locality is built-into webhdfs, client will be redirected to the data node automatically. 
> 
> 
> 
> 
>> On Mon, Mar 17, 2014 at 6:07 AM, RJ Nowling <rn...@gmail.com> wrote:
>> Hi all,
>> 
>> I'm writing up a Google Summer of Code proposal to add HDFS support to Disco, an Erlang MapReduce framework.  
>> 
>> We're interested in using WebHDFS.  I have two questions:
>> 
>> 1) Does WebHDFS allow querying data locality information?
>> 
>> 2) If the data locality information is known, can data on specific data nodes be accessed via Web HDFS?  Or do all Web HDFS requests have to go through a single server?
>> 
>> Thanks,
>> RJ
>> 
>> -- 
>> em rnowling@gmail.com
>> c 954.496.2314
> 
> 
> 
> -- 
> Cheers
> -MJ

Re: Data Locality and WebHDFS

Posted by Alejandro Abdelnur <tu...@cloudera.com>.

well, this is for the first block of the file, the rest of the file (blocks being local or not) are streamed out by the same datanode. for small files (one block) you'll get locality, for large files only the first block, and by chance if other blocks are local to that datanode. 


Alejandro
(phone typing)

> On Mar 16, 2014, at 18:53, Mingjiang Shi <ms...@gopivotal.com> wrote:
> 
> According to this page: http://hortonworks.com/blog/webhdfs-%E2%80%93-http-rest-access-to-hdfs/
>> Data Locality: The file read and file write calls are redirected to the corresponding datanodes. It uses the full bandwidth of the Hadoop cluster for streaming data.
>> 
>> A HDFS Built-in Component: WebHDFS is a first class built-in component of HDFS. It runs inside Namenodes and Datanodes, therefore, it can use all HDFS functionalities. It is a part of HDFS – there are no additional servers to install
>> 
> 
> So it looks like the data locality is built-into webhdfs, client will be redirected to the data node automatically. 
> 
> 
> 
> 
>> On Mon, Mar 17, 2014 at 6:07 AM, RJ Nowling <rn...@gmail.com> wrote:
>> Hi all,
>> 
>> I'm writing up a Google Summer of Code proposal to add HDFS support to Disco, an Erlang MapReduce framework.  
>> 
>> We're interested in using WebHDFS.  I have two questions:
>> 
>> 1) Does WebHDFS allow querying data locality information?
>> 
>> 2) If the data locality information is known, can data on specific data nodes be accessed via Web HDFS?  Or do all Web HDFS requests have to go through a single server?
>> 
>> Thanks,
>> RJ
>> 
>> -- 
>> em rnowling@gmail.com
>> c 954.496.2314
> 
> 
> 
> -- 
> Cheers
> -MJ

Re: Data Locality and WebHDFS

Posted by Mingjiang Shi <ms...@gopivotal.com>.

According to this page:
http://hortonworks.com/blog/webhdfs-%E2%80%93-http-rest-access-to-hdfs/

> *Data Locality*: The file read and file write calls are redirected to the
> corresponding datanodes. It uses the full bandwidth of the Hadoop cluster
> for streaming data.
>
> *A HDFS Built-in Component*: WebHDFS is a first class built-in component
> of HDFS. It runs inside Namenodes and Datanodes, therefore, it can use all
> HDFS functionalities. It is a part of HDFS - there are no additional
> servers to install
>

So it looks like the data locality is built-into webhdfs, client will be
redirected to the data node automatically.




On Mon, Mar 17, 2014 at 6:07 AM, RJ Nowling <rn...@gmail.com> wrote:

> Hi all,
>
> I'm writing up a Google Summer of Code proposal to add HDFS support to
> Disco, an Erlang MapReduce framework.
>
> We're interested in using WebHDFS.  I have two questions:
>
> 1) Does WebHDFS allow querying data locality information?
>
> 2) If the data locality information is known, can data on specific data
> nodes be accessed via Web HDFS?  Or do all Web HDFS requests have to go
> through a single server?
>
> Thanks,
> RJ
>
> --
> em rnowling@gmail.com
> c 954.496.2314
>



-- 
Cheers
-MJ

Re: Data Locality and WebHDFS

Posted by Mingjiang Shi <ms...@gopivotal.com>.

According to this page:
http://hortonworks.com/blog/webhdfs-%E2%80%93-http-rest-access-to-hdfs/

> *Data Locality*: The file read and file write calls are redirected to the
> corresponding datanodes. It uses the full bandwidth of the Hadoop cluster
> for streaming data.
>
> *A HDFS Built-in Component*: WebHDFS is a first class built-in component
> of HDFS. It runs inside Namenodes and Datanodes, therefore, it can use all
> HDFS functionalities. It is a part of HDFS - there are no additional
> servers to install
>

So it looks like the data locality is built-into webhdfs, client will be
redirected to the data node automatically.




On Mon, Mar 17, 2014 at 6:07 AM, RJ Nowling <rn...@gmail.com> wrote:

> Hi all,
>
> I'm writing up a Google Summer of Code proposal to add HDFS support to
> Disco, an Erlang MapReduce framework.
>
> We're interested in using WebHDFS.  I have two questions:
>
> 1) Does WebHDFS allow querying data locality information?
>
> 2) If the data locality information is known, can data on specific data
> nodes be accessed via Web HDFS?  Or do all Web HDFS requests have to go
> through a single server?
>
> Thanks,
> RJ
>
> --
> em rnowling@gmail.com
> c 954.496.2314
>



-- 
Cheers
-MJ

Re: Data Locality and WebHDFS

Posted by Mingjiang Shi <ms...@gopivotal.com>.

According to this page:
http://hortonworks.com/blog/webhdfs-%E2%80%93-http-rest-access-to-hdfs/

> *Data Locality*: The file read and file write calls are redirected to the
> corresponding datanodes. It uses the full bandwidth of the Hadoop cluster
> for streaming data.
>
> *A HDFS Built-in Component*: WebHDFS is a first class built-in component
> of HDFS. It runs inside Namenodes and Datanodes, therefore, it can use all
> HDFS functionalities. It is a part of HDFS - there are no additional
> servers to install
>

So it looks like the data locality is built-into webhdfs, client will be
redirected to the data node automatically.




On Mon, Mar 17, 2014 at 6:07 AM, RJ Nowling <rn...@gmail.com> wrote:

> Hi all,
>
> I'm writing up a Google Summer of Code proposal to add HDFS support to
> Disco, an Erlang MapReduce framework.
>
> We're interested in using WebHDFS.  I have two questions:
>
> 1) Does WebHDFS allow querying data locality information?
>
> 2) If the data locality information is known, can data on specific data
> nodes be accessed via Web HDFS?  Or do all Web HDFS requests have to go
> through a single server?
>
> Thanks,
> RJ
>
> --
> em rnowling@gmail.com
> c 954.496.2314
>



-- 
Cheers
-MJ

Re: Data Locality and WebHDFS

Posted by Mingjiang Shi <ms...@gopivotal.com>.

According to this page:
http://hortonworks.com/blog/webhdfs-%E2%80%93-http-rest-access-to-hdfs/

> *Data Locality*: The file read and file write calls are redirected to the
> corresponding datanodes. It uses the full bandwidth of the Hadoop cluster
> for streaming data.
>
> *A HDFS Built-in Component*: WebHDFS is a first class built-in component
> of HDFS. It runs inside Namenodes and Datanodes, therefore, it can use all
> HDFS functionalities. It is a part of HDFS - there are no additional
> servers to install
>

So it looks like the data locality is built-into webhdfs, client will be
redirected to the data node automatically.




On Mon, Mar 17, 2014 at 6:07 AM, RJ Nowling <rn...@gmail.com> wrote:

> Hi all,
>
> I'm writing up a Google Summer of Code proposal to add HDFS support to
> Disco, an Erlang MapReduce framework.
>
> We're interested in using WebHDFS.  I have two questions:
>
> 1) Does WebHDFS allow querying data locality information?
>
> 2) If the data locality information is known, can data on specific data
> nodes be accessed via Web HDFS?  Or do all Web HDFS requests have to go
> through a single server?
>
> Thanks,
> RJ
>
> --
> em rnowling@gmail.com
> c 954.496.2314
>



-- 
Cheers
-MJ