You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@hadoop.apache.org by Alberto Cordioli <co...@gmail.com> on 2013/03/23 15:53:22 UTC

DistributedCache - why not read directly from HDFS?

Hi all,

I was not able to find an answer to the following question. If the
question has already been answered please give me the pointer to the
right thread.

Which are actually the differences between read file from HDFS in one
mapper and use DistributedCache.

I saw that with DistributedCache you can give an hdfs path and the
task nodes will get the data on local file system. But which
advantages we have compared with a simple HDFS read with
FSDataInputStream.open() method?

Thank you very much,
Alberto


--
Alberto Cordioli

Re: DistributedCache - why not read directly from HDFS?

Posted by Arun C Murthy <ac...@hortonworks.com>.
More importantly, second and subsequent access of the file in DC is guaranteed to be local disk i/o.

On Mar 24, 2013, at 3:00 AM, Alberto Cordioli wrote:

> Thanks for your reply Harsh.
> So if I want to read a simple text file, choosing whether to use
> DistributedCachce or HDFS it becomes just a matter of performance.
> 
> 
> Alberto
> 
> On 23 March 2013 16:17, Harsh J <ha...@cloudera.com> wrote:
>> A DistributedCache is not used just to distribute simple files but
>> also native libraries and such which cannot be loaded by certain if
>> its on HDFS.
>> 
>> Also, keeping it on HDFS could provide less performant as non-local
>> reads could happen (depending on the files' replication factor).
>> 
>> On Sat, Mar 23, 2013 at 8:23 PM, Alberto Cordioli
>> <co...@gmail.com> wrote:
>>> Hi all,
>>> 
>>> I was not able to find an answer to the following question. If the
>>> question has already been answered please give me the pointer to the
>>> right thread.
>>> 
>>> Which are actually the differences between read file from HDFS in one
>>> mapper and use DistributedCache.
>>> 
>>> I saw that with DistributedCache you can give an hdfs path and the
>>> task nodes will get the data on local file system. But which
>>> advantages we have compared with a simple HDFS read with
>>> FSDataInputStream.open() method?
>>> 
>>> Thank you very much,
>>> Alberto
>>> 
>>> 
>>> --
>>> Alberto Cordioli
>> 
>> 
>> 
>> --
>> Harsh J
> 
> 
> 
> -- 
> Alberto Cordioli

--
Arun C. Murthy
Hortonworks Inc.
http://hortonworks.com/



Re: DistributedCache - why not read directly from HDFS?

Posted by Arun C Murthy <ac...@hortonworks.com>.
More importantly, second and subsequent access of the file in DC is guaranteed to be local disk i/o.

On Mar 24, 2013, at 3:00 AM, Alberto Cordioli wrote:

> Thanks for your reply Harsh.
> So if I want to read a simple text file, choosing whether to use
> DistributedCachce or HDFS it becomes just a matter of performance.
> 
> 
> Alberto
> 
> On 23 March 2013 16:17, Harsh J <ha...@cloudera.com> wrote:
>> A DistributedCache is not used just to distribute simple files but
>> also native libraries and such which cannot be loaded by certain if
>> its on HDFS.
>> 
>> Also, keeping it on HDFS could provide less performant as non-local
>> reads could happen (depending on the files' replication factor).
>> 
>> On Sat, Mar 23, 2013 at 8:23 PM, Alberto Cordioli
>> <co...@gmail.com> wrote:
>>> Hi all,
>>> 
>>> I was not able to find an answer to the following question. If the
>>> question has already been answered please give me the pointer to the
>>> right thread.
>>> 
>>> Which are actually the differences between read file from HDFS in one
>>> mapper and use DistributedCache.
>>> 
>>> I saw that with DistributedCache you can give an hdfs path and the
>>> task nodes will get the data on local file system. But which
>>> advantages we have compared with a simple HDFS read with
>>> FSDataInputStream.open() method?
>>> 
>>> Thank you very much,
>>> Alberto
>>> 
>>> 
>>> --
>>> Alberto Cordioli
>> 
>> 
>> 
>> --
>> Harsh J
> 
> 
> 
> -- 
> Alberto Cordioli

--
Arun C. Murthy
Hortonworks Inc.
http://hortonworks.com/



Re: DistributedCache - why not read directly from HDFS?

Posted by Arun C Murthy <ac...@hortonworks.com>.
More importantly, second and subsequent access of the file in DC is guaranteed to be local disk i/o.

On Mar 24, 2013, at 3:00 AM, Alberto Cordioli wrote:

> Thanks for your reply Harsh.
> So if I want to read a simple text file, choosing whether to use
> DistributedCachce or HDFS it becomes just a matter of performance.
> 
> 
> Alberto
> 
> On 23 March 2013 16:17, Harsh J <ha...@cloudera.com> wrote:
>> A DistributedCache is not used just to distribute simple files but
>> also native libraries and such which cannot be loaded by certain if
>> its on HDFS.
>> 
>> Also, keeping it on HDFS could provide less performant as non-local
>> reads could happen (depending on the files' replication factor).
>> 
>> On Sat, Mar 23, 2013 at 8:23 PM, Alberto Cordioli
>> <co...@gmail.com> wrote:
>>> Hi all,
>>> 
>>> I was not able to find an answer to the following question. If the
>>> question has already been answered please give me the pointer to the
>>> right thread.
>>> 
>>> Which are actually the differences between read file from HDFS in one
>>> mapper and use DistributedCache.
>>> 
>>> I saw that with DistributedCache you can give an hdfs path and the
>>> task nodes will get the data on local file system. But which
>>> advantages we have compared with a simple HDFS read with
>>> FSDataInputStream.open() method?
>>> 
>>> Thank you very much,
>>> Alberto
>>> 
>>> 
>>> --
>>> Alberto Cordioli
>> 
>> 
>> 
>> --
>> Harsh J
> 
> 
> 
> -- 
> Alberto Cordioli

--
Arun C. Murthy
Hortonworks Inc.
http://hortonworks.com/



Re: DistributedCache - why not read directly from HDFS?

Posted by Arun C Murthy <ac...@hortonworks.com>.
More importantly, second and subsequent access of the file in DC is guaranteed to be local disk i/o.

On Mar 24, 2013, at 3:00 AM, Alberto Cordioli wrote:

> Thanks for your reply Harsh.
> So if I want to read a simple text file, choosing whether to use
> DistributedCachce or HDFS it becomes just a matter of performance.
> 
> 
> Alberto
> 
> On 23 March 2013 16:17, Harsh J <ha...@cloudera.com> wrote:
>> A DistributedCache is not used just to distribute simple files but
>> also native libraries and such which cannot be loaded by certain if
>> its on HDFS.
>> 
>> Also, keeping it on HDFS could provide less performant as non-local
>> reads could happen (depending on the files' replication factor).
>> 
>> On Sat, Mar 23, 2013 at 8:23 PM, Alberto Cordioli
>> <co...@gmail.com> wrote:
>>> Hi all,
>>> 
>>> I was not able to find an answer to the following question. If the
>>> question has already been answered please give me the pointer to the
>>> right thread.
>>> 
>>> Which are actually the differences between read file from HDFS in one
>>> mapper and use DistributedCache.
>>> 
>>> I saw that with DistributedCache you can give an hdfs path and the
>>> task nodes will get the data on local file system. But which
>>> advantages we have compared with a simple HDFS read with
>>> FSDataInputStream.open() method?
>>> 
>>> Thank you very much,
>>> Alberto
>>> 
>>> 
>>> --
>>> Alberto Cordioli
>> 
>> 
>> 
>> --
>> Harsh J
> 
> 
> 
> -- 
> Alberto Cordioli

--
Arun C. Murthy
Hortonworks Inc.
http://hortonworks.com/



Re: DistributedCache - why not read directly from HDFS?

Posted by Alberto Cordioli <co...@gmail.com>.
Thanks for your reply Harsh.
So if I want to read a simple text file, choosing whether to use
DistributedCachce or HDFS it becomes just a matter of performance.


Alberto

On 23 March 2013 16:17, Harsh J <ha...@cloudera.com> wrote:
> A DistributedCache is not used just to distribute simple files but
> also native libraries and such which cannot be loaded by certain if
> its on HDFS.
>
> Also, keeping it on HDFS could provide less performant as non-local
> reads could happen (depending on the files' replication factor).
>
> On Sat, Mar 23, 2013 at 8:23 PM, Alberto Cordioli
> <co...@gmail.com> wrote:
>> Hi all,
>>
>> I was not able to find an answer to the following question. If the
>> question has already been answered please give me the pointer to the
>> right thread.
>>
>> Which are actually the differences between read file from HDFS in one
>> mapper and use DistributedCache.
>>
>> I saw that with DistributedCache you can give an hdfs path and the
>> task nodes will get the data on local file system. But which
>> advantages we have compared with a simple HDFS read with
>> FSDataInputStream.open() method?
>>
>> Thank you very much,
>> Alberto
>>
>>
>> --
>> Alberto Cordioli
>
>
>
> --
> Harsh J



-- 
Alberto Cordioli

Re: DistributedCache - why not read directly from HDFS?

Posted by Alberto Cordioli <co...@gmail.com>.
Thanks for your reply Harsh.
So if I want to read a simple text file, choosing whether to use
DistributedCachce or HDFS it becomes just a matter of performance.


Alberto

On 23 March 2013 16:17, Harsh J <ha...@cloudera.com> wrote:
> A DistributedCache is not used just to distribute simple files but
> also native libraries and such which cannot be loaded by certain if
> its on HDFS.
>
> Also, keeping it on HDFS could provide less performant as non-local
> reads could happen (depending on the files' replication factor).
>
> On Sat, Mar 23, 2013 at 8:23 PM, Alberto Cordioli
> <co...@gmail.com> wrote:
>> Hi all,
>>
>> I was not able to find an answer to the following question. If the
>> question has already been answered please give me the pointer to the
>> right thread.
>>
>> Which are actually the differences between read file from HDFS in one
>> mapper and use DistributedCache.
>>
>> I saw that with DistributedCache you can give an hdfs path and the
>> task nodes will get the data on local file system. But which
>> advantages we have compared with a simple HDFS read with
>> FSDataInputStream.open() method?
>>
>> Thank you very much,
>> Alberto
>>
>>
>> --
>> Alberto Cordioli
>
>
>
> --
> Harsh J



-- 
Alberto Cordioli

Re: DistributedCache - why not read directly from HDFS?

Posted by Alberto Cordioli <co...@gmail.com>.
Thanks for your reply Harsh.
So if I want to read a simple text file, choosing whether to use
DistributedCachce or HDFS it becomes just a matter of performance.


Alberto

On 23 March 2013 16:17, Harsh J <ha...@cloudera.com> wrote:
> A DistributedCache is not used just to distribute simple files but
> also native libraries and such which cannot be loaded by certain if
> its on HDFS.
>
> Also, keeping it on HDFS could provide less performant as non-local
> reads could happen (depending on the files' replication factor).
>
> On Sat, Mar 23, 2013 at 8:23 PM, Alberto Cordioli
> <co...@gmail.com> wrote:
>> Hi all,
>>
>> I was not able to find an answer to the following question. If the
>> question has already been answered please give me the pointer to the
>> right thread.
>>
>> Which are actually the differences between read file from HDFS in one
>> mapper and use DistributedCache.
>>
>> I saw that with DistributedCache you can give an hdfs path and the
>> task nodes will get the data on local file system. But which
>> advantages we have compared with a simple HDFS read with
>> FSDataInputStream.open() method?
>>
>> Thank you very much,
>> Alberto
>>
>>
>> --
>> Alberto Cordioli
>
>
>
> --
> Harsh J



-- 
Alberto Cordioli

Re: DistributedCache - why not read directly from HDFS?

Posted by Alberto Cordioli <co...@gmail.com>.
Thanks for your reply Harsh.
So if I want to read a simple text file, choosing whether to use
DistributedCachce or HDFS it becomes just a matter of performance.


Alberto

On 23 March 2013 16:17, Harsh J <ha...@cloudera.com> wrote:
> A DistributedCache is not used just to distribute simple files but
> also native libraries and such which cannot be loaded by certain if
> its on HDFS.
>
> Also, keeping it on HDFS could provide less performant as non-local
> reads could happen (depending on the files' replication factor).
>
> On Sat, Mar 23, 2013 at 8:23 PM, Alberto Cordioli
> <co...@gmail.com> wrote:
>> Hi all,
>>
>> I was not able to find an answer to the following question. If the
>> question has already been answered please give me the pointer to the
>> right thread.
>>
>> Which are actually the differences between read file from HDFS in one
>> mapper and use DistributedCache.
>>
>> I saw that with DistributedCache you can give an hdfs path and the
>> task nodes will get the data on local file system. But which
>> advantages we have compared with a simple HDFS read with
>> FSDataInputStream.open() method?
>>
>> Thank you very much,
>> Alberto
>>
>>
>> --
>> Alberto Cordioli
>
>
>
> --
> Harsh J



-- 
Alberto Cordioli

Re: DistributedCache - why not read directly from HDFS?

Posted by Harsh J <ha...@cloudera.com>.
A DistributedCache is not used just to distribute simple files but
also native libraries and such which cannot be loaded by certain if
its on HDFS.

Also, keeping it on HDFS could provide less performant as non-local
reads could happen (depending on the files' replication factor).

On Sat, Mar 23, 2013 at 8:23 PM, Alberto Cordioli
<co...@gmail.com> wrote:
> Hi all,
>
> I was not able to find an answer to the following question. If the
> question has already been answered please give me the pointer to the
> right thread.
>
> Which are actually the differences between read file from HDFS in one
> mapper and use DistributedCache.
>
> I saw that with DistributedCache you can give an hdfs path and the
> task nodes will get the data on local file system. But which
> advantages we have compared with a simple HDFS read with
> FSDataInputStream.open() method?
>
> Thank you very much,
> Alberto
>
>
> --
> Alberto Cordioli



-- 
Harsh J

Re: DistributedCache - why not read directly from HDFS?

Posted by Harsh J <ha...@cloudera.com>.
A DistributedCache is not used just to distribute simple files but
also native libraries and such which cannot be loaded by certain if
its on HDFS.

Also, keeping it on HDFS could provide less performant as non-local
reads could happen (depending on the files' replication factor).

On Sat, Mar 23, 2013 at 8:23 PM, Alberto Cordioli
<co...@gmail.com> wrote:
> Hi all,
>
> I was not able to find an answer to the following question. If the
> question has already been answered please give me the pointer to the
> right thread.
>
> Which are actually the differences between read file from HDFS in one
> mapper and use DistributedCache.
>
> I saw that with DistributedCache you can give an hdfs path and the
> task nodes will get the data on local file system. But which
> advantages we have compared with a simple HDFS read with
> FSDataInputStream.open() method?
>
> Thank you very much,
> Alberto
>
>
> --
> Alberto Cordioli



-- 
Harsh J

Re: DistributedCache - why not read directly from HDFS?

Posted by Harsh J <ha...@cloudera.com>.
A DistributedCache is not used just to distribute simple files but
also native libraries and such which cannot be loaded by certain if
its on HDFS.

Also, keeping it on HDFS could provide less performant as non-local
reads could happen (depending on the files' replication factor).

On Sat, Mar 23, 2013 at 8:23 PM, Alberto Cordioli
<co...@gmail.com> wrote:
> Hi all,
>
> I was not able to find an answer to the following question. If the
> question has already been answered please give me the pointer to the
> right thread.
>
> Which are actually the differences between read file from HDFS in one
> mapper and use DistributedCache.
>
> I saw that with DistributedCache you can give an hdfs path and the
> task nodes will get the data on local file system. But which
> advantages we have compared with a simple HDFS read with
> FSDataInputStream.open() method?
>
> Thank you very much,
> Alberto
>
>
> --
> Alberto Cordioli



-- 
Harsh J

Re: DistributedCache - why not read directly from HDFS?

Posted by Harsh J <ha...@cloudera.com>.
A DistributedCache is not used just to distribute simple files but
also native libraries and such which cannot be loaded by certain if
its on HDFS.

Also, keeping it on HDFS could provide less performant as non-local
reads could happen (depending on the files' replication factor).

On Sat, Mar 23, 2013 at 8:23 PM, Alberto Cordioli
<co...@gmail.com> wrote:
> Hi all,
>
> I was not able to find an answer to the following question. If the
> question has already been answered please give me the pointer to the
> right thread.
>
> Which are actually the differences between read file from HDFS in one
> mapper and use DistributedCache.
>
> I saw that with DistributedCache you can give an hdfs path and the
> task nodes will get the data on local file system. But which
> advantages we have compared with a simple HDFS read with
> FSDataInputStream.open() method?
>
> Thank you very much,
> Alberto
>
>
> --
> Alberto Cordioli



-- 
Harsh J