You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@hadoop.apache.org by Lin Ma <li...@gmail.com> on 2012/12/22 13:03:37 UTC

distributed cache

Hi guys,

I want to confirm when on each task node either mapper or reducer access
distributed cache file, it resides on disk, not resides in memory. Just
want to make sure distributed cache file does not fully loaded into memory
which compete memory consumption with mapper/reducer tasks. Is that correct?

thanks in advance,
Lin

Re: distributed cache

Posted by Lin Ma <li...@gmail.com>.

I have figured out the 2nd issue, appreciate if anyone could advise on the
first issue.

regards,
Lin

On Sat, Dec 22, 2012 at 9:24 PM, Lin Ma <li...@gmail.com> wrote:

> Hi Kai,
>
> Smart answer! :-)
>
>    - The assumption you have is one distributed cache replica could only
>    serve one download session for tasktracker node (this is why you get
>    concurrency n/r). The question is, why one distributed cache replica cannot
>    serve multiple concurrent download session? For example, supposing a
>    tasktracker use elapsed time t to download a file from a specific
>    distributed cache replica, it is possible for 2 tasktrackers to download
>    from the specific distributed cache replica in parallel using elapsed time
>    t as well, or 1.5 t, which is faster than sequential download time 2t you
>    mentioned before?
>    - "In total, r+n/r concurrent operations. If you optimize r depending
>    on n, SRQT(n) is the optimal replication level." -- how do you get SRQT(n)
>    for minimize r+n/r? Appreciate if you could point me to more details.
>
> regards,
> Lin
>
>
> On Sat, Dec 22, 2012 at 8:51 PM, Kai Voigt <k...@123.org> wrote:
>
>> Hi,
>>
>> simple math. Assuming you have n TaskTrackers in your cluster that will
>> need to access the files in the distributed cache. And r is the replication
>> level of those files.
>>
>> Copying the files into HDFS requires r copy operations over the network.
>> The n TaskTrackers need to get their local copies from HDFS, so the n
>> TaskTrackers copy from r DataNodes, so n/r concurrent operation. In total,
>> r+n/r concurrent operations. If you optimize r depending on n, SRQT(n) is
>> the optimal replication level. So 10 is a reasonable default setting for
>> most clusters that are not 500+ nodes big.
>>
>> Kai
>>
>> Am 22.12.2012 um 13:46 schrieb Lin Ma <li...@gmail.com>:
>>
>> Thanks Kai, using higher replication count for the purpose of?
>>
>> regards,
>> Lin
>>
>> On Sat, Dec 22, 2012 at 8:44 PM, Kai Voigt <k...@123.org> wrote:
>>
>>> Hi,
>>>
>>> Am 22.12.2012 um 13:03 schrieb Lin Ma <li...@gmail.com>:
>>>
>>> > I want to confirm when on each task node either mapper or reducer
>>> access distributed cache file, it resides on disk, not resides in memory.
>>> Just want to make sure distributed cache file does not fully loaded into
>>> memory which compete memory consumption with mapper/reducer tasks. Is that
>>> correct?
>>>
>>>
>>> Yes, you are correct. The JobTracker will put files for the distributed
>>> cache into HDFS with a higher replication count (10 by default). Whenever a
>>> TaskTracker needs those files for a task it is launching locally, it will
>>> fetch a copy to its local disk. So it won't need to do this again for
>>> future tasks on this node. After a job is done, all local copies and the
>>> HDFS copies of files in the distributed cache are cleaned up.
>>>
>>> Kai
>>>
>>> --
>>> Kai Voigt
>>> k@123.org
>>>
>>>
>>>
>>>
>>>
>>
>>  --
>> Kai Voigt
>> k@123.org
>>
>>
>>
>>
>>
>

Re: distributed cache

Posted by Lin Ma <li...@gmail.com>.

Thanks Harsh,

(1) "Thankfully, due to block sizes the latter isn't a problem for large
files on a proper DN, as the blocks are spread over the disks and across
the nodes." -- What do you mean DN?

(2) So, you mean concurrent read for small block will not degrade
performance, but concurrent read for large block will degrade performance
compared to single thread read for large block? Please feel free to correct
me if I am wrong. The results are interesting. Appreciate if you could
elaborate a bit more details why.

regards,
Lin

On Wed, Dec 26, 2012 at 8:19 PM, Harsh J <ha...@cloudera.com> wrote:

> Hi,
>
> Sorry for having been ambiguous. For (1) I meant a large block (if the
> block size is large). For (2) I meant multiple, concurrent threads.
>
> On Wed, Dec 26, 2012 at 5:36 PM, Lin Ma <li...@gmail.com> wrote:
> > Thanks Harsh,
> >
> > For long read, you mean read a large continuous part of a file, other
> than a
> > small chunk of a file?
> > "gradually decreasing performance for long reads" -- you mean parallel
> > multiple threads long read degrade performance? Or single thread
> exclusive
> > long read degrade performance?
> >
> > regards,
> > Lin
> >
> >
> > On Wed, Dec 26, 2012 at 7:48 PM, Harsh J <ha...@cloudera.com> wrote:
> >>
> >> Hi Lin,
> >>
> >> It is comparable (and is also logically similar) to reading a file
> >> multiple times in parallel in a local filesystem - not too much of a
> >> performance hit for small reads (by virtue of OS caches, and quick
> >> completion per read, as is usually the case for distributed cache
> >> files), and gradually decreasing performance for long reads (due to
> >> frequent disk physical movement)? Thankfully, due to block sizes the
> >> latter isn't a problem for large files on a proper DN, as the blocks
> >> are spread over the disks and across the nodes.
> >>
> >> On Wed, Dec 26, 2012 at 4:13 PM, Lin Ma <li...@gmail.com> wrote:
> >> > Thanks Harsh, multiple concurrent read is generally faster or?
> >> >
> >> > regards,
> >> > Lin
> >> >
> >> >
> >> > On Wed, Dec 26, 2012 at 6:21 PM, Harsh J <ha...@cloudera.com> wrote:
> >> >>
> >> >> There is no limitation in HDFS that limits reads of a block to a
> >> >> single client at a time (no reason to do so) - so downloads can be as
> >> >> concurrent as possible.
> >> >>
> >> >> On Wed, Dec 26, 2012 at 3:41 PM, Lin Ma <li...@gmail.com> wrote:
> >> >> > Thanks Harsh,
> >> >> >
> >> >> > Supposing DistributedCache is uploaded by client, for each replica,
> >> >> > in
> >> >> > Hadoop design, it could only serve one download session (download
> >> >> > from a
> >> >> > mapper or a reducer which requires the DistributedCache) at a time
> >> >> > until
> >> >> > DistributedCache file download is completed, or it could serve
> >> >> > multiple
> >> >> > concurrent parallel download session (download from multiple
> mappers
> >> >> > or
> >> >> > reducers which requires the DistributedCache).
> >> >> >
> >> >> > regards,
> >> >> > Lin
> >> >> >
> >> >> >
> >> >> > On Wed, Dec 26, 2012 at 4:51 PM, Harsh J <ha...@cloudera.com>
> wrote:
> >> >> >>
> >> >> >> Hi Lin,
> >> >> >>
> >> >> >> DistributedCache files are stored onto the HDFS by the client
> first.
> >> >> >> The TaskTrackers download and localize it. Therefore, as with any
> >> >> >> other file on HDFS, "downloads" can be efficiently parallel with
> >> >> >> higher replicas.
> >> >> >>
> >> >> >> The point of having higher replication for these files is also
> tied
> >> >> >> to
> >> >> >> the concept of racks in a cluster - you would want more replicas
> >> >> >> spread across racks such that on task bootup the downloads happen
> >> >> >> with
> >> >> >> rack locality.
> >> >> >>
> >> >> >> On Sat, Dec 22, 2012 at 6:54 PM, Lin Ma <li...@gmail.com> wrote:
> >> >> >> > Hi Kai,
> >> >> >> >
> >> >> >> > Smart answer! :-)
> >> >> >> >
> >> >> >> > The assumption you have is one distributed cache replica could
> >> >> >> > only
> >> >> >> > serve
> >> >> >> > one download session for tasktracker node (this is why you get
> >> >> >> > concurrency
> >> >> >> > n/r). The question is, why one distributed cache replica cannot
> >> >> >> > serve
> >> >> >> > multiple concurrent download session? For example, supposing a
> >> >> >> > tasktracker
> >> >> >> > use elapsed time t to download a file from a specific
> distributed
> >> >> >> > cache
> >> >> >> > replica, it is possible for 2 tasktrackers to download from the
> >> >> >> > specific
> >> >> >> > distributed cache replica in parallel using elapsed time t as
> >> >> >> > well,
> >> >> >> > or
> >> >> >> > 1.5
> >> >> >> > t, which is faster than sequential download time 2t you
> mentioned
> >> >> >> > before?
> >> >> >> > "In total, r+n/r concurrent operations. If you optimize r
> >> >> >> > depending
> >> >> >> > on
> >> >> >> > n,
> >> >> >> > SRQT(n) is the optimal replication level." -- how do you get
> >> >> >> > SRQT(n)
> >> >> >> > for
> >> >> >> > minimize r+n/r? Appreciate if you could point me to more
> details.
> >> >> >> >
> >> >> >> > regards,
> >> >> >> > Lin
> >> >> >> >
> >> >> >> >
> >> >> >> > On Sat, Dec 22, 2012 at 8:51 PM, Kai Voigt <k...@123.org> wrote:
> >> >> >> >>
> >> >> >> >> Hi,
> >> >> >> >>
> >> >> >> >> simple math. Assuming you have n TaskTrackers in your cluster
> >> >> >> >> that
> >> >> >> >> will
> >> >> >> >> need to access the files in the distributed cache. And r is the
> >> >> >> >> replication
> >> >> >> >> level of those files.
> >> >> >> >>
> >> >> >> >> Copying the files into HDFS requires r copy operations over the
> >> >> >> >> network.
> >> >> >> >> The n TaskTrackers need to get their local copies from HDFS, so
> >> >> >> >> the
> >> >> >> >> n
> >> >> >> >> TaskTrackers copy from r DataNodes, so n/r concurrent
> operation.
> >> >> >> >> In
> >> >> >> >> total,
> >> >> >> >> r+n/r concurrent operations. If you optimize r depending on n,
> >> >> >> >> SRQT(n)
> >> >> >> >> is
> >> >> >> >> the optimal replication level. So 10 is a reasonable default
> >> >> >> >> setting
> >> >> >> >> for
> >> >> >> >> most clusters that are not 500+ nodes big.
> >> >> >> >>
> >> >> >> >> Kai
> >> >> >> >>
> >> >> >> >> Am 22.12.2012 um 13:46 schrieb Lin Ma <li...@gmail.com>:
> >> >> >> >>
> >> >> >> >> Thanks Kai, using higher replication count for the purpose of?
> >> >> >> >>
> >> >> >> >> regards,
> >> >> >> >> Lin
> >> >> >> >>
> >> >> >> >> On Sat, Dec 22, 2012 at 8:44 PM, Kai Voigt <k...@123.org> wrote:
> >> >> >> >>>
> >> >> >> >>> Hi,
> >> >> >> >>>
> >> >> >> >>> Am 22.12.2012 um 13:03 schrieb Lin Ma <li...@gmail.com>:
> >> >> >> >>>
> >> >> >> >>> > I want to confirm when on each task node either mapper or
> >> >> >> >>> > reducer
> >> >> >> >>> > access distributed cache file, it resides on disk, not
> resides
> >> >> >> >>> > in
> >> >> >> >>> > memory.
> >> >> >> >>> > Just want to make sure distributed cache file does not fully
> >> >> >> >>> > loaded
> >> >> >> >>> > into
> >> >> >> >>> > memory which compete memory consumption with mapper/reducer
> >> >> >> >>> > tasks.
> >> >> >> >>> > Is that
> >> >> >> >>> > correct?
> >> >> >> >>>
> >> >> >> >>>
> >> >> >> >>> Yes, you are correct. The JobTracker will put files for the
> >> >> >> >>> distributed
> >> >> >> >>> cache into HDFS with a higher replication count (10 by
> default).
> >> >> >> >>> Whenever a
> >> >> >> >>> TaskTracker needs those files for a task it is launching
> >> >> >> >>> locally,
> >> >> >> >>> it
> >> >> >> >>> will
> >> >> >> >>> fetch a copy to its local disk. So it won't need to do this
> >> >> >> >>> again
> >> >> >> >>> for
> >> >> >> >>> future
> >> >> >> >>> tasks on this node. After a job is done, all local copies and
> >> >> >> >>> the
> >> >> >> >>> HDFS
> >> >> >> >>> copies of files in the distributed cache are cleaned up.
> >> >> >> >>>
> >> >> >> >>> Kai
> >> >> >> >>>
> >> >> >> >>> --
> >> >> >> >>> Kai Voigt
> >> >> >> >>> k@123.org
> >> >> >> >>>
> >> >> >> >>>
> >> >> >> >>>
> >> >> >> >>>
> >> >> >> >>
> >> >> >> >>
> >> >> >> >> --
> >> >> >> >> Kai Voigt
> >> >> >> >> k@123.org
> >> >> >> >>
> >> >> >> >>
> >> >> >> >>
> >> >> >> >>
> >> >> >> >
> >> >> >>
> >> >> >>
> >> >> >>
> >> >> >> --
> >> >> >> Harsh J
> >> >> >
> >> >> >
> >> >>
> >> >>
> >> >>
> >> >> --
> >> >> Harsh J
> >> >
> >> >
> >>
> >>
> >>
> >> --
> >> Harsh J
> >
> >
>
>
>
> --
> Harsh J
>

Re: distributed cache

Posted by Lin Ma <li...@gmail.com>.

Thanks Harsh,

(1) "Thankfully, due to block sizes the latter isn't a problem for large
files on a proper DN, as the blocks are spread over the disks and across
the nodes." -- What do you mean DN?

(2) So, you mean concurrent read for small block will not degrade
performance, but concurrent read for large block will degrade performance
compared to single thread read for large block? Please feel free to correct
me if I am wrong. The results are interesting. Appreciate if you could
elaborate a bit more details why.

regards,
Lin

On Wed, Dec 26, 2012 at 8:19 PM, Harsh J <ha...@cloudera.com> wrote:

> Hi,
>
> Sorry for having been ambiguous. For (1) I meant a large block (if the
> block size is large). For (2) I meant multiple, concurrent threads.
>
> On Wed, Dec 26, 2012 at 5:36 PM, Lin Ma <li...@gmail.com> wrote:
> > Thanks Harsh,
> >
> > For long read, you mean read a large continuous part of a file, other
> than a
> > small chunk of a file?
> > "gradually decreasing performance for long reads" -- you mean parallel
> > multiple threads long read degrade performance? Or single thread
> exclusive
> > long read degrade performance?
> >
> > regards,
> > Lin
> >
> >
> > On Wed, Dec 26, 2012 at 7:48 PM, Harsh J <ha...@cloudera.com> wrote:
> >>
> >> Hi Lin,
> >>
> >> It is comparable (and is also logically similar) to reading a file
> >> multiple times in parallel in a local filesystem - not too much of a
> >> performance hit for small reads (by virtue of OS caches, and quick
> >> completion per read, as is usually the case for distributed cache
> >> files), and gradually decreasing performance for long reads (due to
> >> frequent disk physical movement)? Thankfully, due to block sizes the
> >> latter isn't a problem for large files on a proper DN, as the blocks
> >> are spread over the disks and across the nodes.
> >>
> >> On Wed, Dec 26, 2012 at 4:13 PM, Lin Ma <li...@gmail.com> wrote:
> >> > Thanks Harsh, multiple concurrent read is generally faster or?
> >> >
> >> > regards,
> >> > Lin
> >> >
> >> >
> >> > On Wed, Dec 26, 2012 at 6:21 PM, Harsh J <ha...@cloudera.com> wrote:
> >> >>
> >> >> There is no limitation in HDFS that limits reads of a block to a
> >> >> single client at a time (no reason to do so) - so downloads can be as
> >> >> concurrent as possible.
> >> >>
> >> >> On Wed, Dec 26, 2012 at 3:41 PM, Lin Ma <li...@gmail.com> wrote:
> >> >> > Thanks Harsh,
> >> >> >
> >> >> > Supposing DistributedCache is uploaded by client, for each replica,
> >> >> > in
> >> >> > Hadoop design, it could only serve one download session (download
> >> >> > from a
> >> >> > mapper or a reducer which requires the DistributedCache) at a time
> >> >> > until
> >> >> > DistributedCache file download is completed, or it could serve
> >> >> > multiple
> >> >> > concurrent parallel download session (download from multiple
> mappers
> >> >> > or
> >> >> > reducers which requires the DistributedCache).
> >> >> >
> >> >> > regards,
> >> >> > Lin
> >> >> >
> >> >> >
> >> >> > On Wed, Dec 26, 2012 at 4:51 PM, Harsh J <ha...@cloudera.com>
> wrote:
> >> >> >>
> >> >> >> Hi Lin,
> >> >> >>
> >> >> >> DistributedCache files are stored onto the HDFS by the client
> first.
> >> >> >> The TaskTrackers download and localize it. Therefore, as with any
> >> >> >> other file on HDFS, "downloads" can be efficiently parallel with
> >> >> >> higher replicas.
> >> >> >>
> >> >> >> The point of having higher replication for these files is also
> tied
> >> >> >> to
> >> >> >> the concept of racks in a cluster - you would want more replicas
> >> >> >> spread across racks such that on task bootup the downloads happen
> >> >> >> with
> >> >> >> rack locality.
> >> >> >>
> >> >> >> On Sat, Dec 22, 2012 at 6:54 PM, Lin Ma <li...@gmail.com> wrote:
> >> >> >> > Hi Kai,
> >> >> >> >
> >> >> >> > Smart answer! :-)
> >> >> >> >
> >> >> >> > The assumption you have is one distributed cache replica could
> >> >> >> > only
> >> >> >> > serve
> >> >> >> > one download session for tasktracker node (this is why you get
> >> >> >> > concurrency
> >> >> >> > n/r). The question is, why one distributed cache replica cannot
> >> >> >> > serve
> >> >> >> > multiple concurrent download session? For example, supposing a
> >> >> >> > tasktracker
> >> >> >> > use elapsed time t to download a file from a specific
> distributed
> >> >> >> > cache
> >> >> >> > replica, it is possible for 2 tasktrackers to download from the
> >> >> >> > specific
> >> >> >> > distributed cache replica in parallel using elapsed time t as
> >> >> >> > well,
> >> >> >> > or
> >> >> >> > 1.5
> >> >> >> > t, which is faster than sequential download time 2t you
> mentioned
> >> >> >> > before?
> >> >> >> > "In total, r+n/r concurrent operations. If you optimize r
> >> >> >> > depending
> >> >> >> > on
> >> >> >> > n,
> >> >> >> > SRQT(n) is the optimal replication level." -- how do you get
> >> >> >> > SRQT(n)
> >> >> >> > for
> >> >> >> > minimize r+n/r? Appreciate if you could point me to more
> details.
> >> >> >> >
> >> >> >> > regards,
> >> >> >> > Lin
> >> >> >> >
> >> >> >> >
> >> >> >> > On Sat, Dec 22, 2012 at 8:51 PM, Kai Voigt <k...@123.org> wrote:
> >> >> >> >>
> >> >> >> >> Hi,
> >> >> >> >>
> >> >> >> >> simple math. Assuming you have n TaskTrackers in your cluster
> >> >> >> >> that
> >> >> >> >> will
> >> >> >> >> need to access the files in the distributed cache. And r is the
> >> >> >> >> replication
> >> >> >> >> level of those files.
> >> >> >> >>
> >> >> >> >> Copying the files into HDFS requires r copy operations over the
> >> >> >> >> network.
> >> >> >> >> The n TaskTrackers need to get their local copies from HDFS, so
> >> >> >> >> the
> >> >> >> >> n
> >> >> >> >> TaskTrackers copy from r DataNodes, so n/r concurrent
> operation.
> >> >> >> >> In
> >> >> >> >> total,
> >> >> >> >> r+n/r concurrent operations. If you optimize r depending on n,
> >> >> >> >> SRQT(n)
> >> >> >> >> is
> >> >> >> >> the optimal replication level. So 10 is a reasonable default
> >> >> >> >> setting
> >> >> >> >> for
> >> >> >> >> most clusters that are not 500+ nodes big.
> >> >> >> >>
> >> >> >> >> Kai
> >> >> >> >>
> >> >> >> >> Am 22.12.2012 um 13:46 schrieb Lin Ma <li...@gmail.com>:
> >> >> >> >>
> >> >> >> >> Thanks Kai, using higher replication count for the purpose of?
> >> >> >> >>
> >> >> >> >> regards,
> >> >> >> >> Lin
> >> >> >> >>
> >> >> >> >> On Sat, Dec 22, 2012 at 8:44 PM, Kai Voigt <k...@123.org> wrote:
> >> >> >> >>>
> >> >> >> >>> Hi,
> >> >> >> >>>
> >> >> >> >>> Am 22.12.2012 um 13:03 schrieb Lin Ma <li...@gmail.com>:
> >> >> >> >>>
> >> >> >> >>> > I want to confirm when on each task node either mapper or
> >> >> >> >>> > reducer
> >> >> >> >>> > access distributed cache file, it resides on disk, not
> resides
> >> >> >> >>> > in
> >> >> >> >>> > memory.
> >> >> >> >>> > Just want to make sure distributed cache file does not fully
> >> >> >> >>> > loaded
> >> >> >> >>> > into
> >> >> >> >>> > memory which compete memory consumption with mapper/reducer
> >> >> >> >>> > tasks.
> >> >> >> >>> > Is that
> >> >> >> >>> > correct?
> >> >> >> >>>
> >> >> >> >>>
> >> >> >> >>> Yes, you are correct. The JobTracker will put files for the
> >> >> >> >>> distributed
> >> >> >> >>> cache into HDFS with a higher replication count (10 by
> default).
> >> >> >> >>> Whenever a
> >> >> >> >>> TaskTracker needs those files for a task it is launching
> >> >> >> >>> locally,
> >> >> >> >>> it
> >> >> >> >>> will
> >> >> >> >>> fetch a copy to its local disk. So it won't need to do this
> >> >> >> >>> again
> >> >> >> >>> for
> >> >> >> >>> future
> >> >> >> >>> tasks on this node. After a job is done, all local copies and
> >> >> >> >>> the
> >> >> >> >>> HDFS
> >> >> >> >>> copies of files in the distributed cache are cleaned up.
> >> >> >> >>>
> >> >> >> >>> Kai
> >> >> >> >>>
> >> >> >> >>> --
> >> >> >> >>> Kai Voigt
> >> >> >> >>> k@123.org
> >> >> >> >>>
> >> >> >> >>>
> >> >> >> >>>
> >> >> >> >>>
> >> >> >> >>
> >> >> >> >>
> >> >> >> >> --
> >> >> >> >> Kai Voigt
> >> >> >> >> k@123.org
> >> >> >> >>
> >> >> >> >>
> >> >> >> >>
> >> >> >> >>
> >> >> >> >
> >> >> >>
> >> >> >>
> >> >> >>
> >> >> >> --
> >> >> >> Harsh J
> >> >> >
> >> >> >
> >> >>
> >> >>
> >> >>
> >> >> --
> >> >> Harsh J
> >> >
> >> >
> >>
> >>
> >>
> >> --
> >> Harsh J
> >
> >
>
>
>
> --
> Harsh J
>

Re: distributed cache

Posted by Lin Ma <li...@gmail.com>.

Thanks Harsh,

(1) "Thankfully, due to block sizes the latter isn't a problem for large
files on a proper DN, as the blocks are spread over the disks and across
the nodes." -- What do you mean DN?

(2) So, you mean concurrent read for small block will not degrade
performance, but concurrent read for large block will degrade performance
compared to single thread read for large block? Please feel free to correct
me if I am wrong. The results are interesting. Appreciate if you could
elaborate a bit more details why.

regards,
Lin

On Wed, Dec 26, 2012 at 8:19 PM, Harsh J <ha...@cloudera.com> wrote:

> Hi,
>
> Sorry for having been ambiguous. For (1) I meant a large block (if the
> block size is large). For (2) I meant multiple, concurrent threads.
>
> On Wed, Dec 26, 2012 at 5:36 PM, Lin Ma <li...@gmail.com> wrote:
> > Thanks Harsh,
> >
> > For long read, you mean read a large continuous part of a file, other
> than a
> > small chunk of a file?
> > "gradually decreasing performance for long reads" -- you mean parallel
> > multiple threads long read degrade performance? Or single thread
> exclusive
> > long read degrade performance?
> >
> > regards,
> > Lin
> >
> >
> > On Wed, Dec 26, 2012 at 7:48 PM, Harsh J <ha...@cloudera.com> wrote:
> >>
> >> Hi Lin,
> >>
> >> It is comparable (and is also logically similar) to reading a file
> >> multiple times in parallel in a local filesystem - not too much of a
> >> performance hit for small reads (by virtue of OS caches, and quick
> >> completion per read, as is usually the case for distributed cache
> >> files), and gradually decreasing performance for long reads (due to
> >> frequent disk physical movement)? Thankfully, due to block sizes the
> >> latter isn't a problem for large files on a proper DN, as the blocks
> >> are spread over the disks and across the nodes.
> >>
> >> On Wed, Dec 26, 2012 at 4:13 PM, Lin Ma <li...@gmail.com> wrote:
> >> > Thanks Harsh, multiple concurrent read is generally faster or?
> >> >
> >> > regards,
> >> > Lin
> >> >
> >> >
> >> > On Wed, Dec 26, 2012 at 6:21 PM, Harsh J <ha...@cloudera.com> wrote:
> >> >>
> >> >> There is no limitation in HDFS that limits reads of a block to a
> >> >> single client at a time (no reason to do so) - so downloads can be as
> >> >> concurrent as possible.
> >> >>
> >> >> On Wed, Dec 26, 2012 at 3:41 PM, Lin Ma <li...@gmail.com> wrote:
> >> >> > Thanks Harsh,
> >> >> >
> >> >> > Supposing DistributedCache is uploaded by client, for each replica,
> >> >> > in
> >> >> > Hadoop design, it could only serve one download session (download
> >> >> > from a
> >> >> > mapper or a reducer which requires the DistributedCache) at a time
> >> >> > until
> >> >> > DistributedCache file download is completed, or it could serve
> >> >> > multiple
> >> >> > concurrent parallel download session (download from multiple
> mappers
> >> >> > or
> >> >> > reducers which requires the DistributedCache).
> >> >> >
> >> >> > regards,
> >> >> > Lin
> >> >> >
> >> >> >
> >> >> > On Wed, Dec 26, 2012 at 4:51 PM, Harsh J <ha...@cloudera.com>
> wrote:
> >> >> >>
> >> >> >> Hi Lin,
> >> >> >>
> >> >> >> DistributedCache files are stored onto the HDFS by the client
> first.
> >> >> >> The TaskTrackers download and localize it. Therefore, as with any
> >> >> >> other file on HDFS, "downloads" can be efficiently parallel with
> >> >> >> higher replicas.
> >> >> >>
> >> >> >> The point of having higher replication for these files is also
> tied
> >> >> >> to
> >> >> >> the concept of racks in a cluster - you would want more replicas
> >> >> >> spread across racks such that on task bootup the downloads happen
> >> >> >> with
> >> >> >> rack locality.
> >> >> >>
> >> >> >> On Sat, Dec 22, 2012 at 6:54 PM, Lin Ma <li...@gmail.com> wrote:
> >> >> >> > Hi Kai,
> >> >> >> >
> >> >> >> > Smart answer! :-)
> >> >> >> >
> >> >> >> > The assumption you have is one distributed cache replica could
> >> >> >> > only
> >> >> >> > serve
> >> >> >> > one download session for tasktracker node (this is why you get
> >> >> >> > concurrency
> >> >> >> > n/r). The question is, why one distributed cache replica cannot
> >> >> >> > serve
> >> >> >> > multiple concurrent download session? For example, supposing a
> >> >> >> > tasktracker
> >> >> >> > use elapsed time t to download a file from a specific
> distributed
> >> >> >> > cache
> >> >> >> > replica, it is possible for 2 tasktrackers to download from the
> >> >> >> > specific
> >> >> >> > distributed cache replica in parallel using elapsed time t as
> >> >> >> > well,
> >> >> >> > or
> >> >> >> > 1.5
> >> >> >> > t, which is faster than sequential download time 2t you
> mentioned
> >> >> >> > before?
> >> >> >> > "In total, r+n/r concurrent operations. If you optimize r
> >> >> >> > depending
> >> >> >> > on
> >> >> >> > n,
> >> >> >> > SRQT(n) is the optimal replication level." -- how do you get
> >> >> >> > SRQT(n)
> >> >> >> > for
> >> >> >> > minimize r+n/r? Appreciate if you could point me to more
> details.
> >> >> >> >
> >> >> >> > regards,
> >> >> >> > Lin
> >> >> >> >
> >> >> >> >
> >> >> >> > On Sat, Dec 22, 2012 at 8:51 PM, Kai Voigt <k...@123.org> wrote:
> >> >> >> >>
> >> >> >> >> Hi,
> >> >> >> >>
> >> >> >> >> simple math. Assuming you have n TaskTrackers in your cluster
> >> >> >> >> that
> >> >> >> >> will
> >> >> >> >> need to access the files in the distributed cache. And r is the
> >> >> >> >> replication
> >> >> >> >> level of those files.
> >> >> >> >>
> >> >> >> >> Copying the files into HDFS requires r copy operations over the
> >> >> >> >> network.
> >> >> >> >> The n TaskTrackers need to get their local copies from HDFS, so
> >> >> >> >> the
> >> >> >> >> n
> >> >> >> >> TaskTrackers copy from r DataNodes, so n/r concurrent
> operation.
> >> >> >> >> In
> >> >> >> >> total,
> >> >> >> >> r+n/r concurrent operations. If you optimize r depending on n,
> >> >> >> >> SRQT(n)
> >> >> >> >> is
> >> >> >> >> the optimal replication level. So 10 is a reasonable default
> >> >> >> >> setting
> >> >> >> >> for
> >> >> >> >> most clusters that are not 500+ nodes big.
> >> >> >> >>
> >> >> >> >> Kai
> >> >> >> >>
> >> >> >> >> Am 22.12.2012 um 13:46 schrieb Lin Ma <li...@gmail.com>:
> >> >> >> >>
> >> >> >> >> Thanks Kai, using higher replication count for the purpose of?
> >> >> >> >>
> >> >> >> >> regards,
> >> >> >> >> Lin
> >> >> >> >>
> >> >> >> >> On Sat, Dec 22, 2012 at 8:44 PM, Kai Voigt <k...@123.org> wrote:
> >> >> >> >>>
> >> >> >> >>> Hi,
> >> >> >> >>>
> >> >> >> >>> Am 22.12.2012 um 13:03 schrieb Lin Ma <li...@gmail.com>:
> >> >> >> >>>
> >> >> >> >>> > I want to confirm when on each task node either mapper or
> >> >> >> >>> > reducer
> >> >> >> >>> > access distributed cache file, it resides on disk, not
> resides
> >> >> >> >>> > in
> >> >> >> >>> > memory.
> >> >> >> >>> > Just want to make sure distributed cache file does not fully
> >> >> >> >>> > loaded
> >> >> >> >>> > into
> >> >> >> >>> > memory which compete memory consumption with mapper/reducer
> >> >> >> >>> > tasks.
> >> >> >> >>> > Is that
> >> >> >> >>> > correct?
> >> >> >> >>>
> >> >> >> >>>
> >> >> >> >>> Yes, you are correct. The JobTracker will put files for the
> >> >> >> >>> distributed
> >> >> >> >>> cache into HDFS with a higher replication count (10 by
> default).
> >> >> >> >>> Whenever a
> >> >> >> >>> TaskTracker needs those files for a task it is launching
> >> >> >> >>> locally,
> >> >> >> >>> it
> >> >> >> >>> will
> >> >> >> >>> fetch a copy to its local disk. So it won't need to do this
> >> >> >> >>> again
> >> >> >> >>> for
> >> >> >> >>> future
> >> >> >> >>> tasks on this node. After a job is done, all local copies and
> >> >> >> >>> the
> >> >> >> >>> HDFS
> >> >> >> >>> copies of files in the distributed cache are cleaned up.
> >> >> >> >>>
> >> >> >> >>> Kai
> >> >> >> >>>
> >> >> >> >>> --
> >> >> >> >>> Kai Voigt
> >> >> >> >>> k@123.org
> >> >> >> >>>
> >> >> >> >>>
> >> >> >> >>>
> >> >> >> >>>
> >> >> >> >>
> >> >> >> >>
> >> >> >> >> --
> >> >> >> >> Kai Voigt
> >> >> >> >> k@123.org
> >> >> >> >>
> >> >> >> >>
> >> >> >> >>
> >> >> >> >>
> >> >> >> >
> >> >> >>
> >> >> >>
> >> >> >>
> >> >> >> --
> >> >> >> Harsh J
> >> >> >
> >> >> >
> >> >>
> >> >>
> >> >>
> >> >> --
> >> >> Harsh J
> >> >
> >> >
> >>
> >>
> >>
> >> --
> >> Harsh J
> >
> >
>
>
>
> --
> Harsh J
>

Re: distributed cache

Posted by Lin Ma <li...@gmail.com>.

Thanks Harsh,

(1) "Thankfully, due to block sizes the latter isn't a problem for large
files on a proper DN, as the blocks are spread over the disks and across
the nodes." -- What do you mean DN?

(2) So, you mean concurrent read for small block will not degrade
performance, but concurrent read for large block will degrade performance
compared to single thread read for large block? Please feel free to correct
me if I am wrong. The results are interesting. Appreciate if you could
elaborate a bit more details why.

regards,
Lin

On Wed, Dec 26, 2012 at 8:19 PM, Harsh J <ha...@cloudera.com> wrote:

> Hi,
>
> Sorry for having been ambiguous. For (1) I meant a large block (if the
> block size is large). For (2) I meant multiple, concurrent threads.
>
> On Wed, Dec 26, 2012 at 5:36 PM, Lin Ma <li...@gmail.com> wrote:
> > Thanks Harsh,
> >
> > For long read, you mean read a large continuous part of a file, other
> than a
> > small chunk of a file?
> > "gradually decreasing performance for long reads" -- you mean parallel
> > multiple threads long read degrade performance? Or single thread
> exclusive
> > long read degrade performance?
> >
> > regards,
> > Lin
> >
> >
> > On Wed, Dec 26, 2012 at 7:48 PM, Harsh J <ha...@cloudera.com> wrote:
> >>
> >> Hi Lin,
> >>
> >> It is comparable (and is also logically similar) to reading a file
> >> multiple times in parallel in a local filesystem - not too much of a
> >> performance hit for small reads (by virtue of OS caches, and quick
> >> completion per read, as is usually the case for distributed cache
> >> files), and gradually decreasing performance for long reads (due to
> >> frequent disk physical movement)? Thankfully, due to block sizes the
> >> latter isn't a problem for large files on a proper DN, as the blocks
> >> are spread over the disks and across the nodes.
> >>
> >> On Wed, Dec 26, 2012 at 4:13 PM, Lin Ma <li...@gmail.com> wrote:
> >> > Thanks Harsh, multiple concurrent read is generally faster or?
> >> >
> >> > regards,
> >> > Lin
> >> >
> >> >
> >> > On Wed, Dec 26, 2012 at 6:21 PM, Harsh J <ha...@cloudera.com> wrote:
> >> >>
> >> >> There is no limitation in HDFS that limits reads of a block to a
> >> >> single client at a time (no reason to do so) - so downloads can be as
> >> >> concurrent as possible.
> >> >>
> >> >> On Wed, Dec 26, 2012 at 3:41 PM, Lin Ma <li...@gmail.com> wrote:
> >> >> > Thanks Harsh,
> >> >> >
> >> >> > Supposing DistributedCache is uploaded by client, for each replica,
> >> >> > in
> >> >> > Hadoop design, it could only serve one download session (download
> >> >> > from a
> >> >> > mapper or a reducer which requires the DistributedCache) at a time
> >> >> > until
> >> >> > DistributedCache file download is completed, or it could serve
> >> >> > multiple
> >> >> > concurrent parallel download session (download from multiple
> mappers
> >> >> > or
> >> >> > reducers which requires the DistributedCache).
> >> >> >
> >> >> > regards,
> >> >> > Lin
> >> >> >
> >> >> >
> >> >> > On Wed, Dec 26, 2012 at 4:51 PM, Harsh J <ha...@cloudera.com>
> wrote:
> >> >> >>
> >> >> >> Hi Lin,
> >> >> >>
> >> >> >> DistributedCache files are stored onto the HDFS by the client
> first.
> >> >> >> The TaskTrackers download and localize it. Therefore, as with any
> >> >> >> other file on HDFS, "downloads" can be efficiently parallel with
> >> >> >> higher replicas.
> >> >> >>
> >> >> >> The point of having higher replication for these files is also
> tied
> >> >> >> to
> >> >> >> the concept of racks in a cluster - you would want more replicas
> >> >> >> spread across racks such that on task bootup the downloads happen
> >> >> >> with
> >> >> >> rack locality.
> >> >> >>
> >> >> >> On Sat, Dec 22, 2012 at 6:54 PM, Lin Ma <li...@gmail.com> wrote:
> >> >> >> > Hi Kai,
> >> >> >> >
> >> >> >> > Smart answer! :-)
> >> >> >> >
> >> >> >> > The assumption you have is one distributed cache replica could
> >> >> >> > only
> >> >> >> > serve
> >> >> >> > one download session for tasktracker node (this is why you get
> >> >> >> > concurrency
> >> >> >> > n/r). The question is, why one distributed cache replica cannot
> >> >> >> > serve
> >> >> >> > multiple concurrent download session? For example, supposing a
> >> >> >> > tasktracker
> >> >> >> > use elapsed time t to download a file from a specific
> distributed
> >> >> >> > cache
> >> >> >> > replica, it is possible for 2 tasktrackers to download from the
> >> >> >> > specific
> >> >> >> > distributed cache replica in parallel using elapsed time t as
> >> >> >> > well,
> >> >> >> > or
> >> >> >> > 1.5
> >> >> >> > t, which is faster than sequential download time 2t you
> mentioned
> >> >> >> > before?
> >> >> >> > "In total, r+n/r concurrent operations. If you optimize r
> >> >> >> > depending
> >> >> >> > on
> >> >> >> > n,
> >> >> >> > SRQT(n) is the optimal replication level." -- how do you get
> >> >> >> > SRQT(n)
> >> >> >> > for
> >> >> >> > minimize r+n/r? Appreciate if you could point me to more
> details.
> >> >> >> >
> >> >> >> > regards,
> >> >> >> > Lin
> >> >> >> >
> >> >> >> >
> >> >> >> > On Sat, Dec 22, 2012 at 8:51 PM, Kai Voigt <k...@123.org> wrote:
> >> >> >> >>
> >> >> >> >> Hi,
> >> >> >> >>
> >> >> >> >> simple math. Assuming you have n TaskTrackers in your cluster
> >> >> >> >> that
> >> >> >> >> will
> >> >> >> >> need to access the files in the distributed cache. And r is the
> >> >> >> >> replication
> >> >> >> >> level of those files.
> >> >> >> >>
> >> >> >> >> Copying the files into HDFS requires r copy operations over the
> >> >> >> >> network.
> >> >> >> >> The n TaskTrackers need to get their local copies from HDFS, so
> >> >> >> >> the
> >> >> >> >> n
> >> >> >> >> TaskTrackers copy from r DataNodes, so n/r concurrent
> operation.
> >> >> >> >> In
> >> >> >> >> total,
> >> >> >> >> r+n/r concurrent operations. If you optimize r depending on n,
> >> >> >> >> SRQT(n)
> >> >> >> >> is
> >> >> >> >> the optimal replication level. So 10 is a reasonable default
> >> >> >> >> setting
> >> >> >> >> for
> >> >> >> >> most clusters that are not 500+ nodes big.
> >> >> >> >>
> >> >> >> >> Kai
> >> >> >> >>
> >> >> >> >> Am 22.12.2012 um 13:46 schrieb Lin Ma <li...@gmail.com>:
> >> >> >> >>
> >> >> >> >> Thanks Kai, using higher replication count for the purpose of?
> >> >> >> >>
> >> >> >> >> regards,
> >> >> >> >> Lin
> >> >> >> >>
> >> >> >> >> On Sat, Dec 22, 2012 at 8:44 PM, Kai Voigt <k...@123.org> wrote:
> >> >> >> >>>
> >> >> >> >>> Hi,
> >> >> >> >>>
> >> >> >> >>> Am 22.12.2012 um 13:03 schrieb Lin Ma <li...@gmail.com>:
> >> >> >> >>>
> >> >> >> >>> > I want to confirm when on each task node either mapper or
> >> >> >> >>> > reducer
> >> >> >> >>> > access distributed cache file, it resides on disk, not
> resides
> >> >> >> >>> > in
> >> >> >> >>> > memory.
> >> >> >> >>> > Just want to make sure distributed cache file does not fully
> >> >> >> >>> > loaded
> >> >> >> >>> > into
> >> >> >> >>> > memory which compete memory consumption with mapper/reducer
> >> >> >> >>> > tasks.
> >> >> >> >>> > Is that
> >> >> >> >>> > correct?
> >> >> >> >>>
> >> >> >> >>>
> >> >> >> >>> Yes, you are correct. The JobTracker will put files for the
> >> >> >> >>> distributed
> >> >> >> >>> cache into HDFS with a higher replication count (10 by
> default).
> >> >> >> >>> Whenever a
> >> >> >> >>> TaskTracker needs those files for a task it is launching
> >> >> >> >>> locally,
> >> >> >> >>> it
> >> >> >> >>> will
> >> >> >> >>> fetch a copy to its local disk. So it won't need to do this
> >> >> >> >>> again
> >> >> >> >>> for
> >> >> >> >>> future
> >> >> >> >>> tasks on this node. After a job is done, all local copies and
> >> >> >> >>> the
> >> >> >> >>> HDFS
> >> >> >> >>> copies of files in the distributed cache are cleaned up.
> >> >> >> >>>
> >> >> >> >>> Kai
> >> >> >> >>>
> >> >> >> >>> --
> >> >> >> >>> Kai Voigt
> >> >> >> >>> k@123.org
> >> >> >> >>>
> >> >> >> >>>
> >> >> >> >>>
> >> >> >> >>>
> >> >> >> >>
> >> >> >> >>
> >> >> >> >> --
> >> >> >> >> Kai Voigt
> >> >> >> >> k@123.org
> >> >> >> >>
> >> >> >> >>
> >> >> >> >>
> >> >> >> >>
> >> >> >> >
> >> >> >>
> >> >> >>
> >> >> >>
> >> >> >> --
> >> >> >> Harsh J
> >> >> >
> >> >> >
> >> >>
> >> >>
> >> >>
> >> >> --
> >> >> Harsh J
> >> >
> >> >
> >>
> >>
> >>
> >> --
> >> Harsh J
> >
> >
>
>
>
> --
> Harsh J
>

Re: distributed cache

Posted by Harsh J <ha...@cloudera.com>.

Hi,

Sorry for having been ambiguous. For (1) I meant a large block (if the
block size is large). For (2) I meant multiple, concurrent threads.

On Wed, Dec 26, 2012 at 5:36 PM, Lin Ma <li...@gmail.com> wrote:
> Thanks Harsh,
>
> For long read, you mean read a large continuous part of a file, other than a
> small chunk of a file?
> "gradually decreasing performance for long reads" -- you mean parallel
> multiple threads long read degrade performance? Or single thread exclusive
> long read degrade performance?
>
> regards,
> Lin
>
>
> On Wed, Dec 26, 2012 at 7:48 PM, Harsh J <ha...@cloudera.com> wrote:
>>
>> Hi Lin,
>>
>> It is comparable (and is also logically similar) to reading a file
>> multiple times in parallel in a local filesystem - not too much of a
>> performance hit for small reads (by virtue of OS caches, and quick
>> completion per read, as is usually the case for distributed cache
>> files), and gradually decreasing performance for long reads (due to
>> frequent disk physical movement)? Thankfully, due to block sizes the
>> latter isn't a problem for large files on a proper DN, as the blocks
>> are spread over the disks and across the nodes.
>>
>> On Wed, Dec 26, 2012 at 4:13 PM, Lin Ma <li...@gmail.com> wrote:
>> > Thanks Harsh, multiple concurrent read is generally faster or?
>> >
>> > regards,
>> > Lin
>> >
>> >
>> > On Wed, Dec 26, 2012 at 6:21 PM, Harsh J <ha...@cloudera.com> wrote:
>> >>
>> >> There is no limitation in HDFS that limits reads of a block to a
>> >> single client at a time (no reason to do so) - so downloads can be as
>> >> concurrent as possible.
>> >>
>> >> On Wed, Dec 26, 2012 at 3:41 PM, Lin Ma <li...@gmail.com> wrote:
>> >> > Thanks Harsh,
>> >> >
>> >> > Supposing DistributedCache is uploaded by client, for each replica,
>> >> > in
>> >> > Hadoop design, it could only serve one download session (download
>> >> > from a
>> >> > mapper or a reducer which requires the DistributedCache) at a time
>> >> > until
>> >> > DistributedCache file download is completed, or it could serve
>> >> > multiple
>> >> > concurrent parallel download session (download from multiple mappers
>> >> > or
>> >> > reducers which requires the DistributedCache).
>> >> >
>> >> > regards,
>> >> > Lin
>> >> >
>> >> >
>> >> > On Wed, Dec 26, 2012 at 4:51 PM, Harsh J <ha...@cloudera.com> wrote:
>> >> >>
>> >> >> Hi Lin,
>> >> >>
>> >> >> DistributedCache files are stored onto the HDFS by the client first.
>> >> >> The TaskTrackers download and localize it. Therefore, as with any
>> >> >> other file on HDFS, "downloads" can be efficiently parallel with
>> >> >> higher replicas.
>> >> >>
>> >> >> The point of having higher replication for these files is also tied
>> >> >> to
>> >> >> the concept of racks in a cluster - you would want more replicas
>> >> >> spread across racks such that on task bootup the downloads happen
>> >> >> with
>> >> >> rack locality.
>> >> >>
>> >> >> On Sat, Dec 22, 2012 at 6:54 PM, Lin Ma <li...@gmail.com> wrote:
>> >> >> > Hi Kai,
>> >> >> >
>> >> >> > Smart answer! :-)
>> >> >> >
>> >> >> > The assumption you have is one distributed cache replica could
>> >> >> > only
>> >> >> > serve
>> >> >> > one download session for tasktracker node (this is why you get
>> >> >> > concurrency
>> >> >> > n/r). The question is, why one distributed cache replica cannot
>> >> >> > serve
>> >> >> > multiple concurrent download session? For example, supposing a
>> >> >> > tasktracker
>> >> >> > use elapsed time t to download a file from a specific distributed
>> >> >> > cache
>> >> >> > replica, it is possible for 2 tasktrackers to download from the
>> >> >> > specific
>> >> >> > distributed cache replica in parallel using elapsed time t as
>> >> >> > well,
>> >> >> > or
>> >> >> > 1.5
>> >> >> > t, which is faster than sequential download time 2t you mentioned
>> >> >> > before?
>> >> >> > "In total, r+n/r concurrent operations. If you optimize r
>> >> >> > depending
>> >> >> > on
>> >> >> > n,
>> >> >> > SRQT(n) is the optimal replication level." -- how do you get
>> >> >> > SRQT(n)
>> >> >> > for
>> >> >> > minimize r+n/r? Appreciate if you could point me to more details.
>> >> >> >
>> >> >> > regards,
>> >> >> > Lin
>> >> >> >
>> >> >> >
>> >> >> > On Sat, Dec 22, 2012 at 8:51 PM, Kai Voigt <k...@123.org> wrote:
>> >> >> >>
>> >> >> >> Hi,
>> >> >> >>
>> >> >> >> simple math. Assuming you have n TaskTrackers in your cluster
>> >> >> >> that
>> >> >> >> will
>> >> >> >> need to access the files in the distributed cache. And r is the
>> >> >> >> replication
>> >> >> >> level of those files.
>> >> >> >>
>> >> >> >> Copying the files into HDFS requires r copy operations over the
>> >> >> >> network.
>> >> >> >> The n TaskTrackers need to get their local copies from HDFS, so
>> >> >> >> the
>> >> >> >> n
>> >> >> >> TaskTrackers copy from r DataNodes, so n/r concurrent operation.
>> >> >> >> In
>> >> >> >> total,
>> >> >> >> r+n/r concurrent operations. If you optimize r depending on n,
>> >> >> >> SRQT(n)
>> >> >> >> is
>> >> >> >> the optimal replication level. So 10 is a reasonable default
>> >> >> >> setting
>> >> >> >> for
>> >> >> >> most clusters that are not 500+ nodes big.
>> >> >> >>
>> >> >> >> Kai
>> >> >> >>
>> >> >> >> Am 22.12.2012 um 13:46 schrieb Lin Ma <li...@gmail.com>:
>> >> >> >>
>> >> >> >> Thanks Kai, using higher replication count for the purpose of?
>> >> >> >>
>> >> >> >> regards,
>> >> >> >> Lin
>> >> >> >>
>> >> >> >> On Sat, Dec 22, 2012 at 8:44 PM, Kai Voigt <k...@123.org> wrote:
>> >> >> >>>
>> >> >> >>> Hi,
>> >> >> >>>
>> >> >> >>> Am 22.12.2012 um 13:03 schrieb Lin Ma <li...@gmail.com>:
>> >> >> >>>
>> >> >> >>> > I want to confirm when on each task node either mapper or
>> >> >> >>> > reducer
>> >> >> >>> > access distributed cache file, it resides on disk, not resides
>> >> >> >>> > in
>> >> >> >>> > memory.
>> >> >> >>> > Just want to make sure distributed cache file does not fully
>> >> >> >>> > loaded
>> >> >> >>> > into
>> >> >> >>> > memory which compete memory consumption with mapper/reducer
>> >> >> >>> > tasks.
>> >> >> >>> > Is that
>> >> >> >>> > correct?
>> >> >> >>>
>> >> >> >>>
>> >> >> >>> Yes, you are correct. The JobTracker will put files for the
>> >> >> >>> distributed
>> >> >> >>> cache into HDFS with a higher replication count (10 by default).
>> >> >> >>> Whenever a
>> >> >> >>> TaskTracker needs those files for a task it is launching
>> >> >> >>> locally,
>> >> >> >>> it
>> >> >> >>> will
>> >> >> >>> fetch a copy to its local disk. So it won't need to do this
>> >> >> >>> again
>> >> >> >>> for
>> >> >> >>> future
>> >> >> >>> tasks on this node. After a job is done, all local copies and
>> >> >> >>> the
>> >> >> >>> HDFS
>> >> >> >>> copies of files in the distributed cache are cleaned up.
>> >> >> >>>
>> >> >> >>> Kai
>> >> >> >>>
>> >> >> >>> --
>> >> >> >>> Kai Voigt
>> >> >> >>> k@123.org
>> >> >> >>>
>> >> >> >>>
>> >> >> >>>
>> >> >> >>>
>> >> >> >>
>> >> >> >>
>> >> >> >> --
>> >> >> >> Kai Voigt
>> >> >> >> k@123.org
>> >> >> >>
>> >> >> >>
>> >> >> >>
>> >> >> >>
>> >> >> >
>> >> >>
>> >> >>
>> >> >>
>> >> >> --
>> >> >> Harsh J
>> >> >
>> >> >
>> >>
>> >>
>> >>
>> >> --
>> >> Harsh J
>> >
>> >
>>
>>
>>
>> --
>> Harsh J
>
>



-- 
Harsh J

Re: distributed cache

Posted by Harsh J <ha...@cloudera.com>.

Hi,

Sorry for having been ambiguous. For (1) I meant a large block (if the
block size is large). For (2) I meant multiple, concurrent threads.

On Wed, Dec 26, 2012 at 5:36 PM, Lin Ma <li...@gmail.com> wrote:
> Thanks Harsh,
>
> For long read, you mean read a large continuous part of a file, other than a
> small chunk of a file?
> "gradually decreasing performance for long reads" -- you mean parallel
> multiple threads long read degrade performance? Or single thread exclusive
> long read degrade performance?
>
> regards,
> Lin
>
>
> On Wed, Dec 26, 2012 at 7:48 PM, Harsh J <ha...@cloudera.com> wrote:
>>
>> Hi Lin,
>>
>> It is comparable (and is also logically similar) to reading a file
>> multiple times in parallel in a local filesystem - not too much of a
>> performance hit for small reads (by virtue of OS caches, and quick
>> completion per read, as is usually the case for distributed cache
>> files), and gradually decreasing performance for long reads (due to
>> frequent disk physical movement)? Thankfully, due to block sizes the
>> latter isn't a problem for large files on a proper DN, as the blocks
>> are spread over the disks and across the nodes.
>>
>> On Wed, Dec 26, 2012 at 4:13 PM, Lin Ma <li...@gmail.com> wrote:
>> > Thanks Harsh, multiple concurrent read is generally faster or?
>> >
>> > regards,
>> > Lin
>> >
>> >
>> > On Wed, Dec 26, 2012 at 6:21 PM, Harsh J <ha...@cloudera.com> wrote:
>> >>
>> >> There is no limitation in HDFS that limits reads of a block to a
>> >> single client at a time (no reason to do so) - so downloads can be as
>> >> concurrent as possible.
>> >>
>> >> On Wed, Dec 26, 2012 at 3:41 PM, Lin Ma <li...@gmail.com> wrote:
>> >> > Thanks Harsh,
>> >> >
>> >> > Supposing DistributedCache is uploaded by client, for each replica,
>> >> > in
>> >> > Hadoop design, it could only serve one download session (download
>> >> > from a
>> >> > mapper or a reducer which requires the DistributedCache) at a time
>> >> > until
>> >> > DistributedCache file download is completed, or it could serve
>> >> > multiple
>> >> > concurrent parallel download session (download from multiple mappers
>> >> > or
>> >> > reducers which requires the DistributedCache).
>> >> >
>> >> > regards,
>> >> > Lin
>> >> >
>> >> >
>> >> > On Wed, Dec 26, 2012 at 4:51 PM, Harsh J <ha...@cloudera.com> wrote:
>> >> >>
>> >> >> Hi Lin,
>> >> >>
>> >> >> DistributedCache files are stored onto the HDFS by the client first.
>> >> >> The TaskTrackers download and localize it. Therefore, as with any
>> >> >> other file on HDFS, "downloads" can be efficiently parallel with
>> >> >> higher replicas.
>> >> >>
>> >> >> The point of having higher replication for these files is also tied
>> >> >> to
>> >> >> the concept of racks in a cluster - you would want more replicas
>> >> >> spread across racks such that on task bootup the downloads happen
>> >> >> with
>> >> >> rack locality.
>> >> >>
>> >> >> On Sat, Dec 22, 2012 at 6:54 PM, Lin Ma <li...@gmail.com> wrote:
>> >> >> > Hi Kai,
>> >> >> >
>> >> >> > Smart answer! :-)
>> >> >> >
>> >> >> > The assumption you have is one distributed cache replica could
>> >> >> > only
>> >> >> > serve
>> >> >> > one download session for tasktracker node (this is why you get
>> >> >> > concurrency
>> >> >> > n/r). The question is, why one distributed cache replica cannot
>> >> >> > serve
>> >> >> > multiple concurrent download session? For example, supposing a
>> >> >> > tasktracker
>> >> >> > use elapsed time t to download a file from a specific distributed
>> >> >> > cache
>> >> >> > replica, it is possible for 2 tasktrackers to download from the
>> >> >> > specific
>> >> >> > distributed cache replica in parallel using elapsed time t as
>> >> >> > well,
>> >> >> > or
>> >> >> > 1.5
>> >> >> > t, which is faster than sequential download time 2t you mentioned
>> >> >> > before?
>> >> >> > "In total, r+n/r concurrent operations. If you optimize r
>> >> >> > depending
>> >> >> > on
>> >> >> > n,
>> >> >> > SRQT(n) is the optimal replication level." -- how do you get
>> >> >> > SRQT(n)
>> >> >> > for
>> >> >> > minimize r+n/r? Appreciate if you could point me to more details.
>> >> >> >
>> >> >> > regards,
>> >> >> > Lin
>> >> >> >
>> >> >> >
>> >> >> > On Sat, Dec 22, 2012 at 8:51 PM, Kai Voigt <k...@123.org> wrote:
>> >> >> >>
>> >> >> >> Hi,
>> >> >> >>
>> >> >> >> simple math. Assuming you have n TaskTrackers in your cluster
>> >> >> >> that
>> >> >> >> will
>> >> >> >> need to access the files in the distributed cache. And r is the
>> >> >> >> replication
>> >> >> >> level of those files.
>> >> >> >>
>> >> >> >> Copying the files into HDFS requires r copy operations over the
>> >> >> >> network.
>> >> >> >> The n TaskTrackers need to get their local copies from HDFS, so
>> >> >> >> the
>> >> >> >> n
>> >> >> >> TaskTrackers copy from r DataNodes, so n/r concurrent operation.
>> >> >> >> In
>> >> >> >> total,
>> >> >> >> r+n/r concurrent operations. If you optimize r depending on n,
>> >> >> >> SRQT(n)
>> >> >> >> is
>> >> >> >> the optimal replication level. So 10 is a reasonable default
>> >> >> >> setting
>> >> >> >> for
>> >> >> >> most clusters that are not 500+ nodes big.
>> >> >> >>
>> >> >> >> Kai
>> >> >> >>
>> >> >> >> Am 22.12.2012 um 13:46 schrieb Lin Ma <li...@gmail.com>:
>> >> >> >>
>> >> >> >> Thanks Kai, using higher replication count for the purpose of?
>> >> >> >>
>> >> >> >> regards,
>> >> >> >> Lin
>> >> >> >>
>> >> >> >> On Sat, Dec 22, 2012 at 8:44 PM, Kai Voigt <k...@123.org> wrote:
>> >> >> >>>
>> >> >> >>> Hi,
>> >> >> >>>
>> >> >> >>> Am 22.12.2012 um 13:03 schrieb Lin Ma <li...@gmail.com>:
>> >> >> >>>
>> >> >> >>> > I want to confirm when on each task node either mapper or
>> >> >> >>> > reducer
>> >> >> >>> > access distributed cache file, it resides on disk, not resides
>> >> >> >>> > in
>> >> >> >>> > memory.
>> >> >> >>> > Just want to make sure distributed cache file does not fully
>> >> >> >>> > loaded
>> >> >> >>> > into
>> >> >> >>> > memory which compete memory consumption with mapper/reducer
>> >> >> >>> > tasks.
>> >> >> >>> > Is that
>> >> >> >>> > correct?
>> >> >> >>>
>> >> >> >>>
>> >> >> >>> Yes, you are correct. The JobTracker will put files for the
>> >> >> >>> distributed
>> >> >> >>> cache into HDFS with a higher replication count (10 by default).
>> >> >> >>> Whenever a
>> >> >> >>> TaskTracker needs those files for a task it is launching
>> >> >> >>> locally,
>> >> >> >>> it
>> >> >> >>> will
>> >> >> >>> fetch a copy to its local disk. So it won't need to do this
>> >> >> >>> again
>> >> >> >>> for
>> >> >> >>> future
>> >> >> >>> tasks on this node. After a job is done, all local copies and
>> >> >> >>> the
>> >> >> >>> HDFS
>> >> >> >>> copies of files in the distributed cache are cleaned up.
>> >> >> >>>
>> >> >> >>> Kai
>> >> >> >>>
>> >> >> >>> --
>> >> >> >>> Kai Voigt
>> >> >> >>> k@123.org
>> >> >> >>>
>> >> >> >>>
>> >> >> >>>
>> >> >> >>>
>> >> >> >>
>> >> >> >>
>> >> >> >> --
>> >> >> >> Kai Voigt
>> >> >> >> k@123.org
>> >> >> >>
>> >> >> >>
>> >> >> >>
>> >> >> >>
>> >> >> >
>> >> >>
>> >> >>
>> >> >>
>> >> >> --
>> >> >> Harsh J
>> >> >
>> >> >
>> >>
>> >>
>> >>
>> >> --
>> >> Harsh J
>> >
>> >
>>
>>
>>
>> --
>> Harsh J
>
>



-- 
Harsh J

Re: distributed cache

Posted by Harsh J <ha...@cloudera.com>.

Hi,

Sorry for having been ambiguous. For (1) I meant a large block (if the
block size is large). For (2) I meant multiple, concurrent threads.

On Wed, Dec 26, 2012 at 5:36 PM, Lin Ma <li...@gmail.com> wrote:
> Thanks Harsh,
>
> For long read, you mean read a large continuous part of a file, other than a
> small chunk of a file?
> "gradually decreasing performance for long reads" -- you mean parallel
> multiple threads long read degrade performance? Or single thread exclusive
> long read degrade performance?
>
> regards,
> Lin
>
>
> On Wed, Dec 26, 2012 at 7:48 PM, Harsh J <ha...@cloudera.com> wrote:
>>
>> Hi Lin,
>>
>> It is comparable (and is also logically similar) to reading a file
>> multiple times in parallel in a local filesystem - not too much of a
>> performance hit for small reads (by virtue of OS caches, and quick
>> completion per read, as is usually the case for distributed cache
>> files), and gradually decreasing performance for long reads (due to
>> frequent disk physical movement)? Thankfully, due to block sizes the
>> latter isn't a problem for large files on a proper DN, as the blocks
>> are spread over the disks and across the nodes.
>>
>> On Wed, Dec 26, 2012 at 4:13 PM, Lin Ma <li...@gmail.com> wrote:
>> > Thanks Harsh, multiple concurrent read is generally faster or?
>> >
>> > regards,
>> > Lin
>> >
>> >
>> > On Wed, Dec 26, 2012 at 6:21 PM, Harsh J <ha...@cloudera.com> wrote:
>> >>
>> >> There is no limitation in HDFS that limits reads of a block to a
>> >> single client at a time (no reason to do so) - so downloads can be as
>> >> concurrent as possible.
>> >>
>> >> On Wed, Dec 26, 2012 at 3:41 PM, Lin Ma <li...@gmail.com> wrote:
>> >> > Thanks Harsh,
>> >> >
>> >> > Supposing DistributedCache is uploaded by client, for each replica,
>> >> > in
>> >> > Hadoop design, it could only serve one download session (download
>> >> > from a
>> >> > mapper or a reducer which requires the DistributedCache) at a time
>> >> > until
>> >> > DistributedCache file download is completed, or it could serve
>> >> > multiple
>> >> > concurrent parallel download session (download from multiple mappers
>> >> > or
>> >> > reducers which requires the DistributedCache).
>> >> >
>> >> > regards,
>> >> > Lin
>> >> >
>> >> >
>> >> > On Wed, Dec 26, 2012 at 4:51 PM, Harsh J <ha...@cloudera.com> wrote:
>> >> >>
>> >> >> Hi Lin,
>> >> >>
>> >> >> DistributedCache files are stored onto the HDFS by the client first.
>> >> >> The TaskTrackers download and localize it. Therefore, as with any
>> >> >> other file on HDFS, "downloads" can be efficiently parallel with
>> >> >> higher replicas.
>> >> >>
>> >> >> The point of having higher replication for these files is also tied
>> >> >> to
>> >> >> the concept of racks in a cluster - you would want more replicas
>> >> >> spread across racks such that on task bootup the downloads happen
>> >> >> with
>> >> >> rack locality.
>> >> >>
>> >> >> On Sat, Dec 22, 2012 at 6:54 PM, Lin Ma <li...@gmail.com> wrote:
>> >> >> > Hi Kai,
>> >> >> >
>> >> >> > Smart answer! :-)
>> >> >> >
>> >> >> > The assumption you have is one distributed cache replica could
>> >> >> > only
>> >> >> > serve
>> >> >> > one download session for tasktracker node (this is why you get
>> >> >> > concurrency
>> >> >> > n/r). The question is, why one distributed cache replica cannot
>> >> >> > serve
>> >> >> > multiple concurrent download session? For example, supposing a
>> >> >> > tasktracker
>> >> >> > use elapsed time t to download a file from a specific distributed
>> >> >> > cache
>> >> >> > replica, it is possible for 2 tasktrackers to download from the
>> >> >> > specific
>> >> >> > distributed cache replica in parallel using elapsed time t as
>> >> >> > well,
>> >> >> > or
>> >> >> > 1.5
>> >> >> > t, which is faster than sequential download time 2t you mentioned
>> >> >> > before?
>> >> >> > "In total, r+n/r concurrent operations. If you optimize r
>> >> >> > depending
>> >> >> > on
>> >> >> > n,
>> >> >> > SRQT(n) is the optimal replication level." -- how do you get
>> >> >> > SRQT(n)
>> >> >> > for
>> >> >> > minimize r+n/r? Appreciate if you could point me to more details.
>> >> >> >
>> >> >> > regards,
>> >> >> > Lin
>> >> >> >
>> >> >> >
>> >> >> > On Sat, Dec 22, 2012 at 8:51 PM, Kai Voigt <k...@123.org> wrote:
>> >> >> >>
>> >> >> >> Hi,
>> >> >> >>
>> >> >> >> simple math. Assuming you have n TaskTrackers in your cluster
>> >> >> >> that
>> >> >> >> will
>> >> >> >> need to access the files in the distributed cache. And r is the
>> >> >> >> replication
>> >> >> >> level of those files.
>> >> >> >>
>> >> >> >> Copying the files into HDFS requires r copy operations over the
>> >> >> >> network.
>> >> >> >> The n TaskTrackers need to get their local copies from HDFS, so
>> >> >> >> the
>> >> >> >> n
>> >> >> >> TaskTrackers copy from r DataNodes, so n/r concurrent operation.
>> >> >> >> In
>> >> >> >> total,
>> >> >> >> r+n/r concurrent operations. If you optimize r depending on n,
>> >> >> >> SRQT(n)
>> >> >> >> is
>> >> >> >> the optimal replication level. So 10 is a reasonable default
>> >> >> >> setting
>> >> >> >> for
>> >> >> >> most clusters that are not 500+ nodes big.
>> >> >> >>
>> >> >> >> Kai
>> >> >> >>
>> >> >> >> Am 22.12.2012 um 13:46 schrieb Lin Ma <li...@gmail.com>:
>> >> >> >>
>> >> >> >> Thanks Kai, using higher replication count for the purpose of?
>> >> >> >>
>> >> >> >> regards,
>> >> >> >> Lin
>> >> >> >>
>> >> >> >> On Sat, Dec 22, 2012 at 8:44 PM, Kai Voigt <k...@123.org> wrote:
>> >> >> >>>
>> >> >> >>> Hi,
>> >> >> >>>
>> >> >> >>> Am 22.12.2012 um 13:03 schrieb Lin Ma <li...@gmail.com>:
>> >> >> >>>
>> >> >> >>> > I want to confirm when on each task node either mapper or
>> >> >> >>> > reducer
>> >> >> >>> > access distributed cache file, it resides on disk, not resides
>> >> >> >>> > in
>> >> >> >>> > memory.
>> >> >> >>> > Just want to make sure distributed cache file does not fully
>> >> >> >>> > loaded
>> >> >> >>> > into
>> >> >> >>> > memory which compete memory consumption with mapper/reducer
>> >> >> >>> > tasks.
>> >> >> >>> > Is that
>> >> >> >>> > correct?
>> >> >> >>>
>> >> >> >>>
>> >> >> >>> Yes, you are correct. The JobTracker will put files for the
>> >> >> >>> distributed
>> >> >> >>> cache into HDFS with a higher replication count (10 by default).
>> >> >> >>> Whenever a
>> >> >> >>> TaskTracker needs those files for a task it is launching
>> >> >> >>> locally,
>> >> >> >>> it
>> >> >> >>> will
>> >> >> >>> fetch a copy to its local disk. So it won't need to do this
>> >> >> >>> again
>> >> >> >>> for
>> >> >> >>> future
>> >> >> >>> tasks on this node. After a job is done, all local copies and
>> >> >> >>> the
>> >> >> >>> HDFS
>> >> >> >>> copies of files in the distributed cache are cleaned up.
>> >> >> >>>
>> >> >> >>> Kai
>> >> >> >>>
>> >> >> >>> --
>> >> >> >>> Kai Voigt
>> >> >> >>> k@123.org
>> >> >> >>>
>> >> >> >>>
>> >> >> >>>
>> >> >> >>>
>> >> >> >>
>> >> >> >>
>> >> >> >> --
>> >> >> >> Kai Voigt
>> >> >> >> k@123.org
>> >> >> >>
>> >> >> >>
>> >> >> >>
>> >> >> >>
>> >> >> >
>> >> >>
>> >> >>
>> >> >>
>> >> >> --
>> >> >> Harsh J
>> >> >
>> >> >
>> >>
>> >>
>> >>
>> >> --
>> >> Harsh J
>> >
>> >
>>
>>
>>
>> --
>> Harsh J
>
>



-- 
Harsh J

Re: distributed cache

Posted by Harsh J <ha...@cloudera.com>.

Hi,

Sorry for having been ambiguous. For (1) I meant a large block (if the
block size is large). For (2) I meant multiple, concurrent threads.

On Wed, Dec 26, 2012 at 5:36 PM, Lin Ma <li...@gmail.com> wrote:
> Thanks Harsh,
>
> For long read, you mean read a large continuous part of a file, other than a
> small chunk of a file?
> "gradually decreasing performance for long reads" -- you mean parallel
> multiple threads long read degrade performance? Or single thread exclusive
> long read degrade performance?
>
> regards,
> Lin
>
>
> On Wed, Dec 26, 2012 at 7:48 PM, Harsh J <ha...@cloudera.com> wrote:
>>
>> Hi Lin,
>>
>> It is comparable (and is also logically similar) to reading a file
>> multiple times in parallel in a local filesystem - not too much of a
>> performance hit for small reads (by virtue of OS caches, and quick
>> completion per read, as is usually the case for distributed cache
>> files), and gradually decreasing performance for long reads (due to
>> frequent disk physical movement)? Thankfully, due to block sizes the
>> latter isn't a problem for large files on a proper DN, as the blocks
>> are spread over the disks and across the nodes.
>>
>> On Wed, Dec 26, 2012 at 4:13 PM, Lin Ma <li...@gmail.com> wrote:
>> > Thanks Harsh, multiple concurrent read is generally faster or?
>> >
>> > regards,
>> > Lin
>> >
>> >
>> > On Wed, Dec 26, 2012 at 6:21 PM, Harsh J <ha...@cloudera.com> wrote:
>> >>
>> >> There is no limitation in HDFS that limits reads of a block to a
>> >> single client at a time (no reason to do so) - so downloads can be as
>> >> concurrent as possible.
>> >>
>> >> On Wed, Dec 26, 2012 at 3:41 PM, Lin Ma <li...@gmail.com> wrote:
>> >> > Thanks Harsh,
>> >> >
>> >> > Supposing DistributedCache is uploaded by client, for each replica,
>> >> > in
>> >> > Hadoop design, it could only serve one download session (download
>> >> > from a
>> >> > mapper or a reducer which requires the DistributedCache) at a time
>> >> > until
>> >> > DistributedCache file download is completed, or it could serve
>> >> > multiple
>> >> > concurrent parallel download session (download from multiple mappers
>> >> > or
>> >> > reducers which requires the DistributedCache).
>> >> >
>> >> > regards,
>> >> > Lin
>> >> >
>> >> >
>> >> > On Wed, Dec 26, 2012 at 4:51 PM, Harsh J <ha...@cloudera.com> wrote:
>> >> >>
>> >> >> Hi Lin,
>> >> >>
>> >> >> DistributedCache files are stored onto the HDFS by the client first.
>> >> >> The TaskTrackers download and localize it. Therefore, as with any
>> >> >> other file on HDFS, "downloads" can be efficiently parallel with
>> >> >> higher replicas.
>> >> >>
>> >> >> The point of having higher replication for these files is also tied
>> >> >> to
>> >> >> the concept of racks in a cluster - you would want more replicas
>> >> >> spread across racks such that on task bootup the downloads happen
>> >> >> with
>> >> >> rack locality.
>> >> >>
>> >> >> On Sat, Dec 22, 2012 at 6:54 PM, Lin Ma <li...@gmail.com> wrote:
>> >> >> > Hi Kai,
>> >> >> >
>> >> >> > Smart answer! :-)
>> >> >> >
>> >> >> > The assumption you have is one distributed cache replica could
>> >> >> > only
>> >> >> > serve
>> >> >> > one download session for tasktracker node (this is why you get
>> >> >> > concurrency
>> >> >> > n/r). The question is, why one distributed cache replica cannot
>> >> >> > serve
>> >> >> > multiple concurrent download session? For example, supposing a
>> >> >> > tasktracker
>> >> >> > use elapsed time t to download a file from a specific distributed
>> >> >> > cache
>> >> >> > replica, it is possible for 2 tasktrackers to download from the
>> >> >> > specific
>> >> >> > distributed cache replica in parallel using elapsed time t as
>> >> >> > well,
>> >> >> > or
>> >> >> > 1.5
>> >> >> > t, which is faster than sequential download time 2t you mentioned
>> >> >> > before?
>> >> >> > "In total, r+n/r concurrent operations. If you optimize r
>> >> >> > depending
>> >> >> > on
>> >> >> > n,
>> >> >> > SRQT(n) is the optimal replication level." -- how do you get
>> >> >> > SRQT(n)
>> >> >> > for
>> >> >> > minimize r+n/r? Appreciate if you could point me to more details.
>> >> >> >
>> >> >> > regards,
>> >> >> > Lin
>> >> >> >
>> >> >> >
>> >> >> > On Sat, Dec 22, 2012 at 8:51 PM, Kai Voigt <k...@123.org> wrote:
>> >> >> >>
>> >> >> >> Hi,
>> >> >> >>
>> >> >> >> simple math. Assuming you have n TaskTrackers in your cluster
>> >> >> >> that
>> >> >> >> will
>> >> >> >> need to access the files in the distributed cache. And r is the
>> >> >> >> replication
>> >> >> >> level of those files.
>> >> >> >>
>> >> >> >> Copying the files into HDFS requires r copy operations over the
>> >> >> >> network.
>> >> >> >> The n TaskTrackers need to get their local copies from HDFS, so
>> >> >> >> the
>> >> >> >> n
>> >> >> >> TaskTrackers copy from r DataNodes, so n/r concurrent operation.
>> >> >> >> In
>> >> >> >> total,
>> >> >> >> r+n/r concurrent operations. If you optimize r depending on n,
>> >> >> >> SRQT(n)
>> >> >> >> is
>> >> >> >> the optimal replication level. So 10 is a reasonable default
>> >> >> >> setting
>> >> >> >> for
>> >> >> >> most clusters that are not 500+ nodes big.
>> >> >> >>
>> >> >> >> Kai
>> >> >> >>
>> >> >> >> Am 22.12.2012 um 13:46 schrieb Lin Ma <li...@gmail.com>:
>> >> >> >>
>> >> >> >> Thanks Kai, using higher replication count for the purpose of?
>> >> >> >>
>> >> >> >> regards,
>> >> >> >> Lin
>> >> >> >>
>> >> >> >> On Sat, Dec 22, 2012 at 8:44 PM, Kai Voigt <k...@123.org> wrote:
>> >> >> >>>
>> >> >> >>> Hi,
>> >> >> >>>
>> >> >> >>> Am 22.12.2012 um 13:03 schrieb Lin Ma <li...@gmail.com>:
>> >> >> >>>
>> >> >> >>> > I want to confirm when on each task node either mapper or
>> >> >> >>> > reducer
>> >> >> >>> > access distributed cache file, it resides on disk, not resides
>> >> >> >>> > in
>> >> >> >>> > memory.
>> >> >> >>> > Just want to make sure distributed cache file does not fully
>> >> >> >>> > loaded
>> >> >> >>> > into
>> >> >> >>> > memory which compete memory consumption with mapper/reducer
>> >> >> >>> > tasks.
>> >> >> >>> > Is that
>> >> >> >>> > correct?
>> >> >> >>>
>> >> >> >>>
>> >> >> >>> Yes, you are correct. The JobTracker will put files for the
>> >> >> >>> distributed
>> >> >> >>> cache into HDFS with a higher replication count (10 by default).
>> >> >> >>> Whenever a
>> >> >> >>> TaskTracker needs those files for a task it is launching
>> >> >> >>> locally,
>> >> >> >>> it
>> >> >> >>> will
>> >> >> >>> fetch a copy to its local disk. So it won't need to do this
>> >> >> >>> again
>> >> >> >>> for
>> >> >> >>> future
>> >> >> >>> tasks on this node. After a job is done, all local copies and
>> >> >> >>> the
>> >> >> >>> HDFS
>> >> >> >>> copies of files in the distributed cache are cleaned up.
>> >> >> >>>
>> >> >> >>> Kai
>> >> >> >>>
>> >> >> >>> --
>> >> >> >>> Kai Voigt
>> >> >> >>> k@123.org
>> >> >> >>>
>> >> >> >>>
>> >> >> >>>
>> >> >> >>>
>> >> >> >>
>> >> >> >>
>> >> >> >> --
>> >> >> >> Kai Voigt
>> >> >> >> k@123.org
>> >> >> >>
>> >> >> >>
>> >> >> >>
>> >> >> >>
>> >> >> >
>> >> >>
>> >> >>
>> >> >>
>> >> >> --
>> >> >> Harsh J
>> >> >
>> >> >
>> >>
>> >>
>> >>
>> >> --
>> >> Harsh J
>> >
>> >
>>
>>
>>
>> --
>> Harsh J
>
>



-- 
Harsh J

Re: distributed cache

Posted by Lin Ma <li...@gmail.com>.

Thanks Harsh,

   1. For long read, you mean read a large continuous part of a file, other
   than a small chunk of a file?
   2. "gradually decreasing performance for long reads" -- you mean
   parallel multiple threads long read degrade performance? Or single thread
   exclusive long read degrade performance?

regards,
Lin

On Wed, Dec 26, 2012 at 7:48 PM, Harsh J <ha...@cloudera.com> wrote:

> Hi Lin,
>
> It is comparable (and is also logically similar) to reading a file
> multiple times in parallel in a local filesystem - not too much of a
> performance hit for small reads (by virtue of OS caches, and quick
> completion per read, as is usually the case for distributed cache
> files), and gradually decreasing performance for long reads (due to
> frequent disk physical movement)? Thankfully, due to block sizes the
> latter isn't a problem for large files on a proper DN, as the blocks
> are spread over the disks and across the nodes.
>
> On Wed, Dec 26, 2012 at 4:13 PM, Lin Ma <li...@gmail.com> wrote:
> > Thanks Harsh, multiple concurrent read is generally faster or?
> >
> > regards,
> > Lin
> >
> >
> > On Wed, Dec 26, 2012 at 6:21 PM, Harsh J <ha...@cloudera.com> wrote:
> >>
> >> There is no limitation in HDFS that limits reads of a block to a
> >> single client at a time (no reason to do so) - so downloads can be as
> >> concurrent as possible.
> >>
> >> On Wed, Dec 26, 2012 at 3:41 PM, Lin Ma <li...@gmail.com> wrote:
> >> > Thanks Harsh,
> >> >
> >> > Supposing DistributedCache is uploaded by client, for each replica, in
> >> > Hadoop design, it could only serve one download session (download
> from a
> >> > mapper or a reducer which requires the DistributedCache) at a time
> until
> >> > DistributedCache file download is completed, or it could serve
> multiple
> >> > concurrent parallel download session (download from multiple mappers
> or
> >> > reducers which requires the DistributedCache).
> >> >
> >> > regards,
> >> > Lin
> >> >
> >> >
> >> > On Wed, Dec 26, 2012 at 4:51 PM, Harsh J <ha...@cloudera.com> wrote:
> >> >>
> >> >> Hi Lin,
> >> >>
> >> >> DistributedCache files are stored onto the HDFS by the client first.
> >> >> The TaskTrackers download and localize it. Therefore, as with any
> >> >> other file on HDFS, "downloads" can be efficiently parallel with
> >> >> higher replicas.
> >> >>
> >> >> The point of having higher replication for these files is also tied
> to
> >> >> the concept of racks in a cluster - you would want more replicas
> >> >> spread across racks such that on task bootup the downloads happen
> with
> >> >> rack locality.
> >> >>
> >> >> On Sat, Dec 22, 2012 at 6:54 PM, Lin Ma <li...@gmail.com> wrote:
> >> >> > Hi Kai,
> >> >> >
> >> >> > Smart answer! :-)
> >> >> >
> >> >> > The assumption you have is one distributed cache replica could only
> >> >> > serve
> >> >> > one download session for tasktracker node (this is why you get
> >> >> > concurrency
> >> >> > n/r). The question is, why one distributed cache replica cannot
> serve
> >> >> > multiple concurrent download session? For example, supposing a
> >> >> > tasktracker
> >> >> > use elapsed time t to download a file from a specific distributed
> >> >> > cache
> >> >> > replica, it is possible for 2 tasktrackers to download from the
> >> >> > specific
> >> >> > distributed cache replica in parallel using elapsed time t as well,
> >> >> > or
> >> >> > 1.5
> >> >> > t, which is faster than sequential download time 2t you mentioned
> >> >> > before?
> >> >> > "In total, r+n/r concurrent operations. If you optimize r depending
> >> >> > on
> >> >> > n,
> >> >> > SRQT(n) is the optimal replication level." -- how do you get
> SRQT(n)
> >> >> > for
> >> >> > minimize r+n/r? Appreciate if you could point me to more details.
> >> >> >
> >> >> > regards,
> >> >> > Lin
> >> >> >
> >> >> >
> >> >> > On Sat, Dec 22, 2012 at 8:51 PM, Kai Voigt <k...@123.org> wrote:
> >> >> >>
> >> >> >> Hi,
> >> >> >>
> >> >> >> simple math. Assuming you have n TaskTrackers in your cluster that
> >> >> >> will
> >> >> >> need to access the files in the distributed cache. And r is the
> >> >> >> replication
> >> >> >> level of those files.
> >> >> >>
> >> >> >> Copying the files into HDFS requires r copy operations over the
> >> >> >> network.
> >> >> >> The n TaskTrackers need to get their local copies from HDFS, so
> the
> >> >> >> n
> >> >> >> TaskTrackers copy from r DataNodes, so n/r concurrent operation.
> In
> >> >> >> total,
> >> >> >> r+n/r concurrent operations. If you optimize r depending on n,
> >> >> >> SRQT(n)
> >> >> >> is
> >> >> >> the optimal replication level. So 10 is a reasonable default
> setting
> >> >> >> for
> >> >> >> most clusters that are not 500+ nodes big.
> >> >> >>
> >> >> >> Kai
> >> >> >>
> >> >> >> Am 22.12.2012 um 13:46 schrieb Lin Ma <li...@gmail.com>:
> >> >> >>
> >> >> >> Thanks Kai, using higher replication count for the purpose of?
> >> >> >>
> >> >> >> regards,
> >> >> >> Lin
> >> >> >>
> >> >> >> On Sat, Dec 22, 2012 at 8:44 PM, Kai Voigt <k...@123.org> wrote:
> >> >> >>>
> >> >> >>> Hi,
> >> >> >>>
> >> >> >>> Am 22.12.2012 um 13:03 schrieb Lin Ma <li...@gmail.com>:
> >> >> >>>
> >> >> >>> > I want to confirm when on each task node either mapper or
> reducer
> >> >> >>> > access distributed cache file, it resides on disk, not resides
> in
> >> >> >>> > memory.
> >> >> >>> > Just want to make sure distributed cache file does not fully
> >> >> >>> > loaded
> >> >> >>> > into
> >> >> >>> > memory which compete memory consumption with mapper/reducer
> >> >> >>> > tasks.
> >> >> >>> > Is that
> >> >> >>> > correct?
> >> >> >>>
> >> >> >>>
> >> >> >>> Yes, you are correct. The JobTracker will put files for the
> >> >> >>> distributed
> >> >> >>> cache into HDFS with a higher replication count (10 by default).
> >> >> >>> Whenever a
> >> >> >>> TaskTracker needs those files for a task it is launching locally,
> >> >> >>> it
> >> >> >>> will
> >> >> >>> fetch a copy to its local disk. So it won't need to do this again
> >> >> >>> for
> >> >> >>> future
> >> >> >>> tasks on this node. After a job is done, all local copies and the
> >> >> >>> HDFS
> >> >> >>> copies of files in the distributed cache are cleaned up.
> >> >> >>>
> >> >> >>> Kai
> >> >> >>>
> >> >> >>> --
> >> >> >>> Kai Voigt
> >> >> >>> k@123.org
> >> >> >>>
> >> >> >>>
> >> >> >>>
> >> >> >>>
> >> >> >>
> >> >> >>
> >> >> >> --
> >> >> >> Kai Voigt
> >> >> >> k@123.org
> >> >> >>
> >> >> >>
> >> >> >>
> >> >> >>
> >> >> >
> >> >>
> >> >>
> >> >>
> >> >> --
> >> >> Harsh J
> >> >
> >> >
> >>
> >>
> >>
> >> --
> >> Harsh J
> >
> >
>
>
>
> --
> Harsh J
>

Re: distributed cache

Posted by Lin Ma <li...@gmail.com>.

Thanks Harsh,

   1. For long read, you mean read a large continuous part of a file, other
   than a small chunk of a file?
   2. "gradually decreasing performance for long reads" -- you mean
   parallel multiple threads long read degrade performance? Or single thread
   exclusive long read degrade performance?

regards,
Lin

On Wed, Dec 26, 2012 at 7:48 PM, Harsh J <ha...@cloudera.com> wrote:

> Hi Lin,
>
> It is comparable (and is also logically similar) to reading a file
> multiple times in parallel in a local filesystem - not too much of a
> performance hit for small reads (by virtue of OS caches, and quick
> completion per read, as is usually the case for distributed cache
> files), and gradually decreasing performance for long reads (due to
> frequent disk physical movement)? Thankfully, due to block sizes the
> latter isn't a problem for large files on a proper DN, as the blocks
> are spread over the disks and across the nodes.
>
> On Wed, Dec 26, 2012 at 4:13 PM, Lin Ma <li...@gmail.com> wrote:
> > Thanks Harsh, multiple concurrent read is generally faster or?
> >
> > regards,
> > Lin
> >
> >
> > On Wed, Dec 26, 2012 at 6:21 PM, Harsh J <ha...@cloudera.com> wrote:
> >>
> >> There is no limitation in HDFS that limits reads of a block to a
> >> single client at a time (no reason to do so) - so downloads can be as
> >> concurrent as possible.
> >>
> >> On Wed, Dec 26, 2012 at 3:41 PM, Lin Ma <li...@gmail.com> wrote:
> >> > Thanks Harsh,
> >> >
> >> > Supposing DistributedCache is uploaded by client, for each replica, in
> >> > Hadoop design, it could only serve one download session (download
> from a
> >> > mapper or a reducer which requires the DistributedCache) at a time
> until
> >> > DistributedCache file download is completed, or it could serve
> multiple
> >> > concurrent parallel download session (download from multiple mappers
> or
> >> > reducers which requires the DistributedCache).
> >> >
> >> > regards,
> >> > Lin
> >> >
> >> >
> >> > On Wed, Dec 26, 2012 at 4:51 PM, Harsh J <ha...@cloudera.com> wrote:
> >> >>
> >> >> Hi Lin,
> >> >>
> >> >> DistributedCache files are stored onto the HDFS by the client first.
> >> >> The TaskTrackers download and localize it. Therefore, as with any
> >> >> other file on HDFS, "downloads" can be efficiently parallel with
> >> >> higher replicas.
> >> >>
> >> >> The point of having higher replication for these files is also tied
> to
> >> >> the concept of racks in a cluster - you would want more replicas
> >> >> spread across racks such that on task bootup the downloads happen
> with
> >> >> rack locality.
> >> >>
> >> >> On Sat, Dec 22, 2012 at 6:54 PM, Lin Ma <li...@gmail.com> wrote:
> >> >> > Hi Kai,
> >> >> >
> >> >> > Smart answer! :-)
> >> >> >
> >> >> > The assumption you have is one distributed cache replica could only
> >> >> > serve
> >> >> > one download session for tasktracker node (this is why you get
> >> >> > concurrency
> >> >> > n/r). The question is, why one distributed cache replica cannot
> serve
> >> >> > multiple concurrent download session? For example, supposing a
> >> >> > tasktracker
> >> >> > use elapsed time t to download a file from a specific distributed
> >> >> > cache
> >> >> > replica, it is possible for 2 tasktrackers to download from the
> >> >> > specific
> >> >> > distributed cache replica in parallel using elapsed time t as well,
> >> >> > or
> >> >> > 1.5
> >> >> > t, which is faster than sequential download time 2t you mentioned
> >> >> > before?
> >> >> > "In total, r+n/r concurrent operations. If you optimize r depending
> >> >> > on
> >> >> > n,
> >> >> > SRQT(n) is the optimal replication level." -- how do you get
> SRQT(n)
> >> >> > for
> >> >> > minimize r+n/r? Appreciate if you could point me to more details.
> >> >> >
> >> >> > regards,
> >> >> > Lin
> >> >> >
> >> >> >
> >> >> > On Sat, Dec 22, 2012 at 8:51 PM, Kai Voigt <k...@123.org> wrote:
> >> >> >>
> >> >> >> Hi,
> >> >> >>
> >> >> >> simple math. Assuming you have n TaskTrackers in your cluster that
> >> >> >> will
> >> >> >> need to access the files in the distributed cache. And r is the
> >> >> >> replication
> >> >> >> level of those files.
> >> >> >>
> >> >> >> Copying the files into HDFS requires r copy operations over the
> >> >> >> network.
> >> >> >> The n TaskTrackers need to get their local copies from HDFS, so
> the
> >> >> >> n
> >> >> >> TaskTrackers copy from r DataNodes, so n/r concurrent operation.
> In
> >> >> >> total,
> >> >> >> r+n/r concurrent operations. If you optimize r depending on n,
> >> >> >> SRQT(n)
> >> >> >> is
> >> >> >> the optimal replication level. So 10 is a reasonable default
> setting
> >> >> >> for
> >> >> >> most clusters that are not 500+ nodes big.
> >> >> >>
> >> >> >> Kai
> >> >> >>
> >> >> >> Am 22.12.2012 um 13:46 schrieb Lin Ma <li...@gmail.com>:
> >> >> >>
> >> >> >> Thanks Kai, using higher replication count for the purpose of?
> >> >> >>
> >> >> >> regards,
> >> >> >> Lin
> >> >> >>
> >> >> >> On Sat, Dec 22, 2012 at 8:44 PM, Kai Voigt <k...@123.org> wrote:
> >> >> >>>
> >> >> >>> Hi,
> >> >> >>>
> >> >> >>> Am 22.12.2012 um 13:03 schrieb Lin Ma <li...@gmail.com>:
> >> >> >>>
> >> >> >>> > I want to confirm when on each task node either mapper or
> reducer
> >> >> >>> > access distributed cache file, it resides on disk, not resides
> in
> >> >> >>> > memory.
> >> >> >>> > Just want to make sure distributed cache file does not fully
> >> >> >>> > loaded
> >> >> >>> > into
> >> >> >>> > memory which compete memory consumption with mapper/reducer
> >> >> >>> > tasks.
> >> >> >>> > Is that
> >> >> >>> > correct?
> >> >> >>>
> >> >> >>>
> >> >> >>> Yes, you are correct. The JobTracker will put files for the
> >> >> >>> distributed
> >> >> >>> cache into HDFS with a higher replication count (10 by default).
> >> >> >>> Whenever a
> >> >> >>> TaskTracker needs those files for a task it is launching locally,
> >> >> >>> it
> >> >> >>> will
> >> >> >>> fetch a copy to its local disk. So it won't need to do this again
> >> >> >>> for
> >> >> >>> future
> >> >> >>> tasks on this node. After a job is done, all local copies and the
> >> >> >>> HDFS
> >> >> >>> copies of files in the distributed cache are cleaned up.
> >> >> >>>
> >> >> >>> Kai
> >> >> >>>
> >> >> >>> --
> >> >> >>> Kai Voigt
> >> >> >>> k@123.org
> >> >> >>>
> >> >> >>>
> >> >> >>>
> >> >> >>>
> >> >> >>
> >> >> >>
> >> >> >> --
> >> >> >> Kai Voigt
> >> >> >> k@123.org
> >> >> >>
> >> >> >>
> >> >> >>
> >> >> >>
> >> >> >
> >> >>
> >> >>
> >> >>
> >> >> --
> >> >> Harsh J
> >> >
> >> >
> >>
> >>
> >>
> >> --
> >> Harsh J
> >
> >
>
>
>
> --
> Harsh J
>

Re: distributed cache

Posted by Lin Ma <li...@gmail.com>.

Thanks Harsh,

   1. For long read, you mean read a large continuous part of a file, other
   than a small chunk of a file?
   2. "gradually decreasing performance for long reads" -- you mean
   parallel multiple threads long read degrade performance? Or single thread
   exclusive long read degrade performance?

regards,
Lin

On Wed, Dec 26, 2012 at 7:48 PM, Harsh J <ha...@cloudera.com> wrote:

> Hi Lin,
>
> It is comparable (and is also logically similar) to reading a file
> multiple times in parallel in a local filesystem - not too much of a
> performance hit for small reads (by virtue of OS caches, and quick
> completion per read, as is usually the case for distributed cache
> files), and gradually decreasing performance for long reads (due to
> frequent disk physical movement)? Thankfully, due to block sizes the
> latter isn't a problem for large files on a proper DN, as the blocks
> are spread over the disks and across the nodes.
>
> On Wed, Dec 26, 2012 at 4:13 PM, Lin Ma <li...@gmail.com> wrote:
> > Thanks Harsh, multiple concurrent read is generally faster or?
> >
> > regards,
> > Lin
> >
> >
> > On Wed, Dec 26, 2012 at 6:21 PM, Harsh J <ha...@cloudera.com> wrote:
> >>
> >> There is no limitation in HDFS that limits reads of a block to a
> >> single client at a time (no reason to do so) - so downloads can be as
> >> concurrent as possible.
> >>
> >> On Wed, Dec 26, 2012 at 3:41 PM, Lin Ma <li...@gmail.com> wrote:
> >> > Thanks Harsh,
> >> >
> >> > Supposing DistributedCache is uploaded by client, for each replica, in
> >> > Hadoop design, it could only serve one download session (download
> from a
> >> > mapper or a reducer which requires the DistributedCache) at a time
> until
> >> > DistributedCache file download is completed, or it could serve
> multiple
> >> > concurrent parallel download session (download from multiple mappers
> or
> >> > reducers which requires the DistributedCache).
> >> >
> >> > regards,
> >> > Lin
> >> >
> >> >
> >> > On Wed, Dec 26, 2012 at 4:51 PM, Harsh J <ha...@cloudera.com> wrote:
> >> >>
> >> >> Hi Lin,
> >> >>
> >> >> DistributedCache files are stored onto the HDFS by the client first.
> >> >> The TaskTrackers download and localize it. Therefore, as with any
> >> >> other file on HDFS, "downloads" can be efficiently parallel with
> >> >> higher replicas.
> >> >>
> >> >> The point of having higher replication for these files is also tied
> to
> >> >> the concept of racks in a cluster - you would want more replicas
> >> >> spread across racks such that on task bootup the downloads happen
> with
> >> >> rack locality.
> >> >>
> >> >> On Sat, Dec 22, 2012 at 6:54 PM, Lin Ma <li...@gmail.com> wrote:
> >> >> > Hi Kai,
> >> >> >
> >> >> > Smart answer! :-)
> >> >> >
> >> >> > The assumption you have is one distributed cache replica could only
> >> >> > serve
> >> >> > one download session for tasktracker node (this is why you get
> >> >> > concurrency
> >> >> > n/r). The question is, why one distributed cache replica cannot
> serve
> >> >> > multiple concurrent download session? For example, supposing a
> >> >> > tasktracker
> >> >> > use elapsed time t to download a file from a specific distributed
> >> >> > cache
> >> >> > replica, it is possible for 2 tasktrackers to download from the
> >> >> > specific
> >> >> > distributed cache replica in parallel using elapsed time t as well,
> >> >> > or
> >> >> > 1.5
> >> >> > t, which is faster than sequential download time 2t you mentioned
> >> >> > before?
> >> >> > "In total, r+n/r concurrent operations. If you optimize r depending
> >> >> > on
> >> >> > n,
> >> >> > SRQT(n) is the optimal replication level." -- how do you get
> SRQT(n)
> >> >> > for
> >> >> > minimize r+n/r? Appreciate if you could point me to more details.
> >> >> >
> >> >> > regards,
> >> >> > Lin
> >> >> >
> >> >> >
> >> >> > On Sat, Dec 22, 2012 at 8:51 PM, Kai Voigt <k...@123.org> wrote:
> >> >> >>
> >> >> >> Hi,
> >> >> >>
> >> >> >> simple math. Assuming you have n TaskTrackers in your cluster that
> >> >> >> will
> >> >> >> need to access the files in the distributed cache. And r is the
> >> >> >> replication
> >> >> >> level of those files.
> >> >> >>
> >> >> >> Copying the files into HDFS requires r copy operations over the
> >> >> >> network.
> >> >> >> The n TaskTrackers need to get their local copies from HDFS, so
> the
> >> >> >> n
> >> >> >> TaskTrackers copy from r DataNodes, so n/r concurrent operation.
> In
> >> >> >> total,
> >> >> >> r+n/r concurrent operations. If you optimize r depending on n,
> >> >> >> SRQT(n)
> >> >> >> is
> >> >> >> the optimal replication level. So 10 is a reasonable default
> setting
> >> >> >> for
> >> >> >> most clusters that are not 500+ nodes big.
> >> >> >>
> >> >> >> Kai
> >> >> >>
> >> >> >> Am 22.12.2012 um 13:46 schrieb Lin Ma <li...@gmail.com>:
> >> >> >>
> >> >> >> Thanks Kai, using higher replication count for the purpose of?
> >> >> >>
> >> >> >> regards,
> >> >> >> Lin
> >> >> >>
> >> >> >> On Sat, Dec 22, 2012 at 8:44 PM, Kai Voigt <k...@123.org> wrote:
> >> >> >>>
> >> >> >>> Hi,
> >> >> >>>
> >> >> >>> Am 22.12.2012 um 13:03 schrieb Lin Ma <li...@gmail.com>:
> >> >> >>>
> >> >> >>> > I want to confirm when on each task node either mapper or
> reducer
> >> >> >>> > access distributed cache file, it resides on disk, not resides
> in
> >> >> >>> > memory.
> >> >> >>> > Just want to make sure distributed cache file does not fully
> >> >> >>> > loaded
> >> >> >>> > into
> >> >> >>> > memory which compete memory consumption with mapper/reducer
> >> >> >>> > tasks.
> >> >> >>> > Is that
> >> >> >>> > correct?
> >> >> >>>
> >> >> >>>
> >> >> >>> Yes, you are correct. The JobTracker will put files for the
> >> >> >>> distributed
> >> >> >>> cache into HDFS with a higher replication count (10 by default).
> >> >> >>> Whenever a
> >> >> >>> TaskTracker needs those files for a task it is launching locally,
> >> >> >>> it
> >> >> >>> will
> >> >> >>> fetch a copy to its local disk. So it won't need to do this again
> >> >> >>> for
> >> >> >>> future
> >> >> >>> tasks on this node. After a job is done, all local copies and the
> >> >> >>> HDFS
> >> >> >>> copies of files in the distributed cache are cleaned up.
> >> >> >>>
> >> >> >>> Kai
> >> >> >>>
> >> >> >>> --
> >> >> >>> Kai Voigt
> >> >> >>> k@123.org
> >> >> >>>
> >> >> >>>
> >> >> >>>
> >> >> >>>
> >> >> >>
> >> >> >>
> >> >> >> --
> >> >> >> Kai Voigt
> >> >> >> k@123.org
> >> >> >>
> >> >> >>
> >> >> >>
> >> >> >>
> >> >> >
> >> >>
> >> >>
> >> >>
> >> >> --
> >> >> Harsh J
> >> >
> >> >
> >>
> >>
> >>
> >> --
> >> Harsh J
> >
> >
>
>
>
> --
> Harsh J
>

Re: distributed cache

Posted by Lin Ma <li...@gmail.com>.

Thanks Harsh,

   1. For long read, you mean read a large continuous part of a file, other
   than a small chunk of a file?
   2. "gradually decreasing performance for long reads" -- you mean
   parallel multiple threads long read degrade performance? Or single thread
   exclusive long read degrade performance?

regards,
Lin

On Wed, Dec 26, 2012 at 7:48 PM, Harsh J <ha...@cloudera.com> wrote:

> Hi Lin,
>
> It is comparable (and is also logically similar) to reading a file
> multiple times in parallel in a local filesystem - not too much of a
> performance hit for small reads (by virtue of OS caches, and quick
> completion per read, as is usually the case for distributed cache
> files), and gradually decreasing performance for long reads (due to
> frequent disk physical movement)? Thankfully, due to block sizes the
> latter isn't a problem for large files on a proper DN, as the blocks
> are spread over the disks and across the nodes.
>
> On Wed, Dec 26, 2012 at 4:13 PM, Lin Ma <li...@gmail.com> wrote:
> > Thanks Harsh, multiple concurrent read is generally faster or?
> >
> > regards,
> > Lin
> >
> >
> > On Wed, Dec 26, 2012 at 6:21 PM, Harsh J <ha...@cloudera.com> wrote:
> >>
> >> There is no limitation in HDFS that limits reads of a block to a
> >> single client at a time (no reason to do so) - so downloads can be as
> >> concurrent as possible.
> >>
> >> On Wed, Dec 26, 2012 at 3:41 PM, Lin Ma <li...@gmail.com> wrote:
> >> > Thanks Harsh,
> >> >
> >> > Supposing DistributedCache is uploaded by client, for each replica, in
> >> > Hadoop design, it could only serve one download session (download
> from a
> >> > mapper or a reducer which requires the DistributedCache) at a time
> until
> >> > DistributedCache file download is completed, or it could serve
> multiple
> >> > concurrent parallel download session (download from multiple mappers
> or
> >> > reducers which requires the DistributedCache).
> >> >
> >> > regards,
> >> > Lin
> >> >
> >> >
> >> > On Wed, Dec 26, 2012 at 4:51 PM, Harsh J <ha...@cloudera.com> wrote:
> >> >>
> >> >> Hi Lin,
> >> >>
> >> >> DistributedCache files are stored onto the HDFS by the client first.
> >> >> The TaskTrackers download and localize it. Therefore, as with any
> >> >> other file on HDFS, "downloads" can be efficiently parallel with
> >> >> higher replicas.
> >> >>
> >> >> The point of having higher replication for these files is also tied
> to
> >> >> the concept of racks in a cluster - you would want more replicas
> >> >> spread across racks such that on task bootup the downloads happen
> with
> >> >> rack locality.
> >> >>
> >> >> On Sat, Dec 22, 2012 at 6:54 PM, Lin Ma <li...@gmail.com> wrote:
> >> >> > Hi Kai,
> >> >> >
> >> >> > Smart answer! :-)
> >> >> >
> >> >> > The assumption you have is one distributed cache replica could only
> >> >> > serve
> >> >> > one download session for tasktracker node (this is why you get
> >> >> > concurrency
> >> >> > n/r). The question is, why one distributed cache replica cannot
> serve
> >> >> > multiple concurrent download session? For example, supposing a
> >> >> > tasktracker
> >> >> > use elapsed time t to download a file from a specific distributed
> >> >> > cache
> >> >> > replica, it is possible for 2 tasktrackers to download from the
> >> >> > specific
> >> >> > distributed cache replica in parallel using elapsed time t as well,
> >> >> > or
> >> >> > 1.5
> >> >> > t, which is faster than sequential download time 2t you mentioned
> >> >> > before?
> >> >> > "In total, r+n/r concurrent operations. If you optimize r depending
> >> >> > on
> >> >> > n,
> >> >> > SRQT(n) is the optimal replication level." -- how do you get
> SRQT(n)
> >> >> > for
> >> >> > minimize r+n/r? Appreciate if you could point me to more details.
> >> >> >
> >> >> > regards,
> >> >> > Lin
> >> >> >
> >> >> >
> >> >> > On Sat, Dec 22, 2012 at 8:51 PM, Kai Voigt <k...@123.org> wrote:
> >> >> >>
> >> >> >> Hi,
> >> >> >>
> >> >> >> simple math. Assuming you have n TaskTrackers in your cluster that
> >> >> >> will
> >> >> >> need to access the files in the distributed cache. And r is the
> >> >> >> replication
> >> >> >> level of those files.
> >> >> >>
> >> >> >> Copying the files into HDFS requires r copy operations over the
> >> >> >> network.
> >> >> >> The n TaskTrackers need to get their local copies from HDFS, so
> the
> >> >> >> n
> >> >> >> TaskTrackers copy from r DataNodes, so n/r concurrent operation.
> In
> >> >> >> total,
> >> >> >> r+n/r concurrent operations. If you optimize r depending on n,
> >> >> >> SRQT(n)
> >> >> >> is
> >> >> >> the optimal replication level. So 10 is a reasonable default
> setting
> >> >> >> for
> >> >> >> most clusters that are not 500+ nodes big.
> >> >> >>
> >> >> >> Kai
> >> >> >>
> >> >> >> Am 22.12.2012 um 13:46 schrieb Lin Ma <li...@gmail.com>:
> >> >> >>
> >> >> >> Thanks Kai, using higher replication count for the purpose of?
> >> >> >>
> >> >> >> regards,
> >> >> >> Lin
> >> >> >>
> >> >> >> On Sat, Dec 22, 2012 at 8:44 PM, Kai Voigt <k...@123.org> wrote:
> >> >> >>>
> >> >> >>> Hi,
> >> >> >>>
> >> >> >>> Am 22.12.2012 um 13:03 schrieb Lin Ma <li...@gmail.com>:
> >> >> >>>
> >> >> >>> > I want to confirm when on each task node either mapper or
> reducer
> >> >> >>> > access distributed cache file, it resides on disk, not resides
> in
> >> >> >>> > memory.
> >> >> >>> > Just want to make sure distributed cache file does not fully
> >> >> >>> > loaded
> >> >> >>> > into
> >> >> >>> > memory which compete memory consumption with mapper/reducer
> >> >> >>> > tasks.
> >> >> >>> > Is that
> >> >> >>> > correct?
> >> >> >>>
> >> >> >>>
> >> >> >>> Yes, you are correct. The JobTracker will put files for the
> >> >> >>> distributed
> >> >> >>> cache into HDFS with a higher replication count (10 by default).
> >> >> >>> Whenever a
> >> >> >>> TaskTracker needs those files for a task it is launching locally,
> >> >> >>> it
> >> >> >>> will
> >> >> >>> fetch a copy to its local disk. So it won't need to do this again
> >> >> >>> for
> >> >> >>> future
> >> >> >>> tasks on this node. After a job is done, all local copies and the
> >> >> >>> HDFS
> >> >> >>> copies of files in the distributed cache are cleaned up.
> >> >> >>>
> >> >> >>> Kai
> >> >> >>>
> >> >> >>> --
> >> >> >>> Kai Voigt
> >> >> >>> k@123.org
> >> >> >>>
> >> >> >>>
> >> >> >>>
> >> >> >>>
> >> >> >>
> >> >> >>
> >> >> >> --
> >> >> >> Kai Voigt
> >> >> >> k@123.org
> >> >> >>
> >> >> >>
> >> >> >>
> >> >> >>
> >> >> >
> >> >>
> >> >>
> >> >>
> >> >> --
> >> >> Harsh J
> >> >
> >> >
> >>
> >>
> >>
> >> --
> >> Harsh J
> >
> >
>
>
>
> --
> Harsh J
>

Re: distributed cache

Posted by Harsh J <ha...@cloudera.com>.

Hi Lin,

It is comparable (and is also logically similar) to reading a file
multiple times in parallel in a local filesystem - not too much of a
performance hit for small reads (by virtue of OS caches, and quick
completion per read, as is usually the case for distributed cache
files), and gradually decreasing performance for long reads (due to
frequent disk physical movement)? Thankfully, due to block sizes the
latter isn't a problem for large files on a proper DN, as the blocks
are spread over the disks and across the nodes.

On Wed, Dec 26, 2012 at 4:13 PM, Lin Ma <li...@gmail.com> wrote:
> Thanks Harsh, multiple concurrent read is generally faster or?
>
> regards,
> Lin
>
>
> On Wed, Dec 26, 2012 at 6:21 PM, Harsh J <ha...@cloudera.com> wrote:
>>
>> There is no limitation in HDFS that limits reads of a block to a
>> single client at a time (no reason to do so) - so downloads can be as
>> concurrent as possible.
>>
>> On Wed, Dec 26, 2012 at 3:41 PM, Lin Ma <li...@gmail.com> wrote:
>> > Thanks Harsh,
>> >
>> > Supposing DistributedCache is uploaded by client, for each replica, in
>> > Hadoop design, it could only serve one download session (download from a
>> > mapper or a reducer which requires the DistributedCache) at a time until
>> > DistributedCache file download is completed, or it could serve multiple
>> > concurrent parallel download session (download from multiple mappers or
>> > reducers which requires the DistributedCache).
>> >
>> > regards,
>> > Lin
>> >
>> >
>> > On Wed, Dec 26, 2012 at 4:51 PM, Harsh J <ha...@cloudera.com> wrote:
>> >>
>> >> Hi Lin,
>> >>
>> >> DistributedCache files are stored onto the HDFS by the client first.
>> >> The TaskTrackers download and localize it. Therefore, as with any
>> >> other file on HDFS, "downloads" can be efficiently parallel with
>> >> higher replicas.
>> >>
>> >> The point of having higher replication for these files is also tied to
>> >> the concept of racks in a cluster - you would want more replicas
>> >> spread across racks such that on task bootup the downloads happen with
>> >> rack locality.
>> >>
>> >> On Sat, Dec 22, 2012 at 6:54 PM, Lin Ma <li...@gmail.com> wrote:
>> >> > Hi Kai,
>> >> >
>> >> > Smart answer! :-)
>> >> >
>> >> > The assumption you have is one distributed cache replica could only
>> >> > serve
>> >> > one download session for tasktracker node (this is why you get
>> >> > concurrency
>> >> > n/r). The question is, why one distributed cache replica cannot serve
>> >> > multiple concurrent download session? For example, supposing a
>> >> > tasktracker
>> >> > use elapsed time t to download a file from a specific distributed
>> >> > cache
>> >> > replica, it is possible for 2 tasktrackers to download from the
>> >> > specific
>> >> > distributed cache replica in parallel using elapsed time t as well,
>> >> > or
>> >> > 1.5
>> >> > t, which is faster than sequential download time 2t you mentioned
>> >> > before?
>> >> > "In total, r+n/r concurrent operations. If you optimize r depending
>> >> > on
>> >> > n,
>> >> > SRQT(n) is the optimal replication level." -- how do you get SRQT(n)
>> >> > for
>> >> > minimize r+n/r? Appreciate if you could point me to more details.
>> >> >
>> >> > regards,
>> >> > Lin
>> >> >
>> >> >
>> >> > On Sat, Dec 22, 2012 at 8:51 PM, Kai Voigt <k...@123.org> wrote:
>> >> >>
>> >> >> Hi,
>> >> >>
>> >> >> simple math. Assuming you have n TaskTrackers in your cluster that
>> >> >> will
>> >> >> need to access the files in the distributed cache. And r is the
>> >> >> replication
>> >> >> level of those files.
>> >> >>
>> >> >> Copying the files into HDFS requires r copy operations over the
>> >> >> network.
>> >> >> The n TaskTrackers need to get their local copies from HDFS, so the
>> >> >> n
>> >> >> TaskTrackers copy from r DataNodes, so n/r concurrent operation. In
>> >> >> total,
>> >> >> r+n/r concurrent operations. If you optimize r depending on n,
>> >> >> SRQT(n)
>> >> >> is
>> >> >> the optimal replication level. So 10 is a reasonable default setting
>> >> >> for
>> >> >> most clusters that are not 500+ nodes big.
>> >> >>
>> >> >> Kai
>> >> >>
>> >> >> Am 22.12.2012 um 13:46 schrieb Lin Ma <li...@gmail.com>:
>> >> >>
>> >> >> Thanks Kai, using higher replication count for the purpose of?
>> >> >>
>> >> >> regards,
>> >> >> Lin
>> >> >>
>> >> >> On Sat, Dec 22, 2012 at 8:44 PM, Kai Voigt <k...@123.org> wrote:
>> >> >>>
>> >> >>> Hi,
>> >> >>>
>> >> >>> Am 22.12.2012 um 13:03 schrieb Lin Ma <li...@gmail.com>:
>> >> >>>
>> >> >>> > I want to confirm when on each task node either mapper or reducer
>> >> >>> > access distributed cache file, it resides on disk, not resides in
>> >> >>> > memory.
>> >> >>> > Just want to make sure distributed cache file does not fully
>> >> >>> > loaded
>> >> >>> > into
>> >> >>> > memory which compete memory consumption with mapper/reducer
>> >> >>> > tasks.
>> >> >>> > Is that
>> >> >>> > correct?
>> >> >>>
>> >> >>>
>> >> >>> Yes, you are correct. The JobTracker will put files for the
>> >> >>> distributed
>> >> >>> cache into HDFS with a higher replication count (10 by default).
>> >> >>> Whenever a
>> >> >>> TaskTracker needs those files for a task it is launching locally,
>> >> >>> it
>> >> >>> will
>> >> >>> fetch a copy to its local disk. So it won't need to do this again
>> >> >>> for
>> >> >>> future
>> >> >>> tasks on this node. After a job is done, all local copies and the
>> >> >>> HDFS
>> >> >>> copies of files in the distributed cache are cleaned up.
>> >> >>>
>> >> >>> Kai
>> >> >>>
>> >> >>> --
>> >> >>> Kai Voigt
>> >> >>> k@123.org
>> >> >>>
>> >> >>>
>> >> >>>
>> >> >>>
>> >> >>
>> >> >>
>> >> >> --
>> >> >> Kai Voigt
>> >> >> k@123.org
>> >> >>
>> >> >>
>> >> >>
>> >> >>
>> >> >
>> >>
>> >>
>> >>
>> >> --
>> >> Harsh J
>> >
>> >
>>
>>
>>
>> --
>> Harsh J
>
>



-- 
Harsh J

Re: distributed cache

Posted by Harsh J <ha...@cloudera.com>.

Hi Lin,

It is comparable (and is also logically similar) to reading a file
multiple times in parallel in a local filesystem - not too much of a
performance hit for small reads (by virtue of OS caches, and quick
completion per read, as is usually the case for distributed cache
files), and gradually decreasing performance for long reads (due to
frequent disk physical movement)? Thankfully, due to block sizes the
latter isn't a problem for large files on a proper DN, as the blocks
are spread over the disks and across the nodes.

On Wed, Dec 26, 2012 at 4:13 PM, Lin Ma <li...@gmail.com> wrote:
> Thanks Harsh, multiple concurrent read is generally faster or?
>
> regards,
> Lin
>
>
> On Wed, Dec 26, 2012 at 6:21 PM, Harsh J <ha...@cloudera.com> wrote:
>>
>> There is no limitation in HDFS that limits reads of a block to a
>> single client at a time (no reason to do so) - so downloads can be as
>> concurrent as possible.
>>
>> On Wed, Dec 26, 2012 at 3:41 PM, Lin Ma <li...@gmail.com> wrote:
>> > Thanks Harsh,
>> >
>> > Supposing DistributedCache is uploaded by client, for each replica, in
>> > Hadoop design, it could only serve one download session (download from a
>> > mapper or a reducer which requires the DistributedCache) at a time until
>> > DistributedCache file download is completed, or it could serve multiple
>> > concurrent parallel download session (download from multiple mappers or
>> > reducers which requires the DistributedCache).
>> >
>> > regards,
>> > Lin
>> >
>> >
>> > On Wed, Dec 26, 2012 at 4:51 PM, Harsh J <ha...@cloudera.com> wrote:
>> >>
>> >> Hi Lin,
>> >>
>> >> DistributedCache files are stored onto the HDFS by the client first.
>> >> The TaskTrackers download and localize it. Therefore, as with any
>> >> other file on HDFS, "downloads" can be efficiently parallel with
>> >> higher replicas.
>> >>
>> >> The point of having higher replication for these files is also tied to
>> >> the concept of racks in a cluster - you would want more replicas
>> >> spread across racks such that on task bootup the downloads happen with
>> >> rack locality.
>> >>
>> >> On Sat, Dec 22, 2012 at 6:54 PM, Lin Ma <li...@gmail.com> wrote:
>> >> > Hi Kai,
>> >> >
>> >> > Smart answer! :-)
>> >> >
>> >> > The assumption you have is one distributed cache replica could only
>> >> > serve
>> >> > one download session for tasktracker node (this is why you get
>> >> > concurrency
>> >> > n/r). The question is, why one distributed cache replica cannot serve
>> >> > multiple concurrent download session? For example, supposing a
>> >> > tasktracker
>> >> > use elapsed time t to download a file from a specific distributed
>> >> > cache
>> >> > replica, it is possible for 2 tasktrackers to download from the
>> >> > specific
>> >> > distributed cache replica in parallel using elapsed time t as well,
>> >> > or
>> >> > 1.5
>> >> > t, which is faster than sequential download time 2t you mentioned
>> >> > before?
>> >> > "In total, r+n/r concurrent operations. If you optimize r depending
>> >> > on
>> >> > n,
>> >> > SRQT(n) is the optimal replication level." -- how do you get SRQT(n)
>> >> > for
>> >> > minimize r+n/r? Appreciate if you could point me to more details.
>> >> >
>> >> > regards,
>> >> > Lin
>> >> >
>> >> >
>> >> > On Sat, Dec 22, 2012 at 8:51 PM, Kai Voigt <k...@123.org> wrote:
>> >> >>
>> >> >> Hi,
>> >> >>
>> >> >> simple math. Assuming you have n TaskTrackers in your cluster that
>> >> >> will
>> >> >> need to access the files in the distributed cache. And r is the
>> >> >> replication
>> >> >> level of those files.
>> >> >>
>> >> >> Copying the files into HDFS requires r copy operations over the
>> >> >> network.
>> >> >> The n TaskTrackers need to get their local copies from HDFS, so the
>> >> >> n
>> >> >> TaskTrackers copy from r DataNodes, so n/r concurrent operation. In
>> >> >> total,
>> >> >> r+n/r concurrent operations. If you optimize r depending on n,
>> >> >> SRQT(n)
>> >> >> is
>> >> >> the optimal replication level. So 10 is a reasonable default setting
>> >> >> for
>> >> >> most clusters that are not 500+ nodes big.
>> >> >>
>> >> >> Kai
>> >> >>
>> >> >> Am 22.12.2012 um 13:46 schrieb Lin Ma <li...@gmail.com>:
>> >> >>
>> >> >> Thanks Kai, using higher replication count for the purpose of?
>> >> >>
>> >> >> regards,
>> >> >> Lin
>> >> >>
>> >> >> On Sat, Dec 22, 2012 at 8:44 PM, Kai Voigt <k...@123.org> wrote:
>> >> >>>
>> >> >>> Hi,
>> >> >>>
>> >> >>> Am 22.12.2012 um 13:03 schrieb Lin Ma <li...@gmail.com>:
>> >> >>>
>> >> >>> > I want to confirm when on each task node either mapper or reducer
>> >> >>> > access distributed cache file, it resides on disk, not resides in
>> >> >>> > memory.
>> >> >>> > Just want to make sure distributed cache file does not fully
>> >> >>> > loaded
>> >> >>> > into
>> >> >>> > memory which compete memory consumption with mapper/reducer
>> >> >>> > tasks.
>> >> >>> > Is that
>> >> >>> > correct?
>> >> >>>
>> >> >>>
>> >> >>> Yes, you are correct. The JobTracker will put files for the
>> >> >>> distributed
>> >> >>> cache into HDFS with a higher replication count (10 by default).
>> >> >>> Whenever a
>> >> >>> TaskTracker needs those files for a task it is launching locally,
>> >> >>> it
>> >> >>> will
>> >> >>> fetch a copy to its local disk. So it won't need to do this again
>> >> >>> for
>> >> >>> future
>> >> >>> tasks on this node. After a job is done, all local copies and the
>> >> >>> HDFS
>> >> >>> copies of files in the distributed cache are cleaned up.
>> >> >>>
>> >> >>> Kai
>> >> >>>
>> >> >>> --
>> >> >>> Kai Voigt
>> >> >>> k@123.org
>> >> >>>
>> >> >>>
>> >> >>>
>> >> >>>
>> >> >>
>> >> >>
>> >> >> --
>> >> >> Kai Voigt
>> >> >> k@123.org
>> >> >>
>> >> >>
>> >> >>
>> >> >>
>> >> >
>> >>
>> >>
>> >>
>> >> --
>> >> Harsh J
>> >
>> >
>>
>>
>>
>> --
>> Harsh J
>
>



-- 
Harsh J

Re: distributed cache

Posted by Harsh J <ha...@cloudera.com>.

Hi Lin,

It is comparable (and is also logically similar) to reading a file
multiple times in parallel in a local filesystem - not too much of a
performance hit for small reads (by virtue of OS caches, and quick
completion per read, as is usually the case for distributed cache
files), and gradually decreasing performance for long reads (due to
frequent disk physical movement)? Thankfully, due to block sizes the
latter isn't a problem for large files on a proper DN, as the blocks
are spread over the disks and across the nodes.

On Wed, Dec 26, 2012 at 4:13 PM, Lin Ma <li...@gmail.com> wrote:
> Thanks Harsh, multiple concurrent read is generally faster or?
>
> regards,
> Lin
>
>
> On Wed, Dec 26, 2012 at 6:21 PM, Harsh J <ha...@cloudera.com> wrote:
>>
>> There is no limitation in HDFS that limits reads of a block to a
>> single client at a time (no reason to do so) - so downloads can be as
>> concurrent as possible.
>>
>> On Wed, Dec 26, 2012 at 3:41 PM, Lin Ma <li...@gmail.com> wrote:
>> > Thanks Harsh,
>> >
>> > Supposing DistributedCache is uploaded by client, for each replica, in
>> > Hadoop design, it could only serve one download session (download from a
>> > mapper or a reducer which requires the DistributedCache) at a time until
>> > DistributedCache file download is completed, or it could serve multiple
>> > concurrent parallel download session (download from multiple mappers or
>> > reducers which requires the DistributedCache).
>> >
>> > regards,
>> > Lin
>> >
>> >
>> > On Wed, Dec 26, 2012 at 4:51 PM, Harsh J <ha...@cloudera.com> wrote:
>> >>
>> >> Hi Lin,
>> >>
>> >> DistributedCache files are stored onto the HDFS by the client first.
>> >> The TaskTrackers download and localize it. Therefore, as with any
>> >> other file on HDFS, "downloads" can be efficiently parallel with
>> >> higher replicas.
>> >>
>> >> The point of having higher replication for these files is also tied to
>> >> the concept of racks in a cluster - you would want more replicas
>> >> spread across racks such that on task bootup the downloads happen with
>> >> rack locality.
>> >>
>> >> On Sat, Dec 22, 2012 at 6:54 PM, Lin Ma <li...@gmail.com> wrote:
>> >> > Hi Kai,
>> >> >
>> >> > Smart answer! :-)
>> >> >
>> >> > The assumption you have is one distributed cache replica could only
>> >> > serve
>> >> > one download session for tasktracker node (this is why you get
>> >> > concurrency
>> >> > n/r). The question is, why one distributed cache replica cannot serve
>> >> > multiple concurrent download session? For example, supposing a
>> >> > tasktracker
>> >> > use elapsed time t to download a file from a specific distributed
>> >> > cache
>> >> > replica, it is possible for 2 tasktrackers to download from the
>> >> > specific
>> >> > distributed cache replica in parallel using elapsed time t as well,
>> >> > or
>> >> > 1.5
>> >> > t, which is faster than sequential download time 2t you mentioned
>> >> > before?
>> >> > "In total, r+n/r concurrent operations. If you optimize r depending
>> >> > on
>> >> > n,
>> >> > SRQT(n) is the optimal replication level." -- how do you get SRQT(n)
>> >> > for
>> >> > minimize r+n/r? Appreciate if you could point me to more details.
>> >> >
>> >> > regards,
>> >> > Lin
>> >> >
>> >> >
>> >> > On Sat, Dec 22, 2012 at 8:51 PM, Kai Voigt <k...@123.org> wrote:
>> >> >>
>> >> >> Hi,
>> >> >>
>> >> >> simple math. Assuming you have n TaskTrackers in your cluster that
>> >> >> will
>> >> >> need to access the files in the distributed cache. And r is the
>> >> >> replication
>> >> >> level of those files.
>> >> >>
>> >> >> Copying the files into HDFS requires r copy operations over the
>> >> >> network.
>> >> >> The n TaskTrackers need to get their local copies from HDFS, so the
>> >> >> n
>> >> >> TaskTrackers copy from r DataNodes, so n/r concurrent operation. In
>> >> >> total,
>> >> >> r+n/r concurrent operations. If you optimize r depending on n,
>> >> >> SRQT(n)
>> >> >> is
>> >> >> the optimal replication level. So 10 is a reasonable default setting
>> >> >> for
>> >> >> most clusters that are not 500+ nodes big.
>> >> >>
>> >> >> Kai
>> >> >>
>> >> >> Am 22.12.2012 um 13:46 schrieb Lin Ma <li...@gmail.com>:
>> >> >>
>> >> >> Thanks Kai, using higher replication count for the purpose of?
>> >> >>
>> >> >> regards,
>> >> >> Lin
>> >> >>
>> >> >> On Sat, Dec 22, 2012 at 8:44 PM, Kai Voigt <k...@123.org> wrote:
>> >> >>>
>> >> >>> Hi,
>> >> >>>
>> >> >>> Am 22.12.2012 um 13:03 schrieb Lin Ma <li...@gmail.com>:
>> >> >>>
>> >> >>> > I want to confirm when on each task node either mapper or reducer
>> >> >>> > access distributed cache file, it resides on disk, not resides in
>> >> >>> > memory.
>> >> >>> > Just want to make sure distributed cache file does not fully
>> >> >>> > loaded
>> >> >>> > into
>> >> >>> > memory which compete memory consumption with mapper/reducer
>> >> >>> > tasks.
>> >> >>> > Is that
>> >> >>> > correct?
>> >> >>>
>> >> >>>
>> >> >>> Yes, you are correct. The JobTracker will put files for the
>> >> >>> distributed
>> >> >>> cache into HDFS with a higher replication count (10 by default).
>> >> >>> Whenever a
>> >> >>> TaskTracker needs those files for a task it is launching locally,
>> >> >>> it
>> >> >>> will
>> >> >>> fetch a copy to its local disk. So it won't need to do this again
>> >> >>> for
>> >> >>> future
>> >> >>> tasks on this node. After a job is done, all local copies and the
>> >> >>> HDFS
>> >> >>> copies of files in the distributed cache are cleaned up.
>> >> >>>
>> >> >>> Kai
>> >> >>>
>> >> >>> --
>> >> >>> Kai Voigt
>> >> >>> k@123.org
>> >> >>>
>> >> >>>
>> >> >>>
>> >> >>>
>> >> >>
>> >> >>
>> >> >> --
>> >> >> Kai Voigt
>> >> >> k@123.org
>> >> >>
>> >> >>
>> >> >>
>> >> >>
>> >> >
>> >>
>> >>
>> >>
>> >> --
>> >> Harsh J
>> >
>> >
>>
>>
>>
>> --
>> Harsh J
>
>



-- 
Harsh J

Re: distributed cache

Posted by Harsh J <ha...@cloudera.com>.

Hi Lin,

It is comparable (and is also logically similar) to reading a file
multiple times in parallel in a local filesystem - not too much of a
performance hit for small reads (by virtue of OS caches, and quick
completion per read, as is usually the case for distributed cache
files), and gradually decreasing performance for long reads (due to
frequent disk physical movement)? Thankfully, due to block sizes the
latter isn't a problem for large files on a proper DN, as the blocks
are spread over the disks and across the nodes.

On Wed, Dec 26, 2012 at 4:13 PM, Lin Ma <li...@gmail.com> wrote:
> Thanks Harsh, multiple concurrent read is generally faster or?
>
> regards,
> Lin
>
>
> On Wed, Dec 26, 2012 at 6:21 PM, Harsh J <ha...@cloudera.com> wrote:
>>
>> There is no limitation in HDFS that limits reads of a block to a
>> single client at a time (no reason to do so) - so downloads can be as
>> concurrent as possible.
>>
>> On Wed, Dec 26, 2012 at 3:41 PM, Lin Ma <li...@gmail.com> wrote:
>> > Thanks Harsh,
>> >
>> > Supposing DistributedCache is uploaded by client, for each replica, in
>> > Hadoop design, it could only serve one download session (download from a
>> > mapper or a reducer which requires the DistributedCache) at a time until
>> > DistributedCache file download is completed, or it could serve multiple
>> > concurrent parallel download session (download from multiple mappers or
>> > reducers which requires the DistributedCache).
>> >
>> > regards,
>> > Lin
>> >
>> >
>> > On Wed, Dec 26, 2012 at 4:51 PM, Harsh J <ha...@cloudera.com> wrote:
>> >>
>> >> Hi Lin,
>> >>
>> >> DistributedCache files are stored onto the HDFS by the client first.
>> >> The TaskTrackers download and localize it. Therefore, as with any
>> >> other file on HDFS, "downloads" can be efficiently parallel with
>> >> higher replicas.
>> >>
>> >> The point of having higher replication for these files is also tied to
>> >> the concept of racks in a cluster - you would want more replicas
>> >> spread across racks such that on task bootup the downloads happen with
>> >> rack locality.
>> >>
>> >> On Sat, Dec 22, 2012 at 6:54 PM, Lin Ma <li...@gmail.com> wrote:
>> >> > Hi Kai,
>> >> >
>> >> > Smart answer! :-)
>> >> >
>> >> > The assumption you have is one distributed cache replica could only
>> >> > serve
>> >> > one download session for tasktracker node (this is why you get
>> >> > concurrency
>> >> > n/r). The question is, why one distributed cache replica cannot serve
>> >> > multiple concurrent download session? For example, supposing a
>> >> > tasktracker
>> >> > use elapsed time t to download a file from a specific distributed
>> >> > cache
>> >> > replica, it is possible for 2 tasktrackers to download from the
>> >> > specific
>> >> > distributed cache replica in parallel using elapsed time t as well,
>> >> > or
>> >> > 1.5
>> >> > t, which is faster than sequential download time 2t you mentioned
>> >> > before?
>> >> > "In total, r+n/r concurrent operations. If you optimize r depending
>> >> > on
>> >> > n,
>> >> > SRQT(n) is the optimal replication level." -- how do you get SRQT(n)
>> >> > for
>> >> > minimize r+n/r? Appreciate if you could point me to more details.
>> >> >
>> >> > regards,
>> >> > Lin
>> >> >
>> >> >
>> >> > On Sat, Dec 22, 2012 at 8:51 PM, Kai Voigt <k...@123.org> wrote:
>> >> >>
>> >> >> Hi,
>> >> >>
>> >> >> simple math. Assuming you have n TaskTrackers in your cluster that
>> >> >> will
>> >> >> need to access the files in the distributed cache. And r is the
>> >> >> replication
>> >> >> level of those files.
>> >> >>
>> >> >> Copying the files into HDFS requires r copy operations over the
>> >> >> network.
>> >> >> The n TaskTrackers need to get their local copies from HDFS, so the
>> >> >> n
>> >> >> TaskTrackers copy from r DataNodes, so n/r concurrent operation. In
>> >> >> total,
>> >> >> r+n/r concurrent operations. If you optimize r depending on n,
>> >> >> SRQT(n)
>> >> >> is
>> >> >> the optimal replication level. So 10 is a reasonable default setting
>> >> >> for
>> >> >> most clusters that are not 500+ nodes big.
>> >> >>
>> >> >> Kai
>> >> >>
>> >> >> Am 22.12.2012 um 13:46 schrieb Lin Ma <li...@gmail.com>:
>> >> >>
>> >> >> Thanks Kai, using higher replication count for the purpose of?
>> >> >>
>> >> >> regards,
>> >> >> Lin
>> >> >>
>> >> >> On Sat, Dec 22, 2012 at 8:44 PM, Kai Voigt <k...@123.org> wrote:
>> >> >>>
>> >> >>> Hi,
>> >> >>>
>> >> >>> Am 22.12.2012 um 13:03 schrieb Lin Ma <li...@gmail.com>:
>> >> >>>
>> >> >>> > I want to confirm when on each task node either mapper or reducer
>> >> >>> > access distributed cache file, it resides on disk, not resides in
>> >> >>> > memory.
>> >> >>> > Just want to make sure distributed cache file does not fully
>> >> >>> > loaded
>> >> >>> > into
>> >> >>> > memory which compete memory consumption with mapper/reducer
>> >> >>> > tasks.
>> >> >>> > Is that
>> >> >>> > correct?
>> >> >>>
>> >> >>>
>> >> >>> Yes, you are correct. The JobTracker will put files for the
>> >> >>> distributed
>> >> >>> cache into HDFS with a higher replication count (10 by default).
>> >> >>> Whenever a
>> >> >>> TaskTracker needs those files for a task it is launching locally,
>> >> >>> it
>> >> >>> will
>> >> >>> fetch a copy to its local disk. So it won't need to do this again
>> >> >>> for
>> >> >>> future
>> >> >>> tasks on this node. After a job is done, all local copies and the
>> >> >>> HDFS
>> >> >>> copies of files in the distributed cache are cleaned up.
>> >> >>>
>> >> >>> Kai
>> >> >>>
>> >> >>> --
>> >> >>> Kai Voigt
>> >> >>> k@123.org
>> >> >>>
>> >> >>>
>> >> >>>
>> >> >>>
>> >> >>
>> >> >>
>> >> >> --
>> >> >> Kai Voigt
>> >> >> k@123.org
>> >> >>
>> >> >>
>> >> >>
>> >> >>
>> >> >
>> >>
>> >>
>> >>
>> >> --
>> >> Harsh J
>> >
>> >
>>
>>
>>
>> --
>> Harsh J
>
>



-- 
Harsh J

Re: distributed cache

Posted by Lin Ma <li...@gmail.com>.

Thanks Harsh, multiple concurrent read is generally faster or?

regards,
Lin

On Wed, Dec 26, 2012 at 6:21 PM, Harsh J <ha...@cloudera.com> wrote:

> There is no limitation in HDFS that limits reads of a block to a
> single client at a time (no reason to do so) - so downloads can be as
> concurrent as possible.
>
> On Wed, Dec 26, 2012 at 3:41 PM, Lin Ma <li...@gmail.com> wrote:
> > Thanks Harsh,
> >
> > Supposing DistributedCache is uploaded by client, for each replica, in
> > Hadoop design, it could only serve one download session (download from a
> > mapper or a reducer which requires the DistributedCache) at a time until
> > DistributedCache file download is completed, or it could serve multiple
> > concurrent parallel download session (download from multiple mappers or
> > reducers which requires the DistributedCache).
> >
> > regards,
> > Lin
> >
> >
> > On Wed, Dec 26, 2012 at 4:51 PM, Harsh J <ha...@cloudera.com> wrote:
> >>
> >> Hi Lin,
> >>
> >> DistributedCache files are stored onto the HDFS by the client first.
> >> The TaskTrackers download and localize it. Therefore, as with any
> >> other file on HDFS, "downloads" can be efficiently parallel with
> >> higher replicas.
> >>
> >> The point of having higher replication for these files is also tied to
> >> the concept of racks in a cluster - you would want more replicas
> >> spread across racks such that on task bootup the downloads happen with
> >> rack locality.
> >>
> >> On Sat, Dec 22, 2012 at 6:54 PM, Lin Ma <li...@gmail.com> wrote:
> >> > Hi Kai,
> >> >
> >> > Smart answer! :-)
> >> >
> >> > The assumption you have is one distributed cache replica could only
> >> > serve
> >> > one download session for tasktracker node (this is why you get
> >> > concurrency
> >> > n/r). The question is, why one distributed cache replica cannot serve
> >> > multiple concurrent download session? For example, supposing a
> >> > tasktracker
> >> > use elapsed time t to download a file from a specific distributed
> cache
> >> > replica, it is possible for 2 tasktrackers to download from the
> specific
> >> > distributed cache replica in parallel using elapsed time t as well, or
> >> > 1.5
> >> > t, which is faster than sequential download time 2t you mentioned
> >> > before?
> >> > "In total, r+n/r concurrent operations. If you optimize r depending on
> >> > n,
> >> > SRQT(n) is the optimal replication level." -- how do you get SRQT(n)
> for
> >> > minimize r+n/r? Appreciate if you could point me to more details.
> >> >
> >> > regards,
> >> > Lin
> >> >
> >> >
> >> > On Sat, Dec 22, 2012 at 8:51 PM, Kai Voigt <k...@123.org> wrote:
> >> >>
> >> >> Hi,
> >> >>
> >> >> simple math. Assuming you have n TaskTrackers in your cluster that
> will
> >> >> need to access the files in the distributed cache. And r is the
> >> >> replication
> >> >> level of those files.
> >> >>
> >> >> Copying the files into HDFS requires r copy operations over the
> >> >> network.
> >> >> The n TaskTrackers need to get their local copies from HDFS, so the n
> >> >> TaskTrackers copy from r DataNodes, so n/r concurrent operation. In
> >> >> total,
> >> >> r+n/r concurrent operations. If you optimize r depending on n,
> SRQT(n)
> >> >> is
> >> >> the optimal replication level. So 10 is a reasonable default setting
> >> >> for
> >> >> most clusters that are not 500+ nodes big.
> >> >>
> >> >> Kai
> >> >>
> >> >> Am 22.12.2012 um 13:46 schrieb Lin Ma <li...@gmail.com>:
> >> >>
> >> >> Thanks Kai, using higher replication count for the purpose of?
> >> >>
> >> >> regards,
> >> >> Lin
> >> >>
> >> >> On Sat, Dec 22, 2012 at 8:44 PM, Kai Voigt <k...@123.org> wrote:
> >> >>>
> >> >>> Hi,
> >> >>>
> >> >>> Am 22.12.2012 um 13:03 schrieb Lin Ma <li...@gmail.com>:
> >> >>>
> >> >>> > I want to confirm when on each task node either mapper or reducer
> >> >>> > access distributed cache file, it resides on disk, not resides in
> >> >>> > memory.
> >> >>> > Just want to make sure distributed cache file does not fully
> loaded
> >> >>> > into
> >> >>> > memory which compete memory consumption with mapper/reducer tasks.
> >> >>> > Is that
> >> >>> > correct?
> >> >>>
> >> >>>
> >> >>> Yes, you are correct. The JobTracker will put files for the
> >> >>> distributed
> >> >>> cache into HDFS with a higher replication count (10 by default).
> >> >>> Whenever a
> >> >>> TaskTracker needs those files for a task it is launching locally, it
> >> >>> will
> >> >>> fetch a copy to its local disk. So it won't need to do this again
> for
> >> >>> future
> >> >>> tasks on this node. After a job is done, all local copies and the
> HDFS
> >> >>> copies of files in the distributed cache are cleaned up.
> >> >>>
> >> >>> Kai
> >> >>>
> >> >>> --
> >> >>> Kai Voigt
> >> >>> k@123.org
> >> >>>
> >> >>>
> >> >>>
> >> >>>
> >> >>
> >> >>
> >> >> --
> >> >> Kai Voigt
> >> >> k@123.org
> >> >>
> >> >>
> >> >>
> >> >>
> >> >
> >>
> >>
> >>
> >> --
> >> Harsh J
> >
> >
>
>
>
> --
> Harsh J
>

Re: distributed cache

Posted by Lin Ma <li...@gmail.com>.

Thanks Harsh, multiple concurrent read is generally faster or?

regards,
Lin

On Wed, Dec 26, 2012 at 6:21 PM, Harsh J <ha...@cloudera.com> wrote:

> There is no limitation in HDFS that limits reads of a block to a
> single client at a time (no reason to do so) - so downloads can be as
> concurrent as possible.
>
> On Wed, Dec 26, 2012 at 3:41 PM, Lin Ma <li...@gmail.com> wrote:
> > Thanks Harsh,
> >
> > Supposing DistributedCache is uploaded by client, for each replica, in
> > Hadoop design, it could only serve one download session (download from a
> > mapper or a reducer which requires the DistributedCache) at a time until
> > DistributedCache file download is completed, or it could serve multiple
> > concurrent parallel download session (download from multiple mappers or
> > reducers which requires the DistributedCache).
> >
> > regards,
> > Lin
> >
> >
> > On Wed, Dec 26, 2012 at 4:51 PM, Harsh J <ha...@cloudera.com> wrote:
> >>
> >> Hi Lin,
> >>
> >> DistributedCache files are stored onto the HDFS by the client first.
> >> The TaskTrackers download and localize it. Therefore, as with any
> >> other file on HDFS, "downloads" can be efficiently parallel with
> >> higher replicas.
> >>
> >> The point of having higher replication for these files is also tied to
> >> the concept of racks in a cluster - you would want more replicas
> >> spread across racks such that on task bootup the downloads happen with
> >> rack locality.
> >>
> >> On Sat, Dec 22, 2012 at 6:54 PM, Lin Ma <li...@gmail.com> wrote:
> >> > Hi Kai,
> >> >
> >> > Smart answer! :-)
> >> >
> >> > The assumption you have is one distributed cache replica could only
> >> > serve
> >> > one download session for tasktracker node (this is why you get
> >> > concurrency
> >> > n/r). The question is, why one distributed cache replica cannot serve
> >> > multiple concurrent download session? For example, supposing a
> >> > tasktracker
> >> > use elapsed time t to download a file from a specific distributed
> cache
> >> > replica, it is possible for 2 tasktrackers to download from the
> specific
> >> > distributed cache replica in parallel using elapsed time t as well, or
> >> > 1.5
> >> > t, which is faster than sequential download time 2t you mentioned
> >> > before?
> >> > "In total, r+n/r concurrent operations. If you optimize r depending on
> >> > n,
> >> > SRQT(n) is the optimal replication level." -- how do you get SRQT(n)
> for
> >> > minimize r+n/r? Appreciate if you could point me to more details.
> >> >
> >> > regards,
> >> > Lin
> >> >
> >> >
> >> > On Sat, Dec 22, 2012 at 8:51 PM, Kai Voigt <k...@123.org> wrote:
> >> >>
> >> >> Hi,
> >> >>
> >> >> simple math. Assuming you have n TaskTrackers in your cluster that
> will
> >> >> need to access the files in the distributed cache. And r is the
> >> >> replication
> >> >> level of those files.
> >> >>
> >> >> Copying the files into HDFS requires r copy operations over the
> >> >> network.
> >> >> The n TaskTrackers need to get their local copies from HDFS, so the n
> >> >> TaskTrackers copy from r DataNodes, so n/r concurrent operation. In
> >> >> total,
> >> >> r+n/r concurrent operations. If you optimize r depending on n,
> SRQT(n)
> >> >> is
> >> >> the optimal replication level. So 10 is a reasonable default setting
> >> >> for
> >> >> most clusters that are not 500+ nodes big.
> >> >>
> >> >> Kai
> >> >>
> >> >> Am 22.12.2012 um 13:46 schrieb Lin Ma <li...@gmail.com>:
> >> >>
> >> >> Thanks Kai, using higher replication count for the purpose of?
> >> >>
> >> >> regards,
> >> >> Lin
> >> >>
> >> >> On Sat, Dec 22, 2012 at 8:44 PM, Kai Voigt <k...@123.org> wrote:
> >> >>>
> >> >>> Hi,
> >> >>>
> >> >>> Am 22.12.2012 um 13:03 schrieb Lin Ma <li...@gmail.com>:
> >> >>>
> >> >>> > I want to confirm when on each task node either mapper or reducer
> >> >>> > access distributed cache file, it resides on disk, not resides in
> >> >>> > memory.
> >> >>> > Just want to make sure distributed cache file does not fully
> loaded
> >> >>> > into
> >> >>> > memory which compete memory consumption with mapper/reducer tasks.
> >> >>> > Is that
> >> >>> > correct?
> >> >>>
> >> >>>
> >> >>> Yes, you are correct. The JobTracker will put files for the
> >> >>> distributed
> >> >>> cache into HDFS with a higher replication count (10 by default).
> >> >>> Whenever a
> >> >>> TaskTracker needs those files for a task it is launching locally, it
> >> >>> will
> >> >>> fetch a copy to its local disk. So it won't need to do this again
> for
> >> >>> future
> >> >>> tasks on this node. After a job is done, all local copies and the
> HDFS
> >> >>> copies of files in the distributed cache are cleaned up.
> >> >>>
> >> >>> Kai
> >> >>>
> >> >>> --
> >> >>> Kai Voigt
> >> >>> k@123.org
> >> >>>
> >> >>>
> >> >>>
> >> >>>
> >> >>
> >> >>
> >> >> --
> >> >> Kai Voigt
> >> >> k@123.org
> >> >>
> >> >>
> >> >>
> >> >>
> >> >
> >>
> >>
> >>
> >> --
> >> Harsh J
> >
> >
>
>
>
> --
> Harsh J
>

Re: distributed cache

Posted by Lin Ma <li...@gmail.com>.

Thanks Harsh, multiple concurrent read is generally faster or?

regards,
Lin

On Wed, Dec 26, 2012 at 6:21 PM, Harsh J <ha...@cloudera.com> wrote:

> There is no limitation in HDFS that limits reads of a block to a
> single client at a time (no reason to do so) - so downloads can be as
> concurrent as possible.
>
> On Wed, Dec 26, 2012 at 3:41 PM, Lin Ma <li...@gmail.com> wrote:
> > Thanks Harsh,
> >
> > Supposing DistributedCache is uploaded by client, for each replica, in
> > Hadoop design, it could only serve one download session (download from a
> > mapper or a reducer which requires the DistributedCache) at a time until
> > DistributedCache file download is completed, or it could serve multiple
> > concurrent parallel download session (download from multiple mappers or
> > reducers which requires the DistributedCache).
> >
> > regards,
> > Lin
> >
> >
> > On Wed, Dec 26, 2012 at 4:51 PM, Harsh J <ha...@cloudera.com> wrote:
> >>
> >> Hi Lin,
> >>
> >> DistributedCache files are stored onto the HDFS by the client first.
> >> The TaskTrackers download and localize it. Therefore, as with any
> >> other file on HDFS, "downloads" can be efficiently parallel with
> >> higher replicas.
> >>
> >> The point of having higher replication for these files is also tied to
> >> the concept of racks in a cluster - you would want more replicas
> >> spread across racks such that on task bootup the downloads happen with
> >> rack locality.
> >>
> >> On Sat, Dec 22, 2012 at 6:54 PM, Lin Ma <li...@gmail.com> wrote:
> >> > Hi Kai,
> >> >
> >> > Smart answer! :-)
> >> >
> >> > The assumption you have is one distributed cache replica could only
> >> > serve
> >> > one download session for tasktracker node (this is why you get
> >> > concurrency
> >> > n/r). The question is, why one distributed cache replica cannot serve
> >> > multiple concurrent download session? For example, supposing a
> >> > tasktracker
> >> > use elapsed time t to download a file from a specific distributed
> cache
> >> > replica, it is possible for 2 tasktrackers to download from the
> specific
> >> > distributed cache replica in parallel using elapsed time t as well, or
> >> > 1.5
> >> > t, which is faster than sequential download time 2t you mentioned
> >> > before?
> >> > "In total, r+n/r concurrent operations. If you optimize r depending on
> >> > n,
> >> > SRQT(n) is the optimal replication level." -- how do you get SRQT(n)
> for
> >> > minimize r+n/r? Appreciate if you could point me to more details.
> >> >
> >> > regards,
> >> > Lin
> >> >
> >> >
> >> > On Sat, Dec 22, 2012 at 8:51 PM, Kai Voigt <k...@123.org> wrote:
> >> >>
> >> >> Hi,
> >> >>
> >> >> simple math. Assuming you have n TaskTrackers in your cluster that
> will
> >> >> need to access the files in the distributed cache. And r is the
> >> >> replication
> >> >> level of those files.
> >> >>
> >> >> Copying the files into HDFS requires r copy operations over the
> >> >> network.
> >> >> The n TaskTrackers need to get their local copies from HDFS, so the n
> >> >> TaskTrackers copy from r DataNodes, so n/r concurrent operation. In
> >> >> total,
> >> >> r+n/r concurrent operations. If you optimize r depending on n,
> SRQT(n)
> >> >> is
> >> >> the optimal replication level. So 10 is a reasonable default setting
> >> >> for
> >> >> most clusters that are not 500+ nodes big.
> >> >>
> >> >> Kai
> >> >>
> >> >> Am 22.12.2012 um 13:46 schrieb Lin Ma <li...@gmail.com>:
> >> >>
> >> >> Thanks Kai, using higher replication count for the purpose of?
> >> >>
> >> >> regards,
> >> >> Lin
> >> >>
> >> >> On Sat, Dec 22, 2012 at 8:44 PM, Kai Voigt <k...@123.org> wrote:
> >> >>>
> >> >>> Hi,
> >> >>>
> >> >>> Am 22.12.2012 um 13:03 schrieb Lin Ma <li...@gmail.com>:
> >> >>>
> >> >>> > I want to confirm when on each task node either mapper or reducer
> >> >>> > access distributed cache file, it resides on disk, not resides in
> >> >>> > memory.
> >> >>> > Just want to make sure distributed cache file does not fully
> loaded
> >> >>> > into
> >> >>> > memory which compete memory consumption with mapper/reducer tasks.
> >> >>> > Is that
> >> >>> > correct?
> >> >>>
> >> >>>
> >> >>> Yes, you are correct. The JobTracker will put files for the
> >> >>> distributed
> >> >>> cache into HDFS with a higher replication count (10 by default).
> >> >>> Whenever a
> >> >>> TaskTracker needs those files for a task it is launching locally, it
> >> >>> will
> >> >>> fetch a copy to its local disk. So it won't need to do this again
> for
> >> >>> future
> >> >>> tasks on this node. After a job is done, all local copies and the
> HDFS
> >> >>> copies of files in the distributed cache are cleaned up.
> >> >>>
> >> >>> Kai
> >> >>>
> >> >>> --
> >> >>> Kai Voigt
> >> >>> k@123.org
> >> >>>
> >> >>>
> >> >>>
> >> >>>
> >> >>
> >> >>
> >> >> --
> >> >> Kai Voigt
> >> >> k@123.org
> >> >>
> >> >>
> >> >>
> >> >>
> >> >
> >>
> >>
> >>
> >> --
> >> Harsh J
> >
> >
>
>
>
> --
> Harsh J
>

Re: distributed cache

Posted by Lin Ma <li...@gmail.com>.

Thanks Harsh, multiple concurrent read is generally faster or?

regards,
Lin

On Wed, Dec 26, 2012 at 6:21 PM, Harsh J <ha...@cloudera.com> wrote:

> There is no limitation in HDFS that limits reads of a block to a
> single client at a time (no reason to do so) - so downloads can be as
> concurrent as possible.
>
> On Wed, Dec 26, 2012 at 3:41 PM, Lin Ma <li...@gmail.com> wrote:
> > Thanks Harsh,
> >
> > Supposing DistributedCache is uploaded by client, for each replica, in
> > Hadoop design, it could only serve one download session (download from a
> > mapper or a reducer which requires the DistributedCache) at a time until
> > DistributedCache file download is completed, or it could serve multiple
> > concurrent parallel download session (download from multiple mappers or
> > reducers which requires the DistributedCache).
> >
> > regards,
> > Lin
> >
> >
> > On Wed, Dec 26, 2012 at 4:51 PM, Harsh J <ha...@cloudera.com> wrote:
> >>
> >> Hi Lin,
> >>
> >> DistributedCache files are stored onto the HDFS by the client first.
> >> The TaskTrackers download and localize it. Therefore, as with any
> >> other file on HDFS, "downloads" can be efficiently parallel with
> >> higher replicas.
> >>
> >> The point of having higher replication for these files is also tied to
> >> the concept of racks in a cluster - you would want more replicas
> >> spread across racks such that on task bootup the downloads happen with
> >> rack locality.
> >>
> >> On Sat, Dec 22, 2012 at 6:54 PM, Lin Ma <li...@gmail.com> wrote:
> >> > Hi Kai,
> >> >
> >> > Smart answer! :-)
> >> >
> >> > The assumption you have is one distributed cache replica could only
> >> > serve
> >> > one download session for tasktracker node (this is why you get
> >> > concurrency
> >> > n/r). The question is, why one distributed cache replica cannot serve
> >> > multiple concurrent download session? For example, supposing a
> >> > tasktracker
> >> > use elapsed time t to download a file from a specific distributed
> cache
> >> > replica, it is possible for 2 tasktrackers to download from the
> specific
> >> > distributed cache replica in parallel using elapsed time t as well, or
> >> > 1.5
> >> > t, which is faster than sequential download time 2t you mentioned
> >> > before?
> >> > "In total, r+n/r concurrent operations. If you optimize r depending on
> >> > n,
> >> > SRQT(n) is the optimal replication level." -- how do you get SRQT(n)
> for
> >> > minimize r+n/r? Appreciate if you could point me to more details.
> >> >
> >> > regards,
> >> > Lin
> >> >
> >> >
> >> > On Sat, Dec 22, 2012 at 8:51 PM, Kai Voigt <k...@123.org> wrote:
> >> >>
> >> >> Hi,
> >> >>
> >> >> simple math. Assuming you have n TaskTrackers in your cluster that
> will
> >> >> need to access the files in the distributed cache. And r is the
> >> >> replication
> >> >> level of those files.
> >> >>
> >> >> Copying the files into HDFS requires r copy operations over the
> >> >> network.
> >> >> The n TaskTrackers need to get their local copies from HDFS, so the n
> >> >> TaskTrackers copy from r DataNodes, so n/r concurrent operation. In
> >> >> total,
> >> >> r+n/r concurrent operations. If you optimize r depending on n,
> SRQT(n)
> >> >> is
> >> >> the optimal replication level. So 10 is a reasonable default setting
> >> >> for
> >> >> most clusters that are not 500+ nodes big.
> >> >>
> >> >> Kai
> >> >>
> >> >> Am 22.12.2012 um 13:46 schrieb Lin Ma <li...@gmail.com>:
> >> >>
> >> >> Thanks Kai, using higher replication count for the purpose of?
> >> >>
> >> >> regards,
> >> >> Lin
> >> >>
> >> >> On Sat, Dec 22, 2012 at 8:44 PM, Kai Voigt <k...@123.org> wrote:
> >> >>>
> >> >>> Hi,
> >> >>>
> >> >>> Am 22.12.2012 um 13:03 schrieb Lin Ma <li...@gmail.com>:
> >> >>>
> >> >>> > I want to confirm when on each task node either mapper or reducer
> >> >>> > access distributed cache file, it resides on disk, not resides in
> >> >>> > memory.
> >> >>> > Just want to make sure distributed cache file does not fully
> loaded
> >> >>> > into
> >> >>> > memory which compete memory consumption with mapper/reducer tasks.
> >> >>> > Is that
> >> >>> > correct?
> >> >>>
> >> >>>
> >> >>> Yes, you are correct. The JobTracker will put files for the
> >> >>> distributed
> >> >>> cache into HDFS with a higher replication count (10 by default).
> >> >>> Whenever a
> >> >>> TaskTracker needs those files for a task it is launching locally, it
> >> >>> will
> >> >>> fetch a copy to its local disk. So it won't need to do this again
> for
> >> >>> future
> >> >>> tasks on this node. After a job is done, all local copies and the
> HDFS
> >> >>> copies of files in the distributed cache are cleaned up.
> >> >>>
> >> >>> Kai
> >> >>>
> >> >>> --
> >> >>> Kai Voigt
> >> >>> k@123.org
> >> >>>
> >> >>>
> >> >>>
> >> >>>
> >> >>
> >> >>
> >> >> --
> >> >> Kai Voigt
> >> >> k@123.org
> >> >>
> >> >>
> >> >>
> >> >>
> >> >
> >>
> >>
> >>
> >> --
> >> Harsh J
> >
> >
>
>
>
> --
> Harsh J
>

Re: distributed cache

Posted by Harsh J <ha...@cloudera.com>.

There is no limitation in HDFS that limits reads of a block to a
single client at a time (no reason to do so) - so downloads can be as
concurrent as possible.

On Wed, Dec 26, 2012 at 3:41 PM, Lin Ma <li...@gmail.com> wrote:
> Thanks Harsh,
>
> Supposing DistributedCache is uploaded by client, for each replica, in
> Hadoop design, it could only serve one download session (download from a
> mapper or a reducer which requires the DistributedCache) at a time until
> DistributedCache file download is completed, or it could serve multiple
> concurrent parallel download session (download from multiple mappers or
> reducers which requires the DistributedCache).
>
> regards,
> Lin
>
>
> On Wed, Dec 26, 2012 at 4:51 PM, Harsh J <ha...@cloudera.com> wrote:
>>
>> Hi Lin,
>>
>> DistributedCache files are stored onto the HDFS by the client first.
>> The TaskTrackers download and localize it. Therefore, as with any
>> other file on HDFS, "downloads" can be efficiently parallel with
>> higher replicas.
>>
>> The point of having higher replication for these files is also tied to
>> the concept of racks in a cluster - you would want more replicas
>> spread across racks such that on task bootup the downloads happen with
>> rack locality.
>>
>> On Sat, Dec 22, 2012 at 6:54 PM, Lin Ma <li...@gmail.com> wrote:
>> > Hi Kai,
>> >
>> > Smart answer! :-)
>> >
>> > The assumption you have is one distributed cache replica could only
>> > serve
>> > one download session for tasktracker node (this is why you get
>> > concurrency
>> > n/r). The question is, why one distributed cache replica cannot serve
>> > multiple concurrent download session? For example, supposing a
>> > tasktracker
>> > use elapsed time t to download a file from a specific distributed cache
>> > replica, it is possible for 2 tasktrackers to download from the specific
>> > distributed cache replica in parallel using elapsed time t as well, or
>> > 1.5
>> > t, which is faster than sequential download time 2t you mentioned
>> > before?
>> > "In total, r+n/r concurrent operations. If you optimize r depending on
>> > n,
>> > SRQT(n) is the optimal replication level." -- how do you get SRQT(n) for
>> > minimize r+n/r? Appreciate if you could point me to more details.
>> >
>> > regards,
>> > Lin
>> >
>> >
>> > On Sat, Dec 22, 2012 at 8:51 PM, Kai Voigt <k...@123.org> wrote:
>> >>
>> >> Hi,
>> >>
>> >> simple math. Assuming you have n TaskTrackers in your cluster that will
>> >> need to access the files in the distributed cache. And r is the
>> >> replication
>> >> level of those files.
>> >>
>> >> Copying the files into HDFS requires r copy operations over the
>> >> network.
>> >> The n TaskTrackers need to get their local copies from HDFS, so the n
>> >> TaskTrackers copy from r DataNodes, so n/r concurrent operation. In
>> >> total,
>> >> r+n/r concurrent operations. If you optimize r depending on n, SRQT(n)
>> >> is
>> >> the optimal replication level. So 10 is a reasonable default setting
>> >> for
>> >> most clusters that are not 500+ nodes big.
>> >>
>> >> Kai
>> >>
>> >> Am 22.12.2012 um 13:46 schrieb Lin Ma <li...@gmail.com>:
>> >>
>> >> Thanks Kai, using higher replication count for the purpose of?
>> >>
>> >> regards,
>> >> Lin
>> >>
>> >> On Sat, Dec 22, 2012 at 8:44 PM, Kai Voigt <k...@123.org> wrote:
>> >>>
>> >>> Hi,
>> >>>
>> >>> Am 22.12.2012 um 13:03 schrieb Lin Ma <li...@gmail.com>:
>> >>>
>> >>> > I want to confirm when on each task node either mapper or reducer
>> >>> > access distributed cache file, it resides on disk, not resides in
>> >>> > memory.
>> >>> > Just want to make sure distributed cache file does not fully loaded
>> >>> > into
>> >>> > memory which compete memory consumption with mapper/reducer tasks.
>> >>> > Is that
>> >>> > correct?
>> >>>
>> >>>
>> >>> Yes, you are correct. The JobTracker will put files for the
>> >>> distributed
>> >>> cache into HDFS with a higher replication count (10 by default).
>> >>> Whenever a
>> >>> TaskTracker needs those files for a task it is launching locally, it
>> >>> will
>> >>> fetch a copy to its local disk. So it won't need to do this again for
>> >>> future
>> >>> tasks on this node. After a job is done, all local copies and the HDFS
>> >>> copies of files in the distributed cache are cleaned up.
>> >>>
>> >>> Kai
>> >>>
>> >>> --
>> >>> Kai Voigt
>> >>> k@123.org
>> >>>
>> >>>
>> >>>
>> >>>
>> >>
>> >>
>> >> --
>> >> Kai Voigt
>> >> k@123.org
>> >>
>> >>
>> >>
>> >>
>> >
>>
>>
>>
>> --
>> Harsh J
>
>



-- 
Harsh J

Re: distributed cache

Posted by Harsh J <ha...@cloudera.com>.

There is no limitation in HDFS that limits reads of a block to a
single client at a time (no reason to do so) - so downloads can be as
concurrent as possible.

On Wed, Dec 26, 2012 at 3:41 PM, Lin Ma <li...@gmail.com> wrote:
> Thanks Harsh,
>
> Supposing DistributedCache is uploaded by client, for each replica, in
> Hadoop design, it could only serve one download session (download from a
> mapper or a reducer which requires the DistributedCache) at a time until
> DistributedCache file download is completed, or it could serve multiple
> concurrent parallel download session (download from multiple mappers or
> reducers which requires the DistributedCache).
>
> regards,
> Lin
>
>
> On Wed, Dec 26, 2012 at 4:51 PM, Harsh J <ha...@cloudera.com> wrote:
>>
>> Hi Lin,
>>
>> DistributedCache files are stored onto the HDFS by the client first.
>> The TaskTrackers download and localize it. Therefore, as with any
>> other file on HDFS, "downloads" can be efficiently parallel with
>> higher replicas.
>>
>> The point of having higher replication for these files is also tied to
>> the concept of racks in a cluster - you would want more replicas
>> spread across racks such that on task bootup the downloads happen with
>> rack locality.
>>
>> On Sat, Dec 22, 2012 at 6:54 PM, Lin Ma <li...@gmail.com> wrote:
>> > Hi Kai,
>> >
>> > Smart answer! :-)
>> >
>> > The assumption you have is one distributed cache replica could only
>> > serve
>> > one download session for tasktracker node (this is why you get
>> > concurrency
>> > n/r). The question is, why one distributed cache replica cannot serve
>> > multiple concurrent download session? For example, supposing a
>> > tasktracker
>> > use elapsed time t to download a file from a specific distributed cache
>> > replica, it is possible for 2 tasktrackers to download from the specific
>> > distributed cache replica in parallel using elapsed time t as well, or
>> > 1.5
>> > t, which is faster than sequential download time 2t you mentioned
>> > before?
>> > "In total, r+n/r concurrent operations. If you optimize r depending on
>> > n,
>> > SRQT(n) is the optimal replication level." -- how do you get SRQT(n) for
>> > minimize r+n/r? Appreciate if you could point me to more details.
>> >
>> > regards,
>> > Lin
>> >
>> >
>> > On Sat, Dec 22, 2012 at 8:51 PM, Kai Voigt <k...@123.org> wrote:
>> >>
>> >> Hi,
>> >>
>> >> simple math. Assuming you have n TaskTrackers in your cluster that will
>> >> need to access the files in the distributed cache. And r is the
>> >> replication
>> >> level of those files.
>> >>
>> >> Copying the files into HDFS requires r copy operations over the
>> >> network.
>> >> The n TaskTrackers need to get their local copies from HDFS, so the n
>> >> TaskTrackers copy from r DataNodes, so n/r concurrent operation. In
>> >> total,
>> >> r+n/r concurrent operations. If you optimize r depending on n, SRQT(n)
>> >> is
>> >> the optimal replication level. So 10 is a reasonable default setting
>> >> for
>> >> most clusters that are not 500+ nodes big.
>> >>
>> >> Kai
>> >>
>> >> Am 22.12.2012 um 13:46 schrieb Lin Ma <li...@gmail.com>:
>> >>
>> >> Thanks Kai, using higher replication count for the purpose of?
>> >>
>> >> regards,
>> >> Lin
>> >>
>> >> On Sat, Dec 22, 2012 at 8:44 PM, Kai Voigt <k...@123.org> wrote:
>> >>>
>> >>> Hi,
>> >>>
>> >>> Am 22.12.2012 um 13:03 schrieb Lin Ma <li...@gmail.com>:
>> >>>
>> >>> > I want to confirm when on each task node either mapper or reducer
>> >>> > access distributed cache file, it resides on disk, not resides in
>> >>> > memory.
>> >>> > Just want to make sure distributed cache file does not fully loaded
>> >>> > into
>> >>> > memory which compete memory consumption with mapper/reducer tasks.
>> >>> > Is that
>> >>> > correct?
>> >>>
>> >>>
>> >>> Yes, you are correct. The JobTracker will put files for the
>> >>> distributed
>> >>> cache into HDFS with a higher replication count (10 by default).
>> >>> Whenever a
>> >>> TaskTracker needs those files for a task it is launching locally, it
>> >>> will
>> >>> fetch a copy to its local disk. So it won't need to do this again for
>> >>> future
>> >>> tasks on this node. After a job is done, all local copies and the HDFS
>> >>> copies of files in the distributed cache are cleaned up.
>> >>>
>> >>> Kai
>> >>>
>> >>> --
>> >>> Kai Voigt
>> >>> k@123.org
>> >>>
>> >>>
>> >>>
>> >>>
>> >>
>> >>
>> >> --
>> >> Kai Voigt
>> >> k@123.org
>> >>
>> >>
>> >>
>> >>
>> >
>>
>>
>>
>> --
>> Harsh J
>
>



-- 
Harsh J

Re: distributed cache

Posted by Harsh J <ha...@cloudera.com>.

There is no limitation in HDFS that limits reads of a block to a
single client at a time (no reason to do so) - so downloads can be as
concurrent as possible.

On Wed, Dec 26, 2012 at 3:41 PM, Lin Ma <li...@gmail.com> wrote:
> Thanks Harsh,
>
> Supposing DistributedCache is uploaded by client, for each replica, in
> Hadoop design, it could only serve one download session (download from a
> mapper or a reducer which requires the DistributedCache) at a time until
> DistributedCache file download is completed, or it could serve multiple
> concurrent parallel download session (download from multiple mappers or
> reducers which requires the DistributedCache).
>
> regards,
> Lin
>
>
> On Wed, Dec 26, 2012 at 4:51 PM, Harsh J <ha...@cloudera.com> wrote:
>>
>> Hi Lin,
>>
>> DistributedCache files are stored onto the HDFS by the client first.
>> The TaskTrackers download and localize it. Therefore, as with any
>> other file on HDFS, "downloads" can be efficiently parallel with
>> higher replicas.
>>
>> The point of having higher replication for these files is also tied to
>> the concept of racks in a cluster - you would want more replicas
>> spread across racks such that on task bootup the downloads happen with
>> rack locality.
>>
>> On Sat, Dec 22, 2012 at 6:54 PM, Lin Ma <li...@gmail.com> wrote:
>> > Hi Kai,
>> >
>> > Smart answer! :-)
>> >
>> > The assumption you have is one distributed cache replica could only
>> > serve
>> > one download session for tasktracker node (this is why you get
>> > concurrency
>> > n/r). The question is, why one distributed cache replica cannot serve
>> > multiple concurrent download session? For example, supposing a
>> > tasktracker
>> > use elapsed time t to download a file from a specific distributed cache
>> > replica, it is possible for 2 tasktrackers to download from the specific
>> > distributed cache replica in parallel using elapsed time t as well, or
>> > 1.5
>> > t, which is faster than sequential download time 2t you mentioned
>> > before?
>> > "In total, r+n/r concurrent operations. If you optimize r depending on
>> > n,
>> > SRQT(n) is the optimal replication level." -- how do you get SRQT(n) for
>> > minimize r+n/r? Appreciate if you could point me to more details.
>> >
>> > regards,
>> > Lin
>> >
>> >
>> > On Sat, Dec 22, 2012 at 8:51 PM, Kai Voigt <k...@123.org> wrote:
>> >>
>> >> Hi,
>> >>
>> >> simple math. Assuming you have n TaskTrackers in your cluster that will
>> >> need to access the files in the distributed cache. And r is the
>> >> replication
>> >> level of those files.
>> >>
>> >> Copying the files into HDFS requires r copy operations over the
>> >> network.
>> >> The n TaskTrackers need to get their local copies from HDFS, so the n
>> >> TaskTrackers copy from r DataNodes, so n/r concurrent operation. In
>> >> total,
>> >> r+n/r concurrent operations. If you optimize r depending on n, SRQT(n)
>> >> is
>> >> the optimal replication level. So 10 is a reasonable default setting
>> >> for
>> >> most clusters that are not 500+ nodes big.
>> >>
>> >> Kai
>> >>
>> >> Am 22.12.2012 um 13:46 schrieb Lin Ma <li...@gmail.com>:
>> >>
>> >> Thanks Kai, using higher replication count for the purpose of?
>> >>
>> >> regards,
>> >> Lin
>> >>
>> >> On Sat, Dec 22, 2012 at 8:44 PM, Kai Voigt <k...@123.org> wrote:
>> >>>
>> >>> Hi,
>> >>>
>> >>> Am 22.12.2012 um 13:03 schrieb Lin Ma <li...@gmail.com>:
>> >>>
>> >>> > I want to confirm when on each task node either mapper or reducer
>> >>> > access distributed cache file, it resides on disk, not resides in
>> >>> > memory.
>> >>> > Just want to make sure distributed cache file does not fully loaded
>> >>> > into
>> >>> > memory which compete memory consumption with mapper/reducer tasks.
>> >>> > Is that
>> >>> > correct?
>> >>>
>> >>>
>> >>> Yes, you are correct. The JobTracker will put files for the
>> >>> distributed
>> >>> cache into HDFS with a higher replication count (10 by default).
>> >>> Whenever a
>> >>> TaskTracker needs those files for a task it is launching locally, it
>> >>> will
>> >>> fetch a copy to its local disk. So it won't need to do this again for
>> >>> future
>> >>> tasks on this node. After a job is done, all local copies and the HDFS
>> >>> copies of files in the distributed cache are cleaned up.
>> >>>
>> >>> Kai
>> >>>
>> >>> --
>> >>> Kai Voigt
>> >>> k@123.org
>> >>>
>> >>>
>> >>>
>> >>>
>> >>
>> >>
>> >> --
>> >> Kai Voigt
>> >> k@123.org
>> >>
>> >>
>> >>
>> >>
>> >
>>
>>
>>
>> --
>> Harsh J
>
>



-- 
Harsh J

Re: distributed cache

Posted by Harsh J <ha...@cloudera.com>.

There is no limitation in HDFS that limits reads of a block to a
single client at a time (no reason to do so) - so downloads can be as
concurrent as possible.

On Wed, Dec 26, 2012 at 3:41 PM, Lin Ma <li...@gmail.com> wrote:
> Thanks Harsh,
>
> Supposing DistributedCache is uploaded by client, for each replica, in
> Hadoop design, it could only serve one download session (download from a
> mapper or a reducer which requires the DistributedCache) at a time until
> DistributedCache file download is completed, or it could serve multiple
> concurrent parallel download session (download from multiple mappers or
> reducers which requires the DistributedCache).
>
> regards,
> Lin
>
>
> On Wed, Dec 26, 2012 at 4:51 PM, Harsh J <ha...@cloudera.com> wrote:
>>
>> Hi Lin,
>>
>> DistributedCache files are stored onto the HDFS by the client first.
>> The TaskTrackers download and localize it. Therefore, as with any
>> other file on HDFS, "downloads" can be efficiently parallel with
>> higher replicas.
>>
>> The point of having higher replication for these files is also tied to
>> the concept of racks in a cluster - you would want more replicas
>> spread across racks such that on task bootup the downloads happen with
>> rack locality.
>>
>> On Sat, Dec 22, 2012 at 6:54 PM, Lin Ma <li...@gmail.com> wrote:
>> > Hi Kai,
>> >
>> > Smart answer! :-)
>> >
>> > The assumption you have is one distributed cache replica could only
>> > serve
>> > one download session for tasktracker node (this is why you get
>> > concurrency
>> > n/r). The question is, why one distributed cache replica cannot serve
>> > multiple concurrent download session? For example, supposing a
>> > tasktracker
>> > use elapsed time t to download a file from a specific distributed cache
>> > replica, it is possible for 2 tasktrackers to download from the specific
>> > distributed cache replica in parallel using elapsed time t as well, or
>> > 1.5
>> > t, which is faster than sequential download time 2t you mentioned
>> > before?
>> > "In total, r+n/r concurrent operations. If you optimize r depending on
>> > n,
>> > SRQT(n) is the optimal replication level." -- how do you get SRQT(n) for
>> > minimize r+n/r? Appreciate if you could point me to more details.
>> >
>> > regards,
>> > Lin
>> >
>> >
>> > On Sat, Dec 22, 2012 at 8:51 PM, Kai Voigt <k...@123.org> wrote:
>> >>
>> >> Hi,
>> >>
>> >> simple math. Assuming you have n TaskTrackers in your cluster that will
>> >> need to access the files in the distributed cache. And r is the
>> >> replication
>> >> level of those files.
>> >>
>> >> Copying the files into HDFS requires r copy operations over the
>> >> network.
>> >> The n TaskTrackers need to get their local copies from HDFS, so the n
>> >> TaskTrackers copy from r DataNodes, so n/r concurrent operation. In
>> >> total,
>> >> r+n/r concurrent operations. If you optimize r depending on n, SRQT(n)
>> >> is
>> >> the optimal replication level. So 10 is a reasonable default setting
>> >> for
>> >> most clusters that are not 500+ nodes big.
>> >>
>> >> Kai
>> >>
>> >> Am 22.12.2012 um 13:46 schrieb Lin Ma <li...@gmail.com>:
>> >>
>> >> Thanks Kai, using higher replication count for the purpose of?
>> >>
>> >> regards,
>> >> Lin
>> >>
>> >> On Sat, Dec 22, 2012 at 8:44 PM, Kai Voigt <k...@123.org> wrote:
>> >>>
>> >>> Hi,
>> >>>
>> >>> Am 22.12.2012 um 13:03 schrieb Lin Ma <li...@gmail.com>:
>> >>>
>> >>> > I want to confirm when on each task node either mapper or reducer
>> >>> > access distributed cache file, it resides on disk, not resides in
>> >>> > memory.
>> >>> > Just want to make sure distributed cache file does not fully loaded
>> >>> > into
>> >>> > memory which compete memory consumption with mapper/reducer tasks.
>> >>> > Is that
>> >>> > correct?
>> >>>
>> >>>
>> >>> Yes, you are correct. The JobTracker will put files for the
>> >>> distributed
>> >>> cache into HDFS with a higher replication count (10 by default).
>> >>> Whenever a
>> >>> TaskTracker needs those files for a task it is launching locally, it
>> >>> will
>> >>> fetch a copy to its local disk. So it won't need to do this again for
>> >>> future
>> >>> tasks on this node. After a job is done, all local copies and the HDFS
>> >>> copies of files in the distributed cache are cleaned up.
>> >>>
>> >>> Kai
>> >>>
>> >>> --
>> >>> Kai Voigt
>> >>> k@123.org
>> >>>
>> >>>
>> >>>
>> >>>
>> >>
>> >>
>> >> --
>> >> Kai Voigt
>> >> k@123.org
>> >>
>> >>
>> >>
>> >>
>> >
>>
>>
>>
>> --
>> Harsh J
>
>



-- 
Harsh J

Re: distributed cache

Posted by Lin Ma <li...@gmail.com>.

Thanks Harsh,

Supposing DistributedCache is uploaded by client, for each replica, in
Hadoop design, it could only serve one download session (download from a
mapper or a reducer which requires the DistributedCache) at a time until
DistributedCache file download is completed, or it could serve multiple
concurrent parallel download session (download from multiple mappers or
reducers which requires the DistributedCache).

regards,
Lin

On Wed, Dec 26, 2012 at 4:51 PM, Harsh J <ha...@cloudera.com> wrote:

> Hi Lin,
>
> DistributedCache files are stored onto the HDFS by the client first.
> The TaskTrackers download and localize it. Therefore, as with any
> other file on HDFS, "downloads" can be efficiently parallel with
> higher replicas.
>
> The point of having higher replication for these files is also tied to
> the concept of racks in a cluster - you would want more replicas
> spread across racks such that on task bootup the downloads happen with
> rack locality.
>
> On Sat, Dec 22, 2012 at 6:54 PM, Lin Ma <li...@gmail.com> wrote:
> > Hi Kai,
> >
> > Smart answer! :-)
> >
> > The assumption you have is one distributed cache replica could only serve
> > one download session for tasktracker node (this is why you get
> concurrency
> > n/r). The question is, why one distributed cache replica cannot serve
> > multiple concurrent download session? For example, supposing a
> tasktracker
> > use elapsed time t to download a file from a specific distributed cache
> > replica, it is possible for 2 tasktrackers to download from the specific
> > distributed cache replica in parallel using elapsed time t as well, or
> 1.5
> > t, which is faster than sequential download time 2t you mentioned before?
> > "In total, r+n/r concurrent operations. If you optimize r depending on n,
> > SRQT(n) is the optimal replication level." -- how do you get SRQT(n) for
> > minimize r+n/r? Appreciate if you could point me to more details.
> >
> > regards,
> > Lin
> >
> >
> > On Sat, Dec 22, 2012 at 8:51 PM, Kai Voigt <k...@123.org> wrote:
> >>
> >> Hi,
> >>
> >> simple math. Assuming you have n TaskTrackers in your cluster that will
> >> need to access the files in the distributed cache. And r is the
> replication
> >> level of those files.
> >>
> >> Copying the files into HDFS requires r copy operations over the network.
> >> The n TaskTrackers need to get their local copies from HDFS, so the n
> >> TaskTrackers copy from r DataNodes, so n/r concurrent operation. In
> total,
> >> r+n/r concurrent operations. If you optimize r depending on n, SRQT(n)
> is
> >> the optimal replication level. So 10 is a reasonable default setting for
> >> most clusters that are not 500+ nodes big.
> >>
> >> Kai
> >>
> >> Am 22.12.2012 um 13:46 schrieb Lin Ma <li...@gmail.com>:
> >>
> >> Thanks Kai, using higher replication count for the purpose of?
> >>
> >> regards,
> >> Lin
> >>
> >> On Sat, Dec 22, 2012 at 8:44 PM, Kai Voigt <k...@123.org> wrote:
> >>>
> >>> Hi,
> >>>
> >>> Am 22.12.2012 um 13:03 schrieb Lin Ma <li...@gmail.com>:
> >>>
> >>> > I want to confirm when on each task node either mapper or reducer
> >>> > access distributed cache file, it resides on disk, not resides in
> memory.
> >>> > Just want to make sure distributed cache file does not fully loaded
> into
> >>> > memory which compete memory consumption with mapper/reducer tasks.
> Is that
> >>> > correct?
> >>>
> >>>
> >>> Yes, you are correct. The JobTracker will put files for the distributed
> >>> cache into HDFS with a higher replication count (10 by default).
> Whenever a
> >>> TaskTracker needs those files for a task it is launching locally, it
> will
> >>> fetch a copy to its local disk. So it won't need to do this again for
> future
> >>> tasks on this node. After a job is done, all local copies and the HDFS
> >>> copies of files in the distributed cache are cleaned up.
> >>>
> >>> Kai
> >>>
> >>> --
> >>> Kai Voigt
> >>> k@123.org
> >>>
> >>>
> >>>
> >>>
> >>
> >>
> >> --
> >> Kai Voigt
> >> k@123.org
> >>
> >>
> >>
> >>
> >
>
>
>
> --
> Harsh J
>

Re: distributed cache

Posted by Lin Ma <li...@gmail.com>.

Thanks Harsh,

Supposing DistributedCache is uploaded by client, for each replica, in
Hadoop design, it could only serve one download session (download from a
mapper or a reducer which requires the DistributedCache) at a time until
DistributedCache file download is completed, or it could serve multiple
concurrent parallel download session (download from multiple mappers or
reducers which requires the DistributedCache).

regards,
Lin

On Wed, Dec 26, 2012 at 4:51 PM, Harsh J <ha...@cloudera.com> wrote:

> Hi Lin,
>
> DistributedCache files are stored onto the HDFS by the client first.
> The TaskTrackers download and localize it. Therefore, as with any
> other file on HDFS, "downloads" can be efficiently parallel with
> higher replicas.
>
> The point of having higher replication for these files is also tied to
> the concept of racks in a cluster - you would want more replicas
> spread across racks such that on task bootup the downloads happen with
> rack locality.
>
> On Sat, Dec 22, 2012 at 6:54 PM, Lin Ma <li...@gmail.com> wrote:
> > Hi Kai,
> >
> > Smart answer! :-)
> >
> > The assumption you have is one distributed cache replica could only serve
> > one download session for tasktracker node (this is why you get
> concurrency
> > n/r). The question is, why one distributed cache replica cannot serve
> > multiple concurrent download session? For example, supposing a
> tasktracker
> > use elapsed time t to download a file from a specific distributed cache
> > replica, it is possible for 2 tasktrackers to download from the specific
> > distributed cache replica in parallel using elapsed time t as well, or
> 1.5
> > t, which is faster than sequential download time 2t you mentioned before?
> > "In total, r+n/r concurrent operations. If you optimize r depending on n,
> > SRQT(n) is the optimal replication level." -- how do you get SRQT(n) for
> > minimize r+n/r? Appreciate if you could point me to more details.
> >
> > regards,
> > Lin
> >
> >
> > On Sat, Dec 22, 2012 at 8:51 PM, Kai Voigt <k...@123.org> wrote:
> >>
> >> Hi,
> >>
> >> simple math. Assuming you have n TaskTrackers in your cluster that will
> >> need to access the files in the distributed cache. And r is the
> replication
> >> level of those files.
> >>
> >> Copying the files into HDFS requires r copy operations over the network.
> >> The n TaskTrackers need to get their local copies from HDFS, so the n
> >> TaskTrackers copy from r DataNodes, so n/r concurrent operation. In
> total,
> >> r+n/r concurrent operations. If you optimize r depending on n, SRQT(n)
> is
> >> the optimal replication level. So 10 is a reasonable default setting for
> >> most clusters that are not 500+ nodes big.
> >>
> >> Kai
> >>
> >> Am 22.12.2012 um 13:46 schrieb Lin Ma <li...@gmail.com>:
> >>
> >> Thanks Kai, using higher replication count for the purpose of?
> >>
> >> regards,
> >> Lin
> >>
> >> On Sat, Dec 22, 2012 at 8:44 PM, Kai Voigt <k...@123.org> wrote:
> >>>
> >>> Hi,
> >>>
> >>> Am 22.12.2012 um 13:03 schrieb Lin Ma <li...@gmail.com>:
> >>>
> >>> > I want to confirm when on each task node either mapper or reducer
> >>> > access distributed cache file, it resides on disk, not resides in
> memory.
> >>> > Just want to make sure distributed cache file does not fully loaded
> into
> >>> > memory which compete memory consumption with mapper/reducer tasks.
> Is that
> >>> > correct?
> >>>
> >>>
> >>> Yes, you are correct. The JobTracker will put files for the distributed
> >>> cache into HDFS with a higher replication count (10 by default).
> Whenever a
> >>> TaskTracker needs those files for a task it is launching locally, it
> will
> >>> fetch a copy to its local disk. So it won't need to do this again for
> future
> >>> tasks on this node. After a job is done, all local copies and the HDFS
> >>> copies of files in the distributed cache are cleaned up.
> >>>
> >>> Kai
> >>>
> >>> --
> >>> Kai Voigt
> >>> k@123.org
> >>>
> >>>
> >>>
> >>>
> >>
> >>
> >> --
> >> Kai Voigt
> >> k@123.org
> >>
> >>
> >>
> >>
> >
>
>
>
> --
> Harsh J
>

Re: distributed cache

Posted by Lin Ma <li...@gmail.com>.

Thanks Harsh,

Supposing DistributedCache is uploaded by client, for each replica, in
Hadoop design, it could only serve one download session (download from a
mapper or a reducer which requires the DistributedCache) at a time until
DistributedCache file download is completed, or it could serve multiple
concurrent parallel download session (download from multiple mappers or
reducers which requires the DistributedCache).

regards,
Lin

On Wed, Dec 26, 2012 at 4:51 PM, Harsh J <ha...@cloudera.com> wrote:

> Hi Lin,
>
> DistributedCache files are stored onto the HDFS by the client first.
> The TaskTrackers download and localize it. Therefore, as with any
> other file on HDFS, "downloads" can be efficiently parallel with
> higher replicas.
>
> The point of having higher replication for these files is also tied to
> the concept of racks in a cluster - you would want more replicas
> spread across racks such that on task bootup the downloads happen with
> rack locality.
>
> On Sat, Dec 22, 2012 at 6:54 PM, Lin Ma <li...@gmail.com> wrote:
> > Hi Kai,
> >
> > Smart answer! :-)
> >
> > The assumption you have is one distributed cache replica could only serve
> > one download session for tasktracker node (this is why you get
> concurrency
> > n/r). The question is, why one distributed cache replica cannot serve
> > multiple concurrent download session? For example, supposing a
> tasktracker
> > use elapsed time t to download a file from a specific distributed cache
> > replica, it is possible for 2 tasktrackers to download from the specific
> > distributed cache replica in parallel using elapsed time t as well, or
> 1.5
> > t, which is faster than sequential download time 2t you mentioned before?
> > "In total, r+n/r concurrent operations. If you optimize r depending on n,
> > SRQT(n) is the optimal replication level." -- how do you get SRQT(n) for
> > minimize r+n/r? Appreciate if you could point me to more details.
> >
> > regards,
> > Lin
> >
> >
> > On Sat, Dec 22, 2012 at 8:51 PM, Kai Voigt <k...@123.org> wrote:
> >>
> >> Hi,
> >>
> >> simple math. Assuming you have n TaskTrackers in your cluster that will
> >> need to access the files in the distributed cache. And r is the
> replication
> >> level of those files.
> >>
> >> Copying the files into HDFS requires r copy operations over the network.
> >> The n TaskTrackers need to get their local copies from HDFS, so the n
> >> TaskTrackers copy from r DataNodes, so n/r concurrent operation. In
> total,
> >> r+n/r concurrent operations. If you optimize r depending on n, SRQT(n)
> is
> >> the optimal replication level. So 10 is a reasonable default setting for
> >> most clusters that are not 500+ nodes big.
> >>
> >> Kai
> >>
> >> Am 22.12.2012 um 13:46 schrieb Lin Ma <li...@gmail.com>:
> >>
> >> Thanks Kai, using higher replication count for the purpose of?
> >>
> >> regards,
> >> Lin
> >>
> >> On Sat, Dec 22, 2012 at 8:44 PM, Kai Voigt <k...@123.org> wrote:
> >>>
> >>> Hi,
> >>>
> >>> Am 22.12.2012 um 13:03 schrieb Lin Ma <li...@gmail.com>:
> >>>
> >>> > I want to confirm when on each task node either mapper or reducer
> >>> > access distributed cache file, it resides on disk, not resides in
> memory.
> >>> > Just want to make sure distributed cache file does not fully loaded
> into
> >>> > memory which compete memory consumption with mapper/reducer tasks.
> Is that
> >>> > correct?
> >>>
> >>>
> >>> Yes, you are correct. The JobTracker will put files for the distributed
> >>> cache into HDFS with a higher replication count (10 by default).
> Whenever a
> >>> TaskTracker needs those files for a task it is launching locally, it
> will
> >>> fetch a copy to its local disk. So it won't need to do this again for
> future
> >>> tasks on this node. After a job is done, all local copies and the HDFS
> >>> copies of files in the distributed cache are cleaned up.
> >>>
> >>> Kai
> >>>
> >>> --
> >>> Kai Voigt
> >>> k@123.org
> >>>
> >>>
> >>>
> >>>
> >>
> >>
> >> --
> >> Kai Voigt
> >> k@123.org
> >>
> >>
> >>
> >>
> >
>
>
>
> --
> Harsh J
>

Re: distributed cache

Posted by Lin Ma <li...@gmail.com>.

Thanks Harsh,

Supposing DistributedCache is uploaded by client, for each replica, in
Hadoop design, it could only serve one download session (download from a
mapper or a reducer which requires the DistributedCache) at a time until
DistributedCache file download is completed, or it could serve multiple
concurrent parallel download session (download from multiple mappers or
reducers which requires the DistributedCache).

regards,
Lin

On Wed, Dec 26, 2012 at 4:51 PM, Harsh J <ha...@cloudera.com> wrote:

> Hi Lin,
>
> DistributedCache files are stored onto the HDFS by the client first.
> The TaskTrackers download and localize it. Therefore, as with any
> other file on HDFS, "downloads" can be efficiently parallel with
> higher replicas.
>
> The point of having higher replication for these files is also tied to
> the concept of racks in a cluster - you would want more replicas
> spread across racks such that on task bootup the downloads happen with
> rack locality.
>
> On Sat, Dec 22, 2012 at 6:54 PM, Lin Ma <li...@gmail.com> wrote:
> > Hi Kai,
> >
> > Smart answer! :-)
> >
> > The assumption you have is one distributed cache replica could only serve
> > one download session for tasktracker node (this is why you get
> concurrency
> > n/r). The question is, why one distributed cache replica cannot serve
> > multiple concurrent download session? For example, supposing a
> tasktracker
> > use elapsed time t to download a file from a specific distributed cache
> > replica, it is possible for 2 tasktrackers to download from the specific
> > distributed cache replica in parallel using elapsed time t as well, or
> 1.5
> > t, which is faster than sequential download time 2t you mentioned before?
> > "In total, r+n/r concurrent operations. If you optimize r depending on n,
> > SRQT(n) is the optimal replication level." -- how do you get SRQT(n) for
> > minimize r+n/r? Appreciate if you could point me to more details.
> >
> > regards,
> > Lin
> >
> >
> > On Sat, Dec 22, 2012 at 8:51 PM, Kai Voigt <k...@123.org> wrote:
> >>
> >> Hi,
> >>
> >> simple math. Assuming you have n TaskTrackers in your cluster that will
> >> need to access the files in the distributed cache. And r is the
> replication
> >> level of those files.
> >>
> >> Copying the files into HDFS requires r copy operations over the network.
> >> The n TaskTrackers need to get their local copies from HDFS, so the n
> >> TaskTrackers copy from r DataNodes, so n/r concurrent operation. In
> total,
> >> r+n/r concurrent operations. If you optimize r depending on n, SRQT(n)
> is
> >> the optimal replication level. So 10 is a reasonable default setting for
> >> most clusters that are not 500+ nodes big.
> >>
> >> Kai
> >>
> >> Am 22.12.2012 um 13:46 schrieb Lin Ma <li...@gmail.com>:
> >>
> >> Thanks Kai, using higher replication count for the purpose of?
> >>
> >> regards,
> >> Lin
> >>
> >> On Sat, Dec 22, 2012 at 8:44 PM, Kai Voigt <k...@123.org> wrote:
> >>>
> >>> Hi,
> >>>
> >>> Am 22.12.2012 um 13:03 schrieb Lin Ma <li...@gmail.com>:
> >>>
> >>> > I want to confirm when on each task node either mapper or reducer
> >>> > access distributed cache file, it resides on disk, not resides in
> memory.
> >>> > Just want to make sure distributed cache file does not fully loaded
> into
> >>> > memory which compete memory consumption with mapper/reducer tasks.
> Is that
> >>> > correct?
> >>>
> >>>
> >>> Yes, you are correct. The JobTracker will put files for the distributed
> >>> cache into HDFS with a higher replication count (10 by default).
> Whenever a
> >>> TaskTracker needs those files for a task it is launching locally, it
> will
> >>> fetch a copy to its local disk. So it won't need to do this again for
> future
> >>> tasks on this node. After a job is done, all local copies and the HDFS
> >>> copies of files in the distributed cache are cleaned up.
> >>>
> >>> Kai
> >>>
> >>> --
> >>> Kai Voigt
> >>> k@123.org
> >>>
> >>>
> >>>
> >>>
> >>
> >>
> >> --
> >> Kai Voigt
> >> k@123.org
> >>
> >>
> >>
> >>
> >
>
>
>
> --
> Harsh J
>

Re: distributed cache

Posted by Harsh J <ha...@cloudera.com>.

Hi Lin,

DistributedCache files are stored onto the HDFS by the client first.
The TaskTrackers download and localize it. Therefore, as with any
other file on HDFS, "downloads" can be efficiently parallel with
higher replicas.

The point of having higher replication for these files is also tied to
the concept of racks in a cluster - you would want more replicas
spread across racks such that on task bootup the downloads happen with
rack locality.

On Sat, Dec 22, 2012 at 6:54 PM, Lin Ma <li...@gmail.com> wrote:
> Hi Kai,
>
> Smart answer! :-)
>
> The assumption you have is one distributed cache replica could only serve
> one download session for tasktracker node (this is why you get concurrency
> n/r). The question is, why one distributed cache replica cannot serve
> multiple concurrent download session? For example, supposing a tasktracker
> use elapsed time t to download a file from a specific distributed cache
> replica, it is possible for 2 tasktrackers to download from the specific
> distributed cache replica in parallel using elapsed time t as well, or 1.5
> t, which is faster than sequential download time 2t you mentioned before?
> "In total, r+n/r concurrent operations. If you optimize r depending on n,
> SRQT(n) is the optimal replication level." -- how do you get SRQT(n) for
> minimize r+n/r? Appreciate if you could point me to more details.
>
> regards,
> Lin
>
>
> On Sat, Dec 22, 2012 at 8:51 PM, Kai Voigt <k...@123.org> wrote:
>>
>> Hi,
>>
>> simple math. Assuming you have n TaskTrackers in your cluster that will
>> need to access the files in the distributed cache. And r is the replication
>> level of those files.
>>
>> Copying the files into HDFS requires r copy operations over the network.
>> The n TaskTrackers need to get their local copies from HDFS, so the n
>> TaskTrackers copy from r DataNodes, so n/r concurrent operation. In total,
>> r+n/r concurrent operations. If you optimize r depending on n, SRQT(n) is
>> the optimal replication level. So 10 is a reasonable default setting for
>> most clusters that are not 500+ nodes big.
>>
>> Kai
>>
>> Am 22.12.2012 um 13:46 schrieb Lin Ma <li...@gmail.com>:
>>
>> Thanks Kai, using higher replication count for the purpose of?
>>
>> regards,
>> Lin
>>
>> On Sat, Dec 22, 2012 at 8:44 PM, Kai Voigt <k...@123.org> wrote:
>>>
>>> Hi,
>>>
>>> Am 22.12.2012 um 13:03 schrieb Lin Ma <li...@gmail.com>:
>>>
>>> > I want to confirm when on each task node either mapper or reducer
>>> > access distributed cache file, it resides on disk, not resides in memory.
>>> > Just want to make sure distributed cache file does not fully loaded into
>>> > memory which compete memory consumption with mapper/reducer tasks. Is that
>>> > correct?
>>>
>>>
>>> Yes, you are correct. The JobTracker will put files for the distributed
>>> cache into HDFS with a higher replication count (10 by default). Whenever a
>>> TaskTracker needs those files for a task it is launching locally, it will
>>> fetch a copy to its local disk. So it won't need to do this again for future
>>> tasks on this node. After a job is done, all local copies and the HDFS
>>> copies of files in the distributed cache are cleaned up.
>>>
>>> Kai
>>>
>>> --
>>> Kai Voigt
>>> k@123.org
>>>
>>>
>>>
>>>
>>
>>
>> --
>> Kai Voigt
>> k@123.org
>>
>>
>>
>>
>



-- 
Harsh J

Re: distributed cache

Posted by Lin Ma <li...@gmail.com>.

I have figured out the 2nd issue, appreciate if anyone could advise on the
first issue.

regards,
Lin

On Sat, Dec 22, 2012 at 9:24 PM, Lin Ma <li...@gmail.com> wrote:

> Hi Kai,
>
> Smart answer! :-)
>
>    - The assumption you have is one distributed cache replica could only
>    serve one download session for tasktracker node (this is why you get
>    concurrency n/r). The question is, why one distributed cache replica cannot
>    serve multiple concurrent download session? For example, supposing a
>    tasktracker use elapsed time t to download a file from a specific
>    distributed cache replica, it is possible for 2 tasktrackers to download
>    from the specific distributed cache replica in parallel using elapsed time
>    t as well, or 1.5 t, which is faster than sequential download time 2t you
>    mentioned before?
>    - "In total, r+n/r concurrent operations. If you optimize r depending
>    on n, SRQT(n) is the optimal replication level." -- how do you get SRQT(n)
>    for minimize r+n/r? Appreciate if you could point me to more details.
>
> regards,
> Lin
>
>
> On Sat, Dec 22, 2012 at 8:51 PM, Kai Voigt <k...@123.org> wrote:
>
>> Hi,
>>
>> simple math. Assuming you have n TaskTrackers in your cluster that will
>> need to access the files in the distributed cache. And r is the replication
>> level of those files.
>>
>> Copying the files into HDFS requires r copy operations over the network.
>> The n TaskTrackers need to get their local copies from HDFS, so the n
>> TaskTrackers copy from r DataNodes, so n/r concurrent operation. In total,
>> r+n/r concurrent operations. If you optimize r depending on n, SRQT(n) is
>> the optimal replication level. So 10 is a reasonable default setting for
>> most clusters that are not 500+ nodes big.
>>
>> Kai
>>
>> Am 22.12.2012 um 13:46 schrieb Lin Ma <li...@gmail.com>:
>>
>> Thanks Kai, using higher replication count for the purpose of?
>>
>> regards,
>> Lin
>>
>> On Sat, Dec 22, 2012 at 8:44 PM, Kai Voigt <k...@123.org> wrote:
>>
>>> Hi,
>>>
>>> Am 22.12.2012 um 13:03 schrieb Lin Ma <li...@gmail.com>:
>>>
>>> > I want to confirm when on each task node either mapper or reducer
>>> access distributed cache file, it resides on disk, not resides in memory.
>>> Just want to make sure distributed cache file does not fully loaded into
>>> memory which compete memory consumption with mapper/reducer tasks. Is that
>>> correct?
>>>
>>>
>>> Yes, you are correct. The JobTracker will put files for the distributed
>>> cache into HDFS with a higher replication count (10 by default). Whenever a
>>> TaskTracker needs those files for a task it is launching locally, it will
>>> fetch a copy to its local disk. So it won't need to do this again for
>>> future tasks on this node. After a job is done, all local copies and the
>>> HDFS copies of files in the distributed cache are cleaned up.
>>>
>>> Kai
>>>
>>> --
>>> Kai Voigt
>>> k@123.org
>>>
>>>
>>>
>>>
>>>
>>
>>  --
>> Kai Voigt
>> k@123.org
>>
>>
>>
>>
>>
>

Re: distributed cache

Posted by Harsh J <ha...@cloudera.com>.

Hi Lin,

DistributedCache files are stored onto the HDFS by the client first.
The TaskTrackers download and localize it. Therefore, as with any
other file on HDFS, "downloads" can be efficiently parallel with
higher replicas.

The point of having higher replication for these files is also tied to
the concept of racks in a cluster - you would want more replicas
spread across racks such that on task bootup the downloads happen with
rack locality.

On Sat, Dec 22, 2012 at 6:54 PM, Lin Ma <li...@gmail.com> wrote:
> Hi Kai,
>
> Smart answer! :-)
>
> The assumption you have is one distributed cache replica could only serve
> one download session for tasktracker node (this is why you get concurrency
> n/r). The question is, why one distributed cache replica cannot serve
> multiple concurrent download session? For example, supposing a tasktracker
> use elapsed time t to download a file from a specific distributed cache
> replica, it is possible for 2 tasktrackers to download from the specific
> distributed cache replica in parallel using elapsed time t as well, or 1.5
> t, which is faster than sequential download time 2t you mentioned before?
> "In total, r+n/r concurrent operations. If you optimize r depending on n,
> SRQT(n) is the optimal replication level." -- how do you get SRQT(n) for
> minimize r+n/r? Appreciate if you could point me to more details.
>
> regards,
> Lin
>
>
> On Sat, Dec 22, 2012 at 8:51 PM, Kai Voigt <k...@123.org> wrote:
>>
>> Hi,
>>
>> simple math. Assuming you have n TaskTrackers in your cluster that will
>> need to access the files in the distributed cache. And r is the replication
>> level of those files.
>>
>> Copying the files into HDFS requires r copy operations over the network.
>> The n TaskTrackers need to get their local copies from HDFS, so the n
>> TaskTrackers copy from r DataNodes, so n/r concurrent operation. In total,
>> r+n/r concurrent operations. If you optimize r depending on n, SRQT(n) is
>> the optimal replication level. So 10 is a reasonable default setting for
>> most clusters that are not 500+ nodes big.
>>
>> Kai
>>
>> Am 22.12.2012 um 13:46 schrieb Lin Ma <li...@gmail.com>:
>>
>> Thanks Kai, using higher replication count for the purpose of?
>>
>> regards,
>> Lin
>>
>> On Sat, Dec 22, 2012 at 8:44 PM, Kai Voigt <k...@123.org> wrote:
>>>
>>> Hi,
>>>
>>> Am 22.12.2012 um 13:03 schrieb Lin Ma <li...@gmail.com>:
>>>
>>> > I want to confirm when on each task node either mapper or reducer
>>> > access distributed cache file, it resides on disk, not resides in memory.
>>> > Just want to make sure distributed cache file does not fully loaded into
>>> > memory which compete memory consumption with mapper/reducer tasks. Is that
>>> > correct?
>>>
>>>
>>> Yes, you are correct. The JobTracker will put files for the distributed
>>> cache into HDFS with a higher replication count (10 by default). Whenever a
>>> TaskTracker needs those files for a task it is launching locally, it will
>>> fetch a copy to its local disk. So it won't need to do this again for future
>>> tasks on this node. After a job is done, all local copies and the HDFS
>>> copies of files in the distributed cache are cleaned up.
>>>
>>> Kai
>>>
>>> --
>>> Kai Voigt
>>> k@123.org
>>>
>>>
>>>
>>>
>>
>>
>> --
>> Kai Voigt
>> k@123.org
>>
>>
>>
>>
>



-- 
Harsh J

Re: distributed cache

Posted by Lin Ma <li...@gmail.com>.

I have figured out the 2nd issue, appreciate if anyone could advise on the
first issue.

regards,
Lin

On Sat, Dec 22, 2012 at 9:24 PM, Lin Ma <li...@gmail.com> wrote:

> Hi Kai,
>
> Smart answer! :-)
>
>    - The assumption you have is one distributed cache replica could only
>    serve one download session for tasktracker node (this is why you get
>    concurrency n/r). The question is, why one distributed cache replica cannot
>    serve multiple concurrent download session? For example, supposing a
>    tasktracker use elapsed time t to download a file from a specific
>    distributed cache replica, it is possible for 2 tasktrackers to download
>    from the specific distributed cache replica in parallel using elapsed time
>    t as well, or 1.5 t, which is faster than sequential download time 2t you
>    mentioned before?
>    - "In total, r+n/r concurrent operations. If you optimize r depending
>    on n, SRQT(n) is the optimal replication level." -- how do you get SRQT(n)
>    for minimize r+n/r? Appreciate if you could point me to more details.
>
> regards,
> Lin
>
>
> On Sat, Dec 22, 2012 at 8:51 PM, Kai Voigt <k...@123.org> wrote:
>
>> Hi,
>>
>> simple math. Assuming you have n TaskTrackers in your cluster that will
>> need to access the files in the distributed cache. And r is the replication
>> level of those files.
>>
>> Copying the files into HDFS requires r copy operations over the network.
>> The n TaskTrackers need to get their local copies from HDFS, so the n
>> TaskTrackers copy from r DataNodes, so n/r concurrent operation. In total,
>> r+n/r concurrent operations. If you optimize r depending on n, SRQT(n) is
>> the optimal replication level. So 10 is a reasonable default setting for
>> most clusters that are not 500+ nodes big.
>>
>> Kai
>>
>> Am 22.12.2012 um 13:46 schrieb Lin Ma <li...@gmail.com>:
>>
>> Thanks Kai, using higher replication count for the purpose of?
>>
>> regards,
>> Lin
>>
>> On Sat, Dec 22, 2012 at 8:44 PM, Kai Voigt <k...@123.org> wrote:
>>
>>> Hi,
>>>
>>> Am 22.12.2012 um 13:03 schrieb Lin Ma <li...@gmail.com>:
>>>
>>> > I want to confirm when on each task node either mapper or reducer
>>> access distributed cache file, it resides on disk, not resides in memory.
>>> Just want to make sure distributed cache file does not fully loaded into
>>> memory which compete memory consumption with mapper/reducer tasks. Is that
>>> correct?
>>>
>>>
>>> Yes, you are correct. The JobTracker will put files for the distributed
>>> cache into HDFS with a higher replication count (10 by default). Whenever a
>>> TaskTracker needs those files for a task it is launching locally, it will
>>> fetch a copy to its local disk. So it won't need to do this again for
>>> future tasks on this node. After a job is done, all local copies and the
>>> HDFS copies of files in the distributed cache are cleaned up.
>>>
>>> Kai
>>>
>>> --
>>> Kai Voigt
>>> k@123.org
>>>
>>>
>>>
>>>
>>>
>>
>>  --
>> Kai Voigt
>> k@123.org
>>
>>
>>
>>
>>
>

Re: distributed cache

Posted by Harsh J <ha...@cloudera.com>.

Hi Lin,

DistributedCache files are stored onto the HDFS by the client first.
The TaskTrackers download and localize it. Therefore, as with any
other file on HDFS, "downloads" can be efficiently parallel with
higher replicas.

The point of having higher replication for these files is also tied to
the concept of racks in a cluster - you would want more replicas
spread across racks such that on task bootup the downloads happen with
rack locality.

On Sat, Dec 22, 2012 at 6:54 PM, Lin Ma <li...@gmail.com> wrote:
> Hi Kai,
>
> Smart answer! :-)
>
> The assumption you have is one distributed cache replica could only serve
> one download session for tasktracker node (this is why you get concurrency
> n/r). The question is, why one distributed cache replica cannot serve
> multiple concurrent download session? For example, supposing a tasktracker
> use elapsed time t to download a file from a specific distributed cache
> replica, it is possible for 2 tasktrackers to download from the specific
> distributed cache replica in parallel using elapsed time t as well, or 1.5
> t, which is faster than sequential download time 2t you mentioned before?
> "In total, r+n/r concurrent operations. If you optimize r depending on n,
> SRQT(n) is the optimal replication level." -- how do you get SRQT(n) for
> minimize r+n/r? Appreciate if you could point me to more details.
>
> regards,
> Lin
>
>
> On Sat, Dec 22, 2012 at 8:51 PM, Kai Voigt <k...@123.org> wrote:
>>
>> Hi,
>>
>> simple math. Assuming you have n TaskTrackers in your cluster that will
>> need to access the files in the distributed cache. And r is the replication
>> level of those files.
>>
>> Copying the files into HDFS requires r copy operations over the network.
>> The n TaskTrackers need to get their local copies from HDFS, so the n
>> TaskTrackers copy from r DataNodes, so n/r concurrent operation. In total,
>> r+n/r concurrent operations. If you optimize r depending on n, SRQT(n) is
>> the optimal replication level. So 10 is a reasonable default setting for
>> most clusters that are not 500+ nodes big.
>>
>> Kai
>>
>> Am 22.12.2012 um 13:46 schrieb Lin Ma <li...@gmail.com>:
>>
>> Thanks Kai, using higher replication count for the purpose of?
>>
>> regards,
>> Lin
>>
>> On Sat, Dec 22, 2012 at 8:44 PM, Kai Voigt <k...@123.org> wrote:
>>>
>>> Hi,
>>>
>>> Am 22.12.2012 um 13:03 schrieb Lin Ma <li...@gmail.com>:
>>>
>>> > I want to confirm when on each task node either mapper or reducer
>>> > access distributed cache file, it resides on disk, not resides in memory.
>>> > Just want to make sure distributed cache file does not fully loaded into
>>> > memory which compete memory consumption with mapper/reducer tasks. Is that
>>> > correct?
>>>
>>>
>>> Yes, you are correct. The JobTracker will put files for the distributed
>>> cache into HDFS with a higher replication count (10 by default). Whenever a
>>> TaskTracker needs those files for a task it is launching locally, it will
>>> fetch a copy to its local disk. So it won't need to do this again for future
>>> tasks on this node. After a job is done, all local copies and the HDFS
>>> copies of files in the distributed cache are cleaned up.
>>>
>>> Kai
>>>
>>> --
>>> Kai Voigt
>>> k@123.org
>>>
>>>
>>>
>>>
>>
>>
>> --
>> Kai Voigt
>> k@123.org
>>
>>
>>
>>
>



-- 
Harsh J

Re: distributed cache

Posted by Harsh J <ha...@cloudera.com>.

Hi Lin,

DistributedCache files are stored onto the HDFS by the client first.
The TaskTrackers download and localize it. Therefore, as with any
other file on HDFS, "downloads" can be efficiently parallel with
higher replicas.

The point of having higher replication for these files is also tied to
the concept of racks in a cluster - you would want more replicas
spread across racks such that on task bootup the downloads happen with
rack locality.

On Sat, Dec 22, 2012 at 6:54 PM, Lin Ma <li...@gmail.com> wrote:
> Hi Kai,
>
> Smart answer! :-)
>
> The assumption you have is one distributed cache replica could only serve
> one download session for tasktracker node (this is why you get concurrency
> n/r). The question is, why one distributed cache replica cannot serve
> multiple concurrent download session? For example, supposing a tasktracker
> use elapsed time t to download a file from a specific distributed cache
> replica, it is possible for 2 tasktrackers to download from the specific
> distributed cache replica in parallel using elapsed time t as well, or 1.5
> t, which is faster than sequential download time 2t you mentioned before?
> "In total, r+n/r concurrent operations. If you optimize r depending on n,
> SRQT(n) is the optimal replication level." -- how do you get SRQT(n) for
> minimize r+n/r? Appreciate if you could point me to more details.
>
> regards,
> Lin
>
>
> On Sat, Dec 22, 2012 at 8:51 PM, Kai Voigt <k...@123.org> wrote:
>>
>> Hi,
>>
>> simple math. Assuming you have n TaskTrackers in your cluster that will
>> need to access the files in the distributed cache. And r is the replication
>> level of those files.
>>
>> Copying the files into HDFS requires r copy operations over the network.
>> The n TaskTrackers need to get their local copies from HDFS, so the n
>> TaskTrackers copy from r DataNodes, so n/r concurrent operation. In total,
>> r+n/r concurrent operations. If you optimize r depending on n, SRQT(n) is
>> the optimal replication level. So 10 is a reasonable default setting for
>> most clusters that are not 500+ nodes big.
>>
>> Kai
>>
>> Am 22.12.2012 um 13:46 schrieb Lin Ma <li...@gmail.com>:
>>
>> Thanks Kai, using higher replication count for the purpose of?
>>
>> regards,
>> Lin
>>
>> On Sat, Dec 22, 2012 at 8:44 PM, Kai Voigt <k...@123.org> wrote:
>>>
>>> Hi,
>>>
>>> Am 22.12.2012 um 13:03 schrieb Lin Ma <li...@gmail.com>:
>>>
>>> > I want to confirm when on each task node either mapper or reducer
>>> > access distributed cache file, it resides on disk, not resides in memory.
>>> > Just want to make sure distributed cache file does not fully loaded into
>>> > memory which compete memory consumption with mapper/reducer tasks. Is that
>>> > correct?
>>>
>>>
>>> Yes, you are correct. The JobTracker will put files for the distributed
>>> cache into HDFS with a higher replication count (10 by default). Whenever a
>>> TaskTracker needs those files for a task it is launching locally, it will
>>> fetch a copy to its local disk. So it won't need to do this again for future
>>> tasks on this node. After a job is done, all local copies and the HDFS
>>> copies of files in the distributed cache are cleaned up.
>>>
>>> Kai
>>>
>>> --
>>> Kai Voigt
>>> k@123.org
>>>
>>>
>>>
>>>
>>
>>
>> --
>> Kai Voigt
>> k@123.org
>>
>>
>>
>>
>



-- 
Harsh J

Re: distributed cache

Posted by Lin Ma <li...@gmail.com>.

I have figured out the 2nd issue, appreciate if anyone could advise on the
first issue.

regards,
Lin

On Sat, Dec 22, 2012 at 9:24 PM, Lin Ma <li...@gmail.com> wrote:

> Hi Kai,
>
> Smart answer! :-)
>
>    - The assumption you have is one distributed cache replica could only
>    serve one download session for tasktracker node (this is why you get
>    concurrency n/r). The question is, why one distributed cache replica cannot
>    serve multiple concurrent download session? For example, supposing a
>    tasktracker use elapsed time t to download a file from a specific
>    distributed cache replica, it is possible for 2 tasktrackers to download
>    from the specific distributed cache replica in parallel using elapsed time
>    t as well, or 1.5 t, which is faster than sequential download time 2t you
>    mentioned before?
>    - "In total, r+n/r concurrent operations. If you optimize r depending
>    on n, SRQT(n) is the optimal replication level." -- how do you get SRQT(n)
>    for minimize r+n/r? Appreciate if you could point me to more details.
>
> regards,
> Lin
>
>
> On Sat, Dec 22, 2012 at 8:51 PM, Kai Voigt <k...@123.org> wrote:
>
>> Hi,
>>
>> simple math. Assuming you have n TaskTrackers in your cluster that will
>> need to access the files in the distributed cache. And r is the replication
>> level of those files.
>>
>> Copying the files into HDFS requires r copy operations over the network.
>> The n TaskTrackers need to get their local copies from HDFS, so the n
>> TaskTrackers copy from r DataNodes, so n/r concurrent operation. In total,
>> r+n/r concurrent operations. If you optimize r depending on n, SRQT(n) is
>> the optimal replication level. So 10 is a reasonable default setting for
>> most clusters that are not 500+ nodes big.
>>
>> Kai
>>
>> Am 22.12.2012 um 13:46 schrieb Lin Ma <li...@gmail.com>:
>>
>> Thanks Kai, using higher replication count for the purpose of?
>>
>> regards,
>> Lin
>>
>> On Sat, Dec 22, 2012 at 8:44 PM, Kai Voigt <k...@123.org> wrote:
>>
>>> Hi,
>>>
>>> Am 22.12.2012 um 13:03 schrieb Lin Ma <li...@gmail.com>:
>>>
>>> > I want to confirm when on each task node either mapper or reducer
>>> access distributed cache file, it resides on disk, not resides in memory.
>>> Just want to make sure distributed cache file does not fully loaded into
>>> memory which compete memory consumption with mapper/reducer tasks. Is that
>>> correct?
>>>
>>>
>>> Yes, you are correct. The JobTracker will put files for the distributed
>>> cache into HDFS with a higher replication count (10 by default). Whenever a
>>> TaskTracker needs those files for a task it is launching locally, it will
>>> fetch a copy to its local disk. So it won't need to do this again for
>>> future tasks on this node. After a job is done, all local copies and the
>>> HDFS copies of files in the distributed cache are cleaned up.
>>>
>>> Kai
>>>
>>> --
>>> Kai Voigt
>>> k@123.org
>>>
>>>
>>>
>>>
>>>
>>
>>  --
>> Kai Voigt
>> k@123.org
>>
>>
>>
>>
>>
>

Re: distributed cache

Posted by Lin Ma <li...@gmail.com>.

Hi Kai,

Smart answer! :-)

   - The assumption you have is one distributed cache replica could only
   serve one download session for tasktracker node (this is why you get
   concurrency n/r). The question is, why one distributed cache replica cannot
   serve multiple concurrent download session? For example, supposing a
   tasktracker use elapsed time t to download a file from a specific
   distributed cache replica, it is possible for 2 tasktrackers to download
   from the specific distributed cache replica in parallel using elapsed time
   t as well, or 1.5 t, which is faster than sequential download time 2t you
   mentioned before?
   - "In total, r+n/r concurrent operations. If you optimize r depending on
   n, SRQT(n) is the optimal replication level." -- how do you get SRQT(n) for
   minimize r+n/r? Appreciate if you could point me to more details.

regards,
Lin

On Sat, Dec 22, 2012 at 8:51 PM, Kai Voigt <k...@123.org> wrote:

> Hi,
>
> simple math. Assuming you have n TaskTrackers in your cluster that will
> need to access the files in the distributed cache. And r is the replication
> level of those files.
>
> Copying the files into HDFS requires r copy operations over the network.
> The n TaskTrackers need to get their local copies from HDFS, so the n
> TaskTrackers copy from r DataNodes, so n/r concurrent operation. In total,
> r+n/r concurrent operations. If you optimize r depending on n, SRQT(n) is
> the optimal replication level. So 10 is a reasonable default setting for
> most clusters that are not 500+ nodes big.
>
> Kai
>
> Am 22.12.2012 um 13:46 schrieb Lin Ma <li...@gmail.com>:
>
> Thanks Kai, using higher replication count for the purpose of?
>
> regards,
> Lin
>
> On Sat, Dec 22, 2012 at 8:44 PM, Kai Voigt <k...@123.org> wrote:
>
>> Hi,
>>
>> Am 22.12.2012 um 13:03 schrieb Lin Ma <li...@gmail.com>:
>>
>> > I want to confirm when on each task node either mapper or reducer
>> access distributed cache file, it resides on disk, not resides in memory.
>> Just want to make sure distributed cache file does not fully loaded into
>> memory which compete memory consumption with mapper/reducer tasks. Is that
>> correct?
>>
>>
>> Yes, you are correct. The JobTracker will put files for the distributed
>> cache into HDFS with a higher replication count (10 by default). Whenever a
>> TaskTracker needs those files for a task it is launching locally, it will
>> fetch a copy to its local disk. So it won't need to do this again for
>> future tasks on this node. After a job is done, all local copies and the
>> HDFS copies of files in the distributed cache are cleaned up.
>>
>> Kai
>>
>> --
>> Kai Voigt
>> k@123.org
>>
>>
>>
>>
>>
>
> --
> Kai Voigt
> k@123.org
>
>
>
>
>

Re: distributed cache

Posted by Lin Ma <li...@gmail.com>.

Hi Kai,

Smart answer! :-)

   - The assumption you have is one distributed cache replica could only
   serve one download session for tasktracker node (this is why you get
   concurrency n/r). The question is, why one distributed cache replica cannot
   serve multiple concurrent download session? For example, supposing a
   tasktracker use elapsed time t to download a file from a specific
   distributed cache replica, it is possible for 2 tasktrackers to download
   from the specific distributed cache replica in parallel using elapsed time
   t as well, or 1.5 t, which is faster than sequential download time 2t you
   mentioned before?
   - "In total, r+n/r concurrent operations. If you optimize r depending on
   n, SRQT(n) is the optimal replication level." -- how do you get SRQT(n) for
   minimize r+n/r? Appreciate if you could point me to more details.

regards,
Lin

On Sat, Dec 22, 2012 at 8:51 PM, Kai Voigt <k...@123.org> wrote:

> Hi,
>
> simple math. Assuming you have n TaskTrackers in your cluster that will
> need to access the files in the distributed cache. And r is the replication
> level of those files.
>
> Copying the files into HDFS requires r copy operations over the network.
> The n TaskTrackers need to get their local copies from HDFS, so the n
> TaskTrackers copy from r DataNodes, so n/r concurrent operation. In total,
> r+n/r concurrent operations. If you optimize r depending on n, SRQT(n) is
> the optimal replication level. So 10 is a reasonable default setting for
> most clusters that are not 500+ nodes big.
>
> Kai
>
> Am 22.12.2012 um 13:46 schrieb Lin Ma <li...@gmail.com>:
>
> Thanks Kai, using higher replication count for the purpose of?
>
> regards,
> Lin
>
> On Sat, Dec 22, 2012 at 8:44 PM, Kai Voigt <k...@123.org> wrote:
>
>> Hi,
>>
>> Am 22.12.2012 um 13:03 schrieb Lin Ma <li...@gmail.com>:
>>
>> > I want to confirm when on each task node either mapper or reducer
>> access distributed cache file, it resides on disk, not resides in memory.
>> Just want to make sure distributed cache file does not fully loaded into
>> memory which compete memory consumption with mapper/reducer tasks. Is that
>> correct?
>>
>>
>> Yes, you are correct. The JobTracker will put files for the distributed
>> cache into HDFS with a higher replication count (10 by default). Whenever a
>> TaskTracker needs those files for a task it is launching locally, it will
>> fetch a copy to its local disk. So it won't need to do this again for
>> future tasks on this node. After a job is done, all local copies and the
>> HDFS copies of files in the distributed cache are cleaned up.
>>
>> Kai
>>
>> --
>> Kai Voigt
>> k@123.org
>>
>>
>>
>>
>>
>
> --
> Kai Voigt
> k@123.org
>
>
>
>
>

Re: distributed cache

Posted by Lin Ma <li...@gmail.com>.

Hi Kai,

Smart answer! :-)

   - The assumption you have is one distributed cache replica could only
   serve one download session for tasktracker node (this is why you get
   concurrency n/r). The question is, why one distributed cache replica cannot
   serve multiple concurrent download session? For example, supposing a
   tasktracker use elapsed time t to download a file from a specific
   distributed cache replica, it is possible for 2 tasktrackers to download
   from the specific distributed cache replica in parallel using elapsed time
   t as well, or 1.5 t, which is faster than sequential download time 2t you
   mentioned before?
   - "In total, r+n/r concurrent operations. If you optimize r depending on
   n, SRQT(n) is the optimal replication level." -- how do you get SRQT(n) for
   minimize r+n/r? Appreciate if you could point me to more details.

regards,
Lin

On Sat, Dec 22, 2012 at 8:51 PM, Kai Voigt <k...@123.org> wrote:

> Hi,
>
> simple math. Assuming you have n TaskTrackers in your cluster that will
> need to access the files in the distributed cache. And r is the replication
> level of those files.
>
> Copying the files into HDFS requires r copy operations over the network.
> The n TaskTrackers need to get their local copies from HDFS, so the n
> TaskTrackers copy from r DataNodes, so n/r concurrent operation. In total,
> r+n/r concurrent operations. If you optimize r depending on n, SRQT(n) is
> the optimal replication level. So 10 is a reasonable default setting for
> most clusters that are not 500+ nodes big.
>
> Kai
>
> Am 22.12.2012 um 13:46 schrieb Lin Ma <li...@gmail.com>:
>
> Thanks Kai, using higher replication count for the purpose of?
>
> regards,
> Lin
>
> On Sat, Dec 22, 2012 at 8:44 PM, Kai Voigt <k...@123.org> wrote:
>
>> Hi,
>>
>> Am 22.12.2012 um 13:03 schrieb Lin Ma <li...@gmail.com>:
>>
>> > I want to confirm when on each task node either mapper or reducer
>> access distributed cache file, it resides on disk, not resides in memory.
>> Just want to make sure distributed cache file does not fully loaded into
>> memory which compete memory consumption with mapper/reducer tasks. Is that
>> correct?
>>
>>
>> Yes, you are correct. The JobTracker will put files for the distributed
>> cache into HDFS with a higher replication count (10 by default). Whenever a
>> TaskTracker needs those files for a task it is launching locally, it will
>> fetch a copy to its local disk. So it won't need to do this again for
>> future tasks on this node. After a job is done, all local copies and the
>> HDFS copies of files in the distributed cache are cleaned up.
>>
>> Kai
>>
>> --
>> Kai Voigt
>> k@123.org
>>
>>
>>
>>
>>
>
> --
> Kai Voigt
> k@123.org
>
>
>
>
>

Re: distributed cache

Posted by Lin Ma <li...@gmail.com>.

Hi Kai,

Smart answer! :-)

   - The assumption you have is one distributed cache replica could only
   serve one download session for tasktracker node (this is why you get
   concurrency n/r). The question is, why one distributed cache replica cannot
   serve multiple concurrent download session? For example, supposing a
   tasktracker use elapsed time t to download a file from a specific
   distributed cache replica, it is possible for 2 tasktrackers to download
   from the specific distributed cache replica in parallel using elapsed time
   t as well, or 1.5 t, which is faster than sequential download time 2t you
   mentioned before?
   - "In total, r+n/r concurrent operations. If you optimize r depending on
   n, SRQT(n) is the optimal replication level." -- how do you get SRQT(n) for
   minimize r+n/r? Appreciate if you could point me to more details.

regards,
Lin

On Sat, Dec 22, 2012 at 8:51 PM, Kai Voigt <k...@123.org> wrote:

> Hi,
>
> simple math. Assuming you have n TaskTrackers in your cluster that will
> need to access the files in the distributed cache. And r is the replication
> level of those files.
>
> Copying the files into HDFS requires r copy operations over the network.
> The n TaskTrackers need to get their local copies from HDFS, so the n
> TaskTrackers copy from r DataNodes, so n/r concurrent operation. In total,
> r+n/r concurrent operations. If you optimize r depending on n, SRQT(n) is
> the optimal replication level. So 10 is a reasonable default setting for
> most clusters that are not 500+ nodes big.
>
> Kai
>
> Am 22.12.2012 um 13:46 schrieb Lin Ma <li...@gmail.com>:
>
> Thanks Kai, using higher replication count for the purpose of?
>
> regards,
> Lin
>
> On Sat, Dec 22, 2012 at 8:44 PM, Kai Voigt <k...@123.org> wrote:
>
>> Hi,
>>
>> Am 22.12.2012 um 13:03 schrieb Lin Ma <li...@gmail.com>:
>>
>> > I want to confirm when on each task node either mapper or reducer
>> access distributed cache file, it resides on disk, not resides in memory.
>> Just want to make sure distributed cache file does not fully loaded into
>> memory which compete memory consumption with mapper/reducer tasks. Is that
>> correct?
>>
>>
>> Yes, you are correct. The JobTracker will put files for the distributed
>> cache into HDFS with a higher replication count (10 by default). Whenever a
>> TaskTracker needs those files for a task it is launching locally, it will
>> fetch a copy to its local disk. So it won't need to do this again for
>> future tasks on this node. After a job is done, all local copies and the
>> HDFS copies of files in the distributed cache are cleaned up.
>>
>> Kai
>>
>> --
>> Kai Voigt
>> k@123.org
>>
>>
>>
>>
>>
>
> --
> Kai Voigt
> k@123.org
>
>
>
>
>

Re: distributed cache

Posted by Kai Voigt <k...@123.org>.

Hi,

simple math. Assuming you have n TaskTrackers in your cluster that will need to access the files in the distributed cache. And r is the replication level of those files.

Copying the files into HDFS requires r copy operations over the network. The n TaskTrackers need to get their local copies from HDFS, so the n TaskTrackers copy from r DataNodes, so n/r concurrent operation. In total, r+n/r concurrent operations. If you optimize r depending on n, SRQT(n) is the optimal replication level. So 10 is a reasonable default setting for most clusters that are not 500+ nodes big.

Kai

Am 22.12.2012 um 13:46 schrieb Lin Ma <li...@gmail.com>:

> Thanks Kai, using higher replication count for the purpose of?
> 
> regards,
> Lin
> 
> On Sat, Dec 22, 2012 at 8:44 PM, Kai Voigt <k...@123.org> wrote:
> Hi,
> 
> Am 22.12.2012 um 13:03 schrieb Lin Ma <li...@gmail.com>:
> 
> > I want to confirm when on each task node either mapper or reducer access distributed cache file, it resides on disk, not resides in memory. Just want to make sure distributed cache file does not fully loaded into memory which compete memory consumption with mapper/reducer tasks. Is that correct?
> 
> 
> Yes, you are correct. The JobTracker will put files for the distributed cache into HDFS with a higher replication count (10 by default). Whenever a TaskTracker needs those files for a task it is launching locally, it will fetch a copy to its local disk. So it won't need to do this again for future tasks on this node. After a job is done, all local copies and the HDFS copies of files in the distributed cache are cleaned up.
> 
> Kai
> 
> --
> Kai Voigt
> k@123.org
> 
> 
> 
> 
> 

-- 
Kai Voigt
k@123.org

Re: distributed cache

Posted by Kai Voigt <k...@123.org>.

Hi,

simple math. Assuming you have n TaskTrackers in your cluster that will need to access the files in the distributed cache. And r is the replication level of those files.

Copying the files into HDFS requires r copy operations over the network. The n TaskTrackers need to get their local copies from HDFS, so the n TaskTrackers copy from r DataNodes, so n/r concurrent operation. In total, r+n/r concurrent operations. If you optimize r depending on n, SRQT(n) is the optimal replication level. So 10 is a reasonable default setting for most clusters that are not 500+ nodes big.

Kai

Am 22.12.2012 um 13:46 schrieb Lin Ma <li...@gmail.com>:

> Thanks Kai, using higher replication count for the purpose of?
> 
> regards,
> Lin
> 
> On Sat, Dec 22, 2012 at 8:44 PM, Kai Voigt <k...@123.org> wrote:
> Hi,
> 
> Am 22.12.2012 um 13:03 schrieb Lin Ma <li...@gmail.com>:
> 
> > I want to confirm when on each task node either mapper or reducer access distributed cache file, it resides on disk, not resides in memory. Just want to make sure distributed cache file does not fully loaded into memory which compete memory consumption with mapper/reducer tasks. Is that correct?
> 
> 
> Yes, you are correct. The JobTracker will put files for the distributed cache into HDFS with a higher replication count (10 by default). Whenever a TaskTracker needs those files for a task it is launching locally, it will fetch a copy to its local disk. So it won't need to do this again for future tasks on this node. After a job is done, all local copies and the HDFS copies of files in the distributed cache are cleaned up.
> 
> Kai
> 
> --
> Kai Voigt
> k@123.org
> 
> 
> 
> 
> 

-- 
Kai Voigt
k@123.org

Re: distributed cache

Posted by Kai Voigt <k...@123.org>.

Hi,

simple math. Assuming you have n TaskTrackers in your cluster that will need to access the files in the distributed cache. And r is the replication level of those files.

Copying the files into HDFS requires r copy operations over the network. The n TaskTrackers need to get their local copies from HDFS, so the n TaskTrackers copy from r DataNodes, so n/r concurrent operation. In total, r+n/r concurrent operations. If you optimize r depending on n, SRQT(n) is the optimal replication level. So 10 is a reasonable default setting for most clusters that are not 500+ nodes big.

Kai

Am 22.12.2012 um 13:46 schrieb Lin Ma <li...@gmail.com>:

> Thanks Kai, using higher replication count for the purpose of?
> 
> regards,
> Lin
> 
> On Sat, Dec 22, 2012 at 8:44 PM, Kai Voigt <k...@123.org> wrote:
> Hi,
> 
> Am 22.12.2012 um 13:03 schrieb Lin Ma <li...@gmail.com>:
> 
> > I want to confirm when on each task node either mapper or reducer access distributed cache file, it resides on disk, not resides in memory. Just want to make sure distributed cache file does not fully loaded into memory which compete memory consumption with mapper/reducer tasks. Is that correct?
> 
> 
> Yes, you are correct. The JobTracker will put files for the distributed cache into HDFS with a higher replication count (10 by default). Whenever a TaskTracker needs those files for a task it is launching locally, it will fetch a copy to its local disk. So it won't need to do this again for future tasks on this node. After a job is done, all local copies and the HDFS copies of files in the distributed cache are cleaned up.
> 
> Kai
> 
> --
> Kai Voigt
> k@123.org
> 
> 
> 
> 
> 

-- 
Kai Voigt
k@123.org

Re: distributed cache

Posted by Kai Voigt <k...@123.org>.

Hi,

simple math. Assuming you have n TaskTrackers in your cluster that will need to access the files in the distributed cache. And r is the replication level of those files.

Copying the files into HDFS requires r copy operations over the network. The n TaskTrackers need to get their local copies from HDFS, so the n TaskTrackers copy from r DataNodes, so n/r concurrent operation. In total, r+n/r concurrent operations. If you optimize r depending on n, SRQT(n) is the optimal replication level. So 10 is a reasonable default setting for most clusters that are not 500+ nodes big.

Kai

Am 22.12.2012 um 13:46 schrieb Lin Ma <li...@gmail.com>:

> Thanks Kai, using higher replication count for the purpose of?
> 
> regards,
> Lin
> 
> On Sat, Dec 22, 2012 at 8:44 PM, Kai Voigt <k...@123.org> wrote:
> Hi,
> 
> Am 22.12.2012 um 13:03 schrieb Lin Ma <li...@gmail.com>:
> 
> > I want to confirm when on each task node either mapper or reducer access distributed cache file, it resides on disk, not resides in memory. Just want to make sure distributed cache file does not fully loaded into memory which compete memory consumption with mapper/reducer tasks. Is that correct?
> 
> 
> Yes, you are correct. The JobTracker will put files for the distributed cache into HDFS with a higher replication count (10 by default). Whenever a TaskTracker needs those files for a task it is launching locally, it will fetch a copy to its local disk. So it won't need to do this again for future tasks on this node. After a job is done, all local copies and the HDFS copies of files in the distributed cache are cleaned up.
> 
> Kai
> 
> --
> Kai Voigt
> k@123.org
> 
> 
> 
> 
> 

-- 
Kai Voigt
k@123.org

Re: distributed cache

Posted by Lin Ma <li...@gmail.com>.

Thanks Kai, using higher replication count for the purpose of?

regards,
Lin

On Sat, Dec 22, 2012 at 8:44 PM, Kai Voigt <k...@123.org> wrote:

> Hi,
>
> Am 22.12.2012 um 13:03 schrieb Lin Ma <li...@gmail.com>:
>
> > I want to confirm when on each task node either mapper or reducer access
> distributed cache file, it resides on disk, not resides in memory. Just
> want to make sure distributed cache file does not fully loaded into memory
> which compete memory consumption with mapper/reducer tasks. Is that correct?
>
>
> Yes, you are correct. The JobTracker will put files for the distributed
> cache into HDFS with a higher replication count (10 by default). Whenever a
> TaskTracker needs those files for a task it is launching locally, it will
> fetch a copy to its local disk. So it won't need to do this again for
> future tasks on this node. After a job is done, all local copies and the
> HDFS copies of files in the distributed cache are cleaned up.
>
> Kai
>
> --
> Kai Voigt
> k@123.org
>
>
>
>
>

Re: distributed cache

Posted by Lin Ma <li...@gmail.com>.

Thanks Kai, using higher replication count for the purpose of?

regards,
Lin

On Sat, Dec 22, 2012 at 8:44 PM, Kai Voigt <k...@123.org> wrote:

> Hi,
>
> Am 22.12.2012 um 13:03 schrieb Lin Ma <li...@gmail.com>:
>
> > I want to confirm when on each task node either mapper or reducer access
> distributed cache file, it resides on disk, not resides in memory. Just
> want to make sure distributed cache file does not fully loaded into memory
> which compete memory consumption with mapper/reducer tasks. Is that correct?
>
>
> Yes, you are correct. The JobTracker will put files for the distributed
> cache into HDFS with a higher replication count (10 by default). Whenever a
> TaskTracker needs those files for a task it is launching locally, it will
> fetch a copy to its local disk. So it won't need to do this again for
> future tasks on this node. After a job is done, all local copies and the
> HDFS copies of files in the distributed cache are cleaned up.
>
> Kai
>
> --
> Kai Voigt
> k@123.org
>
>
>
>
>

Re: distributed cache

Posted by Lin Ma <li...@gmail.com>.

Thanks Kai, using higher replication count for the purpose of?

regards,
Lin

On Sat, Dec 22, 2012 at 8:44 PM, Kai Voigt <k...@123.org> wrote:

> Hi,
>
> Am 22.12.2012 um 13:03 schrieb Lin Ma <li...@gmail.com>:
>
> > I want to confirm when on each task node either mapper or reducer access
> distributed cache file, it resides on disk, not resides in memory. Just
> want to make sure distributed cache file does not fully loaded into memory
> which compete memory consumption with mapper/reducer tasks. Is that correct?
>
>
> Yes, you are correct. The JobTracker will put files for the distributed
> cache into HDFS with a higher replication count (10 by default). Whenever a
> TaskTracker needs those files for a task it is launching locally, it will
> fetch a copy to its local disk. So it won't need to do this again for
> future tasks on this node. After a job is done, all local copies and the
> HDFS copies of files in the distributed cache are cleaned up.
>
> Kai
>
> --
> Kai Voigt
> k@123.org
>
>
>
>
>

Re: distributed cache

Posted by Lin Ma <li...@gmail.com>.

Thanks Kai, using higher replication count for the purpose of?

regards,
Lin

On Sat, Dec 22, 2012 at 8:44 PM, Kai Voigt <k...@123.org> wrote:

> Hi,
>
> Am 22.12.2012 um 13:03 schrieb Lin Ma <li...@gmail.com>:
>
> > I want to confirm when on each task node either mapper or reducer access
> distributed cache file, it resides on disk, not resides in memory. Just
> want to make sure distributed cache file does not fully loaded into memory
> which compete memory consumption with mapper/reducer tasks. Is that correct?
>
>
> Yes, you are correct. The JobTracker will put files for the distributed
> cache into HDFS with a higher replication count (10 by default). Whenever a
> TaskTracker needs those files for a task it is launching locally, it will
> fetch a copy to its local disk. So it won't need to do this again for
> future tasks on this node. After a job is done, all local copies and the
> HDFS copies of files in the distributed cache are cleaned up.
>
> Kai
>
> --
> Kai Voigt
> k@123.org
>
>
>
>
>

Re: distributed cache

Posted by Kai Voigt <k...@123.org>.

Hi,

Am 22.12.2012 um 13:03 schrieb Lin Ma <li...@gmail.com>:

> I want to confirm when on each task node either mapper or reducer access distributed cache file, it resides on disk, not resides in memory. Just want to make sure distributed cache file does not fully loaded into memory which compete memory consumption with mapper/reducer tasks. Is that correct?


Yes, you are correct. The JobTracker will put files for the distributed cache into HDFS with a higher replication count (10 by default). Whenever a TaskTracker needs those files for a task it is launching locally, it will fetch a copy to its local disk. So it won't need to do this again for future tasks on this node. After a job is done, all local copies and the HDFS copies of files in the distributed cache are cleaned up.

Kai

-- 
Kai Voigt
k@123.org

Re: distributed cache

Posted by Kai Voigt <k...@123.org>.

Hi,

Am 22.12.2012 um 13:03 schrieb Lin Ma <li...@gmail.com>:

> I want to confirm when on each task node either mapper or reducer access distributed cache file, it resides on disk, not resides in memory. Just want to make sure distributed cache file does not fully loaded into memory which compete memory consumption with mapper/reducer tasks. Is that correct?


Yes, you are correct. The JobTracker will put files for the distributed cache into HDFS with a higher replication count (10 by default). Whenever a TaskTracker needs those files for a task it is launching locally, it will fetch a copy to its local disk. So it won't need to do this again for future tasks on this node. After a job is done, all local copies and the HDFS copies of files in the distributed cache are cleaned up.

Kai

-- 
Kai Voigt
k@123.org

Re: distributed cache

Posted by Kai Voigt <k...@123.org>.

Hi,

Am 22.12.2012 um 13:03 schrieb Lin Ma <li...@gmail.com>:

> I want to confirm when on each task node either mapper or reducer access distributed cache file, it resides on disk, not resides in memory. Just want to make sure distributed cache file does not fully loaded into memory which compete memory consumption with mapper/reducer tasks. Is that correct?


Yes, you are correct. The JobTracker will put files for the distributed cache into HDFS with a higher replication count (10 by default). Whenever a TaskTracker needs those files for a task it is launching locally, it will fetch a copy to its local disk. So it won't need to do this again for future tasks on this node. After a job is done, all local copies and the HDFS copies of files in the distributed cache are cleaned up.

Kai

-- 
Kai Voigt
k@123.org

Re: distributed cache

Posted by Kai Voigt <k...@123.org>.

Hi,

Am 22.12.2012 um 13:03 schrieb Lin Ma <li...@gmail.com>:

> I want to confirm when on each task node either mapper or reducer access distributed cache file, it resides on disk, not resides in memory. Just want to make sure distributed cache file does not fully loaded into memory which compete memory consumption with mapper/reducer tasks. Is that correct?


Yes, you are correct. The JobTracker will put files for the distributed cache into HDFS with a higher replication count (10 by default). Whenever a TaskTracker needs those files for a task it is launching locally, it will fetch a copy to its local disk. So it won't need to do this again for future tasks on this node. After a job is done, all local copies and the HDFS copies of files in the distributed cache are cleaned up.

Kai

-- 
Kai Voigt
k@123.org