You are viewing a plain text version of this content. The canonical link for it is here.
Posted to common-user@hadoop.apache.org by Tarandeep Singh <ta...@gmail.com> on 2008/11/11 19:56:41 UTC

Caching data selectively on slaves

Hi,

Is is possible to cache data selectively on slave machines?

Lets say I have data partitioned as D1, D2... and so on. D1 is required by
Reducer R1, D2 by R2 and so on. I know this before hand because
HashPartitioner.getPartition was used to partition the data.

If I put D1, D2.. in distributed cache, then the data is copied on all
machines. Is is possible to cache data selectively on machines?

Thanks,
Taran

Re: Caching data selectively on slaves

Posted by Tarandeep Singh <ta...@gmail.com>.
Hi Lohit,

I thought of keeping the data on DFS and reading it from there. But storing
the data on DFS will turn out to be expensive-

1) The data is replicated across cluster.
2) While reading, the Reducer Ri may not have Data Di on the same machine,
so a DFS read will occur.

That was the reason I thought if I could selectively cache the data on
respective machines.
And thanks for the tip that I should try to keep my read time minimum else
reducers might timeout. I will keep this in mind.

-Taran

On Tue, Nov 11, 2008 at 12:33 PM, lohit <lo...@yahoo.com> wrote:

> DistributedCache would copy the cache data on all nodes. If you know the
> mapping of R* to D*, how about Reduce reading the data from DFS, the D which
> it expects to. Distributed cache will only help if the data you are using is
> used by multiple tasks on same node, in that you would not try to access DFS
> multiple times. If you know that the each 'D' is read by one 'R' then you
> are not buying much with DistributedCache. Although you should also keep in
> mind if you are read takes long time you reducers might timeout failing to
> report status.
>
> Thanks,
> Lohit
>
>
>
> ----- Original Message ----
> From: Tarandeep Singh <ta...@gmail.com>
> To: core-user@hadoop.apache.org
> Sent: Tuesday, November 11, 2008 10:56:41 AM
> Subject: Caching data selectively on slaves
>
> Hi,
>
> Is is possible to cache data selectively on slave machines?
>
> Lets say I have data partitioned as D1, D2... and so on. D1 is required by
> Reducer R1, D2 by R2 and so on. I know this before hand because
> HashPartitioner.getPartition was used to partition the data.
>
> If I put D1, D2.. in distributed cache, then the data is copied on all
> machines. Is is possible to cache data selectively on machines?
>
> Thanks,
> Taran
>
>

Re: Caching data selectively on slaves

Posted by lohit <lo...@yahoo.com>.
DistributedCache would copy the cache data on all nodes. If you know the mapping of R* to D*, how about Reduce reading the data from DFS, the D which it expects to. Distributed cache will only help if the data you are using is used by multiple tasks on same node, in that you would not try to access DFS multiple times. If you know that the each 'D' is read by one 'R' then you are not buying much with DistributedCache. Although you should also keep in mind if you are read takes long time you reducers might timeout failing to report status.

Thanks,
Lohit



----- Original Message ----
From: Tarandeep Singh <ta...@gmail.com>
To: core-user@hadoop.apache.org
Sent: Tuesday, November 11, 2008 10:56:41 AM
Subject: Caching data selectively on slaves

Hi,

Is is possible to cache data selectively on slave machines?

Lets say I have data partitioned as D1, D2... and so on. D1 is required by
Reducer R1, D2 by R2 and so on. I know this before hand because
HashPartitioner.getPartition was used to partition the data.

If I put D1, D2.. in distributed cache, then the data is copied on all
machines. Is is possible to cache data selectively on machines?

Thanks,
Taran