You are viewing a plain text version of this content. The canonical link for it is here.

Posted to common-user@hadoop.apache.org by Vitaliy Semochkin <vi...@gmail.com> on 2010/07/22 13:56:40 UTC

Distributed Updateable Cache

Hi,

I need to do calculations that would benefit from storing information in
distributed updateable cache.
What are best practices for such things in hadoop?

PS
In case there is no good solution for my problem, here are details and ideas
I have.
I'm going to count unique visitors of a site several times per day(each 5
mins), for that I will need distributed cache that will be accessible from
all mappers to store already counted visitors.

My plan is:
store unique visitors in a file on hdfs
each time mapper jvm starts  store in HashSet in each jvm (I
use mapred.job.reuse.jvm.num.tasks=-1)
after each map/reduce job add additional data to this file

any critics and advises are welcome :-)

Regards,
Vitaliy S

Re: Distributed Updateable Cache

Posted by Arun C Murthy <ac...@yahoo-inc.com>.

HBase? Memcached?

On Jul 22, 2010, at 4:56 AM, Vitaliy Semochkin wrote:

> Hi,
>
> I need to do calculations that would benefit from storing  
> information in
> distributed updateable cache.
> What are best practices for such things in hadoop?
>
> PS
> In case there is no good solution for my problem, here are details  
> and ideas
> I have.
> I'm going to count unique visitors of a site several times per  
> day(each 5
> mins), for that I will need distributed cache that will be  
> accessible from
> all mappers to store already counted visitors.
>
> My plan is:
> store unique visitors in a file on hdfs
> each time mapper jvm starts  store in HashSet in each jvm (I
> use mapred.job.reuse.jvm.num.tasks=-1)
> after each map/reduce job add additional data to this file
>
> any critics and advises are welcome :-)
>
> Regards,
> Vitaliy S