You are viewing a plain text version of this content. The canonical link for it is here.

Posted to hdfs-user@hadoop.apache.org by Georgi Ivanov <iv...@vesseltracker.com> on 2014/09/19 10:17:15 UTC

Re-sampling time data with MR job. Ideas

Hello,
I have time related data like this :
entity_id, timestamp , data

The resolution of the data is something like 5 seconds.
I want to extract the data with 10 minutes resolution.

So what i can do is :
Just emit everything in the mapper as data is not sorted there .
Emit only every 10 minutes from reducer. The reducer is receiving data 
sorted by entity_id,timestamp pair (secondary sorting)

This will work fine, but it will take forever, since i have to process 
TB's of data.
Also the data emitted to the reducer will be huge( as i am not filtering 
in map phase at all) and the number of reducers is much smaller than the 
number of mappers.

Are there any better ideas how to do this ?

Georgi

Re: Re-sampling time data with MR job. Ideas

Posted by Mirko Kämpf <mi...@gmail.com>.

I would only change the time resolution:

1 , 2014-01-01 12:13:02
1 , 2014-01-01 12:13:03
...
2 , 2014-01-01 12:23:04
2 , 2014-01-01 12:24:05

==>

1 , 2014-01-01 12:10:00
1 , 2014-01-01 12:10:00
..
2 , 2014-01-01 12:20:00
2 , 2014-01-01 12:20:00

It is all about selecting the right (k,v) types going out of the mapper.
And this depends on what you really want to do. If only transforming the
time stamp is the task, than a map only job will work also.

This is just a transformation of the individual data point. Resolution goes
from 5s to 10min. No need for any order in this case.
Even if another data point with an early time comes in it works.

If aggregation is used, than this happens on the reducer later on or maybe
already in the combiner, but here you have to think about the right types
for the MapOutputKey.

Cheers,
Mirko




2014-09-19 10:06 GMT+01:00 Georgi Ivanov <iv...@vesseltracker.com>:

>  Hi Mirko,
> Thanks for the reply.
>
> Lets assume i have a record every 1 second for every given entity.
>
> entity_id | timestamp | data
>
> 1 , 2014-01-01 12:13:01 - i want this
> ..some more for different entity
> 1 , 2014-01-01 12:13:02
> 1 , 2014-01-01 12:13:03
> 1 , 2014-01-01 12:13:04
> 1 , 2014-01-01 12:13:05
> ........
> 1 , 2014-01-01 12:23:01 - I want this
> 1 , 2014-01-01 12:23:02
>
> The problem is that in reality this is not coming sorted by entity_id ,
> timestamp
> so i can't filter in the mapper .
> The mapper will get different entity_id's and based on the input split.
>
>
>
> Georgi
>
>
> On 19.09.2014 10:34, Mirko Kämpf wrote:
>
> Hi Georgi,
>
>  I would already emit the new time stamp (with resolution 10 min) in the
> mapper. This allows you to (pre)aggregate the data already in the mapper
> and you have less traffic during the shuffle & sort stage. Changing the
> resolution means you have to aggregate the individual entities or do you
> still need all individual entities and just want to translate the timestamp
> to another resolution (5s => 10 min)?
>
>  Cheers,
> Mirko
>
>
>
>
> 2014-09-19 9:17 GMT+01:00 Georgi Ivanov <iv...@vesseltracker.com>:
>
>> Hello,
>> I have time related data like this :
>> entity_id, timestamp , data
>>
>> The resolution of the data is something like 5 seconds.
>> I want to extract the data with 10 minutes resolution.
>>
>> So what i can do is :
>> Just emit everything in the mapper as data is not sorted there .
>> Emit only every 10 minutes from reducer. The reducer is receiving data
>> sorted by entity_id,timestamp pair (secondary sorting)
>>
>> This will work fine, but it will take forever, since i have to process
>> TB's of data.
>> Also the data emitted to the reducer will be huge( as i am not filtering
>> in map phase at all) and the number of reducers is much smaller than the
>> number of mappers.
>>
>> Are there any better ideas how to do this ?
>>
>> Georgi
>>
>
>
>

Re: Re-sampling time data with MR job. Ideas

Posted by Mirko Kämpf <mi...@gmail.com>.

I would only change the time resolution:

1 , 2014-01-01 12:13:02
1 , 2014-01-01 12:13:03
...
2 , 2014-01-01 12:23:04
2 , 2014-01-01 12:24:05

==>

1 , 2014-01-01 12:10:00
1 , 2014-01-01 12:10:00
..
2 , 2014-01-01 12:20:00
2 , 2014-01-01 12:20:00

It is all about selecting the right (k,v) types going out of the mapper.
And this depends on what you really want to do. If only transforming the
time stamp is the task, than a map only job will work also.

This is just a transformation of the individual data point. Resolution goes
from 5s to 10min. No need for any order in this case.
Even if another data point with an early time comes in it works.

If aggregation is used, than this happens on the reducer later on or maybe
already in the combiner, but here you have to think about the right types
for the MapOutputKey.

Cheers,
Mirko




2014-09-19 10:06 GMT+01:00 Georgi Ivanov <iv...@vesseltracker.com>:

>  Hi Mirko,
> Thanks for the reply.
>
> Lets assume i have a record every 1 second for every given entity.
>
> entity_id | timestamp | data
>
> 1 , 2014-01-01 12:13:01 - i want this
> ..some more for different entity
> 1 , 2014-01-01 12:13:02
> 1 , 2014-01-01 12:13:03
> 1 , 2014-01-01 12:13:04
> 1 , 2014-01-01 12:13:05
> ........
> 1 , 2014-01-01 12:23:01 - I want this
> 1 , 2014-01-01 12:23:02
>
> The problem is that in reality this is not coming sorted by entity_id ,
> timestamp
> so i can't filter in the mapper .
> The mapper will get different entity_id's and based on the input split.
>
>
>
> Georgi
>
>
> On 19.09.2014 10:34, Mirko Kämpf wrote:
>
> Hi Georgi,
>
>  I would already emit the new time stamp (with resolution 10 min) in the
> mapper. This allows you to (pre)aggregate the data already in the mapper
> and you have less traffic during the shuffle & sort stage. Changing the
> resolution means you have to aggregate the individual entities or do you
> still need all individual entities and just want to translate the timestamp
> to another resolution (5s => 10 min)?
>
>  Cheers,
> Mirko
>
>
>
>
> 2014-09-19 9:17 GMT+01:00 Georgi Ivanov <iv...@vesseltracker.com>:
>
>> Hello,
>> I have time related data like this :
>> entity_id, timestamp , data
>>
>> The resolution of the data is something like 5 seconds.
>> I want to extract the data with 10 minutes resolution.
>>
>> So what i can do is :
>> Just emit everything in the mapper as data is not sorted there .
>> Emit only every 10 minutes from reducer. The reducer is receiving data
>> sorted by entity_id,timestamp pair (secondary sorting)
>>
>> This will work fine, but it will take forever, since i have to process
>> TB's of data.
>> Also the data emitted to the reducer will be huge( as i am not filtering
>> in map phase at all) and the number of reducers is much smaller than the
>> number of mappers.
>>
>> Are there any better ideas how to do this ?
>>
>> Georgi
>>
>
>
>

Re: Re-sampling time data with MR job. Ideas

Posted by Mirko Kämpf <mi...@gmail.com>.

I would only change the time resolution:

1 , 2014-01-01 12:13:02
1 , 2014-01-01 12:13:03
...
2 , 2014-01-01 12:23:04
2 , 2014-01-01 12:24:05

==>

1 , 2014-01-01 12:10:00
1 , 2014-01-01 12:10:00
..
2 , 2014-01-01 12:20:00
2 , 2014-01-01 12:20:00

It is all about selecting the right (k,v) types going out of the mapper.
And this depends on what you really want to do. If only transforming the
time stamp is the task, than a map only job will work also.

This is just a transformation of the individual data point. Resolution goes
from 5s to 10min. No need for any order in this case.
Even if another data point with an early time comes in it works.

If aggregation is used, than this happens on the reducer later on or maybe
already in the combiner, but here you have to think about the right types
for the MapOutputKey.

Cheers,
Mirko




2014-09-19 10:06 GMT+01:00 Georgi Ivanov <iv...@vesseltracker.com>:

>  Hi Mirko,
> Thanks for the reply.
>
> Lets assume i have a record every 1 second for every given entity.
>
> entity_id | timestamp | data
>
> 1 , 2014-01-01 12:13:01 - i want this
> ..some more for different entity
> 1 , 2014-01-01 12:13:02
> 1 , 2014-01-01 12:13:03
> 1 , 2014-01-01 12:13:04
> 1 , 2014-01-01 12:13:05
> ........
> 1 , 2014-01-01 12:23:01 - I want this
> 1 , 2014-01-01 12:23:02
>
> The problem is that in reality this is not coming sorted by entity_id ,
> timestamp
> so i can't filter in the mapper .
> The mapper will get different entity_id's and based on the input split.
>
>
>
> Georgi
>
>
> On 19.09.2014 10:34, Mirko Kämpf wrote:
>
> Hi Georgi,
>
>  I would already emit the new time stamp (with resolution 10 min) in the
> mapper. This allows you to (pre)aggregate the data already in the mapper
> and you have less traffic during the shuffle & sort stage. Changing the
> resolution means you have to aggregate the individual entities or do you
> still need all individual entities and just want to translate the timestamp
> to another resolution (5s => 10 min)?
>
>  Cheers,
> Mirko
>
>
>
>
> 2014-09-19 9:17 GMT+01:00 Georgi Ivanov <iv...@vesseltracker.com>:
>
>> Hello,
>> I have time related data like this :
>> entity_id, timestamp , data
>>
>> The resolution of the data is something like 5 seconds.
>> I want to extract the data with 10 minutes resolution.
>>
>> So what i can do is :
>> Just emit everything in the mapper as data is not sorted there .
>> Emit only every 10 minutes from reducer. The reducer is receiving data
>> sorted by entity_id,timestamp pair (secondary sorting)
>>
>> This will work fine, but it will take forever, since i have to process
>> TB's of data.
>> Also the data emitted to the reducer will be huge( as i am not filtering
>> in map phase at all) and the number of reducers is much smaller than the
>> number of mappers.
>>
>> Are there any better ideas how to do this ?
>>
>> Georgi
>>
>
>
>

Re: Re-sampling time data with MR job. Ideas

Posted by Mirko Kämpf <mi...@gmail.com>.

I would only change the time resolution:

1 , 2014-01-01 12:13:02
1 , 2014-01-01 12:13:03
...
2 , 2014-01-01 12:23:04
2 , 2014-01-01 12:24:05

==>

1 , 2014-01-01 12:10:00
1 , 2014-01-01 12:10:00
..
2 , 2014-01-01 12:20:00
2 , 2014-01-01 12:20:00

It is all about selecting the right (k,v) types going out of the mapper.
And this depends on what you really want to do. If only transforming the
time stamp is the task, than a map only job will work also.

This is just a transformation of the individual data point. Resolution goes
from 5s to 10min. No need for any order in this case.
Even if another data point with an early time comes in it works.

If aggregation is used, than this happens on the reducer later on or maybe
already in the combiner, but here you have to think about the right types
for the MapOutputKey.

Cheers,
Mirko




2014-09-19 10:06 GMT+01:00 Georgi Ivanov <iv...@vesseltracker.com>:

>  Hi Mirko,
> Thanks for the reply.
>
> Lets assume i have a record every 1 second for every given entity.
>
> entity_id | timestamp | data
>
> 1 , 2014-01-01 12:13:01 - i want this
> ..some more for different entity
> 1 , 2014-01-01 12:13:02
> 1 , 2014-01-01 12:13:03
> 1 , 2014-01-01 12:13:04
> 1 , 2014-01-01 12:13:05
> ........
> 1 , 2014-01-01 12:23:01 - I want this
> 1 , 2014-01-01 12:23:02
>
> The problem is that in reality this is not coming sorted by entity_id ,
> timestamp
> so i can't filter in the mapper .
> The mapper will get different entity_id's and based on the input split.
>
>
>
> Georgi
>
>
> On 19.09.2014 10:34, Mirko Kämpf wrote:
>
> Hi Georgi,
>
>  I would already emit the new time stamp (with resolution 10 min) in the
> mapper. This allows you to (pre)aggregate the data already in the mapper
> and you have less traffic during the shuffle & sort stage. Changing the
> resolution means you have to aggregate the individual entities or do you
> still need all individual entities and just want to translate the timestamp
> to another resolution (5s => 10 min)?
>
>  Cheers,
> Mirko
>
>
>
>
> 2014-09-19 9:17 GMT+01:00 Georgi Ivanov <iv...@vesseltracker.com>:
>
>> Hello,
>> I have time related data like this :
>> entity_id, timestamp , data
>>
>> The resolution of the data is something like 5 seconds.
>> I want to extract the data with 10 minutes resolution.
>>
>> So what i can do is :
>> Just emit everything in the mapper as data is not sorted there .
>> Emit only every 10 minutes from reducer. The reducer is receiving data
>> sorted by entity_id,timestamp pair (secondary sorting)
>>
>> This will work fine, but it will take forever, since i have to process
>> TB's of data.
>> Also the data emitted to the reducer will be huge( as i am not filtering
>> in map phase at all) and the number of reducers is much smaller than the
>> number of mappers.
>>
>> Are there any better ideas how to do this ?
>>
>> Georgi
>>
>
>
>

Re: Re-sampling time data with MR job. Ideas

Posted by Georgi Ivanov <iv...@vesseltracker.com>.

Hi Mirko,
Thanks for the reply.

Lets assume i have a record every 1 second for every given entity.

entity_id | timestamp | data

1 , 2014-01-01 12:13:01 - i want this
..some more for different entity
1 , 2014-01-01 12:13:02
1 , 2014-01-01 12:13:03
1 , 2014-01-01 12:13:04
1 , 2014-01-01 12:13:05
........
1 , 2014-01-01 12:23:01 - I want this
1 , 2014-01-01 12:23:02

The problem is that in reality this is not coming sorted by entity_id , 
timestamp
so i can't filter in the mapper .
The mapper will get different entity_id's and based on the input split.



Georgi

On 19.09.2014 10:34, Mirko Kämpf wrote:
> Hi Georgi,
>
> I would already emit the new time stamp (with resolution 10 min) in 
> the mapper. This allows you to (pre)aggregate the data already in the 
> mapper and you have less traffic during the shuffle & sort stage. 
> Changing the resolution means you have to aggregate the individual 
> entities or do you still need all individual entities and just want to 
> translate the timestamp to another resolution (5s => 10 min)?
>
> Cheers,
> Mirko
>
>
>
>
> 2014-09-19 9:17 GMT+01:00 Georgi Ivanov <ivanov@vesseltracker.com 
> <ma...@vesseltracker.com>>:
>
>     Hello,
>     I have time related data like this :
>     entity_id, timestamp , data
>
>     The resolution of the data is something like 5 seconds.
>     I want to extract the data with 10 minutes resolution.
>
>     So what i can do is :
>     Just emit everything in the mapper as data is not sorted there .
>     Emit only every 10 minutes from reducer. The reducer is receiving
>     data sorted by entity_id,timestamp pair (secondary sorting)
>
>     This will work fine, but it will take forever, since i have to
>     process TB's of data.
>     Also the data emitted to the reducer will be huge( as i am not
>     filtering in map phase at all) and the number of reducers is much
>     smaller than the number of mappers.
>
>     Are there any better ideas how to do this ?
>
>     Georgi
>
>

Re: Re-sampling time data with MR job. Ideas

Posted by Georgi Ivanov <iv...@vesseltracker.com>.

Hi Mirko,
Thanks for the reply.

Lets assume i have a record every 1 second for every given entity.

entity_id | timestamp | data

1 , 2014-01-01 12:13:01 - i want this
..some more for different entity
1 , 2014-01-01 12:13:02
1 , 2014-01-01 12:13:03
1 , 2014-01-01 12:13:04
1 , 2014-01-01 12:13:05
........
1 , 2014-01-01 12:23:01 - I want this
1 , 2014-01-01 12:23:02

The problem is that in reality this is not coming sorted by entity_id , 
timestamp
so i can't filter in the mapper .
The mapper will get different entity_id's and based on the input split.



Georgi

On 19.09.2014 10:34, Mirko Kämpf wrote:
> Hi Georgi,
>
> I would already emit the new time stamp (with resolution 10 min) in 
> the mapper. This allows you to (pre)aggregate the data already in the 
> mapper and you have less traffic during the shuffle & sort stage. 
> Changing the resolution means you have to aggregate the individual 
> entities or do you still need all individual entities and just want to 
> translate the timestamp to another resolution (5s => 10 min)?
>
> Cheers,
> Mirko
>
>
>
>
> 2014-09-19 9:17 GMT+01:00 Georgi Ivanov <ivanov@vesseltracker.com 
> <ma...@vesseltracker.com>>:
>
>     Hello,
>     I have time related data like this :
>     entity_id, timestamp , data
>
>     The resolution of the data is something like 5 seconds.
>     I want to extract the data with 10 minutes resolution.
>
>     So what i can do is :
>     Just emit everything in the mapper as data is not sorted there .
>     Emit only every 10 minutes from reducer. The reducer is receiving
>     data sorted by entity_id,timestamp pair (secondary sorting)
>
>     This will work fine, but it will take forever, since i have to
>     process TB's of data.
>     Also the data emitted to the reducer will be huge( as i am not
>     filtering in map phase at all) and the number of reducers is much
>     smaller than the number of mappers.
>
>     Are there any better ideas how to do this ?
>
>     Georgi
>
>

Re: Re-sampling time data with MR job. Ideas

Posted by Georgi Ivanov <iv...@vesseltracker.com>.

Hi Mirko,
Thanks for the reply.

Lets assume i have a record every 1 second for every given entity.

entity_id | timestamp | data

1 , 2014-01-01 12:13:01 - i want this
..some more for different entity
1 , 2014-01-01 12:13:02
1 , 2014-01-01 12:13:03
1 , 2014-01-01 12:13:04
1 , 2014-01-01 12:13:05
........
1 , 2014-01-01 12:23:01 - I want this
1 , 2014-01-01 12:23:02

The problem is that in reality this is not coming sorted by entity_id , 
timestamp
so i can't filter in the mapper .
The mapper will get different entity_id's and based on the input split.



Georgi

On 19.09.2014 10:34, Mirko Kämpf wrote:
> Hi Georgi,
>
> I would already emit the new time stamp (with resolution 10 min) in 
> the mapper. This allows you to (pre)aggregate the data already in the 
> mapper and you have less traffic during the shuffle & sort stage. 
> Changing the resolution means you have to aggregate the individual 
> entities or do you still need all individual entities and just want to 
> translate the timestamp to another resolution (5s => 10 min)?
>
> Cheers,
> Mirko
>
>
>
>
> 2014-09-19 9:17 GMT+01:00 Georgi Ivanov <ivanov@vesseltracker.com 
> <ma...@vesseltracker.com>>:
>
>     Hello,
>     I have time related data like this :
>     entity_id, timestamp , data
>
>     The resolution of the data is something like 5 seconds.
>     I want to extract the data with 10 minutes resolution.
>
>     So what i can do is :
>     Just emit everything in the mapper as data is not sorted there .
>     Emit only every 10 minutes from reducer. The reducer is receiving
>     data sorted by entity_id,timestamp pair (secondary sorting)
>
>     This will work fine, but it will take forever, since i have to
>     process TB's of data.
>     Also the data emitted to the reducer will be huge( as i am not
>     filtering in map phase at all) and the number of reducers is much
>     smaller than the number of mappers.
>
>     Are there any better ideas how to do this ?
>
>     Georgi
>
>

Re: Re-sampling time data with MR job. Ideas

Posted by Georgi Ivanov <iv...@vesseltracker.com>.

Hi Mirko,
Thanks for the reply.

Lets assume i have a record every 1 second for every given entity.

entity_id | timestamp | data

1 , 2014-01-01 12:13:01 - i want this
..some more for different entity
1 , 2014-01-01 12:13:02
1 , 2014-01-01 12:13:03
1 , 2014-01-01 12:13:04
1 , 2014-01-01 12:13:05
........
1 , 2014-01-01 12:23:01 - I want this
1 , 2014-01-01 12:23:02

The problem is that in reality this is not coming sorted by entity_id , 
timestamp
so i can't filter in the mapper .
The mapper will get different entity_id's and based on the input split.



Georgi

On 19.09.2014 10:34, Mirko Kämpf wrote:
> Hi Georgi,
>
> I would already emit the new time stamp (with resolution 10 min) in 
> the mapper. This allows you to (pre)aggregate the data already in the 
> mapper and you have less traffic during the shuffle & sort stage. 
> Changing the resolution means you have to aggregate the individual 
> entities or do you still need all individual entities and just want to 
> translate the timestamp to another resolution (5s => 10 min)?
>
> Cheers,
> Mirko
>
>
>
>
> 2014-09-19 9:17 GMT+01:00 Georgi Ivanov <ivanov@vesseltracker.com 
> <ma...@vesseltracker.com>>:
>
>     Hello,
>     I have time related data like this :
>     entity_id, timestamp , data
>
>     The resolution of the data is something like 5 seconds.
>     I want to extract the data with 10 minutes resolution.
>
>     So what i can do is :
>     Just emit everything in the mapper as data is not sorted there .
>     Emit only every 10 minutes from reducer. The reducer is receiving
>     data sorted by entity_id,timestamp pair (secondary sorting)
>
>     This will work fine, but it will take forever, since i have to
>     process TB's of data.
>     Also the data emitted to the reducer will be huge( as i am not
>     filtering in map phase at all) and the number of reducers is much
>     smaller than the number of mappers.
>
>     Are there any better ideas how to do this ?
>
>     Georgi
>
>

Re: Re-sampling time data with MR job. Ideas

Posted by Mirko Kämpf <mi...@gmail.com>.

Hi Georgi,

I would already emit the new time stamp (with resolution 10 min) in the
mapper. This allows you to (pre)aggregate the data already in the mapper
and you have less traffic during the shuffle & sort stage. Changing the
resolution means you have to aggregate the individual entities or do you
still need all individual entities and just want to translate the timestamp
to another resolution (5s => 10 min)?

Cheers,
Mirko




2014-09-19 9:17 GMT+01:00 Georgi Ivanov <iv...@vesseltracker.com>:

> Hello,
> I have time related data like this :
> entity_id, timestamp , data
>
> The resolution of the data is something like 5 seconds.
> I want to extract the data with 10 minutes resolution.
>
> So what i can do is :
> Just emit everything in the mapper as data is not sorted there .
> Emit only every 10 minutes from reducer. The reducer is receiving data
> sorted by entity_id,timestamp pair (secondary sorting)
>
> This will work fine, but it will take forever, since i have to process
> TB's of data.
> Also the data emitted to the reducer will be huge( as i am not filtering
> in map phase at all) and the number of reducers is much smaller than the
> number of mappers.
>
> Are there any better ideas how to do this ?
>
> Georgi
>

Re: Re-sampling time data with MR job. Ideas

Posted by Mirko Kämpf <mi...@gmail.com>.

Hi Georgi,

I would already emit the new time stamp (with resolution 10 min) in the
mapper. This allows you to (pre)aggregate the data already in the mapper
and you have less traffic during the shuffle & sort stage. Changing the
resolution means you have to aggregate the individual entities or do you
still need all individual entities and just want to translate the timestamp
to another resolution (5s => 10 min)?

Cheers,
Mirko




2014-09-19 9:17 GMT+01:00 Georgi Ivanov <iv...@vesseltracker.com>:

> Hello,
> I have time related data like this :
> entity_id, timestamp , data
>
> The resolution of the data is something like 5 seconds.
> I want to extract the data with 10 minutes resolution.
>
> So what i can do is :
> Just emit everything in the mapper as data is not sorted there .
> Emit only every 10 minutes from reducer. The reducer is receiving data
> sorted by entity_id,timestamp pair (secondary sorting)
>
> This will work fine, but it will take forever, since i have to process
> TB's of data.
> Also the data emitted to the reducer will be huge( as i am not filtering
> in map phase at all) and the number of reducers is much smaller than the
> number of mappers.
>
> Are there any better ideas how to do this ?
>
> Georgi
>

Re: Re-sampling time data with MR job. Ideas

Posted by Mirko Kämpf <mi...@gmail.com>.

Hi Georgi,

I would already emit the new time stamp (with resolution 10 min) in the
mapper. This allows you to (pre)aggregate the data already in the mapper
and you have less traffic during the shuffle & sort stage. Changing the
resolution means you have to aggregate the individual entities or do you
still need all individual entities and just want to translate the timestamp
to another resolution (5s => 10 min)?

Cheers,
Mirko




2014-09-19 9:17 GMT+01:00 Georgi Ivanov <iv...@vesseltracker.com>:

> Hello,
> I have time related data like this :
> entity_id, timestamp , data
>
> The resolution of the data is something like 5 seconds.
> I want to extract the data with 10 minutes resolution.
>
> So what i can do is :
> Just emit everything in the mapper as data is not sorted there .
> Emit only every 10 minutes from reducer. The reducer is receiving data
> sorted by entity_id,timestamp pair (secondary sorting)
>
> This will work fine, but it will take forever, since i have to process
> TB's of data.
> Also the data emitted to the reducer will be huge( as i am not filtering
> in map phase at all) and the number of reducers is much smaller than the
> number of mappers.
>
> Are there any better ideas how to do this ?
>
> Georgi
>

Re: Re-sampling time data with MR job. Ideas

Posted by Mirko Kämpf <mi...@gmail.com>.

Hi Georgi,

I would already emit the new time stamp (with resolution 10 min) in the
mapper. This allows you to (pre)aggregate the data already in the mapper
and you have less traffic during the shuffle & sort stage. Changing the
resolution means you have to aggregate the individual entities or do you
still need all individual entities and just want to translate the timestamp
to another resolution (5s => 10 min)?

Cheers,
Mirko




2014-09-19 9:17 GMT+01:00 Georgi Ivanov <iv...@vesseltracker.com>:

> Hello,
> I have time related data like this :
> entity_id, timestamp , data
>
> The resolution of the data is something like 5 seconds.
> I want to extract the data with 10 minutes resolution.
>
> So what i can do is :
> Just emit everything in the mapper as data is not sorted there .
> Emit only every 10 minutes from reducer. The reducer is receiving data
> sorted by entity_id,timestamp pair (secondary sorting)
>
> This will work fine, but it will take forever, since i have to process
> TB's of data.
> Also the data emitted to the reducer will be huge( as i am not filtering
> in map phase at all) and the number of reducers is much smaller than the
> number of mappers.
>
> Are there any better ideas how to do this ?
>
> Georgi
>