You are viewing a plain text version of this content. The canonical link for it is here.
Posted to mapreduce-user@hadoop.apache.org by Marcos Ortiz <ml...@uci.cu> on 2012/02/26 15:10:27 UTC

Re: Query Regarding design MR job for Billing

Well, first, you can design 6 MR jobs:
1- for 5 mins interval
2- for 1 hour
3- for 1 day
4- for 1 month
5- for 1 year
6- and a last for any interval

If you say that for each interval, you have to do a different 
calculation; this way could be a solution (at least I think that).
You can read the "design patterns" for MapReduce algorithms proposed by 
Jimmy Lin and Chris Dyer on his "Data-Intensive Text Processing with 
MapReduce" book.

Regards


On 02/27/2012 05:39 AM, Stuti Awasthi wrote:
> No. The data will be either of 5 mins interval, or 1 hour interval or 1 day interval and so on ....
> So suppose utilization is for 40 days then I will charge 30 days according to months billing and remaining 10 days as days billing job.
>
> -----Original Message-----
> From: Rohit Kelkar [mailto:rohitkelkar@gmail.com]
> Sent: Monday, February 27, 2012 4:06 PM
> To: mapreduce-user@hadoop.apache.org
> Subject: Re: Query Regarding design MR job for Billing
>
> Just trying to understand your use case
> you need an hour job to run on data between 6:40 AM and 7:40 AM. Would it be like a moving window? For ex. run hour jobs on
> 6:41 AM to 7:41 AM
> 6:42 AM to 7:42 AM
> and so on...
>
>
> On Mon, Feb 27, 2012 at 1:01 PM, Stuti Awasthi<st...@hcl.com>  wrote:
>> Hi all,
>>
>> I have to implement BillingEngine using MR jobs. My usecase is like this:
>> I will be having data files of format<TimeStamp>  <Information for Billing>.
>> Now these datafiles will be containing timestamp either at minute interval, hour inverval, day interval, month interval, year interval. Every type of interval will be having different type of calculation for billing so basically different jobs for every type of interval.
>>
>> Suppose I have a data file which contain minute interval timestamp. I have a scenario that if data is present for hours , then it should be processed by hourly job and remaining will be processed by minutejob.
>>
>> Example :
>>
>> 2/10/12 6:40 AM<data for billing>
>> 2/10/12 6:40 AM<data for billing>
>> .
>> 2/10/12 6:45 AM<data for billing>
>> 2/10/12 6:45 AM<data for billing>
>> .
>> .
>> 2/10/12 7:40 AM<data for billing>
>> 2/10/12 7:40 AM<data for billing>
>> .
>> .
>> 2/10/12 7:45 AM<data for billing>
>> 2/10/12 7:45 AM<data for billing>
>> .
>>
>> Now I want data between 2/10/12 6:40 AM to 2/10/12 7:40 AM is processed by Hourjob and 2/10/12 7:45 AM is processed by MinuteJob.
>> Please suggest how to design my MR to achieve this.
>>
>> Thanks
>> Stuti
>>
>> ::DISCLAIMER::
>> ----------------------------------------------------------------------
>> -------------------------------------------------
>>
>> The contents of this e-mail and any attachment(s) are confidential and intended for the named recipient(s) only.
>> It shall not attach any liability on the originator or HCL or its
>> affiliates. Any views or opinions presented in this email are solely those of the author and may not necessarily reflect the opinions of HCL or its affiliates.
>> Any form of reproduction, dissemination, copying, disclosure,
>> modification, distribution and / or publication of this message
>> without the prior written consent of the author of this e-mail is
>> strictly prohibited. If you have received this email in error please delete it and notify the sender immediately. Before opening any mail and attachments please check them for viruses and defect.
>>
>> ----------------------------------------------------------------------
>> -------------------------------------------------


-- 
Marcos Luis Ortíz Valmaseda
  Senior Software Engineer (UCI)
  http://marcosluis2186.posterous.com
  http://www.linkedin.com/in/marcosluis2186
  Twitter: @marcosluis2186



Fin a la injusticia, LIBERTAD AHORA A NUESTROS CINCO COMPATRIOTAS QUE SE ENCUENTRAN INJUSTAMENTE EN PRISIONES DE LOS EEUU!
http://www.antiterroristas.cu
http://justiciaparaloscinco.wordpress.com

Re: Query Regarding design MR job for Billing

Posted by Marcos Ortiz <ml...@uci.cu>.
On 02/27/2012 11:33 PM, Stuti Awasthi wrote:
> Hi Marcos,
>
> Thanks for the pointers. I am also thinking on the similar lines.
> I am doubtful at 1 point :
>
> I will be having separate data files for every interval. Let's take example if I have 5 mins interval file which contain data for 2 hours and 10 mins. In this scenario I want to process 2 hours data with hours job and 10 mins data with mins job. Now since I will provide my data file as Input to MR jobs so I think original file needs to split in 2 files : HourFile and
> MinsFile. HourFile wll contain data for 2 hours and MinsFile will conatin data for 10 mins.
Well, you can with Oozie(http://yahoo.github.com/oozie/) or 
Cascading(http://cascading.org) for complex workflow programming.
1- For example, you can write a MapReduce job for spit your data: one by 
hour, and one by mins. In your case: a simple output would be one data 
file containing your data for 2 hours, and another data file for your 10 
mins. I think that this job could be Mapper-only type with the 
MultipleOutputFormat.

2- Then you can write the different jobs for each interval 
(HourIntervalJob, MonthIntervalJob, etc), spliting its outputs depending 
of each interval in HDFS.

You can define your complete workflow, and then, you can evaluate Oozie 
or Cascading to control that workflow.
Regards

Remember that all thes are suggestions. I'm not a MR expert

>
> I have attained file splitting with simple Java class but I think there is too much I/O operations and if I can attain this also in MR or in some efficient way, it will be good because the original data files can be huge and then the initial breaking of files will itself take too much time.
>
> Please suggest.
> Thanks
>
> -----Original Message-----
> From: Marcos Ortiz [mailto:mlortiz@uci.cu]
> Sent: Sunday, February 26, 2012 7:40 PM
> To: mapreduce-user@hadoop.apache.org
> Cc: Stuti Awasthi
> Subject: Re: Query Regarding design MR job for Billing
>
> Well, first, you can design 6 MR jobs:
> 1- for 5 mins interval
> 2- for 1 hour
> 3- for 1 day
> 4- for 1 month
> 5- for 1 year
> 6- and a last for any interval
>
> If you say that for each interval, you have to do a different calculation; this way could be a solution (at least I think that).
> You can read the "design patterns" for MapReduce algorithms proposed by Jimmy Lin and Chris Dyer on his "Data-Intensive Text Processing with MapReduce" book.
>
> Regards
>
>
> On 02/27/2012 05:39 AM, Stuti Awasthi wrote:
>> No. The data will be either of 5 mins interval, or 1 hour interval or 1 day interval and so on ....
>> So suppose utilization is for 40 days then I will charge 30 days according to months billing and remaining 10 days as days billing job.
>>
>> -----Original Message-----
>> From: Rohit Kelkar [mailto:rohitkelkar@gmail.com]
>> Sent: Monday, February 27, 2012 4:06 PM
>> To: mapreduce-user@hadoop.apache.org
>> Subject: Re: Query Regarding design MR job for Billing
>>
>> Just trying to understand your use case you need an hour job to run on
>> data between 6:40 AM and 7:40 AM. Would it be like a moving window?
>> For ex. run hour jobs on
>> 6:41 AM to 7:41 AM
>> 6:42 AM to 7:42 AM
>> and so on...
>>
>>
>> On Mon, Feb 27, 2012 at 1:01 PM, Stuti Awasthi<st...@hcl.com>   wrote:
>>> Hi all,
>>>
>>> I have to implement BillingEngine using MR jobs. My usecase is like this:
>>> I will be having data files of format<TimeStamp>   <Information for Billing>.
>>> Now these datafiles will be containing timestamp either at minute interval, hour inverval, day interval, month interval, year interval. Every type of interval will be having different type of calculation for billing so basically different jobs for every type of interval.
>>>
>>> Suppose I have a data file which contain minute interval timestamp. I have a scenario that if data is present for hours , then it should be processed by hourly job and remaining will be processed by minutejob.
>>>
>>> Example :
>>>
>>> 2/10/12 6:40 AM<data for billing>
>>> 2/10/12 6:40 AM<data for billing>
>>> .
>>> 2/10/12 6:45 AM<data for billing>
>>> 2/10/12 6:45 AM<data for billing>
>>> .
>>> .
>>> 2/10/12 7:40 AM<data for billing>
>>> 2/10/12 7:40 AM<data for billing>
>>> .
>>> .
>>> 2/10/12 7:45 AM<data for billing>
>>> 2/10/12 7:45 AM<data for billing>
>>> .
>>>
>>> Now I want data between 2/10/12 6:40 AM to 2/10/12 7:40 AM is processed by Hourjob and 2/10/12 7:45 AM is processed by MinuteJob.
>>> Please suggest how to design my MR to achieve this.
>>>
>>> Thanks
>>> Stuti
>>>
>>> ::DISCLAIMER::
>>> ---------------------------------------------------------------------
>>> -
>>> -------------------------------------------------
>>>
>>> The contents of this e-mail and any attachment(s) are confidential and intended for the named recipient(s) only.
>>> It shall not attach any liability on the originator or HCL or its
>>> affiliates. Any views or opinions presented in this email are solely those of the author and may not necessarily reflect the opinions of HCL or its affiliates.
>>> Any form of reproduction, dissemination, copying, disclosure,
>>> modification, distribution and / or publication of this message
>>> without the prior written consent of the author of this e-mail is
>>> strictly prohibited. If you have received this email in error please delete it and notify the sender immediately. Before opening any mail and attachments please check them for viruses and defect.
>>>
>>> ---------------------------------------------------------------------
>>> -
>>> -------------------------------------------------
>
>

-- 
Marcos Luis Ortíz Valmaseda
  Sr. Software Engineer (UCI)
  http://marcosluis2186.posterous.com
  http://postgresql.uci.cu/blog/38



Fin a la injusticia, LIBERTAD AHORA A NUESTROS CINCO COMPATRIOTAS QUE SE ENCUENTRAN INJUSTAMENTE EN PRISIONES DE LOS EEUU!
http://www.antiterroristas.cu
http://justiciaparaloscinco.wordpress.com

RE: Query Regarding design MR job for Billing

Posted by Stuti Awasthi <st...@hcl.com>.
Hi Marcos, 

Thanks for the pointers. I am also thinking on the similar lines. 
I am doubtful at 1 point :

I will be having separate data files for every interval. Let's take example if I have 5 mins interval file which contain data for 2 hours and 10 mins. In this scenario I want to process 2 hours data with hours job and 10 mins data with mins job. Now since I will provide my data file as Input to MR jobs so I think original file needs to split in 2 files : HourFile and 
MinsFile. HourFile wll contain data for 2 hours and MinsFile will conatin data for 10 mins.

I have attained file splitting with simple Java class but I think there is too much I/O operations and if I can attain this also in MR or in some efficient way, it will be good because the original data files can be huge and then the initial breaking of files will itself take too much time.

Please suggest.
Thanks

-----Original Message-----
From: Marcos Ortiz [mailto:mlortiz@uci.cu] 
Sent: Sunday, February 26, 2012 7:40 PM
To: mapreduce-user@hadoop.apache.org
Cc: Stuti Awasthi
Subject: Re: Query Regarding design MR job for Billing

Well, first, you can design 6 MR jobs:
1- for 5 mins interval
2- for 1 hour
3- for 1 day
4- for 1 month
5- for 1 year
6- and a last for any interval

If you say that for each interval, you have to do a different calculation; this way could be a solution (at least I think that).
You can read the "design patterns" for MapReduce algorithms proposed by Jimmy Lin and Chris Dyer on his "Data-Intensive Text Processing with MapReduce" book.

Regards


On 02/27/2012 05:39 AM, Stuti Awasthi wrote:
> No. The data will be either of 5 mins interval, or 1 hour interval or 1 day interval and so on ....
> So suppose utilization is for 40 days then I will charge 30 days according to months billing and remaining 10 days as days billing job.
>
> -----Original Message-----
> From: Rohit Kelkar [mailto:rohitkelkar@gmail.com]
> Sent: Monday, February 27, 2012 4:06 PM
> To: mapreduce-user@hadoop.apache.org
> Subject: Re: Query Regarding design MR job for Billing
>
> Just trying to understand your use case you need an hour job to run on 
> data between 6:40 AM and 7:40 AM. Would it be like a moving window? 
> For ex. run hour jobs on
> 6:41 AM to 7:41 AM
> 6:42 AM to 7:42 AM
> and so on...
>
>
> On Mon, Feb 27, 2012 at 1:01 PM, Stuti Awasthi<st...@hcl.com>  wrote:
>> Hi all,
>>
>> I have to implement BillingEngine using MR jobs. My usecase is like this:
>> I will be having data files of format<TimeStamp>  <Information for Billing>.
>> Now these datafiles will be containing timestamp either at minute interval, hour inverval, day interval, month interval, year interval. Every type of interval will be having different type of calculation for billing so basically different jobs for every type of interval.
>>
>> Suppose I have a data file which contain minute interval timestamp. I have a scenario that if data is present for hours , then it should be processed by hourly job and remaining will be processed by minutejob.
>>
>> Example :
>>
>> 2/10/12 6:40 AM<data for billing>
>> 2/10/12 6:40 AM<data for billing>
>> .
>> 2/10/12 6:45 AM<data for billing>
>> 2/10/12 6:45 AM<data for billing>
>> .
>> .
>> 2/10/12 7:40 AM<data for billing>
>> 2/10/12 7:40 AM<data for billing>
>> .
>> .
>> 2/10/12 7:45 AM<data for billing>
>> 2/10/12 7:45 AM<data for billing>
>> .
>>
>> Now I want data between 2/10/12 6:40 AM to 2/10/12 7:40 AM is processed by Hourjob and 2/10/12 7:45 AM is processed by MinuteJob.
>> Please suggest how to design my MR to achieve this.
>>
>> Thanks
>> Stuti
>>
>> ::DISCLAIMER::
>> ---------------------------------------------------------------------
>> -
>> -------------------------------------------------
>>
>> The contents of this e-mail and any attachment(s) are confidential and intended for the named recipient(s) only.
>> It shall not attach any liability on the originator or HCL or its 
>> affiliates. Any views or opinions presented in this email are solely those of the author and may not necessarily reflect the opinions of HCL or its affiliates.
>> Any form of reproduction, dissemination, copying, disclosure, 
>> modification, distribution and / or publication of this message 
>> without the prior written consent of the author of this e-mail is 
>> strictly prohibited. If you have received this email in error please delete it and notify the sender immediately. Before opening any mail and attachments please check them for viruses and defect.
>>
>> ---------------------------------------------------------------------
>> -
>> -------------------------------------------------


--
Marcos Luis Ortíz Valmaseda
  Senior Software Engineer (UCI)
  http://marcosluis2186.posterous.com
  http://www.linkedin.com/in/marcosluis2186
  Twitter: @marcosluis2186



Fin a la injusticia, LIBERTAD AHORA A NUESTROS CINCO COMPATRIOTAS QUE SE ENCUENTRAN INJUSTAMENTE EN PRISIONES DE LOS EEUU!
http://www.antiterroristas.cu
http://justiciaparaloscinco.wordpress.com