You are viewing a plain text version of this content. The canonical link for it is here.

Posted to mapreduce-user@hadoop.apache.org by MONTMORY Alain <al...@thalesgroup.com> on 2011/09/08 10:58:07 UTC

How to count records in FileInputFormat (MapFile, SequenceFile ?)

Hi everyBody,

In my application the treatment of the whole dataset (that we called CycleWorkflow) may have a duration of several weeks and we want (mandatory) to split the CycleWorkflow into multiple DayWorkflow.
The actual system use a traditionnal RDBMS approach and use SQL OFFSET LIMIT to split the dataset (potentially 80 billions rows in  one table) into smaller datasets. We are actually (re)designing the application using Hadoop/Cascading/HDFS and we try to found a feature to split MapFile in input (with 80 billions key) according to the number of record.
CycleDataset = 10000 <key,value> record
Split1 = 0 <key,value>  to 1000<key,value>    = Dataset for DayWorkflow1
Split2 = 2000 <key,value>  to 3000<key,value>   = Dataset for DayWorkflow2
Split3 = 3000 <key,value>  to 4000<key,value>   = Dataset for DayWorkflow3
etc
We expect to use MapFile  (is it a good choice? or is there another existing file format? (HFile?) more suitable for this usage...).
The progress  monitoring (bargraph in a GUI) of DayWorkflowX are currently done with the number of record already processed by this workflow.

In MapFile class i don't find API to count record efficiently.. am I missing something ? (probably yes..)
Which solution do you suggest to split Input Dataset according to the number of <key,value> record ?

Thank you for your response..

Regards,

Alain
(we are using Hadoop 0.20.xx version)

[@@THALES GROUP RESTRICTED@@]

RE: How to count records in FileInputFormat (MapFile, SequenceFile ?)

Posted by MONTMORY Alain <al...@thalesgroup.com>.

Anyone which have an advice ? or i am not at the right place ?
or mayde my question is stupid...
Thank You

Alain

[@@THALES GROUP RESTRICTED@@]

De : MONTMORY Alain [mailto:alain.montmory@thalesgroup.com]
Envoyé : jeudi 8 septembre 2011 10:58
À : mapreduce-user@hadoop.apache.org
Objet : How to count records in FileInputFormat (MapFile, SequenceFile ?)

Hi everyBody,

In my application the treatment of the whole dataset (that we called CycleWorkflow) may have a duration of several weeks and we want (mandatory) to split the CycleWorkflow into multiple DayWorkflow.
The actual system use a traditionnal RDBMS approach and use SQL OFFSET LIMIT to split the dataset (potentially 80 billions rows in  one table) into smaller datasets. We are actually (re)designing the application using Hadoop/Cascading/HDFS and we try to found a feature to split MapFile in input (with 80 billions key) according to the number of record.
CycleDataset = 10000 <key,value> record
Split1 = 0 <key,value>  to 1000<key,value>    = Dataset for DayWorkflow1
Split2 = 2000 <key,value>  to 3000<key,value>   = Dataset for DayWorkflow2
Split3 = 3000 <key,value>  to 4000<key,value>   = Dataset for DayWorkflow3
etc
We expect to use MapFile  (is it a good choice? or is there another existing file format? (HFile?) more suitable for this usage...).
The progress  monitoring (bargraph in a GUI) of DayWorkflowX are currently done with the number of record already processed by this workflow.

In MapFile class i don't find API to count record efficiently.. am I missing something ? (probably yes..)
Which solution do you suggest to split Input Dataset according to the number of <key,value> record ?

Thank you for your response..

Regards,

Alain
(we are using Hadoop 0.20.xx version)

[@@THALES GROUP RESTRICTED@@]