You are viewing a plain text version of this content. The canonical link for it is here.
Posted to mapreduce-dev@hadoop.apache.org by Suresh S <su...@gmail.com> on 2014/02/07 17:20:13 UTC
How to process part of a file in Hadoop?
Dear Friends,
I have some very large file in HDFS with 3000+ blocks.
I want run a job with various input size. I want to use the same file as a
input. Usually the number of task is equal to number of blocks/splits.
Suppose the job with 2 task need to process randomly any two block of the
given input file.
How to give a random set of HDFS blocks as a input of a job?
note: my aim is not processing the input file to produce some output.
I want to replicate the individual block based on the load.
*Regards*
*S.Suresh,*
*Research Scholar,*
*Department of Computer Applications,*
*National Institute of Technology,*
*Tiruchirappalli - 620015.*
*+91-9941506562*
Re: How to process part of a file in Hadoop?
Posted by Harsh J <ha...@cloudera.com>.
You can write a custom InputFormat whose #getSplits(...) returns your
required InputSplit objects (with randomised offsets + lengths, etc.).
On Fri, Feb 7, 2014 at 9:50 PM, Suresh S <su...@gmail.com> wrote:
> Dear Friends,
>
> I have some very large file in HDFS with 3000+ blocks.
>
> I want run a job with various input size. I want to use the same file as a
> input. Usually the number of task is equal to number of blocks/splits.
> Suppose the job with 2 task need to process randomly any two block of the
> given input file.
>
> How to give a random set of HDFS blocks as a input of a job?
>
> note: my aim is not processing the input file to produce some output.
> I want to replicate the individual block based on the load.
>
> *Regards*
> *S.Suresh,*
> *Research Scholar,*
> *Department of Computer Applications,*
> *National Institute of Technology,*
> *Tiruchirappalli - 620015.*
> *+91-9941506562*
--
Harsh J