You are viewing a plain text version of this content. The canonical link for it is here.

Posted to mapreduce-dev@hadoop.apache.org by Suresh S <su...@gmail.com> on 2014/02/07 17:20:13 UTC

How to process part of a file in Hadoop?

Dear Friends,

          I have some very large file in HDFS with 3000+ blocks.

I want run a job with various input size. I want to use the same file as a
input. Usually the number of task is equal to number of blocks/splits.
Suppose the job with 2 task need to process randomly any two block of the
given input file.

How to give a random set of HDFS blocks as a input of a job?

note:  my aim is not processing the input file to produce some output.
I want to replicate the individual block based on the load.

*Regards*
*S.Suresh,*
*Research Scholar,*
*Department of Computer Applications,*
*National Institute of Technology,*
*Tiruchirappalli - 620015.*
*+91-9941506562*

Re: How to process part of a file in Hadoop?

Posted by Harsh J <ha...@cloudera.com>.

You can write a custom InputFormat whose #getSplits(...) returns your
required InputSplit objects (with randomised offsets + lengths, etc.).

On Fri, Feb 7, 2014 at 9:50 PM, Suresh S <su...@gmail.com> wrote:
> Dear Friends,
>
>           I have some very large file in HDFS with 3000+ blocks.
>
> I want run a job with various input size. I want to use the same file as a
> input. Usually the number of task is equal to number of blocks/splits.
> Suppose the job with 2 task need to process randomly any two block of the
> given input file.
>
> How to give a random set of HDFS blocks as a input of a job?
>
> note:  my aim is not processing the input file to produce some output.
> I want to replicate the individual block based on the load.
>
> *Regards*
> *S.Suresh,*
> *Research Scholar,*
> *Department of Computer Applications,*
> *National Institute of Technology,*
> *Tiruchirappalli - 620015.*
> *+91-9941506562*



-- 
Harsh J