You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@hadoop.apache.org by rab ra <ra...@gmail.com> on 2014/08/20 07:46:38 UTC

Hadoop InputFormat - Processing large number of small files

Hello,

I have a use case wherein i need to process huge set of files stored in
HDFS. Those files are non-splittable and they need to be processed as a
whole. Here, I have the following question for which I need answers to
proceed further in this.

1.  I wish to schedule the map process in task tracker where data is
already available. How can I do it? Currently, I have a file that contains
list of filenames. Each map get one line of it via NLineInputFormat. The
map process then accesses the file via FSDataInputStream and work with it.
Is there a way to ensure this map process is running on the node where the
file is available?.

2.  Since the files are not large and it can be called as 'small' files by
hadoop standard. Now, I came across CombineFileInputFormat that can process
more than one file in a single map process.  What I need here is a format
that can process more than one files in a single map but does not have to
read the files, and either in key or value, it has the filenames. In map
process then, I can run a loop to process these files. Any help?

3. Any othe alternatives?



regards
rab

Re: Hadoop InputFormat - Processing large number of small files

Posted by rab ra <ra...@gmail.com>.
Hi
>
>
>
> I tried to use your CombileFileInputFormat implementation. However, I get
the following exception
>
>
>
> ‘not org.apache.hadoop.mapred.InputFormat’
>
>
>
> I am using hadoop 2.4.1 and it looks like it expect older interface as it
does not accept
‘org.apache.hadoop.mapreduce.lib.input.CombineFileInputFormat’.  May I know
what version of Hadoop you used?
>
>
>
>
>
> Looks like I need to use older one
‘org.apache.hadoop.mapred.lib.CombineFileInputFormat’ ?
>
>
>
> Thanks and Regards
>
> rab
On 20 Aug 2014 22:59, "Felix Chern" <id...@gmail.com> wrote:

> I wrote a post on how to use CombineInputFormat:
>
> http://www.idryman.org/blog/2013/09/22/process-small-files-on-hadoop-using-combinefileinputformat-1/
> In the RecordReader constructor, you can get the context of which file you
> are reading in.
> In my example, I created FileLineWritable to include the filename in the
> mapper input key.
> Then you can use the input key as:
>
>  public static class TestMapper extends Mapper<FileLineWritable, Text,
> Text, IntWritable>{ private Text txt = new Text(); private IntWritable
> count = new IntWritable(1); public void map (FileLineWritable key, Text
> val, Context context) throws IOException, InterruptedException{
> StringTokenizer st = new StringTokenizer(val.toString()); while (st.
> hasMoreTokens()){ txt.set(key.fileName + st.nextToken()); context.write(
> txt, count); } } }
>
>
> Cheers,
> Felix
>
>
> On Aug 20, 2014, at 8:19 AM, rab ra <ra...@gmail.com> wrote:
>
> Thanks for the response.
>
> Yes, I know wholeFileInputFormat. But i am not sure filename comes to map
> process either as key or value. But, I think this file format reads the
> contents of the file. I wish to have a inputformat that just gives filename
> or list of filenames.
>
> Also, files are very small. The wholeFileInputFormat spans one map process
> per file and thus results huge number of map processes. I wish to span a
> single map process per group of files.
>
> I think I need to tweak CombineFileInputFormat's recordreader() so that it
> does not read the entire file but just filename.
>
>
> regards
> rab
>
> regards
> Bala
>
>
> On Wed, Aug 20, 2014 at 6:48 PM, Shahab Yunus <sh...@gmail.com>
> wrote:
>
>> Have you looked at the WholeFileInputFormat implementations? There are
>> quite a few if search for them...
>>
>>
>> http://hadoop-sandy.blogspot.com/2013/02/wholefileinputformat-in-java-hadoop.html
>>
>> https://github.com/tomwhite/hadoop-book/blob/master/ch07/src/main/java/WholeFileInputFormat.java
>>
>> Regards,
>> Shahab
>>
>>
>> On Wed, Aug 20, 2014 at 1:46 AM, rab ra <ra...@gmail.com> wrote:
>>
>>> Hello,
>>>
>>> I have a use case wherein i need to process huge set of files stored in
>>> HDFS. Those files are non-splittable and they need to be processed as a
>>> whole. Here, I have the following question for which I need answers to
>>> proceed further in this.
>>>
>>> 1.  I wish to schedule the map process in task tracker where data is
>>> already available. How can I do it? Currently, I have a file that contains
>>> list of filenames. Each map get one line of it via NLineInputFormat. The
>>> map process then accesses the file via FSDataInputStream and work with it.
>>> Is there a way to ensure this map process is running on the node where the
>>> file is available?.
>>>
>>> 2.  Since the files are not large and it can be called as 'small' files
>>> by hadoop standard. Now, I came across CombineFileInputFormat that can
>>> process more than one file in a single map process.  What I need here is a
>>> format that can process more than one files in a single map but does not
>>> have to read the files, and either in key or value, it has the filenames.
>>> In map process then, I can run a loop to process these files. Any help?
>>>
>>> 3. Any othe alternatives?
>>>
>>>
>>>
>>> regards
>>>  rab
>>>
>>>
>>
>
>

RE: Hadoop InputFormat - Processing large number of small files

Posted by java8964 <ja...@hotmail.com>.
If you want to use NLineInputFormat, and also want the individual file to be processed in the map task which prefer to be on the same task node as data node,  you need to implement and control that kind of logic by yourself.
Extend the NLineInputFormat, Override the getSplits() method, read the location of the HDFS file, and assign the server of the same file/block.
Keep in mind that NLineInputFormat will send one file to one mapper. So you need to make sure that the file size is less than one block, Split is on block level instead of file.
About detail how to control the Mapper execution location, you can refer this book:
Professional Hadoop Solution
http://www.amazon.com/Professional-Hadoop-Solutions-Boris-Lublinsky/dp/1118611934/ref=sr_1_1?s=books&ie=UTF8&qid=1408644908&sr=1-1&keywords=hadoop+solution
Chapter 4 Customizing MapReduce Execution describes some examples how to do that.
Yong

Date: Thu, 21 Aug 2014 22:26:12 +0530
Subject: Re: Hadoop InputFormat - Processing large number of small files
From: rabmdu@gmail.com
To: user@hadoop.apache.org

Hello,
This means that a file with names of all the files that need to be processed and is fed to hadoop with NLineInputFormat? 
If this is the case, then, how can we ensure that map processes are scheduled in the node where blocks containing the files are stored already?

regardsrab

On Thu, Aug 21, 2014 at 9:07 PM, Felix Chern <id...@gmail.com> wrote:


If I were you, I’ll first generate a file with those file name:



hadoop fs -ls > term_file
Then run the normal map reduce job
Felix

On Aug 21, 2014, at 1:38 AM, rab ra <ra...@gmail.com> wrote:
Thanks for the link. If it is not required for CFinputformat to have contents of the files in the map process but only the filename, what changes need to be done in the code?


rab.

On 20 Aug 2014 22:59, "Felix Chern" <id...@gmail.com> wrote:



I wrote a post on how to use CombineInputFormat:http://www.idryman.org/blog/2013/09/22/process-small-files-on-hadoop-using-combinefileinputformat-1/


In the RecordReader constructor, you can get the context of which file you are reading in.In my example, I created FileLineWritable to include the filename in the mapper input key.Then you can use the input key as:






  


  public static class TestMapper extends Mapper<FileLineWritable, Text, Text, IntWritable>{





    private Text txt = new Text();





    private IntWritable count = new IntWritable(1);





    public void map (FileLineWritable key, Text val, Context context) throws IOException, InterruptedException{





      StringTokenizer st = new StringTokenizer(val.toString());





        while (st.hasMoreTokens()){





          txt.set(key.fileName + st.nextToken());          





          context.write(txt, count);





        }


    }


  }




Cheers,Felix

On Aug 20, 2014, at 8:19 AM, rab ra <ra...@gmail.com> wrote:



Thanks for the response.
Yes, I know wholeFileInputFormat. But i am not sure filename comes to map process either as key or value. But, I think this file format reads the contents of the file. I wish to have a inputformat that just gives filename or list of filenames.




Also, files are very small. The wholeFileInputFormat spans one map process per file and thus results huge number of map processes. I wish to span a single map process per group of files. 




I think I need to tweak CombineFileInputFormat's recordreader() so that it does not read the entire file but just filename.

regardsrab




regardsBala

On Wed, Aug 20, 2014 at 6:48 PM, Shahab Yunus <sh...@gmail.com> wrote:




Have you looked at the WholeFileInputFormat implementations? There are quite a few if search for them...




http://hadoop-sandy.blogspot.com/2013/02/wholefileinputformat-in-java-hadoop.html

https://github.com/tomwhite/hadoop-book/blob/master/ch07/src/main/java/WholeFileInputFormat.java






Regards,Shahab

On Wed, Aug 20, 2014 at 1:46 AM, rab ra <ra...@gmail.com> wrote:





Hello,






I have a use case wherein i need to process huge set of files stored in HDFS. Those files are non-splittable and they need to be processed as a whole. Here, I have the following question for which I need answers to proceed further in this.






1.  I wish to schedule the map process in task tracker where data is already available. How can I do it? Currently, I have a file that contains list of filenames. Each map get one line of it via NLineInputFormat. The map process then accesses the file via FSDataInputStream and work with it. Is there a way to ensure this map process is running on the node where the file is available?. 







2.  Since the files are not large and it can be called as 'small' files by hadoop standard. Now, I came across CombineFileInputFormat that can process more than one file in a single map process.  What I need here is a format that can process more than one files in a single map but does not have to read the files, and either in key or value, it has the filenames. In map process then, I can run a loop to process these files. Any help?






3. Any othe alternatives?








regards





rab









 		 	   		  

RE: Hadoop InputFormat - Processing large number of small files

Posted by java8964 <ja...@hotmail.com>.
If you want to use NLineInputFormat, and also want the individual file to be processed in the map task which prefer to be on the same task node as data node,  you need to implement and control that kind of logic by yourself.
Extend the NLineInputFormat, Override the getSplits() method, read the location of the HDFS file, and assign the server of the same file/block.
Keep in mind that NLineInputFormat will send one file to one mapper. So you need to make sure that the file size is less than one block, Split is on block level instead of file.
About detail how to control the Mapper execution location, you can refer this book:
Professional Hadoop Solution
http://www.amazon.com/Professional-Hadoop-Solutions-Boris-Lublinsky/dp/1118611934/ref=sr_1_1?s=books&ie=UTF8&qid=1408644908&sr=1-1&keywords=hadoop+solution
Chapter 4 Customizing MapReduce Execution describes some examples how to do that.
Yong

Date: Thu, 21 Aug 2014 22:26:12 +0530
Subject: Re: Hadoop InputFormat - Processing large number of small files
From: rabmdu@gmail.com
To: user@hadoop.apache.org

Hello,
This means that a file with names of all the files that need to be processed and is fed to hadoop with NLineInputFormat? 
If this is the case, then, how can we ensure that map processes are scheduled in the node where blocks containing the files are stored already?

regardsrab

On Thu, Aug 21, 2014 at 9:07 PM, Felix Chern <id...@gmail.com> wrote:


If I were you, I’ll first generate a file with those file name:



hadoop fs -ls > term_file
Then run the normal map reduce job
Felix

On Aug 21, 2014, at 1:38 AM, rab ra <ra...@gmail.com> wrote:
Thanks for the link. If it is not required for CFinputformat to have contents of the files in the map process but only the filename, what changes need to be done in the code?


rab.

On 20 Aug 2014 22:59, "Felix Chern" <id...@gmail.com> wrote:



I wrote a post on how to use CombineInputFormat:http://www.idryman.org/blog/2013/09/22/process-small-files-on-hadoop-using-combinefileinputformat-1/


In the RecordReader constructor, you can get the context of which file you are reading in.In my example, I created FileLineWritable to include the filename in the mapper input key.Then you can use the input key as:






  


  public static class TestMapper extends Mapper<FileLineWritable, Text, Text, IntWritable>{





    private Text txt = new Text();





    private IntWritable count = new IntWritable(1);





    public void map (FileLineWritable key, Text val, Context context) throws IOException, InterruptedException{





      StringTokenizer st = new StringTokenizer(val.toString());





        while (st.hasMoreTokens()){





          txt.set(key.fileName + st.nextToken());          





          context.write(txt, count);





        }


    }


  }




Cheers,Felix

On Aug 20, 2014, at 8:19 AM, rab ra <ra...@gmail.com> wrote:



Thanks for the response.
Yes, I know wholeFileInputFormat. But i am not sure filename comes to map process either as key or value. But, I think this file format reads the contents of the file. I wish to have a inputformat that just gives filename or list of filenames.




Also, files are very small. The wholeFileInputFormat spans one map process per file and thus results huge number of map processes. I wish to span a single map process per group of files. 




I think I need to tweak CombineFileInputFormat's recordreader() so that it does not read the entire file but just filename.

regardsrab




regardsBala

On Wed, Aug 20, 2014 at 6:48 PM, Shahab Yunus <sh...@gmail.com> wrote:




Have you looked at the WholeFileInputFormat implementations? There are quite a few if search for them...




http://hadoop-sandy.blogspot.com/2013/02/wholefileinputformat-in-java-hadoop.html

https://github.com/tomwhite/hadoop-book/blob/master/ch07/src/main/java/WholeFileInputFormat.java






Regards,Shahab

On Wed, Aug 20, 2014 at 1:46 AM, rab ra <ra...@gmail.com> wrote:





Hello,






I have a use case wherein i need to process huge set of files stored in HDFS. Those files are non-splittable and they need to be processed as a whole. Here, I have the following question for which I need answers to proceed further in this.






1.  I wish to schedule the map process in task tracker where data is already available. How can I do it? Currently, I have a file that contains list of filenames. Each map get one line of it via NLineInputFormat. The map process then accesses the file via FSDataInputStream and work with it. Is there a way to ensure this map process is running on the node where the file is available?. 







2.  Since the files are not large and it can be called as 'small' files by hadoop standard. Now, I came across CombineFileInputFormat that can process more than one file in a single map process.  What I need here is a format that can process more than one files in a single map but does not have to read the files, and either in key or value, it has the filenames. In map process then, I can run a loop to process these files. Any help?






3. Any othe alternatives?








regards





rab









 		 	   		  

RE: Hadoop InputFormat - Processing large number of small files

Posted by java8964 <ja...@hotmail.com>.
If you want to use NLineInputFormat, and also want the individual file to be processed in the map task which prefer to be on the same task node as data node,  you need to implement and control that kind of logic by yourself.
Extend the NLineInputFormat, Override the getSplits() method, read the location of the HDFS file, and assign the server of the same file/block.
Keep in mind that NLineInputFormat will send one file to one mapper. So you need to make sure that the file size is less than one block, Split is on block level instead of file.
About detail how to control the Mapper execution location, you can refer this book:
Professional Hadoop Solution
http://www.amazon.com/Professional-Hadoop-Solutions-Boris-Lublinsky/dp/1118611934/ref=sr_1_1?s=books&ie=UTF8&qid=1408644908&sr=1-1&keywords=hadoop+solution
Chapter 4 Customizing MapReduce Execution describes some examples how to do that.
Yong

Date: Thu, 21 Aug 2014 22:26:12 +0530
Subject: Re: Hadoop InputFormat - Processing large number of small files
From: rabmdu@gmail.com
To: user@hadoop.apache.org

Hello,
This means that a file with names of all the files that need to be processed and is fed to hadoop with NLineInputFormat? 
If this is the case, then, how can we ensure that map processes are scheduled in the node where blocks containing the files are stored already?

regardsrab

On Thu, Aug 21, 2014 at 9:07 PM, Felix Chern <id...@gmail.com> wrote:


If I were you, I’ll first generate a file with those file name:



hadoop fs -ls > term_file
Then run the normal map reduce job
Felix

On Aug 21, 2014, at 1:38 AM, rab ra <ra...@gmail.com> wrote:
Thanks for the link. If it is not required for CFinputformat to have contents of the files in the map process but only the filename, what changes need to be done in the code?


rab.

On 20 Aug 2014 22:59, "Felix Chern" <id...@gmail.com> wrote:



I wrote a post on how to use CombineInputFormat:http://www.idryman.org/blog/2013/09/22/process-small-files-on-hadoop-using-combinefileinputformat-1/


In the RecordReader constructor, you can get the context of which file you are reading in.In my example, I created FileLineWritable to include the filename in the mapper input key.Then you can use the input key as:






  


  public static class TestMapper extends Mapper<FileLineWritable, Text, Text, IntWritable>{





    private Text txt = new Text();





    private IntWritable count = new IntWritable(1);





    public void map (FileLineWritable key, Text val, Context context) throws IOException, InterruptedException{





      StringTokenizer st = new StringTokenizer(val.toString());





        while (st.hasMoreTokens()){





          txt.set(key.fileName + st.nextToken());          





          context.write(txt, count);





        }


    }


  }




Cheers,Felix

On Aug 20, 2014, at 8:19 AM, rab ra <ra...@gmail.com> wrote:



Thanks for the response.
Yes, I know wholeFileInputFormat. But i am not sure filename comes to map process either as key or value. But, I think this file format reads the contents of the file. I wish to have a inputformat that just gives filename or list of filenames.




Also, files are very small. The wholeFileInputFormat spans one map process per file and thus results huge number of map processes. I wish to span a single map process per group of files. 




I think I need to tweak CombineFileInputFormat's recordreader() so that it does not read the entire file but just filename.

regardsrab




regardsBala

On Wed, Aug 20, 2014 at 6:48 PM, Shahab Yunus <sh...@gmail.com> wrote:




Have you looked at the WholeFileInputFormat implementations? There are quite a few if search for them...




http://hadoop-sandy.blogspot.com/2013/02/wholefileinputformat-in-java-hadoop.html

https://github.com/tomwhite/hadoop-book/blob/master/ch07/src/main/java/WholeFileInputFormat.java






Regards,Shahab

On Wed, Aug 20, 2014 at 1:46 AM, rab ra <ra...@gmail.com> wrote:





Hello,






I have a use case wherein i need to process huge set of files stored in HDFS. Those files are non-splittable and they need to be processed as a whole. Here, I have the following question for which I need answers to proceed further in this.






1.  I wish to schedule the map process in task tracker where data is already available. How can I do it? Currently, I have a file that contains list of filenames. Each map get one line of it via NLineInputFormat. The map process then accesses the file via FSDataInputStream and work with it. Is there a way to ensure this map process is running on the node where the file is available?. 







2.  Since the files are not large and it can be called as 'small' files by hadoop standard. Now, I came across CombineFileInputFormat that can process more than one file in a single map process.  What I need here is a format that can process more than one files in a single map but does not have to read the files, and either in key or value, it has the filenames. In map process then, I can run a loop to process these files. Any help?






3. Any othe alternatives?








regards





rab









 		 	   		  

RE: Hadoop InputFormat - Processing large number of small files

Posted by java8964 <ja...@hotmail.com>.
If you want to use NLineInputFormat, and also want the individual file to be processed in the map task which prefer to be on the same task node as data node,  you need to implement and control that kind of logic by yourself.
Extend the NLineInputFormat, Override the getSplits() method, read the location of the HDFS file, and assign the server of the same file/block.
Keep in mind that NLineInputFormat will send one file to one mapper. So you need to make sure that the file size is less than one block, Split is on block level instead of file.
About detail how to control the Mapper execution location, you can refer this book:
Professional Hadoop Solution
http://www.amazon.com/Professional-Hadoop-Solutions-Boris-Lublinsky/dp/1118611934/ref=sr_1_1?s=books&ie=UTF8&qid=1408644908&sr=1-1&keywords=hadoop+solution
Chapter 4 Customizing MapReduce Execution describes some examples how to do that.
Yong

Date: Thu, 21 Aug 2014 22:26:12 +0530
Subject: Re: Hadoop InputFormat - Processing large number of small files
From: rabmdu@gmail.com
To: user@hadoop.apache.org

Hello,
This means that a file with names of all the files that need to be processed and is fed to hadoop with NLineInputFormat? 
If this is the case, then, how can we ensure that map processes are scheduled in the node where blocks containing the files are stored already?

regardsrab

On Thu, Aug 21, 2014 at 9:07 PM, Felix Chern <id...@gmail.com> wrote:


If I were you, I’ll first generate a file with those file name:



hadoop fs -ls > term_file
Then run the normal map reduce job
Felix

On Aug 21, 2014, at 1:38 AM, rab ra <ra...@gmail.com> wrote:
Thanks for the link. If it is not required for CFinputformat to have contents of the files in the map process but only the filename, what changes need to be done in the code?


rab.

On 20 Aug 2014 22:59, "Felix Chern" <id...@gmail.com> wrote:



I wrote a post on how to use CombineInputFormat:http://www.idryman.org/blog/2013/09/22/process-small-files-on-hadoop-using-combinefileinputformat-1/


In the RecordReader constructor, you can get the context of which file you are reading in.In my example, I created FileLineWritable to include the filename in the mapper input key.Then you can use the input key as:






  


  public static class TestMapper extends Mapper<FileLineWritable, Text, Text, IntWritable>{





    private Text txt = new Text();





    private IntWritable count = new IntWritable(1);





    public void map (FileLineWritable key, Text val, Context context) throws IOException, InterruptedException{





      StringTokenizer st = new StringTokenizer(val.toString());





        while (st.hasMoreTokens()){





          txt.set(key.fileName + st.nextToken());          





          context.write(txt, count);





        }


    }


  }




Cheers,Felix

On Aug 20, 2014, at 8:19 AM, rab ra <ra...@gmail.com> wrote:



Thanks for the response.
Yes, I know wholeFileInputFormat. But i am not sure filename comes to map process either as key or value. But, I think this file format reads the contents of the file. I wish to have a inputformat that just gives filename or list of filenames.




Also, files are very small. The wholeFileInputFormat spans one map process per file and thus results huge number of map processes. I wish to span a single map process per group of files. 




I think I need to tweak CombineFileInputFormat's recordreader() so that it does not read the entire file but just filename.

regardsrab




regardsBala

On Wed, Aug 20, 2014 at 6:48 PM, Shahab Yunus <sh...@gmail.com> wrote:




Have you looked at the WholeFileInputFormat implementations? There are quite a few if search for them...




http://hadoop-sandy.blogspot.com/2013/02/wholefileinputformat-in-java-hadoop.html

https://github.com/tomwhite/hadoop-book/blob/master/ch07/src/main/java/WholeFileInputFormat.java






Regards,Shahab

On Wed, Aug 20, 2014 at 1:46 AM, rab ra <ra...@gmail.com> wrote:





Hello,






I have a use case wherein i need to process huge set of files stored in HDFS. Those files are non-splittable and they need to be processed as a whole. Here, I have the following question for which I need answers to proceed further in this.






1.  I wish to schedule the map process in task tracker where data is already available. How can I do it? Currently, I have a file that contains list of filenames. Each map get one line of it via NLineInputFormat. The map process then accesses the file via FSDataInputStream and work with it. Is there a way to ensure this map process is running on the node where the file is available?. 







2.  Since the files are not large and it can be called as 'small' files by hadoop standard. Now, I came across CombineFileInputFormat that can process more than one file in a single map process.  What I need here is a format that can process more than one files in a single map but does not have to read the files, and either in key or value, it has the filenames. In map process then, I can run a loop to process these files. Any help?






3. Any othe alternatives?








regards





rab









 		 	   		  

Re: Hadoop InputFormat - Processing large number of small files

Posted by rab ra <ra...@gmail.com>.
Hello,

This means that a file with names of all the files that need to be
processed and is fed to hadoop with NLineInputFormat?

If this is the case, then, how can we ensure that map processes are
scheduled in the node where blocks containing the files are stored already?

regards
rab


On Thu, Aug 21, 2014 at 9:07 PM, Felix Chern <id...@gmail.com> wrote:

> If I were you, I’ll first generate a file with those file name:
>
> hadoop fs -ls > term_file
>
> Then run the normal map reduce job
>
> Felix
>
> On Aug 21, 2014, at 1:38 AM, rab ra <ra...@gmail.com> wrote:
>
> Thanks for the link. If it is not required for CFinputformat to have
> contents of the files in the map process but only the filename, what
> changes need to be done in the code?
>
> rab.
> On 20 Aug 2014 22:59, "Felix Chern" <id...@gmail.com> wrote:
>
>> I wrote a post on how to use CombineInputFormat:
>>
>> http://www.idryman.org/blog/2013/09/22/process-small-files-on-hadoop-using-combinefileinputformat-1/
>> In the RecordReader constructor, you can get the context of which file
>> you are reading in.
>> In my example, I created FileLineWritable to include the filename in the
>> mapper input key.
>> Then you can use the input key as:
>>
>>   public static class TestMapper extends Mapper<FileLineWritable, Text,
>> Text, IntWritable>{  private Text txt = new Text();  private IntWritable
>> count = new IntWritable(1);  public void map (FileLineWritable key, Text
>> val, Context context) throws IOException, InterruptedException{
>> StringTokenizer st = new StringTokenizer(val.toString());  while (st.
>> hasMoreTokens()){  txt.set(key.fileName + st.nextToken());  context.write
>> (txt, count);  } } }
>>
>>
>> Cheers,
>> Felix
>>
>>
>> On Aug 20, 2014, at 8:19 AM, rab ra <ra...@gmail.com> wrote:
>>
>> Thanks for the response.
>>
>> Yes, I know wholeFileInputFormat. But i am not sure filename comes to map
>> process either as key or value. But, I think this file format reads the
>> contents of the file. I wish to have a inputformat that just gives filename
>> or list of filenames.
>>
>> Also, files are very small. The wholeFileInputFormat spans one map
>> process per file and thus results huge number of map processes. I wish to
>> span a single map process per group of files.
>>
>> I think I need to tweak CombineFileInputFormat's recordreader() so that
>> it does not read the entire file but just filename.
>>
>>
>> regards
>> rab
>>
>> regards
>> Bala
>>
>>
>> On Wed, Aug 20, 2014 at 6:48 PM, Shahab Yunus <sh...@gmail.com>
>> wrote:
>>
>>> Have you looked at the WholeFileInputFormat implementations? There are
>>> quite a few if search for them...
>>>
>>>
>>> http://hadoop-sandy.blogspot.com/2013/02/wholefileinputformat-in-java-hadoop.html
>>>
>>> https://github.com/tomwhite/hadoop-book/blob/master/ch07/src/main/java/WholeFileInputFormat.java
>>>
>>> Regards,
>>> Shahab
>>>
>>>
>>> On Wed, Aug 20, 2014 at 1:46 AM, rab ra <ra...@gmail.com> wrote:
>>>
>>>> Hello,
>>>>
>>>> I have a use case wherein i need to process huge set of files stored in
>>>> HDFS. Those files are non-splittable and they need to be processed as a
>>>> whole. Here, I have the following question for which I need answers to
>>>> proceed further in this.
>>>>
>>>> 1.  I wish to schedule the map process in task tracker where data is
>>>> already available. How can I do it? Currently, I have a file that contains
>>>> list of filenames. Each map get one line of it via NLineInputFormat. The
>>>> map process then accesses the file via FSDataInputStream and work with it.
>>>> Is there a way to ensure this map process is running on the node where the
>>>> file is available?.
>>>>
>>>> 2.  Since the files are not large and it can be called as 'small' files
>>>> by hadoop standard. Now, I came across CombineFileInputFormat that can
>>>> process more than one file in a single map process.  What I need here is a
>>>> format that can process more than one files in a single map but does not
>>>> have to read the files, and either in key or value, it has the filenames.
>>>> In map process then, I can run a loop to process these files. Any help?
>>>>
>>>> 3. Any othe alternatives?
>>>>
>>>>
>>>>
>>>> regards
>>>>  rab
>>>>
>>>>
>>>
>>
>>
>

Re: Hadoop InputFormat - Processing large number of small files

Posted by rab ra <ra...@gmail.com>.
Hello,

This means that a file with names of all the files that need to be
processed and is fed to hadoop with NLineInputFormat?

If this is the case, then, how can we ensure that map processes are
scheduled in the node where blocks containing the files are stored already?

regards
rab


On Thu, Aug 21, 2014 at 9:07 PM, Felix Chern <id...@gmail.com> wrote:

> If I were you, I’ll first generate a file with those file name:
>
> hadoop fs -ls > term_file
>
> Then run the normal map reduce job
>
> Felix
>
> On Aug 21, 2014, at 1:38 AM, rab ra <ra...@gmail.com> wrote:
>
> Thanks for the link. If it is not required for CFinputformat to have
> contents of the files in the map process but only the filename, what
> changes need to be done in the code?
>
> rab.
> On 20 Aug 2014 22:59, "Felix Chern" <id...@gmail.com> wrote:
>
>> I wrote a post on how to use CombineInputFormat:
>>
>> http://www.idryman.org/blog/2013/09/22/process-small-files-on-hadoop-using-combinefileinputformat-1/
>> In the RecordReader constructor, you can get the context of which file
>> you are reading in.
>> In my example, I created FileLineWritable to include the filename in the
>> mapper input key.
>> Then you can use the input key as:
>>
>>   public static class TestMapper extends Mapper<FileLineWritable, Text,
>> Text, IntWritable>{  private Text txt = new Text();  private IntWritable
>> count = new IntWritable(1);  public void map (FileLineWritable key, Text
>> val, Context context) throws IOException, InterruptedException{
>> StringTokenizer st = new StringTokenizer(val.toString());  while (st.
>> hasMoreTokens()){  txt.set(key.fileName + st.nextToken());  context.write
>> (txt, count);  } } }
>>
>>
>> Cheers,
>> Felix
>>
>>
>> On Aug 20, 2014, at 8:19 AM, rab ra <ra...@gmail.com> wrote:
>>
>> Thanks for the response.
>>
>> Yes, I know wholeFileInputFormat. But i am not sure filename comes to map
>> process either as key or value. But, I think this file format reads the
>> contents of the file. I wish to have a inputformat that just gives filename
>> or list of filenames.
>>
>> Also, files are very small. The wholeFileInputFormat spans one map
>> process per file and thus results huge number of map processes. I wish to
>> span a single map process per group of files.
>>
>> I think I need to tweak CombineFileInputFormat's recordreader() so that
>> it does not read the entire file but just filename.
>>
>>
>> regards
>> rab
>>
>> regards
>> Bala
>>
>>
>> On Wed, Aug 20, 2014 at 6:48 PM, Shahab Yunus <sh...@gmail.com>
>> wrote:
>>
>>> Have you looked at the WholeFileInputFormat implementations? There are
>>> quite a few if search for them...
>>>
>>>
>>> http://hadoop-sandy.blogspot.com/2013/02/wholefileinputformat-in-java-hadoop.html
>>>
>>> https://github.com/tomwhite/hadoop-book/blob/master/ch07/src/main/java/WholeFileInputFormat.java
>>>
>>> Regards,
>>> Shahab
>>>
>>>
>>> On Wed, Aug 20, 2014 at 1:46 AM, rab ra <ra...@gmail.com> wrote:
>>>
>>>> Hello,
>>>>
>>>> I have a use case wherein i need to process huge set of files stored in
>>>> HDFS. Those files are non-splittable and they need to be processed as a
>>>> whole. Here, I have the following question for which I need answers to
>>>> proceed further in this.
>>>>
>>>> 1.  I wish to schedule the map process in task tracker where data is
>>>> already available. How can I do it? Currently, I have a file that contains
>>>> list of filenames. Each map get one line of it via NLineInputFormat. The
>>>> map process then accesses the file via FSDataInputStream and work with it.
>>>> Is there a way to ensure this map process is running on the node where the
>>>> file is available?.
>>>>
>>>> 2.  Since the files are not large and it can be called as 'small' files
>>>> by hadoop standard. Now, I came across CombineFileInputFormat that can
>>>> process more than one file in a single map process.  What I need here is a
>>>> format that can process more than one files in a single map but does not
>>>> have to read the files, and either in key or value, it has the filenames.
>>>> In map process then, I can run a loop to process these files. Any help?
>>>>
>>>> 3. Any othe alternatives?
>>>>
>>>>
>>>>
>>>> regards
>>>>  rab
>>>>
>>>>
>>>
>>
>>
>

Re: Hadoop InputFormat - Processing large number of small files

Posted by rab ra <ra...@gmail.com>.
Hello,

This means that a file with names of all the files that need to be
processed and is fed to hadoop with NLineInputFormat?

If this is the case, then, how can we ensure that map processes are
scheduled in the node where blocks containing the files are stored already?

regards
rab


On Thu, Aug 21, 2014 at 9:07 PM, Felix Chern <id...@gmail.com> wrote:

> If I were you, I’ll first generate a file with those file name:
>
> hadoop fs -ls > term_file
>
> Then run the normal map reduce job
>
> Felix
>
> On Aug 21, 2014, at 1:38 AM, rab ra <ra...@gmail.com> wrote:
>
> Thanks for the link. If it is not required for CFinputformat to have
> contents of the files in the map process but only the filename, what
> changes need to be done in the code?
>
> rab.
> On 20 Aug 2014 22:59, "Felix Chern" <id...@gmail.com> wrote:
>
>> I wrote a post on how to use CombineInputFormat:
>>
>> http://www.idryman.org/blog/2013/09/22/process-small-files-on-hadoop-using-combinefileinputformat-1/
>> In the RecordReader constructor, you can get the context of which file
>> you are reading in.
>> In my example, I created FileLineWritable to include the filename in the
>> mapper input key.
>> Then you can use the input key as:
>>
>>   public static class TestMapper extends Mapper<FileLineWritable, Text,
>> Text, IntWritable>{  private Text txt = new Text();  private IntWritable
>> count = new IntWritable(1);  public void map (FileLineWritable key, Text
>> val, Context context) throws IOException, InterruptedException{
>> StringTokenizer st = new StringTokenizer(val.toString());  while (st.
>> hasMoreTokens()){  txt.set(key.fileName + st.nextToken());  context.write
>> (txt, count);  } } }
>>
>>
>> Cheers,
>> Felix
>>
>>
>> On Aug 20, 2014, at 8:19 AM, rab ra <ra...@gmail.com> wrote:
>>
>> Thanks for the response.
>>
>> Yes, I know wholeFileInputFormat. But i am not sure filename comes to map
>> process either as key or value. But, I think this file format reads the
>> contents of the file. I wish to have a inputformat that just gives filename
>> or list of filenames.
>>
>> Also, files are very small. The wholeFileInputFormat spans one map
>> process per file and thus results huge number of map processes. I wish to
>> span a single map process per group of files.
>>
>> I think I need to tweak CombineFileInputFormat's recordreader() so that
>> it does not read the entire file but just filename.
>>
>>
>> regards
>> rab
>>
>> regards
>> Bala
>>
>>
>> On Wed, Aug 20, 2014 at 6:48 PM, Shahab Yunus <sh...@gmail.com>
>> wrote:
>>
>>> Have you looked at the WholeFileInputFormat implementations? There are
>>> quite a few if search for them...
>>>
>>>
>>> http://hadoop-sandy.blogspot.com/2013/02/wholefileinputformat-in-java-hadoop.html
>>>
>>> https://github.com/tomwhite/hadoop-book/blob/master/ch07/src/main/java/WholeFileInputFormat.java
>>>
>>> Regards,
>>> Shahab
>>>
>>>
>>> On Wed, Aug 20, 2014 at 1:46 AM, rab ra <ra...@gmail.com> wrote:
>>>
>>>> Hello,
>>>>
>>>> I have a use case wherein i need to process huge set of files stored in
>>>> HDFS. Those files are non-splittable and they need to be processed as a
>>>> whole. Here, I have the following question for which I need answers to
>>>> proceed further in this.
>>>>
>>>> 1.  I wish to schedule the map process in task tracker where data is
>>>> already available. How can I do it? Currently, I have a file that contains
>>>> list of filenames. Each map get one line of it via NLineInputFormat. The
>>>> map process then accesses the file via FSDataInputStream and work with it.
>>>> Is there a way to ensure this map process is running on the node where the
>>>> file is available?.
>>>>
>>>> 2.  Since the files are not large and it can be called as 'small' files
>>>> by hadoop standard. Now, I came across CombineFileInputFormat that can
>>>> process more than one file in a single map process.  What I need here is a
>>>> format that can process more than one files in a single map but does not
>>>> have to read the files, and either in key or value, it has the filenames.
>>>> In map process then, I can run a loop to process these files. Any help?
>>>>
>>>> 3. Any othe alternatives?
>>>>
>>>>
>>>>
>>>> regards
>>>>  rab
>>>>
>>>>
>>>
>>
>>
>

Re: Hadoop InputFormat - Processing large number of small files

Posted by rab ra <ra...@gmail.com>.
Hello,

This means that a file with names of all the files that need to be
processed and is fed to hadoop with NLineInputFormat?

If this is the case, then, how can we ensure that map processes are
scheduled in the node where blocks containing the files are stored already?

regards
rab


On Thu, Aug 21, 2014 at 9:07 PM, Felix Chern <id...@gmail.com> wrote:

> If I were you, I’ll first generate a file with those file name:
>
> hadoop fs -ls > term_file
>
> Then run the normal map reduce job
>
> Felix
>
> On Aug 21, 2014, at 1:38 AM, rab ra <ra...@gmail.com> wrote:
>
> Thanks for the link. If it is not required for CFinputformat to have
> contents of the files in the map process but only the filename, what
> changes need to be done in the code?
>
> rab.
> On 20 Aug 2014 22:59, "Felix Chern" <id...@gmail.com> wrote:
>
>> I wrote a post on how to use CombineInputFormat:
>>
>> http://www.idryman.org/blog/2013/09/22/process-small-files-on-hadoop-using-combinefileinputformat-1/
>> In the RecordReader constructor, you can get the context of which file
>> you are reading in.
>> In my example, I created FileLineWritable to include the filename in the
>> mapper input key.
>> Then you can use the input key as:
>>
>>   public static class TestMapper extends Mapper<FileLineWritable, Text,
>> Text, IntWritable>{  private Text txt = new Text();  private IntWritable
>> count = new IntWritable(1);  public void map (FileLineWritable key, Text
>> val, Context context) throws IOException, InterruptedException{
>> StringTokenizer st = new StringTokenizer(val.toString());  while (st.
>> hasMoreTokens()){  txt.set(key.fileName + st.nextToken());  context.write
>> (txt, count);  } } }
>>
>>
>> Cheers,
>> Felix
>>
>>
>> On Aug 20, 2014, at 8:19 AM, rab ra <ra...@gmail.com> wrote:
>>
>> Thanks for the response.
>>
>> Yes, I know wholeFileInputFormat. But i am not sure filename comes to map
>> process either as key or value. But, I think this file format reads the
>> contents of the file. I wish to have a inputformat that just gives filename
>> or list of filenames.
>>
>> Also, files are very small. The wholeFileInputFormat spans one map
>> process per file and thus results huge number of map processes. I wish to
>> span a single map process per group of files.
>>
>> I think I need to tweak CombineFileInputFormat's recordreader() so that
>> it does not read the entire file but just filename.
>>
>>
>> regards
>> rab
>>
>> regards
>> Bala
>>
>>
>> On Wed, Aug 20, 2014 at 6:48 PM, Shahab Yunus <sh...@gmail.com>
>> wrote:
>>
>>> Have you looked at the WholeFileInputFormat implementations? There are
>>> quite a few if search for them...
>>>
>>>
>>> http://hadoop-sandy.blogspot.com/2013/02/wholefileinputformat-in-java-hadoop.html
>>>
>>> https://github.com/tomwhite/hadoop-book/blob/master/ch07/src/main/java/WholeFileInputFormat.java
>>>
>>> Regards,
>>> Shahab
>>>
>>>
>>> On Wed, Aug 20, 2014 at 1:46 AM, rab ra <ra...@gmail.com> wrote:
>>>
>>>> Hello,
>>>>
>>>> I have a use case wherein i need to process huge set of files stored in
>>>> HDFS. Those files are non-splittable and they need to be processed as a
>>>> whole. Here, I have the following question for which I need answers to
>>>> proceed further in this.
>>>>
>>>> 1.  I wish to schedule the map process in task tracker where data is
>>>> already available. How can I do it? Currently, I have a file that contains
>>>> list of filenames. Each map get one line of it via NLineInputFormat. The
>>>> map process then accesses the file via FSDataInputStream and work with it.
>>>> Is there a way to ensure this map process is running on the node where the
>>>> file is available?.
>>>>
>>>> 2.  Since the files are not large and it can be called as 'small' files
>>>> by hadoop standard. Now, I came across CombineFileInputFormat that can
>>>> process more than one file in a single map process.  What I need here is a
>>>> format that can process more than one files in a single map but does not
>>>> have to read the files, and either in key or value, it has the filenames.
>>>> In map process then, I can run a loop to process these files. Any help?
>>>>
>>>> 3. Any othe alternatives?
>>>>
>>>>
>>>>
>>>> regards
>>>>  rab
>>>>
>>>>
>>>
>>
>>
>

Re: Hadoop InputFormat - Processing large number of small files

Posted by Felix Chern <id...@gmail.com>.
If I were you, I’ll first generate a file with those file name:

hadoop fs -ls > term_file

Then run the normal map reduce job

Felix

On Aug 21, 2014, at 1:38 AM, rab ra <ra...@gmail.com> wrote:

> Thanks for the link. If it is not required for CFinputformat to have contents of the files in the map process but only the filename, what changes need to be done in the code?
> 
> rab.
> 
> On 20 Aug 2014 22:59, "Felix Chern" <id...@gmail.com> wrote:
> I wrote a post on how to use CombineInputFormat:
> http://www.idryman.org/blog/2013/09/22/process-small-files-on-hadoop-using-combinefileinputformat-1/
> In the RecordReader constructor, you can get the context of which file you are reading in.
> In my example, I created FileLineWritable to include the filename in the mapper input key.
> Then you can use the input key as:
> 
>   
>   public static class TestMapper extends Mapper<FileLineWritable, Text, Text, IntWritable>{
>     private Text txt = new Text();
>     private IntWritable count = new IntWritable(1);
>     public void map (FileLineWritable key, Text val, Context context) throws IOException, InterruptedException{
>       StringTokenizer st = new StringTokenizer(val.toString());
>         while (st.hasMoreTokens()){
>           txt.set(key.fileName + st.nextToken());          
>           context.write(txt, count);
>         }
>     }
>   }
> 
> 
> Cheers,
> Felix
> 
> 
> On Aug 20, 2014, at 8:19 AM, rab ra <ra...@gmail.com> wrote:
> 
>> Thanks for the response.
>> 
>> Yes, I know wholeFileInputFormat. But i am not sure filename comes to map process either as key or value. But, I think this file format reads the contents of the file. I wish to have a inputformat that just gives filename or list of filenames.
>> 
>> Also, files are very small. The wholeFileInputFormat spans one map process per file and thus results huge number of map processes. I wish to span a single map process per group of files. 
>> 
>> I think I need to tweak CombineFileInputFormat's recordreader() so that it does not read the entire file but just filename.
>> 
>> 
>> regards
>> rab
>> 
>> regards
>> Bala
>> 
>> 
>> On Wed, Aug 20, 2014 at 6:48 PM, Shahab Yunus <sh...@gmail.com> wrote:
>> Have you looked at the WholeFileInputFormat implementations? There are quite a few if search for them...
>> 
>> http://hadoop-sandy.blogspot.com/2013/02/wholefileinputformat-in-java-hadoop.html
>> https://github.com/tomwhite/hadoop-book/blob/master/ch07/src/main/java/WholeFileInputFormat.java
>> 
>> Regards,
>> Shahab
>> 
>> 
>> On Wed, Aug 20, 2014 at 1:46 AM, rab ra <ra...@gmail.com> wrote:
>> Hello,
>> 
>> I have a use case wherein i need to process huge set of files stored in HDFS. Those files are non-splittable and they need to be processed as a whole. Here, I have the following question for which I need answers to proceed further in this.
>> 
>> 1.  I wish to schedule the map process in task tracker where data is already available. How can I do it? Currently, I have a file that contains list of filenames. Each map get one line of it via NLineInputFormat. The map process then accesses the file via FSDataInputStream and work with it. Is there a way to ensure this map process is running on the node where the file is available?. 
>> 
>> 2.  Since the files are not large and it can be called as 'small' files by hadoop standard. Now, I came across CombineFileInputFormat that can process more than one file in a single map process.  What I need here is a format that can process more than one files in a single map but does not have to read the files, and either in key or value, it has the filenames. In map process then, I can run a loop to process these files. Any help?
>> 
>> 3. Any othe alternatives?
>> 
>> 
>> 
>> regards
>> rab
>> 
>> 
>> 
> 


Re: Hadoop InputFormat - Processing large number of small files

Posted by Felix Chern <id...@gmail.com>.
If I were you, I’ll first generate a file with those file name:

hadoop fs -ls > term_file

Then run the normal map reduce job

Felix

On Aug 21, 2014, at 1:38 AM, rab ra <ra...@gmail.com> wrote:

> Thanks for the link. If it is not required for CFinputformat to have contents of the files in the map process but only the filename, what changes need to be done in the code?
> 
> rab.
> 
> On 20 Aug 2014 22:59, "Felix Chern" <id...@gmail.com> wrote:
> I wrote a post on how to use CombineInputFormat:
> http://www.idryman.org/blog/2013/09/22/process-small-files-on-hadoop-using-combinefileinputformat-1/
> In the RecordReader constructor, you can get the context of which file you are reading in.
> In my example, I created FileLineWritable to include the filename in the mapper input key.
> Then you can use the input key as:
> 
>   
>   public static class TestMapper extends Mapper<FileLineWritable, Text, Text, IntWritable>{
>     private Text txt = new Text();
>     private IntWritable count = new IntWritable(1);
>     public void map (FileLineWritable key, Text val, Context context) throws IOException, InterruptedException{
>       StringTokenizer st = new StringTokenizer(val.toString());
>         while (st.hasMoreTokens()){
>           txt.set(key.fileName + st.nextToken());          
>           context.write(txt, count);
>         }
>     }
>   }
> 
> 
> Cheers,
> Felix
> 
> 
> On Aug 20, 2014, at 8:19 AM, rab ra <ra...@gmail.com> wrote:
> 
>> Thanks for the response.
>> 
>> Yes, I know wholeFileInputFormat. But i am not sure filename comes to map process either as key or value. But, I think this file format reads the contents of the file. I wish to have a inputformat that just gives filename or list of filenames.
>> 
>> Also, files are very small. The wholeFileInputFormat spans one map process per file and thus results huge number of map processes. I wish to span a single map process per group of files. 
>> 
>> I think I need to tweak CombineFileInputFormat's recordreader() so that it does not read the entire file but just filename.
>> 
>> 
>> regards
>> rab
>> 
>> regards
>> Bala
>> 
>> 
>> On Wed, Aug 20, 2014 at 6:48 PM, Shahab Yunus <sh...@gmail.com> wrote:
>> Have you looked at the WholeFileInputFormat implementations? There are quite a few if search for them...
>> 
>> http://hadoop-sandy.blogspot.com/2013/02/wholefileinputformat-in-java-hadoop.html
>> https://github.com/tomwhite/hadoop-book/blob/master/ch07/src/main/java/WholeFileInputFormat.java
>> 
>> Regards,
>> Shahab
>> 
>> 
>> On Wed, Aug 20, 2014 at 1:46 AM, rab ra <ra...@gmail.com> wrote:
>> Hello,
>> 
>> I have a use case wherein i need to process huge set of files stored in HDFS. Those files are non-splittable and they need to be processed as a whole. Here, I have the following question for which I need answers to proceed further in this.
>> 
>> 1.  I wish to schedule the map process in task tracker where data is already available. How can I do it? Currently, I have a file that contains list of filenames. Each map get one line of it via NLineInputFormat. The map process then accesses the file via FSDataInputStream and work with it. Is there a way to ensure this map process is running on the node where the file is available?. 
>> 
>> 2.  Since the files are not large and it can be called as 'small' files by hadoop standard. Now, I came across CombineFileInputFormat that can process more than one file in a single map process.  What I need here is a format that can process more than one files in a single map but does not have to read the files, and either in key or value, it has the filenames. In map process then, I can run a loop to process these files. Any help?
>> 
>> 3. Any othe alternatives?
>> 
>> 
>> 
>> regards
>> rab
>> 
>> 
>> 
> 


Re: Hadoop InputFormat - Processing large number of small files

Posted by Felix Chern <id...@gmail.com>.
If I were you, I’ll first generate a file with those file name:

hadoop fs -ls > term_file

Then run the normal map reduce job

Felix

On Aug 21, 2014, at 1:38 AM, rab ra <ra...@gmail.com> wrote:

> Thanks for the link. If it is not required for CFinputformat to have contents of the files in the map process but only the filename, what changes need to be done in the code?
> 
> rab.
> 
> On 20 Aug 2014 22:59, "Felix Chern" <id...@gmail.com> wrote:
> I wrote a post on how to use CombineInputFormat:
> http://www.idryman.org/blog/2013/09/22/process-small-files-on-hadoop-using-combinefileinputformat-1/
> In the RecordReader constructor, you can get the context of which file you are reading in.
> In my example, I created FileLineWritable to include the filename in the mapper input key.
> Then you can use the input key as:
> 
>   
>   public static class TestMapper extends Mapper<FileLineWritable, Text, Text, IntWritable>{
>     private Text txt = new Text();
>     private IntWritable count = new IntWritable(1);
>     public void map (FileLineWritable key, Text val, Context context) throws IOException, InterruptedException{
>       StringTokenizer st = new StringTokenizer(val.toString());
>         while (st.hasMoreTokens()){
>           txt.set(key.fileName + st.nextToken());          
>           context.write(txt, count);
>         }
>     }
>   }
> 
> 
> Cheers,
> Felix
> 
> 
> On Aug 20, 2014, at 8:19 AM, rab ra <ra...@gmail.com> wrote:
> 
>> Thanks for the response.
>> 
>> Yes, I know wholeFileInputFormat. But i am not sure filename comes to map process either as key or value. But, I think this file format reads the contents of the file. I wish to have a inputformat that just gives filename or list of filenames.
>> 
>> Also, files are very small. The wholeFileInputFormat spans one map process per file and thus results huge number of map processes. I wish to span a single map process per group of files. 
>> 
>> I think I need to tweak CombineFileInputFormat's recordreader() so that it does not read the entire file but just filename.
>> 
>> 
>> regards
>> rab
>> 
>> regards
>> Bala
>> 
>> 
>> On Wed, Aug 20, 2014 at 6:48 PM, Shahab Yunus <sh...@gmail.com> wrote:
>> Have you looked at the WholeFileInputFormat implementations? There are quite a few if search for them...
>> 
>> http://hadoop-sandy.blogspot.com/2013/02/wholefileinputformat-in-java-hadoop.html
>> https://github.com/tomwhite/hadoop-book/blob/master/ch07/src/main/java/WholeFileInputFormat.java
>> 
>> Regards,
>> Shahab
>> 
>> 
>> On Wed, Aug 20, 2014 at 1:46 AM, rab ra <ra...@gmail.com> wrote:
>> Hello,
>> 
>> I have a use case wherein i need to process huge set of files stored in HDFS. Those files are non-splittable and they need to be processed as a whole. Here, I have the following question for which I need answers to proceed further in this.
>> 
>> 1.  I wish to schedule the map process in task tracker where data is already available. How can I do it? Currently, I have a file that contains list of filenames. Each map get one line of it via NLineInputFormat. The map process then accesses the file via FSDataInputStream and work with it. Is there a way to ensure this map process is running on the node where the file is available?. 
>> 
>> 2.  Since the files are not large and it can be called as 'small' files by hadoop standard. Now, I came across CombineFileInputFormat that can process more than one file in a single map process.  What I need here is a format that can process more than one files in a single map but does not have to read the files, and either in key or value, it has the filenames. In map process then, I can run a loop to process these files. Any help?
>> 
>> 3. Any othe alternatives?
>> 
>> 
>> 
>> regards
>> rab
>> 
>> 
>> 
> 


Re: Hadoop InputFormat - Processing large number of small files

Posted by Felix Chern <id...@gmail.com>.
If I were you, I’ll first generate a file with those file name:

hadoop fs -ls > term_file

Then run the normal map reduce job

Felix

On Aug 21, 2014, at 1:38 AM, rab ra <ra...@gmail.com> wrote:

> Thanks for the link. If it is not required for CFinputformat to have contents of the files in the map process but only the filename, what changes need to be done in the code?
> 
> rab.
> 
> On 20 Aug 2014 22:59, "Felix Chern" <id...@gmail.com> wrote:
> I wrote a post on how to use CombineInputFormat:
> http://www.idryman.org/blog/2013/09/22/process-small-files-on-hadoop-using-combinefileinputformat-1/
> In the RecordReader constructor, you can get the context of which file you are reading in.
> In my example, I created FileLineWritable to include the filename in the mapper input key.
> Then you can use the input key as:
> 
>   
>   public static class TestMapper extends Mapper<FileLineWritable, Text, Text, IntWritable>{
>     private Text txt = new Text();
>     private IntWritable count = new IntWritable(1);
>     public void map (FileLineWritable key, Text val, Context context) throws IOException, InterruptedException{
>       StringTokenizer st = new StringTokenizer(val.toString());
>         while (st.hasMoreTokens()){
>           txt.set(key.fileName + st.nextToken());          
>           context.write(txt, count);
>         }
>     }
>   }
> 
> 
> Cheers,
> Felix
> 
> 
> On Aug 20, 2014, at 8:19 AM, rab ra <ra...@gmail.com> wrote:
> 
>> Thanks for the response.
>> 
>> Yes, I know wholeFileInputFormat. But i am not sure filename comes to map process either as key or value. But, I think this file format reads the contents of the file. I wish to have a inputformat that just gives filename or list of filenames.
>> 
>> Also, files are very small. The wholeFileInputFormat spans one map process per file and thus results huge number of map processes. I wish to span a single map process per group of files. 
>> 
>> I think I need to tweak CombineFileInputFormat's recordreader() so that it does not read the entire file but just filename.
>> 
>> 
>> regards
>> rab
>> 
>> regards
>> Bala
>> 
>> 
>> On Wed, Aug 20, 2014 at 6:48 PM, Shahab Yunus <sh...@gmail.com> wrote:
>> Have you looked at the WholeFileInputFormat implementations? There are quite a few if search for them...
>> 
>> http://hadoop-sandy.blogspot.com/2013/02/wholefileinputformat-in-java-hadoop.html
>> https://github.com/tomwhite/hadoop-book/blob/master/ch07/src/main/java/WholeFileInputFormat.java
>> 
>> Regards,
>> Shahab
>> 
>> 
>> On Wed, Aug 20, 2014 at 1:46 AM, rab ra <ra...@gmail.com> wrote:
>> Hello,
>> 
>> I have a use case wherein i need to process huge set of files stored in HDFS. Those files are non-splittable and they need to be processed as a whole. Here, I have the following question for which I need answers to proceed further in this.
>> 
>> 1.  I wish to schedule the map process in task tracker where data is already available. How can I do it? Currently, I have a file that contains list of filenames. Each map get one line of it via NLineInputFormat. The map process then accesses the file via FSDataInputStream and work with it. Is there a way to ensure this map process is running on the node where the file is available?. 
>> 
>> 2.  Since the files are not large and it can be called as 'small' files by hadoop standard. Now, I came across CombineFileInputFormat that can process more than one file in a single map process.  What I need here is a format that can process more than one files in a single map but does not have to read the files, and either in key or value, it has the filenames. In map process then, I can run a loop to process these files. Any help?
>> 
>> 3. Any othe alternatives?
>> 
>> 
>> 
>> regards
>> rab
>> 
>> 
>> 
> 


Re: Hadoop InputFormat - Processing large number of small files

Posted by rab ra <ra...@gmail.com>.
Thanks for the link. If it is not required for CFinputformat to have
contents of the files in the map process but only the filename, what
changes need to be done in the code?

rab.
On 20 Aug 2014 22:59, "Felix Chern" <id...@gmail.com> wrote:

> I wrote a post on how to use CombineInputFormat:
>
> http://www.idryman.org/blog/2013/09/22/process-small-files-on-hadoop-using-combinefileinputformat-1/
> In the RecordReader constructor, you can get the context of which file you
> are reading in.
> In my example, I created FileLineWritable to include the filename in the
> mapper input key.
> Then you can use the input key as:
>
>  public static class TestMapper extends Mapper<FileLineWritable, Text,
> Text, IntWritable>{ private Text txt = new Text(); private IntWritable
> count = new IntWritable(1); public void map (FileLineWritable key, Text
> val, Context context) throws IOException, InterruptedException{
> StringTokenizer st = new StringTokenizer(val.toString()); while (st.
> hasMoreTokens()){ txt.set(key.fileName + st.nextToken()); context.write(
> txt, count); } } }
>
>
> Cheers,
> Felix
>
>
> On Aug 20, 2014, at 8:19 AM, rab ra <ra...@gmail.com> wrote:
>
> Thanks for the response.
>
> Yes, I know wholeFileInputFormat. But i am not sure filename comes to map
> process either as key or value. But, I think this file format reads the
> contents of the file. I wish to have a inputformat that just gives filename
> or list of filenames.
>
> Also, files are very small. The wholeFileInputFormat spans one map process
> per file and thus results huge number of map processes. I wish to span a
> single map process per group of files.
>
> I think I need to tweak CombineFileInputFormat's recordreader() so that it
> does not read the entire file but just filename.
>
>
> regards
> rab
>
> regards
> Bala
>
>
> On Wed, Aug 20, 2014 at 6:48 PM, Shahab Yunus <sh...@gmail.com>
> wrote:
>
>> Have you looked at the WholeFileInputFormat implementations? There are
>> quite a few if search for them...
>>
>>
>> http://hadoop-sandy.blogspot.com/2013/02/wholefileinputformat-in-java-hadoop.html
>>
>> https://github.com/tomwhite/hadoop-book/blob/master/ch07/src/main/java/WholeFileInputFormat.java
>>
>> Regards,
>> Shahab
>>
>>
>> On Wed, Aug 20, 2014 at 1:46 AM, rab ra <ra...@gmail.com> wrote:
>>
>>> Hello,
>>>
>>> I have a use case wherein i need to process huge set of files stored in
>>> HDFS. Those files are non-splittable and they need to be processed as a
>>> whole. Here, I have the following question for which I need answers to
>>> proceed further in this.
>>>
>>> 1.  I wish to schedule the map process in task tracker where data is
>>> already available. How can I do it? Currently, I have a file that contains
>>> list of filenames. Each map get one line of it via NLineInputFormat. The
>>> map process then accesses the file via FSDataInputStream and work with it.
>>> Is there a way to ensure this map process is running on the node where the
>>> file is available?.
>>>
>>> 2.  Since the files are not large and it can be called as 'small' files
>>> by hadoop standard. Now, I came across CombineFileInputFormat that can
>>> process more than one file in a single map process.  What I need here is a
>>> format that can process more than one files in a single map but does not
>>> have to read the files, and either in key or value, it has the filenames.
>>> In map process then, I can run a loop to process these files. Any help?
>>>
>>> 3. Any othe alternatives?
>>>
>>>
>>>
>>> regards
>>>  rab
>>>
>>>
>>
>
>

Re: Hadoop InputFormat - Processing large number of small files

Posted by rab ra <ra...@gmail.com>.
Hi,

Is it not good idea to model key as Text type?

I have a large number of sequential files that has bunch of key value
pairs. I will read these seq files inside the map. Hence my map needs only
filenames. I believe, with CombineFileInputFormat the map will run on nodes
where data is already available and hence my explicit hdfs read will be
faster.
I do not want the contents in the map as all key value pairs are not needed.

Regards
Rab
On 20 Aug 2014 22:59, "Felix Chern" <id...@gmail.com> wrote:

> I wrote a post on how to use CombineInputFormat:
>
> http://www.idryman.org/blog/2013/09/22/process-small-files-on-hadoop-using-combinefileinputformat-1/
> In the RecordReader constructor, you can get the context of which file you
> are reading in.
> In my example, I created FileLineWritable to include the filename in the
> mapper input key.
> Then you can use the input key as:
>
>  public static class TestMapper extends Mapper<FileLineWritable, Text,
> Text, IntWritable>{ private Text txt = new Text(); private IntWritable
> count = new IntWritable(1); public void map (FileLineWritable key, Text
> val, Context context) throws IOException, InterruptedException{
> StringTokenizer st = new StringTokenizer(val.toString()); while (st.
> hasMoreTokens()){ txt.set(key.fileName + st.nextToken()); context.write(
> txt, count); } } }
>
>
> Cheers,
> Felix
>
>
> On Aug 20, 2014, at 8:19 AM, rab ra <ra...@gmail.com> wrote:
>
> Thanks for the response.
>
> Yes, I know wholeFileInputFormat. But i am not sure filename comes to map
> process either as key or value. But, I think this file format reads the
> contents of the file. I wish to have a inputformat that just gives filename
> or list of filenames.
>
> Also, files are very small. The wholeFileInputFormat spans one map process
> per file and thus results huge number of map processes. I wish to span a
> single map process per group of files.
>
> I think I need to tweak CombineFileInputFormat's recordreader() so that it
> does not read the entire file but just filename.
>
>
> regards
> rab
>
> regards
> Bala
>
>
> On Wed, Aug 20, 2014 at 6:48 PM, Shahab Yunus <sh...@gmail.com>
> wrote:
>
>> Have you looked at the WholeFileInputFormat implementations? There are
>> quite a few if search for them...
>>
>>
>> http://hadoop-sandy.blogspot.com/2013/02/wholefileinputformat-in-java-hadoop.html
>>
>> https://github.com/tomwhite/hadoop-book/blob/master/ch07/src/main/java/WholeFileInputFormat.java
>>
>> Regards,
>> Shahab
>>
>>
>> On Wed, Aug 20, 2014 at 1:46 AM, rab ra <ra...@gmail.com> wrote:
>>
>>> Hello,
>>>
>>> I have a use case wherein i need to process huge set of files stored in
>>> HDFS. Those files are non-splittable and they need to be processed as a
>>> whole. Here, I have the following question for which I need answers to
>>> proceed further in this.
>>>
>>> 1.  I wish to schedule the map process in task tracker where data is
>>> already available. How can I do it? Currently, I have a file that contains
>>> list of filenames. Each map get one line of it via NLineInputFormat. The
>>> map process then accesses the file via FSDataInputStream and work with it.
>>> Is there a way to ensure this map process is running on the node where the
>>> file is available?.
>>>
>>> 2.  Since the files are not large and it can be called as 'small' files
>>> by hadoop standard. Now, I came across CombineFileInputFormat that can
>>> process more than one file in a single map process.  What I need here is a
>>> format that can process more than one files in a single map but does not
>>> have to read the files, and either in key or value, it has the filenames.
>>> In map process then, I can run a loop to process these files. Any help?
>>>
>>> 3. Any othe alternatives?
>>>
>>>
>>>
>>> regards
>>>  rab
>>>
>>>
>>
>
>

Re: Hadoop InputFormat - Processing large number of small files

Posted by rab ra <ra...@gmail.com>.
Hi,

Is it not good idea to model key as Text type?

I have a large number of sequential files that has bunch of key value
pairs. I will read these seq files inside the map. Hence my map needs only
filenames. I believe, with CombineFileInputFormat the map will run on nodes
where data is already available and hence my explicit hdfs read will be
faster.
I do not want the contents in the map as all key value pairs are not needed.

Regards
Rab
On 20 Aug 2014 22:59, "Felix Chern" <id...@gmail.com> wrote:

> I wrote a post on how to use CombineInputFormat:
>
> http://www.idryman.org/blog/2013/09/22/process-small-files-on-hadoop-using-combinefileinputformat-1/
> In the RecordReader constructor, you can get the context of which file you
> are reading in.
> In my example, I created FileLineWritable to include the filename in the
> mapper input key.
> Then you can use the input key as:
>
>  public static class TestMapper extends Mapper<FileLineWritable, Text,
> Text, IntWritable>{ private Text txt = new Text(); private IntWritable
> count = new IntWritable(1); public void map (FileLineWritable key, Text
> val, Context context) throws IOException, InterruptedException{
> StringTokenizer st = new StringTokenizer(val.toString()); while (st.
> hasMoreTokens()){ txt.set(key.fileName + st.nextToken()); context.write(
> txt, count); } } }
>
>
> Cheers,
> Felix
>
>
> On Aug 20, 2014, at 8:19 AM, rab ra <ra...@gmail.com> wrote:
>
> Thanks for the response.
>
> Yes, I know wholeFileInputFormat. But i am not sure filename comes to map
> process either as key or value. But, I think this file format reads the
> contents of the file. I wish to have a inputformat that just gives filename
> or list of filenames.
>
> Also, files are very small. The wholeFileInputFormat spans one map process
> per file and thus results huge number of map processes. I wish to span a
> single map process per group of files.
>
> I think I need to tweak CombineFileInputFormat's recordreader() so that it
> does not read the entire file but just filename.
>
>
> regards
> rab
>
> regards
> Bala
>
>
> On Wed, Aug 20, 2014 at 6:48 PM, Shahab Yunus <sh...@gmail.com>
> wrote:
>
>> Have you looked at the WholeFileInputFormat implementations? There are
>> quite a few if search for them...
>>
>>
>> http://hadoop-sandy.blogspot.com/2013/02/wholefileinputformat-in-java-hadoop.html
>>
>> https://github.com/tomwhite/hadoop-book/blob/master/ch07/src/main/java/WholeFileInputFormat.java
>>
>> Regards,
>> Shahab
>>
>>
>> On Wed, Aug 20, 2014 at 1:46 AM, rab ra <ra...@gmail.com> wrote:
>>
>>> Hello,
>>>
>>> I have a use case wherein i need to process huge set of files stored in
>>> HDFS. Those files are non-splittable and they need to be processed as a
>>> whole. Here, I have the following question for which I need answers to
>>> proceed further in this.
>>>
>>> 1.  I wish to schedule the map process in task tracker where data is
>>> already available. How can I do it? Currently, I have a file that contains
>>> list of filenames. Each map get one line of it via NLineInputFormat. The
>>> map process then accesses the file via FSDataInputStream and work with it.
>>> Is there a way to ensure this map process is running on the node where the
>>> file is available?.
>>>
>>> 2.  Since the files are not large and it can be called as 'small' files
>>> by hadoop standard. Now, I came across CombineFileInputFormat that can
>>> process more than one file in a single map process.  What I need here is a
>>> format that can process more than one files in a single map but does not
>>> have to read the files, and either in key or value, it has the filenames.
>>> In map process then, I can run a loop to process these files. Any help?
>>>
>>> 3. Any othe alternatives?
>>>
>>>
>>>
>>> regards
>>>  rab
>>>
>>>
>>
>
>

Re: Hadoop InputFormat - Processing large number of small files

Posted by rab ra <ra...@gmail.com>.
Hi
>
>
>
> I tried to use your CombileFileInputFormat implementation. However, I get
the following exception
>
>
>
> ‘not org.apache.hadoop.mapred.InputFormat’
>
>
>
> I am using hadoop 2.4.1 and it looks like it expect older interface as it
does not accept
‘org.apache.hadoop.mapreduce.lib.input.CombineFileInputFormat’.  May I know
what version of Hadoop you used?
>
>
>
>
>
> Looks like I need to use older one
‘org.apache.hadoop.mapred.lib.CombineFileInputFormat’ ?
>
>
>
> Thanks and Regards
>
> rab
On 20 Aug 2014 22:59, "Felix Chern" <id...@gmail.com> wrote:

> I wrote a post on how to use CombineInputFormat:
>
> http://www.idryman.org/blog/2013/09/22/process-small-files-on-hadoop-using-combinefileinputformat-1/
> In the RecordReader constructor, you can get the context of which file you
> are reading in.
> In my example, I created FileLineWritable to include the filename in the
> mapper input key.
> Then you can use the input key as:
>
>  public static class TestMapper extends Mapper<FileLineWritable, Text,
> Text, IntWritable>{ private Text txt = new Text(); private IntWritable
> count = new IntWritable(1); public void map (FileLineWritable key, Text
> val, Context context) throws IOException, InterruptedException{
> StringTokenizer st = new StringTokenizer(val.toString()); while (st.
> hasMoreTokens()){ txt.set(key.fileName + st.nextToken()); context.write(
> txt, count); } } }
>
>
> Cheers,
> Felix
>
>
> On Aug 20, 2014, at 8:19 AM, rab ra <ra...@gmail.com> wrote:
>
> Thanks for the response.
>
> Yes, I know wholeFileInputFormat. But i am not sure filename comes to map
> process either as key or value. But, I think this file format reads the
> contents of the file. I wish to have a inputformat that just gives filename
> or list of filenames.
>
> Also, files are very small. The wholeFileInputFormat spans one map process
> per file and thus results huge number of map processes. I wish to span a
> single map process per group of files.
>
> I think I need to tweak CombineFileInputFormat's recordreader() so that it
> does not read the entire file but just filename.
>
>
> regards
> rab
>
> regards
> Bala
>
>
> On Wed, Aug 20, 2014 at 6:48 PM, Shahab Yunus <sh...@gmail.com>
> wrote:
>
>> Have you looked at the WholeFileInputFormat implementations? There are
>> quite a few if search for them...
>>
>>
>> http://hadoop-sandy.blogspot.com/2013/02/wholefileinputformat-in-java-hadoop.html
>>
>> https://github.com/tomwhite/hadoop-book/blob/master/ch07/src/main/java/WholeFileInputFormat.java
>>
>> Regards,
>> Shahab
>>
>>
>> On Wed, Aug 20, 2014 at 1:46 AM, rab ra <ra...@gmail.com> wrote:
>>
>>> Hello,
>>>
>>> I have a use case wherein i need to process huge set of files stored in
>>> HDFS. Those files are non-splittable and they need to be processed as a
>>> whole. Here, I have the following question for which I need answers to
>>> proceed further in this.
>>>
>>> 1.  I wish to schedule the map process in task tracker where data is
>>> already available. How can I do it? Currently, I have a file that contains
>>> list of filenames. Each map get one line of it via NLineInputFormat. The
>>> map process then accesses the file via FSDataInputStream and work with it.
>>> Is there a way to ensure this map process is running on the node where the
>>> file is available?.
>>>
>>> 2.  Since the files are not large and it can be called as 'small' files
>>> by hadoop standard. Now, I came across CombineFileInputFormat that can
>>> process more than one file in a single map process.  What I need here is a
>>> format that can process more than one files in a single map but does not
>>> have to read the files, and either in key or value, it has the filenames.
>>> In map process then, I can run a loop to process these files. Any help?
>>>
>>> 3. Any othe alternatives?
>>>
>>>
>>>
>>> regards
>>>  rab
>>>
>>>
>>
>
>

Re: Hadoop InputFormat - Processing large number of small files

Posted by rab ra <ra...@gmail.com>.
Thanks for the link. If it is not required for CFinputformat to have
contents of the files in the map process but only the filename, what
changes need to be done in the code?

rab.
On 20 Aug 2014 22:59, "Felix Chern" <id...@gmail.com> wrote:

> I wrote a post on how to use CombineInputFormat:
>
> http://www.idryman.org/blog/2013/09/22/process-small-files-on-hadoop-using-combinefileinputformat-1/
> In the RecordReader constructor, you can get the context of which file you
> are reading in.
> In my example, I created FileLineWritable to include the filename in the
> mapper input key.
> Then you can use the input key as:
>
>  public static class TestMapper extends Mapper<FileLineWritable, Text,
> Text, IntWritable>{ private Text txt = new Text(); private IntWritable
> count = new IntWritable(1); public void map (FileLineWritable key, Text
> val, Context context) throws IOException, InterruptedException{
> StringTokenizer st = new StringTokenizer(val.toString()); while (st.
> hasMoreTokens()){ txt.set(key.fileName + st.nextToken()); context.write(
> txt, count); } } }
>
>
> Cheers,
> Felix
>
>
> On Aug 20, 2014, at 8:19 AM, rab ra <ra...@gmail.com> wrote:
>
> Thanks for the response.
>
> Yes, I know wholeFileInputFormat. But i am not sure filename comes to map
> process either as key or value. But, I think this file format reads the
> contents of the file. I wish to have a inputformat that just gives filename
> or list of filenames.
>
> Also, files are very small. The wholeFileInputFormat spans one map process
> per file and thus results huge number of map processes. I wish to span a
> single map process per group of files.
>
> I think I need to tweak CombineFileInputFormat's recordreader() so that it
> does not read the entire file but just filename.
>
>
> regards
> rab
>
> regards
> Bala
>
>
> On Wed, Aug 20, 2014 at 6:48 PM, Shahab Yunus <sh...@gmail.com>
> wrote:
>
>> Have you looked at the WholeFileInputFormat implementations? There are
>> quite a few if search for them...
>>
>>
>> http://hadoop-sandy.blogspot.com/2013/02/wholefileinputformat-in-java-hadoop.html
>>
>> https://github.com/tomwhite/hadoop-book/blob/master/ch07/src/main/java/WholeFileInputFormat.java
>>
>> Regards,
>> Shahab
>>
>>
>> On Wed, Aug 20, 2014 at 1:46 AM, rab ra <ra...@gmail.com> wrote:
>>
>>> Hello,
>>>
>>> I have a use case wherein i need to process huge set of files stored in
>>> HDFS. Those files are non-splittable and they need to be processed as a
>>> whole. Here, I have the following question for which I need answers to
>>> proceed further in this.
>>>
>>> 1.  I wish to schedule the map process in task tracker where data is
>>> already available. How can I do it? Currently, I have a file that contains
>>> list of filenames. Each map get one line of it via NLineInputFormat. The
>>> map process then accesses the file via FSDataInputStream and work with it.
>>> Is there a way to ensure this map process is running on the node where the
>>> file is available?.
>>>
>>> 2.  Since the files are not large and it can be called as 'small' files
>>> by hadoop standard. Now, I came across CombineFileInputFormat that can
>>> process more than one file in a single map process.  What I need here is a
>>> format that can process more than one files in a single map but does not
>>> have to read the files, and either in key or value, it has the filenames.
>>> In map process then, I can run a loop to process these files. Any help?
>>>
>>> 3. Any othe alternatives?
>>>
>>>
>>>
>>> regards
>>>  rab
>>>
>>>
>>
>
>

Re: Hadoop InputFormat - Processing large number of small files

Posted by rab ra <ra...@gmail.com>.
Hi
>
>
>
> I tried to use your CombileFileInputFormat implementation. However, I get
the following exception
>
>
>
> ‘not org.apache.hadoop.mapred.InputFormat’
>
>
>
> I am using hadoop 2.4.1 and it looks like it expect older interface as it
does not accept
‘org.apache.hadoop.mapreduce.lib.input.CombineFileInputFormat’.  May I know
what version of Hadoop you used?
>
>
>
>
>
> Looks like I need to use older one
‘org.apache.hadoop.mapred.lib.CombineFileInputFormat’ ?
>
>
>
> Thanks and Regards
>
> rab
On 20 Aug 2014 22:59, "Felix Chern" <id...@gmail.com> wrote:

> I wrote a post on how to use CombineInputFormat:
>
> http://www.idryman.org/blog/2013/09/22/process-small-files-on-hadoop-using-combinefileinputformat-1/
> In the RecordReader constructor, you can get the context of which file you
> are reading in.
> In my example, I created FileLineWritable to include the filename in the
> mapper input key.
> Then you can use the input key as:
>
>  public static class TestMapper extends Mapper<FileLineWritable, Text,
> Text, IntWritable>{ private Text txt = new Text(); private IntWritable
> count = new IntWritable(1); public void map (FileLineWritable key, Text
> val, Context context) throws IOException, InterruptedException{
> StringTokenizer st = new StringTokenizer(val.toString()); while (st.
> hasMoreTokens()){ txt.set(key.fileName + st.nextToken()); context.write(
> txt, count); } } }
>
>
> Cheers,
> Felix
>
>
> On Aug 20, 2014, at 8:19 AM, rab ra <ra...@gmail.com> wrote:
>
> Thanks for the response.
>
> Yes, I know wholeFileInputFormat. But i am not sure filename comes to map
> process either as key or value. But, I think this file format reads the
> contents of the file. I wish to have a inputformat that just gives filename
> or list of filenames.
>
> Also, files are very small. The wholeFileInputFormat spans one map process
> per file and thus results huge number of map processes. I wish to span a
> single map process per group of files.
>
> I think I need to tweak CombineFileInputFormat's recordreader() so that it
> does not read the entire file but just filename.
>
>
> regards
> rab
>
> regards
> Bala
>
>
> On Wed, Aug 20, 2014 at 6:48 PM, Shahab Yunus <sh...@gmail.com>
> wrote:
>
>> Have you looked at the WholeFileInputFormat implementations? There are
>> quite a few if search for them...
>>
>>
>> http://hadoop-sandy.blogspot.com/2013/02/wholefileinputformat-in-java-hadoop.html
>>
>> https://github.com/tomwhite/hadoop-book/blob/master/ch07/src/main/java/WholeFileInputFormat.java
>>
>> Regards,
>> Shahab
>>
>>
>> On Wed, Aug 20, 2014 at 1:46 AM, rab ra <ra...@gmail.com> wrote:
>>
>>> Hello,
>>>
>>> I have a use case wherein i need to process huge set of files stored in
>>> HDFS. Those files are non-splittable and they need to be processed as a
>>> whole. Here, I have the following question for which I need answers to
>>> proceed further in this.
>>>
>>> 1.  I wish to schedule the map process in task tracker where data is
>>> already available. How can I do it? Currently, I have a file that contains
>>> list of filenames. Each map get one line of it via NLineInputFormat. The
>>> map process then accesses the file via FSDataInputStream and work with it.
>>> Is there a way to ensure this map process is running on the node where the
>>> file is available?.
>>>
>>> 2.  Since the files are not large and it can be called as 'small' files
>>> by hadoop standard. Now, I came across CombineFileInputFormat that can
>>> process more than one file in a single map process.  What I need here is a
>>> format that can process more than one files in a single map but does not
>>> have to read the files, and either in key or value, it has the filenames.
>>> In map process then, I can run a loop to process these files. Any help?
>>>
>>> 3. Any othe alternatives?
>>>
>>>
>>>
>>> regards
>>>  rab
>>>
>>>
>>
>
>

Re: Hadoop InputFormat - Processing large number of small files

Posted by rab ra <ra...@gmail.com>.
Thanks for the link. If it is not required for CFinputformat to have
contents of the files in the map process but only the filename, what
changes need to be done in the code?

rab.
On 20 Aug 2014 22:59, "Felix Chern" <id...@gmail.com> wrote:

> I wrote a post on how to use CombineInputFormat:
>
> http://www.idryman.org/blog/2013/09/22/process-small-files-on-hadoop-using-combinefileinputformat-1/
> In the RecordReader constructor, you can get the context of which file you
> are reading in.
> In my example, I created FileLineWritable to include the filename in the
> mapper input key.
> Then you can use the input key as:
>
>  public static class TestMapper extends Mapper<FileLineWritable, Text,
> Text, IntWritable>{ private Text txt = new Text(); private IntWritable
> count = new IntWritable(1); public void map (FileLineWritable key, Text
> val, Context context) throws IOException, InterruptedException{
> StringTokenizer st = new StringTokenizer(val.toString()); while (st.
> hasMoreTokens()){ txt.set(key.fileName + st.nextToken()); context.write(
> txt, count); } } }
>
>
> Cheers,
> Felix
>
>
> On Aug 20, 2014, at 8:19 AM, rab ra <ra...@gmail.com> wrote:
>
> Thanks for the response.
>
> Yes, I know wholeFileInputFormat. But i am not sure filename comes to map
> process either as key or value. But, I think this file format reads the
> contents of the file. I wish to have a inputformat that just gives filename
> or list of filenames.
>
> Also, files are very small. The wholeFileInputFormat spans one map process
> per file and thus results huge number of map processes. I wish to span a
> single map process per group of files.
>
> I think I need to tweak CombineFileInputFormat's recordreader() so that it
> does not read the entire file but just filename.
>
>
> regards
> rab
>
> regards
> Bala
>
>
> On Wed, Aug 20, 2014 at 6:48 PM, Shahab Yunus <sh...@gmail.com>
> wrote:
>
>> Have you looked at the WholeFileInputFormat implementations? There are
>> quite a few if search for them...
>>
>>
>> http://hadoop-sandy.blogspot.com/2013/02/wholefileinputformat-in-java-hadoop.html
>>
>> https://github.com/tomwhite/hadoop-book/blob/master/ch07/src/main/java/WholeFileInputFormat.java
>>
>> Regards,
>> Shahab
>>
>>
>> On Wed, Aug 20, 2014 at 1:46 AM, rab ra <ra...@gmail.com> wrote:
>>
>>> Hello,
>>>
>>> I have a use case wherein i need to process huge set of files stored in
>>> HDFS. Those files are non-splittable and they need to be processed as a
>>> whole. Here, I have the following question for which I need answers to
>>> proceed further in this.
>>>
>>> 1.  I wish to schedule the map process in task tracker where data is
>>> already available. How can I do it? Currently, I have a file that contains
>>> list of filenames. Each map get one line of it via NLineInputFormat. The
>>> map process then accesses the file via FSDataInputStream and work with it.
>>> Is there a way to ensure this map process is running on the node where the
>>> file is available?.
>>>
>>> 2.  Since the files are not large and it can be called as 'small' files
>>> by hadoop standard. Now, I came across CombineFileInputFormat that can
>>> process more than one file in a single map process.  What I need here is a
>>> format that can process more than one files in a single map but does not
>>> have to read the files, and either in key or value, it has the filenames.
>>> In map process then, I can run a loop to process these files. Any help?
>>>
>>> 3. Any othe alternatives?
>>>
>>>
>>>
>>> regards
>>>  rab
>>>
>>>
>>
>
>

Re: Hadoop InputFormat - Processing large number of small files

Posted by rab ra <ra...@gmail.com>.
Hi
>
>
>
> I tried to use your CombileFileInputFormat implementation. However, I get
the following exception
>
>
>
> ‘not org.apache.hadoop.mapred.InputFormat’
>
>
>
> I am using hadoop 2.4.1 and it looks like it expect older interface as it
does not accept
‘org.apache.hadoop.mapreduce.lib.input.CombineFileInputFormat’.  May I know
what version of Hadoop you used?
>
>
>
>
>
> Looks like I need to use older one
‘org.apache.hadoop.mapred.lib.CombineFileInputFormat’ ?
>
>
>
> Thanks and Regards
>
> rab
On 20 Aug 2014 22:59, "Felix Chern" <id...@gmail.com> wrote:

> I wrote a post on how to use CombineInputFormat:
>
> http://www.idryman.org/blog/2013/09/22/process-small-files-on-hadoop-using-combinefileinputformat-1/
> In the RecordReader constructor, you can get the context of which file you
> are reading in.
> In my example, I created FileLineWritable to include the filename in the
> mapper input key.
> Then you can use the input key as:
>
>  public static class TestMapper extends Mapper<FileLineWritable, Text,
> Text, IntWritable>{ private Text txt = new Text(); private IntWritable
> count = new IntWritable(1); public void map (FileLineWritable key, Text
> val, Context context) throws IOException, InterruptedException{
> StringTokenizer st = new StringTokenizer(val.toString()); while (st.
> hasMoreTokens()){ txt.set(key.fileName + st.nextToken()); context.write(
> txt, count); } } }
>
>
> Cheers,
> Felix
>
>
> On Aug 20, 2014, at 8:19 AM, rab ra <ra...@gmail.com> wrote:
>
> Thanks for the response.
>
> Yes, I know wholeFileInputFormat. But i am not sure filename comes to map
> process either as key or value. But, I think this file format reads the
> contents of the file. I wish to have a inputformat that just gives filename
> or list of filenames.
>
> Also, files are very small. The wholeFileInputFormat spans one map process
> per file and thus results huge number of map processes. I wish to span a
> single map process per group of files.
>
> I think I need to tweak CombineFileInputFormat's recordreader() so that it
> does not read the entire file but just filename.
>
>
> regards
> rab
>
> regards
> Bala
>
>
> On Wed, Aug 20, 2014 at 6:48 PM, Shahab Yunus <sh...@gmail.com>
> wrote:
>
>> Have you looked at the WholeFileInputFormat implementations? There are
>> quite a few if search for them...
>>
>>
>> http://hadoop-sandy.blogspot.com/2013/02/wholefileinputformat-in-java-hadoop.html
>>
>> https://github.com/tomwhite/hadoop-book/blob/master/ch07/src/main/java/WholeFileInputFormat.java
>>
>> Regards,
>> Shahab
>>
>>
>> On Wed, Aug 20, 2014 at 1:46 AM, rab ra <ra...@gmail.com> wrote:
>>
>>> Hello,
>>>
>>> I have a use case wherein i need to process huge set of files stored in
>>> HDFS. Those files are non-splittable and they need to be processed as a
>>> whole. Here, I have the following question for which I need answers to
>>> proceed further in this.
>>>
>>> 1.  I wish to schedule the map process in task tracker where data is
>>> already available. How can I do it? Currently, I have a file that contains
>>> list of filenames. Each map get one line of it via NLineInputFormat. The
>>> map process then accesses the file via FSDataInputStream and work with it.
>>> Is there a way to ensure this map process is running on the node where the
>>> file is available?.
>>>
>>> 2.  Since the files are not large and it can be called as 'small' files
>>> by hadoop standard. Now, I came across CombineFileInputFormat that can
>>> process more than one file in a single map process.  What I need here is a
>>> format that can process more than one files in a single map but does not
>>> have to read the files, and either in key or value, it has the filenames.
>>> In map process then, I can run a loop to process these files. Any help?
>>>
>>> 3. Any othe alternatives?
>>>
>>>
>>>
>>> regards
>>>  rab
>>>
>>>
>>
>
>

Re: Hadoop InputFormat - Processing large number of small files

Posted by rab ra <ra...@gmail.com>.
Hi,

Is it not good idea to model key as Text type?

I have a large number of sequential files that has bunch of key value
pairs. I will read these seq files inside the map. Hence my map needs only
filenames. I believe, with CombineFileInputFormat the map will run on nodes
where data is already available and hence my explicit hdfs read will be
faster.
I do not want the contents in the map as all key value pairs are not needed.

Regards
Rab
On 20 Aug 2014 22:59, "Felix Chern" <id...@gmail.com> wrote:

> I wrote a post on how to use CombineInputFormat:
>
> http://www.idryman.org/blog/2013/09/22/process-small-files-on-hadoop-using-combinefileinputformat-1/
> In the RecordReader constructor, you can get the context of which file you
> are reading in.
> In my example, I created FileLineWritable to include the filename in the
> mapper input key.
> Then you can use the input key as:
>
>  public static class TestMapper extends Mapper<FileLineWritable, Text,
> Text, IntWritable>{ private Text txt = new Text(); private IntWritable
> count = new IntWritable(1); public void map (FileLineWritable key, Text
> val, Context context) throws IOException, InterruptedException{
> StringTokenizer st = new StringTokenizer(val.toString()); while (st.
> hasMoreTokens()){ txt.set(key.fileName + st.nextToken()); context.write(
> txt, count); } } }
>
>
> Cheers,
> Felix
>
>
> On Aug 20, 2014, at 8:19 AM, rab ra <ra...@gmail.com> wrote:
>
> Thanks for the response.
>
> Yes, I know wholeFileInputFormat. But i am not sure filename comes to map
> process either as key or value. But, I think this file format reads the
> contents of the file. I wish to have a inputformat that just gives filename
> or list of filenames.
>
> Also, files are very small. The wholeFileInputFormat spans one map process
> per file and thus results huge number of map processes. I wish to span a
> single map process per group of files.
>
> I think I need to tweak CombineFileInputFormat's recordreader() so that it
> does not read the entire file but just filename.
>
>
> regards
> rab
>
> regards
> Bala
>
>
> On Wed, Aug 20, 2014 at 6:48 PM, Shahab Yunus <sh...@gmail.com>
> wrote:
>
>> Have you looked at the WholeFileInputFormat implementations? There are
>> quite a few if search for them...
>>
>>
>> http://hadoop-sandy.blogspot.com/2013/02/wholefileinputformat-in-java-hadoop.html
>>
>> https://github.com/tomwhite/hadoop-book/blob/master/ch07/src/main/java/WholeFileInputFormat.java
>>
>> Regards,
>> Shahab
>>
>>
>> On Wed, Aug 20, 2014 at 1:46 AM, rab ra <ra...@gmail.com> wrote:
>>
>>> Hello,
>>>
>>> I have a use case wherein i need to process huge set of files stored in
>>> HDFS. Those files are non-splittable and they need to be processed as a
>>> whole. Here, I have the following question for which I need answers to
>>> proceed further in this.
>>>
>>> 1.  I wish to schedule the map process in task tracker where data is
>>> already available. How can I do it? Currently, I have a file that contains
>>> list of filenames. Each map get one line of it via NLineInputFormat. The
>>> map process then accesses the file via FSDataInputStream and work with it.
>>> Is there a way to ensure this map process is running on the node where the
>>> file is available?.
>>>
>>> 2.  Since the files are not large and it can be called as 'small' files
>>> by hadoop standard. Now, I came across CombineFileInputFormat that can
>>> process more than one file in a single map process.  What I need here is a
>>> format that can process more than one files in a single map but does not
>>> have to read the files, and either in key or value, it has the filenames.
>>> In map process then, I can run a loop to process these files. Any help?
>>>
>>> 3. Any othe alternatives?
>>>
>>>
>>>
>>> regards
>>>  rab
>>>
>>>
>>
>
>

Re: Hadoop InputFormat - Processing large number of small files

Posted by rab ra <ra...@gmail.com>.
Hi,

Is it not good idea to model key as Text type?

I have a large number of sequential files that has bunch of key value
pairs. I will read these seq files inside the map. Hence my map needs only
filenames. I believe, with CombineFileInputFormat the map will run on nodes
where data is already available and hence my explicit hdfs read will be
faster.
I do not want the contents in the map as all key value pairs are not needed.

Regards
Rab
On 20 Aug 2014 22:59, "Felix Chern" <id...@gmail.com> wrote:

> I wrote a post on how to use CombineInputFormat:
>
> http://www.idryman.org/blog/2013/09/22/process-small-files-on-hadoop-using-combinefileinputformat-1/
> In the RecordReader constructor, you can get the context of which file you
> are reading in.
> In my example, I created FileLineWritable to include the filename in the
> mapper input key.
> Then you can use the input key as:
>
>  public static class TestMapper extends Mapper<FileLineWritable, Text,
> Text, IntWritable>{ private Text txt = new Text(); private IntWritable
> count = new IntWritable(1); public void map (FileLineWritable key, Text
> val, Context context) throws IOException, InterruptedException{
> StringTokenizer st = new StringTokenizer(val.toString()); while (st.
> hasMoreTokens()){ txt.set(key.fileName + st.nextToken()); context.write(
> txt, count); } } }
>
>
> Cheers,
> Felix
>
>
> On Aug 20, 2014, at 8:19 AM, rab ra <ra...@gmail.com> wrote:
>
> Thanks for the response.
>
> Yes, I know wholeFileInputFormat. But i am not sure filename comes to map
> process either as key or value. But, I think this file format reads the
> contents of the file. I wish to have a inputformat that just gives filename
> or list of filenames.
>
> Also, files are very small. The wholeFileInputFormat spans one map process
> per file and thus results huge number of map processes. I wish to span a
> single map process per group of files.
>
> I think I need to tweak CombineFileInputFormat's recordreader() so that it
> does not read the entire file but just filename.
>
>
> regards
> rab
>
> regards
> Bala
>
>
> On Wed, Aug 20, 2014 at 6:48 PM, Shahab Yunus <sh...@gmail.com>
> wrote:
>
>> Have you looked at the WholeFileInputFormat implementations? There are
>> quite a few if search for them...
>>
>>
>> http://hadoop-sandy.blogspot.com/2013/02/wholefileinputformat-in-java-hadoop.html
>>
>> https://github.com/tomwhite/hadoop-book/blob/master/ch07/src/main/java/WholeFileInputFormat.java
>>
>> Regards,
>> Shahab
>>
>>
>> On Wed, Aug 20, 2014 at 1:46 AM, rab ra <ra...@gmail.com> wrote:
>>
>>> Hello,
>>>
>>> I have a use case wherein i need to process huge set of files stored in
>>> HDFS. Those files are non-splittable and they need to be processed as a
>>> whole. Here, I have the following question for which I need answers to
>>> proceed further in this.
>>>
>>> 1.  I wish to schedule the map process in task tracker where data is
>>> already available. How can I do it? Currently, I have a file that contains
>>> list of filenames. Each map get one line of it via NLineInputFormat. The
>>> map process then accesses the file via FSDataInputStream and work with it.
>>> Is there a way to ensure this map process is running on the node where the
>>> file is available?.
>>>
>>> 2.  Since the files are not large and it can be called as 'small' files
>>> by hadoop standard. Now, I came across CombineFileInputFormat that can
>>> process more than one file in a single map process.  What I need here is a
>>> format that can process more than one files in a single map but does not
>>> have to read the files, and either in key or value, it has the filenames.
>>> In map process then, I can run a loop to process these files. Any help?
>>>
>>> 3. Any othe alternatives?
>>>
>>>
>>>
>>> regards
>>>  rab
>>>
>>>
>>
>
>

Re: Hadoop InputFormat - Processing large number of small files

Posted by rab ra <ra...@gmail.com>.
Thanks for the link. If it is not required for CFinputformat to have
contents of the files in the map process but only the filename, what
changes need to be done in the code?

rab.
On 20 Aug 2014 22:59, "Felix Chern" <id...@gmail.com> wrote:

> I wrote a post on how to use CombineInputFormat:
>
> http://www.idryman.org/blog/2013/09/22/process-small-files-on-hadoop-using-combinefileinputformat-1/
> In the RecordReader constructor, you can get the context of which file you
> are reading in.
> In my example, I created FileLineWritable to include the filename in the
> mapper input key.
> Then you can use the input key as:
>
>  public static class TestMapper extends Mapper<FileLineWritable, Text,
> Text, IntWritable>{ private Text txt = new Text(); private IntWritable
> count = new IntWritable(1); public void map (FileLineWritable key, Text
> val, Context context) throws IOException, InterruptedException{
> StringTokenizer st = new StringTokenizer(val.toString()); while (st.
> hasMoreTokens()){ txt.set(key.fileName + st.nextToken()); context.write(
> txt, count); } } }
>
>
> Cheers,
> Felix
>
>
> On Aug 20, 2014, at 8:19 AM, rab ra <ra...@gmail.com> wrote:
>
> Thanks for the response.
>
> Yes, I know wholeFileInputFormat. But i am not sure filename comes to map
> process either as key or value. But, I think this file format reads the
> contents of the file. I wish to have a inputformat that just gives filename
> or list of filenames.
>
> Also, files are very small. The wholeFileInputFormat spans one map process
> per file and thus results huge number of map processes. I wish to span a
> single map process per group of files.
>
> I think I need to tweak CombineFileInputFormat's recordreader() so that it
> does not read the entire file but just filename.
>
>
> regards
> rab
>
> regards
> Bala
>
>
> On Wed, Aug 20, 2014 at 6:48 PM, Shahab Yunus <sh...@gmail.com>
> wrote:
>
>> Have you looked at the WholeFileInputFormat implementations? There are
>> quite a few if search for them...
>>
>>
>> http://hadoop-sandy.blogspot.com/2013/02/wholefileinputformat-in-java-hadoop.html
>>
>> https://github.com/tomwhite/hadoop-book/blob/master/ch07/src/main/java/WholeFileInputFormat.java
>>
>> Regards,
>> Shahab
>>
>>
>> On Wed, Aug 20, 2014 at 1:46 AM, rab ra <ra...@gmail.com> wrote:
>>
>>> Hello,
>>>
>>> I have a use case wherein i need to process huge set of files stored in
>>> HDFS. Those files are non-splittable and they need to be processed as a
>>> whole. Here, I have the following question for which I need answers to
>>> proceed further in this.
>>>
>>> 1.  I wish to schedule the map process in task tracker where data is
>>> already available. How can I do it? Currently, I have a file that contains
>>> list of filenames. Each map get one line of it via NLineInputFormat. The
>>> map process then accesses the file via FSDataInputStream and work with it.
>>> Is there a way to ensure this map process is running on the node where the
>>> file is available?.
>>>
>>> 2.  Since the files are not large and it can be called as 'small' files
>>> by hadoop standard. Now, I came across CombineFileInputFormat that can
>>> process more than one file in a single map process.  What I need here is a
>>> format that can process more than one files in a single map but does not
>>> have to read the files, and either in key or value, it has the filenames.
>>> In map process then, I can run a loop to process these files. Any help?
>>>
>>> 3. Any othe alternatives?
>>>
>>>
>>>
>>> regards
>>>  rab
>>>
>>>
>>
>
>

Re: Hadoop InputFormat - Processing large number of small files

Posted by Felix Chern <id...@gmail.com>.
I wrote a post on how to use CombineInputFormat:
http://www.idryman.org/blog/2013/09/22/process-small-files-on-hadoop-using-combinefileinputformat-1/
In the RecordReader constructor, you can get the context of which file you are reading in.
In my example, I created FileLineWritable to include the filename in the mapper input key.
Then you can use the input key as:

  
  public static class TestMapper extends Mapper<FileLineWritable, Text, Text, IntWritable>{
    private Text txt = new Text();
    private IntWritable count = new IntWritable(1);
    public void map (FileLineWritable key, Text val, Context context) throws IOException, InterruptedException{
      StringTokenizer st = new StringTokenizer(val.toString());
        while (st.hasMoreTokens()){
          txt.set(key.fileName + st.nextToken());          
          context.write(txt, count);
        }
    }
  }


Cheers,
Felix


On Aug 20, 2014, at 8:19 AM, rab ra <ra...@gmail.com> wrote:

> Thanks for the response.
> 
> Yes, I know wholeFileInputFormat. But i am not sure filename comes to map process either as key or value. But, I think this file format reads the contents of the file. I wish to have a inputformat that just gives filename or list of filenames.
> 
> Also, files are very small. The wholeFileInputFormat spans one map process per file and thus results huge number of map processes. I wish to span a single map process per group of files. 
> 
> I think I need to tweak CombineFileInputFormat's recordreader() so that it does not read the entire file but just filename.
> 
> 
> regards
> rab
> 
> regards
> Bala
> 
> 
> On Wed, Aug 20, 2014 at 6:48 PM, Shahab Yunus <sh...@gmail.com> wrote:
> Have you looked at the WholeFileInputFormat implementations? There are quite a few if search for them...
> 
> http://hadoop-sandy.blogspot.com/2013/02/wholefileinputformat-in-java-hadoop.html
> https://github.com/tomwhite/hadoop-book/blob/master/ch07/src/main/java/WholeFileInputFormat.java
> 
> Regards,
> Shahab
> 
> 
> On Wed, Aug 20, 2014 at 1:46 AM, rab ra <ra...@gmail.com> wrote:
> Hello,
> 
> I have a use case wherein i need to process huge set of files stored in HDFS. Those files are non-splittable and they need to be processed as a whole. Here, I have the following question for which I need answers to proceed further in this.
> 
> 1.  I wish to schedule the map process in task tracker where data is already available. How can I do it? Currently, I have a file that contains list of filenames. Each map get one line of it via NLineInputFormat. The map process then accesses the file via FSDataInputStream and work with it. Is there a way to ensure this map process is running on the node where the file is available?. 
> 
> 2.  Since the files are not large and it can be called as 'small' files by hadoop standard. Now, I came across CombineFileInputFormat that can process more than one file in a single map process.  What I need here is a format that can process more than one files in a single map but does not have to read the files, and either in key or value, it has the filenames. In map process then, I can run a loop to process these files. Any help?
> 
> 3. Any othe alternatives?
> 
> 
> 
> regards
> rab
> 
> 
> 


Re: Hadoop InputFormat - Processing large number of small files

Posted by Felix Chern <id...@gmail.com>.
I wrote a post on how to use CombineInputFormat:
http://www.idryman.org/blog/2013/09/22/process-small-files-on-hadoop-using-combinefileinputformat-1/
In the RecordReader constructor, you can get the context of which file you are reading in.
In my example, I created FileLineWritable to include the filename in the mapper input key.
Then you can use the input key as:

  
  public static class TestMapper extends Mapper<FileLineWritable, Text, Text, IntWritable>{
    private Text txt = new Text();
    private IntWritable count = new IntWritable(1);
    public void map (FileLineWritable key, Text val, Context context) throws IOException, InterruptedException{
      StringTokenizer st = new StringTokenizer(val.toString());
        while (st.hasMoreTokens()){
          txt.set(key.fileName + st.nextToken());          
          context.write(txt, count);
        }
    }
  }


Cheers,
Felix


On Aug 20, 2014, at 8:19 AM, rab ra <ra...@gmail.com> wrote:

> Thanks for the response.
> 
> Yes, I know wholeFileInputFormat. But i am not sure filename comes to map process either as key or value. But, I think this file format reads the contents of the file. I wish to have a inputformat that just gives filename or list of filenames.
> 
> Also, files are very small. The wholeFileInputFormat spans one map process per file and thus results huge number of map processes. I wish to span a single map process per group of files. 
> 
> I think I need to tweak CombineFileInputFormat's recordreader() so that it does not read the entire file but just filename.
> 
> 
> regards
> rab
> 
> regards
> Bala
> 
> 
> On Wed, Aug 20, 2014 at 6:48 PM, Shahab Yunus <sh...@gmail.com> wrote:
> Have you looked at the WholeFileInputFormat implementations? There are quite a few if search for them...
> 
> http://hadoop-sandy.blogspot.com/2013/02/wholefileinputformat-in-java-hadoop.html
> https://github.com/tomwhite/hadoop-book/blob/master/ch07/src/main/java/WholeFileInputFormat.java
> 
> Regards,
> Shahab
> 
> 
> On Wed, Aug 20, 2014 at 1:46 AM, rab ra <ra...@gmail.com> wrote:
> Hello,
> 
> I have a use case wherein i need to process huge set of files stored in HDFS. Those files are non-splittable and they need to be processed as a whole. Here, I have the following question for which I need answers to proceed further in this.
> 
> 1.  I wish to schedule the map process in task tracker where data is already available. How can I do it? Currently, I have a file that contains list of filenames. Each map get one line of it via NLineInputFormat. The map process then accesses the file via FSDataInputStream and work with it. Is there a way to ensure this map process is running on the node where the file is available?. 
> 
> 2.  Since the files are not large and it can be called as 'small' files by hadoop standard. Now, I came across CombineFileInputFormat that can process more than one file in a single map process.  What I need here is a format that can process more than one files in a single map but does not have to read the files, and either in key or value, it has the filenames. In map process then, I can run a loop to process these files. Any help?
> 
> 3. Any othe alternatives?
> 
> 
> 
> regards
> rab
> 
> 
> 


Re: Hadoop InputFormat - Processing large number of small files

Posted by Felix Chern <id...@gmail.com>.
I wrote a post on how to use CombineInputFormat:
http://www.idryman.org/blog/2013/09/22/process-small-files-on-hadoop-using-combinefileinputformat-1/
In the RecordReader constructor, you can get the context of which file you are reading in.
In my example, I created FileLineWritable to include the filename in the mapper input key.
Then you can use the input key as:

  
  public static class TestMapper extends Mapper<FileLineWritable, Text, Text, IntWritable>{
    private Text txt = new Text();
    private IntWritable count = new IntWritable(1);
    public void map (FileLineWritable key, Text val, Context context) throws IOException, InterruptedException{
      StringTokenizer st = new StringTokenizer(val.toString());
        while (st.hasMoreTokens()){
          txt.set(key.fileName + st.nextToken());          
          context.write(txt, count);
        }
    }
  }


Cheers,
Felix


On Aug 20, 2014, at 8:19 AM, rab ra <ra...@gmail.com> wrote:

> Thanks for the response.
> 
> Yes, I know wholeFileInputFormat. But i am not sure filename comes to map process either as key or value. But, I think this file format reads the contents of the file. I wish to have a inputformat that just gives filename or list of filenames.
> 
> Also, files are very small. The wholeFileInputFormat spans one map process per file and thus results huge number of map processes. I wish to span a single map process per group of files. 
> 
> I think I need to tweak CombineFileInputFormat's recordreader() so that it does not read the entire file but just filename.
> 
> 
> regards
> rab
> 
> regards
> Bala
> 
> 
> On Wed, Aug 20, 2014 at 6:48 PM, Shahab Yunus <sh...@gmail.com> wrote:
> Have you looked at the WholeFileInputFormat implementations? There are quite a few if search for them...
> 
> http://hadoop-sandy.blogspot.com/2013/02/wholefileinputformat-in-java-hadoop.html
> https://github.com/tomwhite/hadoop-book/blob/master/ch07/src/main/java/WholeFileInputFormat.java
> 
> Regards,
> Shahab
> 
> 
> On Wed, Aug 20, 2014 at 1:46 AM, rab ra <ra...@gmail.com> wrote:
> Hello,
> 
> I have a use case wherein i need to process huge set of files stored in HDFS. Those files are non-splittable and they need to be processed as a whole. Here, I have the following question for which I need answers to proceed further in this.
> 
> 1.  I wish to schedule the map process in task tracker where data is already available. How can I do it? Currently, I have a file that contains list of filenames. Each map get one line of it via NLineInputFormat. The map process then accesses the file via FSDataInputStream and work with it. Is there a way to ensure this map process is running on the node where the file is available?. 
> 
> 2.  Since the files are not large and it can be called as 'small' files by hadoop standard. Now, I came across CombineFileInputFormat that can process more than one file in a single map process.  What I need here is a format that can process more than one files in a single map but does not have to read the files, and either in key or value, it has the filenames. In map process then, I can run a loop to process these files. Any help?
> 
> 3. Any othe alternatives?
> 
> 
> 
> regards
> rab
> 
> 
> 


Re: Hadoop InputFormat - Processing large number of small files

Posted by Felix Chern <id...@gmail.com>.
I wrote a post on how to use CombineInputFormat:
http://www.idryman.org/blog/2013/09/22/process-small-files-on-hadoop-using-combinefileinputformat-1/
In the RecordReader constructor, you can get the context of which file you are reading in.
In my example, I created FileLineWritable to include the filename in the mapper input key.
Then you can use the input key as:

  
  public static class TestMapper extends Mapper<FileLineWritable, Text, Text, IntWritable>{
    private Text txt = new Text();
    private IntWritable count = new IntWritable(1);
    public void map (FileLineWritable key, Text val, Context context) throws IOException, InterruptedException{
      StringTokenizer st = new StringTokenizer(val.toString());
        while (st.hasMoreTokens()){
          txt.set(key.fileName + st.nextToken());          
          context.write(txt, count);
        }
    }
  }


Cheers,
Felix


On Aug 20, 2014, at 8:19 AM, rab ra <ra...@gmail.com> wrote:

> Thanks for the response.
> 
> Yes, I know wholeFileInputFormat. But i am not sure filename comes to map process either as key or value. But, I think this file format reads the contents of the file. I wish to have a inputformat that just gives filename or list of filenames.
> 
> Also, files are very small. The wholeFileInputFormat spans one map process per file and thus results huge number of map processes. I wish to span a single map process per group of files. 
> 
> I think I need to tweak CombineFileInputFormat's recordreader() so that it does not read the entire file but just filename.
> 
> 
> regards
> rab
> 
> regards
> Bala
> 
> 
> On Wed, Aug 20, 2014 at 6:48 PM, Shahab Yunus <sh...@gmail.com> wrote:
> Have you looked at the WholeFileInputFormat implementations? There are quite a few if search for them...
> 
> http://hadoop-sandy.blogspot.com/2013/02/wholefileinputformat-in-java-hadoop.html
> https://github.com/tomwhite/hadoop-book/blob/master/ch07/src/main/java/WholeFileInputFormat.java
> 
> Regards,
> Shahab
> 
> 
> On Wed, Aug 20, 2014 at 1:46 AM, rab ra <ra...@gmail.com> wrote:
> Hello,
> 
> I have a use case wherein i need to process huge set of files stored in HDFS. Those files are non-splittable and they need to be processed as a whole. Here, I have the following question for which I need answers to proceed further in this.
> 
> 1.  I wish to schedule the map process in task tracker where data is already available. How can I do it? Currently, I have a file that contains list of filenames. Each map get one line of it via NLineInputFormat. The map process then accesses the file via FSDataInputStream and work with it. Is there a way to ensure this map process is running on the node where the file is available?. 
> 
> 2.  Since the files are not large and it can be called as 'small' files by hadoop standard. Now, I came across CombineFileInputFormat that can process more than one file in a single map process.  What I need here is a format that can process more than one files in a single map but does not have to read the files, and either in key or value, it has the filenames. In map process then, I can run a loop to process these files. Any help?
> 
> 3. Any othe alternatives?
> 
> 
> 
> regards
> rab
> 
> 
> 


Re: Hadoop InputFormat - Processing large number of small files

Posted by rab ra <ra...@gmail.com>.
Thanks for the response.

Yes, I know wholeFileInputFormat. But i am not sure filename comes to map
process either as key or value. But, I think this file format reads the
contents of the file. I wish to have a inputformat that just gives filename
or list of filenames.

Also, files are very small. The wholeFileInputFormat spans one map process
per file and thus results huge number of map processes. I wish to span a
single map process per group of files.

I think I need to tweak CombineFileInputFormat's recordreader() so that it
does not read the entire file but just filename.


regards
rab

regards
Bala


On Wed, Aug 20, 2014 at 6:48 PM, Shahab Yunus <sh...@gmail.com>
wrote:

> Have you looked at the WholeFileInputFormat implementations? There are
> quite a few if search for them...
>
>
> http://hadoop-sandy.blogspot.com/2013/02/wholefileinputformat-in-java-hadoop.html
>
> https://github.com/tomwhite/hadoop-book/blob/master/ch07/src/main/java/WholeFileInputFormat.java
>
> Regards,
> Shahab
>
>
> On Wed, Aug 20, 2014 at 1:46 AM, rab ra <ra...@gmail.com> wrote:
>
>> Hello,
>>
>> I have a use case wherein i need to process huge set of files stored in
>> HDFS. Those files are non-splittable and they need to be processed as a
>> whole. Here, I have the following question for which I need answers to
>> proceed further in this.
>>
>> 1.  I wish to schedule the map process in task tracker where data is
>> already available. How can I do it? Currently, I have a file that contains
>> list of filenames. Each map get one line of it via NLineInputFormat. The
>> map process then accesses the file via FSDataInputStream and work with it.
>> Is there a way to ensure this map process is running on the node where the
>> file is available?.
>>
>> 2.  Since the files are not large and it can be called as 'small' files
>> by hadoop standard. Now, I came across CombineFileInputFormat that can
>> process more than one file in a single map process.  What I need here is a
>> format that can process more than one files in a single map but does not
>> have to read the files, and either in key or value, it has the filenames.
>> In map process then, I can run a loop to process these files. Any help?
>>
>> 3. Any othe alternatives?
>>
>>
>>
>> regards
>>  rab
>>
>>
>

Re: Hadoop InputFormat - Processing large number of small files

Posted by rab ra <ra...@gmail.com>.
Thanks for the response.

Yes, I know wholeFileInputFormat. But i am not sure filename comes to map
process either as key or value. But, I think this file format reads the
contents of the file. I wish to have a inputformat that just gives filename
or list of filenames.

Also, files are very small. The wholeFileInputFormat spans one map process
per file and thus results huge number of map processes. I wish to span a
single map process per group of files.

I think I need to tweak CombineFileInputFormat's recordreader() so that it
does not read the entire file but just filename.


regards
rab

regards
Bala


On Wed, Aug 20, 2014 at 6:48 PM, Shahab Yunus <sh...@gmail.com>
wrote:

> Have you looked at the WholeFileInputFormat implementations? There are
> quite a few if search for them...
>
>
> http://hadoop-sandy.blogspot.com/2013/02/wholefileinputformat-in-java-hadoop.html
>
> https://github.com/tomwhite/hadoop-book/blob/master/ch07/src/main/java/WholeFileInputFormat.java
>
> Regards,
> Shahab
>
>
> On Wed, Aug 20, 2014 at 1:46 AM, rab ra <ra...@gmail.com> wrote:
>
>> Hello,
>>
>> I have a use case wherein i need to process huge set of files stored in
>> HDFS. Those files are non-splittable and they need to be processed as a
>> whole. Here, I have the following question for which I need answers to
>> proceed further in this.
>>
>> 1.  I wish to schedule the map process in task tracker where data is
>> already available. How can I do it? Currently, I have a file that contains
>> list of filenames. Each map get one line of it via NLineInputFormat. The
>> map process then accesses the file via FSDataInputStream and work with it.
>> Is there a way to ensure this map process is running on the node where the
>> file is available?.
>>
>> 2.  Since the files are not large and it can be called as 'small' files
>> by hadoop standard. Now, I came across CombineFileInputFormat that can
>> process more than one file in a single map process.  What I need here is a
>> format that can process more than one files in a single map but does not
>> have to read the files, and either in key or value, it has the filenames.
>> In map process then, I can run a loop to process these files. Any help?
>>
>> 3. Any othe alternatives?
>>
>>
>>
>> regards
>>  rab
>>
>>
>

Re: Hadoop InputFormat - Processing large number of small files

Posted by rab ra <ra...@gmail.com>.
Thanks for the response.

Yes, I know wholeFileInputFormat. But i am not sure filename comes to map
process either as key or value. But, I think this file format reads the
contents of the file. I wish to have a inputformat that just gives filename
or list of filenames.

Also, files are very small. The wholeFileInputFormat spans one map process
per file and thus results huge number of map processes. I wish to span a
single map process per group of files.

I think I need to tweak CombineFileInputFormat's recordreader() so that it
does not read the entire file but just filename.


regards
rab

regards
Bala


On Wed, Aug 20, 2014 at 6:48 PM, Shahab Yunus <sh...@gmail.com>
wrote:

> Have you looked at the WholeFileInputFormat implementations? There are
> quite a few if search for them...
>
>
> http://hadoop-sandy.blogspot.com/2013/02/wholefileinputformat-in-java-hadoop.html
>
> https://github.com/tomwhite/hadoop-book/blob/master/ch07/src/main/java/WholeFileInputFormat.java
>
> Regards,
> Shahab
>
>
> On Wed, Aug 20, 2014 at 1:46 AM, rab ra <ra...@gmail.com> wrote:
>
>> Hello,
>>
>> I have a use case wherein i need to process huge set of files stored in
>> HDFS. Those files are non-splittable and they need to be processed as a
>> whole. Here, I have the following question for which I need answers to
>> proceed further in this.
>>
>> 1.  I wish to schedule the map process in task tracker where data is
>> already available. How can I do it? Currently, I have a file that contains
>> list of filenames. Each map get one line of it via NLineInputFormat. The
>> map process then accesses the file via FSDataInputStream and work with it.
>> Is there a way to ensure this map process is running on the node where the
>> file is available?.
>>
>> 2.  Since the files are not large and it can be called as 'small' files
>> by hadoop standard. Now, I came across CombineFileInputFormat that can
>> process more than one file in a single map process.  What I need here is a
>> format that can process more than one files in a single map but does not
>> have to read the files, and either in key or value, it has the filenames.
>> In map process then, I can run a loop to process these files. Any help?
>>
>> 3. Any othe alternatives?
>>
>>
>>
>> regards
>>  rab
>>
>>
>

Re: Hadoop InputFormat - Processing large number of small files

Posted by rab ra <ra...@gmail.com>.
Thanks for the response.

Yes, I know wholeFileInputFormat. But i am not sure filename comes to map
process either as key or value. But, I think this file format reads the
contents of the file. I wish to have a inputformat that just gives filename
or list of filenames.

Also, files are very small. The wholeFileInputFormat spans one map process
per file and thus results huge number of map processes. I wish to span a
single map process per group of files.

I think I need to tweak CombineFileInputFormat's recordreader() so that it
does not read the entire file but just filename.


regards
rab

regards
Bala


On Wed, Aug 20, 2014 at 6:48 PM, Shahab Yunus <sh...@gmail.com>
wrote:

> Have you looked at the WholeFileInputFormat implementations? There are
> quite a few if search for them...
>
>
> http://hadoop-sandy.blogspot.com/2013/02/wholefileinputformat-in-java-hadoop.html
>
> https://github.com/tomwhite/hadoop-book/blob/master/ch07/src/main/java/WholeFileInputFormat.java
>
> Regards,
> Shahab
>
>
> On Wed, Aug 20, 2014 at 1:46 AM, rab ra <ra...@gmail.com> wrote:
>
>> Hello,
>>
>> I have a use case wherein i need to process huge set of files stored in
>> HDFS. Those files are non-splittable and they need to be processed as a
>> whole. Here, I have the following question for which I need answers to
>> proceed further in this.
>>
>> 1.  I wish to schedule the map process in task tracker where data is
>> already available. How can I do it? Currently, I have a file that contains
>> list of filenames. Each map get one line of it via NLineInputFormat. The
>> map process then accesses the file via FSDataInputStream and work with it.
>> Is there a way to ensure this map process is running on the node where the
>> file is available?.
>>
>> 2.  Since the files are not large and it can be called as 'small' files
>> by hadoop standard. Now, I came across CombineFileInputFormat that can
>> process more than one file in a single map process.  What I need here is a
>> format that can process more than one files in a single map but does not
>> have to read the files, and either in key or value, it has the filenames.
>> In map process then, I can run a loop to process these files. Any help?
>>
>> 3. Any othe alternatives?
>>
>>
>>
>> regards
>>  rab
>>
>>
>

Re: Hadoop InputFormat - Processing large number of small files

Posted by Shahab Yunus <sh...@gmail.com>.
Have you looked at the WholeFileInputFormat implementations? There are
quite a few if search for them...

http://hadoop-sandy.blogspot.com/2013/02/wholefileinputformat-in-java-hadoop.html
https://github.com/tomwhite/hadoop-book/blob/master/ch07/src/main/java/WholeFileInputFormat.java

Regards,
Shahab


On Wed, Aug 20, 2014 at 1:46 AM, rab ra <ra...@gmail.com> wrote:

> Hello,
>
> I have a use case wherein i need to process huge set of files stored in
> HDFS. Those files are non-splittable and they need to be processed as a
> whole. Here, I have the following question for which I need answers to
> proceed further in this.
>
> 1.  I wish to schedule the map process in task tracker where data is
> already available. How can I do it? Currently, I have a file that contains
> list of filenames. Each map get one line of it via NLineInputFormat. The
> map process then accesses the file via FSDataInputStream and work with it.
> Is there a way to ensure this map process is running on the node where the
> file is available?.
>
> 2.  Since the files are not large and it can be called as 'small' files by
> hadoop standard. Now, I came across CombineFileInputFormat that can process
> more than one file in a single map process.  What I need here is a format
> that can process more than one files in a single map but does not have to
> read the files, and either in key or value, it has the filenames. In map
> process then, I can run a loop to process these files. Any help?
>
> 3. Any othe alternatives?
>
>
>
> regards
> rab
>
>

Re: Hadoop InputFormat - Processing large number of small files

Posted by Shahab Yunus <sh...@gmail.com>.
Have you looked at the WholeFileInputFormat implementations? There are
quite a few if search for them...

http://hadoop-sandy.blogspot.com/2013/02/wholefileinputformat-in-java-hadoop.html
https://github.com/tomwhite/hadoop-book/blob/master/ch07/src/main/java/WholeFileInputFormat.java

Regards,
Shahab


On Wed, Aug 20, 2014 at 1:46 AM, rab ra <ra...@gmail.com> wrote:

> Hello,
>
> I have a use case wherein i need to process huge set of files stored in
> HDFS. Those files are non-splittable and they need to be processed as a
> whole. Here, I have the following question for which I need answers to
> proceed further in this.
>
> 1.  I wish to schedule the map process in task tracker where data is
> already available. How can I do it? Currently, I have a file that contains
> list of filenames. Each map get one line of it via NLineInputFormat. The
> map process then accesses the file via FSDataInputStream and work with it.
> Is there a way to ensure this map process is running on the node where the
> file is available?.
>
> 2.  Since the files are not large and it can be called as 'small' files by
> hadoop standard. Now, I came across CombineFileInputFormat that can process
> more than one file in a single map process.  What I need here is a format
> that can process more than one files in a single map but does not have to
> read the files, and either in key or value, it has the filenames. In map
> process then, I can run a loop to process these files. Any help?
>
> 3. Any othe alternatives?
>
>
>
> regards
> rab
>
>

Re: Hadoop InputFormat - Processing large number of small files

Posted by Shahab Yunus <sh...@gmail.com>.
Have you looked at the WholeFileInputFormat implementations? There are
quite a few if search for them...

http://hadoop-sandy.blogspot.com/2013/02/wholefileinputformat-in-java-hadoop.html
https://github.com/tomwhite/hadoop-book/blob/master/ch07/src/main/java/WholeFileInputFormat.java

Regards,
Shahab


On Wed, Aug 20, 2014 at 1:46 AM, rab ra <ra...@gmail.com> wrote:

> Hello,
>
> I have a use case wherein i need to process huge set of files stored in
> HDFS. Those files are non-splittable and they need to be processed as a
> whole. Here, I have the following question for which I need answers to
> proceed further in this.
>
> 1.  I wish to schedule the map process in task tracker where data is
> already available. How can I do it? Currently, I have a file that contains
> list of filenames. Each map get one line of it via NLineInputFormat. The
> map process then accesses the file via FSDataInputStream and work with it.
> Is there a way to ensure this map process is running on the node where the
> file is available?.
>
> 2.  Since the files are not large and it can be called as 'small' files by
> hadoop standard. Now, I came across CombineFileInputFormat that can process
> more than one file in a single map process.  What I need here is a format
> that can process more than one files in a single map but does not have to
> read the files, and either in key or value, it has the filenames. In map
> process then, I can run a loop to process these files. Any help?
>
> 3. Any othe alternatives?
>
>
>
> regards
> rab
>
>

Re: Hadoop InputFormat - Processing large number of small files

Posted by Shahab Yunus <sh...@gmail.com>.
Have you looked at the WholeFileInputFormat implementations? There are
quite a few if search for them...

http://hadoop-sandy.blogspot.com/2013/02/wholefileinputformat-in-java-hadoop.html
https://github.com/tomwhite/hadoop-book/blob/master/ch07/src/main/java/WholeFileInputFormat.java

Regards,
Shahab


On Wed, Aug 20, 2014 at 1:46 AM, rab ra <ra...@gmail.com> wrote:

> Hello,
>
> I have a use case wherein i need to process huge set of files stored in
> HDFS. Those files are non-splittable and they need to be processed as a
> whole. Here, I have the following question for which I need answers to
> proceed further in this.
>
> 1.  I wish to schedule the map process in task tracker where data is
> already available. How can I do it? Currently, I have a file that contains
> list of filenames. Each map get one line of it via NLineInputFormat. The
> map process then accesses the file via FSDataInputStream and work with it.
> Is there a way to ensure this map process is running on the node where the
> file is available?.
>
> 2.  Since the files are not large and it can be called as 'small' files by
> hadoop standard. Now, I came across CombineFileInputFormat that can process
> more than one file in a single map process.  What I need here is a format
> that can process more than one files in a single map but does not have to
> read the files, and either in key or value, it has the filenames. In map
> process then, I can run a loop to process these files. Any help?
>
> 3. Any othe alternatives?
>
>
>
> regards
> rab
>
>