You are viewing a plain text version of this content. The canonical link for it is here.

Posted to hdfs-user@hadoop.apache.org by unmesha sreeveni <un...@gmail.com> on 2014/06/28 09:26:50 UTC

WholeFileInputFormat in hadoop

Hi

  A small clarification:

 WholeFileInputFormat takes the entire input file as input or each
record(input split) as whole?

-- 
*Thanks & Regards *


*Unmesha Sreeveni U.B*
*Hadoop, Bigdata Developer*
*Center for Cyber Security | Amrita Vishwa Vidyapeetham*
http://www.unmeshasreeveni.blogspot.in/

Re: WholeFileInputFormat in hadoop

Posted by Ryan Tabora <ra...@gmail.com>.

Try reading this blog here, I think it's a pretty good overview.

*http://hadoopi.wordpress.com/2013/05/27/understand-recordreader-inputsplit/
<http://hadoopi.wordpress.com/2013/05/27/understand-recordreader-inputsplit/>*

If you set a whole file's contents to be either the key or the value in the
mapper, yes you will load the whole file in memory. This is why it is up to
the user to define what a key/value pair is in the input format. You could
always set the key value pair to some metadata about the file (file path,
file length) if you don't want to load the whole thing in the mapper.

Regards,
Ryan Tabora
http://ryantabora.com

On Sun, Jun 29, 2014 at 9:28 AM, unmesha sreeveni <un...@gmail.com>
wrote:

> But how is it different from normal execution and parallel MR.
> Although mapreduce is a parallel exec framework where the data into map is
>  a single input.
>
> If the Whole fileinput is jst an entire input split insead of the entire
> input file . it will be useful right?
> if it is the whole file it can caught heapspace ..
>
> Please correct me if I am wrong.
>
> --
> *Thanks & Regards *
>
>
> *Unmesha Sreeveni U.B*
> *Hadoop, Bigdata Developer*
> *Center for Cyber Security | Amrita Vishwa Vidyapeetham*
> http://www.unmeshasreeveni.blogspot.in/
>
>
>

Re: WholeFileInputFormat in hadoop

Posted by Mohit Anchlia <mo...@gmail.com>.

Have you looked at this post:

http://stackoverflow.com/questions/15863566/need-assistance-with-implementing-dbscan-on-map-reduce/15863699#15863699

On Sun, Jun 29, 2014 at 9:01 PM, unmesha sreeveni <un...@gmail.com>
wrote:

> I am trying to do DBScan Algo.I refered the algo in "Data Mining -
> Concepts and Techniques (3rd Ed)" chapter 10 Page no: 474.
> Here in this algorithmwe need to find the disance between each point.
> say my sample input is
> 5,6
> 8,2
> 4,5
> 4,6
>
> So in DBScan we have to pic 1 elemnt and then find the distance between
> all.
>
> While implementing so I will not be able to get the whole file in map
> inorder to find the distance.
> I tried some approach
> 1. used WholeFileInput and done the entire algorithm in Map itself - I dnt
> think this is a better one.(And it end up with heap space error)
> 2. and this one is not implementes as I thought it is not feasible
>   - Reading 1 line of input data set in driver and write to a new
> file.(say centroid)
>  - this centriod can be read in setup and calculate the distance in Map
> and emit the data which satifies the condition with dbscan
> map(id,epsilonneighbr) and in reducer we will be able to aggregate all the
> epsilon neighbours of (5,6) which come from different map and in Reducer
> find the neighbors of epsilon neighbour.
>  - Next iteration should also be done agian read the input file find a
> node which is not visited....
> If the input is a 1GB file the MR job executes as many times of the total
> record.
>
>
> Can anyone suggest me a better way to do this.
>
> Hope the usecase is understandable else please tell me.I will explain
> further.
>
>
> --
> *Thanks & Regards *
>
>
> *Unmesha Sreeveni U.B*
> *Hadoop, Bigdata Developer*
> *Center for Cyber Security | Amrita Vishwa Vidyapeetham*
> http://www.unmeshasreeveni.blogspot.in/
>
>
>

Re: WholeFileInputFormat in hadoop

Posted by Mohit Anchlia <mo...@gmail.com>.

Have you looked at this post:

http://stackoverflow.com/questions/15863566/need-assistance-with-implementing-dbscan-on-map-reduce/15863699#15863699

On Sun, Jun 29, 2014 at 9:01 PM, unmesha sreeveni <un...@gmail.com>
wrote:

> I am trying to do DBScan Algo.I refered the algo in "Data Mining -
> Concepts and Techniques (3rd Ed)" chapter 10 Page no: 474.
> Here in this algorithmwe need to find the disance between each point.
> say my sample input is
> 5,6
> 8,2
> 4,5
> 4,6
>
> So in DBScan we have to pic 1 elemnt and then find the distance between
> all.
>
> While implementing so I will not be able to get the whole file in map
> inorder to find the distance.
> I tried some approach
> 1. used WholeFileInput and done the entire algorithm in Map itself - I dnt
> think this is a better one.(And it end up with heap space error)
> 2. and this one is not implementes as I thought it is not feasible
>   - Reading 1 line of input data set in driver and write to a new
> file.(say centroid)
>  - this centriod can be read in setup and calculate the distance in Map
> and emit the data which satifies the condition with dbscan
> map(id,epsilonneighbr) and in reducer we will be able to aggregate all the
> epsilon neighbours of (5,6) which come from different map and in Reducer
> find the neighbors of epsilon neighbour.
>  - Next iteration should also be done agian read the input file find a
> node which is not visited....
> If the input is a 1GB file the MR job executes as many times of the total
> record.
>
>
> Can anyone suggest me a better way to do this.
>
> Hope the usecase is understandable else please tell me.I will explain
> further.
>
>
> --
> *Thanks & Regards *
>
>
> *Unmesha Sreeveni U.B*
> *Hadoop, Bigdata Developer*
> *Center for Cyber Security | Amrita Vishwa Vidyapeetham*
> http://www.unmeshasreeveni.blogspot.in/
>
>
>

Re: WholeFileInputFormat in hadoop

Posted by Mohit Anchlia <mo...@gmail.com>.

Have you looked at this post:

http://stackoverflow.com/questions/15863566/need-assistance-with-implementing-dbscan-on-map-reduce/15863699#15863699

On Sun, Jun 29, 2014 at 9:01 PM, unmesha sreeveni <un...@gmail.com>
wrote:

> I am trying to do DBScan Algo.I refered the algo in "Data Mining -
> Concepts and Techniques (3rd Ed)" chapter 10 Page no: 474.
> Here in this algorithmwe need to find the disance between each point.
> say my sample input is
> 5,6
> 8,2
> 4,5
> 4,6
>
> So in DBScan we have to pic 1 elemnt and then find the distance between
> all.
>
> While implementing so I will not be able to get the whole file in map
> inorder to find the distance.
> I tried some approach
> 1. used WholeFileInput and done the entire algorithm in Map itself - I dnt
> think this is a better one.(And it end up with heap space error)
> 2. and this one is not implementes as I thought it is not feasible
>   - Reading 1 line of input data set in driver and write to a new
> file.(say centroid)
>  - this centriod can be read in setup and calculate the distance in Map
> and emit the data which satifies the condition with dbscan
> map(id,epsilonneighbr) and in reducer we will be able to aggregate all the
> epsilon neighbours of (5,6) which come from different map and in Reducer
> find the neighbors of epsilon neighbour.
>  - Next iteration should also be done agian read the input file find a
> node which is not visited....
> If the input is a 1GB file the MR job executes as many times of the total
> record.
>
>
> Can anyone suggest me a better way to do this.
>
> Hope the usecase is understandable else please tell me.I will explain
> further.
>
>
> --
> *Thanks & Regards *
>
>
> *Unmesha Sreeveni U.B*
> *Hadoop, Bigdata Developer*
> *Center for Cyber Security | Amrita Vishwa Vidyapeetham*
> http://www.unmeshasreeveni.blogspot.in/
>
>
>

Re: WholeFileInputFormat in hadoop

Posted by Mohit Anchlia <mo...@gmail.com>.

Have you looked at this post:

http://stackoverflow.com/questions/15863566/need-assistance-with-implementing-dbscan-on-map-reduce/15863699#15863699

On Sun, Jun 29, 2014 at 9:01 PM, unmesha sreeveni <un...@gmail.com>
wrote:

> I am trying to do DBScan Algo.I refered the algo in "Data Mining -
> Concepts and Techniques (3rd Ed)" chapter 10 Page no: 474.
> Here in this algorithmwe need to find the disance between each point.
> say my sample input is
> 5,6
> 8,2
> 4,5
> 4,6
>
> So in DBScan we have to pic 1 elemnt and then find the distance between
> all.
>
> While implementing so I will not be able to get the whole file in map
> inorder to find the distance.
> I tried some approach
> 1. used WholeFileInput and done the entire algorithm in Map itself - I dnt
> think this is a better one.(And it end up with heap space error)
> 2. and this one is not implementes as I thought it is not feasible
>   - Reading 1 line of input data set in driver and write to a new
> file.(say centroid)
>  - this centriod can be read in setup and calculate the distance in Map
> and emit the data which satifies the condition with dbscan
> map(id,epsilonneighbr) and in reducer we will be able to aggregate all the
> epsilon neighbours of (5,6) which come from different map and in Reducer
> find the neighbors of epsilon neighbour.
>  - Next iteration should also be done agian read the input file find a
> node which is not visited....
> If the input is a 1GB file the MR job executes as many times of the total
> record.
>
>
> Can anyone suggest me a better way to do this.
>
> Hope the usecase is understandable else please tell me.I will explain
> further.
>
>
> --
> *Thanks & Regards *
>
>
> *Unmesha Sreeveni U.B*
> *Hadoop, Bigdata Developer*
> *Center for Cyber Security | Amrita Vishwa Vidyapeetham*
> http://www.unmeshasreeveni.blogspot.in/
>
>
>

Re: WholeFileInputFormat in hadoop

Posted by unmesha sreeveni <un...@gmail.com>.

I am trying to do DBScan Algo.I refered the algo in "Data Mining - Concepts
and Techniques (3rd Ed)" chapter 10 Page no: 474.
Here in this algorithmwe need to find the disance between each point.
say my sample input is
5,6
8,2
4,5
4,6

So in DBScan we have to pic 1 elemnt and then find the distance between all.

While implementing so I will not be able to get the whole file in map
inorder to find the distance.
I tried some approach
1. used WholeFileInput and done the entire algorithm in Map itself - I dnt
think this is a better one.(And it end up with heap space error)
2. and this one is not implementes as I thought it is not feasible
  - Reading 1 line of input data set in driver and write to a new file.(say
centroid)
 - this centriod can be read in setup and calculate the distance in Map and
emit the data which satifies the condition with dbscan
map(id,epsilonneighbr) and in reducer we will be able to aggregate all the
epsilon neighbours of (5,6) which come from different map and in Reducer
find the neighbors of epsilon neighbour.
 - Next iteration should also be done agian read the input file find a node
which is not visited....
If the input is a 1GB file the MR job executes as many times of the total
record.


Can anyone suggest me a better way to do this.

Hope the usecase is understandable else please tell me.I will explain
further.


-- 
*Thanks & Regards *


*Unmesha Sreeveni U.B*
*Hadoop, Bigdata Developer*
*Center for Cyber Security | Amrita Vishwa Vidyapeetham*
http://www.unmeshasreeveni.blogspot.in/

Re: WholeFileInputFormat in hadoop

Posted by unmesha sreeveni <un...@gmail.com>.

I am trying to do DBScan Algo.I refered the algo in "Data Mining - Concepts
and Techniques (3rd Ed)" chapter 10 Page no: 474.
Here in this algorithmwe need to find the disance between each point.
say my sample input is
5,6
8,2
4,5
4,6

So in DBScan we have to pic 1 elemnt and then find the distance between all.

While implementing so I will not be able to get the whole file in map
inorder to find the distance.
I tried some approach
1. used WholeFileInput and done the entire algorithm in Map itself - I dnt
think this is a better one.(And it end up with heap space error)
2. and this one is not implementes as I thought it is not feasible
  - Reading 1 line of input data set in driver and write to a new file.(say
centroid)
 - this centriod can be read in setup and calculate the distance in Map and
emit the data which satifies the condition with dbscan
map(id,epsilonneighbr) and in reducer we will be able to aggregate all the
epsilon neighbours of (5,6) which come from different map and in Reducer
find the neighbors of epsilon neighbour.
 - Next iteration should also be done agian read the input file find a node
which is not visited....
If the input is a 1GB file the MR job executes as many times of the total
record.


Can anyone suggest me a better way to do this.

Hope the usecase is understandable else please tell me.I will explain
further.


-- 
*Thanks & Regards *


*Unmesha Sreeveni U.B*
*Hadoop, Bigdata Developer*
*Center for Cyber Security | Amrita Vishwa Vidyapeetham*
http://www.unmeshasreeveni.blogspot.in/

Re: WholeFileInputFormat in hadoop

Posted by unmesha sreeveni <un...@gmail.com>.

I am trying to do DBScan Algo.I refered the algo in "Data Mining - Concepts
and Techniques (3rd Ed)" chapter 10 Page no: 474.
Here in this algorithmwe need to find the disance between each point.
say my sample input is
5,6
8,2
4,5
4,6

So in DBScan we have to pic 1 elemnt and then find the distance between all.

While implementing so I will not be able to get the whole file in map
inorder to find the distance.
I tried some approach
1. used WholeFileInput and done the entire algorithm in Map itself - I dnt
think this is a better one.(And it end up with heap space error)
2. and this one is not implementes as I thought it is not feasible
  - Reading 1 line of input data set in driver and write to a new file.(say
centroid)
 - this centriod can be read in setup and calculate the distance in Map and
emit the data which satifies the condition with dbscan
map(id,epsilonneighbr) and in reducer we will be able to aggregate all the
epsilon neighbours of (5,6) which come from different map and in Reducer
find the neighbors of epsilon neighbour.
 - Next iteration should also be done agian read the input file find a node
which is not visited....
If the input is a 1GB file the MR job executes as many times of the total
record.


Can anyone suggest me a better way to do this.

Hope the usecase is understandable else please tell me.I will explain
further.


-- 
*Thanks & Regards *


*Unmesha Sreeveni U.B*
*Hadoop, Bigdata Developer*
*Center for Cyber Security | Amrita Vishwa Vidyapeetham*
http://www.unmeshasreeveni.blogspot.in/

Re: WholeFileInputFormat in hadoop

Posted by unmesha sreeveni <un...@gmail.com>.

I am trying to do DBScan Algo.I refered the algo in "Data Mining - Concepts
and Techniques (3rd Ed)" chapter 10 Page no: 474.
Here in this algorithmwe need to find the disance between each point.
say my sample input is
5,6
8,2
4,5
4,6

So in DBScan we have to pic 1 elemnt and then find the distance between all.

While implementing so I will not be able to get the whole file in map
inorder to find the distance.
I tried some approach
1. used WholeFileInput and done the entire algorithm in Map itself - I dnt
think this is a better one.(And it end up with heap space error)
2. and this one is not implementes as I thought it is not feasible
  - Reading 1 line of input data set in driver and write to a new file.(say
centroid)
 - this centriod can be read in setup and calculate the distance in Map and
emit the data which satifies the condition with dbscan
map(id,epsilonneighbr) and in reducer we will be able to aggregate all the
epsilon neighbours of (5,6) which come from different map and in Reducer
find the neighbors of epsilon neighbour.
 - Next iteration should also be done agian read the input file find a node
which is not visited....
If the input is a 1GB file the MR job executes as many times of the total
record.


Can anyone suggest me a better way to do this.

Hope the usecase is understandable else please tell me.I will explain
further.


-- 
*Thanks & Regards *


*Unmesha Sreeveni U.B*
*Hadoop, Bigdata Developer*
*Center for Cyber Security | Amrita Vishwa Vidyapeetham*
http://www.unmeshasreeveni.blogspot.in/

Re: WholeFileInputFormat in hadoop

Posted by Mohit Anchlia <mo...@gmail.com>.

I think it will be easier if you give your use case. You really would load the file if you don't want it to split but there are many ways to solve the issue and that's why understanding the use case is helpful.

Sent from my iPhone

> On Jun 29, 2014, at 9:28 AM, unmesha sreeveni <un...@gmail.com> wrote:
> 
> But how is it different from normal execution and parallel MR.
> Although mapreduce is a parallel exec framework where the data into map is  a single input.
> 
> If the Whole fileinput is jst an entire input split insead of the entire input file . it will be useful right?
> if it is the whole file it can caught heapspace ..
> 
> Please correct me if I am wrong.
> 
> -- 
> Thanks & Regards
> 
> Unmesha Sreeveni U.B
> Hadoop, Bigdata Developer
> Center for Cyber Security | Amrita Vishwa Vidyapeetham
> http://www.unmeshasreeveni.blogspot.in/
> 
>

Re: WholeFileInputFormat in hadoop

Posted by Mohit Anchlia <mo...@gmail.com>.

I think it will be easier if you give your use case. You really would load the file if you don't want it to split but there are many ways to solve the issue and that's why understanding the use case is helpful.

Sent from my iPhone

> On Jun 29, 2014, at 9:28 AM, unmesha sreeveni <un...@gmail.com> wrote:
> 
> But how is it different from normal execution and parallel MR.
> Although mapreduce is a parallel exec framework where the data into map is  a single input.
> 
> If the Whole fileinput is jst an entire input split insead of the entire input file . it will be useful right?
> if it is the whole file it can caught heapspace ..
> 
> Please correct me if I am wrong.
> 
> -- 
> Thanks & Regards
> 
> Unmesha Sreeveni U.B
> Hadoop, Bigdata Developer
> Center for Cyber Security | Amrita Vishwa Vidyapeetham
> http://www.unmeshasreeveni.blogspot.in/
> 
>

Re: WholeFileInputFormat in hadoop

Posted by Ryan Tabora <ra...@gmail.com>.

Try reading this blog here, I think it's a pretty good overview.

*http://hadoopi.wordpress.com/2013/05/27/understand-recordreader-inputsplit/
<http://hadoopi.wordpress.com/2013/05/27/understand-recordreader-inputsplit/>*

If you set a whole file's contents to be either the key or the value in the
mapper, yes you will load the whole file in memory. This is why it is up to
the user to define what a key/value pair is in the input format. You could
always set the key value pair to some metadata about the file (file path,
file length) if you don't want to load the whole thing in the mapper.

Regards,
Ryan Tabora
http://ryantabora.com

On Sun, Jun 29, 2014 at 9:28 AM, unmesha sreeveni <un...@gmail.com>
wrote:

> But how is it different from normal execution and parallel MR.
> Although mapreduce is a parallel exec framework where the data into map is
>  a single input.
>
> If the Whole fileinput is jst an entire input split insead of the entire
> input file . it will be useful right?
> if it is the whole file it can caught heapspace ..
>
> Please correct me if I am wrong.
>
> --
> *Thanks & Regards *
>
>
> *Unmesha Sreeveni U.B*
> *Hadoop, Bigdata Developer*
> *Center for Cyber Security | Amrita Vishwa Vidyapeetham*
> http://www.unmeshasreeveni.blogspot.in/
>
>
>

Re: WholeFileInputFormat in hadoop

Posted by Ryan Tabora <ra...@gmail.com>.

Try reading this blog here, I think it's a pretty good overview.

*http://hadoopi.wordpress.com/2013/05/27/understand-recordreader-inputsplit/
<http://hadoopi.wordpress.com/2013/05/27/understand-recordreader-inputsplit/>*

If you set a whole file's contents to be either the key or the value in the
mapper, yes you will load the whole file in memory. This is why it is up to
the user to define what a key/value pair is in the input format. You could
always set the key value pair to some metadata about the file (file path,
file length) if you don't want to load the whole thing in the mapper.

Regards,
Ryan Tabora
http://ryantabora.com

On Sun, Jun 29, 2014 at 9:28 AM, unmesha sreeveni <un...@gmail.com>
wrote:

> But how is it different from normal execution and parallel MR.
> Although mapreduce is a parallel exec framework where the data into map is
>  a single input.
>
> If the Whole fileinput is jst an entire input split insead of the entire
> input file . it will be useful right?
> if it is the whole file it can caught heapspace ..
>
> Please correct me if I am wrong.
>
> --
> *Thanks & Regards *
>
>
> *Unmesha Sreeveni U.B*
> *Hadoop, Bigdata Developer*
> *Center for Cyber Security | Amrita Vishwa Vidyapeetham*
> http://www.unmeshasreeveni.blogspot.in/
>
>
>

Re: WholeFileInputFormat in hadoop

Posted by Ryan Tabora <ra...@gmail.com>.

Try reading this blog here, I think it's a pretty good overview.

*http://hadoopi.wordpress.com/2013/05/27/understand-recordreader-inputsplit/
<http://hadoopi.wordpress.com/2013/05/27/understand-recordreader-inputsplit/>*

If you set a whole file's contents to be either the key or the value in the
mapper, yes you will load the whole file in memory. This is why it is up to
the user to define what a key/value pair is in the input format. You could
always set the key value pair to some metadata about the file (file path,
file length) if you don't want to load the whole thing in the mapper.

Regards,
Ryan Tabora
http://ryantabora.com

On Sun, Jun 29, 2014 at 9:28 AM, unmesha sreeveni <un...@gmail.com>
wrote:

> But how is it different from normal execution and parallel MR.
> Although mapreduce is a parallel exec framework where the data into map is
>  a single input.
>
> If the Whole fileinput is jst an entire input split insead of the entire
> input file . it will be useful right?
> if it is the whole file it can caught heapspace ..
>
> Please correct me if I am wrong.
>
> --
> *Thanks & Regards *
>
>
> *Unmesha Sreeveni U.B*
> *Hadoop, Bigdata Developer*
> *Center for Cyber Security | Amrita Vishwa Vidyapeetham*
> http://www.unmeshasreeveni.blogspot.in/
>
>
>

Re: WholeFileInputFormat in hadoop

Posted by Mohit Anchlia <mo...@gmail.com>.

I think it will be easier if you give your use case. You really would load the file if you don't want it to split but there are many ways to solve the issue and that's why understanding the use case is helpful.

Sent from my iPhone

> On Jun 29, 2014, at 9:28 AM, unmesha sreeveni <un...@gmail.com> wrote:
> 
> But how is it different from normal execution and parallel MR.
> Although mapreduce is a parallel exec framework where the data into map is  a single input.
> 
> If the Whole fileinput is jst an entire input split insead of the entire input file . it will be useful right?
> if it is the whole file it can caught heapspace ..
> 
> Please correct me if I am wrong.
> 
> -- 
> Thanks & Regards
> 
> Unmesha Sreeveni U.B
> Hadoop, Bigdata Developer
> Center for Cyber Security | Amrita Vishwa Vidyapeetham
> http://www.unmeshasreeveni.blogspot.in/
> 
>

Re: WholeFileInputFormat in hadoop

Posted by Mohit Anchlia <mo...@gmail.com>.

I think it will be easier if you give your use case. You really would load the file if you don't want it to split but there are many ways to solve the issue and that's why understanding the use case is helpful.

Sent from my iPhone

> On Jun 29, 2014, at 9:28 AM, unmesha sreeveni <un...@gmail.com> wrote:
> 
> But how is it different from normal execution and parallel MR.
> Although mapreduce is a parallel exec framework where the data into map is  a single input.
> 
> If the Whole fileinput is jst an entire input split insead of the entire input file . it will be useful right?
> if it is the whole file it can caught heapspace ..
> 
> Please correct me if I am wrong.
> 
> -- 
> Thanks & Regards
> 
> Unmesha Sreeveni U.B
> Hadoop, Bigdata Developer
> Center for Cyber Security | Amrita Vishwa Vidyapeetham
> http://www.unmeshasreeveni.blogspot.in/
> 
>

Re: WholeFileInputFormat in hadoop

Posted by unmesha sreeveni <un...@gmail.com>.

But how is it different from normal execution and parallel MR.
Although mapreduce is a parallel exec framework where the data into map is
 a single input.

If the Whole fileinput is jst an entire input split insead of the entire
input file . it will be useful right?
if it is the whole file it can caught heapspace ..

Please correct me if I am wrong.

-- 
*Thanks & Regards *


*Unmesha Sreeveni U.B*
*Hadoop, Bigdata Developer*
*Center for Cyber Security | Amrita Vishwa Vidyapeetham*
http://www.unmeshasreeveni.blogspot.in/

Re: WholeFileInputFormat in hadoop

Posted by unmesha sreeveni <un...@gmail.com>.

But how is it different from normal execution and parallel MR.
Although mapreduce is a parallel exec framework where the data into map is
 a single input.

If the Whole fileinput is jst an entire input split insead of the entire
input file . it will be useful right?
if it is the whole file it can caught heapspace ..

Please correct me if I am wrong.

-- 
*Thanks & Regards *


*Unmesha Sreeveni U.B*
*Hadoop, Bigdata Developer*
*Center for Cyber Security | Amrita Vishwa Vidyapeetham*
http://www.unmeshasreeveni.blogspot.in/

Re: WholeFileInputFormat in hadoop

Posted by unmesha sreeveni <un...@gmail.com>.

But how is it different from normal execution and parallel MR.
Although mapreduce is a parallel exec framework where the data into map is
 a single input.

If the Whole fileinput is jst an entire input split insead of the entire
input file . it will be useful right?
if it is the whole file it can caught heapspace ..

Please correct me if I am wrong.

-- 
*Thanks & Regards *


*Unmesha Sreeveni U.B*
*Hadoop, Bigdata Developer*
*Center for Cyber Security | Amrita Vishwa Vidyapeetham*
http://www.unmeshasreeveni.blogspot.in/

Re: WholeFileInputFormat in hadoop

Posted by unmesha sreeveni <un...@gmail.com>.

But how is it different from normal execution and parallel MR.
Although mapreduce is a parallel exec framework where the data into map is
 a single input.

If the Whole fileinput is jst an entire input split insead of the entire
input file . it will be useful right?
if it is the whole file it can caught heapspace ..

Please correct me if I am wrong.

-- 
*Thanks & Regards *


*Unmesha Sreeveni U.B*
*Hadoop, Bigdata Developer*
*Center for Cyber Security | Amrita Vishwa Vidyapeetham*
http://www.unmeshasreeveni.blogspot.in/

Re: WholeFileInputFormat in hadoop

Posted by Mohit Anchlia <mo...@gmail.com>.

It takes entire file as input. There is a method in the class isSplittable
in this input format class which is set to false. This method determines if
file can be split in multiple chunks.

On Sat, Jun 28, 2014 at 5:38 AM, Shahab Yunus <sh...@gmail.com>
wrote:

> I think it takes the entire file as input. Otherwise it won't be any
> different from the normal line/record-based input format.
>
> Regards,
> Shahab
> On Jun 28, 2014 3:28 AM, "unmesha sreeveni" <un...@gmail.com> wrote:
>
>> Hi
>>
>>   A small clarification:
>>
>>  WholeFileInputFormat takes the entire input file as input or each
>> record(input split) as whole?
>>
>> --
>> *Thanks & Regards *
>>
>>
>> *Unmesha Sreeveni U.B*
>> *Hadoop, Bigdata Developer*
>> *Center for Cyber Security | Amrita Vishwa Vidyapeetham*
>> http://www.unmeshasreeveni.blogspot.in/
>>
>>
>>

Re: WholeFileInputFormat in hadoop

Posted by Mohit Anchlia <mo...@gmail.com>.

It takes entire file as input. There is a method in the class isSplittable
in this input format class which is set to false. This method determines if
file can be split in multiple chunks.

On Sat, Jun 28, 2014 at 5:38 AM, Shahab Yunus <sh...@gmail.com>
wrote:

> I think it takes the entire file as input. Otherwise it won't be any
> different from the normal line/record-based input format.
>
> Regards,
> Shahab
> On Jun 28, 2014 3:28 AM, "unmesha sreeveni" <un...@gmail.com> wrote:
>
>> Hi
>>
>>   A small clarification:
>>
>>  WholeFileInputFormat takes the entire input file as input or each
>> record(input split) as whole?
>>
>> --
>> *Thanks & Regards *
>>
>>
>> *Unmesha Sreeveni U.B*
>> *Hadoop, Bigdata Developer*
>> *Center for Cyber Security | Amrita Vishwa Vidyapeetham*
>> http://www.unmeshasreeveni.blogspot.in/
>>
>>
>>

Re: WholeFileInputFormat in hadoop

Posted by Mohit Anchlia <mo...@gmail.com>.

It takes entire file as input. There is a method in the class isSplittable
in this input format class which is set to false. This method determines if
file can be split in multiple chunks.

On Sat, Jun 28, 2014 at 5:38 AM, Shahab Yunus <sh...@gmail.com>
wrote:

> I think it takes the entire file as input. Otherwise it won't be any
> different from the normal line/record-based input format.
>
> Regards,
> Shahab
> On Jun 28, 2014 3:28 AM, "unmesha sreeveni" <un...@gmail.com> wrote:
>
>> Hi
>>
>>   A small clarification:
>>
>>  WholeFileInputFormat takes the entire input file as input or each
>> record(input split) as whole?
>>
>> --
>> *Thanks & Regards *
>>
>>
>> *Unmesha Sreeveni U.B*
>> *Hadoop, Bigdata Developer*
>> *Center for Cyber Security | Amrita Vishwa Vidyapeetham*
>> http://www.unmeshasreeveni.blogspot.in/
>>
>>
>>

Re: WholeFileInputFormat in hadoop

Posted by Mohit Anchlia <mo...@gmail.com>.

It takes entire file as input. There is a method in the class isSplittable
in this input format class which is set to false. This method determines if
file can be split in multiple chunks.

On Sat, Jun 28, 2014 at 5:38 AM, Shahab Yunus <sh...@gmail.com>
wrote:

> I think it takes the entire file as input. Otherwise it won't be any
> different from the normal line/record-based input format.
>
> Regards,
> Shahab
> On Jun 28, 2014 3:28 AM, "unmesha sreeveni" <un...@gmail.com> wrote:
>
>> Hi
>>
>>   A small clarification:
>>
>>  WholeFileInputFormat takes the entire input file as input or each
>> record(input split) as whole?
>>
>> --
>> *Thanks & Regards *
>>
>>
>> *Unmesha Sreeveni U.B*
>> *Hadoop, Bigdata Developer*
>> *Center for Cyber Security | Amrita Vishwa Vidyapeetham*
>> http://www.unmeshasreeveni.blogspot.in/
>>
>>
>>

Re: WholeFileInputFormat in hadoop

Posted by Shahab Yunus <sh...@gmail.com>.

I think it takes the entire file as input. Otherwise it won't be any
different from the normal line/record-based input format.

Regards,
Shahab
On Jun 28, 2014 3:28 AM, "unmesha sreeveni" <un...@gmail.com> wrote:

> Hi
>
>   A small clarification:
>
>  WholeFileInputFormat takes the entire input file as input or each
> record(input split) as whole?
>
> --
> *Thanks & Regards *
>
>
> *Unmesha Sreeveni U.B*
> *Hadoop, Bigdata Developer*
> *Center for Cyber Security | Amrita Vishwa Vidyapeetham*
> http://www.unmeshasreeveni.blogspot.in/
>
>
>

Re: WholeFileInputFormat in hadoop

Posted by Shahab Yunus <sh...@gmail.com>.

I think it takes the entire file as input. Otherwise it won't be any
different from the normal line/record-based input format.

Regards,
Shahab
On Jun 28, 2014 3:28 AM, "unmesha sreeveni" <un...@gmail.com> wrote:

> Hi
>
>   A small clarification:
>
>  WholeFileInputFormat takes the entire input file as input or each
> record(input split) as whole?
>
> --
> *Thanks & Regards *
>
>
> *Unmesha Sreeveni U.B*
> *Hadoop, Bigdata Developer*
> *Center for Cyber Security | Amrita Vishwa Vidyapeetham*
> http://www.unmeshasreeveni.blogspot.in/
>
>
>

Re: WholeFileInputFormat in hadoop

Posted by Shahab Yunus <sh...@gmail.com>.

I think it takes the entire file as input. Otherwise it won't be any
different from the normal line/record-based input format.

Regards,
Shahab
On Jun 28, 2014 3:28 AM, "unmesha sreeveni" <un...@gmail.com> wrote:

> Hi
>
>   A small clarification:
>
>  WholeFileInputFormat takes the entire input file as input or each
> record(input split) as whole?
>
> --
> *Thanks & Regards *
>
>
> *Unmesha Sreeveni U.B*
> *Hadoop, Bigdata Developer*
> *Center for Cyber Security | Amrita Vishwa Vidyapeetham*
> http://www.unmeshasreeveni.blogspot.in/
>
>
>

Re: WholeFileInputFormat in hadoop

Posted by Shahab Yunus <sh...@gmail.com>.

I think it takes the entire file as input. Otherwise it won't be any
different from the normal line/record-based input format.

Regards,
Shahab
On Jun 28, 2014 3:28 AM, "unmesha sreeveni" <un...@gmail.com> wrote:

> Hi
>
>   A small clarification:
>
>  WholeFileInputFormat takes the entire input file as input or each
> record(input split) as whole?
>
> --
> *Thanks & Regards *
>
>
> *Unmesha Sreeveni U.B*
> *Hadoop, Bigdata Developer*
> *Center for Cyber Security | Amrita Vishwa Vidyapeetham*
> http://www.unmeshasreeveni.blogspot.in/
>
>
>