You are viewing a plain text version of this content. The canonical link for it is here.

Posted to mapreduce-user@hadoop.apache.org by unmesha sreeveni <un...@gmail.com> on 2014/12/12 10:30:27 UTC

Split files into 80% and 20% for building model and prediction

I am trying to divide my HDFS file into 2 parts/files
80% and 20% for classification algorithm(80% for modelling and 20% for
prediction)
Please provide suggestion for the same.
To take 80% and 20% to 2 seperate files we need to know the exact number of
record in the data set
And it is only known if we go through the data set once.
so we need to write 1 MapReduce Job for just counting the number of records
and
2 nd Mapreduce Job for separating 80% and 20% into 2 files using Multiple
Inputs.


Am I in the right track or there is any alternative for the same.
But again a small confusion how to check if the reducer get filled with 80%
data.


-- 
*Thanks & Regards *


*Unmesha Sreeveni U.B*
*Hadoop, Bigdata Developer*
*Centre for Cyber Security | Amrita Vishwa Vidyapeetham*
http://www.unmeshasreeveni.blogspot.in/

Re: Split files into 80% and 20% for building model and prediction

Posted by Andre Kelpe <ak...@concurrentinc.com>.

Try Cascading multitool: http://docs.cascading.org/multitool/2.6/

- André

On Fri, Dec 12, 2014 at 10:30 AM, unmesha sreeveni <un...@gmail.com>
wrote:

> I am trying to divide my HDFS file into 2 parts/files
> 80% and 20% for classification algorithm(80% for modelling and 20% for
> prediction)
> Please provide suggestion for the same.
> To take 80% and 20% to 2 seperate files we need to know the exact number
> of record in the data set
> And it is only known if we go through the data set once.
> so we need to write 1 MapReduce Job for just counting the number of
> records and
> 2 nd Mapreduce Job for separating 80% and 20% into 2 files using Multiple
> Inputs.
>
>
> Am I in the right track or there is any alternative for the same.
> But again a small confusion how to check if the reducer get filled with
> 80% data.
>
>
> --
> *Thanks & Regards *
>
>
> *Unmesha Sreeveni U.B*
> *Hadoop, Bigdata Developer*
> *Centre for Cyber Security | Amrita Vishwa Vidyapeetham*
> http://www.unmeshasreeveni.blogspot.in/
>
>
>


-- 
André Kelpe
andre@concurrentinc.com
http://concurrentinc.com

Re: Split files into 80% and 20% for building model and prediction

Posted by Peyman Mohajerian <mo...@gmail.com>.

you don't have to copy the data to local to do a count.
%hdfs dfs -cat file1 | wc -l
will do the job

On Fri, Dec 12, 2014 at 1:58 AM, Susheel Kumar Gadalay <sk...@gmail.com>
wrote:
>
> Simple solution..
>
> Copy the HDFS file to local and use OS commands to count no of lines
>
> cat file1 | wc -l
>
> and cut it based on line number.
>
>
> On 12/12/14, unmesha sreeveni <un...@gmail.com> wrote:
> > I am trying to divide my HDFS file into 2 parts/files
> > 80% and 20% for classification algorithm(80% for modelling and 20% for
> > prediction)
> > Please provide suggestion for the same.
> > To take 80% and 20% to 2 seperate files we need to know the exact number
> of
> > record in the data set
> > And it is only known if we go through the data set once.
> > so we need to write 1 MapReduce Job for just counting the number of
> records
> > and
> > 2 nd Mapreduce Job for separating 80% and 20% into 2 files using Multiple
> > Inputs.
> >
> >
> > Am I in the right track or there is any alternative for the same.
> > But again a small confusion how to check if the reducer get filled with
> 80%
> > data.
> >
> >
> > --
> > *Thanks & Regards *
> >
> >
> > *Unmesha Sreeveni U.B*
> > *Hadoop, Bigdata Developer*
> > *Centre for Cyber Security | Amrita Vishwa Vidyapeetham*
> > http://www.unmeshasreeveni.blogspot.in/
> >
>

Re: Split files into 80% and 20% for building model and prediction

Posted by Peyman Mohajerian <mo...@gmail.com>.

you don't have to copy the data to local to do a count.
%hdfs dfs -cat file1 | wc -l
will do the job

On Fri, Dec 12, 2014 at 1:58 AM, Susheel Kumar Gadalay <sk...@gmail.com>
wrote:
>
> Simple solution..
>
> Copy the HDFS file to local and use OS commands to count no of lines
>
> cat file1 | wc -l
>
> and cut it based on line number.
>
>
> On 12/12/14, unmesha sreeveni <un...@gmail.com> wrote:
> > I am trying to divide my HDFS file into 2 parts/files
> > 80% and 20% for classification algorithm(80% for modelling and 20% for
> > prediction)
> > Please provide suggestion for the same.
> > To take 80% and 20% to 2 seperate files we need to know the exact number
> of
> > record in the data set
> > And it is only known if we go through the data set once.
> > so we need to write 1 MapReduce Job for just counting the number of
> records
> > and
> > 2 nd Mapreduce Job for separating 80% and 20% into 2 files using Multiple
> > Inputs.
> >
> >
> > Am I in the right track or there is any alternative for the same.
> > But again a small confusion how to check if the reducer get filled with
> 80%
> > data.
> >
> >
> > --
> > *Thanks & Regards *
> >
> >
> > *Unmesha Sreeveni U.B*
> > *Hadoop, Bigdata Developer*
> > *Centre for Cyber Security | Amrita Vishwa Vidyapeetham*
> > http://www.unmeshasreeveni.blogspot.in/
> >
>

Re: Split files into 80% and 20% for building model and prediction

Posted by Peyman Mohajerian <mo...@gmail.com>.

you don't have to copy the data to local to do a count.
%hdfs dfs -cat file1 | wc -l
will do the job

On Fri, Dec 12, 2014 at 1:58 AM, Susheel Kumar Gadalay <sk...@gmail.com>
wrote:
>
> Simple solution..
>
> Copy the HDFS file to local and use OS commands to count no of lines
>
> cat file1 | wc -l
>
> and cut it based on line number.
>
>
> On 12/12/14, unmesha sreeveni <un...@gmail.com> wrote:
> > I am trying to divide my HDFS file into 2 parts/files
> > 80% and 20% for classification algorithm(80% for modelling and 20% for
> > prediction)
> > Please provide suggestion for the same.
> > To take 80% and 20% to 2 seperate files we need to know the exact number
> of
> > record in the data set
> > And it is only known if we go through the data set once.
> > so we need to write 1 MapReduce Job for just counting the number of
> records
> > and
> > 2 nd Mapreduce Job for separating 80% and 20% into 2 files using Multiple
> > Inputs.
> >
> >
> > Am I in the right track or there is any alternative for the same.
> > But again a small confusion how to check if the reducer get filled with
> 80%
> > data.
> >
> >
> > --
> > *Thanks & Regards *
> >
> >
> > *Unmesha Sreeveni U.B*
> > *Hadoop, Bigdata Developer*
> > *Centre for Cyber Security | Amrita Vishwa Vidyapeetham*
> > http://www.unmeshasreeveni.blogspot.in/
> >
>

Re: Split files into 80% and 20% for building model and prediction

Posted by Wilm Schumacher <wi...@gmail.com>.

Hi,

from a machine learning perspective I would recommend this approach, too
... if there is no other information available which splits the data
set. Depends on the data you are processing.

And I would split the data persistently, e.g. not using the train data
directly, but writing it into a file with the approach Mikael suggested
and use that file for training. This is of course more computational
effort, but in my experience it's totally worth it. If you are not able
to reproduce your training on the same split you will never find
problems/enhancments in your preprocessing/training/modelling etc..

And as you use hadoop you have a lot of data, thus this approach would
work quite well*.

Furthermore you could do something like an n-fold cross validation. For
that method you would need the split n times and have to persist the data.

Best wishes

Wilm

*) For small number of data points (<10 or so) you could run into some
singularity problems with the random split approach. E.g. empty train
data. Or unbalanced buckets. But for N->infinity this should work very well.

Am 12.12.2014 um 11:24 schrieb Mikael Sitruk:
> I would use a different approach. For each row in the mapper I would
> have invoked random.Next() then if the number generated by random is
> below 0.8 then the row would go to key for training otherwise go to
> key for the test.
> Mikael.s
> ------------------------------------------------------------------------
> From: Susheel Kumar Gadalay <ma...@gmail.com>
> Sent: ‎12/‎12/‎2014 12:00
> To: user@hadoop.apache.org <ma...@hadoop.apache.org>
> Subject: Re: Split files into 80% and 20% for building model and
> prediction
>
> Simple solution..
>
> Copy the HDFS file to local and use OS commands to count no of lines
>
> cat file1 | wc -l
>
> and cut it based on line number.
>
>
> On 12/12/14, unmesha sreeveni <un...@gmail.com> wrote:
> > I am trying to divide my HDFS file into 2 parts/files
> > 80% and 20% for classification algorithm(80% for modelling and 20% for
> > prediction)
> > Please provide suggestion for the same.
> > To take 80% and 20% to 2 seperate files we need to know the exact
> number of
> > record in the data set
> > And it is only known if we go through the data set once.
> > so we need to write 1 MapReduce Job for just counting the number of
> records
> > and
> > 2 nd Mapreduce Job for separating 80% and 20% into 2 files using
> Multiple
> > Inputs.
> >
> >
> > Am I in the right track or there is any alternative for the same.
> > But again a small confusion how to check if the reducer get filled
> with 80%
> > data.
> >
> >
> > --
> > *Thanks & Regards *
> >
> >
> > *Unmesha Sreeveni U.B*
> > *Hadoop, Bigdata Developer*
> > *Centre for Cyber Security | Amrita Vishwa Vidyapeetham*
> > http://www.unmeshasreeveni.blogspot.in/
> >

Re: Split files into 80% and 20% for building model and prediction

Posted by Wilm Schumacher <wi...@gmail.com>.

Hi,

from a machine learning perspective I would recommend this approach, too
... if there is no other information available which splits the data
set. Depends on the data you are processing.

And I would split the data persistently, e.g. not using the train data
directly, but writing it into a file with the approach Mikael suggested
and use that file for training. This is of course more computational
effort, but in my experience it's totally worth it. If you are not able
to reproduce your training on the same split you will never find
problems/enhancments in your preprocessing/training/modelling etc..

And as you use hadoop you have a lot of data, thus this approach would
work quite well*.

Furthermore you could do something like an n-fold cross validation. For
that method you would need the split n times and have to persist the data.

Best wishes

Wilm

*) For small number of data points (<10 or so) you could run into some
singularity problems with the random split approach. E.g. empty train
data. Or unbalanced buckets. But for N->infinity this should work very well.

Am 12.12.2014 um 11:24 schrieb Mikael Sitruk:
> I would use a different approach. For each row in the mapper I would
> have invoked random.Next() then if the number generated by random is
> below 0.8 then the row would go to key for training otherwise go to
> key for the test.
> Mikael.s
> ------------------------------------------------------------------------
> From: Susheel Kumar Gadalay <ma...@gmail.com>
> Sent: ‎12/‎12/‎2014 12:00
> To: user@hadoop.apache.org <ma...@hadoop.apache.org>
> Subject: Re: Split files into 80% and 20% for building model and
> prediction
>
> Simple solution..
>
> Copy the HDFS file to local and use OS commands to count no of lines
>
> cat file1 | wc -l
>
> and cut it based on line number.
>
>
> On 12/12/14, unmesha sreeveni <un...@gmail.com> wrote:
> > I am trying to divide my HDFS file into 2 parts/files
> > 80% and 20% for classification algorithm(80% for modelling and 20% for
> > prediction)
> > Please provide suggestion for the same.
> > To take 80% and 20% to 2 seperate files we need to know the exact
> number of
> > record in the data set
> > And it is only known if we go through the data set once.
> > so we need to write 1 MapReduce Job for just counting the number of
> records
> > and
> > 2 nd Mapreduce Job for separating 80% and 20% into 2 files using
> Multiple
> > Inputs.
> >
> >
> > Am I in the right track or there is any alternative for the same.
> > But again a small confusion how to check if the reducer get filled
> with 80%
> > data.
> >
> >
> > --
> > *Thanks & Regards *
> >
> >
> > *Unmesha Sreeveni U.B*
> > *Hadoop, Bigdata Developer*
> > *Centre for Cyber Security | Amrita Vishwa Vidyapeetham*
> > http://www.unmeshasreeveni.blogspot.in/
> >

Re: Split files into 80% and 20% for building model and prediction

Posted by Wilm Schumacher <wi...@gmail.com>.

Hi,

from a machine learning perspective I would recommend this approach, too
... if there is no other information available which splits the data
set. Depends on the data you are processing.

And I would split the data persistently, e.g. not using the train data
directly, but writing it into a file with the approach Mikael suggested
and use that file for training. This is of course more computational
effort, but in my experience it's totally worth it. If you are not able
to reproduce your training on the same split you will never find
problems/enhancments in your preprocessing/training/modelling etc..

And as you use hadoop you have a lot of data, thus this approach would
work quite well*.

Furthermore you could do something like an n-fold cross validation. For
that method you would need the split n times and have to persist the data.

Best wishes

Wilm

*) For small number of data points (<10 or so) you could run into some
singularity problems with the random split approach. E.g. empty train
data. Or unbalanced buckets. But for N->infinity this should work very well.

Am 12.12.2014 um 11:24 schrieb Mikael Sitruk:
> I would use a different approach. For each row in the mapper I would
> have invoked random.Next() then if the number generated by random is
> below 0.8 then the row would go to key for training otherwise go to
> key for the test.
> Mikael.s
> ------------------------------------------------------------------------
> From: Susheel Kumar Gadalay <ma...@gmail.com>
> Sent: ‎12/‎12/‎2014 12:00
> To: user@hadoop.apache.org <ma...@hadoop.apache.org>
> Subject: Re: Split files into 80% and 20% for building model and
> prediction
>
> Simple solution..
>
> Copy the HDFS file to local and use OS commands to count no of lines
>
> cat file1 | wc -l
>
> and cut it based on line number.
>
>
> On 12/12/14, unmesha sreeveni <un...@gmail.com> wrote:
> > I am trying to divide my HDFS file into 2 parts/files
> > 80% and 20% for classification algorithm(80% for modelling and 20% for
> > prediction)
> > Please provide suggestion for the same.
> > To take 80% and 20% to 2 seperate files we need to know the exact
> number of
> > record in the data set
> > And it is only known if we go through the data set once.
> > so we need to write 1 MapReduce Job for just counting the number of
> records
> > and
> > 2 nd Mapreduce Job for separating 80% and 20% into 2 files using
> Multiple
> > Inputs.
> >
> >
> > Am I in the right track or there is any alternative for the same.
> > But again a small confusion how to check if the reducer get filled
> with 80%
> > data.
> >
> >
> > --
> > *Thanks & Regards *
> >
> >
> > *Unmesha Sreeveni U.B*
> > *Hadoop, Bigdata Developer*
> > *Centre for Cyber Security | Amrita Vishwa Vidyapeetham*
> > http://www.unmeshasreeveni.blogspot.in/
> >

Re: Split files into 80% and 20% for building model and prediction

Posted by Wilm Schumacher <wi...@gmail.com>.

Hi,

from a machine learning perspective I would recommend this approach, too
... if there is no other information available which splits the data
set. Depends on the data you are processing.

And I would split the data persistently, e.g. not using the train data
directly, but writing it into a file with the approach Mikael suggested
and use that file for training. This is of course more computational
effort, but in my experience it's totally worth it. If you are not able
to reproduce your training on the same split you will never find
problems/enhancments in your preprocessing/training/modelling etc..

And as you use hadoop you have a lot of data, thus this approach would
work quite well*.

Furthermore you could do something like an n-fold cross validation. For
that method you would need the split n times and have to persist the data.

Best wishes

Wilm

*) For small number of data points (<10 or so) you could run into some
singularity problems with the random split approach. E.g. empty train
data. Or unbalanced buckets. But for N->infinity this should work very well.

Am 12.12.2014 um 11:24 schrieb Mikael Sitruk:
> I would use a different approach. For each row in the mapper I would
> have invoked random.Next() then if the number generated by random is
> below 0.8 then the row would go to key for training otherwise go to
> key for the test.
> Mikael.s
> ------------------------------------------------------------------------
> From: Susheel Kumar Gadalay <ma...@gmail.com>
> Sent: ‎12/‎12/‎2014 12:00
> To: user@hadoop.apache.org <ma...@hadoop.apache.org>
> Subject: Re: Split files into 80% and 20% for building model and
> prediction
>
> Simple solution..
>
> Copy the HDFS file to local and use OS commands to count no of lines
>
> cat file1 | wc -l
>
> and cut it based on line number.
>
>
> On 12/12/14, unmesha sreeveni <un...@gmail.com> wrote:
> > I am trying to divide my HDFS file into 2 parts/files
> > 80% and 20% for classification algorithm(80% for modelling and 20% for
> > prediction)
> > Please provide suggestion for the same.
> > To take 80% and 20% to 2 seperate files we need to know the exact
> number of
> > record in the data set
> > And it is only known if we go through the data set once.
> > so we need to write 1 MapReduce Job for just counting the number of
> records
> > and
> > 2 nd Mapreduce Job for separating 80% and 20% into 2 files using
> Multiple
> > Inputs.
> >
> >
> > Am I in the right track or there is any alternative for the same.
> > But again a small confusion how to check if the reducer get filled
> with 80%
> > data.
> >
> >
> > --
> > *Thanks & Regards *
> >
> >
> > *Unmesha Sreeveni U.B*
> > *Hadoop, Bigdata Developer*
> > *Centre for Cyber Security | Amrita Vishwa Vidyapeetham*
> > http://www.unmeshasreeveni.blogspot.in/
> >

RE: Split files into 80% and 20% for building model and prediction

Posted by Mikael Sitruk <mi...@gmail.com>.

Hi Unmesha 
With the random approach you don't need to write the MR job for counting. 

Mikael.s

-----Original Message-----
From: "Hitarth" <t....@gmail.com>
Sent: ‎12/‎12/‎2014 15:20
To: "user@hadoop.apache.org" <us...@hadoop.apache.org>
Subject: Re: Split files into 80% and 20% for building model and prediction

Hi Unmesha, 


If you use the approach suggested by Mikael of taking the random 80% of data for training and rest for testing then you can have good distribution to generate your predictive model. 

Thanks,
Hitarth

On Dec 12, 2014, at 6:00 AM, unmesha sreeveni <un...@gmail.com> wrote:


Hi Mikael
 So you wont write an MR job for counting the number of records in that file to find 80% and 20%


On Fri, Dec 12, 2014 at 3:54 PM, Mikael Sitruk <mi...@gmail.com> wrote:
I would use a different approach. For each row in the mapper I would have invoked random.Next() then if the number generated by random is below 0.8 then the row would go to key for training otherwise go to key for the test.
Mikael.s


From: Susheel Kumar Gadalay
Sent: ‎12/‎12/‎2014 12:00
To: user@hadoop.apache.org
Subject: Re: Split files into 80% and 20% for building model and prediction


Simple solution..

Copy the HDFS file to local and use OS commands to count no of lines

cat file1 | wc -l

and cut it based on line number.


On 12/12/14, unmesha sreeveni <un...@gmail.com> wrote:
> I am trying to divide my HDFS file into 2 parts/files
> 80% and 20% for classification algorithm(80% for modelling and 20% for
> prediction)
> Please provide suggestion for the same.
> To take 80% and 20% to 2 seperate files we need to know the exact number of
> record in the data set
> And it is only known if we go through the data set once.
> so we need to write 1 MapReduce Job for just counting the number of records
> and
> 2 nd Mapreduce Job for separating 80% and 20% into 2 files using Multiple
> Inputs.
>
>
> Am I in the right track or there is any alternative for the same.
> But again a small confusion how to check if the reducer get filled with 80%
> data.
>
>
> --
> *Thanks & Regards *
>
>
> *Unmesha Sreeveni U.B*
> *Hadoop, Bigdata Developer*
> *Centre for Cyber Security | Amrita Vishwa Vidyapeetham*
> http://www.unmeshasreeveni.blogspot.in/
>





-- 

Thanks & Regards 


Unmesha Sreeveni U.B

Hadoop, Bigdata Developer
Centre for Cyber Security | Amrita Vishwa Vidyapeetham

http://www.unmeshasreeveni.blogspot.in/

RE: Split files into 80% and 20% for building model and prediction

Posted by Mikael Sitruk <mi...@gmail.com>.

Hi Unmesha 
With the random approach you don't need to write the MR job for counting. 

Mikael.s

-----Original Message-----
From: "Hitarth" <t....@gmail.com>
Sent: ‎12/‎12/‎2014 15:20
To: "user@hadoop.apache.org" <us...@hadoop.apache.org>
Subject: Re: Split files into 80% and 20% for building model and prediction

Hi Unmesha, 


If you use the approach suggested by Mikael of taking the random 80% of data for training and rest for testing then you can have good distribution to generate your predictive model. 

Thanks,
Hitarth

On Dec 12, 2014, at 6:00 AM, unmesha sreeveni <un...@gmail.com> wrote:


Hi Mikael
 So you wont write an MR job for counting the number of records in that file to find 80% and 20%


On Fri, Dec 12, 2014 at 3:54 PM, Mikael Sitruk <mi...@gmail.com> wrote:
I would use a different approach. For each row in the mapper I would have invoked random.Next() then if the number generated by random is below 0.8 then the row would go to key for training otherwise go to key for the test.
Mikael.s


From: Susheel Kumar Gadalay
Sent: ‎12/‎12/‎2014 12:00
To: user@hadoop.apache.org
Subject: Re: Split files into 80% and 20% for building model and prediction


Simple solution..

Copy the HDFS file to local and use OS commands to count no of lines

cat file1 | wc -l

and cut it based on line number.


On 12/12/14, unmesha sreeveni <un...@gmail.com> wrote:
> I am trying to divide my HDFS file into 2 parts/files
> 80% and 20% for classification algorithm(80% for modelling and 20% for
> prediction)
> Please provide suggestion for the same.
> To take 80% and 20% to 2 seperate files we need to know the exact number of
> record in the data set
> And it is only known if we go through the data set once.
> so we need to write 1 MapReduce Job for just counting the number of records
> and
> 2 nd Mapreduce Job for separating 80% and 20% into 2 files using Multiple
> Inputs.
>
>
> Am I in the right track or there is any alternative for the same.
> But again a small confusion how to check if the reducer get filled with 80%
> data.
>
>
> --
> *Thanks & Regards *
>
>
> *Unmesha Sreeveni U.B*
> *Hadoop, Bigdata Developer*
> *Centre for Cyber Security | Amrita Vishwa Vidyapeetham*
> http://www.unmeshasreeveni.blogspot.in/
>





-- 

Thanks & Regards 


Unmesha Sreeveni U.B

Hadoop, Bigdata Developer
Centre for Cyber Security | Amrita Vishwa Vidyapeetham

http://www.unmeshasreeveni.blogspot.in/

RE: Split files into 80% and 20% for building model and prediction

Posted by Mikael Sitruk <mi...@gmail.com>.

Hi Unmesha 
With the random approach you don't need to write the MR job for counting. 

Mikael.s

-----Original Message-----
From: "Hitarth" <t....@gmail.com>
Sent: ‎12/‎12/‎2014 15:20
To: "user@hadoop.apache.org" <us...@hadoop.apache.org>
Subject: Re: Split files into 80% and 20% for building model and prediction

Hi Unmesha, 


If you use the approach suggested by Mikael of taking the random 80% of data for training and rest for testing then you can have good distribution to generate your predictive model. 

Thanks,
Hitarth

On Dec 12, 2014, at 6:00 AM, unmesha sreeveni <un...@gmail.com> wrote:


Hi Mikael
 So you wont write an MR job for counting the number of records in that file to find 80% and 20%


On Fri, Dec 12, 2014 at 3:54 PM, Mikael Sitruk <mi...@gmail.com> wrote:
I would use a different approach. For each row in the mapper I would have invoked random.Next() then if the number generated by random is below 0.8 then the row would go to key for training otherwise go to key for the test.
Mikael.s


From: Susheel Kumar Gadalay
Sent: ‎12/‎12/‎2014 12:00
To: user@hadoop.apache.org
Subject: Re: Split files into 80% and 20% for building model and prediction


Simple solution..

Copy the HDFS file to local and use OS commands to count no of lines

cat file1 | wc -l

and cut it based on line number.


On 12/12/14, unmesha sreeveni <un...@gmail.com> wrote:
> I am trying to divide my HDFS file into 2 parts/files
> 80% and 20% for classification algorithm(80% for modelling and 20% for
> prediction)
> Please provide suggestion for the same.
> To take 80% and 20% to 2 seperate files we need to know the exact number of
> record in the data set
> And it is only known if we go through the data set once.
> so we need to write 1 MapReduce Job for just counting the number of records
> and
> 2 nd Mapreduce Job for separating 80% and 20% into 2 files using Multiple
> Inputs.
>
>
> Am I in the right track or there is any alternative for the same.
> But again a small confusion how to check if the reducer get filled with 80%
> data.
>
>
> --
> *Thanks & Regards *
>
>
> *Unmesha Sreeveni U.B*
> *Hadoop, Bigdata Developer*
> *Centre for Cyber Security | Amrita Vishwa Vidyapeetham*
> http://www.unmeshasreeveni.blogspot.in/
>





-- 

Thanks & Regards 


Unmesha Sreeveni U.B

Hadoop, Bigdata Developer
Centre for Cyber Security | Amrita Vishwa Vidyapeetham

http://www.unmeshasreeveni.blogspot.in/

RE: Split files into 80% and 20% for building model and prediction

Posted by Mikael Sitruk <mi...@gmail.com>.

Hi Unmesha 
With the random approach you don't need to write the MR job for counting. 

Mikael.s

-----Original Message-----
From: "Hitarth" <t....@gmail.com>
Sent: ‎12/‎12/‎2014 15:20
To: "user@hadoop.apache.org" <us...@hadoop.apache.org>
Subject: Re: Split files into 80% and 20% for building model and prediction

Hi Unmesha, 


If you use the approach suggested by Mikael of taking the random 80% of data for training and rest for testing then you can have good distribution to generate your predictive model. 

Thanks,
Hitarth

On Dec 12, 2014, at 6:00 AM, unmesha sreeveni <un...@gmail.com> wrote:


Hi Mikael
 So you wont write an MR job for counting the number of records in that file to find 80% and 20%


On Fri, Dec 12, 2014 at 3:54 PM, Mikael Sitruk <mi...@gmail.com> wrote:
I would use a different approach. For each row in the mapper I would have invoked random.Next() then if the number generated by random is below 0.8 then the row would go to key for training otherwise go to key for the test.
Mikael.s


From: Susheel Kumar Gadalay
Sent: ‎12/‎12/‎2014 12:00
To: user@hadoop.apache.org
Subject: Re: Split files into 80% and 20% for building model and prediction


Simple solution..

Copy the HDFS file to local and use OS commands to count no of lines

cat file1 | wc -l

and cut it based on line number.


On 12/12/14, unmesha sreeveni <un...@gmail.com> wrote:
> I am trying to divide my HDFS file into 2 parts/files
> 80% and 20% for classification algorithm(80% for modelling and 20% for
> prediction)
> Please provide suggestion for the same.
> To take 80% and 20% to 2 seperate files we need to know the exact number of
> record in the data set
> And it is only known if we go through the data set once.
> so we need to write 1 MapReduce Job for just counting the number of records
> and
> 2 nd Mapreduce Job for separating 80% and 20% into 2 files using Multiple
> Inputs.
>
>
> Am I in the right track or there is any alternative for the same.
> But again a small confusion how to check if the reducer get filled with 80%
> data.
>
>
> --
> *Thanks & Regards *
>
>
> *Unmesha Sreeveni U.B*
> *Hadoop, Bigdata Developer*
> *Centre for Cyber Security | Amrita Vishwa Vidyapeetham*
> http://www.unmeshasreeveni.blogspot.in/
>





-- 

Thanks & Regards 


Unmesha Sreeveni U.B

Hadoop, Bigdata Developer
Centre for Cyber Security | Amrita Vishwa Vidyapeetham

http://www.unmeshasreeveni.blogspot.in/

Re: Split files into 80% and 20% for building model and prediction

Posted by Hitarth <t....@gmail.com>.

Hi Unmesha, 

If you use the approach suggested by Mikael of taking the random 80% of data for training and rest for testing then you can have good distribution to generate your predictive model. 

Thanks,
Hitarth

> On Dec 12, 2014, at 6:00 AM, unmesha sreeveni <un...@gmail.com> wrote:
> 
> Hi Mikael
>  So you wont write an MR job for counting the number of records in that file to find 80% and 20%
> 
>> On Fri, Dec 12, 2014 at 3:54 PM, Mikael Sitruk <mi...@gmail.com> wrote:
>> I would use a different approach. For each row in the mapper I would have invoked random.Next() then if the number generated by random is below 0.8 then the row would go to key for training otherwise go to key for the test.
>> Mikael.s
>> From: Susheel Kumar Gadalay
>> Sent: ‎12/‎12/‎2014 12:00
>> To: user@hadoop.apache.org
>> Subject: Re: Split files into 80% and 20% for building model and prediction
>> 
>> Simple solution..
>> 
>> Copy the HDFS file to local and use OS commands to count no of lines
>> 
>> cat file1 | wc -l
>> 
>> and cut it based on line number.
>> 
>> 
>> On 12/12/14, unmesha sreeveni <un...@gmail.com> wrote:
>> > I am trying to divide my HDFS file into 2 parts/files
>> > 80% and 20% for classification algorithm(80% for modelling and 20% for
>> > prediction)
>> > Please provide suggestion for the same.
>> > To take 80% and 20% to 2 seperate files we need to know the exact number of
>> > record in the data set
>> > And it is only known if we go through the data set once.
>> > so we need to write 1 MapReduce Job for just counting the number of records
>> > and
>> > 2 nd Mapreduce Job for separating 80% and 20% into 2 files using Multiple
>> > Inputs.
>> >
>> >
>> > Am I in the right track or there is any alternative for the same.
>> > But again a small confusion how to check if the reducer get filled with 80%
>> > data.
>> >
>> >
>> > --
>> > *Thanks & Regards *
>> >
>> >
>> > *Unmesha Sreeveni U.B*
>> > *Hadoop, Bigdata Developer*
>> > *Centre for Cyber Security | Amrita Vishwa Vidyapeetham*
>> > http://www.unmeshasreeveni.blogspot.in/
>> >
> 
> 
> -- 
> Thanks & Regards
> 
> Unmesha Sreeveni U.B
> Hadoop, Bigdata Developer
> Centre for Cyber Security | Amrita Vishwa Vidyapeetham
> http://www.unmeshasreeveni.blogspot.in/
> 
>

Re: Split files into 80% and 20% for building model and prediction

Posted by Hitarth <t....@gmail.com>.

Hi Unmesha, 

If you use the approach suggested by Mikael of taking the random 80% of data for training and rest for testing then you can have good distribution to generate your predictive model. 

Thanks,
Hitarth

> On Dec 12, 2014, at 6:00 AM, unmesha sreeveni <un...@gmail.com> wrote:
> 
> Hi Mikael
>  So you wont write an MR job for counting the number of records in that file to find 80% and 20%
> 
>> On Fri, Dec 12, 2014 at 3:54 PM, Mikael Sitruk <mi...@gmail.com> wrote:
>> I would use a different approach. For each row in the mapper I would have invoked random.Next() then if the number generated by random is below 0.8 then the row would go to key for training otherwise go to key for the test.
>> Mikael.s
>> From: Susheel Kumar Gadalay
>> Sent: ‎12/‎12/‎2014 12:00
>> To: user@hadoop.apache.org
>> Subject: Re: Split files into 80% and 20% for building model and prediction
>> 
>> Simple solution..
>> 
>> Copy the HDFS file to local and use OS commands to count no of lines
>> 
>> cat file1 | wc -l
>> 
>> and cut it based on line number.
>> 
>> 
>> On 12/12/14, unmesha sreeveni <un...@gmail.com> wrote:
>> > I am trying to divide my HDFS file into 2 parts/files
>> > 80% and 20% for classification algorithm(80% for modelling and 20% for
>> > prediction)
>> > Please provide suggestion for the same.
>> > To take 80% and 20% to 2 seperate files we need to know the exact number of
>> > record in the data set
>> > And it is only known if we go through the data set once.
>> > so we need to write 1 MapReduce Job for just counting the number of records
>> > and
>> > 2 nd Mapreduce Job for separating 80% and 20% into 2 files using Multiple
>> > Inputs.
>> >
>> >
>> > Am I in the right track or there is any alternative for the same.
>> > But again a small confusion how to check if the reducer get filled with 80%
>> > data.
>> >
>> >
>> > --
>> > *Thanks & Regards *
>> >
>> >
>> > *Unmesha Sreeveni U.B*
>> > *Hadoop, Bigdata Developer*
>> > *Centre for Cyber Security | Amrita Vishwa Vidyapeetham*
>> > http://www.unmeshasreeveni.blogspot.in/
>> >
> 
> 
> -- 
> Thanks & Regards
> 
> Unmesha Sreeveni U.B
> Hadoop, Bigdata Developer
> Centre for Cyber Security | Amrita Vishwa Vidyapeetham
> http://www.unmeshasreeveni.blogspot.in/
> 
>

Re: Split files into 80% and 20% for building model and prediction

Posted by Hitarth <t....@gmail.com>.

Hi Unmesha, 

If you use the approach suggested by Mikael of taking the random 80% of data for training and rest for testing then you can have good distribution to generate your predictive model. 

Thanks,
Hitarth

> On Dec 12, 2014, at 6:00 AM, unmesha sreeveni <un...@gmail.com> wrote:
> 
> Hi Mikael
>  So you wont write an MR job for counting the number of records in that file to find 80% and 20%
> 
>> On Fri, Dec 12, 2014 at 3:54 PM, Mikael Sitruk <mi...@gmail.com> wrote:
>> I would use a different approach. For each row in the mapper I would have invoked random.Next() then if the number generated by random is below 0.8 then the row would go to key for training otherwise go to key for the test.
>> Mikael.s
>> From: Susheel Kumar Gadalay
>> Sent: ‎12/‎12/‎2014 12:00
>> To: user@hadoop.apache.org
>> Subject: Re: Split files into 80% and 20% for building model and prediction
>> 
>> Simple solution..
>> 
>> Copy the HDFS file to local and use OS commands to count no of lines
>> 
>> cat file1 | wc -l
>> 
>> and cut it based on line number.
>> 
>> 
>> On 12/12/14, unmesha sreeveni <un...@gmail.com> wrote:
>> > I am trying to divide my HDFS file into 2 parts/files
>> > 80% and 20% for classification algorithm(80% for modelling and 20% for
>> > prediction)
>> > Please provide suggestion for the same.
>> > To take 80% and 20% to 2 seperate files we need to know the exact number of
>> > record in the data set
>> > And it is only known if we go through the data set once.
>> > so we need to write 1 MapReduce Job for just counting the number of records
>> > and
>> > 2 nd Mapreduce Job for separating 80% and 20% into 2 files using Multiple
>> > Inputs.
>> >
>> >
>> > Am I in the right track or there is any alternative for the same.
>> > But again a small confusion how to check if the reducer get filled with 80%
>> > data.
>> >
>> >
>> > --
>> > *Thanks & Regards *
>> >
>> >
>> > *Unmesha Sreeveni U.B*
>> > *Hadoop, Bigdata Developer*
>> > *Centre for Cyber Security | Amrita Vishwa Vidyapeetham*
>> > http://www.unmeshasreeveni.blogspot.in/
>> >
> 
> 
> -- 
> Thanks & Regards
> 
> Unmesha Sreeveni U.B
> Hadoop, Bigdata Developer
> Centre for Cyber Security | Amrita Vishwa Vidyapeetham
> http://www.unmeshasreeveni.blogspot.in/
> 
>

Re: Split files into 80% and 20% for building model and prediction

Posted by Hitarth <t....@gmail.com>.

Hi Unmesha, 

If you use the approach suggested by Mikael of taking the random 80% of data for training and rest for testing then you can have good distribution to generate your predictive model. 

Thanks,
Hitarth

> On Dec 12, 2014, at 6:00 AM, unmesha sreeveni <un...@gmail.com> wrote:
> 
> Hi Mikael
>  So you wont write an MR job for counting the number of records in that file to find 80% and 20%
> 
>> On Fri, Dec 12, 2014 at 3:54 PM, Mikael Sitruk <mi...@gmail.com> wrote:
>> I would use a different approach. For each row in the mapper I would have invoked random.Next() then if the number generated by random is below 0.8 then the row would go to key for training otherwise go to key for the test.
>> Mikael.s
>> From: Susheel Kumar Gadalay
>> Sent: ‎12/‎12/‎2014 12:00
>> To: user@hadoop.apache.org
>> Subject: Re: Split files into 80% and 20% for building model and prediction
>> 
>> Simple solution..
>> 
>> Copy the HDFS file to local and use OS commands to count no of lines
>> 
>> cat file1 | wc -l
>> 
>> and cut it based on line number.
>> 
>> 
>> On 12/12/14, unmesha sreeveni <un...@gmail.com> wrote:
>> > I am trying to divide my HDFS file into 2 parts/files
>> > 80% and 20% for classification algorithm(80% for modelling and 20% for
>> > prediction)
>> > Please provide suggestion for the same.
>> > To take 80% and 20% to 2 seperate files we need to know the exact number of
>> > record in the data set
>> > And it is only known if we go through the data set once.
>> > so we need to write 1 MapReduce Job for just counting the number of records
>> > and
>> > 2 nd Mapreduce Job for separating 80% and 20% into 2 files using Multiple
>> > Inputs.
>> >
>> >
>> > Am I in the right track or there is any alternative for the same.
>> > But again a small confusion how to check if the reducer get filled with 80%
>> > data.
>> >
>> >
>> > --
>> > *Thanks & Regards *
>> >
>> >
>> > *Unmesha Sreeveni U.B*
>> > *Hadoop, Bigdata Developer*
>> > *Centre for Cyber Security | Amrita Vishwa Vidyapeetham*
>> > http://www.unmeshasreeveni.blogspot.in/
>> >
> 
> 
> -- 
> Thanks & Regards
> 
> Unmesha Sreeveni U.B
> Hadoop, Bigdata Developer
> Centre for Cyber Security | Amrita Vishwa Vidyapeetham
> http://www.unmeshasreeveni.blogspot.in/
> 
>

Re: Split files into 80% and 20% for building model and prediction

Posted by unmesha sreeveni <un...@gmail.com>.

Hi Mikael
 So you wont write an MR job for counting the number of records in that
file to find 80% and 20%

On Fri, Dec 12, 2014 at 3:54 PM, Mikael Sitruk <mi...@gmail.com>
wrote:
>
> I would use a different approach. For each row in the mapper I would have
> invoked random.Next() then if the number generated by random is below 0.8
> then the row would go to key for training otherwise go to key for the test.
> Mikael.s
> ------------------------------
> From: Susheel Kumar Gadalay <sk...@gmail.com>
> Sent: ‎12/‎12/‎2014 12:00
> To: user@hadoop.apache.org
> Subject: Re: Split files into 80% and 20% for building model and
> prediction
>
> Simple solution..
>
> Copy the HDFS file to local and use OS commands to count no of lines
>
> cat file1 | wc -l
>
> and cut it based on line number.
>
>
> On 12/12/14, unmesha sreeveni <un...@gmail.com> wrote:
> > I am trying to divide my HDFS file into 2 parts/files
> > 80% and 20% for classification algorithm(80% for modelling and 20% for
> > prediction)
> > Please provide suggestion for the same.
> > To take 80% and 20% to 2 seperate files we need to know the exact number
> of
> > record in the data set
> > And it is only known if we go through the data set once.
> > so we need to write 1 MapReduce Job for just counting the number of
> records
> > and
> > 2 nd Mapreduce Job for separating 80% and 20% into 2 files using Multiple
> > Inputs.
> >
> >
> > Am I in the right track or there is any alternative for the same.
> > But again a small confusion how to check if the reducer get filled with
> 80%
> > data.
> >
> >
> > --
> > *Thanks & Regards *
> >
> >
> > *Unmesha Sreeveni U.B*
> > *Hadoop, Bigdata Developer*
> > *Centre for Cyber Security | Amrita Vishwa Vidyapeetham*
> > http://www.unmeshasreeveni.blogspot.in/
> >
>


-- 
*Thanks & Regards *


*Unmesha Sreeveni U.B*
*Hadoop, Bigdata Developer*
*Centre for Cyber Security | Amrita Vishwa Vidyapeetham*
http://www.unmeshasreeveni.blogspot.in/

Re: Split files into 80% and 20% for building model and prediction

Posted by unmesha sreeveni <un...@gmail.com>.

Hi Mikael
 So you wont write an MR job for counting the number of records in that
file to find 80% and 20%

On Fri, Dec 12, 2014 at 3:54 PM, Mikael Sitruk <mi...@gmail.com>
wrote:
>
> I would use a different approach. For each row in the mapper I would have
> invoked random.Next() then if the number generated by random is below 0.8
> then the row would go to key for training otherwise go to key for the test.
> Mikael.s
> ------------------------------
> From: Susheel Kumar Gadalay <sk...@gmail.com>
> Sent: ‎12/‎12/‎2014 12:00
> To: user@hadoop.apache.org
> Subject: Re: Split files into 80% and 20% for building model and
> prediction
>
> Simple solution..
>
> Copy the HDFS file to local and use OS commands to count no of lines
>
> cat file1 | wc -l
>
> and cut it based on line number.
>
>
> On 12/12/14, unmesha sreeveni <un...@gmail.com> wrote:
> > I am trying to divide my HDFS file into 2 parts/files
> > 80% and 20% for classification algorithm(80% for modelling and 20% for
> > prediction)
> > Please provide suggestion for the same.
> > To take 80% and 20% to 2 seperate files we need to know the exact number
> of
> > record in the data set
> > And it is only known if we go through the data set once.
> > so we need to write 1 MapReduce Job for just counting the number of
> records
> > and
> > 2 nd Mapreduce Job for separating 80% and 20% into 2 files using Multiple
> > Inputs.
> >
> >
> > Am I in the right track or there is any alternative for the same.
> > But again a small confusion how to check if the reducer get filled with
> 80%
> > data.
> >
> >
> > --
> > *Thanks & Regards *
> >
> >
> > *Unmesha Sreeveni U.B*
> > *Hadoop, Bigdata Developer*
> > *Centre for Cyber Security | Amrita Vishwa Vidyapeetham*
> > http://www.unmeshasreeveni.blogspot.in/
> >
>


-- 
*Thanks & Regards *


*Unmesha Sreeveni U.B*
*Hadoop, Bigdata Developer*
*Centre for Cyber Security | Amrita Vishwa Vidyapeetham*
http://www.unmeshasreeveni.blogspot.in/

Re: Split files into 80% and 20% for building model and prediction

Posted by unmesha sreeveni <un...@gmail.com>.

Hi Mikael
 So you wont write an MR job for counting the number of records in that
file to find 80% and 20%

On Fri, Dec 12, 2014 at 3:54 PM, Mikael Sitruk <mi...@gmail.com>
wrote:
>
> I would use a different approach. For each row in the mapper I would have
> invoked random.Next() then if the number generated by random is below 0.8
> then the row would go to key for training otherwise go to key for the test.
> Mikael.s
> ------------------------------
> From: Susheel Kumar Gadalay <sk...@gmail.com>
> Sent: ‎12/‎12/‎2014 12:00
> To: user@hadoop.apache.org
> Subject: Re: Split files into 80% and 20% for building model and
> prediction
>
> Simple solution..
>
> Copy the HDFS file to local and use OS commands to count no of lines
>
> cat file1 | wc -l
>
> and cut it based on line number.
>
>
> On 12/12/14, unmesha sreeveni <un...@gmail.com> wrote:
> > I am trying to divide my HDFS file into 2 parts/files
> > 80% and 20% for classification algorithm(80% for modelling and 20% for
> > prediction)
> > Please provide suggestion for the same.
> > To take 80% and 20% to 2 seperate files we need to know the exact number
> of
> > record in the data set
> > And it is only known if we go through the data set once.
> > so we need to write 1 MapReduce Job for just counting the number of
> records
> > and
> > 2 nd Mapreduce Job for separating 80% and 20% into 2 files using Multiple
> > Inputs.
> >
> >
> > Am I in the right track or there is any alternative for the same.
> > But again a small confusion how to check if the reducer get filled with
> 80%
> > data.
> >
> >
> > --
> > *Thanks & Regards *
> >
> >
> > *Unmesha Sreeveni U.B*
> > *Hadoop, Bigdata Developer*
> > *Centre for Cyber Security | Amrita Vishwa Vidyapeetham*
> > http://www.unmeshasreeveni.blogspot.in/
> >
>


-- 
*Thanks & Regards *


*Unmesha Sreeveni U.B*
*Hadoop, Bigdata Developer*
*Centre for Cyber Security | Amrita Vishwa Vidyapeetham*
http://www.unmeshasreeveni.blogspot.in/

Re: Split files into 80% and 20% for building model and prediction

Posted by unmesha sreeveni <un...@gmail.com>.

Hi Mikael
 So you wont write an MR job for counting the number of records in that
file to find 80% and 20%

On Fri, Dec 12, 2014 at 3:54 PM, Mikael Sitruk <mi...@gmail.com>
wrote:
>
> I would use a different approach. For each row in the mapper I would have
> invoked random.Next() then if the number generated by random is below 0.8
> then the row would go to key for training otherwise go to key for the test.
> Mikael.s
> ------------------------------
> From: Susheel Kumar Gadalay <sk...@gmail.com>
> Sent: ‎12/‎12/‎2014 12:00
> To: user@hadoop.apache.org
> Subject: Re: Split files into 80% and 20% for building model and
> prediction
>
> Simple solution..
>
> Copy the HDFS file to local and use OS commands to count no of lines
>
> cat file1 | wc -l
>
> and cut it based on line number.
>
>
> On 12/12/14, unmesha sreeveni <un...@gmail.com> wrote:
> > I am trying to divide my HDFS file into 2 parts/files
> > 80% and 20% for classification algorithm(80% for modelling and 20% for
> > prediction)
> > Please provide suggestion for the same.
> > To take 80% and 20% to 2 seperate files we need to know the exact number
> of
> > record in the data set
> > And it is only known if we go through the data set once.
> > so we need to write 1 MapReduce Job for just counting the number of
> records
> > and
> > 2 nd Mapreduce Job for separating 80% and 20% into 2 files using Multiple
> > Inputs.
> >
> >
> > Am I in the right track or there is any alternative for the same.
> > But again a small confusion how to check if the reducer get filled with
> 80%
> > data.
> >
> >
> > --
> > *Thanks & Regards *
> >
> >
> > *Unmesha Sreeveni U.B*
> > *Hadoop, Bigdata Developer*
> > *Centre for Cyber Security | Amrita Vishwa Vidyapeetham*
> > http://www.unmeshasreeveni.blogspot.in/
> >
>


-- 
*Thanks & Regards *


*Unmesha Sreeveni U.B*
*Hadoop, Bigdata Developer*
*Centre for Cyber Security | Amrita Vishwa Vidyapeetham*
http://www.unmeshasreeveni.blogspot.in/

RE: Split files into 80% and 20% for building model and prediction

Posted by Mikael Sitruk <mi...@gmail.com>.

I would use a different approach. For each row in the mapper I would have invoked random.Next() then if the number generated by random is below 0.8 then the row would go to key for training otherwise go to key for the test.
Mikael.s

-----Original Message-----
From: "Susheel Kumar Gadalay" <sk...@gmail.com>
Sent: ‎12/‎12/‎2014 12:00
To: "user@hadoop.apache.org" <us...@hadoop.apache.org>
Subject: Re: Split files into 80% and 20% for building model and prediction

Simple solution..

Copy the HDFS file to local and use OS commands to count no of lines

cat file1 | wc -l

and cut it based on line number.


On 12/12/14, unmesha sreeveni <un...@gmail.com> wrote:
> I am trying to divide my HDFS file into 2 parts/files
> 80% and 20% for classification algorithm(80% for modelling and 20% for
> prediction)
> Please provide suggestion for the same.
> To take 80% and 20% to 2 seperate files we need to know the exact number of
> record in the data set
> And it is only known if we go through the data set once.
> so we need to write 1 MapReduce Job for just counting the number of records
> and
> 2 nd Mapreduce Job for separating 80% and 20% into 2 files using Multiple
> Inputs.
>
>
> Am I in the right track or there is any alternative for the same.
> But again a small confusion how to check if the reducer get filled with 80%
> data.
>
>
> --
> *Thanks & Regards *
>
>
> *Unmesha Sreeveni U.B*
> *Hadoop, Bigdata Developer*
> *Centre for Cyber Security | Amrita Vishwa Vidyapeetham*
> http://www.unmeshasreeveni.blogspot.in/
>

RE: Split files into 80% and 20% for building model and prediction

Posted by Mikael Sitruk <mi...@gmail.com>.

I would use a different approach. For each row in the mapper I would have invoked random.Next() then if the number generated by random is below 0.8 then the row would go to key for training otherwise go to key for the test.
Mikael.s

-----Original Message-----
From: "Susheel Kumar Gadalay" <sk...@gmail.com>
Sent: ‎12/‎12/‎2014 12:00
To: "user@hadoop.apache.org" <us...@hadoop.apache.org>
Subject: Re: Split files into 80% and 20% for building model and prediction

Simple solution..

Copy the HDFS file to local and use OS commands to count no of lines

cat file1 | wc -l

and cut it based on line number.


On 12/12/14, unmesha sreeveni <un...@gmail.com> wrote:
> I am trying to divide my HDFS file into 2 parts/files
> 80% and 20% for classification algorithm(80% for modelling and 20% for
> prediction)
> Please provide suggestion for the same.
> To take 80% and 20% to 2 seperate files we need to know the exact number of
> record in the data set
> And it is only known if we go through the data set once.
> so we need to write 1 MapReduce Job for just counting the number of records
> and
> 2 nd Mapreduce Job for separating 80% and 20% into 2 files using Multiple
> Inputs.
>
>
> Am I in the right track or there is any alternative for the same.
> But again a small confusion how to check if the reducer get filled with 80%
> data.
>
>
> --
> *Thanks & Regards *
>
>
> *Unmesha Sreeveni U.B*
> *Hadoop, Bigdata Developer*
> *Centre for Cyber Security | Amrita Vishwa Vidyapeetham*
> http://www.unmeshasreeveni.blogspot.in/
>

Re: Split files into 80% and 20% for building model and prediction

Posted by Peyman Mohajerian <mo...@gmail.com>.

you don't have to copy the data to local to do a count.
%hdfs dfs -cat file1 | wc -l
will do the job

On Fri, Dec 12, 2014 at 1:58 AM, Susheel Kumar Gadalay <sk...@gmail.com>
wrote:
>
> Simple solution..
>
> Copy the HDFS file to local and use OS commands to count no of lines
>
> cat file1 | wc -l
>
> and cut it based on line number.
>
>
> On 12/12/14, unmesha sreeveni <un...@gmail.com> wrote:
> > I am trying to divide my HDFS file into 2 parts/files
> > 80% and 20% for classification algorithm(80% for modelling and 20% for
> > prediction)
> > Please provide suggestion for the same.
> > To take 80% and 20% to 2 seperate files we need to know the exact number
> of
> > record in the data set
> > And it is only known if we go through the data set once.
> > so we need to write 1 MapReduce Job for just counting the number of
> records
> > and
> > 2 nd Mapreduce Job for separating 80% and 20% into 2 files using Multiple
> > Inputs.
> >
> >
> > Am I in the right track or there is any alternative for the same.
> > But again a small confusion how to check if the reducer get filled with
> 80%
> > data.
> >
> >
> > --
> > *Thanks & Regards *
> >
> >
> > *Unmesha Sreeveni U.B*
> > *Hadoop, Bigdata Developer*
> > *Centre for Cyber Security | Amrita Vishwa Vidyapeetham*
> > http://www.unmeshasreeveni.blogspot.in/
> >
>

RE: Split files into 80% and 20% for building model and prediction

Posted by Mikael Sitruk <mi...@gmail.com>.

I would use a different approach. For each row in the mapper I would have invoked random.Next() then if the number generated by random is below 0.8 then the row would go to key for training otherwise go to key for the test.
Mikael.s

-----Original Message-----
From: "Susheel Kumar Gadalay" <sk...@gmail.com>
Sent: ‎12/‎12/‎2014 12:00
To: "user@hadoop.apache.org" <us...@hadoop.apache.org>
Subject: Re: Split files into 80% and 20% for building model and prediction

Simple solution..

Copy the HDFS file to local and use OS commands to count no of lines

cat file1 | wc -l

and cut it based on line number.


On 12/12/14, unmesha sreeveni <un...@gmail.com> wrote:
> I am trying to divide my HDFS file into 2 parts/files
> 80% and 20% for classification algorithm(80% for modelling and 20% for
> prediction)
> Please provide suggestion for the same.
> To take 80% and 20% to 2 seperate files we need to know the exact number of
> record in the data set
> And it is only known if we go through the data set once.
> so we need to write 1 MapReduce Job for just counting the number of records
> and
> 2 nd Mapreduce Job for separating 80% and 20% into 2 files using Multiple
> Inputs.
>
>
> Am I in the right track or there is any alternative for the same.
> But again a small confusion how to check if the reducer get filled with 80%
> data.
>
>
> --
> *Thanks & Regards *
>
>
> *Unmesha Sreeveni U.B*
> *Hadoop, Bigdata Developer*
> *Centre for Cyber Security | Amrita Vishwa Vidyapeetham*
> http://www.unmeshasreeveni.blogspot.in/
>

RE: Split files into 80% and 20% for building model and prediction

Posted by Mikael Sitruk <mi...@gmail.com>.

I would use a different approach. For each row in the mapper I would have invoked random.Next() then if the number generated by random is below 0.8 then the row would go to key for training otherwise go to key for the test.
Mikael.s

-----Original Message-----
From: "Susheel Kumar Gadalay" <sk...@gmail.com>
Sent: ‎12/‎12/‎2014 12:00
To: "user@hadoop.apache.org" <us...@hadoop.apache.org>
Subject: Re: Split files into 80% and 20% for building model and prediction

Simple solution..

Copy the HDFS file to local and use OS commands to count no of lines

cat file1 | wc -l

and cut it based on line number.


On 12/12/14, unmesha sreeveni <un...@gmail.com> wrote:
> I am trying to divide my HDFS file into 2 parts/files
> 80% and 20% for classification algorithm(80% for modelling and 20% for
> prediction)
> Please provide suggestion for the same.
> To take 80% and 20% to 2 seperate files we need to know the exact number of
> record in the data set
> And it is only known if we go through the data set once.
> so we need to write 1 MapReduce Job for just counting the number of records
> and
> 2 nd Mapreduce Job for separating 80% and 20% into 2 files using Multiple
> Inputs.
>
>
> Am I in the right track or there is any alternative for the same.
> But again a small confusion how to check if the reducer get filled with 80%
> data.
>
>
> --
> *Thanks & Regards *
>
>
> *Unmesha Sreeveni U.B*
> *Hadoop, Bigdata Developer*
> *Centre for Cyber Security | Amrita Vishwa Vidyapeetham*
> http://www.unmeshasreeveni.blogspot.in/
>

Re: Split files into 80% and 20% for building model and prediction

Posted by Susheel Kumar Gadalay <sk...@gmail.com>.

Simple solution..

Copy the HDFS file to local and use OS commands to count no of lines

cat file1 | wc -l

and cut it based on line number.


On 12/12/14, unmesha sreeveni <un...@gmail.com> wrote:
> I am trying to divide my HDFS file into 2 parts/files
> 80% and 20% for classification algorithm(80% for modelling and 20% for
> prediction)
> Please provide suggestion for the same.
> To take 80% and 20% to 2 seperate files we need to know the exact number of
> record in the data set
> And it is only known if we go through the data set once.
> so we need to write 1 MapReduce Job for just counting the number of records
> and
> 2 nd Mapreduce Job for separating 80% and 20% into 2 files using Multiple
> Inputs.
>
>
> Am I in the right track or there is any alternative for the same.
> But again a small confusion how to check if the reducer get filled with 80%
> data.
>
>
> --
> *Thanks & Regards *
>
>
> *Unmesha Sreeveni U.B*
> *Hadoop, Bigdata Developer*
> *Centre for Cyber Security | Amrita Vishwa Vidyapeetham*
> http://www.unmeshasreeveni.blogspot.in/
>

Re: Split files into 80% and 20% for building model and prediction

Posted by Chris Mawata <ch...@gmail.com>.

How about doing something on the lines of bucketing: Pick a field that is
unique for each record and if hash of the field mod 10 is 8 or less it goes
in one bin, otherwise into the other one.
Cheers
Chris
On Dec 12, 2014 1:32 AM, "unmesha sreeveni" <un...@gmail.com> wrote:

> I am trying to divide my HDFS file into 2 parts/files
> 80% and 20% for classification algorithm(80% for modelling and 20% for
> prediction)
> Please provide suggestion for the same.
> To take 80% and 20% to 2 seperate files we need to know the exact number
> of record in the data set
> And it is only known if we go through the data set once.
> so we need to write 1 MapReduce Job for just counting the number of
> records and
> 2 nd Mapreduce Job for separating 80% and 20% into 2 files using Multiple
> Inputs.
>
>
> Am I in the right track or there is any alternative for the same.
> But again a small confusion how to check if the reducer get filled with
> 80% data.
>
>
> --
> *Thanks & Regards *
>
>
> *Unmesha Sreeveni U.B*
> *Hadoop, Bigdata Developer*
> *Centre for Cyber Security | Amrita Vishwa Vidyapeetham*
> http://www.unmeshasreeveni.blogspot.in/
>
>
>

Re: Split files into 80% and 20% for building model and prediction

Posted by Susheel Kumar Gadalay <sk...@gmail.com>.

Simple solution..

Copy the HDFS file to local and use OS commands to count no of lines

cat file1 | wc -l

and cut it based on line number.


On 12/12/14, unmesha sreeveni <un...@gmail.com> wrote:
> I am trying to divide my HDFS file into 2 parts/files
> 80% and 20% for classification algorithm(80% for modelling and 20% for
> prediction)
> Please provide suggestion for the same.
> To take 80% and 20% to 2 seperate files we need to know the exact number of
> record in the data set
> And it is only known if we go through the data set once.
> so we need to write 1 MapReduce Job for just counting the number of records
> and
> 2 nd Mapreduce Job for separating 80% and 20% into 2 files using Multiple
> Inputs.
>
>
> Am I in the right track or there is any alternative for the same.
> But again a small confusion how to check if the reducer get filled with 80%
> data.
>
>
> --
> *Thanks & Regards *
>
>
> *Unmesha Sreeveni U.B*
> *Hadoop, Bigdata Developer*
> *Centre for Cyber Security | Amrita Vishwa Vidyapeetham*
> http://www.unmeshasreeveni.blogspot.in/
>

Re: Split files into 80% and 20% for building model and prediction

Posted by Susheel Kumar Gadalay <sk...@gmail.com>.

Simple solution..

Copy the HDFS file to local and use OS commands to count no of lines

cat file1 | wc -l

and cut it based on line number.


On 12/12/14, unmesha sreeveni <un...@gmail.com> wrote:
> I am trying to divide my HDFS file into 2 parts/files
> 80% and 20% for classification algorithm(80% for modelling and 20% for
> prediction)
> Please provide suggestion for the same.
> To take 80% and 20% to 2 seperate files we need to know the exact number of
> record in the data set
> And it is only known if we go through the data set once.
> so we need to write 1 MapReduce Job for just counting the number of records
> and
> 2 nd Mapreduce Job for separating 80% and 20% into 2 files using Multiple
> Inputs.
>
>
> Am I in the right track or there is any alternative for the same.
> But again a small confusion how to check if the reducer get filled with 80%
> data.
>
>
> --
> *Thanks & Regards *
>
>
> *Unmesha Sreeveni U.B*
> *Hadoop, Bigdata Developer*
> *Centre for Cyber Security | Amrita Vishwa Vidyapeetham*
> http://www.unmeshasreeveni.blogspot.in/
>

Re: Split files into 80% and 20% for building model and prediction

Posted by Andre Kelpe <ak...@concurrentinc.com>.

Try Cascading multitool: http://docs.cascading.org/multitool/2.6/

- André

On Fri, Dec 12, 2014 at 10:30 AM, unmesha sreeveni <un...@gmail.com>
wrote:

> I am trying to divide my HDFS file into 2 parts/files
> 80% and 20% for classification algorithm(80% for modelling and 20% for
> prediction)
> Please provide suggestion for the same.
> To take 80% and 20% to 2 seperate files we need to know the exact number
> of record in the data set
> And it is only known if we go through the data set once.
> so we need to write 1 MapReduce Job for just counting the number of
> records and
> 2 nd Mapreduce Job for separating 80% and 20% into 2 files using Multiple
> Inputs.
>
>
> Am I in the right track or there is any alternative for the same.
> But again a small confusion how to check if the reducer get filled with
> 80% data.
>
>
> --
> *Thanks & Regards *
>
>
> *Unmesha Sreeveni U.B*
> *Hadoop, Bigdata Developer*
> *Centre for Cyber Security | Amrita Vishwa Vidyapeetham*
> http://www.unmeshasreeveni.blogspot.in/
>
>
>


-- 
André Kelpe
andre@concurrentinc.com
http://concurrentinc.com

Re: Split files into 80% and 20% for building model and prediction

Posted by Chris Mawata <ch...@gmail.com>.

How about doing something on the lines of bucketing: Pick a field that is
unique for each record and if hash of the field mod 10 is 8 or less it goes
in one bin, otherwise into the other one.
Cheers
Chris
On Dec 12, 2014 1:32 AM, "unmesha sreeveni" <un...@gmail.com> wrote:

> I am trying to divide my HDFS file into 2 parts/files
> 80% and 20% for classification algorithm(80% for modelling and 20% for
> prediction)
> Please provide suggestion for the same.
> To take 80% and 20% to 2 seperate files we need to know the exact number
> of record in the data set
> And it is only known if we go through the data set once.
> so we need to write 1 MapReduce Job for just counting the number of
> records and
> 2 nd Mapreduce Job for separating 80% and 20% into 2 files using Multiple
> Inputs.
>
>
> Am I in the right track or there is any alternative for the same.
> But again a small confusion how to check if the reducer get filled with
> 80% data.
>
>
> --
> *Thanks & Regards *
>
>
> *Unmesha Sreeveni U.B*
> *Hadoop, Bigdata Developer*
> *Centre for Cyber Security | Amrita Vishwa Vidyapeetham*
> http://www.unmeshasreeveni.blogspot.in/
>
>
>

Re: Split files into 80% and 20% for building model and prediction

Posted by Susheel Kumar Gadalay <sk...@gmail.com>.

Simple solution..

Copy the HDFS file to local and use OS commands to count no of lines

cat file1 | wc -l

and cut it based on line number.


On 12/12/14, unmesha sreeveni <un...@gmail.com> wrote:
> I am trying to divide my HDFS file into 2 parts/files
> 80% and 20% for classification algorithm(80% for modelling and 20% for
> prediction)
> Please provide suggestion for the same.
> To take 80% and 20% to 2 seperate files we need to know the exact number of
> record in the data set
> And it is only known if we go through the data set once.
> so we need to write 1 MapReduce Job for just counting the number of records
> and
> 2 nd Mapreduce Job for separating 80% and 20% into 2 files using Multiple
> Inputs.
>
>
> Am I in the right track or there is any alternative for the same.
> But again a small confusion how to check if the reducer get filled with 80%
> data.
>
>
> --
> *Thanks & Regards *
>
>
> *Unmesha Sreeveni U.B*
> *Hadoop, Bigdata Developer*
> *Centre for Cyber Security | Amrita Vishwa Vidyapeetham*
> http://www.unmeshasreeveni.blogspot.in/
>

Re: Split files into 80% and 20% for building model and prediction

Posted by Chris Mawata <ch...@gmail.com>.

How about doing something on the lines of bucketing: Pick a field that is
unique for each record and if hash of the field mod 10 is 8 or less it goes
in one bin, otherwise into the other one.
Cheers
Chris
On Dec 12, 2014 1:32 AM, "unmesha sreeveni" <un...@gmail.com> wrote:

> I am trying to divide my HDFS file into 2 parts/files
> 80% and 20% for classification algorithm(80% for modelling and 20% for
> prediction)
> Please provide suggestion for the same.
> To take 80% and 20% to 2 seperate files we need to know the exact number
> of record in the data set
> And it is only known if we go through the data set once.
> so we need to write 1 MapReduce Job for just counting the number of
> records and
> 2 nd Mapreduce Job for separating 80% and 20% into 2 files using Multiple
> Inputs.
>
>
> Am I in the right track or there is any alternative for the same.
> But again a small confusion how to check if the reducer get filled with
> 80% data.
>
>
> --
> *Thanks & Regards *
>
>
> *Unmesha Sreeveni U.B*
> *Hadoop, Bigdata Developer*
> *Centre for Cyber Security | Amrita Vishwa Vidyapeetham*
> http://www.unmeshasreeveni.blogspot.in/
>
>
>

Re: Split files into 80% and 20% for building model and prediction

Posted by Andre Kelpe <ak...@concurrentinc.com>.

Try Cascading multitool: http://docs.cascading.org/multitool/2.6/

- André

On Fri, Dec 12, 2014 at 10:30 AM, unmesha sreeveni <un...@gmail.com>
wrote:

> I am trying to divide my HDFS file into 2 parts/files
> 80% and 20% for classification algorithm(80% for modelling and 20% for
> prediction)
> Please provide suggestion for the same.
> To take 80% and 20% to 2 seperate files we need to know the exact number
> of record in the data set
> And it is only known if we go through the data set once.
> so we need to write 1 MapReduce Job for just counting the number of
> records and
> 2 nd Mapreduce Job for separating 80% and 20% into 2 files using Multiple
> Inputs.
>
>
> Am I in the right track or there is any alternative for the same.
> But again a small confusion how to check if the reducer get filled with
> 80% data.
>
>
> --
> *Thanks & Regards *
>
>
> *Unmesha Sreeveni U.B*
> *Hadoop, Bigdata Developer*
> *Centre for Cyber Security | Amrita Vishwa Vidyapeetham*
> http://www.unmeshasreeveni.blogspot.in/
>
>
>


-- 
André Kelpe
andre@concurrentinc.com
http://concurrentinc.com

Re: Split files into 80% and 20% for building model and prediction

Posted by Andre Kelpe <ak...@concurrentinc.com>.

Try Cascading multitool: http://docs.cascading.org/multitool/2.6/

- André

On Fri, Dec 12, 2014 at 10:30 AM, unmesha sreeveni <un...@gmail.com>
wrote:

> I am trying to divide my HDFS file into 2 parts/files
> 80% and 20% for classification algorithm(80% for modelling and 20% for
> prediction)
> Please provide suggestion for the same.
> To take 80% and 20% to 2 seperate files we need to know the exact number
> of record in the data set
> And it is only known if we go through the data set once.
> so we need to write 1 MapReduce Job for just counting the number of
> records and
> 2 nd Mapreduce Job for separating 80% and 20% into 2 files using Multiple
> Inputs.
>
>
> Am I in the right track or there is any alternative for the same.
> But again a small confusion how to check if the reducer get filled with
> 80% data.
>
>
> --
> *Thanks & Regards *
>
>
> *Unmesha Sreeveni U.B*
> *Hadoop, Bigdata Developer*
> *Centre for Cyber Security | Amrita Vishwa Vidyapeetham*
> http://www.unmeshasreeveni.blogspot.in/
>
>
>


-- 
André Kelpe
andre@concurrentinc.com
http://concurrentinc.com

Re: Split files into 80% and 20% for building model and prediction

Posted by Chris Mawata <ch...@gmail.com>.

How about doing something on the lines of bucketing: Pick a field that is
unique for each record and if hash of the field mod 10 is 8 or less it goes
in one bin, otherwise into the other one.
Cheers
Chris
On Dec 12, 2014 1:32 AM, "unmesha sreeveni" <un...@gmail.com> wrote:

> I am trying to divide my HDFS file into 2 parts/files
> 80% and 20% for classification algorithm(80% for modelling and 20% for
> prediction)
> Please provide suggestion for the same.
> To take 80% and 20% to 2 seperate files we need to know the exact number
> of record in the data set
> And it is only known if we go through the data set once.
> so we need to write 1 MapReduce Job for just counting the number of
> records and
> 2 nd Mapreduce Job for separating 80% and 20% into 2 files using Multiple
> Inputs.
>
>
> Am I in the right track or there is any alternative for the same.
> But again a small confusion how to check if the reducer get filled with
> 80% data.
>
>
> --
> *Thanks & Regards *
>
>
> *Unmesha Sreeveni U.B*
> *Hadoop, Bigdata Developer*
> *Centre for Cyber Security | Amrita Vishwa Vidyapeetham*
> http://www.unmeshasreeveni.blogspot.in/
>
>
>