You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@mahout.apache.org by Rahul Mishra <mi...@gmail.com> on 2012/09/19 08:17:14 UTC

Clustering large files using hadoop?

I have been able to cluster and generate results for small csv files(having
only continuous values) on a local system using eclipse and it works
smoothly.
In the process, I have been able to vectorize the data points  and use the
clustering results of K-means to feed it as the initial centroid to Fuzzy
K-means clustering.

But, in the end I am able to do it only for small files . For files having
2 million rows, it simply shows error out of memory.
But, since Mahout is for large scale machine learning , how do I convert my
code to use the power of map-reduce framework of hadoop.[info: I have
access to a 3-node Cluster having hadoop]
Can anyone suggest a step-by-step procedure?

I have also looked into the clustering chapters of the book "Mahout in
Action" but to my dismay did not find any clue.

-- 
Regards,
Rahul K Mishra,
www.ee.iitb.ac.in/student/~rahulkmishra<http://www.ee.iitb.ac.in/student/%7Erahulkmishra>

Re: csv2seq?

Posted by Paritosh Ranjan <pr...@xebia.com>.

I don't see csv2seq command in the list of currently existing command in 
trunk. I also don't see anything committed against it.
The parameters listed by you looks like the parameters of trainlogistic 
or trainnb command as its looking for predictors and target variable.

Please ask your questions in a separate thread, otherwise the main topic 
of the previous thread is hijacked sometimes.
That's why, I have changed the subject of the mail.

On 19-09-2012 22:49, Rajesh Nikam wrote:
> csv2seq seems to be mahout-781 patch seems address issue of converting csv
> to sequence file format.
> However it is integrated and released. Not able to see any documentation
> around its usage except some discussion on mailing list about following
> parameters
>
> has anyone used it ? any comments ?
>
> input: the root HDFS directory containing csv files to convert
> output: the HDFS path of the target sequence file
> header: the HDFS path of a file containing the header of csv files
> predictors: columns to encode as vector
> types: data types of predictors, numeric, word, or text
> target: the name of the target variable
> categories: the number of target categories to be considered
> features: the number of internal hashed features to use
> key: the column to write as the Key of the target sequence file
>
>
>
> On Wed, Sep 19, 2012 at 4:15 PM, Paritosh Ranjan <pr...@xebia.com> wrote:
>
>> This code is putting everything in points ( which I think is some sort of
>> collection ). This will obviously throw OOM for large files.
>> The vectors should be added to a  sequence file and then the path to that
>> sequence file should be given as input to the clustering algorithm.
>>
>> Mahout in action has a code snippet which does it. Googling "writing into
>> a hdfs sequence file" would also help.
>>
>>
>> On 19-09-2012 16:06, Rahul Mishra wrote:
>>
>>> For small file it works absolutely fine. But, I get this error for large
>>> files :
>>>    Exception in thread "main" java.lang.OutOfMemoryError: GC overhead limit
>>> exceeded
>>>
>>> Initially, I am reading csv file using the following code and I presume,
>>> the issue is here. Kindly suggest better approach.
>>>
>>>                   CSVReader reader = new CSVReader(new
>>> FileReader(inputPath));
>>> double field = -1;
>>> int lineCount = 0;
>>>    String [] nextLine;
>>>    while ((nextLine = reader.readNext()) != null) {
>>>    lineCount++;
>>>    //ArrayList<Double> attributes =  new ArrayList<Double>();
>>>    double[] d_attrib = new double[4];
>>> for(int i=0;i<nextLine.length;i++)
>>>    {
>>> d_attrib[i] = Double.parseDouble(nextLine[i]**);
>>> // attributes.add(Double.**parseDouble(nextLine[i]));
>>>    }
>>> //Double[] d_attrib= attributes.toArray(new Double[attributes.size()]);
>>>    NamedVector vec = new NamedVector(new
>>> RandomAccessSparseVector(**nextLine.length)," " + lineCount+" "); //name
>>> the
>>> vector with msisdn
>>>    vec.assign(d_attrib);
>>> points.add(vec);
>>> }
>>>
>>>
>>>
>>>
>>> On Wed, Sep 19, 2012 at 3:03 PM, Lance Norskog <go...@gmail.com> wrote:
>>>
>>>   If you have your Hadoop cluster in your environment variables, most
>>>> Mahout
>>>> jobs use the cluster by default. So, if you can run 'hadoop fs' and look
>>>> at
>>>> your hdfs cluster, Mahout should find your Hadoop cluster.
>>>>
>>>> Lance
>>>>
>>>> ----- Original Message -----
>>>> | From: "Paritosh Ranjan" <pr...@xebia.com>
>>>> | To: user@mahout.apache.org
>>>> | Sent: Tuesday, September 18, 2012 11:28:28 PM
>>>> | Subject: Re: Clustering large files using hadoop?
>>>> |
>>>> | KMeansDriver has a run method with a flag runSequential. When you
>>>> | will
>>>> | mark it to false, it will use the hadoop cluster to scale. kmeans
>>>> | command is also having this flag.
>>>> |
>>>> | "
>>>> |
>>>> | In the process, I have been able to vectorize the data points  and
>>>> | use the
>>>> | clustering results of K-means to feed it as the initial centroid to
>>>> | Fuzzy
>>>> | K-means clustering.
>>>> |
>>>> | "
>>>> | You can also use Canopy clustering for initial seeding, as its a
>>>> | single
>>>> | iteration clustering algorithm and produces good results if proper
>>>> | t1,t2
>>>> | values are provided.
>>>> | https://cwiki.apache.org/**confluence/display/MAHOUT/**
>>>> Canopy+Clustering<https://cwiki.apache.org/confluence/display/MAHOUT/Canopy+Clustering>
>>>> |
>>>> |
>>>> | On 19-09-2012 11:47, Rahul Mishra wrote:
>>>> | > I have been able to cluster and generate results for small csv
>>>> | > files(having
>>>> | > only continuous values) on a local system using eclipse and it
>>>> | > works
>>>> | > smoothly.
>>>> | > In the process, I have been able to vectorize the data points  and
>>>> | > use the
>>>> | > clustering results of K-means to feed it as the initial centroid to
>>>> | > Fuzzy
>>>> | > K-means clustering.
>>>> | >
>>>> | > But, in the end I am able to do it only for small files . For files
>>>> | > having
>>>> | > 2 million rows, it simply shows error out of memory.
>>>> | > But, since Mahout is for large scale machine learning , how do I
>>>> | > convert my
>>>> | > code to use the power of map-reduce framework of hadoop.[info: I
>>>> | > have
>>>> | > access to a 3-node Cluster having hadoop]
>>>> | > Can anyone suggest a step-by-step procedure?
>>>> | >
>>>> | > I have also looked into the clustering chapters of the book "Mahout
>>>> | > in
>>>> | > Action" but to my dismay did not find any clue.
>>>> | >
>>>> |
>>>> |
>>>> |
>>>>
>>>>
>>>
>>

Re: Clustering large files using hadoop?

Posted by Rajesh Nikam <ra...@gmail.com>.

csv2seq seems to be mahout-781 patch seems address issue of converting csv
to sequence file format.
However it is integrated and released. Not able to see any documentation
around its usage except some discussion on mailing list about following
parameters

has anyone used it ? any comments ?

input: the root HDFS directory containing csv files to convert
output: the HDFS path of the target sequence file
header: the HDFS path of a file containing the header of csv files
predictors: columns to encode as vector
types: data types of predictors, numeric, word, or text
target: the name of the target variable
categories: the number of target categories to be considered
features: the number of internal hashed features to use
key: the column to write as the Key of the target sequence file



On Wed, Sep 19, 2012 at 4:15 PM, Paritosh Ranjan <pr...@xebia.com> wrote:

> This code is putting everything in points ( which I think is some sort of
> collection ). This will obviously throw OOM for large files.
> The vectors should be added to a  sequence file and then the path to that
> sequence file should be given as input to the clustering algorithm.
>
> Mahout in action has a code snippet which does it. Googling "writing into
> a hdfs sequence file" would also help.
>
>
> On 19-09-2012 16:06, Rahul Mishra wrote:
>
>> For small file it works absolutely fine. But, I get this error for large
>> files :
>>   Exception in thread "main" java.lang.OutOfMemoryError: GC overhead limit
>> exceeded
>>
>> Initially, I am reading csv file using the following code and I presume,
>> the issue is here. Kindly suggest better approach.
>>
>>                  CSVReader reader = new CSVReader(new
>> FileReader(inputPath));
>> double field = -1;
>> int lineCount = 0;
>>   String [] nextLine;
>>   while ((nextLine = reader.readNext()) != null) {
>>   lineCount++;
>>   //ArrayList<Double> attributes =  new ArrayList<Double>();
>>   double[] d_attrib = new double[4];
>> for(int i=0;i<nextLine.length;i++)
>>   {
>> d_attrib[i] = Double.parseDouble(nextLine[i]**);
>> // attributes.add(Double.**parseDouble(nextLine[i]));
>>   }
>> //Double[] d_attrib= attributes.toArray(new Double[attributes.size()]);
>>   NamedVector vec = new NamedVector(new
>> RandomAccessSparseVector(**nextLine.length)," " + lineCount+" "); //name
>> the
>> vector with msisdn
>>   vec.assign(d_attrib);
>> points.add(vec);
>> }
>>
>>
>>
>>
>> On Wed, Sep 19, 2012 at 3:03 PM, Lance Norskog <go...@gmail.com> wrote:
>>
>>  If you have your Hadoop cluster in your environment variables, most
>>> Mahout
>>> jobs use the cluster by default. So, if you can run 'hadoop fs' and look
>>> at
>>> your hdfs cluster, Mahout should find your Hadoop cluster.
>>>
>>> Lance
>>>
>>> ----- Original Message -----
>>> | From: "Paritosh Ranjan" <pr...@xebia.com>
>>> | To: user@mahout.apache.org
>>> | Sent: Tuesday, September 18, 2012 11:28:28 PM
>>> | Subject: Re: Clustering large files using hadoop?
>>> |
>>> | KMeansDriver has a run method with a flag runSequential. When you
>>> | will
>>> | mark it to false, it will use the hadoop cluster to scale. kmeans
>>> | command is also having this flag.
>>> |
>>> | "
>>> |
>>> | In the process, I have been able to vectorize the data points  and
>>> | use the
>>> | clustering results of K-means to feed it as the initial centroid to
>>> | Fuzzy
>>> | K-means clustering.
>>> |
>>> | "
>>> | You can also use Canopy clustering for initial seeding, as its a
>>> | single
>>> | iteration clustering algorithm and produces good results if proper
>>> | t1,t2
>>> | values are provided.
>>> | https://cwiki.apache.org/**confluence/display/MAHOUT/**
>>> Canopy+Clustering<https://cwiki.apache.org/confluence/display/MAHOUT/Canopy+Clustering>
>>> |
>>> |
>>> | On 19-09-2012 11:47, Rahul Mishra wrote:
>>> | > I have been able to cluster and generate results for small csv
>>> | > files(having
>>> | > only continuous values) on a local system using eclipse and it
>>> | > works
>>> | > smoothly.
>>> | > In the process, I have been able to vectorize the data points  and
>>> | > use the
>>> | > clustering results of K-means to feed it as the initial centroid to
>>> | > Fuzzy
>>> | > K-means clustering.
>>> | >
>>> | > But, in the end I am able to do it only for small files . For files
>>> | > having
>>> | > 2 million rows, it simply shows error out of memory.
>>> | > But, since Mahout is for large scale machine learning , how do I
>>> | > convert my
>>> | > code to use the power of map-reduce framework of hadoop.[info: I
>>> | > have
>>> | > access to a 3-node Cluster having hadoop]
>>> | > Can anyone suggest a step-by-step procedure?
>>> | >
>>> | > I have also looked into the clustering chapters of the book "Mahout
>>> | > in
>>> | > Action" but to my dismay did not find any clue.
>>> | >
>>> |
>>> |
>>> |
>>>
>>>
>>
>>
>
>

Re: Clustering large files using hadoop?

Posted by Paritosh Ranjan <pr...@xebia.com>.

This code is putting everything in points ( which I think is some sort 
of collection ). This will obviously throw OOM for large files.
The vectors should be added to a  sequence file and then the path to 
that sequence file should be given as input to the clustering algorithm.

Mahout in action has a code snippet which does it. Googling "writing 
into a hdfs sequence file" would also help.

On 19-09-2012 16:06, Rahul Mishra wrote:
> For small file it works absolutely fine. But, I get this error for large
> files :
>   Exception in thread "main" java.lang.OutOfMemoryError: GC overhead limit
> exceeded
>
> Initially, I am reading csv file using the following code and I presume,
> the issue is here. Kindly suggest better approach.
>
>                  CSVReader reader = new CSVReader(new FileReader(inputPath));
> double field = -1;
> int lineCount = 0;
>   String [] nextLine;
>   while ((nextLine = reader.readNext()) != null) {
>   lineCount++;
>   //ArrayList<Double> attributes =  new ArrayList<Double>();
>   double[] d_attrib = new double[4];
> for(int i=0;i<nextLine.length;i++)
>   {
> d_attrib[i] = Double.parseDouble(nextLine[i]);
> // attributes.add(Double.parseDouble(nextLine[i]));
>   }
> //Double[] d_attrib= attributes.toArray(new Double[attributes.size()]);
>   NamedVector vec = new NamedVector(new
> RandomAccessSparseVector(nextLine.length)," " + lineCount+" "); //name the
> vector with msisdn
>   vec.assign(d_attrib);
> points.add(vec);
> }
>
>
>
>
> On Wed, Sep 19, 2012 at 3:03 PM, Lance Norskog <go...@gmail.com> wrote:
>
>> If you have your Hadoop cluster in your environment variables, most Mahout
>> jobs use the cluster by default. So, if you can run 'hadoop fs' and look at
>> your hdfs cluster, Mahout should find your Hadoop cluster.
>>
>> Lance
>>
>> ----- Original Message -----
>> | From: "Paritosh Ranjan" <pr...@xebia.com>
>> | To: user@mahout.apache.org
>> | Sent: Tuesday, September 18, 2012 11:28:28 PM
>> | Subject: Re: Clustering large files using hadoop?
>> |
>> | KMeansDriver has a run method with a flag runSequential. When you
>> | will
>> | mark it to false, it will use the hadoop cluster to scale. kmeans
>> | command is also having this flag.
>> |
>> | "
>> |
>> | In the process, I have been able to vectorize the data points  and
>> | use the
>> | clustering results of K-means to feed it as the initial centroid to
>> | Fuzzy
>> | K-means clustering.
>> |
>> | "
>> | You can also use Canopy clustering for initial seeding, as its a
>> | single
>> | iteration clustering algorithm and produces good results if proper
>> | t1,t2
>> | values are provided.
>> | https://cwiki.apache.org/confluence/display/MAHOUT/Canopy+Clustering
>> |
>> |
>> | On 19-09-2012 11:47, Rahul Mishra wrote:
>> | > I have been able to cluster and generate results for small csv
>> | > files(having
>> | > only continuous values) on a local system using eclipse and it
>> | > works
>> | > smoothly.
>> | > In the process, I have been able to vectorize the data points  and
>> | > use the
>> | > clustering results of K-means to feed it as the initial centroid to
>> | > Fuzzy
>> | > K-means clustering.
>> | >
>> | > But, in the end I am able to do it only for small files . For files
>> | > having
>> | > 2 million rows, it simply shows error out of memory.
>> | > But, since Mahout is for large scale machine learning , how do I
>> | > convert my
>> | > code to use the power of map-reduce framework of hadoop.[info: I
>> | > have
>> | > access to a 3-node Cluster having hadoop]
>> | > Can anyone suggest a step-by-step procedure?
>> | >
>> | > I have also looked into the clustering chapters of the book "Mahout
>> | > in
>> | > Action" but to my dismay did not find any clue.
>> | >
>> |
>> |
>> |
>>
>
>

Re: Clustering large files using hadoop?

Posted by Rahul Mishra <mi...@gmail.com>.

For small file it works absolutely fine. But, I get this error for large
files :
 Exception in thread "main" java.lang.OutOfMemoryError: GC overhead limit
exceeded

Initially, I am reading csv file using the following code and I presume,
the issue is here. Kindly suggest better approach.

                CSVReader reader = new CSVReader(new FileReader(inputPath));
double field = -1;
int lineCount = 0;
 String [] nextLine;
 while ((nextLine = reader.readNext()) != null) {
 lineCount++;
 //ArrayList<Double> attributes =  new ArrayList<Double>();
 double[] d_attrib = new double[4];
for(int i=0;i<nextLine.length;i++)
 {
d_attrib[i] = Double.parseDouble(nextLine[i]);
// attributes.add(Double.parseDouble(nextLine[i]));
 }
//Double[] d_attrib= attributes.toArray(new Double[attributes.size()]);
 NamedVector vec = new NamedVector(new
RandomAccessSparseVector(nextLine.length)," " + lineCount+" "); //name the
vector with msisdn
 vec.assign(d_attrib);
points.add(vec);
}




On Wed, Sep 19, 2012 at 3:03 PM, Lance Norskog <go...@gmail.com> wrote:

> If you have your Hadoop cluster in your environment variables, most Mahout
> jobs use the cluster by default. So, if you can run 'hadoop fs' and look at
> your hdfs cluster, Mahout should find your Hadoop cluster.
>
> Lance
>
> ----- Original Message -----
> | From: "Paritosh Ranjan" <pr...@xebia.com>
> | To: user@mahout.apache.org
> | Sent: Tuesday, September 18, 2012 11:28:28 PM
> | Subject: Re: Clustering large files using hadoop?
> |
> | KMeansDriver has a run method with a flag runSequential. When you
> | will
> | mark it to false, it will use the hadoop cluster to scale. kmeans
> | command is also having this flag.
> |
> | "
> |
> | In the process, I have been able to vectorize the data points  and
> | use the
> | clustering results of K-means to feed it as the initial centroid to
> | Fuzzy
> | K-means clustering.
> |
> | "
> | You can also use Canopy clustering for initial seeding, as its a
> | single
> | iteration clustering algorithm and produces good results if proper
> | t1,t2
> | values are provided.
> | https://cwiki.apache.org/confluence/display/MAHOUT/Canopy+Clustering
> |
> |
> | On 19-09-2012 11:47, Rahul Mishra wrote:
> | > I have been able to cluster and generate results for small csv
> | > files(having
> | > only continuous values) on a local system using eclipse and it
> | > works
> | > smoothly.
> | > In the process, I have been able to vectorize the data points  and
> | > use the
> | > clustering results of K-means to feed it as the initial centroid to
> | > Fuzzy
> | > K-means clustering.
> | >
> | > But, in the end I am able to do it only for small files . For files
> | > having
> | > 2 million rows, it simply shows error out of memory.
> | > But, since Mahout is for large scale machine learning , how do I
> | > convert my
> | > code to use the power of map-reduce framework of hadoop.[info: I
> | > have
> | > access to a 3-node Cluster having hadoop]
> | > Can anyone suggest a step-by-step procedure?
> | >
> | > I have also looked into the clustering chapters of the book "Mahout
> | > in
> | > Action" but to my dismay did not find any clue.
> | >
> |
> |
> |
>



-- 
Regards,
Rahul K Mishra,
www.ee.iitb.ac.in/student/~rahulkmishra<http://www.ee.iitb.ac.in/student/%7Erahulkmishra>

Re: Clustering large files using hadoop?

Posted by Lance Norskog <go...@gmail.com>.

If you have your Hadoop cluster in your environment variables, most Mahout jobs use the cluster by default. So, if you can run 'hadoop fs' and look at your hdfs cluster, Mahout should find your Hadoop cluster.

Lance

----- Original Message -----
| From: "Paritosh Ranjan" <pr...@xebia.com>
| To: user@mahout.apache.org
| Sent: Tuesday, September 18, 2012 11:28:28 PM
| Subject: Re: Clustering large files using hadoop?
| 
| KMeansDriver has a run method with a flag runSequential. When you
| will
| mark it to false, it will use the hadoop cluster to scale. kmeans
| command is also having this flag.
| 
| "
| 
| In the process, I have been able to vectorize the data points  and
| use the
| clustering results of K-means to feed it as the initial centroid to
| Fuzzy
| K-means clustering.
| 
| "
| You can also use Canopy clustering for initial seeding, as its a
| single
| iteration clustering algorithm and produces good results if proper
| t1,t2
| values are provided.
| https://cwiki.apache.org/confluence/display/MAHOUT/Canopy+Clustering
| 
| 
| On 19-09-2012 11:47, Rahul Mishra wrote:
| > I have been able to cluster and generate results for small csv
| > files(having
| > only continuous values) on a local system using eclipse and it
| > works
| > smoothly.
| > In the process, I have been able to vectorize the data points  and
| > use the
| > clustering results of K-means to feed it as the initial centroid to
| > Fuzzy
| > K-means clustering.
| >
| > But, in the end I am able to do it only for small files . For files
| > having
| > 2 million rows, it simply shows error out of memory.
| > But, since Mahout is for large scale machine learning , how do I
| > convert my
| > code to use the power of map-reduce framework of hadoop.[info: I
| > have
| > access to a 3-node Cluster having hadoop]
| > Can anyone suggest a step-by-step procedure?
| >
| > I have also looked into the clustering chapters of the book "Mahout
| > in
| > Action" but to my dismay did not find any clue.
| >
| 
| 
|

Re: Clustering large files using hadoop?

Posted by Paritosh Ranjan <pr...@xebia.com>.

KMeansDriver has a run method with a flag runSequential. When you will 
mark it to false, it will use the hadoop cluster to scale. kmeans 
command is also having this flag.

"

In the process, I have been able to vectorize the data points  and use the
clustering results of K-means to feed it as the initial centroid to Fuzzy
K-means clustering.

"
You can also use Canopy clustering for initial seeding, as its a single 
iteration clustering algorithm and produces good results if proper t1,t2 
values are provided.
https://cwiki.apache.org/confluence/display/MAHOUT/Canopy+Clustering

On 19-09-2012 11:47, Rahul Mishra wrote:
> I have been able to cluster and generate results for small csv files(having
> only continuous values) on a local system using eclipse and it works
> smoothly.
> In the process, I have been able to vectorize the data points  and use the
> clustering results of K-means to feed it as the initial centroid to Fuzzy
> K-means clustering.
>
> But, in the end I am able to do it only for small files . For files having
> 2 million rows, it simply shows error out of memory.
> But, since Mahout is for large scale machine learning , how do I convert my
> code to use the power of map-reduce framework of hadoop.[info: I have
> access to a 3-node Cluster having hadoop]
> Can anyone suggest a step-by-step procedure?
>
> I have also looked into the clustering chapters of the book "Mahout in
> Action" but to my dismay did not find any clue.
>