You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@mahout.apache.org by jamal sasha <ja...@gmail.com> on 2014/05/23 07:31:46 UTC

Converting to sequence file in mahout

Hi,
   I have data where each row is comma seperated vector...
And these are bunch of text files...like
0.123,01433,0.932
0.129,0.932,0.123
And I want to run's mahout rowIdSimilarity module on it.. butI am guessing
the input requirement is different.
How do I convert this csv vectors into format consumed by mahout
rowIdSimilarity module?
Thanks

Re: Converting to sequence file in mahout

Posted by Mohit Singh <mo...@gmail.com>.
Hi Jamal,
  Probably, I can answer here.. Since I modified the same code to get
started
In the code you pasted.. change the following to adhere to the requirment:
public class SequenceOutput{
public static void main(String[] args) throws IOException,
InterruptedException, ClassNotFoundException {
Configuration conf = new Configuration(true);
FileSystem fs = FileSystem.get(conf);

// The input file is not in hdfs
BufferedReader reader = new BufferedReader(new FileReader(args[1]));
Path filePath = new Path(args[2]);
 // Delete previous file if exists
if (fs.exists(filePath))
  fs.delete(filePath, true);
 SequenceFile.Writer writer = new SequenceFile.Writer(fs, conf,
 filePath, *IntWritable*.class, VectorWritable.class);
 // Run through the input file
  String line;
  //System.out.prin
  System.out.println(args[3].length());
  while ((line = reader.readLine()) != null) {
 // We surround with try catch to get rid of the exception when header is

  try {
  //System.out.println(line);
  // Split with the given separator
  String[] c = line.split(args[3]);
  if (c.length > 1) {
  double[] d = new double[c.length];
  // Get the feature set
  for (int i = 1; i < c.length; i++)
  d[i] = Double.parseDouble(c[i]);
  // Put it in a vector
  Vector vec = new DenseVector(c.length);
  vec.assign(d);
  VectorWritable writable = new VectorWritable();
  writable.set(vec);

  // Create a label with a / and the class label
  //String label = c[0] + "/" + c[0];
  *int label = // somelabel*
 // Write all in the seqfile
  writer.append(new IntWritable(*label*), writable);
  }
  } catch (NumberFormatException e) {
  continue;
  }
  }
  writer.close();
  reader.close();
 }
}

Just figure out a way to assign label (key) to your vectors.. I just used a
counter..


On Fri, May 23, 2014 at 11:39 AM, Andrew Musselman <
andrew.musselman@gmail.com> wrote:

> You could also look at using Pig with the elephantbird package for creating
> sequence files.
>
> There's an example on the Readme at https://github.com/kevinweil/elephant-
> bird/blob/master/Readme.md
>
>
> On Fri, May 23, 2014 at 11:05 AM, jamal sasha <ja...@gmail.com>
> wrote:
>
> > Hi,
> >   I tried to use one of the implementation.. Here is the copy paste for
> the
> > reference
> >
> > import java.io.BufferedReader;
> > import java.io.FileReader;
> > import java.io.IOException;
> >
> > import org.apache.hadoop.conf.Configuration;
> > import org.apache.hadoop.fs.FileSystem;
> > import org.apache.hadoop.fs.Path;
> > import org.apache.hadoop.io.IntWritable;
> > import org.apache.hadoop.io.SequenceFile;
> > import org.apache.hadoop.io.Text;
> > import org.apache.mahout.math.DenseVector;
> > import org.apache.mahout.math.Vector;
> > import org.apache.mahout.math.VectorWritable;
> >
> > public class SequenceOutput{
> > public static void main(String[] args) throws IOException,
> > InterruptedException, ClassNotFoundException {
> > Configuration conf = new Configuration(true);
> > FileSystem fs = FileSystem.get(conf);
> >
> > // The input file is not in hdfs
> > BufferedReader reader = new BufferedReader(new FileReader(args[1]));
> > Path filePath = new Path(args[2]);
> >  // Delete previous file if exists
> > if (fs.exists(filePath))
> >   fs.delete(filePath, true);
> >  SequenceFile.Writer writer = new SequenceFile.Writer(fs, conf,
> >  filePath, Text.class, VectorWritable.class);
> >  // Run through the input file
> >   String line;
> >   //System.out.prin
> >   System.out.println(args[3].length());
> >   while ((line = reader.readLine()) != null) {
> >  // We surround with try catch to get rid of the exception when header is
> >
> >   try {
> >   //System.out.println(line);
> >   // Split with the given separator
> >   String[] c = line.split(args[3]);
> >   if (c.length > 1) {
> >   double[] d = new double[c.length];
> >   // Get the feature set
> >   for (int i = 1; i < c.length; i++)
> >   d[i] = Double.parseDouble(c[i]);
> >   // Put it in a vector
> >   Vector vec = new DenseVector(c.length);
> >   vec.assign(d);
> >   VectorWritable writable = new VectorWritable();
> >   writable.set(vec);
> >
> >   // Create a label with a / and the class label
> >   String label = c[0] + "/" + c[0];
> >
> >  // Write all in the seqfile
> >   writer.append(new Text(label), writable);
> >   }
> >   } catch (NumberFormatException e) {
> >   continue;
> >   }
> >   }
> >   writer.close();
> >   reader.close();
> >  }
> > }
> >
> >
> > It generates the output but then throws an error when I try to run
> > rowSimilarity job
> > 14/05/23 11:01:02 INFO mapreduce.Job: Task Id :
> > attempt_1400790649200_0044_m_000000_1, Status : FAILED
> > Error: java.lang.ClassCastException: org.apache.hadoop.io.Text cannot be
> > cast to org.apache.hadoop.io.IntWritable
> > at
> >
> >
> org.apache.mahout.math.hadoop.similarity.cooccurrence.RowSimilarityJob$VectorNormMapper.map(RowSimilarityJob.java:184)
> > at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:145)
> > at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:764)
> > at org.apache.hadoop.mapred.MapTask.run(MapTask.java:340)
> > at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:168)
> > at java.security.AccessController.doPrivileged(Native Method)
> > at javax.security.auth.Subject.doAs(Subject.java:415)
> > at
> >
> >
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1548)
> > at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:163)
> >
> > 14/05/23 11:01:02 INFO mapreduce.Job: Task Id :
> > attempt_1400790649200_0044_m_000001_1, Status : FAILED
> > Error: java.lang.ClassCastException: org.apache.hadoop.io.Text cannot be
> > cast to org.apache.hadoop.io.IntWritable
> > at
> >
> >
> org.apache.mahout.math.hadoop.similarity.cooccurrence.RowSimilarityJob$VectorNormMapper.map(RowSimilarityJob.java:184)
> > at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:145)
> > at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:764)
> > at org.apache.hadoop.mapred.MapTask.run(MapTask.java:340)
> > at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:168)
> > at java.security.AccessController.doPrivileged(Native Method)
> > at javax.security.auth.Subject.doAs(Subject.java:415)
> > at
> >
> >
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1548)
> > at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:163)
> >
> > Any clues?
> >
> >
> > On Fri, May 23, 2014 at 1:55 AM, Suneel Marthi <sm...@apache.org>
> wrote:
> >
> > > The input needs to be converted to a sequencefile of vectors in order
> to
> > be
> > > processed by Mahout's pipeline. This has been asked a few times
> recently
> > > and search for Kevin Moulart's recent posts for doing this in the mail
> > > archives.
> > >
> > >  The converted vectors are then fed to RowIdJob with output matrix and
> > > docIndex, then feed the matrix (which is a DRM) to RowSimilarityJob.
> > >
> > >
> > >
> > >
> > > On Fri, May 23, 2014 at 1:31 AM, jamal sasha <ja...@gmail.com>
> > > wrote:
> > >
> > > > Hi,
> > > >    I have data where each row is comma seperated vector...
> > > > And these are bunch of text files...like
> > > > 0.123,01433,0.932
> > > > 0.129,0.932,0.123
> > > > And I want to run's mahout rowIdSimilarity module on it.. butI am
> > > guessing
> > > > the input requirement is different.
> > > > How do I convert this csv vectors into format consumed by mahout
> > > > rowIdSimilarity module?
> > > > Thanks
> > > >
> > >
> >
>



-- 
Mohit

"When you want success as badly as you want the air, then you will get it.
There is no other secret of success."
-Socrates

Re: Converting to sequence file in mahout

Posted by Andrew Musselman <an...@gmail.com>.
You could also look at using Pig with the elephantbird package for creating
sequence files.

There's an example on the Readme at https://github.com/kevinweil/elephant-
bird/blob/master/Readme.md


On Fri, May 23, 2014 at 11:05 AM, jamal sasha <ja...@gmail.com> wrote:

> Hi,
>   I tried to use one of the implementation.. Here is the copy paste for the
> reference
>
> import java.io.BufferedReader;
> import java.io.FileReader;
> import java.io.IOException;
>
> import org.apache.hadoop.conf.Configuration;
> import org.apache.hadoop.fs.FileSystem;
> import org.apache.hadoop.fs.Path;
> import org.apache.hadoop.io.IntWritable;
> import org.apache.hadoop.io.SequenceFile;
> import org.apache.hadoop.io.Text;
> import org.apache.mahout.math.DenseVector;
> import org.apache.mahout.math.Vector;
> import org.apache.mahout.math.VectorWritable;
>
> public class SequenceOutput{
> public static void main(String[] args) throws IOException,
> InterruptedException, ClassNotFoundException {
> Configuration conf = new Configuration(true);
> FileSystem fs = FileSystem.get(conf);
>
> // The input file is not in hdfs
> BufferedReader reader = new BufferedReader(new FileReader(args[1]));
> Path filePath = new Path(args[2]);
>  // Delete previous file if exists
> if (fs.exists(filePath))
>   fs.delete(filePath, true);
>  SequenceFile.Writer writer = new SequenceFile.Writer(fs, conf,
>  filePath, Text.class, VectorWritable.class);
>  // Run through the input file
>   String line;
>   //System.out.prin
>   System.out.println(args[3].length());
>   while ((line = reader.readLine()) != null) {
>  // We surround with try catch to get rid of the exception when header is
>
>   try {
>   //System.out.println(line);
>   // Split with the given separator
>   String[] c = line.split(args[3]);
>   if (c.length > 1) {
>   double[] d = new double[c.length];
>   // Get the feature set
>   for (int i = 1; i < c.length; i++)
>   d[i] = Double.parseDouble(c[i]);
>   // Put it in a vector
>   Vector vec = new DenseVector(c.length);
>   vec.assign(d);
>   VectorWritable writable = new VectorWritable();
>   writable.set(vec);
>
>   // Create a label with a / and the class label
>   String label = c[0] + "/" + c[0];
>
>  // Write all in the seqfile
>   writer.append(new Text(label), writable);
>   }
>   } catch (NumberFormatException e) {
>   continue;
>   }
>   }
>   writer.close();
>   reader.close();
>  }
> }
>
>
> It generates the output but then throws an error when I try to run
> rowSimilarity job
> 14/05/23 11:01:02 INFO mapreduce.Job: Task Id :
> attempt_1400790649200_0044_m_000000_1, Status : FAILED
> Error: java.lang.ClassCastException: org.apache.hadoop.io.Text cannot be
> cast to org.apache.hadoop.io.IntWritable
> at
>
> org.apache.mahout.math.hadoop.similarity.cooccurrence.RowSimilarityJob$VectorNormMapper.map(RowSimilarityJob.java:184)
> at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:145)
> at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:764)
> at org.apache.hadoop.mapred.MapTask.run(MapTask.java:340)
> at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:168)
> at java.security.AccessController.doPrivileged(Native Method)
> at javax.security.auth.Subject.doAs(Subject.java:415)
> at
>
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1548)
> at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:163)
>
> 14/05/23 11:01:02 INFO mapreduce.Job: Task Id :
> attempt_1400790649200_0044_m_000001_1, Status : FAILED
> Error: java.lang.ClassCastException: org.apache.hadoop.io.Text cannot be
> cast to org.apache.hadoop.io.IntWritable
> at
>
> org.apache.mahout.math.hadoop.similarity.cooccurrence.RowSimilarityJob$VectorNormMapper.map(RowSimilarityJob.java:184)
> at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:145)
> at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:764)
> at org.apache.hadoop.mapred.MapTask.run(MapTask.java:340)
> at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:168)
> at java.security.AccessController.doPrivileged(Native Method)
> at javax.security.auth.Subject.doAs(Subject.java:415)
> at
>
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1548)
> at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:163)
>
> Any clues?
>
>
> On Fri, May 23, 2014 at 1:55 AM, Suneel Marthi <sm...@apache.org> wrote:
>
> > The input needs to be converted to a sequencefile of vectors in order to
> be
> > processed by Mahout's pipeline. This has been asked a few times recently
> > and search for Kevin Moulart's recent posts for doing this in the mail
> > archives.
> >
> >  The converted vectors are then fed to RowIdJob with output matrix and
> > docIndex, then feed the matrix (which is a DRM) to RowSimilarityJob.
> >
> >
> >
> >
> > On Fri, May 23, 2014 at 1:31 AM, jamal sasha <ja...@gmail.com>
> > wrote:
> >
> > > Hi,
> > >    I have data where each row is comma seperated vector...
> > > And these are bunch of text files...like
> > > 0.123,01433,0.932
> > > 0.129,0.932,0.123
> > > And I want to run's mahout rowIdSimilarity module on it.. butI am
> > guessing
> > > the input requirement is different.
> > > How do I convert this csv vectors into format consumed by mahout
> > > rowIdSimilarity module?
> > > Thanks
> > >
> >
>

Re: Converting to sequence file in mahout

Posted by jamal sasha <ja...@gmail.com>.
Hi,
  I tried to use one of the implementation.. Here is the copy paste for the
reference

import java.io.BufferedReader;
import java.io.FileReader;
import java.io.IOException;

import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.FileSystem;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.SequenceFile;
import org.apache.hadoop.io.Text;
import org.apache.mahout.math.DenseVector;
import org.apache.mahout.math.Vector;
import org.apache.mahout.math.VectorWritable;

public class SequenceOutput{
public static void main(String[] args) throws IOException,
InterruptedException, ClassNotFoundException {
Configuration conf = new Configuration(true);
FileSystem fs = FileSystem.get(conf);

// The input file is not in hdfs
BufferedReader reader = new BufferedReader(new FileReader(args[1]));
Path filePath = new Path(args[2]);
 // Delete previous file if exists
if (fs.exists(filePath))
  fs.delete(filePath, true);
 SequenceFile.Writer writer = new SequenceFile.Writer(fs, conf,
 filePath, Text.class, VectorWritable.class);
 // Run through the input file
  String line;
  //System.out.prin
  System.out.println(args[3].length());
  while ((line = reader.readLine()) != null) {
 // We surround with try catch to get rid of the exception when header is

  try {
  //System.out.println(line);
  // Split with the given separator
  String[] c = line.split(args[3]);
  if (c.length > 1) {
  double[] d = new double[c.length];
  // Get the feature set
  for (int i = 1; i < c.length; i++)
  d[i] = Double.parseDouble(c[i]);
  // Put it in a vector
  Vector vec = new DenseVector(c.length);
  vec.assign(d);
  VectorWritable writable = new VectorWritable();
  writable.set(vec);

  // Create a label with a / and the class label
  String label = c[0] + "/" + c[0];

 // Write all in the seqfile
  writer.append(new Text(label), writable);
  }
  } catch (NumberFormatException e) {
  continue;
  }
  }
  writer.close();
  reader.close();
 }
}


It generates the output but then throws an error when I try to run
rowSimilarity job
14/05/23 11:01:02 INFO mapreduce.Job: Task Id :
attempt_1400790649200_0044_m_000000_1, Status : FAILED
Error: java.lang.ClassCastException: org.apache.hadoop.io.Text cannot be
cast to org.apache.hadoop.io.IntWritable
at
org.apache.mahout.math.hadoop.similarity.cooccurrence.RowSimilarityJob$VectorNormMapper.map(RowSimilarityJob.java:184)
at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:145)
at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:764)
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:340)
at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:168)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:415)
at
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1548)
at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:163)

14/05/23 11:01:02 INFO mapreduce.Job: Task Id :
attempt_1400790649200_0044_m_000001_1, Status : FAILED
Error: java.lang.ClassCastException: org.apache.hadoop.io.Text cannot be
cast to org.apache.hadoop.io.IntWritable
at
org.apache.mahout.math.hadoop.similarity.cooccurrence.RowSimilarityJob$VectorNormMapper.map(RowSimilarityJob.java:184)
at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:145)
at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:764)
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:340)
at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:168)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:415)
at
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1548)
at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:163)

Any clues?


On Fri, May 23, 2014 at 1:55 AM, Suneel Marthi <sm...@apache.org> wrote:

> The input needs to be converted to a sequencefile of vectors in order to be
> processed by Mahout's pipeline. This has been asked a few times recently
> and search for Kevin Moulart's recent posts for doing this in the mail
> archives.
>
>  The converted vectors are then fed to RowIdJob with output matrix and
> docIndex, then feed the matrix (which is a DRM) to RowSimilarityJob.
>
>
>
>
> On Fri, May 23, 2014 at 1:31 AM, jamal sasha <ja...@gmail.com>
> wrote:
>
> > Hi,
> >    I have data where each row is comma seperated vector...
> > And these are bunch of text files...like
> > 0.123,01433,0.932
> > 0.129,0.932,0.123
> > And I want to run's mahout rowIdSimilarity module on it.. butI am
> guessing
> > the input requirement is different.
> > How do I convert this csv vectors into format consumed by mahout
> > rowIdSimilarity module?
> > Thanks
> >
>

Re: Converting to sequence file in mahout

Posted by Suneel Marthi <sm...@apache.org>.
The input needs to be converted to a sequencefile of vectors in order to be
processed by Mahout's pipeline. This has been asked a few times recently
and search for Kevin Moulart's recent posts for doing this in the mail
archives.

 The converted vectors are then fed to RowIdJob with output matrix and
docIndex, then feed the matrix (which is a DRM) to RowSimilarityJob.




On Fri, May 23, 2014 at 1:31 AM, jamal sasha <ja...@gmail.com> wrote:

> Hi,
>    I have data where each row is comma seperated vector...
> And these are bunch of text files...like
> 0.123,01433,0.932
> 0.129,0.932,0.123
> And I want to run's mahout rowIdSimilarity module on it.. butI am guessing
> the input requirement is different.
> How do I convert this csv vectors into format consumed by mahout
> rowIdSimilarity module?
> Thanks
>