You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@mahout.apache.org by Sameer Tilak <ss...@live.com> on 2013/12/05 18:35:56 UTC

Elephant-Bird, Pig, and Mahout

Hi All,
I have some question about using EB's VectorWritableConverter in my Pig script for data vectorization.
I am generating the tuples using a UDF, however for 
simplicity I am loading the data from a file in the following code. My 
UDF returns tuples of the form (1,0,1,1...) etc.

My map.dat file has the following format:

1,0,1,1
0,1,1,1,
0,0,1,1,
1,1,0,0,
.......
.......
........

I register the necessary jar files. 

%declare SEQFILE_LOADER 'com.twitter.elephantbird.pig.load.SequenceFileLoader';
%declare TEXT_CONVERTER 'com.twitter.elephantbird.pig.util.TextConverter';
%declare LONG_CONVERTER 'com.twitter.elephantbird.pig.util.LongWritableConverter';
%declare VECTOR_CONVERTER 'com.twitter.elephantbird.pig.mahout.VectorWritableConverter';

/* Loading from a file instead of UDF for simplicity */

A = LOAD 'map.dat';

/*
 I am not sure how to use the VectorWritableConverter to convert tuple 
in the relation A to a vector using VectorWritableConverter */
B = FOREACH A GENERATE $VECTOR_CONVERTER();

DUMP B;
 		 	   		  

RE: Elephant-Bird, Pig, and Mahout

Posted by Sameer Tilak <ss...@live.com>.
Hi Andrew et al.,
I have the following statement in my pig script. 
AU = FOREACH A GENERATE myparser.myUDF(param1, param2); STORE AU into '/scratch/AU';
AU has the following format: 
(userid, (item_view_history))
(27,(0,1,1,0,0))(28,(0,0,1,0,0))(29,(0,0,1,0,1))(30,(1,0,1,0,1))
I will have at least few hundred thousand numbers in the  (item_view_history), for readability I am just showing 5 here.

VectorizedInput = FOREACH AU GENERATE FLATTEN($0);/*I am assuming the filed userid will be used as a key and will be written using $INT_CONVERTER', and the tuple will be written using $VECTOR_CONVERTER'. Is this correct? 
STORE VectorizedInput into '/scratch/VectorizedInput' using $SEQFILE_STORAGE ('-c $INT_CONVERTER', '-c $VECTOR_CONVERTER');
I can see that /scratch/VectorizedInput has part- files. These files are binary so hard to know if the script is correct. Can anyone please comment whether the understanding of the SEQFILE_STORAGE and VECTOR_CONVERTER is correct or not?


> Date: Thu, 5 Dec 2013 10:21:16 -0800
> Subject: Re: Elephant-Bird, Pig, and Mahout
> From: andrew.musselman@gmail.com
> To: user@mahout.apache.org
> 
> There's an example on the Readme at
> https://github.com/kevinweil/elephant-bird/blob/master/Readme.md
> 
> Do you have a key to use for each vector?
> 
> I've done stuff like this, and I don't know off-hand if you need to have
> your records in a tuple to use VectorWritableConverter.
> 
> register path/to/lib/mahout/mahout-*.jar
> register path/to/elephant-bird-hadoop*.jar
> register path/to/elephant-bird-hadoop*.jar
> register path/to/elephant-bird-mahout*.jar
> register path/to/elephant-bird-pig*.jar
> %declare SEQFILE_STORAGE
> 'com.twitter.elephantbird.pig.store.SequenceFileStorage';
> %declare INT_CONVERTER
> 'com.twitter.elephantbird.pig.util.IntWritableConverter';
> %declare LONG_CONVERTER
> 'com.twitter.elephantbird.pig.util.LongWritableConverter';
> %declare VECTOR_CONVERTER
> 'com.twitter.elephantbird.pig.mahout.VectorWritableConverter';
> a = load 'input' as (
>   pid: long,
>   v: (
>     f1: int,
>     f2: int,
>     f3: int));
> 
> store a into 'output' using $SEQFILE_STORAGE ('-c $LONG_CONVERTER', '-c
> $VECTOR_CONVERTER');
> 
> 
> On Thu, Dec 5, 2013 at 9:35 AM, Sameer Tilak <ss...@live.com> wrote:
> 
> > Hi All,
> > I have some question about using EB's VectorWritableConverter in my Pig
> > script for data vectorization.
> > I am generating the tuples using a UDF, however for
> > simplicity I am loading the data from a file in the following code. My
> > UDF returns tuples of the form (1,0,1,1...) etc.
> >
> > My map.dat file has the following format:
> >
> > 1,0,1,1
> > 0,1,1,1,
> > 0,0,1,1,
> > 1,1,0,0,
> > .......
> > .......
> > ........
> >
> > I register the necessary jar files.
> >
> > %declare SEQFILE_LOADER
> > 'com.twitter.elephantbird.pig.load.SequenceFileLoader';
> > %declare TEXT_CONVERTER 'com.twitter.elephantbird.pig.util.TextConverter';
> > %declare LONG_CONVERTER
> > 'com.twitter.elephantbird.pig.util.LongWritableConverter';
> > %declare VECTOR_CONVERTER
> > 'com.twitter.elephantbird.pig.mahout.VectorWritableConverter';
> >
> > /* Loading from a file instead of UDF for simplicity */
> >
> > A = LOAD 'map.dat';
> >
> > /*
> >  I am not sure how to use the VectorWritableConverter to convert tuple
> > in the relation A to a vector using VectorWritableConverter */
> > B = FOREACH A GENERATE $VECTOR_CONVERTER();
> >
> > DUMP B;
> >
 		 	   		  

Re: Elephant-Bird, Pig, and Mahout

Posted by Andrew Musselman <an...@gmail.com>.
There's an example on the Readme at
https://github.com/kevinweil/elephant-bird/blob/master/Readme.md

Do you have a key to use for each vector?

I've done stuff like this, and I don't know off-hand if you need to have
your records in a tuple to use VectorWritableConverter.

register path/to/lib/mahout/mahout-*.jar
register path/to/elephant-bird-hadoop*.jar
register path/to/elephant-bird-hadoop*.jar
register path/to/elephant-bird-mahout*.jar
register path/to/elephant-bird-pig*.jar
%declare SEQFILE_STORAGE
'com.twitter.elephantbird.pig.store.SequenceFileStorage';
%declare INT_CONVERTER
'com.twitter.elephantbird.pig.util.IntWritableConverter';
%declare LONG_CONVERTER
'com.twitter.elephantbird.pig.util.LongWritableConverter';
%declare VECTOR_CONVERTER
'com.twitter.elephantbird.pig.mahout.VectorWritableConverter';
a = load 'input' as (
  pid: long,
  v: (
    f1: int,
    f2: int,
    f3: int));

store a into 'output' using $SEQFILE_STORAGE ('-c $LONG_CONVERTER', '-c
$VECTOR_CONVERTER');


On Thu, Dec 5, 2013 at 9:35 AM, Sameer Tilak <ss...@live.com> wrote:

> Hi All,
> I have some question about using EB's VectorWritableConverter in my Pig
> script for data vectorization.
> I am generating the tuples using a UDF, however for
> simplicity I am loading the data from a file in the following code. My
> UDF returns tuples of the form (1,0,1,1...) etc.
>
> My map.dat file has the following format:
>
> 1,0,1,1
> 0,1,1,1,
> 0,0,1,1,
> 1,1,0,0,
> .......
> .......
> ........
>
> I register the necessary jar files.
>
> %declare SEQFILE_LOADER
> 'com.twitter.elephantbird.pig.load.SequenceFileLoader';
> %declare TEXT_CONVERTER 'com.twitter.elephantbird.pig.util.TextConverter';
> %declare LONG_CONVERTER
> 'com.twitter.elephantbird.pig.util.LongWritableConverter';
> %declare VECTOR_CONVERTER
> 'com.twitter.elephantbird.pig.mahout.VectorWritableConverter';
>
> /* Loading from a file instead of UDF for simplicity */
>
> A = LOAD 'map.dat';
>
> /*
>  I am not sure how to use the VectorWritableConverter to convert tuple
> in the relation A to a vector using VectorWritableConverter */
> B = FOREACH A GENERATE $VECTOR_CONVERTER();
>
> DUMP B;
>