You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@mahout.apache.org by Sameer Tilak <ss...@live.com> on 2013/12/05 18:35:56 UTC
Elephant-Bird, Pig, and Mahout
Hi All,
I have some question about using EB's VectorWritableConverter in my Pig script for data vectorization.
I am generating the tuples using a UDF, however for
simplicity I am loading the data from a file in the following code. My
UDF returns tuples of the form (1,0,1,1...) etc.
My map.dat file has the following format:
1,0,1,1
0,1,1,1,
0,0,1,1,
1,1,0,0,
.......
.......
........
I register the necessary jar files.
%declare SEQFILE_LOADER 'com.twitter.elephantbird.pig.load.SequenceFileLoader';
%declare TEXT_CONVERTER 'com.twitter.elephantbird.pig.util.TextConverter';
%declare LONG_CONVERTER 'com.twitter.elephantbird.pig.util.LongWritableConverter';
%declare VECTOR_CONVERTER 'com.twitter.elephantbird.pig.mahout.VectorWritableConverter';
/* Loading from a file instead of UDF for simplicity */
A = LOAD 'map.dat';
/*
I am not sure how to use the VectorWritableConverter to convert tuple
in the relation A to a vector using VectorWritableConverter */
B = FOREACH A GENERATE $VECTOR_CONVERTER();
DUMP B;
RE: Elephant-Bird, Pig, and Mahout
Posted by Sameer Tilak <ss...@live.com>.
Hi Andrew et al.,
I have the following statement in my pig script.
AU = FOREACH A GENERATE myparser.myUDF(param1, param2); STORE AU into '/scratch/AU';
AU has the following format:
(userid, (item_view_history))
(27,(0,1,1,0,0))(28,(0,0,1,0,0))(29,(0,0,1,0,1))(30,(1,0,1,0,1))
I will have at least few hundred thousand numbers in the (item_view_history), for readability I am just showing 5 here.
VectorizedInput = FOREACH AU GENERATE FLATTEN($0);/*I am assuming the filed userid will be used as a key and will be written using $INT_CONVERTER', and the tuple will be written using $VECTOR_CONVERTER'. Is this correct?
STORE VectorizedInput into '/scratch/VectorizedInput' using $SEQFILE_STORAGE ('-c $INT_CONVERTER', '-c $VECTOR_CONVERTER');
I can see that /scratch/VectorizedInput has part- files. These files are binary so hard to know if the script is correct. Can anyone please comment whether the understanding of the SEQFILE_STORAGE and VECTOR_CONVERTER is correct or not?
> Date: Thu, 5 Dec 2013 10:21:16 -0800
> Subject: Re: Elephant-Bird, Pig, and Mahout
> From: andrew.musselman@gmail.com
> To: user@mahout.apache.org
>
> There's an example on the Readme at
> https://github.com/kevinweil/elephant-bird/blob/master/Readme.md
>
> Do you have a key to use for each vector?
>
> I've done stuff like this, and I don't know off-hand if you need to have
> your records in a tuple to use VectorWritableConverter.
>
> register path/to/lib/mahout/mahout-*.jar
> register path/to/elephant-bird-hadoop*.jar
> register path/to/elephant-bird-hadoop*.jar
> register path/to/elephant-bird-mahout*.jar
> register path/to/elephant-bird-pig*.jar
> %declare SEQFILE_STORAGE
> 'com.twitter.elephantbird.pig.store.SequenceFileStorage';
> %declare INT_CONVERTER
> 'com.twitter.elephantbird.pig.util.IntWritableConverter';
> %declare LONG_CONVERTER
> 'com.twitter.elephantbird.pig.util.LongWritableConverter';
> %declare VECTOR_CONVERTER
> 'com.twitter.elephantbird.pig.mahout.VectorWritableConverter';
> a = load 'input' as (
> pid: long,
> v: (
> f1: int,
> f2: int,
> f3: int));
>
> store a into 'output' using $SEQFILE_STORAGE ('-c $LONG_CONVERTER', '-c
> $VECTOR_CONVERTER');
>
>
> On Thu, Dec 5, 2013 at 9:35 AM, Sameer Tilak <ss...@live.com> wrote:
>
> > Hi All,
> > I have some question about using EB's VectorWritableConverter in my Pig
> > script for data vectorization.
> > I am generating the tuples using a UDF, however for
> > simplicity I am loading the data from a file in the following code. My
> > UDF returns tuples of the form (1,0,1,1...) etc.
> >
> > My map.dat file has the following format:
> >
> > 1,0,1,1
> > 0,1,1,1,
> > 0,0,1,1,
> > 1,1,0,0,
> > .......
> > .......
> > ........
> >
> > I register the necessary jar files.
> >
> > %declare SEQFILE_LOADER
> > 'com.twitter.elephantbird.pig.load.SequenceFileLoader';
> > %declare TEXT_CONVERTER 'com.twitter.elephantbird.pig.util.TextConverter';
> > %declare LONG_CONVERTER
> > 'com.twitter.elephantbird.pig.util.LongWritableConverter';
> > %declare VECTOR_CONVERTER
> > 'com.twitter.elephantbird.pig.mahout.VectorWritableConverter';
> >
> > /* Loading from a file instead of UDF for simplicity */
> >
> > A = LOAD 'map.dat';
> >
> > /*
> > I am not sure how to use the VectorWritableConverter to convert tuple
> > in the relation A to a vector using VectorWritableConverter */
> > B = FOREACH A GENERATE $VECTOR_CONVERTER();
> >
> > DUMP B;
> >
Re: Elephant-Bird, Pig, and Mahout
Posted by Andrew Musselman <an...@gmail.com>.
There's an example on the Readme at
https://github.com/kevinweil/elephant-bird/blob/master/Readme.md
Do you have a key to use for each vector?
I've done stuff like this, and I don't know off-hand if you need to have
your records in a tuple to use VectorWritableConverter.
register path/to/lib/mahout/mahout-*.jar
register path/to/elephant-bird-hadoop*.jar
register path/to/elephant-bird-hadoop*.jar
register path/to/elephant-bird-mahout*.jar
register path/to/elephant-bird-pig*.jar
%declare SEQFILE_STORAGE
'com.twitter.elephantbird.pig.store.SequenceFileStorage';
%declare INT_CONVERTER
'com.twitter.elephantbird.pig.util.IntWritableConverter';
%declare LONG_CONVERTER
'com.twitter.elephantbird.pig.util.LongWritableConverter';
%declare VECTOR_CONVERTER
'com.twitter.elephantbird.pig.mahout.VectorWritableConverter';
a = load 'input' as (
pid: long,
v: (
f1: int,
f2: int,
f3: int));
store a into 'output' using $SEQFILE_STORAGE ('-c $LONG_CONVERTER', '-c
$VECTOR_CONVERTER');
On Thu, Dec 5, 2013 at 9:35 AM, Sameer Tilak <ss...@live.com> wrote:
> Hi All,
> I have some question about using EB's VectorWritableConverter in my Pig
> script for data vectorization.
> I am generating the tuples using a UDF, however for
> simplicity I am loading the data from a file in the following code. My
> UDF returns tuples of the form (1,0,1,1...) etc.
>
> My map.dat file has the following format:
>
> 1,0,1,1
> 0,1,1,1,
> 0,0,1,1,
> 1,1,0,0,
> .......
> .......
> ........
>
> I register the necessary jar files.
>
> %declare SEQFILE_LOADER
> 'com.twitter.elephantbird.pig.load.SequenceFileLoader';
> %declare TEXT_CONVERTER 'com.twitter.elephantbird.pig.util.TextConverter';
> %declare LONG_CONVERTER
> 'com.twitter.elephantbird.pig.util.LongWritableConverter';
> %declare VECTOR_CONVERTER
> 'com.twitter.elephantbird.pig.mahout.VectorWritableConverter';
>
> /* Loading from a file instead of UDF for simplicity */
>
> A = LOAD 'map.dat';
>
> /*
> I am not sure how to use the VectorWritableConverter to convert tuple
> in the relation A to a vector using VectorWritableConverter */
> B = FOREACH A GENERATE $VECTOR_CONVERTER();
>
> DUMP B;
>