You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@mahout.apache.org by chirag lakhani <ch...@gmail.com> on 2015/01/07 23:20:37 UTC
consistency of StaticWordValueEncoder
I am trying vectorize text data for a Naive Bayes classifier that will be
trained in Hadoop and then the corresponding model will be deployed in a
Java app. My basic approach is to tokenize a string of text data using
Lucene and then encode each token using a StaticWordValueEncoder here are
the relevant code snippets
private static FeatureVectorEncoder memoEncoder = new
StaticWordValueEncoder("memo");
Vector v = new RandomAccessSparseVector(FEATURES);
StringReader reader = new StringReader(text);
StandardTokenizer source = new StandardTokenizer(Version.LUCENE_46, reader);
ShingleFilter sf = new ShingleFilter(source);
sf.setOutputUnigrams(true);
CharTermAttribute charTermAttribute =
sf.addAttribute(CharTermAttribute.class);
sf.reset();
while (sf.incrementToken()) {
memoEncoder.addToVector(charTermAttribute.toString(), 1,v);
}
In the Mahout in Action book I got the impression that the term "memo" will
seed the random number generator and I wanted to confirm that means I will
have consistency if I deploy this vectorizer in both my Hadoop environment
as well as my Java app. In particular, I am fixing the vector size to be
of length FEATURES and I am using "memo" as the name of my encoder. Will
those two things guarantee consistency of my text vectorization?
Re: consistency of StaticWordValueEncoder
Posted by chirag lakhani <ch...@gmail.com>.
Thanks! Is that standard practice or do people typically serialize their
encoders and then load the binaries later?
On Wed, Jan 7, 2015 at 5:25 PM, Ted Dunning <te...@gmail.com> wrote:
> On Wed, Jan 7, 2015 at 2:20 PM, chirag lakhani <ch...@gmail.com>
> wrote:
>
> > In the Mahout in Action book I got the impression that the term "memo"
> will
> > seed the random number generator and I wanted to confirm that means I
> will
> > have consistency if I deploy this vectorizer in both my Hadoop
> environment
> > as well as my Java app. In particular, I am fixing the vector size to be
> > of length FEATURES and I am using "memo" as the name of my encoder. Will
> > those two things guarantee consistency of my text vectorization?
> >
>
> It should do.
>
> Anything else would be a bug (which is, of course, possible)
>
Re: consistency of StaticWordValueEncoder
Posted by Ted Dunning <te...@gmail.com>.
On Wed, Jan 7, 2015 at 2:20 PM, chirag lakhani <ch...@gmail.com>
wrote:
> In the Mahout in Action book I got the impression that the term "memo" will
> seed the random number generator and I wanted to confirm that means I will
> have consistency if I deploy this vectorizer in both my Hadoop environment
> as well as my Java app. In particular, I am fixing the vector size to be
> of length FEATURES and I am using "memo" as the name of my encoder. Will
> those two things guarantee consistency of my text vectorization?
>
It should do.
Anything else would be a bug (which is, of course, possible)