You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@mahout.apache.org by Frank Scholten <fr...@frankscholten.nl> on 2014/02/03 22:52:37 UTC

Annotation based vectorizer

Hi all,

I put together a utility which vectorizes plain old Java objects annotated
with @Feature and @Target via Mahout's vector encoders.

See my Github branch:
https://github.com/frankscholten/mahout/tree/annotation-based-vectorizer

and the unit test:
https://github.com/frankscholten/mahout/blob/annotation-based-vectorizer/core/src/test/java/org/apache/mahout/classifier/sgd/AnnotationBasedVectorizerTest.java

Use it like this:

class NewsgroupPost {

  @Target
  private String newsgroup;

  @Feature(encoder = TextValueEncoder.class)
  private String newsgroup;

  // Getters & setters

}

AnnotationBasedVectorizer<NewsgroupPost> vectorizer = new
AnnotationBasedVectorizer<NewsgroupPost>(new
TypeReference<NewsgroupPost>(){});

Here the vectorizer scans the NewsgroupPost's annotations. Then you can do
this:

NewsgroupPost post = ...

Vector vector = vectorizer.vectorize(post);
int target = vectorizer.getTarget(post);
int numFeatures = vectorizer.getNumberOfFeatures();

Note that vectorize() and getTarget() methods are genericly typed and due
to the type token passed in the constructor we can enforce that only
NewsgroupPosts are accepted.

The vectorizer uses a Dictionary for encoding the target.

Thoughts?

Cheers,

Frank

Re: Annotation based vectorizer

Posted by Ted Dunning <te...@gmail.com>.
Looks nice.

Where is the dictionary injected?

Would type inferencing of the sort used in Guava Lists.newArrayList() help
the verbosity?

What is the type reference used for?

What if the POJO has a Vector in it?  Is there way to deal with that?

How can I vectorize a second (test) data set compatibly with the first?
 (that is, how do I pass the Dictionary to the second case)




On Mon, Feb 3, 2014 at 1:53 PM, Frank Scholten <fr...@frankscholten.nl>wrote:

> The second field of Newsgroup should be called bodyText of course.
>
>
> On Mon, Feb 3, 2014 at 10:52 PM, Frank Scholten <frank@frankscholten.nl
> >wrote:
>
> > Hi all,
> >
> > I put together a utility which vectorizes plain old Java objects
> annotated
> > with @Feature and @Target via Mahout's vector encoders.
> >
> > See my Github branch:
> > https://github.com/frankscholten/mahout/tree/annotation-based-vectorizer
> >
> > and the unit test:
> >
> https://github.com/frankscholten/mahout/blob/annotation-based-vectorizer/core/src/test/java/org/apache/mahout/classifier/sgd/AnnotationBasedVectorizerTest.java
> >
> > Use it like this:
> >
> > class NewsgroupPost {
> >
> >   @Target
> >   private String newsgroup;
> >
> >   @Feature(encoder = TextValueEncoder.class)
> >   private String newsgroup;
> >
> >   // Getters & setters
> >
> > }
> >
> > AnnotationBasedVectorizer<NewsgroupPost> vectorizer = new
> > AnnotationBasedVectorizer<NewsgroupPost>(new
> > TypeReference<NewsgroupPost>(){});
> >
> > Here the vectorizer scans the NewsgroupPost's annotations. Then you can
> do
> > this:
> >
> > NewsgroupPost post = ...
> >
> > Vector vector = vectorizer.vectorize(post);
> > int target = vectorizer.getTarget(post);
> > int numFeatures = vectorizer.getNumberOfFeatures();
> >
> > Note that vectorize() and getTarget() methods are genericly typed and due
> > to the type token passed in the constructor we can enforce that only
> > NewsgroupPosts are accepted.
> >
> > The vectorizer uses a Dictionary for encoding the target.
> >
> > Thoughts?
> >
> > Cheers,
> >
> > Frank
> >
>

Re: Annotation based vectorizer

Posted by Ted Dunning <te...@gmail.com>.
Looks nice.

Where is the dictionary injected?

Would type inferencing of the sort used in Guava Lists.newArrayList() help
the verbosity?

What is the type reference used for?

What if the POJO has a Vector in it?  Is there way to deal with that?

How can I vectorize a second (test) data set compatibly with the first?
 (that is, how do I pass the Dictionary to the second case)




On Mon, Feb 3, 2014 at 1:53 PM, Frank Scholten <fr...@frankscholten.nl>wrote:

> The second field of Newsgroup should be called bodyText of course.
>
>
> On Mon, Feb 3, 2014 at 10:52 PM, Frank Scholten <frank@frankscholten.nl
> >wrote:
>
> > Hi all,
> >
> > I put together a utility which vectorizes plain old Java objects
> annotated
> > with @Feature and @Target via Mahout's vector encoders.
> >
> > See my Github branch:
> > https://github.com/frankscholten/mahout/tree/annotation-based-vectorizer
> >
> > and the unit test:
> >
> https://github.com/frankscholten/mahout/blob/annotation-based-vectorizer/core/src/test/java/org/apache/mahout/classifier/sgd/AnnotationBasedVectorizerTest.java
> >
> > Use it like this:
> >
> > class NewsgroupPost {
> >
> >   @Target
> >   private String newsgroup;
> >
> >   @Feature(encoder = TextValueEncoder.class)
> >   private String newsgroup;
> >
> >   // Getters & setters
> >
> > }
> >
> > AnnotationBasedVectorizer<NewsgroupPost> vectorizer = new
> > AnnotationBasedVectorizer<NewsgroupPost>(new
> > TypeReference<NewsgroupPost>(){});
> >
> > Here the vectorizer scans the NewsgroupPost's annotations. Then you can
> do
> > this:
> >
> > NewsgroupPost post = ...
> >
> > Vector vector = vectorizer.vectorize(post);
> > int target = vectorizer.getTarget(post);
> > int numFeatures = vectorizer.getNumberOfFeatures();
> >
> > Note that vectorize() and getTarget() methods are genericly typed and due
> > to the type token passed in the constructor we can enforce that only
> > NewsgroupPosts are accepted.
> >
> > The vectorizer uses a Dictionary for encoding the target.
> >
> > Thoughts?
> >
> > Cheers,
> >
> > Frank
> >
>

Re: Annotation based vectorizer

Posted by Frank Scholten <fr...@frankscholten.nl>.
The second field of Newsgroup should be called bodyText of course.


On Mon, Feb 3, 2014 at 10:52 PM, Frank Scholten <fr...@frankscholten.nl>wrote:

> Hi all,
>
> I put together a utility which vectorizes plain old Java objects annotated
> with @Feature and @Target via Mahout's vector encoders.
>
> See my Github branch:
> https://github.com/frankscholten/mahout/tree/annotation-based-vectorizer
>
> and the unit test:
> https://github.com/frankscholten/mahout/blob/annotation-based-vectorizer/core/src/test/java/org/apache/mahout/classifier/sgd/AnnotationBasedVectorizerTest.java
>
> Use it like this:
>
> class NewsgroupPost {
>
>   @Target
>   private String newsgroup;
>
>   @Feature(encoder = TextValueEncoder.class)
>   private String newsgroup;
>
>   // Getters & setters
>
> }
>
> AnnotationBasedVectorizer<NewsgroupPost> vectorizer = new
> AnnotationBasedVectorizer<NewsgroupPost>(new
> TypeReference<NewsgroupPost>(){});
>
> Here the vectorizer scans the NewsgroupPost's annotations. Then you can do
> this:
>
> NewsgroupPost post = ...
>
> Vector vector = vectorizer.vectorize(post);
> int target = vectorizer.getTarget(post);
> int numFeatures = vectorizer.getNumberOfFeatures();
>
> Note that vectorize() and getTarget() methods are genericly typed and due
> to the type token passed in the constructor we can enforce that only
> NewsgroupPosts are accepted.
>
> The vectorizer uses a Dictionary for encoding the target.
>
> Thoughts?
>
> Cheers,
>
> Frank
>

Re: Annotation based vectorizer

Posted by Frank Scholten <fr...@frankscholten.nl>.
The second field of Newsgroup should be called bodyText of course.


On Mon, Feb 3, 2014 at 10:52 PM, Frank Scholten <fr...@frankscholten.nl>wrote:

> Hi all,
>
> I put together a utility which vectorizes plain old Java objects annotated
> with @Feature and @Target via Mahout's vector encoders.
>
> See my Github branch:
> https://github.com/frankscholten/mahout/tree/annotation-based-vectorizer
>
> and the unit test:
> https://github.com/frankscholten/mahout/blob/annotation-based-vectorizer/core/src/test/java/org/apache/mahout/classifier/sgd/AnnotationBasedVectorizerTest.java
>
> Use it like this:
>
> class NewsgroupPost {
>
>   @Target
>   private String newsgroup;
>
>   @Feature(encoder = TextValueEncoder.class)
>   private String newsgroup;
>
>   // Getters & setters
>
> }
>
> AnnotationBasedVectorizer<NewsgroupPost> vectorizer = new
> AnnotationBasedVectorizer<NewsgroupPost>(new
> TypeReference<NewsgroupPost>(){});
>
> Here the vectorizer scans the NewsgroupPost's annotations. Then you can do
> this:
>
> NewsgroupPost post = ...
>
> Vector vector = vectorizer.vectorize(post);
> int target = vectorizer.getTarget(post);
> int numFeatures = vectorizer.getNumberOfFeatures();
>
> Note that vectorize() and getTarget() methods are genericly typed and due
> to the type token passed in the constructor we can enforce that only
> NewsgroupPosts are accepted.
>
> The vectorizer uses a Dictionary for encoding the target.
>
> Thoughts?
>
> Cheers,
>
> Frank
>