You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@mahout.apache.org by tanzek <ta...@gmail.com> on 2012/01/14 10:04:09 UTC

Help in vectorizing features

I have a file in which these are some features and each row is a record
except the head, they are like a relational table. All features are
numeric, and the last feature is a nominal. Now I need to vectorize them to
feed the logistic regression or other classification algorithms. But after
I have read chapters from 13 to 16 in <Mahout in Action>, I was puzzled by
the feature encoder, especially when I used the ContinuousValueEncoder. The
following code is from my real program:

FeatureVectorEncoder enc = new ContinuousValueEncoder("test");
Vector v1 = new DenseVector(20);  // 19 features + 1 class
String[] ftStr = fileReader.getLine[].split(",");
for(int i=0; i<19; ++i){
    enc.addToVector(ftStr[i], v1);
    // enc.addToVector((byte[])null, Double.parseDouble(ftStr[i]), v1);
}
System.out.println(v1);   // *** I can't get the result I am familiar with.

Should I use ContinusousValueEncoder to finish this job? The feature
encoder or feature hashing seems to be hard for me to understand. I have
also dropped the feature encoder in this code.

Vector v1 = new DenseVector(20);  // 19 features + 1 class
String[] ftStr = fileReader.getLine[].split(",");
for(int i=0; i<19; ++i){
    v1.set(i, Double.parseDouble(ftStr[i]));
}
System.out.println(v1);   // *** now I can understand my code

Is this the right way to use Vector?

So, in all I have three questions:
1. What is the relationship between Vector and Encoder?
2. Is the Encoder essential to vectorize my features?
3. Why the encoder work in an unfamiliar way or how does it work?

Any helps, discussions, materials or papers would be highly appreciated.
Thank you!

Re: Help in vectorizing features

Posted by Ted Dunning <te...@gmail.com>.

On Tue, Jan 17, 2012 at 3:45 AM, tanzek <ta...@gmail.com> wrote:

> Oh. very thank you. I think I can understand the usage of Vector.
> In your words, is the thing called "feature hashing"?

Yes.

> Are there any materials or papers in which the meaning is illustrated? Is
> what the paper
> <Feature hashing for large scale multitask learning> written by Weinberger
> et al. shows similar with the "feature hashing" in your book?
>

Yes.  This is a good reference.

The feature hashing Mahout differs slightly in that we have multiple fields
and support multiple probes so that we can use smaller vectors.

The basic idea remains the same, however.

Re: Help in vectorizing features

Posted by tanzek <ta...@gmail.com>.

Oh. very thank you. I think I can understand the usage of Vector.
In your words, is the thing called "feature hashing"? Are there any
materials or papers in which the meaning is illustrated? Is what the paper
<Feature hashing for large scale multitask learning> written by Weinberger
et al. shows similar with the "feature hashing" in your book?


2012/1/17 Ted Dunning <te...@gmail.com>

> What you have is almost correct.  Usually, however, you don't want to
> encode your class in a single slot, but rather allocate k slots for a
> nominal variable that can have k values and set one of those values to 1
> with all others set to 0.
>
> If you do this, then you don't need the *Encoder stuff at all.
>
> On the other hand, if k is very large or even not known or you have a bag
> of nominals with large or unknown k, then you need the Encoder framework.
>  In that case, you will need to have a large vector to encode into, but not
> as large as k (which is good since you don't even know how big that is).
>
> On Mon, Jan 16, 2012 at 3:22 PM, tanzek <ta...@gmail.com> wrote:
>
> > Hello, Ted, I really need a help. Are there any problems with my
> questions?
> >
> > 2012/1/14 tanzek <ta...@gmail.com>
> >
> > > I have a file in which these are some features and each row is a record
> > > except the head, they are like a relational table. All features are
> > > numeric, and the last feature is a nominal. Now I need to vectorize
> them
> > to
> > > feed the logistic regression or other classification algorithms. But
> > after
> > > I have read chapters from 13 to 16 in <Mahout in Action>, I was puzzled
> > by
> > > the feature encoder, especially when I used the ContinuousValueEncoder.
> > The
> > > following code is from my real program:
> > >
> > > FeatureVectorEncoder enc = new ContinuousValueEncoder("test");
> > > Vector v1 = new DenseVector(20);  // 19 features + 1 class
> > > String[] ftStr = fileReader.getLine[].split(",");
> > > for(int i=0; i<19; ++i){
> > >     enc.addToVector(ftStr[i], v1);
> > >     // enc.addToVector((byte[])null, Double.parseDouble(ftStr[i]), v1);
> > > }
> > > System.out.println(v1);   // *** I can't get the result I am familiar
> > with.
> > >
> > > Should I use ContinusousValueEncoder to finish this job? The feature
> > > encoder or feature hashing seems to be hard for me to understand. I
> have
> > > also dropped the feature encoder in this code.
> > >
> > > Vector v1 = new DenseVector(20);  // 19 features + 1 class
> > > String[] ftStr = fileReader.getLine[].split(",");
> > > for(int i=0; i<19; ++i){
> > >     v1.set(i, Double.parseDouble(ftStr[i]));
> > > }
> > > System.out.println(v1);   // *** now I can understand my code
> > >
> > > Is this the right way to use Vector?
> > >
> > > So, in all I have three questions:
> > > 1. What is the relationship between Vector and Encoder?
> > > 2. Is the Encoder essential to vectorize my features?
> > > 3. Why the encoder work in an unfamiliar way or how does it work?
> > >
> > > Any helps, discussions, materials or papers would be highly
> appreciated.
> > > Thank you!
> > >
> > >
> >
>

Re: Recommender system - feedback request

Posted by Szymon Chojnacki <sa...@o2.pl>.

Thank you Ted,
Cheers
Dnia 16 stycznia 2012 22:36 Ted Dunning &lt;ted.dunning@gmail.com&gt; napisał(a):
Sounds like a natural application of nearest neighbor techniques. My guess is that if the size of your set A is moderate then the mahout recommendation engine will work with the addition of a specialized distance function. Data sets of 100,000 examples or more are probably just fine. 
Sent from my iPhone
On Jan 16, 2012, at 11:35, Szymon Chojnacki &lt;sajmmon@o2.pl&gt; wrote:
&gt; Hi,
&gt; 
&gt; my request is not directly connected to Mahout software. I would like to ask for a feedback from ML practitioners in the Mahout community. I am looking for a recommender algorithm that could be used in the following situation:
&gt; 
&gt; 1. As input we have only positive examples mapping points from N-dimensional space A to other N-dimensional space B
&gt; 2. We have a generator that creates plausible points in B (around 60) for any given point in A
&gt; 3. We would like to select the best 5 points in B (from generated ones, the points are usually unique)
&gt; 
&gt; The recommender system is used to automatically arrange offices layouts with furniture, 
&gt; some dimensions from A are: area, number of doors, area of windows. Some dimensions from B are: area occupied by desks / total area, price of all furniture, the variance of the distribution of mass centers.
&gt; 
&gt; Currently a simple algorithm is implemented in a tool called Reterio ( http://www.reterio.com ), but I am looking for any publications describing a general approach to this problem.
&gt; 
&gt; Will be greatful for any feedback
&gt; 
&gt; Szymon Chojnacki
&gt; 
&gt; 
&gt; 
--
Szymon Chojnacki http://www.ipipan.eu/~sch/

Re: Recommender system - feedback request

Posted by Ted Dunning <te...@gmail.com>.

Sounds like a natural application of nearest neighbor techniques. My guess is that if the size of your set A is moderate then the mahout recommendation engine will work with the addition of a specialized distance function.  Data sets of 100,000 examples or more are probably just fine. 

Sent from my iPhone

On Jan 16, 2012, at 11:35, Szymon Chojnacki <sa...@o2.pl> wrote:

> Hi,
> 
> my request is not directly connected to Mahout software. I would like to ask for a feedback from ML practitioners in the Mahout community. I am looking for a recommender algorithm that could be used in the following situation:
> 
> 1. As input we have only positive examples mapping points from N-dimensional space A to other N-dimensional space B
> 2. We have a generator that creates plausible points in B (around 60) for any given point in A
> 3. We would like to select the best 5 points in B (from generated ones, the points are usually unique)
> 
> The recommender system is used to automatically arrange offices layouts with furniture, 
> some dimensions from A are: area, number of doors, area of windows. Some dimensions from B are: area occupied by desks / total area, price of all furniture, the variance of the distribution of mass centers.
> 
> Currently a simple algorithm is implemented in a tool called Reterio ( http://www.reterio.com ), but I am looking for any publications describing a general approach to this problem.
> 
> Will be greatful for any feedback
> 
> Szymon Chojnacki
> 
> 
>

Recommender system - feedback request

Posted by Szymon Chojnacki <sa...@o2.pl>.

Hi,

my request is not directly connected to Mahout software. I would like to ask for a feedback from ML practitioners in the Mahout community. I am looking for a recommender algorithm that could be used in the following situation:

1. As input we have only positive examples mapping points from N-dimensional space A to other N-dimensional space B
2. We have a generator that creates plausible points in B (around 60) for any given point in A
3. We would like to select the best 5 points in B (from generated ones, the points are usually unique)

The recommender system is used to automatically arrange offices layouts with furniture, 
some dimensions from A are: area, number of doors, area of windows. Some dimensions from B are: area occupied by desks / total area, price of all furniture, the variance of the distribution of mass centers.

Currently a simple algorithm is implemented in a tool called Reterio ( http://www.reterio.com ), but I am looking for any publications describing a general approach to this problem.

Will be greatful for any feedback

Szymon Chojnacki

Re: Help in vectorizing features

Posted by Ted Dunning <te...@gmail.com>.

What you have is almost correct.  Usually, however, you don't want to
encode your class in a single slot, but rather allocate k slots for a
nominal variable that can have k values and set one of those values to 1
with all others set to 0.

If you do this, then you don't need the *Encoder stuff at all.

On the other hand, if k is very large or even not known or you have a bag
of nominals with large or unknown k, then you need the Encoder framework.
 In that case, you will need to have a large vector to encode into, but not
as large as k (which is good since you don't even know how big that is).

On Mon, Jan 16, 2012 at 3:22 PM, tanzek <ta...@gmail.com> wrote:

> Hello, Ted, I really need a help. Are there any problems with my questions?
>
> 2012/1/14 tanzek <ta...@gmail.com>
>
> > I have a file in which these are some features and each row is a record
> > except the head, they are like a relational table. All features are
> > numeric, and the last feature is a nominal. Now I need to vectorize them
> to
> > feed the logistic regression or other classification algorithms. But
> after
> > I have read chapters from 13 to 16 in <Mahout in Action>, I was puzzled
> by
> > the feature encoder, especially when I used the ContinuousValueEncoder.
> The
> > following code is from my real program:
> >
> > FeatureVectorEncoder enc = new ContinuousValueEncoder("test");
> > Vector v1 = new DenseVector(20);  // 19 features + 1 class
> > String[] ftStr = fileReader.getLine[].split(",");
> > for(int i=0; i<19; ++i){
> >     enc.addToVector(ftStr[i], v1);
> >     // enc.addToVector((byte[])null, Double.parseDouble(ftStr[i]), v1);
> > }
> > System.out.println(v1);   // *** I can't get the result I am familiar
> with.
> >
> > Should I use ContinusousValueEncoder to finish this job? The feature
> > encoder or feature hashing seems to be hard for me to understand. I have
> > also dropped the feature encoder in this code.
> >
> > Vector v1 = new DenseVector(20);  // 19 features + 1 class
> > String[] ftStr = fileReader.getLine[].split(",");
> > for(int i=0; i<19; ++i){
> >     v1.set(i, Double.parseDouble(ftStr[i]));
> > }
> > System.out.println(v1);   // *** now I can understand my code
> >
> > Is this the right way to use Vector?
> >
> > So, in all I have three questions:
> > 1. What is the relationship between Vector and Encoder?
> > 2. Is the Encoder essential to vectorize my features?
> > 3. Why the encoder work in an unfamiliar way or how does it work?
> >
> > Any helps, discussions, materials or papers would be highly appreciated.
> > Thank you!
> >
> >
>

Re: Help in vectorizing features

Posted by tanzek <ta...@gmail.com>.

Hello, Ted, I really need a help. Are there any problems with my questions?

2012/1/14 tanzek <ta...@gmail.com>

> I have a file in which these are some features and each row is a record
> except the head, they are like a relational table. All features are
> numeric, and the last feature is a nominal. Now I need to vectorize them to
> feed the logistic regression or other classification algorithms. But after
> I have read chapters from 13 to 16 in <Mahout in Action>, I was puzzled by
> the feature encoder, especially when I used the ContinuousValueEncoder. The
> following code is from my real program:
>
> FeatureVectorEncoder enc = new ContinuousValueEncoder("test");
> Vector v1 = new DenseVector(20);  // 19 features + 1 class
> String[] ftStr = fileReader.getLine[].split(",");
> for(int i=0; i<19; ++i){
>     enc.addToVector(ftStr[i], v1);
>     // enc.addToVector((byte[])null, Double.parseDouble(ftStr[i]), v1);
> }
> System.out.println(v1);   // *** I can't get the result I am familiar with.
>
> Should I use ContinusousValueEncoder to finish this job? The feature
> encoder or feature hashing seems to be hard for me to understand. I have
> also dropped the feature encoder in this code.
>
> Vector v1 = new DenseVector(20);  // 19 features + 1 class
> String[] ftStr = fileReader.getLine[].split(",");
> for(int i=0; i<19; ++i){
>     v1.set(i, Double.parseDouble(ftStr[i]));
> }
> System.out.println(v1);   // *** now I can understand my code
>
> Is this the right way to use Vector?
>
> So, in all I have three questions:
> 1. What is the relationship between Vector and Encoder?
> 2. Is the Encoder essential to vectorize my features?
> 3. Why the encoder work in an unfamiliar way or how does it work?
>
> Any helps, discussions, materials or papers would be highly appreciated.
> Thank you!
>
>