You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@mahout.apache.org by Stanley Xu <we...@gmail.com> on 2011/04/20 15:06:40 UTC

Does the Feature Hashing and Collision in the SGD will harm the performance of the algorithm?

Dear all,

Per my understand, what Feature Hashing did in SGD do compress the Feature
Dimensions to a fixed length Vector. Won't that make the training result
incorrect if Feature Hashing Collision happened? Won't the two features
hashed to the same slot would be thought as the same feature? Even if we
have multiple probes to reduce the total collision like a bloom filter.
Won't it also make the slot that has the collision looks like a combination
feature?

Thanks.

Best wishes,
Stanley Xu

Re: Does the Feature Hashing and Collision in the SGD will harm the performance of the algorithm?

Posted by Stanley Xu <we...@gmail.com>.

Thanks Sebastian, downloading. Need 20 hours...
Best wishes,
Stanley Xu



On Wed, Apr 20, 2011 at 9:09 PM, Sebastian Schelter <ss...@apache.org> wrote:

> Have a look at Ted's talk about Mahout's SGD classifier:
> http://vimeo.com/21273655
>
> As far as I remember he also covers the hashing issues you describe.
>
> --sebastian
>
>
> On 20.04.2011 15:06, Stanley Xu wrote:
>
>> Dear all,
>>
>> Per my understand, what Feature Hashing did in SGD do compress the Feature
>> Dimensions to a fixed length Vector. Won't that make the training result
>> incorrect if Feature Hashing Collision happened? Won't the two features
>> hashed to the same slot would be thought as the same feature? Even if we
>> have multiple probes to reduce the total collision like a bloom filter.
>> Won't it also make the slot that has the collision looks like a
>> combination
>> feature?
>>
>> Thanks.
>>
>> Best wishes,
>> Stanley Xu
>>
>>
>

Re: Does the Feature Hashing and Collision in the SGD will harm the performance of the algorithm?

Posted by Sebastian Schelter <ss...@apache.org>.

Have a look at Ted's talk about Mahout's SGD classifier: 
http://vimeo.com/21273655

As far as I remember he also covers the hashing issues you describe.

--sebastian

On 20.04.2011 15:06, Stanley Xu wrote:
> Dear all,
>
> Per my understand, what Feature Hashing did in SGD do compress the Feature
> Dimensions to a fixed length Vector. Won't that make the training result
> incorrect if Feature Hashing Collision happened? Won't the two features
> hashed to the same slot would be thought as the same feature? Even if we
> have multiple probes to reduce the total collision like a bloom filter.
> Won't it also make the slot that has the collision looks like a combination
> feature?
>
> Thanks.
>
> Best wishes,
> Stanley Xu
>

Re: Does the Feature Hashing and Collision in the SGD will harm the performance of the algorithm?

Posted by Ted Dunning <te...@gmail.com>.

The transformation aspects of Rapid Miner are likely similar to what I am
suggesting.

I haven't used Rapid Miner, though, so can't comment in detail.

On Mon, Apr 25, 2011 at 3:31 PM, Dmitriy Lyubimov <dl...@gmail.com> wrote:

> Is Rapid Miner modelling  closer to what you
> mean?
>

Re: Does the Feature Hashing and Collision in the SGD will harm the performance of the algorithm?

Posted by Dmitriy Lyubimov <dl...@gmail.com>.

I see.

I guess you mean nested preprocessors vs. pipelined jobs.

There are some efforts, e.g. Rapid Miner, that allows to do more than
just input normalization in a formal model -- although i did not play
enough with that. But they do *something*. perhaps it could be a
source for inspiration. Is Rapid Miner modelling  closer to what you
mean?



On Mon, Apr 25, 2011 at 12:14 PM, Ted Dunning <te...@gmail.com> wrote:
> On Mon, Apr 25, 2011 at 12:04 PM, Dmitriy Lyubimov <dl...@gmail.com>wrote:
>
>> I don't think stuff like pre-clustering, dimensionality reduction
>> should be included. Just the summarization, hashing trick and common
>> strategies for parsing non-quantitative inputs included in the book.
>>
>
> So you prefer the limited function option.
>
>
>> ...
>> But if there's pre-clustering and/or dimensionality reduction (PCA
>> like stuff), that would be a pipeline, not just input processing? I
>> don't think about input processing as being a pipelined processing.
>>
>
> It isn't usually a pipeline as in map-reduce.  Yes, it is a set of pure
> functions applied to the input variables to produce the actual predictor
> variables.  Yes, these functions can be composed.
>
> If you are trying to do what Grant says (provide Mahout-as-a-service) then
> you need to provide some mechanism for adding these things.
>

Re: Does the Feature Hashing and Collision in the SGD will harm the performance of the algorithm?

Posted by Ted Dunning <te...@gmail.com>.

On Mon, Apr 25, 2011 at 12:04 PM, Dmitriy Lyubimov <dl...@gmail.com>wrote:

> I don't think stuff like pre-clustering, dimensionality reduction
> should be included. Just the summarization, hashing trick and common
> strategies for parsing non-quantitative inputs included in the book.
>

So you prefer the limited function option.

> ...
> But if there's pre-clustering and/or dimensionality reduction (PCA
> like stuff), that would be a pipeline, not just input processing? I
> don't think about input processing as being a pipelined processing.
>

It isn't usually a pipeline as in map-reduce.  Yes, it is a set of pure
functions applied to the input variables to produce the actual predictor
variables.  Yes, these functions can be composed.

If you are trying to do what Grant says (provide Mahout-as-a-service) then
you need to provide some mechanism for adding these things.

Re: Does the Feature Hashing and Collision in the SGD will harm the performance of the algorithm?

Posted by Dmitriy Lyubimov <dl...@gmail.com>.

I don't think stuff like pre-clustering, dimensionality reduction
should be included. Just the summarization, hashing trick and common
strategies for parsing non-quantitative inputs included in the book.
In addition we might leave a space to writing custom strategies
(pretty much like hadoop leaves room for writing custom input
formats).

But if there's pre-clustering and/or dimensionality reduction (PCA
like stuff), that would be a pipeline, not just input processing? I
don't think about input processing as being a pipelined processing.

On Mon, Apr 25, 2011 at 11:16 AM, Ted Dunning <te...@gmail.com> wrote:
> The difficulty is that vectorization often incorporates various kinds of
> interpretation of the original data.  This can involve the nesting of field
> access, parsing, textual analysis as well as the basic vector encoding.  It
> may involve running a classifier (possibly derived by clustering) on some
> inputs to produce an input variable.
>
> How to specify this in full generality is a difficult problem.
>
> The complementary problem is how to restrict what you can do, but allow
> sufficient generality to meet most needs.  That is a hard problem as well.
>
> It may be that the solution is to just provide simple examples and tell
> people to write some Java (implements DataEncoder).  That isn't all bad.
>
> On Mon, Apr 25, 2011 at 10:56 AM, Dmitriy Lyubimov <dl...@gmail.com>wrote:
>
>> I am not sure i see the difficulty but it is possible we are talking
>> about slightly different things.
>> Hadoop solves this stuff thru some pluggable strategies, such as
>> InputFormat .
>>
>> Those strategies are paramerized (and also perhaps persisted) thru
>> some form of declarative definitions (if we keep analogy with hadoop,
>> they use Configuration stuff for serializing something like that --
>> but of course property based definitions are probably quite
>> underwhelming for this case). Similarly, Lucene defines Analyzer
>> preprocessing strategies. Surely, we could probably define some
>> strategies handling rows of re-standardized inputs producing
>> vectorized and standardized inputs as a result.
>>
>> A little bit bigger Q is what to use for pre-vectorized inputs as
>> Vector obviously won't handle various datatypes esp. qualitative
>> inputs.
>>
>> But perhaps we already have some of this, i am not sure. I saw a fare
>> amount of classes that adapt various formats (what was it? TSV?
>> ARFF?), perhaps we could we strategize that as well.
>>
>> On Fri, Apr 22, 2011 at 9:10 AM, Ted Dunning <te...@gmail.com>
>> wrote:
>> > Yes.
>> >
>> > But how do we specify the input?  And how do we specify the encodings?
>> >
>> > This is what has always held me back in the past.  Should we just allow
>> > classes to be specified on the command line?
>> >
>> > On Fri, Apr 22, 2011 at 8:47 AM, Dmitriy Lyubimov <dl...@gmail.com>
>> wrote:
>> >
>> >> Maybe there's a space for Mr based input conversion job indeed as a
>> command
>> >> line routine? I was kind of thinking about the same. Maybe even along
>> with
>> >> standartisation of the values. Some formal definition of inputs being
>> fed
>> >> to
>> >> it.
>> >>
>> >
>>
>

Re: Does the Feature Hashing and Collision in the SGD will harm the performance of the algorithm?

Posted by Ted Dunning <te...@gmail.com>.

The difficulty is that vectorization often incorporates various kinds of
interpretation of the original data.  This can involve the nesting of field
access, parsing, textual analysis as well as the basic vector encoding.  It
may involve running a classifier (possibly derived by clustering) on some
inputs to produce an input variable.

How to specify this in full generality is a difficult problem.

The complementary problem is how to restrict what you can do, but allow
sufficient generality to meet most needs.  That is a hard problem as well.

It may be that the solution is to just provide simple examples and tell
people to write some Java (implements DataEncoder).  That isn't all bad.

On Mon, Apr 25, 2011 at 10:56 AM, Dmitriy Lyubimov <dl...@gmail.com>wrote:

> I am not sure i see the difficulty but it is possible we are talking
> about slightly different things.
> Hadoop solves this stuff thru some pluggable strategies, such as
> InputFormat .
>
> Those strategies are paramerized (and also perhaps persisted) thru
> some form of declarative definitions (if we keep analogy with hadoop,
> they use Configuration stuff for serializing something like that --
> but of course property based definitions are probably quite
> underwhelming for this case). Similarly, Lucene defines Analyzer
> preprocessing strategies. Surely, we could probably define some
> strategies handling rows of re-standardized inputs producing
> vectorized and standardized inputs as a result.
>
> A little bit bigger Q is what to use for pre-vectorized inputs as
> Vector obviously won't handle various datatypes esp. qualitative
> inputs.
>
> But perhaps we already have some of this, i am not sure. I saw a fare
> amount of classes that adapt various formats (what was it? TSV?
> ARFF?), perhaps we could we strategize that as well.
>
> On Fri, Apr 22, 2011 at 9:10 AM, Ted Dunning <te...@gmail.com>
> wrote:
> > Yes.
> >
> > But how do we specify the input?  And how do we specify the encodings?
> >
> > This is what has always held me back in the past.  Should we just allow
> > classes to be specified on the command line?
> >
> > On Fri, Apr 22, 2011 at 8:47 AM, Dmitriy Lyubimov <dl...@gmail.com>
> wrote:
> >
> >> Maybe there's a space for Mr based input conversion job indeed as a
> command
> >> line routine? I was kind of thinking about the same. Maybe even along
> with
> >> standartisation of the values. Some formal definition of inputs being
> fed
> >> to
> >> it.
> >>
> >
>

Re: Does the Feature Hashing and Collision in the SGD will harm the performance of the algorithm?

Posted by Dmitriy Lyubimov <dl...@gmail.com>.

I am not sure i see the difficulty but it is possible we are talking
about slightly different things.
Hadoop solves this stuff thru some pluggable strategies, such as InputFormat .

Those strategies are paramerized (and also perhaps persisted) thru
some form of declarative definitions (if we keep analogy with hadoop,
they use Configuration stuff for serializing something like that --
but of course property based definitions are probably quite
underwhelming for this case). Similarly, Lucene defines Analyzer
preprocessing strategies. Surely, we could probably define some
strategies handling rows of re-standardized inputs producing
vectorized and standardized inputs as a result.

A little bit bigger Q is what to use for pre-vectorized inputs as
Vector obviously won't handle various datatypes esp. qualitative
inputs.

But perhaps we already have some of this, i am not sure. I saw a fare
amount of classes that adapt various formats (what was it? TSV?
ARFF?), perhaps we could we strategize that as well.

On Fri, Apr 22, 2011 at 9:10 AM, Ted Dunning <te...@gmail.com> wrote:
> Yes.
>
> But how do we specify the input?  And how do we specify the encodings?
>
> This is what has always held me back in the past.  Should we just allow
> classes to be specified on the command line?
>
> On Fri, Apr 22, 2011 at 8:47 AM, Dmitriy Lyubimov <dl...@gmail.com> wrote:
>
>> Maybe there's a space for Mr based input conversion job indeed as a command
>> line routine? I was kind of thinking about the same. Maybe even along with
>> standartisation of the values. Some formal definition of inputs being fed
>> to
>> it.
>>
>

Re: Does the Feature Hashing and Collision in the SGD will harm the performance of the algorithm?

Posted by Dmitriy Lyubimov <dl...@gmail.com>.

(typo corrected )

I am not sure i see the difficulty but it is possible we are talking
about slightly different things.
Hadoop solves this stuff thru some pluggable strategies, such as InputFormat .

Those strategies are paramerized (and also perhaps persisted) thru
some form of declarative definitions (if we keep analogy with hadoop,
they use Configuration stuff for serializing something like that --
but of course property based definitions are probably quite
underwhelming for this case). Similarly, Lucene defines Analyzer
preprocessing strategies. Surely, we could probably define some
strategies handling rows of pre-standardized inputs producing
vectorized and standardized inputs as a result.

A little bit larger Q is what to use for pre-vectorized inputs as
Vector obviously won't handle various datatypes esp. qualitative
inputs.

But perhaps we already have some of this, i am not sure. I saw a fare
amount of classes that adapt various formats (what was it? TSV?
ARFF?), perhaps we could we strategize that as well.

On Fri, Apr 22, 2011 at 9:10 AM, Ted Dunning <te...@gmail.com> wrote:
> Yes.
>
> But how do we specify the input?  And how do we specify the encodings?
>
> This is what has always held me back in the past.  Should we just allow
> classes to be specified on the command line?
>
> On Fri, Apr 22, 2011 at 8:47 AM, Dmitriy Lyubimov <dl...@gmail.com> wrote:
>
>> Maybe there's a space for Mr based input conversion job indeed as a command
>> line routine? I was kind of thinking about the same. Maybe even along with
>> standartisation of the values. Some formal definition of inputs being fed
>> to
>> it.
>>
>

Re: Does the Feature Hashing and Collision in the SGD will harm the performance of the algorithm?

Posted by Ted Dunning <te...@gmail.com>.

Yes.

But how do we specify the input?  And how do we specify the encodings?

This is what has always held me back in the past.  Should we just allow
classes to be specified on the command line?

On Fri, Apr 22, 2011 at 8:47 AM, Dmitriy Lyubimov <dl...@gmail.com> wrote:

> Maybe there's a space for Mr based input conversion job indeed as a command
> line routine? I was kind of thinking about the same. Maybe even along with
> standartisation of the values. Some formal definition of inputs being fed
> to
> it.
>

Re: Does the Feature Hashing and Collision in the SGD will harm the performance of the algorithm?

Posted by Dmitriy Lyubimov <dl...@gmail.com>.

Maybe there's a space for Mr based input conversion job indeed as a command
line routine? I was kind of thinking about the same. Maybe even along with
standartisation of the values. Some formal definition of inputs being fed to
it.

apologies for brevity.

Sent from my android.
-Dmitriy
On Apr 21, 2011 3:05 PM, "Ted Dunning" <te...@gmail.com> wrote:
> It is definitely a reasonable idea to convert data to hashed feature
vectors
> using map-reduce.
>
> And yes, you can pick a vector length that is long enough so that you
don't
> have to worry about
> collisions. You need to examine your data to decide how large that needs
to
> be, but it isn't hard
> to do. The encoding framework handles to the placement of features in the
> vector for you. You
> don't have to worry about that.
>
> On Wed, Apr 20, 2011 at 8:03 PM, Stanley Xu <we...@gmail.com> wrote:
>
>> Thanks Ted. Since the SGD is a sequential method, so the Vector be
created
>> for each line could be very large and won't consume too much memory.
Could
>> I
>> assume if we have limited number of features, or could use the map-reduce
>> to
>> pre-process the data to know how many different values in a category
could
>> have, we could just create a long vector, and put different feature
values
>> to different slot to avoid the possible feature collision?
>>
>> Thanks,
>> Stanley
>>
>>
>>
>> On Thu, Apr 21, 2011 at 12:24 AM, Ted Dunning <te...@gmail.com>
>> wrote:
>>
>> > Stanley,
>> >
>> > Yes. What you say is correct. Feature hashing can cause degradation.
>> >
>> > With multiple hashing, however, you do have a fairly strong guarantee
>> that
>> > the feature hashing is very close to information preserving. This is
>> > related to the fact that the feature hashing operation is a random
linear
>> > transformation. Since we are hashing to something that is still quite a
>> > high dimensional space, the information loss is likely to be minimal.
>> >
>> > On Wed, Apr 20, 2011 at 6:06 AM, Stanley Xu <we...@gmail.com>
wrote:
>> >
>> > > Dear all,
>> > >
>> > > Per my understand, what Feature Hashing did in SGD do compress the
>> > Feature
>> > > Dimensions to a fixed length Vector. Won't that make the training
>> result
>> > > incorrect if Feature Hashing Collision happened? Won't the two
features
>> > > hashed to the same slot would be thought as the same feature? Even if
>> we
>> > > have multiple probes to reduce the total collision like a bloom
filter.
>> > > Won't it also make the slot that has the collision looks like a
>> > combination
>> > > feature?
>> > >
>> > > Thanks.
>> > >
>> > > Best wishes,
>> > > Stanley Xu
>> > >
>> >
>>

Re: Does the Feature Hashing and Collision in the SGD will harm the performance of the algorithm?

Posted by Ted Dunning <te...@gmail.com>.

On Fri, Apr 22, 2011 at 6:39 AM, Stanley Xu <we...@gmail.com> wrote:

> One more question, I am also trying to test the MixedGradient, it looks
> like the RankingGradient will take much more time than the DefaultGradient.
>

This is probably due to memory use.  You need to review which way you group
users.

>
> If I set the alpha to 0.5, it will take 50 times of time comparing to the
> DefaultGradient, I thought it should be like that for the RankingGradient
> will do lots of Ranking comparison, and I heard that the algorithm is not
> sensitive to alpha, how would you suggest a alpha I should choose? I haven't
> found much material or suggestion about that.
>

The time should not change with alpha.

I have found that between .1 and .9, the results don't change a huge amount.

Re: Does the Feature Hashing and Collision in the SGD will harm the performance of the algorithm?

Posted by Stanley Xu <we...@gmail.com>.

Got it. Thanks so much, Ted.

One more question, I am also trying to test the MixedGradient, it looks like
the RankingGradient will take much more time than the DefaultGradient.

If I set the alpha to 0.5, it will take 50 times of time comparing to the
DefaultGradient, I thought it should be like that for the RankingGradient
will do lots of Ranking comparison, and I heard that the algorithm is not
sensitive to alpha, how would you suggest a alpha I should choose? I haven't
found much material or suggestion about that.

Best wishes,
Stanley Xu



On Fri, Apr 22, 2011 at 6:04 AM, Ted Dunning <te...@gmail.com> wrote:

> It is definitely a reasonable idea to convert data to hashed feature
> vectors using map-reduce.
>
> And yes, you can pick a vector length that is long enough so that you don't
> have to worry about
> collisions.  You need to examine your data to decide how large that needs
> to be, but it isn't hard
> to do.  The encoding framework handles to the placement of features in the
> vector for you.  You
> don't have to worry about that.
>
>
> On Wed, Apr 20, 2011 at 8:03 PM, Stanley Xu <we...@gmail.com> wrote:
>
>> Thanks Ted. Since the SGD is a sequential method, so the Vector be created
>> for each line could be very large and won't consume too much memory. Could
>> I
>> assume if we have limited number of features, or could use the map-reduce
>> to
>> pre-process the data to know how many different values in a category could
>> have, we could just create a long vector, and put different feature values
>> to different slot to avoid the possible feature collision?
>>
>> Thanks,
>> Stanley
>>
>>
>>
>> On Thu, Apr 21, 2011 at 12:24 AM, Ted Dunning <te...@gmail.com>
>> wrote:
>>
>> > Stanley,
>> >
>> > Yes.  What you say is correct.  Feature hashing can cause degradation.
>> >
>> > With multiple hashing, however, you do have a fairly strong guarantee
>> that
>> > the feature hashing is very close to information preserving.  This is
>> > related to the fact that the feature hashing operation is a random
>> linear
>> > transformation.  Since we are hashing to something that is still quite a
>> > high dimensional space, the information loss is likely to be minimal.
>> >
>> > On Wed, Apr 20, 2011 at 6:06 AM, Stanley Xu <we...@gmail.com>
>> wrote:
>> >
>> > > Dear all,
>> > >
>> > > Per my understand, what Feature Hashing did in SGD do compress the
>> > Feature
>> > > Dimensions to a fixed length Vector. Won't that make the training
>> result
>> > > incorrect if Feature Hashing Collision happened? Won't the two
>> features
>> > > hashed to the same slot would be thought as the same feature? Even if
>> we
>> > > have multiple probes to reduce the total collision like a bloom
>> filter.
>> > > Won't it also make the slot that has the collision looks like a
>> > combination
>> > > feature?
>> > >
>> > > Thanks.
>> > >
>> > > Best wishes,
>> > > Stanley Xu
>> > >
>> >
>>
>
>

Re: Does the Feature Hashing and Collision in the SGD will harm the performance of the algorithm?

Posted by Ted Dunning <te...@gmail.com>.

It is definitely a reasonable idea to convert data to hashed feature vectors
using map-reduce.

And yes, you can pick a vector length that is long enough so that you don't
have to worry about
collisions.  You need to examine your data to decide how large that needs to
be, but it isn't hard
to do.  The encoding framework handles to the placement of features in the
vector for you.  You
don't have to worry about that.

On Wed, Apr 20, 2011 at 8:03 PM, Stanley Xu <we...@gmail.com> wrote:

> Thanks Ted. Since the SGD is a sequential method, so the Vector be created
> for each line could be very large and won't consume too much memory. Could
> I
> assume if we have limited number of features, or could use the map-reduce
> to
> pre-process the data to know how many different values in a category could
> have, we could just create a long vector, and put different feature values
> to different slot to avoid the possible feature collision?
>
> Thanks,
> Stanley
>
>
>
> On Thu, Apr 21, 2011 at 12:24 AM, Ted Dunning <te...@gmail.com>
> wrote:
>
> > Stanley,
> >
> > Yes.  What you say is correct.  Feature hashing can cause degradation.
> >
> > With multiple hashing, however, you do have a fairly strong guarantee
> that
> > the feature hashing is very close to information preserving.  This is
> > related to the fact that the feature hashing operation is a random linear
> > transformation.  Since we are hashing to something that is still quite a
> > high dimensional space, the information loss is likely to be minimal.
> >
> > On Wed, Apr 20, 2011 at 6:06 AM, Stanley Xu <we...@gmail.com> wrote:
> >
> > > Dear all,
> > >
> > > Per my understand, what Feature Hashing did in SGD do compress the
> > Feature
> > > Dimensions to a fixed length Vector. Won't that make the training
> result
> > > incorrect if Feature Hashing Collision happened? Won't the two features
> > > hashed to the same slot would be thought as the same feature? Even if
> we
> > > have multiple probes to reduce the total collision like a bloom filter.
> > > Won't it also make the slot that has the collision looks like a
> > combination
> > > feature?
> > >
> > > Thanks.
> > >
> > > Best wishes,
> > > Stanley Xu
> > >
> >
>

Re: Does the Feature Hashing and Collision in the SGD will harm the performance of the algorithm?

Posted by Stanley Xu <we...@gmail.com>.

Thanks Ted. Since the SGD is a sequential method, so the Vector be created
for each line could be very large and won't consume too much memory. Could I
assume if we have limited number of features, or could use the map-reduce to
pre-process the data to know how many different values in a category could
have, we could just create a long vector, and put different feature values
to different slot to avoid the possible feature collision?

Thanks,
Stanley

On Thu, Apr 21, 2011 at 12:24 AM, Ted Dunning <te...@gmail.com> wrote:

> Stanley,
>
> Yes.  What you say is correct.  Feature hashing can cause degradation.
>
> With multiple hashing, however, you do have a fairly strong guarantee that
> the feature hashing is very close to information preserving.  This is
> related to the fact that the feature hashing operation is a random linear
> transformation.  Since we are hashing to something that is still quite a
> high dimensional space, the information loss is likely to be minimal.
>
> On Wed, Apr 20, 2011 at 6:06 AM, Stanley Xu <we...@gmail.com> wrote:
>
> > Dear all,
> >
> > Per my understand, what Feature Hashing did in SGD do compress the
> Feature
> > Dimensions to a fixed length Vector. Won't that make the training result
> > incorrect if Feature Hashing Collision happened? Won't the two features
> > hashed to the same slot would be thought as the same feature? Even if we
> > have multiple probes to reduce the total collision like a bloom filter.
> > Won't it also make the slot that has the collision looks like a
> combination
> > feature?
> >
> > Thanks.
> >
> > Best wishes,
> > Stanley Xu
> >
>

Re: Does the Feature Hashing and Collision in the SGD will harm the performance of the algorithm?

Posted by Ted Dunning <te...@gmail.com>.

Stanley,

Yes.  What you say is correct.  Feature hashing can cause degradation.

With multiple hashing, however, you do have a fairly strong guarantee that
the feature hashing is very close to information preserving.  This is
related to the fact that the feature hashing operation is a random linear
transformation.  Since we are hashing to something that is still quite a
high dimensional space, the information loss is likely to be minimal.

On Wed, Apr 20, 2011 at 6:06 AM, Stanley Xu <we...@gmail.com> wrote:

> Dear all,
>
> Per my understand, what Feature Hashing did in SGD do compress the Feature
> Dimensions to a fixed length Vector. Won't that make the training result
> incorrect if Feature Hashing Collision happened? Won't the two features
> hashed to the same slot would be thought as the same feature? Even if we
> have multiple probes to reduce the total collision like a bloom filter.
> Won't it also make the slot that has the collision looks like a combination
> feature?
>
> Thanks.
>
> Best wishes,
> Stanley Xu
>