You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@mahout.apache.org by Paul van Hoven <pa...@gmail.com> on 2013/11/29 10:14:58 UTC

RandomAccessSparseVector setting 1.0 in 2 dims for 1 feature value?

For an example program using mahout I use the donut.csv sample data
from the project (
https://svn.apache.org/repos/asf/mahout/trunk/examples/src/main/resources/donut.csv
). My code looks like this:

    import org.apache.mahout.math.RandomAccessSparseVector;
    import org.apache.mahout.math.Vector;
    import org.apache.mahout.vectorizer.encoders.FeatureVectorEncoder;
    import org.apache.mahout.vectorizer.encoders.StaticWordValueEncoder;
    import com.csvreader.CsvReader;

    public class Runner {

    //Set the path accordingly!
    public static final String csvInputDataPath = "/path/to/donut.csv";

    public static void main(String[] args) {

    FeatureVectorEncoder encoder = new StaticWordValueEncoder("features");
    ArrayList<RandomAccessSparseVector> featureVectors =
     new ArrayList<RandomAccessSparseVector>();
    try {
    CsvReader csvReader = new CsvReader(csvInputDataPath);
    csvReader.readHeaders();
    while( csvReader.readRecord() ) {
    Vector featureVector = new RandomAccessSparseVector(30);
    featureVector.set(0, new Double(csvReader.get("x")));
    featureVector.set(1, new Double(csvReader.get("y")));
    featureVector.set(2, new Double(csvReader.get("c")));
    featureVector.set(3, new Integer(csvReader.get("color")));
    System.out.println("Before: " + featureVector.toString());
    encoder.addToVector(csvReader.get("shape").getBytes(),
    featureVector);
    System.out.println(" After: " + featureVector.toString());
    featureVectors.add((RandomAccessSparseVector) featureVector);
    }
    } catch(Exception e) {
    e.printStackTrace();
    }

    System.out.println("Program is done.");
    }

    }


What confuses me is the following output (one sample):

    Before: {0:0.923307513352484,1:0.0135197141207755,2:0.644866125183976,3:2.0}
     After: {0:0.923307513352484,1:0.0135197141207755,2:0.644866125183976,3:2.0,29:1.0,25:1.0}

As you can see, I added just one value "shape" to the vector. However
two dimensions of this vector are encoded with 1.0. On the other hand,
for some other data I get the output

    Before: {0:0.711011884035543,1:0.909141522599384,2:0.46035073663368,3:2.0}
     After: {0:0.711011884035543,1:0.909141522599384,2:0.46035073663368,3:3.0,16:1.0}

Why? I would expect that _always_ only one dimension gets occupied by
1.0 as this is the standard case for categorial encoding. This way
this seems to be wrong.

Thanks in advance,
Paul

Re: RandomAccessSparseVector setting 1.0 in 2 dims for 1 feature value?

Posted by Ted Dunning <te...@gmail.com>.

If you always insert 1's for each element, then you can detect collisions
by inserting all your elements (or all elements in each document
separately) and looking for the max value in the vector.  If you see
something >1, you have a collision.

But collisions are actually good.  The only way to completely avoid them is
to use a vector as large as your vocabulary which is often painfully large.

You can also view multiple probes not so much as avoiding collisions, but
as making the linear transformation from the very large dimensional
representation of one dimension per word to the lower hashed representation
more likely to be nearly invertible in the sense that the Euclidean metric
will be approximately preserved.  Think Johnson-Lindenstrauss random
projections.



On Fri, Nov 29, 2013 at 1:54 AM, Paul van Hoven <pa...@gmail.com>wrote:

> Hi, thanks for your quick reply. So multiple probes are a protection
> against collisions? After playing a little with the default length of
> a RandomAccessSparseVector object I noticed that (of course)
> collisions occur when the length is too short. Therefore, I'm asking
> myself if there is a possibility to check if a collision occurred
> after encoding a new value in the vector? This would give a user the
> information that the length of the chosen vector is too short. So far,
> I did not find any method in the api to check for that.
>
> 2013/11/29 Ted Dunning <te...@gmail.com>:
> > The default with the Mahout encoders is two probes.  This is unnecessary
> > with the intercept term, of course, if you protect the intercept term
> from
> > other updates, possible by encoding other data using a view of the
> original
> > feature vector.
> >
> > For each probe, a different hash is used so each value is put into
> multiple
> > locations.  Multiple probes are useful in general to decrease the effect
> of
> > the reduced dimensionality of the hashed representation.
> >
> >
> >
> > On Fri, Nov 29, 2013 at 1:14 AM, Paul van Hoven <
> paul.van.hoven@gmail.com>wrote:
> >
> >> For an example program using mahout I use the donut.csv sample data
> >> from the project (
> >>
> >>
> https://svn.apache.org/repos/asf/mahout/trunk/examples/src/main/resources/donut.csv
> >> ). My code looks like this:
> >>
> >>     import org.apache.mahout.math.RandomAccessSparseVector;
> >>     import org.apache.mahout.math.Vector;
> >>     import org.apache.mahout.vectorizer.encoders.FeatureVectorEncoder;
> >>     import org.apache.mahout.vectorizer.encoders.StaticWordValueEncoder;
> >>     import com.csvreader.CsvReader;
> >>
> >>     public class Runner {
> >>
> >>     //Set the path accordingly!
> >>     public static final String csvInputDataPath = "/path/to/donut.csv";
> >>
> >>     public static void main(String[] args) {
> >>
> >>     FeatureVectorEncoder encoder = new
> StaticWordValueEncoder("features");
> >>     ArrayList<RandomAccessSparseVector> featureVectors =
> >>      new ArrayList<RandomAccessSparseVector>();
> >>     try {
> >>     CsvReader csvReader = new CsvReader(csvInputDataPath);
> >>     csvReader.readHeaders();
> >>     while( csvReader.readRecord() ) {
> >>     Vector featureVector = new RandomAccessSparseVector(30);
> >>     featureVector.set(0, new Double(csvReader.get("x")));
> >>     featureVector.set(1, new Double(csvReader.get("y")));
> >>     featureVector.set(2, new Double(csvReader.get("c")));
> >>     featureVector.set(3, new Integer(csvReader.get("color")));
> >>     System.out.println("Before: " + featureVector.toString());
> >>     encoder.addToVector(csvReader.get("shape").getBytes(),
> >>     featureVector);
> >>     System.out.println(" After: " + featureVector.toString());
> >>     featureVectors.add((RandomAccessSparseVector) featureVector);
> >>     }
> >>     } catch(Exception e) {
> >>     e.printStackTrace();
> >>     }
> >>
> >>     System.out.println("Program is done.");
> >>     }
> >>
> >>     }
> >>
> >>
> >> What confuses me is the following output (one sample):
> >>
> >>     Before:
> >> {0:0.923307513352484,1:0.0135197141207755,2:0.644866125183976,3:2.0}
> >>      After:
> >>
> {0:0.923307513352484,1:0.0135197141207755,2:0.644866125183976,3:2.0,29:1.0,25:1.0}
> >>
> >> As you can see, I added just one value "shape" to the vector. However
> >> two dimensions of this vector are encoded with 1.0. On the other hand,
> >> for some other data I get the output
> >>
> >>     Before:
> >> {0:0.711011884035543,1:0.909141522599384,2:0.46035073663368,3:2.0}
> >>      After:
> >>
> {0:0.711011884035543,1:0.909141522599384,2:0.46035073663368,3:3.0,16:1.0}
> >>
> >> Why? I would expect that _always_ only one dimension gets occupied by
> >> 1.0 as this is the standard case for categorial encoding. This way
> >> this seems to be wrong.
> >>
> >> Thanks in advance,
> >> Paul
> >>
>

Re: RandomAccessSparseVector setting 1.0 in 2 dims for 1 feature value?

Posted by Paul van Hoven <pa...@gmail.com>.

Hi, thanks for your quick reply. So multiple probes are a protection
against collisions? After playing a little with the default length of
a RandomAccessSparseVector object I noticed that (of course)
collisions occur when the length is too short. Therefore, I'm asking
myself if there is a possibility to check if a collision occurred
after encoding a new value in the vector? This would give a user the
information that the length of the chosen vector is too short. So far,
I did not find any method in the api to check for that.

2013/11/29 Ted Dunning <te...@gmail.com>:
> The default with the Mahout encoders is two probes.  This is unnecessary
> with the intercept term, of course, if you protect the intercept term from
> other updates, possible by encoding other data using a view of the original
> feature vector.
>
> For each probe, a different hash is used so each value is put into multiple
> locations.  Multiple probes are useful in general to decrease the effect of
> the reduced dimensionality of the hashed representation.
>
>
>
> On Fri, Nov 29, 2013 at 1:14 AM, Paul van Hoven <pa...@gmail.com>wrote:
>
>> For an example program using mahout I use the donut.csv sample data
>> from the project (
>>
>> https://svn.apache.org/repos/asf/mahout/trunk/examples/src/main/resources/donut.csv
>> ). My code looks like this:
>>
>>     import org.apache.mahout.math.RandomAccessSparseVector;
>>     import org.apache.mahout.math.Vector;
>>     import org.apache.mahout.vectorizer.encoders.FeatureVectorEncoder;
>>     import org.apache.mahout.vectorizer.encoders.StaticWordValueEncoder;
>>     import com.csvreader.CsvReader;
>>
>>     public class Runner {
>>
>>     //Set the path accordingly!
>>     public static final String csvInputDataPath = "/path/to/donut.csv";
>>
>>     public static void main(String[] args) {
>>
>>     FeatureVectorEncoder encoder = new StaticWordValueEncoder("features");
>>     ArrayList<RandomAccessSparseVector> featureVectors =
>>      new ArrayList<RandomAccessSparseVector>();
>>     try {
>>     CsvReader csvReader = new CsvReader(csvInputDataPath);
>>     csvReader.readHeaders();
>>     while( csvReader.readRecord() ) {
>>     Vector featureVector = new RandomAccessSparseVector(30);
>>     featureVector.set(0, new Double(csvReader.get("x")));
>>     featureVector.set(1, new Double(csvReader.get("y")));
>>     featureVector.set(2, new Double(csvReader.get("c")));
>>     featureVector.set(3, new Integer(csvReader.get("color")));
>>     System.out.println("Before: " + featureVector.toString());
>>     encoder.addToVector(csvReader.get("shape").getBytes(),
>>     featureVector);
>>     System.out.println(" After: " + featureVector.toString());
>>     featureVectors.add((RandomAccessSparseVector) featureVector);
>>     }
>>     } catch(Exception e) {
>>     e.printStackTrace();
>>     }
>>
>>     System.out.println("Program is done.");
>>     }
>>
>>     }
>>
>>
>> What confuses me is the following output (one sample):
>>
>>     Before:
>> {0:0.923307513352484,1:0.0135197141207755,2:0.644866125183976,3:2.0}
>>      After:
>> {0:0.923307513352484,1:0.0135197141207755,2:0.644866125183976,3:2.0,29:1.0,25:1.0}
>>
>> As you can see, I added just one value "shape" to the vector. However
>> two dimensions of this vector are encoded with 1.0. On the other hand,
>> for some other data I get the output
>>
>>     Before:
>> {0:0.711011884035543,1:0.909141522599384,2:0.46035073663368,3:2.0}
>>      After:
>> {0:0.711011884035543,1:0.909141522599384,2:0.46035073663368,3:3.0,16:1.0}
>>
>> Why? I would expect that _always_ only one dimension gets occupied by
>> 1.0 as this is the standard case for categorial encoding. This way
>> this seems to be wrong.
>>
>> Thanks in advance,
>> Paul
>>

Re: RandomAccessSparseVector setting 1.0 in 2 dims for 1 feature value?

Posted by Ted Dunning <te...@gmail.com>.

The default with the Mahout encoders is two probes.  This is unnecessary
with the intercept term, of course, if you protect the intercept term from
other updates, possible by encoding other data using a view of the original
feature vector.

For each probe, a different hash is used so each value is put into multiple
locations.  Multiple probes are useful in general to decrease the effect of
the reduced dimensionality of the hashed representation.



On Fri, Nov 29, 2013 at 1:14 AM, Paul van Hoven <pa...@gmail.com>wrote:

> For an example program using mahout I use the donut.csv sample data
> from the project (
>
> https://svn.apache.org/repos/asf/mahout/trunk/examples/src/main/resources/donut.csv
> ). My code looks like this:
>
>     import org.apache.mahout.math.RandomAccessSparseVector;
>     import org.apache.mahout.math.Vector;
>     import org.apache.mahout.vectorizer.encoders.FeatureVectorEncoder;
>     import org.apache.mahout.vectorizer.encoders.StaticWordValueEncoder;
>     import com.csvreader.CsvReader;
>
>     public class Runner {
>
>     //Set the path accordingly!
>     public static final String csvInputDataPath = "/path/to/donut.csv";
>
>     public static void main(String[] args) {
>
>     FeatureVectorEncoder encoder = new StaticWordValueEncoder("features");
>     ArrayList<RandomAccessSparseVector> featureVectors =
>      new ArrayList<RandomAccessSparseVector>();
>     try {
>     CsvReader csvReader = new CsvReader(csvInputDataPath);
>     csvReader.readHeaders();
>     while( csvReader.readRecord() ) {
>     Vector featureVector = new RandomAccessSparseVector(30);
>     featureVector.set(0, new Double(csvReader.get("x")));
>     featureVector.set(1, new Double(csvReader.get("y")));
>     featureVector.set(2, new Double(csvReader.get("c")));
>     featureVector.set(3, new Integer(csvReader.get("color")));
>     System.out.println("Before: " + featureVector.toString());
>     encoder.addToVector(csvReader.get("shape").getBytes(),
>     featureVector);
>     System.out.println(" After: " + featureVector.toString());
>     featureVectors.add((RandomAccessSparseVector) featureVector);
>     }
>     } catch(Exception e) {
>     e.printStackTrace();
>     }
>
>     System.out.println("Program is done.");
>     }
>
>     }
>
>
> What confuses me is the following output (one sample):
>
>     Before:
> {0:0.923307513352484,1:0.0135197141207755,2:0.644866125183976,3:2.0}
>      After:
> {0:0.923307513352484,1:0.0135197141207755,2:0.644866125183976,3:2.0,29:1.0,25:1.0}
>
> As you can see, I added just one value "shape" to the vector. However
> two dimensions of this vector are encoded with 1.0. On the other hand,
> for some other data I get the output
>
>     Before:
> {0:0.711011884035543,1:0.909141522599384,2:0.46035073663368,3:2.0}
>      After:
> {0:0.711011884035543,1:0.909141522599384,2:0.46035073663368,3:3.0,16:1.0}
>
> Why? I would expect that _always_ only one dimension gets occupied by
> 1.0 as this is the standard case for categorial encoding. This way
> this seems to be wrong.
>
> Thanks in advance,
> Paul
>