You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@mahout.apache.org by Stanley Xu <we...@gmail.com> on 2011/04/20 15:00:29 UTC

Anyway to speedup the category feature parsing and encoding in the SGD algorithm?

Dear all,

For SGD algorithm, I learned the idea that we should use some detailed
parsing code to parse the input features to reduce the time
spent on parse the input and put it into the Vector from Chapter 16.3.4 from
<Mahout In Action>. And per my test, it will reduce 80% of the time spent on
parsing the input by the SimpleCsvExamples.java in the code base.

I am trying to use the similar way to do an optimization test on parsing
category features but it looks it will only reduce about 30% of the time
based on code like the following:

    Vector v = new RandomAccessSparseVector(1000);

//.... some old codes
    } else if ("--fast".equals(args[0])) {
      FastLineReader in = new FastLineReader(new FileInputStream(args[1]));
      try {
        FastLine line = in.read();
        while (line != null) {
          v.assign(0);
          for (int i = 0; i < FIELDS; i++) {
//            double z = line.getDouble(i);
//            s[i].add(z);
            byte[] category = line.getBytes(i);
            encoder[i].addToVector(category, 1, v);
          }
          line = in.read();
        }
      } finally {
        IOUtils.quietClose(in);
      }

  private static final class FastLine {

    public byte[] getBytes(int field) {
      int offset = start.get(field);
      int size = length.get(field);
      byte[] result = new byte[size];
      System.arraycopy(base.array(), offset, result, 0, size);
      return result;
    }
}

I am wondering if anyone would like to help me to find a better solution?
Since I found about 80% of the time for SGD was spent on parse the features
and add it to the Vector. If I could optimize the performance on category
features as well, it would make the algorithm even faster and might be able
to handle 100 million or even billions of lines data on a single machine.

Thanks.

Best wishes,
Stanley Xu

Re: Anyway to speedup the category feature parsing and encoding in the SGD algorithm?

Posted by Ted Dunning <te...@gmail.com>.

Look at VectorWritable

On Fri, Apr 22, 2011 at 6:57 AM, Stanley Xu <we...@gmail.com> wrote:

> Hi Ted,
>
> Which class do you mean for the sparse vector as Writable?
>
> I checked the code that neither the RandomAccessSparseVector nor
> SequentialAccessSparseVector implemented the Writable interface.
>
> Thanks.
>
> On Fri, Apr 22, 2011 at 12:49 PM, Ted Dunning <te...@gmail.com>wrote:
>
>> The binary format is already defined.  Just encode in map-reduce and
>> output sparse vector as Writable.  This is a natural use for a sequence
>> file.
>>
>
>

Re: Anyway to speedup the category feature parsing and encoding in the SGD algorithm?

Posted by Stanley Xu <we...@gmail.com>.

Hi Ted,

Which class do you mean for the sparse vector as Writable?

I checked the code that neither the RandomAccessSparseVector nor
SequentialAccessSparseVector implemented the Writable interface.

Thanks.
On Fri, Apr 22, 2011 at 12:49 PM, Ted Dunning <te...@gmail.com> wrote:

> The binary format is already defined.  Just encode in map-reduce and output
> sparse vector as Writable.  This is a natural use for a sequence file.
>

Re: Anyway to speedup the category feature parsing and encoding in the SGD algorithm?

Posted by Ted Dunning <te...@gmail.com>.

On Thu, Apr 21, 2011 at 7:05 PM, Stanley Xu <we...@gmail.com> wrote:

> Hi Ted,
>
> I knew I have to change the encoder and parse it as a String(or byte
> array). I am wondering even parse it as a byte array, it is still cost lot
> of time in feature hashing, and both as you said and per hour test, the time
> spent on feature hashing and parsing are normally dominate the SGD training
> time.
>

Yes.  It will still cost a lot.


> I guess I should define a binary format for the input features to
> accelerate the feature parsing and hashing work.
>

The binary format is already defined.  Just encode in map-reduce and output
sparse vector as Writable.  This is a natural use for a sequence file.



>
> Thanks a lot.
>
> Best wishes,
> Stanley Xu
>
>
>
>
> On Fri, Apr 22, 2011 at 6:10 AM, Ted Dunning <te...@gmail.com>wrote:
>
>> This code doesn't look right for category features.
>>
>> Those features are usually described either as strings or as integers.
>>  Either case can be handled as strings as
>> long as you don't have any surprises like leading 0's.
>>
>> The best way to handle these features is to encode them using word
>> encoders.  In your code
>> if you define the category encoder like this, it should work fairly well:
>>
>>    private static final FeatureVectorEncoder encoder = new
>> StaticWordValueEncoder("category");
>>
>>
>> On Wed, Apr 20, 2011 at 6:00 AM, Stanley Xu <we...@gmail.com> wrote:
>>
>>> Dear all,
>>>
>>> For SGD algorithm, I learned the idea that we should use some detailed
>>> parsing code to parse the input features to reduce the time
>>> spent on parse the input and put it into the Vector from Chapter 16.3.4
>>> from
>>> <Mahout In Action>. And per my test, it will reduce 80% of the time spent
>>> on
>>> parsing the input by the SimpleCsvExamples.java in the code base.
>>>
>>> I am trying to use the similar way to do an optimization test on parsing
>>> category features but it looks it will only reduce about 30% of the time
>>> based on code like the following:
>>>
>>>    Vector v = new RandomAccessSparseVector(1000);
>>>
>>> //.... some old codes
>>>    } else if ("--fast".equals(args[0])) {
>>>      FastLineReader in = new FastLineReader(new
>>> FileInputStream(args[1]));
>>>      try {
>>>        FastLine line = in.read();
>>>        while (line != null) {
>>>          v.assign(0);
>>>          for (int i = 0; i < FIELDS; i++) {
>>> //            double z = line.getDouble(i);
>>> //            s[i].add(z);
>>>            byte[] category = line.getBytes(i);
>>>            encoder[i].addToVector(category, 1, v);
>>>          }
>>>          line = in.read();
>>>        }
>>>      } finally {
>>>        IOUtils.quietClose(in);
>>>      }
>>>
>>>  private static final class FastLine {
>>>
>>>    public byte[] getBytes(int field) {
>>>      int offset = start.get(field);
>>>      int size = length.get(field);
>>>      byte[] result = new byte[size];
>>>      System.arraycopy(base.array(), offset, result, 0, size);
>>>      return result;
>>>    }
>>> }
>>>
>>> I am wondering if anyone would like to help me to find a better solution?
>>> Since I found about 80% of the time for SGD was spent on parse the
>>> features
>>> and add it to the Vector. If I could optimize the performance on category
>>> features as well, it would make the algorithm even faster and might be
>>> able
>>> to handle 100 million or even billions of lines data on a single machine.
>>>
>>> Thanks.
>>>
>>> Best wishes,
>>> Stanley Xu
>>>
>>
>>
>

Re: Anyway to speedup the category feature parsing and encoding in the SGD algorithm?

Posted by Stanley Xu <we...@gmail.com>.

Hi Ted,

I knew I have to change the encoder and parse it as a String(or byte array).
I am wondering even parse it as a byte array, it is still cost lot of time
in feature hashing, and both as you said and per hour test, the time spent
on feature hashing and parsing are normally dominate the SGD training time.

I guess I should define a binary format for the input features to accelerate
the feature parsing and hashing work.

Thanks a lot.

Best wishes,
Stanley Xu



On Fri, Apr 22, 2011 at 6:10 AM, Ted Dunning <te...@gmail.com> wrote:

> This code doesn't look right for category features.
>
> Those features are usually described either as strings or as integers.
>  Either case can be handled as strings as
> long as you don't have any surprises like leading 0's.
>
> The best way to handle these features is to encode them using word
> encoders.  In your code
> if you define the category encoder like this, it should work fairly well:
>
>    private static final FeatureVectorEncoder encoder = new
> StaticWordValueEncoder("category");
>
>
> On Wed, Apr 20, 2011 at 6:00 AM, Stanley Xu <we...@gmail.com> wrote:
>
>> Dear all,
>>
>> For SGD algorithm, I learned the idea that we should use some detailed
>> parsing code to parse the input features to reduce the time
>> spent on parse the input and put it into the Vector from Chapter 16.3.4
>> from
>> <Mahout In Action>. And per my test, it will reduce 80% of the time spent
>> on
>> parsing the input by the SimpleCsvExamples.java in the code base.
>>
>> I am trying to use the similar way to do an optimization test on parsing
>> category features but it looks it will only reduce about 30% of the time
>> based on code like the following:
>>
>>    Vector v = new RandomAccessSparseVector(1000);
>>
>> //.... some old codes
>>    } else if ("--fast".equals(args[0])) {
>>      FastLineReader in = new FastLineReader(new FileInputStream(args[1]));
>>      try {
>>        FastLine line = in.read();
>>        while (line != null) {
>>          v.assign(0);
>>          for (int i = 0; i < FIELDS; i++) {
>> //            double z = line.getDouble(i);
>> //            s[i].add(z);
>>            byte[] category = line.getBytes(i);
>>            encoder[i].addToVector(category, 1, v);
>>          }
>>          line = in.read();
>>        }
>>      } finally {
>>        IOUtils.quietClose(in);
>>      }
>>
>>  private static final class FastLine {
>>
>>    public byte[] getBytes(int field) {
>>      int offset = start.get(field);
>>      int size = length.get(field);
>>      byte[] result = new byte[size];
>>      System.arraycopy(base.array(), offset, result, 0, size);
>>      return result;
>>    }
>> }
>>
>> I am wondering if anyone would like to help me to find a better solution?
>> Since I found about 80% of the time for SGD was spent on parse the
>> features
>> and add it to the Vector. If I could optimize the performance on category
>> features as well, it would make the algorithm even faster and might be
>> able
>> to handle 100 million or even billions of lines data on a single machine.
>>
>> Thanks.
>>
>> Best wishes,
>> Stanley Xu
>>
>
>

Re: Anyway to speedup the category feature parsing and encoding in the SGD algorithm?

Posted by Ted Dunning <te...@gmail.com>.

This code doesn't look right for category features.

Those features are usually described either as strings or as integers.
 Either case can be handled as strings as
long as you don't have any surprises like leading 0's.

The best way to handle these features is to encode them using word encoders.
 In your code
if you define the category encoder like this, it should work fairly well:

  private static final FeatureVectorEncoder encoder = new
StaticWordValueEncoder("category");


On Wed, Apr 20, 2011 at 6:00 AM, Stanley Xu <we...@gmail.com> wrote:

> Dear all,
>
> For SGD algorithm, I learned the idea that we should use some detailed
> parsing code to parse the input features to reduce the time
> spent on parse the input and put it into the Vector from Chapter 16.3.4
> from
> <Mahout In Action>. And per my test, it will reduce 80% of the time spent
> on
> parsing the input by the SimpleCsvExamples.java in the code base.
>
> I am trying to use the similar way to do an optimization test on parsing
> category features but it looks it will only reduce about 30% of the time
> based on code like the following:
>
>    Vector v = new RandomAccessSparseVector(1000);
>
> //.... some old codes
>    } else if ("--fast".equals(args[0])) {
>      FastLineReader in = new FastLineReader(new FileInputStream(args[1]));
>      try {
>        FastLine line = in.read();
>        while (line != null) {
>          v.assign(0);
>          for (int i = 0; i < FIELDS; i++) {
> //            double z = line.getDouble(i);
> //            s[i].add(z);
>            byte[] category = line.getBytes(i);
>            encoder[i].addToVector(category, 1, v);
>          }
>          line = in.read();
>        }
>      } finally {
>        IOUtils.quietClose(in);
>      }
>
>  private static final class FastLine {
>
>    public byte[] getBytes(int field) {
>      int offset = start.get(field);
>      int size = length.get(field);
>      byte[] result = new byte[size];
>      System.arraycopy(base.array(), offset, result, 0, size);
>      return result;
>    }
> }
>
> I am wondering if anyone would like to help me to find a better solution?
> Since I found about 80% of the time for SGD was spent on parse the features
> and add it to the Vector. If I could optimize the performance on category
> features as well, it would make the algorithm even faster and might be able
> to handle 100 million or even billions of lines data on a single machine.
>
> Thanks.
>
> Best wishes,
> Stanley Xu
>