You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@spark.apache.org by Yang <te...@gmail.com> on 2016/10/05 19:00:07 UTC

can mllib Logistic Regression package handle 10 million sparse features?

anybody had actual experience applying it to real problems of this scale?

thanks

Re: can mllib Logistic Regression package handle 10 million sparse features?

Posted by Yang <te...@gmail.com>.

in my case, my model size is fairly small ( 100k training samples ), though
the features count is roughly 100k populated out of 10mil possible features.

in this case it does not help me to distribute the training process, since
data size is so small. I just need a good core solver to train the model in
a serial manner. On the other hand, I have to train millions of such models
independently, so I have enough load balancing opportunity

On Tue, Oct 11, 2016 at 3:09 AM, Nick Pentreath <ni...@gmail.com>
wrote:

> That's a good point about shuffle data compression. Still, it would be
> good to benchmark the ideas behind https://github.com/
> apache/spark/pull/12761 I think.
>
> For many datasets, even within one partition the gradient sums etc can
> remain very sparse. For example Criteo DAC data is extremely sparse - and
> it has roughly 5% of active features per partition. However, you're correct
> that as the coefficients (and intermediate stats counters) get aggregated
> they will become more and more dense. But there is also the intermediate
> memory overhead of the dense structures, though that comes into play in the
> 100s - 1000s millions feature range.
>
> The situation in the PR above is actually different in that even the
> coefficient vector itself is truly sparse (through some encoding they did
> IRC). This is not an uncommon scenario however, as for high-dimensional
> features users may want to use feature hashing which may result in actually
> sparse coefficient vectors. With hashing often the feature dimension will
> be chosen as power of 2 and higher (in some cases significantly) than the
> true feature dimension to reduce collisions. So sparsity is critical here
> for storage efficiency.
>
> Your result for the final stage does seem to indicate something can be
> improved - perhaps it is due to some level of fetch parallelism - so more
> partitions may fetch more data in parallel? Because with just default
> setting for `treeAggregate` I was seeing much faster times for the final
> stage with 34 million feature dimension (though the final shuffle size
> seems 50% of yours with 2x the features - this is with Spark 2.0.1, I
> haven't tested out master yet with this data).
>
> [image: Screen Shot 2016-10-11 at 12.03.55 PM.png]
>
>
>
> On Fri, 7 Oct 2016 at 08:11 DB Tsai <db...@dbtsai.com> wrote:
>
>> Hi Nick,
>>
>>
>>
>> I'm also working on the benchmark of liner models in Spark. :)
>>
>>
>>
>> One thing I saw is that for sparse features, 14 million features, with
>>
>> multi-depth aggregation, the final aggregation to the driver is
>>
>> extremely slow. See the attachment. The amount of data being exchanged
>>
>> between executor and executor is significantly larger than collecting
>>
>> the data into driver, but the time for collecting the data back to
>>
>> driver takes 4mins while the aggregation between executors only takes
>>
>> 20secs. Seems that the code path is different, and I suspect that
>>
>> there may be something in the spark core that we can optimize.
>>
>>
>>
>> Regrading using sparse data structure for aggregation, I'm not so sure
>>
>> how much this will improve the performance. Since after computing the
>>
>> gradient sum for all the data in one partitions, the vector will be no
>>
>> longer to be very sparse. Even it's sparse, after couple depth of
>>
>> aggregation, it will be very dense. Also, we perform the compression
>>
>> in the shuffle phase, so if there are many zeros, even it's in dense
>>
>> vector representation, the vector should take around the same size as
>>
>> sparse representation. I can be wrong since I never do a study on
>>
>> this, and I wonder how much performance we can gain in practice by
>>
>> using sparse vector for aggregating the gradients.
>>
>>
>>
>> Sincerely,
>>
>>
>>
>> DB Tsai
>>
>> ----------------------------------------------------------
>>
>> Web: https://www.dbtsai.com
>>
>> PGP Key ID: 0xAF08DF8D
>>
>>
>>
>>
>>
>> On Thu, Oct 6, 2016 at 4:09 AM, Nick Pentreath <ni...@gmail.com>
>> wrote:
>>
>> > I'm currently working on various performance tests for large, sparse
>> feature
>>
>> > spaces.
>>
>> >
>>
>> > For the Criteo DAC data - 45.8 million rows, 34.3 million features
>>
>> > (categorical, extremely sparse), the time per iteration for
>>
>> > ml.LogisticRegression is about 20-30s.
>>
>> >
>>
>> > This is with 4x worker nodes, 48 cores & 120GB RAM each. I haven't yet
>> tuned
>>
>> > the tree aggregation depth. But the number of partitions can make a
>>
>> > difference - generally fewer is better since the cost is mostly
>>
>> > communication of the gradient (the gradient computation is < 10% of the
>>
>> > per-iteration time).
>>
>> >
>>
>> > Note that the current impl forces dense arrays for intermediate data
>>
>> > structures, increasing the communication cost significantly. See this
>> PR for
>>
>> > info: https://github.com/apache/spark/pull/12761. Once sparse data
>>
>> > structures are supported for this, the linear models will be orders of
>>
>> > magnitude more scalable for sparse data.
>>
>> >
>>
>> >
>>
>> > On Wed, 5 Oct 2016 at 23:37 DB Tsai <db...@dbtsai.com> wrote:
>>
>> >>
>>
>> >> With the latest code in the current master, we're successfully
>>
>> >> training LOR using Spark ML's implementation with 14M sparse features.
>>
>> >> You need to tune the depth of aggregation to make it efficient.
>>
>> >>
>>
>> >> Sincerely,
>>
>> >>
>>
>> >> DB Tsai
>>
>> >> ----------------------------------------------------------
>>
>> >> Web: https://www.dbtsai.com
>>
>> >> PGP Key ID: 0x9DCC1DBD7FC7BBB2
>>
>> >>
>>
>> >>
>>
>> >> On Wed, Oct 5, 2016 at 12:00 PM, Yang <te...@gmail.com> wrote:
>>
>> >> > anybody had actual experience applying it to real problems of this
>>
>> >> > scale?
>>
>> >> >
>>
>> >> > thanks
>>
>> >> >
>>
>> >>
>>
>> >> ---------------------------------------------------------------------
>>
>> >> To unsubscribe e-mail: user-unsubscribe@spark.apache.org
>>
>> >>
>>
>> >
>>
>>

Re: can mllib Logistic Regression package handle 10 million sparse features?

Posted by Nick Pentreath <ni...@gmail.com>.

That's a good point about shuffle data compression. Still, it would be good
to benchmark the ideas behind https://github.com/apache/spark/pull/12761 I
think.

For many datasets, even within one partition the gradient sums etc can
remain very sparse. For example Criteo DAC data is extremely sparse - and
it has roughly 5% of active features per partition. However, you're correct
that as the coefficients (and intermediate stats counters) get aggregated
they will become more and more dense. But there is also the intermediate
memory overhead of the dense structures, though that comes into play in the
100s - 1000s millions feature range.

The situation in the PR above is actually different in that even the
coefficient vector itself is truly sparse (through some encoding they did
IRC). This is not an uncommon scenario however, as for high-dimensional
features users may want to use feature hashing which may result in actually
sparse coefficient vectors. With hashing often the feature dimension will
be chosen as power of 2 and higher (in some cases significantly) than the
true feature dimension to reduce collisions. So sparsity is critical here
for storage efficiency.

Your result for the final stage does seem to indicate something can be
improved - perhaps it is due to some level of fetch parallelism - so more
partitions may fetch more data in parallel? Because with just default
setting for `treeAggregate` I was seeing much faster times for the final
stage with 34 million feature dimension (though the final shuffle size
seems 50% of yours with 2x the features - this is with Spark 2.0.1, I
haven't tested out master yet with this data).

[image: Screen Shot 2016-10-11 at 12.03.55 PM.png]

On Fri, 7 Oct 2016 at 08:11 DB Tsai <db...@dbtsai.com> wrote:

> Hi Nick,
>
>
>
> I'm also working on the benchmark of liner models in Spark. :)
>
>
>
> One thing I saw is that for sparse features, 14 million features, with
>
> multi-depth aggregation, the final aggregation to the driver is
>
> extremely slow. See the attachment. The amount of data being exchanged
>
> between executor and executor is significantly larger than collecting
>
> the data into driver, but the time for collecting the data back to
>
> driver takes 4mins while the aggregation between executors only takes
>
> 20secs. Seems that the code path is different, and I suspect that
>
> there may be something in the spark core that we can optimize.
>
>
>
> Regrading using sparse data structure for aggregation, I'm not so sure
>
> how much this will improve the performance. Since after computing the
>
> gradient sum for all the data in one partitions, the vector will be no
>
> longer to be very sparse. Even it's sparse, after couple depth of
>
> aggregation, it will be very dense. Also, we perform the compression
>
> in the shuffle phase, so if there are many zeros, even it's in dense
>
> vector representation, the vector should take around the same size as
>
> sparse representation. I can be wrong since I never do a study on
>
> this, and I wonder how much performance we can gain in practice by
>
> using sparse vector for aggregating the gradients.
>
>
>
> Sincerely,
>
>
>
> DB Tsai
>
> ----------------------------------------------------------
>
> Web: https://www.dbtsai.com
>
> PGP Key ID: 0xAF08DF8D
>
>
>
>
>
> On Thu, Oct 6, 2016 at 4:09 AM, Nick Pentreath <ni...@gmail.com>
> wrote:
>
> > I'm currently working on various performance tests for large, sparse
> feature
>
> > spaces.
>
> >
>
> > For the Criteo DAC data - 45.8 million rows, 34.3 million features
>
> > (categorical, extremely sparse), the time per iteration for
>
> > ml.LogisticRegression is about 20-30s.
>
> >
>
> > This is with 4x worker nodes, 48 cores & 120GB RAM each. I haven't yet
> tuned
>
> > the tree aggregation depth. But the number of partitions can make a
>
> > difference - generally fewer is better since the cost is mostly
>
> > communication of the gradient (the gradient computation is < 10% of the
>
> > per-iteration time).
>
> >
>
> > Note that the current impl forces dense arrays for intermediate data
>
> > structures, increasing the communication cost significantly. See this PR
> for
>
> > info: https://github.com/apache/spark/pull/12761. Once sparse data
>
> > structures are supported for this, the linear models will be orders of
>
> > magnitude more scalable for sparse data.
>
> >
>
> >
>
> > On Wed, 5 Oct 2016 at 23:37 DB Tsai <db...@dbtsai.com> wrote:
>
> >>
>
> >> With the latest code in the current master, we're successfully
>
> >> training LOR using Spark ML's implementation with 14M sparse features.
>
> >> You need to tune the depth of aggregation to make it efficient.
>
> >>
>
> >> Sincerely,
>
> >>
>
> >> DB Tsai
>
> >> ----------------------------------------------------------
>
> >> Web: https://www.dbtsai.com
>
> >> PGP Key ID: 0x9DCC1DBD7FC7BBB2
>
> >>
>
> >>
>
> >> On Wed, Oct 5, 2016 at 12:00 PM, Yang <te...@gmail.com> wrote:
>
> >> > anybody had actual experience applying it to real problems of this
>
> >> > scale?
>
> >> >
>
> >> > thanks
>
> >> >
>
> >>
>
> >> ---------------------------------------------------------------------
>
> >> To unsubscribe e-mail: user-unsubscribe@spark.apache.org
>
> >>
>
> >
>
>

Re: can mllib Logistic Regression package handle 10 million sparse features?

Posted by DB Tsai <db...@dbtsai.com>.

Hi Nick,

I'm also working on the benchmark of liner models in Spark. :)

One thing I saw is that for sparse features, 14 million features, with
multi-depth aggregation, the final aggregation to the driver is
extremely slow. See the attachment. The amount of data being exchanged
between executor and executor is significantly larger than collecting
the data into driver, but the time for collecting the data back to
driver takes 4mins while the aggregation between executors only takes
20secs. Seems that the code path is different, and I suspect that
there may be something in the spark core that we can optimize.

Regrading using sparse data structure for aggregation, I'm not so sure
how much this will improve the performance. Since after computing the
gradient sum for all the data in one partitions, the vector will be no
longer to be very sparse. Even it's sparse, after couple depth of
aggregation, it will be very dense. Also, we perform the compression
in the shuffle phase, so if there are many zeros, even it's in dense
vector representation, the vector should take around the same size as
sparse representation. I can be wrong since I never do a study on
this, and I wonder how much performance we can gain in practice by
using sparse vector for aggregating the gradients.

Sincerely,

DB Tsai
----------------------------------------------------------
Web: https://www.dbtsai.com
PGP Key ID: 0xAF08DF8D

On Thu, Oct 6, 2016 at 4:09 AM, Nick Pentreath <ni...@gmail.com> wrote:
> I'm currently working on various performance tests for large, sparse feature
> spaces.
>
> For the Criteo DAC data - 45.8 million rows, 34.3 million features
> (categorical, extremely sparse), the time per iteration for
> ml.LogisticRegression is about 20-30s.
>
> This is with 4x worker nodes, 48 cores & 120GB RAM each. I haven't yet tuned
> the tree aggregation depth. But the number of partitions can make a
> difference - generally fewer is better since the cost is mostly
> communication of the gradient (the gradient computation is < 10% of the
> per-iteration time).
>
> Note that the current impl forces dense arrays for intermediate data
> structures, increasing the communication cost significantly. See this PR for
> info: https://github.com/apache/spark/pull/12761. Once sparse data
> structures are supported for this, the linear models will be orders of
> magnitude more scalable for sparse data.
>
>
> On Wed, 5 Oct 2016 at 23:37 DB Tsai <db...@dbtsai.com> wrote:
>>
>> With the latest code in the current master, we're successfully
>> training LOR using Spark ML's implementation with 14M sparse features.
>> You need to tune the depth of aggregation to make it efficient.
>>
>> Sincerely,
>>
>> DB Tsai
>> ----------------------------------------------------------
>> Web: https://www.dbtsai.com
>> PGP Key ID: 0x9DCC1DBD7FC7BBB2
>>
>>
>> On Wed, Oct 5, 2016 at 12:00 PM, Yang <te...@gmail.com> wrote:
>> > anybody had actual experience applying it to real problems of this
>> > scale?
>> >
>> > thanks
>> >
>>
>> ---------------------------------------------------------------------
>> To unsubscribe e-mail: user-unsubscribe@spark.apache.org
>>
>

Re: can mllib Logistic Regression package handle 10 million sparse features?

Posted by Nick Pentreath <ni...@gmail.com>.

I'm currently working on various performance tests for large, sparse
feature spaces.

For the Criteo DAC data - 45.8 million rows, 34.3 million features
(categorical, extremely sparse), the time per iteration for
ml.LogisticRegression is about 20-30s.

This is with 4x worker nodes, 48 cores & 120GB RAM each. I haven't yet
tuned the tree aggregation depth. But the number of partitions can make a
difference - generally fewer is better since the cost is mostly
communication of the gradient (the gradient computation is < 10% of the
per-iteration time).

Note that the current impl forces dense arrays for intermediate data
structures, increasing the communication cost significantly. See this PR
for info: https://github.com/apache/spark/pull/12761. Once sparse data
structures are supported for this, the linear models will be orders of
magnitude more scalable for sparse data.

On Wed, 5 Oct 2016 at 23:37 DB Tsai <db...@dbtsai.com> wrote:

> With the latest code in the current master, we're successfully
> training LOR using Spark ML's implementation with 14M sparse features.
> You need to tune the depth of aggregation to make it efficient.
>
> Sincerely,
>
> DB Tsai
> ----------------------------------------------------------
> Web: https://www.dbtsai.com
> PGP Key ID: 0x9DCC1DBD7FC7BBB2
>
>
> On Wed, Oct 5, 2016 at 12:00 PM, Yang <te...@gmail.com> wrote:
> > anybody had actual experience applying it to real problems of this scale?
> >
> > thanks
> >
>
> ---------------------------------------------------------------------
> To unsubscribe e-mail: user-unsubscribe@spark.apache.org
>
>

Re: can mllib Logistic Regression package handle 10 million sparse features?

Posted by DB Tsai <db...@dbtsai.com>.

With the latest code in the current master, we're successfully
training LOR using Spark ML's implementation with 14M sparse features.
You need to tune the depth of aggregation to make it efficient.

Sincerely,

DB Tsai
----------------------------------------------------------
Web: https://www.dbtsai.com
PGP Key ID: 0x9DCC1DBD7FC7BBB2

On Wed, Oct 5, 2016 at 12:00 PM, Yang <te...@gmail.com> wrote:
> anybody had actual experience applying it to real problems of this scale?
>
> thanks
>

---------------------------------------------------------------------
To unsubscribe e-mail: user-unsubscribe@spark.apache.org