You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@spark.apache.org by Asher Krim <ak...@hubspot.com> on 2017/01/13 17:23:04 UTC

Why are ml models repartition(1)'d in save methods?

Hi,

I'm curious why it's common for data to be repartitioned to 1 partition
when saving ml models:

sqlContext.createDataFrame(Seq(data)).repartition(1).write.parquet(dataPath)

This shows up in most ml models I've seen (Word2Vec
<https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/ml/feature/Word2Vec.scala#L314>,
PCA
<https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/ml/feature/PCA.scala#L189>,
LDA
<https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/ml/clustering/LDA.scala#L605>).
Am I missing some benefit of repartitioning like this?

Thanks,
-- 
Asher Krim
Senior Software Engineer

Re: Why are ml models repartition(1)'d in save methods?

Posted by Asher Krim <ak...@hubspot.com>.
Cool, thanks!

Jira: https://issues.apache.org/jira/browse/SPARK-19247
PR: https://github.com/apache/spark/pull/16607

I think the LDA model has the exact same issues - currently the
`topicsMatrix` (which is on order of numWords*k, 4GB for numWords=3m and
k=1000) is saved as a single element in a case class. We should probably
address this in another issue.

On Fri, Jan 13, 2017 at 3:55 PM, Sean Owen <so...@cloudera.com> wrote:

> Yes, certainly debatable for word2vec. You have a good point that this
> could overrun the 2GB limit if the model is one big datum, for large but
> not crazy models. This model could probably easily be serialized as
> individual vectors in this case. It would introduce a
> backwards-compatibility issue but it's possible to read old and new
> formats, I believe.
>
> On Fri, Jan 13, 2017 at 8:16 PM Asher Krim <ak...@hubspot.com> wrote:
>
>> I guess it depends on the definition of "small". A Word2vec model with
>> vectorSize=300 and vocabulary=3m takes nearly 4gb. While it does fit on a
>> single machine (so isn't really "big" data), I don't see the benefit in
>> having the model stored in one file. On the contrary, it seems that we
>> would want the model to be distributed:
>> * avoids shuffling of data to one executor
>> * allows the whole cluster to participate in saving the model
>> * avoids rpc issues (http://stackoverflow.com/questions/40842736/spark-
>> word2vecmodel-exceeds-max-rpc-size-for-saving)
>> * "feature parity" with mllib (issues with one large model file already
>> solved for mllib in SPARK-11994
>> <https://issues.apache.org/jira/browse/SPARK-11994>)
>>
>>
>> On Fri, Jan 13, 2017 at 1:02 PM, Nick Pentreath <nick.pentreath@gmail.com
>> > wrote:
>>
>> Yup - it's because almost all model data in spark ML (model coefficients)
>> is "small" - i.e. Non distributed.
>>
>> If you look at ALS you'll see there is no repartitioning since the factor
>> dataframes can be large
>> On Fri, 13 Jan 2017 at 19:42, Sean Owen <so...@cloudera.com> wrote:
>>
>> You're referring to code that serializes models, which are quite small.
>> For example a PCA model consists of a few principal component vector. It's
>> a Dataset of just one element being saved here. It's re-using the code path
>> normally used to save big data sets, to output 1 file with 1 thing as
>> Parquet.
>>
>> On Fri, Jan 13, 2017 at 5:29 PM Asher Krim <ak...@hubspot.com> wrote:
>>
>> But why is that beneficial? The data is supposedly quite large,
>> distributing it across many partitions/files would seem to make sense.
>>
>> On Fri, Jan 13, 2017 at 12:25 PM, Sean Owen <so...@cloudera.com> wrote:
>>
>> That is usually so the result comes out in one file, not partitioned over
>> n files.
>>
>> On Fri, Jan 13, 2017 at 5:23 PM Asher Krim <ak...@hubspot.com> wrote:
>>
>> Hi,
>>
>> I'm curious why it's common for data to be repartitioned to 1 partition
>> when saving ml models:
>>
>> sqlContext.createDataFrame(Seq(data)).repartition(1).write.
>> parquet(dataPath)
>>
>> This shows up in most ml models I've seen (Word2Vec
>> <https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/ml/feature/Word2Vec.scala#L314>,
>> PCA
>> <https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/ml/feature/PCA.scala#L189>,
>> LDA
>> <https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/ml/clustering/LDA.scala#L605>).
>> Am I missing some benefit of repartitioning like this?
>>
>> Thanks,
>> --
>> Asher Krim
>> Senior Software Engineer
>>
>>
>>
>>
>> --
>> Asher Krim
>> Senior Software Engineer
>>
>>

Re: Why are ml models repartition(1)'d in save methods?

Posted by Sean Owen <so...@cloudera.com>.
Yes, certainly debatable for word2vec. You have a good point that this
could overrun the 2GB limit if the model is one big datum, for large but
not crazy models. This model could probably easily be serialized as
individual vectors in this case. It would introduce a
backwards-compatibility issue but it's possible to read old and new
formats, I believe.

On Fri, Jan 13, 2017 at 8:16 PM Asher Krim <ak...@hubspot.com> wrote:

> I guess it depends on the definition of "small". A Word2vec model with
> vectorSize=300 and vocabulary=3m takes nearly 4gb. While it does fit on a
> single machine (so isn't really "big" data), I don't see the benefit in
> having the model stored in one file. On the contrary, it seems that we
> would want the model to be distributed:
> * avoids shuffling of data to one executor
> * allows the whole cluster to participate in saving the model
> * avoids rpc issues (
> http://stackoverflow.com/questions/40842736/spark-word2vecmodel-exceeds-max-rpc-size-for-saving
> )
> * "feature parity" with mllib (issues with one large model file already
> solved for mllib in SPARK-11994
> <https://issues.apache.org/jira/browse/SPARK-11994>)
>
>
> On Fri, Jan 13, 2017 at 1:02 PM, Nick Pentreath <ni...@gmail.com>
> wrote:
>
> Yup - it's because almost all model data in spark ML (model coefficients)
> is "small" - i.e. Non distributed.
>
> If you look at ALS you'll see there is no repartitioning since the factor
> dataframes can be large
> On Fri, 13 Jan 2017 at 19:42, Sean Owen <so...@cloudera.com> wrote:
>
> You're referring to code that serializes models, which are quite small.
> For example a PCA model consists of a few principal component vector. It's
> a Dataset of just one element being saved here. It's re-using the code path
> normally used to save big data sets, to output 1 file with 1 thing as
> Parquet.
>
> On Fri, Jan 13, 2017 at 5:29 PM Asher Krim <ak...@hubspot.com> wrote:
>
> But why is that beneficial? The data is supposedly quite large,
> distributing it across many partitions/files would seem to make sense.
>
> On Fri, Jan 13, 2017 at 12:25 PM, Sean Owen <so...@cloudera.com> wrote:
>
> That is usually so the result comes out in one file, not partitioned over
> n files.
>
> On Fri, Jan 13, 2017 at 5:23 PM Asher Krim <ak...@hubspot.com> wrote:
>
> Hi,
>
> I'm curious why it's common for data to be repartitioned to 1 partition
> when saving ml models:
>
> sqlContext.createDataFrame(Seq(data)).repartition(1
> ).write.parquet(dataPath)
>
> This shows up in most ml models I've seen (Word2Vec
> <https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/ml/feature/Word2Vec.scala#L314>,
> PCA
> <https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/ml/feature/PCA.scala#L189>,
> LDA
> <https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/ml/clustering/LDA.scala#L605>).
> Am I missing some benefit of repartitioning like this?
>
> Thanks,
> --
> Asher Krim
> Senior Software Engineer
>
>
>
>
> --
> Asher Krim
> Senior Software Engineer
>
>

Re: Why are ml models repartition(1)'d in save methods?

Posted by Asher Krim <ak...@hubspot.com>.
I guess it depends on the definition of "small". A Word2vec model with
vectorSize=300 and vocabulary=3m takes nearly 4gb. While it does fit on a
single machine (so isn't really "big" data), I don't see the benefit in
having the model stored in one file. On the contrary, it seems that we
would want the model to be distributed:
* avoids shuffling of data to one executor
* allows the whole cluster to participate in saving the model
* avoids rpc issues (http://stackoverflow.com/questions/40842736/spark-
word2vecmodel-exceeds-max-rpc-size-for-saving)
* "feature parity" with mllib (issues with one large model file already
solved for mllib in SPARK-11994
<https://issues.apache.org/jira/browse/SPARK-11994>)


On Fri, Jan 13, 2017 at 1:02 PM, Nick Pentreath <ni...@gmail.com>
wrote:

> Yup - it's because almost all model data in spark ML (model coefficients)
> is "small" - i.e. Non distributed.
>
> If you look at ALS you'll see there is no repartitioning since the factor
> dataframes can be large
> On Fri, 13 Jan 2017 at 19:42, Sean Owen <so...@cloudera.com> wrote:
>
>> You're referring to code that serializes models, which are quite small.
>> For example a PCA model consists of a few principal component vector. It's
>> a Dataset of just one element being saved here. It's re-using the code path
>> normally used to save big data sets, to output 1 file with 1 thing as
>> Parquet.
>>
>> On Fri, Jan 13, 2017 at 5:29 PM Asher Krim <ak...@hubspot.com> wrote:
>>
>> But why is that beneficial? The data is supposedly quite large,
>> distributing it across many partitions/files would seem to make sense.
>>
>> On Fri, Jan 13, 2017 at 12:25 PM, Sean Owen <so...@cloudera.com> wrote:
>>
>> That is usually so the result comes out in one file, not partitioned over
>> n files.
>>
>> On Fri, Jan 13, 2017 at 5:23 PM Asher Krim <ak...@hubspot.com> wrote:
>>
>> Hi,
>>
>> I'm curious why it's common for data to be repartitioned to 1 partition
>> when saving ml models:
>>
>> sqlContext.createDataFrame(Seq(data)).repartition(1).write.
>> parquet(dataPath)
>>
>> This shows up in most ml models I've seen (Word2Vec
>> <https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/ml/feature/Word2Vec.scala#L314>,
>> PCA
>> <https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/ml/feature/PCA.scala#L189>,
>> LDA
>> <https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/ml/clustering/LDA.scala#L605>).
>> Am I missing some benefit of repartitioning like this?
>>
>> Thanks,
>> --
>> Asher Krim
>> Senior Software Engineer
>>
>>
>>
>>
>> --
>> Asher Krim
>> Senior Software Engineer
>>
>>

Re: Why are ml models repartition(1)'d in save methods?

Posted by Nick Pentreath <ni...@gmail.com>.
Yup - it's because almost all model data in spark ML (model coefficients)
is "small" - i.e. Non distributed.

If you look at ALS you'll see there is no repartitioning since the factor
dataframes can be large
On Fri, 13 Jan 2017 at 19:42, Sean Owen <so...@cloudera.com> wrote:

> You're referring to code that serializes models, which are quite small.
> For example a PCA model consists of a few principal component vector. It's
> a Dataset of just one element being saved here. It's re-using the code path
> normally used to save big data sets, to output 1 file with 1 thing as
> Parquet.
>
> On Fri, Jan 13, 2017 at 5:29 PM Asher Krim <ak...@hubspot.com> wrote:
>
> But why is that beneficial? The data is supposedly quite large,
> distributing it across many partitions/files would seem to make sense.
>
> On Fri, Jan 13, 2017 at 12:25 PM, Sean Owen <so...@cloudera.com> wrote:
>
> That is usually so the result comes out in one file, not partitioned over
> n files.
>
> On Fri, Jan 13, 2017 at 5:23 PM Asher Krim <ak...@hubspot.com> wrote:
>
> Hi,
>
> I'm curious why it's common for data to be repartitioned to 1 partition
> when saving ml models:
>
> sqlContext.createDataFrame(Seq(data)).repartition(1
> ).write.parquet(dataPath)
>
> This shows up in most ml models I've seen (Word2Vec
> <https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/ml/feature/Word2Vec.scala#L314>,
> PCA
> <https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/ml/feature/PCA.scala#L189>,
> LDA
> <https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/ml/clustering/LDA.scala#L605>).
> Am I missing some benefit of repartitioning like this?
>
> Thanks,
> --
> Asher Krim
> Senior Software Engineer
>
>
>
>
> --
> Asher Krim
> Senior Software Engineer
>
>

Re: Why are ml models repartition(1)'d in save methods?

Posted by Sean Owen <so...@cloudera.com>.
You're referring to code that serializes models, which are quite small. For
example a PCA model consists of a few principal component vector. It's a
Dataset of just one element being saved here. It's re-using the code path
normally used to save big data sets, to output 1 file with 1 thing as
Parquet.

On Fri, Jan 13, 2017 at 5:29 PM Asher Krim <ak...@hubspot.com> wrote:

> But why is that beneficial? The data is supposedly quite large,
> distributing it across many partitions/files would seem to make sense.
>
> On Fri, Jan 13, 2017 at 12:25 PM, Sean Owen <so...@cloudera.com> wrote:
>
> That is usually so the result comes out in one file, not partitioned over
> n files.
>
> On Fri, Jan 13, 2017 at 5:23 PM Asher Krim <ak...@hubspot.com> wrote:
>
> Hi,
>
> I'm curious why it's common for data to be repartitioned to 1 partition
> when saving ml models:
>
> sqlContext.createDataFrame(Seq(data)).repartition(1
> ).write.parquet(dataPath)
>
> This shows up in most ml models I've seen (Word2Vec
> <https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/ml/feature/Word2Vec.scala#L314>,
> PCA
> <https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/ml/feature/PCA.scala#L189>,
> LDA
> <https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/ml/clustering/LDA.scala#L605>).
> Am I missing some benefit of repartitioning like this?
>
> Thanks,
> --
> Asher Krim
> Senior Software Engineer
>
>
>
>
> --
> Asher Krim
> Senior Software Engineer
>

Re: Why are ml models repartition(1)'d in save methods?

Posted by Asher Krim <ak...@hubspot.com>.
But why is that beneficial? The data is supposedly quite large,
distributing it across many partitions/files would seem to make sense.

On Fri, Jan 13, 2017 at 12:25 PM, Sean Owen <so...@cloudera.com> wrote:

> That is usually so the result comes out in one file, not partitioned over
> n files.
>
> On Fri, Jan 13, 2017 at 5:23 PM Asher Krim <ak...@hubspot.com> wrote:
>
>> Hi,
>>
>> I'm curious why it's common for data to be repartitioned to 1 partition
>> when saving ml models:
>>
>> sqlContext.createDataFrame(Seq(data)).repartition(1).write.
>> parquet(dataPath)
>>
>> This shows up in most ml models I've seen (Word2Vec
>> <https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/ml/feature/Word2Vec.scala#L314>,
>> PCA
>> <https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/ml/feature/PCA.scala#L189>,
>> LDA
>> <https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/ml/clustering/LDA.scala#L605>).
>> Am I missing some benefit of repartitioning like this?
>>
>> Thanks,
>> --
>> Asher Krim
>> Senior Software Engineer
>>
>


-- 
Asher Krim
Senior Software Engineer

Re: Why are ml models repartition(1)'d in save methods?

Posted by Sean Owen <so...@cloudera.com>.
That is usually so the result comes out in one file, not partitioned over n
files.

On Fri, Jan 13, 2017 at 5:23 PM Asher Krim <ak...@hubspot.com> wrote:

> Hi,
>
> I'm curious why it's common for data to be repartitioned to 1 partition
> when saving ml models:
>
> sqlContext.createDataFrame(Seq(data)).repartition(1
> ).write.parquet(dataPath)
>
> This shows up in most ml models I've seen (Word2Vec
> <https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/ml/feature/Word2Vec.scala#L314>,
> PCA
> <https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/ml/feature/PCA.scala#L189>,
> LDA
> <https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/ml/clustering/LDA.scala#L605>).
> Am I missing some benefit of repartitioning like this?
>
> Thanks,
> --
> Asher Krim
> Senior Software Engineer
>