You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@spark.apache.org by Daniel Siegmann <da...@teamaol.com> on 2016/04/28 21:06:42 UTC
Re: Spark ML - Scaling logistic regression for many features

FYI: https://issues.apache.org/jira/browse/SPARK-14464

I have submitted a PR as well.

On Fri, Mar 18, 2016 at 7:15 AM, Nick Pentreath <ni...@gmail.com>
wrote:

> No, I didn't yet - feel free to create a JIRA.
>
>
>
> On Thu, 17 Mar 2016 at 22:55 Daniel Siegmann <da...@teamaol.com>
> wrote:
>
>> Hi Nick,
>>
>> Thanks again for your help with this. Did you create a ticket in JIRA for
>> investigating sparse models in LR and / or multivariate summariser? If so,
>> can you give me the issue key(s)? If not, would you like me to create these
>> tickets?
>>
>> I'm going to look into this some more and see if I can figure out how to
>> implement these fixes.
>>
>> ~Daniel Siegmann
>>
>> On Sat, Mar 12, 2016 at 5:53 AM, Nick Pentreath <nick.pentreath@gmail.com
>> > wrote:
>>
>>> Also adding dev list in case anyone else has ideas / views.
>>>
>>> On Sat, 12 Mar 2016 at 12:52, Nick Pentreath <ni...@gmail.com>
>>> wrote:
>>>
>>>> Thanks for the feedback.
>>>>
>>>> I think Spark can certainly meet your use case when your data size
>>>> scales up, as the actual model dimension is very small - you will need to
>>>> use those indexers or some other mapping mechanism.
>>>>
>>>> There is ongoing work for Spark 2.0 to make it easier to use models
>>>> outside of Spark - also see PMML export (I think mllib logistic regression
>>>> is supported but I have to check that). That will help use spark models in
>>>> serving environments.
>>>>
>>>> Finally, I will add a JIRA to investigate sparse models for LR - maybe
>>>> also a ticket for multivariate summariser (though I don't think in practice
>>>> there will be much to gain).
>>>>
>>>>
>>>> On Fri, 11 Mar 2016 at 21:35, Daniel Siegmann <
>>>> daniel.siegmann@teamaol.com> wrote:
>>>>
>>>>> Thanks for the pointer to those indexers, those are some good
>>>>> examples. A good way to go for the trainer and any scoring done in Spark. I
>>>>> will definitely have to deal with scoring in non-Spark systems though.
>>>>>
>>>>> I think I will need to scale up beyond what single-node liblinear can
>>>>> practically provide. The system will need to handle much larger sub-samples
>>>>> of this data (and other projects might be larger still). Additionally, the
>>>>> system needs to train many models in parallel (hyper-parameter optimization
>>>>> with n-fold cross-validation, multiple algorithms, different sets of
>>>>> features).
>>>>>
>>>>> Still, I suppose we'll have to consider whether Spark is the best
>>>>> system for this. For now though, my job is to see what can be achieved with
>>>>> Spark.
>>>>>
>>>>>
>>>>>
>>>>> On Fri, Mar 11, 2016 at 12:45 PM, Nick Pentreath <
>>>>> nick.pentreath@gmail.com> wrote:
>>>>>
>>>>>> Ok, I think I understand things better now.
>>>>>>
>>>>>> For Spark's current implementation, you would need to map those
>>>>>> features as you mention. You could also use say StringIndexer ->
>>>>>> OneHotEncoder or VectorIndexer. You could create a Pipeline to deal with
>>>>>> the mapping and training (e.g.
>>>>>> http://spark.apache.org/docs/latest/ml-guide.html#example-pipeline).
>>>>>> Pipeline supports persistence.
>>>>>>
>>>>>> But it depends on your scoring use case too - a Spark pipeline can be
>>>>>> saved and then reloaded, but you need all of Spark dependencies in your
>>>>>> serving app which is often not ideal. If you're doing bulk scoring offline,
>>>>>> then it may suit.
>>>>>>
>>>>>> Honestly though, for that data size I'd certainly go with something
>>>>>> like Liblinear :) Spark will ultimately scale better with # training
>>>>>> examples for very large scale problems. However there are definitely
>>>>>> limitations on model dimension and sparse weight vectors currently. There
>>>>>> are potential solutions to these but they haven't been implemented as yet.
>>>>>>
>>>>>> On Fri, 11 Mar 2016 at 18:35 Daniel Siegmann <
>>>>>> daniel.siegmann@teamaol.com> wrote:
>>>>>>
>>>>>>> On Fri, Mar 11, 2016 at 5:29 AM, Nick Pentreath <
>>>>>>> nick.pentreath@gmail.com> wrote:
>>>>>>>
>>>>>>>> Would you mind letting us know the # training examples in the
>>>>>>>> datasets? Also, what do your features look like? Are they text, categorical
>>>>>>>> etc? You mention that most rows only have a few features, and all rows
>>>>>>>> together have a few 10,000s features, yet your max feature value is 20
>>>>>>>> million. How are your constructing your feature vectors to get a 20 million
>>>>>>>> size? The only realistic way I can see this situation occurring in practice
>>>>>>>> is with feature hashing (HashingTF).
>>>>>>>>
>>>>>>>
>>>>>>> The sub-sample I'm currently training on is about 50K rows, so ...
>>>>>>> small.
>>>>>>>
>>>>>>> The features causing this issue are numeric (int) IDs for ... lets
>>>>>>> call it "Thing". For each Thing in the record, we set the feature
>>>>>>> Thing.id to a value of 1.0 in our vector (which is of course a
>>>>>>> SparseVector). I'm not sure how IDs are generated for Things, but
>>>>>>> they can be large numbers.
>>>>>>>
>>>>>>> The largest Thing ID is around 20 million, so that ends up being the
>>>>>>> size of the vector. But in fact there are fewer than 10,000 unique Thing
>>>>>>> IDs in this data. The mean number of features per record in what I'm
>>>>>>> currently training against is 41, while the maximum for any given record
>>>>>>> was 1754.
>>>>>>>
>>>>>>> It is possible to map the features into a small set (just need to
>>>>>>> zipWithIndex), but this is undesirable because of the added complexity (not
>>>>>>> just for the training, but also anything wanting to score against the
>>>>>>> model). It might be a little easier if this could be encapsulated within
>>>>>>> the model object itself (perhaps via composition), though I'm not sure how
>>>>>>> feasible that is.
>>>>>>>
>>>>>>> But I'd rather not bother with dimensionality reduction at all -
>>>>>>> since we can train using liblinear in just a few minutes, it doesn't seem
>>>>>>> necessary.
>>>>>>>
>>>>>>>
>>>>>>>>
>>>>>>>> MultivariateOnlineSummarizer uses dense arrays, but it should be
>>>>>>>> possible to enable sparse data. Though in theory, the result will tend to
>>>>>>>> be dense anyway, unless you have very many entries in the input feature
>>>>>>>> vector that never occur and are actually zero throughout the data set
>>>>>>>> (which it seems is the case with your data?). So I doubt whether using
>>>>>>>> sparse vectors for the summarizer would improve performance in general.
>>>>>>>>
>>>>>>>
>>>>>>> Yes, that is exactly my case - the vast majority of entries in the
>>>>>>> input feature vector will *never* occur. Presumably that means most
>>>>>>> of the values in the aggregators' arrays will be zero.
>>>>>>>
>>>>>>>
>>>>>>>>
>>>>>>>> LR doesn't accept a sparse weight vector, as it uses dense vectors
>>>>>>>> for coefficients and gradients currently. When using L1 regularization, it
>>>>>>>> could support sparse weight vectors, but the current implementation doesn't
>>>>>>>> do that yet.
>>>>>>>>
>>>>>>>
>>>>>>> Good to know it is theoretically possible to implement. I'll have to
>>>>>>> give it some thought. In the meantime I guess I'll experiment with
>>>>>>> coalescing the data to minimize the communication overhead.
>>>>>>>
>>>>>>> Thanks again.
>>>>>>>
>>>>>>
>>>>>
>>