You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@spark.apache.org by David Brooks <da...@whisk.co.uk> on 2016/01/26 00:06:28 UTC

MLlib OneVsRest causing intermittent exceptions

Hi,

I've run into an exception using MLlib OneVsRest with logistic regression
(v1.6.0, but also in previous versions).

The issue is intermittent.  When running multiclass classification with
K-fold cross validation, there are scenarios where the split does not
contain instances for every target label.  In such cases, an
ArrayIndexOutOfBoundsException is generated.

I've tried to reproduce the problem in a simple SBT project here:

   https://github.com/junglebarry/SparkOneVsRestTest

I don't imagine this is typical - it first surfaced when running over a
dataset with some very rare classes.

I'm happy to look into patching the code, but I first wanted to confirm
that the problem was real, and that I wasn't somehow misunderstanding how I
should be using OneVsRest.

Any guidance would be appreciated - I'm new to the list.

Many thanks,
David

Re: MLlib OneVsRest causing intermittent exceptions

Posted by David Brooks <da...@whisk.co.uk>.
Hi Ram,

Yes, I complete agree.  An exception is poor way to handle this case, and
training on a dataset of zero labels and no one labels should simply work
without exceptions.

Fortunately, it looks like someone else has recently patched the problem
with LogisticRegression:


https://github.com/apache/spark/commit/2388de51912efccaceeb663ac56fc500a79d2ceb

This should resolve the issue I'm experiencing.  I'll get hold of a build
from source and try it out.

Thanks for all your help!

David

On Wed, Jan 27, 2016 at 12:51 AM Ram Sriharsha <sr...@gmail.com>
wrote:

> btw, OneVsRest is using the labels in the dataset that is fed to the fit
> method, in case the metadata is missing.
> So if the metadata contains a label, we expect that label to be present in
> the dataset passed to the fit method.
> If you want OneVsRest to compute the labels you can leave the label
> metadata empty in which case we first compute the # of
> labels in the training dataset.
>
> If the training dataset contains a given label, then logistic regression
> should work fine regardless of the rarity of that label (performance might
> be bad but it won't throw an exception afaik)
>
> if the training dataset does not contain a given label but the metadata
> does, then we do end up training classifiers which will never see that
> label.
> But even here, what gets passed to the underlying classifier is a dataset
> with only say zero labels and no one labels.
> A classifier should be able to handle this... but if it cannot for some
> reason, we can have a check in OneVsRest that doesn't train that classifier
>
> On Tue, Jan 26, 2016 at 4:33 PM, Ram Sriharsha <sr...@gmail.com>
> wrote:
>
>> Hey David
>>
>> In your scenario, OneVsRest is training a classifier for 1 vs not 1...
>> and the input dataset for fit (or train) has labeled data for label 1
>>
>> But the underlying binary classifier (LogisticRegression) uses sampling
>> to determine the subset of data to sample during each iteration and it is
>> possible that this sample does not include any examples with label 1 (ie
>> numClasses = 1)
>>
>> So the examples it selects in that iteration only include 0 labeled data
>> and nothing with label 1.
>>
>> But why should it throw an exception? if it does, then i would think we
>> need to fix the issue in the underlying algorithm instead of the
>> reduction somehow knowing that the binary classifier is sampling from the
>> training dataset.
>>
>> Or am I misunderstanding the issue here?
>>
>> I'll take a look at the gist you linked when i get a chance , thanks!
>>
>> Ram
>>
>> On Tue, Jan 26, 2016 at 4:06 PM, David Brooks <da...@whisk.co.uk> wrote:
>>
>>> Hi Ram, Joseph,
>>>
>>> That's right, but I will clarify:
>>>
>>> (a) a random split can generate a training set that does not contain
>>> some rare class
>>> (b) when LogisticRegression is run over a dataframe where all instances
>>> have the same class label, it throws an ArrayIndexOutOfBoundsException.
>>>
>>> When (a) occurs, (b) is the consequence.  The rare class is missing from
>>> the training set, so you would not expect OneVsRest to train a binary
>>> classifier on it; however, because OneVsRest trains binary classifiers on
>>> all class labels in the range (0 to numClasses), it *will* train a
>>> binary classifier on the missing class, which leads to the exception from
>>> (b).
>>>
>>> A concrete example:
>>>
>>>    - class labels 0, 1, 2, 3 are present in dataset (*numClasses* = 4);
>>>    - 0, 2, 3 are in the training set after random split (no *1*);
>>>    - The range (0 to 4) is used to train binary classifiers on each of
>>>    0, *1*, 2, 3
>>>    - As soon as the classifier is trained on *1*, the exception is
>>>    thrown
>>>
>>> I'd suggest:
>>>
>>>    1. In LogisticRegression, where numClasses == 1, thrown a more
>>>    meaningful validation exception (avoiding the more cryptic
>>>    ArrayIndexOutOfBoundsException)
>>>    2. Only run OneVsRest for class labels that appear in the dataframe,
>>>    rather than all labels in the Range(0, numClasses).
>>>
>>> I created a few simple test cases for running from SBT, like this one
>>> <https://github.com/junglebarry/SparkOneVsRestTest/blob/master/src/main/scala/SparkOneVsRestTest_2_Errors.scala>,
>>> but I've turned them into gists now for spark-shell:
>>>
>>>    - LogisticRegression throwing ArrayIndexOutOfBoundsException
>>>    <https://gist.github.com/junglebarry/a7cedce6eaf978d7b9ee>
>>>    - OneVsRest throwing ArrayIndexOutOfBoundsException
>>>    <https://gist.github.com/junglebarry/66234edfebaad6254ebe> (with a
>>>    simulated missing class from a Range)
>>>    - OneVsRest throwing ArrayIndexOutOfBoundsException with random split
>>>    <https://gist.github.com/junglebarry/6073aa474d89f3322063>.  Only
>>>    exceptions in 2/3 of cases, due to randomness.
>>>
>>> If these look good as test cases, I'll take a look at filing JIRAs and
>>> getting patches tomorrow morning.  It's late here!
>>>
>>> Thanks for the swift response,
>>> David
>>>
>>>
>>> On Tue, Jan 26, 2016 at 11:09 PM Ram Sriharsha <sr...@gmail.com>
>>> wrote:
>>>
>>>> Hi David
>>>>
>>>> If I am reading the email right, there are two problems here right?
>>>> a) for rare classes the random split will likely miss the rare class.
>>>> b) if it misses the rare class an exception is thrown
>>>>
>>>> I thought the exception stems from b), is that right?... i wouldn't
>>>> expect an exception to be thrown in the case the training dataset is
>>>> missing the rare class.
>>>> could you reproduce this in a simple snippet of code that we can
>>>> quickly test on the shell?
>>>>
>>>>
>>>>
>>>>
>>>> On Tue, Jan 26, 2016 at 3:02 PM, Ram Sriharsha <sriharsha.ram@gmail.com
>>>> > wrote:
>>>>
>>>>> Hey David, Yeah absolutely!, feel free to create a JIRA and attach
>>>>> your patch to it. We can help review it and pull in the fix... happy to
>>>>> accept contributions!
>>>>> ccing Joseph who is one of the maintainers of MLLib as well.. when
>>>>> creating the JIRA can you attach a simple test case?
>>>>>
>>>>> On Tue, Jan 26, 2016 at 2:59 PM, David Brooks <da...@whisk.co.uk>
>>>>> wrote:
>>>>>
>>>>>> Hi again Ram,
>>>>>>
>>>>>> Sorry, I was too hasty in my previous response.  I've done a bit more
>>>>>> digging through the code, and StringIndexer does indeed provide metadata,
>>>>>> as a NominalAttribute with a known number of class labels.  I don't think
>>>>>> the issue is related to the use of metadata, however.
>>>>>>
>>>>>> It seems to me to be caused by the interaction between OneVsRest and
>>>>>> TrainValidationSplit.  For rare target classes under OneVsRest, it seems
>>>>>> quite possible for this random-split approach to select a training subset
>>>>>> where all items belong to non-target classes - all of which are given the
>>>>>> same class label by OneVsRest.  In this case, we start training
>>>>>> LogisticRegression on data of a single class, which seems odd.  The
>>>>>> exception stems from there.
>>>>>>
>>>>>> The cause looks to me to be that OneVsRest.fit runs binary
>>>>>> classifications from 0 to numClasses (OneVsRest.scala:209), and this seems
>>>>>> incompatible with the random split, which cannot guarantee training
>>>>>> examples for all labels in the range.  It might be preferable to iterate
>>>>>> over the observed labels in the training set, rather than all labels in the
>>>>>> range.  I don't know the performance effects of that change, but it does
>>>>>> look incompatible with using the label metadata as a shortcut.
>>>>>>
>>>>>> Do you agree that there is an issue here?  Would you accept
>>>>>> contributions to the code to remedy it?  I'd gladly take a look if I can be
>>>>>> of help.
>>>>>>
>>>>>> Many thanks,
>>>>>> David
>>>>>>
>>>>>>
>>>>>>
>>>>>> On Tue, Jan 26, 2016 at 1:29 PM David Brooks <da...@whisk.co.uk>
>>>>>> wrote:
>>>>>>
>>>>>>> Hi Ram,
>>>>>>>
>>>>>>> I didn't include an explicit label column in my reproduction as I
>>>>>>> thought it superfluous.  However, in my original use-case, I was using a
>>>>>>> StringIndexer, where the labels were indexed across the entire dataset
>>>>>>> (training+validation+test).  The (indexed) label column was then explicitly
>>>>>>> provided to the OneVsRest instance.
>>>>>>>
>>>>>>> Here's the abridged version:
>>>>>>>
>>>>>>> val textDocuments = ??? // real data here
>>>>>>>
>>>>>>> // Index labels, adding metadata to the label column.
>>>>>>> // Fit on whole dataset to include all labels in index.
>>>>>>> val labelIndexer = new StringIndexer()
>>>>>>>   .setInputCol("label")
>>>>>>>   .setOutputCol("labelIndexed")
>>>>>>>   .fit(textDocuments)
>>>>>>>
>>>>>>> val lrClassifier = new LogisticRegression()
>>>>>>>
>>>>>>> val classifier = new OneVsRest()
>>>>>>>   .setClassifier(lrClassifier)
>>>>>>>   .setLabelCol(labelIndexer.getOutputCol)
>>>>>>>
>>>>>>> // ...
>>>>>>>
>>>>>>>
>>>>>>> There's an explicit reference to the label column, and when created,
>>>>>>> that column contains all possible values of the label (it's `fit` over all
>>>>>>> data).  It looks to me like StringIndexer computes label metadata at that
>>>>>>> point (in `transform`) and attaches it to the column.  This way, I'd hope
>>>>>>> that even once TrainValidationSplit returns a subset dataframe -
>>>>>>> which may not contain all labels - the metadata on the column
>>>>>>> should still contain all labels.
>>>>>>>
>>>>>>> Does my use of StringIndexer count as "metadata", here?  If so, I
>>>>>>> still see the exception as before.
>>>>>>>
>>>>>>> I've pushed a new example using StringIndexer to my earlier repo, so
>>>>>>> you can see the code and issue.  I'm happy to try a simpler method for
>>>>>>> providing column metadata, if one is available.
>>>>>>>
>>>>>>> Thanks,
>>>>>>> David
>>>>>>>
>>>>>>> On Mon, Jan 25, 2016 at 11:13 PM Ram Sriharsha <
>>>>>>> sriharsha.ram@gmail.com> wrote:
>>>>>>>
>>>>>>>> Hi David
>>>>>>>>
>>>>>>>> What happens if you provide the class labels via metadata instead
>>>>>>>> of letting OneVsRest determine the labels?
>>>>>>>>
>>>>>>>> Ram
>>>>>>>>
>>>>>>>> On Mon, Jan 25, 2016 at 3:06 PM, David Brooks <da...@whisk.co.uk>
>>>>>>>> wrote:
>>>>>>>>
>>>>>>>>> Hi,
>>>>>>>>>
>>>>>>>>> I've run into an exception using MLlib OneVsRest with logistic
>>>>>>>>> regression (v1.6.0, but also in previous versions).
>>>>>>>>>
>>>>>>>>> The issue is intermittent.  When running multiclass classification
>>>>>>>>> with K-fold cross validation, there are scenarios where the split does not
>>>>>>>>> contain instances for every target label.  In such cases, an
>>>>>>>>> ArrayIndexOutOfBoundsException is generated.
>>>>>>>>>
>>>>>>>>> I've tried to reproduce the problem in a simple SBT project here:
>>>>>>>>>
>>>>>>>>>    https://github.com/junglebarry/SparkOneVsRestTest
>>>>>>>>>
>>>>>>>>> I don't imagine this is typical - it first surfaced when running
>>>>>>>>> over a dataset with some very rare classes.
>>>>>>>>>
>>>>>>>>> I'm happy to look into patching the code, but I first wanted to
>>>>>>>>> confirm that the problem was real, and that I wasn't somehow
>>>>>>>>> misunderstanding how I should be using OneVsRest.
>>>>>>>>>
>>>>>>>>> Any guidance would be appreciated - I'm new to the list.
>>>>>>>>>
>>>>>>>>> Many thanks,
>>>>>>>>> David
>>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> --
>>>>>>>> Ram Sriharsha
>>>>>>>> Architect, Spark and Data Science
>>>>>>>> Hortonworks, 2550 Great America Way, 2nd Floor
>>>>>>>> Santa Clara, CA 95054
>>>>>>>> Ph: 408-510-8635
>>>>>>>> email: harsha@apache.org
>>>>>>>>
>>>>>>>> [image: https://www.linkedin.com/in/harsha340]
>>>>>>>> <https://www.linkedin.com/in/harsha340>
>>>>>>>> <https://twitter.com/halfabrane> <https://github.com/harsha2010/>
>>>>>>>>
>>>>>>>>
>>>>>
>>>>>
>>>>> --
>>>>> Ram Sriharsha
>>>>> Architect, Spark and Data Science
>>>>> Hortonworks, 2550 Great America Way, 2nd Floor
>>>>> Santa Clara, CA 95054
>>>>> Ph: 408-510-8635
>>>>> email: harsha@apache.org
>>>>>
>>>>> [image: https://www.linkedin.com/in/harsha340]
>>>>> <https://www.linkedin.com/in/harsha340>
>>>>> <https://twitter.com/halfabrane> <https://github.com/harsha2010/>
>>>>>
>>>>>
>>>>
>>>>
>>>> --
>>>> Ram Sriharsha
>>>> Architect, Spark and Data Science
>>>> Hortonworks, 2550 Great America Way, 2nd Floor
>>>> Santa Clara, CA 95054
>>>> Ph: 408-510-8635
>>>> email: harsha@apache.org
>>>>
>>>> [image: https://www.linkedin.com/in/harsha340]
>>>> <https://www.linkedin.com/in/harsha340>
>>>> <https://twitter.com/halfabrane> <https://github.com/harsha2010/>
>>>>
>>>>
>>
>>
>> --
>> Ram Sriharsha
>> Architect, Spark and Data Science
>> Hortonworks, 2550 Great America Way, 2nd Floor
>> Santa Clara, CA 95054
>> Ph: 408-510-8635
>> email: harsha@apache.org
>>
>> [image: https://www.linkedin.com/in/harsha340]
>> <https://www.linkedin.com/in/harsha340> <https://twitter.com/halfabrane>
>> <https://github.com/harsha2010/>
>>
>>
>
>
> --
> Ram Sriharsha
> Architect, Spark and Data Science
> Hortonworks, 2550 Great America Way, 2nd Floor
> Santa Clara, CA 95054
> Ph: 408-510-8635
> email: harsha@apache.org
>
> [image: https://www.linkedin.com/in/harsha340]
> <https://www.linkedin.com/in/harsha340> <https://twitter.com/halfabrane>
> <https://github.com/harsha2010/>
>
>

Re: MLlib OneVsRest causing intermittent exceptions

Posted by Ram Sriharsha <sr...@gmail.com>.
btw, OneVsRest is using the labels in the dataset that is fed to the fit
method, in case the metadata is missing.
So if the metadata contains a label, we expect that label to be present in
the dataset passed to the fit method.
If you want OneVsRest to compute the labels you can leave the label
metadata empty in which case we first compute the # of
labels in the training dataset.

If the training dataset contains a given label, then logistic regression
should work fine regardless of the rarity of that label (performance might
be bad but it won't throw an exception afaik)

if the training dataset does not contain a given label but the metadata
does, then we do end up training classifiers which will never see that
label.
But even here, what gets passed to the underlying classifier is a dataset
with only say zero labels and no one labels.
A classifier should be able to handle this... but if it cannot for some
reason, we can have a check in OneVsRest that doesn't train that classifier

On Tue, Jan 26, 2016 at 4:33 PM, Ram Sriharsha <sr...@gmail.com>
wrote:

> Hey David
>
> In your scenario, OneVsRest is training a classifier for 1 vs not 1... and
> the input dataset for fit (or train) has labeled data for label 1
>
> But the underlying binary classifier (LogisticRegression) uses sampling to
> determine the subset of data to sample during each iteration and it is
> possible that this sample does not include any examples with label 1 (ie
> numClasses = 1)
>
> So the examples it selects in that iteration only include 0 labeled data
> and nothing with label 1.
>
> But why should it throw an exception? if it does, then i would think we
> need to fix the issue in the underlying algorithm instead of the
> reduction somehow knowing that the binary classifier is sampling from the
> training dataset.
>
> Or am I misunderstanding the issue here?
>
> I'll take a look at the gist you linked when i get a chance , thanks!
>
> Ram
>
> On Tue, Jan 26, 2016 at 4:06 PM, David Brooks <da...@whisk.co.uk> wrote:
>
>> Hi Ram, Joseph,
>>
>> That's right, but I will clarify:
>>
>> (a) a random split can generate a training set that does not contain some
>> rare class
>> (b) when LogisticRegression is run over a dataframe where all instances
>> have the same class label, it throws an ArrayIndexOutOfBoundsException.
>>
>> When (a) occurs, (b) is the consequence.  The rare class is missing from
>> the training set, so you would not expect OneVsRest to train a binary
>> classifier on it; however, because OneVsRest trains binary classifiers on
>> all class labels in the range (0 to numClasses), it *will* train a
>> binary classifier on the missing class, which leads to the exception from
>> (b).
>>
>> A concrete example:
>>
>>    - class labels 0, 1, 2, 3 are present in dataset (*numClasses* = 4);
>>    - 0, 2, 3 are in the training set after random split (no *1*);
>>    - The range (0 to 4) is used to train binary classifiers on each of
>>    0, *1*, 2, 3
>>    - As soon as the classifier is trained on *1*, the exception is thrown
>>
>> I'd suggest:
>>
>>    1. In LogisticRegression, where numClasses == 1, thrown a more
>>    meaningful validation exception (avoiding the more cryptic
>>    ArrayIndexOutOfBoundsException)
>>    2. Only run OneVsRest for class labels that appear in the dataframe,
>>    rather than all labels in the Range(0, numClasses).
>>
>> I created a few simple test cases for running from SBT, like this one
>> <https://github.com/junglebarry/SparkOneVsRestTest/blob/master/src/main/scala/SparkOneVsRestTest_2_Errors.scala>,
>> but I've turned them into gists now for spark-shell:
>>
>>    - LogisticRegression throwing ArrayIndexOutOfBoundsException
>>    <https://gist.github.com/junglebarry/a7cedce6eaf978d7b9ee>
>>    - OneVsRest throwing ArrayIndexOutOfBoundsException
>>    <https://gist.github.com/junglebarry/66234edfebaad6254ebe> (with a
>>    simulated missing class from a Range)
>>    - OneVsRest throwing ArrayIndexOutOfBoundsException with random split
>>    <https://gist.github.com/junglebarry/6073aa474d89f3322063>.  Only
>>    exceptions in 2/3 of cases, due to randomness.
>>
>> If these look good as test cases, I'll take a look at filing JIRAs and
>> getting patches tomorrow morning.  It's late here!
>>
>> Thanks for the swift response,
>> David
>>
>>
>> On Tue, Jan 26, 2016 at 11:09 PM Ram Sriharsha <sr...@gmail.com>
>> wrote:
>>
>>> Hi David
>>>
>>> If I am reading the email right, there are two problems here right?
>>> a) for rare classes the random split will likely miss the rare class.
>>> b) if it misses the rare class an exception is thrown
>>>
>>> I thought the exception stems from b), is that right?... i wouldn't
>>> expect an exception to be thrown in the case the training dataset is
>>> missing the rare class.
>>> could you reproduce this in a simple snippet of code that we can quickly
>>> test on the shell?
>>>
>>>
>>>
>>>
>>> On Tue, Jan 26, 2016 at 3:02 PM, Ram Sriharsha <sr...@gmail.com>
>>> wrote:
>>>
>>>> Hey David, Yeah absolutely!, feel free to create a JIRA and attach your
>>>> patch to it. We can help review it and pull in the fix... happy to accept
>>>> contributions!
>>>> ccing Joseph who is one of the maintainers of MLLib as well.. when
>>>> creating the JIRA can you attach a simple test case?
>>>>
>>>> On Tue, Jan 26, 2016 at 2:59 PM, David Brooks <da...@whisk.co.uk>
>>>> wrote:
>>>>
>>>>> Hi again Ram,
>>>>>
>>>>> Sorry, I was too hasty in my previous response.  I've done a bit more
>>>>> digging through the code, and StringIndexer does indeed provide metadata,
>>>>> as a NominalAttribute with a known number of class labels.  I don't think
>>>>> the issue is related to the use of metadata, however.
>>>>>
>>>>> It seems to me to be caused by the interaction between OneVsRest and
>>>>> TrainValidationSplit.  For rare target classes under OneVsRest, it seems
>>>>> quite possible for this random-split approach to select a training subset
>>>>> where all items belong to non-target classes - all of which are given the
>>>>> same class label by OneVsRest.  In this case, we start training
>>>>> LogisticRegression on data of a single class, which seems odd.  The
>>>>> exception stems from there.
>>>>>
>>>>> The cause looks to me to be that OneVsRest.fit runs binary
>>>>> classifications from 0 to numClasses (OneVsRest.scala:209), and this seems
>>>>> incompatible with the random split, which cannot guarantee training
>>>>> examples for all labels in the range.  It might be preferable to iterate
>>>>> over the observed labels in the training set, rather than all labels in the
>>>>> range.  I don't know the performance effects of that change, but it does
>>>>> look incompatible with using the label metadata as a shortcut.
>>>>>
>>>>> Do you agree that there is an issue here?  Would you accept
>>>>> contributions to the code to remedy it?  I'd gladly take a look if I can be
>>>>> of help.
>>>>>
>>>>> Many thanks,
>>>>> David
>>>>>
>>>>>
>>>>>
>>>>> On Tue, Jan 26, 2016 at 1:29 PM David Brooks <da...@whisk.co.uk>
>>>>> wrote:
>>>>>
>>>>>> Hi Ram,
>>>>>>
>>>>>> I didn't include an explicit label column in my reproduction as I
>>>>>> thought it superfluous.  However, in my original use-case, I was using a
>>>>>> StringIndexer, where the labels were indexed across the entire dataset
>>>>>> (training+validation+test).  The (indexed) label column was then explicitly
>>>>>> provided to the OneVsRest instance.
>>>>>>
>>>>>> Here's the abridged version:
>>>>>>
>>>>>> val textDocuments = ??? // real data here
>>>>>>
>>>>>> // Index labels, adding metadata to the label column.
>>>>>> // Fit on whole dataset to include all labels in index.
>>>>>> val labelIndexer = new StringIndexer()
>>>>>>   .setInputCol("label")
>>>>>>   .setOutputCol("labelIndexed")
>>>>>>   .fit(textDocuments)
>>>>>>
>>>>>> val lrClassifier = new LogisticRegression()
>>>>>>
>>>>>> val classifier = new OneVsRest()
>>>>>>   .setClassifier(lrClassifier)
>>>>>>   .setLabelCol(labelIndexer.getOutputCol)
>>>>>>
>>>>>> // ...
>>>>>>
>>>>>>
>>>>>> There's an explicit reference to the label column, and when created,
>>>>>> that column contains all possible values of the label (it's `fit` over all
>>>>>> data).  It looks to me like StringIndexer computes label metadata at that
>>>>>> point (in `transform`) and attaches it to the column.  This way, I'd hope
>>>>>> that even once TrainValidationSplit returns a subset dataframe -
>>>>>> which may not contain all labels - the metadata on the column should
>>>>>> still contain all labels.
>>>>>>
>>>>>> Does my use of StringIndexer count as "metadata", here?  If so, I
>>>>>> still see the exception as before.
>>>>>>
>>>>>> I've pushed a new example using StringIndexer to my earlier repo, so
>>>>>> you can see the code and issue.  I'm happy to try a simpler method for
>>>>>> providing column metadata, if one is available.
>>>>>>
>>>>>> Thanks,
>>>>>> David
>>>>>>
>>>>>> On Mon, Jan 25, 2016 at 11:13 PM Ram Sriharsha <
>>>>>> sriharsha.ram@gmail.com> wrote:
>>>>>>
>>>>>>> Hi David
>>>>>>>
>>>>>>> What happens if you provide the class labels via metadata instead of
>>>>>>> letting OneVsRest determine the labels?
>>>>>>>
>>>>>>> Ram
>>>>>>>
>>>>>>> On Mon, Jan 25, 2016 at 3:06 PM, David Brooks <da...@whisk.co.uk>
>>>>>>> wrote:
>>>>>>>
>>>>>>>> Hi,
>>>>>>>>
>>>>>>>> I've run into an exception using MLlib OneVsRest with logistic
>>>>>>>> regression (v1.6.0, but also in previous versions).
>>>>>>>>
>>>>>>>> The issue is intermittent.  When running multiclass classification
>>>>>>>> with K-fold cross validation, there are scenarios where the split does not
>>>>>>>> contain instances for every target label.  In such cases, an
>>>>>>>> ArrayIndexOutOfBoundsException is generated.
>>>>>>>>
>>>>>>>> I've tried to reproduce the problem in a simple SBT project here:
>>>>>>>>
>>>>>>>>    https://github.com/junglebarry/SparkOneVsRestTest
>>>>>>>>
>>>>>>>> I don't imagine this is typical - it first surfaced when running
>>>>>>>> over a dataset with some very rare classes.
>>>>>>>>
>>>>>>>> I'm happy to look into patching the code, but I first wanted to
>>>>>>>> confirm that the problem was real, and that I wasn't somehow
>>>>>>>> misunderstanding how I should be using OneVsRest.
>>>>>>>>
>>>>>>>> Any guidance would be appreciated - I'm new to the list.
>>>>>>>>
>>>>>>>> Many thanks,
>>>>>>>> David
>>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> --
>>>>>>> Ram Sriharsha
>>>>>>> Architect, Spark and Data Science
>>>>>>> Hortonworks, 2550 Great America Way, 2nd Floor
>>>>>>> Santa Clara, CA 95054
>>>>>>> Ph: 408-510-8635
>>>>>>> email: harsha@apache.org
>>>>>>>
>>>>>>> [image: https://www.linkedin.com/in/harsha340]
>>>>>>> <https://www.linkedin.com/in/harsha340>
>>>>>>> <https://twitter.com/halfabrane> <https://github.com/harsha2010/>
>>>>>>>
>>>>>>>
>>>>
>>>>
>>>> --
>>>> Ram Sriharsha
>>>> Architect, Spark and Data Science
>>>> Hortonworks, 2550 Great America Way, 2nd Floor
>>>> Santa Clara, CA 95054
>>>> Ph: 408-510-8635
>>>> email: harsha@apache.org
>>>>
>>>> [image: https://www.linkedin.com/in/harsha340]
>>>> <https://www.linkedin.com/in/harsha340>
>>>> <https://twitter.com/halfabrane> <https://github.com/harsha2010/>
>>>>
>>>>
>>>
>>>
>>> --
>>> Ram Sriharsha
>>> Architect, Spark and Data Science
>>> Hortonworks, 2550 Great America Way, 2nd Floor
>>> Santa Clara, CA 95054
>>> Ph: 408-510-8635
>>> email: harsha@apache.org
>>>
>>> [image: https://www.linkedin.com/in/harsha340]
>>> <https://www.linkedin.com/in/harsha340> <https://twitter.com/halfabrane>
>>> <https://github.com/harsha2010/>
>>>
>>>
>
>
> --
> Ram Sriharsha
> Architect, Spark and Data Science
> Hortonworks, 2550 Great America Way, 2nd Floor
> Santa Clara, CA 95054
> Ph: 408-510-8635
> email: harsha@apache.org
>
> [image: https://www.linkedin.com/in/harsha340]
> <https://www.linkedin.com/in/harsha340> <https://twitter.com/halfabrane>
> <https://github.com/harsha2010/>
>
>


-- 
Ram Sriharsha
Architect, Spark and Data Science
Hortonworks, 2550 Great America Way, 2nd Floor
Santa Clara, CA 95054
Ph: 408-510-8635
email: harsha@apache.org

[image: https://www.linkedin.com/in/harsha340]
<https://www.linkedin.com/in/harsha340> <https://twitter.com/halfabrane>
<https://github.com/harsha2010/>

Re: MLlib OneVsRest causing intermittent exceptions

Posted by Ram Sriharsha <sr...@gmail.com>.
Hey David

In your scenario, OneVsRest is training a classifier for 1 vs not 1... and
the input dataset for fit (or train) has labeled data for label 1

But the underlying binary classifier (LogisticRegression) uses sampling to
determine the subset of data to sample during each iteration and it is
possible that this sample does not include any examples with label 1 (ie
numClasses = 1)

So the examples it selects in that iteration only include 0 labeled data
and nothing with label 1.

But why should it throw an exception? if it does, then i would think we
need to fix the issue in the underlying algorithm instead of the
reduction somehow knowing that the binary classifier is sampling from the
training dataset.

Or am I misunderstanding the issue here?

I'll take a look at the gist you linked when i get a chance , thanks!

Ram

On Tue, Jan 26, 2016 at 4:06 PM, David Brooks <da...@whisk.co.uk> wrote:

> Hi Ram, Joseph,
>
> That's right, but I will clarify:
>
> (a) a random split can generate a training set that does not contain some
> rare class
> (b) when LogisticRegression is run over a dataframe where all instances
> have the same class label, it throws an ArrayIndexOutOfBoundsException.
>
> When (a) occurs, (b) is the consequence.  The rare class is missing from
> the training set, so you would not expect OneVsRest to train a binary
> classifier on it; however, because OneVsRest trains binary classifiers on
> all class labels in the range (0 to numClasses), it *will* train a binary
> classifier on the missing class, which leads to the exception from (b).
>
> A concrete example:
>
>    - class labels 0, 1, 2, 3 are present in dataset (*numClasses* = 4);
>    - 0, 2, 3 are in the training set after random split (no *1*);
>    - The range (0 to 4) is used to train binary classifiers on each of 0,
>    *1*, 2, 3
>    - As soon as the classifier is trained on *1*, the exception is thrown
>
> I'd suggest:
>
>    1. In LogisticRegression, where numClasses == 1, thrown a more
>    meaningful validation exception (avoiding the more cryptic
>    ArrayIndexOutOfBoundsException)
>    2. Only run OneVsRest for class labels that appear in the dataframe,
>    rather than all labels in the Range(0, numClasses).
>
> I created a few simple test cases for running from SBT, like this one
> <https://github.com/junglebarry/SparkOneVsRestTest/blob/master/src/main/scala/SparkOneVsRestTest_2_Errors.scala>,
> but I've turned them into gists now for spark-shell:
>
>    - LogisticRegression throwing ArrayIndexOutOfBoundsException
>    <https://gist.github.com/junglebarry/a7cedce6eaf978d7b9ee>
>    - OneVsRest throwing ArrayIndexOutOfBoundsException
>    <https://gist.github.com/junglebarry/66234edfebaad6254ebe> (with a
>    simulated missing class from a Range)
>    - OneVsRest throwing ArrayIndexOutOfBoundsException with random split
>    <https://gist.github.com/junglebarry/6073aa474d89f3322063>.  Only
>    exceptions in 2/3 of cases, due to randomness.
>
> If these look good as test cases, I'll take a look at filing JIRAs and
> getting patches tomorrow morning.  It's late here!
>
> Thanks for the swift response,
> David
>
>
> On Tue, Jan 26, 2016 at 11:09 PM Ram Sriharsha <sr...@gmail.com>
> wrote:
>
>> Hi David
>>
>> If I am reading the email right, there are two problems here right?
>> a) for rare classes the random split will likely miss the rare class.
>> b) if it misses the rare class an exception is thrown
>>
>> I thought the exception stems from b), is that right?... i wouldn't
>> expect an exception to be thrown in the case the training dataset is
>> missing the rare class.
>> could you reproduce this in a simple snippet of code that we can quickly
>> test on the shell?
>>
>>
>>
>>
>> On Tue, Jan 26, 2016 at 3:02 PM, Ram Sriharsha <sr...@gmail.com>
>> wrote:
>>
>>> Hey David, Yeah absolutely!, feel free to create a JIRA and attach your
>>> patch to it. We can help review it and pull in the fix... happy to accept
>>> contributions!
>>> ccing Joseph who is one of the maintainers of MLLib as well.. when
>>> creating the JIRA can you attach a simple test case?
>>>
>>> On Tue, Jan 26, 2016 at 2:59 PM, David Brooks <da...@whisk.co.uk> wrote:
>>>
>>>> Hi again Ram,
>>>>
>>>> Sorry, I was too hasty in my previous response.  I've done a bit more
>>>> digging through the code, and StringIndexer does indeed provide metadata,
>>>> as a NominalAttribute with a known number of class labels.  I don't think
>>>> the issue is related to the use of metadata, however.
>>>>
>>>> It seems to me to be caused by the interaction between OneVsRest and
>>>> TrainValidationSplit.  For rare target classes under OneVsRest, it seems
>>>> quite possible for this random-split approach to select a training subset
>>>> where all items belong to non-target classes - all of which are given the
>>>> same class label by OneVsRest.  In this case, we start training
>>>> LogisticRegression on data of a single class, which seems odd.  The
>>>> exception stems from there.
>>>>
>>>> The cause looks to me to be that OneVsRest.fit runs binary
>>>> classifications from 0 to numClasses (OneVsRest.scala:209), and this seems
>>>> incompatible with the random split, which cannot guarantee training
>>>> examples for all labels in the range.  It might be preferable to iterate
>>>> over the observed labels in the training set, rather than all labels in the
>>>> range.  I don't know the performance effects of that change, but it does
>>>> look incompatible with using the label metadata as a shortcut.
>>>>
>>>> Do you agree that there is an issue here?  Would you accept
>>>> contributions to the code to remedy it?  I'd gladly take a look if I can be
>>>> of help.
>>>>
>>>> Many thanks,
>>>> David
>>>>
>>>>
>>>>
>>>> On Tue, Jan 26, 2016 at 1:29 PM David Brooks <da...@whisk.co.uk> wrote:
>>>>
>>>>> Hi Ram,
>>>>>
>>>>> I didn't include an explicit label column in my reproduction as I
>>>>> thought it superfluous.  However, in my original use-case, I was using a
>>>>> StringIndexer, where the labels were indexed across the entire dataset
>>>>> (training+validation+test).  The (indexed) label column was then explicitly
>>>>> provided to the OneVsRest instance.
>>>>>
>>>>> Here's the abridged version:
>>>>>
>>>>> val textDocuments = ??? // real data here
>>>>>
>>>>> // Index labels, adding metadata to the label column.
>>>>> // Fit on whole dataset to include all labels in index.
>>>>> val labelIndexer = new StringIndexer()
>>>>>   .setInputCol("label")
>>>>>   .setOutputCol("labelIndexed")
>>>>>   .fit(textDocuments)
>>>>>
>>>>> val lrClassifier = new LogisticRegression()
>>>>>
>>>>> val classifier = new OneVsRest()
>>>>>   .setClassifier(lrClassifier)
>>>>>   .setLabelCol(labelIndexer.getOutputCol)
>>>>>
>>>>> // ...
>>>>>
>>>>>
>>>>> There's an explicit reference to the label column, and when created,
>>>>> that column contains all possible values of the label (it's `fit` over all
>>>>> data).  It looks to me like StringIndexer computes label metadata at that
>>>>> point (in `transform`) and attaches it to the column.  This way, I'd hope
>>>>> that even once TrainValidationSplit returns a subset dataframe -
>>>>> which may not contain all labels - the metadata on the column should
>>>>> still contain all labels.
>>>>>
>>>>> Does my use of StringIndexer count as "metadata", here?  If so, I
>>>>> still see the exception as before.
>>>>>
>>>>> I've pushed a new example using StringIndexer to my earlier repo, so
>>>>> you can see the code and issue.  I'm happy to try a simpler method for
>>>>> providing column metadata, if one is available.
>>>>>
>>>>> Thanks,
>>>>> David
>>>>>
>>>>> On Mon, Jan 25, 2016 at 11:13 PM Ram Sriharsha <
>>>>> sriharsha.ram@gmail.com> wrote:
>>>>>
>>>>>> Hi David
>>>>>>
>>>>>> What happens if you provide the class labels via metadata instead of
>>>>>> letting OneVsRest determine the labels?
>>>>>>
>>>>>> Ram
>>>>>>
>>>>>> On Mon, Jan 25, 2016 at 3:06 PM, David Brooks <da...@whisk.co.uk>
>>>>>> wrote:
>>>>>>
>>>>>>> Hi,
>>>>>>>
>>>>>>> I've run into an exception using MLlib OneVsRest with logistic
>>>>>>> regression (v1.6.0, but also in previous versions).
>>>>>>>
>>>>>>> The issue is intermittent.  When running multiclass classification
>>>>>>> with K-fold cross validation, there are scenarios where the split does not
>>>>>>> contain instances for every target label.  In such cases, an
>>>>>>> ArrayIndexOutOfBoundsException is generated.
>>>>>>>
>>>>>>> I've tried to reproduce the problem in a simple SBT project here:
>>>>>>>
>>>>>>>    https://github.com/junglebarry/SparkOneVsRestTest
>>>>>>>
>>>>>>> I don't imagine this is typical - it first surfaced when running
>>>>>>> over a dataset with some very rare classes.
>>>>>>>
>>>>>>> I'm happy to look into patching the code, but I first wanted to
>>>>>>> confirm that the problem was real, and that I wasn't somehow
>>>>>>> misunderstanding how I should be using OneVsRest.
>>>>>>>
>>>>>>> Any guidance would be appreciated - I'm new to the list.
>>>>>>>
>>>>>>> Many thanks,
>>>>>>> David
>>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>> --
>>>>>> Ram Sriharsha
>>>>>> Architect, Spark and Data Science
>>>>>> Hortonworks, 2550 Great America Way, 2nd Floor
>>>>>> Santa Clara, CA 95054
>>>>>> Ph: 408-510-8635
>>>>>> email: harsha@apache.org
>>>>>>
>>>>>> [image: https://www.linkedin.com/in/harsha340]
>>>>>> <https://www.linkedin.com/in/harsha340>
>>>>>> <https://twitter.com/halfabrane> <https://github.com/harsha2010/>
>>>>>>
>>>>>>
>>>
>>>
>>> --
>>> Ram Sriharsha
>>> Architect, Spark and Data Science
>>> Hortonworks, 2550 Great America Way, 2nd Floor
>>> Santa Clara, CA 95054
>>> Ph: 408-510-8635
>>> email: harsha@apache.org
>>>
>>> [image: https://www.linkedin.com/in/harsha340]
>>> <https://www.linkedin.com/in/harsha340> <https://twitter.com/halfabrane>
>>> <https://github.com/harsha2010/>
>>>
>>>
>>
>>
>> --
>> Ram Sriharsha
>> Architect, Spark and Data Science
>> Hortonworks, 2550 Great America Way, 2nd Floor
>> Santa Clara, CA 95054
>> Ph: 408-510-8635
>> email: harsha@apache.org
>>
>> [image: https://www.linkedin.com/in/harsha340]
>> <https://www.linkedin.com/in/harsha340> <https://twitter.com/halfabrane>
>> <https://github.com/harsha2010/>
>>
>>


-- 
Ram Sriharsha
Architect, Spark and Data Science
Hortonworks, 2550 Great America Way, 2nd Floor
Santa Clara, CA 95054
Ph: 408-510-8635
email: harsha@apache.org

[image: https://www.linkedin.com/in/harsha340]
<https://www.linkedin.com/in/harsha340> <https://twitter.com/halfabrane>
<https://github.com/harsha2010/>

Re: MLlib OneVsRest causing intermittent exceptions

Posted by David Brooks <da...@whisk.co.uk>.
Hi Ram, Joseph,

That's right, but I will clarify:

(a) a random split can generate a training set that does not contain some
rare class
(b) when LogisticRegression is run over a dataframe where all instances
have the same class label, it throws an ArrayIndexOutOfBoundsException.

When (a) occurs, (b) is the consequence.  The rare class is missing from
the training set, so you would not expect OneVsRest to train a binary
classifier on it; however, because OneVsRest trains binary classifiers on
all class labels in the range (0 to numClasses), it *will* train a binary
classifier on the missing class, which leads to the exception from (b).

A concrete example:

   - class labels 0, 1, 2, 3 are present in dataset (*numClasses* = 4);
   - 0, 2, 3 are in the training set after random split (no *1*);
   - The range (0 to 4) is used to train binary classifiers on each of 0,
   *1*, 2, 3
   - As soon as the classifier is trained on *1*, the exception is thrown

I'd suggest:

   1. In LogisticRegression, where numClasses == 1, thrown a more
   meaningful validation exception (avoiding the more cryptic
   ArrayIndexOutOfBoundsException)
   2. Only run OneVsRest for class labels that appear in the dataframe,
   rather than all labels in the Range(0, numClasses).

I created a few simple test cases for running from SBT, like this one
<https://github.com/junglebarry/SparkOneVsRestTest/blob/master/src/main/scala/SparkOneVsRestTest_2_Errors.scala>,
but I've turned them into gists now for spark-shell:

   - LogisticRegression throwing ArrayIndexOutOfBoundsException
   <https://gist.github.com/junglebarry/a7cedce6eaf978d7b9ee>
   - OneVsRest throwing ArrayIndexOutOfBoundsException
   <https://gist.github.com/junglebarry/66234edfebaad6254ebe> (with a
   simulated missing class from a Range)
   - OneVsRest throwing ArrayIndexOutOfBoundsException with random split
   <https://gist.github.com/junglebarry/6073aa474d89f3322063>.  Only
   exceptions in 2/3 of cases, due to randomness.

If these look good as test cases, I'll take a look at filing JIRAs and
getting patches tomorrow morning.  It's late here!

Thanks for the swift response,
David


On Tue, Jan 26, 2016 at 11:09 PM Ram Sriharsha <sr...@gmail.com>
wrote:

> Hi David
>
> If I am reading the email right, there are two problems here right?
> a) for rare classes the random split will likely miss the rare class.
> b) if it misses the rare class an exception is thrown
>
> I thought the exception stems from b), is that right?... i wouldn't expect
> an exception to be thrown in the case the training dataset is missing the
> rare class.
> could you reproduce this in a simple snippet of code that we can quickly
> test on the shell?
>
>
>
>
> On Tue, Jan 26, 2016 at 3:02 PM, Ram Sriharsha <sr...@gmail.com>
> wrote:
>
>> Hey David, Yeah absolutely!, feel free to create a JIRA and attach your
>> patch to it. We can help review it and pull in the fix... happy to accept
>> contributions!
>> ccing Joseph who is one of the maintainers of MLLib as well.. when
>> creating the JIRA can you attach a simple test case?
>>
>> On Tue, Jan 26, 2016 at 2:59 PM, David Brooks <da...@whisk.co.uk> wrote:
>>
>>> Hi again Ram,
>>>
>>> Sorry, I was too hasty in my previous response.  I've done a bit more
>>> digging through the code, and StringIndexer does indeed provide metadata,
>>> as a NominalAttribute with a known number of class labels.  I don't think
>>> the issue is related to the use of metadata, however.
>>>
>>> It seems to me to be caused by the interaction between OneVsRest and
>>> TrainValidationSplit.  For rare target classes under OneVsRest, it seems
>>> quite possible for this random-split approach to select a training subset
>>> where all items belong to non-target classes - all of which are given the
>>> same class label by OneVsRest.  In this case, we start training
>>> LogisticRegression on data of a single class, which seems odd.  The
>>> exception stems from there.
>>>
>>> The cause looks to me to be that OneVsRest.fit runs binary
>>> classifications from 0 to numClasses (OneVsRest.scala:209), and this seems
>>> incompatible with the random split, which cannot guarantee training
>>> examples for all labels in the range.  It might be preferable to iterate
>>> over the observed labels in the training set, rather than all labels in the
>>> range.  I don't know the performance effects of that change, but it does
>>> look incompatible with using the label metadata as a shortcut.
>>>
>>> Do you agree that there is an issue here?  Would you accept
>>> contributions to the code to remedy it?  I'd gladly take a look if I can be
>>> of help.
>>>
>>> Many thanks,
>>> David
>>>
>>>
>>>
>>> On Tue, Jan 26, 2016 at 1:29 PM David Brooks <da...@whisk.co.uk> wrote:
>>>
>>>> Hi Ram,
>>>>
>>>> I didn't include an explicit label column in my reproduction as I
>>>> thought it superfluous.  However, in my original use-case, I was using a
>>>> StringIndexer, where the labels were indexed across the entire dataset
>>>> (training+validation+test).  The (indexed) label column was then explicitly
>>>> provided to the OneVsRest instance.
>>>>
>>>> Here's the abridged version:
>>>>
>>>> val textDocuments = ??? // real data here
>>>>
>>>> // Index labels, adding metadata to the label column.
>>>> // Fit on whole dataset to include all labels in index.
>>>> val labelIndexer = new StringIndexer()
>>>>   .setInputCol("label")
>>>>   .setOutputCol("labelIndexed")
>>>>   .fit(textDocuments)
>>>>
>>>> val lrClassifier = new LogisticRegression()
>>>>
>>>> val classifier = new OneVsRest()
>>>>   .setClassifier(lrClassifier)
>>>>   .setLabelCol(labelIndexer.getOutputCol)
>>>>
>>>> // ...
>>>>
>>>>
>>>> There's an explicit reference to the label column, and when created,
>>>> that column contains all possible values of the label (it's `fit` over all
>>>> data).  It looks to me like StringIndexer computes label metadata at that
>>>> point (in `transform`) and attaches it to the column.  This way, I'd hope
>>>> that even once TrainValidationSplit returns a subset dataframe - which
>>>> may not contain all labels - the metadata on the column should still
>>>> contain all labels.
>>>>
>>>> Does my use of StringIndexer count as "metadata", here?  If so, I still
>>>> see the exception as before.
>>>>
>>>> I've pushed a new example using StringIndexer to my earlier repo, so
>>>> you can see the code and issue.  I'm happy to try a simpler method for
>>>> providing column metadata, if one is available.
>>>>
>>>> Thanks,
>>>> David
>>>>
>>>> On Mon, Jan 25, 2016 at 11:13 PM Ram Sriharsha <sr...@gmail.com>
>>>> wrote:
>>>>
>>>>> Hi David
>>>>>
>>>>> What happens if you provide the class labels via metadata instead of
>>>>> letting OneVsRest determine the labels?
>>>>>
>>>>> Ram
>>>>>
>>>>> On Mon, Jan 25, 2016 at 3:06 PM, David Brooks <da...@whisk.co.uk>
>>>>> wrote:
>>>>>
>>>>>> Hi,
>>>>>>
>>>>>> I've run into an exception using MLlib OneVsRest with logistic
>>>>>> regression (v1.6.0, but also in previous versions).
>>>>>>
>>>>>> The issue is intermittent.  When running multiclass classification
>>>>>> with K-fold cross validation, there are scenarios where the split does not
>>>>>> contain instances for every target label.  In such cases, an
>>>>>> ArrayIndexOutOfBoundsException is generated.
>>>>>>
>>>>>> I've tried to reproduce the problem in a simple SBT project here:
>>>>>>
>>>>>>    https://github.com/junglebarry/SparkOneVsRestTest
>>>>>>
>>>>>> I don't imagine this is typical - it first surfaced when running over
>>>>>> a dataset with some very rare classes.
>>>>>>
>>>>>> I'm happy to look into patching the code, but I first wanted to
>>>>>> confirm that the problem was real, and that I wasn't somehow
>>>>>> misunderstanding how I should be using OneVsRest.
>>>>>>
>>>>>> Any guidance would be appreciated - I'm new to the list.
>>>>>>
>>>>>> Many thanks,
>>>>>> David
>>>>>>
>>>>>
>>>>>
>>>>>
>>>>> --
>>>>> Ram Sriharsha
>>>>> Architect, Spark and Data Science
>>>>> Hortonworks, 2550 Great America Way, 2nd Floor
>>>>> Santa Clara, CA 95054
>>>>> Ph: 408-510-8635
>>>>> email: harsha@apache.org
>>>>>
>>>>> [image: https://www.linkedin.com/in/harsha340]
>>>>> <https://www.linkedin.com/in/harsha340>
>>>>> <https://twitter.com/halfabrane> <https://github.com/harsha2010/>
>>>>>
>>>>>
>>
>>
>> --
>> Ram Sriharsha
>> Architect, Spark and Data Science
>> Hortonworks, 2550 Great America Way, 2nd Floor
>> Santa Clara, CA 95054
>> Ph: 408-510-8635
>> email: harsha@apache.org
>>
>> [image: https://www.linkedin.com/in/harsha340]
>> <https://www.linkedin.com/in/harsha340> <https://twitter.com/halfabrane>
>> <https://github.com/harsha2010/>
>>
>>
>
>
> --
> Ram Sriharsha
> Architect, Spark and Data Science
> Hortonworks, 2550 Great America Way, 2nd Floor
> Santa Clara, CA 95054
> Ph: 408-510-8635
> email: harsha@apache.org
>
> [image: https://www.linkedin.com/in/harsha340]
> <https://www.linkedin.com/in/harsha340> <https://twitter.com/halfabrane>
> <https://github.com/harsha2010/>
>
>

Re: MLlib OneVsRest causing intermittent exceptions

Posted by Ram Sriharsha <sr...@gmail.com>.
Hi David

If I am reading the email right, there are two problems here right?
a) for rare classes the random split will likely miss the rare class.
b) if it misses the rare class an exception is thrown

I thought the exception stems from b), is that right?... i wouldn't expect
an exception to be thrown in the case the training dataset is missing the
rare class.
could you reproduce this in a simple snippet of code that we can quickly
test on the shell?




On Tue, Jan 26, 2016 at 3:02 PM, Ram Sriharsha <sr...@gmail.com>
wrote:

> Hey David, Yeah absolutely!, feel free to create a JIRA and attach your
> patch to it. We can help review it and pull in the fix... happy to accept
> contributions!
> ccing Joseph who is one of the maintainers of MLLib as well.. when
> creating the JIRA can you attach a simple test case?
>
> On Tue, Jan 26, 2016 at 2:59 PM, David Brooks <da...@whisk.co.uk> wrote:
>
>> Hi again Ram,
>>
>> Sorry, I was too hasty in my previous response.  I've done a bit more
>> digging through the code, and StringIndexer does indeed provide metadata,
>> as a NominalAttribute with a known number of class labels.  I don't think
>> the issue is related to the use of metadata, however.
>>
>> It seems to me to be caused by the interaction between OneVsRest and
>> TrainValidationSplit.  For rare target classes under OneVsRest, it seems
>> quite possible for this random-split approach to select a training subset
>> where all items belong to non-target classes - all of which are given the
>> same class label by OneVsRest.  In this case, we start training
>> LogisticRegression on data of a single class, which seems odd.  The
>> exception stems from there.
>>
>> The cause looks to me to be that OneVsRest.fit runs binary
>> classifications from 0 to numClasses (OneVsRest.scala:209), and this seems
>> incompatible with the random split, which cannot guarantee training
>> examples for all labels in the range.  It might be preferable to iterate
>> over the observed labels in the training set, rather than all labels in the
>> range.  I don't know the performance effects of that change, but it does
>> look incompatible with using the label metadata as a shortcut.
>>
>> Do you agree that there is an issue here?  Would you accept contributions
>> to the code to remedy it?  I'd gladly take a look if I can be of help.
>>
>> Many thanks,
>> David
>>
>>
>>
>> On Tue, Jan 26, 2016 at 1:29 PM David Brooks <da...@whisk.co.uk> wrote:
>>
>>> Hi Ram,
>>>
>>> I didn't include an explicit label column in my reproduction as I
>>> thought it superfluous.  However, in my original use-case, I was using a
>>> StringIndexer, where the labels were indexed across the entire dataset
>>> (training+validation+test).  The (indexed) label column was then explicitly
>>> provided to the OneVsRest instance.
>>>
>>> Here's the abridged version:
>>>
>>> val textDocuments = ??? // real data here
>>>
>>> // Index labels, adding metadata to the label column.
>>> // Fit on whole dataset to include all labels in index.
>>> val labelIndexer = new StringIndexer()
>>>   .setInputCol("label")
>>>   .setOutputCol("labelIndexed")
>>>   .fit(textDocuments)
>>>
>>> val lrClassifier = new LogisticRegression()
>>>
>>> val classifier = new OneVsRest()
>>>   .setClassifier(lrClassifier)
>>>   .setLabelCol(labelIndexer.getOutputCol)
>>>
>>> // ...
>>>
>>>
>>> There's an explicit reference to the label column, and when created,
>>> that column contains all possible values of the label (it's `fit` over all
>>> data).  It looks to me like StringIndexer computes label metadata at that
>>> point (in `transform`) and attaches it to the column.  This way, I'd hope
>>> that even once TrainValidationSplit returns a subset dataframe - which
>>> may not contain all labels - the metadata on the column should still
>>> contain all labels.
>>>
>>> Does my use of StringIndexer count as "metadata", here?  If so, I still
>>> see the exception as before.
>>>
>>> I've pushed a new example using StringIndexer to my earlier repo, so you
>>> can see the code and issue.  I'm happy to try a simpler method for
>>> providing column metadata, if one is available.
>>>
>>> Thanks,
>>> David
>>>
>>> On Mon, Jan 25, 2016 at 11:13 PM Ram Sriharsha <sr...@gmail.com>
>>> wrote:
>>>
>>>> Hi David
>>>>
>>>> What happens if you provide the class labels via metadata instead of
>>>> letting OneVsRest determine the labels?
>>>>
>>>> Ram
>>>>
>>>> On Mon, Jan 25, 2016 at 3:06 PM, David Brooks <da...@whisk.co.uk>
>>>> wrote:
>>>>
>>>>> Hi,
>>>>>
>>>>> I've run into an exception using MLlib OneVsRest with logistic
>>>>> regression (v1.6.0, but also in previous versions).
>>>>>
>>>>> The issue is intermittent.  When running multiclass classification
>>>>> with K-fold cross validation, there are scenarios where the split does not
>>>>> contain instances for every target label.  In such cases, an
>>>>> ArrayIndexOutOfBoundsException is generated.
>>>>>
>>>>> I've tried to reproduce the problem in a simple SBT project here:
>>>>>
>>>>>    https://github.com/junglebarry/SparkOneVsRestTest
>>>>>
>>>>> I don't imagine this is typical - it first surfaced when running over
>>>>> a dataset with some very rare classes.
>>>>>
>>>>> I'm happy to look into patching the code, but I first wanted to
>>>>> confirm that the problem was real, and that I wasn't somehow
>>>>> misunderstanding how I should be using OneVsRest.
>>>>>
>>>>> Any guidance would be appreciated - I'm new to the list.
>>>>>
>>>>> Many thanks,
>>>>> David
>>>>>
>>>>
>>>>
>>>>
>>>> --
>>>> Ram Sriharsha
>>>> Architect, Spark and Data Science
>>>> Hortonworks, 2550 Great America Way, 2nd Floor
>>>> Santa Clara, CA 95054
>>>> Ph: 408-510-8635
>>>> email: harsha@apache.org
>>>>
>>>> [image: https://www.linkedin.com/in/harsha340]
>>>> <https://www.linkedin.com/in/harsha340>
>>>> <https://twitter.com/halfabrane> <https://github.com/harsha2010/>
>>>>
>>>>
>
>
> --
> Ram Sriharsha
> Architect, Spark and Data Science
> Hortonworks, 2550 Great America Way, 2nd Floor
> Santa Clara, CA 95054
> Ph: 408-510-8635
> email: harsha@apache.org
>
> [image: https://www.linkedin.com/in/harsha340]
> <https://www.linkedin.com/in/harsha340> <https://twitter.com/halfabrane>
> <https://github.com/harsha2010/>
>
>


-- 
Ram Sriharsha
Architect, Spark and Data Science
Hortonworks, 2550 Great America Way, 2nd Floor
Santa Clara, CA 95054
Ph: 408-510-8635
email: harsha@apache.org

[image: https://www.linkedin.com/in/harsha340]
<https://www.linkedin.com/in/harsha340> <https://twitter.com/halfabrane>
<https://github.com/harsha2010/>

Re: MLlib OneVsRest causing intermittent exceptions

Posted by Ram Sriharsha <sr...@gmail.com>.
Hey David, Yeah absolutely!, feel free to create a JIRA and attach your
patch to it. We can help review it and pull in the fix... happy to accept
contributions!
ccing Joseph who is one of the maintainers of MLLib as well.. when creating
the JIRA can you attach a simple test case?

On Tue, Jan 26, 2016 at 2:59 PM, David Brooks <da...@whisk.co.uk> wrote:

> Hi again Ram,
>
> Sorry, I was too hasty in my previous response.  I've done a bit more
> digging through the code, and StringIndexer does indeed provide metadata,
> as a NominalAttribute with a known number of class labels.  I don't think
> the issue is related to the use of metadata, however.
>
> It seems to me to be caused by the interaction between OneVsRest and
> TrainValidationSplit.  For rare target classes under OneVsRest, it seems
> quite possible for this random-split approach to select a training subset
> where all items belong to non-target classes - all of which are given the
> same class label by OneVsRest.  In this case, we start training
> LogisticRegression on data of a single class, which seems odd.  The
> exception stems from there.
>
> The cause looks to me to be that OneVsRest.fit runs binary classifications
> from 0 to numClasses (OneVsRest.scala:209), and this seems incompatible
> with the random split, which cannot guarantee training examples for all
> labels in the range.  It might be preferable to iterate over the observed
> labels in the training set, rather than all labels in the range.  I don't
> know the performance effects of that change, but it does look incompatible
> with using the label metadata as a shortcut.
>
> Do you agree that there is an issue here?  Would you accept contributions
> to the code to remedy it?  I'd gladly take a look if I can be of help.
>
> Many thanks,
> David
>
>
>
> On Tue, Jan 26, 2016 at 1:29 PM David Brooks <da...@whisk.co.uk> wrote:
>
>> Hi Ram,
>>
>> I didn't include an explicit label column in my reproduction as I thought
>> it superfluous.  However, in my original use-case, I was using a
>> StringIndexer, where the labels were indexed across the entire dataset
>> (training+validation+test).  The (indexed) label column was then explicitly
>> provided to the OneVsRest instance.
>>
>> Here's the abridged version:
>>
>> val textDocuments = ??? // real data here
>>
>> // Index labels, adding metadata to the label column.
>> // Fit on whole dataset to include all labels in index.
>> val labelIndexer = new StringIndexer()
>>   .setInputCol("label")
>>   .setOutputCol("labelIndexed")
>>   .fit(textDocuments)
>>
>> val lrClassifier = new LogisticRegression()
>>
>> val classifier = new OneVsRest()
>>   .setClassifier(lrClassifier)
>>   .setLabelCol(labelIndexer.getOutputCol)
>>
>> // ...
>>
>>
>> There's an explicit reference to the label column, and when created, that
>> column contains all possible values of the label (it's `fit` over all
>> data).  It looks to me like StringIndexer computes label metadata at that
>> point (in `transform`) and attaches it to the column.  This way, I'd hope
>> that even once TrainValidationSplit returns a subset dataframe - which
>> may not contain all labels - the metadata on the column should still
>> contain all labels.
>>
>> Does my use of StringIndexer count as "metadata", here?  If so, I still
>> see the exception as before.
>>
>> I've pushed a new example using StringIndexer to my earlier repo, so you
>> can see the code and issue.  I'm happy to try a simpler method for
>> providing column metadata, if one is available.
>>
>> Thanks,
>> David
>>
>> On Mon, Jan 25, 2016 at 11:13 PM Ram Sriharsha <sr...@gmail.com>
>> wrote:
>>
>>> Hi David
>>>
>>> What happens if you provide the class labels via metadata instead of
>>> letting OneVsRest determine the labels?
>>>
>>> Ram
>>>
>>> On Mon, Jan 25, 2016 at 3:06 PM, David Brooks <da...@whisk.co.uk> wrote:
>>>
>>>> Hi,
>>>>
>>>> I've run into an exception using MLlib OneVsRest with logistic
>>>> regression (v1.6.0, but also in previous versions).
>>>>
>>>> The issue is intermittent.  When running multiclass classification with
>>>> K-fold cross validation, there are scenarios where the split does not
>>>> contain instances for every target label.  In such cases, an
>>>> ArrayIndexOutOfBoundsException is generated.
>>>>
>>>> I've tried to reproduce the problem in a simple SBT project here:
>>>>
>>>>    https://github.com/junglebarry/SparkOneVsRestTest
>>>>
>>>> I don't imagine this is typical - it first surfaced when running over a
>>>> dataset with some very rare classes.
>>>>
>>>> I'm happy to look into patching the code, but I first wanted to confirm
>>>> that the problem was real, and that I wasn't somehow misunderstanding how I
>>>> should be using OneVsRest.
>>>>
>>>> Any guidance would be appreciated - I'm new to the list.
>>>>
>>>> Many thanks,
>>>> David
>>>>
>>>
>>>
>>>
>>> --
>>> Ram Sriharsha
>>> Architect, Spark and Data Science
>>> Hortonworks, 2550 Great America Way, 2nd Floor
>>> Santa Clara, CA 95054
>>> Ph: 408-510-8635
>>> email: harsha@apache.org
>>>
>>> [image: https://www.linkedin.com/in/harsha340]
>>> <https://www.linkedin.com/in/harsha340> <https://twitter.com/halfabrane>
>>> <https://github.com/harsha2010/>
>>>
>>>


-- 
Ram Sriharsha
Architect, Spark and Data Science
Hortonworks, 2550 Great America Way, 2nd Floor
Santa Clara, CA 95054
Ph: 408-510-8635
email: harsha@apache.org

[image: https://www.linkedin.com/in/harsha340]
<https://www.linkedin.com/in/harsha340> <https://twitter.com/halfabrane>
<https://github.com/harsha2010/>

Re: MLlib OneVsRest causing intermittent exceptions

Posted by David Brooks <da...@whisk.co.uk>.
Hi again Ram,

Sorry, I was too hasty in my previous response.  I've done a bit more
digging through the code, and StringIndexer does indeed provide metadata,
as a NominalAttribute with a known number of class labels.  I don't think
the issue is related to the use of metadata, however.

It seems to me to be caused by the interaction between OneVsRest and
TrainValidationSplit.  For rare target classes under OneVsRest, it seems
quite possible for this random-split approach to select a training subset
where all items belong to non-target classes - all of which are given the
same class label by OneVsRest.  In this case, we start training
LogisticRegression on data of a single class, which seems odd.  The
exception stems from there.

The cause looks to me to be that OneVsRest.fit runs binary classifications
from 0 to numClasses (OneVsRest.scala:209), and this seems incompatible
with the random split, which cannot guarantee training examples for all
labels in the range.  It might be preferable to iterate over the observed
labels in the training set, rather than all labels in the range.  I don't
know the performance effects of that change, but it does look incompatible
with using the label metadata as a shortcut.

Do you agree that there is an issue here?  Would you accept contributions
to the code to remedy it?  I'd gladly take a look if I can be of help.

Many thanks,
David



On Tue, Jan 26, 2016 at 1:29 PM David Brooks <da...@whisk.co.uk> wrote:

> Hi Ram,
>
> I didn't include an explicit label column in my reproduction as I thought
> it superfluous.  However, in my original use-case, I was using a
> StringIndexer, where the labels were indexed across the entire dataset
> (training+validation+test).  The (indexed) label column was then explicitly
> provided to the OneVsRest instance.
>
> Here's the abridged version:
>
> val textDocuments = ??? // real data here
>
> // Index labels, adding metadata to the label column.
> // Fit on whole dataset to include all labels in index.
> val labelIndexer = new StringIndexer()
>   .setInputCol("label")
>   .setOutputCol("labelIndexed")
>   .fit(textDocuments)
>
> val lrClassifier = new LogisticRegression()
>
> val classifier = new OneVsRest()
>   .setClassifier(lrClassifier)
>   .setLabelCol(labelIndexer.getOutputCol)
>
> // ...
>
>
> There's an explicit reference to the label column, and when created, that
> column contains all possible values of the label (it's `fit` over all
> data).  It looks to me like StringIndexer computes label metadata at that
> point (in `transform`) and attaches it to the column.  This way, I'd hope
> that even once TrainValidationSplit returns a subset dataframe - which
> may not contain all labels - the metadata on the column should still
> contain all labels.
>
> Does my use of StringIndexer count as "metadata", here?  If so, I still
> see the exception as before.
>
> I've pushed a new example using StringIndexer to my earlier repo, so you
> can see the code and issue.  I'm happy to try a simpler method for
> providing column metadata, if one is available.
>
> Thanks,
> David
>
> On Mon, Jan 25, 2016 at 11:13 PM Ram Sriharsha <sr...@gmail.com>
> wrote:
>
>> Hi David
>>
>> What happens if you provide the class labels via metadata instead of
>> letting OneVsRest determine the labels?
>>
>> Ram
>>
>> On Mon, Jan 25, 2016 at 3:06 PM, David Brooks <da...@whisk.co.uk> wrote:
>>
>>> Hi,
>>>
>>> I've run into an exception using MLlib OneVsRest with logistic
>>> regression (v1.6.0, but also in previous versions).
>>>
>>> The issue is intermittent.  When running multiclass classification with
>>> K-fold cross validation, there are scenarios where the split does not
>>> contain instances for every target label.  In such cases, an
>>> ArrayIndexOutOfBoundsException is generated.
>>>
>>> I've tried to reproduce the problem in a simple SBT project here:
>>>
>>>    https://github.com/junglebarry/SparkOneVsRestTest
>>>
>>> I don't imagine this is typical - it first surfaced when running over a
>>> dataset with some very rare classes.
>>>
>>> I'm happy to look into patching the code, but I first wanted to confirm
>>> that the problem was real, and that I wasn't somehow misunderstanding how I
>>> should be using OneVsRest.
>>>
>>> Any guidance would be appreciated - I'm new to the list.
>>>
>>> Many thanks,
>>> David
>>>
>>
>>
>>
>> --
>> Ram Sriharsha
>> Architect, Spark and Data Science
>> Hortonworks, 2550 Great America Way, 2nd Floor
>> Santa Clara, CA 95054
>> Ph: 408-510-8635
>> email: harsha@apache.org
>>
>> [image: https://www.linkedin.com/in/harsha340]
>> <https://www.linkedin.com/in/harsha340> <https://twitter.com/halfabrane>
>> <https://github.com/harsha2010/>
>>
>>

Re: MLlib OneVsRest causing intermittent exceptions

Posted by David Brooks <da...@whisk.co.uk>.
Hi Ram,

I didn't include an explicit label column in my reproduction as I thought
it superfluous.  However, in my original use-case, I was using a
StringIndexer, where the labels were indexed across the entire dataset
(training+validation+test).  The (indexed) label column was then explicitly
provided to the OneVsRest instance.

Here's the abridged version:

val textDocuments = ??? // real data here

// Index labels, adding metadata to the label column.
// Fit on whole dataset to include all labels in index.
val labelIndexer = new StringIndexer()
  .setInputCol("label")
  .setOutputCol("labelIndexed")
  .fit(textDocuments)

val lrClassifier = new LogisticRegression()

val classifier = new OneVsRest()
  .setClassifier(lrClassifier)
  .setLabelCol(labelIndexer.getOutputCol)

// ...


There's an explicit reference to the label column, and when created, that
column contains all possible values of the label (it's `fit` over all
data).  It looks to me like StringIndexer computes label metadata at that
point (in `transform`) and attaches it to the column.  This way, I'd hope
that even once TrainValidationSplit returns a subset dataframe - which may
not contain all labels - the metadata on the column should still contain
all labels.

Does my use of StringIndexer count as "metadata", here?  If so, I still see
the exception as before.

I've pushed a new example using StringIndexer to my earlier repo, so you
can see the code and issue.  I'm happy to try a simpler method for
providing column metadata, if one is available.

Thanks,
David

On Mon, Jan 25, 2016 at 11:13 PM Ram Sriharsha <sr...@gmail.com>
wrote:

> Hi David
>
> What happens if you provide the class labels via metadata instead of
> letting OneVsRest determine the labels?
>
> Ram
>
> On Mon, Jan 25, 2016 at 3:06 PM, David Brooks <da...@whisk.co.uk> wrote:
>
>> Hi,
>>
>> I've run into an exception using MLlib OneVsRest with logistic regression
>> (v1.6.0, but also in previous versions).
>>
>> The issue is intermittent.  When running multiclass classification with
>> K-fold cross validation, there are scenarios where the split does not
>> contain instances for every target label.  In such cases, an
>> ArrayIndexOutOfBoundsException is generated.
>>
>> I've tried to reproduce the problem in a simple SBT project here:
>>
>>    https://github.com/junglebarry/SparkOneVsRestTest
>>
>> I don't imagine this is typical - it first surfaced when running over a
>> dataset with some very rare classes.
>>
>> I'm happy to look into patching the code, but I first wanted to confirm
>> that the problem was real, and that I wasn't somehow misunderstanding how I
>> should be using OneVsRest.
>>
>> Any guidance would be appreciated - I'm new to the list.
>>
>> Many thanks,
>> David
>>
>
>
>
> --
> Ram Sriharsha
> Architect, Spark and Data Science
> Hortonworks, 2550 Great America Way, 2nd Floor
> Santa Clara, CA 95054
> Ph: 408-510-8635
> email: harsha@apache.org
>
> [image: https://www.linkedin.com/in/harsha340]
> <https://www.linkedin.com/in/harsha340> <https://twitter.com/halfabrane>
> <https://github.com/harsha2010/>
>
>

Re: MLlib OneVsRest causing intermittent exceptions

Posted by Ram Sriharsha <sr...@gmail.com>.
Hi David

What happens if you provide the class labels via metadata instead of
letting OneVsRest determine the labels?

Ram

On Mon, Jan 25, 2016 at 3:06 PM, David Brooks <da...@whisk.co.uk> wrote:

> Hi,
>
> I've run into an exception using MLlib OneVsRest with logistic regression
> (v1.6.0, but also in previous versions).
>
> The issue is intermittent.  When running multiclass classification with
> K-fold cross validation, there are scenarios where the split does not
> contain instances for every target label.  In such cases, an
> ArrayIndexOutOfBoundsException is generated.
>
> I've tried to reproduce the problem in a simple SBT project here:
>
>    https://github.com/junglebarry/SparkOneVsRestTest
>
> I don't imagine this is typical - it first surfaced when running over a
> dataset with some very rare classes.
>
> I'm happy to look into patching the code, but I first wanted to confirm
> that the problem was real, and that I wasn't somehow misunderstanding how I
> should be using OneVsRest.
>
> Any guidance would be appreciated - I'm new to the list.
>
> Many thanks,
> David
>



-- 
Ram Sriharsha
Architect, Spark and Data Science
Hortonworks, 2550 Great America Way, 2nd Floor
Santa Clara, CA 95054
Ph: 408-510-8635
email: harsha@apache.org

[image: https://www.linkedin.com/in/harsha340]
<https://www.linkedin.com/in/harsha340> <https://twitter.com/halfabrane>
<https://github.com/harsha2010/>