You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@mahout.apache.org by Ángel Martínez González <am...@gmail.com> on 2013/07/07 18:57:35 UTC

Re: [jira] [Commented] (MAHOUT-1179) GSOC 2013: Refactor and improve the classification APIs

Hi all,

I did not receive any feedback about this. I understand that now is a 
busy time with the work on version 0.8. Is there still interest on 
refactoring the classification APIs once 0.8 is released? Or should I 
just move on and look for some other way to contribute? I think the 
changes proposed in the document may not be very exciting, but some 
homogenization of Mahout's algorithms is necessary.
If more detailed planning is needed, I could break the changes down into 
a list of tasks that have adequate granularity to be registered as JIRA 
issues. Would that help?

Regards,
Angel



El 26/05/2013 22:21, Ángel Martínez González escribió:
> Hi Ted and all,
>
> I've prepared a short document describing the current state of the 
> classification APIs and the proposed changes.
>
> https://docs.google.com/document/d/1Rqn-8aMgK6g9UZuKyD2fpMOY3FJ1xGzE7HUNPCaSd7I/edit?usp=sharing 
>
>
> I'm eager to hear any feedback about it!
>
> The document does not include anything about task order planning. In 
> fact I have a couple of questions about that: As we are talking about 
> refactoring, it would be quite natural to do the changes in a lot of 
> small commits. But, would that be possible or will the work have to be 
> packed in a few big commits? Also, will some committer be able to 
> periodically review the work? And, could the changes interfere with 
> the next version release?
>
> Thanks!
> Angel
>
>
> El 20/05/2013 10:17, Ángel Martínez González escribió:
>>
>> Hi,
>> I'm preparing a short text describing the current state of each 
>> algorithm and the needed changes (also including the data 
>> preprocessing and result evaluation modules). That will answer 
>> question c)
>> I'll try to answer the other two here:
>>
>> El 17/05/2013 9:37, Ted Dunning escribió:
>>> Please lay out a plan before coding. The key questions will be
>>>
>>> a) can you serialize a model efficiently?
>> That should not be a problem. The scope of these proposed changes is 
>> only input and output data formats, not including the classification 
>> models, so that would work just as before. Regarding input and output 
>> data, the formats are similar to the ones used for clustering and 
>> also feature hashing will be supported.
>>> b) can you deal with the random forest and SGD models?
>> I've been looking into possible icompatibilities between classifiers 
>> and I've found the following difficulties related to input format:
>>
>> - The proposed input for trainers is SequenceFile<IntWritable, 
>> VectorWritable>, where the key would be an instance id and the target 
>> variable (class label) would be inside the vector. But, if feature 
>> hashing is used, conflicts may happen with the target variable that 
>> make it impossible to recover.
>> - SGD and Naive Bayes need binarized categorical features, while 
>> Random Forests use categorical features encoded as integer levels. In 
>> Random Forests, any categorical feature can be used as target 
>> variable. In SGD and Naive Bayes, the target variable is provided to 
>> the classifier outside the vector. Binarized features are not 
>> suitable as target variables.
>>
>> Maybe a possible solution for the two interrelated problems could be: 
>> considering binarized categorical features as numerical, while 
>> categorical variables will always be encoded as integer levels and in 
>> SGD and Naive Bayes will only be used as target variables (or 
>> ignored). The feature hashing framework would have to be modified so 
>> that categorical variables have their positions in the vector 
>> reserved and no conflicts involving them are possible.  I think this 
>> is quite similar to the case with "a few special fields (categories 
>> and such) and then a bunch of encoded data" you commented in a 
>> previous mail.
>>
>> How does it sound?
>>
>>> c) what are the real changes to the API needed?
>>>
>>>
>>>
>>>
>>> On Thu, May 16, 2013 at 10:51 AM, Angel Martinez Gonzalez (JIRA) <
>>> jira@apache.org> wrote:
>>>
>>>>      [
>>>> https://issues.apache.org/jira/browse/MAHOUT-1179?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13659764#comment-13659764] 
>>>>
>>>>
>>>> Angel Martinez Gonzalez commented on MAHOUT-1179:
>>>> -------------------------------------------------
>>>>
>>>> Hi again,
>>>> With the goal of modifying all classifiers to use the formats proposed
>>>> above, I've started to work with Naive Bayes. In particular, I've 
>>>> moved the
>>>> code related to evaluation (summary statistics, confusion matrix) 
>>>> that was
>>>> executed at the end of TestNaiveBayesDriver to a separate
>>>> ClassifierEvaluationJob. The benefit of this is that
>>>> ClassifierEvaluationJob should be able in the future to take input 
>>>> from any
>>>> classifier tester.
>>>> The current state of the work may be reviewed here:
>>>> https://github.com/amartgon/mahout/commit/519ae529e9932d1e1d0803d0731a7396daaa603b 
>>>>
>>>>
>>>> There are still modifications to be made on Naive Bayes, such as:
>>>> -Modifying document id format from Text to IntWritable.
>>>> -Moving the "label index" out of TrainNaiveBayesJob.
>>>> Should I create a JIRA issue and submit this part? Or go on with 
>>>> the work
>>>> at least till everything related to Naive Bayes is complete? I'd 
>>>> like to
>>>> have some feedback before going on, to have an idea of whether 
>>>> there is
>>>> agreement/interest in this before investing a lot of time into 
>>>> possibly
>>>> useless work.
>>>>
>>>>
>>>>> GSOC 2013: Refactor and improve the classification APIs
>>>>> -------------------------------------------------------
>>>>>
>>>>>                  Key: MAHOUT-1179
>>>>> URL:https://issues.apache.org/jira/browse/MAHOUT-1179
>>>>>              Project: Mahout
>>>>>           Issue Type: New Feature
>>>>>             Reporter: Dan Filimon
>>>>>               Labels: gsoc2013, mentor
>>>>>
>>>>> [via Andy Twigg]
>>>>> Improve and unify the Mahout classification API. Also related to the
>>>> refactoring of the clustering APIs MAHOUT-1177.
>>>>> The two APIs should be roughly the same, at least in
>>>>> terms of input/output so that pipelining etc is easier. (cf
>>>>> scikit-learn clustering/classifier/regression API)
>>>>> Currently Mahout support:
>>>>> - logistic regression
>>>>> - Naive Bayes
>>>>> - Random Forests
>>>> -- 
>>>> This message is automatically generated by JIRA.
>>>> If you think it was sent incorrectly, please contact your JIRA
>>>> administrators
>>>> For more information on JIRA, 
>>>> see:http://www.atlassian.com/software/jira
>>>>
>>
>


Re: [jira] [Commented] (MAHOUT-1179) GSOC 2013: Refactor and improve the classification APIs

Posted by Ted Dunning <te...@gmail.com>.
Yes.  Both the classification and clustering API's are in need of
homogenization.


On Sun, Jul 7, 2013 at 9:57 AM, Ángel Martínez González
<am...@gmail.com>wrote:

> Hi all,
>
> I did not receive any feedback about this. I understand that now is a busy
> time with the work on version 0.8. Is there still interest on refactoring
> the classification APIs once 0.8 is released? Or should I just move on and
> look for some other way to contribute? I think the changes proposed in the
> document may not be very exciting, but some homogenization of Mahout's
> algorithms is necessary.
> If more detailed planning is needed, I could break the changes down into a
> list of tasks that have adequate granularity to be registered as JIRA
> issues. Would that help?
>
> Regards,
> Angel
>
>
>
> El 26/05/2013 22:21, Ángel Martínez González escribió:
>
>> Hi Ted and all,
>>
>> I've prepared a short document describing the current state of the
>> classification APIs and the proposed changes.
>>
>> https://docs.google.com/**document/d/1Rqn-**
>> 8aMgK6g9UZuKyD2fpMOY3FJ1xGzE7H**UNPCaSd7I/edit?usp=sharing<https://docs.google.com/document/d/1Rqn-8aMgK6g9UZuKyD2fpMOY3FJ1xGzE7HUNPCaSd7I/edit?usp=sharing>
>>
>> I'm eager to hear any feedback about it!
>>
>> The document does not include anything about task order planning. In fact
>> I have a couple of questions about that: As we are talking about
>> refactoring, it would be quite natural to do the changes in a lot of small
>> commits. But, would that be possible or will the work have to be packed in
>> a few big commits? Also, will some committer be able to periodically review
>> the work? And, could the changes interfere with the next version release?
>>
>> Thanks!
>> Angel
>>
>>
>> El 20/05/2013 10:17, Ángel Martínez González escribió:
>>
>>>
>>> Hi,
>>> I'm preparing a short text describing the current state of each
>>> algorithm and the needed changes (also including the data preprocessing and
>>> result evaluation modules). That will answer question c)
>>> I'll try to answer the other two here:
>>>
>>> El 17/05/2013 9:37, Ted Dunning escribió:
>>>
>>>> Please lay out a plan before coding. The key questions will be
>>>>
>>>> a) can you serialize a model efficiently?
>>>>
>>> That should not be a problem. The scope of these proposed changes is
>>> only input and output data formats, not including the classification
>>> models, so that would work just as before. Regarding input and output data,
>>> the formats are similar to the ones used for clustering and also feature
>>> hashing will be supported.
>>>
>>>> b) can you deal with the random forest and SGD models?
>>>>
>>> I've been looking into possible icompatibilities between classifiers and
>>> I've found the following difficulties related to input format:
>>>
>>> - The proposed input for trainers is SequenceFile<IntWritable,
>>> VectorWritable>, where the key would be an instance id and the target
>>> variable (class label) would be inside the vector. But, if feature hashing
>>> is used, conflicts may happen with the target variable that make it
>>> impossible to recover.
>>> - SGD and Naive Bayes need binarized categorical features, while Random
>>> Forests use categorical features encoded as integer levels. In Random
>>> Forests, any categorical feature can be used as target variable. In SGD and
>>> Naive Bayes, the target variable is provided to the classifier outside the
>>> vector. Binarized features are not suitable as target variables.
>>>
>>> Maybe a possible solution for the two interrelated problems could be:
>>> considering binarized categorical features as numerical, while categorical
>>> variables will always be encoded as integer levels and in SGD and Naive
>>> Bayes will only be used as target variables (or ignored). The feature
>>> hashing framework would have to be modified so that categorical variables
>>> have their positions in the vector reserved and no conflicts involving them
>>> are possible.  I think this is quite similar to the case with "a few
>>> special fields (categories and such) and then a bunch of encoded data" you
>>> commented in a previous mail.
>>>
>>> How does it sound?
>>>
>>>  c) what are the real changes to the API needed?
>>>>
>>>>
>>>>
>>>>
>>>> On Thu, May 16, 2013 at 10:51 AM, Angel Martinez Gonzalez (JIRA) <
>>>> jira@apache.org> wrote:
>>>>
>>>>       [
>>>>> https://issues.apache.org/**jira/browse/MAHOUT-1179?page=**
>>>>> com.atlassian.jira.plugin.**system.issuetabpanels:comment-**
>>>>> tabpanel&focusedCommentId=**13659764#comment-13659764<https://issues.apache.org/jira/browse/MAHOUT-1179?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13659764#comment-13659764>]
>>>>>
>>>>>
>>>>> Angel Martinez Gonzalez commented on MAHOUT-1179:
>>>>> ------------------------------**-------------------
>>>>>
>>>>> Hi again,
>>>>> With the goal of modifying all classifiers to use the formats proposed
>>>>> above, I've started to work with Naive Bayes. In particular, I've
>>>>> moved the
>>>>> code related to evaluation (summary statistics, confusion matrix) that
>>>>> was
>>>>> executed at the end of TestNaiveBayesDriver to a separate
>>>>> ClassifierEvaluationJob. The benefit of this is that
>>>>> ClassifierEvaluationJob should be able in the future to take input
>>>>> from any
>>>>> classifier tester.
>>>>> The current state of the work may be reviewed here:
>>>>> https://github.com/amartgon/**mahout/commit/**
>>>>> 519ae529e9932d1e1d0803d0731a73**96daaa603b<https://github.com/amartgon/mahout/commit/519ae529e9932d1e1d0803d0731a7396daaa603b>
>>>>>
>>>>> There are still modifications to be made on Naive Bayes, such as:
>>>>> -Modifying document id format from Text to IntWritable.
>>>>> -Moving the "label index" out of TrainNaiveBayesJob.
>>>>> Should I create a JIRA issue and submit this part? Or go on with the
>>>>> work
>>>>> at least till everything related to Naive Bayes is complete? I'd like
>>>>> to
>>>>> have some feedback before going on, to have an idea of whether there is
>>>>> agreement/interest in this before investing a lot of time into possibly
>>>>> useless work.
>>>>>
>>>>>
>>>>>  GSOC 2013: Refactor and improve the classification APIs
>>>>>> ------------------------------**-------------------------
>>>>>>
>>>>>>                  Key: MAHOUT-1179
>>>>>> URL:https://issues.apache.org/**jira/browse/MAHOUT-1179<https://issues.apache.org/jira/browse/MAHOUT-1179>
>>>>>>              Project: Mahout
>>>>>>           Issue Type: New Feature
>>>>>>             Reporter: Dan Filimon
>>>>>>               Labels: gsoc2013, mentor
>>>>>>
>>>>>> [via Andy Twigg]
>>>>>> Improve and unify the Mahout classification API. Also related to the
>>>>>>
>>>>> refactoring of the clustering APIs MAHOUT-1177.
>>>>>
>>>>>> The two APIs should be roughly the same, at least in
>>>>>> terms of input/output so that pipelining etc is easier. (cf
>>>>>> scikit-learn clustering/classifier/**regression API)
>>>>>> Currently Mahout support:
>>>>>> - logistic regression
>>>>>> - Naive Bayes
>>>>>> - Random Forests
>>>>>>
>>>>> --
>>>>> This message is automatically generated by JIRA.
>>>>> If you think it was sent incorrectly, please contact your JIRA
>>>>> administrators
>>>>> For more information on JIRA, see:http://www.atlassian.com/**
>>>>> software/jira <http://www.atlassian.com/software/jira>
>>>>>
>>>>>
>>>
>>
>