You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@mahout.apache.org by "Stuart Smith (Created) (JIRA)" <ji...@apache.org> on 2012/01/20 00:52:39 UTC

[jira] [Created] (MAHOUT-953) ArffVectorIterable does not gracefully handle duplicate attribute name

ArffVectorIterable does not gracefully handle duplicate attribute name
----------------------------------------------------------------------

                 Key: MAHOUT-953
                 URL: https://issues.apache.org/jira/browse/MAHOUT-953
             Project: Mahout
          Issue Type: Improvement
          Components: Integration
    Affects Versions: 0.6
            Reporter: Stuart Smith
            Priority: Trivial


If you have duplicate attribute names in your ARFF file, and you have non-sparse arff vectors, ARFFVectorIterable.computeNext will throw a ArrayIndexOutOfBoundsExceptions, as it allocates a DenseVector with the size of your attribute labels (duplicates removed), but your arff vectors could have more values (if they reference the attribute at both indexes). This is a somewhat pathological ARFF file.

Not sure if I should note the error (throw an exception) in computeNext() when it's out of bounds, or when someone tries to add duplicate label to the MapBackedArffModel.

My first impulse would be to check in computeNext(), but addLabel() in MapBackedArffModel will do something rather pathological in the case of duplicate attributes: it overwrites the Label map with the new index, but the idxLabel map will hold a mapping from both indexes to the attribute name, so it's out of sync.. so it may be best to disallow duplicate attribute names "IllegalArgumentException" altogether.

For example
@attribute my_attribute NUMERIC
@attribute my_attribute NUMERIC

addLabel()
addLabel()

labelBindings -> ('my_attribute', 1)
idxLabel -> (0, 'my_attribute), (1, 'my_attribute')

I'll happily submit a patch, just wondering if it should be in computeNext() or addLabel()


--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (MAHOUT-953) ArffVectorIterable does not gracefully handle duplicate attribute name

Posted by "Stuart Smith (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/MAHOUT-953?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13404611#comment-13404611 ] 

Stuart Smith commented on MAHOUT-953:
-------------------------------------

Yup, let me take a look at this when I look into another bug I have (since I should be re-syncing & re-duplicating for both anyways). Thanks for checking these out!
                
> ArffVectorIterable does not gracefully handle duplicate attribute name
> ----------------------------------------------------------------------
>
>                 Key: MAHOUT-953
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-953
>             Project: Mahout
>          Issue Type: Improvement
>          Components: Integration
>    Affects Versions: 0.6
>            Reporter: Stuart Smith
>            Priority: Trivial
>
> If you have duplicate attribute names in your ARFF file, and you have non-sparse arff vectors, ARFFVectorIterable.computeNext will throw a ArrayIndexOutOfBoundsExceptions, as it allocates a DenseVector with the size of your attribute labels (duplicates removed), but your arff vectors could have more values (if they reference the attribute at both indexes). This is a somewhat pathological ARFF file.
> Not sure if I should note the error (throw an exception) in computeNext() when it's out of bounds, or when someone tries to add duplicate label to the MapBackedArffModel.
> My first impulse would be to check in computeNext(), but addLabel() in MapBackedArffModel will do something rather pathological in the case of duplicate attributes: it overwrites the Label map with the new index, but the idxLabel map will hold a mapping from both indexes to the attribute name, so it's out of sync.. so it may be best to disallow duplicate attribute names "IllegalArgumentException" altogether.
> For example
> @attribute my_attribute NUMERIC
> @attribute my_attribute NUMERIC
> addLabel()
> addLabel()
> labelBindings -> ('my_attribute', 1)
> idxLabel -> (0, 'my_attribute), (1, 'my_attribute')
> I'll happily submit a patch, just wondering if it should be in computeNext() or addLabel()

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (MAHOUT-953) ArffVectorIterable does not gracefully handle duplicate attribute name

Posted by "Sean Owen (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/MAHOUT-953?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13399277#comment-13399277 ] 

Sean Owen commented on MAHOUT-953:
----------------------------------

I agree that IllegalArgumentException is just fine at parsing time. Anything slightly more descriptive. Care to make a patch?
                
> ArffVectorIterable does not gracefully handle duplicate attribute name
> ----------------------------------------------------------------------
>
>                 Key: MAHOUT-953
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-953
>             Project: Mahout
>          Issue Type: Improvement
>          Components: Integration
>    Affects Versions: 0.6
>            Reporter: Stuart Smith
>            Priority: Trivial
>
> If you have duplicate attribute names in your ARFF file, and you have non-sparse arff vectors, ARFFVectorIterable.computeNext will throw a ArrayIndexOutOfBoundsExceptions, as it allocates a DenseVector with the size of your attribute labels (duplicates removed), but your arff vectors could have more values (if they reference the attribute at both indexes). This is a somewhat pathological ARFF file.
> Not sure if I should note the error (throw an exception) in computeNext() when it's out of bounds, or when someone tries to add duplicate label to the MapBackedArffModel.
> My first impulse would be to check in computeNext(), but addLabel() in MapBackedArffModel will do something rather pathological in the case of duplicate attributes: it overwrites the Label map with the new index, but the idxLabel map will hold a mapping from both indexes to the attribute name, so it's out of sync.. so it may be best to disallow duplicate attribute names "IllegalArgumentException" altogether.
> For example
> @attribute my_attribute NUMERIC
> @attribute my_attribute NUMERIC
> addLabel()
> addLabel()
> labelBindings -> ('my_attribute', 1)
> idxLabel -> (0, 'my_attribute), (1, 'my_attribute')
> I'll happily submit a patch, just wondering if it should be in computeNext() or addLabel()

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira