You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@mahout.apache.org by "Stuart Smith (Created) (JIRA)" <ji...@apache.org> on 2012/01/20 00:52:39 UTC
[jira] [Created] (MAHOUT-953) ArffVectorIterable does not
gracefully handle duplicate attribute name
ArffVectorIterable does not gracefully handle duplicate attribute name
----------------------------------------------------------------------
Key: MAHOUT-953
URL: https://issues.apache.org/jira/browse/MAHOUT-953
Project: Mahout
Issue Type: Improvement
Components: Integration
Affects Versions: 0.6
Reporter: Stuart Smith
Priority: Trivial
If you have duplicate attribute names in your ARFF file, and you have non-sparse arff vectors, ARFFVectorIterable.computeNext will throw a ArrayIndexOutOfBoundsExceptions, as it allocates a DenseVector with the size of your attribute labels (duplicates removed), but your arff vectors could have more values (if they reference the attribute at both indexes). This is a somewhat pathological ARFF file.
Not sure if I should note the error (throw an exception) in computeNext() when it's out of bounds, or when someone tries to add duplicate label to the MapBackedArffModel.
My first impulse would be to check in computeNext(), but addLabel() in MapBackedArffModel will do something rather pathological in the case of duplicate attributes: it overwrites the Label map with the new index, but the idxLabel map will hold a mapping from both indexes to the attribute name, so it's out of sync.. so it may be best to disallow duplicate attribute names "IllegalArgumentException" altogether.
For example
@attribute my_attribute NUMERIC
@attribute my_attribute NUMERIC
addLabel()
addLabel()
labelBindings -> ('my_attribute', 1)
idxLabel -> (0, 'my_attribute), (1, 'my_attribute')
I'll happily submit a patch, just wondering if it should be in computeNext() or addLabel()
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (MAHOUT-953) ArffVectorIterable does not
gracefully handle duplicate attribute name
Posted by "Stuart Smith (JIRA)" <ji...@apache.org>.
[ https://issues.apache.org/jira/browse/MAHOUT-953?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13404611#comment-13404611 ]
Stuart Smith commented on MAHOUT-953:
-------------------------------------
Yup, let me take a look at this when I look into another bug I have (since I should be re-syncing & re-duplicating for both anyways). Thanks for checking these out!
> ArffVectorIterable does not gracefully handle duplicate attribute name
> ----------------------------------------------------------------------
>
> Key: MAHOUT-953
> URL: https://issues.apache.org/jira/browse/MAHOUT-953
> Project: Mahout
> Issue Type: Improvement
> Components: Integration
> Affects Versions: 0.6
> Reporter: Stuart Smith
> Priority: Trivial
>
> If you have duplicate attribute names in your ARFF file, and you have non-sparse arff vectors, ARFFVectorIterable.computeNext will throw a ArrayIndexOutOfBoundsExceptions, as it allocates a DenseVector with the size of your attribute labels (duplicates removed), but your arff vectors could have more values (if they reference the attribute at both indexes). This is a somewhat pathological ARFF file.
> Not sure if I should note the error (throw an exception) in computeNext() when it's out of bounds, or when someone tries to add duplicate label to the MapBackedArffModel.
> My first impulse would be to check in computeNext(), but addLabel() in MapBackedArffModel will do something rather pathological in the case of duplicate attributes: it overwrites the Label map with the new index, but the idxLabel map will hold a mapping from both indexes to the attribute name, so it's out of sync.. so it may be best to disallow duplicate attribute names "IllegalArgumentException" altogether.
> For example
> @attribute my_attribute NUMERIC
> @attribute my_attribute NUMERIC
> addLabel()
> addLabel()
> labelBindings -> ('my_attribute', 1)
> idxLabel -> (0, 'my_attribute), (1, 'my_attribute')
> I'll happily submit a patch, just wondering if it should be in computeNext() or addLabel()
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (MAHOUT-953) ArffVectorIterable does not
gracefully handle duplicate attribute name
Posted by "Sean Owen (JIRA)" <ji...@apache.org>.
[ https://issues.apache.org/jira/browse/MAHOUT-953?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13399277#comment-13399277 ]
Sean Owen commented on MAHOUT-953:
----------------------------------
I agree that IllegalArgumentException is just fine at parsing time. Anything slightly more descriptive. Care to make a patch?
> ArffVectorIterable does not gracefully handle duplicate attribute name
> ----------------------------------------------------------------------
>
> Key: MAHOUT-953
> URL: https://issues.apache.org/jira/browse/MAHOUT-953
> Project: Mahout
> Issue Type: Improvement
> Components: Integration
> Affects Versions: 0.6
> Reporter: Stuart Smith
> Priority: Trivial
>
> If you have duplicate attribute names in your ARFF file, and you have non-sparse arff vectors, ARFFVectorIterable.computeNext will throw a ArrayIndexOutOfBoundsExceptions, as it allocates a DenseVector with the size of your attribute labels (duplicates removed), but your arff vectors could have more values (if they reference the attribute at both indexes). This is a somewhat pathological ARFF file.
> Not sure if I should note the error (throw an exception) in computeNext() when it's out of bounds, or when someone tries to add duplicate label to the MapBackedArffModel.
> My first impulse would be to check in computeNext(), but addLabel() in MapBackedArffModel will do something rather pathological in the case of duplicate attributes: it overwrites the Label map with the new index, but the idxLabel map will hold a mapping from both indexes to the attribute name, so it's out of sync.. so it may be best to disallow duplicate attribute names "IllegalArgumentException" altogether.
> For example
> @attribute my_attribute NUMERIC
> @attribute my_attribute NUMERIC
> addLabel()
> addLabel()
> labelBindings -> ('my_attribute', 1)
> idxLabel -> (0, 'my_attribute), (1, 'my_attribute')
> I'll happily submit a patch, just wondering if it should be in computeNext() or addLabel()
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira