You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@mahout.apache.org by "Andrew Harbick (Created) (JIRA)" <ji...@apache.org> on 2012/03/30 19:58:27 UTC
[jira] [Created] (MAHOUT-997) Make splitData smart enough to not
consider a CSV header to be part of the data
Make splitData smart enough to not consider a CSV header to be part of the data
-------------------------------------------------------------------------------
Key: MAHOUT-997
URL: https://issues.apache.org/jira/browse/MAHOUT-997
Project: Mahout
Issue Type: Improvement
Components: Integration
Affects Versions: 0.6
Environment: OS X
Reporter: Andrew Harbick
Priority: Minor
Fix For: 0.6
If you do something like:
MAHOUT_LOCAL=1 $MAHOUT_HOME/bin/mahout splitDataset --input all.csv --output split --trainingPercentage 0.9 --probePercentage 0.1
The header row from your CSV will end up with 90% chance in your training data and 10% chance in your evaluation data. To use a tool like trainlogistic or runlogistic the header file is needed in both.
Perhaps add an argument to splitData to duplicate the header line?
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (MAHOUT-997) Make splitData smart enough to not
consider a CSV header to be part of the data
Posted by "Sean Owen (Updated) (JIRA)" <ji...@apache.org>.
[ https://issues.apache.org/jira/browse/MAHOUT-997?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Sean Owen updated MAHOUT-997:
-----------------------------
Fix Version/s: (was: 0.6)
> Make splitData smart enough to not consider a CSV header to be part of the data
> -------------------------------------------------------------------------------
>
> Key: MAHOUT-997
> URL: https://issues.apache.org/jira/browse/MAHOUT-997
> Project: Mahout
> Issue Type: Improvement
> Components: Integration
> Affects Versions: 0.6
> Environment: OS X
> Reporter: Andrew Harbick
> Priority: Minor
>
> If you do something like:
> MAHOUT_LOCAL=1 $MAHOUT_HOME/bin/mahout splitDataset --input all.csv --output split --trainingPercentage 0.9 --probePercentage 0.1
> The header row from your CSV will end up with 90% chance in your training data and 10% chance in your evaluation data. To use a tool like trainlogistic or runlogistic the header file is needed in both.
> Perhaps add an argument to splitData to duplicate the header line?
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (MAHOUT-997) Make splitData smart enough to not
consider a CSV header to be part of the data
Posted by "Lance Norskog (Commented) (JIRA)" <ji...@apache.org>.
[ https://issues.apache.org/jira/browse/MAHOUT-997?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13242903#comment-13242903 ]
Lance Norskog commented on MAHOUT-997:
--------------------------------------
This is a general problem, not a splitData problem. Suggest you prestage your Mahout input files from your real files with script that ignores the first line. {{sed 2,\$p}} will do the trick. Most input data requires some kind of cleanup.
> Make splitData smart enough to not consider a CSV header to be part of the data
> -------------------------------------------------------------------------------
>
> Key: MAHOUT-997
> URL: https://issues.apache.org/jira/browse/MAHOUT-997
> Project: Mahout
> Issue Type: Improvement
> Components: Integration
> Affects Versions: 0.6
> Environment: OS X
> Reporter: Andrew Harbick
> Priority: Minor
> Fix For: 0.6
>
>
> If you do something like:
> MAHOUT_LOCAL=1 $MAHOUT_HOME/bin/mahout splitDataset --input all.csv --output split --trainingPercentage 0.9 --probePercentage 0.1
> The header row from your CSV will end up with 90% chance in your training data and 10% chance in your evaluation data. To use a tool like trainlogistic or runlogistic the header file is needed in both.
> Perhaps add an argument to splitData to duplicate the header line?
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Resolved] (MAHOUT-997) Make splitData smart enough to not
consider a CSV header to be part of the data
Posted by "Sebastian Schelter (JIRA)" <ji...@apache.org>.
[ https://issues.apache.org/jira/browse/MAHOUT-997?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Sebastian Schelter resolved MAHOUT-997.
---------------------------------------
Resolution: Not A Problem
> Make splitData smart enough to not consider a CSV header to be part of the data
> -------------------------------------------------------------------------------
>
> Key: MAHOUT-997
> URL: https://issues.apache.org/jira/browse/MAHOUT-997
> Project: Mahout
> Issue Type: Improvement
> Components: Integration
> Affects Versions: 0.6
> Environment: OS X
> Reporter: Andrew Harbick
> Priority: Minor
>
> If you do something like:
> MAHOUT_LOCAL=1 $MAHOUT_HOME/bin/mahout splitDataset --input all.csv --output split --trainingPercentage 0.9 --probePercentage 0.1
> The header row from your CSV will end up with 90% chance in your training data and 10% chance in your evaluation data. To use a tool like trainlogistic or runlogistic the header file is needed in both.
> Perhaps add an argument to splitData to duplicate the header line?
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira