You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@mahout.apache.org by "Drew Farris (JIRA)" <ji...@apache.org> on 2010/05/28 06:21:43 UTC
[jira] Created: (MAHOUT-402) NamedVectors are not readily
identifiable in vectordumper output
NamedVectors are not readily identifiable in vectordumper output
----------------------------------------------------------------
Key: MAHOUT-402
URL: https://issues.apache.org/jira/browse/MAHOUT-402
Project: Mahout
Issue Type: Bug
Affects Versions: 0.4
Reporter: Drew Farris
Priority: Minor
When dumping a sequence file of Writable,NamedVector using vectordumper in either JSON or standard format, it is not apparent in the output that the vectors are indeed sequence files.
For example, after applying MAHOUT-401 to produce NamedVectors from seq2sparse, I run:
{code}
./bin/mahout vectordump -j -p -s ~/mahout/reuters-out-seqdir-sparse/tf-vectors/part-00000
{code}
And get:
{code}
Input Path: /home/drew/mahout/reuters-out-seqdir-sparse/tf-vectors/part-00000
/reut2-000.sgm-0.txt {"class":"org.apache.mahout.math.RandomAccessSparseVector","vector" [...]
{code}
or when removing the -j argument:
{code}
/reut2-000.sgm-0.txt elts: {1026:3.0, 16150:1.0, 3338:3.0, 16147:1.0, 3339:1.0, 12240:1.0, [...]
{code}
The first case, when dumping JSON, is due to the fact that NamedVector simply calls its delegate's asFormatString method. Granted the naive approach of implementing asFormatString in named vector also produces some nasty output:
{code}
/reut2-001.sgm-468.txt {"class":"org.apache.mahout.math.NamedVector","vector":"{\"delegate\":{\"class\":\"org.apache.mahout.math.RandomAccessSparseVector\" [...]
{code}
So a little more thought needs to be given to that approach.
For the non-json format, VectorHelper.vectorToString(..) is the culprit. Would it be ok to do an instanceof NamedVector here and emit the name?
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.
[jira] Updated: (MAHOUT-402) NamedVectors are not readily
identifiable in vectordumper output
Posted by "Drew Farris (JIRA)" <ji...@apache.org>.
[ https://issues.apache.org/jira/browse/MAHOUT-402?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Drew Farris updated MAHOUT-402:
-------------------------------
Component/s: Utils
> NamedVectors are not readily identifiable in vectordumper output
> ----------------------------------------------------------------
>
> Key: MAHOUT-402
> URL: https://issues.apache.org/jira/browse/MAHOUT-402
> Project: Mahout
> Issue Type: Bug
> Components: Utils
> Affects Versions: 0.4
> Reporter: Drew Farris
> Priority: Minor
>
> When dumping a sequence file of Writable,NamedVector using vectordumper in either JSON or standard format, it is not apparent in the output that the vectors are indeed sequence files.
> For example, after applying MAHOUT-401 to produce NamedVectors from seq2sparse, I run:
> {code}
> ./bin/mahout vectordump -j -p -s ~/mahout/reuters-out-seqdir-sparse/tf-vectors/part-00000
> {code}
> And get:
> {code}
> Input Path: /home/drew/mahout/reuters-out-seqdir-sparse/tf-vectors/part-00000
> /reut2-000.sgm-0.txt {"class":"org.apache.mahout.math.RandomAccessSparseVector","vector" [...]
> {code}
> or when removing the -j argument:
> {code}
> /reut2-000.sgm-0.txt elts: {1026:3.0, 16150:1.0, 3338:3.0, 16147:1.0, 3339:1.0, 12240:1.0, [...]
> {code}
> The first case, when dumping JSON, is due to the fact that NamedVector simply calls its delegate's asFormatString method. Granted the naive approach of implementing asFormatString in named vector also produces some nasty output:
> {code}
> /reut2-001.sgm-468.txt {"class":"org.apache.mahout.math.NamedVector","vector":"{\"delegate\":{\"class\":\"org.apache.mahout.math.RandomAccessSparseVector\" [...]
> {code}
> So a little more thought needs to be given to that approach.
> For the non-json format, VectorHelper.vectorToString(..) is the culprit. Would it be ok to do an instanceof NamedVector here and emit the name?
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.