You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@mahout.apache.org by "Gokhan Capan (JIRA)" <ji...@apache.org> on 2013/04/11 16:49:15 UTC
[jira] [Updated] (MAHOUT-1178) GSOC 2013: Improve Lucene support in
Mahout
[ https://issues.apache.org/jira/browse/MAHOUT-1178?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Gokhan Capan updated MAHOUT-1178:
---------------------------------
Attachment: MAHOUT-1178.patch
MAHOUT-1178-TEST.patch
Hi,
I am adding a Matrix impementation that loads the entire data of a field of a Lucene index to an underlying SparseRowMatrix here.
It delegates reading from index logic to the existing LuceneIterator.
When I changed LuceneIterator code a little to make this support StringFields, it broke LuceneIteratorTest, so I am going to add a new version of of LuceneIterator that supports StringFields later.
Also there is an ongoing effort on another version of LuceneMatrix that lazy-loads from index while iterating over matrix. I am going to start a separate issue for that.
I put the code to the integration module, and test and actual code are in different diff files.
> GSOC 2013: Improve Lucene support in Mahout
> -------------------------------------------
>
> Key: MAHOUT-1178
> URL: https://issues.apache.org/jira/browse/MAHOUT-1178
> Project: Mahout
> Issue Type: New Feature
> Reporter: Dan Filimon
> Labels: gsoc2013, mentor
> Attachments: MAHOUT-1178.patch, MAHOUT-1178-TEST.patch
>
>
> [via Ted Dunning]
> It should be possible to view a Lucene index as a matrix. This would
> require that we standardize on a way to convert documents to rows. There
> are many choices, the discussion of which should be deferred to the actual
> work on the project, but there are a few obvious constraints:
> a) it should be possible to get the same result as dumping the term vectors
> for each document each to a line and converting that result using standard
> Mahout methods.
> b) numeric fields ought to work somehow.
> c) if there are multiple text fields that ought to work sensibly as well.
> Two options include dumping multiple matrices or to convert the fields
> into a single row of a single matrix.
> d) it should be possible to refer back from a row of the matrix to find the
> correct document. THis might be because we remember the Lucene doc number
> or because a field is named as holding a unique id.
> e) named vectors and matrices should be used if plausible.
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira