You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@mahout.apache.org by "Hudson (JIRA)" <ji...@apache.org> on 2014/09/04 20:08:52 UTC

[jira] [Commented] (MAHOUT-1541) Create CLI Driver for Spark Cooccurrence Analysis

    [ https://issues.apache.org/jira/browse/MAHOUT-1541?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14121681#comment-14121681 ] 

Hudson commented on MAHOUT-1541:
--------------------------------

SUCCESS: Integrated in Mahout-Quality #2779 (See [https://builds.apache.org/job/Mahout-Quality/2779/])
MAHOUT-1604, MAHOUT-1541 changes all reference to positon in the CLI to columns (pat: rev e24c4afb699c2930d372c701fe2de874a2a2f6c0)
* spark/src/main/scala/org/apache/mahout/drivers/Schema.scala
* spark/src/test/scala/org/apache/mahout/drivers/ItemSimilarityDriverSuite.scala
* spark/src/main/scala/org/apache/mahout/drivers/ItemSimilarityDriver.scala
* spark/src/main/scala/org/apache/mahout/drivers/MahoutOptionParser.scala
* spark/src/main/scala/org/apache/mahout/drivers/TextDelimitedReaderWriter.scala


> Create CLI Driver for Spark Cooccurrence Analysis
> -------------------------------------------------
>
>                 Key: MAHOUT-1541
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-1541
>             Project: Mahout
>          Issue Type: New Feature
>          Components: CLI
>            Reporter: Pat Ferrel
>            Assignee: Pat Ferrel
>
> Create a CLI driver to import data in a flexible manner, create an IndexedDataset with BiMap ID translation dictionaries, call the Spark CooccurrenceAnalysis with the appropriate params, then write output with external IDs optionally reattached.
> Ultimately it should be able to read input as the legacy mr does but will support reading externally defined IDs and flexible formats. Output will be of the legacy format or text files of the user's specification with reattached Item IDs. 
> Support for legacy formats is a question, users can always use the legacy code if they want this. Internal to the IndexedDataset is a Spark DRM so pipelining can be accomplished without any writing to an actual file so the legacy sequence file output may not be needed.
> Opinions?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)