You are viewing a plain text version of this content. The canonical link for it is here.

Posted to issues@mahout.apache.org by "Stefan Goldener (Jira)" <ji...@apache.org> on 2020/03/25 08:03:00 UTC

[jira] [Updated] (MAHOUT-2101) Mahout local file distribution

     [ https://issues.apache.org/jira/browse/MAHOUT-2101?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Stefan Goldener updated MAHOUT-2101:
------------------------------------
    Description: 
At the moment Mahout is heavily based on HDFS. Although MAHOUT_LOCAL is using the local File system it is not possible to use MAHOUT_LOCAL=true and a SPARK ONLY Cluster.

My suggestion is to improve the Mahout code to support local files and distribute them via SPARK. There are multiple options for that e.g. Spark SQL, DataFrames, Datasets or RDD's.

This will also allow Mahout to use the new SPARK Kubernetes features and hence be highly scalable.

Probably the best improvement would be mahout using spark context and just reading the files via sc.textFile("file:///path to the file/")

  was:
At the moment Mahout is heavily based on HDFS. Although MAHOUT_LOCAL is using the local File system it is not possible to use MAHOUT_LOCAL=true and a SPARK ONLY Cluster.

My suggestion is to improve the Mahout code to support local files and distribute them via SPARK. There are multiple options for that e.g. Spark SQL, DataFrames, Datasets or RDD's.

This will also allow Mahout to use the new SPARK Kubernetes features and hence be highly scalable.


> Mahout local file distribution
> ------------------------------
>
>                 Key: MAHOUT-2101
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-2101
>             Project: Mahout
>          Issue Type: Improvement
>            Reporter: Stefan Goldener
>            Priority: Major
>
> At the moment Mahout is heavily based on HDFS. Although MAHOUT_LOCAL is using the local File system it is not possible to use MAHOUT_LOCAL=true and a SPARK ONLY Cluster.
> My suggestion is to improve the Mahout code to support local files and distribute them via SPARK. There are multiple options for that e.g. Spark SQL, DataFrames, Datasets or RDD's.
> This will also allow Mahout to use the new SPARK Kubernetes features and hence be highly scalable.
> Probably the best improvement would be mahout using spark context and just reading the files via sc.textFile("file:///path to the file/")



--
This message was sent by Atlassian Jira
(v8.3.4#803005)