You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@mahout.apache.org by Andrew Palumbo <ap...@outlook.com> on 2019/12/06 22:51:01 UTC

[MEETING NOTES] 10 AM Friday 6 Jan 2019. Google Hangouts

Mahout meeting notes  12.6.2019

==============================



A meeting was held today, Friday 6 Dec 2019 to discuss to discuss the current state of the project, planned releases and a general path forward.

Joe Olson, Andrew Palumbo and Trevor Grant met via Google Hangouts at 10:15 AM.


Early discussion was based around AP and TG’s loose and quickly put together agenda and ideas. AP started the unofficial agenda doc <10 mins before the meeting start, so the agenda was quick n dirty.


An agreement was made early on by TG and AP to focus on the release as the build is currently working, and releases are deploying artifacts for Scala 2.11, scala 2.12, pegged to Java 1.8 and mvn 3.3.9.  A heavy refactoring effort was made to After fixing the build, by revamping some very old poms and reverting back to the parertnt `Apache pom.xml` `release` goal and adding some new information to the release master’s  `.m2/setings.xml`, we are able to release and deploy artifacts with Java 1.8, cross compiled for Scala 2.11 and Scala 2.12.


https://repository.apache.org/#nexus-search;gav~org.apache.mahout~mahout-core_2.12~~~~kw,versionexpand

https://repository.apache.org/#nexus-search;gav~org.apache.mahout~mahout-hdfs_2.11~~~~kw,versionexpand


A release board was created:


https://issues.apache.org/jira/secure/RapidBoard.jspa?rapidView=348


And some minor issues were added.


A decision was made to move Docker files and certain planned AWS infrastructure as code (TerraForm) slated for the 14.1 release off the central repository, and onto both dockerhub.io under a newly created mahout namespace, JO will be handling the task of creating the hub.docker.com/<https://hub.docker.com/> "Mahout" organization, and moving the Docker files to that space.  [MAHOUT-2074<https://issues.apache.org/jira/browse/MAHOUT-2074>]



AP will be creating a mahout-contrib repo on his personal page to be merged in later with some terraform code and examples, etc probably borrowing heavily from Pulsar and spark: https://github.com/apache/pulsar/tree/master/deployment. As well AP will (has) begin/begun leveraging some off project time onto this mahout-contrib package, or at least is keeping much in the org.apache.mahout namespace.  Some work has already been done with NiFi and MiNifi for an SDR project under the org.apache.mahout namespace which will be available in the mahout-contrib package or an other stand alone package [NOT DISCUSSED] at meeting. Forgot to bring up.


AP will fix the `change-scala-version.sh` script, [MAHOUT-2080] and will bump the scala version in master over the weekend [MAHOUT-20<https://issues.apache.org/jira/browse/MAHOUT-2074>82], at which point we will call a code freeze.  And attempt to release by next weekend (after cutting an RC).



TG was able to get Jenkins running again, building snapshots, fixing [MAHOUT-2073<https://issues.apache.org/jira/browse/MAHOUT-2073>].


JO will look into some other projects build-chains MAHOUT-2076<https://issues.apache.org/jira/browse/MAHOUT-2076>, and consider a scripts to cut down on RC creation and Release Deployment time by having a single script with all release commands, similar to Apache Spark (Pulsar was discussed as a reference but the project is moving quickly and they’ve refactored their build since last I’d (AP) looked, in fact, deploying pulsar is as simple as `mvn clean deploy`.


https://github.com/apache/pulsar/blob/master/.test-infra/jenkins/job_pulsar_release_nightly_snapshot.groovy.


Spark and Flink should have good examples…. E.g:

Spark:

https://github.com/apache/spark/blob/master/dev/make-distribution.sh

https://github.com/apache/spark/tree/master/dev/create-release

Flink:

https://github.com/apache/flink/tree/master/tools/releasing



TG will work on zeppelin integration for some easy mahout-python-ggplot2 examples.


There was discussion of using the The US Census Api for a data examples.


A long running issue was resolved [MAHOUT-2023<https://issues.apache.org/jira/browse/MAHOUT-2023>] Broken Scopt Classes:


*ISSUE*:  we have no way of testing this, we need @pat to take a look.  With the nightly snapshots being build, the current version in master is available in NEXUS:


[JENKINS] Archiving /home/jenkins/jenkins-slave/workspace/mahout-nightly/community/spark-cli-drivers/target/mahout-spark-cli-drivers_2.11-14.1-SNAPSHOT.jar to org.apache.mahout/mahout-spark-cli-drivers_2.11/14.1-20191206.193308-1/mahout-spark-cli-drivers_2.11-14.1-20191206.193308-1.jar



We spoke quickly today, and these notes were compiled to the best of my recollection.  If I missed anything, please bring let me know.





Trevor’s Agenda:

  1.  Release # Addressed

     *   Path to release # addressed

     *   Steps -> jira tickets # addressed

     *   Code freeze date # addressed Monday, 9 Dec 2019.

  2.  Other Misc..#  discussed and addressed



Andy’s agenda..


# RELEASE…



  1.  Fix Docker files, # Addressed:  moving dockefiles to dockerhub and IaaS code to AP github

  2.  Create a Release script for 14.1 # adressed- ticket + assigned

  3.  Fix Scala-change-version.sh script. # ticket +assigned.

  4.  Add a terraform script to examples for an asg # addressed


# Whish list (Post 14.1 release)



# in-core matrices backed by Off heap and or shared memory, Tighter coupling with GPU, native code, python, TPUs, FPGAs.


  1.  Arrow backed in core

     *   Arrow is advertising Sparse and Dense Tensors and CSR matrices, have vectors.

        *   Arrow’s general idea is to have off heap shared memory between OS and GPU

     *   Been bit in the ass by them before.. Not all packages are as complete as advertised.

     *   https://arrow.apache.org/docs/java/

        *   public final class SparseMatrixIndexCSR could be used as well as Tensor<T> class.

  2.  Ability to stream data into  in-core matrices off heap buffers from E.g. Nifi.

  3.  https://github.com/apache/incubator-datasketches-memory

     *   http://datasketches.incubator.apache.org/docs/Memory/MemoryPackage.html

     *   Streaming sketch algos:

        *   https://github.com/DataSketches/DataSketches.github.io/blob/master/docs/pdf/Quantiles_KLL.pdf

        *   others

  4.  Tighter, simpler CUDA integration, if arrow is mature enough we may have access to cuML, etc,

  5.  Working with off Heap memory also makes Python a more viable and not so distant possibility.

# ALGOS

  1.  GLMS

  2.  Evolutionary Algos with Spiking Neurons (FINALLY)

  3.  DrmLike[Complex128]

  4.  Currently working oin a project for which i need Basic Streaming capabilities.

     *   To be done in https://github.com/andrewpalumbo/mahout-contrib  Or some such.

     *   Integrate Apache DataSketches-incubating for streaming sketching and analysis.

        *   Streaming SVD-type algorithms

        *   Find eigenvectors as data streams in to in-core and further stacked into DRMs.




===========================================================

The Next meeting will take place Friday 13 Jan 2019 10AM PST.


All are welcome.


Please respond to dev@mahout.apache.org for an invite, and access to next week’s agenda.