You are viewing a plain text version of this content. The canonical link for it is here.
Posted to reviews@spark.apache.org by "zhengruifeng (via GitHub)" <gi...@apache.org> on 2023/03/07 02:39:36 UTC

[GitHub] [spark] zhengruifeng commented on a diff in pull request #40297: [SPARK-42412][WIP] Initial PR of Spark connect ML

zhengruifeng commented on code in PR #40297:
URL: https://github.com/apache/spark/pull/40297#discussion_r1127232443


##########
connector/connect/common/src/main/protobuf/spark/connect/ml.proto:
##########
@@ -0,0 +1,136 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *    http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+syntax = 'proto3';
+
+package spark.connect;
+
+import "google/protobuf/any.proto";
+import "spark/connect/expressions.proto";
+import "spark/connect/types.proto";
+import "spark/connect/relations.proto";
+import "spark/connect/ml_common.proto";
+
+option java_multiple_files = true;
+option java_package = "org.apache.spark.connect.proto";
+
+
+message Evaluator {
+  string name = 1;
+  Params params = 2;
+  string uid = 3;
+}
+
+
+message MlCommand {
+  oneof ml_command_type {
+    Fit fit = 1;
+    ModelAttr model_attr = 2;
+    ModelSummaryAttr model_summary_attr = 3;
+    LoadModel load_model = 4;
+    SaveModel save_model = 5;
+    Evaluate evaluate = 6;
+  }
+
+  message Fit {
+    Stage estimator = 1;
+    Relation dataset = 2;
+  }
+
+  message Evaluate {
+    Evaluator evaluator = 1;
+  }
+
+  message LoadModel {
+    string name = 1;
+    string path = 2;
+  }
+
+  message SaveModel {
+    int64 model_ref_id = 1;
+    string path = 2; // saving path
+    bool overwrite = 3;
+    map<string, string> options = 4; // saving options
+  }
+
+  message ModelAttr {
+    int64 model_ref_id = 1;
+    string name = 2;
+  }
+
+  message ModelSummaryAttr {
+    int64 model_ref_id = 1;
+    string name = 2;
+    Params params = 3;
+
+    // Evaluation dataset that it uses to computes
+    // the summary attribute
+    // If not set, get attributes from
+    // model.summary (i.e. the summary on training dataset)
+    optional Relation evaluation_dataset = 4;
+  }
+}
+
+
+message MlCommandResponse {
+  oneof ml_command_response_type {
+    Expression.Literal literal = 1;
+    ModelInfo model_info = 2;
+    Vector vector = 3;
+    Matrix matrix = 4;
+  }
+  message ModelInfo {
+    int64 model_ref_id = 1;
+    string model_uid = 2;
+  }
+}
+
+
+message Vector {
+  oneof one_of {
+    Dense dense = 1;
+    Sparse sparse = 2;
+  }
+  message Dense {
+    repeated double values = 1;
+  }
+  message Sparse {
+    int32 size = 1;
+    repeated double indices = 2;
+    repeated double values = 3;
+  }
+}
+
+message Matrix {
+  oneof one_of {
+    Dense dense = 1;
+    Sparse sparse = 2;
+  }
+  message Dense {
+    int32 num_rows = 1;
+    int32 num_cols = 2;
+    repeated double values = 3;

Review Comment:
   nit, `DenseMatrix` also has `isTransposed`



##########
connector/connect/server/src/main/scala/org/apache/spark/sql/connect/ml/AlgorithmRegisty.scala:
##########
@@ -0,0 +1,104 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *    http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.sql.connect.ml
+
+import org.apache.spark.connect.proto
+import org.apache.spark.ml
+import org.apache.spark.ml.{Estimator, Model}
+import org.apache.spark.ml.classification.TrainingSummary
+import org.apache.spark.sql.DataFrame
+
+
+object AlgorithmRegistry {
+
+  def get(name: String): Algorithm = {
+    name match {
+      case "LogisticRegression" => new LogisticRegressionAlgorithm
+      case _ =>
+        throw new IllegalArgumentException()
+    }
+  }
+
+}
+
+
+abstract class Algorithm {
+
+  def initiateEstimator(uid: String): Estimator[_]
+
+  def getModelAttr(model: Model[_], name: String): Either[proto.MlCommandResponse, DataFrame]

Review Comment:
   why not making `DataFrame`/`Relation` an `oneof` item in `MlCommandResponse` ?



##########
connector/connect/common/src/main/protobuf/spark/connect/expressions.proto:
##########
@@ -172,6 +172,12 @@ message Expression {
       CalendarInterval calendar_interval = 19;
       int32 year_month_interval = 20;
       int64 day_time_interval = 21;
+
+      List list = 99;

Review Comment:
   https://github.com/apache/spark/commit/0def3de6ed1000efe72c8bbdd3b3804bb34ce620 just introduced the literal array type, we can reuse it



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org