You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@pinot.apache.org by "GSharayu (via GitHub)" <gi...@apache.org> on 2023/03/09 16:04:56 UTC

[GitHub] [pinot] GSharayu commented on a diff in pull request #10394: Pinot Spark Connector for Spark3

GSharayu commented on code in PR #10394:
URL: https://github.com/apache/pinot/pull/10394#discussion_r1131253107


##########
pinot-connectors/pinot-spark-3-connector/documentation/read_model.md:
##########
@@ -0,0 +1,140 @@
+<!--
+
+    Licensed to the Apache Software Foundation (ASF) under one
+    or more contributor license agreements.  See the NOTICE file
+    distributed with this work for additional information
+    regarding copyright ownership.  The ASF licenses this file
+    to you under the Apache License, Version 2.0 (the
+    "License"); you may not use this file except in compliance
+    with the License.  You may obtain a copy of the License at
+
+      http://www.apache.org/licenses/LICENSE-2.0
+
+    Unless required by applicable law or agreed to in writing,
+    software distributed under the License is distributed on an
+    "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+    KIND, either express or implied.  See the License for the
+    specific language governing permissions and limitations
+    under the License.
+
+-->
+# Read Model
+
+Connector can scan offline, hybrid and realtime tables. Base two options <`table` and `tableType`> parameters have to given like below;
+- For offline table: `table: tbl`, `tableType: OFFLINE or offline`
+- For realtime table `table: tbl`, `tableType: REALTIME or realtime`
+- For hybrid table `table: tbl`, `tableType: HYBRID or hybrid`
+
+An example scan;
+
+```scala
+val df = spark.read
+      .format("pinot")
+      .option("table", "airlineStats")
+      .option("tableType", "offline")
+      .load()
+```
+
+Custom schema can be specified directly. If schema is not specified, connector read table schema from Pinot controller, and then convert to the Spark schema. 
+
+### Architecture
+
+Connector reads data from `Pinot Servers` directly. For this operation, firstly, connector creates query with given filters(if filter push down is enabled) and columns, then finds routing table for created query. It creates pinot splits that contains **ONE PINOT SERVER and ONE OR MORE SEGMENT per spark partition**, based on the routing table and `segmentsPerSplit`(detailed explain is defined below). Lastly, each partition read data from specified pinot server in parallel.
+
+![Spark-Pinot Connector Architecture](images/spark-pinot-connector-executor-server-interaction.jpg)
+
+Each Spark partition open connection with Pinot server, and read data. For example, assume that routing table informations for specified query is like that:
+
+```
+- realtime ->
+   - realtimeServer1 -> (segment1, segment2, segment3)
+   - realtimeServer2 -> (segment4)
+- offline ->
+   - offlineServer10 -> (segment10, segment20)
+```
+
+If `segmentsPerSplit` is equal to 3, there will be created 3 Spark partition like below;
+
+| Spark Partition  | Queried Pinot Server/Segments |
+| ------------- | ------------- |
+| partition1  | realtimeServer1 / segment1, segment2, segment3  |
+| partition2  | realtimeServer2 / segment4  |
+| partition3  | offlineServer10 / segment10, segment20 |
+
+If `segmentsPerSplit` is equal to 1, there will be created 6 Spark partition;
+
+| Spark Partition  | Queried Pinot Server/Segments |
+| ------------- | ------------- |
+| partition1  | realtimeServer1 / segment1 |
+| partition2  | realtimeServer1 / segment2  |
+| partition3  | realtimeServer1 / segment3 |
+| partition4  | realtimeServer2 / segment4 |
+| partition5  | offlineServer10 / segment10 |
+| partition6  | offlineServer10 / segment20 |
+
+If `segmentsPerSplit` value is too low, that means more parallelism. But this also mean that a lot of connection will be opened with Pinot servers, and will increase QPS on the Pinot servers. 
+
+If `segmetnsPerSplit` value is too high, that means less parallelism. Each Pinot server will scan more segments per request.  

Review Comment:
   (nit) typo segmentsPerSplit



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org
For additional commands, e-mail: commits-help@pinot.apache.org