You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@pinot.apache.org by su...@apache.org on 2019/03/14 19:17:13 UTC
[incubator-pinot] branch master updated: Add documentation for
tuning scatter and gather (#3969)
This is an automated email from the ASF dual-hosted git repository.
sunithabeeram pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/incubator-pinot.git
The following commit(s) were added to refs/heads/master by this push:
new 5867b16 Add documentation for tuning scatter and gather (#3969)
5867b16 is described below
commit 5867b16f32145a80fa98de4d24ec6302b17bb95e
Author: Seunghyun Lee <sn...@linkedin.com>
AuthorDate: Thu Mar 14 12:17:09 2019 -0700
Add documentation for tuning scatter and gather (#3969)
---
docs/img/partitioning.png | Bin 0 -> 81303 bytes
docs/img/replica-group.png | Bin 0 -> 126306 bytes
docs/index_techniques.rst | 35 ++++---
docs/tuning_pinot.rst | 12 ++-
docs/tuning_scatter_and_gather.rst | 182 +++++++++++++++++++++++++++++++++++++
5 files changed, 215 insertions(+), 14 deletions(-)
diff --git a/docs/img/partitioning.png b/docs/img/partitioning.png
new file mode 100644
index 0000000..243e9f9
Binary files /dev/null and b/docs/img/partitioning.png differ
diff --git a/docs/img/replica-group.png b/docs/img/replica-group.png
new file mode 100644
index 0000000..186477d
Binary files /dev/null and b/docs/img/replica-group.png differ
diff --git a/docs/index_techniques.rst b/docs/index_techniques.rst
index a3ade2a..b5ab93a 100644
--- a/docs/index_techniques.rst
+++ b/docs/index_techniques.rst
@@ -23,6 +23,8 @@
Index Techniques
================
+.. contents:: Table of Contents
+
Pinot currently supports the following index techniques, where each of them have their own advantages in different query
scenarios. By default, Pinot will use ``dictionary-encoded forward index`` for each column.
@@ -68,11 +70,13 @@ Raw value forward index can be configured for a table by setting it in the table
.. code-block:: none
{
- "noDictionaryColumns": [
- "column_name",
- ...
- ],
- ...
+ "tableIndexConfig": {
+ "noDictionaryColumns": [
+ "column_name",
+ ...
+ ],
+ ...
+ }
}
@@ -94,10 +98,12 @@ Sorted index can be configured for a table by setting it in the table config as
.. code-block:: none
{
- "sortedColumn": [
- "memberId"
- ],
- ...
+ "tableIndexConfig": {
+ "sortedColumn": [
+ "column_name"
+ ],
+ ...
+ }
}
Realtime server will sort data on ``sortedColumn`` when generating segment internally. For offline push, input data
@@ -126,10 +132,13 @@ Inverted index can be configured for a table by setting it in the table config a
.. code-block:: none
{
- "invertedIndexColumns": [
- "column_name"
- ],
- ...
+ "tableIndexConfig": {
+ "invertedIndexColumns": [
+ "column_name",
+ ...
+ ],
+ ...
+ }
}
diff --git a/docs/tuning_pinot.rst b/docs/tuning_pinot.rst
index 4dd9364..3b003ad 100644
--- a/docs/tuning_pinot.rst
+++ b/docs/tuning_pinot.rst
@@ -21,9 +21,19 @@
Tuning Pinot
============
+
This section provides information on various options to tune Pinot cluster for storage and query efficiency.
+Unlike Key-Value store, tuning Pinot sometimes can be tricky because the cost of query can vary depending on the
+workload and data characteristics.
+
+If you want to improve query latency for your use case, you can refer to ``Index Techniques`` section. If your
+use case faces the scalability issue after tuning index, you can refer ``Optimizing Scatter and Gather`` for
+improving query throughput for Pinot cluster. If you have identified a performance issue on the specific component
+(broker or server), you can refer to the ``Tuning Broker`` or ``Tuning Server`` section.
+
.. toctree::
:maxdepth: 1
-
+
index_techniques
+ tuning_scatter_and_gather
diff --git a/docs/tuning_scatter_and_gather.rst b/docs/tuning_scatter_and_gather.rst
new file mode 100644
index 0000000..fe4a723
--- /dev/null
+++ b/docs/tuning_scatter_and_gather.rst
@@ -0,0 +1,182 @@
+..
+.. Licensed to the Apache Software Foundation (ASF) under one
+.. or more contributor license agreements. See the NOTICE file
+.. distributed with this work for additional information
+.. regarding copyright ownership. The ASF licenses this file
+.. to you under the Apache License, Version 2.0 (the
+.. "License"); you may not use this file except in compliance
+.. with the License. You may obtain a copy of the License at
+..
+.. http://www.apache.org/licenses/LICENSE-2.0
+..
+.. Unless required by applicable law or agreed to in writing,
+.. software distributed under the License is distributed on an
+.. "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+.. KIND, either express or implied. See the License for the
+.. specific language governing permissions and limitations
+.. under the License.
+..
+
+Optimizing Scatter and Gather
+=============================
+
+.. contents:: Table of Contents
+
+
+When the use case has very high qps along with low latency requirements (usually site facing use cases),
+we need to consider optimizing the scatter-and-gather.
+
+Below table summarizes the two issues with the default behavior of Pinot.
+
+.. csv-table::
+ :header: "Problem", "Impact", "Solution"
+ :widths: 15, 15, 15
+
+ "Querying all servers", "Bad tail latency, not scalable", "Control the number of servers to fan out"
+ "Querying all segments", "More CPU work on server", "Minimize the number of segment"
+
+Querying All Servers
+--------------------
+By default, Pinot uses ``BalanceNumSegmentAssignmentStrategy`` for segment assignment. This scheme tries to distribute
+the number of segments uniformly to all servers. When we perform scatter-and-gather a query request, broker will try
+to uniformly distribute the workload among servers by assigning the balanced number of segments to each server. As
+a result, each query will span out to all servers under this scheme. It works pretty well when qps is low and you
+have small number of servers in the cluster. However, as we add more servers or have more qps, the probability of
+hitting slow server (e.g. gc) increases steeply and Pinot will suffer from a long tail latency.
+
+In order to address this issue, we have introduced a concept of ``Replica Group``, which allows us to control the
+number of servers to fan out for each query.
+
+
+Replica Group Segment Assignment and Query Routing
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+``Replica Group`` is a set of server that contains a 'complete' set of all segments of a table. Once we assign the
+segment based on replica group, each query can be answered by fanning out to a replica group instead of all servers.
+
+.. image:: img/replica-group.png
+
+``Replica Group`` based segment assignment can be configured for a table by setting it in the table config. Note that
+``ReplicaGroupSegmentAssignmentStrategy`` needs to be used along with ``PartitionAwareOffline`` for routing and this is
+currently available for **offline table** only.
+
+.. code-block:: none
+
+ {
+ "segmentsConfig": {
+ ...
+ "replication": "3",
+ "segmentAssignmentStrategy": "ReplicaGroupSegmentAssignmentStrategy",
+ "replicaGroupStrategyConfig": {
+ "mirrorAssignmentAcrossReplicaGroups": true,
+ "numInstancesPerPartition": 4
+ }
+ }
+ ...
+ "routing": {
+ "routingTableBuilderName": "PartitionAwareOffline",
+ "routingTableBuilderOptions": {}
+ },
+ ...
+ }
+
+As seen above, you can use ``replication`` and ``numInstancesPerPartition`` to control the number of servers to span. For
+instance, let's say that you have 12 servers in the cluster. Above configuration will generate 3 replica groups (based on
+``replication=3``) and each replica group will contain 4 servers (``numInstancesPerPartition=4``). In this example, each
+query will span to a single replica group (4 servers).
+
+As you seen above, replica group gives you the control on the number of servers to span for each query. When you try to
+decide the proper number of ``replication`` and ``numInstancesPerPartition``, you should consider the trade-off between
+throughput and latency. Given a fixed number of servers, increasing ``replication`` factor while decreasing
+``numInstancesPerPartition`` will give you more throughput because each server requires to process less number of queries.
+However, each server will need to process more number of segments per query, thus increasing overall latency. Similarly,
+decreasing ``replication`` while increasing ``numInstancesPerPartition`` will make each server processing more number
+of queries but each server needs to process less number of segments per query. So, this number has to be decided based
+on the use case requirements.
+
+
+Querying All Segments
+---------------------
+
+By default, Pinot broker will distribute all segments for query processing and segment pruning is happening in Server.
+In other words, Server will look at the segment metadata such as min/max time value and discard the segment if it does
+not contain any data that the query is asking for. Server side pruning works pretty well when the qps is low; however,
+it becomes the bottleneck if qps is very high (hundreds to thousands queries per second) because unnecessary segments
+still need to be scheduled for processing and consume cpu resources.
+
+Currently, we have two different mechanisms to prune segments on the broker side to minimize the number of segment for
+processing before scatter-and-gather.
+
+Partitioning
+^^^^^^^^^^^^
+When the data is partitioned on a dimension, each segment will contain all the rows with the same partition value for
+a partitioning dimension. In this case, a lot of segments can be pruned if a query requires to look at a single
+partition to compute the result. Below diagram gives the example of data partitioned on member id while the query
+includes an equality filter on member id.
+
+
+.. image:: img/partitioning.png
+
+``Partitoning`` can be enabled by setting the following configuration in the table config.
+
+.. code-block:: none
+
+ {
+ "tableIndexConfig": {
+ "segmentPartitionConfig": {
+ "columnPartitionMap": {
+ "memberId": {
+ "functionName": "modulo",
+ "numPartitions": 4
+ }
+ }
+ }
+ }
+ ...
+ "routing": {
+ "routingTableBuilderName": "PartitionAwareOffline",
+ "routingTableBuilderOptions": {}
+ },
+ }
+
+Pinot currently supports ``modulo`` and ``murmur`` hash function. After setting the above config, data needs to be partitioned
+using the same partition function and the number of partition to partition before running Pinot segment conversion and push job
+for offline push. Realtime partitioning depends on the kafka for partitioning. When emitting an event to kafka, a user need to
+feed partitioning key and partition function for Kafka producer API.
+
+When applied correctly, partition information should be available in the segment metadata.
+
+.. code-block:: none
+
+ $ column.memberId.partitionFunction = Murmur
+ column.memberId.partitionValues = [9 9]
+
+
+Note that broker side pruning for partitioning only works with ``PartitionAwareOffline`` and ``PartitionAwareRealtime`` routing
+table builder strategies. Also note that the current implementation for partitioning only works for **EQUALITY** filter
+(e.g. memberId = xx).
+
+
+Bloom Filter for Dictionary
+^^^^^^^^^^^^^^^^^^^^^^^^^^^
+Dictionary encoding provides the array of unique values. Pinot allows to create a bloom filter on this unique values for each
+column. Bloom filter can quickly determine whether the value exist in the segment.
+
+Bloom filter can be enabled by setting the following configuration in the table config.
+
+.. code-block:: none
+
+ {
+ "tableIndexConfig": {
+ "bloomFilterColumns": [
+ "column_name",
+ ...
+ ],
+ ...
+ }
+ }
+
+Our implementation limits the size of bloom filter to be less than 1MB per segment along with max false positive of 5% to
+avoid consuming too much memory. We recommend to put bloom filter for the column with ``less than 1 million cardinality``.
+
+Note that the current implementation for bloom filter also works for **EQUALITY** filter only.
\ No newline at end of file
---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org
For additional commands, e-mail: commits-help@pinot.apache.org