You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@pinot.apache.org by mc...@apache.org on 2018/11/27 22:47:51 UTC

[incubator-pinot] branch doc-fixes created (now 8d54f8b)

This is an automated email from the ASF dual-hosted git repository.

mcvsubbu pushed a change to branch doc-fixes
in repository https://gitbox.apache.org/repos/asf/incubator-pinot.git.


      at 8d54f8b  Fixes to doc

This branch includes the following new commits:

     new 8d54f8b  Fixes to doc

The 1 revisions listed above as "new" are entirely new to this
repository and will be described in separate emails.  The revisions
listed as "add" were already present in the repository and have only
been added to this reference.



---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org
For additional commands, e-mail: commits-help@pinot.apache.org


[incubator-pinot] 01/01: Fixes to doc

Posted by mc...@apache.org.
This is an automated email from the ASF dual-hosted git repository.

mcvsubbu pushed a commit to branch doc-fixes
in repository https://gitbox.apache.org/repos/asf/incubator-pinot.git

commit 8d54f8b6c8314193b6faef5ae5e583a095e86319
Author: Subbu Subramaniam <ss...@linkedin.com>
AuthorDate: Tue Nov 27 14:47:07 2018 -0800

    Fixes to doc
---
 docs/architecture.rst            | 128 +++++++++++++++++++++++++--------------
 docs/creating_pinot_segments.rst |   2 +-
 docs/expressions_udf.rst         |   2 +-
 docs/intro.rst                   |  26 +++-----
 4 files changed, 94 insertions(+), 64 deletions(-)

diff --git a/docs/architecture.rst b/docs/architecture.rst
index 9b61f2a..27d583e 100644
--- a/docs/architecture.rst
+++ b/docs/architecture.rst
@@ -8,87 +8,125 @@ Architecture
 Terminology
 -----------
 
-* Table: A table is a logical abstraction to refer to a collection of related data. It consists of columns and rows (Document). Table Schema defines column names and their metadata.
-* Segment: Data in table is divided into shards referred to as segments.
+*Table*
+    A table is a logical abstraction to refer to a collection of related data. It consists of columns and rows (documents).
+*Segment*
+    Data in table is divided into (horizontal) shards referred to as segments.
 
 Pinot Components
 ----------------
 
-* Pinot Controller: Manages other pinot components (brokers, servers) as well as controls assignment of tables/segments to servers.
-* Pinot Server: Hosts one or more segments and serves queries from those segments
-* Pinot Broker: Accepts queries from clients and routes them to one or more servers, and returns consolidated response to the server.
+*Pinot Controller*
+    Manages other pinot components (brokers, servers) as well as controls assignment of tables/segments to servers.
+*Pinot Server*
+    Hosts one or more segments and serves queries from those segments
+*Pinot Broker*
+    Accepts queries from clients and routes them to one or more servers, and returns consolidated response to the client.
 
 Pinot leverages `Apache Helix <http://helix.apache.org>`_ for cluster management. 
-Apache Helix is a generic cluster management framework to manage partitions and replicas in a distributed system. See http://helix.apache.org for additional information.
+Helix is a cluster management framework to manage replicated, partitioned resources in a distributed system.
 Helix uses Zookeeper to store cluster state and metadata.
 
-Briefly, Helix divides nodes into 3 logical components based on their responsibilities:
+Briefly, Helix divides nodes into three logical components based on their responsibilities:
 
-*  **Participant**: The nodes that host distributed, partitioned resources
-*  **Spectator**: The nodes that observe the current state of each Participant and use that information to access the resources.
-   Spectators are notified of state changes in the cluster (state of a participant, or that of a partition in a participant).
-*  **Controller**: The node that observes and controls the Participant nodes. It is responsible for coordinating all transitions
-   in the cluster and ensuring that state constraints are satisfied while maintaining cluster stability
+*Participant*
+    The nodes that host distributed, partitioned resources
+*Spectator*
+    The nodes that observe the current state of each Participant and use that information to access the resources.
+    Spectators are notified of state changes in the cluster (state of a participant, or that of a partition in a participant).
+*Controller*
+    The node that observes and controls the Participant nodes. It is responsible for coordinating all transitions
+    in the cluster and ensuring that state constraints are satisfied while maintaining cluster stability
 
-Pinot Controller hosts Helix Controller, in addition to hosting APIs for Pinot cluster administration and data ingestion.
+Pinot Controller hosts Helix Controller, in addition to hosting REST APIs for Pinot cluster administration and data ingestion.
 There can be multiple instances of Pinot controller for redundancy. If there are multiple controllers, Pinot expects that all
-of them are configured with the same back-end storage system so that they have a common view of the segments (_e.g._ NFS).
+of them are configured with the same back-end storage system so that they have a common view of the segments (*e.g.* NFS).
 Pinot can use other storage systems such as HDFS or `ADLS <https://azure.microsoft.com/en-us/services/storage/data-lake-storage/>`_.
 
-Pinot Servers are modeled as Helix Participants, hosting Pinot tables (referred to as 'resources' in helix terminology).
+Pinot Servers are modeled as Helix Participants, hosting Pinot tables (referred to as *resources* in helix terminology).
 Segments of a table are modeled as Helix partitions (of a resource). Thus, a Pinot server hosts one or more helix partitions of one
-or more helix resources (_i.e._ one or more segments of one or more tables).
+or more helix resources (*i.e.* one or more segments of one or more tables).
+
+Pinot Brokers are modeled as Spectators. They need to know the location of each segment of a table (and each replica of the
+segments)
+and route requests to the
+appropriate server that hosts the segments of the table being queried. The broker ensures that all the rows of the table
+are queried exactly once so as to return correct, consistent results for a query. The brokers (or servers) may optimize
+to prune some of the segments as long as accuracy is not satisfied. In case of hybrid tables, the brokers ensure that
+the overlap between realtime and offline segment data is queried exactly once.
+Helix provides the framework by which spectators can learn the location (*i.e.* participant) in which each partition
+of a resource resides. The brokers use this mechanism to learn the servers that host specific segments of a table.
+
+Pinot Tables
+------------
 
-Pinot Brokers are modeled as Spectators. They need to know the location of each segment of a table and route requests to the
-appropriate server that hosts the segments of the table being queried. In general, all segments must be queried exactly once
-in order for a query to return the correct response. There may be multilpe copies of a segment (for redundancy). Helix provides
-the framework by which spectators can learn the location (i.e. participant) in which each partition of a resource resides.
+Pinot supports realtime, or offline, or hybrid tables. Data in Pinot tables is contained in the segments
+belonging to that table. A Pinot table is modeled as a Helix resource.  Each segment of a table is modeled as a Helix Partition,
 
-Pinot tables
-------------
+Table Schema defines column names and their metadata. Table configuration and schema is stored in zookeeper.
 
-Tables in Pinot can be configured to be offline only, or realtime only, or a hybrid of these two.
+Offline tables ingest pre-built pinot-segments from external data stores, whereas Reatime tables
+ingest data from streams (such as Kafka) and build segments.
 
-Segments for offline tables are constructed outside of Pinot, typically in Hadoop via map-reduce jobs. These segments are then ingested
-into Pinot via REST API provided by the Controller. The controller looks up the table's configuration and assigns the segment
-to the servers that host the table. It may assign multiple servers for each servers depending on the number of replicas 
-configured for that table.
+A hybrid Pinot table essentially has both realtime as well as offline tables. 
+In such a table, offline segments may be pushed periodically (say, once a day). The retention on the offline table
+can be set to a high value (say, a few years) since segments are coming in on a periodic basis, whereas the retention
+on the realtime part can be small (say, a few days). Once an offline segment is pushed to cover a recent time period,
+the brokers automatically switch to using the offline table for segments in _that_ time period, and use realtime table
+only to cover later segments for which offline data may not be available yet.
+
+Note that the query does not know the existence of offline or realtime tables. It only specifies the table name
+in the query.
+
+
+Ingesting Offline data
+^^^^^^^^^^^^^^^^^^^^^^
+Segments for offline tables are constructed outside of Pinot, typically in Hadoop via map-reduce jobs
+and ingested into Pinot via REST API provided by the Controller.
 Pinot provides libraries to create Pinot segments out of input files in AVRO, JSON or CSV formats in a hadoop job, and push
 the constructed segments to the controlers via REST APIs.
 
+When an Offline segment is ingested, the controller looks up the table's configuration and assigns the segment
+to the servers that host the table. It may assign multiple servers for each servers depending on the number of replicas 
+configured for that table.
+
 Pinot supports different segment assignment strategies that are optimized for various use cases.
 
 Once segments are assigned, Pinot servers get notified via Helix to "host" the segment. The servers download the segments
 (as a cached local copy to serve queries) and load them into local memory. All segment data is maintained in memory as long
 as the server hosts that segment.
 
-Once the server has loaded the segment, brokers come to know of the availability of these segments and start include the new
-segments for queries. Brokers support different routing strategeies depending on the type of table, the segment assignment
+Once the server has loaded the segment, Helix notifies brokers of the availability of these segments. The brokers 
+start include the new
+segments for queries. Brokers support different routing strategies depending on the type of table, the segment assignment
 strategy and the use case.
 
-Realtime tables, on the other hand, ingest data directly from incoming data streams (such as Kafka). Multiple servers may
-ingest the same data for replication. The servers stop ingesting data after reaching a threshold and "build" a segment of
-the data ingested so far. Once that segment is loaded (just like the offline segments described earlier), they continue
-to consume the next set of events from the stream.
+Data in offline segments are immmutable (Rows cannot be added, deleted, or modified). However, segments may be replaced modified data.
 
-Depending on the type of consumer configured, realtime segments may be held locally in the server, or pushed the controller.
+Ingesting Realtime Data
+^^^^^^^^^^^^^^^^^^^^^^^
+Segments for realtime tables are constructed by Pinot servers. The servers ingest rows from realtime streams (such as
+Kafka) until
+some completion threshold (such as number of rows, or a time threshold) and build a segment out of those rows. Depending
+on the type of ingestion mechanism used (stream or partition level), segments may be locally stored in the servers
+or in the controller's segment store.
 
-**TODO Add reference to the realtime section here**
+Multiple servers may ingest the same data to increase availability and share query load.
 
-A hybrid Pinot table essentially has both realtime as well as offline tables. 
-In such a table, offline segments may be pushed periodically (say, once a day). The retention on the offline table
-can be set to a high value (say, a few years) since segments are coming in on a periodic basis, whereas the retention
-on the realtime part can be small (say, a few days). Once an offline segment is pushed to cover a recent time period,
-the brokers automatically switch to using the offline table for segments in _that_ time period, and use realtime table
-only to cover later segments for which offline data may not be available yet.
+Once a realtime segment is built and loaded the servers continue
+to consume from where they left off.
+
+Realtime segments are immutable once they are completed. While realtime segments are being consumed they are mutable,
+in the sense that new rows can be added to them. Rows cannot be deleted from segments.
 
-Note that the query does not know the existence of offline or realtime tables. It only specifies the table name
-in the query.
+
+See :doc:`realtime design <llc>` for details.
 
 
 Pinot Segments
 --------------
-As mentioned earlier, each Pinot segment is a horizontal shard of the Pinot table. The segment is laid out in a columnar format
+
+A segment is laid out in a columnar format
 so that it can be directly mapped into memory for serving queries. Columns may be single or multi-valued. Column types may be
 STRING, INT, LONG, FLOAT, DOUBLE or BYTES. Columns may be declared to be metric or dimension (or specifically as a time dimension)
 in the schema.
@@ -102,5 +140,3 @@ configured for any set of columns. Inverted indices, while taking up more storag
 
 Specialized indexes like StartTree index is also supported.
 
-**TODO Add/Link startree doc here**
-
diff --git a/docs/creating_pinot_segments.rst b/docs/creating_pinot_segments.rst
index 0425c70..5bae71e 100644
--- a/docs/creating_pinot_segments.rst
+++ b/docs/creating_pinot_segments.rst
@@ -5,7 +5,7 @@ This document describes steps required for creating Pinot2_0 segments from stand
 
 Compiling the code
 ------------------
-Follow the steps described in `trying_pinot`_ to build pinot. Locate ``pinot-admin.sh`` in ``pinot-tools/trget/pinot-tools=pkg/bin/pinot-admin.sh``.
+Follow the steps described in the section on :doc: `Demonstration <trying_pinot>` to build pinot. Locate ``pinot-admin.sh`` in ``pinot-tools/trget/pinot-tools=pkg/bin/pinot-admin.sh``.
 
 
 Data Preparation
diff --git a/docs/expressions_udf.rst b/docs/expressions_udf.rst
index 7a42f0f..d373424 100644
--- a/docs/expressions_udf.rst
+++ b/docs/expressions_udf.rst
@@ -3,7 +3,7 @@ Expressions and UDFs
 
 Requirements
 ~~~~~~~~~~~~
-The query language for Pinot (pql_) currently only supports *selection*, *aggregation* & *group by* operations on columns, and moreover, do not support nested operations. There are a growing number of use-cases of Pinot that require some sort of transformation on the column values, before and/or after performing *selection*, *aggregation* & *group by*. One very common example is when we would want to aggregate *metrics* over different granularity of times, without needing to pre-aggregat [...]
+The query language for Pinot (:doc:`PQL <reference>`) currently only supports *selection*, *aggregation* & *group by* operations on columns, and moreover, do not support nested operations. There are a growing number of use-cases of Pinot that require some sort of transformation on the column values, before and/or after performing *selection*, *aggregation* & *group by*. One very common example is when we would want to aggregate *metrics* over different granularity of times, without needi [...]
 
 The high level requirement here is to support *expressions* that represent a function on a set of columns in the queries, as opposed to just columns.
 
diff --git a/docs/intro.rst b/docs/intro.rst
index c989b04..b169b4a 100644
--- a/docs/intro.rst
+++ b/docs/intro.rst
@@ -29,25 +29,19 @@ Because of the design choices we made to achieve these goals, there are certain
 
 Pinot works very well for querying time series data with lots of Dimensions and Metrics. For example:
 
-::
+.. code-block:: sql
 
-    SELECT sum(clicks), sum(impressions) FROM AdAnalyticsTable WHERE ((daysSinceEpoch >= 17849 AND daysSinceEpoch <= 17856)) AND accountId IN (123456789) GROUP BY daysSinceEpoch TOP 15000
-    SELECT sum(impressions) FROM AdAnalyticsTable WHERE (daysSinceEpoch >= 17824 and daysSinceEpoch <= 17854) AND adveriserId = '1234356789' GROUP BY daysSinceEpoch,advertiserId TOP 1000
-    SELECT sum(cost) FROM AdAnalyticsTable GROUP BY advertiserId TOP 50
-
-
-Terminology
+    SELECT sum(clicks), sum(impressions) FROM AdAnalyticsTable
+      WHERE ((daysSinceEpoch >= 17849 AND daysSinceEpoch <= 17856)) AND accountId IN (123456789)
+      GROUP BY daysSinceEpoch TOP 100
 
-* Table: A table is a logical abstraction to refer to a collection of related data. It consists of columns and rows (Document). Table Schema defines column names and their metadata.
-* Segment: Data in table is divided into shards referred to as segments.
+.. code-block:: sql
 
-Pinot Components
+    SELECT sum(impressions) FROM AdAnalyticsTable
+      WHERE (daysSinceEpoch >= 17824 and daysSinceEpoch <= 17854) AND adveriserId = '1234356789'
+      GROUP BY daysSinceEpoch,advertiserId TOP 100
 
-* Pinot Controller: Manages other pinot components (brokers, servers) as well as controls assignment of tables/segments to servers.
-* Pinot Server: Hosts one or more segments and serves queries from those segments
-* Pinot Broker: Accepts queries from clients and routes them to one or more servers, and returns consolidated response to the server.
+.. code-block:: sql
 
-Pinot leverages [Apache Helix](http://helix.apache.org) for cluster management. 
-
-For more information on Pinot Design and Architecture can be found [here](https://github.com/linkedin/pinot/wiki/Architecture)
+    SELECT sum(cost) FROM AdAnalyticsTable GROUP BY advertiserId TOP 50
 


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org
For additional commands, e-mail: commits-help@pinot.apache.org