You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@pinot.apache.org by ja...@apache.org on 2019/03/25 23:46:16 UTC

[incubator-pinot] branch master updated: Add docs for record reader (#4010)

This is an automated email from the ASF dual-hosted git repository.

jackie pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/incubator-pinot.git


The following commit(s) were added to refs/heads/master by this push:
     new a4a3571  Add docs for record reader (#4010)
a4a3571 is described below

commit a4a357179b61f6c96a70666c4533e17d7fb055ec
Author: Xiaotian (Jackie) Jiang <17...@users.noreply.github.com>
AuthorDate: Mon Mar 25 16:46:10 2019 -0700

    Add docs for record reader (#4010)
---
 docs/extensions.rst    |   3 +-
 docs/record_reader.rst | 104 +++++++++++++++++++++++++++++++++++++++++++++++++
 2 files changed, 106 insertions(+), 1 deletion(-)

diff --git a/docs/extensions.rst b/docs/extensions.rst
index 43872b8..678a924 100644
--- a/docs/extensions.rst
+++ b/docs/extensions.rst
@@ -26,4 +26,5 @@ This section provides an overview of options to extend Pinot code to make Pinot
 
    pluggable_streams
    segment_fetcher
-   pluggable_storage
\ No newline at end of file
+   record_reader
+   pluggable_storage
diff --git a/docs/record_reader.rst b/docs/record_reader.rst
new file mode 100644
index 0000000..a2a10b8
--- /dev/null
+++ b/docs/record_reader.rst
@@ -0,0 +1,104 @@
+..
+.. Licensed to the Apache Software Foundation (ASF) under one
+.. or more contributor license agreements.  See the NOTICE file
+.. distributed with this work for additional information
+.. regarding copyright ownership.  The ASF licenses this file
+.. to you under the Apache License, Version 2.0 (the
+.. "License"); you may not use this file except in compliance
+.. with the License.  You may obtain a copy of the License at
+..
+..   http://www.apache.org/licenses/LICENSE-2.0
+..
+.. Unless required by applicable law or agreed to in writing,
+.. software distributed under the License is distributed on an
+.. "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+.. KIND, either express or implied.  See the License for the
+.. specific language governing permissions and limitations
+.. under the License.
+..
+
+Record Reader
+=============
+
+Pinot supports indexing data from various file formats. To support reading from a file format, a record reader need to
+be provided to read the file and convert records into the general format which the indexing engine can understand. The
+record reader serves as the connector from each individual file format to Pinot record format.
+
+Pinot package provides the following record readers out of the box:
+
+- Avro record reader: record reader for Avro format files
+- CSV record reader: record reader for CSV format files
+- JSON record reader: record reader for JSON format files
+- ORC record reader: record reader for ORC format files
+- Thrift record reader: record reader for Thrift format files
+- Pinot segment record reader: record reader for Pinot segment
+
+Initialize Record Reader
+------------------------
+
+To initialize a record reader, the data file and table schema should be provided (for Pinot segment record reader, only
+need to provide the index directory because schema can be derived from the segment). The output record will follow the
+table schema provided.
+
+For Avro/JSON/ORC/Pinot segment record reader, no extra configuration is required as column names and multi-values are
+embedded in the data file.
+
+For CSV/Thrift record reader, extra configuration might be provided to determine the column names and multi-values for
+the data.
+
+CSV Record Reader Config
+~~~~~~~~~~~~~~~~~~~~~~~~
+
+The CSV record reader config contains the following settings:
+
+- Header: the header for the CSV file (column names)
+- Column delimiter: delimiter for each column
+- Multi-value delimiter: delimiter for each value for a multi-valued column
+
+If no config provided, use the default setting:
+
+- Use the first row in the data file as the header
+- Use ',' as the column delimiter
+- Use ';' as the multi-value delimiter
+
+Thrift Record Reader Config
+~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+The Thrift record reader config is mandatory. It contains the Thrift class name for the record reader to de-serialize
+the Thrift objects.
+
+Implement Your Own Record Reader
+--------------------------------
+
+For other file formats, we provide a general interface for record reader - `RecordReader <https://github.com/apache/incubator-pinot/blob/master/pinot-core/src/main/java/org/apache/pinot/core/data/readers/RecordReader.java>`_.
+To index the file into Pinot segment, simply implement the interface and plug it into the index engine - `SegmentCreationDriverImpl <https://github.com/apache/incubator-pinot/blob/master/pinot-core/src/main/java/org/apache/pinot/core/segment/creator/impl/SegmentIndexCreationDriverImpl.java>`_.
+We use a 2-passes algorithm to index the file into Pinot segment, hence the *rewind()* method is required for the record
+reader.
+
+Generic Row
+~~~~~~~~~~~
+
+`GenericRow <https://github.com/apache/incubator-pinot/blob/master/pinot-core/src/main/java/org/apache/pinot/core/data/GenericRow.java>`_
+is the record abstraction which the index engine can read and index with. It is a map from column name (String) to
+column value (Object). For multi-valued column, the value should be an object array (Object[]).
+
+Contracts for Record Reader
+~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+There are several contracts for record readers that developers should follow when implementing their own record readers:
+
+- The output GenericRow should follow the table schema provided, in the sense that:
+
+  - All the columns in the schema should be preserved (if column does not exist in the original record, put default
+    value instead)
+  - Columns not in the schema should not be included
+  - Values for the column should follow the field spec from the schema (data type, single-valued/multi-valued)
+
+- For the time column (refer to `TimeFieldSpec <https://github.com/apache/incubator-pinot/blob/master/pinot-common/src/main/java/org/apache/pinot/common/data/TimeFieldSpec.java>`_),
+  record reader should be able to read both incoming and outgoing time (we allow *incoming time - time value from the
+  original data* to *outgoing time - time value stored in Pinot* conversion during index creation).
+
+  - If incoming and outgoing time column name are the same, use incoming time field spec
+  - If incoming and outgoing time column name are different, put both of them as time field spec
+  - We keep both incoming and outgoing time column to handle cases where the input file contains time values that are
+    already converted


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org
For additional commands, e-mail: commits-help@pinot.apache.org