You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@pinot.apache.org by GitBox <gi...@apache.org> on 2019/10/21 23:16:54 UTC

[GitHub] [incubator-pinot] kishoreg opened a new issue #4731: Simplify Pinot data ingestion job

kishoreg opened a new issue #4731: Simplify Pinot data ingestion job
URL: https://github.com/apache/incubator-pinot/issues/4731
 
 
   Getting data into Pinot is non-trivial. Most users end up writing a custom hadoop/spark job for data ingestion. There is no standard way to write these jobs. The goal of this issue is to come up with a design that makes it easy to ingest data into Pinot.
   
   Before getting to the solution, let's look at all the variables
   1. **Formats**: JSON, CSV, AVRO, Parquet, Thrift, Protobuf, ORC
   2. **Batch Datasources**: HDFS, S3, ADLS, GCS
   3. **Streaming Datasources**: Stream: Kafka, EventHub
   4. **Execution Frameworks**: Hadoop, Spark, Flink, Samza, etc
   
   As of now, we have RecordReader interface to ingest batch data.
   ```  
     GenericRow next(GenericRow reuse)     throws IOException;
   ```
   
   And the real-time uses a ```StreamMessageDecoder```  interface.
   ```
     GenericRow decode(T payload, GenericRow reuse);
   ```
   The RecordReader implementations are spread all over pinot-core code which brings in lot of external dependencies. We have moved pinot-parquet and pinot-orc to separate packages.
   
   For real-time, we need one implementation per datasource, dataFormat. e.g. KafkaAvroMessageDecoder.
   
   Any thoughts on how to standardize Data Ingestion in Pinot?
   
   
   
   

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org
For additional commands, e-mail: commits-help@pinot.apache.org