You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@pinot.apache.org by GitBox <gi...@apache.org> on 2021/02/01 19:02:10 UTC
[GitHub] [incubator-pinot] jackjlli commented on pull request #6479: Support data ingestion for generating offline segment in one pass

jackjlli commented on pull request #6479:
URL: https://github.com/apache/incubator-pinot/pull/6479#issuecomment-771083123


   > > Thanks for the details @jackjlli, could you also is IntermediateSegment better than existing MutableSegment? For example, you could stream input data to MutableSegment and flush it as needed. This also solves multiple problems:
   > > 
   > > * Common code base for offline and RT segment generation (at least for the streaming part).
   > > * Sorting can now be done for offline within SegmentGeneration, instead of having users to explicitly do so.
   > > * Auto segment sizing that happens in RT will can also be done with offline now.
   > > 
   > > Thoughts @jackjlli @Jackie-Jiang?
   > 
   > I think this is a good idea to explore, but I suspect memory utilization on the offline side may go up significantly.
   > 
   > Also, the auto-segment sizing in realtime is implemented (in the controller) by learning the history of segments already completed. For offline generation, if we can keep a history or some learning mechanism, then it may be possible to implement approximate segment sizing algorithms -- whether we use MutableSegment to build segments or not.
   
   1. Yes, memory utilization will go up significantly, that's why I didn't directly use `MutableSegment` but `IntermediateSegment` as the intermediate container here.  In fact, both `IntermediateSegment` and `MutableSegment` share the common minimal piece of logic, which is that both have forwarded index. The slight difference is that `MutableSegment` will have all the indices (if applicable) like inverted index, text index, etc, for querying purposes. `IntermediateSegment` just keep the minimal component like dictionary.
   2. Plus, if we want partitioning/ sorting, these steps can be done in the platform (like mapreduce, spark) before converting the raw data. In fact, we've already had that logic in LinkedIn. Once this PR is committed, we can consider open sourcing that spark code as well.
   3. Auto-segment size is a good idea that historical data can be used to predict the cardinality or buffer size. While offline segment generation is not always done on the same machine, the historical data would be meaningless if they cannot be reused. If historical data is from controller, then all the worker machines have to query pinot controller simultaneously in order to get the historical data, which could bring huge amount of queries to controller. That's why I didn't bring it here in this PR. We can always add it to `IntermediateSegment` in the future, since the structure between `IntermediateSegment` and `MutableSegment` are pretty much the same.
   
   All 3 points above are really good features, but it'd be too much to be in a single PR. It'd be good if we can leave room for those features and pick them up in the following PRs if applicable. 


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org
For additional commands, e-mail: commits-help@pinot.apache.org