You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@pinot.apache.org by GitBox <gi...@apache.org> on 2020/07/22 18:20:51 UTC

[GitHub] [incubator-pinot] Jackie-Jiang commented on pull request #5712: Add startReplaceSegments, endReplaceSegments controller API

Jackie-Jiang commented on pull request #5712:
URL: https://github.com/apache/incubator-pinot/pull/5712#issuecomment-662610355


   > > > > > > > > I think it is prudent to add an extra query parameter to the controller API for start/end batchupload. I suggest "operation" and make it string type. In this case, "Operation" will be set to "MERGE"
   > > > > > > > > This way, we don't assume that if source segments is null, the the operation MUST be new segment upload. Let us specify the operation clearly, and then it gets easier to evolve the API over other operations we may need.
   > > > > > > > 
   > > > > > > > 
   > > > > > > > @mcvsubbu We want to build primitive operations and we can use them to achieve multiple purpose such as batch upload, batch replace and version control. I don't think we should associate this primitive operation with any specific operation such as `MERGE` or `NEW_UPLOAD`. The semantic of this primitive operation is quite clear: replace the segments in `segmentsFrom` with segments in `segmentsTo` (we might want to rename the parameters if that makes the semantic more clear). @snleee Thoughts?
   > > > > > > 
   > > > > > > 
   > > > > > > If we know that this API will never be used for anything other than merge, then it is ok as defined. In that case, I would rename the API to say startBatchForMerge() or something like that. That way, if we want to introduce a startBatch for upload, we can do so.
   > > > > > > But then we know that this API will perhaps be re-used for batch upload of segments. Since we are reasonably sure that will be the case, but we don't know exactly what arguments it will take when we re-use it for another operation, it is best to specify the operation name we want.
   > > > > > > that way, even if some other set of arguments need to be added, or some null assimptions do not hold true, we have the operation very clearly specified. Yes, we can always add operaton later on, and say that if "operation" is not present, then treat it as merge, but I think it is clearner to specify that now.
   > > > > > 
   > > > > > 
   > > > > > This API will be used for merge, batch upload, batch replace, but I don't want to associate the API with certain type of operation because the semantic of this API is very clear and is independent of the actual operation. It will replace the segments in `segmentsFrom` with segments in `segmentsTo` atomically, and this will hold for all operations.
   > > > > 
   > > > > 
   > > > > If thats the case, then I agree with Subbu. It's better to name this primitive as ReplaceSegments or SwapSegments. Then all higher-level API can leverage this. WDYT?
   > > > 
   > > > 
   > > > The `startBatchUpload` and `endBatchUpload` are the primitives we want to provide. I'm okay renaming them to something like `startReplacingSegments` and `endReplacingSegments`, but I don't want to make the primitives associate with any specific operation (taking an extra query param of "operation") because the behavior should be the same for all operations
   > > 
   > > 
   > > I think that the confusion may come from the naming, `batchUpload`. There was a different design for batch upload protocol a while ago that was aiming to reduce the data inconsistency during the segment push (this design was about using batchId for upload, keep all segments under batchId, and update metadata at once in the end)
   > > The current design is much simpler. This API is essentially a primitive that replaces m segments with n segments atomically. I like the `startReplaceSegments`/ `endReplaceSegments` suggestion because it directly indicates what the API is trying to do.
   > > @mcvsubbu I put some thoughts on this and it's likely that we won't change parameters for these APIs when we implement other operations (versioning/batch replacement/merge). Rather, those operations will use this replacement primitive API as a part of the step. In that sense, we don't need to associate higher level operation to the lower level primitive.
   > > e.g.
   > > Segment Merge -> minion will call these APIs when replacing old segments with merged segments.
   > > Batch Upload/Batch Replacement(backfill) -> Pinot build and push job will call these APIs to upload/backfill segments
   > 
   > When you say "likely" that means we dont have a clear idea. The parameters may change in some way.
   > 
   > If you want to look at how complex coding gets when the same API is used for multiple operations, you don't have to look far. Just look at `/segments` It accommodates so many different options in the same API: refresh vs add, uri vs actual segment, encryption at source/sink , encryption scheme by source/tableschema ....
   > Now imagine if you were to write code for the API and you knew what exactly the intention of the API invoker was right in the beginning. Would the code not be simpler?
   > My argument is simple. If you are using this for multiple operations, and the invoker knows which operation they are going to perform, let the invoker specify the operation. We then have the flexibility to keep the code simple, or support a generic code that handles all possible operations. as in : if (operation == merge) mergeSegments(); else if (operation == batchUpload()) batchUpload(); else throw IllegalOperation()
   
   I would argue another way around, where API should be easy to use, instead of asking the user to understand all the supported operations in order to use it.
   Other than that, I don't see the behavior difference for the primitive for merge vs batch-upload because the semantic of the primitive is just replacing the segments, and it does not matter what is the purpose of using the primitive. For an example, when you use a lock, you wouldn't pass in an extra parameter stating that the lock is for preventing deadlock or solving race conditions.
   Even if we need to add the operation (I don't see that happen), it is always easy to add extra query parameters, but removing them is impossible without backward-incompatible change.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org
For additional commands, e-mail: commits-help@pinot.apache.org