You are viewing a plain text version of this content. The canonical link for it is here.

Posted to commits@pinot.apache.org by GitBox <gi...@apache.org> on 2021/06/01 16:48:25 UTC

[GitHub] [incubator-pinot] mapshen opened a new issue #7004: Preserver Kafka Message Metadata in Pinot Tables

mapshen opened a new issue #7004:
URL: https://github.com/apache/incubator-pinot/issues/7004


   Opening this issue as per the conversation in Slack.
   
   When streaming from Kafka, Pinot currently lacks of a way to allow users to uniquely identify messages. We may achieve it by exposing the metadata of Kafka messages such as offset/timestamp.
   
   The approach suggested by @kishoreg is 
   
   > What we need is Kafka decoder to read this metadata and add it to generic row.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org
For additional commands, e-mail: commits-help@pinot.apache.org

[GitHub] [incubator-pinot] mcvsubbu commented on issue #7004: Preserver Kafka Message Metadata in Pinot Tables

Posted by GitBox <gi...@apache.org>.

mcvsubbu commented on issue #7004:
URL: https://github.com/apache/incubator-pinot/issues/7004#issuecomment-884336631

There are many other methods to get the Kafka (or a stream) lag, so I don't consider that as big plus. In fact, we are experimenting with something that we will put out a PR on soon.

I am clearly not a +1000, I am curious what the others in PPMC (or even community) think about adding a feature like this. If implemented, it has to be done very carefully to make sure no extra garbage is generated even when the feature is turned off (needless to say, it has to be a configurable feature).

I am not sure I fully understand the UDF/real column that you propose, but it sounds promising, and should be looked at. Are you proposing to use a special decoder? We can extend the stream interface to include such UDFs, so that the underlying implementation may decide to populate such columns using such a decoder.

Another thing to consider is whether this can be done sparsely (i.e. sample every so many records).

--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org
For additional commands, e-mail: commits-help@pinot.apache.org

[GitHub] [incubator-pinot] mcvsubbu commented on issue #7004: Preserve Kafka Message Metadata in Pinot Tables

Posted by GitBox <gi...@apache.org>.

mcvsubbu commented on issue #7004:
URL: https://github.com/apache/incubator-pinot/issues/7004#issuecomment-885026477


   @mapshen can you please provide a short doc describing your use case? (not just the queries, but the problem you are trying to solve), thanks. That may help us decide.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org
For additional commands, e-mail: commits-help@pinot.apache.org

[GitHub] [incubator-pinot] npawar commented on issue #7004: Preserver Kafka Message Metadata in Pinot Tables

Posted by GitBox <gi...@apache.org>.

npawar commented on issue #7004:
URL: https://github.com/apache/incubator-pinot/issues/7004#issuecomment-884367275


   I also think this will be very useful, especially for debugging.
   Several times we see users who think the consumers are lagging, or stuck. Being able to run a quick query with $offsetID and $segment would be very helpful. I know there are other ways to get offsets, but often when users are starting out, metrics are not being collected anywhere and many times logs get expired/deleted in their environments. 
   $offset seems like a nice extension to $segmentName, $docId and $hostName, and will give us the complete picture in the query console.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org
For additional commands, e-mail: commits-help@pinot.apache.org

[GitHub] [incubator-pinot] mcvsubbu commented on issue #7004: Preserver Kafka Message Metadata in Pinot Tables

Posted by GitBox <gi...@apache.org>.

mcvsubbu commented on issue #7004:
URL: https://github.com/apache/incubator-pinot/issues/7004#issuecomment-882676889






-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org
For additional commands, e-mail: commits-help@pinot.apache.org

[GitHub] [incubator-pinot] mcvsubbu commented on issue #7004: Preserver Kafka Message Metadata in Pinot Tables

Posted by GitBox <gi...@apache.org>.

mcvsubbu commented on issue #7004:
URL: https://github.com/apache/incubator-pinot/issues/7004#issuecomment-882676889






-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org
For additional commands, e-mail: commits-help@pinot.apache.org

[GitHub] [incubator-pinot] mapshen edited a comment on issue #7004: Preserver Kafka Message Metadata in Pinot Tables

Posted by GitBox <gi...@apache.org>.

mapshen edited a comment on issue #7004:
URL: https://github.com/apache/incubator-pinot/issues/7004#issuecomment-882704506






-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org
For additional commands, e-mail: commits-help@pinot.apache.org

[GitHub] [incubator-pinot] mayankshriv edited a comment on issue #7004: Preserve Kafka Message Metadata in Pinot Tables

Posted by GitBox <gi...@apache.org>.

mayankshriv edited a comment on issue #7004:
URL: https://github.com/apache/incubator-pinot/issues/7004#issuecomment-884357812


   I am +1 on the usefulness of this feature. From reading the discussion, it seems perhaps the sticky point of discussion is in what the implementation would look like, and whether it can be done in a way where there is minimal to no impact for use-cases that don't need/want this feature. Could we decouple the two concerns here, and resolve them in sequence:
   
   - Is this feature useful? Seems like there are multiple +1's already.
   - Can this be implemented in a way that has minimal to no impact on use cases that don't want to use the feature? From what I gather, having a separate decoder that gets plugged in only when feature is enabled could be a way to go.
   
   Thoughts?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org
For additional commands, e-mail: commits-help@pinot.apache.org

[GitHub] [incubator-pinot] amrishlal commented on issue #7004: Preserve Kafka Message Metadata in Pinot Tables

Posted by GitBox <gi...@apache.org>.

amrishlal commented on issue #7004:
URL: https://github.com/apache/incubator-pinot/issues/7004#issuecomment-884523925


   I am a bit confused by these two statements:
   > When streaming from Kafka, Pinot currently lacks of a way to allow users to uniquely identify messages
   
   and
   
   > where offset > last_recorded_offset
   
   It seems like in the first case, you are looking for a globally unique identifier for each row. I am assuming this would involve something like a UUID generator that will tack on UUID with each row that is ingested (?) In the second case, it seems like you are looking for a "rowid" with the additional criteria that it should be monotonically increasing and be comparable.
   
   I am not quite sure if it is possible to do both with reasonable amount of effort (i.e generate a globally unique identifier that can is monotonically increasing and hence also comparable across all rows of all segments) specially when one considers that we commonly replace segments and also do some update operations such as UPSERT. Unless I am missing something, maybe it could be done with a cluster wide id generation service in Pinot (?). The first (UUID generation) can probably be done now at ingestion time using an ingestion transform function (?). The second looks very difficult to implement and get right (?).
   
   I think we need more clarity on what exactly is being implemented here: 1) dynamically generated ROWID over resultset only (for supporting cursors), 2) a column that will identify each row with a globally unique identifier (useful for partitioning, indexing, etc), 3) ROWID generated for each row at row creation time that is globally unique and comparable across all rows and all segments and that can be kept up to date with operations such as segment replacement, UPSERT, etc?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org
For additional commands, e-mail: commits-help@pinot.apache.org

[GitHub] [incubator-pinot] mayankshriv commented on issue #7004: Preserver Kafka Message Metadata in Pinot Tables

Posted by GitBox <gi...@apache.org>.

mayankshriv commented on issue #7004:
URL: https://github.com/apache/incubator-pinot/issues/7004#issuecomment-884357812


   I am +1 on the usefulness of this feature. From reading the discussion, it seems perhaps the gap is in what the implementation would look like, and whether it can be done in a way where there is minimal to no impact for use-cases that don't need/want this feature. Could we decouple the two concerns here, and resolve them in sequence:
   
   - Is this feature useful? Seems like there are multiple +1's already.
   - Can this be implemented in a way that has minimal to no impact on use cases that don't want to use the feature? From what I gather, having a separate decoder that gets plugged in only when feature is enabled could be a way to go.
   
   Thoughts?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org
For additional commands, e-mail: commits-help@pinot.apache.org

[GitHub] [incubator-pinot] sajjad-moradi commented on issue #7004: Preserver Kafka Message Metadata in Pinot Tables

Posted by GitBox <gi...@apache.org>.

sajjad-moradi commented on issue #7004:
URL: https://github.com/apache/incubator-pinot/issues/7004#issuecomment-882736913


   @mapshen Can you elaborate a bit on why this use case is needed? Maybe it makes it easy to understand, as Subbu mentioned, if you provide queries for this use case!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org
For additional commands, e-mail: commits-help@pinot.apache.org

[GitHub] [incubator-pinot] mapshen edited a comment on issue #7004: Preserver Kafka Message Metadata in Pinot Tables

Posted by GitBox <gi...@apache.org>.

mapshen edited a comment on issue #7004:
URL: https://github.com/apache/incubator-pinot/issues/7004#issuecomment-882858547


   Sure. Say we have an inventory table of `products` ingested from a single-partition topic, and, to begin with, a query like
   ```
   select count, offset from inventory where product = 'A' limit 1
   ```
    is executed, from which I record the offset.
   
   Next time a query is issued, if I don't want to reprocess, or even fetch, the rows I have already seen, and only care about the latest change if any, I would do
   ```
   select count, offset  from inventory where offset > <last_recorded_offset> and product = 'A' limit 1
   ```
   If I don't want to miss any messages/rows in between since the last fetch, I could do 
   ```
   select count, offset from inventory where offset > <last_recorded_offset> and product = 'A'
   ```
   From here, I would record the new offset again (if any) and repeat.
   
   There was more context in the original thread on Slack, but I couldn't find it anymore. Seems all messages prior to 06/16 got purged (?).
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org
For additional commands, e-mail: commits-help@pinot.apache.org

[GitHub] [incubator-pinot] mapshen edited a comment on issue #7004: Preserver Kafka Message Metadata in Pinot Tables

Posted by GitBox <gi...@apache.org>.

mapshen edited a comment on issue #7004:
URL: https://github.com/apache/incubator-pinot/issues/7004#issuecomment-882704506


   @mcvsubbu Okay, let me formalize the use case here:
   > As a user, I would like to have a way to uniquely identify a row/message between two queries. It would be really helpful when:
   > * I don't want to reprocess, or even fetch, the rows I have already seen
   > * I don't want to miss any messages/rows in between since the last fetch
   
   We talked over Slack about it if you recall :)
   
   @suddendust, appreciate your offering help. My impression was actually that we agreed to add this in principle but I just couldn't find time to write up a proposal on how to approach this. Perhaps you can start with a proposal?
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org
For additional commands, e-mail: commits-help@pinot.apache.org

[GitHub] [incubator-pinot] mapshen edited a comment on issue #7004: Preserver Kafka Message Metadata in Pinot Tables

Posted by GitBox <gi...@apache.org>.

mapshen edited a comment on issue #7004:
URL: https://github.com/apache/incubator-pinot/issues/7004#issuecomment-882858547


   Sure. Say we have an inventory table of products and, to begin with, a query like
   ```
   select count, offset from inventory where product = 'A' limit 1
   ```
    is executed, from which I record the offset.
   
   Next time a query is issued, if I don't want to reprocess, or even fetch, the rows I have already seen, and only care about the latest change if any, I would do
   ```
   select count, offset  from inventory where offset > last_recorded_offset and product = 'A' limit 1
   ```
   If I don't want to miss any messages/rows in between since the last fetch, I could do 
   ```
   select count, offset from inventory where offset > last_recorded_offset and product = 'A'
   ```
   From here, I would record the new offset again (if any) and repeat.
   
   There was more context in the original thread on Slack, but I couldn't find it anymore. Seems all messages prior to 06/16 got purged (?).
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org
For additional commands, e-mail: commits-help@pinot.apache.org

[GitHub] [incubator-pinot] mcvsubbu commented on issue #7004: Preserve Kafka Message Metadata in Pinot Tables

Posted by GitBox <gi...@apache.org>.

mcvsubbu commented on issue #7004:
URL: https://github.com/apache/incubator-pinot/issues/7004#issuecomment-885157345


   Thanks, @mapshen . Can you share a few lines on what that application does? (e.g. does it do debugging?)


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org
For additional commands, e-mail: commits-help@pinot.apache.org

[GitHub] [incubator-pinot] mapshen commented on issue #7004: Preserver Kafka Message Metadata in Pinot Tables

Posted by GitBox <gi...@apache.org>.

mapshen commented on issue #7004:
URL: https://github.com/apache/incubator-pinot/issues/7004#issuecomment-853316119


   @kishoreg @npawar Supposedly the offset can be added to a decoded row by modifying [1],  now I have a few questions regarding the implementation details:
   
   * What would a good column name to follow our convention? `$offset` or `offset`?
   * Should we make it configurable or always present like the virtual columns?
   * Where should we define its data type when it is indexed [2]? We do this for other columns with a schema.
   * What index should we use for it? Raw value or sorted forward index?
   
   [1] https://github.com/apache/incubator-pinot/blob/master/pinot-plugins/pinot-stream-ingestion/pinot-kafka-base/src/main/java/org/apache/pinot/plugin/stream/kafka/KafkaJSONMessageDecoder.java#L62
   [2]https://github.com/apache/incubator-pinot/blob/master/pinot-core/src/main/java/org/apache/pinot/core/data/manager/realtime/LLRealtimeSegmentDataManager.java#L516


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org
For additional commands, e-mail: commits-help@pinot.apache.org

[GitHub] [incubator-pinot] mapshen commented on issue #7004: Preserver Kafka Message Metadata in Pinot Tables

Posted by GitBox <gi...@apache.org>.

mapshen commented on issue #7004:
URL: https://github.com/apache/incubator-pinot/issues/7004#issuecomment-882858547


   Sure. Say we have an inventory table of products and, to begin with, a query like
   ```
   select count, offset from inventory where product = 'A' limit 1
   ```
    is executed, from which I record the offset.
   
   Next time a query is issued, if I don't want to reprocess, or even fetch, the rows I have already seen, and only care about the latest change if any, I would do
   ```
   select count, offset  from inventory where offset > last_recorded_offset where product = 'A' limit 1
   ```
   If I don't want to miss any messages/rows in between since the last fetch, I could do 
   ```
   select count, offset from inventory where offset > last_recorded_offset where product = 'A'
   ```
   From here, I would record the new offset again (if any) and repeat.
   
   There was more context in the original thread on Slack, but I couldn't find it anymore. Seems all messages prior to 06/16 got purged.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org
For additional commands, e-mail: commits-help@pinot.apache.org

[GitHub] [incubator-pinot] mcvsubbu commented on issue #7004: Preserver Kafka Message Metadata in Pinot Tables

Posted by GitBox <gi...@apache.org>.

mcvsubbu commented on issue #7004:
URL: https://github.com/apache/incubator-pinot/issues/7004#issuecomment-882714584


   Sorry, there was no discussion in the slack channel after I got added to it. I saw the mention for a meeting and I was hoping that one would be called, but I did not get any invite. None of these requirements were mentioned in the slack channel.
   
   What are your queries ? (are they like `select *` and you want to know which new rows came in since you executed the same query previous time?
   
   Pinot guarantees exactly once consumption, and guarantees not missing any row from underlying stream (unless underlying stream has a low retention). Since I see the slack channel has a lot of discussion about actual implementation, perhaps you already know about these guarantees, but I thought i will mention anyway


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org
For additional commands, e-mail: commits-help@pinot.apache.org

[GitHub] [incubator-pinot] sajjad-moradi commented on issue #7004: Preserver Kafka Message Metadata in Pinot Tables

Posted by GitBox <gi...@apache.org>.

sajjad-moradi commented on issue #7004:
URL: https://github.com/apache/incubator-pinot/issues/7004#issuecomment-882736913


   @mapshen Can you elaborate a bit on why this use case is needed? Maybe it makes it easy to understand, as Subbu mentioned, if you provide queries for this use case!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org
For additional commands, e-mail: commits-help@pinot.apache.org

[GitHub] [incubator-pinot] yupeng9 commented on issue #7004: Preserver Kafka Message Metadata in Pinot Tables

Posted by GitBox <gi...@apache.org>.

yupeng9 commented on issue #7004:
URL: https://github.com/apache/incubator-pinot/issues/7004#issuecomment-884343517


   Then perhaps we can make these columns opt-in based on configurations? Based on my observations at Uber, a `_key` column would be useful, and for some topics the key is computed on the fly by deriving from multiple fields in the topic. I can also see the value of offset, especially for CDC use cases where events are sorted in the topic partition.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org
For additional commands, e-mail: commits-help@pinot.apache.org

[GitHub] [incubator-pinot] mayankshriv commented on issue #7004: Preserve Kafka Message Metadata in Pinot Tables

Posted by GitBox <gi...@apache.org>.

mayankshriv commented on issue #7004:
URL: https://github.com/apache/incubator-pinot/issues/7004#issuecomment-884494065


   I still feel we are mixing two discussions, a) whether this is a good feature to implement b) What is the complexity/perf impact of implementing it.
   
   The way I see it, there's five +1's (Kishore, Yupeng, Neha, Map and me). With this, I'd rather flip the question and ask why we shouldn't do this? If the only reason is that it is complex and/or has perf impact, then I'd say we are past a) above and now discussing b). And for that we need a detailed design/PR, without which assessing the impact on s/w or perf is hard (for me at least).


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org
For additional commands, e-mail: commits-help@pinot.apache.org

[GitHub] [incubator-pinot] mcvsubbu commented on issue #7004: Preserver Kafka Message Metadata in Pinot Tables

Posted by GitBox <gi...@apache.org>.

mcvsubbu commented on issue #7004:
URL: https://github.com/apache/incubator-pinot/issues/7004#issuecomment-884345130


   > Then perhaps we can make these columns opt-in based on configurations? Based on my observations at Uber, a `_key` column would be useful, and for some topics the key is computed on the fly by deriving from multiple fields in the topic. I can also see the value of offset, especially for CDC use cases where events are sorted in the topic partition.
   
   Like I mentioned before, making these configurable goes without saying. We have production use cases that ingest at extremely high rate while still serving queries (also at a very high QPS). We cannot afford to configure this in.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org
For additional commands, e-mail: commits-help@pinot.apache.org

[GitHub] [incubator-pinot] sajjad-moradi commented on issue #7004: Preserver Kafka Message Metadata in Pinot Tables

Posted by GitBox <gi...@apache.org>.

sajjad-moradi commented on issue #7004:
URL: https://github.com/apache/incubator-pinot/issues/7004#issuecomment-882736913


   @mapshen Can you elaborate a bit on why this use case is needed? Maybe it makes it easy to understand, as Subbu mentioned, if you provide queries for this use case!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org
For additional commands, e-mail: commits-help@pinot.apache.org

[GitHub] [incubator-pinot] mapshen commented on issue #7004: Preserver Kafka Message Metadata in Pinot Tables

Posted by GitBox <gi...@apache.org>.

mapshen commented on issue #7004:
URL: https://github.com/apache/incubator-pinot/issues/7004#issuecomment-882704506






-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org
For additional commands, e-mail: commits-help@pinot.apache.org

[GitHub] [incubator-pinot] mcvsubbu commented on issue #7004: Preserver Kafka Message Metadata in Pinot Tables

Posted by GitBox <gi...@apache.org>.

mcvsubbu commented on issue #7004:
URL: https://github.com/apache/incubator-pinot/issues/7004#issuecomment-884337918


   > +1
   > 
   > Other systems like Presto support these kind of features. https://prestodb.io/docs/current/connector/kafka-tutorial.html
   > 
   > e.g.
   > ![image](https://user-images.githubusercontent.com/13425258/126526785-2cfaea4c-2699-4b75-a93a-2696404fcf28.png)
   > 
   > All columns with the underscore prefix are ingested by default. I also suggest we add partition id and (possibly key), which are useful for troubleshooting, particularly for upsert tables.
   
   Partition ID is already a part of segment name, and segment name is a virtual column already provided.
   (The value of) key is a property of the stream, so maybe like Kishore suggested, underlying stream can populate certain fields automatically.
   
   That aside, I do agree that other databases may provide such feature. Not all databases are low-latency OLAP, so we need to be careful about feature creep.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org
For additional commands, e-mail: commits-help@pinot.apache.org

[GitHub] [incubator-pinot] mcvsubbu commented on issue #7004: Preserver Kafka Message Metadata in Pinot Tables

Posted by GitBox <gi...@apache.org>.

mcvsubbu commented on issue #7004:
URL: https://github.com/apache/incubator-pinot/issues/7004#issuecomment-854927322

+1 to keeping it virtual column
+1 to making it configurable, since it can be quite some overhead
+1 to keeping it a string, pinot is transparent to the stream underneath.
We should not be building indices on it. Please use raw index. It is supported for consuming segments now.

In case of kinesis, we should (may want to) also keep track of other metadata like partition IDs in the group during the time the segment was being consumed. I think these do not change (if they do, we close the segment), but @npawar or @KKcorps can comment on that.

Since this is stream dependent, I would make it a string that has (at the minimum) the StreamMsgOffset serialized, and also the partition group ID. Beyond that, each stream may add its own stuff.

Also, consider having a less verbose version of this by having some data common to the entire segment, in the segment metadata (Some of these are there in zk metadata). For kafka, this could mean start/end offset, partition group id, etc.

--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org
For additional commands, e-mail: commits-help@pinot.apache.org

[GitHub] [incubator-pinot] mapshen edited a comment on issue #7004: Preserver Kafka Message Metadata in Pinot Tables

Posted by GitBox <gi...@apache.org>.

mapshen edited a comment on issue #7004:
URL: https://github.com/apache/incubator-pinot/issues/7004#issuecomment-882858547


   Sure. Say we have an inventory table of `products` ingested from a single-partition topic, and, to begin with, a query like
   ```
   select count, offset from inventory where product = 'A' order by offset desc limit 1
   ```
    is executed, from which I record the offset.
   
   Next time a query is issued, if I don't want to reprocess, or even fetch, the rows I have already seen, and only care about the latest change if any, I would do
   ```
   select count, offset  from inventory where offset > <last_recorded_offset> and product = 'A' order by offset desc limit 1
   ```
   If I don't want to miss any messages/rows in between since the last fetch, I could do 
   ```
   select count, offset from inventory where offset > <last_recorded_offset> and product = 'A' 
   ```
   From here, I would record the new offset again (if any) and repeat.
   
   There was more context in the original thread on Slack, but I couldn't find it anymore. Seems all messages prior to 06/16 got purged (?).
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org
For additional commands, e-mail: commits-help@pinot.apache.org

[GitHub] [incubator-pinot] mapshen commented on issue #7004: Preserver Kafka Message Metadata in Pinot Tables

Posted by GitBox <gi...@apache.org>.

mapshen commented on issue #7004:
URL: https://github.com/apache/incubator-pinot/issues/7004#issuecomment-882903914


   Unfortunately no. We don't always only care about the latest. 
   
   Also, this UPSERT feature has its limitations as documented here [1], of which the required partition expansion is a showstopper already.
   
   On top of it, we need to break ties correctly and reliably when it happens, for which we cannot rely on the time column, as there might be 2 messages with the same timestamp and primary key but different offsets.
   
   
   [1] https://docs.google.com/document/d/1qljEMndPMxbbKtjlVn9mn2toz7Qrk0TGQsHLfI--7h8/edit#bookmark=kix.zg5cmgwqyx7e


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org
For additional commands, e-mail: commits-help@pinot.apache.org

[GitHub] [incubator-pinot] mcvsubbu commented on issue #7004: Preserver Kafka Message Metadata in Pinot Tables

Posted by GitBox <gi...@apache.org>.

mcvsubbu commented on issue #7004:
URL: https://github.com/apache/incubator-pinot/issues/7004#issuecomment-854975980


   @mapshen curious what your use case is for this feature. It adds significant overhead to realtime consuming memory. When we do have an error, we put out the offset that caused the error. I don't see the advantage of having the offset information for each row.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org
For additional commands, e-mail: commits-help@pinot.apache.org

[GitHub] [incubator-pinot] mapshen edited a comment on issue #7004: Preserver Kafka Message Metadata in Pinot Tables

Posted by GitBox <gi...@apache.org>.

mapshen edited a comment on issue #7004:
URL: https://github.com/apache/incubator-pinot/issues/7004#issuecomment-882858547


   Sure. Say we have an inventory table of products and, to begin with, a query like
   ```
   select count, offset from inventory where product = 'A' limit 1
   ```
    is executed, from which I record the offset.
   
   Next time a query is issued, if I don't want to reprocess, or even fetch, the rows I have already seen, and only care about the latest change if any, I would do
   ```
   select count, offset  from inventory where offset > <last_recorded_offset> and product = 'A' limit 1
   ```
   If I don't want to miss any messages/rows in between since the last fetch, I could do 
   ```
   select count, offset from inventory where offset > <last_recorded_offset> and product = 'A'
   ```
   From here, I would record the new offset again (if any) and repeat.
   
   There was more context in the original thread on Slack, but I couldn't find it anymore. Seems all messages prior to 06/16 got purged (?).
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org
For additional commands, e-mail: commits-help@pinot.apache.org

[GitHub] [incubator-pinot] amrishlal edited a comment on issue #7004: Preserve Kafka Message Metadata in Pinot Tables

Posted by GitBox <gi...@apache.org>.

amrishlal edited a comment on issue #7004:
URL: https://github.com/apache/incubator-pinot/issues/7004#issuecomment-884523925


   I am a bit confused by these two statements:
   > When streaming from Kafka, Pinot currently lacks of a way to allow users to uniquely identify messages
   
   and
   
   > where offset > last_recorded_offset
   
   It seems like in the first case, you are looking for a globally unique identifier for each row. I am assuming this would involve something like a UUID generator that will tack on UUID with each row that is ingested (?) In the second case, it seems like you are looking for a "rowid" with the additional criteria that it should be monotonically increasing and be comparable.
   
   I am not quite sure if it is possible to do both with reasonable amount of effort (i.e generate a globally unique identifier that is monotonically increasing and hence also comparable across all rows of all segments) specially when one considers that we commonly replace segments, generate segments offline, and also do some update operations such as UPSERT. Unless I am missing something, maybe it could be done with a cluster wide id generation service in Pinot (?). The first (UUID generation) can probably be done now at ingestion time using an ingestion transform function (?). The second looks very difficult to implement and get right (?).
   
   I think we need more clarity on what exactly is being implemented here: 1) dynamically generated ROWID over resultset only (for supporting cursors), 2) a column that will identify each row with a globally unique identifier (useful for partitioning, indexing, etc), 3) ROWID generated for each row at row creation time that is globally unique and comparable across all rows and all segments and that can be kept up to date with operations such as segment replacement, UPSERT, etc?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org
For additional commands, e-mail: commits-help@pinot.apache.org

[GitHub] [incubator-pinot] mcvsubbu commented on issue #7004: Preserve Kafka Message Metadata in Pinot Tables

Posted by GitBox <gi...@apache.org>.

mcvsubbu commented on issue #7004:
URL: https://github.com/apache/incubator-pinot/issues/7004#issuecomment-884426206

I admit it all sounds and looks very nice. I am asking whether the complexity of implementing this is warranted or not.

@npawar Pinot servers output log messages every once in a while (5m I think) re consumption status. Even if logs are deleted, the current log should provide info on whether it is consuming or not. Also, we recently, we have introduced several APIs at controller level for debugging purposes. So, I don't buy the argument that this feature is needed to provide better ability to debug.

@mayankshriv Assuming we don't count kishore's +1000 as many :-) , there seem to be three +1s on this feature. Can you provide your justification of why you think this feature will benefit more than the one use case (which I don't understand fully, I admit)

@mapshen can you please provide a short doc describing your use case? (not just the queries, but the problem you are trying to solve), thanks. That may help us decide.

--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org
For additional commands, e-mail: commits-help@pinot.apache.org

[GitHub] [incubator-pinot] mapshen edited a comment on issue #7004: Preserver Kafka Message Metadata in Pinot Tables

Posted by GitBox <gi...@apache.org>.

mapshen edited a comment on issue #7004:
URL: https://github.com/apache/incubator-pinot/issues/7004#issuecomment-882858547


   Sure. Say we have an inventory table of products and, to begin with, a query like
   ```
   select count, offset from inventory where product = 'A' limit 1
   ```
    is executed, from which I record the offset.
   
   Next time a query is issued, if I don't want to reprocess, or even fetch, the rows I have already seen, and only care about the latest change if any, I would do
   ```
   select count, offset  from inventory where offset > last_recorded_offset where product = 'A' limit 1
   ```
   If I don't want to miss any messages/rows in between since the last fetch, I could do 
   ```
   select count, offset from inventory where offset > last_recorded_offset where product = 'A'
   ```
   From here, I would record the new offset again (if any) and repeat.
   
   There was more context in the original thread on Slack, but I couldn't find it anymore. Seems all messages prior to 06/16 got purged (?).
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org
For additional commands, e-mail: commits-help@pinot.apache.org

[GitHub] [incubator-pinot] mapshen commented on issue #7004: Preserve Kafka Message Metadata in Pinot Tables

Posted by GitBox <gi...@apache.org>.

mapshen commented on issue #7004:
URL: https://github.com/apache/incubator-pinot/issues/7004#issuecomment-885260828


   @mcvsubbu The output would be analytics/predictions consumed by the downstream, which can even be fed back to Pinot for visualization for example. The app is not for debugging purposes per se, but we can definitely rely on the metadata from Kafka to debug the app itself.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org
For additional commands, e-mail: commits-help@pinot.apache.org

[GitHub] [incubator-pinot] suddendust commented on issue #7004: Preserver Kafka Message Metadata in Pinot Tables

Posted by GitBox <gi...@apache.org>.

suddendust commented on issue #7004:
URL: https://github.com/apache/incubator-pinot/issues/7004#issuecomment-882537398


   If this feature is a go, I can pick it up (it might take me a while to implement it tho as I am still navigating my way around Pinot. So hopefully it's not an urgent feature).


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org
For additional commands, e-mail: commits-help@pinot.apache.org

[GitHub] [incubator-pinot] mapshen commented on issue #7004: Preserve Kafka Message Metadata in Pinot Tables

Posted by GitBox <gi...@apache.org>.

mapshen commented on issue #7004:
URL: https://github.com/apache/incubator-pinot/issues/7004#issuecomment-885147162


   @mcvsubbu What @kishoreg said above is representative. Our use case is similar: an internal application polls Pinot at regular intervals to fetch updates since the last query and perform complicated calculations.
   
   @amrishlal I would argue that they are the same. The offset for each partition in a Kafka topic is unique. Combined with the partition ID, you can uniquely identify a row in a Pinot table.  You might have also noticed that the case  `where offset > last_recorded_offset` assumes a single-partition topic.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org
For additional commands, e-mail: commits-help@pinot.apache.org

[GitHub] [incubator-pinot] yupeng9 commented on issue #7004: Preserver Kafka Message Metadata in Pinot Tables

Posted by GitBox <gi...@apache.org>.

yupeng9 commented on issue #7004:
URL: https://github.com/apache/incubator-pinot/issues/7004#issuecomment-884334196


   +1
   
   Other systems like Presto support these kind of features. https://prestodb.io/docs/current/connector/kafka-tutorial.html
   
   e.g.
   ![image](https://user-images.githubusercontent.com/13425258/126526785-2cfaea4c-2699-4b75-a93a-2696404fcf28.png)
   
   All columns with the underscore prefix are ingested by default. I also suggest we add partition id and (possibly key), which are useful for troubleshooting, particularly for upsert tables.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org
For additional commands, e-mail: commits-help@pinot.apache.org

[GitHub] [incubator-pinot] suddendust commented on issue #7004: Preserver Kafka Message Metadata in Pinot Tables

Posted by GitBox <gi...@apache.org>.

suddendust commented on issue #7004:
URL: https://github.com/apache/incubator-pinot/issues/7004#issuecomment-882537398


   If this feature is a go, I can pick it up (it might take me a while to implement it tho as I am still navigating my way around Pinot. So hopefully it's not an urgent feature).


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org
For additional commands, e-mail: commits-help@pinot.apache.org

[GitHub] [incubator-pinot] kishoreg commented on issue #7004: Preserver Kafka Message Metadata in Pinot Tables

Posted by GitBox <gi...@apache.org>.

kishoreg commented on issue #7004:
URL: https://github.com/apache/incubator-pinot/issues/7004#issuecomment-884302410


   @mcvsubbu , this is a very powerful feature. This is equivalent to cursor API in databases. For e.g. we can use this in ThirdEye to analyze new events that come in since the last time we ran anomaly detection.
   
   I am +1000 on this feature. Another side benefit of this is in debugging kafka lag. Kafka provides an offset API and we can compare the two to compute the lag.
   
   btw, I am fine with this being an actual column instead of virtual. we can solve this by having a udf pull the value from header.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org
For additional commands, e-mail: commits-help@pinot.apache.org

[GitHub] [incubator-pinot] mapshen edited a comment on issue #7004: Preserver Kafka Message Metadata in Pinot Tables

Posted by GitBox <gi...@apache.org>.

mapshen edited a comment on issue #7004:
URL: https://github.com/apache/incubator-pinot/issues/7004#issuecomment-882704506






-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org
For additional commands, e-mail: commits-help@pinot.apache.org

[GitHub] [incubator-pinot] mcvsubbu commented on issue #7004: Preserver Kafka Message Metadata in Pinot Tables

Posted by GitBox <gi...@apache.org>.

mcvsubbu commented on issue #7004:
URL: https://github.com/apache/incubator-pinot/issues/7004#issuecomment-882676889


   @mapshen has not described a use case for this yet. For reasons that I mentioned before, I am reluctant to include this a a feature


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org
For additional commands, e-mail: commits-help@pinot.apache.org

[GitHub] [incubator-pinot] mapshen commented on issue #7004: Preserver Kafka Message Metadata in Pinot Tables

Posted by GitBox <gi...@apache.org>.

mapshen commented on issue #7004:
URL: https://github.com/apache/incubator-pinot/issues/7004#issuecomment-882704506






-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org
For additional commands, e-mail: commits-help@pinot.apache.org

[GitHub] [incubator-pinot] npawar commented on issue #7004: Preserver Kafka Message Metadata in Pinot Tables

Posted by GitBox <gi...@apache.org>.

npawar commented on issue #7004:
URL: https://github.com/apache/incubator-pinot/issues/7004#issuecomment-854909817


   i like the idea of making it a virtual column, and having it be always present, “$offset”
   
   since we wont know the datatype of the offset (kafka is long, kinesis sequence numbers exceed long range, hence are string) we could keep the offset datatype as string?  
   this would have serialized value of whatever the stream’s implementation of StreamPartitionMsgOffset is
   
   a segment can have only 1 sorted index. So it wont be a good idea to make it have sorted index. Do we need index by default?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org
For additional commands, e-mail: commits-help@pinot.apache.org

[GitHub] [incubator-pinot] mapshen commented on issue #7004: Preserver Kafka Message Metadata in Pinot Tables

Posted by GitBox <gi...@apache.org>.

mapshen commented on issue #7004:
URL: https://github.com/apache/incubator-pinot/issues/7004#issuecomment-882704506


   @mcvsubbu Okay, let me formalize the use case here:
   > As a user, I would like to have a way to uniquely identify a row/message between two queries. It would be really helpful when:
   > * I don't want to reprocess, or even fetch, the rows I have already seen
   > * I don't want to miss any messages/rows in between since last fetch
   
   We talked over Slack about it if you recall :)
   
   @suddendust, appreciate your offering help. My impression was actually that we agreed to add this in principle but I just couldn't find time to write up a proposal on how to approach this.
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org
For additional commands, e-mail: commits-help@pinot.apache.org

[GitHub] [incubator-pinot] suddendust commented on issue #7004: Preserver Kafka Message Metadata in Pinot Tables

Posted by GitBox <gi...@apache.org>.

suddendust commented on issue #7004:
URL: https://github.com/apache/incubator-pinot/issues/7004#issuecomment-882537398


   If this feature is a go, I can pick it up (it might take me a while to implement it tho as I am still navigating my way around Pinot. So hopefully it's not an urgent feature).


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org
For additional commands, e-mail: commits-help@pinot.apache.org

[GitHub] [incubator-pinot] mcvsubbu commented on issue #7004: Preserver Kafka Message Metadata in Pinot Tables

Posted by GitBox <gi...@apache.org>.

mcvsubbu commented on issue #7004:
URL: https://github.com/apache/incubator-pinot/issues/7004#issuecomment-882874689


   Can you use the UPSERT feature, which updates the rows that have come in?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org
For additional commands, e-mail: commits-help@pinot.apache.org