You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@pulsar.apache.org by GitBox <gi...@apache.org> on 2020/05/09 16:41:51 UTC

[GitHub] [pulsar] KannarFr opened a new issue #6930: Pulsar SQL: support user defined indexes

KannarFr opened a new issue #6930:
URL: https://github.com/apache/pulsar/issues/6930


   **Is your feature request related to a problem? Please describe.**
   Currently, there is no index used to query topic using presto.  `__publish_time__` can be considered as index because of ledger storage way but it's not a real one.
   
   **Describe the solution you'd like**
   AvroSchema used to insert to topic should comes with a indexes definition. Since then, we should be able to have managedledger for indexes references classical managedledgers? And then configure pulsar presto impl to use user defined indexes from schema. (This is a suggestion to initialize the discussion, but as @jerrypeng and I discussed it's a large discussion to have).
   
   **Describe alternatives you've considered**
   There are probably multiples ways to do it, feel free to suggest your pov.
   
   **Additional context**
   Reduce the query runtime.
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [pulsar] pointearth commented on issue #6930: Pulsar SQL: support user defined indexes

Posted by GitBox <gi...@apache.org>.
pointearth commented on issue #6930:
URL: https://github.com/apache/pulsar/issues/6930#issuecomment-766553311


   I think user-defined indexes are very important for Pulsar SQL, it could be able to be the real way that we can use pulsar as a database. I think it will make pulsar more popular.
   And I agree to define user-defined indexes individually. we can extend "pulsar-admin topic" to manage indexes, to create, read, update, delete, reIndex them.
   Can we discuss more and push it forward?


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [pulsar] KannarFr edited a comment on issue #6930: Pulsar SQL: support user defined indexes

Posted by GitBox <gi...@apache.org>.
KannarFr edited a comment on issue #6930:
URL: https://github.com/apache/pulsar/issues/6930#issuecomment-800187333


   @pointearth @sijie How do you imagine the index composition regarding ledgers? As the first implementation, regarding pub/sub system, timestamp index-based would be a good start. I think we should first be able to auto-create an index per topic like:
   
   ```java
   Map[Date, LedgerId]
   ```
   Or maybe
   
   ```java
   Map[Date, IndexItem]
   
   IndexItem(PreviousLedgerId, LedgerId, NextLedgerId)
   ```
   
   WDYT? Maybe we should directly point to message and not ledger.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [pulsar] pointearth commented on issue #6930: Pulsar SQL: support user defined indexes

Posted by GitBox <gi...@apache.org>.
pointearth commented on issue #6930:
URL: https://github.com/apache/pulsar/issues/6930#issuecomment-801761819


   I know querying with timestamp is very fast, because data in bookKeeper save timestamp as key.
   My suggestion is to create a non-clustered index, based on the key in bookKeeper, for example:
   If we want to create an index on the field name, We can create indexItem as 
   map[name, List<__publish_time__>]
   
   Can you supply some describe the source code around this? and then we can discuss it again.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [pulsar] KannarFr edited a comment on issue #6930: Pulsar SQL: support user defined indexes

Posted by GitBox <gi...@apache.org>.
KannarFr edited a comment on issue #6930:
URL: https://github.com/apache/pulsar/issues/6930#issuecomment-800187333


   @pointearth @sijie How do you imagine the index composition regarding ledgers? As the first implementation, regarding pub/sub system, timestamp index-based would be a good start. I think we should first be able to auto-create an index per topic like:
   
   ```java
   Map<Date, LedgerId>
   ```
   Or maybe
   
   ```java
   Map<Date, List<PreviousLedgerId, LedgerId, NextLedgerId>>
   ```
   
   WDYT? Maybe we should directly point to message and not ledger.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [pulsar] golden-yang commented on issue #6930: Pulsar SQL: support user defined indexes

Posted by GitBox <gi...@apache.org>.
golden-yang commented on issue #6930:
URL: https://github.com/apache/pulsar/issues/6930#issuecomment-997149341


   Is there any progress on this issue? 
   Being able to support indexes in Pulsar Sql will be a very meaningful feature.
   
   One way is to support it **natively**, and the other way I think it can be achieved through **tiered storage**. For example, combined with the data lake, with the help of Apache Hudi and so on.
   
   I saw some articles about the combination of hudi and pulsar, is there any progress?
   @sijie  


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@pulsar.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [pulsar] KannarFr commented on issue #6930: Pulsar SQL: support user defined indexes

Posted by GitBox <gi...@apache.org>.
KannarFr commented on issue #6930:
URL: https://github.com/apache/pulsar/issues/6930#issuecomment-628792371


   Adding the index definition into schema definition but maybe it is not the best to do. I'm asking your opinion.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [pulsar] KannarFr commented on issue #6930: Pulsar SQL: support user defined indexes

Posted by GitBox <gi...@apache.org>.
KannarFr commented on issue #6930:
URL: https://github.com/apache/pulsar/issues/6930#issuecomment-628744610


   @sijie 
   Ok, about the indexes definition, what do you think about the definition approach using avroschema to define indexes?


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [pulsar] sijie commented on issue #6930: Pulsar SQL: support user defined indexes

Posted by GitBox <gi...@apache.org>.
sijie commented on issue #6930:
URL: https://github.com/apache/pulsar/issues/6930#issuecomment-628767319


   Are you talking about adding the index definition into schema definition? Or using Avro schema specification for describing the indexes?


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [pulsar] pointearth commented on issue #6930: Pulsar SQL: support user defined indexes

Posted by GitBox <gi...@apache.org>.
pointearth commented on issue #6930:
URL: https://github.com/apache/pulsar/issues/6930#issuecomment-801764807


   Can we store it in bookKeeper? and it may be able to start when the presto enhancement switching is open.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [pulsar] KannarFr commented on issue #6930: Pulsar SQL: support user defined indexes

Posted by GitBox <gi...@apache.org>.
KannarFr commented on issue #6930:
URL: https://github.com/apache/pulsar/issues/6930#issuecomment-801032939


   And where/how we store this index? 


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [pulsar] sijie commented on issue #6930: Pulsar SQL: support user defined indexes

Posted by GitBox <gi...@apache.org>.
sijie commented on issue #6930:
URL: https://github.com/apache/pulsar/issues/6930#issuecomment-629006598


   I don't think it is a good idea to add an index definition to the schema definition. The schema definition defines the structure of the original data. The index definition depends on the schema definition but it is different from the original data. So the index definition should be associated with the storage that is used for storing the index data. For example, if we are using another managed ledger for storing the index, then the index definition should be the schema definition of the managed ledger. Does that make sense?


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [pulsar] pointearth commented on issue #6930: Pulsar SQL: support user defined indexes

Posted by GitBox <gi...@apache.org>.
pointearth commented on issue #6930:
URL: https://github.com/apache/pulsar/issues/6930#issuecomment-766553311


   I think user-defined indexes are very important for Pulsar SQL, it could be able to be the real way that we can use pulsar as a database. I think it will make pulsar more popular.
   And I agree to define user-defined indexes individually. we can extend "pulsar-admin topic" to manage indexes, to create, read, update, delete, reIndex them.
   Can we discuss more and push it forward?


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [pulsar] sijie commented on issue #6930: Pulsar SQL: support user defined indexes

Posted by GitBox <gi...@apache.org>.
sijie commented on issue #6930:
URL: https://github.com/apache/pulsar/issues/6930#issuecomment-628737943


   @KannarFr 
   
   The indexes can be built in a background process using the approach that was used for compaction. The "compacted" ledger is essentially an "index" to the original data.
   
   The index maintains some forms of mapping between "keys" to the "offsets" to the original data. The "offset" is essentially the message-id which is referencing a ledger and an entry id. It doesn't matter if a ledger is in the bookkeeper or already offloaded to the tiered storage.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [pulsar] KannarFr commented on issue #6930: Pulsar SQL: support user defined indexes

Posted by GitBox <gi...@apache.org>.
KannarFr commented on issue #6930:
URL: https://github.com/apache/pulsar/issues/6930#issuecomment-800187333


   @pointearth @sijie How do you imagine the index composition regarding ledgers? As the first implementation, regarding pub/sub system, timestamp index-based would be a good start. I think we should first be able to auto-create an index per topic like:
   
   ```java
   Map<Date, LedgerId>
   ```
   Or maybe
   
   ```java
   Map<Date, List<PreviousLedgerId, LedgerId, NextLedgerId>>
   ```
   
   WDYT?


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org