You are viewing a plain text version of this content. The canonical link for it is here.
Posted to reviews@spark.apache.org by GitBox <gi...@apache.org> on 2020/06/03 03:14:44 UTC

[GitHub] [spark] HeartSaVioR edited a comment on pull request #28707: [SPARK-31894][SS] Introduce UnsafeRow format validation for streaming state store

HeartSaVioR edited a comment on pull request #28707:
URL: https://github.com/apache/spark/pull/28707#issuecomment-637926400


   And personally I'd rather do the check in StateStore with additional overhead of reading "a" row in prior to achieve the same in all stateful operations. 
   
   ```
     /** Get or create a store associated with the id. */
     def get(
         storeProviderId: StateStoreProviderId,
         keySchema: StructType,
         valueSchema: StructType,
         indexOrdinal: Option[Int],
         version: Long,
         storeConf: StateStoreConf,
         hadoopConf: Configuration): StateStore = {
       require(version >= 0)
       val storeProvider = loadedProviders.synchronized {
         startMaintenanceIfNeeded()
         val provider = loadedProviders.getOrElseUpdate(
           storeProviderId,
           StateStoreProvider.createAndInit(
             storeProviderId.storeId, keySchema, valueSchema, indexOrdinal, storeConf, hadoopConf)
         )
         reportActiveStoreInstance(storeProviderId)
         provider
       }
       val store = storeProvider.getStore(version)
       val iter = store.iterator()
       if (iter.nonEmpty) {
         val rowPair = iter.next()
         val key = rowPair.key
         val value = rowPair.value
         // TODO: validate key with key schema
         // TODO: validate value with value schema
       }
       store
     }
   ```
   
   For streaming aggregations it initializes "two" state stores so the overhead goes to "two" rows, but I don't think the overhead matters much.
   
   If we really concern about the overhead, just have a StateStore wrapper wrapping `store` and do the same - only validate once for the first "get". In either way, we never need to restrict the functionality to the streaming aggregation.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org