You are viewing a plain text version of this content. The canonical link for it is here.

Posted to notifications@accumulo.apache.org by GitBox <gi...@apache.org> on 2021/07/02 15:02:33 UTC

[GitHub] [accumulo] milleruntime opened a new issue #2187: Sorted Log Recovery configuration design

milleruntime opened a new issue #2187:
URL: https://github.com/apache/accumulo/issues/2187

This ticket is for designing how to configure the new sorted log recovery. With the changes in #2117, there are now new options provided by RFile to configure sorted recovery files. One in particular that seems useful is `table.file.compress.type`. Depending on the use case, a user may want to configure sorted recovery to use ZStandard or Snappy for compression. A new property could be created to set the compression, such as `tserver.sort.compress.type`. This property would be similar to `tserver.sort.buffer.size` which is a current option for the buffer size during sorting. There may be other table options desirable for sorted recovery performance:
- table.file.blocksize
- table.file.compress.blocksize
- table.file.compress.blocksize.index
- table.cache.block.enable

Instead of creating an equivalent `tserver.sort` property for all of these, I am wondering if there is a better way. One way to configure the sorted recovery would be to create a dedicated system table in the accumulo namespace (something like `accumulo.recovery` and allow the user to configure the table the same way they do other tables. This solution is a bit heavy handed though as it would create a new table and a lot of the `table` properties would get ignored as they wouldn't apply to sorted recovery.

Another option is to not create any new configuration options and just set the values to something reasonable.

--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: notifications-unsubscribe@accumulo.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [accumulo] milleruntime commented on issue #2187: Sorted Log Recovery configuration design

Posted by GitBox <gi...@apache.org>.

milleruntime commented on issue #2187:
URL: https://github.com/apache/accumulo/issues/2187#issuecomment-876379764


   > This requires thinking about config that is "local" to RFile operations, and perhaps creating a concrete RFileConfiguration type for use in its public API and other related APIs (like AccumuloFileOutputFormat).
   
   This is an interesting idea that I had not considered. Are you thinking of the new Configuration interface within PluginEnvironment? Or the one in SPI under ServiceEnvironment? It would make sense to me to have one in the public API that goes along with RFile API. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: notifications-unsubscribe@accumulo.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [accumulo] milleruntime edited a comment on issue #2187: Sorted Log Recovery configuration design

Posted by GitBox <gi...@apache.org>.

milleruntime edited a comment on issue #2187:
URL: https://github.com/apache/accumulo/issues/2187#issuecomment-888497972


   What do you think about just one new prefix property as a catch all?
   <pre>
   TSERV_SORT_FILE_PREFIX("tserver.sort.file.", null, PropertyType.PREFIX,
         "The rfile properties to use when sorting logs during recovery. Most of the properties"
            + " that begin with 'table.file' can be used here. For example, to set the compression"
             + " of the sorted recovery files to snappy use 'tserver.sort.file.compress.type=snappy'",
         "2.1.0"),
   </pre>
   
   An RFileConfiguration object would be nice but we aren't using the RFile API for writing during the sorted recovery. I think we just need a prefix property and translate them to table properties in the constructor of LogSorter. This would allow using the table property validation and if a user specifies a valid property that is not used, it will just be ignored.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: notifications-unsubscribe@accumulo.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [accumulo] milleruntime closed issue #2187: Sorted Log Recovery configuration design

Posted by GitBox <gi...@apache.org>.

milleruntime closed issue #2187:
URL: https://github.com/apache/accumulo/issues/2187


   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: notifications-unsubscribe@accumulo.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [accumulo] ctubbsii commented on issue #2187: Sorted Log Recovery configuration design

Posted by GitBox <gi...@apache.org>.

ctubbsii commented on issue #2187:
URL: https://github.com/apache/accumulo/issues/2187#issuecomment-876792029


   > Are you thinking of the new Configuration interface within PluginEnvironment? Or the one in SPI under ServiceEnvironment?
   
   Neither. I was thinking about something along the lines of BatchWriterConfig or NewTableConfiguration.
   
   > It would make sense to me to have one in the public API that goes along with RFile API.
   
   I agree. I am thinking of the RFile API as being something that could, in theory, be completely separated from Accumulo, as its own module. The kind of API it should have should be along those lines. I doubt we can separate the RFile API from Accumulo at all right now, but that's the basic idea... that you don't need "Accumulo" to read and write your own RFiles, and you don't need Accumulo's Configuration to create an RFileConfiguration for use in the RFile API. However, we will need to be able to translate our AccumuloConfiguration into an RFileConfiguration (similar to how we translate ClientProperties into BatchWriterConfig), by extracting the relevant subset of key/value pairs.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: notifications-unsubscribe@accumulo.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [accumulo] ctubbsii commented on issue #2187: Sorted Log Recovery configuration design

Posted by GitBox <gi...@apache.org>.

ctubbsii commented on issue #2187:
URL: https://github.com/apache/accumulo/issues/2187#issuecomment-875791496


   Our config tends to be a big massive pool of properties. A challenge is to figure out how to "localize" subsets of the config to the specific scope of the components that utilize it. The way we've done this is by heavily relying on the hierarchical nature of our properties, and using common prefixes to store configuration properties that are related. However, these prefixes can be stripped off when the properties are "localized" to the code that actually reads the properties' values to perform its function.
   
   We currently localize some properties by using `AccumuloConfiguration.getAllPropertiesWithPrefixStripped(Property contextSpecificPrefix)` for extracting and localizing scan dispatch configuration and some compaction configuration.
   
   The fact that we can "localize" sets of properties this way pretty easily, means we can also have the same set of properties stored in different contexts, established with different prefixes. For example, `table.file.*` and `tserver.wal.sort.file.*`.
   
   So, all we need to do is figure out the set of properties that RFile read/write code needs locally, and then make sure we can extract them from the big pool using a common prefix for different contexts, as in:
   
   ```java
   var tableFileConf = FileConfig.fromMap(tableConf.getAllPropertiesWithPrefixStripped("table.file."));
   var walFileConf = FileConfig.fromMap(systemConf.getAllPropertiesWithPrefixStripped("tserver.wal.sort.file."));
   ```
   
   This requires thinking about config that is "local" to RFile operations, and perhaps creating a concrete RFileConfiguration type for use in its public API and other related APIs (like AccumuloFileOutputFormat).
   
   Alternatively, the simplest thing to do is do nothing, and just read the per-table settings from the system config, as though these RFiles are logically associated with a generic, system-wide, non-existent table, rather than a specific sorted WAL context. We certainly have existing code that behaves this way, but it's less than ideal, and can create confusion for users.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: notifications-unsubscribe@accumulo.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [accumulo] ctubbsii commented on issue #2187: Sorted Log Recovery configuration design

Posted by GitBox <gi...@apache.org>.

ctubbsii commented on issue #2187:
URL: https://github.com/apache/accumulo/issues/2187#issuecomment-888522824


   > What do you think about just one new prefix property as a catch all?
   > 
   > TSERV_SORT_FILE_PREFIX("tserver.sort.file.", null, PropertyType.PREFIX,
   >       "The rfile properties to use when sorting logs during recovery. Most of the properties"
   >          + " that begin with 'table.file' can be used here. For example, to set the compression"
   >           + " of the sorted recovery files to snappy use 'tserver.sort.file.compress.type=snappy'",
   >       "2.1.0"),
   
   I think that's basically what I suggested with `tserver.wal.sort.file.` above. I'm not sure about `tserver.sort.file.` without the `wal` part, because Accumulo's sole function is to maintain sorted data and all RFiles are sorted files, so without the `wal` part to clarify it's related to write-ahead logs, users could easily misunderstand the property name.
   
   > An RFileConfiguration object would be nice but we aren't using the RFile API for writing during the sorted recovery. I think we just need a prefix property and translate them to table properties in the constructor of LogSorter. This would allow using the table property validation and if a user specifies a valid property that is not used, it will just be ignored.
   
   Yeah, that's fine. I was thinking of the configuration object as basically a "translate them" implementation detail. Basically, depending on context, we'd translate table properties into this config object, or we'd translate the sorted WAL properties into this config object. I was just thinking about internally distancing ourselves from where the properties came from once we're inside the RFile code, to decouple the RFile code from Accumulo's configuration mechanisms. But, if it's easier to do the implementation by translating into "table properties", that's fine for now.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: notifications-unsubscribe@accumulo.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [accumulo] milleruntime commented on issue #2187: Sorted Log Recovery configuration design

Posted by GitBox <gi...@apache.org>.

milleruntime commented on issue #2187:
URL: https://github.com/apache/accumulo/issues/2187#issuecomment-875532240


   I did a bit of refactoring on the `LogSorter` in #2191 so it will at least use the system configuration to write the sorted RFiles. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: notifications-unsubscribe@accumulo.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [accumulo] milleruntime commented on issue #2187: Sorted Log Recovery configuration design

Posted by GitBox <gi...@apache.org>.

milleruntime commented on issue #2187:
URL: https://github.com/apache/accumulo/issues/2187#issuecomment-888497972


   What do you think about just one new prefix property as a catch all?
   <pre>
   TSERV_SORT_FILE_PREFIX("tserver.sort.file.", null, PropertyType.PREFIX,
         "The rfile properties to use when sorting logs during recovery. Most of the properties"
            + " that begin with 'table.file' can be used here. For example, to set the compression"
             + " of the sorted recovery files to snappy use 'tserver.sort.file.compress.type=snappy'",
         "2.1.0"),
   </pre>
   
   An RFileConfiguration object would be nice but we aren't using it for writing during the sorted recovery. I think we just need a prefix property and translate them to table properties in the constructor of LogSorter. This would allow using the table property validation and if a user specifies a valid property that is not used, it will just be ignored.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: notifications-unsubscribe@accumulo.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org