You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@lucene.apache.org by "Robert Muir (JIRA)" <ji...@apache.org> on 2014/03/31 01:11:14 UTC

[jira] [Updated] (LUCENE-2446) Add checksums to Lucene segment files

     [ https://issues.apache.org/jira/browse/LUCENE-2446?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Robert Muir updated LUCENE-2446:
--------------------------------

    Attachment: LUCENE-2446.patch

I think this is a pretty important issue: besides the case of distributed system copying files around, we have the issue that today there is no integrity mechanism to detect hardware issues (can cause developers to pull hair out trying to debug corruptions), and we have some optimized components doing bulk merge which can propagate corruptions to new segments over a long time.

Also in recent jvms, computing checksum is fast: e.g. in java8 CRC32 is intrinsic and uses clmul hardware instructions on x86 and so on.

I created an initial patch: the last 8 bytes of every file is a zlib-crc32 checksum. We also write some additional metadata before it (its done via CodecUtil.writeFooter) so we can extend it more in the future if we need.

For small metadata files (e.g. .fnm, .si, .dvm, ...) we just verify when we open, because we are reading the file anyway. So this provides some extra safety.

For larger files this would be expensive: instead the patch adds AtomicReader.validate() which asks the codec (or filterreader, or whatever), to ensure everything is valid. This is called by e.g. checkindex before decoding.
 
Patch adds an option (defaults to off) on IndexWriterConfig to call this before merging. Ideally we wouldnt need this and just validate-as-we-merge, but that requires some codec/merge API changes...

File format changes are backwards compatible.

> Add checksums to Lucene segment files
> -------------------------------------
>
>                 Key: LUCENE-2446
>                 URL: https://issues.apache.org/jira/browse/LUCENE-2446
>             Project: Lucene - Core
>          Issue Type: Improvement
>          Components: core/index
>            Reporter: Lance Norskog
>              Labels: checksum
>         Attachments: LUCENE-2446.patch
>
>
> It would be useful for the different files in a Lucene index to include checksums. This would make it easy to spot corruption while copying index files around; the various cloud efforts assume many more data-copying operations than older single-index implementations.
> This feature might be much easier to implement if all index files are created in a sequential fashion. This issue therefore depends on [LUCENE-2373].



--
This message was sent by Atlassian JIRA
(v6.2#6252)

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org