You are viewing a plain text version of this content. The canonical link for it is here.

Posted to issues@lucene.apache.org by "rmuir (via GitHub)" <gi...@apache.org> on 2023/02/21 17:34:56 UTC

[GitHub] [lucene] rmuir commented on issue #12165: Integrating Apache Lucene into OSS-Fuzz

rmuir commented on issue #12165:
URL: https://github.com/apache/lucene/issues/12165#issuecomment-1438859463

In the analyzers example given there, it is a good one to see the differences.

Both approaches (OSS Fuzz and existing TestRandomChains) test "random analysis chains", but the current TestRandomChains also tests all possible ctors of these analysis components (not just the default constructor), and injects random stuff into them. Default constructor is usually tested anyway in the component's own unit tests with fuzzed data (see testRandomData() methods everywhere). fuzzing the ctors in this way, finds problems e.g. if a component is e.g. missing a sanity or range check on an integer parameter. This might even be more productive overall than actually feeding "fuzzed data".

The current TestRandomChains also doesn't require us to "register" any new components, it just discovers all Tokenizers, CharFilters, TokenFilters, etc that are available. This ensures we catch problems in new analyzers that get added.

As far as actual data fuzzing, it is more than just randomized data, have a look at our base analyzers test class, it is a "torture chamber" for analyzers and will find things such as thread-safety/race issues as well: https://github.com/apache/lucene/blob/main/lucene/test-framework/src/java/org/apache/lucene/tests/analysis/BaseTokenStreamTestCase.java
All analyzers use this class for "fuzzing" in their own unit test: TestRandomChains is just an "integration test" that then combines them together.

It is also important to think about how much time it takes to debug a failure, too. The current setup across both unit and integration tests makes it pretty easy to spot when the problem is a specific analyzer component, vs some crazy "interaction" between more than one of them. Nobody wants to debug a integration test if they can debug a unit test.

We did a lot of work with BaseTokenStreamTestCase/TestRandomChains such as adding special logging of the analysis chain, adding "ValidatingTokenFilter" at every step,etc. It still sucks to debug this stuff when it fails, we have a lot of analyzers :)

--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org