You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@maven.apache.org by "Michael Bien (Jira)" <ji...@apache.org> on 2023/03/12 19:35:00 UTC

[jira] [Commented] (MINDEXER-185) Document filter doesn't seem to do anything

    [ https://issues.apache.org/jira/browse/MINDEXER-185?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17699390#comment-17699390 ] 

Michael Bien commented on MINDEXER-185:
---------------------------------------

i was reading up on lucene yesterday and I entirely forgot that a key point of their data structure is that it is entirely immutable! This means deleting a doc won't delete anything - all it does is to set a flag, it is updated later while queries run during segment merges.

Feel free to close this issue, however, I believe it is worth investigating if the filter could be run in the reader itself while it is building the index, this should hopefully have an actual effect on the resulting index size.

> Document filter doesn't seem to do anything
> -------------------------------------------
>
>                 Key: MINDEXER-185
>                 URL: https://issues.apache.org/jira/browse/MINDEXER-185
>             Project: Maven Indexer
>          Issue Type: Bug
>    Affects Versions: 7.0.1
>            Reporter: Michael Bien
>            Priority: Major
>
> Hello devs!
>  
> I tried to filter the index during extraction using a DocumentFilter and it didn't appear to do anything.
> As test, I simply set {{indexUpdateRequest.setDocumentFilter(doc -> false);}} before calling {{DefaultIndexUpdater.fetchAndUpdateIndex}} and the extracted index had the same size of 5.6gb as without the filter.
>  
> The filter is actually called and it does also add a few minutes to the extraction time.
> https://github.com/apache/maven-indexer/blob/1cd122b1487150613005c8f9aced9aec20fded3e/indexer-core/src/main/java/org/apache/maven/index/updater/DefaultIndexUpdater.java#L238-L241
>  
> I am not sure why the implementation is filtering the index *after* extraction. Wouldn't it be easier and also more efficient to do it in IndexDataReader?
> e.g https://github.com/apache/maven-indexer/blob/1cd122b1487150613005c8f9aced9aec20fded3e/indexer-core/src/main/java/org/apache/maven/index/updater/IndexDataReader.java#L269



--
This message was sent by Atlassian Jira
(v8.20.10#820010)