You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@maven.apache.org by "Hudson (Jira)" <ji...@apache.org> on 2022/11/04 17:09:00 UTC
[jira] [Commented] (MINDEXER-151) Speed up Index update from remote

    [ https://issues.apache.org/jira/browse/MINDEXER-151?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17629118#comment-17629118 ] 

Hudson commented on MINDEXER-151:
---------------------------------

Build succeeded in Jenkins: Maven » Maven TLP » maven-indexer » master #74

See https://ci-maven.apache.org/job/Maven/job/maven-box/job/maven-indexer/job/master/74/

> Speed up Index update from remote
> ---------------------------------
>
>                 Key: MINDEXER-151
>                 URL: https://issues.apache.org/jira/browse/MINDEXER-151
>             Project: Maven Indexer
>          Issue Type: Improvement
>            Reporter: Tamas Cservenak
>            Assignee: Tamas Cservenak
>            Priority: Major
>             Fix For: 7.0.0
>
>
> Currently, if you execute from examples the BasicUsageExample, it will perform "full" update, and the full update (to get from "empty" index to "up to date" index) takes 15 or more minutes. Yes, Central index is huge, but there is room for improvement.
> Steps happening during update(s):
>  * properties file downloaded
>  * GZ file(s) downloaded (depending is it incremental or full)
>  * the GZ files are processed into temporary Lucene index
>  * the target (being updated) indexing context index is "replaced" (or merged, depends) with temporary Lucene index
> Downloading files are several seconds, but it is the processing of the GZIP raw records into Lucene index that takes long time. This can be improved.
> IndexUpdateRequest got new field {{int threads}} with default value of 1 (same will happen as before). When set to something greater than 1 (accepted values are positive numbers), then {{IndexDataReader}} will behave slightly differently that with threads=1: it will create N (threads) "silo" indexes, spawn N threads, and process the input file on N threads into N silos that are merged at the end. This should improve huge update times (as index is huge as well), ideally halve it as experiments show (ideal on my HW is 4 threads that halves the full index update time).
> Using very large numbers may make things worse, as time may be spent on managing/merging silos, so the "sweet spot" is probably HW dependendant.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)