You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@netbeans.apache.org by Michael Bien <mb...@gmail.com> on 2023/03/17 10:06:46 UTC

maven indexing tweaks

Hello everyone,

I experimented a bit with the maven index extraction process and got 
some pretty good results (I think).

There might be a way to filter the index during extraction without 
noteworthy overhead, which allows the following:

  - "sliding window" time filters, e.g drop all documents older than 2 
years (aka: who uses old libraries?)

  - we can drop fields we don't need from the index. Esp interesting for 
fields which don't compress well (looking at you, sha1 hash)

some results for the time cutoff filter:

full: 5.6 GB
2y: 2.6 GB
1y: 1.4 GB

now if we throw away some fields we likely don't need we get this:

full: 2.8 GB
2y: 1.4 GB
1y: 0,8 GB

(this would be configurable in the options obviously, someone who 
doesn't care about storage like myself, would set it to full index)

Lucene's storage uses immutable files which means a remove operation at 
the wrong stage would have no effect (it would only set a bit). This 
makes the extraction step the best place for filtering since that is 
where the index is built. I am not really a lucene expert, I wouldn't 
exclude that there are more ways how to shrink the index.

Some other features of maven-indexer 7+ we would get for free:

  - multi threaded extraction (the filter is going to be hooked into 
this and is MT too assuming it is accepted upstream).

  - lucene 9.6 uses panama on JDK 19+ for memory mapped storage which 
makes it also a bit faster (and apparently safer according to the PR), 
the devs are already excited for the vector API I have read :)

This brings the extraction time of the *full* central index down to 
about 6 minutes on my (aging) machine. The weekly delta updates after 
that are much faster.

This all depends of course whether the changes will be accepted upstream 
(and also on the JDK 8 problem, but we have other threads for that).


index related and already in master for NB 18:

https://github.com/apache/netbeans/pull/5655

https://github.com/apache/netbeans/pull/5646

blocked:

https://github.com/apache/netbeans/pull/4999

upstream in maven-indexer:

https://github.com/apache/maven-indexer/pull/302

another experiment:

https://github.com/apache/netbeans/pull/4971


best regards,

michael


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@netbeans.apache.org
For additional commands, e-mail: dev-help@netbeans.apache.org

For further information about the NetBeans mailing lists, visit:
https://cwiki.apache.org/confluence/display/NETBEANS/Mailing+lists




RE: maven indexing tweaks

Posted by Eirik Bakke <eb...@ultorg.com>.
> Also: if a lib is in your local .m2 folder already (even snapshots you build), it is in a separate index for local repos - this one isn't 
filtered either. The index footprint there is also tiny (25 MB for a 4GB .m2 folder in my case)

Thanks for explaining! It sounds like the potential problem I was imagining is not too relevant.

-- Eirik

-----Original Message-----
From: Michael Bien <mb...@gmail.com> 
Sent: Saturday, March 18, 2023 10:37 PM
To: dev@netbeans.apache.org; Eirik Bakke <eb...@ultorg.com>
Subject: Re: maven indexing tweaks

On 18.03.23 14:41, Eirik Bakke wrote:
> - "sliding window" time filters, e.g drop all documents older than 2 
> years (aka: who uses old libraries?)
>
> Is "document" the same as "maven artifact" here?

yeah pretty much. A lucene document is somewhat comparable to a row in a db (i think). The experiments I ran so far filtered everything which had a modification date field, those should only be on documents which describe artifacts. I used a tool called Luke to inspect the index which I am not an expert in. The raw data is also 14 GB so you can't quickly look at it and know what is in it or what takes up most of the space.

Here are the fields which once were in the data, most aren't used anymore:

https://maven.apache.org/maven-indexer-archives/maven-indexer-LATEST/indexer-core/

e.g removing the sha1 field cut the lucene index almost in half, since those things don't compress very well or cause other overhead.


>   Perhaps an additional condition could be added, "older than 1 year _and_ there are newer versions of this artifact in the cache".

the proposal upstream does not filter the "cache", that would be slow 
and would not have the desired effect of reduced on-disk footprint 
unless the index is rebuild (since lucene storage uses immutable files). 
It filters during extraction of the raw data of the remote index before 
it is put into a lucene index which represents the remote repository 
(e.g central but it works with any other too, apache or a 
company-internal one etc). (another advantage is that the filter would 
be a step in an already multi threaded extraction pipeline without extra 
steps)

So if you set the cutoff filter to 2 years, and use the same cache for 1 
year, there gonna be 3 years of artifact metadata in your index.

Also: if a lib is in your local .m2 folder already (even snapshots you 
build), it is in a separate index for local repos - this one isn't 
filtered either. The index footprint there is also tiny (25 MB for a 4GB 
.m2 folder in my case)

-mbien


>
> -- Eirik
>
> -----Original Message-----
> From: Michael Bien <mb...@gmail.com>
> Sent: Friday, March 17, 2023 8:26 PM
> To: dev@netbeans.apache.org; Antonio <an...@vieiro.net>
> Subject: Re: maven indexing tweaks
>
> On 17.03.23 22:38, Antonio wrote:
>> Hi,
>>
>> These are impressive savings!
> yeah I am pretty happy about the results too. Esp the removal of the
> sha1 field had a great effect. Technically we do actually offer this as query through the public API, however, it doesn't appear as anything is using it - i have to take another look just to be sure. Even if something does we could make it an option in the settings.
>
>
>>
>> Out of curiosity, we don't build the index incrementally using Maven's
>> IndexReader, do we? That's why we download the whole index, right?
> first use will download the whole copy, weekly updates are incremental.
> And yes it uses DefaultIndexReader (and the updater) of the maven-indexer project.
>
> Which is the reason why we have to make some tweaks upstream to get more flexibility (and filtering). For example some time in future we might want to change where the temp extraction storage is, which maven-indexer uses, which is also part of the proposed PR upstream right now.
>
> https://repo1.maven.org/maven2/.index/ has the compressed data for central, (apache etc have their own locations but those indices are smaller so you barely notice anything)
>
> Currently the lucene index isn't moved into new NetBeans config from old caches. This is something we could take a look at too but things like this are super annoying to test + risky since someone will find a way to import an index from a 10 year old backup and report that something fails (just like users who try to import nb-javac from NB 12.x which which breaks pretty much everything).
>
> -mbien
>
>> Thanks,
>> Antonio
>>
>>
>> [1]
>>
>> https://maven.apache.org/maven-indexer/indexer-reader/apidocs/org/apac
>> he/maven/index/reader/IndexReader.html
>>
>>
>> On 17/3/23 11:06, Michael Bien wrote:
>>> Hello everyone,
>>>
>>> I experimented a bit with the maven index extraction process and got
>>> some pretty good results (I think).
>>>
>>> There might be a way to filter the index during extraction without
>>> noteworthy overhead, which allows the following:
>>>
>>>    - "sliding window" time filters, e.g drop all documents older than
>>> 2 years (aka: who uses old libraries?)
>>>
>>>    - we can drop fields we don't need from the index. Esp interesting
>>> for fields which don't compress well (looking at you, sha1 hash)
>>>
>>> some results for the time cutoff filter:
>>>
>>> full: 5.6 GB
>>> 2y: 2.6 GB
>>> 1y: 1.4 GB
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: dev-unsubscribe@netbeans.apache.org
>> For additional commands, e-mail: dev-help@netbeans.apache.org
>>
>> For further information about the NetBeans mailing lists, visit:
>> https://cwiki.apache.org/confluence/display/NETBEANS/Mailing+lists
>>
>>
>>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscribe@netbeans.apache.org
> For additional commands, e-mail: dev-help@netbeans.apache.org
>
> For further information about the NetBeans mailing lists, visit:
> https://cwiki.apache.org/confluence/display/NETBEANS/Mailing+lists
>
>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscribe@netbeans.apache.org
> For additional commands, e-mail: dev-help@netbeans.apache.org
>
> For further information about the NetBeans mailing lists, visit:
> https://cwiki.apache.org/confluence/display/NETBEANS/Mailing+lists
>
>
>


Re: maven indexing tweaks

Posted by Michael Bien <mb...@gmail.com>.
On 18.03.23 14:41, Eirik Bakke wrote:
> - "sliding window" time filters, e.g drop all documents older than 2 years (aka: who uses old libraries?)
>
> Is "document" the same as "maven artifact" here?

yeah pretty much. A lucene document is somewhat comparable to a row in a 
db (i think). The experiments I ran so far filtered everything which had 
a modification date field, those should only be on documents which 
describe artifacts. I used a tool called Luke to inspect the index which 
I am not an expert in. The raw data is also 14 GB so you can't quickly 
look at it and know what is in it or what takes up most of the space.

Here are the fields which once were in the data, most aren't used anymore:

https://maven.apache.org/maven-indexer-archives/maven-indexer-LATEST/indexer-core/

e.g removing the sha1 field cut the lucene index almost in half, since 
those things don't compress very well or cause other overhead.


>   Perhaps an additional condition could be added, "older than 1 year _and_ there are newer versions of this artifact in the cache".

the proposal upstream does not filter the "cache", that would be slow 
and would not have the desired effect of reduced on-disk footprint 
unless the index is rebuild (since lucene storage uses immutable files). 
It filters during extraction of the raw data of the remote index before 
it is put into a lucene index which represents the remote repository 
(e.g central but it works with any other too, apache or a 
company-internal one etc). (another advantage is that the filter would 
be a step in an already multi threaded extraction pipeline without extra 
steps)

So if you set the cutoff filter to 2 years, and use the same cache for 1 
year, there gonna be 3 years of artifact metadata in your index.

Also: if a lib is in your local .m2 folder already (even snapshots you 
build), it is in a separate index for local repos - this one isn't 
filtered either. The index footprint there is also tiny (25 MB for a 4GB 
.m2 folder in my case)

-mbien


>
> -- Eirik
>
> -----Original Message-----
> From: Michael Bien <mb...@gmail.com>
> Sent: Friday, March 17, 2023 8:26 PM
> To: dev@netbeans.apache.org; Antonio <an...@vieiro.net>
> Subject: Re: maven indexing tweaks
>
> On 17.03.23 22:38, Antonio wrote:
>> Hi,
>>
>> These are impressive savings!
> yeah I am pretty happy about the results too. Esp the removal of the
> sha1 field had a great effect. Technically we do actually offer this as query through the public API, however, it doesn't appear as anything is using it - i have to take another look just to be sure. Even if something does we could make it an option in the settings.
>
>
>>
>> Out of curiosity, we don't build the index incrementally using Maven's
>> IndexReader, do we? That's why we download the whole index, right?
> first use will download the whole copy, weekly updates are incremental.
> And yes it uses DefaultIndexReader (and the updater) of the maven-indexer project.
>
> Which is the reason why we have to make some tweaks upstream to get more flexibility (and filtering). For example some time in future we might want to change where the temp extraction storage is, which maven-indexer uses, which is also part of the proposed PR upstream right now.
>
> https://repo1.maven.org/maven2/.index/ has the compressed data for central, (apache etc have their own locations but those indices are smaller so you barely notice anything)
>
> Currently the lucene index isn't moved into new NetBeans config from old caches. This is something we could take a look at too but things like this are super annoying to test + risky since someone will find a way to import an index from a 10 year old backup and report that something fails (just like users who try to import nb-javac from NB 12.x which which breaks pretty much everything).
>
> -mbien
>
>> Thanks,
>> Antonio
>>
>>
>> [1]
>>
>> https://maven.apache.org/maven-indexer/indexer-reader/apidocs/org/apac
>> he/maven/index/reader/IndexReader.html
>>
>>
>> On 17/3/23 11:06, Michael Bien wrote:
>>> Hello everyone,
>>>
>>> I experimented a bit with the maven index extraction process and got
>>> some pretty good results (I think).
>>>
>>> There might be a way to filter the index during extraction without
>>> noteworthy overhead, which allows the following:
>>>
>>>    - "sliding window" time filters, e.g drop all documents older than
>>> 2 years (aka: who uses old libraries?)
>>>
>>>    - we can drop fields we don't need from the index. Esp interesting
>>> for fields which don't compress well (looking at you, sha1 hash)
>>>
>>> some results for the time cutoff filter:
>>>
>>> full: 5.6 GB
>>> 2y: 2.6 GB
>>> 1y: 1.4 GB
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: dev-unsubscribe@netbeans.apache.org
>> For additional commands, e-mail: dev-help@netbeans.apache.org
>>
>> For further information about the NetBeans mailing lists, visit:
>> https://cwiki.apache.org/confluence/display/NETBEANS/Mailing+lists
>>
>>
>>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscribe@netbeans.apache.org
> For additional commands, e-mail: dev-help@netbeans.apache.org
>
> For further information about the NetBeans mailing lists, visit:
> https://cwiki.apache.org/confluence/display/NETBEANS/Mailing+lists
>
>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscribe@netbeans.apache.org
> For additional commands, e-mail: dev-help@netbeans.apache.org
>
> For further information about the NetBeans mailing lists, visit:
> https://cwiki.apache.org/confluence/display/NETBEANS/Mailing+lists
>
>
>


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@netbeans.apache.org
For additional commands, e-mail: dev-help@netbeans.apache.org

For further information about the NetBeans mailing lists, visit:
https://cwiki.apache.org/confluence/display/NETBEANS/Mailing+lists




RE: maven indexing tweaks

Posted by Eirik Bakke <eb...@ultorg.com>.
- "sliding window" time filters, e.g drop all documents older than 2 years (aka: who uses old libraries?)

Is "document" the same as "maven artifact" here? Perhaps an additional condition could be added, "older than 1 year _and_ there are newer versions of this artifact in the cache".

-- Eirik

-----Original Message-----
From: Michael Bien <mb...@gmail.com> 
Sent: Friday, March 17, 2023 8:26 PM
To: dev@netbeans.apache.org; Antonio <an...@vieiro.net>
Subject: Re: maven indexing tweaks

On 17.03.23 22:38, Antonio wrote:
> Hi,
>
> These are impressive savings!

yeah I am pretty happy about the results too. Esp the removal of the
sha1 field had a great effect. Technically we do actually offer this as query through the public API, however, it doesn't appear as anything is using it - i have to take another look just to be sure. Even if something does we could make it an option in the settings.


>
>
> Out of curiosity, we don't build the index incrementally using Maven's 
> IndexReader, do we? That's why we download the whole index, right?

first use will download the whole copy, weekly updates are incremental. 
And yes it uses DefaultIndexReader (and the updater) of the maven-indexer project.

Which is the reason why we have to make some tweaks upstream to get more flexibility (and filtering). For example some time in future we might want to change where the temp extraction storage is, which maven-indexer uses, which is also part of the proposed PR upstream right now.

https://repo1.maven.org/maven2/.index/ has the compressed data for central, (apache etc have their own locations but those indices are smaller so you barely notice anything)

Currently the lucene index isn't moved into new NetBeans config from old caches. This is something we could take a look at too but things like this are super annoying to test + risky since someone will find a way to import an index from a 10 year old backup and report that something fails (just like users who try to import nb-javac from NB 12.x which which breaks pretty much everything).

-mbien

>
> Thanks,
> Antonio
>
>
> [1]
>
> https://maven.apache.org/maven-indexer/indexer-reader/apidocs/org/apac
> he/maven/index/reader/IndexReader.html
>
>
> On 17/3/23 11:06, Michael Bien wrote:
>> Hello everyone,
>>
>> I experimented a bit with the maven index extraction process and got 
>> some pretty good results (I think).
>>
>> There might be a way to filter the index during extraction without 
>> noteworthy overhead, which allows the following:
>>
>>   - "sliding window" time filters, e.g drop all documents older than
>> 2 years (aka: who uses old libraries?)
>>
>>   - we can drop fields we don't need from the index. Esp interesting 
>> for fields which don't compress well (looking at you, sha1 hash)
>>
>> some results for the time cutoff filter:
>>
>> full: 5.6 GB
>> 2y: 2.6 GB
>> 1y: 1.4 GB
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscribe@netbeans.apache.org
> For additional commands, e-mail: dev-help@netbeans.apache.org
>
> For further information about the NetBeans mailing lists, visit:
> https://cwiki.apache.org/confluence/display/NETBEANS/Mailing+lists
>
>
>


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@netbeans.apache.org
For additional commands, e-mail: dev-help@netbeans.apache.org

For further information about the NetBeans mailing lists, visit:
https://cwiki.apache.org/confluence/display/NETBEANS/Mailing+lists




Re: maven indexing tweaks

Posted by Michael Bien <mb...@gmail.com>.
On 17.03.23 22:38, Antonio wrote:
> Hi,
>
> These are impressive savings!

yeah I am pretty happy about the results too. Esp the removal of the 
sha1 field had a great effect. Technically we do actually offer this as 
query through the public API, however, it doesn't appear as anything is 
using it - i have to take another look just to be sure. Even if 
something does we could make it an option in the settings.


>
>
> Out of curiosity, we don't build the index incrementally using Maven's 
> IndexReader, do we? That's why we download the whole index, right?

first use will download the whole copy, weekly updates are incremental. 
And yes it uses DefaultIndexReader (and the updater) of the 
maven-indexer project.

Which is the reason why we have to make some tweaks upstream to get more 
flexibility (and filtering). For example some time in future we might 
want to change where the temp extraction storage is, which maven-indexer 
uses, which is also part of the proposed PR upstream right now.

https://repo1.maven.org/maven2/.index/ has the compressed data for 
central, (apache etc have their own locations but those indices are 
smaller so you barely notice anything)

Currently the lucene index isn't moved into new NetBeans config from old 
caches. This is something we could take a look at too but things like 
this are super annoying to test + risky since someone will find a way to 
import an index from a 10 year old backup and report that something 
fails (just like users who try to import nb-javac from NB 12.x which 
which breaks pretty much everything).

-mbien

>
> Thanks,
> Antonio
>
>
> [1]
>
> https://maven.apache.org/maven-indexer/indexer-reader/apidocs/org/apache/maven/index/reader/IndexReader.html 
>
>
> On 17/3/23 11:06, Michael Bien wrote:
>> Hello everyone,
>>
>> I experimented a bit with the maven index extraction process and got 
>> some pretty good results (I think).
>>
>> There might be a way to filter the index during extraction without 
>> noteworthy overhead, which allows the following:
>>
>>   - "sliding window" time filters, e.g drop all documents older than 
>> 2 years (aka: who uses old libraries?)
>>
>>   - we can drop fields we don't need from the index. Esp interesting 
>> for fields which don't compress well (looking at you, sha1 hash)
>>
>> some results for the time cutoff filter:
>>
>> full: 5.6 GB
>> 2y: 2.6 GB
>> 1y: 1.4 GB
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscribe@netbeans.apache.org
> For additional commands, e-mail: dev-help@netbeans.apache.org
>
> For further information about the NetBeans mailing lists, visit:
> https://cwiki.apache.org/confluence/display/NETBEANS/Mailing+lists
>
>
>


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@netbeans.apache.org
For additional commands, e-mail: dev-help@netbeans.apache.org

For further information about the NetBeans mailing lists, visit:
https://cwiki.apache.org/confluence/display/NETBEANS/Mailing+lists




Re: maven indexing tweaks

Posted by Antonio <an...@vieiro.net.INVALID>.
Hi,

These are impressive savings!

Out of curiosity, we don't build the index incrementally using Maven's 
IndexReader, do we? That's why we download the whole index, right?

Thanks,
Antonio


[1]

https://maven.apache.org/maven-indexer/indexer-reader/apidocs/org/apache/maven/index/reader/IndexReader.html

On 17/3/23 11:06, Michael Bien wrote:
> Hello everyone,
> 
> I experimented a bit with the maven index extraction process and got 
> some pretty good results (I think).
> 
> There might be a way to filter the index during extraction without 
> noteworthy overhead, which allows the following:
> 
>   - "sliding window" time filters, e.g drop all documents older than 2 
> years (aka: who uses old libraries?)
> 
>   - we can drop fields we don't need from the index. Esp interesting for 
> fields which don't compress well (looking at you, sha1 hash)
> 
> some results for the time cutoff filter:
> 
> full: 5.6 GB
> 2y: 2.6 GB
> 1y: 1.4 GB

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@netbeans.apache.org
For additional commands, e-mail: dev-help@netbeans.apache.org

For further information about the NetBeans mailing lists, visit:
https://cwiki.apache.org/confluence/display/NETBEANS/Mailing+lists