You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@lucenenet.apache.org by BradelSablink <Br...@protonmail.com.INVALID> on 2023/01/09 01:27:42 UTC

Speed up indexing in Lucene.net 3.0.3

I use Lucene.net 3.0.3 in an audio fingerprinting project and was wondering how I could improve the indexing speed? It takes ~1 week to make indexes of subfingerprints for 7+ million songs on a 32 core system with 64GB ram. I see that only 1 CPU core is doing 100% of the indexing. How can I use multiple cores to speed up indexing? Or maybe there's a better way to speed it up? I'm a Lucene.net novice compared to all of you so thank you for any help. The area in question where indexing is slow: https://github.com/nelemans1971/AudioFingerprinting/blob/master/CreateInversedFingerprintIndex/Worker.cs#L237

RE: Speed up indexing in Lucene.net 3.0.3

Posted by Shad Storhaug <sh...@shadstorhaug.com>.
> Explore batching things so one Task handles adding 25(?) Documents at a time so there's less switching between Tasks:
>
> https://stackoverflow.com/questions/13731796/create-batches-in-linq
>
> // I hope this formats correctly, I'm not replying through a traditional email client foreach (var batch in dt.Rows.Batch(25)) {
>   tasks.Add(Task.Run(() =>
>   {
>     foreach (DataRow in batch)
>	 {
>	   Document doc = new Document();

The problem either way (batching or not) is that you are allocating a new Document() object on each loop, which will cause a lot of memory pressure if you have a lot of rows to index. The GC may be triggered several times to clean up all of these Document instances, which will make indexing very slow.

Instead, move the creation of the Document instance out of the loop and reuse the same instance inside of the loop (resetting/clearing it each time, of course). Lucene.NET will simply serialize the data inside of the document instance. Once it is done with the instance, it is fine to reuse it.

This will also work in batches, but then you need to create an array of Document objects, hold onto the array between loops, reset the array between loops, etc. Although there is more complexity doing it this way, it may perform better than one document at a time, and I would suggest running benchmarks to determine which approach is better.

-----Original Message-----
From: Ron Grabowski <rg...@apache.org> 
Sent: Thursday, February 8, 2024 10:14 AM
To: user@lucenenet.apache.org
Subject: Re: Speed up indexing in Lucene.net 3.0.3

I saw you improved the code and saw good results! The PR is too big to leave a comment. My guess is fingerprinting the file is slow and adding the results to Lucene is fast. Some more ideas:

1)

Run a code profiler to figure out exactly what is slow and concentrate on fixing that.

2)

Current implementation submits 500 Tasks, experiment with changing that number higher or lower. Maybe loading a lot of rows in memory is slow.

3)

Implement the original idea in https://pastebin.com/g0QKhCb1 to experiment with setting a limit on the Task pool size based on your hardware. The runtime may be scheduling too many things to run at once.

4)

Explore batching things so one Task handles adding 25(?) Documents at a time so there's less switching between Tasks:

https://stackoverflow.com/questions/13731796/create-batches-in-linq

// I hope this formats correctly, I'm not replying through a traditional email client foreach (var batch in dt.Rows.Batch(25)) {
   tasks.Add(Task.Run(() =>
   {
     foreach (DataRow in batch)
	 {
	   Document doc = new Document();

5)

Create a pipeline so you're constantly pulling from the database, then fingerprinting, then indexing. The current implementation stops processing each time another 500 records are queried from the database. BlockingCollection could be used so one producer extracts from the database then many consumers process then a few consumers add to Lucene.

On 2023/01/15 07:08:50 Ron Grabowski wrote:
> Sounds like more of a producer/consumer problem than a Lucene.net problem. Here's some untested pseudo-code showing how to create a Task pool that has a configurable size of 4 workers but 8-16 might be better on your hardware. Tasks are quickly submitted to the pool then the pool works on them 4 Tasks in a time until all Tasks to complete:
> 
> https://pastebin.com/g0QKhCb1
> 
> According to https://lucenenet.apache.org/docs/3.0.3/class_lucene_1_1_net_1_1_index_1_1_index_writer.html#details IndexWriter is thread-safe. Note that I reduced locking on the counter by only updating it at the end of each small batch, not after each Document was added. Batch size could change from 5000 to 2500 for more frequent status updates.
> 
> On 2023/01/09 01:27:42 BradelSablink wrote:
> > I use Lucene.net 3.0.3 in an audio fingerprinting project and was 
> > wondering how I could improve the indexing speed? It takes ~1 week 
> > to make indexes of subfingerprints for 7+ million songs on a 32 core 
> > system with 64GB ram. I see that only 1 CPU core is doing 100% of 
> > the indexing. How can I use multiple cores to speed up indexing? Or 
> > maybe there's a better way to speed it up? I'm a Lucene.net novice 
> > compared to all of you so thank you for any help. The area in 
> > question where indexing is slow: 
> > https://github.com/nelemans1971/AudioFingerprinting/blob/master/Crea
> > teInversedFingerprintIndex/Worker.cs#L237
> 

Re: Speed up indexing in Lucene.net 3.0.3

Posted by Ron Grabowski <rg...@apache.org>.
I saw you improved the code and saw good results! The PR is too big to leave a comment. My guess is fingerprinting the file is slow and adding the results to Lucene is fast. Some more ideas:

1)

Run a code profiler to figure out exactly what is slow and concentrate on fixing that.

2)

Current implementation submits 500 Tasks, experiment with changing that number higher or lower. Maybe loading a lot of rows in memory is slow.

3)

Implement the original idea in https://pastebin.com/g0QKhCb1 to experiment with setting a limit on the Task pool size based on your hardware. The runtime may be scheduling too many things to run at once.

4)

Explore batching things so one Task handles adding 25(?) Documents at a time so there's less switching between Tasks:

https://stackoverflow.com/questions/13731796/create-batches-in-linq

// I hope this formats correctly, I'm not replying through a traditional email client
foreach (var batch in dt.Rows.Batch(25)) 
{
   tasks.Add(Task.Run(() =>
   {
     foreach (DataRow in batch)
	 {
	   Document doc = new Document();

5)

Create a pipeline so you're constantly pulling from the database, then fingerprinting, then indexing. The current implementation stops processing each time another 500 records are queried from the database. BlockingCollection could be used so one producer extracts from the database then many consumers process then a few consumers add to Lucene.

On 2023/01/15 07:08:50 Ron Grabowski wrote:
> Sounds like more of a producer/consumer problem than a Lucene.net problem. Here's some untested pseudo-code showing how to create a Task pool that has a configurable size of 4 workers but 8-16 might be better on your hardware. Tasks are quickly submitted to the pool then the pool works on them 4 Tasks in a time until all Tasks to complete:
> 
> https://pastebin.com/g0QKhCb1
> 
> According to https://lucenenet.apache.org/docs/3.0.3/class_lucene_1_1_net_1_1_index_1_1_index_writer.html#details IndexWriter is thread-safe. Note that I reduced locking on the counter by only updating it at the end of each small batch, not after each Document was added. Batch size could change from 5000 to 2500 for more frequent status updates.
> 
> On 2023/01/09 01:27:42 BradelSablink wrote:
> > I use Lucene.net 3.0.3 in an audio fingerprinting project and was wondering how I could improve the indexing speed? It takes ~1 week to make indexes of subfingerprints for 7+ million songs on a 32 core system with 64GB ram. I see that only 1 CPU core is doing 100% of the indexing. How can I use multiple cores to speed up indexing? Or maybe there's a better way to speed it up? I'm a Lucene.net novice compared to all of you so thank you for any help. The area in question where indexing is slow: https://github.com/nelemans1971/AudioFingerprinting/blob/master/CreateInversedFingerprintIndex/Worker.cs#L237
> 

Re: Speed up indexing in Lucene.net 3.0.3

Posted by Ron Grabowski <rg...@apache.org>.
Sounds like more of a producer/consumer problem than a Lucene.net problem. Here's some untested pseudo-code showing how to create a Task pool that has a configurable size of 4 workers but 8-16 might be better on your hardware. Tasks are quickly submitted to the pool then the pool works on them 4 Tasks in a time until all Tasks to complete:

https://pastebin.com/g0QKhCb1

According to https://lucenenet.apache.org/docs/3.0.3/class_lucene_1_1_net_1_1_index_1_1_index_writer.html#details IndexWriter is thread-safe. Note that I reduced locking on the counter by only updating it at the end of each small batch, not after each Document was added. Batch size could change from 5000 to 2500 for more frequent status updates.

On 2023/01/09 01:27:42 BradelSablink wrote:
> I use Lucene.net 3.0.3 in an audio fingerprinting project and was wondering how I could improve the indexing speed? It takes ~1 week to make indexes of subfingerprints for 7+ million songs on a 32 core system with 64GB ram. I see that only 1 CPU core is doing 100% of the indexing. How can I use multiple cores to speed up indexing? Or maybe there's a better way to speed it up? I'm a Lucene.net novice compared to all of you so thank you for any help. The area in question where indexing is slow: https://github.com/nelemans1971/AudioFingerprinting/blob/master/CreateInversedFingerprintIndex/Worker.cs#L237