You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@lucene.apache.org by GitBox <gi...@apache.org> on 2022/03/03 15:20:30 UTC
[GitHub] [lucene] mayya-sharipova commented on pull request #728: LUCENE-10194 Buffer KNN vectors on disk
mayya-sharipova commented on pull request #728:
URL: https://github.com/apache/lucene/pull/728#issuecomment-1058148842
I've benchmarked the results with ann-benchmarks on glove-100-angular (M:16, efConstruction:100)
- baseline: main branch where we unset RAMBufferSizeMB, which defaults to **16Mb** with segments force merged to 1.
- candidate: this PR, where RAMBufferSizeMB similarly is set to **16Mb**, also force merge at the end.
**Indexing**
- baseline took Built index in 1099 secs, around **18mins**
- candidate took 586 secs, around **10 mins**
- search performance is the same.
2022-03-03T15:01:49.958373Z; main
IW 1 [2022-03-03T15:14:33.924666Z; main]
<details>
<summary>Details on the search performance </summary>
</details>
<details>
<summary>Details on the candidate </summary>
Indexing output
```txt
IW 0 [2022-03-03T14:30:49.413950Z; main]: init: create=true reader=null
ramBufferSizeMB=16.0
maxBufferedDocs=-1
IW 0 [2022-03-03T14:30:49.424202Z; main]: MMapDirectory.UNMAP_SUPPORTED=true
Done indexing 1183514 documents; now flush
IW 0 [2022-03-03T14:30:50.824200Z; main]: now flush at close
IW 0 [2022-03-03T14:30:50.824401Z; main]: start flush: applyAllDeletes=true
IW 0 [2022-03-03T14:30:50.824515Z; main]: index before flush
DW 0 [2022-03-03T14:30:50.824557Z; main]: startFullFlush
DW 0 [2022-03-03T14:30:50.827209Z; main]: anyChanges? numDocsInRam=1183514 deletes=false hasTickets:false pendingChangesInFullFlush: false
DWPT 0 [2022-03-03T14:30:50.831053Z; main]: flush postings as segment _0 numDocs=1183514
HNSW 0 [2022-03-03T14:30:52.334343Z; main]: build graph from 1183514 vectors
...
HNSW 0 [2022-03-03T14:40:31.049504Z; main]: built 1180000 in 5585/578724 ms
...
IW 0 [2022-03-03T14:40:33.492318Z; main]: 582671 msec to write vectors
IFD 0 [2022-03-03T14:40:34.655718Z; main]: 20 msec to checkpoint
Indexed 1183514 documents in 585s
Force merge index in luceneknn-100-16-100.train-16-100.index
IFD 1 [2022-03-03T14:40:34.671943Z; main]: 0 msec to checkpoint
Built index in 586.944657087326
```
**Files in the index**
```txt
0 -rw-r--r-- 1 mayyasharipova staff 0B 3 Mar 14:30 _0.fdm
10080 -rw-r--r-- 1 mayyasharipova staff 4.6M 3 Mar 14:30 _0.fdt
0 -rw-r--r-- 1 mayyasharipova staff 0B 3 Mar 14:30 _0_Lucene90FieldsIndex-doc_ids_0.tmp
0 -rw-r--r-- 1 mayyasharipova staff 0B 3 Mar 14:30 _0_Lucene90FieldsIndexfile_pointers_1.tmp
929304 -rw-r--r-- 1 mayyasharipova staff 451M 3 Mar 14:30 _0_Lucene91HnswVectorsFormat_0.vec
924624 -rw-r--r-- 1 mayyasharipova staff 451M 3 Mar 14:30 _0_Lucene91HnswVectorsFormat_0.vec_temp_3.tmp
0 -rw-r--r-- 1 mayyasharipova staff 0B 3 Mar 14:30 _0_Lucene91HnswVectorsFormat_0.vem
0 -rw-r--r-- 1 mayyasharipova staff 0B 3 Mar 14:30 _0_Lucene91HnswVectorsFormat_0.vex
953168 -rw-r--r-- 1 mayyasharipova staff 451M 3 Mar 14:30 _0_knn_buffered_vectors_temp_2.tmp
0 -rw-r--r-- 1 mayyasharipova staff 0B 3 Mar 14:30 write.lock
```
</details>
<details>
<summary>Details on the baseline </summary>
Indexing output
```txt
Built index in 1099.0846738815308
```
**Files in the index**
```txt
drwxr-xr-x 12 mayyasharipova staff 384B 3 Mar 15:14 .
drwxr-xr-x 42 mayyasharipova staff 1.3K 3 Mar 15:14 ..
-rw-r--r-- 1 mayyasharipova staff 201B 3 Mar 15:03 _w.fdm
-rw-r--r-- 1 mayyasharipova staff 4.6M 3 Mar 15:03 _w.fdt
-rw-r--r-- 1 mayyasharipova staff 3.5K 3 Mar 15:03 _w.fdx
-rw-r--r-- 1 mayyasharipova staff 192B 3 Mar 15:14 _w.fnm
-rw-r--r-- 1 mayyasharipova staff 532B 3 Mar 15:14 _w.si
-rw-r--r-- 1 mayyasharipova staff 451M 3 Mar 15:14 _w_Lucene91HnswVectorsFormat_0.vec
-rw-r--r-- 1 mayyasharipova staff 309K 3 Mar 15:14 _w_Lucene91HnswVectorsFormat_0.vem
-rw-r--r-- 1 mayyasharipova staff 82M 3 Mar 15:14 _w_Lucene91HnswVectorsFormat_0.vex
-rw-r--r-- 1 mayyasharipova staff 154B 3 Mar 15:14 segments_2
-rw-r--r-- 1 mayyasharipova staff 0B 3 Mar 14:56 write.lock
```
</details>
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org