You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@lucene.apache.org by GitBox <gi...@apache.org> on 2022/11/07 22:00:12 UTC

[GitHub] [lucene] benwtrent opened a new pull request, #11905: Fix integer overflow when seeking the vector index for connections

benwtrent opened a new pull request, #11905:
URL: https://github.com/apache/lucene/pull/11905

   This bug has been around since 9.1. It relates directly to the number of nodes that are contained in the level 0 of the HNSW graph. Since level 0 contains all the nodes, this implies the following:
   
    - In Lucene 9.1, the bug probably would have appeared once `31580641` (Integer.MAX_VALUE/(maxConn + 1)) vectors were in a single segment
    - In Lucene 9.2+, the bug appears when there are `16268814` (Integer.MAX_VALUE/(M * 2 + 1)) or more vectors in a single segment.
   
   The stack trace would indicate an EOF failure as Lucene attempts to `seek` to a negative number in `ByteBufferIndexInput`.
   
   This commit fixes the type casting and utilizes `Math.exact...` in the number multiplication and addition. The overhead here is minimal as these calculations are done in constructors and then used repeatably afterwards.
   
   
   I put fixes in the older codecs, I don't know if that is typically done, but if somebody has a large segment and wants to read the vectors, they could build this jar and read them now (the bug is only on read and data layout is unchanged)


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org


[GitHub] [lucene] benwtrent commented on pull request #11905: Fix integer overflow when seeking the vector index for connections

Posted by GitBox <gi...@apache.org>.
benwtrent commented on PR #11905:
URL: https://github.com/apache/lucene/pull/11905#issuecomment-1307208338

   > We have to start building up tests for these cases because this seems like deja vu as far as int overflows in this area.
   
   I am right there with ya @rmuir. 100% feels like "whackamole". 
   
   > Looks like we need more vectors but they can have less dimensions to trigger this one? 
   
   Yeah, we can probably trigger this overflow by using `16268815` `byte` vectors of few dimensions. Something as small as 2 dimensions could work. 
   
   One issue with HNSW is that completely random vectors can make it run dog-slow on index. Maybe having few dimensions could alleviate this.
   
   > but its still pretty slow so I'm gonna leave it running.
   
   Thanks for digging into writing a test up. I am thinking on how to test it. I initially was thinking about moving all these numeric calculations outside of the I/O path. But that pattern is not prevalent in Lucene. Also, doing this type of "unit tests" that wouldn't ACTUALLY be using large amounts of data still won't solve our lack of coverage in larger test scenarios.
   
   ---------
   
   As an aside, all these scenarios fixed here @rmuir had a Java auto-cast warning. Running IntelliJ's analyzer there are over 100 issues of integer multiplication being auto-cast to long. It may be prudent to dig through these warnings and see if they need fixing. What do you think?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org


[GitHub] [lucene] rmuir commented on pull request #11905: Fix integer overflow when seeking the vector index for connections

Posted by GitBox <gi...@apache.org>.
rmuir commented on PR #11905:
URL: https://github.com/apache/lucene/pull/11905#issuecomment-1309200544

   and thanks for help battling the testing. it will get better!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org


[GitHub] [lucene] rmuir commented on pull request #11905: Fix integer overflow when seeking the vector index for connections

Posted by GitBox <gi...@apache.org>.
rmuir commented on PR #11905:
URL: https://github.com/apache/lucene/pull/11905#issuecomment-1307727467

   > * In Lucene 9.2+, the bug appears when there are `16268814` (Integer.MAX_VALUE/(M * 2 + 1)) or more vectors in a single segment.
   
   If this is correct we should just be able to create `new Lucene94HnswVectorsFormat(BIG_ASS_M, defaultBeamWidth)` in the test and only use a few documents to trigger this overflow? could be a normal unit test. I'm not having luck reproducing this issue with a monster test.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org


[GitHub] [lucene] benwtrent commented on pull request #11905: Fix integer overflow when seeking the vector index for connections

Posted by GitBox <gi...@apache.org>.
benwtrent commented on PR #11905:
URL: https://github.com/apache/lucene/pull/11905#issuecomment-1307744298

   @rmuir Thinking outside the box! I will try that. It would definitely cause the graph offset calculation to be completely blown out of proportion! Which is the cause of this overflow.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org


[GitHub] [lucene] benwtrent commented on pull request #11905: Fix integer overflow when seeking the vector index for connections

Posted by GitBox <gi...@apache.org>.
benwtrent commented on PR #11905:
URL: https://github.com/apache/lucene/pull/11905#issuecomment-1309129512

   Updated the monster test. 
   
   Ran about 2hrs on my laptop (but I was working, in virtual meetings, etc. during the entire run). 
   
   Confirmed it fails without this patch. This patch allows it to pass.
   
   @rmuir @jdconrad 
   
   Would be good to get this in a 9.4.2 patch.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org


[GitHub] [lucene] rmuir commented on pull request #11905: Fix integer overflow when seeking the vector index for connections

Posted by GitBox <gi...@apache.org>.
rmuir commented on PR #11905:
URL: https://github.com/apache/lucene/pull/11905#issuecomment-1307756832

   Yes, if such a test works it may at least prevent similar regressions. 
   
   Another possible idea is to give every vector value of 0, then zip up the index, it should be ~16MB of zeros which would get compressed away. But then we have to regenerate the index (e.g. like backwards-codecs tests) which could be a hassle.
   
   I still havent given up on the monster test, as a backup plan. I tried `16268814` and `16268815` but neither failed. I'm trying `20000000` right now, and configured tests to keep the index afterwards. if it doesn't fail I will start messing with checkindex code...


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org


[GitHub] [lucene] rmuir merged pull request #11905: Fix integer overflow when seeking the vector index for connections

Posted by GitBox <gi...@apache.org>.
rmuir merged PR #11905:
URL: https://github.com/apache/lucene/pull/11905


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org


[GitHub] [lucene] dweiss commented on pull request #11905: Fix integer overflow when seeking the vector index for connections

Posted by GitBox <gi...@apache.org>.
dweiss commented on PR #11905:
URL: https://github.com/apache/lucene/pull/11905#issuecomment-1307230679

   There's a whole bunch of automated checks you could go through, selectively, and try to enable them for the future. This includes IntLongMath, which is currently off.
   
   https://github.com/apache/lucene/blob/main/gradle/validation/error-prone.gradle#L71-L183


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org


[GitHub] [lucene] rmuir commented on pull request #11905: Fix integer overflow when seeking the vector index for connections

Posted by GitBox <gi...@apache.org>.
rmuir commented on PR #11905:
URL: https://github.com/apache/lucene/pull/11905#issuecomment-1308697281

   yeah its your search at the end that triggers the issue, because checkindex on vectors is currently too wimpy and doesn't ever call seek(). this is also an issue that we must address here.
   
   you can make the test much faster by sticking with just one dimension and passing some additional args. The one on the other PR runs in 1.5 hours in its current config. 
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org


[GitHub] [lucene] benwtrent commented on pull request #11905: Fix integer overflow when seeking the vector index for connections

Posted by GitBox <gi...@apache.org>.
benwtrent commented on PR #11905:
URL: https://github.com/apache/lucene/pull/11905#issuecomment-1308664186

   @rmuir i took your test, modified it slightly (changing number of vectors and the assertion). It ran for 3.5 hours and failed on the old code in the exactly correct spot (overflowing on graph location).
   
   I am running it again with my fix to verify the fix. Thank you for iterating on this with me.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org


[GitHub] [lucene] rmuir commented on pull request #11905: Fix integer overflow when seeking the vector index for connections

Posted by GitBox <gi...@apache.org>.
rmuir commented on PR #11905:
URL: https://github.com/apache/lucene/pull/11905#issuecomment-1307760820

   @jdconrad helped with some math that may explain why previous tests didnt fail:
   ```
   jshell> int M = 16;
   M ==> 16
   
   jshell> long v1 = (1 + (M*2)) * 4 * 16268814;
   v1 ==> 2147483448
   
   jshell> long v2 = (1 + (M*2)) * 4 * 16268815;
   v2 ==> 2147483580
   
   jshell> long v3 = (1 + (M*2)) * 4 * 20000000;
   v3 ==> -1654967296
   ```
   
   So maybe this run with 20M docs will trip. I'll know in a few hours :)


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org


[GitHub] [lucene] rmuir commented on pull request #11905: Fix integer overflow when seeking the vector index for connections

Posted by GitBox <gi...@apache.org>.
rmuir commented on PR #11905:
URL: https://github.com/apache/lucene/pull/11905#issuecomment-1306713558

   ok i tried to make a stab at a test in that draft PR, but its still pretty slow so I'm gonna leave it running. we have to start building up tests for these cases because this seems like deja vu as far as int overflows in this area.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org


[GitHub] [lucene] rmuir commented on pull request #11905: Fix integer overflow when seeking the vector index for connections

Posted by GitBox <gi...@apache.org>.
rmuir commented on PR #11905:
URL: https://github.com/apache/lucene/pull/11905#issuecomment-1307821765

   With the 20M docs it still didnt fail. I have the index saved so i can play around, maybe checkindex doesnt trigger what is needed here (e.g. advance vs next). 
   
   It is a little crazy that this index has 2.5GB .vex file that, if i run `zip`, deflates 98% down to 75MB. very wasteful.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org


[GitHub] [lucene] rmuir commented on pull request #11905: Fix integer overflow when seeking the vector index for connections

Posted by GitBox <gi...@apache.org>.
rmuir commented on PR #11905:
URL: https://github.com/apache/lucene/pull/11905#issuecomment-1309198710

   nice. i'm fine with the changes. We can open another issue to fix the checkindex stuff. I do really think we should do that before releasing to look for more trouble. Also good if we can go through a candidate list of other potential problems (even if its noisy, its still less than entire codebase).


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org


[GitHub] [lucene] rmuir commented on pull request #11905: Fix integer overflow when seeking the vector index for connections

Posted by GitBox <gi...@apache.org>.
rmuir commented on PR #11905:
URL: https://github.com/apache/lucene/pull/11905#issuecomment-1306601477

   Not good: thanks for splitting this out from your other PR! wonder if we can start cooking up something similar to #11867 ? Looks like we need more vectors but they can have less dimensions to trigger this one? Maybe they can be artificially generated so we don't need a dataset?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org


[GitHub] [lucene] rmuir commented on pull request #11905: Fix integer overflow when seeking the vector index for connections

Posted by GitBox <gi...@apache.org>.
rmuir commented on PR #11905:
URL: https://github.com/apache/lucene/pull/11905#issuecomment-1307289709

   > Yeah, we can probably trigger this overflow by using 16268815 byte vectors of few dimensions. Something as small as 2 dimensions could work.
   
   > One issue with HNSW is that completely random vectors can make it run dog-slow on index. Maybe having few dimensions could alleviate this.
   
   We may need to modify the draft test then to trigger the bug. I used only one dimension and simple `docid % 256` to assign vector value. I also only used `16268814` documents so it may need another one. I also am unsure if CheckIndex at the end will trigger the issue you describe, maybe it only calls `next` and not `advance` or something like that. In such a case, we may need to fix CheckIndex to do some "advancing", too. It does a similar thing already for the postings, see `// Test skipping` section  of the `checkFields()` method. So its definitely all a WIP
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org


[GitHub] [lucene] benwtrent commented on pull request #11905: Fix integer overflow when seeking the vector index for connections

Posted by GitBox <gi...@apache.org>.
benwtrent commented on PR #11905:
URL: https://github.com/apache/lucene/pull/11905#issuecomment-1307833883

   > It is a little crazy that this index has 2.5GB .vex file that, if i run zip, deflates 98% down to 75MB. very wasteful.
   
   Agreed :). Once this stuff is solved, I hope to further iterate on reducing the size of the vector index. There is LOTS of room for improvement there.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org