You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@lucene.apache.org by "Adrien Grand (JIRA)" <ji...@apache.org> on 2017/03/03 16:34:46 UTC
[jira] [Updated] (LUCENE-7730) Better encode length normalization in similarities

     [ https://issues.apache.org/jira/browse/LUCENE-7730?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Adrien Grand updated LUCENE-7730:
---------------------------------
    Attachment: LUCENE-7730.patch

Here's a patch that does the following:
 - adds {{LeafReader.getIndexInfos()}} which bundles the index created version and the index sort in order to keep the number of methods on LeafReader contained. This way similarities can decide how to decode norms based on the created version.
 - adds indexCreatedVersion to {{FieldInvertedState}} so that similarities can decide how to encode norms based on the created version
 - Given that readers now know about their created version, I improved {{IndexWriter.addIndexes(CodecReader...)}} to fail when a reader what was created with a different version is added.
 - SimilarityBase and BM25Similarity now encode directly the length (rather than {{1/sqrt(length)}}) in a way that preserves 4 significant bits across the whole integer range and is accurate up to 40. ClassicSimilarity is left unmodified however.

Here is a table of the encoded lengths for every possible byte. Everything works as if the lengths were rounded to the value in this table that is immediately lesser.
|| Byte & 0xff || Length ||
|0|0|
|1|1|
|2|2|
|3|3|
|4|4|
|5|5|
|6|6|
|7|7|
|8|8|
|9|9|
|10|10|
|11|11|
|12|12|
|13|13|
|14|14|
|15|15|
|16|16|
|17|17|
|18|18|
|19|19|
|20|20|
|21|21|
|22|22|
|23|23|
|24|24|
|25|25|
|26|26|
|27|27|
|28|28|
|29|29|
|30|30|
|31|31|
|32|32|
|33|33|
|34|34|
|35|35|
|36|36|
|37|37|
|38|38|
|39|39|
|40|40|
|41|42|
|42|44|
|43|46|
|44|48|
|45|50|
|46|52|
|47|54|
|48|56|
|49|60|
|50|64|
|51|68|
|52|72|
|53|76|
|54|80|
|55|84|
|56|88|
|57|96|
|58|104|
|59|112|
|60|120|
|61|128|
|62|136|
|63|144|
|64|152|
|65|168|
|66|184|
|67|200|
|68|216|
|69|232|
|70|248|
|71|264|
|72|280|
|73|312|
|74|344|
|75|376|
|76|408|
|77|440|
|78|472|
|79|504|
|80|536|
|81|600|
|82|664|
|83|728|
|84|792|
|85|856|
|86|920|
|87|984|
|88|1048|
|89|1176|
|90|1304|
|91|1432|
|92|1560|
|93|1688|
|94|1816|
|95|1944|
|96|2072|
|97|2328|
|98|2584|
|99|2840|
|100|3096|
|101|3352|
|102|3608|
|103|3864|
|104|4120|
|105|4632|
|106|5144|
|107|5656|
|108|6168|
|109|6680|
|110|7192|
|111|7704|
|112|8216|
|113|9240|
|114|10264|
|115|11288|
|116|12312|
|117|13336|
|118|14360|
|119|15384|
|120|16408|
|121|18456|
|122|20504|
|123|22552|
|124|24600|
|125|26648|
|126|28696|
|127|30744|
|128|32792|
|129|36888|
|130|40984|
|131|45080|
|132|49176|
|133|53272|
|134|57368|
|135|61464|
|136|65560|
|137|73752|
|138|81944|
|139|90136|
|140|98328|
|141|106520|
|142|114712|
|143|122904|
|144|131096|
|145|147480|
|146|163864|
|147|180248|
|148|196632|
|149|213016|
|150|229400|
|151|245784|
|152|262168|
|153|294936|
|154|327704|
|155|360472|
|156|393240|
|157|426008|
|158|458776|
|159|491544|
|160|524312|
|161|589848|
|162|655384|
|163|720920|
|164|786456|
|165|851992|
|166|917528|
|167|983064|
|168|1048600|
|169|1179672|
|170|1310744|
|171|1441816|
|172|1572888|
|173|1703960|
|174|1835032|
|175|1966104|
|176|2097176|
|177|2359320|
|178|2621464|
|179|2883608|
|180|3145752|
|181|3407896|
|182|3670040|
|183|3932184|
|184|4194328|
|185|4718616|
|186|5242904|
|187|5767192|
|188|6291480|
|189|6815768|
|190|7340056|
|191|7864344|
|192|8388632|
|193|9437208|
|194|10485784|
|195|11534360|
|196|12582936|
|197|13631512|
|198|14680088|
|199|15728664|
|200|16777240|
|201|18874392|
|202|20971544|
|203|23068696|
|204|25165848|
|205|27263000|
|206|29360152|
|207|31457304|
|208|33554456|
|209|37748760|
|210|41943064|
|211|46137368|
|212|50331672|
|213|54525976|
|214|58720280|
|215|62914584|
|216|67108888|
|217|75497496|
|218|83886104|
|219|92274712|
|220|100663320|
|221|109051928|
|222|117440536|
|223|125829144|
|224|134217752|
|225|150994968|
|226|167772184|
|227|184549400|
|228|201326616|
|229|218103832|
|230|234881048|
|231|251658264|
|232|268435480|
|233|301989912|
|234|335544344|
|235|369098776|
|236|402653208|
|237|436207640|
|238|469762072|
|239|503316504|
|240|536870936|
|241|603979800|
|242|671088664|
|243|738197528|
|244|805306392|
|245|872415256|
|246|939524120|
|247|1006632984|
|248|1073741848|
|249|1207959576|
|250|1342177304|
|251|1476395032|
|252|1610612760|
|253|1744830488|
|254|1879048216|
|255|2013265944|

It is still a work-in-progress, some tests that rely on the way accuracy was lost are not passing for instance. Feedback about eg. better ways that we could propagate the index created version or encode the norm is welcome.

> Better encode length normalization in similarities
> --------------------------------------------------
>
>                 Key: LUCENE-7730
>                 URL: https://issues.apache.org/jira/browse/LUCENE-7730
>             Project: Lucene - Core
>          Issue Type: Task
>            Reporter: Adrien Grand
>         Attachments: LUCENE-7730.patch
>
>
> Now that index-time boosts are gone (LUCENE-6819) and that indices record the version that was used to create them (for backward compatibility, LUCENE-7703), we can look into storing the length normalization factor more efficiently.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org