You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@lucene.apache.org by "Martin Grotzke (JIRA)" <ji...@apache.org> on 2011/06/09 11:46:58 UTC

[jira] [Created] (SOLR-2583) Make external scoring more efficient (ExternalFileField, FileFloatSource)

Make external scoring more efficient (ExternalFileField, FileFloatSource)
-------------------------------------------------------------------------

                 Key: SOLR-2583
                 URL: https://issues.apache.org/jira/browse/SOLR-2583
             Project: Solr
          Issue Type: Improvement
          Components: search
            Reporter: Martin Grotzke
            Priority: Minor


External scoring eats much memory, depending on the number of documents in the index. The ExternalFileField (used for external scoring) uses FileFloatSource, where one FileFloatSource is created per external scoring file. FileFloatSource creates a float array with the size of the number of docs (this is also done if the file to load is not found). If there are much less entries in the scoring file than there are number of docs in total the big float array wastes much memory.

This could be optimized by using a map of doc -> score, so that the map contains as many entries as there are scoring entries in the external file, but not more.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


[jira] [Commented] (SOLR-2583) Make external scoring more efficient (ExternalFileField, FileFloatSource)

Posted by "Yonik Seeley (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/SOLR-2583?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13046688#comment-13046688 ] 

Yonik Seeley commented on SOLR-2583:
------------------------------------

bq. Perhaps it would be good to allow the user to override this, s.th. like sparse=yes/no/auto.

Sounds good!  I wonder what the memory cut-off should be for auto... 10% of maxDoc() or so?

bq. a smallfloat option could help too? (1/4 the ram)

Yep!

> Make external scoring more efficient (ExternalFileField, FileFloatSource)
> -------------------------------------------------------------------------
>
>                 Key: SOLR-2583
>                 URL: https://issues.apache.org/jira/browse/SOLR-2583
>             Project: Solr
>          Issue Type: Improvement
>          Components: search
>            Reporter: Martin Grotzke
>            Priority: Minor
>         Attachments: FileFloatSource.java.patch
>
>
> External scoring eats much memory, depending on the number of documents in the index. The ExternalFileField (used for external scoring) uses FileFloatSource, where one FileFloatSource is created per external scoring file. FileFloatSource creates a float array with the size of the number of docs (this is also done if the file to load is not found). If there are much less entries in the scoring file than there are number of docs in total the big float array wastes much memory.
> This could be optimized by using a map of doc -> score, so that the map contains as many entries as there are scoring entries in the external file, but not more.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


[jira] [Updated] (SOLR-2583) Make external scoring more efficient (ExternalFileField, FileFloatSource)

Posted by "Martin Grotzke (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/SOLR-2583?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Martin Grotzke updated SOLR-2583:
---------------------------------

    Attachment: FileFloatSource.java.patch

The attached patch changes FileFloatSource to use a map of score by doc.

> Make external scoring more efficient (ExternalFileField, FileFloatSource)
> -------------------------------------------------------------------------
>
>                 Key: SOLR-2583
>                 URL: https://issues.apache.org/jira/browse/SOLR-2583
>             Project: Solr
>          Issue Type: Improvement
>          Components: search
>            Reporter: Martin Grotzke
>            Priority: Minor
>         Attachments: FileFloatSource.java.patch
>
>
> External scoring eats much memory, depending on the number of documents in the index. The ExternalFileField (used for external scoring) uses FileFloatSource, where one FileFloatSource is created per external scoring file. FileFloatSource creates a float array with the size of the number of docs (this is also done if the file to load is not found). If there are much less entries in the scoring file than there are number of docs in total the big float array wastes much memory.
> This could be optimized by using a map of doc -> score, so that the map contains as many entries as there are scoring entries in the external file, but not more.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


[jira] [Commented] (SOLR-2583) Make external scoring more efficient (ExternalFileField, FileFloatSource)

Posted by "Martin Grotzke (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/SOLR-2583?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13046692#comment-13046692 ] 

Martin Grotzke commented on SOLR-2583:
--------------------------------------

Great, sounds like a further optimization for both sparse and non-sparse files. Though, as we had 4GB taken by FileFloatSource objects a reduction to 1/4 would still be too much for us so for our case I prefer the map based approach - then with Smallfloat.

> Make external scoring more efficient (ExternalFileField, FileFloatSource)
> -------------------------------------------------------------------------
>
>                 Key: SOLR-2583
>                 URL: https://issues.apache.org/jira/browse/SOLR-2583
>             Project: Solr
>          Issue Type: Improvement
>          Components: search
>            Reporter: Martin Grotzke
>            Priority: Minor
>         Attachments: FileFloatSource.java.patch
>
>
> External scoring eats much memory, depending on the number of documents in the index. The ExternalFileField (used for external scoring) uses FileFloatSource, where one FileFloatSource is created per external scoring file. FileFloatSource creates a float array with the size of the number of docs (this is also done if the file to load is not found). If there are much less entries in the scoring file than there are number of docs in total the big float array wastes much memory.
> This could be optimized by using a map of doc -> score, so that the map contains as many entries as there are scoring entries in the external file, but not more.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


[jira] [Commented] (SOLR-2583) Make external scoring more efficient (ExternalFileField, FileFloatSource)

Posted by "Robert Muir (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/SOLR-2583?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13049709#comment-13049709 ] 

Robert Muir commented on SOLR-2583:
-----------------------------------

bq. that uses a fixed size and an increasing number of puts

I'm not certain how realistic that is, remember behind the scenes compactbytearray uses blocks,
and if you touch every one (by putting every K docid or something) then you are just testing 
the worst case.


> Make external scoring more efficient (ExternalFileField, FileFloatSource)
> -------------------------------------------------------------------------
>
>                 Key: SOLR-2583
>                 URL: https://issues.apache.org/jira/browse/SOLR-2583
>             Project: Solr
>          Issue Type: Improvement
>          Components: search
>            Reporter: Martin Grotzke
>            Priority: Minor
>         Attachments: FileFloatSource.java.patch, patch.txt
>
>
> External scoring eats much memory, depending on the number of documents in the index. The ExternalFileField (used for external scoring) uses FileFloatSource, where one FileFloatSource is created per external scoring file. FileFloatSource creates a float array with the size of the number of docs (this is also done if the file to load is not found). If there are much less entries in the scoring file than there are number of docs in total the big float array wastes much memory.
> This could be optimized by using a map of doc -> score, so that the map contains as many entries as there are scoring entries in the external file, but not more.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


[jira] [Commented] (SOLR-2583) Make external scoring more efficient (ExternalFileField, FileFloatSource)

Posted by "Martin Grotzke (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/SOLR-2583?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13050435#comment-13050435 ] 

Martin Grotzke commented on SOLR-2583:
--------------------------------------

bq. Are you sure real floats are actually needed?
In our case score values are e.g. 158870000 (one example just taken from one of the files). With this sample this test fails:
{noformat}
byte small = SmallFloat.floatToByte315(104626500f);
assertEquals(104626500f, SmallFloat.byte315ToFloat(small), 0f);
-> AssertionError: expected:<1.04626496E8> but was:<1.00663296E8>
{noformat}

This shows that even we have a case where this will produce wrong results, and even if we could fix this in our case there might be someone else with the same issue.


bq. it would also good to measure performance...
I'd not expect that the boxing makes a real difference here, especially in relation to the rest of the time spent during a search request.
A time based performance comparison that has a real value would take some time, it would have to put in relation to the rest of a search request (how do you do this?) and finally it would require proper interpretation when everything is together. Right now I don't think it's worth the effort.


{quote}
bq. that uses a fixed size and an increasing number of puts
I'm not certain how realistic that is, remember behind the scenes compactbytearray uses blocks,
and if you touch every one (by putting every K docid or something) then you are just testing
the worst case.
{quote}
Do you want to change the test to s.th. that's more realistic?


@Yonik: what do you say regarding the suggestion to use HashMap up to ~5.5% and above that using the float[]?

> Make external scoring more efficient (ExternalFileField, FileFloatSource)
> -------------------------------------------------------------------------
>
>                 Key: SOLR-2583
>                 URL: https://issues.apache.org/jira/browse/SOLR-2583
>             Project: Solr
>          Issue Type: Improvement
>          Components: search
>            Reporter: Martin Grotzke
>            Priority: Minor
>         Attachments: FileFloatSource.java.patch, patch.txt
>
>
> External scoring eats much memory, depending on the number of documents in the index. The ExternalFileField (used for external scoring) uses FileFloatSource, where one FileFloatSource is created per external scoring file. FileFloatSource creates a float array with the size of the number of docs (this is also done if the file to load is not found). If there are much less entries in the scoring file than there are number of docs in total the big float array wastes much memory.
> This could be optimized by using a map of doc -> score, so that the map contains as many entries as there are scoring entries in the external file, but not more.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


[jira] [Commented] (SOLR-2583) Make external scoring more efficient (ExternalFileField, FileFloatSource)

Posted by "Martin Grotzke (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/SOLR-2583?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13049674#comment-13049674 ] 

Martin Grotzke commented on SOLR-2583:
--------------------------------------

The test that produced this output can be found in my lucene-solr fork on github: https://github.com/magro/lucene-solr/commit/b9af87b1
The test method that was executed was testCompareMemoryUsage, for measuring memory usage I used http://code.google.com/p/memory-measurer/ and ran the test/jvm with "-Xmx1G -javaagent:solr/lib/object-explorer.jar" (just from eclipse).

I just added another test, that uses a fixed size and an increasing number of puts (testCompareMemoryUsageWithFixSizeAndIncreasingNumPuts, https://github.com/magro/lucene-solr/blob/trunk/solr/src/test/org/apache/solr/search/function/FileFloatSourceMemoryTest.java#L56), with the following results:

{noformat}
Size: 1000000
NumPuts 1.000 (0,1%),		CompactFloatArray 918.616,	float[] 4.000.016,	HashMap  72.128
NumPuts 10.000 (1,0%),		CompactFloatArray 3.738.712,	float[] 4.000.016,	HashMap  701.696
NumPuts 50.000 (5,0%),		CompactFloatArray 4.016.472,	float[] 4.000.016,	HashMap  3.383.104
NumPuts 55.000 (5,5%),		CompactFloatArray 4.016.472,	float[] 4.000.016,	HashMap  3.949.120
NumPuts 60.000 (6,0%),		CompactFloatArray 4.016.472,	float[] 4.000.016,	HashMap  4.254.848
NumPuts 100.000 (10,0%),	CompactFloatArray 4.016.472,	float[] 4.000.016,	HashMap  6.622.272
NumPuts 500.000 (50,0%),	CompactFloatArray 4.016.472,	float[] 4.000.016,	HashMap  27.262.976
NumPuts 1.000.000 (100,0%),	CompactFloatArray 4.016.472,	float[] 4.000.016,	HashMap  44.649.664
{noformat}

It seems that the HashMap is the most efficient solution up to ~5.5%. Starting from this threshold CompactFloatArray and float[] use less memory, while the CompactFloatArray has no advantages over float[] for puts > 5%.

Therefore I'd suggest that we use an adaptive strategy that uses a HashMap up to 5,5% of number of scores compared to numdocs, and starting from this threshold the original float[] approach is used.

What do you say?

> Make external scoring more efficient (ExternalFileField, FileFloatSource)
> -------------------------------------------------------------------------
>
>                 Key: SOLR-2583
>                 URL: https://issues.apache.org/jira/browse/SOLR-2583
>             Project: Solr
>          Issue Type: Improvement
>          Components: search
>            Reporter: Martin Grotzke
>            Priority: Minor
>         Attachments: FileFloatSource.java.patch, patch.txt
>
>
> External scoring eats much memory, depending on the number of documents in the index. The ExternalFileField (used for external scoring) uses FileFloatSource, where one FileFloatSource is created per external scoring file. FileFloatSource creates a float array with the size of the number of docs (this is also done if the file to load is not found). If there are much less entries in the scoring file than there are number of docs in total the big float array wastes much memory.
> This could be optimized by using a map of doc -> score, so that the map contains as many entries as there are scoring entries in the external file, but not more.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


[jira] [Commented] (SOLR-2583) Make external scoring more efficient (ExternalFileField, FileFloatSource)

Posted by "Robert Muir (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/SOLR-2583?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13046785#comment-13046785 ] 

Robert Muir commented on SOLR-2583:
-----------------------------------

bq. Though, as we had 4GB taken by FileFloatSource objects a reduction to 1/4 would still be too much for us so for our case I prefer the map based approach - then with Smallfloat.

If the problem is sparsity, maybe use a two-stage table, still faster than a hashmap and much better for the worst case.


> Make external scoring more efficient (ExternalFileField, FileFloatSource)
> -------------------------------------------------------------------------
>
>                 Key: SOLR-2583
>                 URL: https://issues.apache.org/jira/browse/SOLR-2583
>             Project: Solr
>          Issue Type: Improvement
>          Components: search
>            Reporter: Martin Grotzke
>            Priority: Minor
>         Attachments: FileFloatSource.java.patch
>
>
> External scoring eats much memory, depending on the number of documents in the index. The ExternalFileField (used for external scoring) uses FileFloatSource, where one FileFloatSource is created per external scoring file. FileFloatSource creates a float array with the size of the number of docs (this is also done if the file to load is not found). If there are much less entries in the scoring file than there are number of docs in total the big float array wastes much memory.
> This could be optimized by using a map of doc -> score, so that the map contains as many entries as there are scoring entries in the external file, but not more.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


[jira] [Commented] (SOLR-2583) Make external scoring more efficient (ExternalFileField, FileFloatSource)

Posted by "Koji Sekiguchi (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/SOLR-2583?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13056224#comment-13056224 ] 

Koji Sekiguchi commented on SOLR-2583:
--------------------------------------

I didn't save the test snippet because I wrote it out of my office (I used stranger's PC). What I did was just using CompactByteArray instead of CompactFloatArray in your FileFloatSourceMemoryTest.java.


> Make external scoring more efficient (ExternalFileField, FileFloatSource)
> -------------------------------------------------------------------------
>
>                 Key: SOLR-2583
>                 URL: https://issues.apache.org/jira/browse/SOLR-2583
>             Project: Solr
>          Issue Type: Improvement
>          Components: search
>            Reporter: Martin Grotzke
>            Priority: Minor
>         Attachments: FileFloatSource.java.patch, patch.txt
>
>
> External scoring eats much memory, depending on the number of documents in the index. The ExternalFileField (used for external scoring) uses FileFloatSource, where one FileFloatSource is created per external scoring file. FileFloatSource creates a float array with the size of the number of docs (this is also done if the file to load is not found). If there are much less entries in the scoring file than there are number of docs in total the big float array wastes much memory.
> This could be optimized by using a map of doc -> score, so that the map contains as many entries as there are scoring entries in the external file, but not more.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


[jira] [Updated] (SOLR-2583) Make external scoring more efficient (ExternalFileField, FileFloatSource)

Posted by "Robert Muir (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/SOLR-2583?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Robert Muir updated SOLR-2583:
------------------------------

    Attachment: patch.txt

{quote}
What do you mean with a two-stage table, can you clarify this please?
{quote}

See: http://www.strchr.com/multi-stage_tables

i attached a patch, of a (not great) implementation i was sorta kinda trying to clean up for other reasons... maybe you can use it.

in the sparse case, blocks that share all the default value are folded into one block (in this patch, blocksize=256 but maybe you should be able to configure it).

for example in your 4GB case (1billion floats), if you use this with SmallFloat the absolute worst case (no sharing) is 1GB + 16MB or so, and the best case (all default values) is 16MB, but the lookups should be a lot faster than hashtables... all primitive types, etc... and it could definitely be improved more.

really this is still probably overkill, as the datastructure is intended to share blocks with the same values in general, when in reality its probably enough to just share ones that have only the default value set...

i didnt look at the solr side to see if its possible to build it incrementally (this would be better, rather than building then compact()ing, but i wonder if this is possible due to lucenedocid/solr id, etc)


> Make external scoring more efficient (ExternalFileField, FileFloatSource)
> -------------------------------------------------------------------------
>
>                 Key: SOLR-2583
>                 URL: https://issues.apache.org/jira/browse/SOLR-2583
>             Project: Solr
>          Issue Type: Improvement
>          Components: search
>            Reporter: Martin Grotzke
>            Priority: Minor
>         Attachments: FileFloatSource.java.patch, patch.txt
>
>
> External scoring eats much memory, depending on the number of documents in the index. The ExternalFileField (used for external scoring) uses FileFloatSource, where one FileFloatSource is created per external scoring file. FileFloatSource creates a float array with the size of the number of docs (this is also done if the file to load is not found). If there are much less entries in the scoring file than there are number of docs in total the big float array wastes much memory.
> This could be optimized by using a map of doc -> score, so that the map contains as many entries as there are scoring entries in the external file, but not more.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


[jira] [Commented] (SOLR-2583) Make external scoring more efficient (ExternalFileField, FileFloatSource)

Posted by "Martin Grotzke (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/SOLR-2583?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13049256#comment-13049256 ] 

Martin Grotzke commented on SOLR-2583:
--------------------------------------

I just compared memory consumption of the 3 different approaches, with different number of puts (number of scores) and sizes (number of docs):

{noformat}
Puts  1.000, size 1.000.000:	  CompactFloatArray 898.136,	float[] 4.000.016,	HashMap  72.192
Puts  10.000, size 1.000.000:	  CompactFloatArray 3.724.376,	float[] 4.000.016,	HashMap  702.784
Puts  100.000, size 1.000.000:	  CompactFloatArray 4.016.472,	float[] 4.000.016,	HashMap  6.607.808
Puts  1.000.000, size 1.000.000:  CompactFloatArray 4.016.472,	float[] 4.000.016,	HashMap  44.644.032
Puts  1.000, size 5.000.000:	  CompactFloatArray 1.128.536,	float[] 20.000.016,	HashMap  72.256
Puts  10.000, size 5.000.000:	  CompactFloatArray 8.168.536,	float[] 20.000.016,	HashMap  704.832
Puts  100.000, size 5.000.000:	  CompactFloatArray 20.013.144,	float[] 20.000.016,	HashMap  7.385.152
Puts  1.000.000, size 5.000.000:  CompactFloatArray 20.131.160,	float[] 20.000.016,	HashMap  66.395.584
Puts  1.000, size 10.000.000:	  CompactFloatArray 1.275.992,	float[] 40.000.016,	HashMap  72.256
Puts  10.000, size 10.000.000:	  CompactFloatArray 9.289.816,	float[] 40.000.016,	HashMap  705.280
Puts  100.000, size 10.000.000:	  CompactFloatArray 37.130.328,	float[] 40.000.016,	HashMap  7.418.112
Puts  1.000.000, size 10.000.000: CompactFloatArray 40.262.232,	float[] 40.000.016,	HashMap  69.282.496
{noformat}

I want to share this intermediately, without further interpretation/conclusion for now (I just need to get the train).

> Make external scoring more efficient (ExternalFileField, FileFloatSource)
> -------------------------------------------------------------------------
>
>                 Key: SOLR-2583
>                 URL: https://issues.apache.org/jira/browse/SOLR-2583
>             Project: Solr
>          Issue Type: Improvement
>          Components: search
>            Reporter: Martin Grotzke
>            Priority: Minor
>         Attachments: FileFloatSource.java.patch, patch.txt
>
>
> External scoring eats much memory, depending on the number of documents in the index. The ExternalFileField (used for external scoring) uses FileFloatSource, where one FileFloatSource is created per external scoring file. FileFloatSource creates a float array with the size of the number of docs (this is also done if the file to load is not found). If there are much less entries in the scoring file than there are number of docs in total the big float array wastes much memory.
> This could be optimized by using a map of doc -> score, so that the map contains as many entries as there are scoring entries in the external file, but not more.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


[jira] [Commented] (SOLR-2583) Make external scoring more efficient (ExternalFileField, FileFloatSource)

Posted by "Martin Grotzke (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/SOLR-2583?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13055737#comment-13055737 ] 

Martin Grotzke commented on SOLR-2583:
--------------------------------------

bq. Looking at your test, I think it is reasonable. But I'd like to use CompactByteArray. I saw it wins over HashMap and float[] when 5% and above in my test.

Can you share your test code or s.th. similar? Perhaps you can just fork https://github.com/magro/lucene-solr/ and add an appropriate test that reflects your data?

> Make external scoring more efficient (ExternalFileField, FileFloatSource)
> -------------------------------------------------------------------------
>
>                 Key: SOLR-2583
>                 URL: https://issues.apache.org/jira/browse/SOLR-2583
>             Project: Solr
>          Issue Type: Improvement
>          Components: search
>            Reporter: Martin Grotzke
>            Priority: Minor
>         Attachments: FileFloatSource.java.patch, patch.txt
>
>
> External scoring eats much memory, depending on the number of documents in the index. The ExternalFileField (used for external scoring) uses FileFloatSource, where one FileFloatSource is created per external scoring file. FileFloatSource creates a float array with the size of the number of docs (this is also done if the file to load is not found). If there are much less entries in the scoring file than there are number of docs in total the big float array wastes much memory.
> This could be optimized by using a map of doc -> score, so that the map contains as many entries as there are scoring entries in the external file, but not more.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


[jira] [Issue Comment Edited] (SOLR-2583) Make external scoring more efficient (ExternalFileField, FileFloatSource)

Posted by "Martin Grotzke (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/SOLR-2583?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13049256#comment-13049256 ] 

Martin Grotzke edited comment on SOLR-2583 at 6/14/11 4:25 PM:
---------------------------------------------------------------

I just compared memory consumption of the 3 different approaches, with different number of puts (number of scores) and sizes (number of docs), the memory is in byte:

{noformat}
Puts  1.000, size 1.000.000:	  CompactFloatArray 898.136,	float[] 4.000.016,	HashMap  72.192
Puts  10.000, size 1.000.000:	  CompactFloatArray 3.724.376,	float[] 4.000.016,	HashMap  702.784
Puts  100.000, size 1.000.000:	  CompactFloatArray 4.016.472,	float[] 4.000.016,	HashMap  6.607.808
Puts  1.000.000, size 1.000.000:  CompactFloatArray 4.016.472,	float[] 4.000.016,	HashMap  44.644.032
Puts  1.000, size 5.000.000:	  CompactFloatArray 1.128.536,	float[] 20.000.016,	HashMap  72.256
Puts  10.000, size 5.000.000:	  CompactFloatArray 8.168.536,	float[] 20.000.016,	HashMap  704.832
Puts  100.000, size 5.000.000:	  CompactFloatArray 20.013.144,	float[] 20.000.016,	HashMap  7.385.152
Puts  1.000.000, size 5.000.000:  CompactFloatArray 20.131.160,	float[] 20.000.016,	HashMap  66.395.584
Puts  1.000, size 10.000.000:	  CompactFloatArray 1.275.992,	float[] 40.000.016,	HashMap  72.256
Puts  10.000, size 10.000.000:	  CompactFloatArray 9.289.816,	float[] 40.000.016,	HashMap  705.280
Puts  100.000, size 10.000.000:	  CompactFloatArray 37.130.328,	float[] 40.000.016,	HashMap  7.418.112
Puts  1.000.000, size 10.000.000: CompactFloatArray 40.262.232,	float[] 40.000.016,	HashMap  69.282.496
{noformat}

I want to share this intermediately, without further interpretation/conclusion for now (I just need to get the train).

      was (Author: martin.grotzke):
    I just compared memory consumption of the 3 different approaches, with different number of puts (number of scores) and sizes (number of docs):

{noformat}
Puts  1.000, size 1.000.000:	  CompactFloatArray 898.136,	float[] 4.000.016,	HashMap  72.192
Puts  10.000, size 1.000.000:	  CompactFloatArray 3.724.376,	float[] 4.000.016,	HashMap  702.784
Puts  100.000, size 1.000.000:	  CompactFloatArray 4.016.472,	float[] 4.000.016,	HashMap  6.607.808
Puts  1.000.000, size 1.000.000:  CompactFloatArray 4.016.472,	float[] 4.000.016,	HashMap  44.644.032
Puts  1.000, size 5.000.000:	  CompactFloatArray 1.128.536,	float[] 20.000.016,	HashMap  72.256
Puts  10.000, size 5.000.000:	  CompactFloatArray 8.168.536,	float[] 20.000.016,	HashMap  704.832
Puts  100.000, size 5.000.000:	  CompactFloatArray 20.013.144,	float[] 20.000.016,	HashMap  7.385.152
Puts  1.000.000, size 5.000.000:  CompactFloatArray 20.131.160,	float[] 20.000.016,	HashMap  66.395.584
Puts  1.000, size 10.000.000:	  CompactFloatArray 1.275.992,	float[] 40.000.016,	HashMap  72.256
Puts  10.000, size 10.000.000:	  CompactFloatArray 9.289.816,	float[] 40.000.016,	HashMap  705.280
Puts  100.000, size 10.000.000:	  CompactFloatArray 37.130.328,	float[] 40.000.016,	HashMap  7.418.112
Puts  1.000.000, size 10.000.000: CompactFloatArray 40.262.232,	float[] 40.000.016,	HashMap  69.282.496
{noformat}

I want to share this intermediately, without further interpretation/conclusion for now (I just need to get the train).
  
> Make external scoring more efficient (ExternalFileField, FileFloatSource)
> -------------------------------------------------------------------------
>
>                 Key: SOLR-2583
>                 URL: https://issues.apache.org/jira/browse/SOLR-2583
>             Project: Solr
>          Issue Type: Improvement
>          Components: search
>            Reporter: Martin Grotzke
>            Priority: Minor
>         Attachments: FileFloatSource.java.patch, patch.txt
>
>
> External scoring eats much memory, depending on the number of documents in the index. The ExternalFileField (used for external scoring) uses FileFloatSource, where one FileFloatSource is created per external scoring file. FileFloatSource creates a float array with the size of the number of docs (this is also done if the file to load is not found). If there are much less entries in the scoring file than there are number of docs in total the big float array wastes much memory.
> This could be optimized by using a map of doc -> score, so that the map contains as many entries as there are scoring entries in the external file, but not more.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


[jira] [Commented] (SOLR-2583) Make external scoring more efficient (ExternalFileField, FileFloatSource)

Posted by "Martin Grotzke (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/SOLR-2583?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13046674#comment-13046674 ] 

Martin Grotzke commented on SOLR-2583:
--------------------------------------

Yes, you're right regarding non-sparse fields. The question for the user will be when to use true or false for sparse. It might also be the case, that files differ, in that some are big, others are small. So I'm thinking about making it adaptive: when the number of lines reach a certain percentage compared to the number of docs, the float array is used, otherwise the doc->score map is used. Perhaps it would be good to allow the user to override this, s.th. like sparse=yes/no/auto.

What do you think?

> Make external scoring more efficient (ExternalFileField, FileFloatSource)
> -------------------------------------------------------------------------
>
>                 Key: SOLR-2583
>                 URL: https://issues.apache.org/jira/browse/SOLR-2583
>             Project: Solr
>          Issue Type: Improvement
>          Components: search
>            Reporter: Martin Grotzke
>            Priority: Minor
>         Attachments: FileFloatSource.java.patch
>
>
> External scoring eats much memory, depending on the number of documents in the index. The ExternalFileField (used for external scoring) uses FileFloatSource, where one FileFloatSource is created per external scoring file. FileFloatSource creates a float array with the size of the number of docs (this is also done if the file to load is not found). If there are much less entries in the scoring file than there are number of docs in total the big float array wastes much memory.
> This could be optimized by using a map of doc -> score, so that the map contains as many entries as there are scoring entries in the external file, but not more.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


[jira] [Commented] (SOLR-2583) Make external scoring more efficient (ExternalFileField, FileFloatSource)

Posted by "Martin Grotzke (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/SOLR-2583?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13046943#comment-13046943 ] 

Martin Grotzke commented on SOLR-2583:
--------------------------------------

bq. If the problem is sparsity, maybe use a two-stage table, still faster than a hashmap and much better for the worst case.

What do you mean with a two-stage table, can you clarify this please?

> Make external scoring more efficient (ExternalFileField, FileFloatSource)
> -------------------------------------------------------------------------
>
>                 Key: SOLR-2583
>                 URL: https://issues.apache.org/jira/browse/SOLR-2583
>             Project: Solr
>          Issue Type: Improvement
>          Components: search
>            Reporter: Martin Grotzke
>            Priority: Minor
>         Attachments: FileFloatSource.java.patch
>
>
> External scoring eats much memory, depending on the number of documents in the index. The ExternalFileField (used for external scoring) uses FileFloatSource, where one FileFloatSource is created per external scoring file. FileFloatSource creates a float array with the size of the number of docs (this is also done if the file to load is not found). If there are much less entries in the scoring file than there are number of docs in total the big float array wastes much memory.
> This could be optimized by using a map of doc -> score, so that the map contains as many entries as there are scoring entries in the external file, but not more.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


[jira] [Commented] (SOLR-2583) Make external scoring more efficient (ExternalFileField, FileFloatSource)

Posted by "Koji Sekiguchi (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/SOLR-2583?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13055437#comment-13055437 ] 

Koji Sekiguchi commented on SOLR-2583:
--------------------------------------

I'd like the feature as I'm using ExternalFileField a lot!

bq. what do you say regarding the suggestion to use HashMap up to ~5.5% and above that using the float[]?

Looking at your test, I think it is reasonable. But I'd like to use CompactByteArray. I saw it wins over HashMap and float[] when 5% and above in my test.

How about introducing compact=yes (default is no and float[] is used) with sparse=yes/no/auto?

> Make external scoring more efficient (ExternalFileField, FileFloatSource)
> -------------------------------------------------------------------------
>
>                 Key: SOLR-2583
>                 URL: https://issues.apache.org/jira/browse/SOLR-2583
>             Project: Solr
>          Issue Type: Improvement
>          Components: search
>            Reporter: Martin Grotzke
>            Priority: Minor
>         Attachments: FileFloatSource.java.patch, patch.txt
>
>
> External scoring eats much memory, depending on the number of documents in the index. The ExternalFileField (used for external scoring) uses FileFloatSource, where one FileFloatSource is created per external scoring file. FileFloatSource creates a float array with the size of the number of docs (this is also done if the file to load is not found). If there are much less entries in the scoring file than there are number of docs in total the big float array wastes much memory.
> This could be optimized by using a map of doc -> score, so that the map contains as many entries as there are scoring entries in the external file, but not more.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


[jira] [Commented] (SOLR-2583) Make external scoring more efficient (ExternalFileField, FileFloatSource)

Posted by "Martin Grotzke (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/SOLR-2583?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13046712#comment-13046712 ] 

Martin Grotzke commented on SOLR-2583:
--------------------------------------

> Sounds good!  I wonder what the memory cut-off should be for auto... 10% of maxDoc() or so?

I'd compare both strategies to see what's the break-even, this should give an absolute number.

> Make external scoring more efficient (ExternalFileField, FileFloatSource)
> -------------------------------------------------------------------------
>
>                 Key: SOLR-2583
>                 URL: https://issues.apache.org/jira/browse/SOLR-2583
>             Project: Solr
>          Issue Type: Improvement
>          Components: search
>            Reporter: Martin Grotzke
>            Priority: Minor
>         Attachments: FileFloatSource.java.patch
>
>
> External scoring eats much memory, depending on the number of documents in the index. The ExternalFileField (used for external scoring) uses FileFloatSource, where one FileFloatSource is created per external scoring file. FileFloatSource creates a float array with the size of the number of docs (this is also done if the file to load is not found). If there are much less entries in the scoring file than there are number of docs in total the big float array wastes much memory.
> This could be optimized by using a map of doc -> score, so that the map contains as many entries as there are scoring entries in the external file, but not more.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


[jira] [Commented] (SOLR-2583) Make external scoring more efficient (ExternalFileField, FileFloatSource)

Posted by "Yonik Seeley (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/SOLR-2583?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13046564#comment-13046564 ] 

Yonik Seeley commented on SOLR-2583:
------------------------------------

Yeah, this will help for sparse fields, but hurt quite a bit for non-sparse ones.
Seems like we should make it an option (sparse=true/false on the fieldType definition)?

> Make external scoring more efficient (ExternalFileField, FileFloatSource)
> -------------------------------------------------------------------------
>
>                 Key: SOLR-2583
>                 URL: https://issues.apache.org/jira/browse/SOLR-2583
>             Project: Solr
>          Issue Type: Improvement
>          Components: search
>            Reporter: Martin Grotzke
>            Priority: Minor
>         Attachments: FileFloatSource.java.patch
>
>
> External scoring eats much memory, depending on the number of documents in the index. The ExternalFileField (used for external scoring) uses FileFloatSource, where one FileFloatSource is created per external scoring file. FileFloatSource creates a float array with the size of the number of docs (this is also done if the file to load is not found). If there are much less entries in the scoring file than there are number of docs in total the big float array wastes much memory.
> This could be optimized by using a map of doc -> score, so that the map contains as many entries as there are scoring entries in the external file, but not more.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


[jira] [Commented] (SOLR-2583) Make external scoring more efficient (ExternalFileField, FileFloatSource)

Posted by "Robert Muir (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/SOLR-2583?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13046675#comment-13046675 ] 

Robert Muir commented on SOLR-2583:
-----------------------------------

a smallfloat option could help too? (1/4 the ram)

> Make external scoring more efficient (ExternalFileField, FileFloatSource)
> -------------------------------------------------------------------------
>
>                 Key: SOLR-2583
>                 URL: https://issues.apache.org/jira/browse/SOLR-2583
>             Project: Solr
>          Issue Type: Improvement
>          Components: search
>            Reporter: Martin Grotzke
>            Priority: Minor
>         Attachments: FileFloatSource.java.patch
>
>
> External scoring eats much memory, depending on the number of documents in the index. The ExternalFileField (used for external scoring) uses FileFloatSource, where one FileFloatSource is created per external scoring file. FileFloatSource creates a float array with the size of the number of docs (this is also done if the file to load is not found). If there are much less entries in the scoring file than there are number of docs in total the big float array wastes much memory.
> This could be optimized by using a map of doc -> score, so that the map contains as many entries as there are scoring entries in the external file, but not more.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


[jira] [Commented] (SOLR-2583) Make external scoring more efficient (ExternalFileField, FileFloatSource)

Posted by "Martin Grotzke (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/SOLR-2583?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13049143#comment-13049143 ] 

Martin Grotzke commented on SOLR-2583:
--------------------------------------

{quote}
See: http://www.strchr.com/multi-stage_tables

i attached a patch, of a (not great) implementation i was sorta kinda trying to clean up for other reasons... maybe you can use it.
{quote}

Thanx, interesting approach!

I just tried to create a CompactFloatArray based on the CompactByteArray to be able to compare memory consumptions. There's one change that wasn't just changing byte to float, and I'm not sure what's the right adaption in this case:

{code}
diff -w solr/src/java/org/apache/solr/util/CompactByteArray.java solr/src/java/org/apache/solr/util/CompactFloatArray.java
57c57
...
202,203c202,203
<   private void touchBlock(int i, int value) {
<     hashes[i] = (hashes[i] + (value << 1)) | 1;
---
>   private void touchBlock(int i, float value) {
>     hashes[i] = (hashes[i] + (Float.floatToIntBits(value) << 1)) | 1;
{code}

The adapted test is green, so it seems to be correct at least. I'll also attach the full patch for CompactFloatArray.java and TestCompactFloatArray.java

> Make external scoring more efficient (ExternalFileField, FileFloatSource)
> -------------------------------------------------------------------------
>
>                 Key: SOLR-2583
>                 URL: https://issues.apache.org/jira/browse/SOLR-2583
>             Project: Solr
>          Issue Type: Improvement
>          Components: search
>            Reporter: Martin Grotzke
>            Priority: Minor
>         Attachments: FileFloatSource.java.patch, patch.txt
>
>
> External scoring eats much memory, depending on the number of documents in the index. The ExternalFileField (used for external scoring) uses FileFloatSource, where one FileFloatSource is created per external scoring file. FileFloatSource creates a float array with the size of the number of docs (this is also done if the file to load is not found). If there are much less entries in the scoring file than there are number of docs in total the big float array wastes much memory.
> This could be optimized by using a map of doc -> score, so that the map contains as many entries as there are scoring entries in the external file, but not more.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


[jira] [Commented] (SOLR-2583) Make external scoring more efficient (ExternalFileField, FileFloatSource)

Posted by "Robert Muir (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/SOLR-2583?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13049706#comment-13049706 ] 

Robert Muir commented on SOLR-2583:
-----------------------------------

Are you sure real floats are actually needed?
Why not use compactbytearray with smallfloat encoding?

it would also good to measure performance... doesn't a hashmap have to box *per-docid* into an Integer for lookup?



> Make external scoring more efficient (ExternalFileField, FileFloatSource)
> -------------------------------------------------------------------------
>
>                 Key: SOLR-2583
>                 URL: https://issues.apache.org/jira/browse/SOLR-2583
>             Project: Solr
>          Issue Type: Improvement
>          Components: search
>            Reporter: Martin Grotzke
>            Priority: Minor
>         Attachments: FileFloatSource.java.patch, patch.txt
>
>
> External scoring eats much memory, depending on the number of documents in the index. The ExternalFileField (used for external scoring) uses FileFloatSource, where one FileFloatSource is created per external scoring file. FileFloatSource creates a float array with the size of the number of docs (this is also done if the file to load is not found). If there are much less entries in the scoring file than there are number of docs in total the big float array wastes much memory.
> This could be optimized by using a map of doc -> score, so that the map contains as many entries as there are scoring entries in the external file, but not more.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org