You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@lucene.apache.org by "Erick Erickson (JIRA)" <ji...@apache.org> on 2016/07/14 15:45:20 UTC
[jira] [Commented] (SOLR-9296) Examine SortingResponseWriter with an eye towards removing extra object creation

    [ https://issues.apache.org/jira/browse/SOLR-9296?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15377139#comment-15377139 ] 

Erick Erickson commented on SOLR-9296:
--------------------------------------

Some preliminary results:

I'm particularly interested in any pointers any of the Lucene people have. Now that I've poked at this enough to understand the issues I may be able to appreciate any pointers you have to offer.

"Bottom line". I've instrumented the SortingResponseWriter class to 
1> not write to the client for testing
2> try to reduce object creation
3> report summary results only.

on 10M rows (see table below) I'm seeing 0-11% improvements in rows/second with one outlier (mv bool fields) showing 5% worse performance. See below. I'm also seeing a bit spiker response time with the old way of doing things, but probably within the margin of error of my measurements.

4M fewer char[] objects created (Visual VM)
Roughly the same number of other types of objects created.
40M total objects created. NOTE: I had to stop looking after 2.16M rows were processed since VisualVM was slowing the system to a crawl.

Still some work to go to see if I can understand why there were roughly the same number of String objects created, this is encouraging enough to pursue though I think.

First, any suggestions for the most vexing thing of all? Let's say I have to convert an integer to a char[] to output it. Currently that can be done with a formatter that takes an "Appendable". Great, I can reuse one StringBuilder/StringBuffer resetting the length to 0 each time. Unfortunately, there's no way to get to the underlying char[] buffer without copying it around. The OpenStringBuilder class that I'd like to use (lucene utility class) doesn't work because the formatter checks for instanceof StringBuffer and/or StringBuilder or asserts. So I wind up copying to a char[] (which I have one of per field) and writing.

I have a char[] cbuf  that I can reuse for the entire export (for each field), so it looks like this

format(val, StringBuffer) // StringBuffer/StringBuilder, depending on what the formatters require)
StringBuilder.getChars(into cbuf)
writer.write(cbuf, 0, StringBuilder.length());

Whereas I'd like to avoid the getChars(...) call.

I'm traveling today so I won't post the code until perhaps tomorrow. So far:
I've taken out a bunch of conversions to String and created some classes that re-use a char[] to move data around. I created a "null writer" to remove the variable of the client having to read 10M rows for testing purposes.

On a preliminary run (exporting 10M rows of various types (int, long, string) the number of allocated objects reduced by about 4M char[] (of 40M total objects) while most of the other object counts remained about the same. I was surprised that the number of String objects stayed similar, I expected that to drop so I need to dig at that some more.

Speed wise I'm seeing up to an 11% improvement in throughput mostly in the single-valued case. Why mv should be different I'm not sure yet. writing mv fields varies from being 5% or so _worse_ (boolfield) to 10% or so better.

These measurements were taken with a null writer that just threw the bits on the floor and added a bit of instrumentation to return the aggregate. I took three runs, each exporting all 10.2M docs (No VisualVM attached, that was just for object counting and gets in the way of perf... badly). You'll notice in the following that all the tries return int_sv which I used as the sort criteria, figuring that would stay constant. Numbers are new/old in thousands, so the first entry says "for returning 10.2M single valued string and integer fields, the new code returned 170K/second and the old code processed 152K/second". Before taking any of these I did a full export of all the fields to try to remove loading and the like from the measurements and for each row below exported three times. The times for each of the three runs reported below were very similar

str_sv,int_sv 
170/152

int_sv 
193/175

long_sv,int_sv 
176/165

date_sv,int_sv 
167/145

bool_sv,int_sv 
186/172

double_sv,int_sv 
171/156

str_mv,int_sv 
131/122

int_mv,int_sv  
149/138

long_mv,int_sv 
146/147

date_mv,int_sv 
120/120

bool_mv,int_sv 
174/183

double_mv,int_sv 
124/125

str_sv,int_sv,date_sv,bool_sv,double_sv,str_mv,int_mv,date_mv,bool_mv,double_mv 
55/55




> Examine SortingResponseWriter with an eye towards removing extra object creation
> --------------------------------------------------------------------------------
>
>                 Key: SOLR-9296
>                 URL: https://issues.apache.org/jira/browse/SOLR-9296
>             Project: Solr
>          Issue Type: Improvement
>      Security Level: Public(Default Security Level. Issues are Public) 
>    Affects Versions: 6.2, master (7.0)
>            Reporter: Erick Erickson
>            Assignee: Erick Erickson
>
> Assigning to myself just to keep from losing track it. Anyone who wants to take it, please feel free!
> While looking at SOLR-9166 I noticed that SortingResponseWriter does a toString for each field it writes out. At a _very_ preliminary examination it seems like we create a lot of String objects that need to be GC'd. Could we reduce this by using some kind of CharsRef/ByteBuffer/Whatever?
> I've only looked at this briefly, not quite sure what the gotchas are but throwing it out for discussion.
> Some initial thoughts:
> 1> for the fixed types (numerics, dates, booleans) there's a strict upper limit on the size of each value so we can allocate something up-front.
> 2> for string fields, we already get a chars ref so just pass that through?
> 3> must make sure that whatever does the actual writing transfers all the bytes before returning.
> I'm sure I won't get to this for a week or perhaps more, so grab it if you have the bandwidth.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org