You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@lucene.apache.org by "Shalin Shekhar Mangar (JIRA)" <ji...@apache.org> on 2015/08/25 18:29:45 UTC

[jira] [Updated] (SOLR-7971) Reduce memory allocated by JavaBinCodec to encode large strings

     [ https://issues.apache.org/jira/browse/SOLR-7971?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Shalin Shekhar Mangar updated SOLR-7971:
----------------------------------------
    Attachment: SOLR-7971.patch

Without this patch, indexing the same 100MB JSON document mentioned in SOLR-7927:
# succeeds on ./bin/solr  start -m 2100M
# fails on ./bin/solr  start -m 2000M

And with this change:
# succeeds on ./bin/solr  start -m 1900M with patch
# fails on ./bin/solr  start -m 1800M with patch

> Reduce memory allocated by JavaBinCodec to encode large strings
> ---------------------------------------------------------------
>
>                 Key: SOLR-7971
>                 URL: https://issues.apache.org/jira/browse/SOLR-7971
>             Project: Solr
>          Issue Type: Sub-task
>          Components: Response Writers, SolrCloud
>            Reporter: Shalin Shekhar Mangar
>            Assignee: Shalin Shekhar Mangar
>            Priority: Minor
>             Fix For: Trunk, 5.4
>
>         Attachments: SOLR-7971.patch
>
>
> As discussed in SOLR-7927, we can reduce the buffer memory allocated by JavaBinCodec while writing large strings.
> https://issues.apache.org/jira/browse/SOLR-7927?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14700420#comment-14700420
> {quote}
> The maximum Unicode code point (as of Unicode 8 anyway) is U+10FFFF ([http://www.unicode.org/glossary/#code_point]).  This is encoded in UTF-16 as surrogate pair {{\uDBFF\uDFFF}}, which takes up two Java chars, and is represented in UTF-8 as the 4-byte sequence {{F4 8F BF BF}}.  This is likely where the mistaken 4-bytes-per-Java-char formulation came from: the maximum number of UTF-8 bytes required to represent a Unicode *code point* is 4.
> The maximum Java char is {{\uFFFF}}, which is represented in UTF-8 as the 3-byte sequence {{EF BF BF}}.
> So I think it's safe to switch to using 3 bytes per Java char (the unit of measurement returned by {{String.length()}}), like {{CompressingStoredFieldsWriter.writeField()}} does.
> {quote}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org