You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@lucene.apache.org by "Joseph Gresock (JIRA)" <ji...@apache.org> on 2014/07/03 16:32:25 UTC

[jira] [Commented] (SOLR-6199) SolrJ, using SolrInputDocument methods, requires entire document to be loaded into memory

    [ https://issues.apache.org/jira/browse/SOLR-6199?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14051509#comment-14051509 ] 

Joseph Gresock commented on SOLR-6199:
--------------------------------------

We would also enjoy this feature, per this discussion: http://lucene.472066.n3.nabble.com/Streaming-large-updates-with-SolrJ-td4144527.html

> SolrJ, using SolrInputDocument methods, requires entire document to be loaded into memory
> -----------------------------------------------------------------------------------------
>
>                 Key: SOLR-6199
>                 URL: https://issues.apache.org/jira/browse/SOLR-6199
>             Project: Solr
>          Issue Type: Bug
>          Components: clients - java
>    Affects Versions: 4.7.3
>            Reporter: Karl Wright
>
> ManifoldCF has historically used Solr's extracting update handler for transmitting binary documents to Solr.  Recently, we've included Tika processing of binary documents, and wanted instead to send an (unlimited by ManifoldCF) character stream as a primary content field to Solr instead.  Unfortunately, it appears that the SolrInputDocument metaphor for receiving extracted content and metadata requires that all fields be completely converted to String objects.  This will cause ManifoldCF to certainly run out of memory at some point, when multiple ManifoldCF threads all try to convert large documents to in-memory strings at the same time.
> I looked into what would be needed to add streaming support to UpdateRequest and SolrInputDocument.  Basically, a legal option would be to set a field value that would be a Reader or a Reader[].  It would be straightforward to implement this, EXCEPT for the fact that SolrCloud apparently makes UpdateRequest copies, and copying a Reader isn't going to work unless there's a backing solid object somewhere.  Even then, I could have gotten this to work by using a temporary file for large streams, but there's no signal from SolrCloud when it is done with its copies of UpdateRequest, so there's no place to free any backing storage.
> If anyone knows a good way to do non-extracting updates without loading entire documents into memory, please let me know.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org