You are viewing a plain text version of this content. The canonical link for it is here.

Posted to solr-user@lucene.apache.org by Ken Krugler <kk...@transpac.com> on 2011/11/04 18:16:03 UTC

overwrite=false support with SolrJ client

Hi list,

I'm working on improving the performance of the Solr scheme for Cascading.

This supports generating a Solr index as the output of a Hadoop job. We use SolrJ to write the index locally (via EmbeddedSolrServer).

There are mentions of using overwrite=false with the CSV request handler, as a way of improving performance.

I see that https://issues.apache.org/jira/browse/SOLR-653 removed this support from SolrJ, because it was deemed too dangerous for mere mortals.

My question is whether anyone knows just how much performance boost this really provides.

For Hadoop-based workflows, it's straightforward to ensure that the unique key field is really unique, thus if the performance gain is significant, I might look into figuring out some way (with a trigger lock) of re-enabling this support in SolrJ.

Thanks,

-- Ken

--------------------------
Ken Krugler
http://www.scaleunlimited.com
custom big data solutions & training
Hadoop, Cascading, Mahout & Solr

Re: overwrite=false support with SolrJ client

Posted by Jason Rutherglen <ja...@gmail.com>.

It should be supported in SolrJ, I'm surprised it's been lopped out.
Bulk indexing is extremely common.

On Fri, Nov 4, 2011 at 1:16 PM, Ken Krugler <kk...@transpac.com> wrote:
> Hi list,
>
> I'm working on improving the performance of the Solr scheme for Cascading.
>
> This supports generating a Solr index as the output of a Hadoop job. We use SolrJ to write the index locally (via EmbeddedSolrServer).
>
> There are mentions of using overwrite=false with the CSV request handler, as a way of improving performance.
>
> I see that https://issues.apache.org/jira/browse/SOLR-653 removed this support from SolrJ, because it was deemed too dangerous for mere mortals.
>
> My question is whether anyone knows just how much performance boost this really provides.
>
> For Hadoop-based workflows, it's straightforward to ensure that the unique key field is really unique, thus if the performance gain is significant, I might look into figuring out some way (with a trigger lock) of re-enabling this support in SolrJ.
>
> Thanks,
>
> -- Ken
>
> --------------------------
> Ken Krugler
> http://www.scaleunlimited.com
> custom big data solutions & training
> Hadoop, Cascading, Mahout & Solr
>
>
>
>
>

Re: overwrite=false support with SolrJ client

Posted by Mark Miller <ma...@gmail.com>.

On Nov 10, 2011, at 1:36 PM, Ken Krugler wrote:

>> : I see that https://issues.apache.org/jira/browse/SOLR-653 removed this 
>> : support from SolrJ, because it was deemed too dangerous for mere 
>> : mortals.

That seems silly. It should just be well documented. At worst marked expert. Yuck.

If you already know your docs are unique, this can be much more efficient in some cases because it uses add rather than update.

It's simple to do this with curl - why shouldn't it be simple with the *java* lib!

- Mark Miller
lucidimagination.com

Re: overwrite=false support with SolrJ client

Posted by Ken Krugler <kk...@transpac.com>.

On Nov 7, 2011, at 12:06pm, Chris Hostetter wrote:

> 
> : I see that https://issues.apache.org/jira/browse/SOLR-653 removed this 
> : support from SolrJ, because it was deemed too dangerous for mere 
> : mortals.
> 
> I believe the concern was that the "novice level" API was very in your 
> face about asking if you wanted to "overwrite" and made it too easy to 
> hurt yourself.
> 
> It should still be fairly trivial to specify overwrite=false in a SolrJ 
> request -- just not using hte convenience methods.  something like...
> 
> 	UpdateRequest req = new UpdateRequest();
> 	req.add(myBigCollectionOfDocuments);
> 	req.setParam(UpdateParams.OVERWRITE, true);
> 	req.process(mySolrServer);

That seemed to work, thanks for the suggestion - though using (in case anybody else reads this)

   req.setParam(UpdateParams.OVERWRITE, Boolean.toString(false));

I'll need to run some tests to check performance improvements.

> : For Hadoop-based workflows, it's straightforward to ensure that the 
> : unique key field is really unique, thus if the performance gain is 
> : significant, I might look into figuring out some way (with a trigger 
> : lock) of re-enabling this support in SolrJ.
> 
> it's not just an issue of knowing that the key is unique -- it's an issue 
> of being certain that your index does not contain any documents with the 
> same key as a document you are about to add.  If you are generating a 
> completley new solr index from data that you are certain is unique -- then 
> you will probably see some perf gains.  but if you are adding to an 
> existing index, i would avoid it. 

For Hadoop workflows, the output is always fresh (unless you do some interesting helicopter stunts).

So yes, by default the index is always being rebuilt from scratch.

And thus as long as the primary key is being used as the reduce-phase key, it's easy to ensure uniqueness in the index.

Thanks again,

-- Ken

--------------------------
Ken Krugler
http://www.scaleunlimited.com
custom big data solutions & training
Hadoop, Cascading, Mahout & Solr

Re: overwrite=false support with SolrJ client

Posted by Chris Hostetter <ho...@fucit.org>.

: I see that https://issues.apache.org/jira/browse/SOLR-653 removed this 
: support from SolrJ, because it was deemed too dangerous for mere 
: mortals.

I believe the concern was that the "novice level" API was very in your 
face about asking if you wanted to "overwrite" and made it too easy to 
hurt yourself.

It should still be fairly trivial to specify overwrite=false in a SolrJ 
request -- just not using hte convenience methods.  something like...

	UpdateRequest req = new UpdateRequest();
	req.add(myBigCollectionOfDocuments);
	req.setParam(UpdateParams.OVERWRITE, true);
	req.process(mySolrServer);

: For Hadoop-based workflows, it's straightforward to ensure that the 
: unique key field is really unique, thus if the performance gain is 
: significant, I might look into figuring out some way (with a trigger 
: lock) of re-enabling this support in SolrJ.

it's not just an issue of knowing that the key is unique -- it's an issue 
of being certain that your index does not contain any documents with the 
same key as a document you are about to add.  If you are generating a 
completley new solr index from data that you are certain is unique -- then 
you will probably see some perf gains.  but if you are adding to an 
existing index, i would avoid it. 


-Hoss