You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@manifoldcf.apache.org by Jack Krupansky <ja...@lucidimagination.com> on 2010/06/02 17:15:27 UTC

Re: Setting up Solr -- commit

I did in fact try setting commit in the Solr output connection arguments a month ago. It kind of worked, but Solr gave some errors on occasion due to overlapping requests - one request did a commit while other parallel requests from LCF were in various stages of processing. I do not recall whether I tried to set JVM throttling to 1 to force sequential processing of posted documents, but you don't really want to have to force sequential processing anyway.

Side note to Solr guys: What is the "contract" for the ExtractingRequestHandler in terms of handling parallel requests? Is it "the more the merrier" (including lots of PDF files?), or are there specific issues that the client must/should worry about? There is also the potential for multiple clients, LCF or other, simultaneously blasting at /update/extract. Obviously those clients can't know what each other is up to.

-- Jack Krupansky


From: karl.wright@nokia.com 
Sent: Wednesday, June 02, 2010 9:01 AM
To: connectors-user@incubator.apache.org 
Subject: RE: Setting up Solr


You can send any argument you want by configuring the output connector.  However, the explicit commit on every post will slow down performance of your crawls.

 

Karl

 

From: ext Rohan.GPatil@cognizant.com [mailto:Rohan.GPatil@cognizant.com] 
Sent: Wednesday, June 02, 2010 9:00 AM
To: connectors-user@incubator.apache.org
Subject: RE: Setting up Solr

 

Hi,

 

Yes that is where I was stuck up.. making an explicit commit.. 

 

Can I send the argument commit=true while configuring the Repo connector.

 

Thanks & Regards,

Rohan G Patil

Cognizant  Programmer Analyst Trainee,Bangalore || Mob # +91 9535577001 

Rohan.GPatil@cognizant.com

 

From: Jack Krupansky [mailto:jack.krupansky@lucidimagination.com] 
Sent: Wednesday, June 02, 2010 4:42 PM
To: connectors-user@incubator.apache.org
Subject: Re: Setting up Solr

 

A short Solr tutorial is here:

 

http://lucene.apache.org/solr/tutorial.html

After running an LCF job that uses a Solr output connection, be sure to manually force a Solr "commit", for example:

 

    cd .../apache-solr-1.4.0/example/exampledocs
    java -jar post.jar


-- Jack Krupansky

 

From: Rohan.GPatil@cognizant.com 

Sent: Wednesday, June 02, 2010 1:46 AM

To: connectors-user@incubator.apache.org 

Subject: Setting up Solr

 

Hi,

 

I am stuck at setting up the Solr server to be used with LCF.

 

I am new to Solr.

 

Thanks & Regards,

Rohan G Patil

Cognizant  Programmer Analyst Trainee,Bangalore || Mob # +91 9535577001 

Rohan.GPatil@cognizant.com

 

      This e-mail and any files transmitted with it are for the sole use of 
      the intended recipient(s) and may contain confidential and privileged 
      information.
      If you are not the intended recipient, please contact the sender by 
      reply e-mail and destroy all copies of the original message.
      Any unauthorized review, use, disclosure, dissemination, forwarding, 
      printing or copying of this email or any action taken in reliance on this 
      e-mail is strictly prohibited and may be unlawful.
     

 

      This e-mail and any files transmitted with it are for the sole use of 
      the intended recipient(s) and may contain confidential and privileged 
      information.
      If you are not the intended recipient, please contact the sender by 
      reply e-mail and destroy all copies of the original message.
      Any unauthorized review, use, disclosure, dissemination, forwarding, 
      printing or copying of this email or any action taken in reliance on this 
      e-mail is strictly prohibited and may be unlawful.


     

 

Re: Setting up Solr -- commit

Posted by Jack Krupansky <ja...@lucidimagination.com>.
I am not convinced. Autocommit works great for your average search engine. 
People are used to the fact that documents appear... whenever, and they 
ponder exactly why their documents haven't been indexed yet, but they accept 
that they have no control. But there is also the issue of developer 
productivity, including initial evaluations. I keep a separate shell window 
with a commit command in it. After I run my LCF test I have to remember to 
go over to that shell, up-arrow to the command to send a commit to Solr, and 
then do my search in Solr. I don't always remember that extra manual step 
and sometimes I think I did it but didn't or got some other command or shell 
by accident. More lost time.  Sure, I could sit there and wait for Solr to 
autocommit as well. Neither solution feels right from a developer 
productivity perspective.

So, five distinct use cases:

1) Initial evaluation. Fewer details to get right (or wrong or omit.)
2) Ongoing repetitive development testing.
3) Production with "lazy" autocommit policy.
4) High-volume of incoming documents, but size-based commit is optimal for 
Solr.
5) Scheduled high-volume (incoming documents; changes, or 
re-crawl/re-indexing of full datasets) production where there is a 
well-defined point (or points), based on job definition, where a commit is 
"best".

-- Jack Krupansky

--------------------------------------------------
From: "Erik Hatcher" <er...@gmail.com>
Sent: Wednesday, June 02, 2010 11:21 AM
To: <co...@incubator.apache.org>
Subject: Re: Setting up Solr -- commit

> autocommit is really the right answer here for the discussions going  on 
> today.  When there are multiple streams of incoming documents to  Solr, 
> unless you want to build some kind of coordinated system that'll  control 
> commits, simply use autocommit.  Definitely a commit-per-doc  is not 
> recommended, and highly discouraged.
>
> As for indexing - it really is the more the merrier, to a point.   Server 
> RAM is needed to handle incoming requests, and these rich  documents are 
> typically large'ish.  Throttling so as to not add too  many (how many is 
> that?  gotta test with your system and RAM and  solrconfig.xml settings) 
> docs at a time is going to be needed in some  way.
>
> Erik
>
>
> On Jun 2, 2010, at 11:15 AM, Jack Krupansky wrote:
>
>> I did in fact try setting commit in the Solr output connection  arguments 
>> a month ago. It kind of worked, but Solr gave some errors  on occasion 
>> due to overlapping requests - one request did a commit  while other 
>> parallel requests from LCF were in various stages of  processing. I do 
>> not recall whether I tried to set JVM throttling to  1 to force 
>> sequential processing of posted documents, but you don't  really want to 
>> have to force sequential processing anyway.
>>
>> Side note to Solr guys: What is the "contract" for the 
>> ExtractingRequestHandler in terms of handling parallel requests? Is  it 
>> "the more the merrier" (including lots of PDF files?), or are  there 
>> specific issues that the client must/should worry about? There  is also 
>> the potential for multiple clients, LCF or other,  simultaneously 
>> blasting at /update/extract. Obviously those clients  can't know what 
>> each other is up to.
>>
>> -- Jack Krupansky
>>
>> From: karl.wright@nokia.com
>> Sent: Wednesday, June 02, 2010 9:01 AM
>> To: connectors-user@incubator.apache.org
>> Subject: RE: Setting up Solr
>>
>> You can send any argument you want by configuring the output  connector. 
>> However, the explicit commit on every post will slow  down performance of 
>> your crawls.
>>
>> Karl
>>
>> From: ext Rohan.GPatil@cognizant.com [mailto:Rohan.GPatil@cognizant.com ]
>> Sent: Wednesday, June 02, 2010 9:00 AM
>> To: connectors-user@incubator.apache.org
>> Subject: RE: Setting up Solr
>>
>> Hi,
>>
>> Yes that is where I was stuck up.. making an explicit commit..
>>
>> Can I send the argument commit=true while configuring the Repo 
>> connector.
>>
>> Thanks & Regards,
>> Rohan G Patil
>> Cognizant  Programmer Analyst Trainee,Bangalore || Mob # +91  9535577001
>> Rohan.GPatil@cognizant.com
>>
>> From: Jack Krupansky [mailto:jack.krupansky@lucidimagination.com]
>> Sent: Wednesday, June 02, 2010 4:42 PM
>> To: connectors-user@incubator.apache.org
>> Subject: Re: Setting up Solr
>>
>> A short Solr tutorial is here:
>>
>> http://lucene.apache.org/solr/tutorial.html
>> After running an LCF job that uses a Solr output connection, be sure  to 
>> manually force a Solr "commit", for example:
>>
>>     cd .../apache-solr-1.4.0/example/exampledocs
>>     java -jar post.jar
>>
>> -- Jack Krupansky
>>
>> From: Rohan.GPatil@cognizant.com
>> Sent: Wednesday, June 02, 2010 1:46 AM
>> To: connectors-user@incubator.apache.org
>> Subject: Setting up Solr
>>
>> Hi,
>>
>> I am stuck at setting up the Solr server to be used with LCF.
>>
>> I am new to Solr.
>>
>> Thanks & Regards,
>> Rohan G Patil
>> Cognizant  Programmer Analyst Trainee,Bangalore || Mob # +91  9535577001
>> Rohan.GPatil@cognizant.com
>>
>> This e-mail and any files transmitted with it are for the sole use of
>> the intended recipient(s) and may contain confidential and privileged
>> information.
>> If you are not the intended recipient, please contact the sender by
>> reply e-mail and destroy all copies of the original message.
>> Any unauthorized review, use, disclosure, dissemination, forwarding,
>> printing or copying of this email or any action taken in reliance on 
>> this
>> e-mail is strictly prohibited and may be unlawful.
>>
>>
>> This e-mail and any files transmitted with it are for the sole use of
>> the intended recipient(s) and may contain confidential and privileged
>> information.
>> If you are not the intended recipient, please contact the sender by
>> reply e-mail and destroy all copies of the original message.
>> Any unauthorized review, use, disclosure, dissemination, forwarding,
>> printing or copying of this email or any action taken in reliance on 
>> this
>> e-mail is strictly prohibited and may be unlawful.
>>
>>
>>
> 

RE: Setting up Solr -- commit

Posted by ka...@nokia.com.
LCF already has throttling to a maximum number of an specific output connection instance.  So while there's no provision for limiting the speed at which data gets thrown to Solr on each connection, there's a limit to how many connections there are at any given time.

Hopefully this is sufficient.

Karl


-----Original Message-----
From: ext Erik Hatcher [mailto:erik.hatcher@gmail.com] 
Sent: Wednesday, June 02, 2010 11:21 AM
To: connectors-user@incubator.apache.org
Subject: Re: Setting up Solr -- commit

autocommit is really the right answer here for the discussions going  
on today.  When there are multiple streams of incoming documents to  
Solr, unless you want to build some kind of coordinated system that'll  
control commits, simply use autocommit.  Definitely a commit-per-doc  
is not recommended, and highly discouraged.

As for indexing - it really is the more the merrier, to a point.   
Server RAM is needed to handle incoming requests, and these rich  
documents are typically large'ish.  Throttling so as to not add too  
many (how many is that?  gotta test with your system and RAM and  
solrconfig.xml settings) docs at a time is going to be needed in some  
way.

	Erik


On Jun 2, 2010, at 11:15 AM, Jack Krupansky wrote:

> I did in fact try setting commit in the Solr output connection  
> arguments a month ago. It kind of worked, but Solr gave some errors  
> on occasion due to overlapping requests - one request did a commit  
> while other parallel requests from LCF were in various stages of  
> processing. I do not recall whether I tried to set JVM throttling to  
> 1 to force sequential processing of posted documents, but you don't  
> really want to have to force sequential processing anyway.
>
> Side note to Solr guys: What is the "contract" for the  
> ExtractingRequestHandler in terms of handling parallel requests? Is  
> it "the more the merrier" (including lots of PDF files?), or are  
> there specific issues that the client must/should worry about? There  
> is also the potential for multiple clients, LCF or other,  
> simultaneously blasting at /update/extract. Obviously those clients  
> can't know what each other is up to.
>
> -- Jack Krupansky
>
> From: karl.wright@nokia.com
> Sent: Wednesday, June 02, 2010 9:01 AM
> To: connectors-user@incubator.apache.org
> Subject: RE: Setting up Solr
>
> You can send any argument you want by configuring the output  
> connector.  However, the explicit commit on every post will slow  
> down performance of your crawls.
>
> Karl
>
> From: ext Rohan.GPatil@cognizant.com [mailto:Rohan.GPatil@cognizant.com 
> ]
> Sent: Wednesday, June 02, 2010 9:00 AM
> To: connectors-user@incubator.apache.org
> Subject: RE: Setting up Solr
>
> Hi,
>
> Yes that is where I was stuck up.. making an explicit commit..
>
> Can I send the argument commit=true while configuring the Repo  
> connector.
>
> Thanks & Regards,
> Rohan G Patil
> Cognizant  Programmer Analyst Trainee,Bangalore || Mob # +91  
> 9535577001
> Rohan.GPatil@cognizant.com
>
> From: Jack Krupansky [mailto:jack.krupansky@lucidimagination.com]
> Sent: Wednesday, June 02, 2010 4:42 PM
> To: connectors-user@incubator.apache.org
> Subject: Re: Setting up Solr
>
> A short Solr tutorial is here:
>
> http://lucene.apache.org/solr/tutorial.html
> After running an LCF job that uses a Solr output connection, be sure  
> to manually force a Solr "commit", for example:
>
>     cd .../apache-solr-1.4.0/example/exampledocs
>     java -jar post.jar
>
> -- Jack Krupansky
>
> From: Rohan.GPatil@cognizant.com
> Sent: Wednesday, June 02, 2010 1:46 AM
> To: connectors-user@incubator.apache.org
> Subject: Setting up Solr
>
> Hi,
>
> I am stuck at setting up the Solr server to be used with LCF.
>
> I am new to Solr.
>
> Thanks & Regards,
> Rohan G Patil
> Cognizant  Programmer Analyst Trainee,Bangalore || Mob # +91  
> 9535577001
> Rohan.GPatil@cognizant.com
>
> This e-mail and any files transmitted with it are for the sole use of
> the intended recipient(s) and may contain confidential and privileged
> information.
> If you are not the intended recipient, please contact the sender by
> reply e-mail and destroy all copies of the original message.
> Any unauthorized review, use, disclosure, dissemination, forwarding,
> printing or copying of this email or any action taken in reliance on  
> this
> e-mail is strictly prohibited and may be unlawful.
>
>
> This e-mail and any files transmitted with it are for the sole use of
> the intended recipient(s) and may contain confidential and privileged
> information.
> If you are not the intended recipient, please contact the sender by
> reply e-mail and destroy all copies of the original message.
> Any unauthorized review, use, disclosure, dissemination, forwarding,
> printing or copying of this email or any action taken in reliance on  
> this
> e-mail is strictly prohibited and may be unlawful.
>
>
>


Re: Setting up Solr -- commit

Posted by Erik Hatcher <er...@gmail.com>.
autocommit is really the right answer here for the discussions going  
on today.  When there are multiple streams of incoming documents to  
Solr, unless you want to build some kind of coordinated system that'll  
control commits, simply use autocommit.  Definitely a commit-per-doc  
is not recommended, and highly discouraged.

As for indexing - it really is the more the merrier, to a point.   
Server RAM is needed to handle incoming requests, and these rich  
documents are typically large'ish.  Throttling so as to not add too  
many (how many is that?  gotta test with your system and RAM and  
solrconfig.xml settings) docs at a time is going to be needed in some  
way.

	Erik


On Jun 2, 2010, at 11:15 AM, Jack Krupansky wrote:

> I did in fact try setting commit in the Solr output connection  
> arguments a month ago. It kind of worked, but Solr gave some errors  
> on occasion due to overlapping requests - one request did a commit  
> while other parallel requests from LCF were in various stages of  
> processing. I do not recall whether I tried to set JVM throttling to  
> 1 to force sequential processing of posted documents, but you don't  
> really want to have to force sequential processing anyway.
>
> Side note to Solr guys: What is the "contract" for the  
> ExtractingRequestHandler in terms of handling parallel requests? Is  
> it "the more the merrier" (including lots of PDF files?), or are  
> there specific issues that the client must/should worry about? There  
> is also the potential for multiple clients, LCF or other,  
> simultaneously blasting at /update/extract. Obviously those clients  
> can't know what each other is up to.
>
> -- Jack Krupansky
>
> From: karl.wright@nokia.com
> Sent: Wednesday, June 02, 2010 9:01 AM
> To: connectors-user@incubator.apache.org
> Subject: RE: Setting up Solr
>
> You can send any argument you want by configuring the output  
> connector.  However, the explicit commit on every post will slow  
> down performance of your crawls.
>
> Karl
>
> From: ext Rohan.GPatil@cognizant.com [mailto:Rohan.GPatil@cognizant.com 
> ]
> Sent: Wednesday, June 02, 2010 9:00 AM
> To: connectors-user@incubator.apache.org
> Subject: RE: Setting up Solr
>
> Hi,
>
> Yes that is where I was stuck up.. making an explicit commit..
>
> Can I send the argument commit=true while configuring the Repo  
> connector.
>
> Thanks & Regards,
> Rohan G Patil
> Cognizant  Programmer Analyst Trainee,Bangalore || Mob # +91  
> 9535577001
> Rohan.GPatil@cognizant.com
>
> From: Jack Krupansky [mailto:jack.krupansky@lucidimagination.com]
> Sent: Wednesday, June 02, 2010 4:42 PM
> To: connectors-user@incubator.apache.org
> Subject: Re: Setting up Solr
>
> A short Solr tutorial is here:
>
> http://lucene.apache.org/solr/tutorial.html
> After running an LCF job that uses a Solr output connection, be sure  
> to manually force a Solr "commit", for example:
>
>     cd .../apache-solr-1.4.0/example/exampledocs
>     java -jar post.jar
>
> -- Jack Krupansky
>
> From: Rohan.GPatil@cognizant.com
> Sent: Wednesday, June 02, 2010 1:46 AM
> To: connectors-user@incubator.apache.org
> Subject: Setting up Solr
>
> Hi,
>
> I am stuck at setting up the Solr server to be used with LCF.
>
> I am new to Solr.
>
> Thanks & Regards,
> Rohan G Patil
> Cognizant  Programmer Analyst Trainee,Bangalore || Mob # +91  
> 9535577001
> Rohan.GPatil@cognizant.com
>
> This e-mail and any files transmitted with it are for the sole use of
> the intended recipient(s) and may contain confidential and privileged
> information.
> If you are not the intended recipient, please contact the sender by
> reply e-mail and destroy all copies of the original message.
> Any unauthorized review, use, disclosure, dissemination, forwarding,
> printing or copying of this email or any action taken in reliance on  
> this
> e-mail is strictly prohibited and may be unlawful.
>
>
> This e-mail and any files transmitted with it are for the sole use of
> the intended recipient(s) and may contain confidential and privileged
> information.
> If you are not the intended recipient, please contact the sender by
> reply e-mail and destroy all copies of the original message.
> Any unauthorized review, use, disclosure, dissemination, forwarding,
> printing or copying of this email or any action taken in reliance on  
> this
> e-mail is strictly prohibited and may be unlawful.
>
>
>


Re: Setting up Solr -- commit

Posted by Erik Hatcher <er...@gmail.com>.
On Jun 2, 2010, at 11:20 AM, <ka...@nokia.com> <karl.wright@nokia.com 
 > wrote:
> If the ExtractingRequestHandler doesn’t properly handle parallel  
> requests

it properly handles it, as does every request to Solr.  It's a web  
application designed for concurrent requests - even commits (but Solr  
throttles that internally).  RAM/CPU will be the hurdles for blasting  
a boatload of rich docs at Solr.

	Erik


RE: Setting up Solr -- commit

Posted by ka...@nokia.com.
If the ExtractingRequestHandler doesn't properly handle parallel requests intermingled with commits, then my previous concerns about complex decision making around when to do a commit become even more pronounced.

Seems to me that this isn't something that LCF should be trying to solve.

Karl


From: ext Jack Krupansky [mailto:jack.krupansky@lucidimagination.com]
Sent: Wednesday, June 02, 2010 11:15 AM
To: connectors-user@incubator.apache.org
Subject: Re: Setting up Solr -- commit

I did in fact try setting commit in the Solr output connection arguments a month ago. It kind of worked, but Solr gave some errors on occasion due to overlapping requests - one request did a commit while other parallel requests from LCF were in various stages of processing. I do not recall whether I tried to set JVM throttling to 1 to force sequential processing of posted documents, but you don't really want to have to force sequential processing anyway.

Side note to Solr guys: What is the "contract" for the ExtractingRequestHandler in terms of handling parallel requests? Is it "the more the merrier" (including lots of PDF files?), or are there specific issues that the client must/should worry about? There is also the potential for multiple clients, LCF or other, simultaneously blasting at /update/extract. Obviously those clients can't know what each other is up to.

-- Jack Krupansky

From: karl.wright@nokia.com<ma...@nokia.com>
Sent: Wednesday, June 02, 2010 9:01 AM
To: connectors-user@incubator.apache.org<ma...@incubator.apache.org>
Subject: RE: Setting up Solr

You can send any argument you want by configuring the output connector.  However, the explicit commit on every post will slow down performance of your crawls.

Karl

From: ext Rohan.GPatil@cognizant.com<ma...@cognizant.com> [mailto:Rohan.GPatil@cognizant.com]
Sent: Wednesday, June 02, 2010 9:00 AM
To: connectors-user@incubator.apache.org<ma...@incubator.apache.org>
Subject: RE: Setting up Solr

Hi,

Yes that is where I was stuck up.. making an explicit commit..

Can I send the argument commit=true while configuring the Repo connector.

Thanks & Regards,
Rohan G Patil
Cognizant  Programmer Analyst Trainee,Bangalore || Mob # +91 9535577001
Rohan.GPatil@cognizant.com<ma...@cognizant.com>

From: Jack Krupansky [mailto:jack.krupansky@lucidimagination.com]
Sent: Wednesday, June 02, 2010 4:42 PM
To: connectors-user@incubator.apache.org
Subject: Re: Setting up Solr

A short Solr tutorial is here:

http://lucene.apache.org/solr/tutorial.html
After running an LCF job that uses a Solr output connection, be sure to manually force a Solr "commit", for example:

    cd .../apache-solr-1.4.0/example/exampledocs
    java -jar post.jar

-- Jack Krupansky

From: Rohan.GPatil@cognizant.com<ma...@cognizant.com>
Sent: Wednesday, June 02, 2010 1:46 AM
To: connectors-user@incubator.apache.org<ma...@incubator.apache.org>
Subject: Setting up Solr

Hi,

I am stuck at setting up the Solr server to be used with LCF.

I am new to Solr.

Thanks & Regards,
Rohan G Patil
Cognizant  Programmer Analyst Trainee,Bangalore || Mob # +91 9535577001
Rohan.GPatil@cognizant.com<ma...@cognizant.com>

This e-mail and any files transmitted with it are for the sole use of
the intended recipient(s) and may contain confidential and privileged
information.
If you are not the intended recipient, please contact the sender by
reply e-mail and destroy all copies of the original message.
Any unauthorized review, use, disclosure, dissemination, forwarding,
printing or copying of this email or any action taken in reliance on this
e-mail is strictly prohibited and may be unlawful.


This e-mail and any files transmitted with it are for the sole use of
the intended recipient(s) and may contain confidential and privileged
information.
If you are not the intended recipient, please contact the sender by
reply e-mail and destroy all copies of the original message.
Any unauthorized review, use, disclosure, dissemination, forwarding,
printing or copying of this email or any action taken in reliance on this
e-mail is strictly prohibited and may be unlawful.