You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@manifoldcf.apache.org by Scott Schneider <sc...@gmail.com> on 2012/03/28 00:24:10 UTC

Slow performance with a basic setup

Hi all,

I have a pretty simple ManifoldCF setup, but I'm getting very slow
performance.  Can someone help me understand and/or fix this?

My input is a web connector that goes to an Apache HTTP server running
on the local machine, serving static text files.  I have a null
authority service.  I output to Solr, also running locally.

The data I'm crawling is ~20 MB total in ~8,500 small files.  I start
the job one afternoon and the next morning, it was not finished!  It
had only processed ~2,500 documents.  Strangely, it listed ~10,000
total documents (and ~7,500 active).

My ultimate goal is to figure out how much space the Solr index takes
as I add more access tokens.  That's why I'm using the web connector
and null authority, rather than just using a file system connector.

Thanks,
Scott

Re: Slow performance with a basic setup

Posted by Scott Schneider <sc...@gmail.com>.
Ah, thanks!  I set up postgreSQL in my previous installation, but
missed it this time.

Scott


On Wed, Mar 28, 2012 at 11:06 AM, Karl Wright <da...@gmail.com> wrote:
> Now it sounds like you are running into known problems with Apache
> Derby.  That is why we suggest using PostgreSQL rather than Derby for
> any kind of real world crawling.  Derby is super convenient but it has
> problems handling deadlocks properly.
>
> You can also use HSQLDB if you prefer an integrated solution, but
> PostgreSQL is faster.
>
> I suggest you look at
> http://incubator.apache.org/connectors/en_US/performance-tuning.html
> to get an idea what all this is about, and also don't forget to look
> at how-to-build-and-deploy.html for a general idea how to set up both
> single-process and multi-process installations that use PostgreSQL.
>
> Thanks,
> Karl
>
> On Wed, Mar 28, 2012 at 1:56 PM, Scott Schneider <sc...@gmail.com> wrote:
>> Thanks for the quick response!  I had been using all the default
>> settings.  Once I deleted the bandwidth throttling, one phase of the
>> job goes much faster.  The # active documents goes from 0 to the total
>> in just a minute or two.  The overall time seems to be shorter, but it
>> still takes about an hour to process ~600 files totaling ~800 kb.  I
>> also increased the max connections to 50 on the web, null, and Solr
>> connections and changed Solr to commit within 30,000 msec rather than
>> at the end of every job.  That does not seem to have made a
>> difference.
>>
>> Actually, I have no idea what state ManifoldCF is in right now.  I hit
>> restart a few hours ago and the status still says "Restarting".  There
>> is nothing in the command windows where I started ManifoldCF or Solr
>> or in the ManifoldCF log file.  The Solr command window does list
>> ManifoldCFSecurityFilter a few times.
>>
>> Scott
>>
>>
>> On Tue, Mar 27, 2012 at 5:37 PM, Karl Wright <da...@gmail.com> wrote:
>>> Let's start with some basics.
>>> First of all, how many web connections do you have configured?  What
>>> do you have for throttling?  If you have not modified the default
>>> settings for throttling and are pulling a number of documents off of
>>> ONE server, then throttling is probably severely limiting your crawl
>>> speed.
>>>
>>> Karl
>>>
>>> On Tue, Mar 27, 2012 at 6:24 PM, Scott Schneider <sc...@gmail.com> wrote:
>>>> Hi all,
>>>>
>>>> I have a pretty simple ManifoldCF setup, but I'm getting very slow
>>>> performance.  Can someone help me understand and/or fix this?
>>>>
>>>> My input is a web connector that goes to an Apache HTTP server running
>>>> on the local machine, serving static text files.  I have a null
>>>> authority service.  I output to Solr, also running locally.
>>>>
>>>> The data I'm crawling is ~20 MB total in ~8,500 small files.  I start
>>>> the job one afternoon and the next morning, it was not finished!  It
>>>> had only processed ~2,500 documents.  Strangely, it listed ~10,000
>>>> total documents (and ~7,500 active).
>>>>
>>>> My ultimate goal is to figure out how much space the Solr index takes
>>>> as I add more access tokens.  That's why I'm using the web connector
>>>> and null authority, rather than just using a file system connector.
>>>>
>>>> Thanks,
>>>> Scott

Re: Slow performance with a basic setup

Posted by Karl Wright <da...@gmail.com>.
Now it sounds like you are running into known problems with Apache
Derby.  That is why we suggest using PostgreSQL rather than Derby for
any kind of real world crawling.  Derby is super convenient but it has
problems handling deadlocks properly.

You can also use HSQLDB if you prefer an integrated solution, but
PostgreSQL is faster.

I suggest you look at
http://incubator.apache.org/connectors/en_US/performance-tuning.html
to get an idea what all this is about, and also don't forget to look
at how-to-build-and-deploy.html for a general idea how to set up both
single-process and multi-process installations that use PostgreSQL.

Thanks,
Karl

On Wed, Mar 28, 2012 at 1:56 PM, Scott Schneider <sc...@gmail.com> wrote:
> Thanks for the quick response!  I had been using all the default
> settings.  Once I deleted the bandwidth throttling, one phase of the
> job goes much faster.  The # active documents goes from 0 to the total
> in just a minute or two.  The overall time seems to be shorter, but it
> still takes about an hour to process ~600 files totaling ~800 kb.  I
> also increased the max connections to 50 on the web, null, and Solr
> connections and changed Solr to commit within 30,000 msec rather than
> at the end of every job.  That does not seem to have made a
> difference.
>
> Actually, I have no idea what state ManifoldCF is in right now.  I hit
> restart a few hours ago and the status still says "Restarting".  There
> is nothing in the command windows where I started ManifoldCF or Solr
> or in the ManifoldCF log file.  The Solr command window does list
> ManifoldCFSecurityFilter a few times.
>
> Scott
>
>
> On Tue, Mar 27, 2012 at 5:37 PM, Karl Wright <da...@gmail.com> wrote:
>> Let's start with some basics.
>> First of all, how many web connections do you have configured?  What
>> do you have for throttling?  If you have not modified the default
>> settings for throttling and are pulling a number of documents off of
>> ONE server, then throttling is probably severely limiting your crawl
>> speed.
>>
>> Karl
>>
>> On Tue, Mar 27, 2012 at 6:24 PM, Scott Schneider <sc...@gmail.com> wrote:
>>> Hi all,
>>>
>>> I have a pretty simple ManifoldCF setup, but I'm getting very slow
>>> performance.  Can someone help me understand and/or fix this?
>>>
>>> My input is a web connector that goes to an Apache HTTP server running
>>> on the local machine, serving static text files.  I have a null
>>> authority service.  I output to Solr, also running locally.
>>>
>>> The data I'm crawling is ~20 MB total in ~8,500 small files.  I start
>>> the job one afternoon and the next morning, it was not finished!  It
>>> had only processed ~2,500 documents.  Strangely, it listed ~10,000
>>> total documents (and ~7,500 active).
>>>
>>> My ultimate goal is to figure out how much space the Solr index takes
>>> as I add more access tokens.  That's why I'm using the web connector
>>> and null authority, rather than just using a file system connector.
>>>
>>> Thanks,
>>> Scott

Re: Slow performance with a basic setup

Posted by Scott Schneider <sc...@gmail.com>.
Thanks for the quick response!  I had been using all the default
settings.  Once I deleted the bandwidth throttling, one phase of the
job goes much faster.  The # active documents goes from 0 to the total
in just a minute or two.  The overall time seems to be shorter, but it
still takes about an hour to process ~600 files totaling ~800 kb.  I
also increased the max connections to 50 on the web, null, and Solr
connections and changed Solr to commit within 30,000 msec rather than
at the end of every job.  That does not seem to have made a
difference.

Actually, I have no idea what state ManifoldCF is in right now.  I hit
restart a few hours ago and the status still says "Restarting".  There
is nothing in the command windows where I started ManifoldCF or Solr
or in the ManifoldCF log file.  The Solr command window does list
ManifoldCFSecurityFilter a few times.

Scott


On Tue, Mar 27, 2012 at 5:37 PM, Karl Wright <da...@gmail.com> wrote:
> Let's start with some basics.
> First of all, how many web connections do you have configured?  What
> do you have for throttling?  If you have not modified the default
> settings for throttling and are pulling a number of documents off of
> ONE server, then throttling is probably severely limiting your crawl
> speed.
>
> Karl
>
> On Tue, Mar 27, 2012 at 6:24 PM, Scott Schneider <sc...@gmail.com> wrote:
>> Hi all,
>>
>> I have a pretty simple ManifoldCF setup, but I'm getting very slow
>> performance.  Can someone help me understand and/or fix this?
>>
>> My input is a web connector that goes to an Apache HTTP server running
>> on the local machine, serving static text files.  I have a null
>> authority service.  I output to Solr, also running locally.
>>
>> The data I'm crawling is ~20 MB total in ~8,500 small files.  I start
>> the job one afternoon and the next morning, it was not finished!  It
>> had only processed ~2,500 documents.  Strangely, it listed ~10,000
>> total documents (and ~7,500 active).
>>
>> My ultimate goal is to figure out how much space the Solr index takes
>> as I add more access tokens.  That's why I'm using the web connector
>> and null authority, rather than just using a file system connector.
>>
>> Thanks,
>> Scott

Re: Slow performance with a basic setup

Posted by Karl Wright <da...@gmail.com>.
Let's start with some basics.
First of all, how many web connections do you have configured?  What
do you have for throttling?  If you have not modified the default
settings for throttling and are pulling a number of documents off of
ONE server, then throttling is probably severely limiting your crawl
speed.

Karl

On Tue, Mar 27, 2012 at 6:24 PM, Scott Schneider <sc...@gmail.com> wrote:
> Hi all,
>
> I have a pretty simple ManifoldCF setup, but I'm getting very slow
> performance.  Can someone help me understand and/or fix this?
>
> My input is a web connector that goes to an Apache HTTP server running
> on the local machine, serving static text files.  I have a null
> authority service.  I output to Solr, also running locally.
>
> The data I'm crawling is ~20 MB total in ~8,500 small files.  I start
> the job one afternoon and the next morning, it was not finished!  It
> had only processed ~2,500 documents.  Strangely, it listed ~10,000
> total documents (and ~7,500 active).
>
> My ultimate goal is to figure out how much space the Solr index takes
> as I add more access tokens.  That's why I'm using the web connector
> and null authority, rather than just using a file system connector.
>
> Thanks,
> Scott