You are viewing a plain text version of this content. The canonical link for it is here.
Posted to solr-user@lucene.apache.org by Zheng Lin Edwin Yeo <ed...@gmail.com> on 2017/05/01 02:39:29 UTC

Slow indexing speed when collection size is large

Hi,

I'm using Solr 6.4.2.

Would like to check, if there are alot of collections in my Solr which has
very large index size, will the indexing speed be affected?

Currently, I have created a new collections in Solr which has several
collections with very large index size, and the indexing speed is much
slower than expected.

Regards,
Edwin

Re: Slow indexing speed when collection size is large

Posted by Zheng Lin Edwin Yeo <ed...@gmail.com>.
Hi Shawn,

Are the two types of indexing (ERH with OCR, and indexing from a DB)
happening on the same Solr server?
A) Yes, they are happening on the same Solr server, but currently, only the
indexing from a DB is running.

Is Solr in a virtual machine?
A) No

Is the 384GB at the hypervisor level, or the virtual machine level?
A) The hypervisor level. The virtual machine for the Sybase is allocated
64GB of memory.

Is the 22GB heap the total heap memory, or is that per Solr instance?
A) Per Solr instance.

It's only the Sybase database that is running on a virtual machine under
Hyper-V. Solr is running on the main server.
The main server is running on Windows 2012, while the virtual machine is
running on SUSE Linux 9. Both Solr instances are running on SSD drive,
while the virtual machine is running on normal hard disk.

What is the best suggestion for the 5TB of indexes The searching speed is
quite fast currently, even during indexing. It is the indexing speed that
is slow.

Regards,
Edwin



On 7 May 2017 at 21:14, Shawn Heisey <ap...@elyograg.org> wrote:

> On 5/6/2017 6:49 PM, Zheng Lin Edwin Yeo wrote:
> > For my rich documentation handling, I'm using Extracting Request
> Handler, and it requires OCR.
> >
> > However, currently, for the slow indexing speed which I'm experiencing,
> the indexing is done directly from the Sybase database. I will fetch about
> 1000 records at a time from Sybase, and stored in into a CacheRowSet for it
> to be indexed. The query to the Sybase database is quite fast, and most of
> the time is spend on processes in the CacheRowSet.
> <snip>
> > A) 384 GB
> <snip>
> > A) 22 GB
> <snip>
> > A) 5 TB
> <snip>
> > A) A virtual machine with Sybase database is running on the server
>
> The discussion about the drawbacks of the Extracting Request Handler has
> already taken place.  Tika should be running on separate hardware, not
> embedded in Solr.  Having high-impact Tika processing run on the Solr
> server is going to slow everything down.
>
> Are the two types of indexing (ERH with OCR, and indexing from a DB)
> happening on the same Solr server?
>
> As soon as you mention virtual machines, my mental picture of the setup
> becomes much less clear.  You'll need to fully describe the OS and
> hardware setup, at both the hypervisor and virtual machine level.  Then
> I will know what questions to ask for more detailed information.
>
> Is Solr in a virtual machine?
> Is the 384GB at the hypervisor level, or the virtual machine level?
> Is the 22GB heap the total heap memory, or is that per Solr instance?
>
> If the 5TB is Solr index data, then there's no way you're going to get
> fast performance.  Putting enough memory in one machine to effectively
> cache that much data is impractically expensive, and most server
> hardware doesn't have enough memory slots even if you do have the
> money.  384GB wouldn't be enough for 5TB of index, and that's not even
> taking into account the memory needed by your software, including Solr
> and Sybase.
>
> Thanks,
> Shawn
>
>

Re: Slow indexing speed when collection size is large

Posted by Shawn Heisey <ap...@elyograg.org>.
On 5/6/2017 6:49 PM, Zheng Lin Edwin Yeo wrote:
> For my rich documentation handling, I'm using Extracting Request Handler, and it requires OCR.
>
> However, currently, for the slow indexing speed which I'm experiencing, the indexing is done directly from the Sybase database. I will fetch about 1000 records at a time from Sybase, and stored in into a CacheRowSet for it to be indexed. The query to the Sybase database is quite fast, and most of the time is spend on processes in the CacheRowSet.
<snip>
> A) 384 GB
<snip>
> A) 22 GB
<snip>
> A) 5 TB
<snip>
> A) A virtual machine with Sybase database is running on the server

The discussion about the drawbacks of the Extracting Request Handler has
already taken place.  Tika should be running on separate hardware, not
embedded in Solr.  Having high-impact Tika processing run on the Solr
server is going to slow everything down.

Are the two types of indexing (ERH with OCR, and indexing from a DB)
happening on the same Solr server?

As soon as you mention virtual machines, my mental picture of the setup
becomes much less clear.  You'll need to fully describe the OS and
hardware setup, at both the hypervisor and virtual machine level.  Then
I will know what questions to ask for more detailed information.

Is Solr in a virtual machine?
Is the 384GB at the hypervisor level, or the virtual machine level?
Is the 22GB heap the total heap memory, or is that per Solr instance?

If the 5TB is Solr index data, then there's no way you're going to get
fast performance.  Putting enough memory in one machine to effectively
cache that much data is impractically expensive, and most server
hardware doesn't have enough memory slots even if you do have the
money.  384GB wouldn't be enough for 5TB of index, and that's not even
taking into account the memory needed by your software, including Solr
and Sybase.

Thanks,
Shawn


Re: Slow indexing speed when collection size is large

Posted by Zheng Lin Edwin Yeo <ed...@gmail.com>.
Hi Shawn,

For my rich documentation handling, I'm using Extracting Request Handler,
and it requires OCR.

However, currently, for the slow indexing speed which I'm experiencing, the
indexing is done directly from the Sybase database. I will fetch about 1000
records at a time from Sybase, and stored in into a CacheRowSet for it to
be indexed. The query to the Sybase database is quite fast, and most of the
time is spend on processes in the CacheRowSet.


Here are the answers to the other questions:

On a single Solr server, how much total memory is installed?
A) 384 GB

What is the total amount of memory reserved for Solr heaps on that server?
A) 22 GB

What is the total on-disk size of all the Solr indexes on that server?
A) 5 TB

-- Multiple replicas must be included if they are present on one machine.
From the core (shard replica) perspective, how many documents are on
that server?
A) About 200 million documents for both replica. Each replica is about 100
million. Currently, both replicas are in the same server, but different
disk.

-- Multiple replicas must be included here too.
Is there software other than the Solr server process(es) running on that
server?
A) A virtual machine with Sybase database is running on the server

Are you making queries at the same time you're indexing?
A) Only occasionally. Most of the time, there is no queries made.

Regards,
Edwin



On 6 May 2017 at 20:41, Shawn Heisey <ap...@elyograg.org> wrote:

> On 5/1/2017 10:17 AM, Zheng Lin Edwin Yeo wrote:
> > I'm using Solrj for the indexing, not using curl. Normally I bundle
> > about 1000 documents for each POST. There's more than 300GB of RAM for
> > that server, and I do not use any sharing at the moment.
>
> Looking over your email history on the list, I was able to determine
> some information, but not everything I was wondering about.  I have some
> questions.
>
> Are you still using the Extracting Request Handler for your rich
> document handling, or have you moved Tika processing outside Solr?
> If it's outside Solr, is it on different machines?
> Are your rich documents still requiring OCR?
>
> Other questions:
>
> On a single Solr server, how much total memory is installed?
> What is the total amount of memory reserved for Solr heaps on that server?
> What is the total on-disk size of all the Solr indexes on that server?
> -- Multiple replicas must be included if they are present on one machine.
> From the core (shard replica) perspective, how many documents are on
> that server?
> -- Multiple replicas must be included here too.
> Is there software other than the Solr server process(es) running on that
> server?
> Are you making queries at the same time you're indexing?
>
> Thanks,
> Shawn
>
>

Re: Slow indexing speed when collection size is large

Posted by Shawn Heisey <ap...@elyograg.org>.
On 5/1/2017 10:17 AM, Zheng Lin Edwin Yeo wrote:
> I'm using Solrj for the indexing, not using curl. Normally I bundle
> about 1000 documents for each POST. There's more than 300GB of RAM for
> that server, and I do not use any sharing at the moment.

Looking over your email history on the list, I was able to determine
some information, but not everything I was wondering about.  I have some
questions.

Are you still using the Extracting Request Handler for your rich
document handling, or have you moved Tika processing outside Solr?
If it's outside Solr, is it on different machines?
Are your rich documents still requiring OCR?

Other questions:

On a single Solr server, how much total memory is installed?
What is the total amount of memory reserved for Solr heaps on that server?
What is the total on-disk size of all the Solr indexes on that server?
-- Multiple replicas must be included if they are present on one machine.
From the core (shard replica) perspective, how many documents are on
that server?
-- Multiple replicas must be included here too.
Is there software other than the Solr server process(es) running on that
server?
Are you making queries at the same time you're indexing?

Thanks,
Shawn


Re: Slow indexing speed when collection size is large

Posted by Zheng Lin Edwin Yeo <ed...@gmail.com>.
Hi Rick,

I'm using Solrj for the indexing, not using curl.
Normally I bundle about 1000 documents for each POST.
There's more than 300GB of RAM for that server, and I do not use any
sharing at the moment.

Regards,
Edwin


On 1 May 2017 at 19:08, Rick Leir <rl...@leirtech.com> wrote:

> Zheng,
> Are you POSTing using curl? Get several processes working in parallel to
> get a small boost. Solrj should speed you up a bit too (numbers anyone?).
> How many documents do you bundle in a POST?
>
> Do you have lots of RAM? Sharding?
> Cheers -- Rick
>
> On April 30, 2017 10:39:29 PM EDT, Zheng Lin Edwin Yeo <
> edwinyeozl@gmail.com> wrote:
> >Hi,
> >
> >I'm using Solr 6.4.2.
> >
> >Would like to check, if there are alot of collections in my Solr which
> >has
> >very large index size, will the indexing speed be affected?
> >
> >Currently, I have created a new collections in Solr which has several
> >collections with very large index size, and the indexing speed is much
> >slower than expected.
> >
> >Regards,
> >Edwin
>
> --
> Sorry for being brief. Alternate email is rickleir at yahoo dot com

Re: Slow indexing speed when collection size is large

Posted by Rick Leir <rl...@leirtech.com>.
Zheng,
Are you POSTing using curl? Get several processes working in parallel to get a small boost. Solrj should speed you up a bit too (numbers anyone?). How many documents do you bundle in a POST? 

Do you have lots of RAM? Sharding?
Cheers -- Rick

On April 30, 2017 10:39:29 PM EDT, Zheng Lin Edwin Yeo <ed...@gmail.com> wrote:
>Hi,
>
>I'm using Solr 6.4.2.
>
>Would like to check, if there are alot of collections in my Solr which
>has
>very large index size, will the indexing speed be affected?
>
>Currently, I have created a new collections in Solr which has several
>collections with very large index size, and the indexing speed is much
>slower than expected.
>
>Regards,
>Edwin

-- 
Sorry for being brief. Alternate email is rickleir at yahoo dot com