You are viewing a plain text version of this content. The canonical link for it is here.
Posted to solr-user@lucene.apache.org by Darx Oman <da...@gmail.com> on 2011/01/30 05:54:17 UTC

Solr Indexing Performance

Hi guys



I'm running a solr instance (trunk)  in my dev. Server to test my
configuration.  I'm doing a DIH full import to index 49 PDF files with their
corresponding database records.  Both the PDF files and database are local
in the server.

*Server : *

·         Windows 2008 R2

·         MS SQL server 2008 R2

·         16 core processor

·         16 GB ram

*Tomcat (7.0.5) : *

·         Set JAVA_OPTS = %JAVA_OPTS%  -Xms1024M  -Xmx8192M

*Solrconfig:*

·         Main index configurations
    <ramBufferSize>2048</ramBufferSize>
    <mergeFactor>50</mergeFactor>

*DIH configuration:*

·         2 data sources defined  jdbcDataSource and BinFileDataSource

·         One main entity with 3 sub entities

<entity dataSource="myJdbc" …>

    <entity dataSource="myBinFile" …> </entity>

    <entity dataSource=" myJdbc" …> </entity>

    <entity dataSource=" myJdbc" …> </entity>

<entity/>

·         Total schema fields are 8, three of which are text type and
multivalued.

*My DIH import Status Messages:*

·         Total Requests made to DataSource = 99**

·         Total Rows Fetched = 2124**

·         Total DocumentsProcessed = 49**

·         Time Taken = *0:2:3:880***

*
Is this time reasonable or it can be improved?*

Re: Solr Indexing Performance

Posted by Darx Oman <da...@gmail.com>.
Thanx  Tomas
I'll try with different configuration

Re: Solr Indexing Performance

Posted by Gora Mohanty <go...@mimirtech.com>.
On Sat, Feb 5, 2011 at 2:06 PM, Darx Oman <da...@gmail.com> wrote:
> I indexed 1000 pdf file with the same configuration, it completed in about
> 32 min.

So, it seems like your indexing scales at least as well as the number
of the PDF documents that you have.

While this might be good news in your case, it is difficult to estimate
an "expected" indexing rate when indexing from documents.

Regards,
Gora

Re: Solr Indexing Performance

Posted by Darx Oman <da...@gmail.com>.
I indexed 1000 pdf file with the same configuration, it completed in about
32 min.

Re: Solr Indexing Performance

Posted by Otis Gospodnetic <ot...@yahoo.com>.
Hi,

2 GB for ramBufferSize is probably too much and not needed, but you could 
increase it from default 32 MB to something like 128 MB or even 512 MB, if you 
really have that much data where that would make a difference (you mention only 
49 PDF files).  I'd leave mergeFactor at 10 for now.  The slowness (if there is 
slowness - how long is it taking?) could be from:
* slow DB
* suboptimal SQL
* PDF content extraction
* indexing itself
* ...

Otis
----
Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch
Lucene ecosystem search :: http://search-lucene.com/



----- Original Message ----
> From: Tomás Fernández Löbbe <to...@gmail.com>
> To: solr-user@lucene.apache.org
> Sent: Mon, January 31, 2011 10:13:32 AM
> Subject: Re: Solr Indexing Performance
> 
> Well, I would say that the best way to be sure is to benchmark  different
> configurations.
> As far as I know, it's usually not recommended  such a big RAM Buffer size,
> default is 32 MB and probably won't get any  improvements using more than 128
> MB.
> The same with the mergeFactor, I know  that a larger merge factor it's better
> for indexing, but 50 sounds like a  lot. Anyway, as I said before, the best
> thing to do is benchmark different  configurations and see which one works
> better for you.
> 
> Have you tried  assigning less memory to the JVM? That would leave more
> memory available to  the OS.
> 
> Tomás
> 
> On Sun, Jan 30, 2011 at 1:54 AM, Darx Oman <da...@gmail.com> wrote:
> 
> >  Hi guys
> >
> >
> >
> > I'm running a solr instance  (trunk)  in my dev. Server to test my
> > configuration.  I'm  doing a DIH full import to index 49 PDF files with
> > their
> >  corresponding database records.  Both the PDF files and database are  local
> > in the server.
> >
> > *Server : *
> >
> > ·          Windows 2008 R2
> >
> > ·          MS SQL server 2008 R2
> >
> > ·         16  core processor
> >
> > ·         16 GB  ram
> >
> > *Tomcat (7.0.5) : *
> >
> > ·          Set JAVA_OPTS = %JAVA_OPTS%  -Xms1024M   -Xmx8192M
> >
> > *Solrconfig:*
> >
> > ·          Main index configurations
> >     <ramBufferSize>2048</ramBufferSize>
> >     <mergeFactor>50</mergeFactor>
> >
> > *DIH  configuration:*
> >
> > ·         2 data sources  defined  jdbcDataSource and BinFileDataSource
> >
> > ·          One main entity with 3 sub entities
> >
> >  <entity dataSource="myJdbc" …>
> >
> >    <entity  dataSource="myBinFile" …> </entity>
> >
> >     <entity dataSource=" myJdbc" …> </entity>
> >
> >     <entity dataSource=" myJdbc" …> </entity>
> >
> >  <entity/>
> >
> > ·         Total schema  fields are 8, three of which are text type and
> >  multivalued.
> >
> > *My DIH import Status Messages:*
> >
> >  ·         Total Requests made to DataSource =  99**
> >
> > ·         Total Rows Fetched =  2124**
> >
> > ·         Total DocumentsProcessed =  49**
> >
> > ·         Time Taken =  *0:2:3:880***
> >
> > *
> > Is this time reasonable or it can be  improved?*
> >
> 

Re: Solr Indexing Performance

Posted by Tomás Fernández Löbbe <to...@gmail.com>.
Well, I would say that the best way to be sure is to benchmark different
configurations.
As far as I know, it's usually not recommended such a big RAM Buffer size,
default is 32 MB and probably won't get any improvements using more than 128
MB.
The same with the mergeFactor, I know that a larger merge factor it's better
for indexing, but 50 sounds like a lot. Anyway, as I said before, the best
thing to do is benchmark different configurations and see which one works
better for you.

Have you tried assigning less memory to the JVM? That would leave more
memory available to the OS.

Tomás

On Sun, Jan 30, 2011 at 1:54 AM, Darx Oman <da...@gmail.com> wrote:

> Hi guys
>
>
>
> I'm running a solr instance (trunk)  in my dev. Server to test my
> configuration.  I'm doing a DIH full import to index 49 PDF files with
> their
> corresponding database records.  Both the PDF files and database are local
> in the server.
>
> *Server : *
>
> ·         Windows 2008 R2
>
> ·         MS SQL server 2008 R2
>
> ·         16 core processor
>
> ·         16 GB ram
>
> *Tomcat (7.0.5) : *
>
> ·         Set JAVA_OPTS = %JAVA_OPTS%  -Xms1024M  -Xmx8192M
>
> *Solrconfig:*
>
> ·         Main index configurations
>    <ramBufferSize>2048</ramBufferSize>
>    <mergeFactor>50</mergeFactor>
>
> *DIH configuration:*
>
> ·         2 data sources defined  jdbcDataSource and BinFileDataSource
>
> ·         One main entity with 3 sub entities
>
> <entity dataSource="myJdbc" …>
>
>    <entity dataSource="myBinFile" …> </entity>
>
>    <entity dataSource=" myJdbc" …> </entity>
>
>    <entity dataSource=" myJdbc" …> </entity>
>
> <entity/>
>
> ·         Total schema fields are 8, three of which are text type and
> multivalued.
>
> *My DIH import Status Messages:*
>
> ·         Total Requests made to DataSource = 99**
>
> ·         Total Rows Fetched = 2124**
>
> ·         Total DocumentsProcessed = 49**
>
> ·         Time Taken = *0:2:3:880***
>
> *
> Is this time reasonable or it can be improved?*
>