You are viewing a plain text version of this content. The canonical link for it is here.
Posted to solr-user@lucene.apache.org by "Gustavo A. Lopes" <ga...@mediacapital.pt> on 2009/04/19 04:00:18 UTC

Slow indexing with data import handler

I'm indexing around 1 million documents of one type that requires 4 additional queries for each document + 0,5 M documents that only require 1 query for all.

I'm using the data import handler from contrib with SolrWriter modified with allowDups = true (doesn't seem to have made any difference).

This doesn't seem to be a that many documents, however, after 21 hours, I have only ~700 k documents of the first type indexed. The size of index is currently 2.1 GB

I'm noticing the initial import rate is relatively high, such as all the documents of first type would be indexed in less than 6 hours if it were maintained. When the number of documents already imported rises, the import rate falls significatively.

Does anyone have any suggestions on how to speed up full imports? What is the bottleneck? I will probably have to make some changes to schema over the next days that will require new imports.

thanks



Esta mensagem e quaisquer ficheiros anexos podem conter informação confidencial ou de uso restrito. Se não for o destinatário da mesma por favor notifique imediatamente o seu remetente e proceda à sua destruição. Não poderá revelar, copiar, distribuir ou de alguma forma usar o seu conteúdo. O Grupo Media Capital e suas associadas utilizam software de anti-virus. No entanto, não obstante terem sido tomadas todas as precauções, não é garantido que a mensagem ou os seus anexos não contenham vírus.

This message, including any attachments, may contain confidential information or privileged material. If you are not the intended recipient please notify the sender immediately by e-mail and delete it from your system. You should not disseminate, distribute or copy this e-mail or disclose its content. We believe, but do not warrant, that this e-mail, including any attachments, is virus free.

Re: Slow indexing with data import handler

Posted by Shalin Shekhar Mangar <sh...@gmail.com>.
On Sun, Apr 19, 2009 at 7:30 AM, Gustavo A. Lopes
<ga...@mediacapital.pt>wrote:

> I'm indexing around 1 million documents of one type that requires 4
> additional queries for each document + 0,5 M documents that only require 1
> query for all.
>
> I'm using the data import handler from contrib with SolrWriter modified
> with allowDups = true (doesn't seem to have made any difference).
>
> This doesn't seem to be a that many documents, however, after 21 hours, I
> have only ~700 k documents of the first type indexed. The size of index is
> currently 2.1 GB
>
> I'm noticing the initial import rate is relatively high, such as all the
> documents of first type would be indexed in less than 6 hours if it were
> maintained. When the number of documents already imported rises, the import
> rate falls significatively.
>
> Does anyone have any suggestions on how to speed up full imports? What is
> the bottleneck? I will probably have to make some changes to schema over the
> next days that will require new imports.
>


Further to Otis's suggestions -- Do you have autoCommit+autowarming turned
on? Maybe that is the cause of the slowdown as the import progresses?

-- 
Regards,
Shalin Shekhar Mangar.

Re: Slow indexing with data import handler

Posted by Otis Gospodnetic <ot...@yahoo.com>.
Hi,

It could be that you are simply seeing the effect of index segment merges that take longer as segments get bigger.  Or it could be that the JVM doesn't have enough memory and is running GC too often.  Do you see high CPU load or lots of disk IO or something else?

 Otis
--
Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch



----- Original Message ----
> From: Gustavo A. Lopes <ga...@mediacapital.pt>
> To: "solr-user@lucene.apache.org" <so...@lucene.apache.org>
> Sent: Saturday, April 18, 2009 10:00:18 PM
> Subject: Slow indexing with data import handler
> 
> I'm indexing around 1 million documents of one type that requires 4 additional 
> queries for each document + 0,5 M documents that only require 1 query for all.
> 
> I'm using the data import handler from contrib with SolrWriter modified with 
> allowDups = true (doesn't seem to have made any difference).
> 
> This doesn't seem to be a that many documents, however, after 21 hours, I have 
> only ~700 k documents of the first type indexed. The size of index is currently 
> 2.1 GB
> 
> I'm noticing the initial import rate is relatively high, such as all the 
> documents of first type would be indexed in less than 6 hours if it were 
> maintained. When the number of documents already imported rises, the import rate 
> falls significatively.
> 
> Does anyone have any suggestions on how to speed up full imports? What is the 
> bottleneck? I will probably have to make some changes to schema over the next 
> days that will require new imports.
> 
> thanks
> 
> 
> 
> Esta mensagem e quaisquer ficheiros anexos podem conter informação confidencial 
> ou de uso restrito. Se não for o destinatário da mesma por favor notifique 
> imediatamente o seu remetente e proceda à sua destruição. Não poderá revelar, 
> copiar, distribuir ou de alguma forma usar o seu conteúdo. O Grupo Media Capital 
> e suas associadas utilizam software de anti-virus. No entanto, não obstante 
> terem sido tomadas todas as precauções, não é garantido que a mensagem ou os 
> seus anexos não contenham vírus.
> 
> This message, including any attachments, may contain confidential information or 
> privileged material. If you are not the intended recipient please notify the 
> sender immediately by e-mail and delete it from your system. You should not 
> disseminate, distribute or copy this e-mail or disclose its content. We believe, 
> but do not warrant, that this e-mail, including any attachments, is virus free.