You are viewing a plain text version of this content. The canonical link for it is here.
Posted to solr-user@lucene.apache.org by Pranav Prakash <pr...@gmail.com> on 2012/08/08 07:16:35 UTC

Is this too much time for full Data Import?

Folks,

My full data import takes ~80hrs. It has around ~9m documents and ~15 SQL
queries for each document. The database servers are different from Solr
Servers. Each document has an update processor chain which (a) calculates
signature of the document using SignatureUpdateProcessorFactory and (b)
Finds out terms which have term frequency > 2; using a custom processor.
The index size is ~ 480GiB

I want to know if the amount of time taken is too large compared to the
document count? How do I benchmark the stats and what are some of the ways
I can improve this? I believe there are some optimizations that I could do
at Update Processor Factory level as well. What would be a good way to get
dirty on this?

*Pranav Prakash*

"temet nosce"

Re: Is this too much time for full Data Import?

Posted by Alexey Serba <as...@gmail.com>.
9m*15 - that's a lot of queries (>400 QPS).

I would try reduce the number of queries:

1. Rewrite your main (root) query to select all possible data
* use SQL joins instead of DIH nested entities
* select data from 1-N related tables (tags, authors, etc) in the main
query using GROUP_CONCAT (that's MySQL specific function, but there
are similar functions for other RDBMS-es) aggregate function and then
split concatenated data in a DIH transformer.

2. Identify small tables in nested entities and cache them completely
in CachedSqlEntityProcessor.



On Wed, Aug 8, 2012 at 10:35 AM, Mikhail Khludnev
<mk...@griddynamics.com> wrote:
> Hello,
>
> Does your indexer utilize CPU/IO? - check it by iostat/vmstat.
> If it doesn't, take several thread dumps by jvisualvm sampler or jstack,
> try to understand what blocks your threads from progress.
> It might happen you need to speedup your SQL data consumption, to do this,
> you can enable threads in DIH (only in 3.6.1), move from N+1 SQL queries to
> select all/cache approach
> http://wiki.apache.org/solr/DataImportHandler#CachedSqlEntityProcessor and
> https://issues.apache.org/jira/browse/SOLR-2382
>
> Good luck
>
> On Wed, Aug 8, 2012 at 9:16 AM, Pranav Prakash <pr...@gmail.com> wrote:
>
>> Folks,
>>
>> My full data import takes ~80hrs. It has around ~9m documents and ~15 SQL
>> queries for each document. The database servers are different from Solr
>> Servers. Each document has an update processor chain which (a) calculates
>> signature of the document using SignatureUpdateProcessorFactory and (b)
>> Finds out terms which have term frequency > 2; using a custom processor.
>> The index size is ~ 480GiB
>>
>> I want to know if the amount of time taken is too large compared to the
>> document count? How do I benchmark the stats and what are some of the ways
>> I can improve this? I believe there are some optimizations that I could do
>> at Update Processor Factory level as well. What would be a good way to get
>> dirty on this?
>>
>> *Pranav Prakash*
>>
>> "temet nosce"
>>
>
>
>
> --
> Sincerely yours
> Mikhail Khludnev
> Tech Lead
> Grid Dynamics
>
> <http://www.griddynamics.com>
>  <mk...@griddynamics.com>

Re: Is this too much time for full Data Import?

Posted by Mikhail Khludnev <mk...@griddynamics.com>.
Hello,

Does your indexer utilize CPU/IO? - check it by iostat/vmstat.
If it doesn't, take several thread dumps by jvisualvm sampler or jstack,
try to understand what blocks your threads from progress.
It might happen you need to speedup your SQL data consumption, to do this,
you can enable threads in DIH (only in 3.6.1), move from N+1 SQL queries to
select all/cache approach
http://wiki.apache.org/solr/DataImportHandler#CachedSqlEntityProcessor and
https://issues.apache.org/jira/browse/SOLR-2382

Good luck

On Wed, Aug 8, 2012 at 9:16 AM, Pranav Prakash <pr...@gmail.com> wrote:

> Folks,
>
> My full data import takes ~80hrs. It has around ~9m documents and ~15 SQL
> queries for each document. The database servers are different from Solr
> Servers. Each document has an update processor chain which (a) calculates
> signature of the document using SignatureUpdateProcessorFactory and (b)
> Finds out terms which have term frequency > 2; using a custom processor.
> The index size is ~ 480GiB
>
> I want to know if the amount of time taken is too large compared to the
> document count? How do I benchmark the stats and what are some of the ways
> I can improve this? I believe there are some optimizations that I could do
> at Update Processor Factory level as well. What would be a good way to get
> dirty on this?
>
> *Pranav Prakash*
>
> "temet nosce"
>



-- 
Sincerely yours
Mikhail Khludnev
Tech Lead
Grid Dynamics

<http://www.griddynamics.com>
 <mk...@griddynamics.com>

Re: Is this too much time for full Data Import?

Posted by Michael Della Bitta <mi...@appinions.com>.
Pranav,

If possible, you may wish to consider moving a job this large outside
of DataImportHandler to a standalone program, as the SQL processing is
somewhat limited by the N+1 subselects problem.

Michael Della Bitta

------------------------------------------------
Appinions | 18 East 41st St., Suite 1806 | New York, NY 10017
www.appinions.com
Where Influence Isn’t a Game


On Wed, Aug 8, 2012 at 1:16 AM, Pranav Prakash <pr...@gmail.com> wrote:
> Folks,
>
> My full data import takes ~80hrs. It has around ~9m documents and ~15 SQL
> queries for each document. The database servers are different from Solr
> Servers. Each document has an update processor chain which (a) calculates
> signature of the document using SignatureUpdateProcessorFactory and (b)
> Finds out terms which have term frequency > 2; using a custom processor.
> The index size is ~ 480GiB
>
> I want to know if the amount of time taken is too large compared to the
> document count? How do I benchmark the stats and what are some of the ways
> I can improve this? I believe there are some optimizations that I could do
> at Update Processor Factory level as well. What would be a good way to get
> dirty on this?
>
> *Pranav Prakash*
>
> "temet nosce"