You are viewing a plain text version of this content. The canonical link for it is here.

Posted to solr-user@lucene.apache.org by Rallavagu <ra...@gmail.com> on 2014/03/05 20:37:52 UTC

Indexing huge data

All,

Wondering about best practices/common practices to index/re-index huge 
amount of data in Solr. The data is about 6 million entries in the db 
and other source (data is not located in one resource). Trying with 
solrj based solution to collect data from difference resources to index 
into Solr. It takes hours to index Solr.

Thanks in advance

Re: Indexing huge data

Posted by Rallavagu <ra...@gmail.com>.

Erick,

That helps so I can focus on the problem areas. Thanks.

On 3/5/14, 6:03 PM, Erick Erickson wrote:
> Here's the easiest thing to try to figure out where to
> concentrate your energies..... Just comment out the
> server.add call in your SolrJ program. Well, and any
> commits you're doing from SolrJ.
>
> My bet: Your program will run at about the same speed
> it does when you actually index the docs, indicating that
> your problem is in the data acquisition side. Of course
> the older I get, the more times I've been wrong :).
>
> You can also monitor the CPU usage on the box running
> Solr. I often see it idling along < 30% when indexing, or
> even < 10%, again indicating that the bottleneck is on the
> acquisition side.
>
> Note I haven't mentioned any solutions, I'm a believer in
> identifying the _problem_ before worrying about a solution.
>
> Best,
> Erick
>
> On Wed, Mar 5, 2014 at 4:29 PM, Jack Krupansky <ja...@basetechnology.com> wrote:
>> Make sure you're not doing a commit on each individual document add. Commit
>> every few minutes or every few hundred or few thousand documents is
>> sufficient. You can set up auto commit in solrconfig.xml.
>>
>> -- Jack Krupansky
>>
>> -----Original Message----- From: Rallavagu
>> Sent: Wednesday, March 5, 2014 2:37 PM
>> To: solr-user@lucene.apache.org
>> Subject: Indexing huge data
>>
>>
>> All,
>>
>> Wondering about best practices/common practices to index/re-index huge
>> amount of data in Solr. The data is about 6 million entries in the db
>> and other source (data is not located in one resource). Trying with
>> solrj based solution to collect data from difference resources to index
>> into Solr. It takes hours to index Solr.
>>
>> Thanks in advance

Re: Indexing huge data

Posted by Rallavagu <ra...@gmail.com>.

Thanks for all responses so far. Test runs so far does not suggest any 
bottleneck with Solr yet as I continue to work on different approaches. 
Collecting the data from different sources seems to be consuming most of 
the time.

On 3/7/14, 5:53 PM, Erick Erickson wrote:
> Kranti and Susheel's appoaches are certainly
> reasonable assuming I bet right :).
>
> Another strategy is to rack together N
> indexing programs that simultaneously
> feed Solr.
>
> In any of these scenarios, the end goal is to get
> Solr using up all the CPU cycles it can, _assuming_
> that Solr isn't the bottleneck in the first place.
>
> Best,
> Erick
>
> On Thu, Mar 6, 2014 at 6:38 PM, Kranti Parisa <kr...@gmail.com> wrote:
>> thats what I do. precreate JSONs following the schema, saving that in
>> MongoDB, this is part of the ETL process. after that, just dump the JSONs
>> into Solr using batching etc. with this you can do full and incremental
>> indexing as well.
>>
>> Thanks,
>> Kranti K. Parisa
>> http://www.linkedin.com/in/krantiparisa
>>
>>
>>
>> On Thu, Mar 6, 2014 at 9:57 AM, Rallavagu <ra...@gmail.com> wrote:
>>
>>> Yeah. I have thought about spitting out JSON and run it against Solr using
>>> parallel Http threads separately. Thanks.
>>>
>>>
>>> On 3/5/14, 6:46 PM, Susheel Kumar wrote:
>>>
>>>> One more suggestion is to collect/prepare the data in CSV format (1-2
>>>> million sample depending on size) and then import data direct into Solr
>>>> using CSV handler & curl.  This will give you the pure indexing time & the
>>>> differences.
>>>>
>>>> Thanks,
>>>> Susheel
>>>>
>>>> -----Original Message-----
>>>> From: Erick Erickson [mailto:erickerickson@gmail.com]
>>>> Sent: Wednesday, March 05, 2014 8:03 PM
>>>> To: solr-user@lucene.apache.org
>>>> Subject: Re: Indexing huge data
>>>>
>>>> Here's the easiest thing to try to figure out where to concentrate your
>>>> energies..... Just comment out the server.add call in your SolrJ program.
>>>> Well, and any commits you're doing from SolrJ.
>>>>
>>>> My bet: Your program will run at about the same speed it does when you
>>>> actually index the docs, indicating that your problem is in the data
>>>> acquisition side. Of course the older I get, the more times I've been wrong
>>>> :).
>>>>
>>>> You can also monitor the CPU usage on the box running Solr. I often see
>>>> it idling along < 30% when indexing, or even < 10%, again indicating that
>>>> the bottleneck is on the acquisition side.
>>>>
>>>> Note I haven't mentioned any solutions, I'm a believer in identifying the
>>>> _problem_ before worrying about a solution.
>>>>
>>>> Best,
>>>> Erick
>>>>
>>>> On Wed, Mar 5, 2014 at 4:29 PM, Jack Krupansky <ja...@basetechnology.com>
>>>> wrote:
>>>>
>>>>> Make sure you're not doing a commit on each individual document add.
>>>>> Commit every few minutes or every few hundred or few thousand
>>>>> documents is sufficient. You can set up auto commit in solrconfig.xml.
>>>>>
>>>>> -- Jack Krupansky
>>>>>
>>>>> -----Original Message----- From: Rallavagu
>>>>> Sent: Wednesday, March 5, 2014 2:37 PM
>>>>> To: solr-user@lucene.apache.org
>>>>> Subject: Indexing huge data
>>>>>
>>>>>
>>>>> All,
>>>>>
>>>>> Wondering about best practices/common practices to index/re-index huge
>>>>> amount of data in Solr. The data is about 6 million entries in the db
>>>>> and other source (data is not located in one resource). Trying with
>>>>> solrj based solution to collect data from difference resources to
>>>>> index into Solr. It takes hours to index Solr.
>>>>>
>>>>> Thanks in advance
>>>>>
>>>>

Re: Indexing huge data

Posted by Erick Erickson <er...@gmail.com>.

Kranti and Susheel's appoaches are certainly
reasonable assuming I bet right :).

Another strategy is to rack together N
indexing programs that simultaneously
feed Solr.

In any of these scenarios, the end goal is to get
Solr using up all the CPU cycles it can, _assuming_
that Solr isn't the bottleneck in the first place.

Best,
Erick

On Thu, Mar 6, 2014 at 6:38 PM, Kranti Parisa <kr...@gmail.com> wrote:
> thats what I do. precreate JSONs following the schema, saving that in
> MongoDB, this is part of the ETL process. after that, just dump the JSONs
> into Solr using batching etc. with this you can do full and incremental
> indexing as well.
>
> Thanks,
> Kranti K. Parisa
> http://www.linkedin.com/in/krantiparisa
>
>
>
> On Thu, Mar 6, 2014 at 9:57 AM, Rallavagu <ra...@gmail.com> wrote:
>
>> Yeah. I have thought about spitting out JSON and run it against Solr using
>> parallel Http threads separately. Thanks.
>>
>>
>> On 3/5/14, 6:46 PM, Susheel Kumar wrote:
>>
>>> One more suggestion is to collect/prepare the data in CSV format (1-2
>>> million sample depending on size) and then import data direct into Solr
>>> using CSV handler & curl.  This will give you the pure indexing time & the
>>> differences.
>>>
>>> Thanks,
>>> Susheel
>>>
>>> -----Original Message-----
>>> From: Erick Erickson [mailto:erickerickson@gmail.com]
>>> Sent: Wednesday, March 05, 2014 8:03 PM
>>> To: solr-user@lucene.apache.org
>>> Subject: Re: Indexing huge data
>>>
>>> Here's the easiest thing to try to figure out where to concentrate your
>>> energies..... Just comment out the server.add call in your SolrJ program.
>>> Well, and any commits you're doing from SolrJ.
>>>
>>> My bet: Your program will run at about the same speed it does when you
>>> actually index the docs, indicating that your problem is in the data
>>> acquisition side. Of course the older I get, the more times I've been wrong
>>> :).
>>>
>>> You can also monitor the CPU usage on the box running Solr. I often see
>>> it idling along < 30% when indexing, or even < 10%, again indicating that
>>> the bottleneck is on the acquisition side.
>>>
>>> Note I haven't mentioned any solutions, I'm a believer in identifying the
>>> _problem_ before worrying about a solution.
>>>
>>> Best,
>>> Erick
>>>
>>> On Wed, Mar 5, 2014 at 4:29 PM, Jack Krupansky <ja...@basetechnology.com>
>>> wrote:
>>>
>>>> Make sure you're not doing a commit on each individual document add.
>>>> Commit every few minutes or every few hundred or few thousand
>>>> documents is sufficient. You can set up auto commit in solrconfig.xml.
>>>>
>>>> -- Jack Krupansky
>>>>
>>>> -----Original Message----- From: Rallavagu
>>>> Sent: Wednesday, March 5, 2014 2:37 PM
>>>> To: solr-user@lucene.apache.org
>>>> Subject: Indexing huge data
>>>>
>>>>
>>>> All,
>>>>
>>>> Wondering about best practices/common practices to index/re-index huge
>>>> amount of data in Solr. The data is about 6 million entries in the db
>>>> and other source (data is not located in one resource). Trying with
>>>> solrj based solution to collect data from difference resources to
>>>> index into Solr. It takes hours to index Solr.
>>>>
>>>> Thanks in advance
>>>>
>>>

Re: Indexing huge data

Posted by Kranti Parisa <kr...@gmail.com>.

thats what I do. precreate JSONs following the schema, saving that in
MongoDB, this is part of the ETL process. after that, just dump the JSONs
into Solr using batching etc. with this you can do full and incremental
indexing as well.

Thanks,
Kranti K. Parisa
http://www.linkedin.com/in/krantiparisa



On Thu, Mar 6, 2014 at 9:57 AM, Rallavagu <ra...@gmail.com> wrote:

> Yeah. I have thought about spitting out JSON and run it against Solr using
> parallel Http threads separately. Thanks.
>
>
> On 3/5/14, 6:46 PM, Susheel Kumar wrote:
>
>> One more suggestion is to collect/prepare the data in CSV format (1-2
>> million sample depending on size) and then import data direct into Solr
>> using CSV handler & curl.  This will give you the pure indexing time & the
>> differences.
>>
>> Thanks,
>> Susheel
>>
>> -----Original Message-----
>> From: Erick Erickson [mailto:erickerickson@gmail.com]
>> Sent: Wednesday, March 05, 2014 8:03 PM
>> To: solr-user@lucene.apache.org
>> Subject: Re: Indexing huge data
>>
>> Here's the easiest thing to try to figure out where to concentrate your
>> energies..... Just comment out the server.add call in your SolrJ program.
>> Well, and any commits you're doing from SolrJ.
>>
>> My bet: Your program will run at about the same speed it does when you
>> actually index the docs, indicating that your problem is in the data
>> acquisition side. Of course the older I get, the more times I've been wrong
>> :).
>>
>> You can also monitor the CPU usage on the box running Solr. I often see
>> it idling along < 30% when indexing, or even < 10%, again indicating that
>> the bottleneck is on the acquisition side.
>>
>> Note I haven't mentioned any solutions, I'm a believer in identifying the
>> _problem_ before worrying about a solution.
>>
>> Best,
>> Erick
>>
>> On Wed, Mar 5, 2014 at 4:29 PM, Jack Krupansky <ja...@basetechnology.com>
>> wrote:
>>
>>> Make sure you're not doing a commit on each individual document add.
>>> Commit every few minutes or every few hundred or few thousand
>>> documents is sufficient. You can set up auto commit in solrconfig.xml.
>>>
>>> -- Jack Krupansky
>>>
>>> -----Original Message----- From: Rallavagu
>>> Sent: Wednesday, March 5, 2014 2:37 PM
>>> To: solr-user@lucene.apache.org
>>> Subject: Indexing huge data
>>>
>>>
>>> All,
>>>
>>> Wondering about best practices/common practices to index/re-index huge
>>> amount of data in Solr. The data is about 6 million entries in the db
>>> and other source (data is not located in one resource). Trying with
>>> solrj based solution to collect data from difference resources to
>>> index into Solr. It takes hours to index Solr.
>>>
>>> Thanks in advance
>>>
>>

Re: Indexing huge data

Posted by Rallavagu <ra...@gmail.com>.

Yeah. I have thought about spitting out JSON and run it against Solr 
using parallel Http threads separately. Thanks.

On 3/5/14, 6:46 PM, Susheel Kumar wrote:
> One more suggestion is to collect/prepare the data in CSV format (1-2 million sample depending on size) and then import data direct into Solr using CSV handler & curl.  This will give you the pure indexing time & the differences.
>
> Thanks,
> Susheel
>
> -----Original Message-----
> From: Erick Erickson [mailto:erickerickson@gmail.com]
> Sent: Wednesday, March 05, 2014 8:03 PM
> To: solr-user@lucene.apache.org
> Subject: Re: Indexing huge data
>
> Here's the easiest thing to try to figure out where to concentrate your energies..... Just comment out the server.add call in your SolrJ program. Well, and any commits you're doing from SolrJ.
>
> My bet: Your program will run at about the same speed it does when you actually index the docs, indicating that your problem is in the data acquisition side. Of course the older I get, the more times I've been wrong :).
>
> You can also monitor the CPU usage on the box running Solr. I often see it idling along < 30% when indexing, or even < 10%, again indicating that the bottleneck is on the acquisition side.
>
> Note I haven't mentioned any solutions, I'm a believer in identifying the _problem_ before worrying about a solution.
>
> Best,
> Erick
>
> On Wed, Mar 5, 2014 at 4:29 PM, Jack Krupansky <ja...@basetechnology.com> wrote:
>> Make sure you're not doing a commit on each individual document add.
>> Commit every few minutes or every few hundred or few thousand
>> documents is sufficient. You can set up auto commit in solrconfig.xml.
>>
>> -- Jack Krupansky
>>
>> -----Original Message----- From: Rallavagu
>> Sent: Wednesday, March 5, 2014 2:37 PM
>> To: solr-user@lucene.apache.org
>> Subject: Indexing huge data
>>
>>
>> All,
>>
>> Wondering about best practices/common practices to index/re-index huge
>> amount of data in Solr. The data is about 6 million entries in the db
>> and other source (data is not located in one resource). Trying with
>> solrj based solution to collect data from difference resources to
>> index into Solr. It takes hours to index Solr.
>>
>> Thanks in advance

RE: Indexing huge data

Posted by Susheel Kumar <su...@thedigitalgroup.net>.

One more suggestion is to collect/prepare the data in CSV format (1-2 million sample depending on size) and then import data direct into Solr using CSV handler & curl.  This will give you the pure indexing time & the differences. 

Thanks,
Susheel  

-----Original Message-----
From: Erick Erickson [mailto:erickerickson@gmail.com] 
Sent: Wednesday, March 05, 2014 8:03 PM
To: solr-user@lucene.apache.org
Subject: Re: Indexing huge data

Here's the easiest thing to try to figure out where to concentrate your energies..... Just comment out the server.add call in your SolrJ program. Well, and any commits you're doing from SolrJ.

My bet: Your program will run at about the same speed it does when you actually index the docs, indicating that your problem is in the data acquisition side. Of course the older I get, the more times I've been wrong :).

You can also monitor the CPU usage on the box running Solr. I often see it idling along < 30% when indexing, or even < 10%, again indicating that the bottleneck is on the acquisition side.

Note I haven't mentioned any solutions, I'm a believer in identifying the _problem_ before worrying about a solution.

Best,
Erick

On Wed, Mar 5, 2014 at 4:29 PM, Jack Krupansky <ja...@basetechnology.com> wrote:
> Make sure you're not doing a commit on each individual document add. 
> Commit every few minutes or every few hundred or few thousand 
> documents is sufficient. You can set up auto commit in solrconfig.xml.
>
> -- Jack Krupansky
>
> -----Original Message----- From: Rallavagu
> Sent: Wednesday, March 5, 2014 2:37 PM
> To: solr-user@lucene.apache.org
> Subject: Indexing huge data
>
>
> All,
>
> Wondering about best practices/common practices to index/re-index huge 
> amount of data in Solr. The data is about 6 million entries in the db 
> and other source (data is not located in one resource). Trying with 
> solrj based solution to collect data from difference resources to 
> index into Solr. It takes hours to index Solr.
>
> Thanks in advance

Re: Indexing huge data

Posted by Erick Erickson <er...@gmail.com>.

Here's the easiest thing to try to figure out where to
concentrate your energies..... Just comment out the
server.add call in your SolrJ program. Well, and any
commits you're doing from SolrJ.

My bet: Your program will run at about the same speed
it does when you actually index the docs, indicating that
your problem is in the data acquisition side. Of course
the older I get, the more times I've been wrong :).

You can also monitor the CPU usage on the box running
Solr. I often see it idling along < 30% when indexing, or
even < 10%, again indicating that the bottleneck is on the
acquisition side.

Note I haven't mentioned any solutions, I'm a believer in
identifying the _problem_ before worrying about a solution.

Best,
Erick

On Wed, Mar 5, 2014 at 4:29 PM, Jack Krupansky <ja...@basetechnology.com> wrote:
> Make sure you're not doing a commit on each individual document add. Commit
> every few minutes or every few hundred or few thousand documents is
> sufficient. You can set up auto commit in solrconfig.xml.
>
> -- Jack Krupansky
>
> -----Original Message----- From: Rallavagu
> Sent: Wednesday, March 5, 2014 2:37 PM
> To: solr-user@lucene.apache.org
> Subject: Indexing huge data
>
>
> All,
>
> Wondering about best practices/common practices to index/re-index huge
> amount of data in Solr. The data is about 6 million entries in the db
> and other source (data is not located in one resource). Trying with
> solrj based solution to collect data from difference resources to index
> into Solr. It takes hours to index Solr.
>
> Thanks in advance

Re: Indexing huge data

Posted by Jack Krupansky <ja...@basetechnology.com>.

Make sure you're not doing a commit on each individual document add. Commit 
every few minutes or every few hundred or few thousand documents is 
sufficient. You can set up auto commit in solrconfig.xml.

-- Jack Krupansky

-----Original Message----- 
From: Rallavagu
Sent: Wednesday, March 5, 2014 2:37 PM
To: solr-user@lucene.apache.org
Subject: Indexing huge data

All,

Wondering about best practices/common practices to index/re-index huge
amount of data in Solr. The data is about 6 million entries in the db
and other source (data is not located in one resource). Trying with
solrj based solution to collect data from difference resources to index
into Solr. It takes hours to index Solr.

Thanks in advance

Re: Indexing huge data

Posted by Otis Gospodnetic <ot...@gmail.com>.

Hi,

Each doc is 100K?  That's on the big side, yes, and the server seems on the
small side, yes.  Hence the "speed". :)

Otis
--
Performance Monitoring * Log Analytics * Search Analytics
Solr & Elasticsearch Support * http://sematext.com/


On Wed, Mar 5, 2014 at 3:37 PM, Rallavagu <ra...@gmail.com> wrote:

> Otis,
>
> Good points. I guess you are suggesting that it depends on the resources.
> The document is 100k each the pre processing server is a 2 cpu VM running
> with 4G RAM. So, that could be a "small" machine relatively to process such
> amount of data??
>
>
> On 3/5/14, 12:27 PM, Otis Gospodnetic wrote:
>
>> Hi,
>>
>> It depends.  Are docs huge or small? Server single core or 32 core?  Heap
>> big or small?  etc. etc.
>>
>> Otis
>> --
>> Performance Monitoring * Log Analytics * Search Analytics
>> Solr & Elasticsearch Support * http://sematext.com/
>>
>>
>> On Wed, Mar 5, 2014 at 3:02 PM, Rallavagu <ra...@gmail.com> wrote:
>>
>>  It seems the latency is introduced by collecting the data from different
>>> sources and putting them together then actual Solr index. I would say all
>>> these activities are contributing equally though I would say So, is it
>>> normal to expect to run indexing to run for long? Wondering what to
>>> expect
>>> in such cases. Thanks.
>>>
>>> On 3/5/14, 11:47 AM, Otis Gospodnetic wrote:
>>>
>>>  Hi,
>>>>
>>>> 6M is really not huge these days.  6B is big, though also still not huge
>>>> any more.  What seems to be the bottleneck?  Solr or DB or network or
>>>> something else?
>>>>
>>>> Otis
>>>> --
>>>> Performance Monitoring * Log Analytics * Search Analytics
>>>> Solr & Elasticsearch Support * http://sematext.com/
>>>>
>>>>
>>>> On Wed, Mar 5, 2014 at 2:37 PM, Rallavagu <ra...@gmail.com> wrote:
>>>>
>>>>   All,
>>>>
>>>>>
>>>>> Wondering about best practices/common practices to index/re-index huge
>>>>> amount of data in Solr. The data is about 6 million entries in the db
>>>>> and
>>>>> other source (data is not located in one resource). Trying with solrj
>>>>> based
>>>>> solution to collect data from difference resources to index into Solr.
>>>>> It
>>>>> takes hours to index Solr.
>>>>>
>>>>> Thanks in advance
>>>>>
>>>>>
>>>>>
>>>>
>>

Re: Indexing huge data

Posted by Rallavagu <ra...@gmail.com>.

Otis,

Good points. I guess you are suggesting that it depends on the 
resources. The document is 100k each the pre processing server is a 2 
cpu VM running with 4G RAM. So, that could be a "small" machine 
relatively to process such amount of data??


On 3/5/14, 12:27 PM, Otis Gospodnetic wrote:
> Hi,
>
> It depends.  Are docs huge or small? Server single core or 32 core?  Heap
> big or small?  etc. etc.
>
> Otis
> --
> Performance Monitoring * Log Analytics * Search Analytics
> Solr & Elasticsearch Support * http://sematext.com/
>
>
> On Wed, Mar 5, 2014 at 3:02 PM, Rallavagu <ra...@gmail.com> wrote:
>
>> It seems the latency is introduced by collecting the data from different
>> sources and putting them together then actual Solr index. I would say all
>> these activities are contributing equally though I would say So, is it
>> normal to expect to run indexing to run for long? Wondering what to expect
>> in such cases. Thanks.
>>
>> On 3/5/14, 11:47 AM, Otis Gospodnetic wrote:
>>
>>> Hi,
>>>
>>> 6M is really not huge these days.  6B is big, though also still not huge
>>> any more.  What seems to be the bottleneck?  Solr or DB or network or
>>> something else?
>>>
>>> Otis
>>> --
>>> Performance Monitoring * Log Analytics * Search Analytics
>>> Solr & Elasticsearch Support * http://sematext.com/
>>>
>>>
>>> On Wed, Mar 5, 2014 at 2:37 PM, Rallavagu <ra...@gmail.com> wrote:
>>>
>>>   All,
>>>>
>>>> Wondering about best practices/common practices to index/re-index huge
>>>> amount of data in Solr. The data is about 6 million entries in the db and
>>>> other source (data is not located in one resource). Trying with solrj
>>>> based
>>>> solution to collect data from difference resources to index into Solr. It
>>>> takes hours to index Solr.
>>>>
>>>> Thanks in advance
>>>>
>>>>
>>>
>

Re: Indexing huge data

Posted by Otis Gospodnetic <ot...@gmail.com>.

Hi,

It depends.  Are docs huge or small? Server single core or 32 core?  Heap
big or small?  etc. etc.

Otis
--
Performance Monitoring * Log Analytics * Search Analytics
Solr & Elasticsearch Support * http://sematext.com/


On Wed, Mar 5, 2014 at 3:02 PM, Rallavagu <ra...@gmail.com> wrote:

> It seems the latency is introduced by collecting the data from different
> sources and putting them together then actual Solr index. I would say all
> these activities are contributing equally though I would say So, is it
> normal to expect to run indexing to run for long? Wondering what to expect
> in such cases. Thanks.
>
> On 3/5/14, 11:47 AM, Otis Gospodnetic wrote:
>
>> Hi,
>>
>> 6M is really not huge these days.  6B is big, though also still not huge
>> any more.  What seems to be the bottleneck?  Solr or DB or network or
>> something else?
>>
>> Otis
>> --
>> Performance Monitoring * Log Analytics * Search Analytics
>> Solr & Elasticsearch Support * http://sematext.com/
>>
>>
>> On Wed, Mar 5, 2014 at 2:37 PM, Rallavagu <ra...@gmail.com> wrote:
>>
>>  All,
>>>
>>> Wondering about best practices/common practices to index/re-index huge
>>> amount of data in Solr. The data is about 6 million entries in the db and
>>> other source (data is not located in one resource). Trying with solrj
>>> based
>>> solution to collect data from difference resources to index into Solr. It
>>> takes hours to index Solr.
>>>
>>> Thanks in advance
>>>
>>>
>>

Re: Indexing huge data

Posted by Rallavagu <ra...@gmail.com>.

It seems the latency is introduced by collecting the data from different 
sources and putting them together then actual Solr index. I would say 
all these activities are contributing equally though I would say So, is 
it normal to expect to run indexing to run for long? Wondering what to 
expect in such cases. Thanks.

On 3/5/14, 11:47 AM, Otis Gospodnetic wrote:
> Hi,
>
> 6M is really not huge these days.  6B is big, though also still not huge
> any more.  What seems to be the bottleneck?  Solr or DB or network or
> something else?
>
> Otis
> --
> Performance Monitoring * Log Analytics * Search Analytics
> Solr & Elasticsearch Support * http://sematext.com/
>
>
> On Wed, Mar 5, 2014 at 2:37 PM, Rallavagu <ra...@gmail.com> wrote:
>
>> All,
>>
>> Wondering about best practices/common practices to index/re-index huge
>> amount of data in Solr. The data is about 6 million entries in the db and
>> other source (data is not located in one resource). Trying with solrj based
>> solution to collect data from difference resources to index into Solr. It
>> takes hours to index Solr.
>>
>> Thanks in advance
>>
>

Re: Indexing huge data

Posted by Otis Gospodnetic <ot...@gmail.com>.

Hi,

6M is really not huge these days.  6B is big, though also still not huge
any more.  What seems to be the bottleneck?  Solr or DB or network or
something else?

Otis
--
Performance Monitoring * Log Analytics * Search Analytics
Solr & Elasticsearch Support * http://sematext.com/

On Wed, Mar 5, 2014 at 2:37 PM, Rallavagu <ra...@gmail.com> wrote:

> All,
>
> Wondering about best practices/common practices to index/re-index huge
> amount of data in Solr. The data is about 6 million entries in the db and
> other source (data is not located in one resource). Trying with solrj based
> solution to collect data from difference resources to index into Solr. It
> takes hours to index Solr.
>
> Thanks in advance
>