You are viewing a plain text version of this content. The canonical link for it is here.

Posted to solr-user@lucene.apache.org by Sphene Software <sp...@gmail.com> on 2012/03/04 11:31:39 UTC

Using DIH to import 10 million records

Folks,

I am planning to use DIH for an index of size 10 million records.

I would like to know the following;
- Can DIH scale for this size of an indexes
- If DIH is a bottleneck, what is the specific issue and how it can be
addressed

I also read about solrnet.
Any experience using this and it's advantages over DIH - would be welcome.

regards
/Sonali.

Re: Using DIH to import 10 million records

Posted by Mikhail Khludnev <mk...@griddynamics.com>.

On Mon, Mar 5, 2012 at 5:56 AM, Lance Norskog <go...@gmail.com> wrote:

> You can run the DIH with multiple threads feeding from the same query.
>
FWIW,
https://issues.apache.org/jira/browse/SOLR-3011


> Depends also on the size of the document: large documents may index
> faster if they have their own threads. This may then interact with the
> new NRT multi-commit code.
>
> On Sun, Mar 4, 2012 at 5:19 PM, Shawn Heisey <so...@elyograg.org> wrote:
> > On 3/4/2012 3:31 AM, Sphene Software wrote:
> >>
> >> Folks,
> >>
> >> I am planning to use DIH for an index of size 10 million records.
> >>
> >> I would like to know the following;
> >> - Can DIH scale for this size of an indexes
> >> - If DIH is a bottleneck, what is the specific issue and how it can be
> >> addressed
> >
> >
> > My entire index is about 67 million documents.  There are a total of
> seven
> > shards, six of them have over 11 million documents each.  I can do a full
> > dataimport (from MySQL) of those six shards simultaneously in less than
> > three hours.  The seventh shard is less than 500000 documents and builds
> > after the others during a full rebuild.  It is rare that we have to do a
> > full rebuild, it's mostly at schema change time.
> >
> > I use SolrJ for updates, my experience with that so far suggests that
> doing
> > the full import with my SolrJ code would take significantly longer than
> > three hours.
> >
> > Thanks,
> > Shawn
> >
>
>
>
> --
> Lance Norskog
> goksron@gmail.com
>



-- 
Sincerely yours
Mikhail Khludnev
Lucid Certified
Apache Lucene/Solr Developer
Grid Dynamics

<http://www.griddynamics.com>
 <mk...@griddynamics.com>

Re: Using DIH to import 10 million records

Posted by Lance Norskog <go...@gmail.com>.

You can run the DIH with multiple threads feeding from the same query.
Depends also on the size of the document: large documents may index
faster if they have their own threads. This may then interact with the
new NRT multi-commit code.

On Sun, Mar 4, 2012 at 5:19 PM, Shawn Heisey <so...@elyograg.org> wrote:
> On 3/4/2012 3:31 AM, Sphene Software wrote:
>>
>> Folks,
>>
>> I am planning to use DIH for an index of size 10 million records.
>>
>> I would like to know the following;
>> - Can DIH scale for this size of an indexes
>> - If DIH is a bottleneck, what is the specific issue and how it can be
>> addressed
>
>
> My entire index is about 67 million documents.  There are a total of seven
> shards, six of them have over 11 million documents each.  I can do a full
> dataimport (from MySQL) of those six shards simultaneously in less than
> three hours.  The seventh shard is less than 500000 documents and builds
> after the others during a full rebuild.  It is rare that we have to do a
> full rebuild, it's mostly at schema change time.
>
> I use SolrJ for updates, my experience with that so far suggests that doing
> the full import with my SolrJ code would take significantly longer than
> three hours.
>
> Thanks,
> Shawn
>



-- 
Lance Norskog
goksron@gmail.com

Re: Using DIH to import 10 million records

Posted by Sonali Sambhus <so...@gmail.com>.

Thanks for the info, Shawn!

On Mon, Mar 5, 2012 at 6:49 AM, Shawn Heisey <so...@elyograg.org> wrote:

> On 3/4/2012 3:31 AM, Sphene Software wrote:
>
>> Folks,
>>
>> I am planning to use DIH for an index of size 10 million records.
>>
>> I would like to know the following;
>> - Can DIH scale for this size of an indexes
>> - If DIH is a bottleneck, what is the specific issue and how it can be
>> addressed
>>
>
> My entire index is about 67 million documents.  There are a total of seven
> shards, six of them have over 11 million documents each.  I can do a full
> dataimport (from MySQL) of those six shards simultaneously in less than
> three hours.  The seventh shard is less than 500000 documents and builds
> after the others during a full rebuild.  It is rare that we have to do a
> full rebuild, it's mostly at schema change time.
>
> I use SolrJ for updates, my experience with that so far suggests that
> doing the full import with my SolrJ code would take significantly longer
> than three hours.
>
> Thanks,
> Shawn
>
>

Re: Using DIH to import 10 million records

Posted by Shawn Heisey <so...@elyograg.org>.

On 3/4/2012 3:31 AM, Sphene Software wrote:
> Folks,
>
> I am planning to use DIH for an index of size 10 million records.
>
> I would like to know the following;
> - Can DIH scale for this size of an indexes
> - If DIH is a bottleneck, what is the specific issue and how it can be
> addressed

My entire index is about 67 million documents.  There are a total of 
seven shards, six of them have over 11 million documents each.  I can do 
a full dataimport (from MySQL) of those six shards simultaneously in 
less than three hours.  The seventh shard is less than 500000 documents 
and builds after the others during a full rebuild.  It is rare that we 
have to do a full rebuild, it's mostly at schema change time.

I use SolrJ for updates, my experience with that so far suggests that 
doing the full import with my SolrJ code would take significantly longer 
than three hours.

Thanks,
Shawn