You are viewing a plain text version of this content. The canonical link for it is here.
Posted to solr-user@lucene.apache.org by Sphene Software <sp...@gmail.com> on 2012/03/04 11:31:39 UTC
Using DIH to import 10 million records
Folks,
I am planning to use DIH for an index of size 10 million records.
I would like to know the following;
- Can DIH scale for this size of an indexes
- If DIH is a bottleneck, what is the specific issue and how it can be
addressed
I also read about solrnet.
Any experience using this and it's advantages over DIH - would be welcome.
regards
/Sonali.
Re: Using DIH to import 10 million records
Posted by Mikhail Khludnev <mk...@griddynamics.com>.
On Mon, Mar 5, 2012 at 5:56 AM, Lance Norskog <go...@gmail.com> wrote:
> You can run the DIH with multiple threads feeding from the same query.
>
FWIW,
https://issues.apache.org/jira/browse/SOLR-3011
> Depends also on the size of the document: large documents may index
> faster if they have their own threads. This may then interact with the
> new NRT multi-commit code.
>
> On Sun, Mar 4, 2012 at 5:19 PM, Shawn Heisey <so...@elyograg.org> wrote:
> > On 3/4/2012 3:31 AM, Sphene Software wrote:
> >>
> >> Folks,
> >>
> >> I am planning to use DIH for an index of size 10 million records.
> >>
> >> I would like to know the following;
> >> - Can DIH scale for this size of an indexes
> >> - If DIH is a bottleneck, what is the specific issue and how it can be
> >> addressed
> >
> >
> > My entire index is about 67 million documents. There are a total of
> seven
> > shards, six of them have over 11 million documents each. I can do a full
> > dataimport (from MySQL) of those six shards simultaneously in less than
> > three hours. The seventh shard is less than 500000 documents and builds
> > after the others during a full rebuild. It is rare that we have to do a
> > full rebuild, it's mostly at schema change time.
> >
> > I use SolrJ for updates, my experience with that so far suggests that
> doing
> > the full import with my SolrJ code would take significantly longer than
> > three hours.
> >
> > Thanks,
> > Shawn
> >
>
>
>
> --
> Lance Norskog
> goksron@gmail.com
>
--
Sincerely yours
Mikhail Khludnev
Lucid Certified
Apache Lucene/Solr Developer
Grid Dynamics
<http://www.griddynamics.com>
<mk...@griddynamics.com>
Re: Using DIH to import 10 million records
Posted by Lance Norskog <go...@gmail.com>.
You can run the DIH with multiple threads feeding from the same query.
Depends also on the size of the document: large documents may index
faster if they have their own threads. This may then interact with the
new NRT multi-commit code.
On Sun, Mar 4, 2012 at 5:19 PM, Shawn Heisey <so...@elyograg.org> wrote:
> On 3/4/2012 3:31 AM, Sphene Software wrote:
>>
>> Folks,
>>
>> I am planning to use DIH for an index of size 10 million records.
>>
>> I would like to know the following;
>> - Can DIH scale for this size of an indexes
>> - If DIH is a bottleneck, what is the specific issue and how it can be
>> addressed
>
>
> My entire index is about 67 million documents. There are a total of seven
> shards, six of them have over 11 million documents each. I can do a full
> dataimport (from MySQL) of those six shards simultaneously in less than
> three hours. The seventh shard is less than 500000 documents and builds
> after the others during a full rebuild. It is rare that we have to do a
> full rebuild, it's mostly at schema change time.
>
> I use SolrJ for updates, my experience with that so far suggests that doing
> the full import with my SolrJ code would take significantly longer than
> three hours.
>
> Thanks,
> Shawn
>
--
Lance Norskog
goksron@gmail.com
Re: Using DIH to import 10 million records
Posted by Sonali Sambhus <so...@gmail.com>.
Thanks for the info, Shawn!
On Mon, Mar 5, 2012 at 6:49 AM, Shawn Heisey <so...@elyograg.org> wrote:
> On 3/4/2012 3:31 AM, Sphene Software wrote:
>
>> Folks,
>>
>> I am planning to use DIH for an index of size 10 million records.
>>
>> I would like to know the following;
>> - Can DIH scale for this size of an indexes
>> - If DIH is a bottleneck, what is the specific issue and how it can be
>> addressed
>>
>
> My entire index is about 67 million documents. There are a total of seven
> shards, six of them have over 11 million documents each. I can do a full
> dataimport (from MySQL) of those six shards simultaneously in less than
> three hours. The seventh shard is less than 500000 documents and builds
> after the others during a full rebuild. It is rare that we have to do a
> full rebuild, it's mostly at schema change time.
>
> I use SolrJ for updates, my experience with that so far suggests that
> doing the full import with my SolrJ code would take significantly longer
> than three hours.
>
> Thanks,
> Shawn
>
>
Re: Using DIH to import 10 million records
Posted by Shawn Heisey <so...@elyograg.org>.
On 3/4/2012 3:31 AM, Sphene Software wrote:
> Folks,
>
> I am planning to use DIH for an index of size 10 million records.
>
> I would like to know the following;
> - Can DIH scale for this size of an indexes
> - If DIH is a bottleneck, what is the specific issue and how it can be
> addressed
My entire index is about 67 million documents. There are a total of
seven shards, six of them have over 11 million documents each. I can do
a full dataimport (from MySQL) of those six shards simultaneously in
less than three hours. The seventh shard is less than 500000 documents
and builds after the others during a full rebuild. It is rare that we
have to do a full rebuild, it's mostly at schema change time.
I use SolrJ for updates, my experience with that so far suggests that
doing the full import with my SolrJ code would take significantly longer
than three hours.
Thanks,
Shawn