You are viewing a plain text version of this content. The canonical link for it is here.
Posted to solr-user@lucene.apache.org by Flavio Pompermaier <po...@okkam.it> on 2013/11/22 10:44:22 UTC

Solrcloud: external fields and frequent commits

Hi to all,
we're migrating from solr 3.x to solr 4.x to use Solrcloud and I have two
big doubts:

1) External fields. When I compute such a file do I have to copy it in the
 data directory of shards..? The external fields boosts the results of the
query to a specific collection, for me it doesn't make sense to put it in
all shard's data dir, it should be something related to the collection
itself.
Am I wrong or missing something? Is there a simple way to upload the
popularity file (for the external field) at one in all shards?

2) My index requires frequently commits (i.e. sometimes up to 100/s). How
do I have to manage this? Do I have to use soft commits..? Any simple
configuration/code snippet to use them? Is it true that external fields
affect performance on commit?

Best,
Flavio

Re: Solrcloud: external fields and frequent commits

Posted by Erick Erickson <er...@gmail.com>.
Long blog post on commits and the state of updates here:
http://searchhub.org/2013/08/23/understanding-transaction-logs-softcommit-and-commit-in-sorlcloud/

hdfs is perfectly fine with Solr, there's even an HdfsDirectoryFactory for
your index. It has its own
performance characteristics/tuning parameters, so there'll be something of
a learning curve.

Best
Erick


On Sat, Nov 23, 2013 at 4:14 AM, Flavio Pompermaier <po...@okkam.it>wrote:

> Thanks again for such a detailed description.
> In our use case we're going to save shards data on hdfs so they all have
> access to a shared location, it would be great to put such a file in one
> place in that case :)
> Do you think that using hdfs as storage is bad for performance?
> Last question: if I softCommit and I have to shutdown my tomcat, will data
> be commited to disk or do I have to annually force a commit before shutting
> down?
>
> Best,
> Flavio
>
> On Sat, Nov 23, 2013 at 2:01 AM, Erick Erickson <erickerickson@gmail.com
> >wrote:
>
> > about <1>. Well, at a high level you're right, of course.
> > Having the EFF stuff in a single place seems more elegant. But
> > then ugly details crop up. I.e. "one place" implies that you'd have
> > to fetch them over the network, potentially a very expensive
> > operation every time there was a commit. Is this really a good
> > tradeoff? With high network latency, this could be a performance
> > killer. But I suspect that the real reason is that nobody has found
> > a compelling use-case for this kind of thing. Until and unless
> > someone does, and is willing to make a patch, it'll be theory :).
> >
> > bq:  modifications also sent to replicas
> > with this kind of commits
> >
> > brief review:
> >
> > Update process:
> > 1> Update goes to a node.
> > 2> node forwards to all leaders
> > 3> leader forward to replicas
> > 4> replicas respond to their leader.
> > 5> leader responds to originating node.
> > 6> originating node responds to caller.
> >
> > At this point all the replicas for your entire cluster have the
> > update. This is entirely independent of commits. Whenever a
> > commit is issued the documents currently pending on a node
> > are committed and made visible to a searcher.
> >
> > If one is relying on solrconfig settings, then the commit happens
> > a little bit out of synch. Let's say that the commit (hard with
> > opensearcher=true or soft) is set to 60 seconds. Each node may
> > have a different commit time, depending upon when it was started.
> > So there may be a slight difference in when documents are visible.
> > You'll probably never notice.
> >
> > If you issue commits from a client, then the commit is propagated
> > to all nodes in the cluster.
> >
> > HTH,
> > Erick
> >
> >
> > On Fri, Nov 22, 2013 at 7:23 PM, Flavio Pompermaier <
> pompermaier@okkam.it
> > >wrote:
> >
> > > On Fri, Nov 22, 2013 at 2:21 PM, Erick Erickson <
> erickerickson@gmail.com
> > > >wrote:
> > >
> > > > 1> I'm not quite sure I understand. External File Fields are keyed
> > > > by the unique id of the doc. So every shard _must_ have the
> > > > eff available for at least the documents in that shard. At first
> glance
> > > > this doesn't look simple. Perhaps a bit more explanation of what
> > > > you're using EFF for?
> > > >
> > > Thanks Erick for the reply, I use EFF for boosting results by
> popularity.
> > > So I was right, I should put popularity in every shard data dir..right?
> > But
> > > why not keeping that file in just one place (obviously the file should
> be
> > > reachable by all solrcloud nodes...) and allow external fields to be
> > > outside data dir?
> > >
> > > >
> > > > 2> Let's be sure we're talking about the same thing here. In Solr,
> > > > a "commit" is the command that makes documents visible, often
> > > > controlled by the autoCommit and autoSoftCommit settings in
> > > > solrconfig.xml. You will not be able to issue 100 commits/second.
> > > >
> > > > If you're using "commit" to mean adding a document to the index,
> > > > then 100/s should be no problem. I regularly see many times that
> > > > ingestion rate. The documents won't be visible to search until
> > > > you do a commit however.
> > > >
> > > Yeah, now it is more clear. Still a question: for my client is not a
> > > problem to soft commit but, are the modifications also sent to replicas
> > > with this kind of commits?
> > >
> > > >
> > > > Best
> > > > Erick
> > > >
> > > >
> > > > On Fri, Nov 22, 2013 at 4:44 AM, Flavio Pompermaier <
> > > pompermaier@okkam.it
> > > > >wrote:
> > > >
> > > > > Hi to all,
> > > > > we're migrating from solr 3.x to solr 4.x to use Solrcloud and I
> have
> > > two
> > > > > big doubts:
> > > > >
> > > > > 1) External fields. When I compute such a file do I have to copy it
> > in
> > > > the
> > > > >  data directory of shards..? The external fields boosts the results
> > of
> > > > the
> > > > > query to a specific collection, for me it doesn't make sense to put
> > it
> > > in
> > > > > all shard's data dir, it should be something related to the
> > collection
> > > > > itself.
> > > > > Am I wrong or missing something? Is there a simple way to upload
> the
> > > > > popularity file (for the external field) at one in all shards?
> > > > >
> > > > > 2) My index requires frequently commits (i.e. sometimes up to
> 100/s).
> > > How
> > > > > do I have to manage this? Do I have to use soft commits..? Any
> simple
> > > > > configuration/code snippet to use them? Is it true that external
> > fields
> > > > > affect performance on commit?
> > > > >
> > > > > Best,
> > > > > Flavio
> > > > >
> > > >
> > >
> >
>

Re: Solrcloud: external fields and frequent commits

Posted by Flavio Pompermaier <po...@okkam.it>.
Thanks again for such a detailed description.
In our use case we're going to save shards data on hdfs so they all have
access to a shared location, it would be great to put such a file in one
place in that case :)
Do you think that using hdfs as storage is bad for performance?
Last question: if I softCommit and I have to shutdown my tomcat, will data
be commited to disk or do I have to annually force a commit before shutting
down?

Best,
Flavio

On Sat, Nov 23, 2013 at 2:01 AM, Erick Erickson <er...@gmail.com>wrote:

> about <1>. Well, at a high level you're right, of course.
> Having the EFF stuff in a single place seems more elegant. But
> then ugly details crop up. I.e. "one place" implies that you'd have
> to fetch them over the network, potentially a very expensive
> operation every time there was a commit. Is this really a good
> tradeoff? With high network latency, this could be a performance
> killer. But I suspect that the real reason is that nobody has found
> a compelling use-case for this kind of thing. Until and unless
> someone does, and is willing to make a patch, it'll be theory :).
>
> bq:  modifications also sent to replicas
> with this kind of commits
>
> brief review:
>
> Update process:
> 1> Update goes to a node.
> 2> node forwards to all leaders
> 3> leader forward to replicas
> 4> replicas respond to their leader.
> 5> leader responds to originating node.
> 6> originating node responds to caller.
>
> At this point all the replicas for your entire cluster have the
> update. This is entirely independent of commits. Whenever a
> commit is issued the documents currently pending on a node
> are committed and made visible to a searcher.
>
> If one is relying on solrconfig settings, then the commit happens
> a little bit out of synch. Let's say that the commit (hard with
> opensearcher=true or soft) is set to 60 seconds. Each node may
> have a different commit time, depending upon when it was started.
> So there may be a slight difference in when documents are visible.
> You'll probably never notice.
>
> If you issue commits from a client, then the commit is propagated
> to all nodes in the cluster.
>
> HTH,
> Erick
>
>
> On Fri, Nov 22, 2013 at 7:23 PM, Flavio Pompermaier <pompermaier@okkam.it
> >wrote:
>
> > On Fri, Nov 22, 2013 at 2:21 PM, Erick Erickson <erickerickson@gmail.com
> > >wrote:
> >
> > > 1> I'm not quite sure I understand. External File Fields are keyed
> > > by the unique id of the doc. So every shard _must_ have the
> > > eff available for at least the documents in that shard. At first glance
> > > this doesn't look simple. Perhaps a bit more explanation of what
> > > you're using EFF for?
> > >
> > Thanks Erick for the reply, I use EFF for boosting results by popularity.
> > So I was right, I should put popularity in every shard data dir..right?
> But
> > why not keeping that file in just one place (obviously the file should be
> > reachable by all solrcloud nodes...) and allow external fields to be
> > outside data dir?
> >
> > >
> > > 2> Let's be sure we're talking about the same thing here. In Solr,
> > > a "commit" is the command that makes documents visible, often
> > > controlled by the autoCommit and autoSoftCommit settings in
> > > solrconfig.xml. You will not be able to issue 100 commits/second.
> > >
> > > If you're using "commit" to mean adding a document to the index,
> > > then 100/s should be no problem. I regularly see many times that
> > > ingestion rate. The documents won't be visible to search until
> > > you do a commit however.
> > >
> > Yeah, now it is more clear. Still a question: for my client is not a
> > problem to soft commit but, are the modifications also sent to replicas
> > with this kind of commits?
> >
> > >
> > > Best
> > > Erick
> > >
> > >
> > > On Fri, Nov 22, 2013 at 4:44 AM, Flavio Pompermaier <
> > pompermaier@okkam.it
> > > >wrote:
> > >
> > > > Hi to all,
> > > > we're migrating from solr 3.x to solr 4.x to use Solrcloud and I have
> > two
> > > > big doubts:
> > > >
> > > > 1) External fields. When I compute such a file do I have to copy it
> in
> > > the
> > > >  data directory of shards..? The external fields boosts the results
> of
> > > the
> > > > query to a specific collection, for me it doesn't make sense to put
> it
> > in
> > > > all shard's data dir, it should be something related to the
> collection
> > > > itself.
> > > > Am I wrong or missing something? Is there a simple way to upload the
> > > > popularity file (for the external field) at one in all shards?
> > > >
> > > > 2) My index requires frequently commits (i.e. sometimes up to 100/s).
> > How
> > > > do I have to manage this? Do I have to use soft commits..? Any simple
> > > > configuration/code snippet to use them? Is it true that external
> fields
> > > > affect performance on commit?
> > > >
> > > > Best,
> > > > Flavio
> > > >
> > >
> >
>

Re: Solrcloud: external fields and frequent commits

Posted by Erick Erickson <er...@gmail.com>.
about <1>. Well, at a high level you're right, of course.
Having the EFF stuff in a single place seems more elegant. But
then ugly details crop up. I.e. "one place" implies that you'd have
to fetch them over the network, potentially a very expensive
operation every time there was a commit. Is this really a good
tradeoff? With high network latency, this could be a performance
killer. But I suspect that the real reason is that nobody has found
a compelling use-case for this kind of thing. Until and unless
someone does, and is willing to make a patch, it'll be theory :).

bq:  modifications also sent to replicas
with this kind of commits

brief review:

Update process:
1> Update goes to a node.
2> node forwards to all leaders
3> leader forward to replicas
4> replicas respond to their leader.
5> leader responds to originating node.
6> originating node responds to caller.

At this point all the replicas for your entire cluster have the
update. This is entirely independent of commits. Whenever a
commit is issued the documents currently pending on a node
are committed and made visible to a searcher.

If one is relying on solrconfig settings, then the commit happens
a little bit out of synch. Let's say that the commit (hard with
opensearcher=true or soft) is set to 60 seconds. Each node may
have a different commit time, depending upon when it was started.
So there may be a slight difference in when documents are visible.
You'll probably never notice.

If you issue commits from a client, then the commit is propagated
to all nodes in the cluster.

HTH,
Erick


On Fri, Nov 22, 2013 at 7:23 PM, Flavio Pompermaier <po...@okkam.it>wrote:

> On Fri, Nov 22, 2013 at 2:21 PM, Erick Erickson <erickerickson@gmail.com
> >wrote:
>
> > 1> I'm not quite sure I understand. External File Fields are keyed
> > by the unique id of the doc. So every shard _must_ have the
> > eff available for at least the documents in that shard. At first glance
> > this doesn't look simple. Perhaps a bit more explanation of what
> > you're using EFF for?
> >
> Thanks Erick for the reply, I use EFF for boosting results by popularity.
> So I was right, I should put popularity in every shard data dir..right? But
> why not keeping that file in just one place (obviously the file should be
> reachable by all solrcloud nodes...) and allow external fields to be
> outside data dir?
>
> >
> > 2> Let's be sure we're talking about the same thing here. In Solr,
> > a "commit" is the command that makes documents visible, often
> > controlled by the autoCommit and autoSoftCommit settings in
> > solrconfig.xml. You will not be able to issue 100 commits/second.
> >
> > If you're using "commit" to mean adding a document to the index,
> > then 100/s should be no problem. I regularly see many times that
> > ingestion rate. The documents won't be visible to search until
> > you do a commit however.
> >
> Yeah, now it is more clear. Still a question: for my client is not a
> problem to soft commit but, are the modifications also sent to replicas
> with this kind of commits?
>
> >
> > Best
> > Erick
> >
> >
> > On Fri, Nov 22, 2013 at 4:44 AM, Flavio Pompermaier <
> pompermaier@okkam.it
> > >wrote:
> >
> > > Hi to all,
> > > we're migrating from solr 3.x to solr 4.x to use Solrcloud and I have
> two
> > > big doubts:
> > >
> > > 1) External fields. When I compute such a file do I have to copy it in
> > the
> > >  data directory of shards..? The external fields boosts the results of
> > the
> > > query to a specific collection, for me it doesn't make sense to put it
> in
> > > all shard's data dir, it should be something related to the collection
> > > itself.
> > > Am I wrong or missing something? Is there a simple way to upload the
> > > popularity file (for the external field) at one in all shards?
> > >
> > > 2) My index requires frequently commits (i.e. sometimes up to 100/s).
> How
> > > do I have to manage this? Do I have to use soft commits..? Any simple
> > > configuration/code snippet to use them? Is it true that external fields
> > > affect performance on commit?
> > >
> > > Best,
> > > Flavio
> > >
> >
>

Re: Solrcloud: external fields and frequent commits

Posted by Flavio Pompermaier <po...@okkam.it>.
On Fri, Nov 22, 2013 at 2:21 PM, Erick Erickson <er...@gmail.com>wrote:

> 1> I'm not quite sure I understand. External File Fields are keyed
> by the unique id of the doc. So every shard _must_ have the
> eff available for at least the documents in that shard. At first glance
> this doesn't look simple. Perhaps a bit more explanation of what
> you're using EFF for?
>
Thanks Erick for the reply, I use EFF for boosting results by popularity.
So I was right, I should put popularity in every shard data dir..right? But
why not keeping that file in just one place (obviously the file should be
reachable by all solrcloud nodes...) and allow external fields to be
outside data dir?

>
> 2> Let's be sure we're talking about the same thing here. In Solr,
> a "commit" is the command that makes documents visible, often
> controlled by the autoCommit and autoSoftCommit settings in
> solrconfig.xml. You will not be able to issue 100 commits/second.
>
> If you're using "commit" to mean adding a document to the index,
> then 100/s should be no problem. I regularly see many times that
> ingestion rate. The documents won't be visible to search until
> you do a commit however.
>
Yeah, now it is more clear. Still a question: for my client is not a
problem to soft commit but, are the modifications also sent to replicas
with this kind of commits?

>
> Best
> Erick
>
>
> On Fri, Nov 22, 2013 at 4:44 AM, Flavio Pompermaier <pompermaier@okkam.it
> >wrote:
>
> > Hi to all,
> > we're migrating from solr 3.x to solr 4.x to use Solrcloud and I have two
> > big doubts:
> >
> > 1) External fields. When I compute such a file do I have to copy it in
> the
> >  data directory of shards..? The external fields boosts the results of
> the
> > query to a specific collection, for me it doesn't make sense to put it in
> > all shard's data dir, it should be something related to the collection
> > itself.
> > Am I wrong or missing something? Is there a simple way to upload the
> > popularity file (for the external field) at one in all shards?
> >
> > 2) My index requires frequently commits (i.e. sometimes up to 100/s). How
> > do I have to manage this? Do I have to use soft commits..? Any simple
> > configuration/code snippet to use them? Is it true that external fields
> > affect performance on commit?
> >
> > Best,
> > Flavio
> >
>

Re: Solrcloud: external fields and frequent commits

Posted by Erick Erickson <er...@gmail.com>.
1> I'm not quite sure I understand. External File Fields are keyed
by the unique id of the doc. So every shard _must_ have the
eff available for at least the documents in that shard. At first glance
this doesn't look simple. Perhaps a bit more explanation of what
you're using EFF for?

2> Let's be sure we're talking about the same thing here. In Solr,
a "commit" is the command that makes documents visible, often
controlled by the autoCommit and autoSoftCommit settings in
solrconfig.xml. You will not be able to issue 100 commits/second.

If you're using "commit" to mean adding a document to the index,
then 100/s should be no problem. I regularly see many times that
ingestion rate. The documents won't be visible to search until
you do a commit however.

Best
Erick


On Fri, Nov 22, 2013 at 4:44 AM, Flavio Pompermaier <po...@okkam.it>wrote:

> Hi to all,
> we're migrating from solr 3.x to solr 4.x to use Solrcloud and I have two
> big doubts:
>
> 1) External fields. When I compute such a file do I have to copy it in the
>  data directory of shards..? The external fields boosts the results of the
> query to a specific collection, for me it doesn't make sense to put it in
> all shard's data dir, it should be something related to the collection
> itself.
> Am I wrong or missing something? Is there a simple way to upload the
> popularity file (for the external field) at one in all shards?
>
> 2) My index requires frequently commits (i.e. sometimes up to 100/s). How
> do I have to manage this? Do I have to use soft commits..? Any simple
> configuration/code snippet to use them? Is it true that external fields
> affect performance on commit?
>
> Best,
> Flavio
>