You are viewing a plain text version of this content. The canonical link for it is here.

Posted to solr-user@lucene.apache.org by Aristedes Maniatis <ar...@ish.com.au> on 2016/07/22 07:22:33 UTC

loading zookeeper data

Hi everyone

I'm not new to Solr, but I'm upgrading from Solr 4 to 5 and needing to use the new Zookeeper configuration requirement. It is adding a lot of extra complexity to our deployment and I want to check that we are doing it right.

1. We are using Saltstack to push files to deployment servers. That makes it easy to put files anywhere I want, run scripts, etc. If you don't know Salt, it is a lot like Puppet or other configuration management tools. Salt is all python.

2. We use Jenkins to build and test

3. Deployment servers are all FreeBSD.

Now, in the old days, I could just push the right core configuration files to each Solr instance (we have three cores), make sure one is the master and use cron to ensure the master updates. The other Solr slaves all update nicely. The problem we want to escape is that this configuration causes outages and other random issues each time the Solr master does a full reload. It shouldn't, but it does and hopefully the new SolrCluster will be better.

Now, I can still deploy Solr and Zookeeper using Salt. All that works well and is easy. But how I do get the configuration files from our development/test environment (built and tested with Jenkins) into production? Obviously I want those config files in version control. And maybe Jenkins can zip up the 8 configuration files (per core) and push them to our artifact repository.

But then what? In the production cluster it seems I then need to

1. Grab the latest configuration bundle for each core and unpack them
2. Launch Java
3. Execute the Solr jars (from the production server since it must be the right version)
- with org.apache.solr.cloud.ZkCLI
- and some parameters pointing to the production Zookeeper cluster
- pointing also to the unpacked config files
4. Parse the output to understand if any error happened
5. Wait for Solr to pick up the new configuration and do any final production checks

Am I missing some really simple step, or is this what we must now do?

I'm thinking that gradle might help with 2&3 above since then at least it can launch the right version of Java, download the right Solr version and execute against that. And maybe that can run from Jenkins as a "release" step.

Is that a good approach?

Cheers
Ari

--
-------------------------->
Aristedes Maniatis
CEO, ish
https://www.ish.com.au
GPG fingerprint CBFB 84B4 738D 4E87 5E5C 5EFA EF6A 7D2E 3E49 102A

Re: loading zookeeper data

Posted by Erick Erickson <er...@gmail.com>.

A Collection is simply the "SolrCloud" way of thinking about a logical
index that incorporates shards, replication factors changing topology
of where the replicas live and the like. In your case it's synonymous
with your core (master and slaves). Since there's no master or slave
role in SolrCloud, it's a little confusing (Leaders and
replicas/followers roles can change in SolrCloud).

bq: Anyhow, the bottom line appears to be that 130Mb of jars are
needed to deploy my configuration to Zookeeper
bq:  I don't want production machines to require VCS checkout credentials

Huh? I think you're confusing deployment tools with how Zookeeper is
used in SolrCloud. Zookeeper has two major functions:

1> store the conf directory (schema.xml, solrconfig.xml and the like),
plus occasionally custom jars and make these automatically available
to all Solr nodes in the cluster. It does NOT store the whole Solr
deployment.

2> be aware of all Solr nodes in the system and notify all the other
Solr nodes when instances go up and down.

Zookeeper was never intended to hold all of Solr and take the place of
puppet or chef. It will not automatically provision a new bare-metal
node with a working Solr etc.

Especially the VCS comment. _Some_ node somewhere has to be VCS
conversant. But once that machine pushes config files to Zookeeper,
they're then automagically available to all the Solr nodes in the
collection, the Solr nodes need to know nothing about your VCS system.

Anyway, if you're happy with your current setup go ahead and continue
to use it. Just be clear what Zookeeper is intended to solve and what
it isn't. It's perfectly compatible with Puppet, Chef and the like

Best,
Erick

On Sun, Jul 24, 2016 at 4:46 PM, Aristedes Maniatis <ar...@ish.com.au> wrote:
> Thanks so much for your reply. That's clarified a few things for me.
>
> Erick Erickson wrote:
>
>> Where SolrCloud becomes compelling is when you _do_ need to have
>> shard, and deal with HA/DR.
>
> I'm not using shards since the indicies are small enough, however I use master/slave with 6 nodes for two reasons: having a single master poll the database means less load on the database than have every node poll separately. And of course we still want HA and performance, so we balance load with haproxy.
>
>> Then the added step of maintaining things
>> in Zookeeper is a small price to pay for _not_ having to be sure that
>> all the configs on all the servers are all the same. Imagine a cluster
>> with several hundred replicas out there. Being absolutely sure that
>> all of them have the same configs, have been restarted and the like
>> becomes daunting. So having to do an "upconfig" is a good tradeoff
>> IMO.
>
> Saltstack (and ansible, puppet, chef, etc) all make distributed configuration management trivial. So it isn't solving any problem for me, but I understand how people without a configuration management tool would like it.
>
>
>
>> The bin/solr script has a "zk -upconfig" parameter that'll take care
>> of pushing the configs up. Since you already have the configs in VCS,
>> your process is just to pull them from vcs to "somewhere" then
>> bin/solr zk -upconfig -z zookeeper_asserss -n configset_name -d
>> directory_you_downloaded_to_from_VCS.
>
> Yep, I guess that's confirming my guess at how people are expected to use this. Its pretty cumbersome for me because:
>
> 1. I don't want production machines to require VCS checkout credentials
> 2. I don't want to have to install Solr (and keep the version in sync with production) on our build or configuration management machines
> 3. I still need files on disk in order to version control them and tie that into our QA processes. Now I need another step to take those files and inject them into the Zookeeper black box, ensuring they are always up to date.
>
> I do understand that people who managed hundreds of nodes completely by hand would find it useful. But I am surprised that there were any of those people.
>
> I was hoping that Zookeeper had some hidden features that would make my life easier.
>
>
>> Thereafter you simply refer to them by name when you create a
>> collection and the rest of it is automatic. Every time a core reloads
>> it gets the new configs.
>>
>> If you're trying to manipulate _cores_, that may be where you're going
>> wrong. Think of them as _collections_. What's not clear from your
>> problem statement is whether these cores on the various machines are
>> part of the same collection or not.
>
> I was unaware of the concept of collection until now. We use one core for each type of entity we are indexing and that works well.
>
>> Do you have multiple shards in one
>> logical index?
>
> No shards. Every Solr node contains the complete set of all data.
>
>>  Or do you have multiple collections that have
>> masters/slaves (in which case the master and all the slaves that point
>> to it will be a "collection")?
>
> I'm not understanding from https://wiki.apache.org/solr/SolrTerminology what a Collection is, that makes it different to the old concept of Core.
>
>
>> Do all of the cores you have use the
>> same configurations? Or is each set of master/slaves using a different
>> configuration?
>
> Each core has a different configuration (which is what makes it a different core... different source data, different synonyms, etc). But every node is identical and kept that way with saltstack.
>
>
>
> Anyhow, the bottom line appears to be that 130Mb of jars are needed to deploy my configuration to Zookeeper. In that case, I think I'll do it by building a new deployment project, with a gradle task (so I don't need to worry about all those Solr dependencies for zksh), and a Jenkins job that can be triggered to run the deployment to either staging or production. A few new holes in my firewall and I'll be done.
>
> Unfortunate new points of failure and complexity, but I can't think of anything simpler.
>
>
> Thanks
>
> Ari
>
>
>
> --
> -------------------------->
> Aristedes Maniatis
> CEO, ish
> https://www.ish.com.au
> GPG fingerprint CBFB 84B4 738D 4E87 5E5C  5EFA EF6A 7D2E 3E49 102A
>

Re: loading zookeeper data

Posted by Aristedes Maniatis <ar...@ish.com.au>.

Thanks so much for your reply. That's clarified a few things for me.

Erick Erickson wrote:

> Where SolrCloud becomes compelling is when you _do_ need to have
> shard, and deal with HA/DR. 

I'm not using shards since the indicies are small enough, however I use master/slave with 6 nodes for two reasons: having a single master poll the database means less load on the database than have every node poll separately. And of course we still want HA and performance, so we balance load with haproxy.

> Then the added step of maintaining things
> in Zookeeper is a small price to pay for _not_ having to be sure that
> all the configs on all the servers are all the same. Imagine a cluster
> with several hundred replicas out there. Being absolutely sure that
> all of them have the same configs, have been restarted and the like
> becomes daunting. So having to do an "upconfig" is a good tradeoff
> IMO.

Saltstack (and ansible, puppet, chef, etc) all make distributed configuration management trivial. So it isn't solving any problem for me, but I understand how people without a configuration management tool would like it.

> The bin/solr script has a "zk -upconfig" parameter that'll take care
> of pushing the configs up. Since you already have the configs in VCS,
> your process is just to pull them from vcs to "somewhere" then
> bin/solr zk -upconfig -z zookeeper_asserss -n configset_name -d
> directory_you_downloaded_to_from_VCS.

Yep, I guess that's confirming my guess at how people are expected to use this. Its pretty cumbersome for me because:

1. I don't want production machines to require VCS checkout credentials
2. I don't want to have to install Solr (and keep the version in sync with production) on our build or configuration management machines
3. I still need files on disk in order to version control them and tie that into our QA processes. Now I need another step to take those files and inject them into the Zookeeper black box, ensuring they are always up to date.

I do understand that people who managed hundreds of nodes completely by hand would find it useful. But I am surprised that there were any of those people.

I was hoping that Zookeeper had some hidden features that would make my life easier.

> Thereafter you simply refer to them by name when you create a
> collection and the rest of it is automatic. Every time a core reloads
> it gets the new configs.
> 
> If you're trying to manipulate _cores_, that may be where you're going
> wrong. Think of them as _collections_. What's not clear from your
> problem statement is whether these cores on the various machines are
> part of the same collection or not.

I was unaware of the concept of collection until now. We use one core for each type of entity we are indexing and that works well.

> Do you have multiple shards in one
> logical index?

No shards. Every Solr node contains the complete set of all data.

>  Or do you have multiple collections that have
> masters/slaves (in which case the master and all the slaves that point
> to it will be a "collection")?

I'm not understanding from https://wiki.apache.org/solr/SolrTerminology what a Collection is, that makes it different to the old concept of Core.

> Do all of the cores you have use the
> same configurations? Or is each set of master/slaves using a different
> configuration?

Each core has a different configuration (which is what makes it a different core... different source data, different synonyms, etc). But every node is identical and kept that way with saltstack.

Anyhow, the bottom line appears to be that 130Mb of jars are needed to deploy my configuration to Zookeeper. In that case, I think I'll do it by building a new deployment project, with a gradle task (so I don't need to worry about all those Solr dependencies for zksh), and a Jenkins job that can be triggered to run the deployment to either staging or production. A few new holes in my firewall and I'll be done.

Unfortunate new points of failure and complexity, but I can't think of anything simpler.

Thanks

Ari

-- 
-------------------------->
Aristedes Maniatis
CEO, ish
https://www.ish.com.au
GPG fingerprint CBFB 84B4 738D 4E87 5E5C  5EFA EF6A 7D2E 3E49 102A

Re: loading zookeeper data

Posted by Aristedes Maniatis <ar...@ish.com.au>.

Thanks so much for your reply. That's clarified a few things for me.

Erick Erickson wrote:

> Where SolrCloud becomes compelling is when you _do_ need to have
> shard, and deal with HA/DR. 

I'm not using shards since the indicies are small enough, however I use master/slave with 6 nodes for two reasons: having a single master poll the database means less load on the database than have every node poll separately. And of course we still want HA and performance, so we balance load with haproxy.

> Then the added step of maintaining things
> in Zookeeper is a small price to pay for _not_ having to be sure that
> all the configs on all the servers are all the same. Imagine a cluster
> with several hundred replicas out there. Being absolutely sure that
> all of them have the same configs, have been restarted and the like
> becomes daunting. So having to do an "upconfig" is a good tradeoff
> IMO.

Saltstack (and ansible, puppet, chef, etc) all make distributed configuration management trivial. So it isn't solving any problem for me, but I understand how people without a configuration management tool would like it.

> The bin/solr script has a "zk -upconfig" parameter that'll take care
> of pushing the configs up. Since you already have the configs in VCS,
> your process is just to pull them from vcs to "somewhere" then
> bin/solr zk -upconfig -z zookeeper_asserss -n configset_name -d
> directory_you_downloaded_to_from_VCS.

Yep, I guess that's confirming my guess at how people are expected to use this. Its pretty cumbersome for me because:

1. I don't want production machines to require VCS checkout credentials
2. I don't want to have to install Solr (and keep the version in sync with production) on our build or configuration management machines
3. I still need files on disk in order to version control them and tie that into our QA processes. Now I need another step to take those files and inject them into the Zookeeper black box, ensuring they are always up to date.

I do understand that people who managed hundreds of nodes completely by hand would find it useful. But I am surprised that there were any of those people.

I was hoping that Zookeeper had some hidden features that would make my life easier.

> Thereafter you simply refer to them by name when you create a
> collection and the rest of it is automatic. Every time a core reloads
> it gets the new configs.
> 
> If you're trying to manipulate _cores_, that may be where you're going
> wrong. Think of them as _collections_. What's not clear from your
> problem statement is whether these cores on the various machines are
> part of the same collection or not.

I was unaware of the concept of collection until now. We use one core for each type of entity we are indexing and that works well.

> Do you have multiple shards in one
> logical index?

No shards. Every Solr node contains the complete set of all data.

>  Or do you have multiple collections that have
> masters/slaves (in which case the master and all the slaves that point
> to it will be a "collection")?

I'm not understanding from https://wiki.apache.org/solr/SolrTerminology what a Collection is, that makes it different to the old concept of Core.

> Do all of the cores you have use the
> same configurations? Or is each set of master/slaves using a different
> configuration?

Each core has a different configuration (which is what makes it a different core... different source data, different synonyms, etc). But every node is identical and kept that way with saltstack.

Anyhow, the bottom line appears to be that 130Mb of jars are needed to deploy my configuration to Zookeeper. In that case, I think I'll do it by building a new deployment project, with a gradle task (so I don't need to worry about all those Solr dependencies for zksh), and a Jenkins job that can be triggered to run the deployment to either staging or production. A few new holes in my firewall and I'll be done.

Unfortunate new points of failure and complexity, but I can't think of anything simpler.

Thanks

Ari

-- 
-------------------------->
Aristedes Maniatis
CEO, ish
https://www.ish.com.au
GPG fingerprint CBFB 84B4 738D 4E87 5E5C  5EFA EF6A 7D2E 3E49 102A

Re: loading zookeeper data

Posted by Erick Erickson <er...@gmail.com>.

bq: Zookeeper seems a step backward.....

For stand-alone Solr, I tend to agree it's a bit awkward. But as Shawn
says, there's no _need_ to run Zookeeper with a more recent Solr.
Running Solr without Zookeeper is perfectly possible, we call that
"stand alone". And, if you have no need for sharding etc., there's no
compelling reason to run SolrCloud. Well, there are some good reasons
having to do with fail-over and the like, but...

Where SolrCloud becomes compelling is when you _do_ need to have
shard, and deal with HA/DR. Then the added step of maintaining things
in Zookeeper is a small price to pay for _not_ having to be sure that
all the configs on all the servers are all the same. Imagine a cluster
with several hundred replicas out there. Being absolutely sure that
all of them have the same configs, have been restarted and the like
becomes daunting. So having to do an "upconfig" is a good tradeoff
IMO.

The bin/solr script has a "zk -upconfig" parameter that'll take care
of pushing the configs up. Since you already have the configs in VCS,
your process is just to pull them from vcs to "somewhere" then
bin/solr zk -upconfig -z zookeeper_asserss -n configset_name -d
directory_you_downloaded_to_from_VCS.

Thereafter you simply refer to them by name when you create a
collection and the rest of it is automatic. Every time a core reloads
it gets the new configs.

If you're trying to manipulate _cores_, that may be where you're going
wrong. Think of them as _collections_. What's not clear from your
problem statement is whether these cores on the various machines are
part of the same collection or not. Do you have multiple shards in one
logical index? Or do you have multiple collections that have
masters/slaves (in which case the master and all the slaves that point
to it will be a "collection")? Do all of the cores you have use the
same configurations? Or is each set of master/slaves using a different
configuration?

Best,
Erick

On Fri, Jul 22, 2016 at 4:41 PM, Aristedes Maniatis <ar...@ish.com.au> wrote:
> On 22/07/2016 5:22pm, Aristedes Maniatis wrote:
>> But then what? In the production cluster it seems I then need to
>>
>> 1. Grab the latest configuration bundle for each core and unpack them
>> 2. Launch Java
>> 3. Execute the Solr jars (from the production server since it must be the right version)
>> - with org.apache.solr.cloud.ZkCLI
>> - and some parameters pointing to the production Zookeeper cluster
>> - pointing also to the unpacked config files
>> 4. Parse the output to understand if any error happened
>> 5. Wait for Solr to pick up the new configuration and do any final production checks
>
> Shawn wrote:
>
>> If you *do* want to run in cloud mode, then you will need to use zkcli to upload config changes to zookeeper and then issue a collection reload with the Collections API. This will find and reload all the cores related to that collection, across the entire cloud. You have the option of using the ZkCLI java class, or the zkcli.sh script that can be found in all 5.x and 6.x installs at server/scripts/cloud-scripts. As of version 5.3, the jars required for zkcli are already unpacked before Solr is started.
>
>
> Thanks Shawn,
>
> I'm trying to understand the common workflow of deploying configuration to Zookeeper. I'm new to that tool, so at this point it appears to be a big black box which can only be populated with data with a specific Java application. Surely others here on this list use configuration management tools and other non-manual workflows.
>
> I've written a little gradle task to wrap up sending data to zookeeper:
>
> task deployConfig {
>         description = 'Upload configuration to production zookeeper cluster.'
>         file('src/main/resources/solr').eachDir { core ->
>             doLast {
>               javaexec {
>                 classpath configurations.zookeeper
>                 main = 'org.apache.solr.cloud.ZkCLI'
>                 args = [
>                         "-confdir", core,
>                         "-zkhost", "solr.host.com:2181",
>                         "-cmd", "upconfig",
>                         "-confname", core.name
>                 ]
>               }
>             }
>         }
> }
>
>
> That does the trick, although I've not yet figured out how to know whether it was successful because it doesn't return anything. And as I outlined above, it is quite cumbersome to automate. Are you saying that everyone who runs SolrCloud runs all these scripts against their production jars by hand?
>
> Zookeeper seems a step backward from files on disk in terms of ease of automation, inspecting for problems, version control and a new point of failure.
>
> Perhaps because I'm new to it I'm missing a set of tools that make all that much easier. Or for that matter, I'm missing an understanding of what problem Zookeeper solves.
>
> Ari
>
>
> --
> -------------------------->
> Aristedes Maniatis
> CEO, ish
> https://www.ish.com.au
> GPG fingerprint CBFB 84B4 738D 4E87 5E5C  5EFA EF6A 7D2E 3E49 102A
>

Re: loading zookeeper data

Posted by Aristedes Maniatis <ar...@ish.com.au>.

On 22/07/2016 5:22pm, Aristedes Maniatis wrote:
> But then what? In the production cluster it seems I then need to
> 
> 1. Grab the latest configuration bundle for each core and unpack them
> 2. Launch Java
> 3. Execute the Solr jars (from the production server since it must be the right version)
> - with org.apache.solr.cloud.ZkCLI
> - and some parameters pointing to the production Zookeeper cluster
> - pointing also to the unpacked config files
> 4. Parse the output to understand if any error happened
> 5. Wait for Solr to pick up the new configuration and do any final production checks

Shawn wrote:

> If you *do* want to run in cloud mode, then you will need to use zkcli to upload config changes to zookeeper and then issue a collection reload with the Collections API. This will find and reload all the cores related to that collection, across the entire cloud. You have the option of using the ZkCLI java class, or the zkcli.sh script that can be found in all 5.x and 6.x installs at server/scripts/cloud-scripts. As of version 5.3, the jars required for zkcli are already unpacked before Solr is started.

Thanks Shawn,

I'm trying to understand the common workflow of deploying configuration to Zookeeper. I'm new to that tool, so at this point it appears to be a big black box which can only be populated with data with a specific Java application. Surely others here on this list use configuration management tools and other non-manual workflows.

I've written a little gradle task to wrap up sending data to zookeeper:

task deployConfig {
	description = 'Upload configuration to production zookeeper cluster.'
	file('src/main/resources/solr').eachDir { core ->
            doLast {
	      javaexec {
	      	classpath configurations.zookeeper
	        main = 'org.apache.solr.cloud.ZkCLI'
	        args = [
	        	"-confdir", core,
	        	"-zkhost", "solr.host.com:2181",
	        	"-cmd", "upconfig",
	        	"-confname", core.name
	        ]
	      }
	    }
	}
}

That does the trick, although I've not yet figured out how to know whether it was successful because it doesn't return anything. And as I outlined above, it is quite cumbersome to automate. Are you saying that everyone who runs SolrCloud runs all these scripts against their production jars by hand?

Zookeeper seems a step backward from files on disk in terms of ease of automation, inspecting for problems, version control and a new point of failure.

Perhaps because I'm new to it I'm missing a set of tools that make all that much easier. Or for that matter, I'm missing an understanding of what problem Zookeeper solves.

Ari

-- 
-------------------------->
Aristedes Maniatis
CEO, ish
https://www.ish.com.au
GPG fingerprint CBFB 84B4 738D 4E87 5E5C  5EFA EF6A 7D2E 3E49 102A

Re: loading zookeeper data

Posted by Shawn Heisey <ap...@elyograg.org>.

On 7/22/2016 1:22 AM, Aristedes Maniatis wrote:
> I'm not new to Solr, but I'm upgrading from Solr 4 to 5 and needing to
> use the new Zookeeper configuration requirement. It is adding a lot of
> extra complexity to our deployment and I want to check that we are
> doing it right. 

Zookeeper is not required for Solr 5, or even for Solr 6.  It's only
required for SolrCloud.  SolrCloud is an operating mode that is not
mandatory.  SolrCloud has been around since Solr 4.0.0.

> The problem we want to escape is that this configuration causes
> outages and other random issues each time the Solr master does a full
> reload. It shouldn't, but it does and hopefully the new SolrCluster
> will be better.

The fact that Solr does a full replication when the master is
restarted/reloaded is a bug.  This bug is fixed in 5.5.2 and 6.1.0.

https://issues.apache.org/jira/browse/SOLR-9036

If you *do* want to run in cloud mode, then you will need to use zkcli
to upload config changes to zookeeper and then issue a collection reload
with the Collections API.  This will find and reload all the cores
related to that collection, across the entire cloud.  You have the
option of using the ZkCLI java class, or the zkcli.sh script that can be
found in all 5.x and 6.x installs at server/scripts/cloud-scripts.  As
of version 5.3, the jars required for zkcli are already unpacked before
Solr is started.

Thanks,
Shawn