You are viewing a plain text version of this content. The canonical link for it is here.
Posted to solr-user@lucene.apache.org by "Kelly, Frank" <fr...@here.com> on 2015/12/03 16:09:22 UTC

Solr 5: Schema.xml vs. Managed Schema - which is advisable?

Just wondering if folks have any suggestions on using Schema.xml vs. Managed Schema going forward.

Our deployment will be
> 3 Zk, 3 Shards, 3 replicas
> Copies of each collection in 5 AWS regions (EBS-backed EC2 instances)
> Planning at least 1 Billion objects indexed (currently < 100 million)

I'm sure our schema.xml will have changes and fixes and just wondering which approach (schema.xml vs. managed)
will be easier to deploy / maintain?

Cheers!

-Frank


Frank Kelly
Principal Software Engineer
Predictive Analytics Team (SCBE/HAC/CDA)









Re: Solr 5: Schema.xml vs. Managed Schema - which is advisable?

Posted by Erick Erickson <er...@gmail.com>.
Shawn:

Managed schema is _used_ by "schemaless", but not the same thing at
all. For "schemaless" (i.e. "data driven"), you need to include the
update processor chains that do the guessing for you and makes use of
the managed veatures to add fields to your schema.

You can also use a managed schema _without_ the processor chains that
enable the "schemaless" update chains. In this you do have a static
schema, with the caveat that "static" means that anyone who can post
directly to Solr can change your schema, but if you allow that someone
issuing managed schema API calls is the least of your worries ;).

That said, I certainly understand wanting to lock down my schema, but
then I'm a control freak.

Best,
Erick



On Thu, Dec 3, 2015 at 7:25 PM, Shawn Heisey <ap...@elyograg.org> wrote:
> On 12/3/2015 8:09 AM, Kelly, Frank wrote:
>> Just wondering if folks have any suggestions on using Schema.xml vs. Managed Schema going forward.
>>
>> Our deployment will be
>>> 3 Zk, 3 Shards, 3 replicas
>>> Copies of each collection in 5 AWS regions (EBS-backed EC2 instances)
>>> Planning at least 1 Billion objects indexed (currently < 100 million)
>>
>> I'm sure our schema.xml will have changes and fixes and just wondering which approach (schema.xml vs. managed)
>> will be easier to deploy / maintain?
>
> In production, you probably want a schema that cannot change.  The
> managed schema that you find in the data-driven configuration will
> automatically add new fields to the schema if unknown fields are
> encountered in your data ... which means that if somehow a typo makes it
> through your indexing process, you may not know about the problem until
> later.
>
> With a static schema, an indexing request that has an error in a field
> name will be rejected and you will receive an error, which is how I
> would want Solr to behave.
>
> The data-driven schema is good for prototyping, but because the field
> definitons that get added are just a guess by Solr, I would manually
> edit the schema before going into production.  Once in production I
> would want to be in complete manual control of the schema.
>
> Thanks,
> Shawn
>

Re: Solr 5: Schema.xml vs. Managed Schema - which is advisable?

Posted by Shawn Heisey <ap...@elyograg.org>.
On 12/3/2015 8:09 AM, Kelly, Frank wrote:
> Just wondering if folks have any suggestions on using Schema.xml vs. Managed Schema going forward.
> 
> Our deployment will be
>> 3 Zk, 3 Shards, 3 replicas
>> Copies of each collection in 5 AWS regions (EBS-backed EC2 instances)
>> Planning at least 1 Billion objects indexed (currently < 100 million)
> 
> I'm sure our schema.xml will have changes and fixes and just wondering which approach (schema.xml vs. managed)
> will be easier to deploy / maintain?

In production, you probably want a schema that cannot change.  The
managed schema that you find in the data-driven configuration will
automatically add new fields to the schema if unknown fields are
encountered in your data ... which means that if somehow a typo makes it
through your indexing process, you may not know about the problem until
later.

With a static schema, an indexing request that has an error in a field
name will be rejected and you will receive an error, which is how I
would want Solr to behave.

The data-driven schema is good for prototyping, but because the field
definitons that get added are just a guess by Solr, I would manually
edit the schema before going into production.  Once in production I
would want to be in complete manual control of the schema.

Thanks,
Shawn


Re: Solr 5: Schema.xml vs. Managed Schema - which is advisable?

Posted by Upayavira <uv...@odoko.co.uk>.
They are different beasts, but I bet on the managed schema winning in
the long run.

With the bulk API, you can post a heap of fields/etc in one go, so
basically, rather than pushing the schema to Zookeeper, you push it to
Solr. 

Look at Solr 5.4 when it comes out shortly. It'll change the way you
think about the schema. The managed schema has been there for ages, but
now the UI has support for it in the schema tab. Being able to really
easily create and remove fields certainly does things to my brain
because it is just so easy.

Upayavira

On Thu, Dec 3, 2015, at 08:35 PM, Erick Erickson wrote:
> It Depends (tm).
> 
> Managed Schema is way cool if you have a front end that lets you
> manipulate the schema via a browser or other program. There's really
> no other way to deal with changing the schema from a browser without
> allowing uploading xml files, which is a security problem. Trust me on
> this one ;).
> 
> For people who know the ins and outs of schema.xml, it's often easier
> just to edit the raw file and upload it to ZK (or use it locally). And
> much faster for mass edits.
> 
> So really they're different beasts. The end result is functionally the
> same, there's a schema that's read by Solr and used. The managed
> schema makes it harder to have typos sneak in and prevent collections
> from loading at the expense of fast mass editing.
> 
> And there is some ability to change the solrconfig.xml file, see:
> https://cwiki.apache.org/confluence/display/solr/Config+API. But again
> whether you "should" use that or just manually edit solrconfig.xml is
> largely a matter of the tools available and personal taste.
> 
> 
> bq: ....will be easier to deploy / maintain
> 
> 
> Not a lot of difference here. At the end of the day, you have to
> 1> have the configs stored somewhere safely in version control (or at
> least I think you must)
> 2> change the files in the config set on Zookeeper
> 3> reload the collection.
> 
> So with manually editing the process to change something you'd
> 1> get the files from VCS
> 2> edit them
> 3> push them to ZK
> 4> reload the collection (collections API) and verify it was correct
> 5> save the configs back to VCS.
> 
> With managed schema you'd
> 1> use the managed schema API to make changes
> 2> reload the collection and verify
> 3> pull the changes from Zookeeper
> 4> put them in VCS
> 
> 
> Best,
> Erick
> 
> 
> 
> On Thu, Dec 3, 2015 at 12:09 PM, Don Bosco Durai <bo...@apache.org>
> wrote:
> > My experience is, once managed-schema is created, then schema.xml even if present is ignored. When both are present, you will get a warning in the Solr log.
> >
> > I have stopped using schema.xml. Actually, I use it once, start Solr and after it generates managed-schema, I export it and pretty much just update it going forward.
> >
> > I think, the recommended way to manage fields is using API calls, but it might not be always possible. E.g. You have to save the config in source code system. If you are doing that, make sure you to update it more regularly, because if Solr finds a new field name, it will auto create it in the managed-schema and you saved copy will be out of date.
> >
> > Bosco
> >
> >
> >
> >
> > On 12/3/15, 11:47 AM, "Jeff Wartes" <jw...@whitepages.com> wrote:
> >
> >>I’ve never used the managed schema, so I’m probably biased, but I’ve never
> >>seen much of a point to the Schema API.
> >>
> >>I need to make changes sometimes to solrconfig.xml, in addition to
> >>schema.xml and other config files, and there’s no API for those, so my
> >>process has been like:
> >>
> >>1. Put the entire config directory used by a collection in source control
> >>somewhere. solrconfig.xml, schema.xml, synonyms.txt, everything.
> >>2. Make changes, test, commit
> >>3. “Release” by uploading the whole config dir at a specific commit to ZK
> >>(overwriting any existing files) and issuing a collections API “reload”.
> >>
> >>
> >>This has the downside that I can upload a broken config and take down my
> >>collection, but with the whole config dir in source control,
> >>I can also easily roll back to any point by uploading an old commit.
> >>You still have to be aware of how the changes you’re making will effect
> >>your current index, but that’s unavoidable.
> >>
> >>
> >>On 12/3/15, 7:09 AM, "Kelly, Frank" <fr...@here.com> wrote:
> >>
> >>>Just wondering if folks have any suggestions on using Schema.xml vs.
> >>>Managed Schema going forward.
> >>>
> >>>Our deployment will be
> >>>> 3 Zk, 3 Shards, 3 replicas
> >>>> Copies of each collection in 5 AWS regions (EBS-backed EC2 instances)
> >>>> Planning at least 1 Billion objects indexed (currently < 100 million)
> >>>
> >>>I'm sure our schema.xml will have changes and fixes and just wondering
> >>>which approach (schema.xml vs. managed)
> >>>will be easier to deploy / maintain?
> >>>
> >>>Cheers!
> >>>
> >>>-Frank
> >>>
> >>>
> >>>Frank Kelly
> >>>Principal Software Engineer
> >>>Predictive Analytics Team (SCBE/HAC/CDA)
> >>>
> >>>
> >>>
> >>>
> >>>
> >>>
> >>>
> >>>
> >>
> >

Re: Solr 5: Schema.xml vs. Managed Schema - which is advisable?

Posted by Erick Erickson <er...@gmail.com>.
It Depends (tm).

Managed Schema is way cool if you have a front end that lets you
manipulate the schema via a browser or other program. There's really
no other way to deal with changing the schema from a browser without
allowing uploading xml files, which is a security problem. Trust me on
this one ;).

For people who know the ins and outs of schema.xml, it's often easier
just to edit the raw file and upload it to ZK (or use it locally). And
much faster for mass edits.

So really they're different beasts. The end result is functionally the
same, there's a schema that's read by Solr and used. The managed
schema makes it harder to have typos sneak in and prevent collections
from loading at the expense of fast mass editing.

And there is some ability to change the solrconfig.xml file, see:
https://cwiki.apache.org/confluence/display/solr/Config+API. But again
whether you "should" use that or just manually edit solrconfig.xml is
largely a matter of the tools available and personal taste.


bq: ....will be easier to deploy / maintain


Not a lot of difference here. At the end of the day, you have to
1> have the configs stored somewhere safely in version control (or at
least I think you must)
2> change the files in the config set on Zookeeper
3> reload the collection.

So with manually editing the process to change something you'd
1> get the files from VCS
2> edit them
3> push them to ZK
4> reload the collection (collections API) and verify it was correct
5> save the configs back to VCS.

With managed schema you'd
1> use the managed schema API to make changes
2> reload the collection and verify
3> pull the changes from Zookeeper
4> put them in VCS


Best,
Erick



On Thu, Dec 3, 2015 at 12:09 PM, Don Bosco Durai <bo...@apache.org> wrote:
> My experience is, once managed-schema is created, then schema.xml even if present is ignored. When both are present, you will get a warning in the Solr log.
>
> I have stopped using schema.xml. Actually, I use it once, start Solr and after it generates managed-schema, I export it and pretty much just update it going forward.
>
> I think, the recommended way to manage fields is using API calls, but it might not be always possible. E.g. You have to save the config in source code system. If you are doing that, make sure you to update it more regularly, because if Solr finds a new field name, it will auto create it in the managed-schema and you saved copy will be out of date.
>
> Bosco
>
>
>
>
> On 12/3/15, 11:47 AM, "Jeff Wartes" <jw...@whitepages.com> wrote:
>
>>I’ve never used the managed schema, so I’m probably biased, but I’ve never
>>seen much of a point to the Schema API.
>>
>>I need to make changes sometimes to solrconfig.xml, in addition to
>>schema.xml and other config files, and there’s no API for those, so my
>>process has been like:
>>
>>1. Put the entire config directory used by a collection in source control
>>somewhere. solrconfig.xml, schema.xml, synonyms.txt, everything.
>>2. Make changes, test, commit
>>3. “Release” by uploading the whole config dir at a specific commit to ZK
>>(overwriting any existing files) and issuing a collections API “reload”.
>>
>>
>>This has the downside that I can upload a broken config and take down my
>>collection, but with the whole config dir in source control,
>>I can also easily roll back to any point by uploading an old commit.
>>You still have to be aware of how the changes you’re making will effect
>>your current index, but that’s unavoidable.
>>
>>
>>On 12/3/15, 7:09 AM, "Kelly, Frank" <fr...@here.com> wrote:
>>
>>>Just wondering if folks have any suggestions on using Schema.xml vs.
>>>Managed Schema going forward.
>>>
>>>Our deployment will be
>>>> 3 Zk, 3 Shards, 3 replicas
>>>> Copies of each collection in 5 AWS regions (EBS-backed EC2 instances)
>>>> Planning at least 1 Billion objects indexed (currently < 100 million)
>>>
>>>I'm sure our schema.xml will have changes and fixes and just wondering
>>>which approach (schema.xml vs. managed)
>>>will be easier to deploy / maintain?
>>>
>>>Cheers!
>>>
>>>-Frank
>>>
>>>
>>>Frank Kelly
>>>Principal Software Engineer
>>>Predictive Analytics Team (SCBE/HAC/CDA)
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>
>

Re: Solr 5: Schema.xml vs. Managed Schema - which is advisable?

Posted by Don Bosco Durai <bo...@apache.org>.
My experience is, once managed-schema is created, then schema.xml even if present is ignored. When both are present, you will get a warning in the Solr log.

I have stopped using schema.xml. Actually, I use it once, start Solr and after it generates managed-schema, I export it and pretty much just update it going forward. 

I think, the recommended way to manage fields is using API calls, but it might not be always possible. E.g. You have to save the config in source code system. If you are doing that, make sure you to update it more regularly, because if Solr finds a new field name, it will auto create it in the managed-schema and you saved copy will be out of date.

Bosco




On 12/3/15, 11:47 AM, "Jeff Wartes" <jw...@whitepages.com> wrote:

>I’ve never used the managed schema, so I’m probably biased, but I’ve never
>seen much of a point to the Schema API.
>
>I need to make changes sometimes to solrconfig.xml, in addition to
>schema.xml and other config files, and there’s no API for those, so my
>process has been like:
>
>1. Put the entire config directory used by a collection in source control
>somewhere. solrconfig.xml, schema.xml, synonyms.txt, everything.
>2. Make changes, test, commit
>3. “Release” by uploading the whole config dir at a specific commit to ZK
>(overwriting any existing files) and issuing a collections API “reload”.
>
>
>This has the downside that I can upload a broken config and take down my
>collection, but with the whole config dir in source control,
>I can also easily roll back to any point by uploading an old commit.
>You still have to be aware of how the changes you’re making will effect
>your current index, but that’s unavoidable.
>
>
>On 12/3/15, 7:09 AM, "Kelly, Frank" <fr...@here.com> wrote:
>
>>Just wondering if folks have any suggestions on using Schema.xml vs.
>>Managed Schema going forward.
>>
>>Our deployment will be
>>> 3 Zk, 3 Shards, 3 replicas
>>> Copies of each collection in 5 AWS regions (EBS-backed EC2 instances)
>>> Planning at least 1 Billion objects indexed (currently < 100 million)
>>
>>I'm sure our schema.xml will have changes and fixes and just wondering
>>which approach (schema.xml vs. managed)
>>will be easier to deploy / maintain?
>>
>>Cheers!
>>
>>-Frank
>>
>>
>>Frank Kelly
>>Principal Software Engineer
>>Predictive Analytics Team (SCBE/HAC/CDA)
>>
>>
>>
>>
>>
>>
>>
>>
>


Re: Solr 5: Schema.xml vs. Managed Schema - which is advisable?

Posted by Jeff Wartes <jw...@whitepages.com>.
I’ve never used the managed schema, so I’m probably biased, but I’ve never
seen much of a point to the Schema API.

I need to make changes sometimes to solrconfig.xml, in addition to
schema.xml and other config files, and there’s no API for those, so my
process has been like:

1. Put the entire config directory used by a collection in source control
somewhere. solrconfig.xml, schema.xml, synonyms.txt, everything.
2. Make changes, test, commit
3. “Release” by uploading the whole config dir at a specific commit to ZK
(overwriting any existing files) and issuing a collections API “reload”.


This has the downside that I can upload a broken config and take down my
collection, but with the whole config dir in source control,
I can also easily roll back to any point by uploading an old commit.
You still have to be aware of how the changes you’re making will effect
your current index, but that’s unavoidable.


On 12/3/15, 7:09 AM, "Kelly, Frank" <fr...@here.com> wrote:

>Just wondering if folks have any suggestions on using Schema.xml vs.
>Managed Schema going forward.
>
>Our deployment will be
>> 3 Zk, 3 Shards, 3 replicas
>> Copies of each collection in 5 AWS regions (EBS-backed EC2 instances)
>> Planning at least 1 Billion objects indexed (currently < 100 million)
>
>I'm sure our schema.xml will have changes and fixes and just wondering
>which approach (schema.xml vs. managed)
>will be easier to deploy / maintain?
>
>Cheers!
>
>-Frank
>
>
>Frank Kelly
>Principal Software Engineer
>Predictive Analytics Team (SCBE/HAC/CDA)
>
>
>
>
>
>
>
>