You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@lucene.apache.org by Erick Erickson <er...@gmail.com> on 2015/02/12 16:30:21 UTC

Enforce "reasonable" field names in Solr?

I was commenting on SOLR-6997 about allowing hyphens in field names
and started to wonder about whether we should try to push people to
"good" names. The ref guide states:

"Field names should consist of alphanumeric or underscore characters
only and not start with a digit"

and SOLR-6997 is a good example of why. I am _not_ at all interested
in supporting the hyphen BTW.

I realize we can't suddenly start enforcing this rule b/c it would
break existing installations. What do people think about defaulting to
throwing an error? Or posting a fat warning with a "deprecation"
message?

I'm envisioning a "strict_field_name" tag or some such that defaults
to true, but could be set to false for back compat and just checking
when parsing a schema.

I'm not at all sure how that plays with the managed schema stuff though.

Erick

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


Re: Enforce "reasonable" field names in Solr?

Posted by Jan Høydahl <ja...@cominvent.com>.
+1 to formally defining field naming

There would be a few absolutes, like
* No spaces (or a convention that spaces will be replaced by _)
* Not start with + or - since they already have special meaning in q parsers
* etc

A list of reserved chars would also be helpful and a well defined way to escape
those to use them. It would however be sad if we disallow "." since it is quite
nice to be able to have dots in field names. Can the f.<field>.whatever try
looking for the longest possible string? Should probably disallow "," since
it is used as separator in fl etc. Or is it acceptable with escaping? e.g.
  fl=name\,last,name\,first

+1 to a way to relax the rules for back-compat

--
Jan Høydahl, search solution architect
Cominvent AS - www.cominvent.com

> 12. feb. 2015 kl. 20.26 skrev Alexandre Rafalovitch <ar...@gmail.com>:
> 
> I wonder if the people who are using dynamic schema care about having
> the fields indexed without _them_ doing pre-processing, but don't mind
> if they have to use cleaned-up names during search. Like, when you
> index from Tika and you just have no clue what possible metadata names
> are in various files. So, you just want to throw the whole lot in,
> prefixed.
> 
> In which case, this could be solved with a specialized
> UpdateRequestProcessor step that will normalize the field names in a
> consistent fashion.
> 
> Regards,
>   Alex.
> ----
> Sign up for my Solr resources newsletter at http://www.solr-start.com/
> 
> 
> On 12 February 2015 at 13:02, Erick Erickson <er...@gmail.com> wrote:
>> Jack:
>> 
>> re: your "little gotcha". I suspect there are enough of these lying
>> around that it'd be a rat-hole to formally support them, and as a
>> developer I'd at least like the choice to "fail early fail often".
>> 
>> Your point about dynamic field names is well taken, sometimes there
>> isn't total control of the field names. Which is why I suggested that
>> the strict mode be the default, but overridable.
>> 
>> So not only does the bit about verifying the field names need to take
>> managed schema into account, but also dynamic field definition...
>> Siiiggggh...
>> 
>> That is, if we do anything about it.
>> 
>> 
>> On Thu, Feb 12, 2015 at 8:35 AM, Jack Krupansky
>> <ja...@gmail.com> wrote:
>>> I used to be 100% in favor of strict names (well, plus the hyphen!), and in
>>> general it is fine for statically declared fields. But then I started
>>> encountering uses of numbers, spaces, slashes, and other punctuation, but
>>> always in the context of dynamic fields. For example, somebody wants to
>>> support a map-like field using dynamic fields with a dynamic field for each
>>> map key, but their map keys are application-defined and not restricted to
>>> Java name rules, such as a date with punctuation, or something that looks
>>> like a part number with numbers and dashes, or a product name or person or
>>> place name that has spaces and dashes and slashes and commas and periods and
>>> parentheses.
>>> 
>>> The big question is how might Solr depend on strict names, and then how to
>>> properly escape improper field names. There are a lot of spaces that use
>>> field names within some larger syntax, but no consistent escaping rules. For
>>> example, the fl and qf parameters, and fielded queries.
>>> 
>>> Maybe the real bottom line is to assure that the issue of field naming needs
>>> to be clearly documented early on in tutorials and upfront in the doc,
>>> rather than some relatively hidden fine print.
>>> 
>>> Hmmm... what does Elasticsearch do? As long as the field name is simply a
>>> single quoted string, then there is no issue.
>>> 
>>> Oh, here's a great little gotcha: field names embedded in parameters that
>>> are field-specific, like f.<field-name>.facet. URL escaping would be needed,
>>> but are names with embedded dots supported? And does the URL query parameter
>>> syntax support escaping of an equal sign in a query parameter name?
>>> 
>>> 
>>> -- Jack Krupansky
>>> 
>>> On Thu, Feb 12, 2015 at 10:30 AM, Erick Erickson <er...@gmail.com>
>>> wrote:
>>>> 
>>>> I was commenting on SOLR-6997 about allowing hyphens in field names
>>>> and started to wonder about whether we should try to push people to
>>>> "good" names. The ref guide states:
>>>> 
>>>> "Field names should consist of alphanumeric or underscore characters
>>>> only and not start with a digit"
>>>> 
>>>> and SOLR-6997 is a good example of why. I am _not_ at all interested
>>>> in supporting the hyphen BTW.
>>>> 
>>>> I realize we can't suddenly start enforcing this rule b/c it would
>>>> break existing installations. What do people think about defaulting to
>>>> throwing an error? Or posting a fat warning with a "deprecation"
>>>> message?
>>>> 
>>>> I'm envisioning a "strict_field_name" tag or some such that defaults
>>>> to true, but could be set to false for back compat and just checking
>>>> when parsing a schema.
>>>> 
>>>> I'm not at all sure how that plays with the managed schema stuff though.
>>>> 
>>>> Erick
>>>> 
>>>> ---------------------------------------------------------------------
>>>> To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
>>>> For additional commands, e-mail: dev-help@lucene.apache.org
>>>> 
>>> 
>> 
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
>> For additional commands, e-mail: dev-help@lucene.apache.org
>> 
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
> For additional commands, e-mail: dev-help@lucene.apache.org
> 


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


Re: Enforce "reasonable" field names in Solr?

Posted by Alexandre Rafalovitch <ar...@gmail.com>.
I wonder if the people who are using dynamic schema care about having
the fields indexed without _them_ doing pre-processing, but don't mind
if they have to use cleaned-up names during search. Like, when you
index from Tika and you just have no clue what possible metadata names
are in various files. So, you just want to throw the whole lot in,
prefixed.

In which case, this could be solved with a specialized
UpdateRequestProcessor step that will normalize the field names in a
consistent fashion.

Regards,
   Alex.
----
Sign up for my Solr resources newsletter at http://www.solr-start.com/


On 12 February 2015 at 13:02, Erick Erickson <er...@gmail.com> wrote:
> Jack:
>
> re: your "little gotcha". I suspect there are enough of these lying
> around that it'd be a rat-hole to formally support them, and as a
> developer I'd at least like the choice to "fail early fail often".
>
> Your point about dynamic field names is well taken, sometimes there
> isn't total control of the field names. Which is why I suggested that
> the strict mode be the default, but overridable.
>
> So not only does the bit about verifying the field names need to take
> managed schema into account, but also dynamic field definition...
> Siiiggggh...
>
> That is, if we do anything about it.
>
>
> On Thu, Feb 12, 2015 at 8:35 AM, Jack Krupansky
> <ja...@gmail.com> wrote:
>> I used to be 100% in favor of strict names (well, plus the hyphen!), and in
>> general it is fine for statically declared fields. But then I started
>> encountering uses of numbers, spaces, slashes, and other punctuation, but
>> always in the context of dynamic fields. For example, somebody wants to
>> support a map-like field using dynamic fields with a dynamic field for each
>> map key, but their map keys are application-defined and not restricted to
>> Java name rules, such as a date with punctuation, or something that looks
>> like a part number with numbers and dashes, or a product name or person or
>> place name that has spaces and dashes and slashes and commas and periods and
>> parentheses.
>>
>> The big question is how might Solr depend on strict names, and then how to
>> properly escape improper field names. There are a lot of spaces that use
>> field names within some larger syntax, but no consistent escaping rules. For
>> example, the fl and qf parameters, and fielded queries.
>>
>> Maybe the real bottom line is to assure that the issue of field naming needs
>> to be clearly documented early on in tutorials and upfront in the doc,
>> rather than some relatively hidden fine print.
>>
>> Hmmm... what does Elasticsearch do? As long as the field name is simply a
>> single quoted string, then there is no issue.
>>
>> Oh, here's a great little gotcha: field names embedded in parameters that
>> are field-specific, like f.<field-name>.facet. URL escaping would be needed,
>> but are names with embedded dots supported? And does the URL query parameter
>> syntax support escaping of an equal sign in a query parameter name?
>>
>>
>> -- Jack Krupansky
>>
>> On Thu, Feb 12, 2015 at 10:30 AM, Erick Erickson <er...@gmail.com>
>> wrote:
>>>
>>> I was commenting on SOLR-6997 about allowing hyphens in field names
>>> and started to wonder about whether we should try to push people to
>>> "good" names. The ref guide states:
>>>
>>> "Field names should consist of alphanumeric or underscore characters
>>> only and not start with a digit"
>>>
>>> and SOLR-6997 is a good example of why. I am _not_ at all interested
>>> in supporting the hyphen BTW.
>>>
>>> I realize we can't suddenly start enforcing this rule b/c it would
>>> break existing installations. What do people think about defaulting to
>>> throwing an error? Or posting a fat warning with a "deprecation"
>>> message?
>>>
>>> I'm envisioning a "strict_field_name" tag or some such that defaults
>>> to true, but could be set to false for back compat and just checking
>>> when parsing a schema.
>>>
>>> I'm not at all sure how that plays with the managed schema stuff though.
>>>
>>> Erick
>>>
>>> ---------------------------------------------------------------------
>>> To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
>>> For additional commands, e-mail: dev-help@lucene.apache.org
>>>
>>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
> For additional commands, e-mail: dev-help@lucene.apache.org
>

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


Re: Enforce "reasonable" field names in Solr?

Posted by Erick Erickson <er...@gmail.com>.
Jack:

re: your "little gotcha". I suspect there are enough of these lying
around that it'd be a rat-hole to formally support them, and as a
developer I'd at least like the choice to "fail early fail often".

Your point about dynamic field names is well taken, sometimes there
isn't total control of the field names. Which is why I suggested that
the strict mode be the default, but overridable.

So not only does the bit about verifying the field names need to take
managed schema into account, but also dynamic field definition...
Siiiggggh...

That is, if we do anything about it.


On Thu, Feb 12, 2015 at 8:35 AM, Jack Krupansky
<ja...@gmail.com> wrote:
> I used to be 100% in favor of strict names (well, plus the hyphen!), and in
> general it is fine for statically declared fields. But then I started
> encountering uses of numbers, spaces, slashes, and other punctuation, but
> always in the context of dynamic fields. For example, somebody wants to
> support a map-like field using dynamic fields with a dynamic field for each
> map key, but their map keys are application-defined and not restricted to
> Java name rules, such as a date with punctuation, or something that looks
> like a part number with numbers and dashes, or a product name or person or
> place name that has spaces and dashes and slashes and commas and periods and
> parentheses.
>
> The big question is how might Solr depend on strict names, and then how to
> properly escape improper field names. There are a lot of spaces that use
> field names within some larger syntax, but no consistent escaping rules. For
> example, the fl and qf parameters, and fielded queries.
>
> Maybe the real bottom line is to assure that the issue of field naming needs
> to be clearly documented early on in tutorials and upfront in the doc,
> rather than some relatively hidden fine print.
>
> Hmmm... what does Elasticsearch do? As long as the field name is simply a
> single quoted string, then there is no issue.
>
> Oh, here's a great little gotcha: field names embedded in parameters that
> are field-specific, like f.<field-name>.facet. URL escaping would be needed,
> but are names with embedded dots supported? And does the URL query parameter
> syntax support escaping of an equal sign in a query parameter name?
>
>
> -- Jack Krupansky
>
> On Thu, Feb 12, 2015 at 10:30 AM, Erick Erickson <er...@gmail.com>
> wrote:
>>
>> I was commenting on SOLR-6997 about allowing hyphens in field names
>> and started to wonder about whether we should try to push people to
>> "good" names. The ref guide states:
>>
>> "Field names should consist of alphanumeric or underscore characters
>> only and not start with a digit"
>>
>> and SOLR-6997 is a good example of why. I am _not_ at all interested
>> in supporting the hyphen BTW.
>>
>> I realize we can't suddenly start enforcing this rule b/c it would
>> break existing installations. What do people think about defaulting to
>> throwing an error? Or posting a fat warning with a "deprecation"
>> message?
>>
>> I'm envisioning a "strict_field_name" tag or some such that defaults
>> to true, but could be set to false for back compat and just checking
>> when parsing a schema.
>>
>> I'm not at all sure how that plays with the managed schema stuff though.
>>
>> Erick
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
>> For additional commands, e-mail: dev-help@lucene.apache.org
>>
>

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


Re: Enforce "reasonable" field names in Solr?

Posted by Jack Krupansky <ja...@gmail.com>.
I used to be 100% in favor of strict names (well, plus the hyphen!), and in
general it is fine for statically declared fields. But then I started
encountering uses of numbers, spaces, slashes, and other punctuation, but
always in the context of dynamic fields. For example, somebody wants to
support a map-like field using dynamic fields with a dynamic field for each
map key, but their map keys are application-defined and not restricted to
Java name rules, such as a date with punctuation, or something that looks
like a part number with numbers and dashes, or a product name or person or
place name that has spaces and dashes and slashes and commas and periods
and parentheses.

The big question is how might Solr depend on strict names, and then how to
properly escape improper field names. There are a lot of spaces that use
field names within some larger syntax, but no consistent escaping rules.
For example, the fl and qf parameters, and fielded queries.

Maybe the real bottom line is to assure that the issue of field naming
needs to be clearly documented early on in tutorials and upfront in the
doc, rather than some relatively hidden fine print.

Hmmm... what does Elasticsearch do? As long as the field name is simply a
single quoted string, then there is no issue.

Oh, here's a great little gotcha: field names embedded in parameters that
are field-specific, like f.<field-name>.facet. URL escaping would be
needed, but are names with embedded dots supported? And does the URL query
parameter syntax support escaping of an equal sign in a query parameter
name?


-- Jack Krupansky

On Thu, Feb 12, 2015 at 10:30 AM, Erick Erickson <er...@gmail.com>
wrote:

> I was commenting on SOLR-6997 about allowing hyphens in field names
> and started to wonder about whether we should try to push people to
> "good" names. The ref guide states:
>
> "Field names should consist of alphanumeric or underscore characters
> only and not start with a digit"
>
> and SOLR-6997 is a good example of why. I am _not_ at all interested
> in supporting the hyphen BTW.
>
> I realize we can't suddenly start enforcing this rule b/c it would
> break existing installations. What do people think about defaulting to
> throwing an error? Or posting a fat warning with a "deprecation"
> message?
>
> I'm envisioning a "strict_field_name" tag or some such that defaults
> to true, but could be set to false for back compat and just checking
> when parsing a schema.
>
> I'm not at all sure how that plays with the managed schema stuff though.
>
> Erick
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
> For additional commands, e-mail: dev-help@lucene.apache.org
>
>