You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@lucene.apache.org by "Hoss Man (JIRA)" <ji...@apache.org> on 2018/01/06 00:03:00 UTC
[jira] [Commented] (SOLR-11741) Offline training mode for schema guessing

    [ https://issues.apache.org/jira/browse/SOLR-11741?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16314169#comment-16314169 ] 

Hoss Man commented on SOLR-11741:
---------------------------------

bq. Proposing an offline training mode where Solr accepts bunch of documents and returns a guessed schema (without indexing). This schema can then be used for actual indexing. I think the original idea is from Hoss.

bq. I think initial implementation can be based on an UpdateRequestProcessor. We can hash out the API soon, as we go along.

FWIW...

What i suggested at one point (I don't remember where ... it may already be in a jira somewhere?) was an UpdateRequestProcessorFactory that could be configured _instead_ of RunUpdateProcessorFactory in a chain (that does not already include AddSchemaFieldsUpdateProcessorFactory) *after* all of the various ParseFooFieldUpdateProcessorFactories.

* For ADD commands, this processor (factory) would iterate over the SolrInputDocuments, and then iterate over the field names in those documents, and record in memory wether any docs had more then one value for that field name, as well as the "Least Common Denominator" of the *java type* of the values found -- ie: 
** if docA=1(int), but docB=1.1(float), docC=5.5(float) then we remember "Float"
** if docA=1(int), docB=1.1(float), docC=1.0000000001(double) then we remember "Double".
** If docA=2017-12-12Z(date) and docB=42(int) then we remember "String"
* For COMMIT commands, this processor (factory) would take all of the info it had accumulated from the ADDs recieved up to that point, and use them to exec Schema Field additions -- using the same sort of "java object class -> fieldType name" mapping that AddSchemaFieldsUpdateProcessorFactory

The idea being that instead of having full "schemaless" mode enabled, there could be an {{/update/train-schema}} RequestHanlder configured to use this {{update.chain}}  Users could post a sampling of their docs to {{/update/train-schema}} then once they were don training send a {{/update/train-schema?commit=true}} command and the processor (factory) would add all the needed fields.

----
By no means should that idea be considered an end all be all solution / design.

It doesn't play very nicely with distributed updates (you'd either have to ensure all training data was sent to the same node where you send the "commit" or add special custom logic to ensure it all got forwarded to a special node) and there are probably a lot more sophisticated / smarter ways to do it ... it was just something i brainstormed one day as something that should be fairly easy to implement as a solr plugin leveraging most of the existing "schemaless" features of Solr -- where "Parse if possible" update processors already do most of the heavy lifting.  

Perhaps it can inspire a more robust solution?

> Offline training mode for schema guessing
> -----------------------------------------
>
>                 Key: SOLR-11741
>                 URL: https://issues.apache.org/jira/browse/SOLR-11741
>             Project: Solr
>          Issue Type: Improvement
>      Security Level: Public(Default Security Level. Issues are Public) 
>            Reporter: Ishan Chattopadhyaya
>
> Our data driven schema guessing doesn't work under many situations. For example, if the first document has a field with value "0", it is guessed as Long and subsequent fields with "0.0" are rejected. Similarly, if the same field had alphanumeric contents for a latter document, those documents are rejected. Also, single vs. multi valued field guessing is not ideal.
> Proposing an offline training mode where Solr accepts bunch of documents and returns a guessed schema (without indexing). This schema can then be used for actual indexing. I think the original idea is from Hoss.
> I think initial implementation can be based on an UpdateRequestProcessor. We can hash out the API soon, as we go along.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org