You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@lucene.apache.org by GitBox <gi...@apache.org> on 2020/09/13 03:33:16 UTC

[GitHub] [lucene-solr] arafalov opened a new pull request #1863: SOLR-14701: GuessSchemaFields URP to replace AddSchemaFields URP in schemaless mode

arafalov opened a new pull request #1863:
URL: https://github.com/apache/lucene-solr/pull/1863


   #Description
   
   New URP with tests. Everything is based on AddSchemaFieldsUpdateProcessorFactory but is sufficiently different to be a separate entry for backward compatibility.
   
   # Solution
   
   This solves the problem with original solution by:
   - splitting accumulation of information about data seen (in processAdd) and actual schema modification (in commit)
   - it uses parameter flag instead of disabling the whole chain, which causes problems if you do want to parse dates
   - it supports parameter widening for numeric types to achieve sane results (e.g. Integer promoted to Double)
   - It tracks multiplicity of values, so allows the baseline fieldTypes to be single-valued and fields declare multiValued as appropriate
   
   It removes
   - Sub-selecting which fields this process applies to. Mostly because we are now much more explicit about it being a learning schema. But also because nobody seems to be using that. Nor was it tested.
   - Default options. They were interactive in non-predictable ways and the only sane option is mapping for String type anyway. 
   
   Not in this PR 
   - RefGuide 
   - New Examples
   - Change file notice
   They should be done together in a separate Jira, as they may need additional discussion.
   
   # Tests
   The existing schemaless tests were copied and adjusted to work with new limitations. A couple more tests could be added later to test better for multiplicity (already tested with xpath), but also to test type widening more directly.
   
   # Checklist
   
   Please review the following and check all that apply:
   
   - [x] I have reviewed the guidelines for [How to Contribute](https://wiki.apache.org/solr/HowToContribute) and my code conforms to the standards described there to the best of my ability.
   - [x] I have created a Jira issue and added the issue ID to my pull request title.
   - [x] I have given Solr maintainers [access](https://help.github.com/en/articles/allowing-changes-to-a-pull-request-branch-created-from-a-fork) to contribute to my PR branch. (optional but recommended)
   - [x] I have developed this patch against the `master` branch.
   - [x] I have run `./gradlew check`.
   - [x] I have added tests for my changes.
   - [-] I have added documentation for the [Ref Guide](https://github.com/apache/lucene-solr/tree/master/solr/solr-ref-guide) (for Solr changes only).
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org


[GitHub] [lucene-solr] epugh commented on pull request #1863: SOLR-14701: GuessSchemaFields URP to replace AddSchemaFields URP in schemaless mode

Posted by GitBox <gi...@apache.org>.
epugh commented on pull request #1863:
URL: https://github.com/apache/lucene-solr/pull/1863#issuecomment-698864894


   This may be me wearing my "marketing hat", but I think I was somewhat resistent to this whole idea.   This morning I was thinking...   Maybe I just don't like the word "guess", it has some negative connations when it comes to data and schemas and sotware, though may be accurate!   What about **Explore-Schema** or **Predict-Schema** ?
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org


[GitHub] [lucene-solr] noblepaul edited a comment on pull request #1863: SOLR-14701: GuessSchemaFields URP to replace AddSchemaFields URP in schemaless mode

Posted by GitBox <gi...@apache.org>.
noblepaul edited a comment on pull request #1863:
URL: https://github.com/apache/lucene-solr/pull/1863#issuecomment-697054163


   > At the point, the schema is created (based on all data) but nothing is actually indexed; the user can review and adjust the schema without any issues
   
   How do you expect the schema to be adjusted? User has no idea what just happened
   
   I'm assuming Step 2 won't be required as it will be built into the end point. 
   
   Is it possible to just return the commands to create a schema?
   
   example
   
   ```
   bin/post -c guess -params "guess-schema=true" example/films/films.json
   ```
   the response can be
   
   ```
   curl -X POST -H 'Content-type: application/json' -d '{"add-field":[
   {
   "name":"id",  
   "type":"string",
    "stored":true },
   {
   "name":"desc",  
   "type":"text",
    "stored":true
    }
   ]}' http://localhost:8983/solr/guess/schema
   ```
   The user has the following choices
   
   1. Run the command as is
   2. edit the command and post it
   
   My point is , our guess schema is worse than useless and anything that it creates will be WRONG for 99% of users. Why bother creating the fields in schema when we know fully well that user will have to edit the schema in ZK soon after


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org


[GitHub] [lucene-solr] noblepaul commented on pull request #1863: SOLR-14701: GuessSchemaFields URP to replace AddSchemaFields URP in schemaless mode

Posted by GitBox <gi...@apache.org>.
noblepaul commented on pull request #1863:
URL: https://github.com/apache/lucene-solr/pull/1863#issuecomment-697194702


   >Strong words there "worse than useless", especially considering that this - to me - seems a strong improvement on the current schemaless mode as it looks at more values and actually supports single/multivalued fields.
   
   I was referring to the current solution we have in Solr (schemaless, guess schema thing) . It's not a comment on the new solution. The current solution is indeed worse than useless
   
   >Generating Schema JSON raises its own questions, such as the shape of the schema it will be applied to, as guessing is currently happening as a differential to the existing schema. 
   
   The command is only relevant for that moment. If you execute it right away, it's useful. Users most likely will just copy paste the command (and edit, if required) 


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org


[GitHub] [lucene-solr] noblepaul edited a comment on pull request #1863: SOLR-14701: GuessSchemaFields URP to replace AddSchemaFields URP in schemaless mode

Posted by GitBox <gi...@apache.org>.
noblepaul edited a comment on pull request #1863:
URL: https://github.com/apache/lucene-solr/pull/1863#issuecomment-697054163


   > At the point, the schema is created (based on all data) but nothing is actually indexed; the user can review and adjust the schema without any issues
   
   How do you expect the schema to be adjusted? User has no idea what just happened
   
   I'm assuming Step 2 won't be required as it will be built into the end point. 
   
   Is it possible to just return the commands to create a schema?
   
   example
   
   ```
   bin/post -c guess -params "guess-schema=true" example/films/films.json
   ```
   the response can be
   
   ```
   curl -X POST -H 'Content-type: application/json' -d '{"add-field":[
   {
   "name":"id",  
   "type":"string",
    "stored":true },
   {
   "name":"desc",  
   "type":"text",
    "stored":true
    }
   ]}' http://localhost:8983/solr/guess/config
   ```
   The user has the following choices
   
   1. Run the command as is
   2. edit the command and post it
   
   My point is , our guess schema is worse than useless and anything that it creates will be WRONG for 99% of users. Why bother creating the fields in schema when we know fully well that user will have to edit the schema in ZK soon after


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org


[GitHub] [lucene-solr] arafalov commented on pull request #1863: SOLR-14701: GuessSchemaFields URP to replace AddSchemaFields URP in schemaless mode

Posted by GitBox <gi...@apache.org>.
arafalov commented on pull request #1863:
URL: https://github.com/apache/lucene-solr/pull/1863#issuecomment-697846565


   > Purely responding to the URP response part, it’s definitely not possible for URP to send non-error responses. I do think its something we should implement though, since it will expand the use cases that URPs can solve. Ill create a JIRA for it.
   
   It may be possible to future proof this implementation by making **guess-schema** being a mode switch, instead of current present/absent flag. So, maybe rename it to **guess-mode** instead with options of 
   - **update** - current (only) option basically, 
   - **show** - (if/when there is a way to return suggested JSON), 
   - **update-all** - (if we wanted to - sometimes - have specific fields even if dynamicField definition matches; could be done now if useful, 
   - **none** to support tools easier. 


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org


[GitHub] [lucene-solr] noblepaul edited a comment on pull request #1863: SOLR-14701: GuessSchemaFields URP to replace AddSchemaFields URP in schemaless mode

Posted by GitBox <gi...@apache.org>.
noblepaul edited a comment on pull request #1863:
URL: https://github.com/apache/lucene-solr/pull/1863#issuecomment-697054163






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org


[GitHub] [lucene-solr] arafalov commented on pull request #1863: SOLR-14701: GuessSchemaFields URP to replace AddSchemaFields URP in schemaless mode

Posted by GitBox <gi...@apache.org>.
arafalov commented on pull request #1863:
URL: https://github.com/apache/lucene-solr/pull/1863#issuecomment-697188648


   Also, I am not even sure there is a pathway to return a non-error message from commit that bin/post will echo to the user as a positive statement. For queries, yes. But we are talking an Update handler with a URP chain.  


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org


[GitHub] [lucene-solr] noblepaul commented on pull request #1863: SOLR-14701: GuessSchemaFields URP to replace AddSchemaFields URP in schemaless mode

Posted by GitBox <gi...@apache.org>.
noblepaul commented on pull request #1863:
URL: https://github.com/apache/lucene-solr/pull/1863#issuecomment-696679556






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org


[GitHub] [lucene-solr] arafalov commented on pull request #1863: SOLR-14701: GuessSchemaFields URP to replace AddSchemaFields URP in schemaless mode

Posted by GitBox <gi...@apache.org>.
arafalov commented on pull request #1863:
URL: https://github.com/apache/lucene-solr/pull/1863#issuecomment-697406760


   Ok, I am glad we are on the same page that the current (let's call it _Add_) solution is rather bad despite all the great work put into it. Let's now get onto the same page about the next step you are actually proposing. I can read the rest of your statement in one of the following ways:
   
   1. Neither original _Add_ nor proposed _Guess_ solutions will address problem. **Next step: that discussion is not about code and should be taken up in the parent JIRA**. 
   That's exactly what it is there for and this code/PR is here to push the discussion from theoretical to practical.
   2. _Guess_ approach is ok overall, but the schema creation is still bad, could it return schema generation commands instead. I just double-checked code and there is no way for the current architecture to return non-error feedback (from either processCommit or SimplePostTool side). **Next step: Propose a way this could be done.** 
   Do note that the reason we are still an URP is because any schema guessing or creation depends on previous chain URPs to be always enabled (e.g. for custom dates formats); that is one of the things really broken with enable/disable flag for _Add_ solution and why I am doing the single-URP level flag.
   3. We need some other Guess approach. **Next action: propose alternative architecture, preferably as straw-man implementation**.
   This would give people on JIRA a chance to select from TWO ways forward, that would be amazing whether we end on one, another or merged solution.
   4. ??? Use veto and keep status quo until somebody yet different has a much better idea than people in last 3 JIRA? 
   5. ??? (I don't claim to read your mind, but I want to move this discussion forward in concrete non-blocking steps)


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org


[GitHub] [lucene-solr] noblepaul commented on pull request #1863: SOLR-14701: GuessSchemaFields URP to replace AddSchemaFields URP in schemaless mode

Posted by GitBox <gi...@apache.org>.
noblepaul commented on pull request #1863:
URL: https://github.com/apache/lucene-solr/pull/1863#issuecomment-697054163


   ># 6 At the point, the schema is created (based on all data) but nothing is actually indexed; the user can review and adjust the schema without any issues
   
   How do you expect the schema to be adjusted? User has no idea what just happened
   
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org


[GitHub] [lucene-solr] noblepaul commented on pull request #1863: SOLR-14701: GuessSchemaFields URP to replace AddSchemaFields URP in schemaless mode

Posted by GitBox <gi...@apache.org>.
noblepaul commented on pull request #1863:
URL: https://github.com/apache/lucene-solr/pull/1863#issuecomment-698728576


   I recommend a new request handler such as `/update/guess-schema`
   
   This way we do not need to add any new functionality, nor do we need to pass any extra params
   
   ```
   curl -F 'data=@datafile.json' http://localhost:8983/gettingstarted/update/guess-schema
   ```
   The response can be 
   
   ```
   curl -X POST -H 'Content-type: application/json' -d '{"add-field":[
   {
   "name":"id",  
   "type":"string",
    "stored":true },
   {
   "name":"desc",  
   "type":"text",
    "stored":true
    }
   ]}' http://localhost:8983/solr/gettingstarted/schema
   
   
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org


[GitHub] [lucene-solr] arafalov commented on pull request #1863: SOLR-14701: GuessSchemaFields URP to replace AddSchemaFields URP in schemaless mode

Posted by GitBox <gi...@apache.org>.
arafalov commented on pull request #1863:
URL: https://github.com/apache/lucene-solr/pull/1863#issuecomment-697179182


   Strong words there "worse than useless", especially considering that this - to me - seems a strong improvement on the current schemaless mode as it looks at more values and actually supports single/multivalued fields. 
   
   In general, I was trying to implement Hoss's proposal, but I am open to the other ideas, if we can clarify the use case.
   
   My understanding is that the use case is of having a lot of data that one does not quite know the shape off. So, they want to index it quickly, explore and then do some manual adjustments.  I am not expecting this to be anywhere near production. Schemaless mode should not have been either.
   
   I am not sure how many people will know how to do step 6, but currently they don't even have that option. Switching from single-value to multi-value is impossible (very hard?) once the actual values are in the index. One has to basically delete everything and start again. As happens in the films example, if one misses the README. With this one, they can look at field definitions in Admin UI and remove or add fields as required without underlying lucene indexes throwing complains.
   
   The way I am seeing this (as well as for other example) is to have a super minimal learning configuration where every additional field is quite obvious. That learning schema, clearly, would not need the step 2 as it would be all setup. I thought your question was about how you would test the code for yourself.
   
   Additionally, to help see what was changed, I think the tag JIRA could be helpful. And frankly, in my imagination, it is not a cloud setup, but a simple learning one. Whether that, by itself, is a breaking point for you, we shall have to see.
   
   Generating Schema JSON raises its own questions, such as the shape of the schema it will be applied to, as guessing is currently happening as a differential to the existing schema. Also, this does not seem like the code that should be in this particular URP, but more of a general utility. If one existed, maybe it would make sense to leverage on top of it.
   
   In general, I am open to implement it any way that seems most useful. I will wait for another couple of opinions rather than chasing one very strong one.
   
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org


[GitHub] [lucene-solr] noblepaul edited a comment on pull request #1863: SOLR-14701: GuessSchemaFields URP to replace AddSchemaFields URP in schemaless mode

Posted by GitBox <gi...@apache.org>.
noblepaul edited a comment on pull request #1863:
URL: https://github.com/apache/lucene-solr/pull/1863#issuecomment-697194702


   >Strong words there "worse than useless", especially considering that this - to me - seems a strong improvement on the current schemaless mode as it looks at more values and actually supports single/multivalued fields.
   
   I was referring to the current solution we have in Solr (schemaless, guess schema thing) . It's not a comment on the new solution. The current schemaless is indeed worse than useless
   
   >Generating Schema JSON raises its own questions, such as the shape of the schema it will be applied to, as guessing is currently happening as a differential to the existing schema. 
   
   The command is only relevant for that moment. If you execute it right away, it's useful. Users most likely will just copy paste the command (and edit, if required) 


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org


[GitHub] [lucene-solr] noblepaul commented on pull request #1863: SOLR-14701: GuessSchemaFields URP to replace AddSchemaFields URP in schemaless mode

Posted by GitBox <gi...@apache.org>.
noblepaul commented on pull request #1863:
URL: https://github.com/apache/lucene-solr/pull/1863#issuecomment-698728576


   I recommend a new request handler such as `/update/guess-schema`
   
   This way we do not need to add any new functionality, nor do we need to pass any extra params
   
   ```
   curl -F 'data=@datafile.json' http://localhost:8983/gettingstarted/update/guess-schema
   ```
   The response can be 
   
   ```
   curl -X POST -H 'Content-type: application/json' -d '{"add-field":[
   {
   "name":"id",  
   "type":"string",
    "stored":true },
   {
   "name":"desc",  
   "type":"text",
    "stored":true
    }
   ]}' http://localhost:8983/solr/gettingstarted/schema
   
   
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org


[GitHub] [lucene-solr] epugh commented on pull request #1863: SOLR-14701: GuessSchemaFields URP to replace AddSchemaFields URP in schemaless mode

Posted by GitBox <gi...@apache.org>.
epugh commented on pull request #1863:
URL: https://github.com/apache/lucene-solr/pull/1863#issuecomment-698864894


   This may be me wearing my "marketing hat", but I think I was somewhat resistent to this whole idea.   This morning I was thinking...   Maybe I just don't like the word "guess", it has some negative connations when it comes to data and schemas and sotware, though may be accurate!   What about **Explore-Schema** or **Predict-Schema** ?
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org


[GitHub] [lucene-solr] arafalov commented on pull request #1863: SOLR-14701: GuessSchemaFields URP to replace AddSchemaFields URP in schemaless mode

Posted by GitBox <gi...@apache.org>.
arafalov commented on pull request #1863:
URL: https://github.com/apache/lucene-solr/pull/1863#issuecomment-698881076


   @noblepaul I don't think your proposal is fully thought out. 
   
   1. This seems not orthogonal to being able to send CSV/JSON/XML/DIH to it, if you are proposing for it to be another one of pathVsLoaders
   2. The whole 'I give you schema' proposal ignores the fact that DateParser or BoolParser URP present in guess process also needs to be present in whatever schema you send it to. That 'curl' command is hiding the user from the actual issues the guessing is supposed to help with. 
   3. Even your types "string vs text" is not something that can currently can guess.
   
   Can you do a counter-proposal in code that skips all the guessing and just shows this send/return path? But is still in the execution path to take the concerns above into the account. I could not find such a place.
   
   @epugh Predict works for me. Explore feels a bit too general and confusing (more interactive/UI feel). But, in general, I am not stuck on Guess at all. 
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org


[GitHub] [lucene-solr] ErickErickson commented on pull request #1863: SOLR-14701: GuessSchemaFields URP to replace AddSchemaFields URP in schemaless mode

Posted by GitBox <gi...@apache.org>.
ErickErickson commented on pull request #1863:
URL: https://github.com/apache/lucene-solr/pull/1863#issuecomment-698933577


   I’ve always thought “guess" was mostly a reflection of how inadequate us devs thought the whole process was. I agree its not very confidence-inspiring…
   
   Off the top of my head, the idea of a new handler has a lot of appeal. It’s easy to get tell a user “index as many training documents as you want to this handler to create your schema. NO DOCUMENTS WILL BE INDEXED during this stage, this is for refining your schema. When you believe you’ve send enough training documents for Solr to have a reasonable chance of defining the correct schema, you must send the documents again, this time to the update handler after reloading the collection to be able to search them”. Maybe “schema-trainer”?
   
   It seems to me that there could be a some efficiencies here.
   
   1> Could we bypass reloading the collection each time? We’d have to read the schema directly and modify it. Or at least only reload it at the end of each batch if there were changes.
   
   2> The response could indicate whether there were any changes to the schema. I can imagine a process whereby I send 10 docs, see there were changes. Then send 10 more and see there were changes. Rinse/repeat until I’d been able to send N batches without any changes.
   
   3> This still isn’t going to bullet-proof semi-structured docs. Or any other really, but it’ll at least make things far more robust. What happens if I index 1,000 Word docs, call it good, then index PDFs. Or PNGs or…. anyway, I try to index some new type of document. Or do we change the update handler to put all unrecognized fields into a text field? Probably should log warnings that this is the case.
   
   4> We’d have to throw an error if the training handler was used after there were documents in the index _and_ any existing field needed to be modified. If we did that users could try to send docs through the training handler at any time, which would partially handle <3>.
   
   5> One tricky bit would be how to train on a bunch of documents _after_ some docs were already indexed. Prior to indexing any docs, we can freely modify existing fields in the schema. Back to <3>. I’m indexing all the Word docs just fine. Now I want to index PDFs and they have new fields, so I want to throw a bunch of them at the training handler. If I send them one-by-one, how to distinguish between a field that had been defined that no currently-indexed documents use and can be changed .vs. one that has documents indexed against it? This’ll bear some thought because the replica processing the training could theoretically examine the local index and see no docs using that field but some replica of some other shard _does_ have one or more docs using the field. And in the case where implicit routing is used or even composite keys this gets worse. This either gets really complicated or we make a rule like “if you use the training handler after documents are indexed, you get on
 e chance to send a batch of documents. Training will fail if any existing field needs to be modified”.
   
   6> We’d be able to confine field guessing exclusively to the training handler.
   
   Erick
   
   
   
   > On Sep 25, 2020, at 7:38 AM, Alexandre Rafalovitch <no...@github.com> wrote:
   > 
   > 
   > @noblepaul I don't think your proposal is fully thought out.
   > 
   > 	• This seems not orthogonal to being able to send CSV/JSON/XML/DIH to it, if you are proposing for it to be another one of pathVsLoaders
   > 	• The whole 'I give you schema' proposal ignores the fact that DateParser or BoolParser URP present in guess process also needs to be present in whatever schema you send it to. That 'curl' command is hiding the user from the actual issues the guessing is supposed to help with.
   > 	• Even your types "string vs text" is not something that can currently can guess.
   > Can you do a counter-proposal in code that skips all the guessing and just shows this send/return path? But is still in the execution path to take the concerns above into the account. I could not find such a place.
   > 
   > @epugh Predict works for me. Explore feels a bit too general and confusing (more interactive/UI feel). But, in general, I am not stuck on Guess at all.
   > 
   > —
   > You are receiving this because you are subscribed to this thread.
   > Reply to this email directly, view it on GitHub, or unsubscribe.
   > 
   
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org


[GitHub] [lucene-solr] arafalov commented on pull request #1863: SOLR-14701: GuessSchemaFields URP to replace AddSchemaFields URP in schemaless mode

Posted by GitBox <gi...@apache.org>.
arafalov commented on pull request #1863:
URL: https://github.com/apache/lucene-solr/pull/1863#issuecomment-696704951


   @noblepaul Good point:
   
   1. **bin/solr start**
   2. Edit the default configset's add-schema-fields definition to
       - Replace AddSchema... with GuessSchema... URP
       - Remove _default_ parameter (code automatically widens to "String" as last step)
       - Remove _Integer_ mapping, where _Long_ already exists (code automatically widens to nearest acceptable definition for numerics)
   3. **bin/solr create -c guess**
   4. **bin/post -c guess -params "guess-schema=true" example/exampledocs/*.xml**
   5. As post does indexing and then commit, it triggers both phases of the URP
   6. At the point, the schema is created (based on all data) but nothing is actually indexed; the user can review and adjust the schema without any issues
   7. Flag value can actually be anything for now, in the future I was thinking that maybe this could be a tag to mark new fields (see SOLR-14855)
   8. **bin/post -c guess  example/exampledocs/*.xml**
   9. Now the data is entered and can be searched
   10. **bin/post -c guess -params "guess-schema=true" example/films/films.json**
   11. **bin/post -c guess  example/films/films.json**
   12. This shows that we don't need the "extra steps" instructions, the films example current requires.
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org


[GitHub] [lucene-solr] arafalov commented on pull request #1863: SOLR-14701: GuessSchemaFields URP to replace AddSchemaFields URP in schemaless mode

Posted by GitBox <gi...@apache.org>.
arafalov commented on pull request #1863:
URL: https://github.com/apache/lucene-solr/pull/1863#issuecomment-696704951


   @noblepaul Good point:
   
   1. **bin/solr start**
   2. Edit the default configset's add-schema-fields definition to
       - Replace AddSchema... with GuessSchema... URP
       - Remove _default_ parameter (code automatically widens to "String" as last step)
       - Remove _Integer_ mapping, where _Long_ already exists (code automatically widens to nearest acceptable definition for numerics)
   3. **bin/solr create -c guess**
   4. **bin/post -c guess -params "guess-schema=true" example/exampledocs/*.xml**
   5. As post does indexing and then commit, it triggers both phases of the URP
   6. At the point, the schema is created (based on all data) but nothing is actually indexed; the user can review and adjust the schema without any issues
   7. Flag value can actually be anything for now, in the future I was thinking that maybe this could be a tag to mark new fields (see SOLR-14855)
   8. **bin/post -c guess  example/exampledocs/*.xml**
   9. Now the data is entered and can be searched
   10. **bin/post -c guess -params "guess-schema=true" example/films/films.json**
   11. **bin/post -c guess  example/films/films.json**
   12. This shows that we don't need the "extra steps" instructions, the films example current requires.
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org


[GitHub] [lucene-solr] noblepaul edited a comment on pull request #1863: SOLR-14701: GuessSchemaFields URP to replace AddSchemaFields URP in schemaless mode

Posted by GitBox <gi...@apache.org>.
noblepaul edited a comment on pull request #1863:
URL: https://github.com/apache/lucene-solr/pull/1863#issuecomment-697054163


   > At the point, the schema is created (based on all data) but nothing is actually indexed; the user can review and adjust the schema without any issues
   
   How do you expect the schema to be adjusted? User has no idea what just happened
   
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org


[GitHub] [lucene-solr] HoustonPutman commented on pull request #1863: SOLR-14701: GuessSchemaFields URP to replace AddSchemaFields URP in schemaless mode

Posted by GitBox <gi...@apache.org>.
HoustonPutman commented on pull request #1863:
URL: https://github.com/apache/lucene-solr/pull/1863#issuecomment-697420014


   Purely responding to the URP response part, it’s definitely not possible for URP to send non-error responses. I do think its something we should implement though, since it will expand the use cases that URPs can solve. Ill create a JIRA for it.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org


[GitHub] [lucene-solr] arafalov commented on pull request #1863: SOLR-14701: GuessSchemaFields URP to replace AddSchemaFields URP in schemaless mode

Posted by GitBox <gi...@apache.org>.
arafalov commented on pull request #1863:
URL: https://github.com/apache/lucene-solr/pull/1863#issuecomment-698881076


   @noblepaul I don't think your proposal is fully thought out. 
   
   1. This seems not orthogonal to being able to send CSV/JSON/XML/DIH to it, if you are proposing for it to be another one of pathVsLoaders
   2. The whole 'I give you schema' proposal ignores the fact that DateParser or BoolParser URP present in guess process also needs to be present in whatever schema you send it to. That 'curl' command is hiding the user from the actual issues the guessing is supposed to help with. 
   3. Even your types "string vs text" is not something that can currently can guess.
   
   Can you do a counter-proposal in code that skips all the guessing and just shows this send/return path? But is still in the execution path to take the concerns above into the account. I could not find such a place.
   
   @epugh Predict works for me. Explore feels a bit too general and confusing (more interactive/UI feel). But, in general, I am not stuck on Guess at all. 
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org


[GitHub] [lucene-solr] noblepaul commented on pull request #1863: SOLR-14701: GuessSchemaFields URP to replace AddSchemaFields URP in schemaless mode

Posted by GitBox <gi...@apache.org>.
noblepaul commented on pull request #1863:
URL: https://github.com/apache/lucene-solr/pull/1863#issuecomment-696679556


   @arafalov Can I see an example of how a user would use this?


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org


[GitHub] [lucene-solr] noblepaul edited a comment on pull request #1863: SOLR-14701: GuessSchemaFields URP to replace AddSchemaFields URP in schemaless mode

Posted by GitBox <gi...@apache.org>.
noblepaul edited a comment on pull request #1863:
URL: https://github.com/apache/lucene-solr/pull/1863#issuecomment-697194702


   >Strong words there "worse than useless", especially considering that this - to me - seems a strong improvement on the current schemaless mode as it looks at more values and actually supports single/multivalued fields.
   
   I'm sorry for the confusion.
   
   I was referring to the current solution we have in Solr (schemaless, guess schema thing) . It's not a comment on the new solution. The current schemaless is indeed worse than useless
   
   >Generating Schema JSON raises its own questions, such as the shape of the schema it will be applied to, as guessing is currently happening as a differential to the existing schema. 
   
   The command is only relevant for that moment. If you execute it right away, it's useful. Users most likely will just copy paste the command (and edit, if required) 


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org