You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@solr.apache.org by Gus Heck <gu...@gmail.com> on 2021/04/28 20:45:30 UTC

Post Tool

I've generally been of the impression/opinion that the Post Tool is really
just a convenience for folks testing out solr to see what it can do, and
not really meant as a production ingestion solution.

A little while back I had a client that had a third party tool that
"integrated with solr" by invoking post.jar on documents with a script to
loop through all the files in a directory and post them (the third party
software's direct example of how to integrate, not the client's idea at
all). Needless to say this caused difficulties with the gigabytes of data
the third party tool had stored in many directories. Of course I don't
know, but I'd guess that someone with little experience was tasked with the
integration with solr at the third party software company and they followed
some examples... then turned them into an "integration" blissfully unaware
of the limitations of what they had done.

I just re-read the ref guide page on post tool
<https://solr.apache.org/guide/8_8/post-tool.html>, and there's nothing
there to indicate to the reader that this might not be a good production
level solution. Also I notice a couple of recent Jira issues regarding
handling of corner cases of strange (broken) behavior or content in a web
site's response, giving the impression that that user (who reported both
issues) might be treading a path that will stretch the bounds of what the
post tool can/should be relied upon for.

https://issues.apache.org/jira/browse/SOLR-15381
https://issues.apache.org/jira/browse/SOLR-15370

How do folks feel about adding a warning or info box at the top of post
tool docs indicating that it is not meant as a production solution, only as
a quick way to test out documents. We might also say something more
concrete like "virtually any use for a corpus containing over a few
thousand documents is a bad idea"? ... or something like that, suggestions
welcome...

If folks agree then it seems that these two issues are likely to be WONTFIX.

-Gus

-- 
http://www.needhamsoftware.com (work)
http://www.the111shift.com (play)

Re: Post Tool

Posted by Gus Heck <gu...@gmail.com>.
Proposed edit: https://github.com/apache/solr/pull/109 I'll leave
expanding/updating examples with curl to Eric.

On Thu, Apr 29, 2021 at 5:06 PM David Smiley <ds...@apache.org> wrote:

> Documentation needs maintenance long term -- it can say things or show
> snippets that aren't true eventually or eventually stop working.  Just keep
> that in mind.
>
> ~ David Smiley
> Apache Lucene/Solr Search Developer
> http://www.linkedin.com/in/davidwsmiley
>
>
> On Thu, Apr 29, 2021 at 4:19 PM Eric Pugh <ep...@opensourceconnections.com>
> wrote:
>
>> I’d be interested in picking up the baton on that idea….    I could see
>> adding both a curl example, but also a native Powershell example.    Curl
>> on windows is also an alias to powershell methods, so it doesn’t quite work
>> always.   I could imagine have three tabs to demonstrate this.
>>
>>
>> On Apr 29, 2021, at 3:43 AM, Jan Høydahl <ja...@cominvent.com>
>> wrote:
>>
>> Yea, let's add some warnings and keep post tool for demo purposes.
>> Perhaps in the tutorial
>> https://solr.apache.org/guide/8_8/solr-tutorial.html we could add cURL
>> examples for indexing the data as well as post.jar (using tabs like we do
>> with v1/v2 api)?
>> We can also do a better job suggesting where to look for proper
>> filesystem / web crawlers for those who need that.
>> And as SimplePostTool is not either a good example of how to integrate
>> with Solr in Java, we could really need a Solr SDK with code examples of
>> integration best practices and "ready-to-use" snippets, using SolrJ.
>>
>> Jan
>>
>> 28. apr. 2021 kl. 22:45 skrev Gus Heck <gu...@gmail.com>:
>>
>> I've generally been of the impression/opinion that the Post Tool is
>> really just a convenience for folks testing out solr to see what it can do,
>> and not really meant as a production ingestion solution.
>>
>> A little while back I had a client that had a third party tool that
>> "integrated with solr" by invoking post.jar on documents with a script to
>> loop through all the files in a directory and post them (the third party
>> software's direct example of how to integrate, not the client's idea at
>> all). Needless to say this caused difficulties with the gigabytes of data
>> the third party tool had stored in many directories. Of course I don't
>> know, but I'd guess that someone with little experience was tasked with the
>> integration with solr at the third party software company and they followed
>> some examples... then turned them into an "integration" blissfully unaware
>> of the limitations of what they had done.
>>
>> I just re-read the ref guide page on post tool
>> <https://solr.apache.org/guide/8_8/post-tool.html>, and there's nothing
>> there to indicate to the reader that this might not be a good production
>> level solution. Also I notice a couple of recent Jira issues regarding
>> handling of corner cases of strange (broken) behavior or content in a web
>> site's response, giving the impression that that user (who reported both
>> issues) might be treading a path that will stretch the bounds of what the
>> post tool can/should be relied upon for.
>>
>> https://issues.apache.org/jira/browse/SOLR-15381
>> https://issues.apache.org/jira/browse/SOLR-15370
>>
>> How do folks feel about adding a warning or info box at the top of post
>> tool docs indicating that it is not meant as a production solution, only as
>> a quick way to test out documents. We might also say something more
>> concrete like "virtually any use for a corpus containing over a few
>> thousand documents is a bad idea"? ... or something like that, suggestions
>> welcome...
>>
>> If folks agree then it seems that these two issues are likely to be
>> WONTFIX.
>>
>> -Gus
>>
>> --
>> http://www.needhamsoftware.com (work)
>> http://www.the111shift.com (play)
>>
>>
>>
>> _______________________
>> *Eric Pugh **| *Founder & CEO | OpenSource Connections, LLC | 434.466.1467
>> | http://www.opensourceconnections.com | My Free/Busy
>> <http://tinyurl.com/eric-cal>
>> Co-Author: Apache Solr Enterprise Search Server, 3rd Ed
>> <https://www.packtpub.com/big-data-and-business-intelligence/apache-solr-enterprise-search-server-third-edition-raw>
>> This e-mail and all contents, including attachments, is considered to be
>> Company Confidential unless explicitly stated otherwise, regardless
>> of whether attachments are marked as such.
>>
>>

-- 
http://www.needhamsoftware.com (work)
http://www.the111shift.com (play)

Re: Post Tool

Posted by David Smiley <ds...@apache.org>.
Documentation needs maintenance long term -- it can say things or show
snippets that aren't true eventually or eventually stop working.  Just keep
that in mind.

~ David Smiley
Apache Lucene/Solr Search Developer
http://www.linkedin.com/in/davidwsmiley


On Thu, Apr 29, 2021 at 4:19 PM Eric Pugh <ep...@opensourceconnections.com>
wrote:

> I’d be interested in picking up the baton on that idea….    I could see
> adding both a curl example, but also a native Powershell example.    Curl
> on windows is also an alias to powershell methods, so it doesn’t quite work
> always.   I could imagine have three tabs to demonstrate this.
>
>
> On Apr 29, 2021, at 3:43 AM, Jan Høydahl <ja...@cominvent.com>
> wrote:
>
> Yea, let's add some warnings and keep post tool for demo purposes.
> Perhaps in the tutorial
> https://solr.apache.org/guide/8_8/solr-tutorial.html we could add cURL
> examples for indexing the data as well as post.jar (using tabs like we do
> with v1/v2 api)?
> We can also do a better job suggesting where to look for proper filesystem
> / web crawlers for those who need that.
> And as SimplePostTool is not either a good example of how to integrate
> with Solr in Java, we could really need a Solr SDK with code examples of
> integration best practices and "ready-to-use" snippets, using SolrJ.
>
> Jan
>
> 28. apr. 2021 kl. 22:45 skrev Gus Heck <gu...@gmail.com>:
>
> I've generally been of the impression/opinion that the Post Tool is really
> just a convenience for folks testing out solr to see what it can do, and
> not really meant as a production ingestion solution.
>
> A little while back I had a client that had a third party tool that
> "integrated with solr" by invoking post.jar on documents with a script to
> loop through all the files in a directory and post them (the third party
> software's direct example of how to integrate, not the client's idea at
> all). Needless to say this caused difficulties with the gigabytes of data
> the third party tool had stored in many directories. Of course I don't
> know, but I'd guess that someone with little experience was tasked with the
> integration with solr at the third party software company and they followed
> some examples... then turned them into an "integration" blissfully unaware
> of the limitations of what they had done.
>
> I just re-read the ref guide page on post tool
> <https://solr.apache.org/guide/8_8/post-tool.html>, and there's nothing
> there to indicate to the reader that this might not be a good production
> level solution. Also I notice a couple of recent Jira issues regarding
> handling of corner cases of strange (broken) behavior or content in a web
> site's response, giving the impression that that user (who reported both
> issues) might be treading a path that will stretch the bounds of what the
> post tool can/should be relied upon for.
>
> https://issues.apache.org/jira/browse/SOLR-15381
> https://issues.apache.org/jira/browse/SOLR-15370
>
> How do folks feel about adding a warning or info box at the top of post
> tool docs indicating that it is not meant as a production solution, only as
> a quick way to test out documents. We might also say something more
> concrete like "virtually any use for a corpus containing over a few
> thousand documents is a bad idea"? ... or something like that, suggestions
> welcome...
>
> If folks agree then it seems that these two issues are likely to be
> WONTFIX.
>
> -Gus
>
> --
> http://www.needhamsoftware.com (work)
> http://www.the111shift.com (play)
>
>
>
> _______________________
> *Eric Pugh **| *Founder & CEO | OpenSource Connections, LLC | 434.466.1467
> | http://www.opensourceconnections.com | My Free/Busy
> <http://tinyurl.com/eric-cal>
> Co-Author: Apache Solr Enterprise Search Server, 3rd Ed
> <https://www.packtpub.com/big-data-and-business-intelligence/apache-solr-enterprise-search-server-third-edition-raw>
> This e-mail and all contents, including attachments, is considered to be
> Company Confidential unless explicitly stated otherwise, regardless
> of whether attachments are marked as such.
>
>

Re: Post Tool

Posted by Eric Pugh <ep...@opensourceconnections.com>.
I’d be interested in picking up the baton on that idea….    I could see adding both a curl example, but also a native Powershell example.    Curl on windows is also an alias to powershell methods, so it doesn’t quite work always.   I could imagine have three tabs to demonstrate this.


> On Apr 29, 2021, at 3:43 AM, Jan Høydahl <ja...@cominvent.com> wrote:
> 
> Yea, let's add some warnings and keep post tool for demo purposes.
> Perhaps in the tutorial https://solr.apache.org/guide/8_8/solr-tutorial.html <https://solr.apache.org/guide/8_8/solr-tutorial.html> we could add cURL examples for indexing the data as well as post.jar (using tabs like we do with v1/v2 api)?
> We can also do a better job suggesting where to look for proper filesystem / web crawlers for those who need that.
> And as SimplePostTool is not either a good example of how to integrate with Solr in Java, we could really need a Solr SDK with code examples of integration best practices and "ready-to-use" snippets, using SolrJ.
> 
> Jan
> 
>> 28. apr. 2021 kl. 22:45 skrev Gus Heck <gus.heck@gmail.com <ma...@gmail.com>>:
>> 
>> I've generally been of the impression/opinion that the Post Tool is really just a convenience for folks testing out solr to see what it can do, and not really meant as a production ingestion solution. 
>> 
>> A little while back I had a client that had a third party tool that "integrated with solr" by invoking post.jar on documents with a script to loop through all the files in a directory and post them (the third party software's direct example of how to integrate, not the client's idea at all). Needless to say this caused difficulties with the gigabytes of data the third party tool had stored in many directories. Of course I don't know, but I'd guess that someone with little experience was tasked with the integration with solr at the third party software company and they followed some examples... then turned them into an "integration" blissfully unaware of the limitations of what they had done.
>> 
>> I just re-read the ref guide page on post tool <https://solr.apache.org/guide/8_8/post-tool.html>, and there's nothing there to indicate to the reader that this might not be a good production level solution. Also I notice a couple of recent Jira issues regarding handling of corner cases of strange (broken) behavior or content in a web site's response, giving the impression that that user (who reported both issues) might be treading a path that will stretch the bounds of what the post tool can/should be relied upon for. 
>> 
>> https://issues.apache.org/jira/browse/SOLR-15381 <https://issues.apache.org/jira/browse/SOLR-15381>
>> https://issues.apache.org/jira/browse/SOLR-15370 <https://issues.apache.org/jira/browse/SOLR-15370>
>> 
>> How do folks feel about adding a warning or info box at the top of post tool docs indicating that it is not meant as a production solution, only as a quick way to test out documents. We might also say something more concrete like "virtually any use for a corpus containing over a few thousand documents is a bad idea"? ... or something like that, suggestions welcome... 
>> 
>> If folks agree then it seems that these two issues are likely to be WONTFIX.
>> 
>> -Gus
>> 
>> -- 
>> http://www.needhamsoftware.com <http://www.needhamsoftware.com/> (work)
>> http://www.the111shift.com <http://www.the111shift.com/> (play)
> 

_______________________
Eric Pugh | Founder & CEO | OpenSource Connections, LLC | 434.466.1467 | http://www.opensourceconnections.com <http://www.opensourceconnections.com/> | My Free/Busy <http://tinyurl.com/eric-cal>  
Co-Author: Apache Solr Enterprise Search Server, 3rd Ed <https://www.packtpub.com/big-data-and-business-intelligence/apache-solr-enterprise-search-server-third-edition-raw>	
This e-mail and all contents, including attachments, is considered to be Company Confidential unless explicitly stated otherwise, regardless of whether attachments are marked as such.


Re: Post Tool

Posted by Jan Høydahl <ja...@cominvent.com>.
Yea, let's add some warnings and keep post tool for demo purposes.
Perhaps in the tutorial https://solr.apache.org/guide/8_8/solr-tutorial.html we could add cURL examples for indexing the data as well as post.jar (using tabs like we do with v1/v2 api)?
We can also do a better job suggesting where to look for proper filesystem / web crawlers for those who need that.
And as SimplePostTool is not either a good example of how to integrate with Solr in Java, we could really need a Solr SDK with code examples of integration best practices and "ready-to-use" snippets, using SolrJ.

Jan

> 28. apr. 2021 kl. 22:45 skrev Gus Heck <gu...@gmail.com>:
> 
> I've generally been of the impression/opinion that the Post Tool is really just a convenience for folks testing out solr to see what it can do, and not really meant as a production ingestion solution. 
> 
> A little while back I had a client that had a third party tool that "integrated with solr" by invoking post.jar on documents with a script to loop through all the files in a directory and post them (the third party software's direct example of how to integrate, not the client's idea at all). Needless to say this caused difficulties with the gigabytes of data the third party tool had stored in many directories. Of course I don't know, but I'd guess that someone with little experience was tasked with the integration with solr at the third party software company and they followed some examples... then turned them into an "integration" blissfully unaware of the limitations of what they had done.
> 
> I just re-read the ref guide page on post tool <https://solr.apache.org/guide/8_8/post-tool.html>, and there's nothing there to indicate to the reader that this might not be a good production level solution. Also I notice a couple of recent Jira issues regarding handling of corner cases of strange (broken) behavior or content in a web site's response, giving the impression that that user (who reported both issues) might be treading a path that will stretch the bounds of what the post tool can/should be relied upon for. 
> 
> https://issues.apache.org/jira/browse/SOLR-15381 <https://issues.apache.org/jira/browse/SOLR-15381>
> https://issues.apache.org/jira/browse/SOLR-15370 <https://issues.apache.org/jira/browse/SOLR-15370>
> 
> How do folks feel about adding a warning or info box at the top of post tool docs indicating that it is not meant as a production solution, only as a quick way to test out documents. We might also say something more concrete like "virtually any use for a corpus containing over a few thousand documents is a bad idea"? ... or something like that, suggestions welcome... 
> 
> If folks agree then it seems that these two issues are likely to be WONTFIX.
> 
> -Gus
> 
> -- 
> http://www.needhamsoftware.com <http://www.needhamsoftware.com/> (work)
> http://www.the111shift.com <http://www.the111shift.com/> (play)


Re: Post Tool

Posted by Ishan Chattopadhyaya <ic...@gmail.com>.
Beginners should experience as little black magic as possible. Post tool is
black magic. Schemaless is black magic. I feel we should remove both.

On Thu, 29 Apr, 2021, 2:56 am Alexandre Rafalovitch, <ar...@gmail.com>
wrote:

> "Good enough/Recommended" for what? Serious question.
>
> Because it may be - more than - good enough to "send files to the
> server", but the post tool is also doing a lot of Solr business logic
> that beginner users may not have understood yet. Like automatic
> commit. Like choosing endpoint and content type based on the file
> extension. Like actually saying what it is doing. Beginners may not
> have the bandwidth to understand all those elements in order to index
> their second document (first document being the tutorial one
> copy/paste here).
>
> Removing a post tool because curl is good enough - in my personal view
> - is abandoning beginners. Unless, that "for what" is clear and the
> gap between curl and post tool is filled in some other ways, through
> better documentation or improved API or whatever.
>
> On the original question, I think the post tool is like DIH and like
> the default schema, people stick to them and push their boundaries
> because our beginner->production story is full of gaps. What to do
> about it though, I am not sure. A suggested warning seems like a
> reasonable non-harmful suggestion, though.
>
> Regards,
>    Alex.
>
> On Wed, 28 Apr 2021 at 17:04, Ishan Chattopadhyaya
> <ic...@gmail.com> wrote:
> >
> > We should remove the post tool
> > Altogether. Curl is good enough and recommended.
> >
> > On Thu, 29 Apr, 2021, 2:15 am Gus Heck, <gu...@gmail.com> wrote:
> >>
> >> I've generally been of the impression/opinion that the Post Tool is
> really just a convenience for folks testing out solr to see what it can do,
> and not really meant as a production ingestion solution.
> >>
> >> A little while back I had a client that had a third party tool that
> "integrated with solr" by invoking post.jar on documents with a script to
> loop through all the files in a directory and post them (the third party
> software's direct example of how to integrate, not the client's idea at
> all). Needless to say this caused difficulties with the gigabytes of data
> the third party tool had stored in many directories. Of course I don't
> know, but I'd guess that someone with little experience was tasked with the
> integration with solr at the third party software company and they followed
> some examples... then turned them into an "integration" blissfully unaware
> of the limitations of what they had done.
> >>
> >> I just re-read the ref guide page on post tool, and there's nothing
> there to indicate to the reader that this might not be a good production
> level solution. Also I notice a couple of recent Jira issues regarding
> handling of corner cases of strange (broken) behavior or content in a web
> site's response, giving the impression that that user (who reported both
> issues) might be treading a path that will stretch the bounds of what the
> post tool can/should be relied upon for.
> >>
> >> https://issues.apache.org/jira/browse/SOLR-15381
> >> https://issues.apache.org/jira/browse/SOLR-15370
> >>
> >> How do folks feel about adding a warning or info box at the top of post
> tool docs indicating that it is not meant as a production solution, only as
> a quick way to test out documents. We might also say something more
> concrete like "virtually any use for a corpus containing over a few
> thousand documents is a bad idea"? ... or something like that, suggestions
> welcome...
> >>
> >> If folks agree then it seems that these two issues are likely to be
> WONTFIX.
> >>
> >> -Gus
> >>
> >> --
> >> http://www.needhamsoftware.com (work)
> >> http://www.the111shift.com (play)
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscribe@solr.apache.org
> For additional commands, e-mail: dev-help@solr.apache.org
>
>

Re: Post Tool

Posted by Gus Heck <gu...@gmail.com>.
I think in previous discussions * schema (all text or all string) was
advanced as a likely replacement for schemaless. It won't bother newbies
with any errors aside from duplicate id's and be 100% predictable.
Adjustments to schema from there would all be "whatever the user needed"
and not foisting anything they don't need on them. If it starts out all
string rather than all text_general, that would give example 2 a great lead
into text analysis I think... going from exact matching to token matching.

On Wed, Apr 28, 2021 at 6:39 PM Timothy Potter <th...@gmail.com> wrote:

> There is some black magic in schemaless for sure. An interesting thing
> about it though ... it's not just doing field guessing and dynamic
> schema mutation, it's also doing some field name normalization
> (removing whitespace), ID injection (if needed), and locale-aware
> parsing of incoming data (which of course it needs to do to be
> effective at guessing). I've had to grapple with this in the Schema
> Designer backend in that I want the field name normalization and
> locale-aware parsing, but I don't need the last stage in the URP Chain
> (mutating the schema) since I have sample docs and a chance to review
> the suggested fields based on those docs before creating the
> ConfigSet.
>
> For 9.0, maybe the default config has three options: 1) no
> transformations on the input data at all; the user must provide input
> data that matches the schema w/o any transformations, 2) basic
> transformations and locale-aware parsing, but no schema mutations, and
> 3) schema mutations (aka schemaless). I'd argue that option #1 is too
> restrictive and make Solr harder to get started with; I think option
> #2 is useful but could be trappy in that it's not clear (esp. to
> beginners) that these simple transformations are happening. I'm all
> for making option #3 (mutations) an opt-in vs. opt-out as it is now
> esp. now that we'll soon have a Schema Designer in the UI.
>
> On Wed, Apr 28, 2021 at 4:18 PM Timothy Potter <th...@gmail.com>
> wrote:
> >
> > I agree with Alex here. We can't overload beginners with a bunch of
> > jargon and complexity just because experts understand how to use curl
> > effectively. I also don't think we should remove a feature b/c one
> > instance of misuse is found in the wild, sounds like Gus' client was
> > being lazy. Better docs are welcome of course.
> >
> > I actually want to integrate the PostTool with the Schema Designer, so
> > new users (or whoever really) can post a bunch of docs into the temp
> > Schema Designer staging area and then tune their schema in the UI.
> > Makes for a nice getting started experience.
> >
> > Tim
> >
> > On Wed, Apr 28, 2021 at 3:26 PM Alexandre Rafalovitch
> > <ar...@gmail.com> wrote:
> > >
> > > "Good enough/Recommended" for what? Serious question.
> > >
> > > Because it may be - more than - good enough to "send files to the
> > > server", but the post tool is also doing a lot of Solr business logic
> > > that beginner users may not have understood yet. Like automatic
> > > commit. Like choosing endpoint and content type based on the file
> > > extension. Like actually saying what it is doing. Beginners may not
> > > have the bandwidth to understand all those elements in order to index
> > > their second document (first document being the tutorial one
> > > copy/paste here).
> > >
> > > Removing a post tool because curl is good enough - in my personal view
> > > - is abandoning beginners. Unless, that "for what" is clear and the
> > > gap between curl and post tool is filled in some other ways, through
> > > better documentation or improved API or whatever.
> > >
> > > On the original question, I think the post tool is like DIH and like
> > > the default schema, people stick to them and push their boundaries
> > > because our beginner->production story is full of gaps. What to do
> > > about it though, I am not sure. A suggested warning seems like a
> > > reasonable non-harmful suggestion, though.
> > >
> > > Regards,
> > >    Alex.
> > >
> > > On Wed, 28 Apr 2021 at 17:04, Ishan Chattopadhyaya
> > > <ic...@gmail.com> wrote:
> > > >
> > > > We should remove the post tool
> > > > Altogether. Curl is good enough and recommended.
> > > >
> > > > On Thu, 29 Apr, 2021, 2:15 am Gus Heck, <gu...@gmail.com> wrote:
> > > >>
> > > >> I've generally been of the impression/opinion that the Post Tool is
> really just a convenience for folks testing out solr to see what it can do,
> and not really meant as a production ingestion solution.
> > > >>
> > > >> A little while back I had a client that had a third party tool that
> "integrated with solr" by invoking post.jar on documents with a script to
> loop through all the files in a directory and post them (the third party
> software's direct example of how to integrate, not the client's idea at
> all). Needless to say this caused difficulties with the gigabytes of data
> the third party tool had stored in many directories. Of course I don't
> know, but I'd guess that someone with little experience was tasked with the
> integration with solr at the third party software company and they followed
> some examples... then turned them into an "integration" blissfully unaware
> of the limitations of what they had done.
> > > >>
> > > >> I just re-read the ref guide page on post tool, and there's nothing
> there to indicate to the reader that this might not be a good production
> level solution. Also I notice a couple of recent Jira issues regarding
> handling of corner cases of strange (broken) behavior or content in a web
> site's response, giving the impression that that user (who reported both
> issues) might be treading a path that will stretch the bounds of what the
> post tool can/should be relied upon for.
> > > >>
> > > >> https://issues.apache.org/jira/browse/SOLR-15381
> > > >> https://issues.apache.org/jira/browse/SOLR-15370
> > > >>
> > > >> How do folks feel about adding a warning or info box at the top of
> post tool docs indicating that it is not meant as a production solution,
> only as a quick way to test out documents. We might also say something more
> concrete like "virtually any use for a corpus containing over a few
> thousand documents is a bad idea"? ... or something like that, suggestions
> welcome...
> > > >>
> > > >> If folks agree then it seems that these two issues are likely to be
> WONTFIX.
> > > >>
> > > >> -Gus
> > > >>
> > > >> --
> > > >> http://www.needhamsoftware.com (work)
> > > >> http://www.the111shift.com (play)
> > >
> > > ---------------------------------------------------------------------
> > > To unsubscribe, e-mail: dev-unsubscribe@solr.apache.org
> > > For additional commands, e-mail: dev-help@solr.apache.org
> > >
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscribe@solr.apache.org
> For additional commands, e-mail: dev-help@solr.apache.org
>
>

-- 
http://www.needhamsoftware.com (work)
http://www.the111shift.com (play)

Re: Post Tool

Posted by Timothy Potter <th...@gmail.com>.
There is some black magic in schemaless for sure. An interesting thing
about it though ... it's not just doing field guessing and dynamic
schema mutation, it's also doing some field name normalization
(removing whitespace), ID injection (if needed), and locale-aware
parsing of incoming data (which of course it needs to do to be
effective at guessing). I've had to grapple with this in the Schema
Designer backend in that I want the field name normalization and
locale-aware parsing, but I don't need the last stage in the URP Chain
(mutating the schema) since I have sample docs and a chance to review
the suggested fields based on those docs before creating the
ConfigSet.

For 9.0, maybe the default config has three options: 1) no
transformations on the input data at all; the user must provide input
data that matches the schema w/o any transformations, 2) basic
transformations and locale-aware parsing, but no schema mutations, and
3) schema mutations (aka schemaless). I'd argue that option #1 is too
restrictive and make Solr harder to get started with; I think option
#2 is useful but could be trappy in that it's not clear (esp. to
beginners) that these simple transformations are happening. I'm all
for making option #3 (mutations) an opt-in vs. opt-out as it is now
esp. now that we'll soon have a Schema Designer in the UI.

On Wed, Apr 28, 2021 at 4:18 PM Timothy Potter <th...@gmail.com> wrote:
>
> I agree with Alex here. We can't overload beginners with a bunch of
> jargon and complexity just because experts understand how to use curl
> effectively. I also don't think we should remove a feature b/c one
> instance of misuse is found in the wild, sounds like Gus' client was
> being lazy. Better docs are welcome of course.
>
> I actually want to integrate the PostTool with the Schema Designer, so
> new users (or whoever really) can post a bunch of docs into the temp
> Schema Designer staging area and then tune their schema in the UI.
> Makes for a nice getting started experience.
>
> Tim
>
> On Wed, Apr 28, 2021 at 3:26 PM Alexandre Rafalovitch
> <ar...@gmail.com> wrote:
> >
> > "Good enough/Recommended" for what? Serious question.
> >
> > Because it may be - more than - good enough to "send files to the
> > server", but the post tool is also doing a lot of Solr business logic
> > that beginner users may not have understood yet. Like automatic
> > commit. Like choosing endpoint and content type based on the file
> > extension. Like actually saying what it is doing. Beginners may not
> > have the bandwidth to understand all those elements in order to index
> > their second document (first document being the tutorial one
> > copy/paste here).
> >
> > Removing a post tool because curl is good enough - in my personal view
> > - is abandoning beginners. Unless, that "for what" is clear and the
> > gap between curl and post tool is filled in some other ways, through
> > better documentation or improved API or whatever.
> >
> > On the original question, I think the post tool is like DIH and like
> > the default schema, people stick to them and push their boundaries
> > because our beginner->production story is full of gaps. What to do
> > about it though, I am not sure. A suggested warning seems like a
> > reasonable non-harmful suggestion, though.
> >
> > Regards,
> >    Alex.
> >
> > On Wed, 28 Apr 2021 at 17:04, Ishan Chattopadhyaya
> > <ic...@gmail.com> wrote:
> > >
> > > We should remove the post tool
> > > Altogether. Curl is good enough and recommended.
> > >
> > > On Thu, 29 Apr, 2021, 2:15 am Gus Heck, <gu...@gmail.com> wrote:
> > >>
> > >> I've generally been of the impression/opinion that the Post Tool is really just a convenience for folks testing out solr to see what it can do, and not really meant as a production ingestion solution.
> > >>
> > >> A little while back I had a client that had a third party tool that "integrated with solr" by invoking post.jar on documents with a script to loop through all the files in a directory and post them (the third party software's direct example of how to integrate, not the client's idea at all). Needless to say this caused difficulties with the gigabytes of data the third party tool had stored in many directories. Of course I don't know, but I'd guess that someone with little experience was tasked with the integration with solr at the third party software company and they followed some examples... then turned them into an "integration" blissfully unaware of the limitations of what they had done.
> > >>
> > >> I just re-read the ref guide page on post tool, and there's nothing there to indicate to the reader that this might not be a good production level solution. Also I notice a couple of recent Jira issues regarding handling of corner cases of strange (broken) behavior or content in a web site's response, giving the impression that that user (who reported both issues) might be treading a path that will stretch the bounds of what the post tool can/should be relied upon for.
> > >>
> > >> https://issues.apache.org/jira/browse/SOLR-15381
> > >> https://issues.apache.org/jira/browse/SOLR-15370
> > >>
> > >> How do folks feel about adding a warning or info box at the top of post tool docs indicating that it is not meant as a production solution, only as a quick way to test out documents. We might also say something more concrete like "virtually any use for a corpus containing over a few thousand documents is a bad idea"? ... or something like that, suggestions welcome...
> > >>
> > >> If folks agree then it seems that these two issues are likely to be WONTFIX.
> > >>
> > >> -Gus
> > >>
> > >> --
> > >> http://www.needhamsoftware.com (work)
> > >> http://www.the111shift.com (play)
> >
> > ---------------------------------------------------------------------
> > To unsubscribe, e-mail: dev-unsubscribe@solr.apache.org
> > For additional commands, e-mail: dev-help@solr.apache.org
> >

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@solr.apache.org
For additional commands, e-mail: dev-help@solr.apache.org


Re: Post Tool

Posted by Timothy Potter <th...@gmail.com>.
I agree with Alex here. We can't overload beginners with a bunch of
jargon and complexity just because experts understand how to use curl
effectively. I also don't think we should remove a feature b/c one
instance of misuse is found in the wild, sounds like Gus' client was
being lazy. Better docs are welcome of course.

I actually want to integrate the PostTool with the Schema Designer, so
new users (or whoever really) can post a bunch of docs into the temp
Schema Designer staging area and then tune their schema in the UI.
Makes for a nice getting started experience.

Tim

On Wed, Apr 28, 2021 at 3:26 PM Alexandre Rafalovitch
<ar...@gmail.com> wrote:
>
> "Good enough/Recommended" for what? Serious question.
>
> Because it may be - more than - good enough to "send files to the
> server", but the post tool is also doing a lot of Solr business logic
> that beginner users may not have understood yet. Like automatic
> commit. Like choosing endpoint and content type based on the file
> extension. Like actually saying what it is doing. Beginners may not
> have the bandwidth to understand all those elements in order to index
> their second document (first document being the tutorial one
> copy/paste here).
>
> Removing a post tool because curl is good enough - in my personal view
> - is abandoning beginners. Unless, that "for what" is clear and the
> gap between curl and post tool is filled in some other ways, through
> better documentation or improved API or whatever.
>
> On the original question, I think the post tool is like DIH and like
> the default schema, people stick to them and push their boundaries
> because our beginner->production story is full of gaps. What to do
> about it though, I am not sure. A suggested warning seems like a
> reasonable non-harmful suggestion, though.
>
> Regards,
>    Alex.
>
> On Wed, 28 Apr 2021 at 17:04, Ishan Chattopadhyaya
> <ic...@gmail.com> wrote:
> >
> > We should remove the post tool
> > Altogether. Curl is good enough and recommended.
> >
> > On Thu, 29 Apr, 2021, 2:15 am Gus Heck, <gu...@gmail.com> wrote:
> >>
> >> I've generally been of the impression/opinion that the Post Tool is really just a convenience for folks testing out solr to see what it can do, and not really meant as a production ingestion solution.
> >>
> >> A little while back I had a client that had a third party tool that "integrated with solr" by invoking post.jar on documents with a script to loop through all the files in a directory and post them (the third party software's direct example of how to integrate, not the client's idea at all). Needless to say this caused difficulties with the gigabytes of data the third party tool had stored in many directories. Of course I don't know, but I'd guess that someone with little experience was tasked with the integration with solr at the third party software company and they followed some examples... then turned them into an "integration" blissfully unaware of the limitations of what they had done.
> >>
> >> I just re-read the ref guide page on post tool, and there's nothing there to indicate to the reader that this might not be a good production level solution. Also I notice a couple of recent Jira issues regarding handling of corner cases of strange (broken) behavior or content in a web site's response, giving the impression that that user (who reported both issues) might be treading a path that will stretch the bounds of what the post tool can/should be relied upon for.
> >>
> >> https://issues.apache.org/jira/browse/SOLR-15381
> >> https://issues.apache.org/jira/browse/SOLR-15370
> >>
> >> How do folks feel about adding a warning or info box at the top of post tool docs indicating that it is not meant as a production solution, only as a quick way to test out documents. We might also say something more concrete like "virtually any use for a corpus containing over a few thousand documents is a bad idea"? ... or something like that, suggestions welcome...
> >>
> >> If folks agree then it seems that these two issues are likely to be WONTFIX.
> >>
> >> -Gus
> >>
> >> --
> >> http://www.needhamsoftware.com (work)
> >> http://www.the111shift.com (play)
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscribe@solr.apache.org
> For additional commands, e-mail: dev-help@solr.apache.org
>

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@solr.apache.org
For additional commands, e-mail: dev-help@solr.apache.org


Re: Post Tool

Posted by David Smiley <ds...@apache.org>.
+1 to Alex's sentiment.  I like the analogy with the DIH.

A warning/notice for bin/post is fine.  Maybe something like:

> This tool exists to help beginning Solr users and for rapid prototyping.
It may work fine in some "production" scenarios but you should probably use
something else like Curl, SolrJ, or something custom.

BTW, years ago I enhanced bin/solr to properly stream massive files into
Solr without putting the whole thing in RAM.  No matter what options I gave
Curl, Curl put the whole thing in RAM.  Perhaps it still does?  Shrug.

~ David Smiley
Apache Lucene/Solr Search Developer
http://www.linkedin.com/in/davidwsmiley


On Wed, Apr 28, 2021 at 5:26 PM Alexandre Rafalovitch <ar...@gmail.com>
wrote:

> "Good enough/Recommended" for what? Serious question.
>
> Because it may be - more than - good enough to "send files to the
> server", but the post tool is also doing a lot of Solr business logic
> that beginner users may not have understood yet. Like automatic
> commit. Like choosing endpoint and content type based on the file
> extension. Like actually saying what it is doing. Beginners may not
> have the bandwidth to understand all those elements in order to index
> their second document (first document being the tutorial one
> copy/paste here).
>
> Removing a post tool because curl is good enough - in my personal view
> - is abandoning beginners. Unless, that "for what" is clear and the
> gap between curl and post tool is filled in some other ways, through
> better documentation or improved API or whatever.
>
> On the original question, I think the post tool is like DIH and like
> the default schema, people stick to them and push their boundaries
> because our beginner->production story is full of gaps. What to do
> about it though, I am not sure. A suggested warning seems like a
> reasonable non-harmful suggestion, though.
>
> Regards,
>    Alex.
>
> On Wed, 28 Apr 2021 at 17:04, Ishan Chattopadhyaya
> <ic...@gmail.com> wrote:
> >
> > We should remove the post tool
> > Altogether. Curl is good enough and recommended.
> >
> > On Thu, 29 Apr, 2021, 2:15 am Gus Heck, <gu...@gmail.com> wrote:
> >>
> >> I've generally been of the impression/opinion that the Post Tool is
> really just a convenience for folks testing out solr to see what it can do,
> and not really meant as a production ingestion solution.
> >>
> >> A little while back I had a client that had a third party tool that
> "integrated with solr" by invoking post.jar on documents with a script to
> loop through all the files in a directory and post them (the third party
> software's direct example of how to integrate, not the client's idea at
> all). Needless to say this caused difficulties with the gigabytes of data
> the third party tool had stored in many directories. Of course I don't
> know, but I'd guess that someone with little experience was tasked with the
> integration with solr at the third party software company and they followed
> some examples... then turned them into an "integration" blissfully unaware
> of the limitations of what they had done.
> >>
> >> I just re-read the ref guide page on post tool, and there's nothing
> there to indicate to the reader that this might not be a good production
> level solution. Also I notice a couple of recent Jira issues regarding
> handling of corner cases of strange (broken) behavior or content in a web
> site's response, giving the impression that that user (who reported both
> issues) might be treading a path that will stretch the bounds of what the
> post tool can/should be relied upon for.
> >>
> >> https://issues.apache.org/jira/browse/SOLR-15381
> >> https://issues.apache.org/jira/browse/SOLR-15370
> >>
> >> How do folks feel about adding a warning or info box at the top of post
> tool docs indicating that it is not meant as a production solution, only as
> a quick way to test out documents. We might also say something more
> concrete like "virtually any use for a corpus containing over a few
> thousand documents is a bad idea"? ... or something like that, suggestions
> welcome...
> >>
> >> If folks agree then it seems that these two issues are likely to be
> WONTFIX.
> >>
> >> -Gus
> >>
> >> --
> >> http://www.needhamsoftware.com (work)
> >> http://www.the111shift.com (play)
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscribe@solr.apache.org
> For additional commands, e-mail: dev-help@solr.apache.org
>
>

Re: Post Tool

Posted by Alexandre Rafalovitch <ar...@gmail.com>.
"Good enough/Recommended" for what? Serious question.

Because it may be - more than - good enough to "send files to the
server", but the post tool is also doing a lot of Solr business logic
that beginner users may not have understood yet. Like automatic
commit. Like choosing endpoint and content type based on the file
extension. Like actually saying what it is doing. Beginners may not
have the bandwidth to understand all those elements in order to index
their second document (first document being the tutorial one
copy/paste here).

Removing a post tool because curl is good enough - in my personal view
- is abandoning beginners. Unless, that "for what" is clear and the
gap between curl and post tool is filled in some other ways, through
better documentation or improved API or whatever.

On the original question, I think the post tool is like DIH and like
the default schema, people stick to them and push their boundaries
because our beginner->production story is full of gaps. What to do
about it though, I am not sure. A suggested warning seems like a
reasonable non-harmful suggestion, though.

Regards,
   Alex.

On Wed, 28 Apr 2021 at 17:04, Ishan Chattopadhyaya
<ic...@gmail.com> wrote:
>
> We should remove the post tool
> Altogether. Curl is good enough and recommended.
>
> On Thu, 29 Apr, 2021, 2:15 am Gus Heck, <gu...@gmail.com> wrote:
>>
>> I've generally been of the impression/opinion that the Post Tool is really just a convenience for folks testing out solr to see what it can do, and not really meant as a production ingestion solution.
>>
>> A little while back I had a client that had a third party tool that "integrated with solr" by invoking post.jar on documents with a script to loop through all the files in a directory and post them (the third party software's direct example of how to integrate, not the client's idea at all). Needless to say this caused difficulties with the gigabytes of data the third party tool had stored in many directories. Of course I don't know, but I'd guess that someone with little experience was tasked with the integration with solr at the third party software company and they followed some examples... then turned them into an "integration" blissfully unaware of the limitations of what they had done.
>>
>> I just re-read the ref guide page on post tool, and there's nothing there to indicate to the reader that this might not be a good production level solution. Also I notice a couple of recent Jira issues regarding handling of corner cases of strange (broken) behavior or content in a web site's response, giving the impression that that user (who reported both issues) might be treading a path that will stretch the bounds of what the post tool can/should be relied upon for.
>>
>> https://issues.apache.org/jira/browse/SOLR-15381
>> https://issues.apache.org/jira/browse/SOLR-15370
>>
>> How do folks feel about adding a warning or info box at the top of post tool docs indicating that it is not meant as a production solution, only as a quick way to test out documents. We might also say something more concrete like "virtually any use for a corpus containing over a few thousand documents is a bad idea"? ... or something like that, suggestions welcome...
>>
>> If folks agree then it seems that these two issues are likely to be WONTFIX.
>>
>> -Gus
>>
>> --
>> http://www.needhamsoftware.com (work)
>> http://www.the111shift.com (play)

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@solr.apache.org
For additional commands, e-mail: dev-help@solr.apache.org


Re: Post Tool

Posted by Dawid Weiss <da...@gmail.com>.
You need an extra step to install curl on Windows. I know it may seem
esoteric but I bet there are still a fair share of folks who are on Windows.

D.

On Wed, Apr 28, 2021 at 11:04 PM Ishan Chattopadhyaya <
ichattopadhyaya@gmail.com> wrote:

> We should remove the post tool
> Altogether. Curl is good enough and recommended.
>
> On Thu, 29 Apr, 2021, 2:15 am Gus Heck, <gu...@gmail.com> wrote:
>
>> I've generally been of the impression/opinion that the Post Tool is
>> really just a convenience for folks testing out solr to see what it can do,
>> and not really meant as a production ingestion solution.
>>
>> A little while back I had a client that had a third party tool that
>> "integrated with solr" by invoking post.jar on documents with a script to
>> loop through all the files in a directory and post them (the third party
>> software's direct example of how to integrate, not the client's idea at
>> all). Needless to say this caused difficulties with the gigabytes of data
>> the third party tool had stored in many directories. Of course I don't
>> know, but I'd guess that someone with little experience was tasked with the
>> integration with solr at the third party software company and they followed
>> some examples... then turned them into an "integration" blissfully unaware
>> of the limitations of what they had done.
>>
>> I just re-read the ref guide page on post tool
>> <https://solr.apache.org/guide/8_8/post-tool.html>, and there's nothing
>> there to indicate to the reader that this might not be a good production
>> level solution. Also I notice a couple of recent Jira issues regarding
>> handling of corner cases of strange (broken) behavior or content in a web
>> site's response, giving the impression that that user (who reported both
>> issues) might be treading a path that will stretch the bounds of what the
>> post tool can/should be relied upon for.
>>
>> https://issues.apache.org/jira/browse/SOLR-15381
>> https://issues.apache.org/jira/browse/SOLR-15370
>>
>> How do folks feel about adding a warning or info box at the top of post
>> tool docs indicating that it is not meant as a production solution, only as
>> a quick way to test out documents. We might also say something more
>> concrete like "virtually any use for a corpus containing over a few
>> thousand documents is a bad idea"? ... or something like that, suggestions
>> welcome...
>>
>> If folks agree then it seems that these two issues are likely to be
>> WONTFIX.
>>
>> -Gus
>>
>> --
>> http://www.needhamsoftware.com (work)
>> http://www.the111shift.com (play)
>>
>

Re: Post Tool

Posted by Ishan Chattopadhyaya <ic...@gmail.com>.
We should remove the post tool
Altogether. Curl is good enough and recommended.

On Thu, 29 Apr, 2021, 2:15 am Gus Heck, <gu...@gmail.com> wrote:

> I've generally been of the impression/opinion that the Post Tool is really
> just a convenience for folks testing out solr to see what it can do, and
> not really meant as a production ingestion solution.
>
> A little while back I had a client that had a third party tool that
> "integrated with solr" by invoking post.jar on documents with a script to
> loop through all the files in a directory and post them (the third party
> software's direct example of how to integrate, not the client's idea at
> all). Needless to say this caused difficulties with the gigabytes of data
> the third party tool had stored in many directories. Of course I don't
> know, but I'd guess that someone with little experience was tasked with the
> integration with solr at the third party software company and they followed
> some examples... then turned them into an "integration" blissfully unaware
> of the limitations of what they had done.
>
> I just re-read the ref guide page on post tool
> <https://solr.apache.org/guide/8_8/post-tool.html>, and there's nothing
> there to indicate to the reader that this might not be a good production
> level solution. Also I notice a couple of recent Jira issues regarding
> handling of corner cases of strange (broken) behavior or content in a web
> site's response, giving the impression that that user (who reported both
> issues) might be treading a path that will stretch the bounds of what the
> post tool can/should be relied upon for.
>
> https://issues.apache.org/jira/browse/SOLR-15381
> https://issues.apache.org/jira/browse/SOLR-15370
>
> How do folks feel about adding a warning or info box at the top of post
> tool docs indicating that it is not meant as a production solution, only as
> a quick way to test out documents. We might also say something more
> concrete like "virtually any use for a corpus containing over a few
> thousand documents is a bad idea"? ... or something like that, suggestions
> welcome...
>
> If folks agree then it seems that these two issues are likely to be
> WONTFIX.
>
> -Gus
>
> --
> http://www.needhamsoftware.com (work)
> http://www.the111shift.com (play)
>