You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@lucene.apache.org by Alexandre Rafalovitch <ar...@gmail.com> on 2020/08/31 17:47:13 UTC

SIP-10: Solr 9 examples: Can we use Ref Guide as a dogfood example?

Hi,
I need a sanity check.

I am in the planning stages for the new example datasets to ship with
Solr 9. The one I am looking at is great for structured information,
but is quite light on full-text content. So, I am thinking of how
important that is and what other sources could be used.

One - only slightly - crazy idea is to use Solr Reference Guide itself
as a document source. I am not saying we need to include the guide
with Solr distribution, but:
1) I could include a couple of sample pages
2) I could index the whole guide (with custom Java-code) during the
final build and we could ship the full index (with stored=false) with
Solr, which then basically becomes a local search for the remote guide
(with absolute URLs).

Either way would allow us to also explore what a good search
configuration could look like for the Ref Guide for when we are
actually ready to move beyond its current "headings-only" javascript
search. Actually, done right, same/similar tool could also feed
subheadings into the javascript search.

Like I said, sanity check?

Regards,
   Alex.

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org

Re: SIP-10: Solr 9 examples: Can we use Ref Guide as a dogfood example?

Posted by YOGENDRA SONI <so...@gmail.com>.

Any good text Information Retrieval dataset may be a good candidate.
https://github.com/harpribot/awesome-information-retrieval#datasets
these datasets have benchmarks and sample queries also.

On Fri, Sep 4, 2020 at 11:26 AM David Smiley <ds...@apache.org> wrote:

> It's tempting to accomplish two goals at once (tutorial & searchable ref
> guide) but I think the realities of making a *good* searchable ref guide
> may distract someone from learning as it tries to do both well.  A
> searchable ref-guide could very well be its own project that we point
> people learning at who move beyond some of the very early basics.
>
> ~ David Smiley
> Apache Lucene/Solr Search Developer
> http://www.linkedin.com/in/davidwsmiley
>
>
> On Tue, Sep 1, 2020 at 1:23 PM Alexandre Rafalovitch <ar...@gmail.com>
> wrote:
>
>> That Jeopardy set reads very dubious. Content that was collected by
>> scraping and available on various sharing sites (including Mega!). I
>> would not feel comfortable working with that in our context.
>>
>> There are other dataset sources. I like the ones that Data is Plural
>> newsletter collects: https://tinyletter.com/data-is-plural (full list
>> at:
>> https://docs.google.com/spreadsheets/d/1wZhPLMCHKJvwOkP4juclhjFgqIY8fQFMemwKL2c64vk/edit#gid=0
>> ). Again, copyright is important and I think having a local copy is
>> important too, for at least tutorial purposes.
>>
>> But I wish we could figure out a way to include the RefGuide. It is
>> just so much more triple-bottom line solution than just random other
>> dataset. We could do a graph of cross-references in the guide, figure
>> out how to extract java path references, etc.
>>
>> Anyway, it is not something that is super-urgent. I don't even know
>> whether our new build processes can be augmented to do this. I guess
>> it is a bit similar to how we run tests.
>>
>> I just wanted to get a strong yay/nay on the idea. So far it feels
>> like I got one strong yay, one caution and one soft nay.
>>
>> Regards,
>>    Alex.
>>
>>
>>
>> On Tue, 1 Sep 2020 at 12:28, Jan Høydahl <ja...@cominvent.com> wrote:
>> >
>> > What about 200.000 Jeopardy questions in JSON format?
>> >
>> https://www.reddit.com/r/datasets/comments/1uyd0t/200000_jeopardy_questions_in_a_json_file/
>> > I downloaded the file in a few seconds, and it also has some structured
>> content, e.g.
>> >
>> >   {
>> >     "category": "NOVELS",
>> >     "air_date": "2005-01-27",
>> >     "question": "'Even the epilogue is lengthy in this 1869 Tolstoy
>> epic; it comes out in 2 parts &, in our copy, is 105 pages long'",
>> >     "value": "$400",
>> >     "answer": "War and Peace",
>> >     "round": "Jeopardy!",
>> >     "show_number": "4699"
>> >   },
>> >   {
>> >     "category": "BRIGHT IDEAS",
>> >     "air_date": "2005-01-27",
>> >     "question": "'In 1948 scientists at Bristol-Meyers \"buffered\"
>> this medicine for the first time'",
>> >     "value": "$400",
>> >     "answer": "aspirin",
>> >     "round": "Jeopardy!",
>> >     "show_number": "4699"
>> >   },
>> >
>> > Lots of docs. Enough free-text to learn some analysis, enough metadata
>> for some meaningful facets / filters…
>> >
>> > As long as we only provide a URL and not re-distribute the content,
>> licensing is less of a concern.
>> >
>> > Jan
>> >
>> > 1. sep. 2020 kl. 15:59 skrev Alexandre Rafalovitch <arafalov@gmail.com
>> >:
>> >
>> > I've thought of providing instructions. But for good indexing, we
>> > should use adoc format as source, rather than html (as Cassandra's
>> > presentation showed), so that means dependencies to build by user to
>> > get asciidoctor library. And the way to get content, so either git
>> > clone or download the whole source and unpack and figure out the
>> > directory locations. It feels messy. Then, it may as well be an
>> > external package or even an external independent project. And
>> > therefore, it would lose value as a shipped tutorial material.
>> >
>> > We could also discuss actually shipping the Solr Reference Guide with
>> > Solr now that the release cycles align, but that would actually not
>> > help my sub-project too much, again because of adoc vs. html formats.
>> >
>> > In terms of other datasets:
>> > *) I could just stay with limited full-text in the one I am thinking
>> > of. The bulk download mode allows for fields such as Occupation,
>> > Company and Vehicle model which are 2-7 words long. That's about the
>> > same length as current examples we ship. It does not allow for a
>> > meaningful discussion about longer-text issues such as
>> > length-normalization, but we don't have those now anyway.
>> > *) I could use a public domain book and break it into parts. From
>> > somewhere like https://standardebooks.org/ . But there is a question
>> > about licensing and also whether we will be able to show interesting
>> > effects with that.
>> > *) I was also told that there is Wikipedia, but again, would we just
>> > include a couple of articles at random? What's the license?
>> > *) It is possible to index Stack Overflow questions, either from the
>> > feed (DIH was doing that) or as a download. I think the license was
>> > compatible.
>> > *) I could augment the dataset with some mix of the above, like a
>> > "favourite quote" field with random book sentences. This feels like
>> > fun, but possibly a whole separate project of its own.
>> >
>> > Anyway, I am open to further thoughts. It is quite likely I missed
>> something.
>> >
>> > Regards,
>> >   Alex.
>> >
>> > T
>> >
>> > On Tue, 1 Sep 2020 at 03:10, Jan Høydahl <ja...@cominvent.com> wrote:
>> >
>> >
>> > I’d rather ship a tutorial and tooling that explains how to index the
>> ref-guide, than shipping a binary index.
>> > What other full-text datasets have you considered as candidates for
>> getting-started examples?
>> >
>> > Jan
>> >
>> > 1. sep. 2020 kl. 05:53 skrev Alexandre Rafalovitch <arafalov@gmail.com
>> >:
>> >
>> > I did not say it was trivial, but I also did not quite mention the
>> previous research.
>> >
>> >
>> https://github.com/arafalov/solr-refguide-indexing/blob/master/src/com/solrstart/refguide/Indexer.java
>> >
>> > Uses official AsciidoctorJ library directory. Not sure if that's just
>> JRuby version of Asciidoctor we currently use to build. But this should
>> only affect the development process, not the final built package.
>> >
>> > I think I am more trying to figure out what people think about shipping
>> an actual core with the distribution. That is something I haven't seen done
>> before. And may have issues I did not think of.
>> >
>> > Regards,
>> >    Alex
>> >
>> > On Mon., Aug. 31, 2020, 10:11 p.m. Gus Heck, <gu...@gmail.com>
>> wrote:
>> >
>> >
>> > Some background to consider before committing to that... it might not
>> be as trivial as you think. (I've often thought it ironic that we don't
>> have real search for our ref guide... )
>> >
>> > https://www.youtube.com/watch?v=DixlnxAk08s
>> >
>> > -Gus
>> >
>> > On Mon, Aug 31, 2020 at 2:06 PM Ishan Chattopadhyaya <
>> ichattopadhyaya@gmail.com> wrote:
>> >
>> >
>> > I love the idea of making the ref guide itself as an example dataset.
>> That way, we won't need to ship anything separately. Python's beautiful
>> soup can extract text from the html pages. I'm sure there maybe such things
>> in Java too (can Tika do this?).
>> >
>> > On Mon, 31 Aug, 2020, 11:18 pm Alexandre Rafalovitch, <
>> arafalov@gmail.com> wrote:
>> >
>> >
>> > Hi,
>> > I need a sanity check.
>> >
>> > I am in the planning stages for the new example datasets to ship with
>> > Solr 9. The one I am looking at is great for structured information,
>> > but is quite light on full-text content. So, I am thinking of how
>> > important that is and what other sources could be used.
>> >
>> > One - only slightly - crazy idea is to use Solr Reference Guide itself
>> > as a document source. I am not saying we need to include the guide
>> > with Solr distribution, but:
>> > 1) I could include a couple of sample pages
>> > 2) I could index the whole guide (with custom Java-code) during the
>> > final build and we could ship the full index (with stored=false) with
>> > Solr, which then basically becomes a local search for the remote guide
>> > (with absolute URLs).
>> >
>> > Either way would allow us to also explore what a good search
>> > configuration could look like for the Ref Guide for when we are
>> > actually ready to move beyond its current "headings-only" javascript
>> > search. Actually, done right, same/similar tool could also feed
>> > subheadings into the javascript search.
>> >
>> > Like I said, sanity check?
>> >
>> > Regards,
>> >   Alex.
>> >
>> > ---------------------------------------------------------------------
>> > To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
>> > For additional commands, e-mail: dev-help@lucene.apache.org
>> >
>> >
>> >
>> > --
>> > http://www.needhamsoftware.com (work)
>> > http://www.the111shift.com (play)
>> >
>> >
>> >
>> >
>> > ---------------------------------------------------------------------
>> > To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
>> > For additional commands, e-mail: dev-help@lucene.apache.org
>> >
>> >
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
>> For additional commands, e-mail: dev-help@lucene.apache.org
>>
>>

Re: SIP-10: Solr 9 examples: Can we use Ref Guide as a dogfood example?

Posted by David Smiley <ds...@apache.org>.

It's tempting to accomplish two goals at once (tutorial & searchable ref
guide) but I think the realities of making a *good* searchable ref guide
may distract someone from learning as it tries to do both well.  A
searchable ref-guide could very well be its own project that we point
people learning at who move beyond some of the very early basics.

~ David Smiley
Apache Lucene/Solr Search Developer
http://www.linkedin.com/in/davidwsmiley


On Tue, Sep 1, 2020 at 1:23 PM Alexandre Rafalovitch <ar...@gmail.com>
wrote:

> That Jeopardy set reads very dubious. Content that was collected by
> scraping and available on various sharing sites (including Mega!). I
> would not feel comfortable working with that in our context.
>
> There are other dataset sources. I like the ones that Data is Plural
> newsletter collects: https://tinyletter.com/data-is-plural (full list
> at:
> https://docs.google.com/spreadsheets/d/1wZhPLMCHKJvwOkP4juclhjFgqIY8fQFMemwKL2c64vk/edit#gid=0
> ). Again, copyright is important and I think having a local copy is
> important too, for at least tutorial purposes.
>
> But I wish we could figure out a way to include the RefGuide. It is
> just so much more triple-bottom line solution than just random other
> dataset. We could do a graph of cross-references in the guide, figure
> out how to extract java path references, etc.
>
> Anyway, it is not something that is super-urgent. I don't even know
> whether our new build processes can be augmented to do this. I guess
> it is a bit similar to how we run tests.
>
> I just wanted to get a strong yay/nay on the idea. So far it feels
> like I got one strong yay, one caution and one soft nay.
>
> Regards,
>    Alex.
>
>
>
> On Tue, 1 Sep 2020 at 12:28, Jan Høydahl <ja...@cominvent.com> wrote:
> >
> > What about 200.000 Jeopardy questions in JSON format?
> >
> https://www.reddit.com/r/datasets/comments/1uyd0t/200000_jeopardy_questions_in_a_json_file/
> > I downloaded the file in a few seconds, and it also has some structured
> content, e.g.
> >
> >   {
> >     "category": "NOVELS",
> >     "air_date": "2005-01-27",
> >     "question": "'Even the epilogue is lengthy in this 1869 Tolstoy
> epic; it comes out in 2 parts &, in our copy, is 105 pages long'",
> >     "value": "$400",
> >     "answer": "War and Peace",
> >     "round": "Jeopardy!",
> >     "show_number": "4699"
> >   },
> >   {
> >     "category": "BRIGHT IDEAS",
> >     "air_date": "2005-01-27",
> >     "question": "'In 1948 scientists at Bristol-Meyers \"buffered\" this
> medicine for the first time'",
> >     "value": "$400",
> >     "answer": "aspirin",
> >     "round": "Jeopardy!",
> >     "show_number": "4699"
> >   },
> >
> > Lots of docs. Enough free-text to learn some analysis, enough metadata
> for some meaningful facets / filters…
> >
> > As long as we only provide a URL and not re-distribute the content,
> licensing is less of a concern.
> >
> > Jan
> >
> > 1. sep. 2020 kl. 15:59 skrev Alexandre Rafalovitch <ar...@gmail.com>:
> >
> > I've thought of providing instructions. But for good indexing, we
> > should use adoc format as source, rather than html (as Cassandra's
> > presentation showed), so that means dependencies to build by user to
> > get asciidoctor library. And the way to get content, so either git
> > clone or download the whole source and unpack and figure out the
> > directory locations. It feels messy. Then, it may as well be an
> > external package or even an external independent project. And
> > therefore, it would lose value as a shipped tutorial material.
> >
> > We could also discuss actually shipping the Solr Reference Guide with
> > Solr now that the release cycles align, but that would actually not
> > help my sub-project too much, again because of adoc vs. html formats.
> >
> > In terms of other datasets:
> > *) I could just stay with limited full-text in the one I am thinking
> > of. The bulk download mode allows for fields such as Occupation,
> > Company and Vehicle model which are 2-7 words long. That's about the
> > same length as current examples we ship. It does not allow for a
> > meaningful discussion about longer-text issues such as
> > length-normalization, but we don't have those now anyway.
> > *) I could use a public domain book and break it into parts. From
> > somewhere like https://standardebooks.org/ . But there is a question
> > about licensing and also whether we will be able to show interesting
> > effects with that.
> > *) I was also told that there is Wikipedia, but again, would we just
> > include a couple of articles at random? What's the license?
> > *) It is possible to index Stack Overflow questions, either from the
> > feed (DIH was doing that) or as a download. I think the license was
> > compatible.
> > *) I could augment the dataset with some mix of the above, like a
> > "favourite quote" field with random book sentences. This feels like
> > fun, but possibly a whole separate project of its own.
> >
> > Anyway, I am open to further thoughts. It is quite likely I missed
> something.
> >
> > Regards,
> >   Alex.
> >
> > T
> >
> > On Tue, 1 Sep 2020 at 03:10, Jan Høydahl <ja...@cominvent.com> wrote:
> >
> >
> > I’d rather ship a tutorial and tooling that explains how to index the
> ref-guide, than shipping a binary index.
> > What other full-text datasets have you considered as candidates for
> getting-started examples?
> >
> > Jan
> >
> > 1. sep. 2020 kl. 05:53 skrev Alexandre Rafalovitch <ar...@gmail.com>:
> >
> > I did not say it was trivial, but I also did not quite mention the
> previous research.
> >
> >
> https://github.com/arafalov/solr-refguide-indexing/blob/master/src/com/solrstart/refguide/Indexer.java
> >
> > Uses official AsciidoctorJ library directory. Not sure if that's just
> JRuby version of Asciidoctor we currently use to build. But this should
> only affect the development process, not the final built package.
> >
> > I think I am more trying to figure out what people think about shipping
> an actual core with the distribution. That is something I haven't seen done
> before. And may have issues I did not think of.
> >
> > Regards,
> >    Alex
> >
> > On Mon., Aug. 31, 2020, 10:11 p.m. Gus Heck, <gu...@gmail.com> wrote:
> >
> >
> > Some background to consider before committing to that... it might not be
> as trivial as you think. (I've often thought it ironic that we don't have
> real search for our ref guide... )
> >
> > https://www.youtube.com/watch?v=DixlnxAk08s
> >
> > -Gus
> >
> > On Mon, Aug 31, 2020 at 2:06 PM Ishan Chattopadhyaya <
> ichattopadhyaya@gmail.com> wrote:
> >
> >
> > I love the idea of making the ref guide itself as an example dataset.
> That way, we won't need to ship anything separately. Python's beautiful
> soup can extract text from the html pages. I'm sure there maybe such things
> in Java too (can Tika do this?).
> >
> > On Mon, 31 Aug, 2020, 11:18 pm Alexandre Rafalovitch, <
> arafalov@gmail.com> wrote:
> >
> >
> > Hi,
> > I need a sanity check.
> >
> > I am in the planning stages for the new example datasets to ship with
> > Solr 9. The one I am looking at is great for structured information,
> > but is quite light on full-text content. So, I am thinking of how
> > important that is and what other sources could be used.
> >
> > One - only slightly - crazy idea is to use Solr Reference Guide itself
> > as a document source. I am not saying we need to include the guide
> > with Solr distribution, but:
> > 1) I could include a couple of sample pages
> > 2) I could index the whole guide (with custom Java-code) during the
> > final build and we could ship the full index (with stored=false) with
> > Solr, which then basically becomes a local search for the remote guide
> > (with absolute URLs).
> >
> > Either way would allow us to also explore what a good search
> > configuration could look like for the Ref Guide for when we are
> > actually ready to move beyond its current "headings-only" javascript
> > search. Actually, done right, same/similar tool could also feed
> > subheadings into the javascript search.
> >
> > Like I said, sanity check?
> >
> > Regards,
> >   Alex.
> >
> > ---------------------------------------------------------------------
> > To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
> > For additional commands, e-mail: dev-help@lucene.apache.org
> >
> >
> >
> > --
> > http://www.needhamsoftware.com (work)
> > http://www.the111shift.com (play)
> >
> >
> >
> >
> > ---------------------------------------------------------------------
> > To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
> > For additional commands, e-mail: dev-help@lucene.apache.org
> >
> >
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
> For additional commands, e-mail: dev-help@lucene.apache.org
>
>

Re: SIP-10: Solr 9 examples: Can we use Ref Guide as a dogfood example?

Posted by Alexandre Rafalovitch <ar...@gmail.com>.

That Jeopardy set reads very dubious. Content that was collected by
scraping and available on various sharing sites (including Mega!). I
would not feel comfortable working with that in our context.

There are other dataset sources. I like the ones that Data is Plural
newsletter collects: https://tinyletter.com/data-is-plural (full list
at: https://docs.google.com/spreadsheets/d/1wZhPLMCHKJvwOkP4juclhjFgqIY8fQFMemwKL2c64vk/edit#gid=0
). Again, copyright is important and I think having a local copy is
important too, for at least tutorial purposes.

But I wish we could figure out a way to include the RefGuide. It is
just so much more triple-bottom line solution than just random other
dataset. We could do a graph of cross-references in the guide, figure
out how to extract java path references, etc.

Anyway, it is not something that is super-urgent. I don't even know
whether our new build processes can be augmented to do this. I guess
it is a bit similar to how we run tests.

I just wanted to get a strong yay/nay on the idea. So far it feels
like I got one strong yay, one caution and one soft nay.

Regards,
   Alex.



On Tue, 1 Sep 2020 at 12:28, Jan Høydahl <ja...@cominvent.com> wrote:
>
> What about 200.000 Jeopardy questions in JSON format?
> https://www.reddit.com/r/datasets/comments/1uyd0t/200000_jeopardy_questions_in_a_json_file/
> I downloaded the file in a few seconds, and it also has some structured content, e.g.
>
>   {
>     "category": "NOVELS",
>     "air_date": "2005-01-27",
>     "question": "'Even the epilogue is lengthy in this 1869 Tolstoy epic; it comes out in 2 parts &, in our copy, is 105 pages long'",
>     "value": "$400",
>     "answer": "War and Peace",
>     "round": "Jeopardy!",
>     "show_number": "4699"
>   },
>   {
>     "category": "BRIGHT IDEAS",
>     "air_date": "2005-01-27",
>     "question": "'In 1948 scientists at Bristol-Meyers \"buffered\" this medicine for the first time'",
>     "value": "$400",
>     "answer": "aspirin",
>     "round": "Jeopardy!",
>     "show_number": "4699"
>   },
>
> Lots of docs. Enough free-text to learn some analysis, enough metadata for some meaningful facets / filters…
>
> As long as we only provide a URL and not re-distribute the content, licensing is less of a concern.
>
> Jan
>
> 1. sep. 2020 kl. 15:59 skrev Alexandre Rafalovitch <ar...@gmail.com>:
>
> I've thought of providing instructions. But for good indexing, we
> should use adoc format as source, rather than html (as Cassandra's
> presentation showed), so that means dependencies to build by user to
> get asciidoctor library. And the way to get content, so either git
> clone or download the whole source and unpack and figure out the
> directory locations. It feels messy. Then, it may as well be an
> external package or even an external independent project. And
> therefore, it would lose value as a shipped tutorial material.
>
> We could also discuss actually shipping the Solr Reference Guide with
> Solr now that the release cycles align, but that would actually not
> help my sub-project too much, again because of adoc vs. html formats.
>
> In terms of other datasets:
> *) I could just stay with limited full-text in the one I am thinking
> of. The bulk download mode allows for fields such as Occupation,
> Company and Vehicle model which are 2-7 words long. That's about the
> same length as current examples we ship. It does not allow for a
> meaningful discussion about longer-text issues such as
> length-normalization, but we don't have those now anyway.
> *) I could use a public domain book and break it into parts. From
> somewhere like https://standardebooks.org/ . But there is a question
> about licensing and also whether we will be able to show interesting
> effects with that.
> *) I was also told that there is Wikipedia, but again, would we just
> include a couple of articles at random? What's the license?
> *) It is possible to index Stack Overflow questions, either from the
> feed (DIH was doing that) or as a download. I think the license was
> compatible.
> *) I could augment the dataset with some mix of the above, like a
> "favourite quote" field with random book sentences. This feels like
> fun, but possibly a whole separate project of its own.
>
> Anyway, I am open to further thoughts. It is quite likely I missed something.
>
> Regards,
>   Alex.
>
> T
>
> On Tue, 1 Sep 2020 at 03:10, Jan Høydahl <ja...@cominvent.com> wrote:
>
>
> I’d rather ship a tutorial and tooling that explains how to index the ref-guide, than shipping a binary index.
> What other full-text datasets have you considered as candidates for getting-started examples?
>
> Jan
>
> 1. sep. 2020 kl. 05:53 skrev Alexandre Rafalovitch <ar...@gmail.com>:
>
> I did not say it was trivial, but I also did not quite mention the previous research.
>
> https://github.com/arafalov/solr-refguide-indexing/blob/master/src/com/solrstart/refguide/Indexer.java
>
> Uses official AsciidoctorJ library directory. Not sure if that's just JRuby version of Asciidoctor we currently use to build. But this should only affect the development process, not the final built package.
>
> I think I am more trying to figure out what people think about shipping an actual core with the distribution. That is something I haven't seen done before. And may have issues I did not think of.
>
> Regards,
>    Alex
>
> On Mon., Aug. 31, 2020, 10:11 p.m. Gus Heck, <gu...@gmail.com> wrote:
>
>
> Some background to consider before committing to that... it might not be as trivial as you think. (I've often thought it ironic that we don't have real search for our ref guide... )
>
> https://www.youtube.com/watch?v=DixlnxAk08s
>
> -Gus
>
> On Mon, Aug 31, 2020 at 2:06 PM Ishan Chattopadhyaya <ic...@gmail.com> wrote:
>
>
> I love the idea of making the ref guide itself as an example dataset. That way, we won't need to ship anything separately. Python's beautiful soup can extract text from the html pages. I'm sure there maybe such things in Java too (can Tika do this?).
>
> On Mon, 31 Aug, 2020, 11:18 pm Alexandre Rafalovitch, <ar...@gmail.com> wrote:
>
>
> Hi,
> I need a sanity check.
>
> I am in the planning stages for the new example datasets to ship with
> Solr 9. The one I am looking at is great for structured information,
> but is quite light on full-text content. So, I am thinking of how
> important that is and what other sources could be used.
>
> One - only slightly - crazy idea is to use Solr Reference Guide itself
> as a document source. I am not saying we need to include the guide
> with Solr distribution, but:
> 1) I could include a couple of sample pages
> 2) I could index the whole guide (with custom Java-code) during the
> final build and we could ship the full index (with stored=false) with
> Solr, which then basically becomes a local search for the remote guide
> (with absolute URLs).
>
> Either way would allow us to also explore what a good search
> configuration could look like for the Ref Guide for when we are
> actually ready to move beyond its current "headings-only" javascript
> search. Actually, done right, same/similar tool could also feed
> subheadings into the javascript search.
>
> Like I said, sanity check?
>
> Regards,
>   Alex.
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
> For additional commands, e-mail: dev-help@lucene.apache.org
>
>
>
> --
> http://www.needhamsoftware.com (work)
> http://www.the111shift.com (play)
>
>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
> For additional commands, e-mail: dev-help@lucene.apache.org
>
>

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org

Re: SIP-10: Solr 9 examples: Can we use Ref Guide as a dogfood example?

Posted by Jan Høydahl <ja...@cominvent.com>.

What about 200.000 Jeopardy questions in JSON format?
https://www.reddit.com/r/datasets/comments/1uyd0t/200000_jeopardy_questions_in_a_json_file/ <https://www.reddit.com/r/datasets/comments/1uyd0t/200000_jeopardy_questions_in_a_json_file/>
I downloaded the file in a few seconds, and it also has some structured content, e.g.

  {
    "category": "NOVELS",
    "air_date": "2005-01-27",
    "question": "'Even the epilogue is lengthy in this 1869 Tolstoy epic; it comes out in 2 parts &, in our copy, is 105 pages long'",
    "value": "$400",
    "answer": "War and Peace",
    "round": "Jeopardy!",
    "show_number": "4699"
  },
  {
    "category": "BRIGHT IDEAS",
    "air_date": "2005-01-27",
    "question": "'In 1948 scientists at Bristol-Meyers \"buffered\" this medicine for the first time'",
    "value": "$400",
    "answer": "aspirin",
    "round": "Jeopardy!",
    "show_number": "4699"
  },

Lots of docs. Enough free-text to learn some analysis, enough metadata for some meaningful facets / filters…

As long as we only provide a URL and not re-distribute the content, licensing is less of a concern.

Jan

> 1. sep. 2020 kl. 15:59 skrev Alexandre Rafalovitch <ar...@gmail.com>:
> 
> I've thought of providing instructions. But for good indexing, we
> should use adoc format as source, rather than html (as Cassandra's
> presentation showed), so that means dependencies to build by user to
> get asciidoctor library. And the way to get content, so either git
> clone or download the whole source and unpack and figure out the
> directory locations. It feels messy. Then, it may as well be an
> external package or even an external independent project. And
> therefore, it would lose value as a shipped tutorial material.
> 
> We could also discuss actually shipping the Solr Reference Guide with
> Solr now that the release cycles align, but that would actually not
> help my sub-project too much, again because of adoc vs. html formats.
> 
> In terms of other datasets:
> *) I could just stay with limited full-text in the one I am thinking
> of. The bulk download mode allows for fields such as Occupation,
> Company and Vehicle model which are 2-7 words long. That's about the
> same length as current examples we ship. It does not allow for a
> meaningful discussion about longer-text issues such as
> length-normalization, but we don't have those now anyway.
> *) I could use a public domain book and break it into parts. From
> somewhere like https://standardebooks.org/ . But there is a question
> about licensing and also whether we will be able to show interesting
> effects with that.
> *) I was also told that there is Wikipedia, but again, would we just
> include a couple of articles at random? What's the license?
> *) It is possible to index Stack Overflow questions, either from the
> feed (DIH was doing that) or as a download. I think the license was
> compatible.
> *) I could augment the dataset with some mix of the above, like a
> "favourite quote" field with random book sentences. This feels like
> fun, but possibly a whole separate project of its own.
> 
> Anyway, I am open to further thoughts. It is quite likely I missed something.
> 
> Regards,
>   Alex.
> 
> T
> 
> On Tue, 1 Sep 2020 at 03:10, Jan Høydahl <ja...@cominvent.com> wrote:
>> 
>> I’d rather ship a tutorial and tooling that explains how to index the ref-guide, than shipping a binary index.
>> What other full-text datasets have you considered as candidates for getting-started examples?
>> 
>> Jan
>> 
>> 1. sep. 2020 kl. 05:53 skrev Alexandre Rafalovitch <ar...@gmail.com>:
>> 
>> I did not say it was trivial, but I also did not quite mention the previous research.
>> 
>> https://github.com/arafalov/solr-refguide-indexing/blob/master/src/com/solrstart/refguide/Indexer.java
>> 
>> Uses official AsciidoctorJ library directory. Not sure if that's just JRuby version of Asciidoctor we currently use to build. But this should only affect the development process, not the final built package.
>> 
>> I think I am more trying to figure out what people think about shipping an actual core with the distribution. That is something I haven't seen done before. And may have issues I did not think of.
>> 
>> Regards,
>>    Alex
>> 
>> On Mon., Aug. 31, 2020, 10:11 p.m. Gus Heck, <gu...@gmail.com> wrote:
>>> 
>>> Some background to consider before committing to that... it might not be as trivial as you think. (I've often thought it ironic that we don't have real search for our ref guide... )
>>> 
>>> https://www.youtube.com/watch?v=DixlnxAk08s
>>> 
>>> -Gus
>>> 
>>> On Mon, Aug 31, 2020 at 2:06 PM Ishan Chattopadhyaya <ic...@gmail.com> wrote:
>>>> 
>>>> I love the idea of making the ref guide itself as an example dataset. That way, we won't need to ship anything separately. Python's beautiful soup can extract text from the html pages. I'm sure there maybe such things in Java too (can Tika do this?).
>>>> 
>>>> On Mon, 31 Aug, 2020, 11:18 pm Alexandre Rafalovitch, <ar...@gmail.com> wrote:
>>>>> 
>>>>> Hi,
>>>>> I need a sanity check.
>>>>> 
>>>>> I am in the planning stages for the new example datasets to ship with
>>>>> Solr 9. The one I am looking at is great for structured information,
>>>>> but is quite light on full-text content. So, I am thinking of how
>>>>> important that is and what other sources could be used.
>>>>> 
>>>>> One - only slightly - crazy idea is to use Solr Reference Guide itself
>>>>> as a document source. I am not saying we need to include the guide
>>>>> with Solr distribution, but:
>>>>> 1) I could include a couple of sample pages
>>>>> 2) I could index the whole guide (with custom Java-code) during the
>>>>> final build and we could ship the full index (with stored=false) with
>>>>> Solr, which then basically becomes a local search for the remote guide
>>>>> (with absolute URLs).
>>>>> 
>>>>> Either way would allow us to also explore what a good search
>>>>> configuration could look like for the Ref Guide for when we are
>>>>> actually ready to move beyond its current "headings-only" javascript
>>>>> search. Actually, done right, same/similar tool could also feed
>>>>> subheadings into the javascript search.
>>>>> 
>>>>> Like I said, sanity check?
>>>>> 
>>>>> Regards,
>>>>>   Alex.
>>>>> 
>>>>> ---------------------------------------------------------------------
>>>>> To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
>>>>> For additional commands, e-mail: dev-help@lucene.apache.org
>>>>> 
>>> 
>>> 
>>> --
>>> http://www.needhamsoftware.com (work)
>>> http://www.the111shift.com (play)
>> 
>> 
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
> For additional commands, e-mail: dev-help@lucene.apache.org
>

Re: SIP-10: Solr 9 examples: Can we use Ref Guide as a dogfood example?

Posted by Alexandre Rafalovitch <ar...@gmail.com>.

I've thought of providing instructions. But for good indexing, we
should use adoc format as source, rather than html (as Cassandra's
presentation showed), so that means dependencies to build by user to
get asciidoctor library. And the way to get content, so either git
clone or download the whole source and unpack and figure out the
directory locations. It feels messy. Then, it may as well be an
external package or even an external independent project. And
therefore, it would lose value as a shipped tutorial material.

We could also discuss actually shipping the Solr Reference Guide with
Solr now that the release cycles align, but that would actually not
help my sub-project too much, again because of adoc vs. html formats.

In terms of other datasets:
*) I could just stay with limited full-text in the one I am thinking
of. The bulk download mode allows for fields such as Occupation,
Company and Vehicle model which are 2-7 words long. That's about the
same length as current examples we ship. It does not allow for a
meaningful discussion about longer-text issues such as
length-normalization, but we don't have those now anyway.
*) I could use a public domain book and break it into parts. From
somewhere like https://standardebooks.org/ . But there is a question
about licensing and also whether we will be able to show interesting
effects with that.
*) I was also told that there is Wikipedia, but again, would we just
include a couple of articles at random? What's the license?
*) It is possible to index Stack Overflow questions, either from the
feed (DIH was doing that) or as a download. I think the license was
compatible.
*) I could augment the dataset with some mix of the above, like a
"favourite quote" field with random book sentences. This feels like
fun, but possibly a whole separate project of its own.

Anyway, I am open to further thoughts. It is quite likely I missed something.

Regards,
   Alex.

T

On Tue, 1 Sep 2020 at 03:10, Jan Høydahl <ja...@cominvent.com> wrote:
>
> I’d rather ship a tutorial and tooling that explains how to index the ref-guide, than shipping a binary index.
> What other full-text datasets have you considered as candidates for getting-started examples?
>
> Jan
>
> 1. sep. 2020 kl. 05:53 skrev Alexandre Rafalovitch <ar...@gmail.com>:
>
> I did not say it was trivial, but I also did not quite mention the previous research.
>
> https://github.com/arafalov/solr-refguide-indexing/blob/master/src/com/solrstart/refguide/Indexer.java
>
> Uses official AsciidoctorJ library directory. Not sure if that's just JRuby version of Asciidoctor we currently use to build. But this should only affect the development process, not the final built package.
>
> I think I am more trying to figure out what people think about shipping an actual core with the distribution. That is something I haven't seen done before. And may have issues I did not think of.
>
> Regards,
>     Alex
>
> On Mon., Aug. 31, 2020, 10:11 p.m. Gus Heck, <gu...@gmail.com> wrote:
>>
>> Some background to consider before committing to that... it might not be as trivial as you think. (I've often thought it ironic that we don't have real search for our ref guide... )
>>
>> https://www.youtube.com/watch?v=DixlnxAk08s
>>
>> -Gus
>>
>> On Mon, Aug 31, 2020 at 2:06 PM Ishan Chattopadhyaya <ic...@gmail.com> wrote:
>>>
>>> I love the idea of making the ref guide itself as an example dataset. That way, we won't need to ship anything separately. Python's beautiful soup can extract text from the html pages. I'm sure there maybe such things in Java too (can Tika do this?).
>>>
>>> On Mon, 31 Aug, 2020, 11:18 pm Alexandre Rafalovitch, <ar...@gmail.com> wrote:
>>>>
>>>> Hi,
>>>> I need a sanity check.
>>>>
>>>> I am in the planning stages for the new example datasets to ship with
>>>> Solr 9. The one I am looking at is great for structured information,
>>>> but is quite light on full-text content. So, I am thinking of how
>>>> important that is and what other sources could be used.
>>>>
>>>> One - only slightly - crazy idea is to use Solr Reference Guide itself
>>>> as a document source. I am not saying we need to include the guide
>>>> with Solr distribution, but:
>>>> 1) I could include a couple of sample pages
>>>> 2) I could index the whole guide (with custom Java-code) during the
>>>> final build and we could ship the full index (with stored=false) with
>>>> Solr, which then basically becomes a local search for the remote guide
>>>> (with absolute URLs).
>>>>
>>>> Either way would allow us to also explore what a good search
>>>> configuration could look like for the Ref Guide for when we are
>>>> actually ready to move beyond its current "headings-only" javascript
>>>> search. Actually, done right, same/similar tool could also feed
>>>> subheadings into the javascript search.
>>>>
>>>> Like I said, sanity check?
>>>>
>>>> Regards,
>>>>    Alex.
>>>>
>>>> ---------------------------------------------------------------------
>>>> To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
>>>> For additional commands, e-mail: dev-help@lucene.apache.org
>>>>
>>
>>
>> --
>> http://www.needhamsoftware.com (work)
>> http://www.the111shift.com (play)
>
>

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org

Re: SIP-10: Solr 9 examples: Can we use Ref Guide as a dogfood example?

Posted by Jan Høydahl <ja...@cominvent.com>.

I’d rather ship a tutorial and tooling that explains how to index the ref-guide, than shipping a binary index.
What other full-text datasets have you considered as candidates for getting-started examples?

Jan

> 1. sep. 2020 kl. 05:53 skrev Alexandre Rafalovitch <ar...@gmail.com>:
> 
> I did not say it was trivial, but I also did not quite mention the previous research.
> 
> https://github.com/arafalov/solr-refguide-indexing/blob/master/src/com/solrstart/refguide/Indexer.java <https://github.com/arafalov/solr-refguide-indexing/blob/master/src/com/solrstart/refguide/Indexer.java>
> 
> Uses official AsciidoctorJ library directory. Not sure if that's just JRuby version of Asciidoctor we currently use to build. But this should only affect the development process, not the final built package. 
> 
> I think I am more trying to figure out what people think about shipping an actual core with the distribution. That is something I haven't seen done before. And may have issues I did not think of. 
> 
> Regards, 
>     Alex
> 
> On Mon., Aug. 31, 2020, 10:11 p.m. Gus Heck, <gus.heck@gmail.com <ma...@gmail.com>> wrote:
> Some background to consider before committing to that... it might not be as trivial as you think. (I've often thought it ironic that we don't have real search for our ref guide... )
> 
> https://www.youtube.com/watch?v=DixlnxAk08s <https://www.youtube.com/watch?v=DixlnxAk08s>
> 
> -Gus
> 
> On Mon, Aug 31, 2020 at 2:06 PM Ishan Chattopadhyaya <ichattopadhyaya@gmail.com <ma...@gmail.com>> wrote:
> I love the idea of making the ref guide itself as an example dataset. That way, we won't need to ship anything separately. Python's beautiful soup can extract text from the html pages. I'm sure there maybe such things in Java too (can Tika do this?).
> 
> On Mon, 31 Aug, 2020, 11:18 pm Alexandre Rafalovitch, <arafalov@gmail.com <ma...@gmail.com>> wrote:
> Hi,
> I need a sanity check.
> 
> I am in the planning stages for the new example datasets to ship with
> Solr 9. The one I am looking at is great for structured information,
> but is quite light on full-text content. So, I am thinking of how
> important that is and what other sources could be used.
> 
> One - only slightly - crazy idea is to use Solr Reference Guide itself
> as a document source. I am not saying we need to include the guide
> with Solr distribution, but:
> 1) I could include a couple of sample pages
> 2) I could index the whole guide (with custom Java-code) during the
> final build and we could ship the full index (with stored=false) with
> Solr, which then basically becomes a local search for the remote guide
> (with absolute URLs).
> 
> Either way would allow us to also explore what a good search
> configuration could look like for the Ref Guide for when we are
> actually ready to move beyond its current "headings-only" javascript
> search. Actually, done right, same/similar tool could also feed
> subheadings into the javascript search.
> 
> Like I said, sanity check?
> 
> Regards,
>    Alex.
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org <ma...@lucene.apache.org>
> For additional commands, e-mail: dev-help@lucene.apache.org <ma...@lucene.apache.org>
> 
> 
> 
> -- 
> http://www.needhamsoftware.com <http://www.needhamsoftware.com/> (work)
> http://www.the111shift.com <http://www.the111shift.com/> (play)

Re: SIP-10: Solr 9 examples: Can we use Ref Guide as a dogfood example?

Posted by Alexandre Rafalovitch <ar...@gmail.com>.

I did not say it was trivial, but I also did not quite mention the previous
research.

https://github.com/arafalov/solr-refguide-indexing/blob/master/src/com/solrstart/refguide/Indexer.java

Uses official AsciidoctorJ library directory. Not sure if that's just JRuby
version of Asciidoctor we currently use to build. But this should only
affect the development process, not the final built package.

I think I am more trying to figure out what people think about shipping an
actual core with the distribution. That is something I haven't seen
done before. And may have issues I did not think of.

Regards,
    Alex

On Mon., Aug. 31, 2020, 10:11 p.m. Gus Heck, <gu...@gmail.com> wrote:

> Some background to consider before committing to that... it might not be
> as trivial as you think. (I've often thought it ironic that we don't have
> real search for our ref guide... )
>
> https://www.youtube.com/watch?v=DixlnxAk08s
>
> -Gus
>
> On Mon, Aug 31, 2020 at 2:06 PM Ishan Chattopadhyaya <
> ichattopadhyaya@gmail.com> wrote:
>
>> I love the idea of making the ref guide itself as an example dataset.
>> That way, we won't need to ship anything separately. Python's beautiful
>> soup can extract text from the html pages. I'm sure there maybe such things
>> in Java too (can Tika do this?).
>>
>> On Mon, 31 Aug, 2020, 11:18 pm Alexandre Rafalovitch, <ar...@gmail.com>
>> wrote:
>>
>>> Hi,
>>> I need a sanity check.
>>>
>>> I am in the planning stages for the new example datasets to ship with
>>> Solr 9. The one I am looking at is great for structured information,
>>> but is quite light on full-text content. So, I am thinking of how
>>> important that is and what other sources could be used.
>>>
>>> One - only slightly - crazy idea is to use Solr Reference Guide itself
>>> as a document source. I am not saying we need to include the guide
>>> with Solr distribution, but:
>>> 1) I could include a couple of sample pages
>>> 2) I could index the whole guide (with custom Java-code) during the
>>> final build and we could ship the full index (with stored=false) with
>>> Solr, which then basically becomes a local search for the remote guide
>>> (with absolute URLs).
>>>
>>> Either way would allow us to also explore what a good search
>>> configuration could look like for the Ref Guide for when we are
>>> actually ready to move beyond its current "headings-only" javascript
>>> search. Actually, done right, same/similar tool could also feed
>>> subheadings into the javascript search.
>>>
>>> Like I said, sanity check?
>>>
>>> Regards,
>>>    Alex.
>>>
>>> ---------------------------------------------------------------------
>>> To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
>>> For additional commands, e-mail: dev-help@lucene.apache.org
>>>
>>>
>
> --
> http://www.needhamsoftware.com (work)
> http://www.the111shift.com (play)
>

Re: SIP-10: Solr 9 examples: Can we use Ref Guide as a dogfood example?

Posted by Gus Heck <gu...@gmail.com>.

Some background to consider before committing to that... it might not be as
trivial as you think. (I've often thought it ironic that we don't have real
search for our ref guide... )

https://www.youtube.com/watch?v=DixlnxAk08s

-Gus

On Mon, Aug 31, 2020 at 2:06 PM Ishan Chattopadhyaya <
ichattopadhyaya@gmail.com> wrote:

> I love the idea of making the ref guide itself as an example dataset. That
> way, we won't need to ship anything separately. Python's beautiful soup can
> extract text from the html pages. I'm sure there maybe such things in Java
> too (can Tika do this?).
>
> On Mon, 31 Aug, 2020, 11:18 pm Alexandre Rafalovitch, <ar...@gmail.com>
> wrote:
>
>> Hi,
>> I need a sanity check.
>>
>> I am in the planning stages for the new example datasets to ship with
>> Solr 9. The one I am looking at is great for structured information,
>> but is quite light on full-text content. So, I am thinking of how
>> important that is and what other sources could be used.
>>
>> One - only slightly - crazy idea is to use Solr Reference Guide itself
>> as a document source. I am not saying we need to include the guide
>> with Solr distribution, but:
>> 1) I could include a couple of sample pages
>> 2) I could index the whole guide (with custom Java-code) during the
>> final build and we could ship the full index (with stored=false) with
>> Solr, which then basically becomes a local search for the remote guide
>> (with absolute URLs).
>>
>> Either way would allow us to also explore what a good search
>> configuration could look like for the Ref Guide for when we are
>> actually ready to move beyond its current "headings-only" javascript
>> search. Actually, done right, same/similar tool could also feed
>> subheadings into the javascript search.
>>
>> Like I said, sanity check?
>>
>> Regards,
>>    Alex.
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
>> For additional commands, e-mail: dev-help@lucene.apache.org
>>
>>

-- 
http://www.needhamsoftware.com (work)
http://www.the111shift.com (play)

Re: SIP-10: Solr 9 examples: Can we use Ref Guide as a dogfood example?

Posted by Ishan Chattopadhyaya <ic...@gmail.com>.

I love the idea of making the ref guide itself as an example dataset. That
way, we won't need to ship anything separately. Python's beautiful soup can
extract text from the html pages. I'm sure there maybe such things in Java
too (can Tika do this?).

On Mon, 31 Aug, 2020, 11:18 pm Alexandre Rafalovitch, <ar...@gmail.com>
wrote:

> Hi,
> I need a sanity check.
>
> I am in the planning stages for the new example datasets to ship with
> Solr 9. The one I am looking at is great for structured information,
> but is quite light on full-text content. So, I am thinking of how
> important that is and what other sources could be used.
>
> One - only slightly - crazy idea is to use Solr Reference Guide itself
> as a document source. I am not saying we need to include the guide
> with Solr distribution, but:
> 1) I could include a couple of sample pages
> 2) I could index the whole guide (with custom Java-code) during the
> final build and we could ship the full index (with stored=false) with
> Solr, which then basically becomes a local search for the remote guide
> (with absolute URLs).
>
> Either way would allow us to also explore what a good search
> configuration could look like for the Ref Guide for when we are
> actually ready to move beyond its current "headings-only" javascript
> search. Actually, done right, same/similar tool could also feed
> subheadings into the javascript search.
>
> Like I said, sanity check?
>
> Regards,
>    Alex.
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
> For additional commands, e-mail: dev-help@lucene.apache.org
>
>