You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@lucenenet.apache.org by Philippe Alexis <ph...@gmail.com> on 2021/12/29 19:53:47 UTC

A question out of frustration about the documentation

Hello everyone,

Allow me to get straight to the point: is the 'free' documentation
specifically written in a way so as *not to cannibalize* the 'commercial'
documentation?

I can't believe how (physically!) painfully obvious it is that the ///
<summary> headers in the code go to great lengths to avoid giving anything
away about what's the difference between a PhraseQuery and a
MultiPhraseQuery, especially when one finally comes across (an extract
from) the cookbook on the matter.

We're talking, All Hail to Lucene, capable of full-text searching through
millions of documents within mere milliseconds... and to 'showcase' this
power, the front page sample code on the website shows the addition of ...
*one* "The quick brown fox jumps over the lazy dog" document. One. Document.

And that writer.Flush(triggerMerge: false, applyAllDeletes: false);
why is it there? You could have at least added two documents, and hinted
right away that only one Flush is needed, if at all.

Here I was, new to Lucene, looping through my painstakingly constructed
100,000 lines-dictionary to create documents... only to find that the
supposedly 'blazingly' fast Lucene took *4 minutes* to index a 10th of
that. (Make of my feedback what you will, I'm just being honest about my
genuinely 'blind' playthrough of Lucene.NET)

Now, we're developers, and we're used to within a couple of clicks in the
documentation, coming across an example usage that tells us just about
everything we need to know to get our work done... Buried somewhere in the
docs, is mentioned, *"The OR operator is the default conjunction operator"*

Excellent! *Something Google got right since 1998*. Now... example of I do
a new of what, and call what to pull the search results out of it, to see
this in action, you mean??
................ Silence.

Somehow, I clench my teeth, and make my way to QueryParser ->
MultiFieldQueryParser... not quite... then to StandardQueryParser...
maybe?? And I get this 'usage' :

const LuceneVersion matchVersion = LuceneVersion.LUCENE_48;
StandardQueryParser qpHelper = new StandardQueryParser();
QueryConfigHandler config = qpHelper.QueryConfigHandler;
config.Set(ConfigurationKeys.ALLOW_LEADING_WILDCARD, true);
config.Set(ConfigurationKeys.ANALYZER, new WhitespaceAnalyzer(matchVersion));
Query query = qpHelper.Parse("apache AND lucene", "defaultField");


Seriously! The 'usage' simply stops unashamedly short of actually telling
me how to do what this is supposed to be all about: getting the results out
of the Query object. Surely, I'm being trolled, or what?

*I know I am not owed anything.* Still, my first impression from this
experience is what it is... the docs seem to try so hard to not give the
'good stuff' away that I feel exhausted from just the thought of this being
about keeping a marketable money-making angle.

Re: A question out of frustration about the documentation

Posted by Philippe Alexis <ph...@gmail.com>.

Thanks for this detailed response Shad. I've stuck at it and have managed a
PoC where a 10second full-text search on SQL server is cut down to under 3
seconds in a Lucene + SQL tandem.

The last hurdle that I had needed to overcome, *if it's not by design,*
then it might be of interest to the list:
On latest 4.8.0 beta, The StandardQueryParser + StandardAnalyzer are able
to get a good number of hits (e.g. for "caisse OR imprimante") on an index
of French text, but weren't handling articles with 'elision' very well.

I went to FrenchAnalyzer ... "caisse OR imprimante" now saw no hits after
the index had been rebuilt. Since 'niveau' still saw hits, I was able to
confirm that 'caisse' was present in the index.
It took some switching things around, to find out that for some reason,
MultiFieldQueryParser + FrenchAnalyzer produced much better results than
StandardQueryParser + FrenchAnalyzer.

*Is that an expected behaviour?* I'm new to this, so I had not known to pay
particular attention when pairing parser and analyzer.

Jean Philippe.


Le jeu. 30 déc. 2021 à 03:11, Shad Storhaug <sh...@shadstorhaug.com> a
écrit :

> Hi Phillipe,
>
> Thanks for the feedback.
>
> Unfortunately, you are preaching to the choir. We have faithfully ported
> all the documentation that Lucene (written in Java) provided along with the
> Lucene modules. In fact, we have added a bit more documentation than what
> they provided us with to cover some .NET-specific scenarios.
>
> https://lucene.apache.org/core/4_8_0/
>
> We didn't write any books, either. We are just trying to be as helpful as
> possible to make sure users can find whatever information about Lucene 4.x
> is available, and unfortunately the best documentation is in books (the one
> that seems to be most recommended is "Lucene in Action").
>
> Other users have made us aware of some of the shortcomings of the demos
> and documentation, and we have an open issue about it:
>
> https://github.com/apache/lucenenet/issues/457
>
> Lucene.NET 4.8.0 is still under development and has been since September
> of 2014. Most of the work is still going toward the code rather than the
> documentation.
>
> If you wish to *provide* this documentation as XML doc comments as a
> contribution after you figure it out (so other users can benefit from your
> research), we are all for it. Unfortunately, it is just as much work for us
> to research as it is for you.
>
> Based on the documentation headers, I suspect that Java developers are
> fully expected to be analyzing code to determine how Lucene works rather
> than relying on documentation, and I am afraid that applies to the .NET
> version, as well. The best place to learn (for free) is by analyzing the
> tests (which could probably give you a quicker answer than we could), to
> ask questions on StackOverflow (be sure to include both the lucene.net
> and the lucene tags), or to ask the Lucene team (the guys who designed and
> documented Lucene) directly on the Lucene user mailing list.
>
> https://lucene.apache.org/core/discussion.html
>
> Here is an example of the difference in tests between PhraseQuery and
> MultiPhraseQuery:
>
>
> https://github.com/apache/lucenenet/blob/master/src/Lucene.Net.Tests/Search/TestPhraseQuery.cs
>
> https://github.com/apache/lucenenet/blob/master/src/Lucene.Net.Tests/Search/TestMultiPhraseQuery.cs
>
> But I am sincere when I say that we definitely could use some help with
> researching and adding more information to the documentation, fixing broken
> links, and working on some of the other documentation issues that are open
> on GitHub. This is a HUGE project, and documentation is only a small part
> of it, but it is likely the last thing that we will work on after the
> stable 4.8.0 release unless people pitch in to make it better.
>
>
> https://github.com/apache/lucenenet/issues?q=is%3Aissue+is%3Aopen+label%3Adocs
>
> In addition, any bloggers are encouraged to create articles about
> Lucene.NET to share what they learn about it. We have a "community links"
> section on our website where we try to collect these links periodically,
> but unfortunately it is not that easy to find:
>
> https://lucenenet.apache.org/contributing/community-links.html
>
> Hope this helps,
> Shad Storhaug (NightOwl888)
> Project Chairperson - Apache Lucene.NET
>
>
> -----Original Message-----
> From: Philippe Alexis <ph...@gmail.com>
> Sent: Thursday, December 30, 2021 2:54 AM
> To: dev@lucenenet.apache.org
> Subject: A question out of frustration about the documentation
>
> Hello everyone,
>
> Allow me to get straight to the point: is the 'free' documentation
> specifically written in a way so as *not to cannibalize* the 'commercial'
> documentation?
>
> I can't believe how (physically!) painfully obvious it is that the ///
> <summary> headers in the code go to great lengths to avoid giving anything
> away about what's the difference between a PhraseQuery and a
> MultiPhraseQuery, especially when one finally comes across (an extract
> from) the cookbook on the matter.
>
> We're talking, All Hail to Lucene, capable of full-text searching through
> millions of documents within mere milliseconds... and to 'showcase' this
> power, the front page sample code on the website shows the addition of ...
> *one* "The quick brown fox jumps over the lazy dog" document. One.
> Document.
>
> And that writer.Flush(triggerMerge: false, applyAllDeletes: false); why is
> it there? You could have at least added two documents, and hinted right
> away that only one Flush is needed, if at all.
>
> Here I was, new to Lucene, looping through my painstakingly constructed
> 100,000 lines-dictionary to create documents... only to find that the
> supposedly 'blazingly' fast Lucene took *4 minutes* to index a 10th of
> that. (Make of my feedback what you will, I'm just being honest about my
> genuinely 'blind' playthrough of Lucene.NET)
>
> Now, we're developers, and we're used to within a couple of clicks in the
> documentation, coming across an example usage that tells us just about
> everything we need to know to get our work done... Buried somewhere in the
> docs, is mentioned, *"The OR operator is the default conjunction operator"*
>
> Excellent! *Something Google got right since 1998*. Now... example of I do
> a new of what, and call what to pull the search results out of it, to see
> this in action, you mean??
> ................ Silence.
>
> Somehow, I clench my teeth, and make my way to QueryParser ->
> MultiFieldQueryParser... not quite... then to StandardQueryParser...
> maybe?? And I get this 'usage' :
>
> const LuceneVersion matchVersion = LuceneVersion.LUCENE_48;
> StandardQueryParser qpHelper = new StandardQueryParser();
> QueryConfigHandler config = qpHelper.QueryConfigHandler;
> config.Set(ConfigurationKeys.ALLOW_LEADING_WILDCARD, true);
> config.Set(ConfigurationKeys.ANALYZER, new
> WhitespaceAnalyzer(matchVersion));
> Query query = qpHelper.Parse("apache AND lucene", "defaultField");
>
>
> Seriously! The 'usage' simply stops unashamedly short of actually telling
> me how to do what this is supposed to be all about: getting the results out
> of the Query object. Surely, I'm being trolled, or what?
>
> *I know I am not owed anything.* Still, my first impression from this
> experience is what it is... the docs seem to try so hard to not give the
> 'good stuff' away that I feel exhausted from just the thought of this being
> about keeping a marketable money-making angle.
>

RE: A question out of frustration about the documentation

Posted by Shad Storhaug <sh...@shadstorhaug.com>.

Hi Phillipe,

Thanks for the feedback.

Unfortunately, you are preaching to the choir. We have faithfully ported all the documentation that Lucene (written in Java) provided along with the Lucene modules. In fact, we have added a bit more documentation than what they provided us with to cover some .NET-specific scenarios.

https://lucene.apache.org/core/4_8_0/

We didn't write any books, either. We are just trying to be as helpful as possible to make sure users can find whatever information about Lucene 4.x is available, and unfortunately the best documentation is in books (the one that seems to be most recommended is "Lucene in Action").

Other users have made us aware of some of the shortcomings of the demos and documentation, and we have an open issue about it:

https://github.com/apache/lucenenet/issues/457

Lucene.NET 4.8.0 is still under development and has been since September of 2014. Most of the work is still going toward the code rather than the documentation.

If you wish to *provide* this documentation as XML doc comments as a contribution after you figure it out (so other users can benefit from your research), we are all for it. Unfortunately, it is just as much work for us to research as it is for you.

Based on the documentation headers, I suspect that Java developers are fully expected to be analyzing code to determine how Lucene works rather than relying on documentation, and I am afraid that applies to the .NET version, as well. The best place to learn (for free) is by analyzing the tests (which could probably give you a quicker answer than we could), to ask questions on StackOverflow (be sure to include both the lucene.net and the lucene tags), or to ask the Lucene team (the guys who designed and documented Lucene) directly on the Lucene user mailing list.

https://lucene.apache.org/core/discussion.html

Here is an example of the difference in tests between PhraseQuery and MultiPhraseQuery:

https://github.com/apache/lucenenet/blob/master/src/Lucene.Net.Tests/Search/TestPhraseQuery.cs
https://github.com/apache/lucenenet/blob/master/src/Lucene.Net.Tests/Search/TestMultiPhraseQuery.cs

But I am sincere when I say that we definitely could use some help with researching and adding more information to the documentation, fixing broken links, and working on some of the other documentation issues that are open on GitHub. This is a HUGE project, and documentation is only a small part of it, but it is likely the last thing that we will work on after the stable 4.8.0 release unless people pitch in to make it better.

https://github.com/apache/lucenenet/issues?q=is%3Aissue+is%3Aopen+label%3Adocs

In addition, any bloggers are encouraged to create articles about Lucene.NET to share what they learn about it. We have a "community links" section on our website where we try to collect these links periodically, but unfortunately it is not that easy to find:

https://lucenenet.apache.org/contributing/community-links.html

Hope this helps,
Shad Storhaug (NightOwl888)
Project Chairperson - Apache Lucene.NET


-----Original Message-----
From: Philippe Alexis <ph...@gmail.com> 
Sent: Thursday, December 30, 2021 2:54 AM
To: dev@lucenenet.apache.org
Subject: A question out of frustration about the documentation

Hello everyone,

Allow me to get straight to the point: is the 'free' documentation specifically written in a way so as *not to cannibalize* the 'commercial'
documentation?

I can't believe how (physically!) painfully obvious it is that the /// <summary> headers in the code go to great lengths to avoid giving anything away about what's the difference between a PhraseQuery and a MultiPhraseQuery, especially when one finally comes across (an extract
from) the cookbook on the matter.

We're talking, All Hail to Lucene, capable of full-text searching through millions of documents within mere milliseconds... and to 'showcase' this power, the front page sample code on the website shows the addition of ...
*one* "The quick brown fox jumps over the lazy dog" document. One. Document.

And that writer.Flush(triggerMerge: false, applyAllDeletes: false); why is it there? You could have at least added two documents, and hinted right away that only one Flush is needed, if at all.

Here I was, new to Lucene, looping through my painstakingly constructed
100,000 lines-dictionary to create documents... only to find that the supposedly 'blazingly' fast Lucene took *4 minutes* to index a 10th of that. (Make of my feedback what you will, I'm just being honest about my genuinely 'blind' playthrough of Lucene.NET)

Now, we're developers, and we're used to within a couple of clicks in the documentation, coming across an example usage that tells us just about everything we need to know to get our work done... Buried somewhere in the docs, is mentioned, *"The OR operator is the default conjunction operator"*

Excellent! *Something Google got right since 1998*. Now... example of I do a new of what, and call what to pull the search results out of it, to see this in action, you mean??
................ Silence.

Somehow, I clench my teeth, and make my way to QueryParser -> MultiFieldQueryParser... not quite... then to StandardQueryParser...
maybe?? And I get this 'usage' :

const LuceneVersion matchVersion = LuceneVersion.LUCENE_48; StandardQueryParser qpHelper = new StandardQueryParser(); QueryConfigHandler config = qpHelper.QueryConfigHandler; config.Set(ConfigurationKeys.ALLOW_LEADING_WILDCARD, true); config.Set(ConfigurationKeys.ANALYZER, new WhitespaceAnalyzer(matchVersion));
Query query = qpHelper.Parse("apache AND lucene", "defaultField");


Seriously! The 'usage' simply stops unashamedly short of actually telling me how to do what this is supposed to be all about: getting the results out of the Query object. Surely, I'm being trolled, or what?

*I know I am not owed anything.* Still, my first impression from this experience is what it is... the docs seem to try so hard to not give the 'good stuff' away that I feel exhausted from just the thought of this being about keeping a marketable money-making angle.