You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@lucene.apache.org by Grant Ingersoll <gs...@apache.org> on 2007/05/25 23:10:01 UTC

Documentation Brainstorming

OK, so I have been really frustrated with the documentation of a  
certain other open source project (a rather large and popular one)  
that shall remain nameless, and it got me thinking about Lucene docs  
and how we can improve them.  Mind you, our docs are an order of  
magnitude better than this other project (in its defense it has a lot  
of dependencies and does not have as active a committer base as us)

So, this is an open call for ideas on how we can improve our docs.   
Here are some areas I think need improving:

1. The demo/tutorial needs to be brought into the current Lucene  
century.  See https://issues.apache.org/jira/browse/LUCENE-805  Most  
important part of this, I think is the "big picture" overview of why  
and when and how.

2.  There needs to be some docs that explain what analysis is at the  
top level, similar to the current Scoring documentation.

3.  A best practices page on the Wiki would be great.

4.  The index package level javadocs are non-existent.  See https:// 
issues.apache.org/jira/browse/LUCENE-765

5.  Glossary of terms, etc.

Any others?  Maybe we should start a wiki page that contains a list  
of docs that need to be added/cleaned up and then we can work through  
them a bit at a time?  I know it isn't sexy, but it does bring a lot  
of goodness to the project.

Ideas for how to do this:

1. Before doing a release, we have 1-2 weeks of code freeze, and we  
focus on documentation and cleaning up bugs in JIRA.

2.  Get the Hudson JIRA integration stuff hooked in so we can know if  
patches are good faster, meaning we can turn around documentation  
patches, and others, faster (not withstanding Otis' great turnaround  
today on documentation patches)

3.  There is a whole lot of knowledge stored in the email archives,  
how can we leverage it?

Anyone else have ideas?  Should we focus more on the wiki or the  
committed docs?  Is there a way we could let users annotate the  
javadocs, wiki style?  I know there are legal issues with user  
contributions that are "released".

Cheers,
Grant



---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

Re: Documentation Brainstorming

Posted by Antony Bowesman <ad...@teamware.com>.

Grant Ingersoll wrote:
> Mind you, our docs are an order of magnitude better than 
> this other project 

I agree, Lucene is a very well documented project compared to many.  In general 
and in conjunction with LIA, it's a pretty easy project to get in to.

> 3.  There is a whole lot of knowledge stored in the email archives, how 
> can we leverage it?

This is indeed a key point.  HitCollector and surrounding classes are poorly 
documented and there have been many replies to questions which recommend using a 
HitCollector.

The search package is generally well described, apart from what are described as 
'low level API' or 'expert' methods and classes.  I found I needed to get to 
that level to get the best out of Lucene in a framework that sits on top of it.

Performance is another topic which would really benefit from a 'best practice' 
guide.  The dev and user posts concerning performance always get many responses. 
  Although a challenge to produce, bringing together some kind of 
recommendations which relate user data to reader/writer usage, e.g. what 
maxBufferedDocs, maxMergeDocs, mergeFactor to use with a number of different 
usage scenarios would be great, although there's no substitute for evaluating 
that with your own data.

A definitive statement about 'optimize' and when (not) to use it and what its 
relationship with performance is.  I know there's lots about it already, but 
it's dotted all over the place.

Maybe this sort of information would be better in LIA2...
Antony

---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

Re: Documentation Brainstorming

Posted by Doron Cohen <DO...@il.ibm.com>.

Paul Elschot <pa...@xs4all.nl> wrote on 30/05/2007 23:57:47:

> On Thursday 31 May 2007 05:52, Erik Hatcher wrote:
> >
> > On May 30, 2007, at 9:33 PM, Grant Ingersoll wrote:
> > >> I'd rather see each jar get its own javadoc,
> > >> or at the very least, indicate which jar each
> > >> class is defined in for the ones that aren't
> > >> part of the core.
> > >>
> > >
> > > Yeah, I don't like that all the contribs are built in together.
> > > What do others think?  I would vote for separating them out.
> >
> > I concur with having the contrib docs separated.  I may have been the
> > one (or at least assisted with it) who got the documentation build to
> > fold it altogether as that was the goal at the time.  It'd be much
> > easier, build-wise, if all artifacts were kept entirely separate for
> > all the various contrib libraries and the core, as well as the demo.
>
>
> Currently it is not clear in the javadocs whether a class belongs
> to core or contrib. Having separate javadocs would probably
> improve that.
> I have no experience in linking between javadoc "packages",
> so I have no suggestion on how to make such a separation.

I am all for separation.
Though it is sometimes useful to have it all together, - perhaps two
versions: all, and by module (core, contrib/x, contrib/y, etc.)?
Or is this too cluttered - we already have it by release...


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

Re: Documentation Brainstorming

Posted by Paul Elschot <pa...@xs4all.nl>.

On Thursday 31 May 2007 05:52, Erik Hatcher wrote:
> 
> On May 30, 2007, at 9:33 PM, Grant Ingersoll wrote:
> >> I'd rather see each jar get its own javadoc,
> >> or at the very least, indicate which jar each
> >> class is defined in for the ones that aren't
> >> part of the core.
> >>
> >
> > Yeah, I don't like that all the contribs are built in together.   
> > What do others think?  I would vote for separating them out.
> 
> I concur with having the contrib docs separated.  I may have been the  
> one (or at least assisted with it) who got the documentation build to  
> fold it altogether as that was the goal at the time.  It'd be much  
> easier, build-wise, if all artifacts were kept entirely separate for  
> all the various contrib libraries and the core, as well as the demo.


Currently it is not clear in the javadocs whether a class belongs
to core or contrib. Having separate javadocs would probably
improve that.
I have no experience in linking between javadoc "packages",
so I have no suggestion on how to make such a separation.

Regards,
Paul Elschot



---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

Re: Documentation Brainstorming

Posted by Erik Hatcher <er...@ehatchersolutions.com>.

On May 30, 2007, at 9:33 PM, Grant Ingersoll wrote:
>> I'd rather see each jar get its own javadoc,
>> or at the very least, indicate which jar each
>> class is defined in for the ones that aren't
>> part of the core.
>>
>
> Yeah, I don't like that all the contribs are built in together.   
> What do others think?  I would vote for separating them out.

I concur with having the contrib docs separated.  I may have been the  
one (or at least assisted with it) who got the documentation build to  
fold it altogether as that was the goal at the time.  It'd be much  
easier, build-wise, if all artifacts were kept entirely separate for  
all the various contrib libraries and the core, as well as the demo.

	Erik

---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

Re: Documentation Brainstorming

Posted by Grant Ingersoll <gs...@apache.org>.

I like the suggestion of having two views: a unified view and then  
also a separate view.  Slightly more work to setup, but should  
satisfy both camps.

On May 31, 2007, at 1:16 PM, Doug Cutting wrote:

>
> I like the single javadoc build.  The linking is nice, e.g., all  
> Analyzer implementations are linked from Analyzer.  It also makes  
> it easier for folks to see everything that's included in the  
> release in one place.

True

>
> Perhaps the names of the sections should be the name of the jar  
> file, and/or the summary sentence in the package.html for contrib  
> packages should name the jar file.  Would that suffice?
>

I find the lower left frame to be the main pain for me, since it  
isn't clear there what is in core and what is in contrib.

> However if most folks really wish to split things, then some new  
> navigational pages are required to provide a home for the various  
> javadocs.  Ideally this would provide the level of integration  
> that, e.g., Ant's optional tasks have with Ant's core tasks: when  
> browsing core tasks there's always a link to optional tasks, and  
> vice-versa, so the optional stuff is always just a single click  
> away.  Putting contrib and core javadoc together achieves this.   
> Achieving it with separate javadocs will be harder.
>

Makes sense.

---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

Re: Documentation Brainstorming

Posted by Doug Cutting <cu...@apache.org>.

Grant Ingersoll wrote:
>> I'd rather see each jar get its own javadoc,
>> or at the very least, indicate which jar each
>> class is defined in for the ones that aren't
>> part of the core.
>>
> 
> Yeah, I don't like that all the contribs are built in together.  What do 
> others think?  I would vote for separating them out.

I like the single javadoc build.  The linking is nice, e.g., all 
Analyzer implementations are linked from Analyzer.  It also makes it 
easier for folks to see everything that's included in the release in one 
place.

Perhaps the names of the sections should be the name of the jar file, 
and/or the summary sentence in the package.html for contrib packages 
should name the jar file.  Would that suffice?

However if most folks really wish to split things, then some new 
navigational pages are required to provide a home for the various 
javadocs.  Ideally this would provide the level of integration that, 
e.g., Ant's optional tasks have with Ant's core tasks: when browsing 
core tasks there's always a link to optional tasks, and vice-versa, so 
the optional stuff is always just a single click away.  Putting contrib 
and core javadoc together achieves this.  Achieving it with separate 
javadocs will be harder.

Doug

---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

Re: Documentation Brainstorming

Posted by Grant Ingersoll <gs...@apache.org>.

Been meaning to get back on this, as there are some good ideas/points  
in here.

On May 25, 2007, at 6:14 PM, Bob Carpenter wrote:

>
>> So, this is an open call for ideas on how we can improve our  
>> docs.  Here are some areas I think need improving:
>
> Before I start suggesting improvements, let
> me qualify them all by saying I'm only
> taking the time to do this because I love
> Lucene and use it all the time.
>

No need to explain your motives, we're all working towards making  
Lucene better

>
> Web Site Redesign
> ------------------
> I'd like to add a request for a top-level site
> redesign.  I find it very difficult to find
> anything on the site.  This isn't just a Lucene
> problem, it's partly an Apache problem.  I believe
> what most people want is a top-level intro to the
> projects and then a pointer to where to download
> and/or read hello-world getting-started docs.
> (This is, for instance, how Tomcat and MySQL set
> up their home pages and sites.)
>
> I just went to the Lucene site and still
> can't figure out where to download the latest
> Lucene.  I start at http://lucene.apache.org/
> and get a nav choice of "who we are"
> and "buy stuff" and "subprojects".
> So I click on subrprojects,
> which opens up a menu and then I click on
> "java" (because I know that there are more
> versions of Lucene than the Java version and
> there's nothing else labeled just Lucene).
> I then get a choice of Features, Who We Are,
> Powered by Lucene, Documentation, Resources,
> Site Versions, and Related Projects.
> I guess the right answer is "Resources"
> then "releases", then I leave the nav for the
> page itself and click "downloads and releases"
> but hey, I'm already there, so I have to go
> into the text and click on "Apache Mirrors".
> I then select a mirror and it gives me a huge
> list to select from.  The README gives me no
> hint as to what's the latest stable version,
> and each version has (old) written next
> to its description.

So, would you prefer the menu items be expanded by default?  Also,  
what about the content of the actual pages outside of the menus?  For  
instance, on the Top Level site, there are brief blurbs about what  
each of the projects are and on the Lucene Java site, the top level  
entry points to a "free download" and the news items generally say  
what release is the latest.  You are right, however, there is no  
clear links to getting started, etc.

By the same token, though, it does take a bit of reading to find,  
there is no clear "download latest" button like on MySql or other  
sites like that.

I also think we should remove older news items, maybe put in a sunset  
policy of 1 year or something.


>
> Ask an coworker who doesn't use Lucene to
> try to find the javadocs, a hello world
> tutorial, and the download on the Lucene
> site.  (Yes, I'm suggesting a usability test.)

Usability, good!  :-)

>
> Altogether, the design should waste less
> whitespace.  Compare an Apache page to
> something like a MySQL page to see the
> difference.
>

I tend to like more whitespace, I find MySQL to be pretty cluttered  
visually, although it is much more efficient.


>
> Class, Method, Construction, Member Doc
> ---------------------------------------
>
> The biggest issue in the doc for me is that
> most methods, packages, classes, etc. are
> hardly documented at all.  For instance, the
> very first class in the 2.1 alphabetical list:
>
> http://lucene.zones.apache.org:8080/hudson/job/Lucene-Nightly/ 
> javadoc/org/apache/lucene/gdata/servlet/handler/ 
> AbstractAccountHandler.html
>
> has 7 methods, 6 of which are undocumented
> and 1 of which has inherited redundant doc.
> There's an uncommented field, an uncommented
> constructor, and there's no class doc.
>

ugh.

> It's also out of date.  Someone finally fixed the
> infinite-loop design of Analyzer, but the class doc
> has a big warning that you must implement one
> of the methods.  But now there's only the
> abstract tokenStream() method which must be implemented
> and a getPositiveIncrementGap() method (which is
> a useful addition, by the way).

Can you enter a bug for this?  And maybe a patch?

>
> It also doesn't help that there are classes
> with non-descriptive names like Among, which
> have no doc at all.
>
> I'd rather see each jar get its own javadoc,
> or at the very least, indicate which jar each
> class is defined in for the ones that aren't
> part of the core.
>

Yeah, I don't like that all the contribs are built in together.  What  
do others think?  I would vote for separating them out.


>
> Reader Schmeader
> ----------------
>
> This is actually an API, not a doc issue, though the
> doc around this needs work as is, too.
>
> I don't understand why Readers are used in analyzers.
> Using them presents several problems.  First, since
> Analyzer.tokenStream() doesn't throw an IOException,
> all exceptions must be caught somewhere inside.  Second,
> it's not clear who closes the reader or how long the
> analyzer will hold it open.  Every time I've used Lucene,
> I wind up having strings or char sequences or char array
> slices that I need to embed in a Reader. That's because
> I invariably have to parse out the bits of documents
> I want to index in various fields.  Finally, wrapping a
> char sequence  or char array slice in a reader is a rather
> inefficient way to implement a sequence of chars.  Can we
> at least introduce a method that takes a CharSequence or
> even just a String and deprecate the one with Reader?
> Or at least provide an alternative for the usual case
> of not having a reader.  Maybe I'm just missing something
> here, but I don't think it's scaling to streaming input
> that'd overflow memory.
>

This, I believe, is due to the fact that some Fields can be  
constructed with Readers.  The relevant code in DocumentWriter  
(around line 195) is:
// the field does not have a TokenStream,
           // so we have to obtain one from the analyzer
           if (stream == null) {
             Reader reader;			  // find or make Reader
             if (field.readerValue() != null)
               reader = field.readerValue();
             else if (field.stringValue() != null)
               reader = new StringReader(field.stringValue());
             else
               throw new IllegalArgumentException
                       ("field must have either String or Reader  
value");

             // Tokenize field and add to postingTable
             stream = analyzer.tokenStream(fieldName, reader);
           }

However, you do present interesting use cases.  Also remember that  
some of these APIs have been around for a while and may very well  
benefit from some updating.



---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

Re: Documentation Brainstorming

Posted by Bob Carpenter <ca...@alias-i.com>.

> So, this is an open call for ideas on how we can improve our docs. Here
> are some areas I think need improving:

Before I start suggesting improvements, let
me qualify them all by saying I'm only
taking the time to do this because I love
Lucene and use it all the time.

Web Site Redesign
------------------
I'd like to add a request for a top-level site
redesign. I find it very difficult to find
anything on the site. This isn't just a Lucene
problem, it's partly an Apache problem. I believe
what most people want is a top-level intro to the
projects and then a pointer to where to download
and/or read hello-world getting-started docs.
(This is, for instance, how Tomcat and MySQL set
up their home pages and sites.)

I just went to the Lucene site and still
can't figure out where to download the latest
Lucene. I start at http://lucene.apache.org/
and get a nav choice of "who we are"
and "buy stuff" and "subprojects".
So I click on subrprojects,
which opens up a menu and then I click on
"java" (because I know that there are more
versions of Lucene than the Java version and
there's nothing else labeled just Lucene).
I then get a choice of Features, Who We Are,
Powered by Lucene, Documentation, Resources,
Site Versions, and Related Projects.
I guess the right answer is "Resources"
then "releases", then I leave the nav for the
page itself and click "downloads and releases"
but hey, I'm already there, so I have to go
into the text and click on "Apache Mirrors".
I then select a mirror and it gives me a huge
list to select from. The README gives me no
hint as to what's the latest stable version,
and each version has (old) written next
to its description.

Ask an coworker who doesn't use Lucene to
try to find the javadocs, a hello world
tutorial, and the download on the Lucene
site. (Yes, I'm suggesting a usability test.)

Altogether, the design should waste less
whitespace. Compare an Apache page to
something like a MySQL page to see the
difference.

Class, Method, Construction, Member Doc
---------------------------------------

The biggest issue in the doc for me is that
most methods, packages, classes, etc. are
hardly documented at all. For instance, the
very first class in the 2.1 alphabetical list:

http://lucene.zones.apache.org:8080/hudson/job/Lucene-Nightly/javadoc/org/apache/lucene/gdata/servlet/handler/AbstractAccountHandler.html

has 7 methods, 6 of which are undocumented
and 1 of which has inherited redundant doc.
There's an uncommented field, an uncommented
constructor, and there's no class doc.

It's also out of date. Someone finally fixed the
infinite-loop design of Analyzer, but the class doc
has a big warning that you must implement one
of the methods. But now there's only the
abstract tokenStream() method which must be implemented
and a getPositiveIncrementGap() method (which is
a useful addition, by the way).

It also doesn't help that there are classes
with non-descriptive names like Among, which
have no doc at all.

I'd rather see each jar get its own javadoc,
or at the very least, indicate which jar each
class is defined in for the ones that aren't
part of the core.

Reader Schmeader
----------------

This is actually an API, not a doc issue, though the
doc around this needs work as is, too.

I don't understand why Readers are used in analyzers.
Using them presents several problems. First, since
Analyzer.tokenStream() doesn't throw an IOException,
all exceptions must be caught somewhere inside. Second,
it's not clear who closes the reader or how long the
analyzer will hold it open. Every time I've used Lucene,
I wind up having strings or char sequences or char array
slices that I need to embed in a Reader. That's because
I invariably have to parse out the bits of documents
I want to index in various fields. Finally, wrapping a
char sequence or char array slice in a reader is a rather
inefficient way to implement a sequence of chars. Can we
at least introduce a method that takes a CharSequence or
even just a String and deprecate the one with Reader?
Or at least provide an alternative for the usual case
of not having a reader. Maybe I'm just missing something
here, but I don't think it's scaling to streaming input
that'd overflow memory.

- Bob Carpenter
Alias-i

---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

Re: Documentation Brainstorming

Posted by Grant Ingersoll <gs...@apache.org>.

Thanks for starting this, Hoss!	

On Jun 7, 2007, at 12:38 AM, Chris Hostetter wrote:

>
> In hte spirit of improving documentation, i've started a document
> summarizing the various comments/ideas about improving out
> documentation...
>
> http://wiki.apache.org/lucene-java/Documentation_Improvements
>
> : 3.  There is a whole lot of knowledge stored in the email archives,
> : how can we leverage it?
>
> One email thread down, a bigillion to go!

Back in yesteryear, I used to do Oracle work and came across a few  
bugs.  At any rate, Oracle had this automated system whereby you  
entered a TAR (Technical Assistance Request).  In it you described  
your problem and then submitted it to the system.  Before actually  
putting it into the support queue, however, it took your text and ran  
it as a query against all the previous TARs and the Oracle knowledge  
base and asked you if any of the items satisfied your question.  If  
it did, great, you were satisfied, if not you could continue on with  
your request.  I would say it correctly answered my concern 6 out of  
10 times (that might be a little high).

Why do I mention all of this?  Does anyone know of a Mailing List  
Manager (MLM) that could take in an email, search the archive, spit  
back the top 20 results (plus FAQ and JIRA) to the person?  Then if  
the user finds the answer they are happy and we are happy b/c we  
don't have to answer something that has already been answered 10  
times before.  If not, they simply reply back to the MLM and it  
forwards on the original message to the list.  I briefly thought  
about hooking Lucene up to ezlm for this, but just didn't have the  
time.  However, if we could get our hands on an MLM that does that,  
maybe this would be one way to leverage all of that knowledge?  Any  
takers?

Maybe this is a good Google SoC project...

-Grant

---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

Re: Documentation Brainstorming

Posted by Chris Hostetter <ho...@fucit.org>.

In hte spirit of improving documentation, i've started a document
summarizing the various comments/ideas about improving out
documentation...

http://wiki.apache.org/lucene-java/Documentation_Improvements

: 3.  There is a whole lot of knowledge stored in the email archives,
: how can we leverage it?

One email thread down, a bigillion to go!


-Hoss


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org