You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@lucy.apache.org by Marvin Humphrey <ma...@rectangular.com> on 2010/11/05 00:30:58 UTC

[lucy-dev] Generalize Tutorial for multiple host languages

Greets,

Most of Lucy's documentation has been written so that it will work with
multiple host languages.  To the extent possible, class descriptions, method
descriptions and so on are host-language neutral -- only code samples need be
customized.

Our multi-chapter Tutorial is still Perl-specific, however.  It would be great
if we could adapt it for use across multiple host languages -- but right now,
it has dependencies which will not be available for every language/platform
combination.

Currently, the tutorial builds sample applications designed to be used in a
web context using an HTML presentation of the United States Constitution as a
corpus.  For HTML parsing, CGI processing, and paging through results, we use
dedicated Perl modules, some of which belong to the Perl core and some of
which must be obtained from CPAN.

To eliminate these dependencies, I think the Tutorial should be simplified to
build a command-line app, and the corpus should be changed to plain text.
Every potential host language has basic file and directory manipulation
capabilities; it should be possible to generalize the tutorial prose so that
it can work with all of them without modification.

Additionally, by eliminating those CPAN prerequisites entirely, we skirt the
issue of dependency licensing.

The only downside is that easily-customizable sample applications are
compelling (see Ruby on Rails), and we'll be taking our "instant web search"
kit and making it less handy.  

We face a similar challenge with the CustomQueryParser Cookbook entry
-- which uses Parse::RecDescent -- but that will be harder to resolve.  I'm
not sure what to do about that one, except possibly remove it from the
distribution and publish it elsewhere as an independent article.  

Marvin Humphrey


Re: [lucy-dev] Generalize Tutorial for multiple host languages

Posted by Peter Karman <pe...@peknet.com>.
Marvin Humphrey wrote on 11/7/10 10:06 AM:

> One thing I'm realizing is that I really don't want to contribute or maintain
> C sample code which operates in a web context.  C is too prone to security
> vulnerabilities, its string handling sucks so you need waaaaay more code, and
> things like URI escaping and HTML tag stripping aren't offered by the standard
> library and aren't easy to fake up.  It's the wrong language for a quickie CGI
> app.

Agreed.

> 
> I think it makes more sense for the C tutorial to operate in a command-line
> context, even if the tutorials for other host language bindings target the
> web.  But then we have a problem: the current HTML format of our sample corpus
> isn't suitable.  The solution, I think, is to change all those docs to plain
> text, with the title on the first line: 
> 
>     Amendment XIII 
> 
>     1. Neither slavery nor involuntary servitude, except as a punishment for
>     crime whereof the party shall have been duly convicted, shall exist within
>     the United States, or any place subject to their jurisdiction.
> 
>     2. Congress shall have power to enforce this article by appropriate
>     legislation.
> 
> Plain text will work for either web or command-line context, and as a bonus,
> for web-context tutorials we no longer have to either pull in an HTML parsing
> dependency or do something hackish with regexes.
> 

Agreed.

For what it's worth, my intention, once we have a working C API, is to include
as part of libswish3 a "swish_lucy.c" example of using Lucy with libswish3,
which *does* do all the HTML/XML parsing.

See, for example:

 http://dev.swish-e.org/browser/libswish3/trunk/src/swish_lint.c
 http://dev.swish-e.org/browser/libswish3/trunk/src/xapian/swish_xapian.cpp

-- 
Peter Karman  .  http://peknet.com/  .  peter@peknet.com

Re: [lucy-dev] Generalize Tutorial for multiple host languages

Posted by Marvin Humphrey <ma...@rectangular.com>.
On Fri, Nov 05, 2010 at 04:39:08PM +0100, Simon Willnauer wrote:
> > Our multi-chapter Tutorial is still Perl-specific, however.  It would be great
> > if we could adapt it for use across multiple host languages -- but right now,
> > it has dependencies which will not be available for every language/platform
> > combination.
> That would be absolutely awesome I am not really into perl (shame on
> me I know) and something like that would definitly help me a lot. I
> find myself in the situation where I am gonna need it sooner or later
> :)

We're not yet at the point where C is a viable host language binding for Lucy,
but each dependency we eliminate brings us closer.

> > The only downside is that easily-customizable sample applications are
> > compelling (see Ruby on Rails), and we'll be taking our "instant web search"
> > kit and making it less handy.
> 
> I know it would be an overhead but could we maintain that aside of the
> getting started example?

One thing I'm realizing is that I really don't want to contribute or maintain
C sample code which operates in a web context.  C is too prone to security
vulnerabilities, its string handling sucks so you need waaaaay more code, and
things like URI escaping and HTML tag stripping aren't offered by the standard
library and aren't easy to fake up.  It's the wrong language for a quickie CGI
app.

I think it makes more sense for the C tutorial to operate in a command-line
context, even if the tutorials for other host language bindings target the
web.  But then we have a problem: the current HTML format of our sample corpus
isn't suitable.  The solution, I think, is to change all those docs to plain
text, with the title on the first line: 

    Amendment XIII 

    1. Neither slavery nor involuntary servitude, except as a punishment for
    crime whereof the party shall have been duly convicted, shall exist within
    the United States, or any place subject to their jurisdiction.

    2. Congress shall have power to enforce this article by appropriate
    legislation.

Plain text will work for either web or command-line context, and as a bonus,
for web-context tutorials we no longer have to either pull in an HTML parsing
dependency or do something hackish with regexes.

Marvin Humphrey



Re: [lucy-dev] Generalize Tutorial for multiple host languages

Posted by Simon Willnauer <si...@googlemail.com>.
hey Marvin,

On Fri, Nov 5, 2010 at 12:30 AM, Marvin Humphrey <ma...@rectangular.com> wrote:
> Greets,
>
> Most of Lucy's documentation has been written so that it will work with
> multiple host languages.  To the extent possible, class descriptions, method
> descriptions and so on are host-language neutral -- only code samples need be
> customized.
>
> Our multi-chapter Tutorial is still Perl-specific, however.  It would be great
> if we could adapt it for use across multiple host languages -- but right now,
> it has dependencies which will not be available for every language/platform
> combination.
That would be absolutely awesome I am not really into perl (shame on
me I know) and something like that would definitly help me a lot. I
find myself in the situation where I am gonna need it sooner or later
:)
>
> Currently, the tutorial builds sample applications designed to be used in a
> web context using an HTML presentation of the United States Constitution as a
> corpus.  For HTML parsing, CGI processing, and paging through results, we use
> dedicated Perl modules, some of which belong to the Perl core and some of
> which must be obtained from CPAN.
>
> To eliminate these dependencies, I think the Tutorial should be simplified to
> build a command-line app, and the corpus should be changed to plain text.
> Every potential host language has basic file and directory manipulation
> capabilities; it should be possible to generalize the tutorial prose so that
> it can work with all of them without modification.
+1
>
> Additionally, by eliminating those CPAN prerequisites entirely, we skirt the
> issue of dependency licensing.
awesome again!
>
> The only downside is that easily-customizable sample applications are
> compelling (see Ruby on Rails), and we'll be taking our "instant web search"
> kit and making it less handy.

I know it would be an overhead but could we maintain that aside of the
getting started example?

simon
>
> We face a similar challenge with the CustomQueryParser Cookbook entry
> -- which uses Parse::RecDescent -- but that will be harder to resolve.  I'm
> not sure what to do about that one, except possibly remove it from the
> distribution and publish it elsewhere as an independent article.
>
> Marvin Humphrey
>
>

Re: [lucy-dev] Generalize Tutorial for multiple host languages

Posted by Marvin Humphrey <ma...@rectangular.com>.
On Mon, Nov 08, 2010 at 08:37:53PM -0600, Peter Karman wrote:
> > In conclusion... for the primary Tutorial documentation, we can and arguably
> > *should* eliminate all non-core-Perl dependencies -- if for no other reason
> > than making it easier to run the sample code and go through the Tutorial.
> > 
> 
> eliminating because they are non-core is a decent reason. +1

Done.

Marvin Humphrey


Re: [lucy-dev] Generalize Tutorial for multiple host languages

Posted by Marvin Humphrey <ma...@rectangular.com>.
On Mon, Nov 08, 2010 at 08:37:53PM -0600, Peter Karman wrote:
> > Here's a draft of the question we might ask legal via JIRA:
> > 
> [snip]
> 
> that email looks good to me.

Sent.

    https://issues.apache.org/jira/browse/LEGAL-86

Marvin Humphrey


Re: [lucy-dev] Generalize Tutorial for multiple host languages

Posted by Peter Karman <pe...@peknet.com>.
Marvin Humphrey wrote on 11/8/10 1:00 PM:

> In conclusion... for the primary Tutorial documentation, we can and arguably
> *should* eliminate all non-core-Perl dependencies -- if for no other reason
> than making it easier to run the sample code and go through the Tutorial.
> 

eliminating because they are non-core is a decent reason. +1

> 
> We should take this up with legal and ask for clarification.  It would be nice
> if we didn't have to deal with replacing JSON::XS right away, but could put
> that task off until after the first release.
> 
> Here's a draft of the question we might ask legal via JIRA:
> 
[snip]

that email looks good to me.

-- 
Peter Karman  .  http://peknet.com/  .  peter@peknet.com

Re: [lucy-dev] Generalize Tutorial for multiple host languages

Posted by Marvin Humphrey <ma...@rectangular.com>.
On Sun, Nov 07, 2010 at 01:07:38PM -0600, Peter Karman wrote:
> HTML::Entities is licensed under the same terms as Perl, which means it may be
> redistributed under the Artistic License or the GPL. But it's up to the
> redistributor to decide, yes?

Hmm, yes.  I suppose that means that we could assert that our usage of those
modules within the tutorial/sample code was under the terms of the Artistic
License and *not* under the GPL.  The question then becomes whether such usage
is compatible with distribution of the tutorial/sample code under the Apache
License 2.0.

For what it's worth, I think switching in CGI::escapeHTML() for some usages of
HTML::Entities::encode_entities() is OK.  I changed the charset for search.cgi
from Latin-1 to UTF-8, so it's no longer necessary to encode code points above
255 as HTML entities.  The only important thing we need entity encoding for
now is to guard against cross-site-scripting attacks, and for that,
CGI::escapeHTML() suffices.

Switching in CGI::escape() instead of encode_entities() for URL encoding is
actually a bugfix.  (If we were going to use a CPAN module for that, it should
have been URI::Escape, which offers the function uri_escape_utf8().)
CGI::escape() would be a gimme except that it was silently dedocumented back in
2005, with CGI.pm version 3.06 when cgi_docs.html was abandoned in favor of the
module POD -- escape() and unescape() didn't make the jump from the old
documentation to the new.  CGI is on version 3.49 now and it's distributed with
the Perl core, so escape() isn't going anywhere -- it's safe to use, just no
longer publicly documented.

> Same is true of HTML::TreeBuilder and Data::Pageset.

Data::Pageset is a mild improvement at best.  HTML::TreeBuilder is no longer
necessary if we go with a plain text corpus.

In conclusion... for the primary Tutorial documentation, we can and arguably
*should* eliminate all non-core-Perl dependencies -- if for no other reason
than making it easier to run the sample code and go through the Tutorial.

Elsewhere, though, there are two Perl-licensed modules that we *do* care
about. 

  * Parse::RecDescent, for the Clownfish compiler and for
    Lucy::Docs::Cookbook::CustomQueryParser. 
  * JSON::XS for Lucy itself, until we write our own JSON parser.

> So I don't see why it's necessary to reinvent those dependencies. The Artistic
> license is *not* listed under the "Category X" page at
> http://www.apache.org/legal/resolved.html#category-x

The Artistic License isn't listed at all on that page, which means it hasn't
yet been ruled on.  Usage has been discussed in
<https://issues.apache.org/jira/browse/LEGAL-64>, but that deals with sample
data, not code dependencies.  It's also come up on legal-discuss@a.o, but it's
never resulted in an official outcome.

We should take this up with legal and ask for clarification.  It would be nice
if we didn't have to deal with replacing JSON::XS right away, but could put
that task off until after the first release.

Here's a draft of the question we might ask legal via JIRA:

    The Apache Lucy Incubator podling is working to pare down its list of
    dependencies, but there are two CPAN distributions which we would like to
    put off replacing for the time being (Parse::RecDescent and JSON::XS).
    These two distributions are both licensed, as is common for CPAN modules,
    under the "same terms as Perl itself".  Perl's licensing is here:

      http://dev.perl.org/licenses/

    We do not wish to bundle these CPAN distributions with Lucy, but instead
    specify them as prerequisites.  We assert that our usage of the modules in
    question falls under the terms of the Artistic License and *not* the GPL.

    Lucy interfaces with these modules in three places:

        * At build time (Parse::RecDescent).
        * Within Lucy itself at runtime (JSON::XS).
        * Within sample/cookbook code (Parse::RecDescent).
    
    We have two questions:

    Is it acceptable for code released under the Apache License 2.0 to have a
    non-optional dependency on code which is licensed under the Artistic
    License?

    Is it acceptable to classify these modules as "system dependencies", which
    the user is expected to install?

Marvin Humphrey


Re: [lucy-dev] Generalize Tutorial for multiple host languages

Posted by Peter Karman <pe...@peknet.com>.
Marvin Humphrey wrote on 11/5/10 5:00 PM:

> 
> Fortunately, even if the Lucy sample apps continue to operate in an HTML/CGI
> context rather than migrate to plain-text/command-line as proposed, it's easy
> to remove the dependencies on HTML::TreeBuilder, HTML::Entities, and
> Data::Pageset, which were actually introduced not too long ago.  We'll just go
> back to manual paging code (slightly more verbose but in an area that doesn't
> matter), CGI::escapeHTML() instead of HTML::Entities::encode_entities() (I
> generally try to avoid CGI.pm but oh well), and stripping of HTML tags with
> regexes (hackish but fine for a demo).
> 

HTML::Entities is licensed under the same terms as Perl, which means it may be
redistributed under the Artistic License or the GPL. But it's up to the
redistributor to decide, yes?

Same is true of HTML::TreeBuilder and Data::Pageset.

So I don't see why it's necessary to reinvent those dependencies. The Artistic
license is *not* listed under the "Category X" page at
http://www.apache.org/legal/resolved.html#category-x

It is idiomatic of Perl code to use CPAN. Unless a CPAN module is explicitly
licensed under the GPL (or other Category X license), why jettison it? I don't
see where the Apache standards dictate that.

We can't (for example) use SWISH::3 because it is explicitly GPL licensed. Many
(most?) CPAN modules are licensed "under the same terms as Perl" which it seems
to me makes them fair game for example code.

-- 
Peter Karman  .  http://peknet.com/  .  peter@peknet.com

Re: [lucy-dev] Generalize Tutorial for multiple host languages

Posted by Marvin Humphrey <ma...@rectangular.com>.
On Fri, Nov 05, 2010 at 10:50:53AM -0500, Peter Karman wrote:
> > Additionally, by eliminating those CPAN prerequisites entirely, we skirt the
> > issue of dependency licensing.
> 
> How is documenting something, in a tutorial, affected by software
> licensing? We don't distribute those modules.

The problem is the sample application itself.  We distribute it in finished
form as indexer.pl and search.cgi under trunk/perl/sample/, and it's also
available in fragmented form within the Tutorial documentation itself. 

The GPL asserts itself on "derived works" under copyright law.  The position
of the FSF, and of some other prominent GPL advocates such as Linus Torvalds,
is that interfacing with GPL'd software in your source code is sufficient to
create a derived work.  This is not a universally held position -- see e.g.
the qualifying language that Larry Wall inserted into the Perl license text --
and ultimately, enforcement of the GPL depends upon a copyright owner with
standing and willingness to bring suit.  But regardless of the position of
individual copyright holders, the text of the GPL is what it is, and that
makes the licensing of our present sample app questionable.

As I understand things, the position of the ASF is that sample code, like core
project code, must follow the rules for dependencies laid down at
<http://www.apache.org/legal/resolved.html>.  When Apache Pivot proposed
moving demo code with LGPL dependencies off of ASF servers to a separate home
at Google Code, they received this response on legal-discuss@a.o:

  http://markmail.org/message/3th3yoasvmyaugzg

  Understanding that these are demos/examples, that's a substantially
  disappointing direction. You pretty much ensure that the world can't really
  count on basing their code on the demos or examples without being locked
  into a copyleft schema (or questioning the providence of the code and again,
  being unable to use this).

  So if the plan is to build upon these mixed-license demos at an external
  location, I'd encourage the project to rethink the sense in that (and
  perhaps bring in all IP-clear AL examples back to the project, abandoning
  those with licensing or providence issues.) This would ensure there are a
  set of adoptable, modifiable demos for users to start with. 

Many Perl users are accustomed to working with GPL'd CPAN modules and won't
care about the licensing.  But we still have to ensure that we are in
compliance with the ASF policy, and I fully agree with the rationale behind
that policy.  Ensuring that GPL'd dependencies do not sneak into your code
requires vigilance and can consume a lot of energy.  The consumers of Apache
Lucy's Tutorial code should not have to concern themselves with whether our
sample apps introduce a vector for the insertion of GPL'd code into their
codebases.

Fortunately, even if the Lucy sample apps continue to operate in an HTML/CGI
context rather than migrate to plain-text/command-line as proposed, it's easy
to remove the dependencies on HTML::TreeBuilder, HTML::Entities, and
Data::Pageset, which were actually introduced not too long ago.  We'll just go
back to manual paging code (slightly more verbose but in an area that doesn't
matter), CGI::escapeHTML() instead of HTML::Entities::encode_entities() (I
generally try to avoid CGI.pm but oh well), and stripping of HTML tags with
regexes (hackish but fine for a demo).

Marvin Humphrey


Re: [lucy-dev] Generalize Tutorial for multiple host languages

Posted by Marvin Humphrey <ma...@rectangular.com>.
On Mon, Nov 08, 2010 at 08:36:04PM -0600, Peter Karman wrote:
> Marvin Humphrey wrote on 11/8/10 1:05 PM:
> > On Sun, Nov 07, 2010 at 01:19:00PM -0600, Peter Karman wrote:
> >> The approach I have taken with libswish3 is to make the example code part of the
> >> test suite, so that if the API changes, I have to keep the example(s) up to date
> >> with it. We could do something similar with the tutorial apps.

> > Perhaps we could divide the sample scripts up into testable subroutines, then

> sounds reasonable.

Dividing up the sample code into testable subroutines didn't work so well.
There's a conflict between what tests well and what reads well in
sample/tutorial material.

I wound up adding a test (perl/t/binding/702-sample.t) which simply verifies
that the sample apps work -- that indexer.pl creates an index, and that
search.cgi searches the index and returns the expected number of results.

Marvin Humphrey


Re: [lucy-dev] Generalize Tutorial for multiple host languages

Posted by Peter Karman <pe...@peknet.com>.
Marvin Humphrey wrote on 11/8/10 1:05 PM:
> On Sun, Nov 07, 2010 at 01:19:00PM -0600, Peter Karman wrote:
>> The approach I have taken with libswish3 is to make the example code part of the
>> test suite, so that if the API changes, I have to keep the example(s) up to date
>> with it. We could do something similar with the tutorial apps.
> 
> This is a good plan; I just hadn't figured out how to make it work before.
> 
> Perhaps we could divide the sample scripts up into testable subroutines, then
> performing a require/do to import its subs into the current namespace.
> 
>     require "sample/indexer.pl" or die $@;
>     ok( parse_file("sample/us_constitution/amend10.txt"), "parse_file" );
> 

sounds reasonable.


-- 
Peter Karman  .  http://peknet.com/  .  peter@peknet.com

Re: [lucy-dev] Generalize Tutorial for multiple host languages

Posted by Marvin Humphrey <ma...@rectangular.com>.
On Sun, Nov 07, 2010 at 01:19:00PM -0600, Peter Karman wrote:
> The approach I have taken with libswish3 is to make the example code part of the
> test suite, so that if the API changes, I have to keep the example(s) up to date
> with it. We could do something similar with the tutorial apps.

This is a good plan; I just hadn't figured out how to make it work before.

Perhaps we could divide the sample scripts up into testable subroutines, then
performing a require/do to import its subs into the current namespace.

    require "sample/indexer.pl" or die $@;
    ok( parse_file("sample/us_constitution/amend10.txt"), "parse_file" );

Marvin Humphrey


Re: [lucy-dev] Generalize Tutorial for multiple host languages

Posted by Peter Karman <pe...@peknet.com>.
Marvin Humphrey wrote on 11/5/10 4:40 PM:
> On Fri, Nov 05, 2010 at 10:03:47AM -0700, Nathan Kurz wrote:
>> I was about to write the same thing as Peter just did: don't generalize a
>> tutorial, rather provide a native version per language with a many real
>> world examples as possible.  It's OK if you also want to have a meta
>> tutorial as a template guide for developers (although maybe this is the same
>> as the C guide) but don't try to make a single doc that covers both C and
>> Ruby.
> 
> OK, I can work with this.  
> 
> I have misgivings about the violation of DRY and the increase in maintenance
> burden; if Lucy is successful, we will add more bindings and each binding will
> cost more as a result of the branching we're choosing to initiate now.  I
> predict that attempting to keep a multiple tutorials up-to-date is going to
> introduce documentation bugs in the future.
> 

If Lucy is successful, it will be in part because we have added many more
developers and users to our community, thus sharing the maintenance burden.

The approach I have taken with libswish3 is to make the example code part of the
test suite, so that if the API changes, I have to keep the example(s) up to date
with it. We could do something similar with the tutorial apps.


-- 
Peter Karman  .  http://peknet.com/  .  peter@peknet.com

Re: [lucy-dev] Generalize Tutorial for multiple host languages

Posted by Marvin Humphrey <ma...@rectangular.com>.
On Fri, Nov 05, 2010 at 10:03:47AM -0700, Nathan Kurz wrote:
> I was about to write the same thing as Peter just did: don't generalize a
> tutorial, rather provide a native version per language with a many real
> world examples as possible.  It's OK if you also want to have a meta
> tutorial as a template guide for developers (although maybe this is the same
> as the C guide) but don't try to make a single doc that covers both C and
> Ruby.

OK, I can work with this.  

I have misgivings about the violation of DRY and the increase in maintenance
burden; if Lucy is successful, we will add more bindings and each binding will
cost more as a result of the branching we're choosing to initiate now.  I
predict that attempting to keep a multiple tutorials up-to-date is going to
introduce documentation bugs in the future.

However, I believe that if there's a place to pour your resources, it's
introductory tutorial documentation and sample applications.  The costs are
justifiable.

Marvin Humphrey


Re: [lucy-dev] Generalize Tutorial for multiple host languages

Posted by Nathan Kurz <na...@verse.com>.
I was about to write the same thing as Peter just did: don't generalize a
tutorial, rather provide a native version per language with a many real
world examples as possible.  It's OK if you also want to have a meta
tutorial as a template guide for developers (although maybe this is the same
as the C guide) but don't try to make a single doc that covers both C and
Ruby.

Make each one feel as native as it can. You're a clownfish, tolerated by
your host and immune to its poisons, not some generic marauding tube bass
that causes everyone to cower in the kelp.

Nathan Kurz
nate@verse.com
On Nov 5, 2010 8:51 AM, "Peter Karman" <pe...@peknet.com> wrote:

Re: [lucy-dev] Generalize Tutorial for multiple host languages

Posted by Peter Karman <pe...@peknet.com>.
Marvin Humphrey wrote on 11/04/2010 06:30 PM:

> To eliminate these dependencies, I think the Tutorial should be simplified to
> build a command-line app, and the corpus should be changed to plain text.
> Every potential host language has basic file and directory manipulation
> capabilities; it should be possible to generalize the tutorial prose so that
> it can work with all of them without modification.

I favor the opposite. Instead of generalizing the tutorial, let's build
a tutorial for each language implementation.

The philosophy of Lucy is: provide core, shared, C code and idiomatic
language implementations. Let's follow the same philosophy for our
documentation: idiomatic tutorials per-language. It ought to be possible
to lift the example code out of the tutorial and run it; by avoiding any
particular language implementation, we prevent that ease-of-use.

> 
> Additionally, by eliminating those CPAN prerequisites entirely, we skirt the
> issue of dependency licensing.

How is documenting something, in a tutorial, affected by software
licensing? We don't distribute those modules.



-- 
Peter Karman  .  http://peknet.com/  .  peter@peknet.com