You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@lucy.apache.org by Marvin Humphrey <ma...@rectangular.com> on 2013/05/24 03:54:01 UTC

[lucy-dev] autogen dir

On Mon, May 20, 2013 at 12:09 PM,  <nw...@apache.org> wrote:

> Add parcel prefix to parcel.c and parcel.h
>
> The parcel header file must be publicly available, so add parcel prefix
> to avoid name clashes.

> Project: http://git-wip-us.apache.org/repos/asf/lucy/repo
> Commit: http://git-wip-us.apache.org/repos/asf/lucy/commit/fd43a656
> Tree: http://git-wip-us.apache.org/repos/asf/lucy/tree/fd43a656
> Diff: http://git-wip-us.apache.org/repos/asf/lucy/diff/fd43a656

> -    scratch = chaz_Util_join(dir_sep, "autogen", "source", "parcel.c", NULL);
> +    scratch = chaz_Util_join(dir_sep, "autogen", "source", "lucy_parcel.c",
> +                             NULL);

This bugs me because it's not extendable to fully qualified parcel namespaces,
but it's suggested a tangential idea:

How about eliminating the "autogen" directory and having all our output go
into .h files?  Then we can drop them alongside the .cfh files we used to
generate them.

  core/Lucy/Search/IndexSearcher.cfh  // in
  core/Lucy/Search/IndexSearcher.h    // out

There's a certain amount of tangible .c code that we generate, but maybe we
can stick that in a .h file and enclose it with ifdefs which only one .c file
defines.

    #ifdef C_LUCY_INDEXSEARCHER
    // ....
    #endif

The rationale is to make the connection between the .cfh source files and the
generated .h files clearer so that it's easier to for both newcomers and
experts to see what's going on.

Marvin Humphrey

Re: [lucy-dev] autogen dir

Posted by Marvin Humphrey <ma...@rectangular.com>.

On Tue, May 28, 2013 at 3:42 AM, Nick Wellnhofer <we...@aevum.de> wrote:
> I'd only like to add that there probably won't be that many Clownfish
> parcels in the foreseeable future. Or, in other words, I would be happy if
> we had so many Clownfish users that they would start to complain about
> symbol clashes ;)

Heh. ;)  Hence, my failed attempt to address prefixing as an aside at first
rather than an imperative that needed to be attended to immediately.  But then
when it appeared we might have substantially different long-term visions, I
figured it was worth investing some time to reconcile them.

>> In terms of alias generation, here's what I think we should be doing:
>>
>>      #define lucy_Indexer_new org_apache_lucy_Indexer_new
>
> But if we define this unconditionally, doesn't it defeat the purpose of
> fully-qualified names?

Yeah, I suppose that it does -- if conflicting Clownfish-powered libraries are
both pound-included in the same source file.  However, at least the macro
aliases are not visible from compiled object files or later stages.

>>  From the perspective of a programmer working with Clownfish, everything in
>>  a parcel should be available with via a single pound-include:
>>
>>      #include "org/apache/lucy.h"
>
> Note that this would possibly pull in megabytes of mostly unused header
> files for every source file. This might result in a noticable compilation
> slowdown.

Maybe so, but...

*   Lucy is pretty big.
*   You'll almost always want everything from the Clownfish core.
*   Things have been skewed up till now because the test class headers have
    also been included.
*   It may be possible to cut down the size of those headers substantially.

> I think most of the bloat comes from the method wrappers.

I'm pretty sure that's right.  Perhaps the "thunk method wrappers" proposal
can help.

> It shouldn't be a problem if headers are included selectively.

I haven't yet given up on the idea of maintaining multiple major versions and
having one header which points to the latest.  I'm envisioning that
"org/apache/lucy.h" would consist of a single line:

    #include "org/apache/lucy/0/parcel.h"

Things get more complicated if individual files have to specify versions in
the directory structure.  I don't really want to inflict this on users:

    #include "org/apache/lucy/0/search/IndexSearcher.h"

Maybe the approach won't work out, but it's worth a try.

Marvin Humphrey

Re: [lucy-dev] autogen dir

Posted by Nick Wellnhofer <we...@aevum.de>.

On 28/05/2013 06:10, Marvin Humphrey wrote:
> Yes, it's to make our namespacing mechanism more robust.  However, it's not
> that Clownfish should *force* users into lengthy parcel names -- its that we
> should support namespaces properly and give users a choice.  Even after we
> enable nested parcel names, people can still select simple names, or avoid
> explicit parcels altogether.  Using reversed domain names is only a
> convention.
>
> Prefixes are flawed, because anything more than a few characters results in
> symbols which are unacceptably cumbersome to type, but the limited length
> makes clashes more likely.  For example, when we were considering what
> Clownfish ought to use as a prefix, we had to take into account that "CF" is
> used as a prefix by Apple's "Core Foundation" classes.
>
> We're already committed to providing short name aliases for the sake of
> programmer convenience.  We may as well go one step further and build out
> namespacing which remains just as user-friendly yet is more resistant to
> symbol clashes.
>
> *   Individual symbol aliases, which are typed multiple times within source
>      files, should be short.
> *   Imports, which happen only once per file, may be long.
> *   Real names for symbols may be long, since they can remain hidden behind
>      aliases most of the time.
>
> To support namespaces properly in the Clownfish internals, we need to ensure
> that we don't lock systems into place that depend on prefixes within global
> contexts -- which is why the commit in question drew my attention.  We were
> already doing doing something similar elsewhere -- the "boot" files e.g.
> "lucy_boot.c" and "lucy_boot.h" -- and I was the one who wrote that code.  But
> those were either a bug or a TODO (take your pick) -- and rather than compound
> the mistake, we should fix it... by differentiating autogenerated files using
> directory structures rather than file name prefixes.

These are all valid points. I'd only like to add that there probably 
won't be that many Clownfish parcels in the foreseeable future. Or, in 
other words, I would be happy if we had so many Clownfish users that 
they would start to complain about symbol clashes ;)

> However, I would like to suggest a tweak.  A common complaint in Java-land is
> that the reverse-domain package naming convention results in too deep a
> directory hierarchy.  The extra depth is not a huge deal for installed files,
> but it's a pain when interacting with source trees.  I think we can solve this
> by having CFC allow .cfp parcel files to establish the namespace for files in
> lower directories:
>
>      // This...
>      $CORE/
>            foo.cfp          // com::example::foo
>      $CORE/foo/
>                MyClass.cfh  // com::example::foo::MyClass
>
>      // Not this...
>      $CORE/com/example/
>                        foo.cfp
>      $CORE/com/example/foo/
>                            MyClass.cfh

+1

> (Aside: We may want to go with '.' instead of '::' ourselves.)

+1

> In terms of alias generation, here's what I think we should be doing:
>
>      #define lucy_Indexer_new org_apache_lucy_Indexer_new

But if we define this unconditionally, doesn't it defeat the purpose of 
fully-qualified names?

> (Another aside: perhaps we should enable short names by default and replace
> `LUCY_USE_SHORT_NAMES` with `LUCY_NO_SHORT_ALIASES`.)

+1

>  From the perspective of a programmer working with Clownfish, everything in a
> parcel should be available with via a single pound-include:
>
>      #include "org/apache/lucy.h"

Note that this would possibly pull in megabytes of mostly unused header 
files for every source file. This might result in a noticable 
compilation slowdown.

> FWIW, there's definitely some bloat in those headers.
>
> Also, we could address your concern about embedded C code taking up too much
> space in the headers by generating a file called e.g. "parcel.impl" which gets
> pulled in conditionally:
>
>      #ifdef P_ORG_APACHE_LUCY
>        #include "org/apache/lucy/parcel.impl"
>      #endif

I think most of the bloat comes from the method wrappers. It shouldn't 
be a problem if headers are included selectively.

Nick

Re: [lucy-dev] autogen dir

Posted by Marvin Humphrey <ma...@rectangular.com>.

On Sat, May 25, 2013 at 6:17 AM, Nick Wellnhofer <we...@aevum.de> wrote:
> What's the rationale for fully qualified parcel namespaces exactly? Is it to
> work around possible name clashes of parcel prefixes?

Yes, it's to make our namespacing mechanism more robust.  However, it's not
that Clownfish should *force* users into lengthy parcel names -- its that we
should support namespaces properly and give users a choice.  Even after we
enable nested parcel names, people can still select simple names, or avoid
explicit parcels altogether.  Using reversed domain names is only a
convention.

Prefixes are flawed, because anything more than a few characters results in
symbols which are unacceptably cumbersome to type, but the limited length
makes clashes more likely.  For example, when we were considering what
Clownfish ought to use as a prefix, we had to take into account that "CF" is
used as a prefix by Apple's "Core Foundation" classes.

We're already committed to providing short name aliases for the sake of
programmer convenience.  We may as well go one step further and build out
namespacing which remains just as user-friendly yet is more resistant to
symbol clashes.

*   Individual symbol aliases, which are typed multiple times within source
    files, should be short.
*   Imports, which happen only once per file, may be long.
*   Real names for symbols may be long, since they can remain hidden behind
    aliases most of the time.

To support namespaces properly in the Clownfish internals, we need to ensure
that we don't lock systems into place that depend on prefixes within global
contexts -- which is why the commit in question drew my attention.  We were
already doing doing something similar elsewhere -- the "boot" files e.g.
"lucy_boot.c" and "lucy_boot.h" -- and I was the one who wrote that code.  But
those were either a bug or a TODO (take your pick) -- and rather than compound
the mistake, we should fix it... by differentiating autogenerated files using
directory structures rather than file name prefixes.

However, I would like to suggest a tweak.  A common complaint in Java-land is
that the reverse-domain package naming convention results in too deep a
directory hierarchy.  The extra depth is not a huge deal for installed files,
but it's a pain when interacting with source trees.  I think we can solve this
by having CFC allow .cfp parcel files to establish the namespace for files in
lower directories:

    // This...
    $CORE/
          foo.cfp          // com::example::foo
    $CORE/foo/
              MyClass.cfh  // com::example::foo::MyClass

    // Not this...
    $CORE/com/example/
                      foo.cfp
    $CORE/com/example/foo/
                          MyClass.cfh

Inside CFC, we should simplify things by using a single symbol table for both
parcels and classes, so that class names are prefixed by the names of the
parcels they live under.  One consequence is that Clownfish class names will
no longer map one-to-one onto Perl package names, so we'll have to perform
per-host mapping.  But we were going to have to do that anyway for other hosts
like Python, where module names are lowercase by convention and '.' is used as
a package separator instead of the double colon.

    Clownfish:  org::apache::lucy::search::IndexSearcher
    Perl:       Lucy::Search::IndexSearcher
    Python:     lucy.search.IndexSearcher

(Aside: We may want to go with '.' instead of '::' ourselves.)

In terms of alias generation, here's what I think we should be doing:

    #define lucy_Indexer_new org_apache_lucy_Indexer_new
    #ifdef LUCY_USE_SHORT_NAMES
        #define Indexer_new org_apache_lucy_Indexer_new
    #endif

(Another aside: perhaps we should enable short names by default and replace
`LUCY_USE_SHORT_NAMES` with `LUCY_NO_SHORT_ALIASES`.)

>From the perspective of a programmer working with Clownfish, everything in a
parcel should be available with via a single pound-include:

    #include "org/apache/lucy.h"

The programmer then uses the parcel prefix if there may be clashes (as we have
to when programming in files which pound-include most host language C
headers), or uses the short names when there are no conflicts (as we do when
programming in standard C environment).  There's no difference from today as
far as programming; in our case, we won't have to change any of our search
engine code.  However, instead of being a real symbol, `lucy_Indexer_new`
would be an alias -- just like the much more commonly used `Indexer_new` is
already.

> I also don't want to put more internal stuff in the installed headers. They
> already take up considerable space. Here's an example of the footprint of a
> C library installation on OS X:
>
>     $ du -sch lucy/*
>     6.3M        lucy/include
>     2.4M        lucy/lib
>     372K        lucy/man
>     9.1M        total
>
> It's not really a problem but I find it interesting that the headers take up
> more than two times the space of the binary. They're even more than three
> times the size of the stripped binary.

FWIW, there's definitely some bloat in those headers.

Also, we could address your concern about embedded C code taking up too much
space in the headers by generating a file called e.g. "parcel.impl" which gets
pulled in conditionally:

    #ifdef P_ORG_APACHE_LUCY
      #include "org/apache/lucy/parcel.impl"
    #endif

But these are implementation details.

Marvin Humphrey

Re: [lucy-dev] autogen dir

Posted by Nick Wellnhofer <we...@aevum.de>.

On May 24, 2013, at 03:54 , Marvin Humphrey <ma...@rectangular.com> wrote:

> On Mon, May 20, 2013 at 12:09 PM,  <nw...@apache.org> wrote:
> 
>> Add parcel prefix to parcel.c and parcel.h
>> 
>> The parcel header file must be publicly available, so add parcel prefix
>> to avoid name clashes.
> 
>> Project: http://git-wip-us.apache.org/repos/asf/lucy/repo
>> Commit: http://git-wip-us.apache.org/repos/asf/lucy/commit/fd43a656
>> Tree: http://git-wip-us.apache.org/repos/asf/lucy/tree/fd43a656
>> Diff: http://git-wip-us.apache.org/repos/asf/lucy/diff/fd43a656
> 
>> -    scratch = chaz_Util_join(dir_sep, "autogen", "source", "parcel.c", NULL);
>> +    scratch = chaz_Util_join(dir_sep, "autogen", "source", "lucy_parcel.c",
>> +                             NULL);
> 
> This bugs me because it's not extendable to fully qualified parcel namespaces,
> but it's suggested a tangential idea:

What's the rationale for fully qualified parcel namespaces exactly? Is it to work around possible name clashes of parcel prefixes?

> How about eliminating the "autogen" directory and having all our output go
> into .h files?  Then we can drop them alongside the .cfh files we used to
> generate them.
> 
>  core/Lucy/Search/IndexSearcher.cfh  // in
>  core/Lucy/Search/IndexSearcher.h    // out
> 
> There's a certain amount of tangible .c code that we generate, but maybe we
> can stick that in a .h file and enclose it with ifdefs which only one .c file
> defines.
> 
>    #ifdef C_LUCY_INDEXSEARCHER
>    // ....
>    #endif
> 
> The rationale is to make the connection between the .cfh source files and the
> generated .h files clearer so that it's easier to for both newcomers and
> experts to see what's going on.

+1 for creating the per-class .h files next to the .cfh files. It only makes installation of the header files for the C library a bit more complicated.

-1 for moving the generated .c code to the .h files. I don't see a problem with the way we generate the .c files right now. Furthermore, there is some per-parcel C code where there isn't a particular class it belongs to.

I also don't want to put more internal stuff in the installed headers. They already take up considerable space. Here's an example of the footprint of a C library installation on OS X:

    $ du -sch lucy/*
    6.3M	lucy/include
    2.4M	lucy/lib
    372K	lucy/man
    9.1M	total

It's not really a problem but I find it interesting that the headers take up more than two times the space of the binary. They're even more than three times the size of the stripped binary.

Nick