You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@lucy.apache.org by Marvin Humphrey <ma...@rectangular.com> on 2006/06/21 04:38:32 UTC

Dependencies

Greets,

Both of Lucy's present target platforms provide much of the  
functionality missing from C and present in Java -- for instance,  
portable filepath handling.  We could also get that from APR a la  
Lucene4C, but while it may be possible to add a C target to Lucy  
someday that uses APR as its foundation, we don't need to complicate  
the install process by making APR a prerequisite for all targets.

There are a few dependencies I think we should bundle with Lucy:

   * Zlib
   * Snowball stemmers
   * some variant of vsnprintf

While Zlib is provided as part of core Perl and possibly as part of  
all other platforms Lucy might target, bundling it means we don't  
have to call back to the native API should we wish to access it from  
C, as we might if FieldsWriter and/or FieldsReader end up implemented  
in C.

The Snowball stemmers are also available via CPAN; I now maintain  
that distribution (Lingua::Stem::Snowball).  However, other platforms  
probably won't have something like that available, and even within  
the Perl world, bundling Snowball means greater flexibility with  
regards to how Lucy interacts with it.

We need vsnprintf for formatting error messages, which may include  
user-controllable input and which are therefore ripe for buffer  
overflow attack.  There are many variants available -- see <http:// 
www.ijs.si/software/snprintf/> for links to a few (some are  
outdated).  We may be able to derive something from APR's  
implementation if we can't find one with a compatible license we can  
just bundle and #inclide.

If those are are only external dependencies, that implies we'll be  
building a lot from scratch.  Here are some of the utilities we'll  
need to code up:

   * hashtable
   * priority queue
   * byte buffer (an array of bytes that knows its own length)
   * bit vector
   * external sort
   * C test harness

How does that sound?

Marvin Humphrey
Rectangular Research
http://www.rectangular.com/


Re: Dependencies

Posted by David Balmain <db...@gmail.com>.
On 6/21/06, Marvin Humphrey <ma...@rectangular.com> wrote:
>
> On Jun 20, 2006, at 11:29 PM, David Balmain wrote:
>
> > This seems like a lot to bundle to me, when like you said, it will
> > probably be available on all platforms that Lucy might target. I don't
> > see the problem with calling back to the native API. We are going to
> > have to provide call-backs for things like memory allocation and
> > exception handling so I don't think an extra inflate and deflate
> > callback is going to hurt. But if you feel strongly about this I'm not
> > to fussed.
>
> I could be persuaded.  Callbacks to Perl from C are verbose and kind
> of hard to get your head around, so I have something of an
> instinctive aversion to them.
>
> Here's a wrapper for compress() in Perl:
>
>    use Compress::Zlib qw( compress )
>
>    sub compress_it {
>        my $input = shift;
>        return compress($input);
>    }
>
> ... and here's the equivalent function rendered in XS, calling back
> to Perl..
>
> SV*
> compress_it(SV *input)
> {
>      SV *retval;
>
>      dSP;          /* declare stack pointer */
>      ENTER;        /* opening bracket for a callback */
>      SAVETMPS;     /* opening bracket for temporaries */
>      PUSHMARK(SP); /* start arg stack */
>      XPUSHs( sv_2mortal( newSVsv(input) ) ); /* pass copy of input */
>      PUTBACK;      /* close arg stack */
>      call_pv("Compress::Zlib::compress", G_SCALAR); /* invoke compress
> () */
>      FREETMPS;     /* closing bracket for temporaries */
>      LEAVE;        /* closing bracket for a callback */
>
>      retval = sv_2mortal( newSVsv( ST(0) ) ); /* copy first item on
> stack */
>      return retval;
> }
>
> Untested.  ;)

Ok, so it's a little easier for me. But it's only two methods,
compress and decompress. I don't think it will be too hard to do.

> <snip>
> > Do you plan on
> > doing any other analysis at the C level or do you just want to make
> > the SnowBall parser available in the target API?
>
> I think we'll want to render TokenBatch in C and make the Snowball
> Stemmer able to act on the TokenBatch's member strings directly.  If
> I have to call back to Perl, I'll have to wrap token text in a Perl
> scalar then recover it back into the TokenBatch, and it won't be as
> efficient -- more copy ops.
>
> I'm not sure how the other Analyzers will be implemented.

The reason I asked is that if we were going to implement Analyzers at
the C level then we are going to have to worry about character
encoding. Unfortunately this is a reality for me in Ferret since there
is no way to lowercase utf-8 strings in Ruby yet. (Hopefully utf-8
support will be coming soon). I think for Lucy we should probably
leave the analysis to the target language, at least to start with.

> > What is the byte buffer for in particular?
>
> KinoSearch does a lot of serialization and deserialization.  It's
> really handy to have a string that knows its own length when you're
> doing stuff like concat and truncate ops all the time.
>
> The external sorter takes ByteBuffer's as its args.  Without that it
> would have to take char* and a string length at the same time.  It
> would get really messy.  Think qsort with strings that may contain
> null bytes.

Ok, that makes sense.

Re: Dependencies

Posted by Marvin Humphrey <ma...@rectangular.com>.
On Jun 20, 2006, at 11:29 PM, David Balmain wrote:

> This seems like a lot to bundle to me, when like you said, it will
> probably be available on all platforms that Lucy might target. I don't
> see the problem with calling back to the native API. We are going to
> have to provide call-backs for things like memory allocation and
> exception handling so I don't think an extra inflate and deflate
> callback is going to hurt. But if you feel strongly about this I'm not
> to fussed.

I could be persuaded.  Callbacks to Perl from C are verbose and kind  
of hard to get your head around, so I have something of an  
instinctive aversion to them.

Here's a wrapper for compress() in Perl:

   use Compress::Zlib qw( compress )

   sub compress_it {
       my $input = shift;
       return compress($input);
   }

... and here's the equivalent function rendered in XS, calling back  
to Perl..

SV*
compress_it(SV *input)
{
     SV *retval;

     dSP;          /* declare stack pointer */
     ENTER;        /* opening bracket for a callback */
     SAVETMPS;     /* opening bracket for temporaries */
     PUSHMARK(SP); /* start arg stack */
     XPUSHs( sv_2mortal( newSVsv(input) ) ); /* pass copy of input */
     PUTBACK;      /* close arg stack */
     call_pv("Compress::Zlib::compress", G_SCALAR); /* invoke compress 
() */
     FREETMPS;     /* closing bracket for temporaries */
     LEAVE;        /* closing bracket for a callback */

     retval = sv_2mortal( newSVsv( ST(0) ) ); /* copy first item on  
stack */
     return retval;
}

Untested.  ;)

Larry Wall apparently has said that XS wasn't any simpler because it  
had to be efficient...

>
>> The Snowball stemmers are also available via CPAN; I now maintain
>> that distribution (Lingua::Stem::Snowball).  However, other platforms
>> probably won't have something like that available, and even within
>> the Perl world, bundling Snowball means greater flexibility with
>> regards to how Lucy interacts with it.
>
> This I agree with. I've bundled it with Ferret. I've also bundled the
> lists of stopwords from http://snowball.tartarus.org/.

Yeah, I didn't mention that, but same thing: we should bundle it.   
And same thing, I maintain the CPAN distro Lingua::StopWords, but  
other targets wouldn't have access to something like that.

> Do you plan on
> doing any other analysis at the C level or do you just want to make
> the SnowBall parser available in the target API?

I think we'll want to render TokenBatch in C and make the Snowball  
Stemmer able to act on the TokenBatch's member strings directly.  If  
I have to call back to Perl, I'll have to wrap token text in a Perl  
scalar then recover it back into the TokenBatch, and it won't be as  
efficient -- more copy ops.

I'm not sure how the other Analyzers will be implemented.

>> We need vsnprintf for formatting error messages, which may include
>> user-controllable input and which are therefore ripe for buffer
>> overflow attack.  There are many variants available -- see <http://
>> www.ijs.si/software/snprintf/> for links to a few (some are
>> outdated).  We may be able to derive something from APR's
>> implementation if we can't find one with a compatible license we can
>> just bundle and #inclide.
>
> I think I'd rather derive something from APR's implementation.

OK, that's cool.  This is another example of something that Perl  
supplied so I didn't have to.

> I've done all these before bar the external sort.

The external sort, I have nailed.  I've been honing that sucker for a  
while now.

> What is the byte buffer for in particular?

KinoSearch does a lot of serialization and deserialization.  It's  
really handy to have a string that knows its own length when you're  
doing stuff like concat and truncate ops all the time.

The external sorter takes ByteBuffer's as its args.  Without that it  
would have to take char* and a string length at the same time.  It  
would get really messy.  Think qsort with strings that may contain  
null bytes.

Marvin Humphrey
Rectangular Research
http://www.rectangular.com/


Re: Dependencies

Posted by David Balmain <db...@gmail.com>.
On 6/21/06, Marvin Humphrey <ma...@rectangular.com> wrote:
> Greets,
>
> Both of Lucy's present target platforms provide much of the
> functionality missing from C and present in Java -- for instance,
> portable filepath handling.  We could also get that from APR a la
> Lucene4C, but while it may be possible to add a C target to Lucy
> someday that uses APR as its foundation, we don't need to complicate
> the install process by making APR a prerequisite for all targets.
>
> There are a few dependencies I think we should bundle with Lucy:
>
>    * Zlib
>    * Snowball stemmers
>    * some variant of vsnprintf
>
> While Zlib is provided as part of core Perl and possibly as part of
> all other platforms Lucy might target, bundling it means we don't
> have to call back to the native API should we wish to access it from
> C, as we might if FieldsWriter and/or FieldsReader end up implemented
> in C.

This seems like a lot to bundle to me, when like you said, it will
probably be available on all platforms that Lucy might target. I don't
see the problem with calling back to the native API. We are going to
have to provide call-backs for things like memory allocation and
exception handling so I don't think an extra inflate and deflate
callback is going to hurt. But if you feel strongly about this I'm not
to fussed.

> The Snowball stemmers are also available via CPAN; I now maintain
> that distribution (Lingua::Stem::Snowball).  However, other platforms
> probably won't have something like that available, and even within
> the Perl world, bundling Snowball means greater flexibility with
> regards to how Lucy interacts with it.

This I agree with. I've bundled it with Ferret. I've also bundled the
lists of stopwords from http://snowball.tartarus.org/. Do you plan on
doing any other analysis at the C level or do you just want to make
the SnowBall parser available in the target API?

> We need vsnprintf for formatting error messages, which may include
> user-controllable input and which are therefore ripe for buffer
> overflow attack.  There are many variants available -- see <http://
> www.ijs.si/software/snprintf/> for links to a few (some are
> outdated).  We may be able to derive something from APR's
> implementation if we can't find one with a compatible license we can
> just bundle and #inclide.

I think I'd rather derive something from APR's implementation.

> If those are are only external dependencies, that implies we'll be
> building a lot from scratch.  Here are some of the utilities we'll
> need to code up:

>    * hashtable
>    * priority queue
>    * byte buffer (an array of bytes that knows its own length)
>    * bit vector
>    * external sort
>    * C test harness
>
> How does that sound?

Sounds good to me. I've done all these before bar the external sort.
What is the byte buffer for in particular?

Cheers,
Dave

PS: Any progress with the test harness? Would you like me to do it?

Re: Dependencies

Posted by David Balmain <db...@gmail.com>.
On 6/21/06, Yen-Ju Chen <yj...@gmail.com> wrote:
> On 6/20/06, Marvin Humphrey <ma...@rectangular.com> wrote:
> > Greets,
> >
> [snip]
> >
> > If those are are only external dependencies, that implies we'll be
> > building a lot from scratch.  Here are some of the utilities we'll
> > need to code up:
>
>   I did some goggle and hope these links helpful.
>   I never used them before.
>   But if both the license and algorithm fit Lucy,
>   it can save time by not reinventing wheel.
>
> >    * hashtable
>
>   http://www.jeannot.org/~js/code/index.en.html#MapKit
>
> >    * priority queue
>
>   http://www.hpcf.upr.edu/~humberto/software/EPQ/report.html
>
> >    * byte buffer (an array of bytes that knows its own length)
> >    * bit vector
>
>   http://www.csd.uwo.ca/%7ejamie/BitVectors/SeeAlso.html
>   Although it is a perl module, the C core can be used stadn-alone.
>   (See the local-copy announcement).
>
> >    * external sort
> >    * C test harness
>
>   There are two c unit tests: http://sastools.com/b2/post/79394064
>
>   Yen-Ju

Thanks Yen-Ju,

I think we've reinvented the wheel already in our own projects
already. These modules are easily enough implented that the benifits
of having a homogenous codebase outweigh the cost of having to
reimplement them. Zlib is a good example of something that we wouldn't
do ourselves. Hash tables? Piece of cake.

Cheers,
Dave

Re: Dependencies

Posted by Marvin Humphrey <ma...@rectangular.com>.
On Jun 20, 2006, at 10:52 PM, Yen-Ju Chen wrote:
>  I did some goggle and hope these links helpful.
>  I never used them before.
>  But if both the license and algorithm fit Lucy,
>  it can save time by not reinventing wheel.

>>    * hashtable
>
>  http://www.jeannot.org/~js/code/index.en.html#MapKit

It's funny that Dave thinks hashtable's the piece of cake.  This was  
the link I found most interesting, as I've written hashtables before  
but can't say I enjoyed the calisthenics.  I also did not write a  
hashtable for KinoSearch, since I was able to use Perl's hashes from  
C via the XS API.

As for the others, we both have tested implementations of most  
everything, since we needed those for our own C code.  We won't  
really be reinventing the wheel, we'll be rotating the tires.  :)

Marvin Humphrey
Rectangular Research
http://www.rectangular.com/


Re: Dependencies

Posted by Yen-Ju Chen <yj...@gmail.com>.
On 6/20/06, Marvin Humphrey <ma...@rectangular.com> wrote:
> Greets,
>
[snip]
>
> If those are are only external dependencies, that implies we'll be
> building a lot from scratch.  Here are some of the utilities we'll
> need to code up:

  I did some goggle and hope these links helpful.
  I never used them before.
  But if both the license and algorithm fit Lucy,
  it can save time by not reinventing wheel.

>    * hashtable

  http://www.jeannot.org/~js/code/index.en.html#MapKit

>    * priority queue

  http://www.hpcf.upr.edu/~humberto/software/EPQ/report.html

>    * byte buffer (an array of bytes that knows its own length)
>    * bit vector

  http://www.csd.uwo.ca/%7ejamie/BitVectors/SeeAlso.html
  Although it is a perl module, the C core can be used stadn-alone.
  (See the local-copy announcement).

>    * external sort
>    * C test harness

  There are two c unit tests: http://sastools.com/b2/post/79394064

  Yen-Ju

>
> How does that sound?
>
> Marvin Humphrey
> Rectangular Research
> http://www.rectangular.com/
>
>