You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@lucy.apache.org by Marvin Humphrey <ma...@rectangular.com> on 2006/06/21 04:38:32 UTC
Dependencies
Greets,
Both of Lucy's present target platforms provide much of the
functionality missing from C and present in Java -- for instance,
portable filepath handling. We could also get that from APR a la
Lucene4C, but while it may be possible to add a C target to Lucy
someday that uses APR as its foundation, we don't need to complicate
the install process by making APR a prerequisite for all targets.
There are a few dependencies I think we should bundle with Lucy:
* Zlib
* Snowball stemmers
* some variant of vsnprintf
While Zlib is provided as part of core Perl and possibly as part of
all other platforms Lucy might target, bundling it means we don't
have to call back to the native API should we wish to access it from
C, as we might if FieldsWriter and/or FieldsReader end up implemented
in C.
The Snowball stemmers are also available via CPAN; I now maintain
that distribution (Lingua::Stem::Snowball). However, other platforms
probably won't have something like that available, and even within
the Perl world, bundling Snowball means greater flexibility with
regards to how Lucy interacts with it.
We need vsnprintf for formatting error messages, which may include
user-controllable input and which are therefore ripe for buffer
overflow attack. There are many variants available -- see <http://
www.ijs.si/software/snprintf/> for links to a few (some are
outdated). We may be able to derive something from APR's
implementation if we can't find one with a compatible license we can
just bundle and #inclide.
If those are are only external dependencies, that implies we'll be
building a lot from scratch. Here are some of the utilities we'll
need to code up:
* hashtable
* priority queue
* byte buffer (an array of bytes that knows its own length)
* bit vector
* external sort
* C test harness
How does that sound?
Marvin Humphrey
Rectangular Research
http://www.rectangular.com/
Re: Dependencies
Posted by David Balmain <db...@gmail.com>.
On 6/21/06, Marvin Humphrey <ma...@rectangular.com> wrote:
>
> On Jun 20, 2006, at 11:29 PM, David Balmain wrote:
>
> > This seems like a lot to bundle to me, when like you said, it will
> > probably be available on all platforms that Lucy might target. I don't
> > see the problem with calling back to the native API. We are going to
> > have to provide call-backs for things like memory allocation and
> > exception handling so I don't think an extra inflate and deflate
> > callback is going to hurt. But if you feel strongly about this I'm not
> > to fussed.
>
> I could be persuaded. Callbacks to Perl from C are verbose and kind
> of hard to get your head around, so I have something of an
> instinctive aversion to them.
>
> Here's a wrapper for compress() in Perl:
>
> use Compress::Zlib qw( compress )
>
> sub compress_it {
> my $input = shift;
> return compress($input);
> }
>
> ... and here's the equivalent function rendered in XS, calling back
> to Perl..
>
> SV*
> compress_it(SV *input)
> {
> SV *retval;
>
> dSP; /* declare stack pointer */
> ENTER; /* opening bracket for a callback */
> SAVETMPS; /* opening bracket for temporaries */
> PUSHMARK(SP); /* start arg stack */
> XPUSHs( sv_2mortal( newSVsv(input) ) ); /* pass copy of input */
> PUTBACK; /* close arg stack */
> call_pv("Compress::Zlib::compress", G_SCALAR); /* invoke compress
> () */
> FREETMPS; /* closing bracket for temporaries */
> LEAVE; /* closing bracket for a callback */
>
> retval = sv_2mortal( newSVsv( ST(0) ) ); /* copy first item on
> stack */
> return retval;
> }
>
> Untested. ;)
Ok, so it's a little easier for me. But it's only two methods,
compress and decompress. I don't think it will be too hard to do.
> <snip>
> > Do you plan on
> > doing any other analysis at the C level or do you just want to make
> > the SnowBall parser available in the target API?
>
> I think we'll want to render TokenBatch in C and make the Snowball
> Stemmer able to act on the TokenBatch's member strings directly. If
> I have to call back to Perl, I'll have to wrap token text in a Perl
> scalar then recover it back into the TokenBatch, and it won't be as
> efficient -- more copy ops.
>
> I'm not sure how the other Analyzers will be implemented.
The reason I asked is that if we were going to implement Analyzers at
the C level then we are going to have to worry about character
encoding. Unfortunately this is a reality for me in Ferret since there
is no way to lowercase utf-8 strings in Ruby yet. (Hopefully utf-8
support will be coming soon). I think for Lucy we should probably
leave the analysis to the target language, at least to start with.
> > What is the byte buffer for in particular?
>
> KinoSearch does a lot of serialization and deserialization. It's
> really handy to have a string that knows its own length when you're
> doing stuff like concat and truncate ops all the time.
>
> The external sorter takes ByteBuffer's as its args. Without that it
> would have to take char* and a string length at the same time. It
> would get really messy. Think qsort with strings that may contain
> null bytes.
Ok, that makes sense.
Re: Dependencies
Posted by Marvin Humphrey <ma...@rectangular.com>.
On Jun 20, 2006, at 11:29 PM, David Balmain wrote:
> This seems like a lot to bundle to me, when like you said, it will
> probably be available on all platforms that Lucy might target. I don't
> see the problem with calling back to the native API. We are going to
> have to provide call-backs for things like memory allocation and
> exception handling so I don't think an extra inflate and deflate
> callback is going to hurt. But if you feel strongly about this I'm not
> to fussed.
I could be persuaded. Callbacks to Perl from C are verbose and kind
of hard to get your head around, so I have something of an
instinctive aversion to them.
Here's a wrapper for compress() in Perl:
use Compress::Zlib qw( compress )
sub compress_it {
my $input = shift;
return compress($input);
}
... and here's the equivalent function rendered in XS, calling back
to Perl..
SV*
compress_it(SV *input)
{
SV *retval;
dSP; /* declare stack pointer */
ENTER; /* opening bracket for a callback */
SAVETMPS; /* opening bracket for temporaries */
PUSHMARK(SP); /* start arg stack */
XPUSHs( sv_2mortal( newSVsv(input) ) ); /* pass copy of input */
PUTBACK; /* close arg stack */
call_pv("Compress::Zlib::compress", G_SCALAR); /* invoke compress
() */
FREETMPS; /* closing bracket for temporaries */
LEAVE; /* closing bracket for a callback */
retval = sv_2mortal( newSVsv( ST(0) ) ); /* copy first item on
stack */
return retval;
}
Untested. ;)
Larry Wall apparently has said that XS wasn't any simpler because it
had to be efficient...
>
>> The Snowball stemmers are also available via CPAN; I now maintain
>> that distribution (Lingua::Stem::Snowball). However, other platforms
>> probably won't have something like that available, and even within
>> the Perl world, bundling Snowball means greater flexibility with
>> regards to how Lucy interacts with it.
>
> This I agree with. I've bundled it with Ferret. I've also bundled the
> lists of stopwords from http://snowball.tartarus.org/.
Yeah, I didn't mention that, but same thing: we should bundle it.
And same thing, I maintain the CPAN distro Lingua::StopWords, but
other targets wouldn't have access to something like that.
> Do you plan on
> doing any other analysis at the C level or do you just want to make
> the SnowBall parser available in the target API?
I think we'll want to render TokenBatch in C and make the Snowball
Stemmer able to act on the TokenBatch's member strings directly. If
I have to call back to Perl, I'll have to wrap token text in a Perl
scalar then recover it back into the TokenBatch, and it won't be as
efficient -- more copy ops.
I'm not sure how the other Analyzers will be implemented.
>> We need vsnprintf for formatting error messages, which may include
>> user-controllable input and which are therefore ripe for buffer
>> overflow attack. There are many variants available -- see <http://
>> www.ijs.si/software/snprintf/> for links to a few (some are
>> outdated). We may be able to derive something from APR's
>> implementation if we can't find one with a compatible license we can
>> just bundle and #inclide.
>
> I think I'd rather derive something from APR's implementation.
OK, that's cool. This is another example of something that Perl
supplied so I didn't have to.
> I've done all these before bar the external sort.
The external sort, I have nailed. I've been honing that sucker for a
while now.
> What is the byte buffer for in particular?
KinoSearch does a lot of serialization and deserialization. It's
really handy to have a string that knows its own length when you're
doing stuff like concat and truncate ops all the time.
The external sorter takes ByteBuffer's as its args. Without that it
would have to take char* and a string length at the same time. It
would get really messy. Think qsort with strings that may contain
null bytes.
Marvin Humphrey
Rectangular Research
http://www.rectangular.com/
Re: Dependencies
Posted by David Balmain <db...@gmail.com>.
On 6/21/06, Marvin Humphrey <ma...@rectangular.com> wrote:
> Greets,
>
> Both of Lucy's present target platforms provide much of the
> functionality missing from C and present in Java -- for instance,
> portable filepath handling. We could also get that from APR a la
> Lucene4C, but while it may be possible to add a C target to Lucy
> someday that uses APR as its foundation, we don't need to complicate
> the install process by making APR a prerequisite for all targets.
>
> There are a few dependencies I think we should bundle with Lucy:
>
> * Zlib
> * Snowball stemmers
> * some variant of vsnprintf
>
> While Zlib is provided as part of core Perl and possibly as part of
> all other platforms Lucy might target, bundling it means we don't
> have to call back to the native API should we wish to access it from
> C, as we might if FieldsWriter and/or FieldsReader end up implemented
> in C.
This seems like a lot to bundle to me, when like you said, it will
probably be available on all platforms that Lucy might target. I don't
see the problem with calling back to the native API. We are going to
have to provide call-backs for things like memory allocation and
exception handling so I don't think an extra inflate and deflate
callback is going to hurt. But if you feel strongly about this I'm not
to fussed.
> The Snowball stemmers are also available via CPAN; I now maintain
> that distribution (Lingua::Stem::Snowball). However, other platforms
> probably won't have something like that available, and even within
> the Perl world, bundling Snowball means greater flexibility with
> regards to how Lucy interacts with it.
This I agree with. I've bundled it with Ferret. I've also bundled the
lists of stopwords from http://snowball.tartarus.org/. Do you plan on
doing any other analysis at the C level or do you just want to make
the SnowBall parser available in the target API?
> We need vsnprintf for formatting error messages, which may include
> user-controllable input and which are therefore ripe for buffer
> overflow attack. There are many variants available -- see <http://
> www.ijs.si/software/snprintf/> for links to a few (some are
> outdated). We may be able to derive something from APR's
> implementation if we can't find one with a compatible license we can
> just bundle and #inclide.
I think I'd rather derive something from APR's implementation.
> If those are are only external dependencies, that implies we'll be
> building a lot from scratch. Here are some of the utilities we'll
> need to code up:
> * hashtable
> * priority queue
> * byte buffer (an array of bytes that knows its own length)
> * bit vector
> * external sort
> * C test harness
>
> How does that sound?
Sounds good to me. I've done all these before bar the external sort.
What is the byte buffer for in particular?
Cheers,
Dave
PS: Any progress with the test harness? Would you like me to do it?
Re: Dependencies
Posted by David Balmain <db...@gmail.com>.
On 6/21/06, Yen-Ju Chen <yj...@gmail.com> wrote:
> On 6/20/06, Marvin Humphrey <ma...@rectangular.com> wrote:
> > Greets,
> >
> [snip]
> >
> > If those are are only external dependencies, that implies we'll be
> > building a lot from scratch. Here are some of the utilities we'll
> > need to code up:
>
> I did some goggle and hope these links helpful.
> I never used them before.
> But if both the license and algorithm fit Lucy,
> it can save time by not reinventing wheel.
>
> > * hashtable
>
> http://www.jeannot.org/~js/code/index.en.html#MapKit
>
> > * priority queue
>
> http://www.hpcf.upr.edu/~humberto/software/EPQ/report.html
>
> > * byte buffer (an array of bytes that knows its own length)
> > * bit vector
>
> http://www.csd.uwo.ca/%7ejamie/BitVectors/SeeAlso.html
> Although it is a perl module, the C core can be used stadn-alone.
> (See the local-copy announcement).
>
> > * external sort
> > * C test harness
>
> There are two c unit tests: http://sastools.com/b2/post/79394064
>
> Yen-Ju
Thanks Yen-Ju,
I think we've reinvented the wheel already in our own projects
already. These modules are easily enough implented that the benifits
of having a homogenous codebase outweigh the cost of having to
reimplement them. Zlib is a good example of something that we wouldn't
do ourselves. Hash tables? Piece of cake.
Cheers,
Dave
Re: Dependencies
Posted by Marvin Humphrey <ma...@rectangular.com>.
On Jun 20, 2006, at 10:52 PM, Yen-Ju Chen wrote:
> I did some goggle and hope these links helpful.
> I never used them before.
> But if both the license and algorithm fit Lucy,
> it can save time by not reinventing wheel.
>> * hashtable
>
> http://www.jeannot.org/~js/code/index.en.html#MapKit
It's funny that Dave thinks hashtable's the piece of cake. This was
the link I found most interesting, as I've written hashtables before
but can't say I enjoyed the calisthenics. I also did not write a
hashtable for KinoSearch, since I was able to use Perl's hashes from
C via the XS API.
As for the others, we both have tested implementations of most
everything, since we needed those for our own C code. We won't
really be reinventing the wheel, we'll be rotating the tires. :)
Marvin Humphrey
Rectangular Research
http://www.rectangular.com/
Re: Dependencies
Posted by Yen-Ju Chen <yj...@gmail.com>.
On 6/20/06, Marvin Humphrey <ma...@rectangular.com> wrote:
> Greets,
>
[snip]
>
> If those are are only external dependencies, that implies we'll be
> building a lot from scratch. Here are some of the utilities we'll
> need to code up:
I did some goggle and hope these links helpful.
I never used them before.
But if both the license and algorithm fit Lucy,
it can save time by not reinventing wheel.
> * hashtable
http://www.jeannot.org/~js/code/index.en.html#MapKit
> * priority queue
http://www.hpcf.upr.edu/~humberto/software/EPQ/report.html
> * byte buffer (an array of bytes that knows its own length)
> * bit vector
http://www.csd.uwo.ca/%7ejamie/BitVectors/SeeAlso.html
Although it is a perl module, the C core can be used stadn-alone.
(See the local-copy announcement).
> * external sort
> * C test harness
There are two c unit tests: http://sastools.com/b2/post/79394064
Yen-Ju
>
> How does that sound?
>
> Marvin Humphrey
> Rectangular Research
> http://www.rectangular.com/
>
>