You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@lucene.apache.org by Erik Hatcher <er...@ehatchersolutions.com> on 2003/12/02 19:45:14 UTC
Language neutral index format representation (was Re: Confused by writePostings/SegmentTermDocs.next())
On Tuesday, December 2, 2003, at 11:02 AM, Simon Cozens wrote:
> Hi all,
> At my company, we're working on a Perl version of Lucene, which we
> plan to
> release under the same terms as Lucene. (When we have it working,
> tested and
> documented.)
Very nice! At FOO you mentioned you were going to probably write a
Perl version - glad you're getting the time to do it now. I've been
dragging my feet on RubyLucene (@ RubyForge.org) - I've gotten some
low-level file I/O Directory implementations working, but nothing above
that yet.
Speaking of language implementations of Lucene's index format and
associated searching/indexing API, I think it would be cool if we
represent the directory and file formats in a computer-readable
(probably XML) format which could be used by to code generate the
low-level language-specific code for the various implementations.
Conceivably such a representation could be used at runtime, but for
performance reasons it would seem a more sensical approach would be for
code generating I/O code.
This representation would also be handy to deal with changes to the
file format, making it more formalized and easily diff'd or used by
tools or implementations to have graceful backwards compatibility and
such.
What do folks think of this idea? Any drawbacks? Could the Java I/O
code be code generated without affecting the design at that level if
such a representation existed?
Erik
---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-dev-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-dev-help@jakarta.apache.org
Re: Language neutral index format representation (was Re: Confused by writePostings/SegmentTermDocs.next())
Posted by robert burrell donkin <rd...@apache.org>.
On 2 Dec 2003, at 18:45, Erik Hatcher wrote:
> On Tuesday, December 2, 2003, at 11:02 AM, Simon Cozens wrote:
>> Hi all,
>> At my company, we're working on a Perl version of Lucene, which
>> we plan to
>> release under the same terms as Lucene. (When we have it working,
>> tested and
>> documented.)
>
> Very nice! At FOO you mentioned you were going to probably write a
> Perl version - glad you're getting the time to do it now. I've been
> dragging my feet on RubyLucene (@ RubyForge.org) - I've gotten some
> low-level file I/O Directory implementations working, but nothing
> above that yet.
(for who aren't aware of the possibility...)
if the lucene community decided to opt for top level project status at
the ASF then the perl, ruby and java versions could be developed within
the same project here at apache. if this is the route that the lucene
community decides to take, then (i think that) the jakarta pmc would do
whatever they could to help. (at least, that's the current policy :)
- robert
---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-dev-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-dev-help@jakarta.apache.org
Re: Language neutral index format representation
Posted by Simon Cozens <si...@simon-cozens.org>.
Erik Hatcher:
> Speaking of tests - are you testing Java/Perl interoperability? For
> example - are you testing an index created in Java is read fine by your
> Perl API? And vice versa?
Only ad-hoc tests. :( I started testing the index reader by copying across a
Java-created index and making sure it could read that, but now I can generate
indexes in Perl, I'm using those instead. Ideally, I will have some tests that
make sure that the two representations are identical.
> I'm jealous! Or I guess you might say I should forget Ruby and switch
> to Perl :)
Oh, believe me, I'd love it if I could work full-time in Ruby.
> I'm thinking more in terms of generating classes like FieldInfos and
> SegmentInfos from an XML descriptor that represented the info here:
>
> http://jakarta.apache.org/lucene/docs/fileformats.html
Sounds interesting, and not too difficult.
--
The Blit is a nice terminal, but it runs emacs.
---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-dev-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-dev-help@jakarta.apache.org
Re: Language neutral index format representation
Posted by Erik Hatcher <er...@ehatchersolutions.com>.
On Tuesday, December 2, 2003, at 01:56 PM, Simon Cozens wrote:
> Yep, thanks to Kasei, who are also cleaning up and documenting the
> code I
> write. For the interested, what I'm doing is at
> http://cvs.simon-cozens.org/viewcvs.cgi/plucene/ and I hope to sync
> back over
> the docs/tests once they're completed.
Speaking of tests - are you testing Java/Perl interoperability? For
example - are you testing an index created in Java is read fine by your
Perl API? And vice versa? I'm interested in developing some sort of
test suite to do this with the Ruby port eventually.
> My version's almost there, thanks to a month basically full-time work
> on it.
I'm jealous! Or I guess you might say I should forget Ruby and switch
to Perl :)
> I believe so. You'd generate, conceptually, an ObjectSerializer class
> of
> some sort which has read and write methods, which is overloaded to do
> the right thing with the right object type.
I'm thinking more in terms of generating classes like FieldInfos and
SegmentInfos from an XML descriptor that represented the info here:
http://jakarta.apache.org/lucene/docs/fileformats.html
> However, I can imagine some snags, such as the one which prompted this
> thread: how would you represent sequences of objects with their
> properties
> delta-encoded, for instance, or the cunning buffer-substring trick
> used to
> store the terms in the .tis file?
I view this as what the language-specific code generator would build
from a general file format descriptor. For example, in Java I'd
probably write some Velocity templates that keyed of the XML
descriptor. In Ruby, I'd use REXML and ERb templates.
I haven't thought through any detailed issues that could come up or if
it would impact the design of the Java "reference implementation" to
accommodate generated code or not.
Erik
---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-dev-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-dev-help@jakarta.apache.org
Re: Language neutral index format representation (was Re: Confused by writePostings/SegmentTermDocs.next())
Posted by Simon Cozens <si...@simon-cozens.org>.
Erik Hatcher:
> Very nice! At FOO you mentioned you were going to probably write a
> Perl version - glad you're getting the time to do it now.
Yep, thanks to Kasei, who are also cleaning up and documenting the code I
write. For the interested, what I'm doing is at
http://cvs.simon-cozens.org/viewcvs.cgi/plucene/ and I hope to sync back over
the docs/tests once they're completed.
> I've been dragging my feet on RubyLucene (@ RubyForge.org) - I've gotten
> some low-level file I/O Directory implementations working, but nothing above
> that yet.
My version's almost there, thanks to a month basically full-time work on it.
> Speaking of language implementations of Lucene's index format and
> associated searching/indexing API, I think it would be cool if we
> represent the directory and file formats in a computer-readable
> (probably XML) format which could be used by to code generate the
> low-level language-specific code for the various implementations.
That would be quite nifty; I'll have a think about how it might look.
> What do folks think of this idea? Any drawbacks? Could the Java I/O
> code be code generated without affecting the design at that level if
> such a representation existed?
I believe so. You'd generate, conceptually, an ObjectSerializer class of
some sort which has read and write methods, which is overloaded to do
the right thing with the right object type.
However, I can imagine some snags, such as the one which prompted this
thread: how would you represent sequences of objects with their properties
delta-encoded, for instance, or the cunning buffer-substring trick used to
store the terms in the .tis file?
--
BASH is great, it dumps core and has clear documentation. -Ari Suntioinen
---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-dev-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-dev-help@jakarta.apache.org