You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@lucene.apache.org by Erik Hatcher <er...@ehatchersolutions.com> on 2003/12/02 19:45:14 UTC

Language neutral index format representation (was Re: Confused by writePostings/SegmentTermDocs.next())

On Tuesday, December 2, 2003, at 11:02  AM, Simon Cozens wrote:
> Hi all,
>     At my company, we're working on a Perl version of Lucene, which we 
> plan to
> release under the same terms as Lucene. (When we have it working, 
> tested and
> documented.)

Very nice!  At FOO you mentioned you were going to probably write a 
Perl version - glad you're getting the time to do it now.  I've been 
dragging my feet on RubyLucene (@ RubyForge.org) - I've gotten some 
low-level file I/O Directory implementations working, but nothing above 
that yet.

Speaking of language implementations of Lucene's index format and 
associated searching/indexing API, I think it would be cool if we 
represent the directory and file formats in a computer-readable 
(probably XML) format which could be used by to code generate the 
low-level language-specific code for the various implementations.  
Conceivably such a representation could be used at runtime, but for 
performance reasons it would seem a more sensical approach would be for 
code generating I/O code.

This representation would also be handy to deal with changes to the 
file format, making it more formalized and easily diff'd or used by 
tools or implementations to have graceful backwards compatibility and 
such.

What do folks think of this idea?  Any drawbacks?  Could the Java I/O 
code be code generated without affecting the design at that level if 
such a representation existed?

	Erik

---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-dev-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-dev-help@jakarta.apache.org

Re: Language neutral index format representation (was Re: Confused by writePostings/SegmentTermDocs.next())

Posted by robert burrell donkin <rd...@apache.org>.

On 2 Dec 2003, at 18:45, Erik Hatcher wrote:

> On Tuesday, December 2, 2003, at 11:02  AM, Simon Cozens wrote:
>> Hi all,
>>     At my company, we're working on a Perl version of Lucene, which 
>> we plan to
>> release under the same terms as Lucene. (When we have it working, 
>> tested and
>> documented.)
>
> Very nice!  At FOO you mentioned you were going to probably write a 
> Perl version - glad you're getting the time to do it now.  I've been 
> dragging my feet on RubyLucene (@ RubyForge.org) - I've gotten some 
> low-level file I/O Directory implementations working, but nothing 
> above that yet.

(for who aren't aware of the possibility...)

if the lucene community decided to opt for top level project status at 
the ASF then the perl, ruby and java versions could be developed within 
the same project here at apache. if this is the route that the lucene 
community decides to take, then (i think that) the jakarta pmc would do 
whatever they could to help. (at least, that's the current policy :)

- robert


---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-dev-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-dev-help@jakarta.apache.org

Re: Language neutral index format representation

Posted by Simon Cozens <si...@simon-cozens.org>.

Erik Hatcher:
> Speaking of tests - are you testing Java/Perl interoperability?  For 
> example - are you testing an index created in Java is read fine by your 
> Perl API?  And vice versa?

Only ad-hoc tests. :( I started testing the index reader by copying across a
Java-created index and making sure it could read that, but now I can generate
indexes in Perl, I'm using those instead. Ideally, I will have some tests that
make sure that the two representations are identical.

> I'm jealous!  Or I guess you might say I should forget Ruby and switch 
> to Perl :)

Oh, believe me, I'd love it if I could work full-time in Ruby.

> I'm thinking more in terms of generating classes like FieldInfos and 
> SegmentInfos from an XML descriptor that represented the info here:
> 
> 	http://jakarta.apache.org/lucene/docs/fileformats.html

Sounds interesting, and not too difficult. 

-- 
The Blit is a nice terminal, but it runs emacs.

---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-dev-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-dev-help@jakarta.apache.org

Re: Language neutral index format representation

Posted by Erik Hatcher <er...@ehatchersolutions.com>.

On Tuesday, December 2, 2003, at 01:56  PM, Simon Cozens wrote:
> Yep, thanks to Kasei, who are also cleaning up and documenting the 
> code I
> write. For the interested, what I'm doing is at
> http://cvs.simon-cozens.org/viewcvs.cgi/plucene/ and I hope to sync 
> back over
> the docs/tests once they're completed.

Speaking of tests - are you testing Java/Perl interoperability?  For 
example - are you testing an index created in Java is read fine by your 
Perl API?  And vice versa?  I'm interested in developing some sort of 
test suite to do this with the Ruby port eventually.

> My version's almost there, thanks to a month basically full-time work 
> on it.

I'm jealous!  Or I guess you might say I should forget Ruby and switch 
to Perl :)

> I believe so. You'd generate, conceptually, an ObjectSerializer class 
> of
> some sort which has read and write methods, which is overloaded to do
> the right thing with the right object type.

I'm thinking more in terms of generating classes like FieldInfos and 
SegmentInfos from an XML descriptor that represented the info here:

	http://jakarta.apache.org/lucene/docs/fileformats.html

> However, I can imagine some snags, such as the one which prompted this
> thread: how would you represent sequences of objects with their 
> properties
> delta-encoded, for instance, or the cunning buffer-substring trick 
> used to
> store the terms in the .tis file?

I view this as what the language-specific code generator would build 
from a general file format descriptor.  For example, in Java I'd 
probably write some Velocity templates that keyed of the XML 
descriptor.  In Ruby, I'd use REXML and ERb templates.

I haven't thought through any detailed issues that could come up or if 
it would impact the design of the Java "reference implementation" to 
accommodate generated code or not.

	Erik

---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-dev-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-dev-help@jakarta.apache.org

Re: Language neutral index format representation (was Re: Confused by writePostings/SegmentTermDocs.next())

Posted by Simon Cozens <si...@simon-cozens.org>.

Erik Hatcher:
> Very nice!  At FOO you mentioned you were going to probably write a 
> Perl version - glad you're getting the time to do it now. 

Yep, thanks to Kasei, who are also cleaning up and documenting the code I
write. For the interested, what I'm doing is at
http://cvs.simon-cozens.org/viewcvs.cgi/plucene/ and I hope to sync back over
the docs/tests once they're completed.

> I've been dragging my feet on RubyLucene (@ RubyForge.org) - I've gotten
> some low-level file I/O Directory implementations working, but nothing above
> that yet.

My version's almost there, thanks to a month basically full-time work on it.

> Speaking of language implementations of Lucene's index format and 
> associated searching/indexing API, I think it would be cool if we 
> represent the directory and file formats in a computer-readable 
> (probably XML) format which could be used by to code generate the 
> low-level language-specific code for the various implementations.  

That would be quite nifty; I'll have a think about how it might look.

> What do folks think of this idea?  Any drawbacks?  Could the Java I/O 
> code be code generated without affecting the design at that level if 
> such a representation existed?

I believe so. You'd generate, conceptually, an ObjectSerializer class of
some sort which has read and write methods, which is overloaded to do
the right thing with the right object type.

However, I can imagine some snags, such as the one which prompted this
thread: how would you represent sequences of objects with their properties
delta-encoded, for instance, or the cunning buffer-substring trick used to
store the terms in the .tis file?

-- 
BASH is great, it dumps core and has clear documentation.  -Ari Suntioinen

---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-dev-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-dev-help@jakarta.apache.org