You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@lucy.apache.org by Marvin Humphrey <ma...@rectangular.com> on 2006/06/21 07:40:51 UTC

Problematic platforms

Greets,

There are some portability problems that may not be worth solving.

On some Crays, ints, longs and pointers are all 8 bytes (the ILP64  
format).  I propose not supporting any machine where we can't  
guarantee that lucy_i8_t is 1 byte and lucy_i32_t is 4 bytes.

A second esoteric problem is machines that don't use IEEE 754 for  
floats: <http://www.codeproject.com/tools/libnumber.asp>.  I think  
that the norms-encoding routine will break on such machines.  That  
ought to be the only problem, I think but it's gnarly enough I think  
we should just decide not to support those boxes.

Another wrinkle is large file support.  Machines that don't support  
large files are growing scarcer by the day, but eventually, somebody  
who has one will want to use Lucy.  Index files can get pretty big.

Is it even possible for a machine to have large file support and not  
provide a 64-bit integer?  The only thing Lucene ever uses 64-bit  
integers for is file pointers.  KinoSearch takes advantage of this in  
a weird way -- it uses doubles wherever Lucene uses Java longs.  I  
did it that way because Perl always provides support for doubles, but  
64-bit integer support takes a special compile and generally doesn't  
work very well.  The 52-bit mantissa in an IEEE 754 double is more  
than enough for any file pointer.  But when I made that call, I was  
using native Perl filehandles as InStream objects; KinoSearch doesn't  
do that anymore, and I don't think we should go the doubles-as-file- 
pointers route with Lucy (even though it Just Works).

I'm inclined to require both large file support and 64-bit integers  
for Lucy.  What say?

Marvin Humphrey
Rectangular Research
http://www.rectangular.com/


Re: Problematic platforms

Posted by Marvin Humphrey <ma...@rectangular.com>.
On Jun 20, 2006, at 11:44 PM, David Balmain wrote:

> If someone needs Lucy to work on one of those
> boxes, it will just be a simple matter of them supplying us with
> float2byte and byte2float methods.

And tests, which we won't be able to run ourselves.

>> I'm inclined to require both large file support and 64-bit integers
>> for Lucy.  What say?
>
> I'm not sure about large file support. You've looked into it more than
> I have but I do think 64 bit integers are a must.

I thought about replacing what would have been lucy_i64_t with  
lucy_off_t.  One problem is how to fail reliably.  I actually think  
we could pull it off without data corruption, since failure would  
first occur either when the compound file was written (and before the  
segments file gets altered), or at search-time, when the index first  
gets loaded.

However, I just didn't think it would be a common enough case that it  
was worth coding and testing special versions of write_vlong, etc.

> [aside:What I'm doing in Ferret is storing all file pointers as off_t.
> As well as read/write_vint methods I have read/write_voff_t. The only
> time I use 64-bit integers (ie always 64-bit unlike off_t which could
> be 32-bit) is when I need to write a fixed byte size pointer like in
> the fields and term_vectors index files. I've only just implemented
> this but it seems to be working.]

This is one of the main things we need Configurator for.  We need to  
figure out whether off_t is 32-bit or 64-bit.  We need to figure out  
whether the OS is supplying ftello64, whether ftello returns a 64-bit  
off_t, etc.

Marvin Humphrey
Rectangular Research
http://www.rectangular.com/


Re: Problematic platforms

Posted by David Balmain <db...@gmail.com>.
On 6/21/06, Marvin Humphrey <ma...@rectangular.com> wrote:
> Greets,
>
> There are some portability problems that may not be worth solving.
>
> On some Crays, ints, longs and pointers are all 8 bytes (the ILP64
> format).  I propose not supporting any machine where we can't
> guarantee that lucy_i8_t is 1 byte and lucy_i32_t is 4 bytes.
>
> A second esoteric problem is machines that don't use IEEE 754 for
> floats: <http://www.codeproject.com/tools/libnumber.asp>.  I think
> that the norms-encoding routine will break on such machines.  That
> ought to be the only problem, I think but it's gnarly enough I think
> we should just decide not to support those boxes.

Sounds fine to me. If someone needs Lucy to work on one of those
boxes, it will just be a simple matter of them supplying us with
float2byte and byte2float methods.

> Another wrinkle is large file support.  Machines that don't support
> large files are growing scarcer by the day, but eventually, somebody
> who has one will want to use Lucy.  Index files can get pretty big.
>
> Is it even possible for a machine to have large file support and not
> provide a 64-bit integer?  The only thing Lucene ever uses 64-bit
> integers for is file pointers.  KinoSearch takes advantage of this in
> a weird way -- it uses doubles wherever Lucene uses Java longs.  I
> did it that way because Perl always provides support for doubles, but
> 64-bit integer support takes a special compile and generally doesn't
> work very well.  The 52-bit mantissa in an IEEE 754 double is more
> than enough for any file pointer.  But when I made that call, I was
> using native Perl filehandles as InStream objects; KinoSearch doesn't
> do that anymore, and I don't think we should go the doubles-as-file-
> pointers route with Lucy (even though it Just Works).
>
> I'm inclined to require both large file support and 64-bit integers
> for Lucy.  What say?

I'm not sure about large file support. You've looked into it more than
I have but I do think 64 bit integers are a must.

[aside:What I'm doing in Ferret is storing all file pointers as off_t.
As well as read/write_vint methods I have read/write_voff_t. The only
time I use 64-bit integers (ie always 64-bit unlike off_t which could
be 32-bit) is when I need to write a fixed byte size pointer like in
the fields and term_vectors index files. I've only just implemented
this but it seems to be working.]