You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@lucy.apache.org by "Marvin Humphrey (JIRA)" <ji...@apache.org> on 2008/11/21 01:10:45 UTC

[jira] Commented: (LUCY-4) Compound File Format Spec

    [ https://issues.apache.org/jira/browse/LUCY-4?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12649549#action_12649549 ] 

Marvin Humphrey commented on LUCY-4:
------------------------------------

Now that we're memory mapping files, we want to be able to cast the buffers to
larger types.  That introduces alignment issues: trying to cast a char* to an
i64_t* causes problems if the char* starts on an odd byte -- e.g. if the sub 
file's offset is 7.  

To avert casting difficulties with basic types,  we can insert padding between
sub files so that the offset is always a multiple of 8.  However, perhaps it
makes sense to use a larger number.

At the upper end, we could use a common system page size, like 4096.  However,
we'd still have to calculate window width since system page sizes aren't
standardized.  That a lot of wasted space for no real benefit that I know of. 

Another multiple we may want to consider is cache-line size, which is the unit
of CPU memory fetching and is often 32, 64, or 128 bytes (but can be more or
less).  Say that the cache line size is 32 bytes and you have a data structure
which is exactly 32 bytes.  If the data is aligned to cache line size,
fetching it takes a single cache load operations, but if it is unaligned, it
takes two.  Here's a blog post Nathan Kurz passed along on the issue, along
with the Microsoft docs for their align() function:

http://x264dev.blogspot.com/2008/05/cacheline-splits-aka-intel-hell.html 
http://msdn.microsoft.com/en-us/library/83ythb65.aspx

I'd be surprised if current KS were to reap any benefit from padding to a
common cache line size because of the heavy data compression.  Nevertheless,
it's not very costly to go with 32 or 64 instead of 8, and future posting list
implementations may use codecs which are optimized with cache line size in
mind.

Future posting list implementations may also use specialized assembly for 
optimizing performance on specific processors, and sometimes instruction sets
impose data alignment requirements.  From
<http://en.wikipedia.org/wiki/Data_structure_alignment#x86_and_x64>:

  While the x86 architecture originally did not require aligned memory access
  and still works without it, SSE2 instructions on x86 and x64 CPUs do require
  the data to be 128-bit (16-byte) aligned and there can be substantial
  performance advantages from using aligned data on these architectures.

Taking all that into consideration, 64 bytes seems like a reasonable offset
multiple to stipulate in the Lucy File Format spec.



> Compound File Format Spec
> -------------------------
>
>                 Key: LUCY-4
>                 URL: https://issues.apache.org/jira/browse/LUCY-4
>             Project: Lucy
>          Issue Type: Sub-task
>            Reporter: Marvin Humphrey
>            Assignee: Marvin Humphrey
>            Priority: Minor
>
> Lucene, KinoSearch, etc, use "compound files" for segment data to avoid running up against file descriptor limits, e.g. on Mac OS X where the default max is 256.  InStream objects created against the sub files all share a common file descriptor and specify an offset and a length.  Lucy
> will have to do something similar.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.