You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@nutch.apache.org by Stefan Groschupf <sg...@media-style.com> on 2006/01/31 22:05:48 UTC

Re: Lucene's VInt for lengths/counts/sizes

+1 :-)

Am 31.01.2006 um 22:06 schrieb Andrzej Bialecki:

> Hi,
>
> I wonder, would it be a good idea to replace the (rather wasteful)  
> 4-byte ints with Lucene's variable-byte int encoding, in all places  
> where size matters? We could "borrow" the code from Lucene and  
> create a VIntWritable for this purpose. I'm thinking specifically  
> about the following places:
>
> * UTF8 (2-byte string length)
>
> * ArrayWritable/BytesWritable/TwoDArrayWritable (4-byte length)
>
> * Properties and derived maps (like ContentProperties): all lengths  
> are written as 4-byte ints.
>
> * any Writable that consists of lists of values is currently  
> serialized using 4-byte ints for the size of list, e.g.  
> ParseData.outlinks
>
> Overall I think the size savings could be considerable, at the cost  
> of some CPU.
>
> -- 
> Best regards,
> Andrzej Bialecki     <><
> ___. ___ ___ ___ _ _   __________________________________
> [__ || __|__/|__||\/|  Information Retrieval, Semantic Web
> ___|||__||  \|  ||  |  Embedded Unix, System Integration
> http://www.sigram.com  Contact: info at sigram dot com
>
>
>