You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@nutch.apache.org by Stefan Groschupf <sg...@media-style.com> on 2006/01/31 22:05:48 UTC
Re: Lucene's VInt for lengths/counts/sizes
+1 :-)
Am 31.01.2006 um 22:06 schrieb Andrzej Bialecki:
> Hi,
>
> I wonder, would it be a good idea to replace the (rather wasteful)
> 4-byte ints with Lucene's variable-byte int encoding, in all places
> where size matters? We could "borrow" the code from Lucene and
> create a VIntWritable for this purpose. I'm thinking specifically
> about the following places:
>
> * UTF8 (2-byte string length)
>
> * ArrayWritable/BytesWritable/TwoDArrayWritable (4-byte length)
>
> * Properties and derived maps (like ContentProperties): all lengths
> are written as 4-byte ints.
>
> * any Writable that consists of lists of values is currently
> serialized using 4-byte ints for the size of list, e.g.
> ParseData.outlinks
>
> Overall I think the size savings could be considerable, at the cost
> of some CPU.
>
> --
> Best regards,
> Andrzej Bialecki <><
> ___. ___ ___ ___ _ _ __________________________________
> [__ || __|__/|__||\/| Information Retrieval, Semantic Web
> ___|||__|| \| || | Embedded Unix, System Integration
> http://www.sigram.com Contact: info at sigram dot com
>
>
>