You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@lucy.apache.org by Marvin Humphrey <ma...@rectangular.com> on 2015/04/24 06:17:04 UTC

[lucy-dev] Blob

Greets,

I'd like to propose a new class for the Clownfish core: Blob, a
wrapper for constant binary data.

Blob is to ByteBuf as String is to CharBuf.

Introducing Blob will allow us to wrap host-supplied arbitrary binary
data the same way we wrap host-supplied UTF-8 content with String.
Right now we wrap such data with ByteBuf, which is dangerous because
ByteBuf allows the content to be manipulated.

Marvin Humphrey

Re: [lucy-dev] Blob

Posted by Marvin Humphrey <ma...@rectangular.com>.
On Mon, May 4, 2015 at 10:24 AM, Nick Wellnhofer <we...@aevum.de> wrote:

> I started with the implementation of Blob:
>
>     https://github.com/nwellnhof/lucy-clownfish/commits/CLOWNFISH-11-blob
>
> I also switched most of Lucy over to Blob:
>
>     https://github.com/nwellnhof/lucy/commits/CLOWNFISH-11-blob

Awesome!  I've reviewed both branches -- aside from one glitch where
Blob_new_steal calls Blob_init rather than Blob_init_steal, they look
great!  +1 to merge to master.

> ByteBufs remain at the following locations:
>
> * As argument to Read_Record in the default readers. These methods
>   could be changed to return an incremented Blob, incurring
>   additional memory allocations.
> * `anchor_set` ivar in PhraseMatcher.
> * Backing store of RAMFiles.
> * MemoryPool arenas.

These all seem fine as they are -- they're not dangerous and they are
appropriate use cases for ByteBuf (to varying degrees).

Marvin Humphrey

Re: [lucy-dev] Blob

Posted by Nick Wellnhofer <we...@aevum.de>.
On 24/04/2015 06:17, Marvin Humphrey wrote:
> I'd like to propose a new class for the Clownfish core: Blob, a
> wrapper for constant binary data.
>
> Blob is to ByteBuf as String is to CharBuf.
>
> Introducing Blob will allow us to wrap host-supplied arbitrary binary
> data the same way we wrap host-supplied UTF-8 content with String.
> Right now we wrap such data with ByteBuf, which is dangerous because
> ByteBuf allows the content to be manipulated.

I started with the implementation of Blob:

     https://github.com/nwellnhof/lucy-clownfish/commits/CLOWNFISH-11-blob

I also switched most of Lucy over to Blob:

     https://github.com/nwellnhof/lucy/commits/CLOWNFISH-11-blob

ByteBufs remain at the following locations:

* As argument to Read_Record in the default readers. These methods
   could be changed to return an incremented Blob, incurring
   additional memory allocations.
* `anchor_set` ivar in PhraseMatcher.
* Backing store of RAMFiles.
* MemoryPool arenas.

Nick


Re: [lucy-dev] Blob

Posted by Marvin Humphrey <ma...@rectangular.com>.
On Tue, Apr 28, 2015 at 12:21 PM, Nick Wellnhofer <we...@aevum.de> wrote:

> Ultimately, I'd like to have something that can also be used for arrays of
> other number types. Maybe with an API similar to JavaScript's TypedArray or
> DataView:
>
> https://developer.mozilla.org/en-US/docs/Web/JavaScript/Typed_arrays
>
> But with a clear distinction between immutable and mutable objects and the
> ability to append data.

I can see the use case here.   If we add `Set_I32` and `Get_I32` to ByteBuf,
we could use it to replace Lucy's I32Array.

For the time being, I think that the Clownfish core runtime only needs to
provide enough data types to support JSON: String, Hash, Vector, Integer,
Float, Bool, and NULL.  We do need Blob/ByteBuf to hold arbitrary binary data
(such as packed, undecoded JSON!) and for technical reasons we need iterators
on at least String and Hash.

IMO, for anything beyond that it would be best to experiment outside the core,
where the type can have a chance to evolve and mature.  The smaller and
simpler the Clownfish core, the easier it is to add support for more
languages, which is where we are currently have the most profound need.

So if you want to experiment with a data structure along the lines of
TypedArray or DataView, perhaps a good place would be within Lucy as a private
type replacing I32Array.

Marvin Humphrey

Re: [lucy-dev] Blob

Posted by Nick Wellnhofer <we...@aevum.de>.
On 28/04/2015 04:43, Marvin Humphrey wrote:
> For now, I don't think adding encoders/decoders for primitive types onto
> Blob/ByteBuf is pressing, and I suggest we delay that discussion.

Ultimately, I'd like to have something that can also be used for arrays of 
other number types. Maybe with an API similar to JavaScript's TypedArray or 
DataView:

https://developer.mozilla.org/en-US/docs/Web/JavaScript/Typed_arrays

But with a clear distinction between immutable and mutable objects and the 
ability to append data.

> Lucy uses the NumUtil_ inline routines extensively and in
> performance-sensitive locations like BitVector.  It won't be able to use
> methods on ByteBuf or Blob or their iterators without taking a speed hit.

Yes, I realized that, too.

> For now, I propose that we move the NumberUtils module back to Lucy, where it
> will be private.

+1

Nick


Re: [lucy-dev] Blob

Posted by Marvin Humphrey <ma...@rectangular.com>.
On Fri, Apr 24, 2015 at 4:03 AM, Nick Wellnhofer <we...@aevum.de> wrote:

> +1. Also see CLOWNFISH-11:
>
>     https://issues.apache.org/jira/browse/CLOWNFISH-11

Heh.

>From the issue:

    I recently toyed with the idea of reworking Clownfish's ByteBuf in a way
    similar to the immutable String changes. This would mean to use a separate
    class for immutable byte buffers replacing ViewByteBuf and restrict
    ByteBuf to creating byte buffers.

    The immutable ByteBuf class could then have a companion iterator class
    that can be used to parse a byte buffer's content. All the
    NumUtils_decode_* functions could go there, for example.

    The NumUtils_encode_* function could be made methods of the mutable
    ByteBuf.

+1 to everything in your first paragraph.

For now, I don't think adding encoders/decoders for primitive types onto
Blob/ByteBuf is pressing, and I suggest we delay that discussion.  But it
would be great to do something with NumberUtils -- the Clownfish runtime only
uses it once internally and it isn't essential functionality that Clownfish
absolutely needs to support.

Lucy uses the NumUtil_ inline routines extensively and in
performance-sensitive locations like BitVector.  It won't be able to use
methods on ByteBuf or Blob or their iterators without taking a speed hit.

For now, I propose that we move the NumberUtils module back to Lucy, where it
will be private.

Marvin Humphrey

Re: [lucy-dev] Blob

Posted by Nick Wellnhofer <we...@aevum.de>.
On 24/04/2015 06:17, Marvin Humphrey wrote:
> I'd like to propose a new class for the Clownfish core: Blob, a
> wrapper for constant binary data.
>
> Blob is to ByteBuf as String is to CharBuf.
>
> Introducing Blob will allow us to wrap host-supplied arbitrary binary
> data the same way we wrap host-supplied UTF-8 content with String.
> Right now we wrap such data with ByteBuf, which is dangerous because
> ByteBuf allows the content to be manipulated.

+1. Also see CLOWNFISH-11:

     https://issues.apache.org/jira/browse/CLOWNFISH-11

Nick