You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@asterixdb.apache.org by Jianfeng Jia <ji...@gmail.com> on 2016/01/14 01:27:28 UTC

Implement an SerializableVector in Hyracks

Hi Devs,

First of all, Xi Zhang is a Master student at UCI wants to work with us for a while. Welcome Xi!

We are thinking of making a Frame-based, memory-bound SerializableVector at first. We expect this vector can solve some occasionally Java.Heap.OutOfMemory exceptions in Hyracks.
Though we did a good job on organizing the record-located memory, the OOM exception can still happen while operating the auxiliary data structure. For example in the sort run generator, instead of moving record around we are creating an reference “pointer" array to the original record. However, if the record is small and the size of that int array will be large, then the OOM exception will occur, which is the case of issue [1].

One way to solve this problem is to put auxiliary data structures into the memory-bounded frame as well. In general, it will be much easier to ask for multiple small memory blocks than one big chunk of memory. I guess that was the same reason why we have “SerializableHashTable” for HashJoin. It will be nice to have a more general structure that can be used by all the operators.

The Frame based Vector idea is inspired by the Scala Vector[2] which looks like a List, but internally it is implemented as a 32-ary tree. The performance of it is very stable for variety size of object[3]. It will have all the benefits of ArrayList and the LinkedList. In addition, we can take the memory usage of the auxiliary structure into the calculation. We will work on the detailed design doc later if we are agree on this direction.

Any thoughts or suggestions? Thank you!

[1] https://code.google.com/p/asterixdb/issues/detail?id=934&can=1&q=last%20straw&colspec=ID%20Type%20Status%20Priority%20Milestone%20Owner%20Summary%20ETA%20Severity <https://code.google.com/p/asterixdb/issues/detail?id=934&can=1&q=last%20straw&colspec=ID%20Type%20Status%20Priority%20Milestone%20Owner%20Summary%20ETA%20Severity>
[2] https://bitbucket.org/astrieanna/bitmapped-vector-trie <https://bitbucket.org/astrieanna/bitmapped-vector-trie>
[3] http://danielasfregola.com/2015/06/15/which-immutable-scala-collection/ <http://danielasfregola.com/2015/06/15/which-immutable-scala-collection/>

Best,

Jianfeng Jia
PhD Candidate of Computer Science
University of California, Irvine

Re: Implement an SerializableVector in Hyracks

Posted by Chen Li <ch...@gmail.com>.

Jianfeng and Xi: eventually please make sure to put the design doc to
the wiki site.

Chen

On Mon, Feb 1, 2016 at 9:56 AM, Jianfeng Jia <ji...@gmail.com> wrote:
> Hi,
>
> We plan to implement an append-only array at first. The main reason is this is how the auxiliary data structure be used so far. Then the implementation is straightforward.
>
> The tree-structured Vector in Scala can save a lot in updating case mainly because of their immutable requirement. It can saving unnecessary data copies comparing to other immutable list when updating. We only allow in-place update. The tree design may be overkill for us.
>
> Xi made on detailed design doc is here: https://docs.google.com/document/d/1bs3JBCxmvuJZmBq_gKzUDt0FZdM9j3d3MPAmC29ZoSA/edit?usp=sharing
> Any thoughts or comments?
>
>
>> On Jan 18, 2016, at 9:52 AM, Chen Li <ch...@gmail.com> wrote:
>>
>> @Xi and Jianfeng: after we come up with the design, let's share it with the
>> group for an approval before the implementation.
>>
>> Chen
>>
>> On Fri, Jan 15, 2016 at 11:48 AM, Mike Carey <dt...@gmail.com> wrote:
>>
>>> The accounting is just as critical as the chunking - we should do both
>>> (together).
>>>
>>>
>>> On 1/15/16 9:00 AM, Till Westmann wrote:
>>>
>>>> I don’t have relevant experience on the subject. But I think that it
>>>> sounds good to avoid arbitrarily long chunks of memory. Especially - as
>>>> Jianfeng wrote - it would be good to be able to a) account for this memory
>>>> and b) to manage it.
>>>> An interesting question for me would be what the overhead of such a
>>>> Vector is compared to a simple Java array and as a result where it should
>>>> be used to replace arrays. (The comparison in [3] only compares different
>>>> Scala collections, but doesn’t look at plain arrays.)
>>>>
>>>> Cheers,
>>>> Till
>>>>
>>>> On 14 Jan 2016, at 22:05, Chen Li wrote:
>>>>
>>>> Before we ask Xi to work on this project, it will be good to know if
>>>>> other people have seen similar problems and agree with this plan.
>>>>> @Till: can you share some tips?
>>>>>
>>>>> Chen
>>>>>
>>>>> On Wed, Jan 13, 2016 at 4:27 PM, Jianfeng Jia <ji...@gmail.com>
>>>>> wrote:
>>>>>
>>>>>> Hi Devs,
>>>>>>
>>>>>> First of all, Xi Zhang is a Master student at UCI wants to work with us
>>>>>> for a while. Welcome Xi!
>>>>>>
>>>>>> We are thinking of making a Frame-based, memory-bound
>>>>>> SerializableVector at first. We expect this vector can solve some
>>>>>> occasionally Java.Heap.OutOfMemory exceptions in Hyracks.
>>>>>> Though we did a good job on organizing the record-located memory, the
>>>>>> OOM exception can still happen while operating the auxiliary data
>>>>>> structure. For example in the sort run generator, instead of moving record
>>>>>> around we are creating an reference “pointer" array to the original record.
>>>>>> However, if the record is small and the size of that int array will be
>>>>>> large, then the OOM exception will occur, which is the case of issue [1].
>>>>>>
>>>>>> One way to solve this problem is to put auxiliary data structures into
>>>>>> the memory-bounded frame as well. In general, it will be much easier to ask
>>>>>> for multiple small memory blocks than one big chunk of memory. I guess that
>>>>>> was the same reason why we have “SerializableHashTable” for HashJoin. It
>>>>>> will be nice to have a more general structure that can be used by all the
>>>>>> operators.
>>>>>>
>>>>>> The Frame based Vector idea is inspired by the Scala Vector[2] which
>>>>>> looks like a List, but internally it is implemented as a 32-ary tree. The
>>>>>> performance of it is very stable for variety size of object[3]. It will
>>>>>> have all the benefits of ArrayList and the LinkedList. In addition, we can
>>>>>> take the memory usage of the auxiliary structure into the calculation. We
>>>>>> will work on the detailed design doc later if we are agree on this
>>>>>> direction.
>>>>>>
>>>>>> Any thoughts or suggestions? Thank you!
>>>>>>
>>>>>>
>>>>>> [1]
>>>>>> https://code.google.com/p/asterixdb/issues/detail?id=934&can=1&q=last%20straw&colspec=ID%20Type%20Status%20Priority%20Milestone%20Owner%20Summary%20ETA%20Severity
>>>>>> <
>>>>>> https://code.google.com/p/asterixdb/issues/detail?id=934&can=1&q=last%20straw&colspec=ID%20Type%20Status%20Priority%20Milestone%20Owner%20Summary%20ETA%20Severity>
>>>>>>
>>>>>> [2] https://bitbucket.org/astrieanna/bitmapped-vector-trie <
>>>>>> https://bitbucket.org/astrieanna/bitmapped-vector-trie>
>>>>>> [3]
>>>>>> http://danielasfregola.com/2015/06/15/which-immutable-scala-collection/
>>>>>> <
>>>>>> http://danielasfregola.com/2015/06/15/which-immutable-scala-collection/>
>>>>>>
>>>>>>
>>>>>> Best,
>>>>>>
>>>>>> Jianfeng Jia
>>>>>> PhD Candidate of Computer Science
>>>>>> University of California, Irvine
>>>>>>
>>>>>>
>>>
>
>
>
> Best,
>
> Jianfeng Jia
> PhD Candidate of Computer Science
> University of California, Irvine
>

Re: Implement an SerializableVector in Hyracks

Posted by Chen Li <ch...@gmail.com>.

OK.  Then we will not include Xi's code.  This subject is closed.


On Wed, Apr 6, 2016 at 12:56 PM, Mike Carey <dt...@gmail.com> wrote:

> Agreed!
>
>
> On 4/6/16 10:14 AM, Till Westmann wrote:
>
>> Not really. Just talked with Yingyi as well and we agree with Jianfeng
>> that
>> taking the data structure memory usage into account is probably the most
>> important improvement here.
>>
>> Cheers,
>> Till
>>
>> On 6 Apr 2016, at 9:21, Chen Li wrote:
>>
>> @Till: do you have any thoughts on this subject before we close it?
>>>
>>> On Fri, Apr 1, 2016 at 2:08 PM, Jianfeng Jia <ji...@gmail.com>
>>> wrote:
>>>
>>> Dear devs,
>>>>
>>>> Xi has implemented the SerializableVector(SVector) which allocated
>>>> memory
>>>> increasingly rather than a long java int array upfront.
>>>> We did some tests but it seems not a good enough answer to solve the OOM
>>>> issue caused by long java array allocation.
>>>>
>>>> The first experiment was to test if the standalone SVector approach
>>>> would
>>>> get more memory comparing to the long int array. With the heap size of
>>>> -Xmx2g on a single machine, the SVector could allocate up to 1280M data,
>>>> while the original Java array can allocate up to 1080M data. It can
>>>> allocate about 18% more space.
>>>>
>>>> When we move to the AsterixDB, we replace the int array which stores the
>>>> (FrameId, TupleId) pair with the SVector. Then we gradually increasing
>>>> the
>>>> "compiler.sortmemory" to increase the in-memory buffer for external
>>>> sort.
>>>> Both master and SVector succeeded in 1150M case and failed at 1250M
>>>> case.
>>>> Furthermore, the master branch succeeded in 1200M case but the SVector
>>>> failed in that case.
>>>>
>>>> Given that, we may think that allocate the memory chunk by chunk doesn't
>>>> win too much space comparing to allocate a big one directly.
>>>> I think we need to take the structure memory usage into the spilling
>>>> case
>>>> calculation, but we may not need an extra structure to chop the memory.
>>>>
>>>> Any insights?
>>>> If you are interested, the code sample Xi wrote is here:
>>>> https://github.com/xixZ/JVMExperiment/tree/master/src/testing
>>>>
>>>>
>>>> On Tue, Feb 2, 2016 at 11:01 PM, Jianfeng Jia <ji...@gmail.com>
>>>> wrote:
>>>>
>>>> Thanks Ted for the elaborate introduction!
>>>>> I did some reading about it. Based on my understanding, the main
>>>>>
>>>> advantage
>>>>
>>>>> of ValueVector is its column-wise design. In that case, the per-record
>>>>> based metadata, e.g. indexes or hash keys, can be directly added as an
>>>>> column along with the existing record. Thus, the sorting within one row
>>>>> group should be easily addressed. One question still not very clear to
>>>>> me
>>>>> is how to generate the sorted result across several row groups?
>>>>> If you can get somebody to talk more about it that will be great. Thank
>>>>> you!
>>>>>
>>>>> On Feb 1, 2016, at 6:03 PM, Ted Dunning <te...@gmail.com> wrote:
>>>>>
>>>>> So there are several key points for ValueVectors that I can describe,
>>>>> but
>>>>> for the authoritative voice, others would have to speak.
>>>>>
>>>>> The first point is that in Drill at least (and this is not required)
>>>>> ValueVectors are off-heap.  This helps enormously in managing the
>>>>> life-cycle since vectors can be associated with queries and when the
>>>>>
>>>> query
>>>>
>>>>> ends, all associated vectors can be deallocated quickly.  That also
>>>>>
>>>> allows
>>>>
>>>>> the memory footprint of Drill to be adjusted both up and down while
>>>>> running.
>>>>>
>>>>> Secondly, ValueVectors are stored column-wise, not record-wise.  Most
>>>>> manipulations do not require copies. Projection simply requires
>>>>> ignoring
>>>>> some columns. New columns can be added without disturbing old ones.
>>>>> Filtering is done using a row selection bitmap. Sorting is often done
>>>>>
>>>> using
>>>>
>>>>> an index column.
>>>>>
>>>>> The assumption is also that you will have a row group with something
>>>>>
>>>> like a
>>>>
>>>>> hundred thousand rows in it. This means that the size of a single group
>>>>> isn't usually astronomical although very large data structures in a
>>>>>
>>>> single
>>>>
>>>>> row can make the regulation of the size of row groups more difficult.
>>>>>
>>>>> Of particular interest is the fact that processing code can be
>>>>> generated
>>>>>
>>>> in
>>>>
>>>>> Java that avoids almost all object creation so that most SQL-like
>>>>> queries
>>>>> don't require any object cons'ing at all during the row scans.
>>>>> Moreover,
>>>>> the code generated can even be rendered by the JIT into vectorized low
>>>>> level instructions because the access patterns on ValueVectors are so
>>>>> simple.
>>>>>
>>>>> Nested data structures are handled using invisible marking columns
>>>>>
>>>> similar
>>>>
>>>>> to the way that nesting is marked in Dremel or Parquet. This means that
>>>>>
>>>> you
>>>>
>>>>> get uniformly typed pseudo columns that represent a flattened view of
>>>>> the
>>>>> nested content. Many restructuring operations can be done by simply
>>>>> re-interpreting the nested data without any copying at all.
>>>>>
>>>>> If more detail is desired we should probably get somebody who is more
>>>>> active in the Drill implementation to talk about how this all works and
>>>>>
>>>> how
>>>>
>>>>> it will be extracted into Apache Arrow.
>>>>>
>>>>> More information can be found here:
>>>>>
>>>>> https://drill.apache.org/docs/value-vectors/
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> On Mon, Feb 1, 2016 at 5:08 PM, Jianfeng Jia <ji...@gmail.com>
>>>>> wrote:
>>>>>
>>>>> If I understand correctly, it seems very similar to the IFrame in
>>>>> Hyrack,
>>>>> which is also a container to store a sequence of record into the
>>>>> ByteBuffer.
>>>>>
>>>>> I’m not clear about how records are manipulated inside the
>>>>> ValueVectors.
>>>>> In Hyracks case, we store the pointer(usually a pair of (frameID,
>>>>> recordID)) in one java array and manipulate the pointers instead of the
>>>>> original records. We mainly want to break the that one array into
>>>>>
>>>> multiple
>>>>
>>>>> small ByteBuffer-based arrays. By doing so, we can reduce the risk of
>>>>> OOM
>>>>> for a large array and we may also take the memory usage of those
>>>>> pointers
>>>>> into account for the flush decisions.  @Ted, could you share some
>>>>>
>>>> insights
>>>>
>>>>> about how the ValueVectors handles manipulations? e.g. sort, hashing …
>>>>>
>>>> etc.
>>>>
>>>>>
>>>>> On Feb 1, 2016, at 3:16 PM, Ted Dunning <te...@gmail.com> wrote:
>>>>>
>>>>> Have you guys looked at the Drill ValueVectors?
>>>>>
>>>>> This structure is being spun out as Apache Arrow with multiple
>>>>> interfaces
>>>>> and language bindings.
>>>>>
>>>>> On Mon, Feb 1, 2016 at 9:56 AM, Jianfeng Jia <ji...@gmail.com>
>>>>>
>>>>> wrote:
>>>>>
>>>>>
>>>>> Hi,
>>>>>
>>>>> We plan to implement an append-only array at first. The main reason is
>>>>> this is how the auxiliary data structure be used so far. Then the
>>>>> implementation is straightforward.
>>>>>
>>>>> The tree-structured Vector in Scala can save a lot in updating case
>>>>>
>>>>> mainly
>>>>>
>>>>> because of their immutable requirement. It can saving unnecessary data
>>>>> copies comparing to other immutable list when updating. We only allow
>>>>> in-place update. The tree design may be overkill for us.
>>>>>
>>>>> Xi made on detailed design doc is here:
>>>>>
>>>>>
>>>>>
>>>>>
>>>> https://docs.google.com/document/d/1bs3JBCxmvuJZmBq_gKzUDt0FZdM9j3d3MPAmC29ZoSA/edit?usp=sharing
>>>>
>>>>>
>>>>> Any thoughts or comments?
>>>>>
>>>>>
>>>>> On Jan 18, 2016, at 9:52 AM, Chen Li <ch...@gmail.com> wrote:
>>>>>
>>>>> @Xi and Jianfeng: after we come up with the design, let's share it with
>>>>>
>>>>> the
>>>>>
>>>>> group for an approval before the implementation.
>>>>>
>>>>> Chen
>>>>>
>>>>> On Fri, Jan 15, 2016 at 11:48 AM, Mike Carey <dt...@gmail.com>
>>>>>
>>>>> wrote:
>>>>>
>>>>>
>>>>> The accounting is just as critical as the chunking - we should do both
>>>>> (together).
>>>>>
>>>>>
>>>>> On 1/15/16 9:00 AM, Till Westmann wrote:
>>>>>
>>>>> I don’t have relevant experience on the subject. But I think that it
>>>>> sounds good to avoid arbitrarily long chunks of memory. Especially -
>>>>>
>>>>> as
>>>>>
>>>>> Jianfeng wrote - it would be good to be able to a) account for this
>>>>>
>>>>> memory
>>>>>
>>>>> and b) to manage it.
>>>>> An interesting question for me would be what the overhead of such a
>>>>> Vector is compared to a simple Java array and as a result where it
>>>>>
>>>>> should
>>>>>
>>>>> be used to replace arrays. (The comparison in [3] only compares
>>>>>
>>>>> different
>>>>>
>>>>> Scala collections, but doesn’t look at plain arrays.)
>>>>>
>>>>> Cheers,
>>>>> Till
>>>>>
>>>>> On 14 Jan 2016, at 22:05, Chen Li wrote:
>>>>>
>>>>> Before we ask Xi to work on this project, it will be good to know if
>>>>>
>>>>> other people have seen similar problems and agree with this plan.
>>>>> @Till: can you share some tips?
>>>>>
>>>>> Chen
>>>>>
>>>>> On Wed, Jan 13, 2016 at 4:27 PM, Jianfeng Jia <
>>>>>
>>>>> jianfeng.jia@gmail.com
>>>>>
>>>>>
>>>>> wrote:
>>>>>
>>>>> Hi Devs,
>>>>>
>>>>> First of all, Xi Zhang is a Master student at UCI wants to work
>>>>>
>>>>> with
>>>>>
>>>>> us
>>>>>
>>>>> for a while. Welcome Xi!
>>>>>
>>>>> We are thinking of making a Frame-based, memory-bound
>>>>> SerializableVector at first. We expect this vector can solve some
>>>>> occasionally Java.Heap.OutOfMemory exceptions in Hyracks.
>>>>> Though we did a good job on organizing the record-located memory,
>>>>>
>>>>> the
>>>>>
>>>>> OOM exception can still happen while operating the auxiliary data
>>>>> structure. For example in the sort run generator, instead of moving
>>>>>
>>>>> record
>>>>>
>>>>> around we are creating an reference “pointer" array to the original
>>>>>
>>>>> record.
>>>>>
>>>>> However, if the record is small and the size of that int array will
>>>>>
>>>>> be
>>>>>
>>>>> large, then the OOM exception will occur, which is the case of
>>>>>
>>>>> issue
>>>>>
>>>>> [1].
>>>>>
>>>>>
>>>>> One way to solve this problem is to put auxiliary data structures
>>>>>
>>>>> into
>>>>>
>>>>> the memory-bounded frame as well. In general, it will be much
>>>>>
>>>>> easier
>>>>>
>>>>> to ask
>>>>>
>>>>> for multiple small memory blocks than one big chunk of memory. I
>>>>>
>>>>> guess that
>>>>>
>>>>> was the same reason why we have “SerializableHashTable” for
>>>>>
>>>>> HashJoin. It
>>>>>
>>>>> will be nice to have a more general structure that can be used by
>>>>>
>>>>> all the
>>>>>
>>>>> operators.
>>>>>
>>>>> The Frame based Vector idea is inspired by the Scala Vector[2]
>>>>>
>>>>> which
>>>>>
>>>>> looks like a List, but internally it is implemented as a 32-ary
>>>>>
>>>>> tree. The
>>>>>
>>>>> performance of it is very stable for variety size of object[3]. It
>>>>>
>>>>> will
>>>>>
>>>>> have all the benefits of ArrayList and the LinkedList. In addition,
>>>>>
>>>>> we can
>>>>>
>>>>> take the memory usage of the auxiliary structure into the
>>>>>
>>>>> calculation. We
>>>>>
>>>>> will work on the detailed design doc later if we are agree on this
>>>>> direction.
>>>>>
>>>>> Any thoughts or suggestions? Thank you!
>>>>>
>>>>>
>>>>> [1]
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>> https://code.google.com/p/asterixdb/issues/detail?id=934&can=1&q=last%20straw&colspec=ID%20Type%20Status%20Priority%20Milestone%20Owner%20Summary%20ETA%20Severity
>>>>
>>>>>
>>>>> <
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>> https://code.google.com/p/asterixdb/issues/detail?id=934&can=1&q=last%20straw&colspec=ID%20Type%20Status%20Priority%20Milestone%20Owner%20Summary%20ETA%20Severity
>>>>
>>>>>
>>>>>
>>>>>
>>>>> [2] https://bitbucket.org/astrieanna/bitmapped-vector-trie <
>>>>> https://bitbucket.org/astrieanna/bitmapped-vector-trie>
>>>>> [3]
>>>>>
>>>>> http://danielasfregola.com/2015/06/15/which-immutable-scala-collection/
>>>>>
>>>>> <
>>>>>
>>>>> http://danielasfregola.com/2015/06/15/which-immutable-scala-collection/
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> Best,
>>>>>
>>>>> Jianfeng Jia
>>>>> PhD Candidate of Computer Science
>>>>> University of California, Irvine
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> Best,
>>>>>
>>>>> Jianfeng Jia
>>>>> PhD Candidate of Computer Science
>>>>> University of California, Irvine
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> Best,
>>>>>
>>>>> Jianfeng Jia
>>>>> PhD Candidate of Computer Science
>>>>> University of California, Irvine
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> Best,
>>>>>
>>>>> Jianfeng Jia
>>>>> PhD Candidate of Computer Science
>>>>> University of California, Irvine
>>>>>
>>>>>
>>>>>
>>>>
>>>> --
>>>>
>>>> -----------------
>>>> Best Regards
>>>>
>>>> Jianfeng Jia
>>>> Ph.D. Candidate of Computer Science
>>>> University of California, Irvine
>>>>
>>>>
>

Re: Implement an SerializableVector in Hyracks

Posted by Mike Carey <dt...@gmail.com>.

Agreed!

On 4/6/16 10:14 AM, Till Westmann wrote:
> Not really. Just talked with Yingyi as well and we agree with Jianfeng 
> that
> taking the data structure memory usage into account is probably the most
> important improvement here.
>
> Cheers,
> Till
>
> On 6 Apr 2016, at 9:21, Chen Li wrote:
>
>> @Till: do you have any thoughts on this subject before we close it?
>>
>> On Fri, Apr 1, 2016 at 2:08 PM, Jianfeng Jia <ji...@gmail.com> 
>> wrote:
>>
>>> Dear devs,
>>>
>>> Xi has implemented the SerializableVector(SVector) which allocated 
>>> memory
>>> increasingly rather than a long java int array upfront.
>>> We did some tests but it seems not a good enough answer to solve the 
>>> OOM
>>> issue caused by long java array allocation.
>>>
>>> The first experiment was to test if the standalone SVector approach 
>>> would
>>> get more memory comparing to the long int array. With the heap size of
>>> -Xmx2g on a single machine, the SVector could allocate up to 1280M 
>>> data,
>>> while the original Java array can allocate up to 1080M data. It can
>>> allocate about 18% more space.
>>>
>>> When we move to the AsterixDB, we replace the int array which stores 
>>> the
>>> (FrameId, TupleId) pair with the SVector. Then we gradually 
>>> increasing the
>>> "compiler.sortmemory" to increase the in-memory buffer for external 
>>> sort.
>>> Both master and SVector succeeded in 1150M case and failed at 1250M 
>>> case.
>>> Furthermore, the master branch succeeded in 1200M case but the SVector
>>> failed in that case.
>>>
>>> Given that, we may think that allocate the memory chunk by chunk 
>>> doesn't
>>> win too much space comparing to allocate a big one directly.
>>> I think we need to take the structure memory usage into the spilling 
>>> case
>>> calculation, but we may not need an extra structure to chop the memory.
>>>
>>> Any insights?
>>> If you are interested, the code sample Xi wrote is here:
>>> https://github.com/xixZ/JVMExperiment/tree/master/src/testing
>>>
>>>
>>> On Tue, Feb 2, 2016 at 11:01 PM, Jianfeng Jia <ji...@gmail.com>
>>> wrote:
>>>
>>>> Thanks Ted for the elaborate introduction!
>>>> I did some reading about it. Based on my understanding, the main
>>> advantage
>>>> of ValueVector is its column-wise design. In that case, the per-record
>>>> based metadata, e.g. indexes or hash keys, can be directly added as an
>>>> column along with the existing record. Thus, the sorting within one 
>>>> row
>>>> group should be easily addressed. One question still not very clear 
>>>> to me
>>>> is how to generate the sorted result across several row groups?
>>>> If you can get somebody to talk more about it that will be great. 
>>>> Thank
>>>> you!
>>>>
>>>> On Feb 1, 2016, at 6:03 PM, Ted Dunning <te...@gmail.com> wrote:
>>>>
>>>> So there are several key points for ValueVectors that I can 
>>>> describe, but
>>>> for the authoritative voice, others would have to speak.
>>>>
>>>> The first point is that in Drill at least (and this is not required)
>>>> ValueVectors are off-heap.  This helps enormously in managing the
>>>> life-cycle since vectors can be associated with queries and when the
>>> query
>>>> ends, all associated vectors can be deallocated quickly.  That also
>>> allows
>>>> the memory footprint of Drill to be adjusted both up and down while
>>>> running.
>>>>
>>>> Secondly, ValueVectors are stored column-wise, not record-wise.  Most
>>>> manipulations do not require copies. Projection simply requires 
>>>> ignoring
>>>> some columns. New columns can be added without disturbing old ones.
>>>> Filtering is done using a row selection bitmap. Sorting is often done
>>> using
>>>> an index column.
>>>>
>>>> The assumption is also that you will have a row group with something
>>> like a
>>>> hundred thousand rows in it. This means that the size of a single 
>>>> group
>>>> isn't usually astronomical although very large data structures in a
>>> single
>>>> row can make the regulation of the size of row groups more difficult.
>>>>
>>>> Of particular interest is the fact that processing code can be 
>>>> generated
>>> in
>>>> Java that avoids almost all object creation so that most SQL-like 
>>>> queries
>>>> don't require any object cons'ing at all during the row scans. 
>>>> Moreover,
>>>> the code generated can even be rendered by the JIT into vectorized low
>>>> level instructions because the access patterns on ValueVectors are so
>>>> simple.
>>>>
>>>> Nested data structures are handled using invisible marking columns
>>> similar
>>>> to the way that nesting is marked in Dremel or Parquet. This means 
>>>> that
>>> you
>>>> get uniformly typed pseudo columns that represent a flattened view 
>>>> of the
>>>> nested content. Many restructuring operations can be done by simply
>>>> re-interpreting the nested data without any copying at all.
>>>>
>>>> If more detail is desired we should probably get somebody who is more
>>>> active in the Drill implementation to talk about how this all works 
>>>> and
>>> how
>>>> it will be extracted into Apache Arrow.
>>>>
>>>> More information can be found here:
>>>>
>>>> https://drill.apache.org/docs/value-vectors/
>>>>
>>>>
>>>>
>>>>
>>>> On Mon, Feb 1, 2016 at 5:08 PM, Jianfeng Jia <ji...@gmail.com>
>>>> wrote:
>>>>
>>>> If I understand correctly, it seems very similar to the IFrame in 
>>>> Hyrack,
>>>> which is also a container to store a sequence of record into the
>>>> ByteBuffer.
>>>>
>>>> I’m not clear about how records are manipulated inside the 
>>>> ValueVectors.
>>>> In Hyracks case, we store the pointer(usually a pair of (frameID,
>>>> recordID)) in one java array and manipulate the pointers instead of 
>>>> the
>>>> original records. We mainly want to break the that one array into
>>> multiple
>>>> small ByteBuffer-based arrays. By doing so, we can reduce the risk 
>>>> of OOM
>>>> for a large array and we may also take the memory usage of those 
>>>> pointers
>>>> into account for the flush decisions.  @Ted, could you share some
>>> insights
>>>> about how the ValueVectors handles manipulations? e.g. sort, hashing …
>>> etc.
>>>>
>>>> On Feb 1, 2016, at 3:16 PM, Ted Dunning <te...@gmail.com> wrote:
>>>>
>>>> Have you guys looked at the Drill ValueVectors?
>>>>
>>>> This structure is being spun out as Apache Arrow with multiple 
>>>> interfaces
>>>> and language bindings.
>>>>
>>>> On Mon, Feb 1, 2016 at 9:56 AM, Jianfeng Jia <ji...@gmail.com>
>>>>
>>>> wrote:
>>>>
>>>>
>>>> Hi,
>>>>
>>>> We plan to implement an append-only array at first. The main reason is
>>>> this is how the auxiliary data structure be used so far. Then the
>>>> implementation is straightforward.
>>>>
>>>> The tree-structured Vector in Scala can save a lot in updating case
>>>>
>>>> mainly
>>>>
>>>> because of their immutable requirement. It can saving unnecessary data
>>>> copies comparing to other immutable list when updating. We only allow
>>>> in-place update. The tree design may be overkill for us.
>>>>
>>>> Xi made on detailed design doc is here:
>>>>
>>>>
>>>>
>>> https://docs.google.com/document/d/1bs3JBCxmvuJZmBq_gKzUDt0FZdM9j3d3MPAmC29ZoSA/edit?usp=sharing 
>>>
>>>>
>>>> Any thoughts or comments?
>>>>
>>>>
>>>> On Jan 18, 2016, at 9:52 AM, Chen Li <ch...@gmail.com> wrote:
>>>>
>>>> @Xi and Jianfeng: after we come up with the design, let's share it 
>>>> with
>>>>
>>>> the
>>>>
>>>> group for an approval before the implementation.
>>>>
>>>> Chen
>>>>
>>>> On Fri, Jan 15, 2016 at 11:48 AM, Mike Carey <dt...@gmail.com>
>>>>
>>>> wrote:
>>>>
>>>>
>>>> The accounting is just as critical as the chunking - we should do both
>>>> (together).
>>>>
>>>>
>>>> On 1/15/16 9:00 AM, Till Westmann wrote:
>>>>
>>>> I don’t have relevant experience on the subject. But I think that it
>>>> sounds good to avoid arbitrarily long chunks of memory. Especially -
>>>>
>>>> as
>>>>
>>>> Jianfeng wrote - it would be good to be able to a) account for this
>>>>
>>>> memory
>>>>
>>>> and b) to manage it.
>>>> An interesting question for me would be what the overhead of such a
>>>> Vector is compared to a simple Java array and as a result where it
>>>>
>>>> should
>>>>
>>>> be used to replace arrays. (The comparison in [3] only compares
>>>>
>>>> different
>>>>
>>>> Scala collections, but doesn’t look at plain arrays.)
>>>>
>>>> Cheers,
>>>> Till
>>>>
>>>> On 14 Jan 2016, at 22:05, Chen Li wrote:
>>>>
>>>> Before we ask Xi to work on this project, it will be good to know if
>>>>
>>>> other people have seen similar problems and agree with this plan.
>>>> @Till: can you share some tips?
>>>>
>>>> Chen
>>>>
>>>> On Wed, Jan 13, 2016 at 4:27 PM, Jianfeng Jia <
>>>>
>>>> jianfeng.jia@gmail.com
>>>>
>>>>
>>>> wrote:
>>>>
>>>> Hi Devs,
>>>>
>>>> First of all, Xi Zhang is a Master student at UCI wants to work
>>>>
>>>> with
>>>>
>>>> us
>>>>
>>>> for a while. Welcome Xi!
>>>>
>>>> We are thinking of making a Frame-based, memory-bound
>>>> SerializableVector at first. We expect this vector can solve some
>>>> occasionally Java.Heap.OutOfMemory exceptions in Hyracks.
>>>> Though we did a good job on organizing the record-located memory,
>>>>
>>>> the
>>>>
>>>> OOM exception can still happen while operating the auxiliary data
>>>> structure. For example in the sort run generator, instead of moving
>>>>
>>>> record
>>>>
>>>> around we are creating an reference “pointer" array to the original
>>>>
>>>> record.
>>>>
>>>> However, if the record is small and the size of that int array will
>>>>
>>>> be
>>>>
>>>> large, then the OOM exception will occur, which is the case of
>>>>
>>>> issue
>>>>
>>>> [1].
>>>>
>>>>
>>>> One way to solve this problem is to put auxiliary data structures
>>>>
>>>> into
>>>>
>>>> the memory-bounded frame as well. In general, it will be much
>>>>
>>>> easier
>>>>
>>>> to ask
>>>>
>>>> for multiple small memory blocks than one big chunk of memory. I
>>>>
>>>> guess that
>>>>
>>>> was the same reason why we have “SerializableHashTable” for
>>>>
>>>> HashJoin. It
>>>>
>>>> will be nice to have a more general structure that can be used by
>>>>
>>>> all the
>>>>
>>>> operators.
>>>>
>>>> The Frame based Vector idea is inspired by the Scala Vector[2]
>>>>
>>>> which
>>>>
>>>> looks like a List, but internally it is implemented as a 32-ary
>>>>
>>>> tree. The
>>>>
>>>> performance of it is very stable for variety size of object[3]. It
>>>>
>>>> will
>>>>
>>>> have all the benefits of ArrayList and the LinkedList. In addition,
>>>>
>>>> we can
>>>>
>>>> take the memory usage of the auxiliary structure into the
>>>>
>>>> calculation. We
>>>>
>>>> will work on the detailed design doc later if we are agree on this
>>>> direction.
>>>>
>>>> Any thoughts or suggestions? Thank you!
>>>>
>>>>
>>>> [1]
>>>>
>>>>
>>>>
>>>>
>>> https://code.google.com/p/asterixdb/issues/detail?id=934&can=1&q=last%20straw&colspec=ID%20Type%20Status%20Priority%20Milestone%20Owner%20Summary%20ETA%20Severity 
>>>
>>>>
>>>> <
>>>>
>>>>
>>>>
>>>>
>>> https://code.google.com/p/asterixdb/issues/detail?id=934&can=1&q=last%20straw&colspec=ID%20Type%20Status%20Priority%20Milestone%20Owner%20Summary%20ETA%20Severity 
>>>
>>>>
>>>>
>>>>
>>>> [2] https://bitbucket.org/astrieanna/bitmapped-vector-trie <
>>>> https://bitbucket.org/astrieanna/bitmapped-vector-trie>
>>>> [3]
>>>>
>>>> http://danielasfregola.com/2015/06/15/which-immutable-scala-collection/ 
>>>>
>>>>
>>>> <
>>>>
>>>> http://danielasfregola.com/2015/06/15/which-immutable-scala-collection/ 
>>>>
>>>>
>>>>
>>>>
>>>>
>>>> Best,
>>>>
>>>> Jianfeng Jia
>>>> PhD Candidate of Computer Science
>>>> University of California, Irvine
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>> Best,
>>>>
>>>> Jianfeng Jia
>>>> PhD Candidate of Computer Science
>>>> University of California, Irvine
>>>>
>>>>
>>>>
>>>>
>>>>
>>>> Best,
>>>>
>>>> Jianfeng Jia
>>>> PhD Candidate of Computer Science
>>>> University of California, Irvine
>>>>
>>>>
>>>>
>>>>
>>>>
>>>> Best,
>>>>
>>>> Jianfeng Jia
>>>> PhD Candidate of Computer Science
>>>> University of California, Irvine
>>>>
>>>>
>>>
>>>
>>> -- 
>>>
>>> -----------------
>>> Best Regards
>>>
>>> Jianfeng Jia
>>> Ph.D. Candidate of Computer Science
>>> University of California, Irvine
>>>

Re: Implement an SerializableVector in Hyracks

Posted by Till Westmann <ti...@apache.org>.

Not really. Just talked with Yingyi as well and we agree with Jianfeng 
that
taking the data structure memory usage into account is probably the most
important improvement here.

Cheers,
Till

On 6 Apr 2016, at 9:21, Chen Li wrote:

> @Till: do you have any thoughts on this subject before we close it?
>
> On Fri, Apr 1, 2016 at 2:08 PM, Jianfeng Jia <ji...@gmail.com> 
> wrote:
>
>> Dear devs,
>>
>> Xi has implemented the SerializableVector(SVector) which allocated 
>> memory
>> increasingly rather than a long java int array upfront.
>> We did some tests but it seems not a good enough answer to solve the 
>> OOM
>> issue caused by long java array allocation.
>>
>> The first experiment was to test if the standalone SVector approach 
>> would
>> get more memory comparing to the long int array. With the heap size 
>> of
>> -Xmx2g on a single machine, the SVector could allocate up to 1280M 
>> data,
>> while the original Java array can allocate up to 1080M data. It can
>> allocate about 18% more space.
>>
>> When we move to the AsterixDB, we replace the int array which stores 
>> the
>> (FrameId, TupleId) pair with the SVector. Then we gradually 
>> increasing the
>> "compiler.sortmemory" to increase the in-memory buffer for external 
>> sort.
>> Both master and SVector succeeded in 1150M case and failed at 1250M 
>> case.
>> Furthermore, the master branch succeeded in 1200M case but the 
>> SVector
>> failed in that case.
>>
>> Given that, we may think that allocate the memory chunk by chunk 
>> doesn't
>> win too much space comparing to allocate a big one directly.
>> I think we need to take the structure memory usage into the spilling 
>> case
>> calculation, but we may not need an extra structure to chop the 
>> memory.
>>
>> Any insights?
>> If you are interested, the code sample Xi wrote is here:
>> https://github.com/xixZ/JVMExperiment/tree/master/src/testing
>>
>>
>> On Tue, Feb 2, 2016 at 11:01 PM, Jianfeng Jia 
>> <ji...@gmail.com>
>> wrote:
>>
>>> Thanks Ted for the elaborate introduction!
>>> I did some reading about it. Based on my understanding, the main
>> advantage
>>> of ValueVector is its column-wise design. In that case, the 
>>> per-record
>>> based metadata, e.g. indexes or hash keys, can be directly added as 
>>> an
>>> column along with the existing record. Thus, the sorting within one 
>>> row
>>> group should be easily addressed. One question still not very clear 
>>> to me
>>> is how to generate the sorted result across several row groups?
>>> If you can get somebody to talk more about it that will be great. 
>>> Thank
>>> you!
>>>
>>> On Feb 1, 2016, at 6:03 PM, Ted Dunning <te...@gmail.com> 
>>> wrote:
>>>
>>> So there are several key points for ValueVectors that I can 
>>> describe, but
>>> for the authoritative voice, others would have to speak.
>>>
>>> The first point is that in Drill at least (and this is not required)
>>> ValueVectors are off-heap.  This helps enormously in managing the
>>> life-cycle since vectors can be associated with queries and when the
>> query
>>> ends, all associated vectors can be deallocated quickly.  That also
>> allows
>>> the memory footprint of Drill to be adjusted both up and down while
>>> running.
>>>
>>> Secondly, ValueVectors are stored column-wise, not record-wise.  
>>> Most
>>> manipulations do not require copies. Projection simply requires 
>>> ignoring
>>> some columns. New columns can be added without disturbing old ones.
>>> Filtering is done using a row selection bitmap. Sorting is often 
>>> done
>> using
>>> an index column.
>>>
>>> The assumption is also that you will have a row group with something
>> like a
>>> hundred thousand rows in it. This means that the size of a single 
>>> group
>>> isn't usually astronomical although very large data structures in a
>> single
>>> row can make the regulation of the size of row groups more 
>>> difficult.
>>>
>>> Of particular interest is the fact that processing code can be 
>>> generated
>> in
>>> Java that avoids almost all object creation so that most SQL-like 
>>> queries
>>> don't require any object cons'ing at all during the row scans. 
>>> Moreover,
>>> the code generated can even be rendered by the JIT into vectorized 
>>> low
>>> level instructions because the access patterns on ValueVectors are 
>>> so
>>> simple.
>>>
>>> Nested data structures are handled using invisible marking columns
>> similar
>>> to the way that nesting is marked in Dremel or Parquet. This means 
>>> that
>> you
>>> get uniformly typed pseudo columns that represent a flattened view 
>>> of the
>>> nested content. Many restructuring operations can be done by simply
>>> re-interpreting the nested data without any copying at all.
>>>
>>> If more detail is desired we should probably get somebody who is 
>>> more
>>> active in the Drill implementation to talk about how this all works 
>>> and
>> how
>>> it will be extracted into Apache Arrow.
>>>
>>> More information can be found here:
>>>
>>> https://drill.apache.org/docs/value-vectors/
>>>
>>>
>>>
>>>
>>> On Mon, Feb 1, 2016 at 5:08 PM, Jianfeng Jia 
>>> <ji...@gmail.com>
>>> wrote:
>>>
>>> If I understand correctly, it seems very similar to the IFrame in 
>>> Hyrack,
>>> which is also a container to store a sequence of record into the
>>> ByteBuffer.
>>>
>>> I’m not clear about how records are manipulated inside the 
>>> ValueVectors.
>>> In Hyracks case, we store the pointer(usually a pair of (frameID,
>>> recordID)) in one java array and manipulate the pointers instead of 
>>> the
>>> original records. We mainly want to break the that one array into
>> multiple
>>> small ByteBuffer-based arrays. By doing so, we can reduce the risk 
>>> of OOM
>>> for a large array and we may also take the memory usage of those 
>>> pointers
>>> into account for the flush decisions.  @Ted, could you share some
>> insights
>>> about how the ValueVectors handles manipulations? e.g. sort, hashing 
>>> …
>> etc.
>>>
>>> On Feb 1, 2016, at 3:16 PM, Ted Dunning <te...@gmail.com> 
>>> wrote:
>>>
>>> Have you guys looked at the Drill ValueVectors?
>>>
>>> This structure is being spun out as Apache Arrow with multiple 
>>> interfaces
>>> and language bindings.
>>>
>>> On Mon, Feb 1, 2016 at 9:56 AM, Jianfeng Jia 
>>> <ji...@gmail.com>
>>>
>>> wrote:
>>>
>>>
>>> Hi,
>>>
>>> We plan to implement an append-only array at first. The main reason 
>>> is
>>> this is how the auxiliary data structure be used so far. Then the
>>> implementation is straightforward.
>>>
>>> The tree-structured Vector in Scala can save a lot in updating case
>>>
>>> mainly
>>>
>>> because of their immutable requirement. It can saving unnecessary 
>>> data
>>> copies comparing to other immutable list when updating. We only 
>>> allow
>>> in-place update. The tree design may be overkill for us.
>>>
>>> Xi made on detailed design doc is here:
>>>
>>>
>>>
>> https://docs.google.com/document/d/1bs3JBCxmvuJZmBq_gKzUDt0FZdM9j3d3MPAmC29ZoSA/edit?usp=sharing
>>>
>>> Any thoughts or comments?
>>>
>>>
>>> On Jan 18, 2016, at 9:52 AM, Chen Li <ch...@gmail.com> wrote:
>>>
>>> @Xi and Jianfeng: after we come up with the design, let's share it 
>>> with
>>>
>>> the
>>>
>>> group for an approval before the implementation.
>>>
>>> Chen
>>>
>>> On Fri, Jan 15, 2016 at 11:48 AM, Mike Carey <dt...@gmail.com>
>>>
>>> wrote:
>>>
>>>
>>> The accounting is just as critical as the chunking - we should do 
>>> both
>>> (together).
>>>
>>>
>>> On 1/15/16 9:00 AM, Till Westmann wrote:
>>>
>>> I don’t have relevant experience on the subject. But I think that 
>>> it
>>> sounds good to avoid arbitrarily long chunks of memory. Especially -
>>>
>>> as
>>>
>>> Jianfeng wrote - it would be good to be able to a) account for this
>>>
>>> memory
>>>
>>> and b) to manage it.
>>> An interesting question for me would be what the overhead of such a
>>> Vector is compared to a simple Java array and as a result where it
>>>
>>> should
>>>
>>> be used to replace arrays. (The comparison in [3] only compares
>>>
>>> different
>>>
>>> Scala collections, but doesn’t look at plain arrays.)
>>>
>>> Cheers,
>>> Till
>>>
>>> On 14 Jan 2016, at 22:05, Chen Li wrote:
>>>
>>> Before we ask Xi to work on this project, it will be good to know if
>>>
>>> other people have seen similar problems and agree with this plan.
>>> @Till: can you share some tips?
>>>
>>> Chen
>>>
>>> On Wed, Jan 13, 2016 at 4:27 PM, Jianfeng Jia <
>>>
>>> jianfeng.jia@gmail.com
>>>
>>>
>>> wrote:
>>>
>>> Hi Devs,
>>>
>>> First of all, Xi Zhang is a Master student at UCI wants to work
>>>
>>> with
>>>
>>> us
>>>
>>> for a while. Welcome Xi!
>>>
>>> We are thinking of making a Frame-based, memory-bound
>>> SerializableVector at first. We expect this vector can solve some
>>> occasionally Java.Heap.OutOfMemory exceptions in Hyracks.
>>> Though we did a good job on organizing the record-located memory,
>>>
>>> the
>>>
>>> OOM exception can still happen while operating the auxiliary data
>>> structure. For example in the sort run generator, instead of moving
>>>
>>> record
>>>
>>> around we are creating an reference “pointer" array to the 
>>> original
>>>
>>> record.
>>>
>>> However, if the record is small and the size of that int array will
>>>
>>> be
>>>
>>> large, then the OOM exception will occur, which is the case of
>>>
>>> issue
>>>
>>> [1].
>>>
>>>
>>> One way to solve this problem is to put auxiliary data structures
>>>
>>> into
>>>
>>> the memory-bounded frame as well. In general, it will be much
>>>
>>> easier
>>>
>>> to ask
>>>
>>> for multiple small memory blocks than one big chunk of memory. I
>>>
>>> guess that
>>>
>>> was the same reason why we have “SerializableHashTable” for
>>>
>>> HashJoin. It
>>>
>>> will be nice to have a more general structure that can be used by
>>>
>>> all the
>>>
>>> operators.
>>>
>>> The Frame based Vector idea is inspired by the Scala Vector[2]
>>>
>>> which
>>>
>>> looks like a List, but internally it is implemented as a 32-ary
>>>
>>> tree. The
>>>
>>> performance of it is very stable for variety size of object[3]. It
>>>
>>> will
>>>
>>> have all the benefits of ArrayList and the LinkedList. In addition,
>>>
>>> we can
>>>
>>> take the memory usage of the auxiliary structure into the
>>>
>>> calculation. We
>>>
>>> will work on the detailed design doc later if we are agree on this
>>> direction.
>>>
>>> Any thoughts or suggestions? Thank you!
>>>
>>>
>>> [1]
>>>
>>>
>>>
>>>
>> https://code.google.com/p/asterixdb/issues/detail?id=934&can=1&q=last%20straw&colspec=ID%20Type%20Status%20Priority%20Milestone%20Owner%20Summary%20ETA%20Severity
>>>
>>> <
>>>
>>>
>>>
>>>
>> https://code.google.com/p/asterixdb/issues/detail?id=934&can=1&q=last%20straw&colspec=ID%20Type%20Status%20Priority%20Milestone%20Owner%20Summary%20ETA%20Severity
>>>
>>>
>>>
>>> [2] https://bitbucket.org/astrieanna/bitmapped-vector-trie <
>>> https://bitbucket.org/astrieanna/bitmapped-vector-trie>
>>> [3]
>>>
>>> http://danielasfregola.com/2015/06/15/which-immutable-scala-collection/
>>>
>>> <
>>>
>>> http://danielasfregola.com/2015/06/15/which-immutable-scala-collection/
>>>
>>>
>>>
>>>
>>> Best,
>>>
>>> Jianfeng Jia
>>> PhD Candidate of Computer Science
>>> University of California, Irvine
>>>
>>>
>>>
>>>
>>>
>>>
>>> Best,
>>>
>>> Jianfeng Jia
>>> PhD Candidate of Computer Science
>>> University of California, Irvine
>>>
>>>
>>>
>>>
>>>
>>> Best,
>>>
>>> Jianfeng Jia
>>> PhD Candidate of Computer Science
>>> University of California, Irvine
>>>
>>>
>>>
>>>
>>>
>>> Best,
>>>
>>> Jianfeng Jia
>>> PhD Candidate of Computer Science
>>> University of California, Irvine
>>>
>>>
>>
>>
>> --
>>
>> -----------------
>> Best Regards
>>
>> Jianfeng Jia
>> Ph.D. Candidate of Computer Science
>> University of California, Irvine
>>

Re: Implement an SerializableVector in Hyracks

Posted by Chen Li <ch...@gmail.com>.

@Till: do you have any thoughts on this subject before we close it?

On Fri, Apr 1, 2016 at 2:08 PM, Jianfeng Jia <ji...@gmail.com> wrote:

> Dear devs,
>
> Xi has implemented the SerializableVector(SVector) which allocated memory
> increasingly rather than a long java int array upfront.
> We did some tests but it seems not a good enough answer to solve the OOM
> issue caused by long java array allocation.
>
> The first experiment was to test if the standalone SVector approach would
> get more memory comparing to the long int array. With the heap size of
> -Xmx2g on a single machine, the SVector could allocate up to 1280M data,
> while the original Java array can allocate up to 1080M data. It can
> allocate about 18% more space.
>
> When we move to the AsterixDB, we replace the int array which stores the
> (FrameId, TupleId) pair with the SVector. Then we gradually increasing the
> "compiler.sortmemory" to increase the in-memory buffer for external sort.
> Both master and SVector succeeded in 1150M case and failed at 1250M case.
> Furthermore, the master branch succeeded in 1200M case but the SVector
> failed in that case.
>
> Given that, we may think that allocate the memory chunk by chunk doesn't
> win too much space comparing to allocate a big one directly.
> I think we need to take the structure memory usage into the spilling case
> calculation, but we may not need an extra structure to chop the memory.
>
> Any insights?
> If you are interested, the code sample Xi wrote is here:
> https://github.com/xixZ/JVMExperiment/tree/master/src/testing
>
>
> On Tue, Feb 2, 2016 at 11:01 PM, Jianfeng Jia <ji...@gmail.com>
> wrote:
>
> > Thanks Ted for the elaborate introduction!
> > I did some reading about it. Based on my understanding, the main
> advantage
> > of ValueVector is its column-wise design. In that case, the per-record
> > based metadata, e.g. indexes or hash keys, can be directly added as an
> > column along with the existing record. Thus, the sorting within one row
> > group should be easily addressed. One question still not very clear to me
> > is how to generate the sorted result across several row groups?
> > If you can get somebody to talk more about it that will be great. Thank
> > you!
> >
> > On Feb 1, 2016, at 6:03 PM, Ted Dunning <te...@gmail.com> wrote:
> >
> > So there are several key points for ValueVectors that I can describe, but
> > for the authoritative voice, others would have to speak.
> >
> > The first point is that in Drill at least (and this is not required)
> > ValueVectors are off-heap.  This helps enormously in managing the
> > life-cycle since vectors can be associated with queries and when the
> query
> > ends, all associated vectors can be deallocated quickly.  That also
> allows
> > the memory footprint of Drill to be adjusted both up and down while
> > running.
> >
> > Secondly, ValueVectors are stored column-wise, not record-wise.  Most
> > manipulations do not require copies. Projection simply requires ignoring
> > some columns. New columns can be added without disturbing old ones.
> > Filtering is done using a row selection bitmap. Sorting is often done
> using
> > an index column.
> >
> > The assumption is also that you will have a row group with something
> like a
> > hundred thousand rows in it. This means that the size of a single group
> > isn't usually astronomical although very large data structures in a
> single
> > row can make the regulation of the size of row groups more difficult.
> >
> > Of particular interest is the fact that processing code can be generated
> in
> > Java that avoids almost all object creation so that most SQL-like queries
> > don't require any object cons'ing at all during the row scans. Moreover,
> > the code generated can even be rendered by the JIT into vectorized low
> > level instructions because the access patterns on ValueVectors are so
> > simple.
> >
> > Nested data structures are handled using invisible marking columns
> similar
> > to the way that nesting is marked in Dremel or Parquet. This means that
> you
> > get uniformly typed pseudo columns that represent a flattened view of the
> > nested content. Many restructuring operations can be done by simply
> > re-interpreting the nested data without any copying at all.
> >
> > If more detail is desired we should probably get somebody who is more
> > active in the Drill implementation to talk about how this all works and
> how
> > it will be extracted into Apache Arrow.
> >
> > More information can be found here:
> >
> > https://drill.apache.org/docs/value-vectors/
> >
> >
> >
> >
> > On Mon, Feb 1, 2016 at 5:08 PM, Jianfeng Jia <ji...@gmail.com>
> > wrote:
> >
> > If I understand correctly, it seems very similar to the IFrame in Hyrack,
> > which is also a container to store a sequence of record into the
> > ByteBuffer.
> >
> > I’m not clear about how records are manipulated inside the ValueVectors.
> > In Hyracks case, we store the pointer(usually a pair of (frameID,
> > recordID)) in one java array and manipulate the pointers instead of the
> > original records. We mainly want to break the that one array into
> multiple
> > small ByteBuffer-based arrays. By doing so, we can reduce the risk of OOM
> > for a large array and we may also take the memory usage of those pointers
> > into account for the flush decisions.  @Ted, could you share some
> insights
> > about how the ValueVectors handles manipulations? e.g. sort, hashing …
> etc.
> >
> > On Feb 1, 2016, at 3:16 PM, Ted Dunning <te...@gmail.com> wrote:
> >
> > Have you guys looked at the Drill ValueVectors?
> >
> > This structure is being spun out as Apache Arrow with multiple interfaces
> > and language bindings.
> >
> > On Mon, Feb 1, 2016 at 9:56 AM, Jianfeng Jia <ji...@gmail.com>
> >
> > wrote:
> >
> >
> > Hi,
> >
> > We plan to implement an append-only array at first. The main reason is
> > this is how the auxiliary data structure be used so far. Then the
> > implementation is straightforward.
> >
> > The tree-structured Vector in Scala can save a lot in updating case
> >
> > mainly
> >
> > because of their immutable requirement. It can saving unnecessary data
> > copies comparing to other immutable list when updating. We only allow
> > in-place update. The tree design may be overkill for us.
> >
> > Xi made on detailed design doc is here:
> >
> >
> >
> https://docs.google.com/document/d/1bs3JBCxmvuJZmBq_gKzUDt0FZdM9j3d3MPAmC29ZoSA/edit?usp=sharing
> >
> > Any thoughts or comments?
> >
> >
> > On Jan 18, 2016, at 9:52 AM, Chen Li <ch...@gmail.com> wrote:
> >
> > @Xi and Jianfeng: after we come up with the design, let's share it with
> >
> > the
> >
> > group for an approval before the implementation.
> >
> > Chen
> >
> > On Fri, Jan 15, 2016 at 11:48 AM, Mike Carey <dt...@gmail.com>
> >
> > wrote:
> >
> >
> > The accounting is just as critical as the chunking - we should do both
> > (together).
> >
> >
> > On 1/15/16 9:00 AM, Till Westmann wrote:
> >
> > I don’t have relevant experience on the subject. But I think that it
> > sounds good to avoid arbitrarily long chunks of memory. Especially -
> >
> > as
> >
> > Jianfeng wrote - it would be good to be able to a) account for this
> >
> > memory
> >
> > and b) to manage it.
> > An interesting question for me would be what the overhead of such a
> > Vector is compared to a simple Java array and as a result where it
> >
> > should
> >
> > be used to replace arrays. (The comparison in [3] only compares
> >
> > different
> >
> > Scala collections, but doesn’t look at plain arrays.)
> >
> > Cheers,
> > Till
> >
> > On 14 Jan 2016, at 22:05, Chen Li wrote:
> >
> > Before we ask Xi to work on this project, it will be good to know if
> >
> > other people have seen similar problems and agree with this plan.
> > @Till: can you share some tips?
> >
> > Chen
> >
> > On Wed, Jan 13, 2016 at 4:27 PM, Jianfeng Jia <
> >
> > jianfeng.jia@gmail.com
> >
> >
> > wrote:
> >
> > Hi Devs,
> >
> > First of all, Xi Zhang is a Master student at UCI wants to work
> >
> > with
> >
> > us
> >
> > for a while. Welcome Xi!
> >
> > We are thinking of making a Frame-based, memory-bound
> > SerializableVector at first. We expect this vector can solve some
> > occasionally Java.Heap.OutOfMemory exceptions in Hyracks.
> > Though we did a good job on organizing the record-located memory,
> >
> > the
> >
> > OOM exception can still happen while operating the auxiliary data
> > structure. For example in the sort run generator, instead of moving
> >
> > record
> >
> > around we are creating an reference “pointer" array to the original
> >
> > record.
> >
> > However, if the record is small and the size of that int array will
> >
> > be
> >
> > large, then the OOM exception will occur, which is the case of
> >
> > issue
> >
> > [1].
> >
> >
> > One way to solve this problem is to put auxiliary data structures
> >
> > into
> >
> > the memory-bounded frame as well. In general, it will be much
> >
> > easier
> >
> > to ask
> >
> > for multiple small memory blocks than one big chunk of memory. I
> >
> > guess that
> >
> > was the same reason why we have “SerializableHashTable” for
> >
> > HashJoin. It
> >
> > will be nice to have a more general structure that can be used by
> >
> > all the
> >
> > operators.
> >
> > The Frame based Vector idea is inspired by the Scala Vector[2]
> >
> > which
> >
> > looks like a List, but internally it is implemented as a 32-ary
> >
> > tree. The
> >
> > performance of it is very stable for variety size of object[3]. It
> >
> > will
> >
> > have all the benefits of ArrayList and the LinkedList. In addition,
> >
> > we can
> >
> > take the memory usage of the auxiliary structure into the
> >
> > calculation. We
> >
> > will work on the detailed design doc later if we are agree on this
> > direction.
> >
> > Any thoughts or suggestions? Thank you!
> >
> >
> > [1]
> >
> >
> >
> >
> https://code.google.com/p/asterixdb/issues/detail?id=934&can=1&q=last%20straw&colspec=ID%20Type%20Status%20Priority%20Milestone%20Owner%20Summary%20ETA%20Severity
> >
> > <
> >
> >
> >
> >
> https://code.google.com/p/asterixdb/issues/detail?id=934&can=1&q=last%20straw&colspec=ID%20Type%20Status%20Priority%20Milestone%20Owner%20Summary%20ETA%20Severity
> >
> >
> >
> > [2] https://bitbucket.org/astrieanna/bitmapped-vector-trie <
> > https://bitbucket.org/astrieanna/bitmapped-vector-trie>
> > [3]
> >
> > http://danielasfregola.com/2015/06/15/which-immutable-scala-collection/
> >
> > <
> >
> > http://danielasfregola.com/2015/06/15/which-immutable-scala-collection/
> >
> >
> >
> >
> > Best,
> >
> > Jianfeng Jia
> > PhD Candidate of Computer Science
> > University of California, Irvine
> >
> >
> >
> >
> >
> >
> > Best,
> >
> > Jianfeng Jia
> > PhD Candidate of Computer Science
> > University of California, Irvine
> >
> >
> >
> >
> >
> > Best,
> >
> > Jianfeng Jia
> > PhD Candidate of Computer Science
> > University of California, Irvine
> >
> >
> >
> >
> >
> > Best,
> >
> > Jianfeng Jia
> > PhD Candidate of Computer Science
> > University of California, Irvine
> >
> >
>
>
> --
>
> -----------------
> Best Regards
>
> Jianfeng Jia
> Ph.D. Candidate of Computer Science
> University of California, Irvine
>

Re: Implement an SerializableVector in Hyracks

Posted by Jianfeng Jia <ji...@gmail.com>.

Dear devs,

Xi has implemented the SerializableVector(SVector) which allocated memory
increasingly rather than a long java int array upfront.
We did some tests but it seems not a good enough answer to solve the OOM
issue caused by long java array allocation.

The first experiment was to test if the standalone SVector approach would
get more memory comparing to the long int array. With the heap size of
-Xmx2g on a single machine, the SVector could allocate up to 1280M data,
while the original Java array can allocate up to 1080M data. It can
allocate about 18% more space.

When we move to the AsterixDB, we replace the int array which stores the
(FrameId, TupleId) pair with the SVector. Then we gradually increasing the
"compiler.sortmemory" to increase the in-memory buffer for external sort.
Both master and SVector succeeded in 1150M case and failed at 1250M case.
Furthermore, the master branch succeeded in 1200M case but the SVector
failed in that case.

Given that, we may think that allocate the memory chunk by chunk doesn't
win too much space comparing to allocate a big one directly.
I think we need to take the structure memory usage into the spilling case
calculation, but we may not need an extra structure to chop the memory.

Any insights?
If you are interested, the code sample Xi wrote is here:
https://github.com/xixZ/JVMExperiment/tree/master/src/testing


On Tue, Feb 2, 2016 at 11:01 PM, Jianfeng Jia <ji...@gmail.com>
wrote:

> Thanks Ted for the elaborate introduction!
> I did some reading about it. Based on my understanding, the main advantage
> of ValueVector is its column-wise design. In that case, the per-record
> based metadata, e.g. indexes or hash keys, can be directly added as an
> column along with the existing record. Thus, the sorting within one row
> group should be easily addressed. One question still not very clear to me
> is how to generate the sorted result across several row groups?
> If you can get somebody to talk more about it that will be great. Thank
> you!
>
> On Feb 1, 2016, at 6:03 PM, Ted Dunning <te...@gmail.com> wrote:
>
> So there are several key points for ValueVectors that I can describe, but
> for the authoritative voice, others would have to speak.
>
> The first point is that in Drill at least (and this is not required)
> ValueVectors are off-heap.  This helps enormously in managing the
> life-cycle since vectors can be associated with queries and when the query
> ends, all associated vectors can be deallocated quickly.  That also allows
> the memory footprint of Drill to be adjusted both up and down while
> running.
>
> Secondly, ValueVectors are stored column-wise, not record-wise.  Most
> manipulations do not require copies. Projection simply requires ignoring
> some columns. New columns can be added without disturbing old ones.
> Filtering is done using a row selection bitmap. Sorting is often done using
> an index column.
>
> The assumption is also that you will have a row group with something like a
> hundred thousand rows in it. This means that the size of a single group
> isn't usually astronomical although very large data structures in a single
> row can make the regulation of the size of row groups more difficult.
>
> Of particular interest is the fact that processing code can be generated in
> Java that avoids almost all object creation so that most SQL-like queries
> don't require any object cons'ing at all during the row scans. Moreover,
> the code generated can even be rendered by the JIT into vectorized low
> level instructions because the access patterns on ValueVectors are so
> simple.
>
> Nested data structures are handled using invisible marking columns similar
> to the way that nesting is marked in Dremel or Parquet. This means that you
> get uniformly typed pseudo columns that represent a flattened view of the
> nested content. Many restructuring operations can be done by simply
> re-interpreting the nested data without any copying at all.
>
> If more detail is desired we should probably get somebody who is more
> active in the Drill implementation to talk about how this all works and how
> it will be extracted into Apache Arrow.
>
> More information can be found here:
>
> https://drill.apache.org/docs/value-vectors/
>
>
>
>
> On Mon, Feb 1, 2016 at 5:08 PM, Jianfeng Jia <ji...@gmail.com>
> wrote:
>
> If I understand correctly, it seems very similar to the IFrame in Hyrack,
> which is also a container to store a sequence of record into the
> ByteBuffer.
>
> I’m not clear about how records are manipulated inside the ValueVectors.
> In Hyracks case, we store the pointer(usually a pair of (frameID,
> recordID)) in one java array and manipulate the pointers instead of the
> original records. We mainly want to break the that one array into multiple
> small ByteBuffer-based arrays. By doing so, we can reduce the risk of OOM
> for a large array and we may also take the memory usage of those pointers
> into account for the flush decisions.  @Ted, could you share some insights
> about how the ValueVectors handles manipulations? e.g. sort, hashing … etc.
>
> On Feb 1, 2016, at 3:16 PM, Ted Dunning <te...@gmail.com> wrote:
>
> Have you guys looked at the Drill ValueVectors?
>
> This structure is being spun out as Apache Arrow with multiple interfaces
> and language bindings.
>
> On Mon, Feb 1, 2016 at 9:56 AM, Jianfeng Jia <ji...@gmail.com>
>
> wrote:
>
>
> Hi,
>
> We plan to implement an append-only array at first. The main reason is
> this is how the auxiliary data structure be used so far. Then the
> implementation is straightforward.
>
> The tree-structured Vector in Scala can save a lot in updating case
>
> mainly
>
> because of their immutable requirement. It can saving unnecessary data
> copies comparing to other immutable list when updating. We only allow
> in-place update. The tree design may be overkill for us.
>
> Xi made on detailed design doc is here:
>
>
> https://docs.google.com/document/d/1bs3JBCxmvuJZmBq_gKzUDt0FZdM9j3d3MPAmC29ZoSA/edit?usp=sharing
>
> Any thoughts or comments?
>
>
> On Jan 18, 2016, at 9:52 AM, Chen Li <ch...@gmail.com> wrote:
>
> @Xi and Jianfeng: after we come up with the design, let's share it with
>
> the
>
> group for an approval before the implementation.
>
> Chen
>
> On Fri, Jan 15, 2016 at 11:48 AM, Mike Carey <dt...@gmail.com>
>
> wrote:
>
>
> The accounting is just as critical as the chunking - we should do both
> (together).
>
>
> On 1/15/16 9:00 AM, Till Westmann wrote:
>
> I don’t have relevant experience on the subject. But I think that it
> sounds good to avoid arbitrarily long chunks of memory. Especially -
>
> as
>
> Jianfeng wrote - it would be good to be able to a) account for this
>
> memory
>
> and b) to manage it.
> An interesting question for me would be what the overhead of such a
> Vector is compared to a simple Java array and as a result where it
>
> should
>
> be used to replace arrays. (The comparison in [3] only compares
>
> different
>
> Scala collections, but doesn’t look at plain arrays.)
>
> Cheers,
> Till
>
> On 14 Jan 2016, at 22:05, Chen Li wrote:
>
> Before we ask Xi to work on this project, it will be good to know if
>
> other people have seen similar problems and agree with this plan.
> @Till: can you share some tips?
>
> Chen
>
> On Wed, Jan 13, 2016 at 4:27 PM, Jianfeng Jia <
>
> jianfeng.jia@gmail.com
>
>
> wrote:
>
> Hi Devs,
>
> First of all, Xi Zhang is a Master student at UCI wants to work
>
> with
>
> us
>
> for a while. Welcome Xi!
>
> We are thinking of making a Frame-based, memory-bound
> SerializableVector at first. We expect this vector can solve some
> occasionally Java.Heap.OutOfMemory exceptions in Hyracks.
> Though we did a good job on organizing the record-located memory,
>
> the
>
> OOM exception can still happen while operating the auxiliary data
> structure. For example in the sort run generator, instead of moving
>
> record
>
> around we are creating an reference “pointer" array to the original
>
> record.
>
> However, if the record is small and the size of that int array will
>
> be
>
> large, then the OOM exception will occur, which is the case of
>
> issue
>
> [1].
>
>
> One way to solve this problem is to put auxiliary data structures
>
> into
>
> the memory-bounded frame as well. In general, it will be much
>
> easier
>
> to ask
>
> for multiple small memory blocks than one big chunk of memory. I
>
> guess that
>
> was the same reason why we have “SerializableHashTable” for
>
> HashJoin. It
>
> will be nice to have a more general structure that can be used by
>
> all the
>
> operators.
>
> The Frame based Vector idea is inspired by the Scala Vector[2]
>
> which
>
> looks like a List, but internally it is implemented as a 32-ary
>
> tree. The
>
> performance of it is very stable for variety size of object[3]. It
>
> will
>
> have all the benefits of ArrayList and the LinkedList. In addition,
>
> we can
>
> take the memory usage of the auxiliary structure into the
>
> calculation. We
>
> will work on the detailed design doc later if we are agree on this
> direction.
>
> Any thoughts or suggestions? Thank you!
>
>
> [1]
>
>
>
> https://code.google.com/p/asterixdb/issues/detail?id=934&can=1&q=last%20straw&colspec=ID%20Type%20Status%20Priority%20Milestone%20Owner%20Summary%20ETA%20Severity
>
> <
>
>
>
> https://code.google.com/p/asterixdb/issues/detail?id=934&can=1&q=last%20straw&colspec=ID%20Type%20Status%20Priority%20Milestone%20Owner%20Summary%20ETA%20Severity
>
>
>
> [2] https://bitbucket.org/astrieanna/bitmapped-vector-trie <
> https://bitbucket.org/astrieanna/bitmapped-vector-trie>
> [3]
>
> http://danielasfregola.com/2015/06/15/which-immutable-scala-collection/
>
> <
>
> http://danielasfregola.com/2015/06/15/which-immutable-scala-collection/
>
>
>
>
> Best,
>
> Jianfeng Jia
> PhD Candidate of Computer Science
> University of California, Irvine
>
>
>
>
>
>
> Best,
>
> Jianfeng Jia
> PhD Candidate of Computer Science
> University of California, Irvine
>
>
>
>
>
> Best,
>
> Jianfeng Jia
> PhD Candidate of Computer Science
> University of California, Irvine
>
>
>
>
>
> Best,
>
> Jianfeng Jia
> PhD Candidate of Computer Science
> University of California, Irvine
>
>


-- 

-----------------
Best Regards

Jianfeng Jia
Ph.D. Candidate of Computer Science
University of California, Irvine

Re: Implement an SerializableVector in Hyracks

Posted by Jianfeng Jia <ji...@gmail.com>.

Thanks Ted for the elaborate introduction! 
I did some reading about it. Based on my understanding, the main advantage of ValueVector is its column-wise design. In that case, the per-record based metadata, e.g. indexes or hash keys, can be directly added as an column along with the existing record. Thus, the sorting within one row group should be easily addressed. One question still not very clear to me is how to generate the sorted result across several row groups? 
If you can get somebody to talk more about it that will be great. Thank you!

> On Feb 1, 2016, at 6:03 PM, Ted Dunning <te...@gmail.com> wrote:
> 
> So there are several key points for ValueVectors that I can describe, but
> for the authoritative voice, others would have to speak.
> 
> The first point is that in Drill at least (and this is not required)
> ValueVectors are off-heap.  This helps enormously in managing the
> life-cycle since vectors can be associated with queries and when the query
> ends, all associated vectors can be deallocated quickly.  That also allows
> the memory footprint of Drill to be adjusted both up and down while running.
> 
> Secondly, ValueVectors are stored column-wise, not record-wise.  Most
> manipulations do not require copies. Projection simply requires ignoring
> some columns. New columns can be added without disturbing old ones.
> Filtering is done using a row selection bitmap. Sorting is often done using
> an index column.
> 
> The assumption is also that you will have a row group with something like a
> hundred thousand rows in it. This means that the size of a single group
> isn't usually astronomical although very large data structures in a single
> row can make the regulation of the size of row groups more difficult.
> 
> Of particular interest is the fact that processing code can be generated in
> Java that avoids almost all object creation so that most SQL-like queries
> don't require any object cons'ing at all during the row scans. Moreover,
> the code generated can even be rendered by the JIT into vectorized low
> level instructions because the access patterns on ValueVectors are so
> simple.
> 
> Nested data structures are handled using invisible marking columns similar
> to the way that nesting is marked in Dremel or Parquet. This means that you
> get uniformly typed pseudo columns that represent a flattened view of the
> nested content. Many restructuring operations can be done by simply
> re-interpreting the nested data without any copying at all.
> 
> If more detail is desired we should probably get somebody who is more
> active in the Drill implementation to talk about how this all works and how
> it will be extracted into Apache Arrow.
> 
> More information can be found here:
> 
> https://drill.apache.org/docs/value-vectors/
> 
> 
> 
> 
> On Mon, Feb 1, 2016 at 5:08 PM, Jianfeng Jia <ji...@gmail.com> wrote:
> 
>> If I understand correctly, it seems very similar to the IFrame in Hyrack,
>> which is also a container to store a sequence of record into the ByteBuffer.
>> 
>> I’m not clear about how records are manipulated inside the ValueVectors.
>> In Hyracks case, we store the pointer(usually a pair of (frameID,
>> recordID)) in one java array and manipulate the pointers instead of the
>> original records. We mainly want to break the that one array into multiple
>> small ByteBuffer-based arrays. By doing so, we can reduce the risk of OOM
>> for a large array and we may also take the memory usage of those pointers
>> into account for the flush decisions.  @Ted, could you share some insights
>> about how the ValueVectors handles manipulations? e.g. sort, hashing … etc.
>> 
>>> On Feb 1, 2016, at 3:16 PM, Ted Dunning <te...@gmail.com> wrote:
>>> 
>>> Have you guys looked at the Drill ValueVectors?
>>> 
>>> This structure is being spun out as Apache Arrow with multiple interfaces
>>> and language bindings.
>>> 
>>> On Mon, Feb 1, 2016 at 9:56 AM, Jianfeng Jia <ji...@gmail.com>
>> wrote:
>>> 
>>>> Hi,
>>>> 
>>>> We plan to implement an append-only array at first. The main reason is
>>>> this is how the auxiliary data structure be used so far. Then the
>>>> implementation is straightforward.
>>>> 
>>>> The tree-structured Vector in Scala can save a lot in updating case
>> mainly
>>>> because of their immutable requirement. It can saving unnecessary data
>>>> copies comparing to other immutable list when updating. We only allow
>>>> in-place update. The tree design may be overkill for us.
>>>> 
>>>> Xi made on detailed design doc is here:
>>>> 
>> https://docs.google.com/document/d/1bs3JBCxmvuJZmBq_gKzUDt0FZdM9j3d3MPAmC29ZoSA/edit?usp=sharing
>>>> Any thoughts or comments?
>>>> 
>>>> 
>>>>> On Jan 18, 2016, at 9:52 AM, Chen Li <ch...@gmail.com> wrote:
>>>>> 
>>>>> @Xi and Jianfeng: after we come up with the design, let's share it with
>>>> the
>>>>> group for an approval before the implementation.
>>>>> 
>>>>> Chen
>>>>> 
>>>>> On Fri, Jan 15, 2016 at 11:48 AM, Mike Carey <dt...@gmail.com>
>> wrote:
>>>>> 
>>>>>> The accounting is just as critical as the chunking - we should do both
>>>>>> (together).
>>>>>> 
>>>>>> 
>>>>>> On 1/15/16 9:00 AM, Till Westmann wrote:
>>>>>> 
>>>>>>> I don’t have relevant experience on the subject. But I think that it
>>>>>>> sounds good to avoid arbitrarily long chunks of memory. Especially -
>> as
>>>>>>> Jianfeng wrote - it would be good to be able to a) account for this
>>>> memory
>>>>>>> and b) to manage it.
>>>>>>> An interesting question for me would be what the overhead of such a
>>>>>>> Vector is compared to a simple Java array and as a result where it
>>>> should
>>>>>>> be used to replace arrays. (The comparison in [3] only compares
>>>> different
>>>>>>> Scala collections, but doesn’t look at plain arrays.)
>>>>>>> 
>>>>>>> Cheers,
>>>>>>> Till
>>>>>>> 
>>>>>>> On 14 Jan 2016, at 22:05, Chen Li wrote:
>>>>>>> 
>>>>>>> Before we ask Xi to work on this project, it will be good to know if
>>>>>>>> other people have seen similar problems and agree with this plan.
>>>>>>>> @Till: can you share some tips?
>>>>>>>> 
>>>>>>>> Chen
>>>>>>>> 
>>>>>>>> On Wed, Jan 13, 2016 at 4:27 PM, Jianfeng Jia <
>> jianfeng.jia@gmail.com
>>>>> 
>>>>>>>> wrote:
>>>>>>>> 
>>>>>>>>> Hi Devs,
>>>>>>>>> 
>>>>>>>>> First of all, Xi Zhang is a Master student at UCI wants to work
>> with
>>>> us
>>>>>>>>> for a while. Welcome Xi!
>>>>>>>>> 
>>>>>>>>> We are thinking of making a Frame-based, memory-bound
>>>>>>>>> SerializableVector at first. We expect this vector can solve some
>>>>>>>>> occasionally Java.Heap.OutOfMemory exceptions in Hyracks.
>>>>>>>>> Though we did a good job on organizing the record-located memory,
>> the
>>>>>>>>> OOM exception can still happen while operating the auxiliary data
>>>>>>>>> structure. For example in the sort run generator, instead of moving
>>>> record
>>>>>>>>> around we are creating an reference “pointer" array to the original
>>>> record.
>>>>>>>>> However, if the record is small and the size of that int array will
>>>> be
>>>>>>>>> large, then the OOM exception will occur, which is the case of
>> issue
>>>> [1].
>>>>>>>>> 
>>>>>>>>> One way to solve this problem is to put auxiliary data structures
>>>> into
>>>>>>>>> the memory-bounded frame as well. In general, it will be much
>> easier
>>>> to ask
>>>>>>>>> for multiple small memory blocks than one big chunk of memory. I
>>>> guess that
>>>>>>>>> was the same reason why we have “SerializableHashTable” for
>>>> HashJoin. It
>>>>>>>>> will be nice to have a more general structure that can be used by
>>>> all the
>>>>>>>>> operators.
>>>>>>>>> 
>>>>>>>>> The Frame based Vector idea is inspired by the Scala Vector[2]
>> which
>>>>>>>>> looks like a List, but internally it is implemented as a 32-ary
>>>> tree. The
>>>>>>>>> performance of it is very stable for variety size of object[3]. It
>>>> will
>>>>>>>>> have all the benefits of ArrayList and the LinkedList. In addition,
>>>> we can
>>>>>>>>> take the memory usage of the auxiliary structure into the
>>>> calculation. We
>>>>>>>>> will work on the detailed design doc later if we are agree on this
>>>>>>>>> direction.
>>>>>>>>> 
>>>>>>>>> Any thoughts or suggestions? Thank you!
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> [1]
>>>>>>>>> 
>>>> 
>> https://code.google.com/p/asterixdb/issues/detail?id=934&can=1&q=last%20straw&colspec=ID%20Type%20Status%20Priority%20Milestone%20Owner%20Summary%20ETA%20Severity
>>>>>>>>> <
>>>>>>>>> 
>>>> 
>> https://code.google.com/p/asterixdb/issues/detail?id=934&can=1&q=last%20straw&colspec=ID%20Type%20Status%20Priority%20Milestone%20Owner%20Summary%20ETA%20Severity
>>>>> 
>>>>>>>>> 
>>>>>>>>> [2] https://bitbucket.org/astrieanna/bitmapped-vector-trie <
>>>>>>>>> https://bitbucket.org/astrieanna/bitmapped-vector-trie>
>>>>>>>>> [3]
>>>>>>>>> 
>>>> http://danielasfregola.com/2015/06/15/which-immutable-scala-collection/
>>>>>>>>> <
>>>>>>>>> 
>>>> http://danielasfregola.com/2015/06/15/which-immutable-scala-collection/
>>> 
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> Best,
>>>>>>>>> 
>>>>>>>>> Jianfeng Jia
>>>>>>>>> PhD Candidate of Computer Science
>>>>>>>>> University of California, Irvine
>>>>>>>>> 
>>>>>>>>> 
>>>>>> 
>>>> 
>>>> 
>>>> 
>>>> Best,
>>>> 
>>>> Jianfeng Jia
>>>> PhD Candidate of Computer Science
>>>> University of California, Irvine
>>>> 
>>>> 
>> 
>> 
>> 
>> Best,
>> 
>> Jianfeng Jia
>> PhD Candidate of Computer Science
>> University of California, Irvine
>> 
>> 



Best,

Jianfeng Jia
PhD Candidate of Computer Science
University of California, Irvine

Re: Implement an SerializableVector in Hyracks

Posted by Ted Dunning <te...@gmail.com>.

So there are several key points for ValueVectors that I can describe, but
for the authoritative voice, others would have to speak.

The first point is that in Drill at least (and this is not required)
ValueVectors are off-heap.  This helps enormously in managing the
life-cycle since vectors can be associated with queries and when the query
ends, all associated vectors can be deallocated quickly.  That also allows
the memory footprint of Drill to be adjusted both up and down while running.

Secondly, ValueVectors are stored column-wise, not record-wise.  Most
manipulations do not require copies. Projection simply requires ignoring
some columns. New columns can be added without disturbing old ones.
Filtering is done using a row selection bitmap. Sorting is often done using
an index column.

The assumption is also that you will have a row group with something like a
hundred thousand rows in it. This means that the size of a single group
isn't usually astronomical although very large data structures in a single
row can make the regulation of the size of row groups more difficult.

Of particular interest is the fact that processing code can be generated in
Java that avoids almost all object creation so that most SQL-like queries
don't require any object cons'ing at all during the row scans. Moreover,
the code generated can even be rendered by the JIT into vectorized low
level instructions because the access patterns on ValueVectors are so
simple.

Nested data structures are handled using invisible marking columns similar
to the way that nesting is marked in Dremel or Parquet. This means that you
get uniformly typed pseudo columns that represent a flattened view of the
nested content. Many restructuring operations can be done by simply
re-interpreting the nested data without any copying at all.

If more detail is desired we should probably get somebody who is more
active in the Drill implementation to talk about how this all works and how
it will be extracted into Apache Arrow.

More information can be found here:

https://drill.apache.org/docs/value-vectors/




On Mon, Feb 1, 2016 at 5:08 PM, Jianfeng Jia <ji...@gmail.com> wrote:

> If I understand correctly, it seems very similar to the IFrame in Hyrack,
> which is also a container to store a sequence of record into the ByteBuffer.
>
> I’m not clear about how records are manipulated inside the ValueVectors.
> In Hyracks case, we store the pointer(usually a pair of (frameID,
> recordID)) in one java array and manipulate the pointers instead of the
> original records. We mainly want to break the that one array into multiple
> small ByteBuffer-based arrays. By doing so, we can reduce the risk of OOM
> for a large array and we may also take the memory usage of those pointers
> into account for the flush decisions.  @Ted, could you share some insights
> about how the ValueVectors handles manipulations? e.g. sort, hashing … etc.
>
> > On Feb 1, 2016, at 3:16 PM, Ted Dunning <te...@gmail.com> wrote:
> >
> > Have you guys looked at the Drill ValueVectors?
> >
> > This structure is being spun out as Apache Arrow with multiple interfaces
> > and language bindings.
> >
> > On Mon, Feb 1, 2016 at 9:56 AM, Jianfeng Jia <ji...@gmail.com>
> wrote:
> >
> >> Hi,
> >>
> >> We plan to implement an append-only array at first. The main reason is
> >> this is how the auxiliary data structure be used so far. Then the
> >> implementation is straightforward.
> >>
> >> The tree-structured Vector in Scala can save a lot in updating case
> mainly
> >> because of their immutable requirement. It can saving unnecessary data
> >> copies comparing to other immutable list when updating. We only allow
> >> in-place update. The tree design may be overkill for us.
> >>
> >> Xi made on detailed design doc is here:
> >>
> https://docs.google.com/document/d/1bs3JBCxmvuJZmBq_gKzUDt0FZdM9j3d3MPAmC29ZoSA/edit?usp=sharing
> >> Any thoughts or comments?
> >>
> >>
> >>> On Jan 18, 2016, at 9:52 AM, Chen Li <ch...@gmail.com> wrote:
> >>>
> >>> @Xi and Jianfeng: after we come up with the design, let's share it with
> >> the
> >>> group for an approval before the implementation.
> >>>
> >>> Chen
> >>>
> >>> On Fri, Jan 15, 2016 at 11:48 AM, Mike Carey <dt...@gmail.com>
> wrote:
> >>>
> >>>> The accounting is just as critical as the chunking - we should do both
> >>>> (together).
> >>>>
> >>>>
> >>>> On 1/15/16 9:00 AM, Till Westmann wrote:
> >>>>
> >>>>> I don’t have relevant experience on the subject. But I think that it
> >>>>> sounds good to avoid arbitrarily long chunks of memory. Especially -
> as
> >>>>> Jianfeng wrote - it would be good to be able to a) account for this
> >> memory
> >>>>> and b) to manage it.
> >>>>> An interesting question for me would be what the overhead of such a
> >>>>> Vector is compared to a simple Java array and as a result where it
> >> should
> >>>>> be used to replace arrays. (The comparison in [3] only compares
> >> different
> >>>>> Scala collections, but doesn’t look at plain arrays.)
> >>>>>
> >>>>> Cheers,
> >>>>> Till
> >>>>>
> >>>>> On 14 Jan 2016, at 22:05, Chen Li wrote:
> >>>>>
> >>>>> Before we ask Xi to work on this project, it will be good to know if
> >>>>>> other people have seen similar problems and agree with this plan.
> >>>>>> @Till: can you share some tips?
> >>>>>>
> >>>>>> Chen
> >>>>>>
> >>>>>> On Wed, Jan 13, 2016 at 4:27 PM, Jianfeng Jia <
> jianfeng.jia@gmail.com
> >>>
> >>>>>> wrote:
> >>>>>>
> >>>>>>> Hi Devs,
> >>>>>>>
> >>>>>>> First of all, Xi Zhang is a Master student at UCI wants to work
> with
> >> us
> >>>>>>> for a while. Welcome Xi!
> >>>>>>>
> >>>>>>> We are thinking of making a Frame-based, memory-bound
> >>>>>>> SerializableVector at first. We expect this vector can solve some
> >>>>>>> occasionally Java.Heap.OutOfMemory exceptions in Hyracks.
> >>>>>>> Though we did a good job on organizing the record-located memory,
> the
> >>>>>>> OOM exception can still happen while operating the auxiliary data
> >>>>>>> structure. For example in the sort run generator, instead of moving
> >> record
> >>>>>>> around we are creating an reference “pointer" array to the original
> >> record.
> >>>>>>> However, if the record is small and the size of that int array will
> >> be
> >>>>>>> large, then the OOM exception will occur, which is the case of
> issue
> >> [1].
> >>>>>>>
> >>>>>>> One way to solve this problem is to put auxiliary data structures
> >> into
> >>>>>>> the memory-bounded frame as well. In general, it will be much
> easier
> >> to ask
> >>>>>>> for multiple small memory blocks than one big chunk of memory. I
> >> guess that
> >>>>>>> was the same reason why we have “SerializableHashTable” for
> >> HashJoin. It
> >>>>>>> will be nice to have a more general structure that can be used by
> >> all the
> >>>>>>> operators.
> >>>>>>>
> >>>>>>> The Frame based Vector idea is inspired by the Scala Vector[2]
> which
> >>>>>>> looks like a List, but internally it is implemented as a 32-ary
> >> tree. The
> >>>>>>> performance of it is very stable for variety size of object[3]. It
> >> will
> >>>>>>> have all the benefits of ArrayList and the LinkedList. In addition,
> >> we can
> >>>>>>> take the memory usage of the auxiliary structure into the
> >> calculation. We
> >>>>>>> will work on the detailed design doc later if we are agree on this
> >>>>>>> direction.
> >>>>>>>
> >>>>>>> Any thoughts or suggestions? Thank you!
> >>>>>>>
> >>>>>>>
> >>>>>>> [1]
> >>>>>>>
> >>
> https://code.google.com/p/asterixdb/issues/detail?id=934&can=1&q=last%20straw&colspec=ID%20Type%20Status%20Priority%20Milestone%20Owner%20Summary%20ETA%20Severity
> >>>>>>> <
> >>>>>>>
> >>
> https://code.google.com/p/asterixdb/issues/detail?id=934&can=1&q=last%20straw&colspec=ID%20Type%20Status%20Priority%20Milestone%20Owner%20Summary%20ETA%20Severity
> >>>
> >>>>>>>
> >>>>>>> [2] https://bitbucket.org/astrieanna/bitmapped-vector-trie <
> >>>>>>> https://bitbucket.org/astrieanna/bitmapped-vector-trie>
> >>>>>>> [3]
> >>>>>>>
> >> http://danielasfregola.com/2015/06/15/which-immutable-scala-collection/
> >>>>>>> <
> >>>>>>>
> >> http://danielasfregola.com/2015/06/15/which-immutable-scala-collection/
> >
> >>>>>>>
> >>>>>>>
> >>>>>>> Best,
> >>>>>>>
> >>>>>>> Jianfeng Jia
> >>>>>>> PhD Candidate of Computer Science
> >>>>>>> University of California, Irvine
> >>>>>>>
> >>>>>>>
> >>>>
> >>
> >>
> >>
> >> Best,
> >>
> >> Jianfeng Jia
> >> PhD Candidate of Computer Science
> >> University of California, Irvine
> >>
> >>
>
>
>
> Best,
>
> Jianfeng Jia
> PhD Candidate of Computer Science
> University of California, Irvine
>
>

Re: Implement an SerializableVector in Hyracks

Posted by Jianfeng Jia <ji...@gmail.com>.

If I understand correctly, it seems very similar to the IFrame in Hyrack, which is also a container to store a sequence of record into the ByteBuffer.

I’m not clear about how records are manipulated inside the ValueVectors.
In Hyracks case, we store the pointer(usually a pair of (frameID, recordID)) in one java array and manipulate the pointers instead of the original records. We mainly want to break the that one array into multiple small ByteBuffer-based arrays. By doing so, we can reduce the risk of OOM for a large array and we may also take the memory usage of those pointers into account for the flush decisions.  @Ted, could you share some insights about how the ValueVectors handles manipulations? e.g. sort, hashing … etc.  

> On Feb 1, 2016, at 3:16 PM, Ted Dunning <te...@gmail.com> wrote:
> 
> Have you guys looked at the Drill ValueVectors?
> 
> This structure is being spun out as Apache Arrow with multiple interfaces
> and language bindings.
> 
> On Mon, Feb 1, 2016 at 9:56 AM, Jianfeng Jia <ji...@gmail.com> wrote:
> 
>> Hi,
>> 
>> We plan to implement an append-only array at first. The main reason is
>> this is how the auxiliary data structure be used so far. Then the
>> implementation is straightforward.
>> 
>> The tree-structured Vector in Scala can save a lot in updating case mainly
>> because of their immutable requirement. It can saving unnecessary data
>> copies comparing to other immutable list when updating. We only allow
>> in-place update. The tree design may be overkill for us.
>> 
>> Xi made on detailed design doc is here:
>> https://docs.google.com/document/d/1bs3JBCxmvuJZmBq_gKzUDt0FZdM9j3d3MPAmC29ZoSA/edit?usp=sharing
>> Any thoughts or comments?
>> 
>> 
>>> On Jan 18, 2016, at 9:52 AM, Chen Li <ch...@gmail.com> wrote:
>>> 
>>> @Xi and Jianfeng: after we come up with the design, let's share it with
>> the
>>> group for an approval before the implementation.
>>> 
>>> Chen
>>> 
>>> On Fri, Jan 15, 2016 at 11:48 AM, Mike Carey <dt...@gmail.com> wrote:
>>> 
>>>> The accounting is just as critical as the chunking - we should do both
>>>> (together).
>>>> 
>>>> 
>>>> On 1/15/16 9:00 AM, Till Westmann wrote:
>>>> 
>>>>> I don’t have relevant experience on the subject. But I think that it
>>>>> sounds good to avoid arbitrarily long chunks of memory. Especially - as
>>>>> Jianfeng wrote - it would be good to be able to a) account for this
>> memory
>>>>> and b) to manage it.
>>>>> An interesting question for me would be what the overhead of such a
>>>>> Vector is compared to a simple Java array and as a result where it
>> should
>>>>> be used to replace arrays. (The comparison in [3] only compares
>> different
>>>>> Scala collections, but doesn’t look at plain arrays.)
>>>>> 
>>>>> Cheers,
>>>>> Till
>>>>> 
>>>>> On 14 Jan 2016, at 22:05, Chen Li wrote:
>>>>> 
>>>>> Before we ask Xi to work on this project, it will be good to know if
>>>>>> other people have seen similar problems and agree with this plan.
>>>>>> @Till: can you share some tips?
>>>>>> 
>>>>>> Chen
>>>>>> 
>>>>>> On Wed, Jan 13, 2016 at 4:27 PM, Jianfeng Jia <jianfeng.jia@gmail.com
>>> 
>>>>>> wrote:
>>>>>> 
>>>>>>> Hi Devs,
>>>>>>> 
>>>>>>> First of all, Xi Zhang is a Master student at UCI wants to work with
>> us
>>>>>>> for a while. Welcome Xi!
>>>>>>> 
>>>>>>> We are thinking of making a Frame-based, memory-bound
>>>>>>> SerializableVector at first. We expect this vector can solve some
>>>>>>> occasionally Java.Heap.OutOfMemory exceptions in Hyracks.
>>>>>>> Though we did a good job on organizing the record-located memory, the
>>>>>>> OOM exception can still happen while operating the auxiliary data
>>>>>>> structure. For example in the sort run generator, instead of moving
>> record
>>>>>>> around we are creating an reference “pointer" array to the original
>> record.
>>>>>>> However, if the record is small and the size of that int array will
>> be
>>>>>>> large, then the OOM exception will occur, which is the case of issue
>> [1].
>>>>>>> 
>>>>>>> One way to solve this problem is to put auxiliary data structures
>> into
>>>>>>> the memory-bounded frame as well. In general, it will be much easier
>> to ask
>>>>>>> for multiple small memory blocks than one big chunk of memory. I
>> guess that
>>>>>>> was the same reason why we have “SerializableHashTable” for
>> HashJoin. It
>>>>>>> will be nice to have a more general structure that can be used by
>> all the
>>>>>>> operators.
>>>>>>> 
>>>>>>> The Frame based Vector idea is inspired by the Scala Vector[2] which
>>>>>>> looks like a List, but internally it is implemented as a 32-ary
>> tree. The
>>>>>>> performance of it is very stable for variety size of object[3]. It
>> will
>>>>>>> have all the benefits of ArrayList and the LinkedList. In addition,
>> we can
>>>>>>> take the memory usage of the auxiliary structure into the
>> calculation. We
>>>>>>> will work on the detailed design doc later if we are agree on this
>>>>>>> direction.
>>>>>>> 
>>>>>>> Any thoughts or suggestions? Thank you!
>>>>>>> 
>>>>>>> 
>>>>>>> [1]
>>>>>>> 
>> https://code.google.com/p/asterixdb/issues/detail?id=934&can=1&q=last%20straw&colspec=ID%20Type%20Status%20Priority%20Milestone%20Owner%20Summary%20ETA%20Severity
>>>>>>> <
>>>>>>> 
>> https://code.google.com/p/asterixdb/issues/detail?id=934&can=1&q=last%20straw&colspec=ID%20Type%20Status%20Priority%20Milestone%20Owner%20Summary%20ETA%20Severity
>>> 
>>>>>>> 
>>>>>>> [2] https://bitbucket.org/astrieanna/bitmapped-vector-trie <
>>>>>>> https://bitbucket.org/astrieanna/bitmapped-vector-trie>
>>>>>>> [3]
>>>>>>> 
>> http://danielasfregola.com/2015/06/15/which-immutable-scala-collection/
>>>>>>> <
>>>>>>> 
>> http://danielasfregola.com/2015/06/15/which-immutable-scala-collection/>
>>>>>>> 
>>>>>>> 
>>>>>>> Best,
>>>>>>> 
>>>>>>> Jianfeng Jia
>>>>>>> PhD Candidate of Computer Science
>>>>>>> University of California, Irvine
>>>>>>> 
>>>>>>> 
>>>> 
>> 
>> 
>> 
>> Best,
>> 
>> Jianfeng Jia
>> PhD Candidate of Computer Science
>> University of California, Irvine
>> 
>> 



Best,

Jianfeng Jia
PhD Candidate of Computer Science
University of California, Irvine

Re: Implement an SerializableVector in Hyracks

Posted by Ted Dunning <te...@gmail.com>.

Have you guys looked at the Drill ValueVectors?

This structure is being spun out as Apache Arrow with multiple interfaces
and language bindings.

On Mon, Feb 1, 2016 at 9:56 AM, Jianfeng Jia <ji...@gmail.com> wrote:

> Hi,
>
> We plan to implement an append-only array at first. The main reason is
> this is how the auxiliary data structure be used so far. Then the
> implementation is straightforward.
>
> The tree-structured Vector in Scala can save a lot in updating case mainly
> because of their immutable requirement. It can saving unnecessary data
> copies comparing to other immutable list when updating. We only allow
> in-place update. The tree design may be overkill for us.
>
> Xi made on detailed design doc is here:
> https://docs.google.com/document/d/1bs3JBCxmvuJZmBq_gKzUDt0FZdM9j3d3MPAmC29ZoSA/edit?usp=sharing
> Any thoughts or comments?
>
>
> > On Jan 18, 2016, at 9:52 AM, Chen Li <ch...@gmail.com> wrote:
> >
> > @Xi and Jianfeng: after we come up with the design, let's share it with
> the
> > group for an approval before the implementation.
> >
> > Chen
> >
> > On Fri, Jan 15, 2016 at 11:48 AM, Mike Carey <dt...@gmail.com> wrote:
> >
> >> The accounting is just as critical as the chunking - we should do both
> >> (together).
> >>
> >>
> >> On 1/15/16 9:00 AM, Till Westmann wrote:
> >>
> >>> I don’t have relevant experience on the subject. But I think that it
> >>> sounds good to avoid arbitrarily long chunks of memory. Especially - as
> >>> Jianfeng wrote - it would be good to be able to a) account for this
> memory
> >>> and b) to manage it.
> >>> An interesting question for me would be what the overhead of such a
> >>> Vector is compared to a simple Java array and as a result where it
> should
> >>> be used to replace arrays. (The comparison in [3] only compares
> different
> >>> Scala collections, but doesn’t look at plain arrays.)
> >>>
> >>> Cheers,
> >>> Till
> >>>
> >>> On 14 Jan 2016, at 22:05, Chen Li wrote:
> >>>
> >>> Before we ask Xi to work on this project, it will be good to know if
> >>>> other people have seen similar problems and agree with this plan.
> >>>> @Till: can you share some tips?
> >>>>
> >>>> Chen
> >>>>
> >>>> On Wed, Jan 13, 2016 at 4:27 PM, Jianfeng Jia <jianfeng.jia@gmail.com
> >
> >>>> wrote:
> >>>>
> >>>>> Hi Devs,
> >>>>>
> >>>>> First of all, Xi Zhang is a Master student at UCI wants to work with
> us
> >>>>> for a while. Welcome Xi!
> >>>>>
> >>>>> We are thinking of making a Frame-based, memory-bound
> >>>>> SerializableVector at first. We expect this vector can solve some
> >>>>> occasionally Java.Heap.OutOfMemory exceptions in Hyracks.
> >>>>> Though we did a good job on organizing the record-located memory, the
> >>>>> OOM exception can still happen while operating the auxiliary data
> >>>>> structure. For example in the sort run generator, instead of moving
> record
> >>>>> around we are creating an reference “pointer" array to the original
> record.
> >>>>> However, if the record is small and the size of that int array will
> be
> >>>>> large, then the OOM exception will occur, which is the case of issue
> [1].
> >>>>>
> >>>>> One way to solve this problem is to put auxiliary data structures
> into
> >>>>> the memory-bounded frame as well. In general, it will be much easier
> to ask
> >>>>> for multiple small memory blocks than one big chunk of memory. I
> guess that
> >>>>> was the same reason why we have “SerializableHashTable” for
> HashJoin. It
> >>>>> will be nice to have a more general structure that can be used by
> all the
> >>>>> operators.
> >>>>>
> >>>>> The Frame based Vector idea is inspired by the Scala Vector[2] which
> >>>>> looks like a List, but internally it is implemented as a 32-ary
> tree. The
> >>>>> performance of it is very stable for variety size of object[3]. It
> will
> >>>>> have all the benefits of ArrayList and the LinkedList. In addition,
> we can
> >>>>> take the memory usage of the auxiliary structure into the
> calculation. We
> >>>>> will work on the detailed design doc later if we are agree on this
> >>>>> direction.
> >>>>>
> >>>>> Any thoughts or suggestions? Thank you!
> >>>>>
> >>>>>
> >>>>> [1]
> >>>>>
> https://code.google.com/p/asterixdb/issues/detail?id=934&can=1&q=last%20straw&colspec=ID%20Type%20Status%20Priority%20Milestone%20Owner%20Summary%20ETA%20Severity
> >>>>> <
> >>>>>
> https://code.google.com/p/asterixdb/issues/detail?id=934&can=1&q=last%20straw&colspec=ID%20Type%20Status%20Priority%20Milestone%20Owner%20Summary%20ETA%20Severity
> >
> >>>>>
> >>>>> [2] https://bitbucket.org/astrieanna/bitmapped-vector-trie <
> >>>>> https://bitbucket.org/astrieanna/bitmapped-vector-trie>
> >>>>> [3]
> >>>>>
> http://danielasfregola.com/2015/06/15/which-immutable-scala-collection/
> >>>>> <
> >>>>>
> http://danielasfregola.com/2015/06/15/which-immutable-scala-collection/>
> >>>>>
> >>>>>
> >>>>> Best,
> >>>>>
> >>>>> Jianfeng Jia
> >>>>> PhD Candidate of Computer Science
> >>>>> University of California, Irvine
> >>>>>
> >>>>>
> >>
>
>
>
> Best,
>
> Jianfeng Jia
> PhD Candidate of Computer Science
> University of California, Irvine
>
>

Re: Implement an SerializableVector in Hyracks

Posted by Jianfeng Jia <ji...@gmail.com>.

Hi, 

We plan to implement an append-only array at first. The main reason is this is how the auxiliary data structure be used so far. Then the implementation is straightforward. 

The tree-structured Vector in Scala can save a lot in updating case mainly because of their immutable requirement. It can saving unnecessary data copies comparing to other immutable list when updating. We only allow in-place update. The tree design may be overkill for us.

Xi made on detailed design doc is here: https://docs.google.com/document/d/1bs3JBCxmvuJZmBq_gKzUDt0FZdM9j3d3MPAmC29ZoSA/edit?usp=sharing
Any thoughts or comments? 


> On Jan 18, 2016, at 9:52 AM, Chen Li <ch...@gmail.com> wrote:
> 
> @Xi and Jianfeng: after we come up with the design, let's share it with the
> group for an approval before the implementation.
> 
> Chen
> 
> On Fri, Jan 15, 2016 at 11:48 AM, Mike Carey <dt...@gmail.com> wrote:
> 
>> The accounting is just as critical as the chunking - we should do both
>> (together).
>> 
>> 
>> On 1/15/16 9:00 AM, Till Westmann wrote:
>> 
>>> I don’t have relevant experience on the subject. But I think that it
>>> sounds good to avoid arbitrarily long chunks of memory. Especially - as
>>> Jianfeng wrote - it would be good to be able to a) account for this memory
>>> and b) to manage it.
>>> An interesting question for me would be what the overhead of such a
>>> Vector is compared to a simple Java array and as a result where it should
>>> be used to replace arrays. (The comparison in [3] only compares different
>>> Scala collections, but doesn’t look at plain arrays.)
>>> 
>>> Cheers,
>>> Till
>>> 
>>> On 14 Jan 2016, at 22:05, Chen Li wrote:
>>> 
>>> Before we ask Xi to work on this project, it will be good to know if
>>>> other people have seen similar problems and agree with this plan.
>>>> @Till: can you share some tips?
>>>> 
>>>> Chen
>>>> 
>>>> On Wed, Jan 13, 2016 at 4:27 PM, Jianfeng Jia <ji...@gmail.com>
>>>> wrote:
>>>> 
>>>>> Hi Devs,
>>>>> 
>>>>> First of all, Xi Zhang is a Master student at UCI wants to work with us
>>>>> for a while. Welcome Xi!
>>>>> 
>>>>> We are thinking of making a Frame-based, memory-bound
>>>>> SerializableVector at first. We expect this vector can solve some
>>>>> occasionally Java.Heap.OutOfMemory exceptions in Hyracks.
>>>>> Though we did a good job on organizing the record-located memory, the
>>>>> OOM exception can still happen while operating the auxiliary data
>>>>> structure. For example in the sort run generator, instead of moving record
>>>>> around we are creating an reference “pointer" array to the original record.
>>>>> However, if the record is small and the size of that int array will be
>>>>> large, then the OOM exception will occur, which is the case of issue [1].
>>>>> 
>>>>> One way to solve this problem is to put auxiliary data structures into
>>>>> the memory-bounded frame as well. In general, it will be much easier to ask
>>>>> for multiple small memory blocks than one big chunk of memory. I guess that
>>>>> was the same reason why we have “SerializableHashTable” for HashJoin. It
>>>>> will be nice to have a more general structure that can be used by all the
>>>>> operators.
>>>>> 
>>>>> The Frame based Vector idea is inspired by the Scala Vector[2] which
>>>>> looks like a List, but internally it is implemented as a 32-ary tree. The
>>>>> performance of it is very stable for variety size of object[3]. It will
>>>>> have all the benefits of ArrayList and the LinkedList. In addition, we can
>>>>> take the memory usage of the auxiliary structure into the calculation. We
>>>>> will work on the detailed design doc later if we are agree on this
>>>>> direction.
>>>>> 
>>>>> Any thoughts or suggestions? Thank you!
>>>>> 
>>>>> 
>>>>> [1]
>>>>> https://code.google.com/p/asterixdb/issues/detail?id=934&can=1&q=last%20straw&colspec=ID%20Type%20Status%20Priority%20Milestone%20Owner%20Summary%20ETA%20Severity
>>>>> <
>>>>> https://code.google.com/p/asterixdb/issues/detail?id=934&can=1&q=last%20straw&colspec=ID%20Type%20Status%20Priority%20Milestone%20Owner%20Summary%20ETA%20Severity>
>>>>> 
>>>>> [2] https://bitbucket.org/astrieanna/bitmapped-vector-trie <
>>>>> https://bitbucket.org/astrieanna/bitmapped-vector-trie>
>>>>> [3]
>>>>> http://danielasfregola.com/2015/06/15/which-immutable-scala-collection/
>>>>> <
>>>>> http://danielasfregola.com/2015/06/15/which-immutable-scala-collection/>
>>>>> 
>>>>> 
>>>>> Best,
>>>>> 
>>>>> Jianfeng Jia
>>>>> PhD Candidate of Computer Science
>>>>> University of California, Irvine
>>>>> 
>>>>> 
>> 



Best,

Jianfeng Jia
PhD Candidate of Computer Science
University of California, Irvine

Re: Implement an SerializableVector in Hyracks

Posted by Chen Li <ch...@gmail.com>.

@Xi and Jianfeng: after we come up with the design, let's share it with the
group for an approval before the implementation.

Chen

On Fri, Jan 15, 2016 at 11:48 AM, Mike Carey <dt...@gmail.com> wrote:

> The accounting is just as critical as the chunking - we should do both
> (together).
>
>
> On 1/15/16 9:00 AM, Till Westmann wrote:
>
>> I don’t have relevant experience on the subject. But I think that it
>> sounds good to avoid arbitrarily long chunks of memory. Especially - as
>> Jianfeng wrote - it would be good to be able to a) account for this memory
>> and b) to manage it.
>> An interesting question for me would be what the overhead of such a
>> Vector is compared to a simple Java array and as a result where it should
>> be used to replace arrays. (The comparison in [3] only compares different
>> Scala collections, but doesn’t look at plain arrays.)
>>
>> Cheers,
>> Till
>>
>> On 14 Jan 2016, at 22:05, Chen Li wrote:
>>
>> Before we ask Xi to work on this project, it will be good to know if
>>> other people have seen similar problems and agree with this plan.
>>> @Till: can you share some tips?
>>>
>>> Chen
>>>
>>> On Wed, Jan 13, 2016 at 4:27 PM, Jianfeng Jia <ji...@gmail.com>
>>> wrote:
>>>
>>>> Hi Devs,
>>>>
>>>> First of all, Xi Zhang is a Master student at UCI wants to work with us
>>>> for a while. Welcome Xi!
>>>>
>>>> We are thinking of making a Frame-based, memory-bound
>>>> SerializableVector at first. We expect this vector can solve some
>>>> occasionally Java.Heap.OutOfMemory exceptions in Hyracks.
>>>> Though we did a good job on organizing the record-located memory, the
>>>> OOM exception can still happen while operating the auxiliary data
>>>> structure. For example in the sort run generator, instead of moving record
>>>> around we are creating an reference “pointer" array to the original record.
>>>> However, if the record is small and the size of that int array will be
>>>> large, then the OOM exception will occur, which is the case of issue [1].
>>>>
>>>> One way to solve this problem is to put auxiliary data structures into
>>>> the memory-bounded frame as well. In general, it will be much easier to ask
>>>> for multiple small memory blocks than one big chunk of memory. I guess that
>>>> was the same reason why we have “SerializableHashTable” for HashJoin. It
>>>> will be nice to have a more general structure that can be used by all the
>>>> operators.
>>>>
>>>> The Frame based Vector idea is inspired by the Scala Vector[2] which
>>>> looks like a List, but internally it is implemented as a 32-ary tree. The
>>>> performance of it is very stable for variety size of object[3]. It will
>>>> have all the benefits of ArrayList and the LinkedList. In addition, we can
>>>> take the memory usage of the auxiliary structure into the calculation. We
>>>> will work on the detailed design doc later if we are agree on this
>>>> direction.
>>>>
>>>> Any thoughts or suggestions? Thank you!
>>>>
>>>>
>>>> [1]
>>>> https://code.google.com/p/asterixdb/issues/detail?id=934&can=1&q=last%20straw&colspec=ID%20Type%20Status%20Priority%20Milestone%20Owner%20Summary%20ETA%20Severity
>>>> <
>>>> https://code.google.com/p/asterixdb/issues/detail?id=934&can=1&q=last%20straw&colspec=ID%20Type%20Status%20Priority%20Milestone%20Owner%20Summary%20ETA%20Severity>
>>>>
>>>> [2] https://bitbucket.org/astrieanna/bitmapped-vector-trie <
>>>> https://bitbucket.org/astrieanna/bitmapped-vector-trie>
>>>> [3]
>>>> http://danielasfregola.com/2015/06/15/which-immutable-scala-collection/
>>>> <
>>>> http://danielasfregola.com/2015/06/15/which-immutable-scala-collection/>
>>>>
>>>>
>>>> Best,
>>>>
>>>> Jianfeng Jia
>>>> PhD Candidate of Computer Science
>>>> University of California, Irvine
>>>>
>>>>
>

Re: Implement an SerializableVector in Hyracks

Posted by Mike Carey <dt...@gmail.com>.

The accounting is just as critical as the chunking - we should do both 
(together).

On 1/15/16 9:00 AM, Till Westmann wrote:
> I don’t have relevant experience on the subject. But I think that it 
> sounds good to avoid arbitrarily long chunks of memory. Especially - 
> as Jianfeng wrote - it would be good to be able to a) account for this 
> memory and b) to manage it.
> An interesting question for me would be what the overhead of such a 
> Vector is compared to a simple Java array and as a result where it 
> should be used to replace arrays. (The comparison in [3] only compares 
> different Scala collections, but doesn’t look at plain arrays.)
>
> Cheers,
> Till
>
> On 14 Jan 2016, at 22:05, Chen Li wrote:
>
>> Before we ask Xi to work on this project, it will be good to know if
>> other people have seen similar problems and agree with this plan.
>> @Till: can you share some tips?
>>
>> Chen
>>
>> On Wed, Jan 13, 2016 at 4:27 PM, Jianfeng Jia 
>> <ji...@gmail.com> wrote:
>>> Hi Devs,
>>>
>>> First of all, Xi Zhang is a Master student at UCI wants to work with 
>>> us for a while. Welcome Xi!
>>>
>>> We are thinking of making a Frame-based, memory-bound 
>>> SerializableVector at first. We expect this vector can solve some 
>>> occasionally Java.Heap.OutOfMemory exceptions in Hyracks.
>>> Though we did a good job on organizing the record-located memory, 
>>> the OOM exception can still happen while operating the auxiliary 
>>> data structure. For example in the sort run generator, instead of 
>>> moving record around we are creating an reference “pointer" array to 
>>> the original record. However, if the record is small and the size of 
>>> that int array will be large, then the OOM exception will occur, 
>>> which is the case of issue [1].
>>>
>>> One way to solve this problem is to put auxiliary data structures 
>>> into the memory-bounded frame as well. In general, it will be much 
>>> easier to ask for multiple small memory blocks than one big chunk of 
>>> memory. I guess that was the same reason why we have 
>>> “SerializableHashTable” for HashJoin. It will be nice to have a more 
>>> general structure that can be used by all the operators.
>>>
>>> The Frame based Vector idea is inspired by the Scala Vector[2] which 
>>> looks like a List, but internally it is implemented as a 32-ary 
>>> tree. The performance of it is very stable for variety size of 
>>> object[3]. It will have all the benefits of ArrayList and the 
>>> LinkedList. In addition, we can take the memory usage of the 
>>> auxiliary structure into the calculation. We will work on the 
>>> detailed design doc later if we are agree on this direction.
>>>
>>> Any thoughts or suggestions? Thank you!
>>>
>>>
>>> [1] 
>>> https://code.google.com/p/asterixdb/issues/detail?id=934&can=1&q=last%20straw&colspec=ID%20Type%20Status%20Priority%20Milestone%20Owner%20Summary%20ETA%20Severity 
>>> <https://code.google.com/p/asterixdb/issues/detail?id=934&can=1&q=last%20straw&colspec=ID%20Type%20Status%20Priority%20Milestone%20Owner%20Summary%20ETA%20Severity> 
>>>
>>> [2] https://bitbucket.org/astrieanna/bitmapped-vector-trie 
>>> <https://bitbucket.org/astrieanna/bitmapped-vector-trie>
>>> [3] 
>>> http://danielasfregola.com/2015/06/15/which-immutable-scala-collection/ 
>>> <http://danielasfregola.com/2015/06/15/which-immutable-scala-collection/> 
>>>
>>>
>>> Best,
>>>
>>> Jianfeng Jia
>>> PhD Candidate of Computer Science
>>> University of California, Irvine
>>>

Re: Implement an SerializableVector in Hyracks

Posted by Till Westmann <ti...@apache.org>.

Agreed!

On 15 Jan 2016, at 10:30, Jianfeng Jia wrote:

> Sounds a positive support :-).
> One thing this Vector will sure better than plain array is to support 
> editing. (Update, Insert, Delete, even Append is better)
> It will have some overhead than the plain array as for the random 
> access. How much overhead it introduced will be an interesting 
> experiment to do after we have some initial implementation.
>
>> On Jan 15, 2016, at 9:00 AM, Till Westmann <ti...@apache.org> wrote:
>>
>> I don’t have relevant experience on the subject. But I think that 
>> it sounds good to avoid arbitrarily long chunks of memory. Especially 
>> - as Jianfeng wrote - it would be good to be able to a) account for 
>> this memory and b) to manage it.
>> An interesting question for me would be what the overhead of such a 
>> Vector is compared to a simple Java array and as a result where it 
>> should be used to replace arrays. (The comparison in [3] only 
>> compares different Scala collections, but doesn’t look at plain 
>> arrays.)
>>
>> Cheers,
>> Till
>>
>> On 14 Jan 2016, at 22:05, Chen Li wrote:
>>
>>> Before we ask Xi to work on this project, it will be good to know if
>>> other people have seen similar problems and agree with this plan.
>>> @Till: can you share some tips?
>>>
>>> Chen
>>>
>>> On Wed, Jan 13, 2016 at 4:27 PM, Jianfeng Jia 
>>> <ji...@gmail.com> wrote:
>>>> Hi Devs,
>>>>
>>>> First of all, Xi Zhang is a Master student at UCI wants to work 
>>>> with us for a while. Welcome Xi!
>>>>
>>>> We are thinking of making a Frame-based, memory-bound 
>>>> SerializableVector at first. We expect this vector can solve some 
>>>> occasionally Java.Heap.OutOfMemory exceptions in Hyracks.
>>>> Though we did a good job on organizing the record-located memory, 
>>>> the OOM exception can still happen while operating the auxiliary 
>>>> data structure. For example in the sort run generator, instead of 
>>>> moving record around we are creating an reference “pointer" array 
>>>> to the original record. However, if the record is small and the 
>>>> size of that int array will be large, then the OOM exception will 
>>>> occur, which is the case of issue [1].
>>>>
>>>> One way to solve this problem is to put auxiliary data structures 
>>>> into the memory-bounded frame as well. In general, it will be much 
>>>> easier to ask for multiple small memory blocks than one big chunk 
>>>> of memory. I guess that was the same reason why we have 
>>>> “SerializableHashTable” for HashJoin. It will be nice to have a 
>>>> more general structure that can be used by all the operators.
>>>>
>>>> The Frame based Vector idea is inspired by the Scala Vector[2] 
>>>> which looks like a List, but internally it is implemented as a 
>>>> 32-ary tree. The performance of it is very stable for variety size 
>>>> of object[3]. It will have all the benefits of ArrayList and the 
>>>> LinkedList. In addition, we can take the memory usage of the 
>>>> auxiliary structure into the calculation. We will work on the 
>>>> detailed design doc later if we are agree on this direction.
>>>>
>>>> Any thoughts or suggestions? Thank you!
>>>>
>>>>
>>>> [1] 
>>>> https://code.google.com/p/asterixdb/issues/detail?id=934&can=1&q=last%20straw&colspec=ID%20Type%20Status%20Priority%20Milestone%20Owner%20Summary%20ETA%20Severity 
>>>> <https://code.google.com/p/asterixdb/issues/detail?id=934&can=1&q=last%20straw&colspec=ID%20Type%20Status%20Priority%20Milestone%20Owner%20Summary%20ETA%20Severity>
>>>> [2] https://bitbucket.org/astrieanna/bitmapped-vector-trie 
>>>> <https://bitbucket.org/astrieanna/bitmapped-vector-trie>
>>>> [3] 
>>>> http://danielasfregola.com/2015/06/15/which-immutable-scala-collection/ 
>>>> <http://danielasfregola.com/2015/06/15/which-immutable-scala-collection/>
>>>>
>>>> Best,
>>>>
>>>> Jianfeng Jia
>>>> PhD Candidate of Computer Science
>>>> University of California, Irvine
>>>>
>
>
>
> Best,
>
> Jianfeng Jia
> PhD Candidate of Computer Science
> University of California, Irvine

Re: Implement an SerializableVector in Hyracks

Posted by Jianfeng Jia <ji...@gmail.com>.

Sounds a positive support :-). 
One thing this Vector will sure better than plain array is to support editing. (Update, Insert, Delete, even Append is better)
It will have some overhead than the plain array as for the random access. How much overhead it introduced will be an interesting experiment to do after we have some initial implementation.

> On Jan 15, 2016, at 9:00 AM, Till Westmann <ti...@apache.org> wrote:
> 
> I don’t have relevant experience on the subject. But I think that it sounds good to avoid arbitrarily long chunks of memory. Especially - as Jianfeng wrote - it would be good to be able to a) account for this memory and b) to manage it.
> An interesting question for me would be what the overhead of such a Vector is compared to a simple Java array and as a result where it should be used to replace arrays. (The comparison in [3] only compares different Scala collections, but doesn’t look at plain arrays.)
> 
> Cheers,
> Till
> 
> On 14 Jan 2016, at 22:05, Chen Li wrote:
> 
>> Before we ask Xi to work on this project, it will be good to know if
>> other people have seen similar problems and agree with this plan.
>> @Till: can you share some tips?
>> 
>> Chen
>> 
>> On Wed, Jan 13, 2016 at 4:27 PM, Jianfeng Jia <ji...@gmail.com> wrote:
>>> Hi Devs,
>>> 
>>> First of all, Xi Zhang is a Master student at UCI wants to work with us for a while. Welcome Xi!
>>> 
>>> We are thinking of making a Frame-based, memory-bound SerializableVector at first. We expect this vector can solve some occasionally Java.Heap.OutOfMemory exceptions in Hyracks.
>>> Though we did a good job on organizing the record-located memory, the OOM exception can still happen while operating the auxiliary data structure. For example in the sort run generator, instead of moving record around we are creating an reference “pointer" array to the original record. However, if the record is small and the size of that int array will be large, then the OOM exception will occur, which is the case of issue [1].
>>> 
>>> One way to solve this problem is to put auxiliary data structures into the memory-bounded frame as well. In general, it will be much easier to ask for multiple small memory blocks than one big chunk of memory. I guess that was the same reason why we have “SerializableHashTable” for HashJoin. It will be nice to have a more general structure that can be used by all the operators.
>>> 
>>> The Frame based Vector idea is inspired by the Scala Vector[2] which looks like a List, but internally it is implemented as a 32-ary tree. The performance of it is very stable for variety size of object[3]. It will have all the benefits of ArrayList and the LinkedList. In addition, we can take the memory usage of the auxiliary structure into the calculation. We will work on the detailed design doc later if we are agree on this direction.
>>> 
>>> Any thoughts or suggestions? Thank you!
>>> 
>>> 
>>> [1] https://code.google.com/p/asterixdb/issues/detail?id=934&can=1&q=last%20straw&colspec=ID%20Type%20Status%20Priority%20Milestone%20Owner%20Summary%20ETA%20Severity <https://code.google.com/p/asterixdb/issues/detail?id=934&can=1&q=last%20straw&colspec=ID%20Type%20Status%20Priority%20Milestone%20Owner%20Summary%20ETA%20Severity>
>>> [2] https://bitbucket.org/astrieanna/bitmapped-vector-trie <https://bitbucket.org/astrieanna/bitmapped-vector-trie>
>>> [3] http://danielasfregola.com/2015/06/15/which-immutable-scala-collection/ <http://danielasfregola.com/2015/06/15/which-immutable-scala-collection/>
>>> 
>>> Best,
>>> 
>>> Jianfeng Jia
>>> PhD Candidate of Computer Science
>>> University of California, Irvine
>>> 



Best,

Jianfeng Jia
PhD Candidate of Computer Science
University of California, Irvine

Re: Implement an SerializableVector in Hyracks

Posted by Till Westmann <ti...@apache.org>.

I don’t have relevant experience on the subject. But I think that it 
sounds good to avoid arbitrarily long chunks of memory. Especially - as 
Jianfeng wrote - it would be good to be able to a) account for this 
memory and b) to manage it.
An interesting question for me would be what the overhead of such a 
Vector is compared to a simple Java array and as a result where it 
should be used to replace arrays. (The comparison in [3] only compares 
different Scala collections, but doesn’t look at plain arrays.)

Cheers,
Till

On 14 Jan 2016, at 22:05, Chen Li wrote:

> Before we ask Xi to work on this project, it will be good to know if
> other people have seen similar problems and agree with this plan.
> @Till: can you share some tips?
>
> Chen
>
> On Wed, Jan 13, 2016 at 4:27 PM, Jianfeng Jia <ji...@gmail.com> 
> wrote:
>> Hi Devs,
>>
>> First of all, Xi Zhang is a Master student at UCI wants to work with 
>> us for a while. Welcome Xi!
>>
>> We are thinking of making a Frame-based, memory-bound 
>> SerializableVector at first. We expect this vector can solve some 
>> occasionally Java.Heap.OutOfMemory exceptions in Hyracks.
>> Though we did a good job on organizing the record-located memory, the 
>> OOM exception can still happen while operating the auxiliary data 
>> structure. For example in the sort run generator, instead of moving 
>> record around we are creating an reference “pointer" array to the 
>> original record. However, if the record is small and the size of that 
>> int array will be large, then the OOM exception will occur, which is 
>> the case of issue [1].
>>
>> One way to solve this problem is to put auxiliary data structures 
>> into the memory-bounded frame as well. In general, it will be much 
>> easier to ask for multiple small memory blocks than one big chunk of 
>> memory. I guess that was the same reason why we have 
>> “SerializableHashTable” for HashJoin. It will be nice to have a 
>> more general structure that can be used by all the operators.
>>
>> The Frame based Vector idea is inspired by the Scala Vector[2] which 
>> looks like a List, but internally it is implemented as a 32-ary tree. 
>> The performance of it is very stable for variety size of object[3]. 
>> It will have all the benefits of ArrayList and the LinkedList. In 
>> addition, we can take the memory usage of the auxiliary structure 
>> into the calculation. We will work on the detailed design doc later 
>> if we are agree on this direction.
>>
>> Any thoughts or suggestions? Thank you!
>>
>>
>> [1] 
>> https://code.google.com/p/asterixdb/issues/detail?id=934&can=1&q=last%20straw&colspec=ID%20Type%20Status%20Priority%20Milestone%20Owner%20Summary%20ETA%20Severity 
>> <https://code.google.com/p/asterixdb/issues/detail?id=934&can=1&q=last%20straw&colspec=ID%20Type%20Status%20Priority%20Milestone%20Owner%20Summary%20ETA%20Severity>
>> [2] https://bitbucket.org/astrieanna/bitmapped-vector-trie 
>> <https://bitbucket.org/astrieanna/bitmapped-vector-trie>
>> [3] 
>> http://danielasfregola.com/2015/06/15/which-immutable-scala-collection/ 
>> <http://danielasfregola.com/2015/06/15/which-immutable-scala-collection/>
>>
>> Best,
>>
>> Jianfeng Jia
>> PhD Candidate of Computer Science
>> University of California, Irvine
>>

Re: Implement an SerializableVector in Hyracks

Posted by Chen Li <ch...@gmail.com>.

Before we ask Xi to work on this project, it will be good to know if
other people have seen similar problems and agree with this plan.
@Till: can you share some tips?

Chen

On Wed, Jan 13, 2016 at 4:27 PM, Jianfeng Jia <ji...@gmail.com> wrote:
> Hi Devs,
>
> First of all, Xi Zhang is a Master student at UCI wants to work with us for a while. Welcome Xi!
>
> We are thinking of making a Frame-based, memory-bound SerializableVector at first. We expect this vector can solve some occasionally Java.Heap.OutOfMemory exceptions in Hyracks.
> Though we did a good job on organizing the record-located memory, the OOM exception can still happen while operating the auxiliary data structure. For example in the sort run generator, instead of moving record around we are creating an reference “pointer" array to the original record. However, if the record is small and the size of that int array will be large, then the OOM exception will occur, which is the case of issue [1].
>
> One way to solve this problem is to put auxiliary data structures into the memory-bounded frame as well. In general, it will be much easier to ask for multiple small memory blocks than one big chunk of memory. I guess that was the same reason why we have “SerializableHashTable” for HashJoin. It will be nice to have a more general structure that can be used by all the operators.
>
> The Frame based Vector idea is inspired by the Scala Vector[2] which looks like a List, but internally it is implemented as a 32-ary tree. The performance of it is very stable for variety size of object[3]. It will have all the benefits of ArrayList and the LinkedList. In addition, we can take the memory usage of the auxiliary structure into the calculation. We will work on the detailed design doc later if we are agree on this direction.
>
> Any thoughts or suggestions? Thank you!
>
>
> [1] https://code.google.com/p/asterixdb/issues/detail?id=934&can=1&q=last%20straw&colspec=ID%20Type%20Status%20Priority%20Milestone%20Owner%20Summary%20ETA%20Severity <https://code.google.com/p/asterixdb/issues/detail?id=934&can=1&q=last%20straw&colspec=ID%20Type%20Status%20Priority%20Milestone%20Owner%20Summary%20ETA%20Severity>
> [2] https://bitbucket.org/astrieanna/bitmapped-vector-trie <https://bitbucket.org/astrieanna/bitmapped-vector-trie>
> [3] http://danielasfregola.com/2015/06/15/which-immutable-scala-collection/ <http://danielasfregola.com/2015/06/15/which-immutable-scala-collection/>
>
> Best,
>
> Jianfeng Jia
> PhD Candidate of Computer Science
> University of California, Irvine
>

Re: Implement an SerializableVector in Hyracks

Posted by Mike Carey <dt...@gmail.com>.

Sounds like a good project!

On 1/13/16 4:27 PM, Jianfeng Jia wrote:
> Hi Devs,
>
> First of all, Xi Zhang is a Master student at UCI wants to work with us for a while. Welcome Xi!
>
> We are thinking of making a Frame-based, memory-bound SerializableVector at first. We expect this vector can solve some occasionally Java.Heap.OutOfMemory exceptions in Hyracks.
> Though we did a good job on organizing the record-located memory, the OOM exception can still happen while operating the auxiliary data structure. For example in the sort run generator, instead of moving record around we are creating an reference “pointer" array to the original record. However, if the record is small and the size of that int array will be large, then the OOM exception will occur, which is the case of issue [1].
>
> One way to solve this problem is to put auxiliary data structures into the memory-bounded frame as well. In general, it will be much easier to ask for multiple small memory blocks than one big chunk of memory. I guess that was the same reason why we have “SerializableHashTable” for HashJoin. It will be nice to have a more general structure that can be used by all the operators.
>
> The Frame based Vector idea is inspired by the Scala Vector[2] which looks like a List, but internally it is implemented as a 32-ary tree. The performance of it is very stable for variety size of object[3]. It will have all the benefits of ArrayList and the LinkedList. In addition, we can take the memory usage of the auxiliary structure into the calculation. We will work on the detailed design doc later if we are agree on this direction.
>
> Any thoughts or suggestions? Thank you!
>
>
> [1] https://code.google.com/p/asterixdb/issues/detail?id=934&can=1&q=last%20straw&colspec=ID%20Type%20Status%20Priority%20Milestone%20Owner%20Summary%20ETA%20Severity <https://code.google.com/p/asterixdb/issues/detail?id=934&can=1&q=last%20straw&colspec=ID%20Type%20Status%20Priority%20Milestone%20Owner%20Summary%20ETA%20Severity>
> [2] https://bitbucket.org/astrieanna/bitmapped-vector-trie <https://bitbucket.org/astrieanna/bitmapped-vector-trie>
> [3] http://danielasfregola.com/2015/06/15/which-immutable-scala-collection/ <http://danielasfregola.com/2015/06/15/which-immutable-scala-collection/>
>
> Best,
>
> Jianfeng Jia
> PhD Candidate of Computer Science
> University of California, Irvine
>
>