You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@arrow.apache.org by Kirill Lykov <ly...@gmail.com> on 2021/11/17 15:00:38 UTC

updates on top of arrow

Hi,

I'm trying to build something like an embedded data base system around
arrow with some properties which I will list below.
The question is if someone can point me to existing solutions (or attempts)
for such kind of problem.
I'm trying to develop a data layout such that
* typically reading is not for single value but for a range of keys/fields
* this is all in-memory
* it provides low overhead for reading and writing
* multithreaded reading supported, but updates are single threaded
* 99% of updates are primarily adding a new data point to the end of column
(a-la log)
* return data in arrow tables with minimal overhead (better not to copy
data from internal storage but do slicing whenever possible)

On a high level I had in mind something like SSTables with level compaction
when the data is accessed for reading. But before doing this I wanted to
check if there is a solution for such a problem leveraging an arrow or if
there are all the bricks to build such a solution. Any input (including
papers) is welcomed.

-- 
Best regards,
Kirill Lykov

Re: updates on top of arrow

Posted by Nate Bauernfeind <na...@deephaven.io>.

It's not clear if this will actually hit your use case or not.
Specifically, low overhead means different things to different people.
Also, my suggestion is not a database -- it is an analytics engine. The
persistence part of the problem is off the table at this time, too. I
wanted to mention it up front in case any of that is important to you.

With all of the caveats out, let's get started. I work at Deephaven. We
have released a community edition of a powerful tool that is targeted
around real time updates. We use Arrow Flight for IPC. To subscribe to a
changing stream of data, we use DoExchange which enables the client to
inform the server which subset of the data it is interested in when it does
not want all of the data (consider GUI use cases).

The append-only pattern is the dip-your-feet-wet with updating data. I
highly recommend this article, in which our CTO goes into detail about the
history and evolution of our update model design. This article will give
you a lot of insight into what we tried to build and is likely to inform
you whether or not it might fit your needs. Maybe it will excite you.
Link: https://deephaven.io/core/docs/conceptual/table-update-model/

I wrote an article describing how we arrived at gRPC and Arrow. It assumes
at least a high level understanding of the update model, as extending this
to work over Arrow Flight was really about finding the right application
metadata which enables consistent [Arrow] wire format for payload.
Link: https://deephaven.io/core/docs/conceptual/deephaven-core-api/

I'm not finding a short and sweet demo on our youtube channel, but maybe
you will find a weekly demo or two to be revealing of whether or not this
is a tool worth investing your time to learn more about.
Link: https://www.youtube.com/channel/UCoaYOlkX555PSTTJz8ZaI_w/

If you have an hour or two to explore and research -- it's very low-effort
to get our simplest examples running.
Source can be found (including getting started instructions):
https://github.com/deephaven/deephaven-core

I'd love to answer your direct questions or help you make the most of your
research time. Feel free to jump into our gitter and ask questions:
https://gitter.im/deephaven/deephaven

If you play around with it and don't like it -- I'd love to hear that
feedback, too.

Nate

On Wed, Nov 17, 2021 at 8:01 AM Kirill Lykov <ly...@gmail.com> wrote:

> Hi,
>
> I'm trying to build something like an embedded data base system around
> arrow with some properties which I will list below.
> The question is if someone can point me to existing solutions (or attempts)
> for such kind of problem.
> I'm trying to develop a data layout such that
> * typically reading is not for single value but for a range of keys/fields
> * this is all in-memory
> * it provides low overhead for reading and writing
> * multithreaded reading supported, but updates are single threaded
> * 99% of updates are primarily adding a new data point to the end of column
> (a-la log)
> * return data in arrow tables with minimal overhead (better not to copy
> data from internal storage but do slicing whenever possible)
>
> On a high level I had in mind something like SSTables with level compaction
> when the data is accessed for reading. But before doing this I wanted to
> check if there is a solution for such a problem leveraging an arrow or if
> there are all the bricks to build such a solution. Any input (including
> papers) is welcomed.
>
> --
> Best regards,
> Kirill Lykov
>

--