You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@kudu.apache.org by Benjamin Kim <bb...@gmail.com> on 2016/05/12 21:08:20 UTC

Sparse Data

Can Kudu handle the use case where sparse data is involved? In many of our processes, we deal with data that can have any number of columns and many previously unknown column names depending on what attributes are brought in at the time. Currently, we use HBase to handle this. Since Kudu is based on HBase, can it do the same? Or, do we have to use a map data type column for this?

Thanks,
Ben

Re: Sparse Data

Posted by Benjamin Kim <bb...@gmail.com>.

So, basically go back to using relational database techniques. Got it. But, how was the performance?

Cheers,
Ben

> On May 12, 2016, at 2:43 PM, Chris George <Ch...@rms.com> wrote:
> 
> I’ve used kudu with an EAV model for sparse data and that worked extremely well for us with billions of rows and the correct partitioning.
> -Chris
> 
> On 5/12/16, 3:21 PM, "Dan Burkert" <dan@cloudera.com <ma...@cloudera.com>> wrote:
> 
> Hi Ben,
> 
> Kudu doesn't support sparse datasets with many columns very well.  Kudu's data model looks much more like the relational, structured data model of a traditional SQL database than HBase's data model.  Kudu doesn't yet have a map column type (or any nested column types), but we do have BINARY typed columns if you can handle your own serialization. Oftentimes, however, it's better to restructure the data so that it can fit Kudu's structure better.  If you can give more information about your usage patterns (especially details queries you wish to optimize for) I can perhaps give better info.
> 
> - Dan
> 
> On Thu, May 12, 2016 at 2:08 PM, Benjamin Kim <bbuild11@gmail.com <ma...@gmail.com>> wrote:
> Can Kudu handle the use case where sparse data is involved? In many of our processes, we deal with data that can have any number of columns and many previously unknown column names depending on what attributes are brought in at the time. Currently, we use HBase to handle this. Since Kudu is based on HBase, can it do the same? Or, do we have to use a map data type column for this?
> 
> Thanks,
> Ben
> 
>

Re: Sparse Data

Posted by Chris George <Ch...@rms.com>.

I've used kudu with an EAV model for sparse data and that worked extremely well for us with billions of rows and the correct partitioning.
-Chris

On 5/12/16, 3:21 PM, "Dan Burkert" <da...@cloudera.com>> wrote:

Hi Ben,

Kudu doesn't support sparse datasets with many columns very well.  Kudu's data model looks much more like the relational, structured data model of a traditional SQL database than HBase's data model.  Kudu doesn't yet have a map column type (or any nested column types), but we do have BINARY typed columns if you can handle your own serialization. Oftentimes, however, it's better to restructure the data so that it can fit Kudu's structure better.  If you can give more information about your usage patterns (especially details queries you wish to optimize for) I can perhaps give better info.

- Dan

On Thu, May 12, 2016 at 2:08 PM, Benjamin Kim <bb...@gmail.com>> wrote:
Can Kudu handle the use case where sparse data is involved? In many of our processes, we deal with data that can have any number of columns and many previously unknown column names depending on what attributes are brought in at the time. Currently, we use HBase to handle this. Since Kudu is based on HBase, can it do the same? Or, do we have to use a map data type column for this?

Thanks,
Ben

Re: Sparse Data

Posted by Dan Burkert <da...@cloudera.com>.

Hi Ben,

Kudu doesn't support sparse datasets with many columns very well.  Kudu's
data model looks much more like the relational, structured data model of a
traditional SQL database than HBase's data model.  Kudu doesn't yet have a
map column type (or any nested column types), but we do have BINARY typed
columns if you can handle your own serialization. Oftentimes, however, it's
better to restructure the data so that it can fit Kudu's structure better.
If you can give more information about your usage patterns (especially
details queries you wish to optimize for) I can perhaps give better info.

- Dan

On Thu, May 12, 2016 at 2:08 PM, Benjamin Kim <bb...@gmail.com> wrote:

> Can Kudu handle the use case where sparse data is involved? In many of our
> processes, we deal with data that can have any number of columns and many
> previously unknown column names depending on what attributes are brought in
> at the time. Currently, we use HBase to handle this. Since Kudu is based on
> HBase, can it do the same? Or, do we have to use a map data type column for
> this?
>
> Thanks,
> Ben
>
>