You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@kudu.apache.org by Boris Tyukin <bo...@boristyukin.com> on 2019/09/07 16:42:32 UTC

Long text and complex data types support

Hi guys,

Any plans to support long text type in Kudu? We would love to use Kudu with
other projects but unfortunately long text data are pretty common in
healthcare industry and we have to use hive/Impala/hdfs instead which is
quite painful since we cannot do in place updates and deletes.

Same question about complex types (arrays, maps etc.)

Thanks

Re: Long text and complex data types support

Posted by Boris Tyukin <bo...@boristyukin.com>.
I cannot explain why, because it is the vendors who design EMR/EHR systems
and databases to support them, but in our case (Cerner EHR) they have text
fields taking many MBs, we've seen as high as 64Mb (again, do not ask me
why :))

On Mon, Sep 9, 2019 at 12:36 PM Grant Henke <gh...@cloudera.com> wrote:

> Oracle has CLOBs and BLOBs, MS SQL has varchar(max) and binary. I believe
>> SnowFlake and Redshift have similar data types.
>>
>
> Today Kudu has support for String columns which can hold up to 64KB
> of UTF-8 encoded characters. I assume you are asking because that limit is
> too small. How large would these text columns need to be?
>
>
>
>
>
> On Mon, Sep 9, 2019 at 10:09 AM Boris Tyukin <bo...@boristyukin.com>
> wrote:
>
>> Hi Grant,
>>
>> thanks for responding!
>>
>> Oracle has CLOBs and BLOBs, MS SQL has varchar(max) and binary. I believe
>> SnowFlake and Redshift have similar data types.
>>
>> In healthcare, a lot of good data is trapped in physician notes, progress
>> reports, discharge summaries etc. and it takes time for specially trained
>> people (medical coders and abstractors) to read these reports and structure
>> them (assign billing codes, classify procedures and diagnosis etc.) Some
>> things will never get coded and trapped in a text.
>>
>> Another example in healthcare is patient satisfaction surveys with free
>> text comments.
>>
>> As for complex data types, we recently had a small project, ingesting
>> FHIR bundles which are highly nested and complex json data sets. Just go to
>> FHIR HL7 org site to see examples. This is one of the easiest to
>> comprehend FHIR document sample:
>> https://www.hl7.org/fhir/patient-example.json.html
>>
>> We ended up using Hive to store them and Spark to get meaningful data but
>> data is mutable and lot of rows need to be updated/deleted daily which is
>> painful with Hive.
>>
>> Hope it helps.
>>
>> On Sun, Sep 8, 2019 at 6:17 PM Grant Henke <gh...@cloudera.com> wrote:
>>
>>> Hi Boris,
>>>
>>> Can you describe in more detail what exactly you are looking for in a
>>> long text type? Is there another database that has an equivalent type for
>>> reference?
>>>
>>> I have started looking at complex type support and plan to put up a
>>> design document soon. No estimates on when it would be complete or how much
>>> work is required exists yet. Do you have any sample schemas with complex
>>> types you could send me to help inform designs and trade offs?
>>>
>>> Thank you,
>>> Grant
>>>
>>> On Sat, Sep 7, 2019 at 11:43 AM Boris Tyukin <bo...@boristyukin.com>
>>> wrote:
>>>
>>>> Hi guys,
>>>>
>>>> Any plans to support long text type in Kudu? We would love to use Kudu
>>>> with other projects but unfortunately long text data are pretty common in
>>>> healthcare industry and we have to use hive/Impala/hdfs instead which is
>>>> quite painful since we cannot do in place updates and deletes.
>>>>
>>>> Same question about complex types (arrays, maps etc.)
>>>>
>>>> Thanks
>>>>
>>>
>>>
>>> --
>>> Grant Henke
>>> Software Engineer | Cloudera
>>> grant@cloudera.com | twitter.com/gchenke | linkedin.com/in/granthenke
>>>
>>
>
> --
> Grant Henke
> Software Engineer | Cloudera
> grant@cloudera.com | twitter.com/gchenke | linkedin.com/in/granthenke
>

Re: Long text and complex data types support

Posted by Grant Henke <gh...@cloudera.com>.
>
> Oracle has CLOBs and BLOBs, MS SQL has varchar(max) and binary. I believe
> SnowFlake and Redshift have similar data types.
>

Today Kudu has support for String columns which can hold up to 64KB
of UTF-8 encoded characters. I assume you are asking because that limit is
too small. How large would these text columns need to be?





On Mon, Sep 9, 2019 at 10:09 AM Boris Tyukin <bo...@boristyukin.com> wrote:

> Hi Grant,
>
> thanks for responding!
>
> Oracle has CLOBs and BLOBs, MS SQL has varchar(max) and binary. I believe
> SnowFlake and Redshift have similar data types.
>
> In healthcare, a lot of good data is trapped in physician notes, progress
> reports, discharge summaries etc. and it takes time for specially trained
> people (medical coders and abstractors) to read these reports and structure
> them (assign billing codes, classify procedures and diagnosis etc.) Some
> things will never get coded and trapped in a text.
>
> Another example in healthcare is patient satisfaction surveys with free
> text comments.
>
> As for complex data types, we recently had a small project, ingesting FHIR
> bundles which are highly nested and complex json data sets. Just go to FHIR
> HL7 org site to see examples. This is one of the easiest to comprehend FHIR
> document sample:
> https://www.hl7.org/fhir/patient-example.json.html
>
> We ended up using Hive to store them and Spark to get meaningful data but
> data is mutable and lot of rows need to be updated/deleted daily which is
> painful with Hive.
>
> Hope it helps.
>
> On Sun, Sep 8, 2019 at 6:17 PM Grant Henke <gh...@cloudera.com> wrote:
>
>> Hi Boris,
>>
>> Can you describe in more detail what exactly you are looking for in a
>> long text type? Is there another database that has an equivalent type for
>> reference?
>>
>> I have started looking at complex type support and plan to put up a
>> design document soon. No estimates on when it would be complete or how much
>> work is required exists yet. Do you have any sample schemas with complex
>> types you could send me to help inform designs and trade offs?
>>
>> Thank you,
>> Grant
>>
>> On Sat, Sep 7, 2019 at 11:43 AM Boris Tyukin <bo...@boristyukin.com>
>> wrote:
>>
>>> Hi guys,
>>>
>>> Any plans to support long text type in Kudu? We would love to use Kudu
>>> with other projects but unfortunately long text data are pretty common in
>>> healthcare industry and we have to use hive/Impala/hdfs instead which is
>>> quite painful since we cannot do in place updates and deletes.
>>>
>>> Same question about complex types (arrays, maps etc.)
>>>
>>> Thanks
>>>
>>
>>
>> --
>> Grant Henke
>> Software Engineer | Cloudera
>> grant@cloudera.com | twitter.com/gchenke | linkedin.com/in/granthenke
>>
>

-- 
Grant Henke
Software Engineer | Cloudera
grant@cloudera.com | twitter.com/gchenke | linkedin.com/in/granthenke

Re: Long text and complex data types support

Posted by Boris Tyukin <bo...@boristyukin.com>.
Hi Grant,

thanks for responding!

Oracle has CLOBs and BLOBs, MS SQL has varchar(max) and binary. I believe
SnowFlake and Redshift have similar data types.

In healthcare, a lot of good data is trapped in physician notes, progress
reports, discharge summaries etc. and it takes time for specially trained
people (medical coders and abstractors) to read these reports and structure
them (assign billing codes, classify procedures and diagnosis etc.) Some
things will never get coded and trapped in a text.

Another example in healthcare is patient satisfaction surveys with free
text comments.

As for complex data types, we recently had a small project, ingesting FHIR
bundles which are highly nested and complex json data sets. Just go to FHIR
HL7 org site to see examples. This is one of the easiest to comprehend FHIR
document sample:
https://www.hl7.org/fhir/patient-example.json.html

We ended up using Hive to store them and Spark to get meaningful data but
data is mutable and lot of rows need to be updated/deleted daily which is
painful with Hive.

Hope it helps.

On Sun, Sep 8, 2019 at 6:17 PM Grant Henke <gh...@cloudera.com> wrote:

> Hi Boris,
>
> Can you describe in more detail what exactly you are looking for in a long
> text type? Is there another database that has an equivalent type for
> reference?
>
> I have started looking at complex type support and plan to put up a design
> document soon. No estimates on when it would be complete or how much work
> is required exists yet. Do you have any sample schemas with complex types
> you could send me to help inform designs and trade offs?
>
> Thank you,
> Grant
>
> On Sat, Sep 7, 2019 at 11:43 AM Boris Tyukin <bo...@boristyukin.com>
> wrote:
>
>> Hi guys,
>>
>> Any plans to support long text type in Kudu? We would love to use Kudu
>> with other projects but unfortunately long text data are pretty common in
>> healthcare industry and we have to use hive/Impala/hdfs instead which is
>> quite painful since we cannot do in place updates and deletes.
>>
>> Same question about complex types (arrays, maps etc.)
>>
>> Thanks
>>
>
>
> --
> Grant Henke
> Software Engineer | Cloudera
> grant@cloudera.com | twitter.com/gchenke | linkedin.com/in/granthenke
>

Re: Long text and complex data types support

Posted by Dmitry Degrave <dm...@gmail.com>.
> Dmitry, Would you be interested in writing up more details about how you
are using Kudu in a blog post or even a mailing list email? This sounds
super interesting.

I wrote about our Kudu cluster and ETL in posts below - there are
unexplained results wrt resource utilization with multiple tservers:

https://lists.apache.org/thread.html/2b94912dc4a251312000dbd6df2d31c43029723e16d50ffd6e510c90@%3Cuser.kudu.apache.org%3E

~dmitry

On Thu, 12 Sep 2019 at 11:28, Grant Henke <gh...@cloudera.com> wrote:

> Thanks for the information Dmitry and Mauricio!
>
> An example from genomics.
>>
>
> Dmitry, Would you be interested in writing up more details about how you
> are using Kudu in a blog post or even a mailing list email? This sounds
> super interesting.
>
> Supporting serialized objects (e.g. java's hashtables with
>> capabilities to select only rows with hashtables containing some
>> specific keys) would make Kudu super-special ;)
>>
>
> I agree supporting something like this would be very cool.
>
> Would be good if Kudu supported the way Impala can store and query nested
>> data
>>
>
> Supporting Impala's syntax on Kudu tables with complex types is absolutely
> a priority.
>
> Thanks,
> Grant
>
> On Wed, Sep 11, 2019 at 7:04 PM Mauricio Aristizabal <ma...@impact.com>
> wrote:
>
>> Would be good if Kudu supported the way Impala can store and query nested
>> data in hdfs/parquet, so it would be (at least mostly) transparent to query
>> nested data in either storage engine.  We recently had a use for this
>> (basically storing N order item details along with each order record) but
>> decided against it because we know we'll be moving that table from Parquet
>> to Kudu soon.
>>
>> On Wed, Sep 11, 2019 at 1:49 PM Dmitry Degrave <dm...@gmail.com> wrote:
>>
>>> Hi Grant,
>>>
>>> An example from genomics. Current scheme is simple [1] (denormalized
>>> for performance), but requires N = S * V rows in genotype table (S is
>>> number of samples, V is average number of variants in a sample,
>>> typical value for WGS V=5*10^6 and we deal with tens of thousands of
>>> samples). More optimal scheme would keep all variants of a sample in a
>>> single row, which is impossible now.
>>>
>>> Supporting nested data structures, e.g. similar to implemented in
>>> ClickHouse [2], would be useful too.
>>>
>>> Supporting serialized objects (e.g. java's hashtables with
>>> capabilities to select only rows with hashtables containing some
>>> specific keys) would make Kudu super-special ;)
>>>
>>> ~dmitry
>>>
>>> [1] https://gist.github.com/dnafault/e55ea987c55d2960c738d94e4811d043
>>> [2]
>>> https://clickhouse-docs.readthedocs.io/en/latest/data_types/nested_data_structures/nested.html
>>>
>>> On Mon, 9 Sep 2019 at 08:18, Grant Henke <gh...@cloudera.com> wrote:
>>> >
>>> > Hi Boris,
>>> >
>>> > Can you describe in more detail what exactly you are looking for in a
>>> long text type? Is there another database that has an equivalent type for
>>> reference?
>>> >
>>> > I have started looking at complex type support and plan to put up a
>>> design document soon. No estimates on when it would be complete or how much
>>> work is required exists yet. Do you have any sample schemas with complex
>>> types you could send me to help inform designs and trade offs?
>>> >
>>> > Thank you,
>>> > Grant
>>> >
>>> > On Sat, Sep 7, 2019 at 11:43 AM Boris Tyukin <bo...@boristyukin.com>
>>> wrote:
>>> >>
>>> >> Hi guys,
>>> >>
>>> >> Any plans to support long text type in Kudu? We would love to use
>>> Kudu with other projects but unfortunately long text data are pretty common
>>> in healthcare industry and we have to use hive/Impala/hdfs instead which is
>>> quite painful since we cannot do in place updates and deletes.
>>> >>
>>> >> Same question about complex types (arrays, maps etc.)
>>> >>
>>> >> Thanks
>>> >
>>> >
>>> >
>>> > --
>>> > Grant Henke
>>> > Software Engineer | Cloudera
>>> > grant@cloudera.com | twitter.com/gchenke | linkedin.com/in/granthenke
>>>
>>
>>
>> --
>> Mauricio Aristizabal
>> Architect - Data Pipeline
>> mauricio@impact.com | 323 309 4260
>> https://impact.com
>> <https://www.linkedin.com/company/impact-martech/>
>> <https://www.facebook.com/ImpactParTech/>
>> <https://twitter.com/impactpartech>
>> <https://www.youtube.com/c/impactmartech>
>>
>>
>>
>> <http://go.impact.com/WR-PC-AW-DiscoveringGrowthThroughPartnerships.html?utm_medium=owned-email-send&utm_source=sigsatori&utm_campaign=webinarreg-201909-discoveringgrowth-pc>
>>
>
>
> --
> Grant Henke
> Software Engineer | Cloudera
> grant@cloudera.com | twitter.com/gchenke | linkedin.com/in/granthenke
>

Re: Long text and complex data types support

Posted by Grant Henke <gh...@cloudera.com>.
Thanks for the information Dmitry and Mauricio!

An example from genomics.
>

Dmitry, Would you be interested in writing up more details about how you
are using Kudu in a blog post or even a mailing list email? This sounds
super interesting.

Supporting serialized objects (e.g. java's hashtables with
> capabilities to select only rows with hashtables containing some
> specific keys) would make Kudu super-special ;)
>

I agree supporting something like this would be very cool.

Would be good if Kudu supported the way Impala can store and query nested
> data
>

Supporting Impala's syntax on Kudu tables with complex types is absolutely
a priority.

Thanks,
Grant

On Wed, Sep 11, 2019 at 7:04 PM Mauricio Aristizabal <ma...@impact.com>
wrote:

> Would be good if Kudu supported the way Impala can store and query nested
> data in hdfs/parquet, so it would be (at least mostly) transparent to query
> nested data in either storage engine.  We recently had a use for this
> (basically storing N order item details along with each order record) but
> decided against it because we know we'll be moving that table from Parquet
> to Kudu soon.
>
> On Wed, Sep 11, 2019 at 1:49 PM Dmitry Degrave <dm...@gmail.com> wrote:
>
>> Hi Grant,
>>
>> An example from genomics. Current scheme is simple [1] (denormalized
>> for performance), but requires N = S * V rows in genotype table (S is
>> number of samples, V is average number of variants in a sample,
>> typical value for WGS V=5*10^6 and we deal with tens of thousands of
>> samples). More optimal scheme would keep all variants of a sample in a
>> single row, which is impossible now.
>>
>> Supporting nested data structures, e.g. similar to implemented in
>> ClickHouse [2], would be useful too.
>>
>> Supporting serialized objects (e.g. java's hashtables with
>> capabilities to select only rows with hashtables containing some
>> specific keys) would make Kudu super-special ;)
>>
>> ~dmitry
>>
>> [1] https://gist.github.com/dnafault/e55ea987c55d2960c738d94e4811d043
>> [2]
>> https://clickhouse-docs.readthedocs.io/en/latest/data_types/nested_data_structures/nested.html
>>
>> On Mon, 9 Sep 2019 at 08:18, Grant Henke <gh...@cloudera.com> wrote:
>> >
>> > Hi Boris,
>> >
>> > Can you describe in more detail what exactly you are looking for in a
>> long text type? Is there another database that has an equivalent type for
>> reference?
>> >
>> > I have started looking at complex type support and plan to put up a
>> design document soon. No estimates on when it would be complete or how much
>> work is required exists yet. Do you have any sample schemas with complex
>> types you could send me to help inform designs and trade offs?
>> >
>> > Thank you,
>> > Grant
>> >
>> > On Sat, Sep 7, 2019 at 11:43 AM Boris Tyukin <bo...@boristyukin.com>
>> wrote:
>> >>
>> >> Hi guys,
>> >>
>> >> Any plans to support long text type in Kudu? We would love to use Kudu
>> with other projects but unfortunately long text data are pretty common in
>> healthcare industry and we have to use hive/Impala/hdfs instead which is
>> quite painful since we cannot do in place updates and deletes.
>> >>
>> >> Same question about complex types (arrays, maps etc.)
>> >>
>> >> Thanks
>> >
>> >
>> >
>> > --
>> > Grant Henke
>> > Software Engineer | Cloudera
>> > grant@cloudera.com | twitter.com/gchenke | linkedin.com/in/granthenke
>>
>
>
> --
> Mauricio Aristizabal
> Architect - Data Pipeline
> mauricio@impact.com | 323 309 4260
> https://impact.com
> <https://www.linkedin.com/company/impact-martech/>
> <https://www.facebook.com/ImpactParTech/>
> <https://twitter.com/impactpartech>
> <https://www.youtube.com/c/impactmartech>
>
>
>
> <http://go.impact.com/WR-PC-AW-DiscoveringGrowthThroughPartnerships.html?utm_medium=owned-email-send&utm_source=sigsatori&utm_campaign=webinarreg-201909-discoveringgrowth-pc>
>


-- 
Grant Henke
Software Engineer | Cloudera
grant@cloudera.com | twitter.com/gchenke | linkedin.com/in/granthenke

Re: Long text and complex data types support

Posted by Mauricio Aristizabal <ma...@impact.com>.
Would be good if Kudu supported the way Impala can store and query nested
data in hdfs/parquet, so it would be (at least mostly) transparent to query
nested data in either storage engine.  We recently had a use for this
(basically storing N order item details along with each order record) but
decided against it because we know we'll be moving that table from Parquet
to Kudu soon.

On Wed, Sep 11, 2019 at 1:49 PM Dmitry Degrave <dm...@gmail.com> wrote:

> Hi Grant,
>
> An example from genomics. Current scheme is simple [1] (denormalized
> for performance), but requires N = S * V rows in genotype table (S is
> number of samples, V is average number of variants in a sample,
> typical value for WGS V=5*10^6 and we deal with tens of thousands of
> samples). More optimal scheme would keep all variants of a sample in a
> single row, which is impossible now.
>
> Supporting nested data structures, e.g. similar to implemented in
> ClickHouse [2], would be useful too.
>
> Supporting serialized objects (e.g. java's hashtables with
> capabilities to select only rows with hashtables containing some
> specific keys) would make Kudu super-special ;)
>
> ~dmitry
>
> [1] https://gist.github.com/dnafault/e55ea987c55d2960c738d94e4811d043
> [2]
> https://clickhouse-docs.readthedocs.io/en/latest/data_types/nested_data_structures/nested.html
>
> On Mon, 9 Sep 2019 at 08:18, Grant Henke <gh...@cloudera.com> wrote:
> >
> > Hi Boris,
> >
> > Can you describe in more detail what exactly you are looking for in a
> long text type? Is there another database that has an equivalent type for
> reference?
> >
> > I have started looking at complex type support and plan to put up a
> design document soon. No estimates on when it would be complete or how much
> work is required exists yet. Do you have any sample schemas with complex
> types you could send me to help inform designs and trade offs?
> >
> > Thank you,
> > Grant
> >
> > On Sat, Sep 7, 2019 at 11:43 AM Boris Tyukin <bo...@boristyukin.com>
> wrote:
> >>
> >> Hi guys,
> >>
> >> Any plans to support long text type in Kudu? We would love to use Kudu
> with other projects but unfortunately long text data are pretty common in
> healthcare industry and we have to use hive/Impala/hdfs instead which is
> quite painful since we cannot do in place updates and deletes.
> >>
> >> Same question about complex types (arrays, maps etc.)
> >>
> >> Thanks
> >
> >
> >
> > --
> > Grant Henke
> > Software Engineer | Cloudera
> > grant@cloudera.com | twitter.com/gchenke | linkedin.com/in/granthenke
>


-- 
Mauricio Aristizabal
Architect - Data Pipeline
mauricio@impact.com | 323 309 4260
https://impact.com
<https://www.linkedin.com/company/impact-martech/>
<https://www.facebook.com/ImpactParTech/>
<https://twitter.com/impactpartech>
<https://www.youtube.com/c/impactmartech>


<http://go.impact.com/WR-PC-AW-DiscoveringGrowthThroughPartnerships.html?utm_medium=owned-email-send&utm_source=sigsatori&utm_campaign=webinarreg-201909-discoveringgrowth-pc>

Re: Long text and complex data types support

Posted by Dmitry Degrave <dm...@gmail.com>.
Hi Grant,

An example from genomics. Current scheme is simple [1] (denormalized
for performance), but requires N = S * V rows in genotype table (S is
number of samples, V is average number of variants in a sample,
typical value for WGS V=5*10^6 and we deal with tens of thousands of
samples). More optimal scheme would keep all variants of a sample in a
single row, which is impossible now.

Supporting nested data structures, e.g. similar to implemented in
ClickHouse [2], would be useful too.

Supporting serialized objects (e.g. java's hashtables with
capabilities to select only rows with hashtables containing some
specific keys) would make Kudu super-special ;)

~dmitry

[1] https://gist.github.com/dnafault/e55ea987c55d2960c738d94e4811d043
[2] https://clickhouse-docs.readthedocs.io/en/latest/data_types/nested_data_structures/nested.html

On Mon, 9 Sep 2019 at 08:18, Grant Henke <gh...@cloudera.com> wrote:
>
> Hi Boris,
>
> Can you describe in more detail what exactly you are looking for in a long text type? Is there another database that has an equivalent type for reference?
>
> I have started looking at complex type support and plan to put up a design document soon. No estimates on when it would be complete or how much work is required exists yet. Do you have any sample schemas with complex types you could send me to help inform designs and trade offs?
>
> Thank you,
> Grant
>
> On Sat, Sep 7, 2019 at 11:43 AM Boris Tyukin <bo...@boristyukin.com> wrote:
>>
>> Hi guys,
>>
>> Any plans to support long text type in Kudu? We would love to use Kudu with other projects but unfortunately long text data are pretty common in healthcare industry and we have to use hive/Impala/hdfs instead which is quite painful since we cannot do in place updates and deletes.
>>
>> Same question about complex types (arrays, maps etc.)
>>
>> Thanks
>
>
>
> --
> Grant Henke
> Software Engineer | Cloudera
> grant@cloudera.com | twitter.com/gchenke | linkedin.com/in/granthenke

Re: Long text and complex data types support

Posted by Grant Henke <gh...@cloudera.com>.
Hi Boris,

Can you describe in more detail what exactly you are looking for in a long
text type? Is there another database that has an equivalent type for
reference?

I have started looking at complex type support and plan to put up a design
document soon. No estimates on when it would be complete or how much work
is required exists yet. Do you have any sample schemas with complex types
you could send me to help inform designs and trade offs?

Thank you,
Grant

On Sat, Sep 7, 2019 at 11:43 AM Boris Tyukin <bo...@boristyukin.com> wrote:

> Hi guys,
>
> Any plans to support long text type in Kudu? We would love to use Kudu
> with other projects but unfortunately long text data are pretty common in
> healthcare industry and we have to use hive/Impala/hdfs instead which is
> quite painful since we cannot do in place updates and deletes.
>
> Same question about complex types (arrays, maps etc.)
>
> Thanks
>


-- 
Grant Henke
Software Engineer | Cloudera
grant@cloudera.com | twitter.com/gchenke | linkedin.com/in/granthenke