You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@drill.apache.org by mi...@nomura.com on 2015/10/07 16:22:23 UTC

Querying k/v pairs in a CSV column

Hi,

If I have a separated file (PSV/TSV etc) with a number of columns, each with a single value aside from the last column which contains a bunch of key value pairs in a string e.g. var1=test var2=test var3=test. Is it possible to query based on individual key/value items in the last column? I am looking to querying nested data without using LIKE wildcards in where statements (I am not sure this is possible with separated text data).

Does anyone have any advice on whether this is possible?

Thanks,
Mike

This e-mail (including any attachments) is private and confidential, may contain proprietary or privileged information and is intended for the named recipient(s) only. Unintended recipients are strictly prohibited from taking action on the basis of information in this e-mail and must contact the sender immediately, delete this e-mail (and all attachments) and destroy any hard copies. Nomura will not accept responsibility or liability for the accuracy or completeness of, or the presence of any virus or disabling code in, this e-mail. If verification is sought please request a hard copy. Any reference to the terms of executed transactions should be treated as preliminary only and subject to formal written confirmation by Nomura. Nomura reserves the right to retain, monitor and intercept e-mail communications through its networks (subject to and in accordance with applicable laws). No confidentiality or privilege is waived or lost by Nomura by any mistransmission of this e-mail. Any reference to "Nomura" is a reference to any entity in the Nomura Holdings, Inc. group. Please read our Electronic Communications Legal Notice which forms part of this e-mail: http://www.Nomura.com/email_disclaimer.htm

Re: Querying k/v pairs in a CSV column

Posted by Jim Scott <js...@maprtech.com>.

Is this a case where kvgen could be potentially leveraged?

On Wed, Oct 7, 2015 at 9:55 AM, Tomer Shiran <ts...@dremio.com> wrote:

> I think it would be much better to leverage something like
> CONVERT_FROM(col, 'MyKeyValueFormat') where the entire key/value cell is
> parsed into Drill's internal structure just like we do with JSON. That
> exposes all those key/value pairs such that you can leverage them in other
> SQL operators.
>
> On Wed, Oct 7, 2015 at 7:32 AM, Tugdual Grall <tu...@gmail.com> wrote:
>
> > You can create a custom function to do that:
> > - https://drill.apache.org/docs/develop-custom-functions/
> >
> > this will end with something like:
> >  select *
> >  from *.csv
> >  where GET_VALUE(col[x] , 'var1' ) = 'value'
> >
> > (I will publish, in the next few day, a similar function parsing URL in
> > values in this repo :
> https://github.com/mapr-demos/simple-drill-functions
> > )
> >
> > t
> >
> > On Wed, Oct 7, 2015 at 4:22 PM, <mi...@nomura.com> wrote:
> >
> > > Hi,
> > >
> > > If I have a separated file (PSV/TSV etc) with a number of columns, each
> > > with a single value aside from the last column which contains a bunch
> of
> > > key value pairs in a string e.g. var1=test var2=test var3=test. Is it
> > > possible to query based on individual key/value items in the last
> > column? I
> > > am looking to querying nested data without using LIKE wildcards in
> where
> > > statements (I am not sure this is possible with separated text data).
> > >
> > > Does anyone have any advice on whether this is possible?
> > >
> > > Thanks,
> > > Mike
> > >
> > >
> > > This e-mail (including any attachments) is private and confidential,
> may
> > > contain proprietary or privileged information and is intended for the
> > named
> > > recipient(s) only. Unintended recipients are strictly prohibited from
> > > taking action on the basis of information in this e-mail and must
> contact
> > > the sender immediately, delete this e-mail (and all attachments) and
> > > destroy any hard copies. Nomura will not accept responsibility or
> > liability
> > > for the accuracy or completeness of, or the presence of any virus or
> > > disabling code in, this e-mail. If verification is sought please
> request
> > a
> > > hard copy. Any reference to the terms of executed transactions should
> be
> > > treated as preliminary only and subject to formal written confirmation
> by
> > > Nomura. Nomura reserves the right to retain, monitor and intercept
> e-mail
> > > communications through its networks (subject to and in accordance with
> > > applicable laws). No confidentiality or privilege is waived or lost by
> > > Nomura by any mistransmission of this e-mail. Any reference to "Nomura"
> > is
> > > a reference to any entity in the Nomura Holdings, Inc. group. Please
> read
> > > our Electronic Communications Legal Notice which forms part of this
> > e-mail:
> > > http://www.Nomura.com/email_disclaimer.htm
> > >
> > >
> >
>
>
>
> --
> Tomer Shiran
> CEO and Co-Founder, Dremio
>



-- 
*Jim Scott*
Director, Enterprise Strategy & Architecture
+1 (347) 746-9281
@kingmesal <https://twitter.com/kingmesal>

<http://www.mapr.com/>
[image: MapR Technologies] <http://www.mapr.com>

Now Available - Free Hadoop On-Demand Training
<http://www.mapr.com/training?utm_source=Email&utm_medium=Signature&utm_campaign=Free%20available>

Re: Querying k/v pairs in a CSV column

Posted by Christopher Matta <cm...@mapr.com>.

An illustration of my previous email:

Original csv with a varN=value column:

0: jdbc:drill:> select * from `kv.csv`;
+-----------------------------------------------------------+
|                          columns                          |
+-----------------------------------------------------------+
| ["abc123","chris matta","var1=test var2=test var3=test"]  |
+-----------------------------------------------------------+
1 row selected (0.554 seconds)

Convert it to valid JSON with a creative set of regular expressions:

0: jdbc:drill:> select `columns`[0], CONVERT_FROM('{"' ||
regexp_replace(regexp_replace(`columns`[2], '\s', '","'), '=', '": "')
|| '"}', 'JSON') from `kv.csv`;
+---------+----------------------------------------------+
| EXPR$0  |                    EXPR$1                    |
+---------+----------------------------------------------+
| abc123  | {"var1":"test","var2":"test","var3":"test"}  |
+---------+----------------------------------------------+

Then you can use flatten with kvgen:

0: jdbc:drill:> select t.`id`, flatten(kvgen(t.`values`)) from (select
`columns`[0] `id`, CONVERT_FROM('{"' ||
regexp_replace(regexp_replace(`columns`[2], '\s', '","'), '=', '": "')
|| '"}', 'JSON') `values` from `kv.csv`) t;
+---------+--------------------------------+
|   id    |             EXPR$1             |
+---------+--------------------------------+
| abc123  | {"key":"var1","value":"test"}  |
| abc123  | {"key":"var2","value":"test"}  |
| abc123  | {"key":"var3","value":"test"}  |
+---------+--------------------------------+
3 rows selected (0.597 seconds)



Chris Matta
cmatta@mapr.com
215-701-3146

On Mon, Nov 23, 2015 at 10:09 PM, Christopher Matta <cm...@mapr.com> wrote:

> From a suggestion in a separate thread I did something like this to
> convert my delimited substring into a JSON array, this allowed me to use
> FLATTEN:
>
> 0: jdbc:drill:> select `columns`[0] `id`, `columns`[16] `classes` FROM `maprfs.cmatta`.`product.tsv` limit 10;
> +-------------------------------------------------+------------------------------------------------------------------------------------------------+
> |                       id                        |                                            classes                                             |
> +-------------------------------------------------+------------------------------------------------------------------------------------------------+
> | 0002-1200_88439a12-a98f-4551-a98f-85422ab2fbcc  | Radioactive Diagnostic Agent [EPC],Positron Emitting Activity [MoA]                            |
> | 0002-1407_e7af3676-cd9d-4c30-b127-9f6c46ff1589  | Antiarrhythmic [EPC],Cytochrome P450 2D6 Inhibitor [EPC],Cytochrome P450 2D6 Inhibitors [MoA]  |
> | 0002-1433_4137e5fc-0d0b-4171-842d-db80973d1a36  |                                                                                                |
> | 0002-1434_4137e5fc-0d0b-4171-842d-db80973d1a36  |                                                                                                |
> | 0002-1975_6bf7a335-bf6b-4f7f-b8cf-c7b6ee1ba089  | Androgen [EPC],Androgen Receptor Agonists [MoA],Androstanes [Chemical/Ingredient]              |
> | 0002-3004_77661866-2f7c-4b90-8866-dce57ebb87bb  | Serotonin Reuptake Inhibitor [EPC],Serotonin Uptake Inhibitors [MoA]                           |
> | 0002-3227_7ad983d7-6ac4-4560-a5a7-0c9151cd3831  | Norepinephrine Reuptake Inhibitor [EPC],Norepinephrine Uptake Inhibitors [MoA]                 |
> | 0002-3228_7ad983d7-6ac4-4560-a5a7-0c9151cd3831  | Norepinephrine Reuptake Inhibitor [EPC],Norepinephrine Uptake Inhibitors [MoA]                 |
> | 0002-3229_7ad983d7-6ac4-4560-a5a7-0c9151cd3831  | Norepinephrine Reuptake Inhibitor [EPC],Norepinephrine Uptake Inhibitors [MoA]                 |
> +-------------------------------------------------+------------------------------------------------------------------------------------------------+
> 10 rows selected (0.545 seconds)
> 0: jdbc:drill:> select `columns`[0] `id`, FLATTEN(CONVERT_FROM('["' || REGEXP_REPLACE(`columns`[16],',','","') || '"]','JSON')) `classes` FROM `maprfs.cmatta`.`product.tsv` limit 10;
> +-------------------------------------------------+---------------------------------------+
> |                       id                        |                classes                |
> +-------------------------------------------------+---------------------------------------+
> | 0002-1200_88439a12-a98f-4551-a98f-85422ab2fbcc  | Radioactive Diagnostic Agent [EPC]    |
> | 0002-1200_88439a12-a98f-4551-a98f-85422ab2fbcc  | Positron Emitting Activity [MoA]      |
> | 0002-1407_e7af3676-cd9d-4c30-b127-9f6c46ff1589  | Antiarrhythmic [EPC]                  |
> | 0002-1407_e7af3676-cd9d-4c30-b127-9f6c46ff1589  | Cytochrome P450 2D6 Inhibitor [EPC]   |
> | 0002-1407_e7af3676-cd9d-4c30-b127-9f6c46ff1589  | Cytochrome P450 2D6 Inhibitors [MoA]  |
> | 0002-1433_4137e5fc-0d0b-4171-842d-db80973d1a36  |                                       |
> | 0002-1434_4137e5fc-0d0b-4171-842d-db80973d1a36  |                                       |
> | 0002-1975_6bf7a335-bf6b-4f7f-b8cf-c7b6ee1ba089  | Androgen [EPC]                        |
> | 0002-1975_6bf7a335-bf6b-4f7f-b8cf-c7b6ee1ba089  | Androgen Receptor Agonists [MoA]      |
> +-------------------------------------------------+---------------------------------------+
> 10 rows selected (0.876 seconds)
> 0: jdbc:drill:>
>
> If you could come up with a regexp and string formatting to turn your var1=test
> var2=test var3=test column into valid JSON you could use CONVERT_FROM as
> Tomer suggested.
> 
>
> Chris Matta
> cmatta@mapr.com
> 215-701-3146
>
> On Sat, Nov 21, 2015 at 6:38 AM, Christopher Matta <cm...@mapr.com>
> wrote:
>
>> Along the same lines I have a data set that's delimited by pipe with the
>> last column further delimited by commas. It would be great to be able to
>> flatten that last column.
>>
>>
>> On Wednesday, October 7, 2015, Tugdual Grall <tu...@gmail.com> wrote:
>>
>>> Hi,
>>>
>>> Can you clarify how to use our own format in CONVERT_FROM?
>>>
>>> 1- is it supported out of the box ? (cannot find that in the doc -
>>>
>>> https://drill.apache.org/docs/supported-data-types/#data-types-for-convert_to-and-convert_from-functions
>>> - may be not looking at the proper place.
>>>
>>> 2- it is another extension point? so developer can create their own
>>> parser
>>> and deploy it?
>>>
>>>
>>>
>>> Thanks
>>>
>>> On Wed, Oct 7, 2015 at 4:55 PM, Tomer Shiran <ts...@dremio.com> wrote:
>>>
>>> > I think it would be much better to leverage something like
>>> > CONVERT_FROM(col, 'MyKeyValueFormat') where the entire key/value cell
>>> is
>>> > parsed into Drill's internal structure just like we do with JSON. That
>>> > exposes all those key/value pairs such that you can leverage them in
>>> other
>>> > SQL operators.
>>> >
>>> > On Wed, Oct 7, 2015 at 7:32 AM, Tugdual Grall <tu...@gmail.com>
>>> wrote:
>>> >
>>> >> You can create a custom function to do that:
>>> >> - https://drill.apache.org/docs/develop-custom-functions/
>>> >>
>>> >> this will end with something like:
>>> >>  select *
>>> >>  from *.csv
>>> >>  where GET_VALUE(col[x] , 'var1' ) = 'value'
>>> >>
>>> >> (I will publish, in the next few day, a similar function parsing URL
>>> in
>>> >> values in this repo :
>>> >> https://github.com/mapr-demos/simple-drill-functions )
>>> >>
>>> >> t
>>> >>
>>> >> On Wed, Oct 7, 2015 at 4:22 PM, <mi...@nomura.com> wrote:
>>> >>
>>> >> > Hi,
>>> >> >
>>> >> > If I have a separated file (PSV/TSV etc) with a number of columns,
>>> each
>>> >> > with a single value aside from the last column which contains a
>>> bunch of
>>> >> > key value pairs in a string e.g. var1=test var2=test var3=test. Is
>>> it
>>> >> > possible to query based on individual key/value items in the last
>>> >> column? I
>>> >> > am looking to querying nested data without using LIKE wildcards in
>>> where
>>> >> > statements (I am not sure this is possible with separated text
>>> data).
>>> >> >
>>> >> > Does anyone have any advice on whether this is possible?
>>> >> >
>>> >> > Thanks,
>>> >> > Mike
>>> >> >
>>> >> >
>>> >> > This e-mail (including any attachments) is private and
>>> confidential, may
>>> >> > contain proprietary or privileged information and is intended for
>>> the
>>> >> named
>>> >> > recipient(s) only. Unintended recipients are strictly prohibited
>>> from
>>> >> > taking action on the basis of information in this e-mail and must
>>> >> contact
>>> >> > the sender immediately, delete this e-mail (and all attachments) and
>>> >> > destroy any hard copies. Nomura will not accept responsibility or
>>> >> liability
>>> >> > for the accuracy or completeness of, or the presence of any virus or
>>> >> > disabling code in, this e-mail. If verification is sought please
>>> >> request a
>>> >> > hard copy. Any reference to the terms of executed transactions
>>> should be
>>> >> > treated as preliminary only and subject to formal written
>>> confirmation
>>> >> by
>>> >> > Nomura. Nomura reserves the right to retain, monitor and intercept
>>> >> e-mail
>>> >> > communications through its networks (subject to and in accordance
>>> with
>>> >> > applicable laws). No confidentiality or privilege is waived or lost
>>> by
>>> >> > Nomura by any mistransmission of this e-mail. Any reference to
>>> "Nomura"
>>> >> is
>>> >> > a reference to any entity in the Nomura Holdings, Inc. group. Please
>>> >> read
>>> >> > our Electronic Communications Legal Notice which forms part of this
>>> >> e-mail:
>>> >> > http://www.Nomura.com/email_disclaimer.htm
>>> >> >
>>> >> >
>>> >>
>>> >
>>> >
>>> >
>>> > --
>>> > Tomer Shiran
>>> > CEO and Co-Founder, Dremio
>>> >
>>>
>>
>>
>> --
>> Chris Matta
>> cmatta@mapr.com
>> 215-701-3146
>>
>>
>

Re: Querying k/v pairs in a CSV column

Posted by Christopher Matta <cm...@mapr.com>.

>From a suggestion in a separate thread I did something like this to convert
my delimited substring into a JSON array, this allowed me to use FLATTEN:

0: jdbc:drill:> select `columns`[0] `id`, `columns`[16] `classes` FROM
`maprfs.cmatta`.`product.tsv` limit 10;
+-------------------------------------------------+------------------------------------------------------------------------------------------------+
|                       id                        |
                        classes
     |
+-------------------------------------------------+------------------------------------------------------------------------------------------------+
| 0002-1200_88439a12-a98f-4551-a98f-85422ab2fbcc  | Radioactive
Diagnostic Agent [EPC],Positron Emitting Activity [MoA]
            |
| 0002-1407_e7af3676-cd9d-4c30-b127-9f6c46ff1589  | Antiarrhythmic
[EPC],Cytochrome P450 2D6 Inhibitor [EPC],Cytochrome P450 2D6
Inhibitors [MoA]  |
| 0002-1433_4137e5fc-0d0b-4171-842d-db80973d1a36  |

     |
| 0002-1434_4137e5fc-0d0b-4171-842d-db80973d1a36  |

     |
| 0002-1975_6bf7a335-bf6b-4f7f-b8cf-c7b6ee1ba089  | Androgen
[EPC],Androgen Receptor Agonists [MoA],Androstanes
[Chemical/Ingredient]              |
| 0002-3004_77661866-2f7c-4b90-8866-dce57ebb87bb  | Serotonin Reuptake
Inhibitor [EPC],Serotonin Uptake Inhibitors [MoA]
     |
| 0002-3227_7ad983d7-6ac4-4560-a5a7-0c9151cd3831  | Norepinephrine
Reuptake Inhibitor [EPC],Norepinephrine Uptake Inhibitors [MoA]
         |
| 0002-3228_7ad983d7-6ac4-4560-a5a7-0c9151cd3831  | Norepinephrine
Reuptake Inhibitor [EPC],Norepinephrine Uptake Inhibitors [MoA]
         |
| 0002-3229_7ad983d7-6ac4-4560-a5a7-0c9151cd3831  | Norepinephrine
Reuptake Inhibitor [EPC],Norepinephrine Uptake Inhibitors [MoA]
         |
+-------------------------------------------------+------------------------------------------------------------------------------------------------+
10 rows selected (0.545 seconds)
0: jdbc:drill:> select `columns`[0] `id`, FLATTEN(CONVERT_FROM('["' ||
REGEXP_REPLACE(`columns`[16],',','","') || '"]','JSON')) `classes`
FROM `maprfs.cmatta`.`product.tsv` limit 10;
+-------------------------------------------------+---------------------------------------+
|                       id                        |
classes                |
+-------------------------------------------------+---------------------------------------+
| 0002-1200_88439a12-a98f-4551-a98f-85422ab2fbcc  | Radioactive
Diagnostic Agent [EPC]    |
| 0002-1200_88439a12-a98f-4551-a98f-85422ab2fbcc  | Positron Emitting
Activity [MoA]      |
| 0002-1407_e7af3676-cd9d-4c30-b127-9f6c46ff1589  | Antiarrhythmic
[EPC]                  |
| 0002-1407_e7af3676-cd9d-4c30-b127-9f6c46ff1589  | Cytochrome P450
2D6 Inhibitor [EPC]   |
| 0002-1407_e7af3676-cd9d-4c30-b127-9f6c46ff1589  | Cytochrome P450
2D6 Inhibitors [MoA]  |
| 0002-1433_4137e5fc-0d0b-4171-842d-db80973d1a36  |
                   |
| 0002-1434_4137e5fc-0d0b-4171-842d-db80973d1a36  |
                   |
| 0002-1975_6bf7a335-bf6b-4f7f-b8cf-c7b6ee1ba089  | Androgen [EPC]
                   |
| 0002-1975_6bf7a335-bf6b-4f7f-b8cf-c7b6ee1ba089  | Androgen Receptor
Agonists [MoA]      |
+-------------------------------------------------+---------------------------------------+
10 rows selected (0.876 seconds)
0: jdbc:drill:>

If you could come up with a regexp and string formatting to turn your var1=test
var2=test var3=test column into valid JSON you could use CONVERT_FROM as
Tomer suggested.


Chris Matta
cmatta@mapr.com
215-701-3146

On Sat, Nov 21, 2015 at 6:38 AM, Christopher Matta <cm...@mapr.com> wrote:

> Along the same lines I have a data set that's delimited by pipe with the
> last column further delimited by commas. It would be great to be able to
> flatten that last column.
>
>
> On Wednesday, October 7, 2015, Tugdual Grall <tu...@gmail.com> wrote:
>
>> Hi,
>>
>> Can you clarify how to use our own format in CONVERT_FROM?
>>
>> 1- is it supported out of the box ? (cannot find that in the doc -
>>
>> https://drill.apache.org/docs/supported-data-types/#data-types-for-convert_to-and-convert_from-functions
>> - may be not looking at the proper place.
>>
>> 2- it is another extension point? so developer can create their own parser
>> and deploy it?
>>
>>
>>
>> Thanks
>>
>> On Wed, Oct 7, 2015 at 4:55 PM, Tomer Shiran <ts...@dremio.com> wrote:
>>
>> > I think it would be much better to leverage something like
>> > CONVERT_FROM(col, 'MyKeyValueFormat') where the entire key/value cell is
>> > parsed into Drill's internal structure just like we do with JSON. That
>> > exposes all those key/value pairs such that you can leverage them in
>> other
>> > SQL operators.
>> >
>> > On Wed, Oct 7, 2015 at 7:32 AM, Tugdual Grall <tu...@gmail.com>
>> wrote:
>> >
>> >> You can create a custom function to do that:
>> >> - https://drill.apache.org/docs/develop-custom-functions/
>> >>
>> >> this will end with something like:
>> >>  select *
>> >>  from *.csv
>> >>  where GET_VALUE(col[x] , 'var1' ) = 'value'
>> >>
>> >> (I will publish, in the next few day, a similar function parsing URL in
>> >> values in this repo :
>> >> https://github.com/mapr-demos/simple-drill-functions )
>> >>
>> >> t
>> >>
>> >> On Wed, Oct 7, 2015 at 4:22 PM, <mi...@nomura.com> wrote:
>> >>
>> >> > Hi,
>> >> >
>> >> > If I have a separated file (PSV/TSV etc) with a number of columns,
>> each
>> >> > with a single value aside from the last column which contains a
>> bunch of
>> >> > key value pairs in a string e.g. var1=test var2=test var3=test. Is it
>> >> > possible to query based on individual key/value items in the last
>> >> column? I
>> >> > am looking to querying nested data without using LIKE wildcards in
>> where
>> >> > statements (I am not sure this is possible with separated text data).
>> >> >
>> >> > Does anyone have any advice on whether this is possible?
>> >> >
>> >> > Thanks,
>> >> > Mike
>> >> >
>> >> >
>> >> > This e-mail (including any attachments) is private and confidential,
>> may
>> >> > contain proprietary or privileged information and is intended for the
>> >> named
>> >> > recipient(s) only. Unintended recipients are strictly prohibited from
>> >> > taking action on the basis of information in this e-mail and must
>> >> contact
>> >> > the sender immediately, delete this e-mail (and all attachments) and
>> >> > destroy any hard copies. Nomura will not accept responsibility or
>> >> liability
>> >> > for the accuracy or completeness of, or the presence of any virus or
>> >> > disabling code in, this e-mail. If verification is sought please
>> >> request a
>> >> > hard copy. Any reference to the terms of executed transactions
>> should be
>> >> > treated as preliminary only and subject to formal written
>> confirmation
>> >> by
>> >> > Nomura. Nomura reserves the right to retain, monitor and intercept
>> >> e-mail
>> >> > communications through its networks (subject to and in accordance
>> with
>> >> > applicable laws). No confidentiality or privilege is waived or lost
>> by
>> >> > Nomura by any mistransmission of this e-mail. Any reference to
>> "Nomura"
>> >> is
>> >> > a reference to any entity in the Nomura Holdings, Inc. group. Please
>> >> read
>> >> > our Electronic Communications Legal Notice which forms part of this
>> >> e-mail:
>> >> > http://www.Nomura.com/email_disclaimer.htm
>> >> >
>> >> >
>> >>
>> >
>> >
>> >
>> > --
>> > Tomer Shiran
>> > CEO and Co-Founder, Dremio
>> >
>>
>
>
> --
> Chris Matta
> cmatta@mapr.com
> 215-701-3146
>
>

Re: Querying k/v pairs in a CSV column

Posted by Christopher Matta <cm...@mapr.com>.

Along the same lines I have a data set that's delimited by pipe with the
last column further delimited by commas. It would be great to be able to
flatten that last column.

On Wednesday, October 7, 2015, Tugdual Grall <tu...@gmail.com> wrote:

> Hi,
>
> Can you clarify how to use our own format in CONVERT_FROM?
>
> 1- is it supported out of the box ? (cannot find that in the doc -
>
> https://drill.apache.org/docs/supported-data-types/#data-types-for-convert_to-and-convert_from-functions
> - may be not looking at the proper place.
>
> 2- it is another extension point? so developer can create their own parser
> and deploy it?
>
>
>
> Thanks
>
> On Wed, Oct 7, 2015 at 4:55 PM, Tomer Shiran <tshiran@dremio.com
> <javascript:;>> wrote:
>
> > I think it would be much better to leverage something like
> > CONVERT_FROM(col, 'MyKeyValueFormat') where the entire key/value cell is
> > parsed into Drill's internal structure just like we do with JSON. That
> > exposes all those key/value pairs such that you can leverage them in
> other
> > SQL operators.
> >
> > On Wed, Oct 7, 2015 at 7:32 AM, Tugdual Grall <tugdual@gmail.com
> <javascript:;>> wrote:
> >
> >> You can create a custom function to do that:
> >> - https://drill.apache.org/docs/develop-custom-functions/
> >>
> >> this will end with something like:
> >>  select *
> >>  from *.csv
> >>  where GET_VALUE(col[x] , 'var1' ) = 'value'
> >>
> >> (I will publish, in the next few day, a similar function parsing URL in
> >> values in this repo :
> >> https://github.com/mapr-demos/simple-drill-functions )
> >>
> >> t
> >>
> >> On Wed, Oct 7, 2015 at 4:22 PM, <michael.england@nomura.com
> <javascript:;>> wrote:
> >>
> >> > Hi,
> >> >
> >> > If I have a separated file (PSV/TSV etc) with a number of columns,
> each
> >> > with a single value aside from the last column which contains a bunch
> of
> >> > key value pairs in a string e.g. var1=test var2=test var3=test. Is it
> >> > possible to query based on individual key/value items in the last
> >> column? I
> >> > am looking to querying nested data without using LIKE wildcards in
> where
> >> > statements (I am not sure this is possible with separated text data).
> >> >
> >> > Does anyone have any advice on whether this is possible?
> >> >
> >> > Thanks,
> >> > Mike
> >> >
> >> >
> >> > This e-mail (including any attachments) is private and confidential,
> may
> >> > contain proprietary or privileged information and is intended for the
> >> named
> >> > recipient(s) only. Unintended recipients are strictly prohibited from
> >> > taking action on the basis of information in this e-mail and must
> >> contact
> >> > the sender immediately, delete this e-mail (and all attachments) and
> >> > destroy any hard copies. Nomura will not accept responsibility or
> >> liability
> >> > for the accuracy or completeness of, or the presence of any virus or
> >> > disabling code in, this e-mail. If verification is sought please
> >> request a
> >> > hard copy. Any reference to the terms of executed transactions should
> be
> >> > treated as preliminary only and subject to formal written confirmation
> >> by
> >> > Nomura. Nomura reserves the right to retain, monitor and intercept
> >> e-mail
> >> > communications through its networks (subject to and in accordance with
> >> > applicable laws). No confidentiality or privilege is waived or lost by
> >> > Nomura by any mistransmission of this e-mail. Any reference to
> "Nomura"
> >> is
> >> > a reference to any entity in the Nomura Holdings, Inc. group. Please
> >> read
> >> > our Electronic Communications Legal Notice which forms part of this
> >> e-mail:
> >> > http://www.Nomura.com/email_disclaimer.htm
> >> >
> >> >
> >>
> >
> >
> >
> > --
> > Tomer Shiran
> > CEO and Co-Founder, Dremio
> >
>


-- 
Chris Matta
cmatta@mapr.com
215-701-3146

Re: Querying k/v pairs in a CSV column

Posted by Tugdual Grall <tu...@gmail.com>.

Hi,

Can you clarify how to use our own format in CONVERT_FROM?

1- is it supported out of the box ? (cannot find that in the doc -
https://drill.apache.org/docs/supported-data-types/#data-types-for-convert_to-and-convert_from-functions
- may be not looking at the proper place.

2- it is another extension point? so developer can create their own parser
and deploy it?



Thanks

On Wed, Oct 7, 2015 at 4:55 PM, Tomer Shiran <ts...@dremio.com> wrote:

> I think it would be much better to leverage something like
> CONVERT_FROM(col, 'MyKeyValueFormat') where the entire key/value cell is
> parsed into Drill's internal structure just like we do with JSON. That
> exposes all those key/value pairs such that you can leverage them in other
> SQL operators.
>
> On Wed, Oct 7, 2015 at 7:32 AM, Tugdual Grall <tu...@gmail.com> wrote:
>
>> You can create a custom function to do that:
>> - https://drill.apache.org/docs/develop-custom-functions/
>>
>> this will end with something like:
>>  select *
>>  from *.csv
>>  where GET_VALUE(col[x] , 'var1' ) = 'value'
>>
>> (I will publish, in the next few day, a similar function parsing URL in
>> values in this repo :
>> https://github.com/mapr-demos/simple-drill-functions )
>>
>> t
>>
>> On Wed, Oct 7, 2015 at 4:22 PM, <mi...@nomura.com> wrote:
>>
>> > Hi,
>> >
>> > If I have a separated file (PSV/TSV etc) with a number of columns, each
>> > with a single value aside from the last column which contains a bunch of
>> > key value pairs in a string e.g. var1=test var2=test var3=test. Is it
>> > possible to query based on individual key/value items in the last
>> column? I
>> > am looking to querying nested data without using LIKE wildcards in where
>> > statements (I am not sure this is possible with separated text data).
>> >
>> > Does anyone have any advice on whether this is possible?
>> >
>> > Thanks,
>> > Mike
>> >
>> >
>> > This e-mail (including any attachments) is private and confidential, may
>> > contain proprietary or privileged information and is intended for the
>> named
>> > recipient(s) only. Unintended recipients are strictly prohibited from
>> > taking action on the basis of information in this e-mail and must
>> contact
>> > the sender immediately, delete this e-mail (and all attachments) and
>> > destroy any hard copies. Nomura will not accept responsibility or
>> liability
>> > for the accuracy or completeness of, or the presence of any virus or
>> > disabling code in, this e-mail. If verification is sought please
>> request a
>> > hard copy. Any reference to the terms of executed transactions should be
>> > treated as preliminary only and subject to formal written confirmation
>> by
>> > Nomura. Nomura reserves the right to retain, monitor and intercept
>> e-mail
>> > communications through its networks (subject to and in accordance with
>> > applicable laws). No confidentiality or privilege is waived or lost by
>> > Nomura by any mistransmission of this e-mail. Any reference to "Nomura"
>> is
>> > a reference to any entity in the Nomura Holdings, Inc. group. Please
>> read
>> > our Electronic Communications Legal Notice which forms part of this
>> e-mail:
>> > http://www.Nomura.com/email_disclaimer.htm
>> >
>> >
>>
>
>
>
> --
> Tomer Shiran
> CEO and Co-Founder, Dremio
>

Re: Querying k/v pairs in a CSV column

Posted by Tomer Shiran <ts...@dremio.com>.

I think it would be much better to leverage something like
CONVERT_FROM(col, 'MyKeyValueFormat') where the entire key/value cell is
parsed into Drill's internal structure just like we do with JSON. That
exposes all those key/value pairs such that you can leverage them in other
SQL operators.

On Wed, Oct 7, 2015 at 7:32 AM, Tugdual Grall <tu...@gmail.com> wrote:

> You can create a custom function to do that:
> - https://drill.apache.org/docs/develop-custom-functions/
>
> this will end with something like:
>  select *
>  from *.csv
>  where GET_VALUE(col[x] , 'var1' ) = 'value'
>
> (I will publish, in the next few day, a similar function parsing URL in
> values in this repo : https://github.com/mapr-demos/simple-drill-functions
> )
>
> t
>
> On Wed, Oct 7, 2015 at 4:22 PM, <mi...@nomura.com> wrote:
>
> > Hi,
> >
> > If I have a separated file (PSV/TSV etc) with a number of columns, each
> > with a single value aside from the last column which contains a bunch of
> > key value pairs in a string e.g. var1=test var2=test var3=test. Is it
> > possible to query based on individual key/value items in the last
> column? I
> > am looking to querying nested data without using LIKE wildcards in where
> > statements (I am not sure this is possible with separated text data).
> >
> > Does anyone have any advice on whether this is possible?
> >
> > Thanks,
> > Mike
> >
> >
> > This e-mail (including any attachments) is private and confidential, may
> > contain proprietary or privileged information and is intended for the
> named
> > recipient(s) only. Unintended recipients are strictly prohibited from
> > taking action on the basis of information in this e-mail and must contact
> > the sender immediately, delete this e-mail (and all attachments) and
> > destroy any hard copies. Nomura will not accept responsibility or
> liability
> > for the accuracy or completeness of, or the presence of any virus or
> > disabling code in, this e-mail. If verification is sought please request
> a
> > hard copy. Any reference to the terms of executed transactions should be
> > treated as preliminary only and subject to formal written confirmation by
> > Nomura. Nomura reserves the right to retain, monitor and intercept e-mail
> > communications through its networks (subject to and in accordance with
> > applicable laws). No confidentiality or privilege is waived or lost by
> > Nomura by any mistransmission of this e-mail. Any reference to "Nomura"
> is
> > a reference to any entity in the Nomura Holdings, Inc. group. Please read
> > our Electronic Communications Legal Notice which forms part of this
> e-mail:
> > http://www.Nomura.com/email_disclaimer.htm
> >
> >
>



-- 
Tomer Shiran
CEO and Co-Founder, Dremio

Re: Querying k/v pairs in a CSV column

Posted by Tugdual Grall <tu...@gmail.com>.

You can create a custom function to do that:
- https://drill.apache.org/docs/develop-custom-functions/

this will end with something like:
 select *
 from *.csv
 where GET_VALUE(col[x] , 'var1' ) = 'value'

(I will publish, in the next few day, a similar function parsing URL in
values in this repo : https://github.com/mapr-demos/simple-drill-functions )

t

On Wed, Oct 7, 2015 at 4:22 PM, <mi...@nomura.com> wrote:

> Hi,
>
> If I have a separated file (PSV/TSV etc) with a number of columns, each
> with a single value aside from the last column which contains a bunch of
> key value pairs in a string e.g. var1=test var2=test var3=test. Is it
> possible to query based on individual key/value items in the last column? I
> am looking to querying nested data without using LIKE wildcards in where
> statements (I am not sure this is possible with separated text data).
>
> Does anyone have any advice on whether this is possible?
>
> Thanks,
> Mike
>
>
> This e-mail (including any attachments) is private and confidential, may
> contain proprietary or privileged information and is intended for the named
> recipient(s) only. Unintended recipients are strictly prohibited from
> taking action on the basis of information in this e-mail and must contact
> the sender immediately, delete this e-mail (and all attachments) and
> destroy any hard copies. Nomura will not accept responsibility or liability
> for the accuracy or completeness of, or the presence of any virus or
> disabling code in, this e-mail. If verification is sought please request a
> hard copy. Any reference to the terms of executed transactions should be
> treated as preliminary only and subject to formal written confirmation by
> Nomura. Nomura reserves the right to retain, monitor and intercept e-mail
> communications through its networks (subject to and in accordance with
> applicable laws). No confidentiality or privilege is waived or lost by
> Nomura by any mistransmission of this e-mail. Any reference to "Nomura" is
> a reference to any entity in the Nomura Holdings, Inc. group. Please read
> our Electronic Communications Legal Notice which forms part of this e-mail:
> http://www.Nomura.com/email_disclaimer.htm
>
>