You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@hive.apache.org by Mahender Sarangam <Ma...@outlook.com> on 2016/06/09 16:13:17 UTC

Get 100 items in Comma Separated strings from Hive Column.

Hi,

We have hive table which has a single column with more than 1000 comma separated string items.  Is there a way to retrieve only 100 string items from that Column. Also we need to capture number of comma separated string items. We are looking for more of   "substring_index" functionality, since we are using Hive 1.2 version, we couldn't find "substring_index" UDF function, Is there a way to achieve the same functionality with  "regexp_extract" and I also see there is UDF available not sure whether this helps us achieving same functionality. https://github.com/brndnmtthws/facebook-hive-udfs/blob/master/src/main/java/com/facebook/hive/udf/UDFRegexpExtractAll.java

Scenario : Table1 (Source Table)

RowID stringColumn

1 1,2,3,4...10000

2 2,4,5,8,4

3 10,11,98,100

Now i Would like to show table result structure like below

Row ID 100String count

1 1,2,3...100 10000

2 2,4,5,8,4 5

RE: Get 100 items in Comma Separated strings from Hive Column.

Posted by "Markovitz, Dudu" <dm...@paypal.com>.
You are welcomed

Dudu

From: Mahender Sarangam [mailto:Mahender.BigData@outlook.com]
Sent: Friday, June 10, 2016 8:55 PM
To: user@hive.apache.org
Subject: Re: Get 100 items in Comma Separated strings from Hive Column.


Thanks Dudu. This is wonderful explaination. I'm very thankful

On 6/10/2016 7:24 AM, Markovitz, Dudu wrote:
regexp_extract ('(,?[^,]*){0,10}',0)

(...){0,10}

The expression surrounded by brackets repeats 0 to 10 times.


(,?[…]*)

Optional comma followed by sequence (0 or more) of characters


[^,]

Any character which is not comma


regexp_extract (...,0)

0 stands for the whole expression
1 stands for the 1st expression which is surrounded by brackets (ordered by the opening brackets)
2 stands for the 2nd expression which is surrounded by brackets (ordered by the opening brackets)
3 stands for the 3rd expression which is surrounded by brackets (ordered by the opening brackets)
Etc.



regexp_replace (((,?[^,]*){0,10}).*','$1')

Similar to regexp_extract but this time we’re not extracting the first 10 tokens but replacing the whole expression with the first 10 tokens.
The expression that stands for the first 10 tokens is identical to the one we used in regexp_extract
.* stands for any character that repeats 0 or more times which represent anything following the first 10 tokens
$1 stands for the 1st expression which is surrounded by brackets (ordered by the opening brackets)


From: Mahender Sarangam [mailto:Mahender.BigData@outlook.com]
Sent: Friday, June 10, 2016 2:54 PM
To: user@hive.apache.org<ma...@hive.apache.org>
Subject: Re: Get 100 items in Comma Separated strings from Hive Column.


Thanks Dudu. I will check. Can you please throw some light on regexp_replace (((,?[^,]*){0,10}).*','$1')  regexp_extract ('(,?[^,]*){0,10}',0),

On 6/9/2016 11:33 PM, Markovitz, Dudu wrote:
+ Improvement

The “Count” can be done in a cleaner way
(The previous way works also with simple ‘replace’)

hive> select RowID,length(regexp_replace(stringColumn,'[^,]',''))+1 as count from t;

1              2
2              5
3              24
4              17
5              8
6              11
7              26
8              18
9              9


From: Markovitz, Dudu [mailto:dmarkovitz@paypal.com]
Sent: Thursday, June 09, 2016 11:30 PM
To: user@hive.apache.org<ma...@hive.apache.org>
Subject: RE: Get 100 items in Comma Separated strings from Hive Column.

----------------------------------------------------------------------------------------------------
--  bash
----------------------------------------------------------------------------------------------------
mkdir t

cat>t/data.txt
1|44,85
2|56,37,83,68,43
3|33,48,42,18,23,80,31,86,48,42,37,52,99,55,93,1,63,67,32,75,44,57,70,2
4|77,26,95,53,11,99,74,82,7,55,75,6,32,87,75,99,80
5|48,78,39,62,16,44,43,63
6|35,97,99,19,22,50,29,84,82,25,77
7|80,43,82,94,81,58,70,8,70,6,62,100,60,84,55,24,100,75,84,15,53,5,19,45,61,73
8|66,44,66,4,80,72,81,63,51,24,51,77,87,85,10,36,43,2
9|39,64,29,14,9,42,66,56,33

hdfs dfs -put t /tmp

----------------------------------------------------------------------------------------------------
--  hive
----------------------------------------------------------------------------------------------------

create external table t
(
    RowID           int
   ,stringColumn    string
)
    row format delimited
    fields terminated by '|'
    location '/tmp/t'
;

select RowID,regexp_extract (stringColumn,'(,?[^,]*){0,10}',0) as string10,length(stringColumn)-length(regexp_replace(stringColumn,',',''))+1 as count from t;

1    44,85 2
2    56,37,83,68,43  5
3    33,48,42,18,23,80,31,86,48,42   24
4    77,26,95,53,11,99,74,82,7,55    17
5    48,78,39,62,16,44,43,63    8
6    35,97,99,19,22,50,29,84,82,25   11
7    80,43,82,94,81,58,70,8,70,6     26
8    66,44,66,4,80,72,81,63,51,24    18
9    39,64,29,14,9,42,66,56,33  9


Extracting the first 100 (10 in my example) tokens can be done with regexp_extract or regexp_replace

hive> select regexp_extract ('1,2,3,4,5,6,7,8,9,10,11,12,13,14,15','(,?[^,]*){0,10}',0);

1,2,3,4,5,6,7,8,9,10

hive> select regexp_replace ('1,2,3,4,5,6,7,8,9,10,11,12,13,14,15','((,?[^,]*){0,10}).*','$1');

1,2,3,4,5,6,7,8,9,10


From: Mahender Sarangam [mailto:Mahender.BigData@outlook.com]
Sent: Thursday, June 09, 2016 7:13 PM
To: user@hive.apache.org<ma...@hive.apache.org>
Subject: Get 100 items in Comma Separated strings from Hive Column.


Hi,

We have hive table which has a single column with more than 1000 comma separated string items.  Is there a way to retrieve only 100 string items from that Column. Also we need to capture number of comma separated string items. We are looking for more of   "substring_index" functionality, since we are using Hive 1.2 version, we couldn't find "substring_index" UDF function, Is there a way to achieve the same functionality with  "regexp_extract" and I also see there is UDF available not sure whether this helps us achieving same functionality. https://github.com/brndnmtthws/facebook-hive-udfs/blob/master/src/main/java/com/facebook/hive/udf/UDFRegexpExtractAll.java

Scenario : Table1 (Source Table)

RowID stringColumn

1 1,2,3,4...10000

2 2,4,5,8,4

3 10,11,98,100

Now i Would like to show table result structure like below

Row ID 100String count

1 1,2,3...100 10000

2 2,4,5,8,4 5



Re: Get 100 items in Comma Separated strings from Hive Column.

Posted by Mahender Sarangam <Ma...@outlook.com>.
Thanks Dudu. This is wonderful explaination. I'm very thankful

On 6/10/2016 7:24 AM, Markovitz, Dudu wrote:
regexp_extract ('(,?[^,]*){0,10}',0)

(...){0,10}

The expression surrounded by brackets repeats 0 to 10 times.


(,?[…]*)

Optional comma followed by sequence (0 or more) of characters


[^,]

Any character which is not comma


regexp_extract (...,0)

0 stands for the whole expression
1 stands for the 1st expression which is surrounded by brackets (ordered by the opening brackets)
2 stands for the 2nd expression which is surrounded by brackets (ordered by the opening brackets)
3 stands for the 3rd expression which is surrounded by brackets (ordered by the opening brackets)
Etc.



regexp_replace (((,?[^,]*){0,10}).*','$1')

Similar to regexp_extract but this time we’re not extracting the first 10 tokens but replacing the whole expression with the first 10 tokens.
The expression that stands for the first 10 tokens is identical to the one we used in regexp_extract
.* stands for any character that repeats 0 or more times which represent anything following the first 10 tokens
$1 stands for the 1st expression which is surrounded by brackets (ordered by the opening brackets)


From: Mahender Sarangam [mailto:Mahender.BigData@outlook.com]
Sent: Friday, June 10, 2016 2:54 PM
To: user@hive.apache.org<ma...@hive.apache.org>
Subject: Re: Get 100 items in Comma Separated strings from Hive Column.


Thanks Dudu. I will check. Can you please throw some light on regexp_replace (((,?[^,]*){0,10}).*','$1')  regexp_extract ('(,?[^,]*){0,10}',0),

On 6/9/2016 11:33 PM, Markovitz, Dudu wrote:
+ Improvement

The “Count” can be done in a cleaner way
(The previous way works also with simple ‘replace’)

hive> select RowID,length(regexp_replace(stringColumn,'[^,]',''))+1 as count from t;

1              2
2              5
3              24
4              17
5              8
6              11
7              26
8              18
9              9


From: Markovitz, Dudu [mailto:dmarkovitz@paypal.com]
Sent: Thursday, June 09, 2016 11:30 PM
To: user@hive.apache.org<ma...@hive.apache.org>
Subject: RE: Get 100 items in Comma Separated strings from Hive Column.

----------------------------------------------------------------------------------------------------
--  bash
----------------------------------------------------------------------------------------------------
mkdir t

cat>t/data.txt
1|44,85
2|56,37,83,68,43
3|33,48,42,18,23,80,31,86,48,42,37,52,99,55,93,1,63,67,32,75,44,57,70,2
4|77,26,95,53,11,99,74,82,7,55,75,6,32,87,75,99,80
5|48,78,39,62,16,44,43,63
6|35,97,99,19,22,50,29,84,82,25,77
7|80,43,82,94,81,58,70,8,70,6,62,100,60,84,55,24,100,75,84,15,53,5,19,45,61,73
8|66,44,66,4,80,72,81,63,51,24,51,77,87,85,10,36,43,2
9|39,64,29,14,9,42,66,56,33

hdfs dfs -put t /tmp

----------------------------------------------------------------------------------------------------
--  hive
----------------------------------------------------------------------------------------------------

create external table t
(
    RowID           int
   ,stringColumn    string
)
    row format delimited
    fields terminated by '|'
    location '/tmp/t'
;

select RowID,regexp_extract (stringColumn,'(,?[^,]*){0,10}',0) as string10,length(stringColumn)-length(regexp_replace(stringColumn,',',''))+1 as count from t;

1    44,85 2
2    56,37,83,68,43  5
3    33,48,42,18,23,80,31,86,48,42   24
4    77,26,95,53,11,99,74,82,7,55    17
5    48,78,39,62,16,44,43,63    8
6    35,97,99,19,22,50,29,84,82,25   11
7    80,43,82,94,81,58,70,8,70,6     26
8    66,44,66,4,80,72,81,63,51,24    18
9    39,64,29,14,9,42,66,56,33  9


Extracting the first 100 (10 in my example) tokens can be done with regexp_extract or regexp_replace

hive> select regexp_extract ('1,2,3,4,5,6,7,8,9,10,11,12,13,14,15','(,?[^,]*){0,10}',0);

1,2,3,4,5,6,7,8,9,10

hive> select regexp_replace ('1,2,3,4,5,6,7,8,9,10,11,12,13,14,15','((,?[^,]*){0,10}).*','$1');

1,2,3,4,5,6,7,8,9,10


From: Mahender Sarangam [mailto:Mahender.BigData@outlook.com]
Sent: Thursday, June 09, 2016 7:13 PM
To: user@hive.apache.org<ma...@hive.apache.org>
Subject: Get 100 items in Comma Separated strings from Hive Column.


Hi,

We have hive table which has a single column with more than 1000 comma separated string items.  Is there a way to retrieve only 100 string items from that Column. Also we need to capture number of comma separated string items. We are looking for more of   "substring_index" functionality, since we are using Hive 1.2 version, we couldn't find "substring_index" UDF function, Is there a way to achieve the same functionality with  "regexp_extract" and I also see there is UDF available not sure whether this helps us achieving same functionality. https://github.com/brndnmtthws/facebook-hive-udfs/blob/master/src/main/java/com/facebook/hive/udf/UDFRegexpExtractAll.java

Scenario : Table1 (Source Table)

RowID stringColumn

1 1,2,3,4...10000

2 2,4,5,8,4

3 10,11,98,100

Now i Would like to show table result structure like below

Row ID 100String count

1 1,2,3...100 10000

2 2,4,5,8,4 5



RE: Get 100 items in Comma Separated strings from Hive Column.

Posted by "Markovitz, Dudu" <dm...@paypal.com>.
regexp_extract ('(,?[^,]*){0,10}',0)

(...){0,10}

The expression surrounded by brackets repeats 0 to 10 times.


(,?[…]*)

Optional comma followed by sequence (0 or more) of characters


[^,]

Any character which is not comma


regexp_extract (...,0)

0 stands for the whole expression
1 stands for the 1st expression which is surrounded by brackets (ordered by the opening brackets)
2 stands for the 2nd expression which is surrounded by brackets (ordered by the opening brackets)
3 stands for the 3rd expression which is surrounded by brackets (ordered by the opening brackets)
Etc.



regexp_replace (((,?[^,]*){0,10}).*','$1')

Similar to regexp_extract but this time we’re not extracting the first 10 tokens but replacing the whole expression with the first 10 tokens.
The expression that stands for the first 10 tokens is identical to the one we used in regexp_extract
.* stands for any character that repeats 0 or more times which represent anything following the first 10 tokens
$1 stands for the 1st expression which is surrounded by brackets (ordered by the opening brackets)


From: Mahender Sarangam [mailto:Mahender.BigData@outlook.com]
Sent: Friday, June 10, 2016 2:54 PM
To: user@hive.apache.org
Subject: Re: Get 100 items in Comma Separated strings from Hive Column.


Thanks Dudu. I will check. Can you please throw some light on regexp_replace (((,?[^,]*){0,10}).*','$1')  regexp_extract ('(,?[^,]*){0,10}',0),

On 6/9/2016 11:33 PM, Markovitz, Dudu wrote:
+ Improvement

The “Count” can be done in a cleaner way
(The previous way works also with simple ‘replace’)

hive> select RowID,length(regexp_replace(stringColumn,'[^,]',''))+1 as count from t;

1              2
2              5
3              24
4              17
5              8
6              11
7              26
8              18
9              9


From: Markovitz, Dudu [mailto:dmarkovitz@paypal.com]
Sent: Thursday, June 09, 2016 11:30 PM
To: user@hive.apache.org<ma...@hive.apache.org>
Subject: RE: Get 100 items in Comma Separated strings from Hive Column.

----------------------------------------------------------------------------------------------------
--  bash
----------------------------------------------------------------------------------------------------
mkdir t

cat>t/data.txt
1|44,85
2|56,37,83,68,43
3|33,48,42,18,23,80,31,86,48,42,37,52,99,55,93,1,63,67,32,75,44,57,70,2
4|77,26,95,53,11,99,74,82,7,55,75,6,32,87,75,99,80
5|48,78,39,62,16,44,43,63
6|35,97,99,19,22,50,29,84,82,25,77
7|80,43,82,94,81,58,70,8,70,6,62,100,60,84,55,24,100,75,84,15,53,5,19,45,61,73
8|66,44,66,4,80,72,81,63,51,24,51,77,87,85,10,36,43,2
9|39,64,29,14,9,42,66,56,33

hdfs dfs -put t /tmp

----------------------------------------------------------------------------------------------------
--  hive
----------------------------------------------------------------------------------------------------

create external table t
(
    RowID           int
   ,stringColumn    string
)
    row format delimited
    fields terminated by '|'
    location '/tmp/t'
;

select RowID,regexp_extract (stringColumn,'(,?[^,]*){0,10}',0) as string10,length(stringColumn)-length(regexp_replace(stringColumn,',',''))+1 as count from t;

1    44,85 2
2    56,37,83,68,43  5
3    33,48,42,18,23,80,31,86,48,42   24
4    77,26,95,53,11,99,74,82,7,55    17
5    48,78,39,62,16,44,43,63    8
6    35,97,99,19,22,50,29,84,82,25   11
7    80,43,82,94,81,58,70,8,70,6     26
8    66,44,66,4,80,72,81,63,51,24    18
9    39,64,29,14,9,42,66,56,33  9


Extracting the first 100 (10 in my example) tokens can be done with regexp_extract or regexp_replace

hive> select regexp_extract ('1,2,3,4,5,6,7,8,9,10,11,12,13,14,15','(,?[^,]*){0,10}',0);

1,2,3,4,5,6,7,8,9,10

hive> select regexp_replace ('1,2,3,4,5,6,7,8,9,10,11,12,13,14,15','((,?[^,]*){0,10}).*','$1');

1,2,3,4,5,6,7,8,9,10


From: Mahender Sarangam [mailto:Mahender.BigData@outlook.com]
Sent: Thursday, June 09, 2016 7:13 PM
To: user@hive.apache.org<ma...@hive.apache.org>
Subject: Get 100 items in Comma Separated strings from Hive Column.


Hi,

We have hive table which has a single column with more than 1000 comma separated string items.  Is there a way to retrieve only 100 string items from that Column. Also we need to capture number of comma separated string items. We are looking for more of   "substring_index" functionality, since we are using Hive 1.2 version, we couldn't find "substring_index" UDF function, Is there a way to achieve the same functionality with  "regexp_extract" and I also see there is UDF available not sure whether this helps us achieving same functionality. https://github.com/brndnmtthws/facebook-hive-udfs/blob/master/src/main/java/com/facebook/hive/udf/UDFRegexpExtractAll.java

Scenario : Table1 (Source Table)

RowID stringColumn

1 1,2,3,4...10000

2 2,4,5,8,4

3 10,11,98,100

Now i Would like to show table result structure like below

Row ID 100String count

1 1,2,3...100 10000

2 2,4,5,8,4 5


Re: Get 100 items in Comma Separated strings from Hive Column.

Posted by Mahender Sarangam <Ma...@outlook.com>.
Thanks Dudu. I will check. Can you please throw some light on regexp_replace (((,?[^,]*){0,10}).*','$1')  regexp_extract ('(,?[^,]*){0,10}',0),

On 6/9/2016 11:33 PM, Markovitz, Dudu wrote:
+ Improvement

The “Count” can be done in a cleaner way
(The previous way works also with simple ‘replace’)

hive> select RowID,length(regexp_replace(stringColumn,'[^,]',''))+1 as count from t;

1              2
2              5
3              24
4              17
5              8
6              11
7              26
8              18
9              9


From: Markovitz, Dudu [mailto:dmarkovitz@paypal.com]
Sent: Thursday, June 09, 2016 11:30 PM
To: user@hive.apache.org<ma...@hive.apache.org>
Subject: RE: Get 100 items in Comma Separated strings from Hive Column.

----------------------------------------------------------------------------------------------------
--  bash
----------------------------------------------------------------------------------------------------
mkdir t

cat>t/data.txt
1|44,85
2|56,37,83,68,43
3|33,48,42,18,23,80,31,86,48,42,37,52,99,55,93,1,63,67,32,75,44,57,70,2
4|77,26,95,53,11,99,74,82,7,55,75,6,32,87,75,99,80
5|48,78,39,62,16,44,43,63
6|35,97,99,19,22,50,29,84,82,25,77
7|80,43,82,94,81,58,70,8,70,6,62,100,60,84,55,24,100,75,84,15,53,5,19,45,61,73
8|66,44,66,4,80,72,81,63,51,24,51,77,87,85,10,36,43,2
9|39,64,29,14,9,42,66,56,33

hdfs dfs -put t /tmp

----------------------------------------------------------------------------------------------------
--  hive
----------------------------------------------------------------------------------------------------

create external table t
(
    RowID           int
   ,stringColumn    string
)
    row format delimited
    fields terminated by '|'
    location '/tmp/t'
;

select RowID,regexp_extract (stringColumn,'(,?[^,]*){0,10}',0) as string10,length(stringColumn)-length(regexp_replace(stringColumn,',',''))+1 as count from t;

1    44,85 2
2    56,37,83,68,43  5
3    33,48,42,18,23,80,31,86,48,42   24
4    77,26,95,53,11,99,74,82,7,55    17
5    48,78,39,62,16,44,43,63    8
6    35,97,99,19,22,50,29,84,82,25   11
7    80,43,82,94,81,58,70,8,70,6     26
8    66,44,66,4,80,72,81,63,51,24    18
9    39,64,29,14,9,42,66,56,33  9


Extracting the first 100 (10 in my example) tokens can be done with regexp_extract or regexp_replace

hive> select regexp_extract ('1,2,3,4,5,6,7,8,9,10,11,12,13,14,15','(,?[^,]*){0,10}',0);

1,2,3,4,5,6,7,8,9,10

hive> select regexp_replace ('1,2,3,4,5,6,7,8,9,10,11,12,13,14,15','((,?[^,]*){0,10}).*','$1');

1,2,3,4,5,6,7,8,9,10


From: Mahender Sarangam [mailto:Mahender.BigData@outlook.com]
Sent: Thursday, June 09, 2016 7:13 PM
To: user@hive.apache.org<ma...@hive.apache.org>
Subject: Get 100 items in Comma Separated strings from Hive Column.


Hi,

We have hive table which has a single column with more than 1000 comma separated string items.  Is there a way to retrieve only 100 string items from that Column. Also we need to capture number of comma separated string items. We are looking for more of   "substring_index" functionality, since we are using Hive 1.2 version, we couldn't find "substring_index" UDF function, Is there a way to achieve the same functionality with  "regexp_extract" and I also see there is UDF available not sure whether this helps us achieving same functionality. https://github.com/brndnmtthws/facebook-hive-udfs/blob/master/src/main/java/com/facebook/hive/udf/UDFRegexpExtractAll.java

Scenario : Table1 (Source Table)

RowID stringColumn

1 1,2,3,4...10000

2 2,4,5,8,4

3 10,11,98,100

Now i Would like to show table result structure like below

Row ID 100String count

1 1,2,3...100 10000

2 2,4,5,8,4 5


RE: Get 100 items in Comma Separated strings from Hive Column.

Posted by "Markovitz, Dudu" <dm...@paypal.com>.
+ bug fix
This version will differentiate between empty strings and strings with a single token (both have no commas)

hive> select RowID,length(regexp_replace(stringColumn,'[^,]',''))+if(length(stringColumn)=0,0,1) as count from t;


From: Markovitz, Dudu [mailto:dmarkovitz@paypal.com]
Sent: Friday, June 10, 2016 9:34 AM
To: user@hive.apache.org
Subject: RE: Get 100 items in Comma Separated strings from Hive Column.

+ Improvement

The “Count” can be done in a cleaner way
(The previous way works also with simple ‘replace’)

hive> select RowID,length(regexp_replace(stringColumn,'[^,]',''))+1 as count from t;

1              2
2              5
3              24
4              17
5              8
6              11
7              26
8              18
9              9


From: Markovitz, Dudu [mailto:dmarkovitz@paypal.com]
Sent: Thursday, June 09, 2016 11:30 PM
To: user@hive.apache.org<ma...@hive.apache.org>
Subject: RE: Get 100 items in Comma Separated strings from Hive Column.

----------------------------------------------------------------------------------------------------
--  bash
----------------------------------------------------------------------------------------------------
mkdir t

cat>t/data.txt
1|44,85
2|56,37,83,68,43
3|33,48,42,18,23,80,31,86,48,42,37,52,99,55,93,1,63,67,32,75,44,57,70,2
4|77,26,95,53,11,99,74,82,7,55,75,6,32,87,75,99,80
5|48,78,39,62,16,44,43,63
6|35,97,99,19,22,50,29,84,82,25,77
7|80,43,82,94,81,58,70,8,70,6,62,100,60,84,55,24,100,75,84,15,53,5,19,45,61,73
8|66,44,66,4,80,72,81,63,51,24,51,77,87,85,10,36,43,2
9|39,64,29,14,9,42,66,56,33

hdfs dfs -put t /tmp

----------------------------------------------------------------------------------------------------
--  hive
----------------------------------------------------------------------------------------------------

create external table t
(
    RowID           int
   ,stringColumn    string
)
    row format delimited
    fields terminated by '|'
    location '/tmp/t'
;

select RowID,regexp_extract (stringColumn,'(,?[^,]*){0,10}',0) as string10,length(stringColumn)-length(regexp_replace(stringColumn,',',''))+1 as count from t;

1    44,85 2
2    56,37,83,68,43  5
3    33,48,42,18,23,80,31,86,48,42   24
4    77,26,95,53,11,99,74,82,7,55    17
5    48,78,39,62,16,44,43,63    8
6    35,97,99,19,22,50,29,84,82,25   11
7    80,43,82,94,81,58,70,8,70,6     26
8    66,44,66,4,80,72,81,63,51,24    18
9    39,64,29,14,9,42,66,56,33  9


Extracting the first 100 (10 in my example) tokens can be done with regexp_extract or regexp_replace

hive> select regexp_extract ('1,2,3,4,5,6,7,8,9,10,11,12,13,14,15','(,?[^,]*){0,10}',0);

1,2,3,4,5,6,7,8,9,10

hive> select regexp_replace ('1,2,3,4,5,6,7,8,9,10,11,12,13,14,15','((,?[^,]*){0,10}).*','$1');

1,2,3,4,5,6,7,8,9,10


From: Mahender Sarangam [mailto:Mahender.BigData@outlook.com]
Sent: Thursday, June 09, 2016 7:13 PM
To: user@hive.apache.org<ma...@hive.apache.org>
Subject: Get 100 items in Comma Separated strings from Hive Column.


Hi,

We have hive table which has a single column with more than 1000 comma separated string items.  Is there a way to retrieve only 100 string items from that Column. Also we need to capture number of comma separated string items. We are looking for more of   "substring_index" functionality, since we are using Hive 1.2 version, we couldn't find "substring_index" UDF function, Is there a way to achieve the same functionality with  "regexp_extract" and I also see there is UDF available not sure whether this helps us achieving same functionality. https://github.com/brndnmtthws/facebook-hive-udfs/blob/master/src/main/java/com/facebook/hive/udf/UDFRegexpExtractAll.java

Scenario : Table1 (Source Table)

RowID stringColumn

1 1,2,3,4...10000

2 2,4,5,8,4

3 10,11,98,100

Now i Would like to show table result structure like below

Row ID 100String count

1 1,2,3...100 10000

2 2,4,5,8,4 5

RE: Get 100 items in Comma Separated strings from Hive Column.

Posted by "Markovitz, Dudu" <dm...@paypal.com>.
+ Improvement

The “Count” can be done in a cleaner way
(The previous way works also with simple ‘replace’)

hive> select RowID,length(regexp_replace(stringColumn,'[^,]',''))+1 as count from t;

1              2
2              5
3              24
4              17
5              8
6              11
7              26
8              18
9              9


From: Markovitz, Dudu [mailto:dmarkovitz@paypal.com]
Sent: Thursday, June 09, 2016 11:30 PM
To: user@hive.apache.org
Subject: RE: Get 100 items in Comma Separated strings from Hive Column.

----------------------------------------------------------------------------------------------------
--  bash
----------------------------------------------------------------------------------------------------
mkdir t

cat>t/data.txt
1|44,85
2|56,37,83,68,43
3|33,48,42,18,23,80,31,86,48,42,37,52,99,55,93,1,63,67,32,75,44,57,70,2
4|77,26,95,53,11,99,74,82,7,55,75,6,32,87,75,99,80
5|48,78,39,62,16,44,43,63
6|35,97,99,19,22,50,29,84,82,25,77
7|80,43,82,94,81,58,70,8,70,6,62,100,60,84,55,24,100,75,84,15,53,5,19,45,61,73
8|66,44,66,4,80,72,81,63,51,24,51,77,87,85,10,36,43,2
9|39,64,29,14,9,42,66,56,33

hdfs dfs -put t /tmp

----------------------------------------------------------------------------------------------------
--  hive
----------------------------------------------------------------------------------------------------

create external table t
(
    RowID           int
   ,stringColumn    string
)
    row format delimited
    fields terminated by '|'
    location '/tmp/t'
;

select RowID,regexp_extract (stringColumn,'(,?[^,]*){0,10}',0) as string10,length(stringColumn)-length(regexp_replace(stringColumn,',',''))+1 as count from t;

1    44,85 2
2    56,37,83,68,43  5
3    33,48,42,18,23,80,31,86,48,42   24
4    77,26,95,53,11,99,74,82,7,55    17
5    48,78,39,62,16,44,43,63    8
6    35,97,99,19,22,50,29,84,82,25   11
7    80,43,82,94,81,58,70,8,70,6     26
8    66,44,66,4,80,72,81,63,51,24    18
9    39,64,29,14,9,42,66,56,33  9


Extracting the first 100 (10 in my example) tokens can be done with regexp_extract or regexp_replace

hive> select regexp_extract ('1,2,3,4,5,6,7,8,9,10,11,12,13,14,15','(,?[^,]*){0,10}',0);

1,2,3,4,5,6,7,8,9,10

hive> select regexp_replace ('1,2,3,4,5,6,7,8,9,10,11,12,13,14,15','((,?[^,]*){0,10}).*','$1');

1,2,3,4,5,6,7,8,9,10


From: Mahender Sarangam [mailto:Mahender.BigData@outlook.com]
Sent: Thursday, June 09, 2016 7:13 PM
To: user@hive.apache.org<ma...@hive.apache.org>
Subject: Get 100 items in Comma Separated strings from Hive Column.


Hi,

We have hive table which has a single column with more than 1000 comma separated string items.  Is there a way to retrieve only 100 string items from that Column. Also we need to capture number of comma separated string items. We are looking for more of   "substring_index" functionality, since we are using Hive 1.2 version, we couldn't find "substring_index" UDF function, Is there a way to achieve the same functionality with  "regexp_extract" and I also see there is UDF available not sure whether this helps us achieving same functionality. https://github.com/brndnmtthws/facebook-hive-udfs/blob/master/src/main/java/com/facebook/hive/udf/UDFRegexpExtractAll.java

Scenario : Table1 (Source Table)

RowID stringColumn

1 1,2,3,4...10000

2 2,4,5,8,4

3 10,11,98,100

Now i Would like to show table result structure like below

Row ID 100String count

1 1,2,3...100 10000

2 2,4,5,8,4 5

RE: Get 100 items in Comma Separated strings from Hive Column.

Posted by "Markovitz, Dudu" <dm...@paypal.com>.
----------------------------------------------------------------------------------------------------
--  bash
----------------------------------------------------------------------------------------------------
mkdir t

cat>t/data.txt
1|44,85
2|56,37,83,68,43
3|33,48,42,18,23,80,31,86,48,42,37,52,99,55,93,1,63,67,32,75,44,57,70,2
4|77,26,95,53,11,99,74,82,7,55,75,6,32,87,75,99,80
5|48,78,39,62,16,44,43,63
6|35,97,99,19,22,50,29,84,82,25,77
7|80,43,82,94,81,58,70,8,70,6,62,100,60,84,55,24,100,75,84,15,53,5,19,45,61,73
8|66,44,66,4,80,72,81,63,51,24,51,77,87,85,10,36,43,2
9|39,64,29,14,9,42,66,56,33

hdfs dfs -put t /tmp

----------------------------------------------------------------------------------------------------
--  hive
----------------------------------------------------------------------------------------------------

create external table t
(
    RowID           int
   ,stringColumn    string
)
    row format delimited
    fields terminated by '|'
    location '/tmp/t'
;

select RowID,regexp_extract (stringColumn,'(,?[^,]*){0,10}',0) as string10,length(stringColumn)-length(regexp_replace(stringColumn,',',''))+1 as count from t;

1    44,85 2
2    56,37,83,68,43  5
3    33,48,42,18,23,80,31,86,48,42   24
4    77,26,95,53,11,99,74,82,7,55    17
5    48,78,39,62,16,44,43,63    8
6    35,97,99,19,22,50,29,84,82,25   11
7    80,43,82,94,81,58,70,8,70,6     26
8    66,44,66,4,80,72,81,63,51,24    18
9    39,64,29,14,9,42,66,56,33  9


Extracting the first 100 (10 in my example) tokens can be done with regexp_extract or regexp_replace

hive> select regexp_extract ('1,2,3,4,5,6,7,8,9,10,11,12,13,14,15','(,?[^,]*){0,10}',0);

1,2,3,4,5,6,7,8,9,10

hive> select regexp_replace ('1,2,3,4,5,6,7,8,9,10,11,12,13,14,15','((,?[^,]*){0,10}).*','$1');

1,2,3,4,5,6,7,8,9,10


From: Mahender Sarangam [mailto:Mahender.BigData@outlook.com]
Sent: Thursday, June 09, 2016 7:13 PM
To: user@hive.apache.org
Subject: Get 100 items in Comma Separated strings from Hive Column.


Hi,

We have hive table which has a single column with more than 1000 comma separated string items.  Is there a way to retrieve only 100 string items from that Column. Also we need to capture number of comma separated string items. We are looking for more of   "substring_index" functionality, since we are using Hive 1.2 version, we couldn't find "substring_index" UDF function, Is there a way to achieve the same functionality with  "regexp_extract" and I also see there is UDF available not sure whether this helps us achieving same functionality. https://github.com/brndnmtthws/facebook-hive-udfs/blob/master/src/main/java/com/facebook/hive/udf/UDFRegexpExtractAll.java

Scenario : Table1 (Source Table)

RowID stringColumn

1 1,2,3,4...10000

2 2,4,5,8,4

3 10,11,98,100

Now i Would like to show table result structure like below

Row ID 100String count

1 1,2,3...100 10000

2 2,4,5,8,4 5