You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@carbondata.apache.org by Pallavi Singh <pa...@knoldus.in> on 2017/05/16 06:08:11 UTC

Query regarding behaviour of sort column

Hi Community,

While working with the sort columns, I saw that if the column listed in the
sort_column happens to have a null value in the data while sorting , then
the row corresponding to that null value is eliminated from the result set.
Is this correct behavior? Ideally null values should be sorted and listed
on the top in the result set.

-- 
Regards | Pallavi Singh
Software Consultant
Knoldus Software LLP
+91-9911235949

Re: Query regarding behaviour of sort column

Posted by manish gupta <to...@gmail.com>.
Hi Pallavi,

Behavior of column should be same irrespective of whether it is dictionary,
no dictionary, measure or sort column.

When you specify any numeric column as sort column, it is processed as no
dictionary column.

In your case as the BAD_RECORDS_FORCE  is force so data load should not
fail and consistent behavior should be displayed. However there is a bug in
the code because of which some inconsistency is being encountered.

I will raise a jira for the same and track the issue.

Regards
Manish Gupta

On Wed, May 17, 2017 at 4:49 PM, Pallavi Singh <pa...@knoldus.in>
wrote:

> Hi Community,
>
> While working with the above problem I found two discussions regarding the
> sort column
>
> 1. https://github.com/apache/carbondata/pull/635 which states : If the
> table need be sorted by a measure, we should use dictionary_include to add
> it to dimension list.
>
> 2. https://github.com/apache/carbondata/pull/757 : if a column of
> SORT_COLUMNS is a measure before, now this column will be created as a
> dimension. And this dimension is a no-dicitonary column(Better to use other
> direct-dictionary).
>
> Now if the columns in my sort_column be measures then I have to add the
> same columns in the dictionary_include other wise in case of null value in
> case of sort_column column the loading fails after the first null encounter
> itself.
>
> for example like this :
>
> CREATE TABLE test_sort_col
>    | (id INT,
>    | name STRING,
>    | age INT
>    | )STORED BY 'org.apache.carbondata.format'
>    | TBLPROPERTIES('SORT_COLUMNS'='id,age','DICTIONARY_INCLUDE'='id,age')
>
> and the csv has following data :
>
> 1,Pallavi,25
> 2,Geetika,24
> 3,Prabhat,twenty six
> 7,Neha,25
> 2,Geetika,22
> 3,Sangeeta,26
>
> and the load gets successful like shown below :
>
> +---+--------+----+
> | id|    name| age|
> +---+--------+----+
> |  1| Pallavi|  25|
> |  2| Geetika|  22|
> |  2| Geetika|  24|
> |  3| Prabhat|null|
> |  3|Sangeeta|  26|
> |  7|    Neha|  25|
> +---+--------+----+
>
> now if i remove the measures of the sort_column from the
> dictionary_include in the  query I get an error and partial data gets
> loaded, snapshot is provided below
>
> Data load request has been received for table default.test_sort_col
> 17/05/17 16:46:51 ERROR UnsafeBatchParallelReadMergeSorterImpl:
> pool-20-thread-1
> java.lang.ClassCastException: java.lang.String cannot be cast to [B
> at org.apache.carbondata.processing.newflow.sort.
> unsafe.UnsafeCarbonRowPage.addRow(UnsafeCarbonRowPage.java:89)
> at org.apache.carbondata.processing.newflow.sort.
> unsafe.UnsafeCarbonRowPage.addRow(UnsafeCarbonRowPage.java:74)
> at org.apache.carbondata.processing.newflow.sort.
> unsafe.UnsafeSortDataRows.addRowBatch(UnsafeSortDataRows.java:170)
> at org.apache.carbondata.processing.newflow.sort.impl.
> UnsafeBatchParallelReadMergeSorterImpl$SortIteratorThread.call(
> UnsafeBatchParallelReadMergeSorterImpl.java:150)
> at org.apache.carbondata.processing.newflow.sort.impl.
> UnsafeBatchParallelReadMergeSorterImpl$SortIteratorThread.call(
> UnsafeBatchParallelReadMergeSorterImpl.java:117)
> at java.util.concurrent.FutureTask.run(FutureTask.java:266)
> at java.util.concurrent.ThreadPoolExecutor.runWorker(
> ThreadPoolExecutor.java:1142)
> at java.util.concurrent.ThreadPoolExecutor$Worker.run(
> ThreadPoolExecutor.java:617)
> at java.lang.Thread.run(Thread.java:745)
> 17/05/17 16:46:51 AUDIT UnsafeInmemoryHolder: [pallavi][pallavi][Thread-83]Processing
> unsafe inmemory rows page with size : 2
> 17/05/17 16:46:52 ERROR DataLoadExecutor: [Executor task launch
> worker-0][partitionID:default_test_sort_col_296196d0-a469-4273-a382-c41531c32591]
> Data Load is partially success for table test_sort_col
> 17/05/17 16:46:52 AUDIT CarbonDataRDDFactory$: [pallavi][pallavi][Thread-1]Data
> load is partially successful for default.test_sort_col
> +---+-------+---+
> | id|   name|age|
> +---+-------+---+
> |  1|Pallavi| 25|
> |  2|Geetika| 24|
> +---+-------+---+
>
> What is the correct behavior, should we add measures in sort_column to
> dictionary_include or should we modify the load flow to handle null values?
>
>
> On Tue, May 16, 2017 at 11:38 AM, Pallavi Singh <pa...@knoldus.in>
> wrote:
>
>> Hi Community,
>>
>> While working with the sort columns, I saw that if the column listed in
>> the sort_column happens to have a null value in the data while sorting ,
>> then the row corresponding to that null value is eliminated from the result
>> set. Is this correct behavior? Ideally null values should be sorted and
>> listed on the top in the result set.
>>
>
> --
> Regards | Pallavi Singh
> Software Consultant
>

Re: Query regarding behaviour of sort column

Posted by Pallavi Singh <pa...@knoldus.in>.
Hi Community,

While working with the above problem I found two discussions regarding the
sort column

1. https://github.com/apache/carbondata/pull/635 which states : If the
table need be sorted by a measure, we should use dictionary_include to add
it to dimension list.

2. https://github.com/apache/carbondata/pull/757 : if a column of
SORT_COLUMNS is a measure before, now this column will be created as a
dimension. And this dimension is a no-dicitonary column(Better to use other
direct-dictionary).

Now if the columns in my sort_column be measures then I have to add the
same columns in the dictionary_include other wise in case of null value in
case of sort_column column the loading fails after the first null encounter
itself.

for example like this :

CREATE TABLE test_sort_col
   | (id INT,
   | name STRING,
   | age INT
   | )STORED BY 'org.apache.carbondata.format'
   | TBLPROPERTIES('SORT_COLUMNS'='id,age','DICTIONARY_INCLUDE'='id,age')

and the csv has following data :

1,Pallavi,25
2,Geetika,24
3,Prabhat,twenty six
7,Neha,25
2,Geetika,22
3,Sangeeta,26

and the load gets successful like shown below :

+---+--------+----+
| id|    name| age|
+---+--------+----+
|  1| Pallavi|  25|
|  2| Geetika|  22|
|  2| Geetika|  24|
|  3| Prabhat|null|
|  3|Sangeeta|  26|
|  7|    Neha|  25|
+---+--------+----+

now if i remove the measures of the sort_column from the dictionary_include
in the  query I get an error and partial data gets loaded, snapshot is
provided below

Data load request has been received for table default.test_sort_col
17/05/17 16:46:51 ERROR UnsafeBatchParallelReadMergeSorterImpl:
pool-20-thread-1
java.lang.ClassCastException: java.lang.String cannot be cast to [B
at
org.apache.carbondata.processing.newflow.sort.unsafe.UnsafeCarbonRowPage.addRow(UnsafeCarbonRowPage.java:89)
at
org.apache.carbondata.processing.newflow.sort.unsafe.UnsafeCarbonRowPage.addRow(UnsafeCarbonRowPage.java:74)
at
org.apache.carbondata.processing.newflow.sort.unsafe.UnsafeSortDataRows.addRowBatch(UnsafeSortDataRows.java:170)
at
org.apache.carbondata.processing.newflow.sort.impl.UnsafeBatchParallelReadMergeSorterImpl$SortIteratorThread.call(UnsafeBatchParallelReadMergeSorterImpl.java:150)
at
org.apache.carbondata.processing.newflow.sort.impl.UnsafeBatchParallelReadMergeSorterImpl$SortIteratorThread.call(UnsafeBatchParallelReadMergeSorterImpl.java:117)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
17/05/17 16:46:51 AUDIT UnsafeInmemoryHolder:
[pallavi][pallavi][Thread-83]Processing unsafe inmemory rows page with size
: 2
17/05/17 16:46:52 ERROR DataLoadExecutor: [Executor task launch
worker-0][partitionID:default_test_sort_col_296196d0-a469-4273-a382-c41531c32591]
Data Load is partially success for table test_sort_col
17/05/17 16:46:52 AUDIT CarbonDataRDDFactory$:
[pallavi][pallavi][Thread-1]Data load is partially successful for
default.test_sort_col
+---+-------+---+
| id|   name|age|
+---+-------+---+
|  1|Pallavi| 25|
|  2|Geetika| 24|
+---+-------+---+

What is the correct behavior, should we add measures in sort_column to
dictionary_include or should we modify the load flow to handle null values?


On Tue, May 16, 2017 at 11:38 AM, Pallavi Singh <pa...@knoldus.in>
wrote:

> Hi Community,
>
> While working with the sort columns, I saw that if the column listed in
> the sort_column happens to have a null value in the data while sorting ,
> then the row corresponding to that null value is eliminated from the result
> set. Is this correct behavior? Ideally null values should be sorted and
> listed on the top in the result set.
>

-- 
Regards | Pallavi Singh
Software Consultant

Re: Query regarding behaviour of sort column

Posted by Pallavi Singh <pa...@knoldus.in>.
Hi Community,

While working with the above problem I found two discussions regarding the
sort column

1. https://github.com/apache/carbondata/pull/635 which states : If the
table need be sorted by a measure, we should use dictionary_include to add
it to dimension list.

2. https://github.com/apache/carbondata/pull/757 : if a column of
SORT_COLUMNS is a measure before, now this column will be created as a
dimension. And this dimension is a no-dicitonary column(Better to use other
direct-dictionary).

Now if the columns in my sort_column be measures then I have to add the
same columns in the dictionary_include other wise in case of null value in
case of sort_column column the loading fails after the first null encounter
itself.

for example like this :

CREATE TABLE test_sort_col
   | (id INT,
   | name STRING,
   | age INT
   | )STORED BY 'org.apache.carbondata.format'
   | TBLPROPERTIES('SORT_COLUMNS'='id,age','DICTIONARY_INCLUDE'='id,age')

and the csv has following data :

1,Pallavi,25
2,Geetika,24
3,Prabhat,twenty six
7,Neha,25
2,Geetika,22
3,Sangeeta,26

and the load gets successful like shown below :

+---+--------+----+
| id|    name| age|
+---+--------+----+
|  1| Pallavi|  25|
|  2| Geetika|  22|
|  2| Geetika|  24|
|  3| Prabhat|null|
|  3|Sangeeta|  26|
|  7|    Neha|  25|
+---+--------+----+

now if i remove the measures of the sort_column from the dictionary_include
in the  query I get an error and partial data gets loaded, snapshot is
provided below

Data load request has been received for table default.test_sort_col
17/05/17 16:46:51 ERROR UnsafeBatchParallelReadMergeSorterImpl:
pool-20-thread-1
java.lang.ClassCastException: java.lang.String cannot be cast to [B
at
org.apache.carbondata.processing.newflow.sort.unsafe.UnsafeCarbonRowPage.addRow(UnsafeCarbonRowPage.java:89)
at
org.apache.carbondata.processing.newflow.sort.unsafe.UnsafeCarbonRowPage.addRow(UnsafeCarbonRowPage.java:74)
at
org.apache.carbondata.processing.newflow.sort.unsafe.UnsafeSortDataRows.addRowBatch(UnsafeSortDataRows.java:170)
at
org.apache.carbondata.processing.newflow.sort.impl.UnsafeBatchParallelReadMergeSorterImpl$SortIteratorThread.call(UnsafeBatchParallelReadMergeSorterImpl.java:150)
at
org.apache.carbondata.processing.newflow.sort.impl.UnsafeBatchParallelReadMergeSorterImpl$SortIteratorThread.call(UnsafeBatchParallelReadMergeSorterImpl.java:117)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
17/05/17 16:46:51 AUDIT UnsafeInmemoryHolder:
[pallavi][pallavi][Thread-83]Processing unsafe inmemory rows page with size
: 2
17/05/17 16:46:52 ERROR DataLoadExecutor: [Executor task launch
worker-0][partitionID:default_test_sort_col_296196d0-a469-4273-a382-c41531c32591]
Data Load is partially success for table test_sort_col
17/05/17 16:46:52 AUDIT CarbonDataRDDFactory$:
[pallavi][pallavi][Thread-1]Data load is partially successful for
default.test_sort_col
+---+-------+---+
| id|   name|age|
+---+-------+---+
|  1|Pallavi| 25|
|  2|Geetika| 24|
+---+-------+---+

What is the correct behavior, should we add measures in sort_column to
dictionary_include or should we modify the load flow to handle null values?


On Tue, May 16, 2017 at 11:38 AM, Pallavi Singh <pa...@knoldus.in>
wrote:

> Hi Community,
>
> While working with the sort columns, I saw that if the column listed in
> the sort_column happens to have a null value in the data while sorting ,
> then the row corresponding to that null value is eliminated from the result
> set. Is this correct behavior? Ideally null values should be sorted and
> listed on the top in the result set.
>

-- 
Regards | Pallavi Singh
Software Consultant