You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@phoenix.apache.org by Siva <sb...@gmail.com> on 2015/02/12 01:28:01 UTC

Phoenix bulk loading

Hello all,

is there a way to specify to keep NULL values for the columns which were
not there in csv file as part of bulk loading?

Requirement I have is, few row in csv file contains all the column, but
rows contain only few columns.

In Hbase, if the given record doesnt have desired columns, it just ignore
the columns and it goes for next record while loading the data from
ImportTsv.

HADOOP_CLASSPATH=/usr/hdp/2.2.0.0-2041/hbase/lib/hbase-protocol.jar:/usr/hdp/2.2.0.0-2041/hbase/conf
hadoop jar
/usr/hdp/2.2.0.0-2041/phoenix/phoenix-4.2.0.2.2.0.0-2041-client.jar
org.apache.phoenix.mapreduce.CsvBulkLoadTool --table P_TEST_2_COLS --input
/user/sbhavanari/p_h_test_2_cols_less.csv --import-columns NAME,LEADID,D
--zookeeper 172.31.45.176:2181:/hbase

Thanks,
Siva.

Re: Phoenix bulk loading

Posted by Siva <sb...@gmail.com>.
Thanks for your inputs Gabriel. I m thinking of like this, having the
consistency to data loader as is and having a new parameter to extend the
logic to populate null values for missing columns. Lets see how it goes :)


On Thu, Feb 12, 2015 at 12:17 PM, Gabriel Reid <ga...@gmail.com>
wrote:

> Hi Siva,
>
> If I'm not mistaken, the bulk loader will currently just drop rows
> that don't have the correct number of fields.
>
> I'm actually in favor of this behavior (i.e. enforcing the consistency
> of records in an input file). A CSV file is generally considered a
> table with a given set of columns, so I think it's pretty reasonable
> to consider a CSV record corrupt if it doesn't include the correct
> number of columns.
>
> I think the workaround here (I think you already realized this) would
> be to encode your input data as
>
> r1,c1,c2,c3,
> r2,c1,c2,,
> r2,c1,c2,c3,4
>
> I realize that this makes your use case more complex, because it
> requires processing on the file before loading it into Phoenix, but I
> think that keeping the Phoenix loader relatively strict (and simple)
> in this case is the better choice.
>
> - Gabriel
>
> On Thu, Feb 12, 2015 at 7:32 PM, Siva <sb...@gmail.com> wrote:
> > Hi Gabriel,
> >
> > Thanks for your response.
> >
> > Your understanding is correct. The usecase we have is, we get the data
> from
> > different sources (having different table structure (in terms of columns)
> > based on client type) in csv format. if a column is not available in
> source
> > we dont have a choice to even append a blank comma (,) with in that
> place.
> > But in Hbase, it ignores the column if it dont find the data with in a
> > record.
> >
> > if I have set of records like below, and if I specify as 4 columns
> > (excluding row key), then for first record it inserts data for 3 columns,
> > for 2nd record 2 cols, for 3rd record 4 column, it just ignore the
> column if
> > it dont find the data.
> >
> > r1,c1,c2,c3
> > r2,c1,c2
> > r2,c1,c2,c3,c4
> >
> >
> > Since phoenix doesn't have this capability, we have to create tables in
> > HBase and load them through it. Use Phoenix just for sql queries.
> >
> > I think we should enhance the Phoenix data loader in the same way like
> Hbase
> > loader. What do you say, any thoughts on this?
> >
> > Thanks,
> > Siva.
> >
> > On Wed, Feb 11, 2015 at 11:34 PM, Gabriel Reid <ga...@gmail.com>
> > wrote:
> >>
> >> Hi Siva,
> >>
> >> If I understand correctly, you want to explicitly supply null values
> >> in a CSV file for some fields. In general, this should work by just
> >> leaving the field empty in your CSV file. For example, if you have
> >> three fields (id, first_name, last_name) in your CSV file, then a
> >> record like "1,,Reid" should create a record with first_name left as
> >> null.
> >>
> >> Note that there is still an open bug, PHOENIX-1277 [1] that will
> >> prevent inserting null values via the bulk loader or psql, so for some
> >> datatypes there currently isn't a way to explicitly supply null
> >> values.
> >>
> >> - Gabriel
> >>
> >>
> >> 1. https://issues.apache.org/jira/browse/PHOENIX-1277
> >>
> >> On Thu, Feb 12, 2015 at 1:28 AM, Siva <sb...@gmail.com> wrote:
> >> > Hello all,
> >> >
> >> > is there a way to specify to keep NULL values for the columns which
> were
> >> > not there in csv file as part of bulk loading?
> >> >
> >> > Requirement I have is, few row in csv file contains all the column,
> but
> >> > rows contain only few columns.
> >> >
> >> > In Hbase, if the given record doesnt have desired columns, it just
> >> > ignore
> >> > the columns and it goes for next record while loading the data from
> >> > ImportTsv.
> >> >
> >> >
> >> >
> HADOOP_CLASSPATH=/usr/hdp/2.2.0.0-2041/hbase/lib/hbase-protocol.jar:/usr/hdp/2.2.0.0-2041/hbase/conf
> >> > hadoop jar
> >> > /usr/hdp/2.2.0.0-2041/phoenix/phoenix-4.2.0.2.2.0.0-2041-client.jar
> >> > org.apache.phoenix.mapreduce.CsvBulkLoadTool --table P_TEST_2_COLS
> >> > --input
> >> > /user/sbhavanari/p_h_test_2_cols_less.csv --import-columns
> NAME,LEADID,D
> >> > --zookeeper 172.31.45.176:2181:/hbase
> >> >
> >> > Thanks,
> >> > Siva.
> >
> >
>

Re: Phoenix bulk loading

Posted by Gabriel Reid <ga...@gmail.com>.
Hi Siva,

If I'm not mistaken, the bulk loader will currently just drop rows
that don't have the correct number of fields.

I'm actually in favor of this behavior (i.e. enforcing the consistency
of records in an input file). A CSV file is generally considered a
table with a given set of columns, so I think it's pretty reasonable
to consider a CSV record corrupt if it doesn't include the correct
number of columns.

I think the workaround here (I think you already realized this) would
be to encode your input data as

r1,c1,c2,c3,
r2,c1,c2,,
r2,c1,c2,c3,4

I realize that this makes your use case more complex, because it
requires processing on the file before loading it into Phoenix, but I
think that keeping the Phoenix loader relatively strict (and simple)
in this case is the better choice.

- Gabriel

On Thu, Feb 12, 2015 at 7:32 PM, Siva <sb...@gmail.com> wrote:
> Hi Gabriel,
>
> Thanks for your response.
>
> Your understanding is correct. The usecase we have is, we get the data from
> different sources (having different table structure (in terms of columns)
> based on client type) in csv format. if a column is not available in source
> we dont have a choice to even append a blank comma (,) with in that place.
> But in Hbase, it ignores the column if it dont find the data with in a
> record.
>
> if I have set of records like below, and if I specify as 4 columns
> (excluding row key), then for first record it inserts data for 3 columns,
> for 2nd record 2 cols, for 3rd record 4 column, it just ignore the column if
> it dont find the data.
>
> r1,c1,c2,c3
> r2,c1,c2
> r2,c1,c2,c3,c4
>
>
> Since phoenix doesn't have this capability, we have to create tables in
> HBase and load them through it. Use Phoenix just for sql queries.
>
> I think we should enhance the Phoenix data loader in the same way like Hbase
> loader. What do you say, any thoughts on this?
>
> Thanks,
> Siva.
>
> On Wed, Feb 11, 2015 at 11:34 PM, Gabriel Reid <ga...@gmail.com>
> wrote:
>>
>> Hi Siva,
>>
>> If I understand correctly, you want to explicitly supply null values
>> in a CSV file for some fields. In general, this should work by just
>> leaving the field empty in your CSV file. For example, if you have
>> three fields (id, first_name, last_name) in your CSV file, then a
>> record like "1,,Reid" should create a record with first_name left as
>> null.
>>
>> Note that there is still an open bug, PHOENIX-1277 [1] that will
>> prevent inserting null values via the bulk loader or psql, so for some
>> datatypes there currently isn't a way to explicitly supply null
>> values.
>>
>> - Gabriel
>>
>>
>> 1. https://issues.apache.org/jira/browse/PHOENIX-1277
>>
>> On Thu, Feb 12, 2015 at 1:28 AM, Siva <sb...@gmail.com> wrote:
>> > Hello all,
>> >
>> > is there a way to specify to keep NULL values for the columns which were
>> > not there in csv file as part of bulk loading?
>> >
>> > Requirement I have is, few row in csv file contains all the column, but
>> > rows contain only few columns.
>> >
>> > In Hbase, if the given record doesnt have desired columns, it just
>> > ignore
>> > the columns and it goes for next record while loading the data from
>> > ImportTsv.
>> >
>> >
>> > HADOOP_CLASSPATH=/usr/hdp/2.2.0.0-2041/hbase/lib/hbase-protocol.jar:/usr/hdp/2.2.0.0-2041/hbase/conf
>> > hadoop jar
>> > /usr/hdp/2.2.0.0-2041/phoenix/phoenix-4.2.0.2.2.0.0-2041-client.jar
>> > org.apache.phoenix.mapreduce.CsvBulkLoadTool --table P_TEST_2_COLS
>> > --input
>> > /user/sbhavanari/p_h_test_2_cols_less.csv --import-columns NAME,LEADID,D
>> > --zookeeper 172.31.45.176:2181:/hbase
>> >
>> > Thanks,
>> > Siva.
>
>

Re: Phoenix bulk loading

Posted by Siva <sb...@gmail.com>.
Hi Gabriel,

Thanks for your response.

Your understanding is correct. The usecase we have is, we get the data from
different sources (having different table structure (in terms of columns)
based on client type) in csv format. if a column is not available in source
we dont have a choice to even append a blank comma (,) with in that place.
But in Hbase, it ignores the column if it dont find the data with in a
record.

if I have set of records like below, and if I specify as 4 columns
(excluding row key), then for first record it inserts data for 3 columns,
for 2nd record 2 cols, for 3rd record 4 column, it just ignore the column
if it dont find the data.

r1,c1,c2,c3
r2,c1,c2
r2,c1,c2,c3,c4


Since phoenix doesn't have this capability, we have to create tables in
HBase and load them through it. Use Phoenix just for sql queries.

I think we should enhance the Phoenix data loader in the same way like
Hbase loader. What do you say, any thoughts on this?

Thanks,
Siva.

On Wed, Feb 11, 2015 at 11:34 PM, Gabriel Reid <ga...@gmail.com>
wrote:

> Hi Siva,
>
> If I understand correctly, you want to explicitly supply null values
> in a CSV file for some fields. In general, this should work by just
> leaving the field empty in your CSV file. For example, if you have
> three fields (id, first_name, last_name) in your CSV file, then a
> record like "1,,Reid" should create a record with first_name left as
> null.
>
> Note that there is still an open bug, PHOENIX-1277 [1] that will
> prevent inserting null values via the bulk loader or psql, so for some
> datatypes there currently isn't a way to explicitly supply null
> values.
>
> - Gabriel
>
>
> 1. https://issues.apache.org/jira/browse/PHOENIX-1277
>
> On Thu, Feb 12, 2015 at 1:28 AM, Siva <sb...@gmail.com> wrote:
> > Hello all,
> >
> > is there a way to specify to keep NULL values for the columns which were
> > not there in csv file as part of bulk loading?
> >
> > Requirement I have is, few row in csv file contains all the column, but
> > rows contain only few columns.
> >
> > In Hbase, if the given record doesnt have desired columns, it just ignore
> > the columns and it goes for next record while loading the data from
> > ImportTsv.
> >
> >
> HADOOP_CLASSPATH=/usr/hdp/2.2.0.0-2041/hbase/lib/hbase-protocol.jar:/usr/hdp/2.2.0.0-2041/hbase/conf
> > hadoop jar
> > /usr/hdp/2.2.0.0-2041/phoenix/phoenix-4.2.0.2.2.0.0-2041-client.jar
> > org.apache.phoenix.mapreduce.CsvBulkLoadTool --table P_TEST_2_COLS
> --input
> > /user/sbhavanari/p_h_test_2_cols_less.csv --import-columns NAME,LEADID,D
> > --zookeeper 172.31.45.176:2181:/hbase
> >
> > Thanks,
> > Siva.
>

Re: Phoenix bulk loading

Posted by Gabriel Reid <ga...@gmail.com>.
Hi Siva,

If I understand correctly, you want to explicitly supply null values
in a CSV file for some fields. In general, this should work by just
leaving the field empty in your CSV file. For example, if you have
three fields (id, first_name, last_name) in your CSV file, then a
record like "1,,Reid" should create a record with first_name left as
null.

Note that there is still an open bug, PHOENIX-1277 [1] that will
prevent inserting null values via the bulk loader or psql, so for some
datatypes there currently isn't a way to explicitly supply null
values.

- Gabriel


1. https://issues.apache.org/jira/browse/PHOENIX-1277

On Thu, Feb 12, 2015 at 1:28 AM, Siva <sb...@gmail.com> wrote:
> Hello all,
>
> is there a way to specify to keep NULL values for the columns which were
> not there in csv file as part of bulk loading?
>
> Requirement I have is, few row in csv file contains all the column, but
> rows contain only few columns.
>
> In Hbase, if the given record doesnt have desired columns, it just ignore
> the columns and it goes for next record while loading the data from
> ImportTsv.
>
> HADOOP_CLASSPATH=/usr/hdp/2.2.0.0-2041/hbase/lib/hbase-protocol.jar:/usr/hdp/2.2.0.0-2041/hbase/conf
> hadoop jar
> /usr/hdp/2.2.0.0-2041/phoenix/phoenix-4.2.0.2.2.0.0-2041-client.jar
> org.apache.phoenix.mapreduce.CsvBulkLoadTool --table P_TEST_2_COLS --input
> /user/sbhavanari/p_h_test_2_cols_less.csv --import-columns NAME,LEADID,D
> --zookeeper 172.31.45.176:2181:/hbase
>
> Thanks,
> Siva.