You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@phoenix.apache.org by "Ciureanu, Constantin (GfK)" <Co...@gfk.com> on 2015/01/13 10:12:13 UTC

MapReduce bulk load into Phoenix table

Hello all,

(Due to the slow speed of Phoenix JDBC – single machine ~ 1000-1500 rows /sec) I am also documenting myself about loading data into Phoenix via MapReduce.

So far I understood that the Key + List<[Key,Value]> to be inserted into HBase table is obtained via a “dummy” Phoenix connection – then those rows are stored into HFiles (then after the MR job finishes it is Bulk loading those HFiles normally into HBase).

My question: Is there any better / faster approach? I assume this cannot reach the maximum speed to load data into Phoenix / HBase table.

Also I would like to find a better / newer sample code than this one:
http://grepcode.com/file/repo1.maven.org/maven2/org.apache.phoenix/phoenix/4.0.0-incubating/org/apache/phoenix/mapreduce/CsvToKeyValueMapper.java#CsvToKeyValueMapper.loadPreUpsertProcessor%28org.apache.hadoop.conf.Configuration%29

Thank you,
   Constantin

RE: MapReduce bulk load into Phoenix table

Posted by "Ciureanu, Constantin (GfK)" <Co...@gfk.com>.

Hi Gabriel,
Found the problem :) it was in my code: I was getting a new Connection afterwards in my Mapper code - apparently that locked sometimes ... 
I'm now reusing the same connection and it's working fine (level of parallelism is as expected).

-----Original Message-----
From: Ciureanu, Constantin (GfK) [mailto:Constantin.Ciureanu@gfk.com] 
Sent: Friday, January 16, 2015 10:17 AM
To: user@phoenix.apache.org
Subject: RE: MapReduce bulk load into Phoenix table

Hi Gabriel,

The processing part is after this code:

		System.out.println(String.format("Connection with driver {} with url {}", PhoenixDriver.class.getName(), jdbcUrl));

		try {
			conn = (PhoenixConnection) DriverManager.getConnection(jdbcUrl, clientInfos);
			stmt = conn.createStatement();
		} catch (SQLException e) {
			throw new RuntimeException(e);
		}

I added some random wait time before this code - and as a result 12 Mappers started to work in parallel (the rest were still blocked for ~10 minutes, upon restart by MR framework they worked fine).
So one of the lines above froze the mapper.

P.S. I will upgrade Phoenix version soon (I use now Phoenix 4.2) - maybe that was fixed already.

Regards,
  Constantin

-----Original Message-----
From: Gabriel Reid [mailto:gabriel.reid@gmail.com]
Sent: Friday, January 16, 2015 9:38 AM
To: user@phoenix.apache.org
Subject: Re: MapReduce bulk load into Phoenix table

Hi Constantin,

The issues you're having sound like they're (probably) much more related to MapReduce than to Phoenix. In order to first determine what the real issue is, could you give a general overview of how your MR job is implemented (or even better, give me a pointer to it on GitHub or something similar)?

- Gabriel


On Thu, Jan 15, 2015 at 2:19 PM, Ciureanu, Constantin (GfK) <Co...@gfk.com> wrote:
> Hello all,
>
> I finished the MR Job - for now it just failed a few times since the Mappers gave some weird timeout (600 seconds) apparently not processing anything meanwhile.
> When I check the running mappers, just 3 of them are progressing (quite fast however, why just 3 are working? - I have 6 machines, 24 tasks can run in the same time).
>
> Can be this because of some limitation on number of connections to Phoenix?
>
> Regards,
>   Constantin
>
>
> -----Original Message-----
> From: Ciureanu, Constantin (GfK) [mailto:Constantin.Ciureanu@gfk.com]
> Sent: Wednesday, January 14, 2015 9:44 AM
> To: user@phoenix.apache.org
> Subject: RE: MapReduce bulk load into Phoenix table
>
> Hello James,
>
> Yes, as low as 1500 rows /sec -> using Phoenix JDBC with Batch Inserts of 1000 records at once, but there are at least 100 dynamic columns for each row.
> I was expecting higher values of course - but I will finish soon coding a MR job to load the same data using Hadoop.
> The code I read and adapt in my MR job is from your CsvBulkLoadTool. [ After finishing it I will test it then post new speed results.] This is basically using Phoenix connection to "dummy upsert" then takes the Key + List<KV> and rollback the connection - that was my question yesterday if there's no other better way.
> My new problem is that the CsvUpsertExecutor needs a list of fields (which I don't have since the columns are dynamic, I do not use anyway a CSV source).
> So it would have been nice to have a "reusable building block of code" for this - I'm sure everyone needs a fast and clean template code to load data into destination HBase (or Phoenix) Table using Phoenix + MR.
> I can create the row key from concatenating my key fields - but I don't know (yet) how to obtain the salting byte(s).
>
> My current test cluster details:
> - 6x dualcore machines (on AWS)
> - more than 100 TB disk space
> - the table is salted into 8 buckets and has 8 columns common to all 
> rows
>
> Thank you for your answer and technical support on this email-list, 
> Constantin
>
> -----Original Message-----
> From: James Taylor [mailto:jamestaylor@apache.org]
> Sent: Tuesday, January 13, 2015 7:23 PM
> To: user
> Subject: Re: MapReduce bulk load into Phoenix table
>
> Hi Constantin,
> 1000-1500 rows per sec? Using our performance.py script, on my Mac laptop, I'm seeing 27,000 rows per sec (Phoenix 4.2.2 with HBase 0.98.9).
>
> If you want to realistically measure performance, I'd recommend doing 
> so on a real cluster. If you'll really only have a single machine, 
> then you're probably better off using something like MySQL. Using the 
> map-reduce based CSV loader on a single node is not going to speed 
> anything up. For a cluster it can make a difference, though. See 
> http://phoenix.apache.org/phoenix_mr.html
>
> FYI, Phoenix indexes are only maintained if you go through Phoenix APIs.
>
> Thanks,
> James
>
>
> On Tue, Jan 13, 2015 at 2:45 AM, Vaclav Loffelmann <va...@socialbakers.com> wrote:
>> -----BEGIN PGP SIGNED MESSAGE-----
>> Hash: SHA1
>>
>> I think the easiest way how to determine if indexes are maintained 
>> when inserting directly to HBase is to test it. If it is maintained 
>> by region observer coprocessors, it should. (I'll do tests when as 
>> soon I'll have some time.)
>>
>> I don't see any problem with different cols between multiple rows.
>> Make view same as you'd make table definition. Null values are not 
>> stored at HBase hence theres no overhead.
>>
>> I'm afraid there is not any piece of code (publicly avail) how to do 
>> that, but it is very straight forward.
>> If you use composite primary key, then concat multiple results of
>> PDataType.TYPE.toBytes() as rowkey. For values use same logic. Data 
>> types are defined as enums at this class:
>> org.apache.phoenix.schema.PDataType.
>>
>> Good luck,
>> Vaclav;
>>
>> On 01/13/2015 10:58 AM, Ciureanu, Constantin (GfK) wrote:
>>> Thank you Vaclav,
>>>
>>> I have just started today to write some code :) for MR job that will 
>>> load data into HBase + Phoenix. Previously I wrote some application 
>>> to load data using Phoenix JDBC (slow), but I also have experience 
>>> with HBase so I can understand and write code to load data directly 
>>> there.
>>>
>>> If doing so, I'm also worry about: - maintaining (some existing) 
>>> Phoenix indexes (if any) - perhaps this still works in case the
>>> (same) coprocessors would trigger at insert time, but I cannot know 
>>> how it works behind the scenes. - having the Phoenix view around the 
>>> HBase table would "solve" the above problem (so there's no index
>>> whatsoever) but would create a lot of other problems (my table has a 
>>> limited number of common columns and the rest are too different from 
>>> row to row - in total I have hundreds of possible
>>> columns)
>>>
>>> So - to make things faster for me-  is there any good piece of code 
>>> I can find on the internet about how to map my data types to Phoenix 
>>> data types and use the results as regular HBase Bulk Load?
>>>
>>> Regards, Constantin
>>>
>>> -----Original Message----- From: Vaclav Loffelmann 
>>> [mailto:vaclav.loffelmann@socialbakers.com] Sent: Tuesday, January 
>>> 13, 2015 10:30 AM To: user@phoenix.apache.org Subject: Re:
>>> MapReduce bulk load into Phoenix table
>>>
>>> Hi, our daily usage is to import raw data directly to HBase, but 
>>> mapped to Phoenix data types. And for querying we use Phoenix view 
>>> on top of that HBase table.
>>>
>>> Then you should hit bottleneck of HBase itself. It should be from
>>> 10 to 30+ times faster than your current solution. Depending on HW 
>>> of course.
>>>
>>> I'd prefer this solution for stream writes.
>>>
>>> Vaclav
>>>
>>> On 01/13/2015 10:12 AM, Ciureanu, Constantin (GfK) wrote:
>>>> Hello all,
>>>
>>>> (Due to the slow speed of Phoenix JDBC – single machine ~
>>>> 1000-1500 rows /sec) I am also documenting myself about loading 
>>>> data into Phoenix via MapReduce.
>>>
>>>> So far I understood that the Key + List<[Key,Value]> to be inserted 
>>>> into HBase table is obtained via a “dummy” Phoenix connection – 
>>>> then those rows are stored into HFiles (then after the MR job 
>>>> finishes it is Bulk loading those HFiles normally into HBase).
>>>
>>>> My question: Is there any better / faster approach? I assume this 
>>>> cannot reach the maximum speed to load data into Phoenix / HBase 
>>>> table.
>>>
>>>> Also I would like to find a better / newer sample code than this
>>>> one:
>>>> http://grepcode.com/file/repo1.maven.org/maven2/org.apache.phoenix/
>>>> p
>>>> ho
>>>>
>>>>
>> enix/4.0.0-incubating/org/apache/phoenix/mapreduce/CsvToKeyValueMappe
>> r
>>>> .java#CsvToKeyValueMapper.loadPreUpsertProcessor%28org.apache.hadoo
>>>> p
>>>> .c
>>>>
>>>>
>> onf.Configuration%29
>>>
>>>> Thank you, Constantin
>>>
>>>
>> -----BEGIN PGP SIGNATURE-----
>> Version: GnuPG v1
>>
>> iQEcBAEBAgAGBQJUtPc4AAoJEG3mD8Wtuk0WWu0IAIJcEveMvKZbrrf3FY0SRbNx
>> 6V5jF44t+Dl88r1EX9VvKZrWzjp/uN6t8r3H6D5nNLmq0314jOfK68T0q3uaIqOp
>> Ui7UwMRAGkmhhY9pqgZBfWvXsLLJSy9hJU70JSk3RFeynprqPsme9a8CWJi8IfDN
>> G83he0X2avQKQJ72hKeDZX9NzKib9cxNQFKtWDpr2NQat5VnCJkCUGprMcMUxU31
>> vqFg9aQ+b40WN0KFJ3p7cI5tlAuJ5Tz7Ogh+KOZUqTVZ8I5OLwQzqpiwDMD6stRb
>> PJ1gc7LCs64wJghv5TIZpHyXl/3HOgmpYrO+UfGv1S1qzySpM3B1o9ajTbww3L0=
>> =Fdvo
>> -----END PGP SIGNATURE-----

RE: MapReduce bulk load into Phoenix table

Posted by "Ciureanu, Constantin (GfK)" <Co...@gfk.com>.

Hi Gabriel,

The processing part is after this code:

		System.out.println(String.format("Connection with driver {} with url {}", PhoenixDriver.class.getName(), jdbcUrl));

		try {
			conn = (PhoenixConnection) DriverManager.getConnection(jdbcUrl, clientInfos);
			stmt = conn.createStatement();
		} catch (SQLException e) {
			throw new RuntimeException(e);
		}

I added some random wait time before this code - and as a result 12 Mappers started to work in parallel (the rest were still blocked for ~10 minutes, upon restart by MR framework they worked fine).
So one of the lines above froze the mapper.

P.S. I will upgrade Phoenix version soon (I use now Phoenix 4.2) - maybe that was fixed already.

Regards,
  Constantin

-----Original Message-----
From: Gabriel Reid [mailto:gabriel.reid@gmail.com] 
Sent: Friday, January 16, 2015 9:38 AM
To: user@phoenix.apache.org
Subject: Re: MapReduce bulk load into Phoenix table

Hi Constantin,

The issues you're having sound like they're (probably) much more related to MapReduce than to Phoenix. In order to first determine what the real issue is, could you give a general overview of how your MR job is implemented (or even better, give me a pointer to it on GitHub or something similar)?

- Gabriel


On Thu, Jan 15, 2015 at 2:19 PM, Ciureanu, Constantin (GfK) <Co...@gfk.com> wrote:
> Hello all,
>
> I finished the MR Job - for now it just failed a few times since the Mappers gave some weird timeout (600 seconds) apparently not processing anything meanwhile.
> When I check the running mappers, just 3 of them are progressing (quite fast however, why just 3 are working? - I have 6 machines, 24 tasks can run in the same time).
>
> Can be this because of some limitation on number of connections to Phoenix?
>
> Regards,
>   Constantin
>
>
> -----Original Message-----
> From: Ciureanu, Constantin (GfK) [mailto:Constantin.Ciureanu@gfk.com]
> Sent: Wednesday, January 14, 2015 9:44 AM
> To: user@phoenix.apache.org
> Subject: RE: MapReduce bulk load into Phoenix table
>
> Hello James,
>
> Yes, as low as 1500 rows /sec -> using Phoenix JDBC with Batch Inserts of 1000 records at once, but there are at least 100 dynamic columns for each row.
> I was expecting higher values of course - but I will finish soon coding a MR job to load the same data using Hadoop.
> The code I read and adapt in my MR job is from your CsvBulkLoadTool. [ After finishing it I will test it then post new speed results.] This is basically using Phoenix connection to "dummy upsert" then takes the Key + List<KV> and rollback the connection - that was my question yesterday if there's no other better way.
> My new problem is that the CsvUpsertExecutor needs a list of fields (which I don't have since the columns are dynamic, I do not use anyway a CSV source).
> So it would have been nice to have a "reusable building block of code" for this - I'm sure everyone needs a fast and clean template code to load data into destination HBase (or Phoenix) Table using Phoenix + MR.
> I can create the row key from concatenating my key fields - but I don't know (yet) how to obtain the salting byte(s).
>
> My current test cluster details:
> - 6x dualcore machines (on AWS)
> - more than 100 TB disk space
> - the table is salted into 8 buckets and has 8 columns common to all 
> rows
>
> Thank you for your answer and technical support on this email-list, 
> Constantin
>
> -----Original Message-----
> From: James Taylor [mailto:jamestaylor@apache.org]
> Sent: Tuesday, January 13, 2015 7:23 PM
> To: user
> Subject: Re: MapReduce bulk load into Phoenix table
>
> Hi Constantin,
> 1000-1500 rows per sec? Using our performance.py script, on my Mac laptop, I'm seeing 27,000 rows per sec (Phoenix 4.2.2 with HBase 0.98.9).
>
> If you want to realistically measure performance, I'd recommend doing 
> so on a real cluster. If you'll really only have a single machine, 
> then you're probably better off using something like MySQL. Using the 
> map-reduce based CSV loader on a single node is not going to speed 
> anything up. For a cluster it can make a difference, though. See 
> http://phoenix.apache.org/phoenix_mr.html
>
> FYI, Phoenix indexes are only maintained if you go through Phoenix APIs.
>
> Thanks,
> James
>
>
> On Tue, Jan 13, 2015 at 2:45 AM, Vaclav Loffelmann <va...@socialbakers.com> wrote:
>> -----BEGIN PGP SIGNED MESSAGE-----
>> Hash: SHA1
>>
>> I think the easiest way how to determine if indexes are maintained 
>> when inserting directly to HBase is to test it. If it is maintained 
>> by region observer coprocessors, it should. (I'll do tests when as 
>> soon I'll have some time.)
>>
>> I don't see any problem with different cols between multiple rows.
>> Make view same as you'd make table definition. Null values are not 
>> stored at HBase hence theres no overhead.
>>
>> I'm afraid there is not any piece of code (publicly avail) how to do 
>> that, but it is very straight forward.
>> If you use composite primary key, then concat multiple results of
>> PDataType.TYPE.toBytes() as rowkey. For values use same logic. Data 
>> types are defined as enums at this class:
>> org.apache.phoenix.schema.PDataType.
>>
>> Good luck,
>> Vaclav;
>>
>> On 01/13/2015 10:58 AM, Ciureanu, Constantin (GfK) wrote:
>>> Thank you Vaclav,
>>>
>>> I have just started today to write some code :) for MR job that will 
>>> load data into HBase + Phoenix. Previously I wrote some application 
>>> to load data using Phoenix JDBC (slow), but I also have experience 
>>> with HBase so I can understand and write code to load data directly 
>>> there.
>>>
>>> If doing so, I'm also worry about: - maintaining (some existing) 
>>> Phoenix indexes (if any) - perhaps this still works in case the
>>> (same) coprocessors would trigger at insert time, but I cannot know 
>>> how it works behind the scenes. - having the Phoenix view around the 
>>> HBase table would "solve" the above problem (so there's no index
>>> whatsoever) but would create a lot of other problems (my table has a 
>>> limited number of common columns and the rest are too different from 
>>> row to row - in total I have hundreds of possible
>>> columns)
>>>
>>> So - to make things faster for me-  is there any good piece of code 
>>> I can find on the internet about how to map my data types to Phoenix 
>>> data types and use the results as regular HBase Bulk Load?
>>>
>>> Regards, Constantin
>>>
>>> -----Original Message----- From: Vaclav Loffelmann 
>>> [mailto:vaclav.loffelmann@socialbakers.com] Sent: Tuesday, January 
>>> 13, 2015 10:30 AM To: user@phoenix.apache.org Subject: Re:
>>> MapReduce bulk load into Phoenix table
>>>
>>> Hi, our daily usage is to import raw data directly to HBase, but 
>>> mapped to Phoenix data types. And for querying we use Phoenix view 
>>> on top of that HBase table.
>>>
>>> Then you should hit bottleneck of HBase itself. It should be from
>>> 10 to 30+ times faster than your current solution. Depending on HW 
>>> of course.
>>>
>>> I'd prefer this solution for stream writes.
>>>
>>> Vaclav
>>>
>>> On 01/13/2015 10:12 AM, Ciureanu, Constantin (GfK) wrote:
>>>> Hello all,
>>>
>>>> (Due to the slow speed of Phoenix JDBC – single machine ~
>>>> 1000-1500 rows /sec) I am also documenting myself about loading 
>>>> data into Phoenix via MapReduce.
>>>
>>>> So far I understood that the Key + List<[Key,Value]> to be inserted 
>>>> into HBase table is obtained via a “dummy” Phoenix connection – 
>>>> then those rows are stored into HFiles (then after the MR job 
>>>> finishes it is Bulk loading those HFiles normally into HBase).
>>>
>>>> My question: Is there any better / faster approach? I assume this 
>>>> cannot reach the maximum speed to load data into Phoenix / HBase 
>>>> table.
>>>
>>>> Also I would like to find a better / newer sample code than this
>>>> one:
>>>> http://grepcode.com/file/repo1.maven.org/maven2/org.apache.phoenix/
>>>> p
>>>> ho
>>>>
>>>>
>> enix/4.0.0-incubating/org/apache/phoenix/mapreduce/CsvToKeyValueMappe
>> r
>>>> .java#CsvToKeyValueMapper.loadPreUpsertProcessor%28org.apache.hadoo
>>>> p
>>>> .c
>>>>
>>>>
>> onf.Configuration%29
>>>
>>>> Thank you, Constantin
>>>
>>>
>> -----BEGIN PGP SIGNATURE-----
>> Version: GnuPG v1
>>
>> iQEcBAEBAgAGBQJUtPc4AAoJEG3mD8Wtuk0WWu0IAIJcEveMvKZbrrf3FY0SRbNx
>> 6V5jF44t+Dl88r1EX9VvKZrWzjp/uN6t8r3H6D5nNLmq0314jOfK68T0q3uaIqOp
>> Ui7UwMRAGkmhhY9pqgZBfWvXsLLJSy9hJU70JSk3RFeynprqPsme9a8CWJi8IfDN
>> G83he0X2avQKQJ72hKeDZX9NzKib9cxNQFKtWDpr2NQat5VnCJkCUGprMcMUxU31
>> vqFg9aQ+b40WN0KFJ3p7cI5tlAuJ5Tz7Ogh+KOZUqTVZ8I5OLwQzqpiwDMD6stRb
>> PJ1gc7LCs64wJghv5TIZpHyXl/3HOgmpYrO+UfGv1S1qzySpM3B1o9ajTbww3L0=
>> =Fdvo
>> -----END PGP SIGNATURE-----

Re: MapReduce bulk load into Phoenix table

Posted by Gabriel Reid <ga...@gmail.com>.

Hi Constantin,

The issues you're having sound like they're (probably) much more
related to MapReduce than to Phoenix. In order to first determine what
the real issue is, could you give a general overview of how your MR
job is implemented (or even better, give me a pointer to it on GitHub
or something similar)?

- Gabriel


On Thu, Jan 15, 2015 at 2:19 PM, Ciureanu, Constantin (GfK)
<Co...@gfk.com> wrote:
> Hello all,
>
> I finished the MR Job - for now it just failed a few times since the Mappers gave some weird timeout (600 seconds) apparently not processing anything meanwhile.
> When I check the running mappers, just 3 of them are progressing (quite fast however, why just 3 are working? - I have 6 machines, 24 tasks can run in the same time).
>
> Can be this because of some limitation on number of connections to Phoenix?
>
> Regards,
>   Constantin
>
>
> -----Original Message-----
> From: Ciureanu, Constantin (GfK) [mailto:Constantin.Ciureanu@gfk.com]
> Sent: Wednesday, January 14, 2015 9:44 AM
> To: user@phoenix.apache.org
> Subject: RE: MapReduce bulk load into Phoenix table
>
> Hello James,
>
> Yes, as low as 1500 rows /sec -> using Phoenix JDBC with Batch Inserts of 1000 records at once, but there are at least 100 dynamic columns for each row.
> I was expecting higher values of course - but I will finish soon coding a MR job to load the same data using Hadoop.
> The code I read and adapt in my MR job is from your CsvBulkLoadTool. [ After finishing it I will test it then post new speed results.] This is basically using Phoenix connection to "dummy upsert" then takes the Key + List<KV> and rollback the connection - that was my question yesterday if there's no other better way.
> My new problem is that the CsvUpsertExecutor needs a list of fields (which I don't have since the columns are dynamic, I do not use anyway a CSV source).
> So it would have been nice to have a "reusable building block of code" for this - I'm sure everyone needs a fast and clean template code to load data into destination HBase (or Phoenix) Table using Phoenix + MR.
> I can create the row key from concatenating my key fields - but I don't know (yet) how to obtain the salting byte(s).
>
> My current test cluster details:
> - 6x dualcore machines (on AWS)
> - more than 100 TB disk space
> - the table is salted into 8 buckets and has 8 columns common to all rows
>
> Thank you for your answer and technical support on this email-list, Constantin
>
> -----Original Message-----
> From: James Taylor [mailto:jamestaylor@apache.org]
> Sent: Tuesday, January 13, 2015 7:23 PM
> To: user
> Subject: Re: MapReduce bulk load into Phoenix table
>
> Hi Constantin,
> 1000-1500 rows per sec? Using our performance.py script, on my Mac laptop, I'm seeing 27,000 rows per sec (Phoenix 4.2.2 with HBase 0.98.9).
>
> If you want to realistically measure performance, I'd recommend doing so on a real cluster. If you'll really only have a single machine, then you're probably better off using something like MySQL. Using the map-reduce based CSV loader on a single node is not going to speed anything up. For a cluster it can make a difference, though. See http://phoenix.apache.org/phoenix_mr.html
>
> FYI, Phoenix indexes are only maintained if you go through Phoenix APIs.
>
> Thanks,
> James
>
>
> On Tue, Jan 13, 2015 at 2:45 AM, Vaclav Loffelmann <va...@socialbakers.com> wrote:
>> -----BEGIN PGP SIGNED MESSAGE-----
>> Hash: SHA1
>>
>> I think the easiest way how to determine if indexes are maintained
>> when inserting directly to HBase is to test it. If it is maintained by
>> region observer coprocessors, it should. (I'll do tests when as soon
>> I'll have some time.)
>>
>> I don't see any problem with different cols between multiple rows.
>> Make view same as you'd make table definition. Null values are not
>> stored at HBase hence theres no overhead.
>>
>> I'm afraid there is not any piece of code (publicly avail) how to do
>> that, but it is very straight forward.
>> If you use composite primary key, then concat multiple results of
>> PDataType.TYPE.toBytes() as rowkey. For values use same logic. Data
>> types are defined as enums at this class:
>> org.apache.phoenix.schema.PDataType.
>>
>> Good luck,
>> Vaclav;
>>
>> On 01/13/2015 10:58 AM, Ciureanu, Constantin (GfK) wrote:
>>> Thank you Vaclav,
>>>
>>> I have just started today to write some code :) for MR job that will
>>> load data into HBase + Phoenix. Previously I wrote some application
>>> to load data using Phoenix JDBC (slow), but I also have experience
>>> with HBase so I can understand and write code to load data directly
>>> there.
>>>
>>> If doing so, I'm also worry about: - maintaining (some existing)
>>> Phoenix indexes (if any) - perhaps this still works in case the
>>> (same) coprocessors would trigger at insert time, but I cannot know
>>> how it works behind the scenes. - having the Phoenix view around the
>>> HBase table would "solve" the above problem (so there's no index
>>> whatsoever) but would create a lot of other problems (my table has a
>>> limited number of common columns and the rest are too different from
>>> row to row - in total I have hundreds of possible
>>> columns)
>>>
>>> So - to make things faster for me-  is there any good piece of code I
>>> can find on the internet about how to map my data types to Phoenix
>>> data types and use the results as regular HBase Bulk Load?
>>>
>>> Regards, Constantin
>>>
>>> -----Original Message----- From: Vaclav Loffelmann
>>> [mailto:vaclav.loffelmann@socialbakers.com] Sent: Tuesday, January
>>> 13, 2015 10:30 AM To: user@phoenix.apache.org Subject: Re:
>>> MapReduce bulk load into Phoenix table
>>>
>>> Hi, our daily usage is to import raw data directly to HBase, but
>>> mapped to Phoenix data types. And for querying we use Phoenix view on
>>> top of that HBase table.
>>>
>>> Then you should hit bottleneck of HBase itself. It should be from
>>> 10 to 30+ times faster than your current solution. Depending on HW of
>>> course.
>>>
>>> I'd prefer this solution for stream writes.
>>>
>>> Vaclav
>>>
>>> On 01/13/2015 10:12 AM, Ciureanu, Constantin (GfK) wrote:
>>>> Hello all,
>>>
>>>> (Due to the slow speed of Phoenix JDBC – single machine ~
>>>> 1000-1500 rows /sec) I am also documenting myself about loading data
>>>> into Phoenix via MapReduce.
>>>
>>>> So far I understood that the Key + List<[Key,Value]> to be inserted
>>>> into HBase table is obtained via a “dummy” Phoenix connection – then
>>>> those rows are stored into HFiles (then after the MR job finishes it
>>>> is Bulk loading those HFiles normally into HBase).
>>>
>>>> My question: Is there any better / faster approach? I assume this
>>>> cannot reach the maximum speed to load data into Phoenix / HBase
>>>> table.
>>>
>>>> Also I would like to find a better / newer sample code than this
>>>> one:
>>>> http://grepcode.com/file/repo1.maven.org/maven2/org.apache.phoenix/p
>>>> ho
>>>>
>>>>
>> enix/4.0.0-incubating/org/apache/phoenix/mapreduce/CsvToKeyValueMapper
>>>> .java#CsvToKeyValueMapper.loadPreUpsertProcessor%28org.apache.hadoop
>>>> .c
>>>>
>>>>
>> onf.Configuration%29
>>>
>>>> Thank you, Constantin
>>>
>>>
>> -----BEGIN PGP SIGNATURE-----
>> Version: GnuPG v1
>>
>> iQEcBAEBAgAGBQJUtPc4AAoJEG3mD8Wtuk0WWu0IAIJcEveMvKZbrrf3FY0SRbNx
>> 6V5jF44t+Dl88r1EX9VvKZrWzjp/uN6t8r3H6D5nNLmq0314jOfK68T0q3uaIqOp
>> Ui7UwMRAGkmhhY9pqgZBfWvXsLLJSy9hJU70JSk3RFeynprqPsme9a8CWJi8IfDN
>> G83he0X2avQKQJ72hKeDZX9NzKib9cxNQFKtWDpr2NQat5VnCJkCUGprMcMUxU31
>> vqFg9aQ+b40WN0KFJ3p7cI5tlAuJ5Tz7Ogh+KOZUqTVZ8I5OLwQzqpiwDMD6stRb
>> PJ1gc7LCs64wJghv5TIZpHyXl/3HOgmpYrO+UfGv1S1qzySpM3B1o9ajTbww3L0=
>> =Fdvo
>> -----END PGP SIGNATURE-----

RE: MapReduce bulk load into Phoenix table

Posted by "Ciureanu, Constantin (GfK)" <Co...@gfk.com>.

Hello all,

I finished the MR Job - for now it just failed a few times since the Mappers gave some weird timeout (600 seconds) apparently not processing anything meanwhile.
When I check the running mappers, just 3 of them are progressing (quite fast however, why just 3 are working? - I have 6 machines, 24 tasks can run in the same time).

Can be this because of some limitation on number of connections to Phoenix?

Regards,
  Constantin


-----Original Message-----
From: Ciureanu, Constantin (GfK) [mailto:Constantin.Ciureanu@gfk.com] 
Sent: Wednesday, January 14, 2015 9:44 AM
To: user@phoenix.apache.org
Subject: RE: MapReduce bulk load into Phoenix table

Hello James,

Yes, as low as 1500 rows /sec -> using Phoenix JDBC with Batch Inserts of 1000 records at once, but there are at least 100 dynamic columns for each row.
I was expecting higher values of course - but I will finish soon coding a MR job to load the same data using Hadoop.
The code I read and adapt in my MR job is from your CsvBulkLoadTool. [ After finishing it I will test it then post new speed results.] This is basically using Phoenix connection to "dummy upsert" then takes the Key + List<KV> and rollback the connection - that was my question yesterday if there's no other better way.
My new problem is that the CsvUpsertExecutor needs a list of fields (which I don't have since the columns are dynamic, I do not use anyway a CSV source).
So it would have been nice to have a "reusable building block of code" for this - I'm sure everyone needs a fast and clean template code to load data into destination HBase (or Phoenix) Table using Phoenix + MR.
I can create the row key from concatenating my key fields - but I don't know (yet) how to obtain the salting byte(s).

My current test cluster details:
- 6x dualcore machines (on AWS)
- more than 100 TB disk space
- the table is salted into 8 buckets and has 8 columns common to all rows

Thank you for your answer and technical support on this email-list, Constantin

-----Original Message-----
From: James Taylor [mailto:jamestaylor@apache.org]
Sent: Tuesday, January 13, 2015 7:23 PM
To: user
Subject: Re: MapReduce bulk load into Phoenix table

Hi Constantin,
1000-1500 rows per sec? Using our performance.py script, on my Mac laptop, I'm seeing 27,000 rows per sec (Phoenix 4.2.2 with HBase 0.98.9).

If you want to realistically measure performance, I'd recommend doing so on a real cluster. If you'll really only have a single machine, then you're probably better off using something like MySQL. Using the map-reduce based CSV loader on a single node is not going to speed anything up. For a cluster it can make a difference, though. See http://phoenix.apache.org/phoenix_mr.html

FYI, Phoenix indexes are only maintained if you go through Phoenix APIs.

Thanks,
James


On Tue, Jan 13, 2015 at 2:45 AM, Vaclav Loffelmann <va...@socialbakers.com> wrote:
> -----BEGIN PGP SIGNED MESSAGE-----
> Hash: SHA1
>
> I think the easiest way how to determine if indexes are maintained 
> when inserting directly to HBase is to test it. If it is maintained by 
> region observer coprocessors, it should. (I'll do tests when as soon 
> I'll have some time.)
>
> I don't see any problem with different cols between multiple rows.
> Make view same as you'd make table definition. Null values are not 
> stored at HBase hence theres no overhead.
>
> I'm afraid there is not any piece of code (publicly avail) how to do 
> that, but it is very straight forward.
> If you use composite primary key, then concat multiple results of
> PDataType.TYPE.toBytes() as rowkey. For values use same logic. Data 
> types are defined as enums at this class:
> org.apache.phoenix.schema.PDataType.
>
> Good luck,
> Vaclav;
>
> On 01/13/2015 10:58 AM, Ciureanu, Constantin (GfK) wrote:
>> Thank you Vaclav,
>>
>> I have just started today to write some code :) for MR job that will 
>> load data into HBase + Phoenix. Previously I wrote some application 
>> to load data using Phoenix JDBC (slow), but I also have experience 
>> with HBase so I can understand and write code to load data directly 
>> there.
>>
>> If doing so, I'm also worry about: - maintaining (some existing) 
>> Phoenix indexes (if any) - perhaps this still works in case the
>> (same) coprocessors would trigger at insert time, but I cannot know 
>> how it works behind the scenes. - having the Phoenix view around the 
>> HBase table would "solve" the above problem (so there's no index
>> whatsoever) but would create a lot of other problems (my table has a 
>> limited number of common columns and the rest are too different from 
>> row to row - in total I have hundreds of possible
>> columns)
>>
>> So - to make things faster for me-  is there any good piece of code I 
>> can find on the internet about how to map my data types to Phoenix 
>> data types and use the results as regular HBase Bulk Load?
>>
>> Regards, Constantin
>>
>> -----Original Message----- From: Vaclav Loffelmann 
>> [mailto:vaclav.loffelmann@socialbakers.com] Sent: Tuesday, January 
>> 13, 2015 10:30 AM To: user@phoenix.apache.org Subject: Re:
>> MapReduce bulk load into Phoenix table
>>
>> Hi, our daily usage is to import raw data directly to HBase, but 
>> mapped to Phoenix data types. And for querying we use Phoenix view on 
>> top of that HBase table.
>>
>> Then you should hit bottleneck of HBase itself. It should be from
>> 10 to 30+ times faster than your current solution. Depending on HW of 
>> course.
>>
>> I'd prefer this solution for stream writes.
>>
>> Vaclav
>>
>> On 01/13/2015 10:12 AM, Ciureanu, Constantin (GfK) wrote:
>>> Hello all,
>>
>>> (Due to the slow speed of Phoenix JDBC – single machine ~
>>> 1000-1500 rows /sec) I am also documenting myself about loading data 
>>> into Phoenix via MapReduce.
>>
>>> So far I understood that the Key + List<[Key,Value]> to be inserted 
>>> into HBase table is obtained via a “dummy” Phoenix connection – then 
>>> those rows are stored into HFiles (then after the MR job finishes it 
>>> is Bulk loading those HFiles normally into HBase).
>>
>>> My question: Is there any better / faster approach? I assume this 
>>> cannot reach the maximum speed to load data into Phoenix / HBase 
>>> table.
>>
>>> Also I would like to find a better / newer sample code than this
>>> one:
>>> http://grepcode.com/file/repo1.maven.org/maven2/org.apache.phoenix/p
>>> ho
>>>
>>>
> enix/4.0.0-incubating/org/apache/phoenix/mapreduce/CsvToKeyValueMapper
>>> .java#CsvToKeyValueMapper.loadPreUpsertProcessor%28org.apache.hadoop
>>> .c
>>>
>>>
> onf.Configuration%29
>>
>>> Thank you, Constantin
>>
>>
> -----BEGIN PGP SIGNATURE-----
> Version: GnuPG v1
>
> iQEcBAEBAgAGBQJUtPc4AAoJEG3mD8Wtuk0WWu0IAIJcEveMvKZbrrf3FY0SRbNx
> 6V5jF44t+Dl88r1EX9VvKZrWzjp/uN6t8r3H6D5nNLmq0314jOfK68T0q3uaIqOp
> Ui7UwMRAGkmhhY9pqgZBfWvXsLLJSy9hJU70JSk3RFeynprqPsme9a8CWJi8IfDN
> G83he0X2avQKQJ72hKeDZX9NzKib9cxNQFKtWDpr2NQat5VnCJkCUGprMcMUxU31
> vqFg9aQ+b40WN0KFJ3p7cI5tlAuJ5Tz7Ogh+KOZUqTVZ8I5OLwQzqpiwDMD6stRb
> PJ1gc7LCs64wJghv5TIZpHyXl/3HOgmpYrO+UfGv1S1qzySpM3B1o9ajTbww3L0=
> =Fdvo
> -----END PGP SIGNATURE-----

RE: MapReduce bulk load into Phoenix table

Posted by "Ciureanu, Constantin (GfK)" <Co...@gfk.com>.

Hello James,

Yes, as low as 1500 rows /sec -> using Phoenix JDBC with Batch Inserts of 1000 records at once, but there are at least 100 dynamic columns for each row.
I was expecting higher values of course - but I will finish soon coding a MR job to load the same data using Hadoop.
The code I read and adapt in my MR job is from your CsvBulkLoadTool. [ After finishing it I will test it then post new speed results.]
This is basically using Phoenix connection to "dummy upsert" then takes the Key + List<KV> and rollback the connection - that was my question yesterday if there's no other better way.
My new problem is that the CsvUpsertExecutor needs a list of fields (which I don't have since the columns are dynamic, I do not use anyway a CSV source).
So it would have been nice to have a "reusable building block of code" for this - I'm sure everyone needs a fast and clean template code to load data into destination HBase (or Phoenix) Table using Phoenix + MR.
I can create the row key from concatenating my key fields - but I don't know (yet) how to obtain the salting byte(s).

My current test cluster details:
- 6x dualcore machines (on AWS)
- more than 100 TB disk space
- the table is salted into 8 buckets and has 8 columns common to all rows

Thank you for your answer and technical support on this email-list,
Constantin

-----Original Message-----
From: James Taylor [mailto:jamestaylor@apache.org] 
Sent: Tuesday, January 13, 2015 7:23 PM
To: user
Subject: Re: MapReduce bulk load into Phoenix table

Hi Constantin,
1000-1500 rows per sec? Using our performance.py script, on my Mac laptop, I'm seeing 27,000 rows per sec (Phoenix 4.2.2 with HBase 0.98.9).

If you want to realistically measure performance, I'd recommend doing so on a real cluster. If you'll really only have a single machine, then you're probably better off using something like MySQL. Using the map-reduce based CSV loader on a single node is not going to speed anything up. For a cluster it can make a difference, though. See http://phoenix.apache.org/phoenix_mr.html

FYI, Phoenix indexes are only maintained if you go through Phoenix APIs.

Thanks,
James


On Tue, Jan 13, 2015 at 2:45 AM, Vaclav Loffelmann <va...@socialbakers.com> wrote:
> -----BEGIN PGP SIGNED MESSAGE-----
> Hash: SHA1
>
> I think the easiest way how to determine if indexes are maintained 
> when inserting directly to HBase is to test it. If it is maintained by 
> region observer coprocessors, it should. (I'll do tests when as soon 
> I'll have some time.)
>
> I don't see any problem with different cols between multiple rows.
> Make view same as you'd make table definition. Null values are not 
> stored at HBase hence theres no overhead.
>
> I'm afraid there is not any piece of code (publicly avail) how to do 
> that, but it is very straight forward.
> If you use composite primary key, then concat multiple results of
> PDataType.TYPE.toBytes() as rowkey. For values use same logic. Data 
> types are defined as enums at this class:
> org.apache.phoenix.schema.PDataType.
>
> Good luck,
> Vaclav;
>
> On 01/13/2015 10:58 AM, Ciureanu, Constantin (GfK) wrote:
>> Thank you Vaclav,
>>
>> I have just started today to write some code :) for MR job that will 
>> load data into HBase + Phoenix. Previously I wrote some application 
>> to load data using Phoenix JDBC (slow), but I also have experience 
>> with HBase so I can understand and write code to load data directly 
>> there.
>>
>> If doing so, I'm also worry about: - maintaining (some existing) 
>> Phoenix indexes (if any) - perhaps this still works in case the
>> (same) coprocessors would trigger at insert time, but I cannot know 
>> how it works behind the scenes. - having the Phoenix view around the 
>> HBase table would "solve" the above problem (so there's no index 
>> whatsoever) but would create a lot of other problems (my table has a 
>> limited number of common columns and the rest are too different from 
>> row to row - in total I have hundreds of possible
>> columns)
>>
>> So - to make things faster for me-  is there any good piece of code I 
>> can find on the internet about how to map my data types to Phoenix 
>> data types and use the results as regular HBase Bulk Load?
>>
>> Regards, Constantin
>>
>> -----Original Message----- From: Vaclav Loffelmann 
>> [mailto:vaclav.loffelmann@socialbakers.com] Sent: Tuesday, January 
>> 13, 2015 10:30 AM To: user@phoenix.apache.org Subject: Re:
>> MapReduce bulk load into Phoenix table
>>
>> Hi, our daily usage is to import raw data directly to HBase, but 
>> mapped to Phoenix data types. And for querying we use Phoenix view on 
>> top of that HBase table.
>>
>> Then you should hit bottleneck of HBase itself. It should be from
>> 10 to 30+ times faster than your current solution. Depending on HW of 
>> course.
>>
>> I'd prefer this solution for stream writes.
>>
>> Vaclav
>>
>> On 01/13/2015 10:12 AM, Ciureanu, Constantin (GfK) wrote:
>>> Hello all,
>>
>>> (Due to the slow speed of Phoenix JDBC – single machine ~
>>> 1000-1500 rows /sec) I am also documenting myself about loading data 
>>> into Phoenix via MapReduce.
>>
>>> So far I understood that the Key + List<[Key,Value]> to be inserted 
>>> into HBase table is obtained via a “dummy” Phoenix connection – then 
>>> those rows are stored into HFiles (then after the MR job finishes it 
>>> is Bulk loading those HFiles normally into HBase).
>>
>>> My question: Is there any better / faster approach? I assume this  
>>> cannot reach the maximum speed to load data into Phoenix / HBase  
>>> table.
>>
>>> Also I would like to find a better / newer sample code than this
>>> one:
>>> http://grepcode.com/file/repo1.maven.org/maven2/org.apache.phoenix/p
>>> ho
>>>
>>>
> enix/4.0.0-incubating/org/apache/phoenix/mapreduce/CsvToKeyValueMapper
>>> .java#CsvToKeyValueMapper.loadPreUpsertProcessor%28org.apache.hadoop
>>> .c
>>>
>>>
> onf.Configuration%29
>>
>>> Thank you, Constantin
>>
>>
> -----BEGIN PGP SIGNATURE-----
> Version: GnuPG v1
>
> iQEcBAEBAgAGBQJUtPc4AAoJEG3mD8Wtuk0WWu0IAIJcEveMvKZbrrf3FY0SRbNx
> 6V5jF44t+Dl88r1EX9VvKZrWzjp/uN6t8r3H6D5nNLmq0314jOfK68T0q3uaIqOp
> Ui7UwMRAGkmhhY9pqgZBfWvXsLLJSy9hJU70JSk3RFeynprqPsme9a8CWJi8IfDN
> G83he0X2avQKQJ72hKeDZX9NzKib9cxNQFKtWDpr2NQat5VnCJkCUGprMcMUxU31
> vqFg9aQ+b40WN0KFJ3p7cI5tlAuJ5Tz7Ogh+KOZUqTVZ8I5OLwQzqpiwDMD6stRb
> PJ1gc7LCs64wJghv5TIZpHyXl/3HOgmpYrO+UfGv1S1qzySpM3B1o9ajTbww3L0=
> =Fdvo
> -----END PGP SIGNATURE-----

Re: MapReduce bulk load into Phoenix table

Posted by James Taylor <ja...@apache.org>.

Hi Constantin,
1000-1500 rows per sec? Using our performance.py script, on my Mac
laptop, I'm seeing 27,000 rows per sec (Phoenix 4.2.2 with HBase
0.98.9).

If you want to realistically measure performance, I'd recommend doing
so on a real cluster. If you'll really only have a single machine,
then you're probably better off using something like MySQL. Using the
map-reduce based CSV loader on a single node is not going to speed
anything up. For a cluster it can make a difference, though. See
http://phoenix.apache.org/phoenix_mr.html

FYI, Phoenix indexes are only maintained if you go through Phoenix APIs.

Thanks,
James


On Tue, Jan 13, 2015 at 2:45 AM, Vaclav Loffelmann
<va...@socialbakers.com> wrote:
> -----BEGIN PGP SIGNED MESSAGE-----
> Hash: SHA1
>
> I think the easiest way how to determine if indexes are maintained
> when inserting directly to HBase is to test it. If it is maintained by
> region observer coprocessors, it should. (I'll do tests when as soon
> I'll have some time.)
>
> I don't see any problem with different cols between multiple rows.
> Make view same as you'd make table definition. Null values are not
> stored at HBase hence theres no overhead.
>
> I'm afraid there is not any piece of code (publicly avail) how to do
> that, but it is very straight forward.
> If you use composite primary key, then concat multiple results of
> PDataType.TYPE.toBytes() as rowkey. For values use same logic. Data
> types are defined as enums at this class:
> org.apache.phoenix.schema.PDataType.
>
> Good luck,
> Vaclav;
>
> On 01/13/2015 10:58 AM, Ciureanu, Constantin (GfK) wrote:
>> Thank you Vaclav,
>>
>> I have just started today to write some code :) for MR job that
>> will load data into HBase + Phoenix. Previously I wrote some
>> application to load data using Phoenix JDBC (slow), but I also have
>> experience with HBase so I can understand and write code to load
>> data directly there.
>>
>> If doing so, I'm also worry about: - maintaining (some existing)
>> Phoenix indexes (if any) - perhaps this still works in case the
>> (same) coprocessors would trigger at insert time, but I cannot know
>> how it works behind the scenes. - having the Phoenix view around
>> the HBase table would "solve" the above problem (so there's no
>> index whatsoever) but would create a lot of other problems (my
>> table has a limited number of common columns and the rest are too
>> different from row to row - in total I have hundreds of possible
>> columns)
>>
>> So - to make things faster for me-  is there any good piece of code
>> I can find on the internet about how to map my data types to
>> Phoenix data types and use the results as regular HBase Bulk Load?
>>
>> Regards, Constantin
>>
>> -----Original Message----- From: Vaclav Loffelmann
>> [mailto:vaclav.loffelmann@socialbakers.com] Sent: Tuesday, January
>> 13, 2015 10:30 AM To: user@phoenix.apache.org Subject: Re:
>> MapReduce bulk load into Phoenix table
>>
>> Hi, our daily usage is to import raw data directly to HBase, but
>> mapped to Phoenix data types. And for querying we use Phoenix view
>> on top of that HBase table.
>>
>> Then you should hit bottleneck of HBase itself. It should be from
>> 10 to 30+ times faster than your current solution. Depending on HW
>> of course.
>>
>> I'd prefer this solution for stream writes.
>>
>> Vaclav
>>
>> On 01/13/2015 10:12 AM, Ciureanu, Constantin (GfK) wrote:
>>> Hello all,
>>
>>> (Due to the slow speed of Phoenix JDBC – single machine ~
>>> 1000-1500 rows /sec) I am also documenting myself about loading
>>> data into Phoenix via MapReduce.
>>
>>> So far I understood that the Key + List<[Key,Value]> to be
>>> inserted into HBase table is obtained via a “dummy” Phoenix
>>> connection – then those rows are stored into HFiles (then after
>>> the MR job finishes it is Bulk loading those HFiles normally into
>>> HBase).
>>
>>> My question: Is there any better / faster approach? I assume this
>>>  cannot reach the maximum speed to load data into Phoenix / HBase
>>>  table.
>>
>>> Also I would like to find a better / newer sample code than this
>>> one:
>>> http://grepcode.com/file/repo1.maven.org/maven2/org.apache.phoenix/pho
>>>
>>>
> enix/4.0.0-incubating/org/apache/phoenix/mapreduce/CsvToKeyValueMapper
>>> .java#CsvToKeyValueMapper.loadPreUpsertProcessor%28org.apache.hadoop.c
>>>
>>>
> onf.Configuration%29
>>
>>> Thank you, Constantin
>>
>>
> -----BEGIN PGP SIGNATURE-----
> Version: GnuPG v1
>
> iQEcBAEBAgAGBQJUtPc4AAoJEG3mD8Wtuk0WWu0IAIJcEveMvKZbrrf3FY0SRbNx
> 6V5jF44t+Dl88r1EX9VvKZrWzjp/uN6t8r3H6D5nNLmq0314jOfK68T0q3uaIqOp
> Ui7UwMRAGkmhhY9pqgZBfWvXsLLJSy9hJU70JSk3RFeynprqPsme9a8CWJi8IfDN
> G83he0X2avQKQJ72hKeDZX9NzKib9cxNQFKtWDpr2NQat5VnCJkCUGprMcMUxU31
> vqFg9aQ+b40WN0KFJ3p7cI5tlAuJ5Tz7Ogh+KOZUqTVZ8I5OLwQzqpiwDMD6stRb
> PJ1gc7LCs64wJghv5TIZpHyXl/3HOgmpYrO+UfGv1S1qzySpM3B1o9ajTbww3L0=
> =Fdvo
> -----END PGP SIGNATURE-----

Re: MapReduce bulk load into Phoenix table

Posted by Vaclav Loffelmann <va...@socialbakers.com>.

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

I think the easiest way how to determine if indexes are maintained
when inserting directly to HBase is to test it. If it is maintained by
region observer coprocessors, it should. (I'll do tests when as soon
I'll have some time.)

I don't see any problem with different cols between multiple rows.
Make view same as you'd make table definition. Null values are not
stored at HBase hence theres no overhead.

I'm afraid there is not any piece of code (publicly avail) how to do
that, but it is very straight forward.
If you use composite primary key, then concat multiple results of
PDataType.TYPE.toBytes() as rowkey. For values use same logic. Data
types are defined as enums at this class:
org.apache.phoenix.schema.PDataType.

Good luck,
Vaclav;

On 01/13/2015 10:58 AM, Ciureanu, Constantin (GfK) wrote:
> Thank you Vaclav,
> 
> I have just started today to write some code :) for MR job that
> will load data into HBase + Phoenix. Previously I wrote some
> application to load data using Phoenix JDBC (slow), but I also have
> experience with HBase so I can understand and write code to load
> data directly there.
> 
> If doing so, I'm also worry about: - maintaining (some existing)
> Phoenix indexes (if any) - perhaps this still works in case the
> (same) coprocessors would trigger at insert time, but I cannot know
> how it works behind the scenes. - having the Phoenix view around
> the HBase table would "solve" the above problem (so there's no
> index whatsoever) but would create a lot of other problems (my
> table has a limited number of common columns and the rest are too
> different from row to row - in total I have hundreds of possible
> columns)
> 
> So - to make things faster for me-  is there any good piece of code
> I can find on the internet about how to map my data types to
> Phoenix data types and use the results as regular HBase Bulk Load?
> 
> Regards, Constantin
> 
> -----Original Message----- From: Vaclav Loffelmann
> [mailto:vaclav.loffelmann@socialbakers.com] Sent: Tuesday, January
> 13, 2015 10:30 AM To: user@phoenix.apache.org Subject: Re:
> MapReduce bulk load into Phoenix table
> 
> Hi, our daily usage is to import raw data directly to HBase, but
> mapped to Phoenix data types. And for querying we use Phoenix view
> on top of that HBase table.
> 
> Then you should hit bottleneck of HBase itself. It should be from
> 10 to 30+ times faster than your current solution. Depending on HW
> of course.
> 
> I'd prefer this solution for stream writes.
> 
> Vaclav
> 
> On 01/13/2015 10:12 AM, Ciureanu, Constantin (GfK) wrote:
>> Hello all,
> 
>> (Due to the slow speed of Phoenix JDBC – single machine ~
>> 1000-1500 rows /sec) I am also documenting myself about loading
>> data into Phoenix via MapReduce.
> 
>> So far I understood that the Key + List<[Key,Value]> to be
>> inserted into HBase table is obtained via a “dummy” Phoenix
>> connection – then those rows are stored into HFiles (then after
>> the MR job finishes it is Bulk loading those HFiles normally into
>> HBase).
> 
>> My question: Is there any better / faster approach? I assume this
>>  cannot reach the maximum speed to load data into Phoenix / HBase
>>  table.
> 
>> Also I would like to find a better / newer sample code than this 
>> one: 
>> http://grepcode.com/file/repo1.maven.org/maven2/org.apache.phoenix/pho
>>
>> 
enix/4.0.0-incubating/org/apache/phoenix/mapreduce/CsvToKeyValueMapper
>> .java#CsvToKeyValueMapper.loadPreUpsertProcessor%28org.apache.hadoop.c
>>
>> 
onf.Configuration%29
> 
>> Thank you, Constantin
> 
> 
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1

iQEcBAEBAgAGBQJUtPc4AAoJEG3mD8Wtuk0WWu0IAIJcEveMvKZbrrf3FY0SRbNx
6V5jF44t+Dl88r1EX9VvKZrWzjp/uN6t8r3H6D5nNLmq0314jOfK68T0q3uaIqOp
Ui7UwMRAGkmhhY9pqgZBfWvXsLLJSy9hJU70JSk3RFeynprqPsme9a8CWJi8IfDN
G83he0X2avQKQJ72hKeDZX9NzKib9cxNQFKtWDpr2NQat5VnCJkCUGprMcMUxU31
vqFg9aQ+b40WN0KFJ3p7cI5tlAuJ5Tz7Ogh+KOZUqTVZ8I5OLwQzqpiwDMD6stRb
PJ1gc7LCs64wJghv5TIZpHyXl/3HOgmpYrO+UfGv1S1qzySpM3B1o9ajTbww3L0=
=Fdvo
-----END PGP SIGNATURE-----

RE: MapReduce bulk load into Phoenix table

Posted by "Ciureanu, Constantin (GfK)" <Co...@gfk.com>.

Thank you Vaclav,

I have just started today to write some code :) for MR job that will load data into HBase + Phoenix. 
Previously I wrote some application to load data using Phoenix JDBC (slow), but I also have experience with HBase so I can understand and write code to load data directly there.

If doing so, I'm also worry about:
- maintaining (some existing) Phoenix indexes (if any) - perhaps this still works in case the (same) coprocessors would trigger at insert time, but I cannot know how it works behind the scenes.
- having the Phoenix view around the HBase table would "solve" the above problem (so there's no index whatsoever) but would create a lot of other problems (my table has a limited number of common columns and the rest are too different from row to row - in total I have hundreds of possible columns)

So - to make things faster for me-  is there any good piece of code I can find on the internet about how to map my data types to Phoenix data types and use the results as regular HBase Bulk Load?

Regards,
  Constantin

-----Original Message-----
From: Vaclav Loffelmann [mailto:vaclav.loffelmann@socialbakers.com] 
Sent: Tuesday, January 13, 2015 10:30 AM
To: user@phoenix.apache.org
Subject: Re: MapReduce bulk load into Phoenix table

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Hi,
our daily usage is to import raw data directly to HBase, but mapped to Phoenix data types. And for querying we use Phoenix view on top of that HBase table.

Then you should hit bottleneck of HBase itself. It should be from 10 to 30+ times faster than your current solution. Depending on HW of course.

I'd prefer this solution for stream writes.

Vaclav

On 01/13/2015 10:12 AM, Ciureanu, Constantin (GfK) wrote:
> Hello all,
> 
> (Due to the slow speed of Phoenix JDBC – single machine ~ 1000-1500 
> rows /sec) I am also documenting myself about loading data into 
> Phoenix via MapReduce.
> 
> So far I understood that the Key + List<[Key,Value]> to be inserted 
> into HBase table is obtained via a “dummy” Phoenix connection – then 
> those rows are stored into HFiles (then after the MR job finishes it 
> is Bulk loading those HFiles normally into HBase).
> 
> My question: Is there any better / faster approach? I assume this 
> cannot reach the maximum speed to load data into Phoenix / HBase 
> table.
> 
> Also I would like to find a better / newer sample code than this
> one: 
> http://grepcode.com/file/repo1.maven.org/maven2/org.apache.phoenix/pho
> enix/4.0.0-incubating/org/apache/phoenix/mapreduce/CsvToKeyValueMapper
> .java#CsvToKeyValueMapper.loadPreUpsertProcessor%28org.apache.hadoop.c
> onf.Configuration%29
>
>  Thank you, Constantin
> 
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1

iQEcBAEBAgAGBQJUtOWVAAoJEG3mD8Wtuk0WIwsIAI/P4DJ9fcVQlmwSCGbLjxsI
5gm2grAPe7kMXewc74GBKN56bAwi8vkg54pZW7ymp3hp1L9LlXa/iHhuUApwE24W
eZ3kArdhXbgK1KGYItjmGCGTypKM3HZ/8HlzljKMzaRsOkqDcsg0JdldeXYbZ7vW
MO58IBBjiyx8sGAN1x757ZimoUzcoDN/lMP9ypsKu9m9GmAEv87h7twMkkGLAl47
W9J9rjoCHDJqMlNZMy5gUBDdZWqtHYNWOsG0Q3s/rbwb4hTCsCwQiCBAjmZt7Nea
Wzgfr53WFeXWQ2LYFqqeWbbs5hdCJ3hfTew0gW4wpjzzsi5TocVcQow3cOTW3/E=
=dy7R
-----END PGP SIGNATURE-----

Re: MapReduce bulk load into Phoenix table

Posted by Vaclav Loffelmann <va...@socialbakers.com>.

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Hi,
our daily usage is to import raw data directly to HBase, but mapped to
Phoenix data types. And for querying we use Phoenix view on top of
that HBase table.

Then you should hit bottleneck of HBase itself. It should be from 10
to 30+ times faster than your current solution. Depending on HW of course.

I'd prefer this solution for stream writes.

Vaclav

On 01/13/2015 10:12 AM, Ciureanu, Constantin (GfK) wrote:
> Hello all,
> 
> (Due to the slow speed of Phoenix JDBC – single machine ~ 1000-1500
> rows /sec) I am also documenting myself about loading data into
> Phoenix via MapReduce.
> 
> So far I understood that the Key + List<[Key,Value]> to be inserted
> into HBase table is obtained via a “dummy” Phoenix connection –
> then those rows are stored into HFiles (then after the MR job
> finishes it is Bulk loading those HFiles normally into HBase).
> 
> My question: Is there any better / faster approach? I assume this
> cannot reach the maximum speed to load data into Phoenix / HBase
> table.
> 
> Also I would like to find a better / newer sample code than this
> one: 
> http://grepcode.com/file/repo1.maven.org/maven2/org.apache.phoenix/phoenix/4.0.0-incubating/org/apache/phoenix/mapreduce/CsvToKeyValueMapper.java#CsvToKeyValueMapper.loadPreUpsertProcessor%28org.apache.hadoop.conf.Configuration%29
>
>  Thank you, Constantin
> 
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1

iQEcBAEBAgAGBQJUtOWVAAoJEG3mD8Wtuk0WIwsIAI/P4DJ9fcVQlmwSCGbLjxsI
5gm2grAPe7kMXewc74GBKN56bAwi8vkg54pZW7ymp3hp1L9LlXa/iHhuUApwE24W
eZ3kArdhXbgK1KGYItjmGCGTypKM3HZ/8HlzljKMzaRsOkqDcsg0JdldeXYbZ7vW
MO58IBBjiyx8sGAN1x757ZimoUzcoDN/lMP9ypsKu9m9GmAEv87h7twMkkGLAl47
W9J9rjoCHDJqMlNZMy5gUBDdZWqtHYNWOsG0Q3s/rbwb4hTCsCwQiCBAjmZt7Nea
Wzgfr53WFeXWQ2LYFqqeWbbs5hdCJ3hfTew0gW4wpjzzsi5TocVcQow3cOTW3/E=
=dy7R
-----END PGP SIGNATURE-----

Re: Re: MapReduce bulk load into Phoenix table

Posted by "sunfl@certusnet.com.cn" <su...@certusnet.com.cn>.

Yes. I know exactly what HBase bulkload do and we are applying the schema to bulkload into Phoniex.
Just to clarify, if wal is enabled for phoenix tables to be bulkloaded, the loading performance shall be 
very poor. So disabling wal goes to an option for better loading performance.
Corrects me if I am wrong.

Regards,
Sun.





CertusNet 

From: Nick Dimiduk
Date: 2015-01-14 02:50
To: user
Subject: Re: MapReduce bulk load into Phoenix table
On Tue, Jan 13, 2015 at 1:29 AM, sunfl@certusnet.com.cn <su...@certusnet.com.cn> wrote:
As far as I know, bulk loading into phoenix or hbase may be affected by several conditions, like wal enabled or numbers of split regions. 

Bulkloading in HBase does not go through the WAL, it's using the HFileOutputFormat to write HFiles directly. Region splits will have some impact on bulkload, but not in the same way as it does with online writes.

I agree with James -- it seems your host is very underpowered or your underlying cluster installation is not configured correctly. Please consider profiling the individual steps in isolation so as to better identify the bottleneck.

From: Ciureanu, Constantin (GfK)
Date: 2015-01-13 17:12
To: user@phoenix.apache.org
Subject: MapReduce bulk load into Phoenix table
Hello all,
 
(Due to the slow speed of Phoenix JDBC – single machine ~ 1000-1500 rows /sec) I am also documenting myself about loading data into Phoenix via MapReduce.
 
So far I understood that the Key + List<[Key,Value]> to be inserted into HBase table is obtained via a “dummy” Phoenix connection – then those rows are stored into HFiles (then after the MR job finishes it is Bulk loading those HFiles normally into HBase).
 
My question: Is there any better / faster approach? I assume this cannot reach the maximum speed to load data into Phoenix / HBase table.
   
Also I would like to find a better / newer sample code than this one:
http://grepcode.com/file/repo1.maven.org/maven2/org.apache.phoenix/phoenix/4.0.0-incubating/org/apache/phoenix/mapreduce/CsvToKeyValueMapper.java#CsvToKeyValueMapper.loadPreUpsertProcessor%28org.apache.hadoop.conf.Configuration%29
 
Thank you,
   Constantin

Re: MapReduce bulk load into Phoenix table

Posted by Nick Dimiduk <nd...@gmail.com>.

On Tue, Jan 13, 2015 at 1:29 AM, sunfl@certusnet.com.cn <
sunfl@certusnet.com.cn> wrote:

> As far as I know, bulk loading into phoenix or hbase may be affected by
> several conditions, like wal enabled or numbers of split regions.
>

Bulkloading in HBase does not go through the WAL, it's using the
HFileOutputFormat to write HFiles directly. Region splits will have some
impact on bulkload, but not in the same way as it does with online writes.

I agree with James -- it seems your host is very underpowered or your
underlying cluster installation is not configured correctly. Please
consider profiling the individual steps in isolation so as to better
identify the bottleneck.

>
> *From:* Ciureanu, Constantin (GfK) <Co...@gfk.com>
> *Date:* 2015-01-13 17:12
> *To:* user@phoenix.apache.org
> *Subject:* MapReduce bulk load into Phoenix table
>
> Hello all,
>
>
>
> (Due to the slow speed of Phoenix JDBC – single machine ~ 1000-1500 rows
> /sec) I am also documenting myself about loading data into Phoenix via
> MapReduce.
>
>
>
> So far I understood that the Key + List<[Key,Value]> to be inserted into
> HBase table is obtained via a “dummy” Phoenix connection – then those rows
> are stored into HFiles (then after the MR job finishes it is Bulk loading
> those HFiles normally into HBase).
>
>
>
> My question: Is there any better / faster approach? I assume this cannot
> reach the maximum speed to load data into Phoenix / HBase table.
>
>
>
> Also I would like to find a better / newer sample code than this one:
>
>
> http://grepcode.com/file/repo1.maven.org/maven2/org.apache.phoenix/phoenix/4.0.0-incubating/org/apache/phoenix/mapreduce/CsvToKeyValueMapper.java#CsvToKeyValueMapper.loadPreUpsertProcessor%28org.apache.hadoop.conf.Configuration%29
>
>
>
> Thank you,
>
>    Constantin
>
>

Re: MapReduce bulk load into Phoenix table

Posted by "sunfl@certusnet.com.cn" <su...@certusnet.com.cn>.

Hi, Constantin
You can try to use Apache Spark to complete the mapreduce bulkload job. As far as I know, bulkloading into
phoenix or hbase may be affected by several conditions, like wal enabled or numbers of split regions. And your hbase
or phoenix configuration parameter may also influence the bulkloading performance. 

You can share more about your specific data loading information and I can help you do some tuning work.

Thanks,
Sun.





CertusNet 

From: Ciureanu, Constantin (GfK)
Date: 2015-01-13 17:12
To: user@phoenix.apache.org
Subject: MapReduce bulk load into Phoenix table
Hello all,
 
(Due to the slow speed of Phoenix JDBC – single machine ~ 1000-1500 rows /sec) I am also documenting myself about loading data into Phoenix via MapReduce.
 
So far I understood that the Key + List<[Key,Value]> to be inserted into HBase table is obtained via a “dummy” Phoenix connection – then those rows are stored into HFiles (then after the MR job finishes it is Bulk loading those HFiles normally into HBase).
 
My question: Is there any better / faster approach? I assume this cannot reach the maximum speed to load data into Phoenix / HBase table.
   
Also I would like to find a better / newer sample code than this one:
http://grepcode.com/file/repo1.maven.org/maven2/org.apache.phoenix/phoenix/4.0.0-incubating/org/apache/phoenix/mapreduce/CsvToKeyValueMapper.java#CsvToKeyValueMapper.loadPreUpsertProcessor%28org.apache.hadoop.conf.Configuration%29
 
Thank you,
   Constantin