You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@phoenix.apache.org by Kiru Pakkirisamy <ki...@yahoo.com> on 2015/04/24 02:12:06 UTC

CsvBulkLoadTool question

Hi,We are trying to load large number of rows (100/200M) into a table and benchmark it against Hive.We pretty much used the CsvBulkLoadTool as documented. But now after completion, Hbase is still in 'minor compaction' for quite a number of hours.(Also, we see only one region in the table.)A select count on this table does not seem to complete. Any ideas on how to proceed ? Regards,
- kiru

Re: CsvBulkLoadTool question

Posted by Kiru Pakkirisamy <ki...@yahoo.com>.

James,Thanks. We are having good success with SALT_BUCKETS (thanks Gabriel).And we are not sure what to split on. (BTW, can both of these be used together ?)
 Regards,
- kiru
      From: James Taylor <ja...@apache.org>
 To: user <us...@phoenix.apache.org>; Kiru Pakkirisamy <ki...@yahoo.com> 
 Sent: Friday, April 24, 2015 10:04 AM
 Subject: Re: CsvBulkLoadTool question
   
Another option, Kiru, is to use the SPLIT ON (...) clause at the end
of your CREATE TABLE call. This will cause your table to be pre-split
with salting it.



On Fri, Apr 24, 2015 at 9:47 AM, Kiru Pakkirisamy
<ki...@yahoo.com> wrote:
> Gabriel,
> Thanks for the tip, I will retry with the SALT_BUCKETS option.
>
> Regards,
> - kiru
> ________________________________
> From: Gabriel Reid <ga...@gmail.com>
> To: user@phoenix.apache.org; Kiru Pakkirisamy <ki...@yahoo.com>
> Sent: Thursday, April 23, 2015 11:57 PM
> Subject: Re: CsvBulkLoadTool question
>
> Hi Kiru,
>
> The CSV bulk loader won't automatically make multiple regions for you, it
> simply loads data into the existing regions of the table. In your case, it
> means that all data has been loaded into a single region (as you're seeing),
> which means that any kind of operations that scan over a large number of
> rows (such as a "select count") will be very slow.
>
> I would recommend pre-splitting your table before running the bulk load
> tool. If you're creating the table directly in Phoenix, you can supply the
> SALT_BUCKETS table option [1] when creating the table.
>
> - Gabriel
>
> 1. http://phoenix.apache.org/language/index.html#options
>
>
>
> On Fri, Apr 24, 2015 at 2:15 AM Kiru Pakkirisamy <ki...@yahoo.com>
> wrote:
>
> Hi,
> We are trying to load large number of rows (100/200M) into a table and
> benchmark it against Hive.
> We pretty much used the CsvBulkLoadTool as documented. But now after
> completion, Hbase is still in 'minor compaction' for quite a number of
> hours.
> (Also, we see only one region in the table.)
> A select count on this table does not seem to complete. Any ideas on how to
> proceed ?
>
> Regards,
> - kiru
>
>
>

Re: CsvBulkLoadTool question

Posted by James Taylor <ja...@apache.org>.

Another option, Kiru, is to use the SPLIT ON (...) clause at the end
of your CREATE TABLE call. This will cause your table to be pre-split
with salting it.

On Fri, Apr 24, 2015 at 9:47 AM, Kiru Pakkirisamy
<ki...@yahoo.com> wrote:
> Gabriel,
> Thanks for the tip, I will retry with the SALT_BUCKETS option.
>
> Regards,
> - kiru
> ________________________________
> From: Gabriel Reid <ga...@gmail.com>
> To: user@phoenix.apache.org; Kiru Pakkirisamy <ki...@yahoo.com>
> Sent: Thursday, April 23, 2015 11:57 PM
> Subject: Re: CsvBulkLoadTool question
>
> Hi Kiru,
>
> The CSV bulk loader won't automatically make multiple regions for you, it
> simply loads data into the existing regions of the table. In your case, it
> means that all data has been loaded into a single region (as you're seeing),
> which means that any kind of operations that scan over a large number of
> rows (such as a "select count") will be very slow.
>
> I would recommend pre-splitting your table before running the bulk load
> tool. If you're creating the table directly in Phoenix, you can supply the
> SALT_BUCKETS table option [1] when creating the table.
>
> - Gabriel
>
> 1. http://phoenix.apache.org/language/index.html#options
>
>
>
> On Fri, Apr 24, 2015 at 2:15 AM Kiru Pakkirisamy <ki...@yahoo.com>
> wrote:
>
> Hi,
> We are trying to load large number of rows (100/200M) into a table and
> benchmark it against Hive.
> We pretty much used the CsvBulkLoadTool as documented. But now after
> completion, Hbase is still in 'minor compaction' for quite a number of
> hours.
> (Also, we see only one region in the table.)
> A select count on this table does not seem to complete. Any ideas on how to
> proceed ?
>
> Regards,
> - kiru
>
>
>

Re: CsvBulkLoadTool question

Posted by Kiru Pakkirisamy <ki...@yahoo.com>.

Gabriel,Thanks for the tip, I will retry with the SALT_BUCKETS option.
Regards,
- kiru      From: Gabriel Reid <ga...@gmail.com>
 To: user@phoenix.apache.org; Kiru Pakkirisamy <ki...@yahoo.com> 
 Sent: Thursday, April 23, 2015 11:57 PM
 Subject: Re: CsvBulkLoadTool question
   
Hi Kiru,
The CSV bulk loader won't automatically make multiple regions for you, it simply loads data into the existing regions of the table. In your case, it means that all data has been loaded into a single region (as you're seeing), which means that any kind of operations that scan over a large number of rows (such as a "select count") will be very slow.
I would recommend pre-splitting your table before running the bulk load tool. If you're creating the table directly in Phoenix, you can supply the SALT_BUCKETS table option [1] when creating the table.
- Gabriel
1. http://phoenix.apache.org/language/index.html#options



On Fri, Apr 24, 2015 at 2:15 AM Kiru Pakkirisamy <ki...@yahoo.com> wrote:

Hi,We are trying to load large number of rows (100/200M) into a table and benchmark it against Hive.We pretty much used the CsvBulkLoadTool as documented. But now after completion, Hbase is still in 'minor compaction' for quite a number of hours.(Also, we see only one region in the table.)A select count on this table does not seem to complete. Any ideas on how to proceed ? Regards,
- kiru

Re: CsvBulkLoadTool question

Posted by Gabriel Reid <ga...@gmail.com>.

Hi Kiru,

The CSV bulk loader won't automatically make multiple regions for you, it
simply loads data into the existing regions of the table. In your case, it
means that all data has been loaded into a single region (as you're
seeing), which means that any kind of operations that scan over a large
number of rows (such as a "select count") will be very slow.

I would recommend pre-splitting your table before running the bulk load
tool. If you're creating the table directly in Phoenix, you can supply the
SALT_BUCKETS table option [1] when creating the table.

- Gabriel

1. http://phoenix.apache.org/language/index.html#options

On Fri, Apr 24, 2015 at 2:15 AM Kiru Pakkirisamy <ki...@yahoo.com>
wrote:

> Hi,
> We are trying to load large number of rows (100/200M) into a table and
> benchmark it against Hive.
> We pretty much used the CsvBulkLoadTool as documented. But now after
> completion, Hbase is still in 'minor compaction' for quite a number of
> hours.
> (Also, we see only one region in the table.)
> A select count on this table does not seem to complete. Any ideas on how
> to proceed ?
>
> Regards,
> - kiru
>
>