You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@sqoop.apache.org by Alberto Cordioli <co...@gmail.com> on 2012/09/05 09:37:49 UTC

Sqoop import big MySql table in HBase

Hi all,

I am using Sqoop to import a big MySql table (around 910 milions of
records) in Hbase.
The command line that I'm using is something like:
sqoop import --connect
jdbc:mysql://<server>/<db>?zeroDateTimeBehavior=round --username <usr>
-P --query <query>' --split-by <date-field> --hbase-table
"<hbase_table>" --column-family "<fam>" --hbase-row-key "hash"

The strange thing is that it takes a lot to complete the last part of
the map. This is part of the log:

[...]
12/09/04 17:16:45 INFO mapred.JobClient: Running job: job_201209031227_0007
12/09/04 17:16:46 INFO mapred.JobClient:  map 0% reduce 0%
12/09/04 17:24:20 INFO mapred.JobClient:  map 25% reduce 0%
12/09/04 17:24:21 INFO mapred.JobClient:  map 50% reduce 0%
12/09/04 17:24:23 INFO mapred.JobClient:  map 75% reduce 0%

As you can see it does not take much time to from start to 75%, but
the last part hasn't been finished (although it is working by a day
continuously).
Is there something wrong? I've tried to take a look to the logs but it
seems to be ok.


Thanks,
Alberto



-- 
Alberto Cordioli

Re: Sqoop import big MySql table in HBase

Posted by Alberto Cordioli <co...@gmail.com>.

Perfect.
Thanks for the help!

Cheers,
Alberto

On 5 September 2012 16:50, abhijeet gaikwad <ab...@gmail.com> wrote:
> I forgot mentioning but calculations are done for maximum first 8 chars of
> the string. So computation wise you are safe - but that may generate
> in-accurate splits for some scenarios.
> Anyways, I feel Jarcec's solutions are a better workaround :)
>
> Thanks,
> Abhijeet
>
> On 5 Sep 2012 20:00, "Alberto Cordioli" <co...@gmail.com> wrote:
>>
>> Mmh, I see. Hence I should avoid split by string fields, since my hash
>> field is 72 char long and it requires a lot of computations (if I
>> understood correctly).
>> I think one of the solutions proposed by Jarcec could be ok.
>> I also think I'll divide my big table in more little chunks, since the
>> problem is the query that determine the split points. What do you
>> think?
>>
>> Cheers,
>> Alberto
>>
>>
>> On 5 September 2012 15:21, abhijeet gaikwad <ab...@gmail.com>
>> wrote:
>> > Hi Alberto,
>> > Sqoop calculates split points by converting min. and max. string values
>> > returned by DBMS for the column into respective decimal values using an
>> > algo. and then use Decimal splitter. There is a complex algorithm used
>> > for
>> > converting(more of mapping) string to decimal. This would help you
>> > understand better (taken from the java docs for the split method of text
>> > splitter):
>> >
>> > /**
>> >    * This method needs to determine the splits between two user-provided
>> >    * strings.  In the case where the user's strings are 'A' and 'Z',
>> > this is
>> >    * not hard; we could create two splits from ['A', 'M') and ['M',
>> > 'Z'], 26
>> >    * splits for strings beginning with each letter, etc.
>> >    *
>> >    * If a user has provided us with the strings "Ham" and "Haze",
>> > however,
>> > we
>> >    * need to create splits that differ in the third letter.
>> >    *
>> >    * The algorithm used is as follows:
>> >    * Since there are 2**16 unicode characters, we interpret characters
>> > as
>> >    * digits in base 65536. Given a string 's' containing characters s_0,
>> > s_1
>> >    * .. s_n, we interpret the string as the number: 0.s_0 s_1 s_2.. s_n
>> > in
>> >    * base 65536. Having mapped the low and high strings into
>> > floating-point
>> >    * values, we then use the BigDecimalSplitter to establish the even
>> > split
>> >    * points, then map the resulting floating point values back into
>> > strings.
>> >    */
>> >   public List<InputSplit> split(Configuration conf, ResultSet results,
>> >       String colName) throws SQLException {
>> >
>> > After the splits are calculated, where clauses are used in SELECT
>> > queries(i.e. - result is bounded by split points)) fired by each mapper
>> > to
>> > retrieve the data.
>> >
>> > From user perspective, you can use string for splitting except for
>> > following
>> > scenario:
>> > Char split-by column is not recommended when the DBMS sorts in case
>> > insensitive manner. The current algorithm used to calculate splits has
>> > some
>> > flaws. This is known and Sqoop displays a warning before executing the
>> > job.
>> >
>> > Let me know if you need more details.
>> >
>> > Thanks,
>> > Abhijeet
>> >
>> >
>> > On Wed, Sep 5, 2012 at 5:13 PM, Jarek Jarcec Cecho <ja...@apache.org>
>> > wrote:
>> >>
>> >> Hi Alberto,
>> >> I've never used text column for data splitting, however it seems that
>> >> sqoop is supporting that (I found it's splitter in the code). However
>> >> I'm
>> >> still not sure if it's wise as string operations tends to be much
>> >> slower on
>> >> databases and you might end up with performance issues. Unfortunately
>> >> Sqoop
>> >> currently do not support any direct way how to affect split creation.
>> >>
>> >> I tried to think about your problem and came with two ideas how to help
>> >> in
>> >> your use case:
>> >>
>> >> 1) Would it be acceptable in your use case to change the zero date
>> >> policy
>> >> from zeroDateTimeBehavior=round to zeroDateTimeBehavior=convertToNull?
>> >> In
>> >> case that "split" column contains nulls, sqoop will create X+1 splits
>> >> where
>> >> the +1 will cover all NULL values. It probably won't be the best, but
>> >> it
>> >> might help to distribute your load more properly.
>> >>
>> >> 2) What about splitting entire job into two parts - firstly export all
>> >> zero dates and separately in next job the rest of the values. By doing
>> >> so
>> >> you might be able to get decent distribution across the "normal" dates
>> >> part.
>> >> Importing all the zero dates might be challenging if you have a lot of
>> >> them
>> >> as there will be only one value available (and thus just one split) and
>> >> therefore you might need to use the text column for split creation in
>> >> this
>> >> case anyway.
>> >>
>> >> Jarcec
>> >>
>> >> On Wed, Sep 05, 2012 at 10:16:17AM +0200, Alberto Cordioli wrote:
>> >> > Thanks Jarcec,
>> >> > probably you've identified immediately the problem. In fact, I
>> >> > checked
>> >> > the date field, and I think problem is that in my data I have some
>> >> > "limit" values like '0000-00-00' (damn who have inserted these).
>> >> > The other data are equally distributed in 2 months (from 2012-04-01
>> >> > to
>> >> > 2012-06-01): as you said with a parallelism of 3, 2 mappers will take
>> >> > basically no data while the other will do the "true" job, right?
>> >> >
>> >> > So, now my question becomes: the other field that I could use to
>> >> > split
>> >> > the job is an hash (string). How sqoop divide this type of field?
>> >> > Lexicography order?
>> >> >
>> >> >
>> >> > Alberto
>> >> >
>> >> >
>> >> >
>> >> > On 5 September 2012 09:57, Jarek Jarcec Cecho <ja...@apache.org>
>> >> > wrote:
>> >> > > Hi Alberto,
>> >> > > taking into account that you have 910 millions of records and
>> >> > > you're
>> >> > > job was able to get to 75% in matter of 8 minutes and then it slow
>> >> > > down
>> >> > > significantly, I do have a feeling that your splits were not
>> >> > > equally
>> >> > > divided. Based on your command line it seems that you're diving
>> >> > > data by some
>> >> > > date field. Is this date field uniformly distributed? E.g. is there
>> >> > > roughly
>> >> > > same number of rows for each date or do you have more rows in more
>> >> > > recent
>> >> > > days?
>> >> > >
>> >> > > Because Sqoop have no idea how exactly the data are distributed in
>> >> > > your database, it assumes uniform distribution. Let me explain why
>> >> > > it
>> >> > > matters on following example. Let's consider table where there is
>> >> > > one row on
>> >> > > 2012-01-01, second row on 2012-02-01 and 1M of rows on 2012-03-01.
>> >> > > Let's
>> >> > > assume that we will use three mappers (--num-mappers 3). In this
>> >> > > case, sqoop
>> >> > > will create three splits 2012-01-01 up to 2012-01-31, 2012-02-01 up
>> >> > > to
>> >> > > 2012-02-28 and 2012-03-01 up to 2012-03-31. Because the first two
>> >> > > mappers do
>> >> > > have just one row to move, they will finish almost instantly and
>> >> > > get job to
>> >> > > 66% done (2 out of 3 mappers are done), however the last mapper
>> >> > > will be
>> >> > > running for some time as it need to move 1M of rows. For external
>> >> > > observer
>> >> > > it would appear that the sqoop has stopped, but what really
>> >> > > happened is just
>> >> > > having not uniformly distributed data across all mappers.
>> >> > >
>> >> > > Jarcec
>> >> > >
>> >> > > On Wed, Sep 05, 2012 at 09:37:49AM +0200, Alberto Cordioli wrote:
>> >> > >> Hi all,
>> >> > >>
>> >> > >> I am using Sqoop to import a big MySql table (around 910 milions
>> >> > >> of
>> >> > >> records) in Hbase.
>> >> > >> The command line that I'm using is something like:
>> >> > >> sqoop import --connect
>> >> > >> jdbc:mysql://<server>/<db>?zeroDateTimeBehavior=round --username
>> >> > >> <usr>
>> >> > >> -P --query <query>' --split-by <date-field> --hbase-table
>> >> > >> "<hbase_table>" --column-family "<fam>" --hbase-row-key "hash"
>> >> > >>
>> >> > >> The strange thing is that it takes a lot to complete the last part
>> >> > >> of
>> >> > >> the map. This is part of the log:
>> >> > >>
>> >> > >> [...]
>> >> > >> 12/09/04 17:16:45 INFO mapred.JobClient: Running job:
>> >> > >> job_201209031227_0007
>> >> > >> 12/09/04 17:16:46 INFO mapred.JobClient:  map 0% reduce 0%
>> >> > >> 12/09/04 17:24:20 INFO mapred.JobClient:  map 25% reduce 0%
>> >> > >> 12/09/04 17:24:21 INFO mapred.JobClient:  map 50% reduce 0%
>> >> > >> 12/09/04 17:24:23 INFO mapred.JobClient:  map 75% reduce 0%
>> >> > >>
>> >> > >> As you can see it does not take much time to from start to 75%,
>> >> > >> but
>> >> > >> the last part hasn't been finished (although it is working by a
>> >> > >> day
>> >> > >> continuously).
>> >> > >> Is there something wrong? I've tried to take a look to the logs
>> >> > >> but
>> >> > >> it
>> >> > >> seems to be ok.
>> >> > >>
>> >> > >>
>> >> > >> Thanks,
>> >> > >> Alberto
>> >> > >>
>> >> > >>
>> >> > >>
>> >> > >> --
>> >> > >> Alberto Cordioli
>> >> >
>> >> >
>> >> >
>> >> > --
>> >> > Alberto Cordioli
>> >
>> >
>>
>>
>>
>> --
>> Alberto Cordioli



-- 
Alberto Cordioli

Re: Sqoop import big MySql table in HBase

Posted by abhijeet gaikwad <ab...@gmail.com>.

I forgot mentioning but calculations are done for maximum first 8 chars of
the string. So computation wise you are safe - but that may generate
in-accurate splits for some scenarios.
Anyways, I feel Jarcec's solutions are a better workaround :)

Thanks,
Abhijeet
On 5 Sep 2012 20:00, "Alberto Cordioli" <co...@gmail.com> wrote:

> Mmh, I see. Hence I should avoid split by string fields, since my hash
> field is 72 char long and it requires a lot of computations (if I
> understood correctly).
> I think one of the solutions proposed by Jarcec could be ok.
> I also think I'll divide my big table in more little chunks, since the
> problem is the query that determine the split points. What do you
> think?
>
> Cheers,
> Alberto
>
>
> On 5 September 2012 15:21, abhijeet gaikwad <ab...@gmail.com>
> wrote:
> > Hi Alberto,
> > Sqoop calculates split points by converting min. and max. string values
> > returned by DBMS for the column into respective decimal values using an
> > algo. and then use Decimal splitter. There is a complex algorithm used
> for
> > converting(more of mapping) string to decimal. This would help you
> > understand better (taken from the java docs for the split method of text
> > splitter):
> >
> > /**
> >    * This method needs to determine the splits between two user-provided
> >    * strings.  In the case where the user's strings are 'A' and 'Z',
> this is
> >    * not hard; we could create two splits from ['A', 'M') and ['M',
> 'Z'], 26
> >    * splits for strings beginning with each letter, etc.
> >    *
> >    * If a user has provided us with the strings "Ham" and "Haze",
> however,
> > we
> >    * need to create splits that differ in the third letter.
> >    *
> >    * The algorithm used is as follows:
> >    * Since there are 2**16 unicode characters, we interpret characters as
> >    * digits in base 65536. Given a string 's' containing characters s_0,
> s_1
> >    * .. s_n, we interpret the string as the number: 0.s_0 s_1 s_2.. s_n
> in
> >    * base 65536. Having mapped the low and high strings into
> floating-point
> >    * values, we then use the BigDecimalSplitter to establish the even
> split
> >    * points, then map the resulting floating point values back into
> strings.
> >    */
> >   public List<InputSplit> split(Configuration conf, ResultSet results,
> >       String colName) throws SQLException {
> >
> > After the splits are calculated, where clauses are used in SELECT
> > queries(i.e. - result is bounded by split points)) fired by each mapper
> to
> > retrieve the data.
> >
> > From user perspective, you can use string for splitting except for
> following
> > scenario:
> > Char split-by column is not recommended when the DBMS sorts in case
> > insensitive manner. The current algorithm used to calculate splits has
> some
> > flaws. This is known and Sqoop displays a warning before executing the
> job.
> >
> > Let me know if you need more details.
> >
> > Thanks,
> > Abhijeet
> >
> >
> > On Wed, Sep 5, 2012 at 5:13 PM, Jarek Jarcec Cecho <ja...@apache.org>
> > wrote:
> >>
> >> Hi Alberto,
> >> I've never used text column for data splitting, however it seems that
> >> sqoop is supporting that (I found it's splitter in the code). However
> I'm
> >> still not sure if it's wise as string operations tends to be much
> slower on
> >> databases and you might end up with performance issues. Unfortunately
> Sqoop
> >> currently do not support any direct way how to affect split creation.
> >>
> >> I tried to think about your problem and came with two ideas how to help
> in
> >> your use case:
> >>
> >> 1) Would it be acceptable in your use case to change the zero date
> policy
> >> from zeroDateTimeBehavior=round to zeroDateTimeBehavior=convertToNull?
> In
> >> case that "split" column contains nulls, sqoop will create X+1 splits
> where
> >> the +1 will cover all NULL values. It probably won't be the best, but it
> >> might help to distribute your load more properly.
> >>
> >> 2) What about splitting entire job into two parts - firstly export all
> >> zero dates and separately in next job the rest of the values. By doing
> so
> >> you might be able to get decent distribution across the "normal" dates
> part.
> >> Importing all the zero dates might be challenging if you have a lot of
> them
> >> as there will be only one value available (and thus just one split) and
> >> therefore you might need to use the text column for split creation in
> this
> >> case anyway.
> >>
> >> Jarcec
> >>
> >> On Wed, Sep 05, 2012 at 10:16:17AM +0200, Alberto Cordioli wrote:
> >> > Thanks Jarcec,
> >> > probably you've identified immediately the problem. In fact, I checked
> >> > the date field, and I think problem is that in my data I have some
> >> > "limit" values like '0000-00-00' (damn who have inserted these).
> >> > The other data are equally distributed in 2 months (from 2012-04-01 to
> >> > 2012-06-01): as you said with a parallelism of 3, 2 mappers will take
> >> > basically no data while the other will do the "true" job, right?
> >> >
> >> > So, now my question becomes: the other field that I could use to split
> >> > the job is an hash (string). How sqoop divide this type of field?
> >> > Lexicography order?
> >> >
> >> >
> >> > Alberto
> >> >
> >> >
> >> >
> >> > On 5 September 2012 09:57, Jarek Jarcec Cecho <ja...@apache.org>
> wrote:
> >> > > Hi Alberto,
> >> > > taking into account that you have 910 millions of records and you're
> >> > > job was able to get to 75% in matter of 8 minutes and then it slow
> down
> >> > > significantly, I do have a feeling that your splits were not equally
> >> > > divided. Based on your command line it seems that you're diving
> data by some
> >> > > date field. Is this date field uniformly distributed? E.g. is there
> roughly
> >> > > same number of rows for each date or do you have more rows in more
> recent
> >> > > days?
> >> > >
> >> > > Because Sqoop have no idea how exactly the data are distributed in
> >> > > your database, it assumes uniform distribution. Let me explain why
> it
> >> > > matters on following example. Let's consider table where there is
> one row on
> >> > > 2012-01-01, second row on 2012-02-01 and 1M of rows on 2012-03-01.
> Let's
> >> > > assume that we will use three mappers (--num-mappers 3). In this
> case, sqoop
> >> > > will create three splits 2012-01-01 up to 2012-01-31, 2012-02-01 up
> to
> >> > > 2012-02-28 and 2012-03-01 up to 2012-03-31. Because the first two
> mappers do
> >> > > have just one row to move, they will finish almost instantly and
> get job to
> >> > > 66% done (2 out of 3 mappers are done), however the last mapper
> will be
> >> > > running for some time as it need to move 1M of rows. For external
> observer
> >> > > it would appear that the sqoop has stopped, but what really
> happened is just
> >> > > having not uniformly distributed data across all mappers.
> >> > >
> >> > > Jarcec
> >> > >
> >> > > On Wed, Sep 05, 2012 at 09:37:49AM +0200, Alberto Cordioli wrote:
> >> > >> Hi all,
> >> > >>
> >> > >> I am using Sqoop to import a big MySql table (around 910 milions of
> >> > >> records) in Hbase.
> >> > >> The command line that I'm using is something like:
> >> > >> sqoop import --connect
> >> > >> jdbc:mysql://<server>/<db>?zeroDateTimeBehavior=round --username
> >> > >> <usr>
> >> > >> -P --query <query>' --split-by <date-field> --hbase-table
> >> > >> "<hbase_table>" --column-family "<fam>" --hbase-row-key "hash"
> >> > >>
> >> > >> The strange thing is that it takes a lot to complete the last part
> of
> >> > >> the map. This is part of the log:
> >> > >>
> >> > >> [...]
> >> > >> 12/09/04 17:16:45 INFO mapred.JobClient: Running job:
> >> > >> job_201209031227_0007
> >> > >> 12/09/04 17:16:46 INFO mapred.JobClient:  map 0% reduce 0%
> >> > >> 12/09/04 17:24:20 INFO mapred.JobClient:  map 25% reduce 0%
> >> > >> 12/09/04 17:24:21 INFO mapred.JobClient:  map 50% reduce 0%
> >> > >> 12/09/04 17:24:23 INFO mapred.JobClient:  map 75% reduce 0%
> >> > >>
> >> > >> As you can see it does not take much time to from start to 75%, but
> >> > >> the last part hasn't been finished (although it is working by a day
> >> > >> continuously).
> >> > >> Is there something wrong? I've tried to take a look to the logs but
> >> > >> it
> >> > >> seems to be ok.
> >> > >>
> >> > >>
> >> > >> Thanks,
> >> > >> Alberto
> >> > >>
> >> > >>
> >> > >>
> >> > >> --
> >> > >> Alberto Cordioli
> >> >
> >> >
> >> >
> >> > --
> >> > Alberto Cordioli
> >
> >
>
>
>
> --
> Alberto Cordioli
>

Re: Sqoop import big MySql table in HBase

Posted by Alberto Cordioli <co...@gmail.com>.

Mmh, I see. Hence I should avoid split by string fields, since my hash
field is 72 char long and it requires a lot of computations (if I
understood correctly).
I think one of the solutions proposed by Jarcec could be ok.
I also think I'll divide my big table in more little chunks, since the
problem is the query that determine the split points. What do you
think?

Cheers,
Alberto


On 5 September 2012 15:21, abhijeet gaikwad <ab...@gmail.com> wrote:
> Hi Alberto,
> Sqoop calculates split points by converting min. and max. string values
> returned by DBMS for the column into respective decimal values using an
> algo. and then use Decimal splitter. There is a complex algorithm used for
> converting(more of mapping) string to decimal. This would help you
> understand better (taken from the java docs for the split method of text
> splitter):
>
> /**
>    * This method needs to determine the splits between two user-provided
>    * strings.  In the case where the user's strings are 'A' and 'Z', this is
>    * not hard; we could create two splits from ['A', 'M') and ['M', 'Z'], 26
>    * splits for strings beginning with each letter, etc.
>    *
>    * If a user has provided us with the strings "Ham" and "Haze", however,
> we
>    * need to create splits that differ in the third letter.
>    *
>    * The algorithm used is as follows:
>    * Since there are 2**16 unicode characters, we interpret characters as
>    * digits in base 65536. Given a string 's' containing characters s_0, s_1
>    * .. s_n, we interpret the string as the number: 0.s_0 s_1 s_2.. s_n in
>    * base 65536. Having mapped the low and high strings into floating-point
>    * values, we then use the BigDecimalSplitter to establish the even split
>    * points, then map the resulting floating point values back into strings.
>    */
>   public List<InputSplit> split(Configuration conf, ResultSet results,
>       String colName) throws SQLException {
>
> After the splits are calculated, where clauses are used in SELECT
> queries(i.e. - result is bounded by split points)) fired by each mapper to
> retrieve the data.
>
> From user perspective, you can use string for splitting except for following
> scenario:
> Char split-by column is not recommended when the DBMS sorts in case
> insensitive manner. The current algorithm used to calculate splits has some
> flaws. This is known and Sqoop displays a warning before executing the job.
>
> Let me know if you need more details.
>
> Thanks,
> Abhijeet
>
>
> On Wed, Sep 5, 2012 at 5:13 PM, Jarek Jarcec Cecho <ja...@apache.org>
> wrote:
>>
>> Hi Alberto,
>> I've never used text column for data splitting, however it seems that
>> sqoop is supporting that (I found it's splitter in the code). However I'm
>> still not sure if it's wise as string operations tends to be much slower on
>> databases and you might end up with performance issues. Unfortunately Sqoop
>> currently do not support any direct way how to affect split creation.
>>
>> I tried to think about your problem and came with two ideas how to help in
>> your use case:
>>
>> 1) Would it be acceptable in your use case to change the zero date policy
>> from zeroDateTimeBehavior=round to zeroDateTimeBehavior=convertToNull? In
>> case that "split" column contains nulls, sqoop will create X+1 splits where
>> the +1 will cover all NULL values. It probably won't be the best, but it
>> might help to distribute your load more properly.
>>
>> 2) What about splitting entire job into two parts - firstly export all
>> zero dates and separately in next job the rest of the values. By doing so
>> you might be able to get decent distribution across the "normal" dates part.
>> Importing all the zero dates might be challenging if you have a lot of them
>> as there will be only one value available (and thus just one split) and
>> therefore you might need to use the text column for split creation in this
>> case anyway.
>>
>> Jarcec
>>
>> On Wed, Sep 05, 2012 at 10:16:17AM +0200, Alberto Cordioli wrote:
>> > Thanks Jarcec,
>> > probably you've identified immediately the problem. In fact, I checked
>> > the date field, and I think problem is that in my data I have some
>> > "limit" values like '0000-00-00' (damn who have inserted these).
>> > The other data are equally distributed in 2 months (from 2012-04-01 to
>> > 2012-06-01): as you said with a parallelism of 3, 2 mappers will take
>> > basically no data while the other will do the "true" job, right?
>> >
>> > So, now my question becomes: the other field that I could use to split
>> > the job is an hash (string). How sqoop divide this type of field?
>> > Lexicography order?
>> >
>> >
>> > Alberto
>> >
>> >
>> >
>> > On 5 September 2012 09:57, Jarek Jarcec Cecho <ja...@apache.org> wrote:
>> > > Hi Alberto,
>> > > taking into account that you have 910 millions of records and you're
>> > > job was able to get to 75% in matter of 8 minutes and then it slow down
>> > > significantly, I do have a feeling that your splits were not equally
>> > > divided. Based on your command line it seems that you're diving data by some
>> > > date field. Is this date field uniformly distributed? E.g. is there roughly
>> > > same number of rows for each date or do you have more rows in more recent
>> > > days?
>> > >
>> > > Because Sqoop have no idea how exactly the data are distributed in
>> > > your database, it assumes uniform distribution. Let me explain why it
>> > > matters on following example. Let's consider table where there is one row on
>> > > 2012-01-01, second row on 2012-02-01 and 1M of rows on 2012-03-01. Let's
>> > > assume that we will use three mappers (--num-mappers 3). In this case, sqoop
>> > > will create three splits 2012-01-01 up to 2012-01-31, 2012-02-01 up to
>> > > 2012-02-28 and 2012-03-01 up to 2012-03-31. Because the first two mappers do
>> > > have just one row to move, they will finish almost instantly and get job to
>> > > 66% done (2 out of 3 mappers are done), however the last mapper will be
>> > > running for some time as it need to move 1M of rows. For external observer
>> > > it would appear that the sqoop has stopped, but what really happened is just
>> > > having not uniformly distributed data across all mappers.
>> > >
>> > > Jarcec
>> > >
>> > > On Wed, Sep 05, 2012 at 09:37:49AM +0200, Alberto Cordioli wrote:
>> > >> Hi all,
>> > >>
>> > >> I am using Sqoop to import a big MySql table (around 910 milions of
>> > >> records) in Hbase.
>> > >> The command line that I'm using is something like:
>> > >> sqoop import --connect
>> > >> jdbc:mysql://<server>/<db>?zeroDateTimeBehavior=round --username
>> > >> <usr>
>> > >> -P --query <query>' --split-by <date-field> --hbase-table
>> > >> "<hbase_table>" --column-family "<fam>" --hbase-row-key "hash"
>> > >>
>> > >> The strange thing is that it takes a lot to complete the last part of
>> > >> the map. This is part of the log:
>> > >>
>> > >> [...]
>> > >> 12/09/04 17:16:45 INFO mapred.JobClient: Running job:
>> > >> job_201209031227_0007
>> > >> 12/09/04 17:16:46 INFO mapred.JobClient:  map 0% reduce 0%
>> > >> 12/09/04 17:24:20 INFO mapred.JobClient:  map 25% reduce 0%
>> > >> 12/09/04 17:24:21 INFO mapred.JobClient:  map 50% reduce 0%
>> > >> 12/09/04 17:24:23 INFO mapred.JobClient:  map 75% reduce 0%
>> > >>
>> > >> As you can see it does not take much time to from start to 75%, but
>> > >> the last part hasn't been finished (although it is working by a day
>> > >> continuously).
>> > >> Is there something wrong? I've tried to take a look to the logs but
>> > >> it
>> > >> seems to be ok.
>> > >>
>> > >>
>> > >> Thanks,
>> > >> Alberto
>> > >>
>> > >>
>> > >>
>> > >> --
>> > >> Alberto Cordioli
>> >
>> >
>> >
>> > --
>> > Alberto Cordioli
>
>



-- 
Alberto Cordioli

Re: Sqoop import big MySql table in HBase

Posted by abhijeet gaikwad <ab...@gmail.com>.

Hi Alberto,
Sqoop calculates split points by converting min. and max. string values
returned by DBMS for the column into respective decimal values using an
algo. and then use Decimal splitter. There is a complex algorithm used for
converting(more of mapping) string to decimal. This would help you
understand better (taken from the java docs for the split method of text
splitter):

/**
   * This method needs to determine the splits between two user-provided
   * strings.  In the case where the user's strings are 'A' and 'Z', this is
   * not hard; we could create two splits from ['A', 'M') and ['M', 'Z'], 26
   * splits for strings beginning with each letter, etc.
   *
   * If a user has provided us with the strings "Ham" and "Haze", however,
we
   * need to create splits that differ in the third letter.
   *
   * The algorithm used is as follows:
   * Since there are 2**16 unicode characters, we interpret characters as
   * digits in base 65536. Given a string 's' containing characters s_0, s_1
   * .. s_n, we interpret the string as the number: 0.s_0 s_1 s_2.. s_n in
   * base 65536. Having mapped the low and high strings into floating-point
   * values, we then use the BigDecimalSplitter to establish the even split
   * points, then map the resulting floating point values back into strings.
   */
  public List<InputSplit> split(Configuration conf, ResultSet results,
      String colName) throws SQLException {

After the splits are calculated, where clauses are used in SELECT
queries(i.e. - result is bounded by split points)) fired by each mapper to
retrieve the data.

>From user perspective, you can use string for splitting except for
following scenario:
Char split-by column is not recommended when the DBMS sorts in case
insensitive manner. The current algorithm used to calculate splits has some
flaws. This is known and Sqoop displays a warning before executing the job.

Let me know if you need more details.

Thanks,
Abhijeet

On Wed, Sep 5, 2012 at 5:13 PM, Jarek Jarcec Cecho <ja...@apache.org>wrote:

> Hi Alberto,
> I've never used text column for data splitting, however it seems that
> sqoop is supporting that (I found it's splitter in the code). However I'm
> still not sure if it's wise as string operations tends to be much slower on
> databases and you might end up with performance issues. Unfortunately Sqoop
> currently do not support any direct way how to affect split creation.
>
> I tried to think about your problem and came with two ideas how to help in
> your use case:
>
> 1) Would it be acceptable in your use case to change the zero date policy
> from zeroDateTimeBehavior=round to zeroDateTimeBehavior=convertToNull? In
> case that "split" column contains nulls, sqoop will create X+1 splits where
> the +1 will cover all NULL values. It probably won't be the best, but it
> might help to distribute your load more properly.
>
> 2) What about splitting entire job into two parts - firstly export all
> zero dates and separately in next job the rest of the values. By doing so
> you might be able to get decent distribution across the "normal" dates
> part. Importing all the zero dates might be challenging if you have a lot
> of them as there will be only one value available (and thus just one split)
> and therefore you might need to use the text column for split creation in
> this case anyway.
>
> Jarcec
>
> On Wed, Sep 05, 2012 at 10:16:17AM +0200, Alberto Cordioli wrote:
> > Thanks Jarcec,
> > probably you've identified immediately the problem. In fact, I checked
> > the date field, and I think problem is that in my data I have some
> > "limit" values like '0000-00-00' (damn who have inserted these).
> > The other data are equally distributed in 2 months (from 2012-04-01 to
> > 2012-06-01): as you said with a parallelism of 3, 2 mappers will take
> > basically no data while the other will do the "true" job, right?
> >
> > So, now my question becomes: the other field that I could use to split
> > the job is an hash (string). How sqoop divide this type of field?
> > Lexicography order?
> >
> >
> > Alberto
> >
> >
> >
> > On 5 September 2012 09:57, Jarek Jarcec Cecho <ja...@apache.org> wrote:
> > > Hi Alberto,
> > > taking into account that you have 910 millions of records and you're
> job was able to get to 75% in matter of 8 minutes and then it slow down
> significantly, I do have a feeling that your splits were not equally
> divided. Based on your command line it seems that you're diving data by
> some date field. Is this date field uniformly distributed? E.g. is there
> roughly same number of rows for each date or do you have more rows in more
> recent days?
> > >
> > > Because Sqoop have no idea how exactly the data are distributed in
> your database, it assumes uniform distribution. Let me explain why it
> matters on following example. Let's consider table where there is one row
> on 2012-01-01, second row on 2012-02-01 and 1M of rows on 2012-03-01. Let's
> assume that we will use three mappers (--num-mappers 3). In this case,
> sqoop will create three splits 2012-01-01 up to 2012-01-31, 2012-02-01 up
> to 2012-02-28 and 2012-03-01 up to 2012-03-31. Because the first two
> mappers do have just one row to move, they will finish almost instantly and
> get job to 66% done (2 out of 3 mappers are done), however the last mapper
> will be running for some time as it need to move 1M of rows. For external
> observer it would appear that the sqoop has stopped, but what really
> happened is just having not uniformly distributed data across all mappers.
> > >
> > > Jarcec
> > >
> > > On Wed, Sep 05, 2012 at 09:37:49AM +0200, Alberto Cordioli wrote:
> > >> Hi all,
> > >>
> > >> I am using Sqoop to import a big MySql table (around 910 milions of
> > >> records) in Hbase.
> > >> The command line that I'm using is something like:
> > >> sqoop import --connect
> > >> jdbc:mysql://<server>/<db>?zeroDateTimeBehavior=round --username <usr>
> > >> -P --query <query>' --split-by <date-field> --hbase-table
> > >> "<hbase_table>" --column-family "<fam>" --hbase-row-key "hash"
> > >>
> > >> The strange thing is that it takes a lot to complete the last part of
> > >> the map. This is part of the log:
> > >>
> > >> [...]
> > >> 12/09/04 17:16:45 INFO mapred.JobClient: Running job:
> job_201209031227_0007
> > >> 12/09/04 17:16:46 INFO mapred.JobClient:  map 0% reduce 0%
> > >> 12/09/04 17:24:20 INFO mapred.JobClient:  map 25% reduce 0%
> > >> 12/09/04 17:24:21 INFO mapred.JobClient:  map 50% reduce 0%
> > >> 12/09/04 17:24:23 INFO mapred.JobClient:  map 75% reduce 0%
> > >>
> > >> As you can see it does not take much time to from start to 75%, but
> > >> the last part hasn't been finished (although it is working by a day
> > >> continuously).
> > >> Is there something wrong? I've tried to take a look to the logs but it
> > >> seems to be ok.
> > >>
> > >>
> > >> Thanks,
> > >> Alberto
> > >>
> > >>
> > >>
> > >> --
> > >> Alberto Cordioli
> >
> >
> >
> > --
> > Alberto Cordioli
>

Re: Sqoop import big MySql table in HBase

Posted by Jarek Jarcec Cecho <ja...@apache.org>.

Hi Alberto,
I've never used text column for data splitting, however it seems that sqoop is supporting that (I found it's splitter in the code). However I'm still not sure if it's wise as string operations tends to be much slower on databases and you might end up with performance issues. Unfortunately Sqoop currently do not support any direct way how to affect split creation.

I tried to think about your problem and came with two ideas how to help in your use case:

1) Would it be acceptable in your use case to change the zero date policy from zeroDateTimeBehavior=round to zeroDateTimeBehavior=convertToNull? In case that "split" column contains nulls, sqoop will create X+1 splits where the +1 will cover all NULL values. It probably won't be the best, but it might help to distribute your load more properly.

2) What about splitting entire job into two parts - firstly export all zero dates and separately in next job the rest of the values. By doing so you might be able to get decent distribution across the "normal" dates part. Importing all the zero dates might be challenging if you have a lot of them as there will be only one value available (and thus just one split) and therefore you might need to use the text column for split creation in this case anyway.

Jarcec

On Wed, Sep 05, 2012 at 10:16:17AM +0200, Alberto Cordioli wrote:
> Thanks Jarcec,
> probably you've identified immediately the problem. In fact, I checked
> the date field, and I think problem is that in my data I have some
> "limit" values like '0000-00-00' (damn who have inserted these).
> The other data are equally distributed in 2 months (from 2012-04-01 to
> 2012-06-01): as you said with a parallelism of 3, 2 mappers will take
> basically no data while the other will do the "true" job, right?
> 
> So, now my question becomes: the other field that I could use to split
> the job is an hash (string). How sqoop divide this type of field?
> Lexicography order?
> 
> 
> Alberto
> 
> 
> 
> On 5 September 2012 09:57, Jarek Jarcec Cecho <ja...@apache.org> wrote:
> > Hi Alberto,
> > taking into account that you have 910 millions of records and you're job was able to get to 75% in matter of 8 minutes and then it slow down significantly, I do have a feeling that your splits were not equally divided. Based on your command line it seems that you're diving data by some date field. Is this date field uniformly distributed? E.g. is there roughly same number of rows for each date or do you have more rows in more recent days?
> >
> > Because Sqoop have no idea how exactly the data are distributed in your database, it assumes uniform distribution. Let me explain why it matters on following example. Let's consider table where there is one row on 2012-01-01, second row on 2012-02-01 and 1M of rows on 2012-03-01. Let's assume that we will use three mappers (--num-mappers 3). In this case, sqoop will create three splits 2012-01-01 up to 2012-01-31, 2012-02-01 up to 2012-02-28 and 2012-03-01 up to 2012-03-31. Because the first two mappers do have just one row to move, they will finish almost instantly and get job to 66% done (2 out of 3 mappers are done), however the last mapper will be running for some time as it need to move 1M of rows. For external observer it would appear that the sqoop has stopped, but what really happened is just having not uniformly distributed data across all mappers.
> >
> > Jarcec
> >
> > On Wed, Sep 05, 2012 at 09:37:49AM +0200, Alberto Cordioli wrote:
> >> Hi all,
> >>
> >> I am using Sqoop to import a big MySql table (around 910 milions of
> >> records) in Hbase.
> >> The command line that I'm using is something like:
> >> sqoop import --connect
> >> jdbc:mysql://<server>/<db>?zeroDateTimeBehavior=round --username <usr>
> >> -P --query <query>' --split-by <date-field> --hbase-table
> >> "<hbase_table>" --column-family "<fam>" --hbase-row-key "hash"
> >>
> >> The strange thing is that it takes a lot to complete the last part of
> >> the map. This is part of the log:
> >>
> >> [...]
> >> 12/09/04 17:16:45 INFO mapred.JobClient: Running job: job_201209031227_0007
> >> 12/09/04 17:16:46 INFO mapred.JobClient:  map 0% reduce 0%
> >> 12/09/04 17:24:20 INFO mapred.JobClient:  map 25% reduce 0%
> >> 12/09/04 17:24:21 INFO mapred.JobClient:  map 50% reduce 0%
> >> 12/09/04 17:24:23 INFO mapred.JobClient:  map 75% reduce 0%
> >>
> >> As you can see it does not take much time to from start to 75%, but
> >> the last part hasn't been finished (although it is working by a day
> >> continuously).
> >> Is there something wrong? I've tried to take a look to the logs but it
> >> seems to be ok.
> >>
> >>
> >> Thanks,
> >> Alberto
> >>
> >>
> >>
> >> --
> >> Alberto Cordioli
> 
> 
> 
> -- 
> Alberto Cordioli

Re: Sqoop import big MySql table in HBase

Posted by Alberto Cordioli <co...@gmail.com>.

Thanks Jarcec,
probably you've identified immediately the problem. In fact, I checked
the date field, and I think problem is that in my data I have some
"limit" values like '0000-00-00' (damn who have inserted these).
The other data are equally distributed in 2 months (from 2012-04-01 to
2012-06-01): as you said with a parallelism of 3, 2 mappers will take
basically no data while the other will do the "true" job, right?

So, now my question becomes: the other field that I could use to split
the job is an hash (string). How sqoop divide this type of field?
Lexicography order?


Alberto



On 5 September 2012 09:57, Jarek Jarcec Cecho <ja...@apache.org> wrote:
> Hi Alberto,
> taking into account that you have 910 millions of records and you're job was able to get to 75% in matter of 8 minutes and then it slow down significantly, I do have a feeling that your splits were not equally divided. Based on your command line it seems that you're diving data by some date field. Is this date field uniformly distributed? E.g. is there roughly same number of rows for each date or do you have more rows in more recent days?
>
> Because Sqoop have no idea how exactly the data are distributed in your database, it assumes uniform distribution. Let me explain why it matters on following example. Let's consider table where there is one row on 2012-01-01, second row on 2012-02-01 and 1M of rows on 2012-03-01. Let's assume that we will use three mappers (--num-mappers 3). In this case, sqoop will create three splits 2012-01-01 up to 2012-01-31, 2012-02-01 up to 2012-02-28 and 2012-03-01 up to 2012-03-31. Because the first two mappers do have just one row to move, they will finish almost instantly and get job to 66% done (2 out of 3 mappers are done), however the last mapper will be running for some time as it need to move 1M of rows. For external observer it would appear that the sqoop has stopped, but what really happened is just having not uniformly distributed data across all mappers.
>
> Jarcec
>
> On Wed, Sep 05, 2012 at 09:37:49AM +0200, Alberto Cordioli wrote:
>> Hi all,
>>
>> I am using Sqoop to import a big MySql table (around 910 milions of
>> records) in Hbase.
>> The command line that I'm using is something like:
>> sqoop import --connect
>> jdbc:mysql://<server>/<db>?zeroDateTimeBehavior=round --username <usr>
>> -P --query <query>' --split-by <date-field> --hbase-table
>> "<hbase_table>" --column-family "<fam>" --hbase-row-key "hash"
>>
>> The strange thing is that it takes a lot to complete the last part of
>> the map. This is part of the log:
>>
>> [...]
>> 12/09/04 17:16:45 INFO mapred.JobClient: Running job: job_201209031227_0007
>> 12/09/04 17:16:46 INFO mapred.JobClient:  map 0% reduce 0%
>> 12/09/04 17:24:20 INFO mapred.JobClient:  map 25% reduce 0%
>> 12/09/04 17:24:21 INFO mapred.JobClient:  map 50% reduce 0%
>> 12/09/04 17:24:23 INFO mapred.JobClient:  map 75% reduce 0%
>>
>> As you can see it does not take much time to from start to 75%, but
>> the last part hasn't been finished (although it is working by a day
>> continuously).
>> Is there something wrong? I've tried to take a look to the logs but it
>> seems to be ok.
>>
>>
>> Thanks,
>> Alberto
>>
>>
>>
>> --
>> Alberto Cordioli



-- 
Alberto Cordioli

Re: Sqoop import big MySql table in HBase

Posted by Jarek Jarcec Cecho <ja...@apache.org>.

Hi Alberto,
taking into account that you have 910 millions of records and you're job was able to get to 75% in matter of 8 minutes and then it slow down significantly, I do have a feeling that your splits were not equally divided. Based on your command line it seems that you're diving data by some date field. Is this date field uniformly distributed? E.g. is there roughly same number of rows for each date or do you have more rows in more recent days?

Because Sqoop have no idea how exactly the data are distributed in your database, it assumes uniform distribution. Let me explain why it matters on following example. Let's consider table where there is one row on 2012-01-01, second row on 2012-02-01 and 1M of rows on 2012-03-01. Let's assume that we will use three mappers (--num-mappers 3). In this case, sqoop will create three splits 2012-01-01 up to 2012-01-31, 2012-02-01 up to 2012-02-28 and 2012-03-01 up to 2012-03-31. Because the first two mappers do have just one row to move, they will finish almost instantly and get job to 66% done (2 out of 3 mappers are done), however the last mapper will be running for some time as it need to move 1M of rows. For external observer it would appear that the sqoop has stopped, but what really happened is just having not uniformly distributed data across all mappers.

Jarcec

On Wed, Sep 05, 2012 at 09:37:49AM +0200, Alberto Cordioli wrote:
> Hi all,
> 
> I am using Sqoop to import a big MySql table (around 910 milions of
> records) in Hbase.
> The command line that I'm using is something like:
> sqoop import --connect
> jdbc:mysql://<server>/<db>?zeroDateTimeBehavior=round --username <usr>
> -P --query <query>' --split-by <date-field> --hbase-table
> "<hbase_table>" --column-family "<fam>" --hbase-row-key "hash"
> 
> The strange thing is that it takes a lot to complete the last part of
> the map. This is part of the log:
> 
> [...]
> 12/09/04 17:16:45 INFO mapred.JobClient: Running job: job_201209031227_0007
> 12/09/04 17:16:46 INFO mapred.JobClient:  map 0% reduce 0%
> 12/09/04 17:24:20 INFO mapred.JobClient:  map 25% reduce 0%
> 12/09/04 17:24:21 INFO mapred.JobClient:  map 50% reduce 0%
> 12/09/04 17:24:23 INFO mapred.JobClient:  map 75% reduce 0%
> 
> As you can see it does not take much time to from start to 75%, but
> the last part hasn't been finished (although it is working by a day
> continuously).
> Is there something wrong? I've tried to take a look to the logs but it
> seems to be ok.
> 
> 
> Thanks,
> Alberto
> 
> 
> 
> -- 
> Alberto Cordioli