You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@sqoop.apache.org by David Kincaid <ki...@gmail.com> on 2013/06/19 17:33:33 UTC

Strange distribution of keys among mappers

We're seeing a strange thing happen with a sqoop import job with the way
the key range is getting distributed among the 4 mappers that are running.
The minimum key value is 2110 and the maximum value is 288272191. We are
getting one mapper that is only getting one record to import. Here is the
distribution among the mappers:

[2110, 96092137)
[96092137, 192182164)
[192182164, 288272191)
[288272191, 288272192)

you can see that the fourth mapper is given a range with only one value in
it. Could someone help me understand what is going on?

Thanks,

Dave

Re: Strange distribution of keys among mappers

Posted by David Kincaid <ki...@gmail.com>.

Right. That seems to be what's happening. Thank you for all the help
understanding. It's making sense now.

- Dave


On Wed, Jun 19, 2013 at 7:30 PM, Abraham Elmahrek <ab...@cloudera.com> wrote:

> David,
>
> It's really just a hint. So the splitters will try to hit whatever is
> defined, but an extra may be created. For instance, BigDecimalSplitter will
> create 4 splits for certain ranges with 3 MR tasks specified.
>
> -Abe
>
>
> On Wed, Jun 19, 2013 at 5:03 PM, David Kincaid <ki...@gmail.com>wrote:
>
>> We don't have that set on our cluster and aren't specifying it in our
>> job. When I look at the different sqoop jobs I see both 3 for some and 4
>> for others on the jobs.
>>
>>
>> On Wed, Jun 19, 2013 at 6:50 PM, Abraham Elmahrek <ab...@cloudera.com>wrote:
>>
>>> David,
>>>
>>> Well I think sqoop is looking at "mapred.map.tasks". Do you have that
>>> set in mapred-site.xml? I thought that defaults to 2.
>>>
>>> -Abe
>>>
>>>
>>> On Wed, Jun 19, 2013 at 4:31 PM, Abraham Elmahrek <ab...@cloudera.com>wrote:
>>>
>>>> David,
>>>>
>>>> I've created https://issues.apache.org/jira/browse/SQOOP-1093 to track
>>>> the documentation issue. Thanks for bringing this to the community's
>>>> attention!
>>>>
>>>> -Abe
>>>>
>>>>
>>>> On Wed, Jun 19, 2013 at 4:21 PM, Abraham Elmahrek <ab...@cloudera.com>wrote:
>>>>
>>>>> Hey David,
>>>>>
>>>>> With oracle, the BigDecimalSplitter will be used by default for all
>>>>> number types.
>>>>>
>>>>> -Abe
>>>>>
>>>>>
>>>>> On Wed, Jun 19, 2013 at 4:05 PM, David Kincaid <kincaid.dave@gmail.com
>>>>> > wrote:
>>>>>
>>>>>> Abe, the database is Oracle.
>>>>>>
>>>>>>
>>>>>> On Wed, Jun 19, 2013 at 5:48 PM, Abraham Elmahrek <ab...@cloudera.com>wrote:
>>>>>>
>>>>>>> David,
>>>>>>>
>>>>>>> What database are you importing from? The description I gave was for
>>>>>>> datatypes that map to the BigDecimal Splitter. The userguide might be
>>>>>>> referring to the IntegerSplitter which will add the remainder to the last
>>>>>>> value.
>>>>>>>
>>>>>>> -Abe
>>>>>>>
>>>>>>>
>>>>>>> On Wed, Jun 19, 2013 at 1:23 PM, David Kincaid <
>>>>>>> kincaid.dave@gmail.com> wrote:
>>>>>>>
>>>>>>>> Thanks. We didn't specify the number of mappers, so it's giving us
>>>>>>>> 4. I understand your explanation, but it seems to conflict with the Sqoop
>>>>>>>> user guide (
>>>>>>>> http://sqoop.apache.org/docs/1.4.3/SqoopUserGuide.html#_controlling_parallelism
>>>>>>>> ):
>>>>>>>>
>>>>>>>> "When performing parallel imports, Sqoop needs a criterion by
>>>>>>>> which it can split the workload. Sqoop uses a *splitting column* to
>>>>>>>> split the workload. By default, Sqoop will identify the primary key column
>>>>>>>> (if present) in a table and use it as the splitting column. The low and
>>>>>>>> high values for the splitting column are retrieved from the database, and
>>>>>>>> the map tasks operate on evenly-sized components of the total range. For
>>>>>>>> example, if you had a table with a primary key column of id whose
>>>>>>>> minimum value was 0 and maximum value was 1000, and Sqoop was directed to
>>>>>>>> use 4 tasks, Sqoop would run four processes which each execute SQL
>>>>>>>> statements of the form SELECT * FROM sometable WHERE id >= lo AND
>>>>>>>> id < hi, with (lo, hi) set to (0, 250), (250, 500), (500, 750),
>>>>>>>> and (750, 1001) in the different tasks."
>>>>>>>>
>>>>>>>>
>>>>>>>> On Wed, Jun 19, 2013 at 3:14 PM, Abraham Elmahrek <abe@cloudera.com
>>>>>>>> > wrote:
>>>>>>>>
>>>>>>>>> Hey David,
>>>>>>>>>
>>>>>>>>> Here's the algorithm:
>>>>>>>>> Split lengths are defined by (max - min)/(# mappers) and whatever
>>>>>>>>> is left is tacked on at the end. So in this case, (288272191-2110)/3
>>>>>>>>> = 96090027.33... So I'm assuming the .33... is rounded down and split
>>>>>>>>> lengths will be of length 96090027. Sqoop will then create splits
>>>>>>>>> with the following points: (min) + (range length)*(n). We can see
>>>>>>>>> that 2110 + 96090027*0 = 2110, 2110 + 96090027*1 = 96092137, 2110
>>>>>>>>> + 96090027*2 = 192182164, and 2110 + 96090027*3 = 288272191 will
>>>>>>>>> be generated based off of this algorithm. The last point to be added will
>>>>>>>>> be 288272192 because the max value is not part of the generated
>>>>>>>>> split points. Then sqoop will distributed accordingly based off of these
>>>>>>>>> points as you've pointed out above.
>>>>>>>>>
>>>>>>>>> Just to be sure, did you configure sqoop to use 3 mappers?
>>>>>>>>>
>>>>>>>>> Hope this helps,
>>>>>>>>> -Abe
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> On Wed, Jun 19, 2013 at 8:33 AM, David Kincaid <
>>>>>>>>> kincaid.dave@gmail.com> wrote:
>>>>>>>>>
>>>>>>>>>> We're seeing a strange thing happen with a sqoop import job with
>>>>>>>>>> the way the key range is getting distributed among the 4 mappers that are
>>>>>>>>>> running. The minimum key value is 2110 and the maximum value is 288272191.
>>>>>>>>>> We are getting one mapper that is only getting one record to import. Here
>>>>>>>>>> is the distribution among the mappers:
>>>>>>>>>>
>>>>>>>>>> [2110, 96092137)
>>>>>>>>>> [96092137, 192182164)
>>>>>>>>>> [192182164, 288272191)
>>>>>>>>>> [288272191, 288272192)
>>>>>>>>>>
>>>>>>>>>> you can see that the fourth mapper is given a range with only one
>>>>>>>>>> value in it. Could someone help me understand what is going on?
>>>>>>>>>>
>>>>>>>>>> Thanks,
>>>>>>>>>>
>>>>>>>>>> Dave
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>
>>>>>
>>>>
>>>
>>
>

Re: Strange distribution of keys among mappers

Posted by Abraham Elmahrek <ab...@cloudera.com>.

David,

It's really just a hint. So the splitters will try to hit whatever is
defined, but an extra may be created. For instance, BigDecimalSplitter will
create 4 splits for certain ranges with 3 MR tasks specified.

-Abe


On Wed, Jun 19, 2013 at 5:03 PM, David Kincaid <ki...@gmail.com>wrote:

> We don't have that set on our cluster and aren't specifying it in our job.
> When I look at the different sqoop jobs I see both 3 for some and 4 for
> others on the jobs.
>
>
> On Wed, Jun 19, 2013 at 6:50 PM, Abraham Elmahrek <ab...@cloudera.com>wrote:
>
>> David,
>>
>> Well I think sqoop is looking at "mapred.map.tasks". Do you have that set
>> in mapred-site.xml? I thought that defaults to 2.
>>
>> -Abe
>>
>>
>> On Wed, Jun 19, 2013 at 4:31 PM, Abraham Elmahrek <ab...@cloudera.com>wrote:
>>
>>> David,
>>>
>>> I've created https://issues.apache.org/jira/browse/SQOOP-1093 to track
>>> the documentation issue. Thanks for bringing this to the community's
>>> attention!
>>>
>>> -Abe
>>>
>>>
>>> On Wed, Jun 19, 2013 at 4:21 PM, Abraham Elmahrek <ab...@cloudera.com>wrote:
>>>
>>>> Hey David,
>>>>
>>>> With oracle, the BigDecimalSplitter will be used by default for all
>>>> number types.
>>>>
>>>> -Abe
>>>>
>>>>
>>>> On Wed, Jun 19, 2013 at 4:05 PM, David Kincaid <ki...@gmail.com>wrote:
>>>>
>>>>> Abe, the database is Oracle.
>>>>>
>>>>>
>>>>> On Wed, Jun 19, 2013 at 5:48 PM, Abraham Elmahrek <ab...@cloudera.com>wrote:
>>>>>
>>>>>> David,
>>>>>>
>>>>>> What database are you importing from? The description I gave was for
>>>>>> datatypes that map to the BigDecimal Splitter. The userguide might be
>>>>>> referring to the IntegerSplitter which will add the remainder to the last
>>>>>> value.
>>>>>>
>>>>>> -Abe
>>>>>>
>>>>>>
>>>>>> On Wed, Jun 19, 2013 at 1:23 PM, David Kincaid <
>>>>>> kincaid.dave@gmail.com> wrote:
>>>>>>
>>>>>>> Thanks. We didn't specify the number of mappers, so it's giving us
>>>>>>> 4. I understand your explanation, but it seems to conflict with the Sqoop
>>>>>>> user guide (
>>>>>>> http://sqoop.apache.org/docs/1.4.3/SqoopUserGuide.html#_controlling_parallelism
>>>>>>> ):
>>>>>>>
>>>>>>> "When performing parallel imports, Sqoop needs a criterion by which
>>>>>>> it can split the workload. Sqoop uses a *splitting column* to split
>>>>>>> the workload. By default, Sqoop will identify the primary key column (if
>>>>>>> present) in a table and use it as the splitting column. The low and high
>>>>>>> values for the splitting column are retrieved from the database, and the
>>>>>>> map tasks operate on evenly-sized components of the total range. For
>>>>>>> example, if you had a table with a primary key column of id whose
>>>>>>> minimum value was 0 and maximum value was 1000, and Sqoop was directed to
>>>>>>> use 4 tasks, Sqoop would run four processes which each execute SQL
>>>>>>> statements of the form SELECT * FROM sometable WHERE id >= lo AND
>>>>>>> id < hi, with (lo, hi) set to (0, 250), (250, 500), (500, 750), and
>>>>>>> (750, 1001) in the different tasks."
>>>>>>>
>>>>>>>
>>>>>>> On Wed, Jun 19, 2013 at 3:14 PM, Abraham Elmahrek <ab...@cloudera.com>wrote:
>>>>>>>
>>>>>>>> Hey David,
>>>>>>>>
>>>>>>>> Here's the algorithm:
>>>>>>>> Split lengths are defined by (max - min)/(# mappers) and whatever
>>>>>>>> is left is tacked on at the end. So in this case, (288272191-2110)/3
>>>>>>>> = 96090027.33... So I'm assuming the .33... is rounded down and split
>>>>>>>> lengths will be of length 96090027. Sqoop will then create splits
>>>>>>>> with the following points: (min) + (range length)*(n). We can see
>>>>>>>> that 2110 + 96090027*0 = 2110, 2110 + 96090027*1 = 96092137, 2110
>>>>>>>> + 96090027*2 = 192182164, and 2110 + 96090027*3 = 288272191 will
>>>>>>>> be generated based off of this algorithm. The last point to be added will
>>>>>>>> be 288272192 because the max value is not part of the generated
>>>>>>>> split points. Then sqoop will distributed accordingly based off of these
>>>>>>>> points as you've pointed out above.
>>>>>>>>
>>>>>>>> Just to be sure, did you configure sqoop to use 3 mappers?
>>>>>>>>
>>>>>>>> Hope this helps,
>>>>>>>> -Abe
>>>>>>>>
>>>>>>>>
>>>>>>>> On Wed, Jun 19, 2013 at 8:33 AM, David Kincaid <
>>>>>>>> kincaid.dave@gmail.com> wrote:
>>>>>>>>
>>>>>>>>> We're seeing a strange thing happen with a sqoop import job with
>>>>>>>>> the way the key range is getting distributed among the 4 mappers that are
>>>>>>>>> running. The minimum key value is 2110 and the maximum value is 288272191.
>>>>>>>>> We are getting one mapper that is only getting one record to import. Here
>>>>>>>>> is the distribution among the mappers:
>>>>>>>>>
>>>>>>>>> [2110, 96092137)
>>>>>>>>> [96092137, 192182164)
>>>>>>>>> [192182164, 288272191)
>>>>>>>>> [288272191, 288272192)
>>>>>>>>>
>>>>>>>>> you can see that the fourth mapper is given a range with only one
>>>>>>>>> value in it. Could someone help me understand what is going on?
>>>>>>>>>
>>>>>>>>> Thanks,
>>>>>>>>>
>>>>>>>>> Dave
>>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>
>>>>>
>>>>
>>>
>>
>

Re: Strange distribution of keys among mappers

Posted by David Kincaid <ki...@gmail.com>.

We don't have that set on our cluster and aren't specifying it in our job.
When I look at the different sqoop jobs I see both 3 for some and 4 for
others on the jobs.


On Wed, Jun 19, 2013 at 6:50 PM, Abraham Elmahrek <ab...@cloudera.com> wrote:

> David,
>
> Well I think sqoop is looking at "mapred.map.tasks". Do you have that set
> in mapred-site.xml? I thought that defaults to 2.
>
> -Abe
>
>
> On Wed, Jun 19, 2013 at 4:31 PM, Abraham Elmahrek <ab...@cloudera.com>wrote:
>
>> David,
>>
>> I've created https://issues.apache.org/jira/browse/SQOOP-1093 to track
>> the documentation issue. Thanks for bringing this to the community's
>> attention!
>>
>> -Abe
>>
>>
>> On Wed, Jun 19, 2013 at 4:21 PM, Abraham Elmahrek <ab...@cloudera.com>wrote:
>>
>>> Hey David,
>>>
>>> With oracle, the BigDecimalSplitter will be used by default for all
>>> number types.
>>>
>>> -Abe
>>>
>>>
>>> On Wed, Jun 19, 2013 at 4:05 PM, David Kincaid <ki...@gmail.com>wrote:
>>>
>>>> Abe, the database is Oracle.
>>>>
>>>>
>>>> On Wed, Jun 19, 2013 at 5:48 PM, Abraham Elmahrek <ab...@cloudera.com>wrote:
>>>>
>>>>> David,
>>>>>
>>>>> What database are you importing from? The description I gave was for
>>>>> datatypes that map to the BigDecimal Splitter. The userguide might be
>>>>> referring to the IntegerSplitter which will add the remainder to the last
>>>>> value.
>>>>>
>>>>> -Abe
>>>>>
>>>>>
>>>>> On Wed, Jun 19, 2013 at 1:23 PM, David Kincaid <kincaid.dave@gmail.com
>>>>> > wrote:
>>>>>
>>>>>> Thanks. We didn't specify the number of mappers, so it's giving us 4.
>>>>>> I understand your explanation, but it seems to conflict with the Sqoop user
>>>>>> guide (
>>>>>> http://sqoop.apache.org/docs/1.4.3/SqoopUserGuide.html#_controlling_parallelism
>>>>>> ):
>>>>>>
>>>>>> "When performing parallel imports, Sqoop needs a criterion by which
>>>>>> it can split the workload. Sqoop uses a *splitting column* to split
>>>>>> the workload. By default, Sqoop will identify the primary key column (if
>>>>>> present) in a table and use it as the splitting column. The low and high
>>>>>> values for the splitting column are retrieved from the database, and the
>>>>>> map tasks operate on evenly-sized components of the total range. For
>>>>>> example, if you had a table with a primary key column of id whose
>>>>>> minimum value was 0 and maximum value was 1000, and Sqoop was directed to
>>>>>> use 4 tasks, Sqoop would run four processes which each execute SQL
>>>>>> statements of the form SELECT * FROM sometable WHERE id >= lo AND id
>>>>>> < hi, with (lo, hi) set to (0, 250), (250, 500), (500, 750), and
>>>>>> (750, 1001) in the different tasks."
>>>>>>
>>>>>>
>>>>>> On Wed, Jun 19, 2013 at 3:14 PM, Abraham Elmahrek <ab...@cloudera.com>wrote:
>>>>>>
>>>>>>> Hey David,
>>>>>>>
>>>>>>> Here's the algorithm:
>>>>>>> Split lengths are defined by (max - min)/(# mappers) and whatever is
>>>>>>> left is tacked on at the end. So in this case, (288272191-2110)/3 =
>>>>>>> 96090027.33... So I'm assuming the .33... is rounded down and split lengths
>>>>>>> will be of length 96090027. Sqoop will then create splits with the
>>>>>>> following points: (min) + (range length)*(n). We can see that 2110
>>>>>>> + 96090027*0 = 2110, 2110 + 96090027*1 = 96092137, 2110 + 96090027*2
>>>>>>> = 192182164, and 2110 + 96090027*3 = 288272191 will be generated
>>>>>>> based off of this algorithm. The last point to be added will be 288272192
>>>>>>> because the max value is not part of the generated split points. Then sqoop
>>>>>>> will distributed accordingly based off of these points as you've pointed
>>>>>>> out above.
>>>>>>>
>>>>>>> Just to be sure, did you configure sqoop to use 3 mappers?
>>>>>>>
>>>>>>> Hope this helps,
>>>>>>> -Abe
>>>>>>>
>>>>>>>
>>>>>>> On Wed, Jun 19, 2013 at 8:33 AM, David Kincaid <
>>>>>>> kincaid.dave@gmail.com> wrote:
>>>>>>>
>>>>>>>> We're seeing a strange thing happen with a sqoop import job with
>>>>>>>> the way the key range is getting distributed among the 4 mappers that are
>>>>>>>> running. The minimum key value is 2110 and the maximum value is 288272191.
>>>>>>>> We are getting one mapper that is only getting one record to import. Here
>>>>>>>> is the distribution among the mappers:
>>>>>>>>
>>>>>>>> [2110, 96092137)
>>>>>>>> [96092137, 192182164)
>>>>>>>> [192182164, 288272191)
>>>>>>>> [288272191, 288272192)
>>>>>>>>
>>>>>>>> you can see that the fourth mapper is given a range with only one
>>>>>>>> value in it. Could someone help me understand what is going on?
>>>>>>>>
>>>>>>>> Thanks,
>>>>>>>>
>>>>>>>> Dave
>>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>
>>>>>
>>>>
>>>
>>
>

Re: Strange distribution of keys among mappers

Posted by Abraham Elmahrek <ab...@cloudera.com>.

David,

Well I think sqoop is looking at "mapred.map.tasks". Do you have that set
in mapred-site.xml? I thought that defaults to 2.

-Abe


On Wed, Jun 19, 2013 at 4:31 PM, Abraham Elmahrek <ab...@cloudera.com> wrote:

> David,
>
> I've created https://issues.apache.org/jira/browse/SQOOP-1093 to track
> the documentation issue. Thanks for bringing this to the community's
> attention!
>
> -Abe
>
>
> On Wed, Jun 19, 2013 at 4:21 PM, Abraham Elmahrek <ab...@cloudera.com>wrote:
>
>> Hey David,
>>
>> With oracle, the BigDecimalSplitter will be used by default for all
>> number types.
>>
>> -Abe
>>
>>
>> On Wed, Jun 19, 2013 at 4:05 PM, David Kincaid <ki...@gmail.com>wrote:
>>
>>> Abe, the database is Oracle.
>>>
>>>
>>> On Wed, Jun 19, 2013 at 5:48 PM, Abraham Elmahrek <ab...@cloudera.com>wrote:
>>>
>>>> David,
>>>>
>>>> What database are you importing from? The description I gave was for
>>>> datatypes that map to the BigDecimal Splitter. The userguide might be
>>>> referring to the IntegerSplitter which will add the remainder to the last
>>>> value.
>>>>
>>>> -Abe
>>>>
>>>>
>>>> On Wed, Jun 19, 2013 at 1:23 PM, David Kincaid <ki...@gmail.com>wrote:
>>>>
>>>>> Thanks. We didn't specify the number of mappers, so it's giving us 4.
>>>>> I understand your explanation, but it seems to conflict with the Sqoop user
>>>>> guide (
>>>>> http://sqoop.apache.org/docs/1.4.3/SqoopUserGuide.html#_controlling_parallelism
>>>>> ):
>>>>>
>>>>> "When performing parallel imports, Sqoop needs a criterion by which
>>>>> it can split the workload. Sqoop uses a *splitting column* to split
>>>>> the workload. By default, Sqoop will identify the primary key column (if
>>>>> present) in a table and use it as the splitting column. The low and high
>>>>> values for the splitting column are retrieved from the database, and the
>>>>> map tasks operate on evenly-sized components of the total range. For
>>>>> example, if you had a table with a primary key column of id whose
>>>>> minimum value was 0 and maximum value was 1000, and Sqoop was directed to
>>>>> use 4 tasks, Sqoop would run four processes which each execute SQL
>>>>> statements of the form SELECT * FROM sometable WHERE id >= lo AND id
>>>>> < hi, with (lo, hi) set to (0, 250), (250, 500), (500, 750), and
>>>>> (750, 1001) in the different tasks."
>>>>>
>>>>>
>>>>> On Wed, Jun 19, 2013 at 3:14 PM, Abraham Elmahrek <ab...@cloudera.com>wrote:
>>>>>
>>>>>> Hey David,
>>>>>>
>>>>>> Here's the algorithm:
>>>>>> Split lengths are defined by (max - min)/(# mappers) and whatever is
>>>>>> left is tacked on at the end. So in this case, (288272191-2110)/3 =
>>>>>> 96090027.33... So I'm assuming the .33... is rounded down and split lengths
>>>>>> will be of length 96090027. Sqoop will then create splits with the
>>>>>> following points: (min) + (range length)*(n). We can see that 2110 + 96090027*0
>>>>>> = 2110, 2110 + 96090027*1 = 96092137, 2110 + 96090027*2 = 192182164,
>>>>>> and 2110 + 96090027*3 = 288272191 will be generated based off of
>>>>>> this algorithm. The last point to be added will be 288272192 because
>>>>>> the max value is not part of the generated split points. Then sqoop will
>>>>>> distributed accordingly based off of these points as you've pointed out
>>>>>> above.
>>>>>>
>>>>>> Just to be sure, did you configure sqoop to use 3 mappers?
>>>>>>
>>>>>> Hope this helps,
>>>>>> -Abe
>>>>>>
>>>>>>
>>>>>> On Wed, Jun 19, 2013 at 8:33 AM, David Kincaid <
>>>>>> kincaid.dave@gmail.com> wrote:
>>>>>>
>>>>>>> We're seeing a strange thing happen with a sqoop import job with the
>>>>>>> way the key range is getting distributed among the 4 mappers that are
>>>>>>> running. The minimum key value is 2110 and the maximum value is 288272191.
>>>>>>> We are getting one mapper that is only getting one record to import. Here
>>>>>>> is the distribution among the mappers:
>>>>>>>
>>>>>>> [2110, 96092137)
>>>>>>> [96092137, 192182164)
>>>>>>> [192182164, 288272191)
>>>>>>> [288272191, 288272192)
>>>>>>>
>>>>>>> you can see that the fourth mapper is given a range with only one
>>>>>>> value in it. Could someone help me understand what is going on?
>>>>>>>
>>>>>>> Thanks,
>>>>>>>
>>>>>>> Dave
>>>>>>>
>>>>>>
>>>>>>
>>>>>
>>>>
>>>
>>
>

Re: Strange distribution of keys among mappers

Posted by Abraham Elmahrek <ab...@cloudera.com>.

David,

I've created https://issues.apache.org/jira/browse/SQOOP-1093 to track the
documentation issue. Thanks for bringing this to the community's attention!

-Abe


On Wed, Jun 19, 2013 at 4:21 PM, Abraham Elmahrek <ab...@cloudera.com> wrote:

> Hey David,
>
> With oracle, the BigDecimalSplitter will be used by default for all number
> types.
>
> -Abe
>
>
> On Wed, Jun 19, 2013 at 4:05 PM, David Kincaid <ki...@gmail.com>wrote:
>
>> Abe, the database is Oracle.
>>
>>
>> On Wed, Jun 19, 2013 at 5:48 PM, Abraham Elmahrek <ab...@cloudera.com>wrote:
>>
>>> David,
>>>
>>> What database are you importing from? The description I gave was for
>>> datatypes that map to the BigDecimal Splitter. The userguide might be
>>> referring to the IntegerSplitter which will add the remainder to the last
>>> value.
>>>
>>> -Abe
>>>
>>>
>>> On Wed, Jun 19, 2013 at 1:23 PM, David Kincaid <ki...@gmail.com>wrote:
>>>
>>>> Thanks. We didn't specify the number of mappers, so it's giving us 4. I
>>>> understand your explanation, but it seems to conflict with the Sqoop user
>>>> guide (
>>>> http://sqoop.apache.org/docs/1.4.3/SqoopUserGuide.html#_controlling_parallelism
>>>> ):
>>>>
>>>> "When performing parallel imports, Sqoop needs a criterion by which it
>>>> can split the workload. Sqoop uses a *splitting column* to split the
>>>> workload. By default, Sqoop will identify the primary key column (if
>>>> present) in a table and use it as the splitting column. The low and high
>>>> values for the splitting column are retrieved from the database, and the
>>>> map tasks operate on evenly-sized components of the total range. For
>>>> example, if you had a table with a primary key column of id whose
>>>> minimum value was 0 and maximum value was 1000, and Sqoop was directed to
>>>> use 4 tasks, Sqoop would run four processes which each execute SQL
>>>> statements of the form SELECT * FROM sometable WHERE id >= lo AND id <
>>>> hi, with (lo, hi) set to (0, 250), (250, 500), (500, 750), and (750,
>>>> 1001) in the different tasks."
>>>>
>>>>
>>>> On Wed, Jun 19, 2013 at 3:14 PM, Abraham Elmahrek <ab...@cloudera.com>wrote:
>>>>
>>>>> Hey David,
>>>>>
>>>>> Here's the algorithm:
>>>>> Split lengths are defined by (max - min)/(# mappers) and whatever is
>>>>> left is tacked on at the end. So in this case, (288272191-2110)/3 =
>>>>> 96090027.33... So I'm assuming the .33... is rounded down and split lengths
>>>>> will be of length 96090027. Sqoop will then create splits with the
>>>>> following points: (min) + (range length)*(n). We can see that 2110 + 96090027*0
>>>>> = 2110, 2110 + 96090027*1 = 96092137, 2110 + 96090027*2 = 192182164,
>>>>> and 2110 + 96090027*3 = 288272191 will be generated based off of this
>>>>> algorithm. The last point to be added will be 288272192 because the
>>>>> max value is not part of the generated split points. Then sqoop will
>>>>> distributed accordingly based off of these points as you've pointed out
>>>>> above.
>>>>>
>>>>> Just to be sure, did you configure sqoop to use 3 mappers?
>>>>>
>>>>> Hope this helps,
>>>>> -Abe
>>>>>
>>>>>
>>>>> On Wed, Jun 19, 2013 at 8:33 AM, David Kincaid <kincaid.dave@gmail.com
>>>>> > wrote:
>>>>>
>>>>>> We're seeing a strange thing happen with a sqoop import job with the
>>>>>> way the key range is getting distributed among the 4 mappers that are
>>>>>> running. The minimum key value is 2110 and the maximum value is 288272191.
>>>>>> We are getting one mapper that is only getting one record to import. Here
>>>>>> is the distribution among the mappers:
>>>>>>
>>>>>> [2110, 96092137)
>>>>>> [96092137, 192182164)
>>>>>> [192182164, 288272191)
>>>>>> [288272191, 288272192)
>>>>>>
>>>>>> you can see that the fourth mapper is given a range with only one
>>>>>> value in it. Could someone help me understand what is going on?
>>>>>>
>>>>>> Thanks,
>>>>>>
>>>>>> Dave
>>>>>>
>>>>>
>>>>>
>>>>
>>>
>>
>

Re: Strange distribution of keys among mappers

Posted by David Kincaid <ki...@gmail.com>.

Aha. Thank you very much. That definitely clears up how the split was
happening.

Now, my next question is about the number of mappers getting set to 3 by
default. The user guide says that the default should be 4.


On Wed, Jun 19, 2013 at 6:21 PM, Abraham Elmahrek <ab...@cloudera.com> wrote:

> Hey David,
>
> With oracle, the BigDecimalSplitter will be used by default for all number
> types.
>
> -Abe
>
>
> On Wed, Jun 19, 2013 at 4:05 PM, David Kincaid <ki...@gmail.com>wrote:
>
>> Abe, the database is Oracle.
>>
>>
>> On Wed, Jun 19, 2013 at 5:48 PM, Abraham Elmahrek <ab...@cloudera.com>wrote:
>>
>>> David,
>>>
>>> What database are you importing from? The description I gave was for
>>> datatypes that map to the BigDecimal Splitter. The userguide might be
>>> referring to the IntegerSplitter which will add the remainder to the last
>>> value.
>>>
>>> -Abe
>>>
>>>
>>> On Wed, Jun 19, 2013 at 1:23 PM, David Kincaid <ki...@gmail.com>wrote:
>>>
>>>> Thanks. We didn't specify the number of mappers, so it's giving us 4. I
>>>> understand your explanation, but it seems to conflict with the Sqoop user
>>>> guide (
>>>> http://sqoop.apache.org/docs/1.4.3/SqoopUserGuide.html#_controlling_parallelism
>>>> ):
>>>>
>>>> "When performing parallel imports, Sqoop needs a criterion by which it
>>>> can split the workload. Sqoop uses a *splitting column* to split the
>>>> workload. By default, Sqoop will identify the primary key column (if
>>>> present) in a table and use it as the splitting column. The low and high
>>>> values for the splitting column are retrieved from the database, and the
>>>> map tasks operate on evenly-sized components of the total range. For
>>>> example, if you had a table with a primary key column of id whose
>>>> minimum value was 0 and maximum value was 1000, and Sqoop was directed to
>>>> use 4 tasks, Sqoop would run four processes which each execute SQL
>>>> statements of the form SELECT * FROM sometable WHERE id >= lo AND id <
>>>> hi, with (lo, hi) set to (0, 250), (250, 500), (500, 750), and (750,
>>>> 1001) in the different tasks."
>>>>
>>>>
>>>> On Wed, Jun 19, 2013 at 3:14 PM, Abraham Elmahrek <ab...@cloudera.com>wrote:
>>>>
>>>>> Hey David,
>>>>>
>>>>> Here's the algorithm:
>>>>> Split lengths are defined by (max - min)/(# mappers) and whatever is
>>>>> left is tacked on at the end. So in this case, (288272191-2110)/3 =
>>>>> 96090027.33... So I'm assuming the .33... is rounded down and split lengths
>>>>> will be of length 96090027. Sqoop will then create splits with the
>>>>> following points: (min) + (range length)*(n). We can see that 2110 + 96090027*0
>>>>> = 2110, 2110 + 96090027*1 = 96092137, 2110 + 96090027*2 = 192182164,
>>>>> and 2110 + 96090027*3 = 288272191 will be generated based off of this
>>>>> algorithm. The last point to be added will be 288272192 because the
>>>>> max value is not part of the generated split points. Then sqoop will
>>>>> distributed accordingly based off of these points as you've pointed out
>>>>> above.
>>>>>
>>>>> Just to be sure, did you configure sqoop to use 3 mappers?
>>>>>
>>>>> Hope this helps,
>>>>> -Abe
>>>>>
>>>>>
>>>>> On Wed, Jun 19, 2013 at 8:33 AM, David Kincaid <kincaid.dave@gmail.com
>>>>> > wrote:
>>>>>
>>>>>> We're seeing a strange thing happen with a sqoop import job with the
>>>>>> way the key range is getting distributed among the 4 mappers that are
>>>>>> running. The minimum key value is 2110 and the maximum value is 288272191.
>>>>>> We are getting one mapper that is only getting one record to import. Here
>>>>>> is the distribution among the mappers:
>>>>>>
>>>>>> [2110, 96092137)
>>>>>> [96092137, 192182164)
>>>>>> [192182164, 288272191)
>>>>>> [288272191, 288272192)
>>>>>>
>>>>>> you can see that the fourth mapper is given a range with only one
>>>>>> value in it. Could someone help me understand what is going on?
>>>>>>
>>>>>> Thanks,
>>>>>>
>>>>>> Dave
>>>>>>
>>>>>
>>>>>
>>>>
>>>
>>
>

Re: Strange distribution of keys among mappers

Posted by Abraham Elmahrek <ab...@cloudera.com>.

Hey David,

With oracle, the BigDecimalSplitter will be used by default for all number
types.

-Abe


On Wed, Jun 19, 2013 at 4:05 PM, David Kincaid <ki...@gmail.com>wrote:

> Abe, the database is Oracle.
>
>
> On Wed, Jun 19, 2013 at 5:48 PM, Abraham Elmahrek <ab...@cloudera.com>wrote:
>
>> David,
>>
>> What database are you importing from? The description I gave was for
>> datatypes that map to the BigDecimal Splitter. The userguide might be
>> referring to the IntegerSplitter which will add the remainder to the last
>> value.
>>
>> -Abe
>>
>>
>> On Wed, Jun 19, 2013 at 1:23 PM, David Kincaid <ki...@gmail.com>wrote:
>>
>>> Thanks. We didn't specify the number of mappers, so it's giving us 4. I
>>> understand your explanation, but it seems to conflict with the Sqoop user
>>> guide (
>>> http://sqoop.apache.org/docs/1.4.3/SqoopUserGuide.html#_controlling_parallelism
>>> ):
>>>
>>> "When performing parallel imports, Sqoop needs a criterion by which it
>>> can split the workload. Sqoop uses a *splitting column* to split the
>>> workload. By default, Sqoop will identify the primary key column (if
>>> present) in a table and use it as the splitting column. The low and high
>>> values for the splitting column are retrieved from the database, and the
>>> map tasks operate on evenly-sized components of the total range. For
>>> example, if you had a table with a primary key column of id whose
>>> minimum value was 0 and maximum value was 1000, and Sqoop was directed to
>>> use 4 tasks, Sqoop would run four processes which each execute SQL
>>> statements of the form SELECT * FROM sometable WHERE id >= lo AND id <
>>> hi, with (lo, hi) set to (0, 250), (250, 500), (500, 750), and (750,
>>> 1001) in the different tasks."
>>>
>>>
>>> On Wed, Jun 19, 2013 at 3:14 PM, Abraham Elmahrek <ab...@cloudera.com>wrote:
>>>
>>>> Hey David,
>>>>
>>>> Here's the algorithm:
>>>> Split lengths are defined by (max - min)/(# mappers) and whatever is
>>>> left is tacked on at the end. So in this case, (288272191-2110)/3 =
>>>> 96090027.33... So I'm assuming the .33... is rounded down and split lengths
>>>> will be of length 96090027. Sqoop will then create splits with the
>>>> following points: (min) + (range length)*(n). We can see that 2110 + 96090027*0
>>>> = 2110, 2110 + 96090027*1 = 96092137, 2110 + 96090027*2 = 192182164,
>>>> and 2110 + 96090027*3 = 288272191 will be generated based off of this
>>>> algorithm. The last point to be added will be 288272192 because the
>>>> max value is not part of the generated split points. Then sqoop will
>>>> distributed accordingly based off of these points as you've pointed out
>>>> above.
>>>>
>>>> Just to be sure, did you configure sqoop to use 3 mappers?
>>>>
>>>> Hope this helps,
>>>> -Abe
>>>>
>>>>
>>>> On Wed, Jun 19, 2013 at 8:33 AM, David Kincaid <ki...@gmail.com>wrote:
>>>>
>>>>> We're seeing a strange thing happen with a sqoop import job with the
>>>>> way the key range is getting distributed among the 4 mappers that are
>>>>> running. The minimum key value is 2110 and the maximum value is 288272191.
>>>>> We are getting one mapper that is only getting one record to import. Here
>>>>> is the distribution among the mappers:
>>>>>
>>>>> [2110, 96092137)
>>>>> [96092137, 192182164)
>>>>> [192182164, 288272191)
>>>>> [288272191, 288272192)
>>>>>
>>>>> you can see that the fourth mapper is given a range with only one
>>>>> value in it. Could someone help me understand what is going on?
>>>>>
>>>>> Thanks,
>>>>>
>>>>> Dave
>>>>>
>>>>
>>>>
>>>
>>
>

Re: Strange distribution of keys among mappers

Posted by David Kincaid <ki...@gmail.com>.

Abe, the database is Oracle.


On Wed, Jun 19, 2013 at 5:48 PM, Abraham Elmahrek <ab...@cloudera.com> wrote:

> David,
>
> What database are you importing from? The description I gave was for
> datatypes that map to the BigDecimal Splitter. The userguide might be
> referring to the IntegerSplitter which will add the remainder to the last
> value.
>
> -Abe
>
>
> On Wed, Jun 19, 2013 at 1:23 PM, David Kincaid <ki...@gmail.com>wrote:
>
>> Thanks. We didn't specify the number of mappers, so it's giving us 4. I
>> understand your explanation, but it seems to conflict with the Sqoop user
>> guide (
>> http://sqoop.apache.org/docs/1.4.3/SqoopUserGuide.html#_controlling_parallelism
>> ):
>>
>> "When performing parallel imports, Sqoop needs a criterion by which it
>> can split the workload. Sqoop uses a *splitting column* to split the
>> workload. By default, Sqoop will identify the primary key column (if
>> present) in a table and use it as the splitting column. The low and high
>> values for the splitting column are retrieved from the database, and the
>> map tasks operate on evenly-sized components of the total range. For
>> example, if you had a table with a primary key column of id whose
>> minimum value was 0 and maximum value was 1000, and Sqoop was directed to
>> use 4 tasks, Sqoop would run four processes which each execute SQL
>> statements of the form SELECT * FROM sometable WHERE id >= lo AND id < hi,
>> with (lo, hi) set to (0, 250), (250, 500), (500, 750), and (750, 1001)
>> in the different tasks."
>>
>>
>> On Wed, Jun 19, 2013 at 3:14 PM, Abraham Elmahrek <ab...@cloudera.com>wrote:
>>
>>> Hey David,
>>>
>>> Here's the algorithm:
>>> Split lengths are defined by (max - min)/(# mappers) and whatever is
>>> left is tacked on at the end. So in this case, (288272191-2110)/3 =
>>> 96090027.33... So I'm assuming the .33... is rounded down and split lengths
>>> will be of length 96090027. Sqoop will then create splits with the
>>> following points: (min) + (range length)*(n). We can see that 2110 + 96090027*0
>>> = 2110, 2110 + 96090027*1 = 96092137, 2110 + 96090027*2 = 192182164,
>>> and 2110 + 96090027*3 = 288272191 will be generated based off of this
>>> algorithm. The last point to be added will be 288272192 because the max
>>> value is not part of the generated split points. Then sqoop will
>>> distributed accordingly based off of these points as you've pointed out
>>> above.
>>>
>>> Just to be sure, did you configure sqoop to use 3 mappers?
>>>
>>> Hope this helps,
>>> -Abe
>>>
>>>
>>> On Wed, Jun 19, 2013 at 8:33 AM, David Kincaid <ki...@gmail.com>wrote:
>>>
>>>> We're seeing a strange thing happen with a sqoop import job with the
>>>> way the key range is getting distributed among the 4 mappers that are
>>>> running. The minimum key value is 2110 and the maximum value is 288272191.
>>>> We are getting one mapper that is only getting one record to import. Here
>>>> is the distribution among the mappers:
>>>>
>>>> [2110, 96092137)
>>>> [96092137, 192182164)
>>>> [192182164, 288272191)
>>>> [288272191, 288272192)
>>>>
>>>> you can see that the fourth mapper is given a range with only one value
>>>> in it. Could someone help me understand what is going on?
>>>>
>>>> Thanks,
>>>>
>>>> Dave
>>>>
>>>
>>>
>>
>

Re: Strange distribution of keys among mappers

Posted by Abraham Elmahrek <ab...@cloudera.com>.

David,

What database are you importing from? The description I gave was for
datatypes that map to the BigDecimal Splitter. The userguide might be
referring to the IntegerSplitter which will add the remainder to the last
value.

-Abe


On Wed, Jun 19, 2013 at 1:23 PM, David Kincaid <ki...@gmail.com>wrote:

> Thanks. We didn't specify the number of mappers, so it's giving us 4. I
> understand your explanation, but it seems to conflict with the Sqoop user
> guide (
> http://sqoop.apache.org/docs/1.4.3/SqoopUserGuide.html#_controlling_parallelism
> ):
>
> "When performing parallel imports, Sqoop needs a criterion by which it
> can split the workload. Sqoop uses a *splitting column* to split the
> workload. By default, Sqoop will identify the primary key column (if
> present) in a table and use it as the splitting column. The low and high
> values for the splitting column are retrieved from the database, and the
> map tasks operate on evenly-sized components of the total range. For
> example, if you had a table with a primary key column of id whose minimum
> value was 0 and maximum value was 1000, and Sqoop was directed to use 4
> tasks, Sqoop would run four processes which each execute SQL statements of
> the form SELECT * FROM sometable WHERE id >= lo AND id < hi, with (lo, hi) set
> to (0, 250), (250, 500), (500, 750), and (750, 1001) in the different
> tasks."
>
>
> On Wed, Jun 19, 2013 at 3:14 PM, Abraham Elmahrek <ab...@cloudera.com>wrote:
>
>> Hey David,
>>
>> Here's the algorithm:
>> Split lengths are defined by (max - min)/(# mappers) and whatever is left
>> is tacked on at the end. So in this case, (288272191-2110)/3 =
>> 96090027.33... So I'm assuming the .33... is rounded down and split lengths
>> will be of length 96090027. Sqoop will then create splits with the
>> following points: (min) + (range length)*(n). We can see that 2110 + 96090027*0
>> = 2110, 2110 + 96090027*1 = 96092137, 2110 + 96090027*2 = 192182164, and 2110
>> + 96090027*3 = 288272191 will be generated based off of this algorithm.
>> The last point to be added will be 288272192 because the max value is
>> not part of the generated split points. Then sqoop will distributed
>> accordingly based off of these points as you've pointed out above.
>>
>> Just to be sure, did you configure sqoop to use 3 mappers?
>>
>> Hope this helps,
>> -Abe
>>
>>
>> On Wed, Jun 19, 2013 at 8:33 AM, David Kincaid <ki...@gmail.com>wrote:
>>
>>> We're seeing a strange thing happen with a sqoop import job with the way
>>> the key range is getting distributed among the 4 mappers that are running.
>>> The minimum key value is 2110 and the maximum value is 288272191. We are
>>> getting one mapper that is only getting one record to import. Here is the
>>> distribution among the mappers:
>>>
>>> [2110, 96092137)
>>> [96092137, 192182164)
>>> [192182164, 288272191)
>>> [288272191, 288272192)
>>>
>>> you can see that the fourth mapper is given a range with only one value
>>> in it. Could someone help me understand what is going on?
>>>
>>> Thanks,
>>>
>>> Dave
>>>
>>
>>
>

Re: Strange distribution of keys among mappers

Posted by David Kincaid <ki...@gmail.com>.

Thanks. We didn't specify the number of mappers, so it's giving us 4. I
understand your explanation, but it seems to conflict with the Sqoop user
guide (
http://sqoop.apache.org/docs/1.4.3/SqoopUserGuide.html#_controlling_parallelism
):

"When performing parallel imports, Sqoop needs a criterion by which it can
split the workload. Sqoop uses a *splitting column* to split the workload.
By default, Sqoop will identify the primary key column (if present) in a
table and use it as the splitting column. The low and high values for the
splitting column are retrieved from the database, and the map tasks operate
on evenly-sized components of the total range. For example, if you had a
table with a primary key column of id whose minimum value was 0 and maximum
value was 1000, and Sqoop was directed to use 4 tasks, Sqoop would run four
processes which each execute SQL statements of the form SELECT * FROM
sometable WHERE id >= lo AND id < hi, with (lo, hi) set to (0, 250), (250,
500), (500, 750), and (750, 1001) in the different tasks."

On Wed, Jun 19, 2013 at 3:14 PM, Abraham Elmahrek <ab...@cloudera.com> wrote:

> Hey David,
>
> Here's the algorithm:
> Split lengths are defined by (max - min)/(# mappers) and whatever is left
> is tacked on at the end. So in this case, (288272191-2110)/3 =
> 96090027.33... So I'm assuming the .33... is rounded down and split lengths
> will be of length 96090027. Sqoop will then create splits with the
> following points: (min) + (range length)*(n). We can see that 2110 + 96090027*0
> = 2110, 2110 + 96090027*1 = 96092137, 2110 + 96090027*2 = 192182164, and 2110
> + 96090027*3 = 288272191 will be generated based off of this algorithm.
> The last point to be added will be 288272192 because the max value is not
> part of the generated split points. Then sqoop will distributed accordingly
> based off of these points as you've pointed out above.
>
> Just to be sure, did you configure sqoop to use 3 mappers?
>
> Hope this helps,
> -Abe
>
>
> On Wed, Jun 19, 2013 at 8:33 AM, David Kincaid <ki...@gmail.com>wrote:
>
>> We're seeing a strange thing happen with a sqoop import job with the way
>> the key range is getting distributed among the 4 mappers that are running.
>> The minimum key value is 2110 and the maximum value is 288272191. We are
>> getting one mapper that is only getting one record to import. Here is the
>> distribution among the mappers:
>>
>> [2110, 96092137)
>> [96092137, 192182164)
>> [192182164, 288272191)
>> [288272191, 288272192)
>>
>> you can see that the fourth mapper is given a range with only one value
>> in it. Could someone help me understand what is going on?
>>
>> Thanks,
>>
>> Dave
>>
>
>

Re: Strange distribution of keys among mappers

Posted by Abraham Elmahrek <ab...@cloudera.com>.

Hey David,

Here's the algorithm:
Split lengths are defined by (max - min)/(# mappers) and whatever is left
is tacked on at the end. So in this case, (288272191-2110)/3 =
96090027.33... So I'm assuming the .33... is rounded down and split lengths
will be of length 96090027. Sqoop will then create splits with the
following points: (min) + (range length)*(n). We can see that 2110 + 96090027*0
= 2110, 2110 + 96090027*1 = 96092137, 2110 + 96090027*2 = 192182164, and 2110
+ 96090027*3 = 288272191 will be generated based off of this algorithm. The
last point to be added will be 288272192 because the max value is not part
of the generated split points. Then sqoop will distributed accordingly
based off of these points as you've pointed out above.

Just to be sure, did you configure sqoop to use 3 mappers?

Hope this helps,
-Abe


On Wed, Jun 19, 2013 at 8:33 AM, David Kincaid <ki...@gmail.com>wrote:

> We're seeing a strange thing happen with a sqoop import job with the way
> the key range is getting distributed among the 4 mappers that are running.
> The minimum key value is 2110 and the maximum value is 288272191. We are
> getting one mapper that is only getting one record to import. Here is the
> distribution among the mappers:
>
> [2110, 96092137)
> [96092137, 192182164)
> [192182164, 288272191)
> [288272191, 288272192)
>
> you can see that the fourth mapper is given a range with only one value in
> it. Could someone help me understand what is going on?
>
> Thanks,
>
> Dave
>