You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@pig.apache.org by Kurt Muehlner <km...@connexity.com> on 2016/05/05 20:41:35 UTC

data discrepancies related to parallelism

Hello all,

I posted this issue in the Tez user group earlier today, where it was suggested I also post it here.  We have a Pig/Tez application exhibiting data discrepancies which occur only when there is a difference between requested parallelism (via SET_DEFAULT_PARALLEL) and the number of containers YARN is able to allocate to the application.

Has anyone seen this sort of problem, or have any suspicions as to what may be going wrong?

Thanks,
Kurt

Original message to Tez group:

Hello,

We have a Pig/Tez application which is exhibiting a strange problem.  This application was recently migrated from Pig/MR to Pig/Tez.  We carefully vetted during QA that both MR and Tez versions produced identical results.  However, after deploying to production, we noticed that occasionally, results are not the same (either as compared to MR results, or results of Tez processing the same data on a QA cluster).

We’re still looking into the root cause, but I’d like to reach out to the user group in case anyone has seen anything similar, or has suggestions on what might be wrong/what to investigate.

*** What we know so far ***
Results discrepancy occurs ONLY when the number of containers given to the application by YARN is less than the number requested (we have disabled auto-parallelism, and are using SET_DEFAULT_PARALLEL=50 in all pig scripts).  When this occurs, we also see a corresponding discrepancy in the the file system counters HDFS_READ_OPS and HDFS_BYTES_READ (lower when number of containers is low), despite the fact that in all cases number of records processed is identical.

Thus, when the production cluster is very busy, we get invalid results.  We have kept a separate instance of the Pig/Tez application running on another cluster where it never competes for resources, so we have been able to compare results for each run of the application, which has allowed us to diagnose the problem this far.  By comparing results on these two clusters, we also know that the ratio (actual HDFS_READ_OPS)/(expected HDFS_READ_OPS) correlates with the ratio (actual containers)/(requested containers).  Likewise, we see the same correlation between hdfs ops ratio and container ratio.

Below are some relevant counters.  For each counter, the first line is the value from the production cluster showing the problem, and the second line is the value from the QA cluster running on the same data.

Any hints/suggestions/questions are most welcome.

Thanks,
Kurt

org.apache.tez.common.counters.DAGCounter

  NUM_SUCCEEDED_TASKS=950
  NUM_SUCCEEDED_TASKS=950
  
  TOTAL_LAUNCHED_TASKS=950
  TOTAL_LAUNCHED_TASKS=950
  
File System Counters

  FILE_BYTES_READ=7745801982
  FILE_BYTES_READ=8003771938

  FILE_BYTES_WRITTEN=9725468612
  FILE_BYTES_WRITTEN=9675253887

  *HDFS_BYTES_READ=9487600888  (when number of containers equals the number requested, this counter is the same between the two clusters)
  *HDFS_BYTES_READ=17996466110

  *HDFS_READ_OPS=3080  (when number of containers equals the number requested, this counter is the same between the two clusters)
  *HDFS_READ_OPS=3600

  HDFS_WRITE_OPS=900
  HDFS_WRITE_OPS=900

org.apache.tez.common.counters.TaskCounter
  INPUT_RECORDS_PROCESSED=28729671
  INPUT_RECORDS_PROCESSED=28729671


  OUTPUT_RECORDS=33655895
  OUTPUT_RECORDS=33655895

  OUTPUT_BYTES=28290888628
  OUTPUT_BYTES=28294000270

Input(s):
Successfully read 2254733 records (1632743360 bytes) from: "input1"
Successfully read 2254733 records (1632743360 bytes) from: "input1"


Output(s):
Successfully stored 0 records in: “output1”
Successfully stored 0 records in: "output1”

Successfully stored 56019 records (10437069 bytes) in: “output2”
Successfully stored 56019 records (10437069 bytes) in: "output2”

Successfully stored 2254733 records (1651936175 bytes) in: "output3”
Successfully stored 2254733 records (1651936175 bytes) in: "output3”

Successfully stored 1160599 records (823479742 bytes) in: "output4”
Successfully stored 1160599 records (823480450 bytes) in: "output4”

Successfully stored 28605 records (21176320 bytes) in: "output5”
Successfully stored 28605 records (21177552 bytes) in: "output5”

Successfully stored 6574 records (4442933 bytes) in: "output6”
Successfully stored 6574 records (4442933 bytes) in: "output6”

Successfully stored 111416 records (164375858 bytes) in: "output7”
Successfully stored 111416 records (164379800 bytes) in: "output7”

Successfully stored 542 records (387761 bytes) in: "output8”
Successfully stored 542 records (387762 bytes) in: "output8"




Re: data discrepancies related to parallelism

Posted by Kurt Muehlner <km...@connexity.com>.
Hi Rohini,

Unfortunately we were not able to find it.  Currently investigating this on hold, but we do plan to investigate further.

-Kurt

On 6/1/16, 10:55 AM, "Rohini Palaniswamy" <ro...@gmail.com> wrote:

>Kurt,
>   Did you find the problem?
>
>Regards,
>Rohini
>
>On Thu, May 5, 2016 at 1:41 PM, Kurt Muehlner <km...@connexity.com>
>wrote:
>
>> Hello all,
>>
>> I posted this issue in the Tez user group earlier today, where it was
>> suggested I also post it here.  We have a Pig/Tez application exhibiting
>> data discrepancies which occur only when there is a difference between
>> requested parallelism (via SET_DEFAULT_PARALLEL) and the number of
>> containers YARN is able to allocate to the application.
>>
>> Has anyone seen this sort of problem, or have any suspicions as to what
>> may be going wrong?
>>
>> Thanks,
>> Kurt
>>
>> Original message to Tez group:
>>
>> Hello,
>>
>> We have a Pig/Tez application which is exhibiting a strange problem.  This
>> application was recently migrated from Pig/MR to Pig/Tez.  We carefully
>> vetted during QA that both MR and Tez versions produced identical results.
>> However, after deploying to production, we noticed that occasionally,
>> results are not the same (either as compared to MR results, or results of
>> Tez processing the same data on a QA cluster).
>>
>> We’re still looking into the root cause, but I’d like to reach out to the
>> user group in case anyone has seen anything similar, or has suggestions on
>> what might be wrong/what to investigate.
>>
>> *** What we know so far ***
>> Results discrepancy occurs ONLY when the number of containers given to the
>> application by YARN is less than the number requested (we have disabled
>> auto-parallelism, and are using SET_DEFAULT_PARALLEL=50 in all pig
>> scripts).  When this occurs, we also see a corresponding discrepancy in the
>> the file system counters HDFS_READ_OPS and HDFS_BYTES_READ (lower when
>> number of containers is low), despite the fact that in all cases number of
>> records processed is identical.
>>
>> Thus, when the production cluster is very busy, we get invalid results.
>> We have kept a separate instance of the Pig/Tez application running on
>> another cluster where it never competes for resources, so we have been able
>> to compare results for each run of the application, which has allowed us to
>> diagnose the problem this far.  By comparing results on these two clusters,
>> we also know that the ratio (actual HDFS_READ_OPS)/(expected HDFS_READ_OPS)
>> correlates with the ratio (actual containers)/(requested containers).
>> Likewise, we see the same correlation between hdfs ops ratio and container
>> ratio.
>>
>> Below are some relevant counters.  For each counter, the first line is the
>> value from the production cluster showing the problem, and the second line
>> is the value from the QA cluster running on the same data.
>>
>> Any hints/suggestions/questions are most welcome.
>>
>> Thanks,
>> Kurt
>>
>> org.apache.tez.common.counters.DAGCounter
>>
>>   NUM_SUCCEEDED_TASKS=950
>>   NUM_SUCCEEDED_TASKS=950
>>
>>   TOTAL_LAUNCHED_TASKS=950
>>   TOTAL_LAUNCHED_TASKS=950
>>
>> File System Counters
>>
>>   FILE_BYTES_READ=7745801982
>>   FILE_BYTES_READ=8003771938
>>
>>   FILE_BYTES_WRITTEN=9725468612
>>   FILE_BYTES_WRITTEN=9675253887
>>
>>   *HDFS_BYTES_READ=9487600888  (when number of containers equals the
>> number requested, this counter is the same between the two clusters)
>>   *HDFS_BYTES_READ=17996466110
>>
>>   *HDFS_READ_OPS=3080  (when number of containers equals the number
>> requested, this counter is the same between the two clusters)
>>   *HDFS_READ_OPS=3600
>>
>>   HDFS_WRITE_OPS=900
>>   HDFS_WRITE_OPS=900
>>
>> org.apache.tez.common.counters.TaskCounter
>>   INPUT_RECORDS_PROCESSED=28729671
>>   INPUT_RECORDS_PROCESSED=28729671
>>
>>
>>   OUTPUT_RECORDS=33655895
>>   OUTPUT_RECORDS=33655895
>>
>>   OUTPUT_BYTES=28290888628
>>   OUTPUT_BYTES=28294000270
>>
>> Input(s):
>> Successfully read 2254733 records (1632743360 bytes) from: "input1"
>> Successfully read 2254733 records (1632743360 bytes) from: "input1"
>>
>>
>> Output(s):
>> Successfully stored 0 records in: “output1”
>> Successfully stored 0 records in: "output1”
>>
>> Successfully stored 56019 records (10437069 bytes) in: “output2”
>> Successfully stored 56019 records (10437069 bytes) in: "output2”
>>
>> Successfully stored 2254733 records (1651936175 bytes) in: "output3”
>> Successfully stored 2254733 records (1651936175 bytes) in: "output3”
>>
>> Successfully stored 1160599 records (823479742 bytes) in: "output4”
>> Successfully stored 1160599 records (823480450 bytes) in: "output4”
>>
>> Successfully stored 28605 records (21176320 bytes) in: "output5”
>> Successfully stored 28605 records (21177552 bytes) in: "output5”
>>
>> Successfully stored 6574 records (4442933 bytes) in: "output6”
>> Successfully stored 6574 records (4442933 bytes) in: "output6”
>>
>> Successfully stored 111416 records (164375858 bytes) in: "output7”
>> Successfully stored 111416 records (164379800 bytes) in: "output7”
>>
>> Successfully stored 542 records (387761 bytes) in: "output8”
>> Successfully stored 542 records (387762 bytes) in: "output8"
>>
>>
>>
>>


Re: data discrepancies related to parallelism

Posted by Rohini Palaniswamy <ro...@gmail.com>.
Kurt,
   Did you find the problem?

Regards,
Rohini

On Thu, May 5, 2016 at 1:41 PM, Kurt Muehlner <km...@connexity.com>
wrote:

> Hello all,
>
> I posted this issue in the Tez user group earlier today, where it was
> suggested I also post it here.  We have a Pig/Tez application exhibiting
> data discrepancies which occur only when there is a difference between
> requested parallelism (via SET_DEFAULT_PARALLEL) and the number of
> containers YARN is able to allocate to the application.
>
> Has anyone seen this sort of problem, or have any suspicions as to what
> may be going wrong?
>
> Thanks,
> Kurt
>
> Original message to Tez group:
>
> Hello,
>
> We have a Pig/Tez application which is exhibiting a strange problem.  This
> application was recently migrated from Pig/MR to Pig/Tez.  We carefully
> vetted during QA that both MR and Tez versions produced identical results.
> However, after deploying to production, we noticed that occasionally,
> results are not the same (either as compared to MR results, or results of
> Tez processing the same data on a QA cluster).
>
> We’re still looking into the root cause, but I’d like to reach out to the
> user group in case anyone has seen anything similar, or has suggestions on
> what might be wrong/what to investigate.
>
> *** What we know so far ***
> Results discrepancy occurs ONLY when the number of containers given to the
> application by YARN is less than the number requested (we have disabled
> auto-parallelism, and are using SET_DEFAULT_PARALLEL=50 in all pig
> scripts).  When this occurs, we also see a corresponding discrepancy in the
> the file system counters HDFS_READ_OPS and HDFS_BYTES_READ (lower when
> number of containers is low), despite the fact that in all cases number of
> records processed is identical.
>
> Thus, when the production cluster is very busy, we get invalid results.
> We have kept a separate instance of the Pig/Tez application running on
> another cluster where it never competes for resources, so we have been able
> to compare results for each run of the application, which has allowed us to
> diagnose the problem this far.  By comparing results on these two clusters,
> we also know that the ratio (actual HDFS_READ_OPS)/(expected HDFS_READ_OPS)
> correlates with the ratio (actual containers)/(requested containers).
> Likewise, we see the same correlation between hdfs ops ratio and container
> ratio.
>
> Below are some relevant counters.  For each counter, the first line is the
> value from the production cluster showing the problem, and the second line
> is the value from the QA cluster running on the same data.
>
> Any hints/suggestions/questions are most welcome.
>
> Thanks,
> Kurt
>
> org.apache.tez.common.counters.DAGCounter
>
>   NUM_SUCCEEDED_TASKS=950
>   NUM_SUCCEEDED_TASKS=950
>
>   TOTAL_LAUNCHED_TASKS=950
>   TOTAL_LAUNCHED_TASKS=950
>
> File System Counters
>
>   FILE_BYTES_READ=7745801982
>   FILE_BYTES_READ=8003771938
>
>   FILE_BYTES_WRITTEN=9725468612
>   FILE_BYTES_WRITTEN=9675253887
>
>   *HDFS_BYTES_READ=9487600888  (when number of containers equals the
> number requested, this counter is the same between the two clusters)
>   *HDFS_BYTES_READ=17996466110
>
>   *HDFS_READ_OPS=3080  (when number of containers equals the number
> requested, this counter is the same between the two clusters)
>   *HDFS_READ_OPS=3600
>
>   HDFS_WRITE_OPS=900
>   HDFS_WRITE_OPS=900
>
> org.apache.tez.common.counters.TaskCounter
>   INPUT_RECORDS_PROCESSED=28729671
>   INPUT_RECORDS_PROCESSED=28729671
>
>
>   OUTPUT_RECORDS=33655895
>   OUTPUT_RECORDS=33655895
>
>   OUTPUT_BYTES=28290888628
>   OUTPUT_BYTES=28294000270
>
> Input(s):
> Successfully read 2254733 records (1632743360 bytes) from: "input1"
> Successfully read 2254733 records (1632743360 bytes) from: "input1"
>
>
> Output(s):
> Successfully stored 0 records in: “output1”
> Successfully stored 0 records in: "output1”
>
> Successfully stored 56019 records (10437069 bytes) in: “output2”
> Successfully stored 56019 records (10437069 bytes) in: "output2”
>
> Successfully stored 2254733 records (1651936175 bytes) in: "output3”
> Successfully stored 2254733 records (1651936175 bytes) in: "output3”
>
> Successfully stored 1160599 records (823479742 bytes) in: "output4”
> Successfully stored 1160599 records (823480450 bytes) in: "output4”
>
> Successfully stored 28605 records (21176320 bytes) in: "output5”
> Successfully stored 28605 records (21177552 bytes) in: "output5”
>
> Successfully stored 6574 records (4442933 bytes) in: "output6”
> Successfully stored 6574 records (4442933 bytes) in: "output6”
>
> Successfully stored 111416 records (164375858 bytes) in: "output7”
> Successfully stored 111416 records (164379800 bytes) in: "output7”
>
> Successfully stored 542 records (387761 bytes) in: "output8”
> Successfully stored 542 records (387762 bytes) in: "output8"
>
>
>
>