You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@pig.apache.org by Rohini Palaniswamy <ro...@gmail.com> on 2016/06/01 17:55:21 UTC

Re: data discrepancies related to parallelism

Kurt,
   Did you find the problem?

Regards,
Rohini

On Thu, May 5, 2016 at 1:41 PM, Kurt Muehlner <km...@connexity.com>
wrote:

> Hello all,
>
> I posted this issue in the Tez user group earlier today, where it was
> suggested I also post it here.  We have a Pig/Tez application exhibiting
> data discrepancies which occur only when there is a difference between
> requested parallelism (via SET_DEFAULT_PARALLEL) and the number of
> containers YARN is able to allocate to the application.
>
> Has anyone seen this sort of problem, or have any suspicions as to what
> may be going wrong?
>
> Thanks,
> Kurt
>
> Original message to Tez group:
>
> Hello,
>
> We have a Pig/Tez application which is exhibiting a strange problem.  This
> application was recently migrated from Pig/MR to Pig/Tez.  We carefully
> vetted during QA that both MR and Tez versions produced identical results.
> However, after deploying to production, we noticed that occasionally,
> results are not the same (either as compared to MR results, or results of
> Tez processing the same data on a QA cluster).
>
> We’re still looking into the root cause, but I’d like to reach out to the
> user group in case anyone has seen anything similar, or has suggestions on
> what might be wrong/what to investigate.
>
> *** What we know so far ***
> Results discrepancy occurs ONLY when the number of containers given to the
> application by YARN is less than the number requested (we have disabled
> auto-parallelism, and are using SET_DEFAULT_PARALLEL=50 in all pig
> scripts).  When this occurs, we also see a corresponding discrepancy in the
> the file system counters HDFS_READ_OPS and HDFS_BYTES_READ (lower when
> number of containers is low), despite the fact that in all cases number of
> records processed is identical.
>
> Thus, when the production cluster is very busy, we get invalid results.
> We have kept a separate instance of the Pig/Tez application running on
> another cluster where it never competes for resources, so we have been able
> to compare results for each run of the application, which has allowed us to
> diagnose the problem this far.  By comparing results on these two clusters,
> we also know that the ratio (actual HDFS_READ_OPS)/(expected HDFS_READ_OPS)
> correlates with the ratio (actual containers)/(requested containers).
> Likewise, we see the same correlation between hdfs ops ratio and container
> ratio.
>
> Below are some relevant counters.  For each counter, the first line is the
> value from the production cluster showing the problem, and the second line
> is the value from the QA cluster running on the same data.
>
> Any hints/suggestions/questions are most welcome.
>
> Thanks,
> Kurt
>
> org.apache.tez.common.counters.DAGCounter
>
>   NUM_SUCCEEDED_TASKS=950
>   NUM_SUCCEEDED_TASKS=950
>
>   TOTAL_LAUNCHED_TASKS=950
>   TOTAL_LAUNCHED_TASKS=950
>
> File System Counters
>
>   FILE_BYTES_READ=7745801982
>   FILE_BYTES_READ=8003771938
>
>   FILE_BYTES_WRITTEN=9725468612
>   FILE_BYTES_WRITTEN=9675253887
>
>   *HDFS_BYTES_READ=9487600888  (when number of containers equals the
> number requested, this counter is the same between the two clusters)
>   *HDFS_BYTES_READ=17996466110
>
>   *HDFS_READ_OPS=3080  (when number of containers equals the number
> requested, this counter is the same between the two clusters)
>   *HDFS_READ_OPS=3600
>
>   HDFS_WRITE_OPS=900
>   HDFS_WRITE_OPS=900
>
> org.apache.tez.common.counters.TaskCounter
>   INPUT_RECORDS_PROCESSED=28729671
>   INPUT_RECORDS_PROCESSED=28729671
>
>
>   OUTPUT_RECORDS=33655895
>   OUTPUT_RECORDS=33655895
>
>   OUTPUT_BYTES=28290888628
>   OUTPUT_BYTES=28294000270
>
> Input(s):
> Successfully read 2254733 records (1632743360 bytes) from: "input1"
> Successfully read 2254733 records (1632743360 bytes) from: "input1"
>
>
> Output(s):
> Successfully stored 0 records in: “output1”
> Successfully stored 0 records in: "output1”
>
> Successfully stored 56019 records (10437069 bytes) in: “output2”
> Successfully stored 56019 records (10437069 bytes) in: "output2”
>
> Successfully stored 2254733 records (1651936175 bytes) in: "output3”
> Successfully stored 2254733 records (1651936175 bytes) in: "output3”
>
> Successfully stored 1160599 records (823479742 bytes) in: "output4”
> Successfully stored 1160599 records (823480450 bytes) in: "output4”
>
> Successfully stored 28605 records (21176320 bytes) in: "output5”
> Successfully stored 28605 records (21177552 bytes) in: "output5”
>
> Successfully stored 6574 records (4442933 bytes) in: "output6”
> Successfully stored 6574 records (4442933 bytes) in: "output6”
>
> Successfully stored 111416 records (164375858 bytes) in: "output7”
> Successfully stored 111416 records (164379800 bytes) in: "output7”
>
> Successfully stored 542 records (387761 bytes) in: "output8”
> Successfully stored 542 records (387762 bytes) in: "output8"
>
>
>
>

Re: data discrepancies related to parallelism

Posted by Kurt Muehlner <km...@connexity.com>.
Hi Rohini,

Unfortunately we were not able to find it.  Currently investigating this on hold, but we do plan to investigate further.

-Kurt

On 6/1/16, 10:55 AM, "Rohini Palaniswamy" <ro...@gmail.com> wrote:

>Kurt,
>   Did you find the problem?
>
>Regards,
>Rohini
>
>On Thu, May 5, 2016 at 1:41 PM, Kurt Muehlner <km...@connexity.com>
>wrote:
>
>> Hello all,
>>
>> I posted this issue in the Tez user group earlier today, where it was
>> suggested I also post it here.  We have a Pig/Tez application exhibiting
>> data discrepancies which occur only when there is a difference between
>> requested parallelism (via SET_DEFAULT_PARALLEL) and the number of
>> containers YARN is able to allocate to the application.
>>
>> Has anyone seen this sort of problem, or have any suspicions as to what
>> may be going wrong?
>>
>> Thanks,
>> Kurt
>>
>> Original message to Tez group:
>>
>> Hello,
>>
>> We have a Pig/Tez application which is exhibiting a strange problem.  This
>> application was recently migrated from Pig/MR to Pig/Tez.  We carefully
>> vetted during QA that both MR and Tez versions produced identical results.
>> However, after deploying to production, we noticed that occasionally,
>> results are not the same (either as compared to MR results, or results of
>> Tez processing the same data on a QA cluster).
>>
>> We’re still looking into the root cause, but I’d like to reach out to the
>> user group in case anyone has seen anything similar, or has suggestions on
>> what might be wrong/what to investigate.
>>
>> *** What we know so far ***
>> Results discrepancy occurs ONLY when the number of containers given to the
>> application by YARN is less than the number requested (we have disabled
>> auto-parallelism, and are using SET_DEFAULT_PARALLEL=50 in all pig
>> scripts).  When this occurs, we also see a corresponding discrepancy in the
>> the file system counters HDFS_READ_OPS and HDFS_BYTES_READ (lower when
>> number of containers is low), despite the fact that in all cases number of
>> records processed is identical.
>>
>> Thus, when the production cluster is very busy, we get invalid results.
>> We have kept a separate instance of the Pig/Tez application running on
>> another cluster where it never competes for resources, so we have been able
>> to compare results for each run of the application, which has allowed us to
>> diagnose the problem this far.  By comparing results on these two clusters,
>> we also know that the ratio (actual HDFS_READ_OPS)/(expected HDFS_READ_OPS)
>> correlates with the ratio (actual containers)/(requested containers).
>> Likewise, we see the same correlation between hdfs ops ratio and container
>> ratio.
>>
>> Below are some relevant counters.  For each counter, the first line is the
>> value from the production cluster showing the problem, and the second line
>> is the value from the QA cluster running on the same data.
>>
>> Any hints/suggestions/questions are most welcome.
>>
>> Thanks,
>> Kurt
>>
>> org.apache.tez.common.counters.DAGCounter
>>
>>   NUM_SUCCEEDED_TASKS=950
>>   NUM_SUCCEEDED_TASKS=950
>>
>>   TOTAL_LAUNCHED_TASKS=950
>>   TOTAL_LAUNCHED_TASKS=950
>>
>> File System Counters
>>
>>   FILE_BYTES_READ=7745801982
>>   FILE_BYTES_READ=8003771938
>>
>>   FILE_BYTES_WRITTEN=9725468612
>>   FILE_BYTES_WRITTEN=9675253887
>>
>>   *HDFS_BYTES_READ=9487600888  (when number of containers equals the
>> number requested, this counter is the same between the two clusters)
>>   *HDFS_BYTES_READ=17996466110
>>
>>   *HDFS_READ_OPS=3080  (when number of containers equals the number
>> requested, this counter is the same between the two clusters)
>>   *HDFS_READ_OPS=3600
>>
>>   HDFS_WRITE_OPS=900
>>   HDFS_WRITE_OPS=900
>>
>> org.apache.tez.common.counters.TaskCounter
>>   INPUT_RECORDS_PROCESSED=28729671
>>   INPUT_RECORDS_PROCESSED=28729671
>>
>>
>>   OUTPUT_RECORDS=33655895
>>   OUTPUT_RECORDS=33655895
>>
>>   OUTPUT_BYTES=28290888628
>>   OUTPUT_BYTES=28294000270
>>
>> Input(s):
>> Successfully read 2254733 records (1632743360 bytes) from: "input1"
>> Successfully read 2254733 records (1632743360 bytes) from: "input1"
>>
>>
>> Output(s):
>> Successfully stored 0 records in: “output1”
>> Successfully stored 0 records in: "output1”
>>
>> Successfully stored 56019 records (10437069 bytes) in: “output2”
>> Successfully stored 56019 records (10437069 bytes) in: "output2”
>>
>> Successfully stored 2254733 records (1651936175 bytes) in: "output3”
>> Successfully stored 2254733 records (1651936175 bytes) in: "output3”
>>
>> Successfully stored 1160599 records (823479742 bytes) in: "output4”
>> Successfully stored 1160599 records (823480450 bytes) in: "output4”
>>
>> Successfully stored 28605 records (21176320 bytes) in: "output5”
>> Successfully stored 28605 records (21177552 bytes) in: "output5”
>>
>> Successfully stored 6574 records (4442933 bytes) in: "output6”
>> Successfully stored 6574 records (4442933 bytes) in: "output6”
>>
>> Successfully stored 111416 records (164375858 bytes) in: "output7”
>> Successfully stored 111416 records (164379800 bytes) in: "output7”
>>
>> Successfully stored 542 records (387761 bytes) in: "output8”
>> Successfully stored 542 records (387762 bytes) in: "output8"
>>
>>
>>
>>