You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@spark.apache.org by Reminia Scarlet <re...@gmail.com> on 2019/10/23 12:57:04 UTC

SparkStreming logical plan leaf nodes is not equal pysical plan leaf nodes and streaming metrics cannot be reported.

Hi all:
 I use StreamingQueryListener to report batch inputRecordsNum as metrics.
 But the numInputRows is aways 0. And the debug log  in
MicroBatchExecution.scala said:

 2019-10-23 06:56:05 WARN  MicroBatchExecution:66 - Could not report
metrics as number leaves in trigger logical plan did not match that of
the execution plan:

 And this causes num input rows by sources always 0 from below codes
in ProgressReporter.scala when number of leaves size not matches in
logical plan and execution plan.

[image: image.png]
Attached the output logical plan && physical plan leaves. I think
there might be some bugs. Seems LogicalRDD is duplicate as Relation in
the logical plan.
And counting twice as leaf.If we remove the LogcialRDD, leave size
should be the same.

[image: image.png]
[image: image.png]

Can anyone help? Thx very much.

Re: SparkStreming logical plan leaf nodes is not equal pysical plan leaf nodes and streaming metrics cannot be reported.

Posted by Jungtaek Lim <ka...@gmail.com>.
What you've seen is the code path which there's at least one DSv1 source is
used in the query, and fails to match due to the limitation.

SPARK-24050 describes the "technical limitation" of resolving this if DSv1
source is used, so please refer the description of issue if you're
interested.


On Thu, Oct 24, 2019 at 3:14 PM Reminia Scarlet <re...@gmail.com>
wrote:

> @Jungtaek Lim <ka...@gmail.com>
> We joined streaming from eventhub and static dataframe  from csv and
> parquet with simple spark.read.csv/ parquet method.
> Are sure this is a bug? I am not that familiar with spark codes.
> Also forward to dev email list for help.
>
>
> On Thu, Oct 24, 2019 at 6:11 AM Jungtaek Lim <ka...@gmail.com>
> wrote:
>
>> Sorry I haven't checked the details on SPARK-24050. Looks like it was
>> only resolved with DSv2 sources, and there're some streaming sources still
>> using DSv1.
>> File stream source is one of the case, so SPARK-24050 may not help here.
>> I guess that was technical reason to only dealt with DSv2, so I'm not sure
>> there's a good way to deal with this.
>>
>> Hopefully file stream source seems to be migrated to DSv2 in Spark 3.0,
>> so Spark 3.0 would help solving the problem.
>>
>> On Wed, Oct 23, 2019 at 11:21 PM Reminia Scarlet <
>> reminia.scarlet@gmail.com> wrote:
>>
>>> @Jungtaek
>>> I'm using  Spark 2.4 (HDI 4.0)  in Azure.
>>> Maybe there are other corner cases not taking into consideration.
>>> Also I will decompile the spark jar from Azure to check the source code .
>>>
>>> On Wed, Oct 23, 2019 at 9:39 PM Jungtaek Lim <
>>> kabhwan.opensource@gmail.com> wrote:
>>>
>>>> Which version of Spark you are using?
>>>> I guess there was relevant issue SPARK-24050 [1] which was fixed in
>>>> Spark 2.4.0 so you may want to check the latest version out and try if you
>>>> use lower version.
>>>>
>>>> - Jungtaek Lim (HeartSaVioR)
>>>>
>>>> 1. https://issues.apache.org/jira/browse/SPARK-24050
>>>>
>>>> On Wed, Oct 23, 2019 at 9:57 PM Reminia Scarlet <
>>>> reminia.scarlet@gmail.com> wrote:
>>>>
>>>>> Hi all:
>>>>>  I use StreamingQueryListener to report batch inputRecordsNum as
>>>>> metrics.
>>>>>  But the numInputRows is aways 0. And the debug log  in
>>>>> MicroBatchExecution.scala said:
>>>>>
>>>>>  2019-10-23 06:56:05 WARN  MicroBatchExecution:66 - Could not report metrics as number leaves in trigger logical plan did not match that of the execution plan:
>>>>>
>>>>>  And this causes num input rows by sources always 0 from below codes in ProgressReporter.scala when number of leaves size not matches in logical plan and execution plan.
>>>>>
>>>>> [image: image.png]
>>>>> Attached the output logical plan && physical plan leaves. I think there might be some bugs. Seems LogicalRDD is duplicate as Relation in the logical plan.
>>>>> And counting twice as leaf.If we remove the LogcialRDD, leave size should be the same.
>>>>>
>>>>> [image: image.png]
>>>>> [image: image.png]
>>>>>
>>>>> Can anyone help? Thx very much.
>>>>>
>>>>>

Re: SparkStreming logical plan leaf nodes is not equal pysical plan leaf nodes and streaming metrics cannot be reported.

Posted by Jungtaek Lim <ka...@gmail.com>.
What you've seen is the code path which there's at least one DSv1 source is
used in the query, and fails to match due to the limitation.

SPARK-24050 describes the "technical limitation" of resolving this if DSv1
source is used, so please refer the description of issue if you're
interested.


On Thu, Oct 24, 2019 at 3:14 PM Reminia Scarlet <re...@gmail.com>
wrote:

> @Jungtaek Lim <ka...@gmail.com>
> We joined streaming from eventhub and static dataframe  from csv and
> parquet with simple spark.read.csv/ parquet method.
> Are sure this is a bug? I am not that familiar with spark codes.
> Also forward to dev email list for help.
>
>
> On Thu, Oct 24, 2019 at 6:11 AM Jungtaek Lim <ka...@gmail.com>
> wrote:
>
>> Sorry I haven't checked the details on SPARK-24050. Looks like it was
>> only resolved with DSv2 sources, and there're some streaming sources still
>> using DSv1.
>> File stream source is one of the case, so SPARK-24050 may not help here.
>> I guess that was technical reason to only dealt with DSv2, so I'm not sure
>> there's a good way to deal with this.
>>
>> Hopefully file stream source seems to be migrated to DSv2 in Spark 3.0,
>> so Spark 3.0 would help solving the problem.
>>
>> On Wed, Oct 23, 2019 at 11:21 PM Reminia Scarlet <
>> reminia.scarlet@gmail.com> wrote:
>>
>>> @Jungtaek
>>> I'm using  Spark 2.4 (HDI 4.0)  in Azure.
>>> Maybe there are other corner cases not taking into consideration.
>>> Also I will decompile the spark jar from Azure to check the source code .
>>>
>>> On Wed, Oct 23, 2019 at 9:39 PM Jungtaek Lim <
>>> kabhwan.opensource@gmail.com> wrote:
>>>
>>>> Which version of Spark you are using?
>>>> I guess there was relevant issue SPARK-24050 [1] which was fixed in
>>>> Spark 2.4.0 so you may want to check the latest version out and try if you
>>>> use lower version.
>>>>
>>>> - Jungtaek Lim (HeartSaVioR)
>>>>
>>>> 1. https://issues.apache.org/jira/browse/SPARK-24050
>>>>
>>>> On Wed, Oct 23, 2019 at 9:57 PM Reminia Scarlet <
>>>> reminia.scarlet@gmail.com> wrote:
>>>>
>>>>> Hi all:
>>>>>  I use StreamingQueryListener to report batch inputRecordsNum as
>>>>> metrics.
>>>>>  But the numInputRows is aways 0. And the debug log  in
>>>>> MicroBatchExecution.scala said:
>>>>>
>>>>>  2019-10-23 06:56:05 WARN  MicroBatchExecution:66 - Could not report metrics as number leaves in trigger logical plan did not match that of the execution plan:
>>>>>
>>>>>  And this causes num input rows by sources always 0 from below codes in ProgressReporter.scala when number of leaves size not matches in logical plan and execution plan.
>>>>>
>>>>> [image: image.png]
>>>>> Attached the output logical plan && physical plan leaves. I think there might be some bugs. Seems LogicalRDD is duplicate as Relation in the logical plan.
>>>>> And counting twice as leaf.If we remove the LogcialRDD, leave size should be the same.
>>>>>
>>>>> [image: image.png]
>>>>> [image: image.png]
>>>>>
>>>>> Can anyone help? Thx very much.
>>>>>
>>>>>

Re: SparkStreming logical plan leaf nodes is not equal pysical plan leaf nodes and streaming metrics cannot be reported.

Posted by Reminia Scarlet <re...@gmail.com>.
@Jungtaek Lim <ka...@gmail.com>
We joined streaming from eventhub and static dataframe  from csv and
parquet with simple spark.read.csv/ parquet method.
Are sure this is a bug? I am not that familiar with spark codes.
Also forward to dev email list for help.


On Thu, Oct 24, 2019 at 6:11 AM Jungtaek Lim <ka...@gmail.com>
wrote:

> Sorry I haven't checked the details on SPARK-24050. Looks like it was only
> resolved with DSv2 sources, and there're some streaming sources still using
> DSv1.
> File stream source is one of the case, so SPARK-24050 may not help here. I
> guess that was technical reason to only dealt with DSv2, so I'm not sure
> there's a good way to deal with this.
>
> Hopefully file stream source seems to be migrated to DSv2 in Spark 3.0, so
> Spark 3.0 would help solving the problem.
>
> On Wed, Oct 23, 2019 at 11:21 PM Reminia Scarlet <
> reminia.scarlet@gmail.com> wrote:
>
>> @Jungtaek
>> I'm using  Spark 2.4 (HDI 4.0)  in Azure.
>> Maybe there are other corner cases not taking into consideration.
>> Also I will decompile the spark jar from Azure to check the source code .
>>
>> On Wed, Oct 23, 2019 at 9:39 PM Jungtaek Lim <
>> kabhwan.opensource@gmail.com> wrote:
>>
>>> Which version of Spark you are using?
>>> I guess there was relevant issue SPARK-24050 [1] which was fixed in
>>> Spark 2.4.0 so you may want to check the latest version out and try if you
>>> use lower version.
>>>
>>> - Jungtaek Lim (HeartSaVioR)
>>>
>>> 1. https://issues.apache.org/jira/browse/SPARK-24050
>>>
>>> On Wed, Oct 23, 2019 at 9:57 PM Reminia Scarlet <
>>> reminia.scarlet@gmail.com> wrote:
>>>
>>>> Hi all:
>>>>  I use StreamingQueryListener to report batch inputRecordsNum as
>>>> metrics.
>>>>  But the numInputRows is aways 0. And the debug log  in
>>>> MicroBatchExecution.scala said:
>>>>
>>>>  2019-10-23 06:56:05 WARN  MicroBatchExecution:66 - Could not report metrics as number leaves in trigger logical plan did not match that of the execution plan:
>>>>
>>>>  And this causes num input rows by sources always 0 from below codes in ProgressReporter.scala when number of leaves size not matches in logical plan and execution plan.
>>>>
>>>> [image: image.png]
>>>> Attached the output logical plan && physical plan leaves. I think there might be some bugs. Seems LogicalRDD is duplicate as Relation in the logical plan.
>>>> And counting twice as leaf.If we remove the LogcialRDD, leave size should be the same.
>>>>
>>>> [image: image.png]
>>>> [image: image.png]
>>>>
>>>> Can anyone help? Thx very much.
>>>>
>>>>

Re: SparkStreming logical plan leaf nodes is not equal pysical plan leaf nodes and streaming metrics cannot be reported.

Posted by Reminia Scarlet <re...@gmail.com>.
@Jungtaek Lim <ka...@gmail.com>
We joined streaming from eventhub and static dataframe  from csv and
parquet with simple spark.read.csv/ parquet method.
Are sure this is a bug? I am not that familiar with spark codes.
Also forward to dev email list for help.


On Thu, Oct 24, 2019 at 6:11 AM Jungtaek Lim <ka...@gmail.com>
wrote:

> Sorry I haven't checked the details on SPARK-24050. Looks like it was only
> resolved with DSv2 sources, and there're some streaming sources still using
> DSv1.
> File stream source is one of the case, so SPARK-24050 may not help here. I
> guess that was technical reason to only dealt with DSv2, so I'm not sure
> there's a good way to deal with this.
>
> Hopefully file stream source seems to be migrated to DSv2 in Spark 3.0, so
> Spark 3.0 would help solving the problem.
>
> On Wed, Oct 23, 2019 at 11:21 PM Reminia Scarlet <
> reminia.scarlet@gmail.com> wrote:
>
>> @Jungtaek
>> I'm using  Spark 2.4 (HDI 4.0)  in Azure.
>> Maybe there are other corner cases not taking into consideration.
>> Also I will decompile the spark jar from Azure to check the source code .
>>
>> On Wed, Oct 23, 2019 at 9:39 PM Jungtaek Lim <
>> kabhwan.opensource@gmail.com> wrote:
>>
>>> Which version of Spark you are using?
>>> I guess there was relevant issue SPARK-24050 [1] which was fixed in
>>> Spark 2.4.0 so you may want to check the latest version out and try if you
>>> use lower version.
>>>
>>> - Jungtaek Lim (HeartSaVioR)
>>>
>>> 1. https://issues.apache.org/jira/browse/SPARK-24050
>>>
>>> On Wed, Oct 23, 2019 at 9:57 PM Reminia Scarlet <
>>> reminia.scarlet@gmail.com> wrote:
>>>
>>>> Hi all:
>>>>  I use StreamingQueryListener to report batch inputRecordsNum as
>>>> metrics.
>>>>  But the numInputRows is aways 0. And the debug log  in
>>>> MicroBatchExecution.scala said:
>>>>
>>>>  2019-10-23 06:56:05 WARN  MicroBatchExecution:66 - Could not report metrics as number leaves in trigger logical plan did not match that of the execution plan:
>>>>
>>>>  And this causes num input rows by sources always 0 from below codes in ProgressReporter.scala when number of leaves size not matches in logical plan and execution plan.
>>>>
>>>> [image: image.png]
>>>> Attached the output logical plan && physical plan leaves. I think there might be some bugs. Seems LogicalRDD is duplicate as Relation in the logical plan.
>>>> And counting twice as leaf.If we remove the LogcialRDD, leave size should be the same.
>>>>
>>>> [image: image.png]
>>>> [image: image.png]
>>>>
>>>> Can anyone help? Thx very much.
>>>>
>>>>

Re: SparkStreming logical plan leaf nodes is not equal pysical plan leaf nodes and streaming metrics cannot be reported.

Posted by Jungtaek Lim <ka...@gmail.com>.
Sorry I haven't checked the details on SPARK-24050. Looks like it was only
resolved with DSv2 sources, and there're some streaming sources still using
DSv1.
File stream source is one of the case, so SPARK-24050 may not help here. I
guess that was technical reason to only dealt with DSv2, so I'm not sure
there's a good way to deal with this.

Hopefully file stream source seems to be migrated to DSv2 in Spark 3.0, so
Spark 3.0 would help solving the problem.

On Wed, Oct 23, 2019 at 11:21 PM Reminia Scarlet <re...@gmail.com>
wrote:

> @Jungtaek
> I'm using  Spark 2.4 (HDI 4.0)  in Azure.
> Maybe there are other corner cases not taking into consideration.
> Also I will decompile the spark jar from Azure to check the source code .
>
> On Wed, Oct 23, 2019 at 9:39 PM Jungtaek Lim <ka...@gmail.com>
> wrote:
>
>> Which version of Spark you are using?
>> I guess there was relevant issue SPARK-24050 [1] which was fixed in Spark
>> 2.4.0 so you may want to check the latest version out and try if you use
>> lower version.
>>
>> - Jungtaek Lim (HeartSaVioR)
>>
>> 1. https://issues.apache.org/jira/browse/SPARK-24050
>>
>> On Wed, Oct 23, 2019 at 9:57 PM Reminia Scarlet <
>> reminia.scarlet@gmail.com> wrote:
>>
>>> Hi all:
>>>  I use StreamingQueryListener to report batch inputRecordsNum as metrics.
>>>  But the numInputRows is aways 0. And the debug log  in
>>> MicroBatchExecution.scala said:
>>>
>>>  2019-10-23 06:56:05 WARN  MicroBatchExecution:66 - Could not report metrics as number leaves in trigger logical plan did not match that of the execution plan:
>>>
>>>  And this causes num input rows by sources always 0 from below codes in ProgressReporter.scala when number of leaves size not matches in logical plan and execution plan.
>>>
>>> [image: image.png]
>>> Attached the output logical plan && physical plan leaves. I think there might be some bugs. Seems LogicalRDD is duplicate as Relation in the logical plan.
>>> And counting twice as leaf.If we remove the LogcialRDD, leave size should be the same.
>>>
>>> [image: image.png]
>>> [image: image.png]
>>>
>>> Can anyone help? Thx very much.
>>>
>>>

Re: SparkStreming logical plan leaf nodes is not equal pysical plan leaf nodes and streaming metrics cannot be reported.

Posted by Reminia Scarlet <re...@gmail.com>.
@Jungtaek
I'm using  Spark 2.4 (HDI 4.0)  in Azure.
Maybe there are other corner cases not taking into consideration.
Also I will decompile the spark jar from Azure to check the source code .

On Wed, Oct 23, 2019 at 9:39 PM Jungtaek Lim <ka...@gmail.com>
wrote:

> Which version of Spark you are using?
> I guess there was relevant issue SPARK-24050 [1] which was fixed in Spark
> 2.4.0 so you may want to check the latest version out and try if you use
> lower version.
>
> - Jungtaek Lim (HeartSaVioR)
>
> 1. https://issues.apache.org/jira/browse/SPARK-24050
>
> On Wed, Oct 23, 2019 at 9:57 PM Reminia Scarlet <re...@gmail.com>
> wrote:
>
>> Hi all:
>>  I use StreamingQueryListener to report batch inputRecordsNum as metrics.
>>  But the numInputRows is aways 0. And the debug log  in
>> MicroBatchExecution.scala said:
>>
>>  2019-10-23 06:56:05 WARN  MicroBatchExecution:66 - Could not report metrics as number leaves in trigger logical plan did not match that of the execution plan:
>>
>>  And this causes num input rows by sources always 0 from below codes in ProgressReporter.scala when number of leaves size not matches in logical plan and execution plan.
>>
>> [image: image.png]
>> Attached the output logical plan && physical plan leaves. I think there might be some bugs. Seems LogicalRDD is duplicate as Relation in the logical plan.
>> And counting twice as leaf.If we remove the LogcialRDD, leave size should be the same.
>>
>> [image: image.png]
>> [image: image.png]
>>
>> Can anyone help? Thx very much.
>>
>>

Re: SparkStreming logical plan leaf nodes is not equal pysical plan leaf nodes and streaming metrics cannot be reported.

Posted by Jungtaek Lim <ka...@gmail.com>.
Which version of Spark you are using?
I guess there was relevant issue SPARK-24050 [1] which was fixed in Spark
2.4.0 so you may want to check the latest version out and try if you use
lower version.

- Jungtaek Lim (HeartSaVioR)

1. https://issues.apache.org/jira/browse/SPARK-24050

On Wed, Oct 23, 2019 at 9:57 PM Reminia Scarlet <re...@gmail.com>
wrote:

> Hi all:
>  I use StreamingQueryListener to report batch inputRecordsNum as metrics.
>  But the numInputRows is aways 0. And the debug log  in
> MicroBatchExecution.scala said:
>
>  2019-10-23 06:56:05 WARN  MicroBatchExecution:66 - Could not report metrics as number leaves in trigger logical plan did not match that of the execution plan:
>
>  And this causes num input rows by sources always 0 from below codes in ProgressReporter.scala when number of leaves size not matches in logical plan and execution plan.
>
> [image: image.png]
> Attached the output logical plan && physical plan leaves. I think there might be some bugs. Seems LogicalRDD is duplicate as Relation in the logical plan.
> And counting twice as leaf.If we remove the LogcialRDD, leave size should be the same.
>
> [image: image.png]
> [image: image.png]
>
> Can anyone help? Thx very much.
>
>