You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@spark.apache.org by Venkat Dabri <ve...@gmail.com> on 2018/10/15 18:56:18 UTC

Spark seems to think that a particular broadcast variable is large in size

I am trying to do a broadcast join on two tables. The size of the
smaller table will vary based upon the parameters but the size of the
larger table is close to 2TB. What I have noticed is that if I don't
set the spark.sql.autoBroadcastJoinThreshold to 10G some of these
operations do a SortMergeJoin instead of a broadcast join. But the
size of the smaller table shouldn't be this big at all. I wrote the
smaller table to a s3 folder and it took only 12.6 MB of space. I
didn't some operations on the smaller table so the shuffle size
appears on the Spark History Server and the size in memory seemed to
be 150 MB nowhere near 10G. Also if I force a broadcast join on the
smaller table it takes a long time to broadcast, leading me to think
that the table might not just be 150 MB in size. What would be a good
way to figure out the actual size that Spark is seeing and deciding
whether it crosses the spark.sql.autoBroadcastJoinThreshold?

---------------------------------------------------------------------
To unsubscribe e-mail: user-unsubscribe@spark.apache.org

Re: Spark seems to think that a particular broadcast variable is large in size

Posted by Dillon Dukek <di...@placed.com.INVALID>.

You keep mentioning that you're viewing this after the fact in the spark
history server. Also the spark-shell isn't a UI so I'm not sure what you
mean by saying that the storage tab is blank in the spark-shell. Just so
I'm clear about what you're doing, are you looking at this info while your
application is running in the SparkUI through the ResourceManager link in
the EMR console? That would be the route I would go. I'm not sure that
spark retains storage information to be viewed after the fact since after
the program is complete the dataframe will be freed and you will lose
context about that data. If you would like to do a paired down test in the
spark-shell you can do that as well. Once spark is started via the
spark-shell command it will launch a spark UI for you to view the job
progress. This even sounds small enough that if you are allowed to do so
you should be able to launch this from your local machine and see the UI at
localhost:4040. I've confirmed this works locally for some data that I have.

On Tue, Oct 16, 2018 at 8:05 AM Venkat Dabri <ve...@gmail.com> wrote:

> The same problem is mentioned here :
>
> https://forums.databricks.com/questions/117/why-is-my-rdd-not-showing-up-in-the-storage-tab-of.html
>
> https://stackoverflow.com/questions/44792213/blank-storage-tab-in-spark-history-server
> On Tue, Oct 16, 2018 at 8:06 AM Venkat Dabri <ve...@gmail.com>
> wrote:
> >
> > I did try that mechanism before but the data never shows up in the
> > storage tab. The storage tab is always blank. I have tried it in
> > Zeppelin as well as spark-shell.
> >
> > scala> val classCount = spark.read.parquet("s3:// ..../classCount")
> > scala> classCount.persist
> > scala> classCount.count
> >
> > Nothing shows up in the Storage tab of either Zeppelin or spark-shell.
> > However, I have several running applications in production that does
> > show the data in cache. I am using Scala and Spark 2.2.1 in EMR. Any
> > workarounds to see the data in cache.
> > On Mon, Oct 15, 2018 at 2:53 PM Dillon Dukek <di...@placed.com>
> wrote:
> > >
> > > In your program persist the smaller table and use count to force it to
> materialize. Then in the Spark UI go to the Storage tab. The size of your
> table as spark sees it should be displayed there. Out of curiosity what
> version / language of Spark are you using?
> > >
> > > On Mon, Oct 15, 2018 at 11:53 AM Venkat Dabri <ve...@gmail.com>
> wrote:
> > >>
> > >> I am trying to do a broadcast join on two tables. The size of the
> > >> smaller table will vary based upon the parameters but the size of the
> > >> larger table is close to 2TB. What I have noticed is that if I don't
> > >> set the spark.sql.autoBroadcastJoinThreshold to 10G some of these
> > >> operations do a SortMergeJoin instead of a broadcast join. But the
> > >> size of the smaller table shouldn't be this big at all. I wrote the
> > >> smaller table to a s3 folder and it took only 12.6 MB of space. I
> > >> didn't some operations on the smaller table so the shuffle size
> > >> appears on the Spark History Server and the size in memory seemed to
> > >> be 150 MB nowhere near 10G. Also if I force a broadcast join on the
> > >> smaller table it takes a long time to broadcast, leading me to think
> > >> that the table might not just be 150 MB in size. What would be a good
> > >> way to figure out the actual size that Spark is seeing and deciding
> > >> whether it crosses the spark.sql.autoBroadcastJoinThreshold?
> > >>
> > >> ---------------------------------------------------------------------
> > >> To unsubscribe e-mail: user-unsubscribe@spark.apache.org
> > >>
>

Re: Spark seems to think that a particular broadcast variable is large in size

Posted by Venkat Dabri <ve...@gmail.com>.

The same problem is mentioned here :
https://forums.databricks.com/questions/117/why-is-my-rdd-not-showing-up-in-the-storage-tab-of.html
https://stackoverflow.com/questions/44792213/blank-storage-tab-in-spark-history-server
On Tue, Oct 16, 2018 at 8:06 AM Venkat Dabri <ve...@gmail.com> wrote:
>
> I did try that mechanism before but the data never shows up in the
> storage tab. The storage tab is always blank. I have tried it in
> Zeppelin as well as spark-shell.
>
> scala> val classCount = spark.read.parquet("s3:// ..../classCount")
> scala> classCount.persist
> scala> classCount.count
>
> Nothing shows up in the Storage tab of either Zeppelin or spark-shell.
> However, I have several running applications in production that does
> show the data in cache. I am using Scala and Spark 2.2.1 in EMR. Any
> workarounds to see the data in cache.
> On Mon, Oct 15, 2018 at 2:53 PM Dillon Dukek <di...@placed.com> wrote:
> >
> > In your program persist the smaller table and use count to force it to materialize. Then in the Spark UI go to the Storage tab. The size of your table as spark sees it should be displayed there. Out of curiosity what version / language of Spark are you using?
> >
> > On Mon, Oct 15, 2018 at 11:53 AM Venkat Dabri <ve...@gmail.com> wrote:
> >>
> >> I am trying to do a broadcast join on two tables. The size of the
> >> smaller table will vary based upon the parameters but the size of the
> >> larger table is close to 2TB. What I have noticed is that if I don't
> >> set the spark.sql.autoBroadcastJoinThreshold to 10G some of these
> >> operations do a SortMergeJoin instead of a broadcast join. But the
> >> size of the smaller table shouldn't be this big at all. I wrote the
> >> smaller table to a s3 folder and it took only 12.6 MB of space. I
> >> didn't some operations on the smaller table so the shuffle size
> >> appears on the Spark History Server and the size in memory seemed to
> >> be 150 MB nowhere near 10G. Also if I force a broadcast join on the
> >> smaller table it takes a long time to broadcast, leading me to think
> >> that the table might not just be 150 MB in size. What would be a good
> >> way to figure out the actual size that Spark is seeing and deciding
> >> whether it crosses the spark.sql.autoBroadcastJoinThreshold?
> >>
> >> ---------------------------------------------------------------------
> >> To unsubscribe e-mail: user-unsubscribe@spark.apache.org
> >>

---------------------------------------------------------------------
To unsubscribe e-mail: user-unsubscribe@spark.apache.org

Re: Spark seems to think that a particular broadcast variable is large in size

Posted by Venkat Dabri <ve...@gmail.com>.

I did try that mechanism before but the data never shows up in the
storage tab. The storage tab is always blank. I have tried it in
Zeppelin as well as spark-shell.

scala> val classCount = spark.read.parquet("s3:// ..../classCount")
scala> classCount.persist
scala> classCount.count

Nothing shows up in the Storage tab of either Zeppelin or spark-shell.
However, I have several running applications in production that does
show the data in cache. I am using Scala and Spark 2.2.1 in EMR. Any
workarounds to see the data in cache.
On Mon, Oct 15, 2018 at 2:53 PM Dillon Dukek <di...@placed.com> wrote:
>
> In your program persist the smaller table and use count to force it to materialize. Then in the Spark UI go to the Storage tab. The size of your table as spark sees it should be displayed there. Out of curiosity what version / language of Spark are you using?
>
> On Mon, Oct 15, 2018 at 11:53 AM Venkat Dabri <ve...@gmail.com> wrote:
>>
>> I am trying to do a broadcast join on two tables. The size of the
>> smaller table will vary based upon the parameters but the size of the
>> larger table is close to 2TB. What I have noticed is that if I don't
>> set the spark.sql.autoBroadcastJoinThreshold to 10G some of these
>> operations do a SortMergeJoin instead of a broadcast join. But the
>> size of the smaller table shouldn't be this big at all. I wrote the
>> smaller table to a s3 folder and it took only 12.6 MB of space. I
>> didn't some operations on the smaller table so the shuffle size
>> appears on the Spark History Server and the size in memory seemed to
>> be 150 MB nowhere near 10G. Also if I force a broadcast join on the
>> smaller table it takes a long time to broadcast, leading me to think
>> that the table might not just be 150 MB in size. What would be a good
>> way to figure out the actual size that Spark is seeing and deciding
>> whether it crosses the spark.sql.autoBroadcastJoinThreshold?
>>
>> ---------------------------------------------------------------------
>> To unsubscribe e-mail: user-unsubscribe@spark.apache.org
>>

---------------------------------------------------------------------
To unsubscribe e-mail: user-unsubscribe@spark.apache.org

Re: Spark seems to think that a particular broadcast variable is large in size

Posted by Dillon Dukek <di...@placed.com.INVALID>.

In your program persist the smaller table and use count to force it to
materialize. Then in the Spark UI go to the Storage tab. The size of your
table as spark sees it should be displayed there. Out of curiosity what
version / language of Spark are you using?

On Mon, Oct 15, 2018 at 11:53 AM Venkat Dabri <ve...@gmail.com> wrote:

> I am trying to do a broadcast join on two tables. The size of the
> smaller table will vary based upon the parameters but the size of the
> larger table is close to 2TB. What I have noticed is that if I don't
> set the spark.sql.autoBroadcastJoinThreshold to 10G some of these
> operations do a SortMergeJoin instead of a broadcast join. But the
> size of the smaller table shouldn't be this big at all. I wrote the
> smaller table to a s3 folder and it took only 12.6 MB of space. I
> didn't some operations on the smaller table so the shuffle size
> appears on the Spark History Server and the size in memory seemed to
> be 150 MB nowhere near 10G. Also if I force a broadcast join on the
> smaller table it takes a long time to broadcast, leading me to think
> that the table might not just be 150 MB in size. What would be a good
> way to figure out the actual size that Spark is seeing and deciding
> whether it crosses the spark.sql.autoBroadcastJoinThreshold?
>
> ---------------------------------------------------------------------
> To unsubscribe e-mail: user-unsubscribe@spark.apache.org
>
>