You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@spark.apache.org by Randy Gelhausen <rg...@gmail.com> on 2016/06/26 22:34:22 UTC

Spark 1.6.1: Unexpected partition behavior?

<code>
val enriched_web_logs = sqlContext.sql("""
select web_logs.datetime, web_logs.node as app_host, source_ip, b.node as
source_host, log
from web_logs
left outer join (select distinct node, address from nodes) b on source_ip =
address
""")
enriched_web_logs.coalesce(1).write.format("parquet").mode("overwrite").save(bucket+"derived/enriched_web_logs")
enriched_web_logs.registerTempTable("enriched_web_logs")
sqlContext.cacheTable("enriched_web_logs")
</code>

There are only 524 records in the resulting table, and I have explicitly
attempted to coalesce into 1 partition.

Yet my Spark UI shows 200 (mostly empty) partitions:
RDD NameStorage LevelCached PartitionsFraction CachedSize in MemorySize in
ExternalBlockStoreSize on Disk
In-memory table enriched_web_logs
<http://localhost:4040/storage/rdd?id=86> Memory
Deserialized 1x Replicated 200 100% 22.0 KB 0.0 B 0.0 BWhy would there be
200 partitions despite the coalesce call?

Re: Spark 1.6.1: Unexpected partition behavior?

Posted by Randy Gelhausen <rg...@gmail.com>.
Sorry, please ignore the above.

I now see I called coalesce on a different reference, than I used to
register the table.

On Sun, Jun 26, 2016 at 6:34 PM, Randy Gelhausen <rg...@gmail.com> wrote:

> <code>
> val enriched_web_logs = sqlContext.sql("""
> select web_logs.datetime, web_logs.node as app_host, source_ip, b.node as
> source_host, log
> from web_logs
> left outer join (select distinct node, address from nodes) b on source_ip
> = address
> """)
>
> enriched_web_logs.coalesce(1).write.format("parquet").mode("overwrite").save(bucket+"derived/enriched_web_logs")
> enriched_web_logs.registerTempTable("enriched_web_logs")
> sqlContext.cacheTable("enriched_web_logs")
> </code>
>
> There are only 524 records in the resulting table, and I have explicitly
> attempted to coalesce into 1 partition.
>
> Yet my Spark UI shows 200 (mostly empty) partitions:
> RDD NameStorage LevelCached PartitionsFraction CachedSize in MemorySize
> in ExternalBlockStoreSize on Disk
> In-memory table enriched_web_logs
> <http://localhost:4040/storage/rdd?id=86> Memory Deserialized 1x
> Replicated 200 100% 22.0 KB 0.0 B 0.0 BWhy would there be 200 partitions
> despite the coalesce call?
>

Re: Spark 1.6.1: Unexpected partition behavior?

Posted by Randy Gelhausen <rg...@gmail.com>.
Sorry, please ignore the above.

I now see I called coalesce on a different reference, than I used to
register the table.

On Sun, Jun 26, 2016 at 6:34 PM, Randy Gelhausen <rg...@gmail.com> wrote:

> <code>
> val enriched_web_logs = sqlContext.sql("""
> select web_logs.datetime, web_logs.node as app_host, source_ip, b.node as
> source_host, log
> from web_logs
> left outer join (select distinct node, address from nodes) b on source_ip
> = address
> """)
>
> enriched_web_logs.coalesce(1).write.format("parquet").mode("overwrite").save(bucket+"derived/enriched_web_logs")
> enriched_web_logs.registerTempTable("enriched_web_logs")
> sqlContext.cacheTable("enriched_web_logs")
> </code>
>
> There are only 524 records in the resulting table, and I have explicitly
> attempted to coalesce into 1 partition.
>
> Yet my Spark UI shows 200 (mostly empty) partitions:
> RDD NameStorage LevelCached PartitionsFraction CachedSize in MemorySize
> in ExternalBlockStoreSize on Disk
> In-memory table enriched_web_logs
> <http://localhost:4040/storage/rdd?id=86> Memory Deserialized 1x
> Replicated 200 100% 22.0 KB 0.0 B 0.0 BWhy would there be 200 partitions
> despite the coalesce call?
>