You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@spark.apache.org by Matthew Bucci <mr...@gmail.com> on 2015/03/09 18:21:16 UTC

GraphX Snapshot Partitioning

Hello,

I am working on a project where we want to split graphs of data into
snapshots across partitions and I was wondering what would happen if one of
the snapshots we had was too large to fit into a single partition. Would the
snapshot be split over the two partitions equally, for example, and how is a
single snapshot spread over multiple partitions?

Thank You,
Matthew Bucci



--
View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/GraphX-Snapshot-Partitioning-tp21977.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
For additional commands, e-mail: user-help@spark.apache.org

Re: GraphX Snapshot Partitioning

Posted by Takeshi Yamamuro <li...@gmail.com>.

Large edge partitions could cause java.lang.OutOfMemoryError, and then
spark tasks fails.

FWIW, each edge partition can have at most 2^32 edges because 64-bit vertex
IDs are
mapped into 32-bit ones in each partitions.
If #edges is over the limit, graphx could throw
ArrayIndexOutOfBoundsException,
or something. So, each partition can have more edges than you expect.





On Wed, Mar 11, 2015 at 11:42 PM, Matthew Bucci <mr...@gmail.com>
wrote:

> Hi,
>
> Thanks for the response! That answered some questions I had, but the last
> one I was wondering is what happens if you run a partition strategy and one
> of the partitions ends up being too large? For example, let's say
> partitions can hold 64MB (actually knowing the maximum possible size of a
> partition would probably also be helpful to me). You try to partition the
> edges of a graph to 3 separate partitions but the edges in the first
> partition end up being 80MB worth of edges so it cannot all fit in the
> first partition . Would the extra 16MB flood over into a new 4th partition
> or would the system try to split it so that the 1st and 4th partition are
> both at 40MB, or would the partition strategy just fail with a memory
> error?
>
> Thank You,
> Matthew Bucci
>
> On Mon, Mar 9, 2015 at 11:07 PM, Takeshi Yamamuro <li...@gmail.com>
> wrote:
>
>> Hi,
>>
>> Vertices are simply hash-paritioned by their 64-bit IDs, so
>> they are evenly spread over parititons.
>>
>> As for edges, GraphLoader#edgeList builds edge paritions
>> through hadoopFile(), so the initial parititons depend
>> on InputFormat#getSplits implementations
>> (e.g, partitions are mostly equal to 64MB blocks for HDFS).
>>
>> Edges can be re-partitioned by ParititonStrategy;
>> a graph is partitioned considering graph structures and
>> a source ID and a destination ID are used as partition keys.
>> The partitions might suffer from skewness depending
>> on graph properties (hub nodes, or something).
>>
>> Thanks,
>> takeshi
>>
>>
>> On Tue, Mar 10, 2015 at 2:21 AM, Matthew Bucci <mr...@gmail.com>
>> wrote:
>>
>>> Hello,
>>>
>>> I am working on a project where we want to split graphs of data into
>>> snapshots across partitions and I was wondering what would happen if one
>>> of
>>> the snapshots we had was too large to fit into a single partition. Would
>>> the
>>> snapshot be split over the two partitions equally, for example, and how
>>> is a
>>> single snapshot spread over multiple partitions?
>>>
>>> Thank You,
>>> Matthew Bucci
>>>
>>>
>>>
>>> --
>>> View this message in context:
>>> http://apache-spark-user-list.1001560.n3.nabble.com/GraphX-Snapshot-Partitioning-tp21977.html
>>> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>>>
>>> ---------------------------------------------------------------------
>>> To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
>>> For additional commands, e-mail: user-help@spark.apache.org
>>>
>>>
>>
>>
>> --
>> ---
>> Takeshi Yamamuro
>>
>
>


-- 
---
Takeshi Yamamuro

Re: GraphX Snapshot Partitioning

Posted by Matthew Bucci <mr...@gmail.com>.

Hi,

Thanks for the response! That answered some questions I had, but the last
one I was wondering is what happens if you run a partition strategy and one
of the partitions ends up being too large? For example, let's say
partitions can hold 64MB (actually knowing the maximum possible size of a
partition would probably also be helpful to me). You try to partition the
edges of a graph to 3 separate partitions but the edges in the first
partition end up being 80MB worth of edges so it cannot all fit in the
first partition . Would the extra 16MB flood over into a new 4th partition
or would the system try to split it so that the 1st and 4th partition are
both at 40MB, or would the partition strategy just fail with a memory
error?

Thank You,
Matthew Bucci

On Mon, Mar 9, 2015 at 11:07 PM, Takeshi Yamamuro <li...@gmail.com>
wrote:

> Hi,
>
> Vertices are simply hash-paritioned by their 64-bit IDs, so
> they are evenly spread over parititons.
>
> As for edges, GraphLoader#edgeList builds edge paritions
> through hadoopFile(), so the initial parititons depend
> on InputFormat#getSplits implementations
> (e.g, partitions are mostly equal to 64MB blocks for HDFS).
>
> Edges can be re-partitioned by ParititonStrategy;
> a graph is partitioned considering graph structures and
> a source ID and a destination ID are used as partition keys.
> The partitions might suffer from skewness depending
> on graph properties (hub nodes, or something).
>
> Thanks,
> takeshi
>
>
> On Tue, Mar 10, 2015 at 2:21 AM, Matthew Bucci <mr...@gmail.com>
> wrote:
>
>> Hello,
>>
>> I am working on a project where we want to split graphs of data into
>> snapshots across partitions and I was wondering what would happen if one
>> of
>> the snapshots we had was too large to fit into a single partition. Would
>> the
>> snapshot be split over the two partitions equally, for example, and how
>> is a
>> single snapshot spread over multiple partitions?
>>
>> Thank You,
>> Matthew Bucci
>>
>>
>>
>> --
>> View this message in context:
>> http://apache-spark-user-list.1001560.n3.nabble.com/GraphX-Snapshot-Partitioning-tp21977.html
>> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
>> For additional commands, e-mail: user-help@spark.apache.org
>>
>>
>
>
> --
> ---
> Takeshi Yamamuro
>

Re: GraphX Snapshot Partitioning

Posted by Takeshi Yamamuro <li...@gmail.com>.

Hi,

Vertices are simply hash-paritioned by their 64-bit IDs, so
they are evenly spread over parititons.

As for edges, GraphLoader#edgeList builds edge paritions
through hadoopFile(), so the initial parititons depend
on InputFormat#getSplits implementations
(e.g, partitions are mostly equal to 64MB blocks for HDFS).

Edges can be re-partitioned by ParititonStrategy;
a graph is partitioned considering graph structures and
a source ID and a destination ID are used as partition keys.
The partitions might suffer from skewness depending
on graph properties (hub nodes, or something).

Thanks,
takeshi

On Tue, Mar 10, 2015 at 2:21 AM, Matthew Bucci <mr...@gmail.com> wrote:

> Hello,
>
> I am working on a project where we want to split graphs of data into
> snapshots across partitions and I was wondering what would happen if one of
> the snapshots we had was too large to fit into a single partition. Would
> the
> snapshot be split over the two partitions equally, for example, and how is
> a
> single snapshot spread over multiple partitions?
>
> Thank You,
> Matthew Bucci
>
>
>
> --
> View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/GraphX-Snapshot-Partitioning-tp21977.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
> For additional commands, e-mail: user-help@spark.apache.org
>
>

-- 
---
Takeshi Yamamuro