You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@spark.apache.org by ahaider3 <ah...@hawk.iit.edu> on 2016/02/21 05:05:21 UTC

Using Encoding to reduce GraphX's static graph memory consumption

Hi,
I have been looking through the GraphX source code, dissecting the reason
for its high memory consumption compared to the on-disk size of the graph. I
have found that there may be room to reduce the memory footprint of the
graph structures. I think the biggest savings can come from the localSrcIds
and localDstIds in EdgePartitions.

In particular, instead of storing both a source and destination local ID for
each edge, we could store only the destination id. For example after sorting
edges by global source id, we can map each of the source vertices first to
local values followed by unmapped global destination ids. This would make
localSrcIds sorted starting from 0 to n, where n is the number of distinct
global source ids. Then instead of actually storing the local source id for
each edge, we can store an array of size n, with each element storing an
index into localDstIds. From my understanding, this would also eliminate
the need for storing an index for indexed scanning, since each element in
localSrcIds would be the start of a cluster. From some extensive testing,
this along with some delta encoding strategies on localDstIds and the
mapping structures can reduce memory consumption of the graph by nearly
half.

However, I am not entirely sure if there is any reason for storing both
localSrcIds and localDstIds for each edge in terms of integration of future
functionalities, such as graph mutations. I noticed there was another post
similar to this one as well, but it had not replies.

The idea is quite similar to Netflix graph library
<https://github.com/Netflix/netflix-graph> and would be happy to open a
jira on this issue with partial improvements. But, I may not be completely
correct with my thinking!

--
View this message in context: http://apache-spark-developers-list.1001551.n3.nabble.com/Using-Encoding-to-reduce-GraphX-s-static-graph-memory-consumption-tp16373.html
Sent from the Apache Spark Developers List mailing list archive at Nabble.com.

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@spark.apache.org
For additional commands, e-mail: dev-help@spark.apache.org

Re: Using Encoding to reduce GraphX's static graph memory consumption

Posted by "Joseph E. Gonzalez" <jo...@gmail.com>.

Actually another improvement would be to use something like compressed sparse row encoding which can be used to store A and A^T relatively efficiently (I think using 5 arrays instead of 6).  There is an option to also be more cache aware using something like a block compressed sparse row encoding.

Ankur also at some point looked at renumbering the vertex ids on each edge to be consecutive with respect to the edge partition.  This is a pretty common technique in most graph processing systems but in our early experiments we didn’t see much of a gain.   In theory, however this should allow for lower bit precision encoding and a more dense representation and eliminate the need to use a hash lookup when joining vertices with edges. 

Joey





> On Feb 23, 2016, at 1:13 PM, Adnan Haider <ah...@hawk.iit.edu> wrote:
> 
> Hi
> I have created a jira for this issue here. <https://issues.apache.org/jira/browse/SPARK-13460?jql=project%20%3D%20SPARK%20AND%20created%3E%3D-1w%20ORDER%20BY%20created%20DESC> As for the pull request, my implementation is based on removing localSrcIds and storing an array of offsets into localDstIds. I am running into issues with this method when testing operations which create partitions from existing edge partitions. The other method for implementing this would be to use a hashmap from local id to offset/length pairs. This seems to work fine but there is more storage overhead associated with this since each source vertex requires space for three integers. 
> 
> Thanks, Adnan Haider
> B.S Candidate, Computer Science
> Illinois Institute of Technology 
> www.adnanhaider.com <http://www.adnanhaider.com/>
> 
> On Mon, Feb 22, 2016 at 8:16 AM, Adnan Haider <ahaider3@hawk.iit.edu <ma...@hawk.iit.edu>> wrote:
> Yes, sounds good. I can submit the pull request.
> 
> On 22 Feb 2016 00:35, "Reynold Xin" <rxin@databricks.com <ma...@databricks.com>> wrote:
> + Joey
> 
> We think this is worth doing. Are you interested in submitting a pull request?
> 
> 
> On Sat, Feb 20, 2016 at 8:05 PM ahaider3 <ahaider3@hawk.iit.edu <ma...@hawk.iit.edu>> wrote:
> Hi,
> I have been looking through the GraphX source code, dissecting the reason
> for its high memory consumption compared to the on-disk size of the graph. I
> have found that there may be room to reduce the memory footprint of the
> graph structures. I think the biggest savings can come from the localSrcIds
> and localDstIds in EdgePartitions.
> 
> In particular, instead of storing both a source and destination local ID for
> each edge, we could store only the destination id. For example after sorting
> edges by global source id, we can map each of the source vertices first to
> local values followed by unmapped global destination ids. This would make
> localSrcIds sorted starting from 0 to n, where n is the number of distinct
> global source ids. Then instead of actually storing the local source id for
> each edge, we can store an array of size n, with each element storing an
> index into localDstIds.  From my understanding, this would also eliminate
> the need for storing an index for indexed scanning, since each element in
> localSrcIds would be the start of a cluster. From some extensive testing,
> this along with some delta encoding strategies on localDstIds and the
> mapping structures can reduce memory consumption of the graph by nearly
> half.
> 
> However, I am not entirely sure if there is any reason for storing both
> localSrcIds and localDstIds for each edge in terms of integration of future
> functionalities, such as graph mutations. I noticed there was another post
> similar to this one as well, but it had not replies.
> 
> The idea is quite similar to  Netflix graph library
> <https://github.com/Netflix/netflix-graph <https://github.com/Netflix/netflix-graph>>   and would be happy to open a
> jira on this issue with partial improvements. But, I may not be completely
> correct with my thinking!
> 
> 
> 
> 
> --
> View this message in context: http://apache-spark-developers-list.1001551.n3.nabble.com/Using-Encoding-to-reduce-GraphX-s-static-graph-memory-consumption-tp16373.html <http://apache-spark-developers-list.1001551.n3.nabble.com/Using-Encoding-to-reduce-GraphX-s-static-graph-memory-consumption-tp16373.html>
> Sent from the Apache Spark Developers List mailing list archive at Nabble.com.
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscribe@spark.apache.org <ma...@spark.apache.org>
> For additional commands, e-mail: dev-help@spark.apache.org <ma...@spark.apache.org>
> 
>

Re: Using Encoding to reduce GraphX's static graph memory consumption

Posted by Adnan Haider <ah...@hawk.iit.edu>.

Hi
I have created a jira for this issue here.
<https://issues.apache.org/jira/browse/SPARK-13460?jql=project%20%3D%20SPARK%20AND%20created%3E%3D-1w%20ORDER%20BY%20created%20DESC>
As
for the pull request, my implementation is based on removing localSrcIds
and storing an array of offsets into localDstIds. I am running into issues
with this method when testing operations which create partitions from
existing edge partitions. The other method for implementing this would be
to use a hashmap from local id to offset/length pairs. This seems to work
fine but there is more storage overhead associated with this since each
source vertex requires space for three integers.

Thanks, Adnan Haider
B.S Candidate, Computer Science
Illinois Institute of Technology
www.adnanhaider.com

On Mon, Feb 22, 2016 at 8:16 AM, Adnan Haider <ah...@hawk.iit.edu> wrote:

> Yes, sounds good. I can submit the pull request.
> On 22 Feb 2016 00:35, "Reynold Xin" <rx...@databricks.com> wrote:
>
>> + Joey
>>
>> We think this is worth doing. Are you interested in submitting a pull
>> request?
>>
>>
>> On Sat, Feb 20, 2016 at 8:05 PM ahaider3 <ah...@hawk.iit.edu> wrote:
>>
>>> Hi,
>>> I have been looking through the GraphX source code, dissecting the reason
>>> for its high memory consumption compared to the on-disk size of the
>>> graph. I
>>> have found that there may be room to reduce the memory footprint of the
>>> graph structures. I think the biggest savings can come from the
>>> localSrcIds
>>> and localDstIds in EdgePartitions.
>>>
>>> In particular, instead of storing both a source and destination local ID
>>> for
>>> each edge, we could store only the destination id. For example after
>>> sorting
>>> edges by global source id, we can map each of the source vertices first
>>> to
>>> local values followed by unmapped global destination ids. This would make
>>> localSrcIds sorted starting from 0 to n, where n is the number of
>>> distinct
>>> global source ids. Then instead of actually storing the local source id
>>> for
>>> each edge, we can store an array of size n, with each element storing an
>>> index into localDstIds.  From my understanding, this would also eliminate
>>> the need for storing an index for indexed scanning, since each element in
>>> localSrcIds would be the start of a cluster. From some extensive testing,
>>> this along with some delta encoding strategies on localDstIds and the
>>> mapping structures can reduce memory consumption of the graph by nearly
>>> half.
>>>
>>> However, I am not entirely sure if there is any reason for storing both
>>> localSrcIds and localDstIds for each edge in terms of integration of
>>> future
>>> functionalities, such as graph mutations. I noticed there was another
>>> post
>>> similar to this one as well, but it had not replies.
>>>
>>> The idea is quite similar to  Netflix graph library
>>> <https://github.com/Netflix/netflix-graph>   and would be happy to open
>>> a
>>> jira on this issue with partial improvements. But, I may not be
>>> completely
>>> correct with my thinking!
>>>
>>>
>>>
>>>
>>> --
>>> View this message in context:
>>> http://apache-spark-developers-list.1001551.n3.nabble.com/Using-Encoding-to-reduce-GraphX-s-static-graph-memory-consumption-tp16373.html
>>> Sent from the Apache Spark Developers List mailing list archive at
>>> Nabble.com.
>>>
>>> ---------------------------------------------------------------------
>>> To unsubscribe, e-mail: dev-unsubscribe@spark.apache.org
>>> For additional commands, e-mail: dev-help@spark.apache.org
>>>
>>>

Re: Using Encoding to reduce GraphX's static graph memory consumption

Posted by Adnan Haider <ah...@hawk.iit.edu>.

Yes, sounds good. I can submit the pull request.
On 22 Feb 2016 00:35, "Reynold Xin" <rx...@databricks.com> wrote:

> + Joey
>
> We think this is worth doing. Are you interested in submitting a pull
> request?
>
>
> On Sat, Feb 20, 2016 at 8:05 PM ahaider3 <ah...@hawk.iit.edu> wrote:
>
>> Hi,
>> I have been looking through the GraphX source code, dissecting the reason
>> for its high memory consumption compared to the on-disk size of the
>> graph. I
>> have found that there may be room to reduce the memory footprint of the
>> graph structures. I think the biggest savings can come from the
>> localSrcIds
>> and localDstIds in EdgePartitions.
>>
>> In particular, instead of storing both a source and destination local ID
>> for
>> each edge, we could store only the destination id. For example after
>> sorting
>> edges by global source id, we can map each of the source vertices first to
>> local values followed by unmapped global destination ids. This would make
>> localSrcIds sorted starting from 0 to n, where n is the number of distinct
>> global source ids. Then instead of actually storing the local source id
>> for
>> each edge, we can store an array of size n, with each element storing an
>> index into localDstIds.  From my understanding, this would also eliminate
>> the need for storing an index for indexed scanning, since each element in
>> localSrcIds would be the start of a cluster. From some extensive testing,
>> this along with some delta encoding strategies on localDstIds and the
>> mapping structures can reduce memory consumption of the graph by nearly
>> half.
>>
>> However, I am not entirely sure if there is any reason for storing both
>> localSrcIds and localDstIds for each edge in terms of integration of
>> future
>> functionalities, such as graph mutations. I noticed there was another post
>> similar to this one as well, but it had not replies.
>>
>> The idea is quite similar to  Netflix graph library
>> <https://github.com/Netflix/netflix-graph>   and would be happy to open a
>> jira on this issue with partial improvements. But, I may not be completely
>> correct with my thinking!
>>
>>
>>
>>
>> --
>> View this message in context:
>> http://apache-spark-developers-list.1001551.n3.nabble.com/Using-Encoding-to-reduce-GraphX-s-static-graph-memory-consumption-tp16373.html
>> Sent from the Apache Spark Developers List mailing list archive at
>> Nabble.com.
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: dev-unsubscribe@spark.apache.org
>> For additional commands, e-mail: dev-help@spark.apache.org
>>
>>

Re: Using Encoding to reduce GraphX's static graph memory consumption

Posted by Reynold Xin <rx...@databricks.com>.

+ Joey

We think this is worth doing. Are you interested in submitting a pull
request?


On Sat, Feb 20, 2016 at 8:05 PM ahaider3 <ah...@hawk.iit.edu> wrote:

> Hi,
> I have been looking through the GraphX source code, dissecting the reason
> for its high memory consumption compared to the on-disk size of the graph.
> I
> have found that there may be room to reduce the memory footprint of the
> graph structures. I think the biggest savings can come from the localSrcIds
> and localDstIds in EdgePartitions.
>
> In particular, instead of storing both a source and destination local ID
> for
> each edge, we could store only the destination id. For example after
> sorting
> edges by global source id, we can map each of the source vertices first to
> local values followed by unmapped global destination ids. This would make
> localSrcIds sorted starting from 0 to n, where n is the number of distinct
> global source ids. Then instead of actually storing the local source id for
> each edge, we can store an array of size n, with each element storing an
> index into localDstIds.  From my understanding, this would also eliminate
> the need for storing an index for indexed scanning, since each element in
> localSrcIds would be the start of a cluster. From some extensive testing,
> this along with some delta encoding strategies on localDstIds and the
> mapping structures can reduce memory consumption of the graph by nearly
> half.
>
> However, I am not entirely sure if there is any reason for storing both
> localSrcIds and localDstIds for each edge in terms of integration of future
> functionalities, such as graph mutations. I noticed there was another post
> similar to this one as well, but it had not replies.
>
> The idea is quite similar to  Netflix graph library
> <https://github.com/Netflix/netflix-graph>   and would be happy to open a
> jira on this issue with partial improvements. But, I may not be completely
> correct with my thinking!
>
>
>
>
> --
> View this message in context:
> http://apache-spark-developers-list.1001551.n3.nabble.com/Using-Encoding-to-reduce-GraphX-s-static-graph-memory-consumption-tp16373.html
> Sent from the Apache Spark Developers List mailing list archive at
> Nabble.com.
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscribe@spark.apache.org
> For additional commands, e-mail: dev-help@spark.apache.org
>
>