You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@giraph.apache.org by Armando Miraglia <a....@student.vu.nl> on 2013/09/02 18:00:27 UTC

ZooKeeper barrier in the OutputFormat

Hi guys,

I am writing you since I am facing an issue I would like to solve to
complete the implementation of the RexsterOutputFormat API.

Based on a short discussion I had with Nitay and Claudio, I am trying to
implement a barrier inside the RexsterOutputFormat so that vertices are
guaranteed to be saved before the edges are sent to the Rexster
endpoint. This is needed since while saving the edges, the source vertices as
well as the destination vertices need to be already present in the
database otherwise Blueprints cannot save the edges consistently.

The naive implementation based on the new EdgeOutputFormat API is
already working but it does not have any global/cluster wide barrier.
This means that it works on a pseudo-distributed environment but cannot
work on an actual cluster (this is due to the fact that saving edges is
ordered after saving the vertices).

At this point the implementation of the barrier could be straightforward
if I could have either the number of workers currently up and
running while writing the vertices or the global number of
vertices. The second info looks better to me because in this
manner I would not to deal with workers dying. I would just need to check that
all the edges accounted at the end of the final superstep are saved. I
could use a znode where the workers save the number of vertices saved
and the barrier can be quitted when the total number of saved vertices
is the same as the expected global count.

Now, my problem here is how do I access this information form the
OutputFormat? I checked around and it looks to me that the global state
is only accessible in the service scope. This would mean that I should
add something in the BspServiceWorker to pass this information to the
RexsterVertexOutputFormat.

Do you have any idea how I could achieve this keeping the approach
consistent with the Giraph code-base? Do you have any other suggestions
or possible solutions that I could implement to achieve the same goal,
namely saving the vertices before the edges at cluster level?

Thanks!
Armando