You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@giraph.apache.org by Sai Ganesh Muthuraman <sa...@gmail.com> on 2017/02/25 06:24:53 UTC

Re: RELATION BETWEEN THE NUMBER OF GIRAPH WORKERS AND THE PROBLEM SIZE

Hi,

I used one worker per node and that worked for smaller files. When the file size was more than 25 MB, I got this strange exception. I tried using 2 nodes and 3 nodes, the result is the same.

**ERROR** [org.apache.giraph.master.MasterThread] master.BspServiceMaster (BspServiceMaster.java:barrierOnWorkerList(1415)) - barrierOnWorkerList:**Missing chosen workers**[Worker(hostname=comet-10-68.sdsc.edu, MRtaskID=2, port=30002)] on superstep 2

**FATAL** [org.apache.giraph.master.MasterThread] master.BspServiceMaster (BspServiceMaster.java:getLastGoodCheckpoint(1291)) - getLastGoodCheckpoint: No last good checkpoints can be found, killing the job.

java.io.FileNotFoundException: File hdfs://comet-10-33.ibnet:54310/user/saiganes/_bsp/_checkpoints/giraph_yarn_application_1488002378889_0001 does not exist.

        at org.apache.hadoop.hdfs.DistributedFileSystem.listStatusInternal(DistributedFileSystem.java:697)

        at org.apache.hadoop.hdfs.DistributedFileSystem.access$600(DistributedFileSystem.java:105)

Re: RELATION BETWEEN THE NUMBER OF GIRAPH WORKERS AND THE PROBLEM SIZE

Posted by José Luis Larroque <la...@gmail.com>.

You aren't setting yh (yarn heap), and without this parameter, every
container will have 1024 MB by default. You should use -yh 10240 (same
value of mapreduce.map.memory.mb).

You should ask about giraph 1.2 compilation in a separate email for a
better understanting. I didn't compile it yet, so i'm unable to help you in
this issue..

-- 
*José Luis Larroque*
Analista Programador Universitario - Facultad de Informática - UNLP
Desarrollador Java y .NET  en LIFIA

2017-03-02 14:18 GMT-03:00 Sai Ganesh Muthuraman <sa...@gmail.com>:

> Hi,
>
> I tried building giraph-1.2.0 with yarn profile
>
> The build was successful when I just tried *mvn -DskipTests package.*
>
> But my hadoop version is hadoop-2.6.0. So I removed that build.
>
> I tried installing using the command *mvn -Phadoop_yarn
> -Dhadoop.version=2.6.0 -DskipTests package*
>
> I am getting the errors in 'dependencies.dependency.version' in lines
> 1277,1281 and 1285 in pom.xml.
>
> Is it possible to build giraph-1.2.0 for hadoop-2.6.0 or should I do
> something else?
>
> The build was successful when I just tried *mvn -DskipTests package.*
>
>

Re: RELATION BETWEEN THE NUMBER OF GIRAPH WORKERS AND THE PROBLEM SIZE

Posted by Sai Ganesh Muthuraman <sa...@gmail.com>.

Hi,

I tried building giraph-1.2.0 with yarn profile

The build was successful when I just tried *mvn -DskipTests package.*

But my hadoop version is hadoop-2.6.0. So I removed that build.

I tried installing using the command *mvn -Phadoop_yarn
-Dhadoop.version=2.6.0 -DskipTests package*

I am getting the errors in 'dependencies.dependency.version' in lines
1277,1281 and 1285 in pom.xml.

Is it possible to build giraph-1.2.0 for hadoop-2.6.0 or should I do
something else?

The build was successful when I just tried *mvn -DskipTests package.*

Re: RELATION BETWEEN THE NUMBER OF GIRAPH WORKERS AND THE PROBLEM SIZE

Posted by Sai Ganesh Muthuraman <sa...@gmail.com>.

Hi,

I am still using Giraph-1.1.0 and I will upgrade soon.
Thanks a lot for your suggestions. This is the exact command I used

hadoop jar myjar.jar org.apache.giraph.GiraphRunner MyBetweenness.BetweennessComputation -vif MyBetweenness.VertexDataInputFormat -vip /user/$USER/inputbc/wiki-Vote -vof MyBetweenness.VertexDataOutputFormat -op /user/$USER/outputBC 1.0 -w 2 -yj myjar.jar -ca giraph.useOutOfCoreMessages=true -ca giraph.maxMessagesInMemory=10000 -ca giraph.useOutOfCoreGraph=true -ca giraph.waitTaskDoneTimeoutMs=450000 -ca giraph.logLevel=debug -ca giraph.isStatic=true

I checked the memory information. After Superstep -1,  Free memory is 907mb, Total is 987 mb. 
mapreduce.map.memory.mb is 10240mb.


Sai Ganesh

Re: RELATION BETWEEN THE NUMBER OF GIRAPH WORKERS AND THE PROBLEM SIZE

Posted by José Luis Larroque <la...@gmail.com>.

There are lots of suggestions to deal with that problem.

First ones:
- Decrease the number of workers, to 1 per node, for maximize the amount of
RAM that each worker have. Xmx and Xms should be the same, this is a good
practice in every java environment as fas as i know.
- Put here the exact command that you are using for invoking giraph
algorithm, so everyone here can help you.
- I reccommend you to check how much memory is left when all your graph is
loaded into workers. In the superstep -1 (minus one) your graph is loaded
into memory. You should look how much memory take loading the graph, and
check what memory is left for the rest of the supersteps.
- Increase the logging level of your application. You can get more detailed
information using giraph.logLevel=Debug (defaul Giraph logging level is
info).
- Enable isStaticGraph option, for stopping graph mutations that can
increase your memory problems. At least until you have a clue of what is
going on.
- You should be using Giraph 1.2 that has a better support for out of core,
instead of the previous 1.1 version. Are you using it?

Check this tips and tell us any new information.

Bye

-- 
*José Luis Larroque*
Analista Programador Universitario - Facultad de Informática - UNLP
Desarrollador Java y .NET  en LIFIA

2017-03-02 7:01 GMT-03:00 Sai Ganesh Muthuraman <sa...@gmail.com>:

> Hi Jose,
>
> I went through the container logs and found that the following error was
> happening
> *java.lang.OutOfMemoryError: Java heap space*
> This was probably causing the missing chosen workers error.
>
> This happens only when the graph size exceeds more than 50k vertices and
> 100k edges.
> I enabled out of core messaging, out of core computation and heap size is
> 8GB which is reasonable given that I have 128 GB RAM in every node.
> I tried increasing the number of workers and the number of nodes to 6.
> Still the same result.
> This happens in superstep 2 itself.
> Any suggestions?
>
> Sai Ganesh
>
>
> On Feb 27, 2017, at 10:27 PM, José Luis Larroque <us...@giraph.apache.org>
> wrote:
>
> Could be a lot of different reasons. Memory problems, algorithm problems,
> etc.
>
> I recommend you to focus in reach the logs instead of guessing why the
> worker's are dying. Maybe you are looking in the wrong place, maybe you can
> access to them though web ui instead of command line.
>
> From terminal, doing yarn logs -applicationId "id" doing will be enough
> for seeing them. If you want to access your phyisical files in your nodes,
> you should go to all nodes and check everyone of them, and search for the
> different containers of your application in the directory where those are.
>
> Another link with help:
> http://stackoverflow.com/questions/32713587/how-to-keep-yarns-log-files.
>
> Maybe you could test the algorithm locally instead of running it on the
> cluster, for a better understanding of the relation between yarn and Giraph.
>
> Bye
>
> --
> *José Luis Larroque*
> Analista Programador Universitario - Facultad de Informática - UNLP
> Desarrollador Java y .NET  en LIFIA
>
> 2017-02-27 12:27 GMT-03:00 Sai Ganesh Muthuraman <sa...@gmail.com>:
>
> Hi,
>
> The first container in the application logs usually contains the gam logs.
> But the first container logs are not available. Hence no gam logs.
> What could be the possible reasons for the dying of some workers?
>
>
> Sai Ganesh
>
>
>
> On Feb 25, 2017, at 9:30 PM, José Luis Larroque <us...@giraph.apache.org>
> wrote:
>
> You are probably looking at your giraph application manager (gam) logs.
> You should look for your workers logs, each one have a log (container's
> logs). If you can't find them, you should look at your yarn configuration
> in order to know where are them, see this: http://stackoverflow.com/quest
> ions/21621755/where-does-hadoop-store-the-logs-of-yarn-applications.
>
> I don't recommend you to enable checkpointing until you now the specific
> error that you are facing. If you are facing out of memory errors for
> example, checkpointing won't be helpful in my experience, the same error
> will happen over and over.
>
> --
> *José Luis Larroque*
> Analista Programador Universitario - Facultad de Informática - UNLP
> Desarrollador Java y .NET  en LIFIA
>
> 2017-02-25 12:38 GMT-03:00 Sai Ganesh Muthuraman <sa...@gmail.com>:
>
> Hi Jose,
>
> Which logs do I have to look into exactly, because in the application
> logs, I found the error message that I mentioned and it was also mentioned
> that there was *No good last checkpoint.*
> I am not able to figure out the reason for the failure of a worker for
> bigger files. What do I have to look for in the logs?
> Also, How do I enable Checkpointing?
>
>
> - Sai Ganesh Muthuraman
>
>
>
>
>
>
>
>
>
>

Re: RELATION BETWEEN THE NUMBER OF GIRAPH WORKERS AND THE PROBLEM SIZE

Posted by Sai Ganesh Muthuraman <sa...@gmail.com>.

Hi Jose,

I went through the container logs and found that the following error was happening
java.lang.OutOfMemoryError: Java heap space
This was probably causing the missing chosen workers error.

This happens only when the graph size exceeds more than 50k vertices and 100k edges.
I enabled out of core messaging, out of core computation and heap size is 8GB which is reasonable given that I have 128 GB RAM in every node. 
I tried increasing the number of workers and the number of nodes to 6. Still the same result. 
This happens in superstep 2 itself.
Any suggestions?
Sai Ganesh

On Feb 27, 2017, at 10:27 PM, José Luis Larroque <us...@giraph.apache.org> wrote:

> Could be a lot of different reasons. Memory problems, algorithm problems, etc.
> 
> I recommend you to focus in reach the logs instead of guessing why the worker's are dying. Maybe you are looking in the wrong place, maybe you can access to them though web ui instead of command line.
> 
> From terminal, doing yarn logs -applicationId "id" doing will b

Re: RELATION BETWEEN THE NUMBER OF GIRAPH WORKERS AND THE PROBLEM SIZE

Posted by José Luis Larroque <la...@gmail.com>.

Could be a lot of different reasons. Memory problems, algorithm problems,
etc.

I recommend you to focus in reach the logs instead of guessing why the
worker's are dying. Maybe you are looking in the wrong place, maybe you can
access to them though web ui instead of command line.

From terminal, doing yarn logs -applicationId "id" doing will be enough for
seeing them. If you want to access your phyisical files in your nodes, you
should go to all nodes and check everyone of them, and search for the
different containers of your application in the directory where those are.

Another link with help:
http://stackoverflow.com/questions/32713587/how-to-keep-yarns-log-files.

Maybe you could test the algorithm locally instead of running it on the
cluster, for a better understanding of the relation between yarn and Giraph.

Bye

-- 
*José Luis Larroque*
Analista Programador Universitario - Facultad de Informática - UNLP
Desarrollador Java y .NET  en LIFIA

2017-02-27 12:27 GMT-03:00 Sai Ganesh Muthuraman <sa...@gmail.com>:

> Hi,
>
> The first container in the application logs usually contains the gam logs.
> But the first container logs are not available. Hence no gam logs.
> What could be the possible reasons for the dying of some workers?
>
>
> Sai Ganesh
>
>
>
> On Feb 25, 2017, at 9:30 PM, José Luis Larroque <us...@giraph.apache.org>
> wrote:
>
> You are probably looking at your giraph application manager (gam) logs.
> You should look for your workers logs, each one have a log (container's
> logs). If you can't find them, you should look at your yarn configuration
> in order to know where are them, see this: http://stackoverflow.com/
> questions/21621755/where-does-hadoop-store-the-logs-of-yarn-applications.
>
> I don't recommend you to enable checkpointing until you now the specific
> error that you are facing. If you are facing out of memory errors for
> example, checkpointing won't be helpful in my experience, the same error
> will happen over and over.
>
> --
> *José Luis Larroque*
> Analista Programador Universitario - Facultad de Informática - UNLP
> Desarrollador Java y .NET  en LIFIA
>
> 2017-02-25 12:38 GMT-03:00 Sai Ganesh Muthuraman <sa...@gmail.com>:
>
> Hi Jose,
>
> Which logs do I have to look into exactly, because in the application
> logs, I found the error message that I mentioned and it was also mentioned
> that there was *No good last checkpoint.*
> I am not able to figure out the reason for the failure of a worker for
> bigger files. What do I have to look for in the logs?
> Also, How do I enable Checkpointing?
>
>
> - Sai Ganesh Muthuraman
>
>
>
>
>
>
>

Re: RELATION BETWEEN THE NUMBER OF GIRAPH WORKERS AND THE PROBLEM SIZE

Posted by Sai Ganesh Muthuraman <sa...@gmail.com>.

Hi,

The first container in the application logs usually contains the gam logs. But the first container logs are not available. Hence no gam logs.  What could be the possible reasons for the dying of some workers?

Sai Ganesh

On Feb 25, 2017, at 9:30 PM, José Luis Larroque <us...@giraph.apache.org> wrote:

You are probably looking at your giraph application manager (gam) logs. You should look for your workers logs, each one have a log (container's logs). If you can't find them, you should look at your yarn configuration in order to know where are them, see this: [http://stackoverflow.com/questions/21621755/where-does-hadoop-store-the-logs-of-yarn-applications](http://stackoverflow.com/questions/21621755/where-does-hadoop-store-the-logs-of-yarn-applications).

I don't recommend you to enable checkpointing until you now the specific error that you are facing. If you are facing out of memory errors for example, checkpointing won't be helpful in my experience, the same error will happen over and over.

--  **José

Re: RELATION BETWEEN THE NUMBER OF GIRAPH WORKERS AND THE PROBLEM SIZE

Posted by José Luis Larroque <la...@gmail.com>.

You are probably looking at your giraph application manager (gam) logs. You
should look for your workers logs, each one have a log (container's logs).
If you can't find them, you should look at your yarn configuration in order
to know where are them, see this:
http://stackoverflow.com/questions/21621755/where-does-hadoop-store-the-logs-of-yarn-applications
.

I don't recommend you to enable checkpointing until you now the specific
error that you are facing. If you are facing out of memory errors for
example, checkpointing won't be helpful in my experience, the same error
will happen over and over.

-- 
*José Luis Larroque*
Analista Programador Universitario - Facultad de Informática - UNLP
Desarrollador Java y .NET  en LIFIA

2017-02-25 12:38 GMT-03:00 Sai Ganesh Muthuraman <sa...@gmail.com>:

> Hi Jose,
>
> Which logs do I have to look into exactly, because in the application
> logs, I found the error message that I mentioned and it was also mentioned
> that there was *No good last checkpoint.*
> I am not able to figure out the reason for the failure of a worker for
> bigger files. What do I have to look for in the logs?
> Also, How do I enable Checkpointing?
>
>
> - Sai Ganesh Muthuraman
>
>
>
>

Re: RELATION BETWEEN THE NUMBER OF GIRAPH WORKERS AND THE PROBLEM SIZE

Posted by Sai Ganesh Muthuraman <sa...@gmail.com>.

Hi Jose,

Which logs do I have to look into exactly, because in the application logs, I found the error message that I mentioned and it was also mentioned that there was **No good last checkpoint.** I am not able to figure out the reason for the failure of a worker for bigger files. What do I have to look for in the logs?**

** Also, How do I enable Checkpointing?

- Sai Ganesh Muthuraman

Re: RELATION BETWEEN THE NUMBER OF GIRAPH WORKERS AND THE PROBLEM SIZE

Posted by José Luis Larroque <la...@gmail.com>.

Hi Ganesh,

For some reason, some of your workers are dying. When that happens, giraph
automatically detects that the amount of workers is below neccesary on "
barrierOnWorkerList" and search if a checkpoint exists (a checkpoint is a
backup of the state of a Giraph application). You don't have checkpointing
enabled apparently, so the entire job is being killed. I reccomend that you
look in your containers logs and try to detect why one or more workers are
dying when you have bigger files.

Bye!



-- 
*José Luis Larroque*
Analista Programador Universitario - Facultad de Informática - UNLP
Desarrollador Java y .NET  en LIFIA

2017-02-25 3:24 GMT-03:00 Sai Ganesh Muthuraman <sa...@gmail.com>:

> Hi,
>
> I used one worker per node and that worked for smaller files. When the
> file size was more than 25 MB, I got this strange exception. I tried using
> 2 nodes and 3 nodes, the result is the same.
>
> *ERROR* [org.apache.giraph.master.MasterThread] master.BspServiceMaster
> (BspServiceMaster.java:barrierOnWorkerList(1415)) - barrierOnWorkerList:*
> Missing chosen workers *[Worker(hostname=comet-10-68.sdsc.edu,
> MRtaskID=2, port=30002)] on superstep 2
> *FATAL* [org.apache.giraph.master.MasterThread] master.BspServiceMaster
> (BspServiceMaster.java:getLastGoodCheckpoint(1291)) -
> getLastGoodCheckpoint: No last good checkpoints can be found, killing the
> job.
> java.io.FileNotFoundException: File hdfs://comet-10-33.ibnet:
> 54310/user/saiganes/_bsp/_checkpoints/giraph_yarn_application_1488002378889_0001
> does not exist.
>         at org.apache.hadoop.hdfs.DistributedFileSystem.
> listStatusInternal(DistributedFileSystem.java:697)
>         at org.apache.hadoop.hdfs.DistributedFileSystem.access$
> 600(DistributedFileSystem.java:105)
>         at org.apache.hadoop.hdfs.DistributedFileSystem$15.
> doCall(DistributedFileSystem.java:755)
>         at org.apache.hadoop.hdfs.DistributedFileSystem$15.
> doCall(DistributedFileSystem.java:751)
>         at org.apache.hadoop.fs.FileSystemLinkResolver.resolve(
> FileSystemLinkResolver.java:81)
>         at org.apache.hadoop.hdfs.DistributedFileSystem.listStatus(
> DistributedFileSystem.java:751)
>         at org.apache.hadoop.fs.FileSystem.listStatus(
> FileSystem.java:1485)
>         at org.apache.hadoop.fs.FileSystem.listStatus(
> FileSystem.java:1525)
>         at org.apache.giraph.utils.CheckpointingUtils.
> getLastCheckpointedSuperstep(CheckpointingUtils.java:107)
>         at org.apache.giraph.bsp.BspService.getLastCheckpointedSuperstep(
> BspService.java:1196)
>         at org.apache.giraph.master.BspServiceMaster.
> getLastGoodCheckpoint(BspServiceMaster.java:1289)
>         at org.apache.giraph.master.MasterThread.run(MasterThread.
> java:149)
>
>
> - Sai Ganesh
>
>