You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@accumulo.apache.org by "Parker, Matthew - IS" <Ma...@exelisinc.com> on 2013/01/31 16:19:06 UTC

Accumulo Configuration Question

TWIMC:

I'm new to Accumulo and I've been trying to come up with a good architecture for a 20 node cluster. I have been running a map/reduce program, and it encounters issues when it comes to running the Accumulo section of the code. Once the job's completion rate exceeds 93, it starts dropping 10's of tasks during the process, because they eventually timeout. The completion rate drops back down, but it the job eventually finishes. I have a suspicion it's due to the way I have the system configured and I wanted to get some feedback as to what's the generally preferred architecture when installing Accumulo?

Since you have the choice of installing hdfs, map/reduce, and tablet servers on any three, the general guideline is to install two per machine (data node and table server, or data node and map/reduce) as per the Hardware section in the Administration documentation.

http://accumulo.apache.org/1.4/user_manual/Administration.html#Hardware

Does that mean you have one large group of data nodes that's installed on the majority of the cluster, or are they somehow split into two groups such that map/reduce & hdfs runs on one set of nodes, and Accumulo tablet servers and hdfs uses another?

I was wondering whether people would comment on what a working configuration might look like?

TIA,

Matt

________________________________

This e-mail and any files transmitted with it may be proprietary and are intended solely for the use of the individual or entity to whom they are addressed. If you have received this e-mail in error please notify the sender. Please note that any views or opinions presented in this e-mail are solely those of the author and do not necessarily represent those of Exelis Inc. The recipient should check this e-mail and any attachments for the presence of viruses. Exelis Inc. accepts no liability for any damage caused by any virus transmitted by this e-mail.

RE: Accumulo Configuration Question

Posted by "Parker, Matthew - IS" <Ma...@exelisinc.com>.
>>> Whether tasks timeout can be due to the data and the reduce logic, in addition to the configuration. Are things timing out in the reduce phase?

The job is loading the data in the map phase.

>>> Also, do you notice that it's the same tasktrackers that experience timeouts?

Yes. There are 3-4 that do. I get messages like the following:

2013-01-30 15:34:21,282 ERROR org.apache.hadoop.hdfs.server.datanode.DataNode: DatanodeRegistration(192.168.1.16:50010, storageID=DS-1659690528-192.168.3.16-50010-135543
7253451, infoPort=50075, ipcPort=50020):DataXceiver
java.net.SocketTimeoutException: 480000 millis timeout while waiting for channel to be ready for write. ch : java.nio.channels.SocketChannel[connected local=/192.168.1.1
6:50010 remote=/192.168.1.26:57262]
        at org.apache.hadoop.net.SocketIOWithTimeout.waitForIO(SocketIOWithTimeout.java:246)
        at org.apache.hadoop.net.SocketOutputStream.waitForWritable(SocketOutputStream.java:159)
        at org.apache.hadoop.net.SocketOutputStream.transferToFully(SocketOutputStream.java:198)
        at org.apache.hadoop.hdfs.server.datanode.BlockSender.sendChunks(BlockSender.java:397)
        at org.apache.hadoop.hdfs.server.datanode.BlockSender.sendBlock(BlockSender.java:493)
        at org.apache.hadoop.hdfs.server.datanode.DataXceiver.readBlock(DataXceiver.java:279)
        at org.apache.hadoop.hdfs.server.datanode.DataXceiver.run(DataXceiver.java:175)

I adjusted the timeout setting to a large value, including to having the system just wait until it's completed, and things just get worse where the system starts dropping more tasks.

>>> Finally, are you doing MapReduce from HDFS to HDFS? Or are you reading from or writing to Accumulo tables? You alluded to an Accumulo section of your code. Are you reading/writing to/from HFDS but doing scans/lookups/inserts to Accumulo from your mappers or reducers?

Reading text files stored in HDFS and writing data to Accumulo. Just inserting data. Each file has 1 million records, and it'll process anywhere from 1-50 files depending on the run. The job's setup code looks like the following:

        Configuration conf = new Configuration();

        Job job = new Job(conf, "Load Accumulo Data Table");
        job.setJarByClass(LoadAccumuloDataTable.class);

        job.setInputFormatClass(TextInputFormat.class);
        TextInputFormat.addInputPath(job, inputDataPath);

        job.setMapperClass(DataLoader.class);
        job.setNumReduceTasks(0);

        job.setOutputFormatClass(AccumuloOutputFormat.class);
        job.setOutputKeyClass(Text.class);
        job.setOutputValueClass(Mutation.class);

        boolean createTables = true;
        AccumuloOutputFormat.setOutputInfo(job.getConfiguration(), accumuloUserName, accumuloPassword.getBytes(), createTables, dataTable);
        AccumuloOutputFormat.setZooKeeperInstance(job.getConfiguration(), zookeeperInstanceName, zookeeperServers);

>>> You can certainly just use one large group of HDFS data nodes and mapreduce and accumulo will work fine. Also - depending on your hardware, you can run all three processes on each node. You just want to make sure each process has enough ram/cpu.

Each node has 16 cpu's, and 48GB of RAM. I set the map/reduce system defaults to
support 8 mappers, and 8 reducers per node and allocated each mapper/reduce 2 GB of memory.

I believe the tablet servers and task trackers are running with their default settings.

________________________________
From: Aaron Cordova [aaron@cordovas.org]
Sent: Thursday, January 31, 2013 10:57 AM
To: user@accumulo.apache.org
Subject: Re: Accumulo Configuration Question


On Jan 31, 2013, at 10:19 AM, "Parker, Matthew - IS" <Ma...@exelisinc.com>> wrote:

TWIMC:

I'm new to Accumulo and I've been trying to come up with a good architecture for a 20 node cluster. I have been running a map/reduce program, and it encounters issues when it comes to running the Accumulo section of the code. Once the job's completion rate exceeds 93, it starts dropping 10's of tasks during the process, because they eventually timeout. The completion rate drops back down, but it the job eventually finishes. I have a suspicion it's due to the way I have the system configured and I wanted to get some feedback as to what's the generally preferred architecture when installing Accumulo?

Whether tasks timeout can be due to the data and the reduce logic, in addition to the configuration. Are things timing out in the reduce phase?

Also, do you notice that it's the same tasktrackers that experience timeouts?

Finally, are you doing MapReduce from HDFS to HDFS? Or are you reading from or writing to Accumulo tables? You alluded to an Accumulo section of your code. Are you reading/writing to/from HFDS but doing scans/lookups/inserts to Accumulo from your mappers or reducers?

Since you have the choice of installing hdfs, map/reduce, and tablet servers on any three, the general guideline is to install two per machine (data node and table server, or data nodeand map/reduce) as per the Hardware section in the Administration documentation.

http://accumulo.apache.org/1.4/user_manual/Administration.html#Hardware

Does that mean you have one large group of data nodes that's installed on the majority of the cluster, or are they somehow split into two groups such that map/reduce & hdfs runs on one set of nodes, and Accumulo tablet servers and hdfs uses another?

You can certainly just use one large group of HDFS data nodes and mapreduce and accumulo will work fine. Also - depending on your hardware, you can run all three processes on each node. You just want to make sure each process has enough ram/cpu.

If you want to keep Accumulo IO somewhat isolated from MapReduce you can control the location of HDFS block replicas to a certain degree to achieve more independence of failures and IO. Of course writing to or reading from Accumulo in a MapReduce will still absorb resources from the Accumulo side.


I was wondering whether people would comment on what a working configuration might look like?

TIA,

Matt

________________________________

This e-mail and any files transmitted with it may be proprietary and are intended solely for the use of the individual or entity to whom they are addressed. If you have received this e-mail in error please notify the sender. Please note that any views or opinions presented in this e-mail are solely those of the author and do not necessarily represent those of Exelis Inc. The recipient should check this e-mail and any attachments for the presence of viruses. Exelis Inc. accepts no liability for any damage caused by any virus transmitted by this e-mail.


Re: Accumulo Configuration Question

Posted by Aaron Cordova <aa...@cordovas.org>.
On Jan 31, 2013, at 10:19 AM, "Parker, Matthew - IS" <Ma...@exelisinc.com> wrote:

> TWIMC:
> 
> I'm new to Accumulo and I've been trying to come up with a good architecture for a 20 node cluster. I have been running a map/reduce program, and it encounters issues when it comes to running the Accumulo section of the code. Once the job's completion rate exceeds 93, it starts dropping 10's of tasks during the process, because they eventually timeout. The completion rate drops back down, but it the job eventually finishes. I have a suspicion it's due to the way I have the system configured and I wanted to get some feedback as to what's the generally preferred architecture when installing Accumulo?

Whether tasks timeout can be due to the data and the reduce logic, in addition to the configuration. Are things timing out in the reduce phase?

Also, do you notice that it's the same tasktrackers that experience timeouts?

Finally, are you doing MapReduce from HDFS to HDFS? Or are you reading from or writing to Accumulo tables? You alluded to an Accumulo section of your code. Are you reading/writing to/from HFDS but doing scans/lookups/inserts to Accumulo from your mappers or reducers?

> Since you have the choice of installing hdfs, map/reduce, and tablet servers on any three, the general guideline is to install two per machine (data node and table server, or data nodeand map/reduce) as per the Hardware section in the Administration documentation.
> 
> http://accumulo.apache.org/1.4/user_manual/Administration.html#Hardware
> 
> Does that mean you have one large group of data nodes that's installed on the majority of the cluster, or are they somehow split into two groups such that map/reduce & hdfs runs on one set of nodes, and Accumulo tablet servers and hdfs uses another?

You can certainly just use one large group of HDFS data nodes and mapreduce and accumulo will work fine. Also - depending on your hardware, you can run all three processes on each node. You just want to make sure each process has enough ram/cpu.  

If you want to keep Accumulo IO somewhat isolated from MapReduce you can control the location of HDFS block replicas to a certain degree to achieve more independence of failures and IO. Of course writing to or reading from Accumulo in a MapReduce will still absorb resources from the Accumulo side.


> I was wondering whether people would comment on what a working configuration might look like?
> 
> TIA,
> 
> Matt 
> 
> 
> This e-mail and any files transmitted with it may be proprietary and are intended solely for the use of the individual or entity to whom they are addressed. If you have received this e-mail in error please notify the sender. Please note that any views or opinions presented in this e-mail are solely those of the author and do not necessarily represent those of Exelis Inc. The recipient should check this e-mail and any attachments for the presence of viruses. Exelis Inc. accepts no liability for any damage caused by any virus transmitted by this e-mail. 


RE: Accumulo Configuration Question

Posted by "Parker, Matthew - IS" <Ma...@exelisinc.com>.
I don't control the system, and the admins won't open up ports to the outside. I'm stuck with putty access.

________________________________
From: Jason Morris [jmorris@texeltek.com]
Sent: Thursday, January 31, 2013 11:44 AM
To: user@accumulo.apache.org
Subject: Re: Accumulo Configuration Question

Have you tried setting up a SOCKS proxy via SSH and pulling the status page that way?


On Thu, Jan 31, 2013 at 11:19 AM, Parker, Matthew - IS <Ma...@exelisinc.com>> wrote:
I'm sort of flying blind. The cluster is on a headless environment, and I can only access the system via putty at the command prompt. I've had to resort to using lynx to browse the monitor page. Unfortunately, the graphs don't translate well when using a text-based browser. Is there another way to get that info through the logs?

________________________________
From: William Slacum [wilhelm.von.cloud@accumulo.net<ma...@accumulo.net>]
Sent: Thursday, January 31, 2013 10:52 AM
To: user@accumulo.apache.org<ma...@accumulo.apache.org>
Subject: Re: Accumulo Configuration Question

This doesn't have much to do with your cluster set up, but what does the monitor say as your jobs are nearing completion and things start failing? Are there hold times for the table(s) you are writing to?

On Thu, Jan 31, 2013 at 10:19 AM, Parker, Matthew - IS <Ma...@exelisinc.com>> wrote:
TWIMC:

I'm new to Accumulo and I've been trying to come up with a good architecture for a 20 node cluster. I have been running a map/reduce program, and it encounters issues when it comes to running the Accumulo section of the code. Once the job's completion rate exceeds 93, it starts dropping 10's of tasks during the process, because they eventually timeout. The completion rate drops back down, but it the job eventually finishes. I have a suspicion it's due to the way I have the system configured and I wanted to get some feedback as to what's the generally preferred architecture when installing Accumulo?

Since you have the choice of installing hdfs, map/reduce, and tablet servers on any three, the general guideline is to install two per machine (data node and table server, or data node and map/reduce) as per the Hardware section in the Administration documentation.

http://accumulo.apache.org/1.4/user_manual/Administration.html#Hardware

Does that mean you have one large group of data nodes that's installed on the majority of the cluster, or are they somehow split into two groups such that map/reduce & hdfs runs on one set of nodes, and Accumulo tablet servers and hdfs uses another?

I was wondering whether people would comment on what a working configuration might look like?

TIA,

Matt

________________________________

This e-mail and any files transmitted with it may be proprietary and are intended solely for the use of the individual or entity to whom they are addressed. If you have received this e-mail in error please notify the sender. Please note that any views or opinions presented in this e-mail are solely those of the author and do not necessarily represent those of Exelis Inc. The recipient should check this e-mail and any attachments for the presence of viruses. Exelis Inc. accepts no liability for any damage caused by any virus transmitted by this e-mail.




--
Jason Morris
TexelTek Inc.
308 Sentinel Drive
Suite 500
Annapolis Junction, MD  20701
Office: 301.880.7123 Ext. 6677

Re: Accumulo Configuration Question

Posted by Jason Morris <jm...@texeltek.com>.
Have you tried setting up a SOCKS proxy via SSH and pulling the status page
that way?


On Thu, Jan 31, 2013 at 11:19 AM, Parker, Matthew - IS <
Matthew.Parker@exelisinc.com> wrote:

>  I'm sort of flying blind. The cluster is on a headless environment, and
> I can only access the system via putty at the command prompt. I've had to
> resort to using lynx to browse the monitor page. Unfortunately, the graphs
> don't translate well when using a text-based browser. Is there another way
> to get that info through the logs?
>
>  ------------------------------
> *From:* William Slacum [wilhelm.von.cloud@accumulo.net]
> *Sent:* Thursday, January 31, 2013 10:52 AM
> *To:* user@accumulo.apache.org
> *Subject:* Re: Accumulo Configuration Question
>
>  This doesn't have much to do with your cluster set up, but what does the
> monitor say as your jobs are nearing completion and things start failing?
> Are there hold times for the table(s) you are writing to?
>
> On Thu, Jan 31, 2013 at 10:19 AM, Parker, Matthew - IS <
> Matthew.Parker@exelisinc.com> wrote:
>
>>  TWIMC:
>>
>> I'm new to Accumulo and I've been trying to come up with a good
>> architecture for a 20 node cluster. I have been running a map/reduce
>> program, and it encounters issues when it comes to running the Accumulo
>> section of the code. Once the job's completion rate exceeds 93, it startsdropping 10's ofta
>> sks during the process, because they eventually timeout. The completionrate drops back down, but it the
>> job eventually finishes. I have a suspicion it's due to the way I have
>> the system configured and I wanted to get some feedback as to what's thegenerally preferred
>> architecture when installing Accumulo?
>>
>> Since you have the choice of installing hdfs, map/reduce, and tablet
>> servers on any three, the general guideline is to install two per
>> machine (data node and table server, or data node and map/reduce) as per
>> the Hardware section in the Administration documentation.
>>
>> http://accumulo.apache.org/1.4/user_manual/Administration.html#Hardware
>>
>> Does that mean you have one large group of data nodes that's installed on
>> the majority of the cluster, or are they somehow split into two groups such
>> that map/reduce & hdfs runs on one set of nodes, and Accumulo tablet
>> servers and hdfs uses another?
>>
>> I was wondering whether people would comment on what a working
>> configuration might look like?
>>
>> TIA,
>>
>> Matt
>>
>> ------------------------------
>>
>> This e-mail and any files transmitted with it may be proprietary and are
>> intended solely for the use of the individual or entity to whom they are
>> addressed. If you have received this e-mail in error please notify the
>> sender. Please note that any views or opinions presented in this e-mail are
>> solely those of the author and do not necessarily represent those of Exelis
>> Inc. The recipient should check this e-mail and any attachments for the
>> presence of viruses. Exelis Inc. accepts no liability for any damage caused
>> by any virus transmitted by this e-mail.
>>
>
>


-- 
Jason Morris
TexelTek Inc.
308 Sentinel Drive
Suite 500
Annapolis Junction, MD  20701
Office: 301.880.7123 Ext. 6677

RE: Accumulo Configuration Question

Posted by "Parker, Matthew - IS" <Ma...@exelisinc.com>.
I'm sort of flying blind. The cluster is on a headless environment, and I can only access the system via putty at the command prompt. I've had to resort to using lynx to browse the monitor page. Unfortunately, the graphs don't translate well when using a text-based browser. Is there another way to get that info through the logs?

________________________________
From: William Slacum [wilhelm.von.cloud@accumulo.net]
Sent: Thursday, January 31, 2013 10:52 AM
To: user@accumulo.apache.org
Subject: Re: Accumulo Configuration Question

This doesn't have much to do with your cluster set up, but what does the monitor say as your jobs are nearing completion and things start failing? Are there hold times for the table(s) you are writing to?

On Thu, Jan 31, 2013 at 10:19 AM, Parker, Matthew - IS <Ma...@exelisinc.com>> wrote:
TWIMC:

I'm new to Accumulo and I've been trying to come up with a good architecture for a 20 node cluster. I have been running a map/reduce program, and it encounters issues when it comes to running the Accumulo section of the code. Once the job's completion rate exceeds 93, it starts dropping 10's of tasks during the process, because they eventually timeout. The completion rate drops back down, but it the job eventually finishes. I have a suspicion it's due to the way I have the system configured and I wanted to get some feedback as to what's the generally preferred architecture when installing Accumulo?

Since you have the choice of installing hdfs, map/reduce, and tablet servers on any three, the general guideline is to install two per machine (data node and table server, or data node and map/reduce) as per the Hardware section in the Administration documentation.

http://accumulo.apache.org/1.4/user_manual/Administration.html#Hardware

Does that mean you have one large group of data nodes that's installed on the majority of the cluster, or are they somehow split into two groups such that map/reduce & hdfs runs on one set of nodes, and Accumulo tablet servers and hdfs uses another?

I was wondering whether people would comment on what a working configuration might look like?

TIA,

Matt

________________________________

This e-mail and any files transmitted with it may be proprietary and are intended solely for the use of the individual or entity to whom they are addressed. If you have received this e-mail in error please notify the sender. Please note that any views or opinions presented in this e-mail are solely those of the author and do not necessarily represent those of Exelis Inc. The recipient should check this e-mail and any attachments for the presence of viruses. Exelis Inc. accepts no liability for any damage caused by any virus transmitted by this e-mail.


Re: Accumulo Configuration Question

Posted by William Slacum <wi...@accumulo.net>.
This doesn't have much to do with your cluster set up, but what does the
monitor say as your jobs are nearing completion and things start failing?
Are there hold times for the table(s) you are writing to?

On Thu, Jan 31, 2013 at 10:19 AM, Parker, Matthew - IS <
Matthew.Parker@exelisinc.com> wrote:

>  TWIMC:
>
> I'm new to Accumulo and I've been trying to come up with a good
> architecture for a 20 node cluster. I have been running a map/reduce
> program, and it encounters issues when it comes to running the Accumulo
> section of the code. Once the job's completion rate exceeds 93, it startsdropping 10's ofta
> sks during the process, because they eventually timeout. The completionrate drops back down, but it the
> job eventually finishes. I have a suspicion it's due to the way I have
> the system configured and I wanted to get some feedback as to what's thegenerally preferred
> architecture when installing Accumulo?
>
> Since you have the choice of installing hdfs, map/reduce, and tablet
> servers on any three, the general guideline is to install two per machine
> (data node and table server, or data node and map/reduce) as per the
> Hardware section in the Administration documentation.
>
> http://accumulo.apache.org/1.4/user_manual/Administration.html#Hardware
>
> Does that mean you have one large group of data nodes that's installed on
> the majority of the cluster, or are they somehow split into two groups such
> that map/reduce & hdfs runs on one set of nodes, and Accumulo tablet
> servers and hdfs uses another?
>
> I was wondering whether people would comment on what a working
> configuration might look like?
>
> TIA,
>
> Matt
>
> ------------------------------
>
> This e-mail and any files transmitted with it may be proprietary and are
> intended solely for the use of the individual or entity to whom they are
> addressed. If you have received this e-mail in error please notify the
> sender. Please note that any views or opinions presented in this e-mail are
> solely those of the author and do not necessarily represent those of Exelis
> Inc. The recipient should check this e-mail and any attachments for the
> presence of viruses. Exelis Inc. accepts no liability for any damage caused
> by any virus transmitted by this e-mail.
>