You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@accumulo.apache.org by pdread <pa...@siginttech.com> on 2014/04/08 15:40:08 UTC

bulk ingest without mapred

Hi

I interface to an accumulo cloud (100s of nodes) which I don't maintain.
I'll try and keep this short, the interface App is used to ingest millions
of docs/week from various streams, some are required near real time. A
problem came up where the tservers would not stay up and our ingest would
halt. Now the admins are working on fixing this but I'm not optimistic.
Others who have run into this tell me its the use of Mutations that is
causing the problem and it will go away if I do bulk ingest. However
mapreduce is way to slow to spin up and does not map to our arch.

So here is what I have been trying to do. After much research I think I
should be able to bulk ingest if I create the RFile and feed this to
TableOperations.importDirectory(). I can create the RFile ok, at least I
thinks so, I create the "failure" directory using hadoops' file system. I
check that the failure directory is there and is a directory but when I feed
it to the import I get an error over on the accumulo master log that the it
can not find the failure directory. Now the interesting thing is I have
traced the code thourgh the accumulo client it checks successfully for the
load file and the failure directory. What am I doing wrong?

First the client error:

org.apache.accumulo.core.client.AccumuloException: Internal error processing
waitForTableOperation
	at
org.apache.accumulo.core.client.admin.TableOperationsImpl.doTableOperation(TableOperationsImpl.java:290)
	at
org.apache.accumulo.core.client.admin.TableOperationsImpl.doTableOperation(TableOperationsImpl.java:258)
	at
org.apache.accumulo.core.client.admin.TableOperationsImpl.importDirectory(TableOperationsImpl.java:945)
	at
airs.medr.accumulo.server.table.EntityTable.writeEntities(EntityTable.java:130)

Now the master log exception:

2014-04-08 08:33:50,609 [thrift.MasterClientService$Processor] ERROR:
Internal error processing waitForTableOperation
java.lang.RuntimeException: java.io.FileNotFoundException: File does not
exist: bulk/entities_fails/failures
        at
org.apache.accumulo.server.master.Master$MasterClientServiceHandler.waitForTableOperation(Master.java:1053)
        at sun.reflect.GeneratedMethodAccessor24.invoke(Unknown Source)
        at
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
        at java.lang.reflect.Method.invoke(Method.java:597)
        at
org.apache.accumulo.cloudtrace.instrument.thrift.TraceWrap$1.invoke(TraceWrap.java:59)
        at $Proxy6.waitForTableOperation(Unknown Source)
        at
org.apache.accumulo.core.master.thrift.MasterClientService$Processor$waitForTableOperation.process(MasterClientService.java:2004)
        at
org.apache.accumulo.core.master.thrift.MasterClientService$Processor.process(MasterClientService.java:1472)
        at
org.apache.accumulo.server.util.TServerUtils$TimedProcessor.process(TServerUtils.java:154)
        at
org.apache.thrift.server.TNonblockingServer$FrameBuffer.invoke(TNonblockingServer.java:631)
        at
org.apache.accumulo.server.util.TServerUtils$THsHaServer$Invocation.run(TServerUtils.java:202)
        at
java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
        at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
        at
org.apache.accumulo.core.util.LoggingRunnable.run(LoggingRunnable.java:34)
        at java.lang.Thread.run(Thread.java:662)
Caused by: java.io.FileNotFoundException: File does not exist:
bulk/entities_fails/failures
        at
org.apache.hadoop.hdfs.DistributedFileSystem.getFileStatus(DistributedFileSystem.java:528)
        at
org.apache.accumulo.server.trace.TraceFileSystem.getFileStatus(TraceFileSystem.java:797)
        at
org.apache.accumulo.server.master.tableOps.BulkImport.call(BulkImport.java:157)
        at
org.apache.accumulo.server.master.tableOps.BulkImport.call(BulkImport.java:110)
        at
org.apache.accumulo.server.master.tableOps.TraceRepo.call(TraceRepo.java:65)
        at
org.apache.accumulo.server.fate.Fate$TransactionRunner.run(Fate.java:65)

 
Thoughts?

Thanks

Paul
  



--
View this message in context: http://apache-accumulo.1065345.n5.nabble.com/bulk-ingest-without-mapred-tp8904.html
Sent from the Users mailing list archive at Nabble.com.

Re: bulk ingest without mapred

Posted by William Slacum <wi...@accumulo.net>.
You should be creating those directories via a FileSystem object. Browse
your file system using the Namenode's webapp (if it's local, it'll usually
be http://localhost:50070) to see if those directories exist.

On Tue, Apr 8, 2014 at 12:48 PM, pdread <pa...@siginttech.com> wrote:

> Josh
>
> So what you're telling me there is no programmatic way to add files to HDFS
> except via the command line?
>
> If thats the case then its a pretty sad system. The world doesn't run on
> the
> command line.
>
> Thanks
>
> Paul
>
>
>
> --
> View this message in context:
> http://apache-accumulo.1065345.n5.nabble.com/bulk-ingest-without-mapred-tp8904p8916.html
> Sent from the Users mailing list archive at Nabble.com.
>

Re: bulk ingest without mapred

Posted by John Vines <vi...@apache.org>.
It looks like you're not actually adding the files to hdfs, you're just
putting them in the datanode block directories and expecting them to show
up. Look into
http://hadoop.apache.org/docs/r2.2.0/api/org/apache/hadoop/fs/FileSystem.html


On Tue, Apr 8, 2014 at 12:48 PM, pdread <pa...@siginttech.com> wrote:

> Josh
>
> So what you're telling me there is no programmatic way to add files to HDFS
> except via the command line?
>
> If thats the case then its a pretty sad system. The world doesn't run on
> the
> command line.
>
> Thanks
>
> Paul
>
>
>
> --
> View this message in context:
> http://apache-accumulo.1065345.n5.nabble.com/bulk-ingest-without-mapred-tp8904p8916.html
> Sent from the Users mailing list archive at Nabble.com.
>

Re: bulk ingest without mapred

Posted by pdread <pa...@siginttech.com>.
Sean

Cool idea. Thanks. I just started looking at hadoop (programming wise)
yesterday

Paul



--
View this message in context: http://apache-accumulo.1065345.n5.nabble.com/bulk-ingest-without-mapred-tp8904p8927.html
Sent from the Users mailing list archive at Nabble.com.

Re: bulk ingest without mapred

Posted by Sean Busbey <bu...@cloudera.com>.
Rather than hard code the extension, you should rely on
o.a.a.core.file.RFile.EXTENSION.

Keep in mind that since it isn't in the publicly supported API, RFile
related things might change with little to no warning. (It appears to have
held stable through 1.5 and 1.6, FWIW.)


On Tue, Apr 8, 2014 at 10:36 AM, pdread <pa...@siginttech.com> wrote:

> Bill
>
> Hey thanks.. I was almost there...
>
> Yes I used the appropriate writer.
>
> Paul
>
>
>
> --
> View this message in context:
> http://apache-accumulo.1065345.n5.nabble.com/bulk-ingest-without-mapred-tp8904p8924.html
> Sent from the Users mailing list archive at Nabble.com.
>



-- 
Sean

Re: bulk ingest without mapred

Posted by pdread <pa...@siginttech.com>.

Thanks one all...it worked...now I can spin some tomcats up on the tserver
nodes and bulk load the files, hopefully at a very high rate.

I may get to keep my job!





--
View this message in context: http://apache-accumulo.1065345.n5.nabble.com/bulk-ingest-without-mapred-tp8904p8925.html
Sent from the Users mailing list archive at Nabble.com.

Re: bulk ingest without mapred

Posted by pdread <pa...@siginttech.com>.
Bill

Hey thanks.. I was almost there... 

Yes I used the appropriate writer.

Paul



--
View this message in context: http://apache-accumulo.1065345.n5.nabble.com/bulk-ingest-without-mapred-tp8904p8924.html
Sent from the Users mailing list archive at Nabble.com.

Re: bulk ingest without mapred

Posted by William Slacum <wi...@accumulo.net>.
The extension is ".rf". Are you using an RFile.Writer?


On Tue, Apr 8, 2014 at 1:29 PM, pdread <pa...@siginttech.com> wrote:

> Josh
>
> As I had stated in one of my previous posts I am using FileSystem. I am
> using the code from the MapReduce bulk ingest without the MapReduce. I did
> feed the TableOperations.importDirectory a "load" directory and that is
> where if found the "entities.txt", in that load directory. So now the only
> question remains is what is the proper extension for the RFile. The
> "entities.txt" is a RFile which I created witht the appropriate Key/Value
> pairs that should load/match my table.
>
> Thanks
>
> Paul
>
>
>
> --
> View this message in context:
> http://apache-accumulo.1065345.n5.nabble.com/bulk-ingest-without-mapred-tp8904p8922.html
> Sent from the Users mailing list archive at Nabble.com.
>

Re: bulk ingest without mapred

Posted by pdread <pa...@siginttech.com>.
Josh

As I had stated in one of my previous posts I am using FileSystem. I am
using the code from the MapReduce bulk ingest without the MapReduce. I did
feed the TableOperations.importDirectory a "load" directory and that is
where if found the "entities.txt", in that load directory. So now the only
question remains is what is the proper extension for the RFile. The
"entities.txt" is a RFile which I created witht the appropriate Key/Value
pairs that should load/match my table.

Thanks

Paul



--
View this message in context: http://apache-accumulo.1065345.n5.nabble.com/bulk-ingest-without-mapred-tp8904p8922.html
Sent from the Users mailing list archive at Nabble.com.

Re: bulk ingest without mapred

Posted by Josh Elser <jo...@gmail.com>.
You're still a bit confused :). John's link might help too to read.

The "FileSystem" I referred to originally is a class provided by HDFS.
It has multiple implementations (notably the LocalFileSystem and
DistributedFileSystem) which are returned when you use
FileSystem.get(...) based on the contents of the Configuration object
pulled in by the hdfs configuration files (core-site.xml and
hdfs-site.xml). A complete programmatic API is available via HDFS --
this is what Accumulo uses.

Regarding your most recent error, you should be providing a directory
of RFiles. Not sure what's in the text file you provided, or what you
intended it to do...

On Tue, Apr 8, 2014 at 12:59 PM, pdread <pa...@siginttech.com> wrote:
> Josh
>
> I want to thank you...you solved some of my problems, unintentionally but
> none the less.
>
> I added all the hadoop config files to the classpath and bingo, accumulo can
> now find the files...the only down side now is my load file must require a
> certain extension. I had used ".txt" but it complained with
>
> 2014-04-08 12:53:47,764 [tableOps.BulkImport] WARN :
> hdfs://localhost:9000/data/accu1/hdfs/tmp/bulk/entities/load/entities.txt
> does not have a valid extension, ignoring
>
> So now the hunt is on for the correct RFile extension.
>
> Thanks
>
> Paul
>
>
>
> --
> View this message in context: http://apache-accumulo.1065345.n5.nabble.com/bulk-ingest-without-mapred-tp8904p8917.html
> Sent from the Users mailing list archive at Nabble.com.

Re: bulk ingest without mapred

Posted by pdread <pa...@siginttech.com>.
Josh

I want to thank you...you solved some of my problems, unintentionally but
none the less.

I added all the hadoop config files to the classpath and bingo, accumulo can
now find the files...the only down side now is my load file must require a
certain extension. I had used ".txt" but it complained with 

2014-04-08 12:53:47,764 [tableOps.BulkImport] WARN :
hdfs://localhost:9000/data/accu1/hdfs/tmp/bulk/entities/load/entities.txt
does not have a valid extension, ignoring

So now the hunt is on for the correct RFile extension.

Thanks

Paul



--
View this message in context: http://apache-accumulo.1065345.n5.nabble.com/bulk-ingest-without-mapred-tp8904p8917.html
Sent from the Users mailing list archive at Nabble.com.

Re: bulk ingest without mapred

Posted by pdread <pa...@siginttech.com>.
Josh

So what you're telling me there is no programmatic way to add files to HDFS
except via the command line? 

If thats the case then its a pretty sad system. The world doesn't run on the
command line.

Thanks

Paul



--
View this message in context: http://apache-accumulo.1065345.n5.nabble.com/bulk-ingest-without-mapred-tp8904p8916.html
Sent from the Users mailing list archive at Nabble.com.

Re: bulk ingest without mapred

Posted by Josh Elser <jo...@gmail.com>.
Paul,

It might be a good idea to re-read a basic overview on HDFS. You
shouldn't be modifying anything beneath the HDFS data directories.
Those directories on the local filesystem are used by HDFS to create a
distributed filesystem (which is what Accumulo is using).

Those paths that you provide to Accumulo for bulk imports all exist on
that distributed filesystem which should be modified using the hadoop
or hdfs executable on the command line or the FileSystem with your
hdfs-site.xml configuration file on the classpath.

On Tue, Apr 8, 2014 at 12:36 PM, pdread <pa...@siginttech.com> wrote:
> My hdfs-site.xml has the data nodes (space?) defined as
>
> <property>
>         <name>dfs.data.dir</name>
>         <value>/data/accu1/hdfs,/data/accu2/hdfs</value>
> </property>
>
> So I created the files/directories under /data/accu1/hdfs/tmp/bulk, and so
> they were.
>
> After more exploring I found the Hadoop code that is causing the problem,
> DFSClient.getFileInfo() is returning null.
>
>  public FileStatus getFileInfo(String src) throws IOException {
>     FileStatus fileStatus;
>
>     checkOpen();
>     try {
>       if (fileStatusCache != null) {
>         fileStatus = fileStatusCache.get(src);
>         if (fileStatus != FileStatusCache.nullFileStatus) {
>           return fileStatus;
>         }
>       }
>       fileStatus = namenodeProtocolProxy == null ?
> versionBasedGetFileInfo(src)
>           : methodBasedGetFileInfo(src);
>    if (fileStatusCache != null) {
>    fileStatusCache.set(src, fileStatus);
>    }
>
>    return fileStatus;
>     } catch(RemoteException re) {
>       throw re.unwrapRemoteException(AccessControlException.class);
>     }
>   }
>
> So I guess now why is this the case. I noticed that no logging was done to
> the hadoop logs, specifically the namenode and datanode logs. The DFSClient
> code refers to rpc calls which would suggest its connection into the hadoop
> system and not looking at the disk directly. Since I used FileSystem to do
> the file manipulation is there additional bookkeeping that needs to be done
> to let the "hadoop" system know there are files out there? In other words
> even though I used hadoop to create the files does "hadoop" proper know
> about them? If not then what bookkeeping has to be done to get them into the
> system.
>
> Just a guess here. But since the files are clear there and clearly available
> there must be something else at play.
>
> Thanks
>
> Paul
>
>
>
> --
> View this message in context: http://apache-accumulo.1065345.n5.nabble.com/bulk-ingest-without-mapred-tp8904p8914.html
> Sent from the Users mailing list archive at Nabble.com.

Re: bulk ingest without mapred

Posted by pdread <pa...@siginttech.com>.
My hdfs-site.xml has the data nodes (space?) defined as

<property>
        <name>dfs.data.dir</name>
        <value>/data/accu1/hdfs,/data/accu2/hdfs</value>
</property>

So I created the files/directories under /data/accu1/hdfs/tmp/bulk, and so
they were.

After more exploring I found the Hadoop code that is causing the problem,
DFSClient.getFileInfo() is returning null.

 public FileStatus getFileInfo(String src) throws IOException {
    FileStatus fileStatus;

    checkOpen();
    try {
      if (fileStatusCache != null) {
        fileStatus = fileStatusCache.get(src);
        if (fileStatus != FileStatusCache.nullFileStatus) {
          return fileStatus;
        }
      }
      fileStatus = namenodeProtocolProxy == null ?
versionBasedGetFileInfo(src)
          : methodBasedGetFileInfo(src);
   if (fileStatusCache != null) {
   fileStatusCache.set(src, fileStatus);
   }
  
   return fileStatus;
    } catch(RemoteException re) {
      throw re.unwrapRemoteException(AccessControlException.class);
    }
  }

So I guess now why is this the case. I noticed that no logging was done to
the hadoop logs, specifically the namenode and datanode logs. The DFSClient
code refers to rpc calls which would suggest its connection into the hadoop
system and not looking at the disk directly. Since I used FileSystem to do
the file manipulation is there additional bookkeeping that needs to be done
to let the "hadoop" system know there are files out there? In other words
even though I used hadoop to create the files does "hadoop" proper know
about them? If not then what bookkeeping has to be done to get them into the
system.

Just a guess here. But since the files are clear there and clearly available
there must be something else at play.

Thanks

Paul 



--
View this message in context: http://apache-accumulo.1065345.n5.nabble.com/bulk-ingest-without-mapred-tp8904p8914.html
Sent from the Users mailing list archive at Nabble.com.

Re: bulk ingest without mapred

Posted by William Slacum <wi...@accumulo.net>.
Do you mean "/bulk/entities/load"?

On Tue, Apr 8, 2014 at 11:08 AM, pdread <pa...@siginttech.com> wrote:

> Ok I changed the locations to be the same location as the hdfs-site.xml,
> verified the files/directorys are there with the correct permissions, and
> get the same error. Accumulo master cannot find that location.
>
> As an aside my test server is setup with a single node Accumulo so I have
> complete control over this Accumulo. Running version 1.4.4. and hadoop
> 1.1.2. My test (eclipse) is run on the test server so all have access to
> the
> hdfs location.
>
> Thanks
>
> Paul
>
>
>
> --
> View this message in context:
> http://apache-accumulo.1065345.n5.nabble.com/bulk-ingest-without-mapred-tp8904p8911.html
> Sent from the Users mailing list archive at Nabble.com.
>

Re: bulk ingest without mapred

Posted by pdread <pa...@siginttech.com>.
Ok I changed the locations to be the same location as the hdfs-site.xml,
verified the files/directorys are there with the correct permissions, and
get the same error. Accumulo master cannot find that location. 

As an aside my test server is setup with a single node Accumulo so I have
complete control over this Accumulo. Running version 1.4.4. and hadoop
1.1.2. My test (eclipse) is run on the test server so all have access to the
hdfs location.

Thanks

Paul



--
View this message in context: http://apache-accumulo.1065345.n5.nabble.com/bulk-ingest-without-mapred-tp8904p8911.html
Sent from the Users mailing list archive at Nabble.com.

Re: bulk ingest without mapred

Posted by pdread <pa...@siginttech.com>.
Sorry for answering my own post but that was a very dumb question..

I should just use the hdfs file location defined in the hdfs-site.xml....dah

Thanks

Paul



--
View this message in context: http://apache-accumulo.1065345.n5.nabble.com/bulk-ingest-without-mapred-tp8904p8910.html
Sent from the Users mailing list archive at Nabble.com.

Re: bulk ingest without mapred

Posted by pdread <pa...@siginttech.com>.
This is where my limited knowledge of Accumulo/hadoop is not so good.

After running the mapreduce examples and seeing how it defines the files and
directories it was my assumption that was under the control of Hadoop
FileSystem as to what to do with the files. Now sitting here thinking about
it I did notice mapreduce created a user area where it was placing files. 

So how using hadoop FS would I define a working area that accumulo proper
would know about?

This is how I call the importDirectory

getAdmin().getConnector().tableOperations().
	importDirectory(name(), "bulk/entities/load",
"bulk/entities_fails/failures", false);

Thanks

Paul




--
View this message in context: http://apache-accumulo.1065345.n5.nabble.com/bulk-ingest-without-mapred-tp8904p8909.html
Sent from the Users mailing list archive at Nabble.com.

Re: bulk ingest without mapred

Posted by William Slacum <wi...@accumulo.net>.
 java.io.FileNotFoundException: File does not exist:
bulk/entities_fails/failures

sticks out to me. it looks like a relative path. where does that directory
exist on your file system?


On Tue, Apr 8, 2014 at 9:40 AM, pdread <pa...@siginttech.com> wrote:

> Hi
>
> I interface to an accumulo cloud (100s of nodes) which I don't maintain.
> I'll try and keep this short, the interface App is used to ingest millions
> of docs/week from various streams, some are required near real time. A
> problem came up where the tservers would not stay up and our ingest would
> halt. Now the admins are working on fixing this but I'm not optimistic.
> Others who have run into this tell me its the use of Mutations that is
> causing the problem and it will go away if I do bulk ingest. However
> mapreduce is way to slow to spin up and does not map to our arch.
>
> So here is what I have been trying to do. After much research I think I
> should be able to bulk ingest if I create the RFile and feed this to
> TableOperations.importDirectory(). I can create the RFile ok, at least I
> thinks so, I create the "failure" directory using hadoops' file system. I
> check that the failure directory is there and is a directory but when I
> feed
> it to the import I get an error over on the accumulo master log that the it
> can not find the failure directory. Now the interesting thing is I have
> traced the code thourgh the accumulo client it checks successfully for the
> load file and the failure directory. What am I doing wrong?
>
> First the client error:
>
> org.apache.accumulo.core.client.AccumuloException: Internal error
> processing
> waitForTableOperation
>         at
>
> org.apache.accumulo.core.client.admin.TableOperationsImpl.doTableOperation(TableOperationsImpl.java:290)
>         at
>
> org.apache.accumulo.core.client.admin.TableOperationsImpl.doTableOperation(TableOperationsImpl.java:258)
>         at
>
> org.apache.accumulo.core.client.admin.TableOperationsImpl.importDirectory(TableOperationsImpl.java:945)
>         at
>
> airs.medr.accumulo.server.table.EntityTable.writeEntities(EntityTable.java:130)
>
> Now the master log exception:
>
> 2014-04-08 08:33:50,609 [thrift.MasterClientService$Processor] ERROR:
> Internal error processing waitForTableOperation
> java.lang.RuntimeException: java.io.FileNotFoundException: File does not
> exist: bulk/entities_fails/failures
>         at
>
> org.apache.accumulo.server.master.Master$MasterClientServiceHandler.waitForTableOperation(Master.java:1053)
>         at sun.reflect.GeneratedMethodAccessor24.invoke(Unknown Source)
>         at
>
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
>         at java.lang.reflect.Method.invoke(Method.java:597)
>         at
>
> org.apache.accumulo.cloudtrace.instrument.thrift.TraceWrap$1.invoke(TraceWrap.java:59)
>         at $Proxy6.waitForTableOperation(Unknown Source)
>         at
>
> org.apache.accumulo.core.master.thrift.MasterClientService$Processor$waitForTableOperation.process(MasterClientService.java:2004)
>         at
>
> org.apache.accumulo.core.master.thrift.MasterClientService$Processor.process(MasterClientService.java:1472)
>         at
>
> org.apache.accumulo.server.util.TServerUtils$TimedProcessor.process(TServerUtils.java:154)
>         at
>
> org.apache.thrift.server.TNonblockingServer$FrameBuffer.invoke(TNonblockingServer.java:631)
>         at
>
> org.apache.accumulo.server.util.TServerUtils$THsHaServer$Invocation.run(TServerUtils.java:202)
>         at
>
> java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
>         at
>
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
>         at
> org.apache.accumulo.core.util.LoggingRunnable.run(LoggingRunnable.java:34)
>         at java.lang.Thread.run(Thread.java:662)
> Caused by: java.io.FileNotFoundException: File does not exist:
> bulk/entities_fails/failures
>         at
>
> org.apache.hadoop.hdfs.DistributedFileSystem.getFileStatus(DistributedFileSystem.java:528)
>         at
>
> org.apache.accumulo.server.trace.TraceFileSystem.getFileStatus(TraceFileSystem.java:797)
>         at
>
> org.apache.accumulo.server.master.tableOps.BulkImport.call(BulkImport.java:157)
>         at
>
> org.apache.accumulo.server.master.tableOps.BulkImport.call(BulkImport.java:110)
>         at
>
> org.apache.accumulo.server.master.tableOps.TraceRepo.call(TraceRepo.java:65)
>         at
> org.apache.accumulo.server.fate.Fate$TransactionRunner.run(Fate.java:65)
>
>
> Thoughts?
>
> Thanks
>
> Paul
>
>
>
>
> --
> View this message in context:
> http://apache-accumulo.1065345.n5.nabble.com/bulk-ingest-without-mapred-tp8904.html
> Sent from the Users mailing list archive at Nabble.com.
>