You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@accumulo.apache.org by pdread <pa...@siginttech.com> on 2014/04/10 13:10:44 UTC

Stream fed accumulo

Hi

This has been bothering me for some time, and I suspect its a dumb question,
but what the heck.

The accumulo client API only accepts byte[] or Text as its Mutation input.
Would it be possible to 
use a Stream instead (devlopers?)? If I'm processing streams, which I am,
and I have to handle files to the tune
of 10GB, which I would like to store in Accumulo but I have read I cannot,
it would save memory 
footprint on my tomcats if I could stream my data into accumulo and not deal
with bytes/text.

Oh and accumulo developers while you're at adding this new feature it would
be nice if the bulk loads could
append instead of just replace the tables....would be nice.

Thanks

Paul 



--
View this message in context: http://apache-accumulo.1065345.n5.nabble.com/Stream-fed-accumulo-tp8981.html
Sent from the Users mailing list archive at Nabble.com.

Re: Stream fed accumulo

Posted by David Medinets <da...@gmail.com>.

I have stored large numbers of large files in Accumulo using a derivative
of the File System Archive (
https://accumulo.apache.org/1.4/examples/dirlist.html). I had code that
accepted streams but stores chunks instead of the whole file into the
Value. Attached to my ColumnQualifier was essentially a chunk index.


On Thu, Apr 10, 2014 at 7:10 AM, pdread <pa...@siginttech.com> wrote:

> Hi
>
> This has been bothering me for some time, and I suspect its a dumb
> question,
> but what the heck.
>
> The accumulo client API only accepts byte[] or Text as its Mutation input.
> Would it be possible to
> use a Stream instead (devlopers?)? If I'm processing streams, which I am,
> and I have to handle files to the tune
> of 10GB, which I would like to store in Accumulo but I have read I cannot,
> it would save memory
> footprint on my tomcats if I could stream my data into accumulo and not
> deal
> with bytes/text.
>
> Oh and accumulo developers while you're at adding this new feature it would
> be nice if the bulk loads could
> append instead of just replace the tables....would be nice.
>
> Thanks
>
> Paul
>
>
>
> --
> View this message in context:
> http://apache-accumulo.1065345.n5.nabble.com/Stream-fed-accumulo-tp8981.html
> Sent from the Users mailing list archive at Nabble.com.
>

Re: Stream fed accumulo

Posted by Joe Gresock <jg...@gmail.com>.

Right, but you can set the byte[] buffer size to as large or small as you
want.  For example, if you use BufferedInputStream, it defaults to an 8K
buffer -- this would be similar.  This was a good solution for us, since
Accumulo doesn't inherently support streaming at the moment.


On Thu, Apr 10, 2014 at 8:04 AM, pdread <pa...@siginttech.com> wrote:

>
> Your still passing byte[] to your mutations which means your process
> allocated space for that buffer. I was hoping for
>
> public void myPut(Stream mydata. String key, String type)
> {
>   Mutation mutation = new Mutation(generateRowId(key, type));
>   mutation.put(DATA_CF, encoder.encode(sequenceNum), visibility, timestamp,
> mydata);
> }
>
> Then under the covers the accumulo client streams the data to where ever
> the
> data goes. Maybe similar to how one would use the DataHandler in a REST
> service to pass a stream in a REST call.
>
> Thanks
>
> Paul
>
>
>
> --
> View this message in context:
> http://apache-accumulo.1065345.n5.nabble.com/Stream-fed-accumulo-tp8981p8985.html
> Sent from the Users mailing list archive at Nabble.com.
>



-- 
I know what it is to be in need, and I know what it is to have plenty.  I
have learned the secret of being content in any and every situation,
whether well fed or hungry, whether living in plenty or in want.  I can do
all this through him who gives me strength.    *-Philippians 4:12-13*

Re: Stream fed accumulo

Posted by pdread <pa...@siginttech.com>.

Your still passing byte[] to your mutations which means your process
allocated space for that buffer. I was hoping for

public void myPut(Stream mydata. String key, String type)
{
  Mutation mutation = new Mutation(generateRowId(key, type));
  mutation.put(DATA_CF, encoder.encode(sequenceNum), visibility, timestamp,
mydata);
}

Then under the covers the accumulo client streams the data to where ever the
data goes. Maybe similar to how one would use the DataHandler in a REST
service to pass a stream in a REST call. 

Thanks

Paul



--
View this message in context: http://apache-accumulo.1065345.n5.nabble.com/Stream-fed-accumulo-tp8981p8985.html
Sent from the Users mailing list archive at Nabble.com.

Re: Stream fed accumulo

Posted by Joe Gresock <jg...@gmail.com>.

We were able to use this implementation in our code to stream to and from
Accumulo:
https://github.com/calrissian/accumulo-recipes/blob/master/store/blob-store/src/main/java/org/calrissian/accumulorecipes/blobstore/impl/AccumuloBlobStore.java



On Thu, Apr 10, 2014 at 7:32 AM, pdread <pa...@siginttech.com> wrote:

> Ariel
>
> Actually we are storing anything over 128M to HDFS, as of next week. Our
> system is very large and fairly complex and I was not really intending on
> going into detail but just wondering if there was a way the Mutation thread
> to accumulo could be made more efficient.
>
> In the past we have reduced our tomcat footprint by going totally streamed
> based which increased speed and the number of clients we could handle. Most
> of our docs are in the 10-50K range but we try to process many at one time,
> plus I have 20TB of data to be processed that are over 100M per doc which
> starts to bog the system down. You have to understand we process many
> millions of docs per week and any kind of performance boost makes everyone
> happier.
>
> Thanks
>
> Paul
>
>
>
>
> --
> View this message in context:
> http://apache-accumulo.1065345.n5.nabble.com/Stream-fed-accumulo-tp8981p8983.html
> Sent from the Users mailing list archive at Nabble.com.
>



-- 
I know what it is to be in need, and I know what it is to have plenty.  I
have learned the secret of being content in any and every situation,
whether well fed or hungry, whether living in plenty or in want.  I can do
all this through him who gives me strength.    *-Philippians 4:12-13*

Re: Stream fed accumulo

Posted by pdread <pa...@siginttech.com>.

Ariel 

Actually we are storing anything over 128M to HDFS, as of next week. Our
system is very large and fairly complex and I was not really intending on
going into detail but just wondering if there was a way the Mutation thread
to accumulo could be made more efficient.

In the past we have reduced our tomcat footprint by going totally streamed
based which increased speed and the number of clients we could handle. Most
of our docs are in the 10-50K range but we try to process many at one time,
plus I have 20TB of data to be processed that are over 100M per doc which
starts to bog the system down. You have to understand we process many
millions of docs per week and any kind of performance boost makes everyone
happier. 

Thanks

Paul




--
View this message in context: http://apache-accumulo.1065345.n5.nabble.com/Stream-fed-accumulo-tp8981p8983.html
Sent from the Users mailing list archive at Nabble.com.

Re: Stream fed accumulo

Posted by Ariel Valentin <ar...@arielvalentin.com>.

I agree that loading 10 GB files into memory during file uploads is inefficient but I am not sure that storing 10GB files in an Accumulo cell is the best approach. 

I would encourage you to perhaps store that file directly in HDFS and if you need to store the metadata about that file in Accumulo (e.g. mime type, file name, date created). 

Thanks,
Ariel
---
Sent from my mobile device. Please excuse any errors.

> On Apr 10, 2014, at 7:10 AM, pdread <pa...@siginttech.com> wrote:
> 
> Hi
> 
> This has been bothering me for some time, and I suspect its a dumb question,
> but what the heck.
> 
> The accumulo client API only accepts byte[] or Text as its Mutation input.
> Would it be possible to 
> use a Stream instead (devlopers?)? If I'm processing streams, which I am,
> and I have to handle files to the tune
> of 10GB, which I would like to store in Accumulo but I have read I cannot,
> it would save memory 
> footprint on my tomcats if I could stream my data into accumulo and not deal
> with bytes/text.
> 
> Oh and accumulo developers while you're at adding this new feature it would
> be nice if the bulk loads could
> append instead of just replace the tables....would be nice.
> 
> Thanks
> 
> Paul 
> 
> 
> 
> --
> View this message in context: http://apache-accumulo.1065345.n5.nabble.com/Stream-fed-accumulo-tp8981.html
> Sent from the Users mailing list archive at Nabble.com.

Re: Stream fed accumulo

Posted by Christopher <ct...@apache.org>.

Yes, for those versions, and for any others. Bulk load has always been this way.

--
Christopher L Tubbs II
http://gravatar.com/ctubbsii


On Thu, Apr 10, 2014 at 1:57 PM, pdread <pa...@siginttech.com> wrote:
> Thanks Chris,
>
> I got some bad advise from our accumulo admins then. Or I misunderstood what
> they said. The append would be a God send for us.
>
> ( I assume this is for versions 1.4.x through 1.5.x ? We are running 1.4.3 )
>
> Thanks again,
>
> Paul
>
>
>
> --
> View this message in context: http://apache-accumulo.1065345.n5.nabble.com/Stream-fed-accumulo-tp8981p9012.html
> Sent from the Users mailing list archive at Nabble.com.

Re: Stream fed accumulo

Posted by pdread <pa...@siginttech.com>.

Thanks Chris, 

I got some bad advise from our accumulo admins then. Or I misunderstood what
they said. The append would be a God send for us. 

( I assume this is for versions 1.4.x through 1.5.x ? We are running 1.4.3 )

Thanks again,

Paul



--
View this message in context: http://apache-accumulo.1065345.n5.nabble.com/Stream-fed-accumulo-tp8981p9012.html
Sent from the Users mailing list archive at Nabble.com.

Re: Stream fed accumulo

Posted by Christopher <ct...@apache.org>.

Regarding the bulk load part of your comment, you should know that
bulk load *does* append to a table. It does not replace it.

--
Christopher L Tubbs II
http://gravatar.com/ctubbsii


On Thu, Apr 10, 2014 at 7:10 AM, pdread <pa...@siginttech.com> wrote:
> Hi
>
> This has been bothering me for some time, and I suspect its a dumb question,
> but what the heck.
>
> The accumulo client API only accepts byte[] or Text as its Mutation input.
> Would it be possible to
> use a Stream instead (devlopers?)? If I'm processing streams, which I am,
> and I have to handle files to the tune
> of 10GB, which I would like to store in Accumulo but I have read I cannot,
> it would save memory
> footprint on my tomcats if I could stream my data into accumulo and not deal
> with bytes/text.
>
> Oh and accumulo developers while you're at adding this new feature it would
> be nice if the bulk loads could
> append instead of just replace the tables....would be nice.
>
> Thanks
>
> Paul
>
>
>
> --
> View this message in context: http://apache-accumulo.1065345.n5.nabble.com/Stream-fed-accumulo-tp8981.html
> Sent from the Users mailing list archive at Nabble.com.