You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@accumulo.apache.org by Bob Cook <bc...@gmail.com> on 2016/10/15 15:24:15 UTC

Fwd: Extacting ALL Data using multiple java processes

All,

I'm new to accumulo and inherited this project to extract all data from
accumulo (assembled as a "document" by RowID) into another web service.

So I started with SimpleReadClient.java to "scan" all data, and built a
"document" based on the RowID, ColumnFamily and Value. Sending
this "document" to the service.
Example data.
ID CF CV
RowID_1 createdDate "2015-01-01:00:00:01 UTC"
RowID_1 data "this is a test"
RowID_1 title "My test title"

RowID_2 createdDate "2015-01-01:12:01:01 UTC"
RowID_2 data "this is test 2"
RowID_2 title "My test2 title"

...

So my table is pretty simple,  RowID, ColumnFamily and Value (no
ColumnQualifier)

I need to process one Billion "OLD" unique RowIDs (a years worth of data)
on a live system that is ingesting "new data" at a rate of about a 4million
RowIds a day.
i.e. I need to process data from September 2015 - September 2016, not
worrying about new data coming in.

So I'm thinking I need to run multiple processes to extract ALL the data in
this "data range" to be more efficient.
Also, it may allow me to run the processes at a lower priority and at
off-hours of the day when traffic is less.

My issue is how do I specify the "range" to scan, and how do I specify.

1. Is using the "createdDate" a good idea, if so how would I specify the
range for it.

2. How about the TimestampFilter?   If I specify my start to end to "equal"
a day (about 4 Million unique RowIDs),
Will this get me all ColumnFamily and Values for a given RowID?  Or could I
miss something because it's timestamp
was the next day.  I don't really understand Timestamps wrt Accumulo.

3. Does a map-reduce job make sense.  If so, how would I specify.


Thanks,

Bob

Re: Fwd: Extacting ALL Data using multiple java processes

Posted by Josh Elser <jo...@gmail.com>.

There should be a static setRanges(Configuration, Collection<Range>) 
method somewhere in the type hierarchy of AccumuloInputFormat which lets 
you specify the Range[s].

Not using the TimestampFilter (not being able to use the timestamp for 
this filtering), you have two options to perform row-filtering based on 
the value for a column.

1) Perform the filter on the client side. If you have significant 
amounts of data, this will be slow. Even with MapReduce, this may 
present a significant overhead in processing.

2) Implement a custom Accumulo Iterator which can perform this filtering 
in Accumulo itself. I would recommend using the WholeRowIterator in 
conjunction with this filter you would implement.

At a high level, configure the WholeRowIterator to aggregate all of the 
Keys in one row into a single Key-Value pair. Then, implement and 
configure a custom Iterator (ideally, extend the abstract Filter 
iterator) which deserializes that single key-value pair back into many, 
extract the createdDate column, and make a decision as to whether or not 
the row should be returned to the client.

On the client, you would then unpack the serialized row into many 
key-value pairs again.

Bob Cook wrote:
> Josh,
>
> Thanks. I was able to get TimestampFilter to works for my needs.  But I
> originally wanted "createdDate" as our application creates that date
> which is known to the user
> and may be different than accumulo timestamp due to when the data
> actually got processed into accumulo.
>
> So if I wanted to use the ColumnFamily "createdDate" and it's value,
>   what java code would I have to write?
>
> I looked at the AccumuloInputFormat class, but confused on how to
> specify the "range" for the date range that I'm interested in..
>
> So would I use the TimestampFilter Class similar to how I'm using it in
> the "scanner.addScanIterator", but instead using
> "AcculoInputFormat.addIterator(job, is), as below.
>
> IteratorSetting is = newIteratorSetting(30, TimestampFilter.class);
> TimestampFilter.setRange(is, startDate, endDate);
> AccumuloInputFormat.addIterator(job, is);
>
> Or could I use
> is.addOption("start", startDate);
> is.addOption("end", endDate);
>
> NOTE: for me "TimestempFilter.setRange"  nor "TimestampFilter.setStart
> and TimestampFilter.setEnd didn't seem to work.
>
> On Sun, Oct 16, 2016 at 2:05 PM, Josh Elser <josh.elser@gmail.com
> <ma...@gmail.com>> wrote:
>
>     The TimestampFilter will return only the Keys whose timestamp fall
>     in the range you specify. The timestamp is an attribute on every
>     Key, a long value which, when not set by the client at write time,
>     is the number of millis since the epoch. You specify the numeric
>     range of timestamps you want. This is a post-filter operation --
>     Accumulo must still read all of the data in the table.
>
>     You need to tell *us* what the time component you're actually
>     filtering on: the timestamp on each Key, or the createdDate column
>     in each row.
>
>     MapReduce is likely more efficient to do this batch processing (as
>     MapReduce is a batch processing system). See the AccumuloInputFormat
>     class.
>
>     Bob Cook wrote:
>
>         All,
>
>         I'm new to accumulo and inherited this project to extract all
>         data from
>         accumulo (assembled as a "document" by RowID) into another web
>         service.
>
>         So I started with SimpleReadClient.java to "scan" all data, and
>         built a
>         "document" based on the RowID, ColumnFamily and Value. Sending
>         this "document" to the service.
>         Example data.
>         ID CF CV
>         RowID_1 createdDate "2015-01-01:00:00:01 UTC"
>         RowID_1 data "this is a test"
>         RowID_1 title "My test title"
>
>         RowID_2 createdDate "2015-01-01:12:01:01 UTC"
>         RowID_2 data "this is test 2"
>         RowID_2 title "My test2 title"
>
>         ...
>
>         So my table is pretty simple,  RowID, ColumnFamily and Value (no
>         ColumnQualifier)
>
>         I need to process one Billion "OLD" unique RowIDs (a years worth of
>         data) on a live system that is ingesting "new data" at a rate of
>         about a
>         4million RowIds a day.
>         i.e. I need to process data from September 2015 - September
>         2016, not
>         worrying about new data coming in.
>
>         So I'm thinking I need to run multiple processes to extract ALL
>         the data
>         in this "data range" to be more efficient.
>         Also, it may allow me to run the processes at a lower priority
>         and at
>         off-hours of the day when traffic is less.
>
>         My issue is how do I specify the "range" to scan, and how do I
>         specify.
>
>         1. Is using the "createdDate" a good idea, if so how would I
>         specify the
>         range for it.
>
>         2. How about the TimestampFilter?   If I specify my start to end to
>         "equal" a day (about 4 Million unique RowIDs),
>         Will this get me all ColumnFamily and Values for a given RowID?  Or
>         could I miss something because it's timestamp
>         was the next day.  I don't really understand Timestamps wrt
>         Accumulo.
>
>         3. Does a map-reduce job make sense.  If so, how would I specify.
>
>
>         Thanks,
>
>         Bob
>
>

Re: Fwd: Extacting ALL Data using multiple java processes

Posted by Bob Cook <bc...@gmail.com>.

Josh,

Thanks. I was able to get TimestampFilter to works for my needs.  But I
originally wanted "createdDate" as our application creates that date which
is known to the user
and may be different than accumulo timestamp due to when the data actually
got processed into accumulo.

So if I wanted to use the ColumnFamily "createdDate" and it's value,  what
java code would I have to write?

I looked at the AccumuloInputFormat class, but confused on how to specify
the "range" for the date range that I'm interested in..

So would I use the TimestampFilter Class similar to how I'm using it in the
"scanner.addScanIterator", but instead using
"AcculoInputFormat.addIterator(job, is), as below.

IteratorSetting is = new IteratorSetting(30, TimestampFilter.class);
TimestampFilter.setRange(is, startDate, endDate);
AccumuloInputFormat.addIterator(job, is);

Or could I use
is.addOption("start", startDate);
is.addOption("end", endDate);

NOTE: for me "TimestempFilter.setRange"  nor "TimestampFilter.setStart and
TimestampFilter.setEnd didn't seem to work.

On Sun, Oct 16, 2016 at 2:05 PM, Josh Elser <jo...@gmail.com> wrote:

> The TimestampFilter will return only the Keys whose timestamp fall in the
> range you specify. The timestamp is an attribute on every Key, a long value
> which, when not set by the client at write time, is the number of millis
> since the epoch. You specify the numeric range of timestamps you want. This
> is a post-filter operation -- Accumulo must still read all of the data in
> the table.
>
> You need to tell *us* what the time component you're actually filtering
> on: the timestamp on each Key, or the createdDate column in each row.
>
> MapReduce is likely more efficient to do this batch processing (as
> MapReduce is a batch processing system). See the AccumuloInputFormat class.
>
> Bob Cook wrote:
>
>> All,
>>
>> I'm new to accumulo and inherited this project to extract all data from
>> accumulo (assembled as a "document" by RowID) into another web service.
>>
>> So I started with SimpleReadClient.java to "scan" all data, and built a
>> "document" based on the RowID, ColumnFamily and Value. Sending
>> this "document" to the service.
>> Example data.
>> ID CF CV
>> RowID_1 createdDate "2015-01-01:00:00:01 UTC"
>> RowID_1 data "this is a test"
>> RowID_1 title "My test title"
>>
>> RowID_2 createdDate "2015-01-01:12:01:01 UTC"
>> RowID_2 data "this is test 2"
>> RowID_2 title "My test2 title"
>>
>> ...
>>
>> So my table is pretty simple,  RowID, ColumnFamily and Value (no
>> ColumnQualifier)
>>
>> I need to process one Billion "OLD" unique RowIDs (a years worth of
>> data) on a live system that is ingesting "new data" at a rate of about a
>> 4million RowIds a day.
>> i.e. I need to process data from September 2015 - September 2016, not
>> worrying about new data coming in.
>>
>> So I'm thinking I need to run multiple processes to extract ALL the data
>> in this "data range" to be more efficient.
>> Also, it may allow me to run the processes at a lower priority and at
>> off-hours of the day when traffic is less.
>>
>> My issue is how do I specify the "range" to scan, and how do I specify.
>>
>> 1. Is using the "createdDate" a good idea, if so how would I specify the
>> range for it.
>>
>> 2. How about the TimestampFilter?   If I specify my start to end to
>> "equal" a day (about 4 Million unique RowIDs),
>> Will this get me all ColumnFamily and Values for a given RowID?  Or
>> could I miss something because it's timestamp
>> was the next day.  I don't really understand Timestamps wrt Accumulo.
>>
>> 3. Does a map-reduce job make sense.  If so, how would I specify.
>>
>>
>> Thanks,
>>
>> Bob
>>
>>

Re: Fwd: Extacting ALL Data using multiple java processes

Posted by Josh Elser <jo...@gmail.com>.

The TimestampFilter will return only the Keys whose timestamp fall in 
the range you specify. The timestamp is an attribute on every Key, a 
long value which, when not set by the client at write time, is the 
number of millis since the epoch. You specify the numeric range of 
timestamps you want. This is a post-filter operation -- Accumulo must 
still read all of the data in the table.

You need to tell *us* what the time component you're actually filtering 
on: the timestamp on each Key, or the createdDate column in each row.

MapReduce is likely more efficient to do this batch processing (as 
MapReduce is a batch processing system). See the AccumuloInputFormat class.

Bob Cook wrote:
> All,
>
> I'm new to accumulo and inherited this project to extract all data from
> accumulo (assembled as a "document" by RowID) into another web service.
>
> So I started with SimpleReadClient.java to "scan" all data, and built a
> "document" based on the RowID, ColumnFamily and Value. Sending
> this "document" to the service.
> Example data.
> ID CF CV
> RowID_1 createdDate "2015-01-01:00:00:01 UTC"
> RowID_1 data "this is a test"
> RowID_1 title "My test title"
>
> RowID_2 createdDate "2015-01-01:12:01:01 UTC"
> RowID_2 data "this is test 2"
> RowID_2 title "My test2 title"
>
> ...
>
> So my table is pretty simple,  RowID, ColumnFamily and Value (no
> ColumnQualifier)
>
> I need to process one Billion "OLD" unique RowIDs (a years worth of
> data) on a live system that is ingesting "new data" at a rate of about a
> 4million RowIds a day.
> i.e. I need to process data from September 2015 - September 2016, not
> worrying about new data coming in.
>
> So I'm thinking I need to run multiple processes to extract ALL the data
> in this "data range" to be more efficient.
> Also, it may allow me to run the processes at a lower priority and at
> off-hours of the day when traffic is less.
>
> My issue is how do I specify the "range" to scan, and how do I specify.
>
> 1. Is using the "createdDate" a good idea, if so how would I specify the
> range for it.
>
> 2. How about the TimestampFilter?   If I specify my start to end to
> "equal" a day (about 4 Million unique RowIDs),
> Will this get me all ColumnFamily and Values for a given RowID?  Or
> could I miss something because it's timestamp
> was the next day.  I don't really understand Timestamps wrt Accumulo.
>
> 3. Does a map-reduce job make sense.  If so, how would I specify.
>
>
> Thanks,
>
> Bob
>