You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@accumulo.apache.org by "Iezzi, Adam [USA]" <ie...@bah.com> on 2013/06/18 03:56:52 UTC

Storing, Indexing, and Querying data in Accumulo (geo + timeseries)

I've been asked by my client to store a dataset which contains a time series and geospatial coordinates (points) in Accumulo. At the moment, we have a very dense data stored in Accumulo using the following table schema:

Row ID: <geohash>_<reverse timestamp>
Family: <id >
Qualifier: attribute
Value: <value>

We are salting our RowID's with a geohash to prevent hot spotting. When we query the data, we use a prefix scan (center tile and eight neighbors), then using an Iterator to filter out the outliers (points and time). Unfortunately, we've noticed some performance issues with this approach in that it seems as the initial prefix scan brings back a ton of data, forcing the iterators to filter out a significant amount of outliers. E.g. more than 50% is being filtered out, which seems inefficient to us. Unfortunately for us, our users will always query by space and time, making them equally important for each query. Because of the time series component to our data, we're often bringing back a significant amount of data for each given point. Each point can have ten entries due to the time series, making our data set very very dense.

The following are some options we're considering:

1. Salt a master table with an ID rather than the geohash <id>_<reverse timestamp>, and then create a spatial index table. If we choose this option, I assume we would scan the index first, then use a batch scanner with the ID from the first query. Unfortunately, I still see us filtering out a significant amount of data using this approach.

2. Keep the table design as is, and maybe a RegExFilter via a custom Iterator.

3. Do something completely different, such as use a Column Family and the temporal aspect of the dataset together in some way.

Any advice or guidance would be greatly appreciated.

Thank you,

Adam

Re: [External] Re: Storing, Indexing, and Querying data in Accumulo (geo + timeseries)

Posted by Kurt Christensen <ho...@hoodel.com>.

I thought I might too chime in late. I think we're talking about the 
same thing, with perhaps different encoding.

Yes. In the bit-interleaving scheme I mentioned, each 3-bits of the hash 
is equivalent to a level in an oct-tree ("3D quadtree"). ... and yes, 
there is a trick to picking the time-scales right.

-- Kurt


On 6/25/13 9:08 AM, Jamie Stephens wrote:
> Adam & Co.,
>
> Sorry to chime in late here.
>
> One of our projects has similar requirements: queries based on 
> time-space constraints. (Tracking a particular entity through time and 
> space is a different requirement.)
>
> We've used the following scheme with decent results.
>
> Our basic approach is to use a 3D quadtree based on lat, lon, and 
> time.  Longitude and time are first transformed to make a quadtree key 
> prefix represent a cube (approximately).  Alternately roll your 
> own quadtree algorithm to give similar results.  So some number of 
> prefix bytes of a quadtree key represents an approximate time-space 
> cube of dimensions 1km x 1km x 1day.  Pick your time unit.  Another 
> variation: use a 3D geohash instead of a quadtree.
>
> Then use the first N bytes of the key as the row ID and the remaining 
> bytes for the column qualifier.  Rationale: Sometimes there is virtue 
> in keeping points in a cube on the same tablet server.  (Or you might 
> want to, say, use only spatial key prefixes as row IDs.  Lots of 
> flavors to consider.)
>
> Disadvantages: You have to pick N and the time unit up front.  N and 
> the time unit are the basic index tuning parameters.  In our 
> applications, setting those parameters isn't too hard because we 
> understand the data and its uses pretty well.  However, as you've 
> suggested, hotspots due to concentrations can still be a problem.  We 
> try to turn up N to adjust.
>
> Variation: Use the military grid reference system (MGRS) grid zone 
> designator and square identifier as row ID and a quadtree-code 
> numerical location for the column qualifier.  Etc.
>
> I'll see if I can get an example on github.
>
> --Jamie
>
>
> On Mon, Jun 24, 2013 at 9:47 AM, Jim Klucar <klucar@gmail.com 
> <ma...@gmail.com>> wrote:
>
>     Adam,
>
>     Usually with geo-queries points of interest are pretty dense (as
>     you've stated is your case). The indexing typically used (geohash
>     or z-order) is efficient for points spread evenly across the
>     earth, which isn't the typical case (think population density).
>     One method I've heard (never actually tried myself) is to store
>     points as distances from known locations. You can then find points
>     close to each other by finding similar distances to 2 or 3 known
>     locations. The known locations can then be created and distributed
>     based on your expected point density allowing even dense areas to
>     be spread evenly across a cluster.
>
>     There's plenty of math, table design, and query design work to get
>     it all working, but I think its feasible.
>
>     Jim
>
>
>     On Wed, Jun 19, 2013 at 1:07 AM, Kurt Christensen
>     <hoodel@hoodel.com <ma...@hoodel.com>> wrote:
>
>
>         To clarify: By 'geologic', I was referring to time-scale (like
>         100s of millions of years, with more detail near present,
>         suggesting a log scale).
>
>         Your use of id is surprising. Maybe I don't understand what
>         you're trying to do.
>         >From what I was thinking, since you made reference to
>         time-series, no efficiency is gained through this id. If,
>         instead the id were for a whole time-series, and not each
>         individual point then for each timestamp, you would have X(id,
>         timestamp), Y(id, timestamp) and whatever else (id, timestamp)
>         already organized as time series. ... all with the same row id.
>         bithash+id, INDEX, id, ... - (query to get a list of IDs
>         intersecting your space-time region)
>         id, POSITION, XY, vis, TIMESTAMP, (x,y) - (use iterators to
>         filter these points)
>         id, MEAS, name, vis, TIMESTAMP, named_measurement
>
>         Alternately, if you wanted rich points, and not individual values:
>         bithash+id, INDEX, id, ... - (query to get a list of IDs
>         intersecting your space-time region)
>         id, SAMPLE, (x,y), vis, TIMESTAMP, sampleObject(JSON?) - (all
>         in one column)
>
>         If this is way off base from what you are trying to do, please
>         ignore.
>
>         Kurt
>
>         -----
>
>
>         On 6/18/13 10:14 PM, Iezzi, Adam [USA] wrote:
>
>             All,
>
>             Thank you for all of the replies. To answer some of the
>             questions:
>
>             Q: You say you have point data. Are time series
>             geographically fixed, with only the time dimension
>             changing? ... or are the time series moving in space-time?
>             A: The time series will be moving in space-time;
>             therefore, the dataset is geologic.
>
>             Q: If you have time series (identified by<id>) moving in
>             space-time, then I would add an indirection.
>             A: Our dataset is very similar to what you describe. Each
>             geospatial point and time stamp is defined by an id.
>              Since I'm new to the Accumulo world, I'm not very
>             familiar with this pattern/approach in table design. But,
>             I will look around now that I have some guidance.
>
>             Overall, I think I need to create a space-time hash of my
>             dataset, but the biggest question I have is, "what time
>             span do I use?". At the moment, I only have a years' worth
>             of data; therefore, my MIN_DATE = Jan 01 and MAX_DATE =
>             Dec 31. But we obviously expect this data to continue to
>             grow; therefore, would want to account for additional data
>             in the future.
>
>             Thanks again for all of the guidance. I will digest some
>             of the comments and will report back.
>
>             Adam
>
>             -----Original Message-----
>             From: Kurt Christensen [mailto:hoodel@hoodel.com
>             <ma...@hoodel.com>]
>             Sent: Tuesday, June 18, 2013 8:54 PM
>             To: user@accumulo.apache.org <ma...@accumulo.apache.org>
>             Subject: [External] Re: Storing, Indexing, and Querying
>             data in Accumulo (geo + timeseries)
>
>
>             An effective optimization strategy will be largely
>             influenced by the nature of your data.
>
>             You say you have point data. Are time series
>             geographically fixed, with only the time dimension
>             changing? ... or are the time series moving in space-time?
>
>             I was going to suggest a 3-D approach, bit-interleaving
>             your space and time [modulo timespan] together ( or
>             point-tree, or octtree, or k-d trie, or r-d trie ). The
>             trick there is to pick a time span large enough so that
>             any interval you query is small relative to the time span,
>             but small enough so that you don't waste a bunch (up to an
>             eighth) of your usable hash values with no useful time
>             data (i.e. populate your most significant bits). This
>             would work if your data were geographically fixed, but
>             changing only in time. If your time span is geologic, you
>             might want to use a logarithmic time scale.
>
>             If you have time series (identified by<id>) moving in
>             space-time, then I would add an indirection. Use the
>             space-time hash to determine the IDs intersecting your
>             zone and then query again, using the IDs to pull out the
>             time series, filtering with your interator, perhaps using
>             the native timestamp field.
>
>             I hope that helps. Good luck.
>
>             Kurt
>
>             BTW: 50% filtering isn't really that inefficient. - kkc
>
>
>             On 6/18/13 12:36 AM, Jared Winick wrote:
>
>                 Have you considered a "geohash" of all 3 dimensions
>                 together and using
>                 that as the RowID? I have never implemented a geohash
>                 exactly, but I
>                 do know it is possible to build a z-order curve on
>                 more than 2
>                 dimensions, which may be what you want considering
>                 that it sounds like
>                 all your queries are in 3-dimensions.
>
>
>                 On Mon, Jun 17, 2013 at 7:56 PM, Iezzi, Adam
>                 [USA]<iezzi_adam@bah.com <ma...@bah.com>
>                 <mailto:iezzi_adam@bah.com
>                 <ma...@bah.com>>>  wrote:
>
>                      I've been asked by my client to store a dataset
>                 which contains a
>                      time series and geospatial coordinates (points)
>                 in Accumulo. At
>                      the moment, we have a very dense data stored in
>                 Accumulo using the
>                      following table schema:
>
>                      Row ID:<geohash>_<reverse timestamp>
>
>                      Family:<id>
>
>                      Qualifier: attribute
>
>                      Value:<value>
>
>                      We are salting our RowID's with a geohash to
>                 prevent hot spotting.
>                      When we query the data, we use a prefix scan
>                 (center tile and
>                      eight neighbors), then using an Iterator to
>                 filter out the
>                      outliers (points and time). Unfortunately, we've
>                 noticed some
>                      performance issues with this approach in that it
>                 seems as the
>                      initial prefix scan brings back a ton of data,
>                 forcing the
>                      iterators to filter out a significant amount of
>                 outliers. E.g.
>                      more than 50% is being filtered out, which seems
>                 inefficient to
>                      us. Unfortunately for us, our users will always
>                 query by space and
>                      time, making them equally important for each
>                 query. Because of the
>                      time series component to our data, we're often
>                 bringing back a
>                      significant amount of data for each given point.
>                 Each point can
>                      have ten entries due to the time series, making
>                 our data set very
>                      very dense.
>
>                      The following are some options we're considering:
>
>                      1. Salt a master table with an ID rather than the
>                 geohash
>                 <id>_<reverse timestamp>, and then create a spatial
>                 index table.
>                      If we choose this option, I assume we would scan
>                 the index first,
>                      then use a batch scanner with the ID from the
>                 first query.
>                      Unfortunately, I still see us filtering out a
>                 significant amount
>                      of data using this approach.
>
>                      2. Keep the table design as is, and maybe a
>                 RegExFilter via a
>                      custom Iterator.
>
>                      3. Do something completely different, such as use
>                 a Column Family
>                      and the temporal aspect of the dataset together
>                 in some way.
>
>                      Any advice or guidance would be greatly appreciated.
>
>                      Thank you,
>
>                      Adam
>
>
>
>
>
>         -- 
>
>         Kurt Christensen
>         P.O. Box 811
>         Westminster, MD 21158-0811
>
>         ------------------------------------------------------------------------
>         "One of the penalties for refusing to participate in politics
>         is that you end up being governed by your inferiors."
>         --- Plato
>
>
>

-- 

Kurt Christensen
P.O. Box 811
Westminster, MD 21158-0811

------------------------------------------------------------------------
"One of the penalties for refusing to participate in politics is that 
you end up being governed by your inferiors."
--- Plato

Re: [External] Re: Storing, Indexing, and Querying data in Accumulo (geo + timeseries)

Posted by Jamie Stephens <js...@morphism.com>.

Adam & Co.,

Sorry to chime in late here.

One of our projects has similar requirements: queries based on time-space
constraints. (Tracking a particular entity through time and space is a
different requirement.)

We've used the following scheme with decent results.

Our basic approach is to use a 3D quadtree based on lat, lon, and time.
 Longitude and time are first transformed to make a quadtree key prefix
represent a cube (approximately).  Alternately roll your
own quadtree algorithm to give similar results.  So some number of prefix
bytes of a quadtree key represents an approximate time-space cube of
dimensions 1km x 1km x 1day.  Pick your time unit.  Another variation: use
a 3D geohash instead of a quadtree.

Then use the first N bytes of the key as the row ID and the remaining bytes
for the column qualifier.  Rationale: Sometimes there is virtue in keeping
points in a cube on the same tablet server.  (Or you might want to, say,
use only spatial key prefixes as row IDs.  Lots of flavors to consider.)

Disadvantages: You have to pick N and the time unit up front.  N and the
time unit are the basic index tuning parameters.  In our applications,
setting those parameters isn't too hard because we understand the data and
its uses pretty well.  However, as you've suggested, hotspots due to
concentrations can still be a problem.  We try to turn up N to adjust.

Variation: Use the military grid reference system (MGRS) grid zone
designator and square identifier as row ID and a quadtree-code numerical
location for the column qualifier.  Etc.

I'll see if I can get an example on github.

--Jamie


On Mon, Jun 24, 2013 at 9:47 AM, Jim Klucar <kl...@gmail.com> wrote:

> Adam,
>
> Usually with geo-queries points of interest are pretty dense (as you've
> stated is your case). The indexing typically used (geohash or z-order) is
> efficient for points spread evenly across the earth, which isn't the
> typical case (think population density). One method I've heard (never
> actually tried myself) is to store points as distances from known
> locations. You can then find points close to each other by finding similar
> distances to 2 or 3 known locations. The known locations can then be
> created and distributed based on your expected point density allowing even
> dense areas to be spread evenly across a cluster.
>
> There's plenty of math, table design, and query design work to get it all
> working, but I think its feasible.
>
> Jim
>
>
> On Wed, Jun 19, 2013 at 1:07 AM, Kurt Christensen <ho...@hoodel.com>wrote:
>
>>
>> To clarify: By 'geologic', I was referring to time-scale (like 100s of
>> millions of years, with more detail near present, suggesting a log scale).
>>
>> Your use of id is surprising. Maybe I don't understand what you're trying
>> to do.
>> From what I was thinking, since you made reference to time-series, no
>> efficiency is gained through this id. If, instead the id were for a whole
>> time-series, and not each individual point then for each timestamp, you
>> would have X(id, timestamp), Y(id, timestamp) and whatever else (id,
>> timestamp) already organized as time series. ... all with the same row id.
>> bithash+id, INDEX, id, ... - (query to get a list of IDs intersecting
>> your space-time region)
>> id, POSITION, XY, vis, TIMESTAMP, (x,y) - (use iterators to filter these
>> points)
>> id, MEAS, name, vis, TIMESTAMP, named_measurement
>>
>> Alternately, if you wanted rich points, and not individual values:
>> bithash+id, INDEX, id, ... - (query to get a list of IDs intersecting
>> your space-time region)
>> id, SAMPLE, (x,y), vis, TIMESTAMP, sampleObject(JSON?) - (all in one
>> column)
>>
>> If this is way off base from what you are trying to do, please ignore.
>>
>> Kurt
>>
>> -----
>>
>>
>> On 6/18/13 10:14 PM, Iezzi, Adam [USA] wrote:
>>
>>> All,
>>>
>>> Thank you for all of the replies. To answer some of the questions:
>>>
>>> Q: You say you have point data. Are time series geographically fixed,
>>> with only the time dimension changing? ... or are the time series moving in
>>> space-time?
>>> A: The time series will be moving in space-time; therefore, the dataset
>>> is geologic.
>>>
>>> Q: If you have time series (identified by<id>) moving in space-time,
>>> then I would add an indirection.
>>> A: Our dataset is very similar to what you describe. Each geospatial
>>> point and time stamp is defined by an id.  Since I'm new to the Accumulo
>>> world, I'm not very familiar with this pattern/approach in table design.
>>> But, I will look around now that I have some guidance.
>>>
>>> Overall, I think I need to create a space-time hash of my dataset, but
>>> the biggest question I have is, "what time span do I use?". At the moment,
>>> I only have a years' worth of data; therefore, my MIN_DATE = Jan 01 and
>>> MAX_DATE = Dec 31. But we obviously expect this data to continue to grow;
>>> therefore, would want to account for additional data in the future.
>>>
>>> Thanks again for all of the guidance. I will digest some of the comments
>>> and will report back.
>>>
>>> Adam
>>>
>>> -----Original Message-----
>>> From: Kurt Christensen [mailto:hoodel@hoodel.com]
>>> Sent: Tuesday, June 18, 2013 8:54 PM
>>> To: user@accumulo.apache.org
>>> Subject: [External] Re: Storing, Indexing, and Querying data in Accumulo
>>> (geo + timeseries)
>>>
>>>
>>> An effective optimization strategy will be largely influenced by the
>>> nature of your data.
>>>
>>> You say you have point data. Are time series geographically fixed, with
>>> only the time dimension changing? ... or are the time series moving in
>>> space-time?
>>>
>>> I was going to suggest a 3-D approach, bit-interleaving your space and
>>> time [modulo timespan] together ( or point-tree, or octtree, or k-d trie,
>>> or r-d trie ). The trick there is to pick a time span large enough so that
>>> any interval you query is small relative to the time span, but small enough
>>> so that you don't waste a bunch (up to an eighth) of your usable hash
>>> values with no useful time data (i.e. populate your most significant bits).
>>> This would work if your data were geographically fixed, but changing only
>>> in time. If your time span is geologic, you might want to use a logarithmic
>>> time scale.
>>>
>>> If you have time series (identified by<id>) moving in space-time, then I
>>> would add an indirection. Use the space-time hash to determine the IDs
>>> intersecting your zone and then query again, using the IDs to pull out the
>>> time series, filtering with your interator, perhaps using the native
>>> timestamp field.
>>>
>>> I hope that helps. Good luck.
>>>
>>> Kurt
>>>
>>> BTW: 50% filtering isn't really that inefficient. - kkc
>>>
>>>
>>> On 6/18/13 12:36 AM, Jared Winick wrote:
>>>
>>>
>>>> Have you considered a "geohash" of all 3 dimensions together and using
>>>> that as the RowID? I have never implemented a geohash exactly, but I
>>>> do know it is possible to build a z-order curve on more than 2
>>>> dimensions, which may be what you want considering that it sounds like
>>>> all your queries are in 3-dimensions.
>>>>
>>>>
>>>> On Mon, Jun 17, 2013 at 7:56 PM, Iezzi, Adam [USA]<iezzi_adam@bah.com
>>>> <ma...@bah.com>>  wrote:
>>>>
>>>>      I've been asked by my client to store a dataset which contains a
>>>>      time series and geospatial coordinates (points) in Accumulo. At
>>>>      the moment, we have a very dense data stored in Accumulo using the
>>>>      following table schema:
>>>>
>>>>      Row ID:<geohash>_<reverse timestamp>
>>>>
>>>>      Family:<id>
>>>>
>>>>      Qualifier: attribute
>>>>
>>>>      Value:<value>
>>>>
>>>>      We are salting our RowID's with a geohash to prevent hot spotting.
>>>>      When we query the data, we use a prefix scan (center tile and
>>>>      eight neighbors), then using an Iterator to filter out the
>>>>      outliers (points and time). Unfortunately, we've noticed some
>>>>      performance issues with this approach in that it seems as the
>>>>      initial prefix scan brings back a ton of data, forcing the
>>>>      iterators to filter out a significant amount of outliers. E.g.
>>>>      more than 50% is being filtered out, which seems inefficient to
>>>>      us. Unfortunately for us, our users will always query by space and
>>>>      time, making them equally important for each query. Because of the
>>>>      time series component to our data, we're often bringing back a
>>>>      significant amount of data for each given point. Each point can
>>>>      have ten entries due to the time series, making our data set very
>>>>      very dense.
>>>>
>>>>      The following are some options we're considering:
>>>>
>>>>      1. Salt a master table with an ID rather than the geohash
>>>>      <id>_<reverse timestamp>, and then create a spatial index table.
>>>>      If we choose this option, I assume we would scan the index first,
>>>>      then use a batch scanner with the ID from the first query.
>>>>      Unfortunately, I still see us filtering out a significant amount
>>>>      of data using this approach.
>>>>
>>>>      2. Keep the table design as is, and maybe a RegExFilter via a
>>>>      custom Iterator.
>>>>
>>>>      3. Do something completely different, such as use a Column Family
>>>>      and the temporal aspect of the dataset together in some way.
>>>>
>>>>      Any advice or guidance would be greatly appreciated.
>>>>
>>>>      Thank you,
>>>>
>>>>      Adam
>>>>
>>>>
>>>>
>>>>
>>>
>>>
>>
>> --
>>
>> Kurt Christensen
>> P.O. Box 811
>> Westminster, MD 21158-0811
>>
>> ------------------------------**------------------------------**
>> ------------
>> "One of the penalties for refusing to participate in politics is that you
>> end up being governed by your inferiors."
>> --- Plato
>>
>
>

Re: [External] Re: Storing, Indexing, and Querying data in Accumulo (geo + timeseries)

Posted by Jim Klucar <kl...@gmail.com>.

Adam,

Usually with geo-queries points of interest are pretty dense (as you've
stated is your case). The indexing typically used (geohash or z-order) is
efficient for points spread evenly across the earth, which isn't the
typical case (think population density). One method I've heard (never
actually tried myself) is to store points as distances from known
locations. You can then find points close to each other by finding similar
distances to 2 or 3 known locations. The known locations can then be
created and distributed based on your expected point density allowing even
dense areas to be spread evenly across a cluster.

There's plenty of math, table design, and query design work to get it all
working, but I think its feasible.

Jim


On Wed, Jun 19, 2013 at 1:07 AM, Kurt Christensen <ho...@hoodel.com> wrote:

>
> To clarify: By 'geologic', I was referring to time-scale (like 100s of
> millions of years, with more detail near present, suggesting a log scale).
>
> Your use of id is surprising. Maybe I don't understand what you're trying
> to do.
> From what I was thinking, since you made reference to time-series, no
> efficiency is gained through this id. If, instead the id were for a whole
> time-series, and not each individual point then for each timestamp, you
> would have X(id, timestamp), Y(id, timestamp) and whatever else (id,
> timestamp) already organized as time series. ... all with the same row id.
> bithash+id, INDEX, id, ... - (query to get a list of IDs intersecting your
> space-time region)
> id, POSITION, XY, vis, TIMESTAMP, (x,y) - (use iterators to filter these
> points)
> id, MEAS, name, vis, TIMESTAMP, named_measurement
>
> Alternately, if you wanted rich points, and not individual values:
> bithash+id, INDEX, id, ... - (query to get a list of IDs intersecting your
> space-time region)
> id, SAMPLE, (x,y), vis, TIMESTAMP, sampleObject(JSON?) - (all in one
> column)
>
> If this is way off base from what you are trying to do, please ignore.
>
> Kurt
>
> -----
>
>
> On 6/18/13 10:14 PM, Iezzi, Adam [USA] wrote:
>
>> All,
>>
>> Thank you for all of the replies. To answer some of the questions:
>>
>> Q: You say you have point data. Are time series geographically fixed,
>> with only the time dimension changing? ... or are the time series moving in
>> space-time?
>> A: The time series will be moving in space-time; therefore, the dataset
>> is geologic.
>>
>> Q: If you have time series (identified by<id>) moving in space-time, then
>> I would add an indirection.
>> A: Our dataset is very similar to what you describe. Each geospatial
>> point and time stamp is defined by an id.  Since I'm new to the Accumulo
>> world, I'm not very familiar with this pattern/approach in table design.
>> But, I will look around now that I have some guidance.
>>
>> Overall, I think I need to create a space-time hash of my dataset, but
>> the biggest question I have is, "what time span do I use?". At the moment,
>> I only have a years' worth of data; therefore, my MIN_DATE = Jan 01 and
>> MAX_DATE = Dec 31. But we obviously expect this data to continue to grow;
>> therefore, would want to account for additional data in the future.
>>
>> Thanks again for all of the guidance. I will digest some of the comments
>> and will report back.
>>
>> Adam
>>
>> -----Original Message-----
>> From: Kurt Christensen [mailto:hoodel@hoodel.com]
>> Sent: Tuesday, June 18, 2013 8:54 PM
>> To: user@accumulo.apache.org
>> Subject: [External] Re: Storing, Indexing, and Querying data in Accumulo
>> (geo + timeseries)
>>
>>
>> An effective optimization strategy will be largely influenced by the
>> nature of your data.
>>
>> You say you have point data. Are time series geographically fixed, with
>> only the time dimension changing? ... or are the time series moving in
>> space-time?
>>
>> I was going to suggest a 3-D approach, bit-interleaving your space and
>> time [modulo timespan] together ( or point-tree, or octtree, or k-d trie,
>> or r-d trie ). The trick there is to pick a time span large enough so that
>> any interval you query is small relative to the time span, but small enough
>> so that you don't waste a bunch (up to an eighth) of your usable hash
>> values with no useful time data (i.e. populate your most significant bits).
>> This would work if your data were geographically fixed, but changing only
>> in time. If your time span is geologic, you might want to use a logarithmic
>> time scale.
>>
>> If you have time series (identified by<id>) moving in space-time, then I
>> would add an indirection. Use the space-time hash to determine the IDs
>> intersecting your zone and then query again, using the IDs to pull out the
>> time series, filtering with your interator, perhaps using the native
>> timestamp field.
>>
>> I hope that helps. Good luck.
>>
>> Kurt
>>
>> BTW: 50% filtering isn't really that inefficient. - kkc
>>
>>
>> On 6/18/13 12:36 AM, Jared Winick wrote:
>>
>>
>>> Have you considered a "geohash" of all 3 dimensions together and using
>>> that as the RowID? I have never implemented a geohash exactly, but I
>>> do know it is possible to build a z-order curve on more than 2
>>> dimensions, which may be what you want considering that it sounds like
>>> all your queries are in 3-dimensions.
>>>
>>>
>>> On Mon, Jun 17, 2013 at 7:56 PM, Iezzi, Adam [USA]<iezzi_adam@bah.com
>>> <ma...@bah.com>>  wrote:
>>>
>>>      I've been asked by my client to store a dataset which contains a
>>>      time series and geospatial coordinates (points) in Accumulo. At
>>>      the moment, we have a very dense data stored in Accumulo using the
>>>      following table schema:
>>>
>>>      Row ID:<geohash>_<reverse timestamp>
>>>
>>>      Family:<id>
>>>
>>>      Qualifier: attribute
>>>
>>>      Value:<value>
>>>
>>>      We are salting our RowID's with a geohash to prevent hot spotting.
>>>      When we query the data, we use a prefix scan (center tile and
>>>      eight neighbors), then using an Iterator to filter out the
>>>      outliers (points and time). Unfortunately, we've noticed some
>>>      performance issues with this approach in that it seems as the
>>>      initial prefix scan brings back a ton of data, forcing the
>>>      iterators to filter out a significant amount of outliers. E.g.
>>>      more than 50% is being filtered out, which seems inefficient to
>>>      us. Unfortunately for us, our users will always query by space and
>>>      time, making them equally important for each query. Because of the
>>>      time series component to our data, we're often bringing back a
>>>      significant amount of data for each given point. Each point can
>>>      have ten entries due to the time series, making our data set very
>>>      very dense.
>>>
>>>      The following are some options we're considering:
>>>
>>>      1. Salt a master table with an ID rather than the geohash
>>>      <id>_<reverse timestamp>, and then create a spatial index table.
>>>      If we choose this option, I assume we would scan the index first,
>>>      then use a batch scanner with the ID from the first query.
>>>      Unfortunately, I still see us filtering out a significant amount
>>>      of data using this approach.
>>>
>>>      2. Keep the table design as is, and maybe a RegExFilter via a
>>>      custom Iterator.
>>>
>>>      3. Do something completely different, such as use a Column Family
>>>      and the temporal aspect of the dataset together in some way.
>>>
>>>      Any advice or guidance would be greatly appreciated.
>>>
>>>      Thank you,
>>>
>>>      Adam
>>>
>>>
>>>
>>>
>>
>>
>
> --
>
> Kurt Christensen
> P.O. Box 811
> Westminster, MD 21158-0811
>
> ------------------------------**------------------------------**
> ------------
> "One of the penalties for refusing to participate in politics is that you
> end up being governed by your inferiors."
> --- Plato
>

Re: [External] Re: Storing, Indexing, and Querying data in Accumulo (geo + timeseries)

Posted by Kurt Christensen <ho...@hoodel.com>.

To clarify: By 'geologic', I was referring to time-scale (like 100s of 
millions of years, with more detail near present, suggesting a log scale).

Your use of id is surprising. Maybe I don't understand what you're 
trying to do.
 From what I was thinking, since you made reference to time-series, no 
efficiency is gained through this id. If, instead the id were for a 
whole time-series, and not each individual point then for each 
timestamp, you would have X(id, timestamp), Y(id, timestamp) and 
whatever else (id, timestamp) already organized as time series. ... all 
with the same row id.
bithash+id, INDEX, id, ... - (query to get a list of IDs intersecting 
your space-time region)
id, POSITION, XY, vis, TIMESTAMP, (x,y) - (use iterators to filter these 
points)
id, MEAS, name, vis, TIMESTAMP, named_measurement

Alternately, if you wanted rich points, and not individual values:
bithash+id, INDEX, id, ... - (query to get a list of IDs intersecting 
your space-time region)
id, SAMPLE, (x,y), vis, TIMESTAMP, sampleObject(JSON?) - (all in one column)

If this is way off base from what you are trying to do, please ignore.

Kurt

-----

On 6/18/13 10:14 PM, Iezzi, Adam [USA] wrote:
> All,
>
> Thank you for all of the replies. To answer some of the questions:
>
> Q: You say you have point data. Are time series geographically fixed, with only the time dimension changing? ... or are the time series moving in space-time?
> A: The time series will be moving in space-time; therefore, the dataset is geologic.
>
> Q: If you have time series (identified by<id>) moving in space-time, then I would add an indirection.
> A: Our dataset is very similar to what you describe. Each geospatial point and time stamp is defined by an id.  Since I'm new to the Accumulo world, I'm not very familiar with this pattern/approach in table design. But, I will look around now that I have some guidance.
>
> Overall, I think I need to create a space-time hash of my dataset, but the biggest question I have is, "what time span do I use?". At the moment, I only have a years' worth of data; therefore, my MIN_DATE = Jan 01 and MAX_DATE = Dec 31. But we obviously expect this data to continue to grow; therefore, would want to account for additional data in the future.
>
> Thanks again for all of the guidance. I will digest some of the comments and will report back.
>
> Adam
>
> -----Original Message-----
> From: Kurt Christensen [mailto:hoodel@hoodel.com]
> Sent: Tuesday, June 18, 2013 8:54 PM
> To: user@accumulo.apache.org
> Subject: [External] Re: Storing, Indexing, and Querying data in Accumulo (geo + timeseries)
>
>
> An effective optimization strategy will be largely influenced by the nature of your data.
>
> You say you have point data. Are time series geographically fixed, with only the time dimension changing? ... or are the time series moving in space-time?
>
> I was going to suggest a 3-D approach, bit-interleaving your space and time [modulo timespan] together ( or point-tree, or octtree, or k-d trie, or r-d trie ). The trick there is to pick a time span large enough so that any interval you query is small relative to the time span, but small enough so that you don't waste a bunch (up to an eighth) of your usable hash values with no useful time data (i.e. populate your most significant bits). This would work if your data were geographically fixed, but changing only in time. If your time span is geologic, you might want to use a logarithmic time scale.
>
> If you have time series (identified by<id>) moving in space-time, then I would add an indirection. Use the space-time hash to determine the IDs intersecting your zone and then query again, using the IDs to pull out the time series, filtering with your interator, perhaps using the native timestamp field.
>
> I hope that helps. Good luck.
>
> Kurt
>
> BTW: 50% filtering isn't really that inefficient. - kkc
>
>
> On 6/18/13 12:36 AM, Jared Winick wrote:
>    
>> Have you considered a "geohash" of all 3 dimensions together and using
>> that as the RowID? I have never implemented a geohash exactly, but I
>> do know it is possible to build a z-order curve on more than 2
>> dimensions, which may be what you want considering that it sounds like
>> all your queries are in 3-dimensions.
>>
>>
>> On Mon, Jun 17, 2013 at 7:56 PM, Iezzi, Adam [USA]<iezzi_adam@bah.com
>> <ma...@bah.com>>  wrote:
>>
>>      I've been asked by my client to store a dataset which contains a
>>      time series and geospatial coordinates (points) in Accumulo. At
>>      the moment, we have a very dense data stored in Accumulo using the
>>      following table schema:
>>
>>      Row ID:<geohash>_<reverse timestamp>
>>
>>      Family:<id>
>>
>>      Qualifier: attribute
>>
>>      Value:<value>
>>
>>      We are salting our RowID's with a geohash to prevent hot spotting.
>>      When we query the data, we use a prefix scan (center tile and
>>      eight neighbors), then using an Iterator to filter out the
>>      outliers (points and time). Unfortunately, we've noticed some
>>      performance issues with this approach in that it seems as the
>>      initial prefix scan brings back a ton of data, forcing the
>>      iterators to filter out a significant amount of outliers. E.g.
>>      more than 50% is being filtered out, which seems inefficient to
>>      us. Unfortunately for us, our users will always query by space and
>>      time, making them equally important for each query. Because of the
>>      time series component to our data, we're often bringing back a
>>      significant amount of data for each given point. Each point can
>>      have ten entries due to the time series, making our data set very
>>      very dense.
>>
>>      The following are some options we're considering:
>>
>>      1. Salt a master table with an ID rather than the geohash
>>      <id>_<reverse timestamp>, and then create a spatial index table.
>>      If we choose this option, I assume we would scan the index first,
>>      then use a batch scanner with the ID from the first query.
>>      Unfortunately, I still see us filtering out a significant amount
>>      of data using this approach.
>>
>>      2. Keep the table design as is, and maybe a RegExFilter via a
>>      custom Iterator.
>>
>>      3. Do something completely different, such as use a Column Family
>>      and the temporal aspect of the dataset together in some way.
>>
>>      Any advice or guidance would be greatly appreciated.
>>
>>      Thank you,
>>
>>      Adam
>>
>>
>>      
>    

-- 

Kurt Christensen
P.O. Box 811
Westminster, MD 21158-0811

------------------------------------------------------------------------
"One of the penalties for refusing to participate in politics is that 
you end up being governed by your inferiors."
--- Plato

RE: [External] Re: Storing, Indexing, and Querying data in Accumulo (geo + timeseries)

Posted by "Iezzi, Adam [USA]" <ie...@bah.com>.

All,

Thank you for all of the replies. To answer some of the questions:

Q: You say you have point data. Are time series geographically fixed, with only the time dimension changing? ... or are the time series moving in space-time?
A: The time series will be moving in space-time; therefore, the dataset is geologic. 

Q: If you have time series (identified by <id>) moving in space-time, then I would add an indirection.
A: Our dataset is very similar to what you describe. Each geospatial point and time stamp is defined by an id.  Since I'm new to the Accumulo world, I'm not very familiar with this pattern/approach in table design. But, I will look around now that I have some guidance. 

Overall, I think I need to create a space-time hash of my dataset, but the biggest question I have is, "what time span do I use?". At the moment, I only have a years' worth of data; therefore, my MIN_DATE = Jan 01 and MAX_DATE = Dec 31. But we obviously expect this data to continue to grow; therefore, would want to account for additional data in the future.

Thanks again for all of the guidance. I will digest some of the comments and will report back.

Adam

-----Original Message-----
From: Kurt Christensen [mailto:hoodel@hoodel.com] 
Sent: Tuesday, June 18, 2013 8:54 PM
To: user@accumulo.apache.org
Subject: [External] Re: Storing, Indexing, and Querying data in Accumulo (geo + timeseries)

An effective optimization strategy will be largely influenced by the nature of your data.

You say you have point data. Are time series geographically fixed, with only the time dimension changing? ... or are the time series moving in space-time?

I was going to suggest a 3-D approach, bit-interleaving your space and time [modulo timespan] together ( or point-tree, or octtree, or k-d trie, or r-d trie ). The trick there is to pick a time span large enough so that any interval you query is small relative to the time span, but small enough so that you don't waste a bunch (up to an eighth) of your usable hash values with no useful time data (i.e. populate your most significant bits). This would work if your data were geographically fixed, but changing only in time. If your time span is geologic, you might want to use a logarithmic time scale.

If you have time series (identified by <id>) moving in space-time, then I would add an indirection. Use the space-time hash to determine the IDs intersecting your zone and then query again, using the IDs to pull out the time series, filtering with your interator, perhaps using the native timestamp field.

I hope that helps. Good luck.

Kurt

BTW: 50% filtering isn't really that inefficient. - kkc

On 6/18/13 12:36 AM, Jared Winick wrote:
> Have you considered a "geohash" of all 3 dimensions together and using 
> that as the RowID? I have never implemented a geohash exactly, but I 
> do know it is possible to build a z-order curve on more than 2 
> dimensions, which may be what you want considering that it sounds like 
> all your queries are in 3-dimensions.
>
>
> On Mon, Jun 17, 2013 at 7:56 PM, Iezzi, Adam [USA] <iezzi_adam@bah.com 
> <ma...@bah.com>> wrote:
>
>     I've been asked by my client to store a dataset which contains a
>     time series and geospatial coordinates (points) in Accumulo. At
>     the moment, we have a very dense data stored in Accumulo using the
>     following table schema:
>
>     Row ID: <geohash>_<reverse timestamp>
>
>     Family: <id >
>
>     Qualifier: attribute
>
>     Value: <value>
>
>     We are salting our RowID's with a geohash to prevent hot spotting.
>     When we query the data, we use a prefix scan (center tile and
>     eight neighbors), then using an Iterator to filter out the
>     outliers (points and time). Unfortunately, we've noticed some
>     performance issues with this approach in that it seems as the
>     initial prefix scan brings back a ton of data, forcing the
>     iterators to filter out a significant amount of outliers. E.g.
>     more than 50% is being filtered out, which seems inefficient to
>     us. Unfortunately for us, our users will always query by space and
>     time, making them equally important for each query. Because of the
>     time series component to our data, we're often bringing back a
>     significant amount of data for each given point. Each point can
>     have ten entries due to the time series, making our data set very
>     very dense.
>
>     The following are some options we're considering:
>
>     1. Salt a master table with an ID rather than the geohash
>     <id>_<reverse timestamp>, and then create a spatial index table.
>     If we choose this option, I assume we would scan the index first,
>     then use a batch scanner with the ID from the first query.
>     Unfortunately, I still see us filtering out a significant amount
>     of data using this approach.
>
>     2. Keep the table design as is, and maybe a RegExFilter via a
>     custom Iterator.
>
>     3. Do something completely different, such as use a Column Family
>     and the temporal aspect of the dataset together in some way.
>
>     Any advice or guidance would be greatly appreciated.
>
>     Thank you,
>
>     Adam
>
>

-- 

Kurt Christensen
P.O. Box 811
Westminster, MD 21158-0811

------------------------------------------------------------------------
"One of the penalties for refusing to participate in politics is that you end up being governed by your inferiors."
--- Plato

Re: Storing, Indexing, and Querying data in Accumulo (geo + timeseries)

Posted by Kurt Christensen <ho...@hoodel.com>.

An effective optimization strategy will be largely influenced by the 
nature of your data.

You say you have point data. Are time series geographically fixed, with 
only the time dimension changing? ... or are the time series moving in 
space-time?

I was going to suggest a 3-D approach, bit-interleaving your space and 
time [modulo timespan] together ( or point-tree, or octtree, or k-d 
trie, or r-d trie ). The trick there is to pick a time span large enough 
so that any interval you query is small relative to the time span, but 
small enough so that you don't waste a bunch (up to an eighth) of your 
usable hash values with no useful time data (i.e. populate your most 
significant bits). This would work if your data were geographically 
fixed, but changing only in time. If your time span is geologic, you 
might want to use a logarithmic time scale.

If you have time series (identified by <id>) moving in space-time, then 
I would add an indirection. Use the space-time hash to determine the IDs 
intersecting your zone and then query again, using the IDs to pull out 
the time series, filtering with your interator, perhaps using the native 
timestamp field.

I hope that helps. Good luck.

Kurt

BTW: 50% filtering isn't really that inefficient. - kkc

On 6/18/13 12:36 AM, Jared Winick wrote:
> Have you considered a "geohash" of all 3 dimensions together and using 
> that as the RowID? I have never implemented a geohash exactly, but I 
> do know it is possible to build a z-order curve on more than 2 
> dimensions, which may be what you want considering that it sounds like 
> all your queries are in 3-dimensions.
>
>
> On Mon, Jun 17, 2013 at 7:56 PM, Iezzi, Adam [USA] <iezzi_adam@bah.com 
> <ma...@bah.com>> wrote:
>
>     I’ve been asked by my client to store a dataset which contains a
>     time series and geospatial coordinates (points) in Accumulo. At
>     the moment, we have a very dense data stored in Accumulo using the
>     following table schema:
>
>     Row ID: <geohash>_<reverse timestamp>
>
>     Family: <id >
>
>     Qualifier: attribute
>
>     Value: <value>
>
>     We are salting our RowID’s with a geohash to prevent hot spotting.
>     When we query the data, we use a prefix scan (center tile and
>     eight neighbors), then using an Iterator to filter out the
>     outliers (points and time). Unfortunately, we’ve noticed some
>     performance issues with this approach in that it seems as the
>     initial prefix scan brings back a ton of data, forcing the
>     iterators to filter out a significant amount of outliers. E.g.
>     more than 50% is being filtered out, which seems inefficient to
>     us. Unfortunately for us, our users will always query by space and
>     time, making them equally important for each query. Because of the
>     time series component to our data, we’re often bringing back a
>     significant amount of data for each given point. Each point can
>     have ten entries due to the time series, making our data set very
>     very dense.
>
>     The following are some options we’re considering:
>
>     1. Salt a master table with an ID rather than the geohash
>     <id>_<reverse timestamp>, and then create a spatial index table.
>     If we choose this option, I assume we would scan the index first,
>     then use a batch scanner with the ID from the first query.
>     Unfortunately, I still see us filtering out a significant amount
>     of data using this approach.
>
>     2. Keep the table design as is, and maybe a RegExFilter via a
>     custom Iterator.
>
>     3. Do something completely different, such as use a Column Family
>     and the temporal aspect of the dataset together in some way.
>
>     Any advice or guidance would be greatly appreciated.
>
>     Thank you,
>
>     Adam
>
>

-- 

Kurt Christensen
P.O. Box 811
Westminster, MD 21158-0811

------------------------------------------------------------------------
"One of the penalties for refusing to participate in politics is that 
you end up being governed by your inferiors."
--- Plato

Re: Storing, Indexing, and Querying data in Accumulo (geo + timeseries)

Posted by Jared Winick <ja...@gmail.com>.

Have you considered a "geohash" of all 3 dimensions together and using that
as the RowID? I have never implemented a geohash exactly, but I do know it
is possible to build a z-order curve on more than 2 dimensions, which may
be what you want considering that it sounds like all your queries are in
3-dimensions.


On Mon, Jun 17, 2013 at 7:56 PM, Iezzi, Adam [USA] <ie...@bah.com>wrote:

>  I’ve been asked by my client to store a dataset which contains a time
> series and geospatial coordinates (points) in Accumulo. At the moment, we
> have a very dense data stored in Accumulo using the following table schema:
> ****
>
> ** **
>
> Row ID:                <geohash>_<reverse timestamp>****
>
> Family:                  <id >****
>
> Qualifier:             attribute****
>
> Value:                   <value>****
>
> ** **
>
> We are salting our RowID’s with a geohash to prevent hot spotting. When we
> query the data, we use a prefix scan (center tile and eight neighbors),
> then using an Iterator to filter out the outliers (points and time).
> Unfortunately, we’ve noticed some performance issues with this approach in
> that it seems as the initial prefix scan brings back a ton of data, forcing
> the iterators to filter out a significant amount of outliers. E.g. more
> than 50% is being filtered out, which seems inefficient to us.
> Unfortunately for us, our users will always query by space and time, making
> them equally important for each query. Because of the time series component
> to our data, we’re often bringing back a significant amount of data for
> each given point. Each point can have ten entries due to the time series,
> making our data set very very dense. ****
>
> ** **
>
> The following are some options we’re considering:****
>
> ** **
>
> **1.       **Salt a master table with an ID rather than the geohash
> <id>_<reverse timestamp>, and then create a spatial index table. If we
> choose this option, I assume we would scan the index first, then use a
> batch scanner with the ID from the first query. Unfortunately, I still see
> us filtering out a significant amount of data using this approach.****
>
> **2.       **Keep the table design as is, and maybe a RegExFilter via a
> custom Iterator.****
>
> **3.       **Do something completely different, such as use a Column
> Family and the temporal aspect of the dataset together in some way.****
>
> ** **
>
> Any advice or guidance would be greatly appreciated.****
>
> ** **
>
> Thank you,****
>
> ** **
>
> Adam****
>

Re: Storing, Indexing, and Querying data in Accumulo (geo + timeseries)

Posted by Eric Newton <er...@gmail.com>.

It may be that the 50% you are filtering out would need to be
seeked/scanned anyhow, because they belong in blocks close to the data you
want.

Have you experimented with smaller tiles?

Do you know how your geohash is being mapped to nodes?  Does it spread well
over your cluster?  You may want to look at a custom balancer.

Can you give us an idea of the scale you are working at (10's, 100's,
1000's nodes)?

Can you describe in more detail what CF <id > is used for now?

-Eric


On Mon, Jun 17, 2013 at 9:56 PM, Iezzi, Adam [USA] <ie...@bah.com>wrote:

>  I’ve been asked by my client to store a dataset which contains a time
> series and geospatial coordinates (points) in Accumulo. At the moment, we
> have a very dense data stored in Accumulo using the following table schema:
> ****
>
> ** **
>
> Row ID:                <geohash>_<reverse timestamp>****
>
> Family:                  <id >****
>
> Qualifier:             attribute****
>
> Value:                   <value>****
>
> ** **
>
> We are salting our RowID’s with a geohash to prevent hot spotting. When we
> query the data, we use a prefix scan (center tile and eight neighbors),
> then using an Iterator to filter out the outliers (points and time).
> Unfortunately, we’ve noticed some performance issues with this approach in
> that it seems as the initial prefix scan brings back a ton of data, forcing
> the iterators to filter out a significant amount of outliers. E.g. more
> than 50% is being filtered out, which seems inefficient to us.
> Unfortunately for us, our users will always query by space and time, making
> them equally important for each query. Because of the time series component
> to our data, we’re often bringing back a significant amount of data for
> each given point. Each point can have ten entries due to the time series,
> making our data set very very dense. ****
>
> ** **
>
> The following are some options we’re considering:****
>
> ** **
>
> **1.       **Salt a master table with an ID rather than the geohash
> <id>_<reverse timestamp>, and then create a spatial index table. If we
> choose this option, I assume we would scan the index first, then use a
> batch scanner with the ID from the first query. Unfortunately, I still see
> us filtering out a significant amount of data using this approach.****
>
> **2.       **Keep the table design as is, and maybe a RegExFilter via a
> custom Iterator.****
>
> **3.       **Do something completely different, such as use a Column
> Family and the temporal aspect of the dataset together in some way.****
>
> ** **
>
> Any advice or guidance would be greatly appreciated.****
>
> ** **
>
> Thank you,****
>
> ** **
>
> Adam****
>