You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@accumulo.apache.org by Arshak Navruzyan <ar...@gmail.com> on 2013/12/26 21:10:59 UTC

schema examples

Hello,

I am trying to get my head around Accumulo schema designs.  I went through
a lot of trouble to get the wikisearch example running but since the data
in protobuf lists, it's not that illustrative (for a newbie).

Would love to find another example that is a little simpler to understand.
 In particular I am interested in java/scala code that mimics the D4M
schema design (not a Matlab guy).

Thanks,

Arshak

Re: schema examples

Posted by Josh Elser <jo...@gmail.com>.

Arshak,

Yes and no. Accumulo Combiners help a bit here.

For servicing inserts and deletes (treating an update as the combination 
of the two), both models work, although a serialized list is a little 
more tricky to manage (as most optimizations end up).

You will most likely want to have a Combiner set on your inverted index 
for the purposes of aggregating multiple inserts together into a single 
Key-Value. This happens naturally at scan time for you (by virtue of the 
combiner) and then gets persisted to disk in a merged for during a major 
compaction. The same logic can be applied to deletions. Keeping a sorted 
list of IDs in your serialized structure makes this algorithm pretty 
easy. One caveat to note is that Accumulo won't always compact *every* 
file in a tablet, so deletions may need to be persisted in that 
serialized structure to ensure that the deletion persists (we can go 
more into that later as I assume that isn't clear).

Speaking loosely for D4M as I haven't seen the code as to how it uses 
Accumulo, both should ensure referential integrity, as such, they should 
both be capable of servicing the same use-cases. While keeping a 
serialized list is a bit more work in your code, there should be 
performance gains seen in this approach.

On 12/29/2013 5:45 PM, Arshak Navruzyan wrote:
> Josh, I am still a little stuck on the idea of how this would work in a
> transactional app? (aka mixed workload of reads and writes).
>
> I definitely see the power of using a serialized structure in order to
> minimize the number of records but what will happen when rows get
> deleted out of the main table (or mutated)?   In the bloated model I
> could see some referential integrity code zapping the index entries as
> well.  In the serialized structure design it seems pretty complex to go
> and update every array that referenced that row.
>
> Is it fair to say that the D4M approach is a little better suited for
> transactional apps and the wikisearch approach is better for
> read-optimized index apps?
>
>
> On Sun, Dec 29, 2013 at 12:27 PM, Josh Elser <josh.elser@gmail.com
> <ma...@gmail.com>> wrote:
>
>     Some context here in regards to the wikisearch:
>
>     The point of the protocol buffers here (or any serialized structure
>     in the Value) is to reduce the ingest pressure and increase query
>     performance on the inverted index (or transpose table, if I follow
>     the d4m phrasing).
>
>     This works well because most languages (especially English) follow a
>     Zipfian distribution: some terms appear very frequently while some
>     occur very infrequently. For common terms, we don't want to bloat
>     our index, nor spend time creating those index records (e.g. "the").
>     For uncommon terms, we still want direct access to these infrequent
>     words (e.g. "__supercalifragilisticexpialidoc__ious")
>
>     The ingest affect is also rather interesting when dealing with
>     Accumulo as you're not just writing more data, but typically writing
>     data to most (if not all) tservers. Even the tokenization of a
>     single document is likely to create inserts to a majority of the
>     tablets for your inverted index. When dealing with high ingest rates
>     (live *or* bulk -- you still have the send data to these servers),
>     minimizing the number of records becomes important to be cognizant
>     of as it may be a bottleneck in your pipeline.
>
>     The query implications are pretty straightforward: common terms
>     don't bloat the index in size nor affect uncommon term lookups and
>     those uncommon term lookups remain specific to documents rather than
>     a range (shard) of documents.
>
>
>     On 12/29/2013 11:57 AM, Arshak Navruzyan wrote:
>
>         Sorry I mixed things up.  It was in the wikisearch example:
>
>         http://accumulo.apache.org/__example/wikisearch.html
>         <http://accumulo.apache.org/example/wikisearch.html>
>
>         "If the cardinality is small enough, it will track the set of
>         documents
>         by term directly."
>
>
>         On Sun, Dec 29, 2013 at 8:42 AM, Kepner, Jeremy - 0553 - MITLL
>         <kepner@ll.mit.edu <ma...@ll.mit.edu>
>         <mailto:kepner@ll.mit.edu <ma...@ll.mit.edu>>> wrote:
>
>              Hi Arshak,
>                 See interspersed below.
>              Regards.  -Jeremy
>
>              On Dec 29, 2013, at 11:34 AM, Arshak Navruzyan
>         <arshakn@gmail.com <ma...@gmail.com>
>              <mailto:arshakn@gmail.com <ma...@gmail.com>>> wrote:
>
>                  Jeremy,
>
>                  Thanks for the detailed explanation.  Just a couple of
>             final
>                  questions:
>
>                  1.  What's your advise on the transpose table as far as
>             whether to
>                  repeat the indexed term (one per matching row id) or
>             try to store
>                  all matching row ids from tedge in a single row in
>             tedgetranspose
>                  (using protobuf for example).  What's the performance
>             implication
>                  of each approach?  In the paper you mentioned that if
>             it's a few
>                  values they should just be stored together.  Was there
>             a cut-off
>                  point in your testing?
>
>
>              Can you clarify?  I am not sure what your asking.
>
>
>                  2.  You mentioned that the degrees should be calculated
>             beforehand
>                  for high ingest rates.  Doesn't this change Accumulo
>             from being a
>                  true database to being more of an index?  If changes to
>             the data
>                  cause the degree table to get out of sync, sounds like
>             changes
>                  have to be applied elsewhere first and Accumulo has to
>             be reloaded
>                  periodically.  Or perhaps letting the degree table get
>             out of sync
>                  is ok since it's just an assist...
>
>
>              My point was a very narrow comment on optimization in very high
>              performance situations. I probably shouldn't have mentioned
>         it.  If
>              you have ever have performance issues with your degree
>         tables, that
>              would be the time to discuss. . You may never encounter
>         this issue.
>
>                  Thanks,
>
>                  Arshak
>
>
>                  On Sat, Dec 28, 2013 at 10:36 AM, Kepner, Jeremy - 0553
>             - MITLL
>                  <kepner@ll.mit.edu <ma...@ll.mit.edu>
>             <mailto:kepner@ll.mit.edu <ma...@ll.mit.edu>>> wrote:
>
>                      Hi Arshak,
>                        Here is how you might do it.  We implement
>             everything with
>                      batch writers and batch scanners.  Note: if you are
>             doing high
>                      ingest rates, the degree table can be tricky and
>             usually
>                      requires pre-summing prior to ingestion to reduce
>             the pressure
>                      on the accumulator inside of Accumulo.  Feel free
>             to ask
>                      further questions as I would imagine that there a
>             details that
>                      still wouldn't be clear.  In particular, why we do
>             it this way.
>
>                      Regards.  -Jeremy
>
>                      Original data:
>
>                      Machine,Pool,Load,__ReadingTimestamp
>                      neptune,west,5,1388191975000
>                      neptune,west,9,1388191975010
>                      pluto,east,13,1388191975090
>
>
>                      Tedge table:
>                      rowKey,columnQualifier,value
>
>                      0005791918831-neptune,Machine|__neptune,1
>                      0005791918831-neptune,Pool|__west,1
>                      0005791918831-neptune,Load|5,1
>
>             0005791918831-neptune,__ReadingTimestamp|__1388191975000,1
>                      0105791918831-neptune,Machine|__neptune,1
>                      0105791918831-neptune,Pool|__west,1
>                      0105791918831-neptune,Load|9,1
>
>             0105791918831-neptune,__ReadingTimestamp|__1388191975010,1
>                      0905791918831-pluto,Machine|__pluto,1
>                      0905791918831-pluto,Pool|east,__1
>                      0905791918831-pluto,Load|13,1
>
>             0905791918831-pluto,__ReadingTimestamp|__1388191975090,1
>
>
>                      TedgeTranspose table:
>                      rowKey,columnQualifier,value
>
>                      Machine|neptune,0005791918831-__neptune,1
>                      Pool|west,0005791918831-__neptune,1
>                      Load|5,0005791918831-neptune,1
>
>             ReadingTimestamp|__1388191975000,0005791918831-__neptune,1
>                      Machine|neptune,0105791918831-__neptune,1
>                      Pool|west,0105791918831-__neptune,1
>                      Load|9,0105791918831-neptune,1
>
>             ReadingTimestamp|__1388191975010,0105791918831-__neptune,1
>                      Machine|pluto,0905791918831-__pluto,1
>                      Pool|east,0905791918831-pluto,__1
>                      Load|13,0905791918831-pluto,1
>
>             ReadingTimestamp|__1388191975090,0905791918831-__pluto,1
>
>
>                      TedgeDegree table:
>                      rowKey,columnQualifier,value
>
>                      Machine|neptune,Degree,2
>                      Pool|west,Degree,2
>                      Load|5,Degree,1
>                      ReadingTimestamp|__1388191975000,Degree,1
>                      Load|9,Degree,1
>                      ReadingTimestamp|__1388191975010,Degree,1
>                      Machine|pluto,Degree,1
>                      Pool|east,Degree,1
>                      Load|13,Degree,1
>                      ReadingTimestamp|__1388191975090,Degree,1
>
>
>                      TedgeText table:
>                      rowKey,columnQualifier,value
>
>                      0005791918831-neptune,Text,< ... raw text of
>             original log ...>
>                      0105791918831-neptune,Text,< ... raw text of
>             original log ...>
>                      0905791918831-pluto,Text,< ... raw text of original
>             log ...>
>
>                      On Dec 27, 2013, at 8:01 PM, Arshak Navruzyan
>                      <arshakn@gmail.com <ma...@gmail.com>
>             <mailto:arshakn@gmail.com <ma...@gmail.com>>> wrote:
>
>                      > Jeremy,
>                      >
>                      > Wow, didn't expect to get help from the author :)
>                      >
>                      > How about something simple like this:
>                      >
>                      > Machine    Pool      Load        ReadingTimestamp
>                      > neptune     west      5            1388191975000
>                      > neptune     west      9            1388191975010
>                      > pluto         east       13           1388191975090
>                      >
>                      > These are the areas I am unclear on:
>                      >
>                      > 1.  Should the transpose table be built as part
>             of ingest
>                      code or as an accumulo combiner?
>                      > 2.  What does the degree table do in this example
>             ?  The
>                      paper mentions it's useful for query optimization.
>               How?
>                      > 3.  Does D4M accommodate "repurposing" the row_id
>             to a
>                      partition key?  The wikisearch shows how the
>             partition id is
>                      important for parallel scans of the index.  But
>             since Accumulo
>                      is a row store how can you do fast lookups by row
>             if you've
>                      used the row_id as a partition key.
>                      >
>                      > Thank you,
>                      >
>                      > Arshak
>                      >
>                      >
>                      >
>                      >
>                      >
>                      >
>                      > On Thu, Dec 26, 2013 at 5:31 PM, Jeremy Kepner
>                      <kepner@ll.mit.edu <ma...@ll.mit.edu>
>             <mailto:kepner@ll.mit.edu <ma...@ll.mit.edu>>> wrote:
>                      > Hi Arshak,
>                      >   Maybe you can send a few (~3) records of data
>             that you are
>                      familiar with
>                      > and we can walk you through how the D4M schema
>             would be
>                      applied to those records.
>                      >
>                      > Regards.  -Jeremy
>                      >
>                      > On Thu, Dec 26, 2013 at 03:10:59PM -0500, Arshak
>             Navruzyan
>                      wrote:
>                      > >    Hello,
>                      > >    I am trying to get my head around Accumulo
>             schema
>                      designs.  I went through
>                      > >    a lot of trouble to get the wikisearch
>             example running
>                      but since the data
>                      > >    in protobuf lists, it's not that
>             illustrative (for a
>                      newbie).
>                      > >    Would love to find another example that is a
>             little
>                      simpler to understand.
>                      > >     In particular I am interested in java/scala
>             code that
>                      mimics the D4M
>                      > >    schema design (not a Matlab guy).
>                      > >    Thanks,
>                      > >    Arshak
>                      >
>
>
>
>
>

Re: schema examples

Posted by Jeremy Kepner <ke...@ll.mit.edu>.

I would be reluctant to make generalizations.

On Sun, Dec 29, 2013 at 05:45:28PM -0500, Arshak Navruzyan wrote:
>    Josh, I am still a little stuck on the idea of how this would work in a
>    transactional app? (aka mixed workload of reads and writes).
>    I definitely see the power of using a serialized structure in order to
>    minimize the number of records but what will happen when rows get deleted
>    out of the main table (or mutated)? � In the bloated model I could see
>    some referential integrity code zapping the index entries as well. �In the
>    serialized structure design it seems pretty complex to go and update every
>    array that referenced that row. �
>    Is it fair to say that the D4M approach is a little better suited for
>    transactional apps and the wikisearch approach is better for
>    read-optimized index apps?
> 
>    On Sun, Dec 29, 2013 at 12:27 PM, Josh Elser <[1...@gmail.com>
>    wrote:
> 
>      Some context here in regards to the wikisearch:
> 
>      The point of the protocol buffers here (or any serialized structure in
>      the Value) is to reduce the ingest pressure and increase query
>      performance on the inverted index (or transpose table, if I follow the
>      d4m phrasing).
> 
>      This works well because most languages (especially English) follow a
>      Zipfian distribution: some terms appear very frequently while some occur
>      very infrequently. For common terms, we don't want to bloat our index,
>      nor spend time creating those index records (e.g. "the"). For uncommon
>      terms, we still want direct access to these infrequent words (e.g.
>      "supercalifragilisticexpialidocious")
> 
>      The ingest affect is also rather interesting when dealing with Accumulo
>      as you're not just writing more data, but typically writing data to most
>      (if not all) tservers. Even the tokenization of a single document is
>      likely to create inserts to a majority of the tablets for your inverted
>      index. When dealing with high ingest rates (live *or* bulk -- you still
>      have the send data to these servers), minimizing the number of records
>      becomes important to be cognizant of as it may be a bottleneck in your
>      pipeline.
> 
>      The query implications are pretty straightforward: common terms don't
>      bloat the index in size nor affect uncommon term lookups and those
>      uncommon term lookups remain specific to documents rather than a range
>      (shard) of documents.
> 
>      On 12/29/2013 11:57 AM, Arshak Navruzyan wrote:
> 
>        Sorry I mixed things up. �It was in the wikisearch example:
> 
>        [2]http://accumulo.apache.org/example/wikisearch.html
> 
>        "If the cardinality is small enough, it will track the set of
>        documents
>        by term directly."
> 
>        On Sun, Dec 29, 2013 at 8:42 AM, Kepner, Jeremy - 0553 - MITLL
>        <[3]kepner@ll.mit.edu <ma...@ll.mit.edu>> wrote:
> 
>        � � Hi Arshak,
>        � � � �See interspersed below.
>        � � Regards. �-Jeremy
> 
>        � � On Dec 29, 2013, at 11:34 AM, Arshak Navruzyan
>        <[5]arshakn@gmail.com
>        � � <ma...@gmail.com>> wrote:
> 
>          � � Jeremy,
> 
>          � � Thanks for the detailed explanation. �Just a couple of final
>          � � questions:
> 
>          � � 1. �What's your advise on the transpose table as far as whether
>          to
>          � � repeat the indexed term (one per matching row id) or try to
>          store
>          � � all matching row ids from tedge in a single row in
>          tedgetranspose
>          � � (using protobuf for example). �What's the performance
>          implication
>          � � of each approach? �In the paper you mentioned that if it's a few
>          � � values they should just be stored together. �Was there a cut-off
>          � � point in your testing?
> 
>        � � Can you clarify? �I am not sure what your asking.
> 
>          � � 2. �You mentioned that the degrees should be calculated
>          beforehand
>          � � for high ingest rates. �Doesn't this change Accumulo from being
>          a
>          � � true database to being more of an index? �If changes to the data
>          � � cause the degree table to get out of sync, sounds like changes
>          � � have to be applied elsewhere first and Accumulo has to be
>          reloaded
>          � � periodically. �Or perhaps letting the degree table get out of
>          sync
>          � � is ok since it's just an assist...
> 
>        � � My point was a very narrow comment on optimization in very high
>        � � performance situations. I probably shouldn't have mentioned it.
>        �If
>        � � you have ever have performance issues with your degree tables,
>        that
>        � � would be the time to discuss. . You may never encounter this
>        issue.
> 
>          � � Thanks,
> 
>          � � Arshak
> 
>          � � On Sat, Dec 28, 2013 at 10:36 AM, Kepner, Jeremy - 0553 - MITLL
>          � � <[7]kepner@ll.mit.edu <ma...@ll.mit.edu>> wrote:
> 
>          � � � � Hi Arshak,
>          � � � � � Here is how you might do it. �We implement everything with
>          � � � � batch writers and batch scanners. �Note: if you are doing
>          high
>          � � � � ingest rates, the degree table can be tricky and usually
>          � � � � requires pre-summing prior to ingestion to reduce the
>          pressure
>          � � � � on the accumulator inside of Accumulo. �Feel free to ask
>          � � � � further questions as I would imagine that there a details
>          that
>          � � � � still wouldn't be clear. �In particular, why we do it this
>          way.
> 
>          � � � � Regards. �-Jeremy
> 
>          � � � � Original data:
> 
>          � � � � Machine,Pool,Load,ReadingTimestamp
>          � � � � neptune,west,5,1388191975000
>          � � � � neptune,west,9,1388191975010
>          � � � � pluto,east,13,1388191975090
> 
>          � � � � Tedge table:
>          � � � � rowKey,columnQualifier,value
> 
>          � � � � 0005791918831-neptune,Machine|neptune,1
>          � � � � 0005791918831-neptune,Pool|west,1
>          � � � � 0005791918831-neptune,Load|5,1
>          � � � � 0005791918831-neptune,ReadingTimestamp|1388191975000,1
>          � � � � 0105791918831-neptune,Machine|neptune,1
>          � � � � 0105791918831-neptune,Pool|west,1
>          � � � � 0105791918831-neptune,Load|9,1
>          � � � � 0105791918831-neptune,ReadingTimestamp|1388191975010,1
>          � � � � 0905791918831-pluto,Machine|pluto,1
>          � � � � 0905791918831-pluto,Pool|east,1
>          � � � � 0905791918831-pluto,Load|13,1
>          � � � � 0905791918831-pluto,ReadingTimestamp|1388191975090,1
> 
>          � � � � TedgeTranspose table:
>          � � � � rowKey,columnQualifier,value
> 
>          � � � � Machine|neptune,0005791918831-neptune,1
>          � � � � Pool|west,0005791918831-neptune,1
>          � � � � Load|5,0005791918831-neptune,1
>          � � � � ReadingTimestamp|1388191975000,0005791918831-neptune,1
>          � � � � Machine|neptune,0105791918831-neptune,1
>          � � � � Pool|west,0105791918831-neptune,1
>          � � � � Load|9,0105791918831-neptune,1
>          � � � � ReadingTimestamp|1388191975010,0105791918831-neptune,1
>          � � � � Machine|pluto,0905791918831-pluto,1
>          � � � � Pool|east,0905791918831-pluto,1
>          � � � � Load|13,0905791918831-pluto,1
>          � � � � ReadingTimestamp|1388191975090,0905791918831-pluto,1
> 
>          � � � � TedgeDegree table:
>          � � � � rowKey,columnQualifier,value
> 
>          � � � � Machine|neptune,Degree,2
>          � � � � Pool|west,Degree,2
>          � � � � Load|5,Degree,1
>          � � � � ReadingTimestamp|1388191975000,Degree,1
>          � � � � Load|9,Degree,1
>          � � � � ReadingTimestamp|1388191975010,Degree,1
>          � � � � Machine|pluto,Degree,1
>          � � � � Pool|east,Degree,1
>          � � � � Load|13,Degree,1
>          � � � � ReadingTimestamp|1388191975090,Degree,1
> 
>          � � � � TedgeText table:
>          � � � � rowKey,columnQualifier,value
> 
>          � � � � 0005791918831-neptune,Text,< ... raw text of original log
>          ...>
>          � � � � 0105791918831-neptune,Text,< ... raw text of original log
>          ...>
>          � � � � 0905791918831-pluto,Text,< ... raw text of original log ...>
> 
>          � � � � On Dec 27, 2013, at 8:01 PM, Arshak Navruzyan
>          � � � � <[9]arshakn@gmail.com <ma...@gmail.com>> wrote:
> 
>          � � � � > Jeremy,
>          � � � � >
>          � � � � > Wow, didn't expect to get help from the author :)
>          � � � � >
>          � � � � > How about something simple like this:
>          � � � � >
>          � � � � > Machine � �Pool � � �Load � � � �ReadingTimestamp
>          � � � � > neptune � � west � � �5 � � � � � �1388191975000
>          � � � � > neptune � � west � � �9 � � � � � �1388191975010
>          � � � � > pluto � � � � east � � � 13 � � � � � 1388191975090
>          � � � � >
>          � � � � > These are the areas I am unclear on:
>          � � � � >
>          � � � � > 1. �Should the transpose table be built as part of ingest
>          � � � � code or as an accumulo combiner?
>          � � � � > 2. �What does the degree table do in this example ? �The
>          � � � � paper mentions it's useful for query optimization. �How?
>          � � � � > 3. �Does D4M accommodate "repurposing" the row_id to a
>          � � � � partition key? �The wikisearch shows how the partition id is
>          � � � � important for parallel scans of the index. �But since
>          Accumulo
>          � � � � is a row store how can you do fast lookups by row if you've
>          � � � � used the row_id as a partition key.
>          � � � � >
>          � � � � > Thank you,
>          � � � � >
>          � � � � > Arshak
>          � � � � >
>          � � � � >
>          � � � � >
>          � � � � >
>          � � � � >
>          � � � � >
>          � � � � > On Thu, Dec 26, 2013 at 5:31 PM, Jeremy Kepner
>          � � � � <[11]kepner@ll.mit.edu <ma...@ll.mit.edu>>
>          wrote:
>          � � � � > Hi Arshak,
>          � � � � > � Maybe you can send a few (~3) records of data that you
>          are
>          � � � � familiar with
>          � � � � > and we can walk you through how the D4M schema would be
>          � � � � applied to those records.
>          � � � � >
>          � � � � > Regards. �-Jeremy
>          � � � � >
>          � � � � > On Thu, Dec 26, 2013 at 03:10:59PM -0500, Arshak Navruzyan
>          � � � � wrote:
>          � � � � > > � �Hello,
>          � � � � > > � �I am trying to get my head around Accumulo schema
>          � � � � designs. �I went through
>          � � � � > > � �a lot of trouble to get the wikisearch example
>          running
>          � � � � but since the data
>          � � � � > > � �in protobuf lists, it's not that illustrative (for a
>          � � � � newbie).
>          � � � � > > � �Would love to find another example that is a little
>          � � � � simpler to understand.
>          � � � � > > � � In particular I am interested in java/scala code
>          that
>          � � � � mimics the D4M
>          � � � � > > � �schema design (not a Matlab guy).
>          � � � � > > � �Thanks,
>          � � � � > > � �Arshak
>          � � � � >
> 
> References
> 
>    Visible links
>    1. mailto:josh.elser@gmail.com
>    2. http://accumulo.apache.org/example/wikisearch.html
>    3. mailto:kepner@ll.mit.edu
>    4. mailto:kepner@ll.mit.edu
>    5. mailto:arshakn@gmail.com
>    6. mailto:arshakn@gmail.com
>    7. mailto:kepner@ll.mit.edu
>    8. mailto:kepner@ll.mit.edu
>    9. mailto:arshakn@gmail.com
>   10. mailto:arshakn@gmail.com
>   11. mailto:kepner@ll.mit.edu
>   12. mailto:kepner@ll.mit.edu

Re: schema examples

Posted by Arshak Navruzyan <ar...@gmail.com>.

Josh, I am still a little stuck on the idea of how this would work in a
transactional app? (aka mixed workload of reads and writes).

I definitely see the power of using a serialized structure in order to
minimize the number of records but what will happen when rows get deleted
out of the main table (or mutated)?   In the bloated model I could see some
referential integrity code zapping the index entries as well.  In the
serialized structure design it seems pretty complex to go and update every
array that referenced that row.

Is it fair to say that the D4M approach is a little better suited for
transactional apps and the wikisearch approach is better for read-optimized
index apps?


On Sun, Dec 29, 2013 at 12:27 PM, Josh Elser <jo...@gmail.com> wrote:

> Some context here in regards to the wikisearch:
>
> The point of the protocol buffers here (or any serialized structure in the
> Value) is to reduce the ingest pressure and increase query performance on
> the inverted index (or transpose table, if I follow the d4m phrasing).
>
> This works well because most languages (especially English) follow a
> Zipfian distribution: some terms appear very frequently while some occur
> very infrequently. For common terms, we don't want to bloat our index, nor
> spend time creating those index records (e.g. "the"). For uncommon terms,
> we still want direct access to these infrequent words (e.g. "
> supercalifragilisticexpialidocious")
>
> The ingest affect is also rather interesting when dealing with Accumulo as
> you're not just writing more data, but typically writing data to most (if
> not all) tservers. Even the tokenization of a single document is likely to
> create inserts to a majority of the tablets for your inverted index. When
> dealing with high ingest rates (live *or* bulk -- you still have the send
> data to these servers), minimizing the number of records becomes important
> to be cognizant of as it may be a bottleneck in your pipeline.
>
> The query implications are pretty straightforward: common terms don't
> bloat the index in size nor affect uncommon term lookups and those uncommon
> term lookups remain specific to documents rather than a range (shard) of
> documents.
>
>
> On 12/29/2013 11:57 AM, Arshak Navruzyan wrote:
>
>> Sorry I mixed things up.  It was in the wikisearch example:
>>
>> http://accumulo.apache.org/example/wikisearch.html
>>
>> "If the cardinality is small enough, it will track the set of documents
>> by term directly."
>>
>>
>> On Sun, Dec 29, 2013 at 8:42 AM, Kepner, Jeremy - 0553 - MITLL
>> <kepner@ll.mit.edu <ma...@ll.mit.edu>> wrote:
>>
>>     Hi Arshak,
>>        See interspersed below.
>>     Regards.  -Jeremy
>>
>>     On Dec 29, 2013, at 11:34 AM, Arshak Navruzyan <arshakn@gmail.com
>>     <ma...@gmail.com>> wrote:
>>
>>      Jeremy,
>>>
>>>     Thanks for the detailed explanation.  Just a couple of final
>>>     questions:
>>>
>>>     1.  What's your advise on the transpose table as far as whether to
>>>     repeat the indexed term (one per matching row id) or try to store
>>>     all matching row ids from tedge in a single row in tedgetranspose
>>>     (using protobuf for example).  What's the performance implication
>>>     of each approach?  In the paper you mentioned that if it's a few
>>>     values they should just be stored together.  Was there a cut-off
>>>     point in your testing?
>>>
>>
>>     Can you clarify?  I am not sure what your asking.
>>
>>
>>>     2.  You mentioned that the degrees should be calculated beforehand
>>>     for high ingest rates.  Doesn't this change Accumulo from being a
>>>     true database to being more of an index?  If changes to the data
>>>     cause the degree table to get out of sync, sounds like changes
>>>     have to be applied elsewhere first and Accumulo has to be reloaded
>>>     periodically.  Or perhaps letting the degree table get out of sync
>>>     is ok since it's just an assist...
>>>
>>
>>     My point was a very narrow comment on optimization in very high
>>     performance situations. I probably shouldn't have mentioned it.  If
>>     you have ever have performance issues with your degree tables, that
>>     would be the time to discuss. . You may never encounter this issue.
>>
>>      Thanks,
>>>
>>>     Arshak
>>>
>>>
>>>     On Sat, Dec 28, 2013 at 10:36 AM, Kepner, Jeremy - 0553 - MITLL
>>>     <kepner@ll.mit.edu <ma...@ll.mit.edu>> wrote:
>>>
>>>         Hi Arshak,
>>>           Here is how you might do it.  We implement everything with
>>>         batch writers and batch scanners.  Note: if you are doing high
>>>         ingest rates, the degree table can be tricky and usually
>>>         requires pre-summing prior to ingestion to reduce the pressure
>>>         on the accumulator inside of Accumulo.  Feel free to ask
>>>         further questions as I would imagine that there a details that
>>>         still wouldn't be clear.  In particular, why we do it this way.
>>>
>>>         Regards.  -Jeremy
>>>
>>>         Original data:
>>>
>>>         Machine,Pool,Load,ReadingTimestamp
>>>         neptune,west,5,1388191975000
>>>         neptune,west,9,1388191975010
>>>         pluto,east,13,1388191975090
>>>
>>>
>>>         Tedge table:
>>>         rowKey,columnQualifier,value
>>>
>>>         0005791918831-neptune,Machine|neptune,1
>>>         0005791918831-neptune,Pool|west,1
>>>         0005791918831-neptune,Load|5,1
>>>         0005791918831-neptune,ReadingTimestamp|1388191975000,1
>>>         0105791918831-neptune,Machine|neptune,1
>>>         0105791918831-neptune,Pool|west,1
>>>         0105791918831-neptune,Load|9,1
>>>         0105791918831-neptune,ReadingTimestamp|1388191975010,1
>>>         0905791918831-pluto,Machine|pluto,1
>>>         0905791918831-pluto,Pool|east,1
>>>         0905791918831-pluto,Load|13,1
>>>         0905791918831-pluto,ReadingTimestamp|1388191975090,1
>>>
>>>
>>>         TedgeTranspose table:
>>>         rowKey,columnQualifier,value
>>>
>>>         Machine|neptune,0005791918831-neptune,1
>>>         Pool|west,0005791918831-neptune,1
>>>         Load|5,0005791918831-neptune,1
>>>         ReadingTimestamp|1388191975000,0005791918831-neptune,1
>>>         Machine|neptune,0105791918831-neptune,1
>>>         Pool|west,0105791918831-neptune,1
>>>         Load|9,0105791918831-neptune,1
>>>         ReadingTimestamp|1388191975010,0105791918831-neptune,1
>>>         Machine|pluto,0905791918831-pluto,1
>>>         Pool|east,0905791918831-pluto,1
>>>         Load|13,0905791918831-pluto,1
>>>         ReadingTimestamp|1388191975090,0905791918831-pluto,1
>>>
>>>
>>>         TedgeDegree table:
>>>         rowKey,columnQualifier,value
>>>
>>>         Machine|neptune,Degree,2
>>>         Pool|west,Degree,2
>>>         Load|5,Degree,1
>>>         ReadingTimestamp|1388191975000,Degree,1
>>>         Load|9,Degree,1
>>>         ReadingTimestamp|1388191975010,Degree,1
>>>         Machine|pluto,Degree,1
>>>         Pool|east,Degree,1
>>>         Load|13,Degree,1
>>>         ReadingTimestamp|1388191975090,Degree,1
>>>
>>>
>>>         TedgeText table:
>>>         rowKey,columnQualifier,value
>>>
>>>         0005791918831-neptune,Text,< ... raw text of original log ...>
>>>         0105791918831-neptune,Text,< ... raw text of original log ...>
>>>         0905791918831-pluto,Text,< ... raw text of original log ...>
>>>
>>>         On Dec 27, 2013, at 8:01 PM, Arshak Navruzyan
>>>         <arshakn@gmail.com <ma...@gmail.com>> wrote:
>>>
>>>         > Jeremy,
>>>         >
>>>         > Wow, didn't expect to get help from the author :)
>>>         >
>>>         > How about something simple like this:
>>>         >
>>>         > Machine    Pool      Load        ReadingTimestamp
>>>         > neptune     west      5            1388191975000
>>>         > neptune     west      9            1388191975010
>>>         > pluto         east       13           1388191975090
>>>         >
>>>         > These are the areas I am unclear on:
>>>         >
>>>         > 1.  Should the transpose table be built as part of ingest
>>>         code or as an accumulo combiner?
>>>         > 2.  What does the degree table do in this example ?  The
>>>         paper mentions it's useful for query optimization.  How?
>>>         > 3.  Does D4M accommodate "repurposing" the row_id to a
>>>         partition key?  The wikisearch shows how the partition id is
>>>         important for parallel scans of the index.  But since Accumulo
>>>         is a row store how can you do fast lookups by row if you've
>>>         used the row_id as a partition key.
>>>         >
>>>         > Thank you,
>>>         >
>>>         > Arshak
>>>         >
>>>         >
>>>         >
>>>         >
>>>         >
>>>         >
>>>         > On Thu, Dec 26, 2013 at 5:31 PM, Jeremy Kepner
>>>         <kepner@ll.mit.edu <ma...@ll.mit.edu>> wrote:
>>>         > Hi Arshak,
>>>         >   Maybe you can send a few (~3) records of data that you are
>>>         familiar with
>>>         > and we can walk you through how the D4M schema would be
>>>         applied to those records.
>>>         >
>>>         > Regards.  -Jeremy
>>>         >
>>>         > On Thu, Dec 26, 2013 at 03:10:59PM -0500, Arshak Navruzyan
>>>         wrote:
>>>         > >    Hello,
>>>         > >    I am trying to get my head around Accumulo schema
>>>         designs.  I went through
>>>         > >    a lot of trouble to get the wikisearch example running
>>>         but since the data
>>>         > >    in protobuf lists, it's not that illustrative (for a
>>>         newbie).
>>>         > >    Would love to find another example that is a little
>>>         simpler to understand.
>>>         > >     In particular I am interested in java/scala code that
>>>         mimics the D4M
>>>         > >    schema design (not a Matlab guy).
>>>         > >    Thanks,
>>>         > >    Arshak
>>>         >
>>>
>>>
>>>
>>
>>

Re: schema examples

Posted by Josh Elser <jo...@gmail.com>.

Some context here in regards to the wikisearch:

The point of the protocol buffers here (or any serialized structure in 
the Value) is to reduce the ingest pressure and increase query 
performance on the inverted index (or transpose table, if I follow the 
d4m phrasing).

This works well because most languages (especially English) follow a 
Zipfian distribution: some terms appear very frequently while some occur 
very infrequently. For common terms, we don't want to bloat our index, 
nor spend time creating those index records (e.g. "the"). For uncommon 
terms, we still want direct access to these infrequent words (e.g. 
"supercalifragilisticexpialidocious")

The ingest affect is also rather interesting when dealing with Accumulo 
as you're not just writing more data, but typically writing data to most 
(if not all) tservers. Even the tokenization of a single document is 
likely to create inserts to a majority of the tablets for your inverted 
index. When dealing with high ingest rates (live *or* bulk -- you still 
have the send data to these servers), minimizing the number of records 
becomes important to be cognizant of as it may be a bottleneck in your 
pipeline.

The query implications are pretty straightforward: common terms don't 
bloat the index in size nor affect uncommon term lookups and those 
uncommon term lookups remain specific to documents rather than a range 
(shard) of documents.

On 12/29/2013 11:57 AM, Arshak Navruzyan wrote:
> Sorry I mixed things up.  It was in the wikisearch example:
>
> http://accumulo.apache.org/example/wikisearch.html
>
> "If the cardinality is small enough, it will track the set of documents
> by term directly."
>
>
> On Sun, Dec 29, 2013 at 8:42 AM, Kepner, Jeremy - 0553 - MITLL
> <kepner@ll.mit.edu <ma...@ll.mit.edu>> wrote:
>
>     Hi Arshak,
>        See interspersed below.
>     Regards.  -Jeremy
>
>     On Dec 29, 2013, at 11:34 AM, Arshak Navruzyan <arshakn@gmail.com
>     <ma...@gmail.com>> wrote:
>
>>     Jeremy,
>>
>>     Thanks for the detailed explanation.  Just a couple of final
>>     questions:
>>
>>     1.  What's your advise on the transpose table as far as whether to
>>     repeat the indexed term (one per matching row id) or try to store
>>     all matching row ids from tedge in a single row in tedgetranspose
>>     (using protobuf for example).  What's the performance implication
>>     of each approach?  In the paper you mentioned that if it's a few
>>     values they should just be stored together.  Was there a cut-off
>>     point in your testing?
>
>     Can you clarify?  I am not sure what your asking.
>
>>
>>     2.  You mentioned that the degrees should be calculated beforehand
>>     for high ingest rates.  Doesn't this change Accumulo from being a
>>     true database to being more of an index?  If changes to the data
>>     cause the degree table to get out of sync, sounds like changes
>>     have to be applied elsewhere first and Accumulo has to be reloaded
>>     periodically.  Or perhaps letting the degree table get out of sync
>>     is ok since it's just an assist...
>
>     My point was a very narrow comment on optimization in very high
>     performance situations. I probably shouldn't have mentioned it.  If
>     you have ever have performance issues with your degree tables, that
>     would be the time to discuss. . You may never encounter this issue.
>
>>     Thanks,
>>
>>     Arshak
>>
>>
>>     On Sat, Dec 28, 2013 at 10:36 AM, Kepner, Jeremy - 0553 - MITLL
>>     <kepner@ll.mit.edu <ma...@ll.mit.edu>> wrote:
>>
>>         Hi Arshak,
>>           Here is how you might do it.  We implement everything with
>>         batch writers and batch scanners.  Note: if you are doing high
>>         ingest rates, the degree table can be tricky and usually
>>         requires pre-summing prior to ingestion to reduce the pressure
>>         on the accumulator inside of Accumulo.  Feel free to ask
>>         further questions as I would imagine that there a details that
>>         still wouldn't be clear.  In particular, why we do it this way.
>>
>>         Regards.  -Jeremy
>>
>>         Original data:
>>
>>         Machine,Pool,Load,ReadingTimestamp
>>         neptune,west,5,1388191975000
>>         neptune,west,9,1388191975010
>>         pluto,east,13,1388191975090
>>
>>
>>         Tedge table:
>>         rowKey,columnQualifier,value
>>
>>         0005791918831-neptune,Machine|neptune,1
>>         0005791918831-neptune,Pool|west,1
>>         0005791918831-neptune,Load|5,1
>>         0005791918831-neptune,ReadingTimestamp|1388191975000,1
>>         0105791918831-neptune,Machine|neptune,1
>>         0105791918831-neptune,Pool|west,1
>>         0105791918831-neptune,Load|9,1
>>         0105791918831-neptune,ReadingTimestamp|1388191975010,1
>>         0905791918831-pluto,Machine|pluto,1
>>         0905791918831-pluto,Pool|east,1
>>         0905791918831-pluto,Load|13,1
>>         0905791918831-pluto,ReadingTimestamp|1388191975090,1
>>
>>
>>         TedgeTranspose table:
>>         rowKey,columnQualifier,value
>>
>>         Machine|neptune,0005791918831-neptune,1
>>         Pool|west,0005791918831-neptune,1
>>         Load|5,0005791918831-neptune,1
>>         ReadingTimestamp|1388191975000,0005791918831-neptune,1
>>         Machine|neptune,0105791918831-neptune,1
>>         Pool|west,0105791918831-neptune,1
>>         Load|9,0105791918831-neptune,1
>>         ReadingTimestamp|1388191975010,0105791918831-neptune,1
>>         Machine|pluto,0905791918831-pluto,1
>>         Pool|east,0905791918831-pluto,1
>>         Load|13,0905791918831-pluto,1
>>         ReadingTimestamp|1388191975090,0905791918831-pluto,1
>>
>>
>>         TedgeDegree table:
>>         rowKey,columnQualifier,value
>>
>>         Machine|neptune,Degree,2
>>         Pool|west,Degree,2
>>         Load|5,Degree,1
>>         ReadingTimestamp|1388191975000,Degree,1
>>         Load|9,Degree,1
>>         ReadingTimestamp|1388191975010,Degree,1
>>         Machine|pluto,Degree,1
>>         Pool|east,Degree,1
>>         Load|13,Degree,1
>>         ReadingTimestamp|1388191975090,Degree,1
>>
>>
>>         TedgeText table:
>>         rowKey,columnQualifier,value
>>
>>         0005791918831-neptune,Text,< ... raw text of original log ...>
>>         0105791918831-neptune,Text,< ... raw text of original log ...>
>>         0905791918831-pluto,Text,< ... raw text of original log ...>
>>
>>         On Dec 27, 2013, at 8:01 PM, Arshak Navruzyan
>>         <arshakn@gmail.com <ma...@gmail.com>> wrote:
>>
>>         > Jeremy,
>>         >
>>         > Wow, didn't expect to get help from the author :)
>>         >
>>         > How about something simple like this:
>>         >
>>         > Machine    Pool      Load        ReadingTimestamp
>>         > neptune     west      5            1388191975000
>>         > neptune     west      9            1388191975010
>>         > pluto         east       13           1388191975090
>>         >
>>         > These are the areas I am unclear on:
>>         >
>>         > 1.  Should the transpose table be built as part of ingest
>>         code or as an accumulo combiner?
>>         > 2.  What does the degree table do in this example ?  The
>>         paper mentions it's useful for query optimization.  How?
>>         > 3.  Does D4M accommodate "repurposing" the row_id to a
>>         partition key?  The wikisearch shows how the partition id is
>>         important for parallel scans of the index.  But since Accumulo
>>         is a row store how can you do fast lookups by row if you've
>>         used the row_id as a partition key.
>>         >
>>         > Thank you,
>>         >
>>         > Arshak
>>         >
>>         >
>>         >
>>         >
>>         >
>>         >
>>         > On Thu, Dec 26, 2013 at 5:31 PM, Jeremy Kepner
>>         <kepner@ll.mit.edu <ma...@ll.mit.edu>> wrote:
>>         > Hi Arshak,
>>         >   Maybe you can send a few (~3) records of data that you are
>>         familiar with
>>         > and we can walk you through how the D4M schema would be
>>         applied to those records.
>>         >
>>         > Regards.  -Jeremy
>>         >
>>         > On Thu, Dec 26, 2013 at 03:10:59PM -0500, Arshak Navruzyan
>>         wrote:
>>         > >    Hello,
>>         > >    I am trying to get my head around Accumulo schema
>>         designs.  I went through
>>         > >    a lot of trouble to get the wikisearch example running
>>         but since the data
>>         > >    in protobuf lists, it's not that illustrative (for a
>>         newbie).
>>         > >    Would love to find another example that is a little
>>         simpler to understand.
>>         > >     In particular I am interested in java/scala code that
>>         mimics the D4M
>>         > >    schema design (not a Matlab guy).
>>         > >    Thanks,
>>         > >    Arshak
>>         >
>>
>>
>
>

Re: schema examples

Posted by Arshak Navruzyan <ar...@gmail.com>.

Got it, thanks again Jeremy!


On Sun, Dec 29, 2013 at 9:12 AM, Kepner, Jeremy - 0553 - MITLL <
kepner@ll.mit.edu> wrote:

> FYI, we just insert all the triples into both Tedge and TedgeTranspose
> using seperate batchwriters and let Accumulo figure out which ones belong
> in the same row. This has worked well for us.
>
> On Dec 29, 2013, at 11:57 AM, Arshak Navruzyan <ar...@gmail.com> wrote:
>
> Sorry I mixed things up.  It was in the wikisearch example:
>
> http://accumulo.apache.org/example/wikisearch.html
>
> "If the cardinality is small enough, it will track the set of documents by
> term directly."
>
>
> On Sun, Dec 29, 2013 at 8:42 AM, Kepner, Jeremy - 0553 - MITLL <
> kepner@ll.mit.edu> wrote:
>
>> Hi Arshak,
>>   See interspersed below.
>> Regards.  -Jeremy
>>
>> On Dec 29, 2013, at 11:34 AM, Arshak Navruzyan <ar...@gmail.com> wrote:
>>
>> Jeremy,
>>
>> Thanks for the detailed explanation.  Just a couple of final questions:
>>
>> 1.  What's your advise on the transpose table as far as whether to repeat
>> the indexed term (one per matching row id) or try to store all matching row
>> ids from tedge in a single row in tedgetranspose (using protobuf for
>> example).  What's the performance implication of each approach?  In the
>> paper you mentioned that if it's a few values they should just be stored
>> together.  Was there a cut-off point in your testing?
>>
>>
>> Can you clarify?  I am not sure what your asking.
>>
>>
>> 2.  You mentioned that the degrees should be calculated beforehand for
>> high ingest rates.  Doesn't this change Accumulo from being a true database
>> to being more of an index?  If changes to the data cause the degree table
>> to get out of sync, sounds like changes have to be applied elsewhere first
>> and Accumulo has to be reloaded periodically.  Or perhaps letting the
>> degree table get out of sync is ok since it's just an assist...
>>
>>
>> My point was a very narrow comment on optimization in very high
>> performance situations. I probably shouldn't have mentioned it.  If you
>> have ever have performance issues with your degree tables, that would be
>> the time to discuss. . You may never encounter this issue.
>>
>> Thanks,
>>
>> Arshak
>>
>>
>> On Sat, Dec 28, 2013 at 10:36 AM, Kepner, Jeremy - 0553 - MITLL <
>> kepner@ll.mit.edu> wrote:
>>
>>> Hi Arshak,
>>>   Here is how you might do it.  We implement everything with batch
>>> writers and batch scanners.  Note: if you are doing high ingest rates, the
>>> degree table can be tricky and usually requires pre-summing prior to
>>> ingestion to reduce the pressure on the accumulator inside of Accumulo.
>>>  Feel free to ask further questions as I would imagine that there a details
>>> that still wouldn't be clear.  In particular, why we do it this way.
>>>
>>> Regards.  -Jeremy
>>>
>>> Original data:
>>>
>>> Machine,Pool,Load,ReadingTimestamp
>>> neptune,west,5,1388191975000
>>> neptune,west,9,1388191975010
>>> pluto,east,13,1388191975090
>>>
>>>
>>> Tedge table:
>>> rowKey,columnQualifier,value
>>>
>>> 0005791918831-neptune,Machine|neptune,1
>>> 0005791918831-neptune,Pool|west,1
>>> 0005791918831-neptune,Load|5,1
>>> 0005791918831-neptune,ReadingTimestamp|1388191975000,1
>>> 0105791918831-neptune,Machine|neptune,1
>>> 0105791918831-neptune,Pool|west,1
>>> 0105791918831-neptune,Load|9,1
>>> 0105791918831-neptune,ReadingTimestamp|1388191975010,1
>>> 0905791918831-pluto,Machine|pluto,1
>>> 0905791918831-pluto,Pool|east,1
>>> 0905791918831-pluto,Load|13,1
>>> 0905791918831-pluto,ReadingTimestamp|1388191975090,1
>>>
>>>
>>> TedgeTranspose table:
>>> rowKey,columnQualifier,value
>>>
>>> Machine|neptune,0005791918831-neptune,1
>>> Pool|west,0005791918831-neptune,1
>>> Load|5,0005791918831-neptune,1
>>> ReadingTimestamp|1388191975000,0005791918831-neptune,1
>>> Machine|neptune,0105791918831-neptune,1
>>> Pool|west,0105791918831-neptune,1
>>> Load|9,0105791918831-neptune,1
>>> ReadingTimestamp|1388191975010,0105791918831-neptune,1
>>> Machine|pluto,0905791918831-pluto,1
>>> Pool|east,0905791918831-pluto,1
>>> Load|13,0905791918831-pluto,1
>>> ReadingTimestamp|1388191975090,0905791918831-pluto,1
>>>
>>>
>>> TedgeDegree table:
>>> rowKey,columnQualifier,value
>>>
>>> Machine|neptune,Degree,2
>>> Pool|west,Degree,2
>>> Load|5,Degree,1
>>> ReadingTimestamp|1388191975000,Degree,1
>>> Load|9,Degree,1
>>> ReadingTimestamp|1388191975010,Degree,1
>>> Machine|pluto,Degree,1
>>> Pool|east,Degree,1
>>> Load|13,Degree,1
>>> ReadingTimestamp|1388191975090,Degree,1
>>>
>>>
>>> TedgeText table:
>>> rowKey,columnQualifier,value
>>>
>>> 0005791918831-neptune,Text,< ... raw text of original log ...>
>>> 0105791918831-neptune,Text,< ... raw text of original log ...>
>>> 0905791918831-pluto,Text,< ... raw text of original log ...>
>>>
>>> On Dec 27, 2013, at 8:01 PM, Arshak Navruzyan <ar...@gmail.com> wrote:
>>>
>>> > Jeremy,
>>> >
>>> > Wow, didn't expect to get help from the author :)
>>> >
>>> > How about something simple like this:
>>> >
>>> > Machine    Pool      Load        ReadingTimestamp
>>> > neptune     west      5            1388191975000
>>> > neptune     west      9            1388191975010
>>> > pluto         east       13           1388191975090
>>> >
>>> > These are the areas I am unclear on:
>>> >
>>> > 1.  Should the transpose table be built as part of ingest code or as
>>> an accumulo combiner?
>>> > 2.  What does the degree table do in this example ?  The paper
>>> mentions it's useful for query optimization.  How?
>>> > 3.  Does D4M accommodate "repurposing" the row_id to a partition key?
>>>  The wikisearch shows how the partition id is important for parallel scans
>>> of the index.  But since Accumulo is a row store how can you do fast
>>> lookups by row if you've used the row_id as a partition key.
>>> >
>>> > Thank you,
>>> >
>>> > Arshak
>>> >
>>> >
>>> >
>>> >
>>> >
>>> >
>>> > On Thu, Dec 26, 2013 at 5:31 PM, Jeremy Kepner <ke...@ll.mit.edu>
>>> wrote:
>>> > Hi Arshak,
>>> >   Maybe you can send a few (~3) records of data that you are familiar
>>> with
>>> > and we can walk you through how the D4M schema would be applied to
>>> those records.
>>> >
>>> > Regards.  -Jeremy
>>> >
>>> > On Thu, Dec 26, 2013 at 03:10:59PM -0500, Arshak Navruzyan wrote:
>>> > >    Hello,
>>> > >    I am trying to get my head around Accumulo schema designs.  I
>>> went through
>>> > >    a lot of trouble to get the wikisearch example running but since
>>> the data
>>> > >    in protobuf lists, it's not that illustrative (for a newbie).
>>> > >    Would love to find another example that is a little simpler to
>>> understand.
>>> > >     In particular I am interested in java/scala code that mimics the
>>> D4M
>>> > >    schema design (not a Matlab guy).
>>> > >    Thanks,
>>> > >    Arshak
>>> >
>>>
>>>
>>
>>
>
>

Re: schema examples

Posted by "Kepner, Jeremy - 0553 - MITLL" <ke...@ll.mit.edu>.

FYI, we just insert all the triples into both Tedge and TedgeTranspose using seperate batchwriters and let Accumulo figure out which ones belong in the same row. This has worked well for us.

On Dec 29, 2013, at 11:57 AM, Arshak Navruzyan <ar...@gmail.com> wrote:

> Sorry I mixed things up.  It was in the wikisearch example:
> 
> http://accumulo.apache.org/example/wikisearch.html
> 
> "If the cardinality is small enough, it will track the set of documents by term directly."
> 
> 
> On Sun, Dec 29, 2013 at 8:42 AM, Kepner, Jeremy - 0553 - MITLL <ke...@ll.mit.edu> wrote:
> Hi Arshak,
>   See interspersed below.
> Regards.  -Jeremy
> 
> On Dec 29, 2013, at 11:34 AM, Arshak Navruzyan <ar...@gmail.com> wrote:
> 
>> Jeremy,
>> 
>> Thanks for the detailed explanation.  Just a couple of final questions:
>> 
>> 1.  What's your advise on the transpose table as far as whether to repeat the indexed term (one per matching row id) or try to store all matching row ids from tedge in a single row in tedgetranspose (using protobuf for example).  What's the performance implication of each approach?  In the paper you mentioned that if it's a few values they should just be stored together.  Was there a cut-off point in your testing?
> 
> Can you clarify?  I am not sure what your asking.
> 
>> 
>> 2.  You mentioned that the degrees should be calculated beforehand for high ingest rates.  Doesn't this change Accumulo from being a true database to being more of an index?  If changes to the data cause the degree table to get out of sync, sounds like changes have to be applied elsewhere first and Accumulo has to be reloaded periodically.  Or perhaps letting the degree table get out of sync is ok since it's just an assist...
> 
> My point was a very narrow comment on optimization in very high performance situations. I probably shouldn't have mentioned it.  If you have ever have performance issues with your degree tables, that would be the time to discuss. . You may never encounter this issue.
> 
>> Thanks,
>> 
>> Arshak
>> 
>> 
>> On Sat, Dec 28, 2013 at 10:36 AM, Kepner, Jeremy - 0553 - MITLL <ke...@ll.mit.edu> wrote:
>> Hi Arshak,
>>   Here is how you might do it.  We implement everything with batch writers and batch scanners.  Note: if you are doing high ingest rates, the degree table can be tricky and usually requires pre-summing prior to ingestion to reduce the pressure on the accumulator inside of Accumulo.  Feel free to ask further questions as I would imagine that there a details that still wouldn't be clear.  In particular, why we do it this way.
>> 
>> Regards.  -Jeremy
>> 
>> Original data:
>> 
>> Machine,Pool,Load,ReadingTimestamp
>> neptune,west,5,1388191975000
>> neptune,west,9,1388191975010
>> pluto,east,13,1388191975090
>> 
>> 
>> Tedge table:
>> rowKey,columnQualifier,value
>> 
>> 0005791918831-neptune,Machine|neptune,1
>> 0005791918831-neptune,Pool|west,1
>> 0005791918831-neptune,Load|5,1
>> 0005791918831-neptune,ReadingTimestamp|1388191975000,1
>> 0105791918831-neptune,Machine|neptune,1
>> 0105791918831-neptune,Pool|west,1
>> 0105791918831-neptune,Load|9,1
>> 0105791918831-neptune,ReadingTimestamp|1388191975010,1
>> 0905791918831-pluto,Machine|pluto,1
>> 0905791918831-pluto,Pool|east,1
>> 0905791918831-pluto,Load|13,1
>> 0905791918831-pluto,ReadingTimestamp|1388191975090,1
>> 
>> 
>> TedgeTranspose table:
>> rowKey,columnQualifier,value
>> 
>> Machine|neptune,0005791918831-neptune,1
>> Pool|west,0005791918831-neptune,1
>> Load|5,0005791918831-neptune,1
>> ReadingTimestamp|1388191975000,0005791918831-neptune,1
>> Machine|neptune,0105791918831-neptune,1
>> Pool|west,0105791918831-neptune,1
>> Load|9,0105791918831-neptune,1
>> ReadingTimestamp|1388191975010,0105791918831-neptune,1
>> Machine|pluto,0905791918831-pluto,1
>> Pool|east,0905791918831-pluto,1
>> Load|13,0905791918831-pluto,1
>> ReadingTimestamp|1388191975090,0905791918831-pluto,1
>> 
>> 
>> TedgeDegree table:
>> rowKey,columnQualifier,value
>> 
>> Machine|neptune,Degree,2
>> Pool|west,Degree,2
>> Load|5,Degree,1
>> ReadingTimestamp|1388191975000,Degree,1
>> Load|9,Degree,1
>> ReadingTimestamp|1388191975010,Degree,1
>> Machine|pluto,Degree,1
>> Pool|east,Degree,1
>> Load|13,Degree,1
>> ReadingTimestamp|1388191975090,Degree,1
>> 
>> 
>> TedgeText table:
>> rowKey,columnQualifier,value
>> 
>> 0005791918831-neptune,Text,< ... raw text of original log ...>
>> 0105791918831-neptune,Text,< ... raw text of original log ...>
>> 0905791918831-pluto,Text,< ... raw text of original log ...>
>> 
>> On Dec 27, 2013, at 8:01 PM, Arshak Navruzyan <ar...@gmail.com> wrote:
>> 
>> > Jeremy,
>> >
>> > Wow, didn't expect to get help from the author :)
>> >
>> > How about something simple like this:
>> >
>> > Machine    Pool      Load        ReadingTimestamp
>> > neptune     west      5            1388191975000
>> > neptune     west      9            1388191975010
>> > pluto         east       13           1388191975090
>> >
>> > These are the areas I am unclear on:
>> >
>> > 1.  Should the transpose table be built as part of ingest code or as an accumulo combiner?
>> > 2.  What does the degree table do in this example ?  The paper mentions it's useful for query optimization.  How?
>> > 3.  Does D4M accommodate "repurposing" the row_id to a partition key?  The wikisearch shows how the partition id is important for parallel scans of the index.  But since Accumulo is a row store how can you do fast lookups by row if you've used the row_id as a partition key.
>> >
>> > Thank you,
>> >
>> > Arshak
>> >
>> >
>> >
>> >
>> >
>> >
>> > On Thu, Dec 26, 2013 at 5:31 PM, Jeremy Kepner <ke...@ll.mit.edu> wrote:
>> > Hi Arshak,
>> >   Maybe you can send a few (~3) records of data that you are familiar with
>> > and we can walk you through how the D4M schema would be applied to those records.
>> >
>> > Regards.  -Jeremy
>> >
>> > On Thu, Dec 26, 2013 at 03:10:59PM -0500, Arshak Navruzyan wrote:
>> > >    Hello,
>> > >    I am trying to get my head around Accumulo schema designs.  I went through
>> > >    a lot of trouble to get the wikisearch example running but since the data
>> > >    in protobuf lists, it's not that illustrative (for a newbie).
>> > >    Would love to find another example that is a little simpler to understand.
>> > >     In particular I am interested in java/scala code that mimics the D4M
>> > >    schema design (not a Matlab guy).
>> > >    Thanks,
>> > >    Arshak
>> >
>> 
>> 
> 
>

Re: schema examples

Posted by Arshak Navruzyan <ar...@gmail.com>.

Sorry I mixed things up.  It was in the wikisearch example:

http://accumulo.apache.org/example/wikisearch.html

"If the cardinality is small enough, it will track the set of documents by
term directly."


On Sun, Dec 29, 2013 at 8:42 AM, Kepner, Jeremy - 0553 - MITLL <
kepner@ll.mit.edu> wrote:

> Hi Arshak,
>   See interspersed below.
> Regards.  -Jeremy
>
> On Dec 29, 2013, at 11:34 AM, Arshak Navruzyan <ar...@gmail.com> wrote:
>
> Jeremy,
>
> Thanks for the detailed explanation.  Just a couple of final questions:
>
> 1.  What's your advise on the transpose table as far as whether to repeat
> the indexed term (one per matching row id) or try to store all matching row
> ids from tedge in a single row in tedgetranspose (using protobuf for
> example).  What's the performance implication of each approach?  In the
> paper you mentioned that if it's a few values they should just be stored
> together.  Was there a cut-off point in your testing?
>
>
> Can you clarify?  I am not sure what your asking.
>
>
> 2.  You mentioned that the degrees should be calculated beforehand for
> high ingest rates.  Doesn't this change Accumulo from being a true database
> to being more of an index?  If changes to the data cause the degree table
> to get out of sync, sounds like changes have to be applied elsewhere first
> and Accumulo has to be reloaded periodically.  Or perhaps letting the
> degree table get out of sync is ok since it's just an assist...
>
>
> My point was a very narrow comment on optimization in very high
> performance situations. I probably shouldn't have mentioned it.  If you
> have ever have performance issues with your degree tables, that would be
> the time to discuss. . You may never encounter this issue.
>
> Thanks,
>
> Arshak
>
>
> On Sat, Dec 28, 2013 at 10:36 AM, Kepner, Jeremy - 0553 - MITLL <
> kepner@ll.mit.edu> wrote:
>
>> Hi Arshak,
>>   Here is how you might do it.  We implement everything with batch
>> writers and batch scanners.  Note: if you are doing high ingest rates, the
>> degree table can be tricky and usually requires pre-summing prior to
>> ingestion to reduce the pressure on the accumulator inside of Accumulo.
>>  Feel free to ask further questions as I would imagine that there a details
>> that still wouldn't be clear.  In particular, why we do it this way.
>>
>> Regards.  -Jeremy
>>
>> Original data:
>>
>> Machine,Pool,Load,ReadingTimestamp
>> neptune,west,5,1388191975000
>> neptune,west,9,1388191975010
>> pluto,east,13,1388191975090
>>
>>
>> Tedge table:
>> rowKey,columnQualifier,value
>>
>> 0005791918831-neptune,Machine|neptune,1
>> 0005791918831-neptune,Pool|west,1
>> 0005791918831-neptune,Load|5,1
>> 0005791918831-neptune,ReadingTimestamp|1388191975000,1
>> 0105791918831-neptune,Machine|neptune,1
>> 0105791918831-neptune,Pool|west,1
>> 0105791918831-neptune,Load|9,1
>> 0105791918831-neptune,ReadingTimestamp|1388191975010,1
>> 0905791918831-pluto,Machine|pluto,1
>> 0905791918831-pluto,Pool|east,1
>> 0905791918831-pluto,Load|13,1
>> 0905791918831-pluto,ReadingTimestamp|1388191975090,1
>>
>>
>> TedgeTranspose table:
>> rowKey,columnQualifier,value
>>
>> Machine|neptune,0005791918831-neptune,1
>> Pool|west,0005791918831-neptune,1
>> Load|5,0005791918831-neptune,1
>> ReadingTimestamp|1388191975000,0005791918831-neptune,1
>> Machine|neptune,0105791918831-neptune,1
>> Pool|west,0105791918831-neptune,1
>> Load|9,0105791918831-neptune,1
>> ReadingTimestamp|1388191975010,0105791918831-neptune,1
>> Machine|pluto,0905791918831-pluto,1
>> Pool|east,0905791918831-pluto,1
>> Load|13,0905791918831-pluto,1
>> ReadingTimestamp|1388191975090,0905791918831-pluto,1
>>
>>
>> TedgeDegree table:
>> rowKey,columnQualifier,value
>>
>> Machine|neptune,Degree,2
>> Pool|west,Degree,2
>> Load|5,Degree,1
>> ReadingTimestamp|1388191975000,Degree,1
>> Load|9,Degree,1
>> ReadingTimestamp|1388191975010,Degree,1
>> Machine|pluto,Degree,1
>> Pool|east,Degree,1
>> Load|13,Degree,1
>> ReadingTimestamp|1388191975090,Degree,1
>>
>>
>> TedgeText table:
>> rowKey,columnQualifier,value
>>
>> 0005791918831-neptune,Text,< ... raw text of original log ...>
>> 0105791918831-neptune,Text,< ... raw text of original log ...>
>> 0905791918831-pluto,Text,< ... raw text of original log ...>
>>
>> On Dec 27, 2013, at 8:01 PM, Arshak Navruzyan <ar...@gmail.com> wrote:
>>
>> > Jeremy,
>> >
>> > Wow, didn't expect to get help from the author :)
>> >
>> > How about something simple like this:
>> >
>> > Machine    Pool      Load        ReadingTimestamp
>> > neptune     west      5            1388191975000
>> > neptune     west      9            1388191975010
>> > pluto         east       13           1388191975090
>> >
>> > These are the areas I am unclear on:
>> >
>> > 1.  Should the transpose table be built as part of ingest code or as an
>> accumulo combiner?
>> > 2.  What does the degree table do in this example ?  The paper mentions
>> it's useful for query optimization.  How?
>> > 3.  Does D4M accommodate "repurposing" the row_id to a partition key?
>>  The wikisearch shows how the partition id is important for parallel scans
>> of the index.  But since Accumulo is a row store how can you do fast
>> lookups by row if you've used the row_id as a partition key.
>> >
>> > Thank you,
>> >
>> > Arshak
>> >
>> >
>> >
>> >
>> >
>> >
>> > On Thu, Dec 26, 2013 at 5:31 PM, Jeremy Kepner <ke...@ll.mit.edu>
>> wrote:
>> > Hi Arshak,
>> >   Maybe you can send a few (~3) records of data that you are familiar
>> with
>> > and we can walk you through how the D4M schema would be applied to
>> those records.
>> >
>> > Regards.  -Jeremy
>> >
>> > On Thu, Dec 26, 2013 at 03:10:59PM -0500, Arshak Navruzyan wrote:
>> > >    Hello,
>> > >    I am trying to get my head around Accumulo schema designs.  I went
>> through
>> > >    a lot of trouble to get the wikisearch example running but since
>> the data
>> > >    in protobuf lists, it's not that illustrative (for a newbie).
>> > >    Would love to find another example that is a little simpler to
>> understand.
>> > >     In particular I am interested in java/scala code that mimics the
>> D4M
>> > >    schema design (not a Matlab guy).
>> > >    Thanks,
>> > >    Arshak
>> >
>>
>>
>
>

Re: schema examples

Posted by "Kepner, Jeremy - 0553 - MITLL" <ke...@ll.mit.edu>.

Hi Arshak,
  See interspersed below.
Regards.  -Jeremy

On Dec 29, 2013, at 11:34 AM, Arshak Navruzyan <ar...@gmail.com> wrote:

> Jeremy,
> 
> Thanks for the detailed explanation.  Just a couple of final questions:
> 
> 1.  What's your advise on the transpose table as far as whether to repeat the indexed term (one per matching row id) or try to store all matching row ids from tedge in a single row in tedgetranspose (using protobuf for example).  What's the performance implication of each approach?  In the paper you mentioned that if it's a few values they should just be stored together.  Was there a cut-off point in your testing?

Can you clarify?  I am not sure what your asking.

> 
> 2.  You mentioned that the degrees should be calculated beforehand for high ingest rates.  Doesn't this change Accumulo from being a true database to being more of an index?  If changes to the data cause the degree table to get out of sync, sounds like changes have to be applied elsewhere first and Accumulo has to be reloaded periodically.  Or perhaps letting the degree table get out of sync is ok since it's just an assist...

My point was a very narrow comment on optimization in very high performance situations. I probably shouldn't have mentioned it.  If you have ever have performance issues with your degree tables, that would be the time to discuss. . You may never encounter this issue.

> Thanks,
> 
> Arshak
> 
> 
> On Sat, Dec 28, 2013 at 10:36 AM, Kepner, Jeremy - 0553 - MITLL <ke...@ll.mit.edu> wrote:
> Hi Arshak,
>   Here is how you might do it.  We implement everything with batch writers and batch scanners.  Note: if you are doing high ingest rates, the degree table can be tricky and usually requires pre-summing prior to ingestion to reduce the pressure on the accumulator inside of Accumulo.  Feel free to ask further questions as I would imagine that there a details that still wouldn't be clear.  In particular, why we do it this way.
> 
> Regards.  -Jeremy
> 
> Original data:
> 
> Machine,Pool,Load,ReadingTimestamp
> neptune,west,5,1388191975000
> neptune,west,9,1388191975010
> pluto,east,13,1388191975090
> 
> 
> Tedge table:
> rowKey,columnQualifier,value
> 
> 0005791918831-neptune,Machine|neptune,1
> 0005791918831-neptune,Pool|west,1
> 0005791918831-neptune,Load|5,1
> 0005791918831-neptune,ReadingTimestamp|1388191975000,1
> 0105791918831-neptune,Machine|neptune,1
> 0105791918831-neptune,Pool|west,1
> 0105791918831-neptune,Load|9,1
> 0105791918831-neptune,ReadingTimestamp|1388191975010,1
> 0905791918831-pluto,Machine|pluto,1
> 0905791918831-pluto,Pool|east,1
> 0905791918831-pluto,Load|13,1
> 0905791918831-pluto,ReadingTimestamp|1388191975090,1
> 
> 
> TedgeTranspose table:
> rowKey,columnQualifier,value
> 
> Machine|neptune,0005791918831-neptune,1
> Pool|west,0005791918831-neptune,1
> Load|5,0005791918831-neptune,1
> ReadingTimestamp|1388191975000,0005791918831-neptune,1
> Machine|neptune,0105791918831-neptune,1
> Pool|west,0105791918831-neptune,1
> Load|9,0105791918831-neptune,1
> ReadingTimestamp|1388191975010,0105791918831-neptune,1
> Machine|pluto,0905791918831-pluto,1
> Pool|east,0905791918831-pluto,1
> Load|13,0905791918831-pluto,1
> ReadingTimestamp|1388191975090,0905791918831-pluto,1
> 
> 
> TedgeDegree table:
> rowKey,columnQualifier,value
> 
> Machine|neptune,Degree,2
> Pool|west,Degree,2
> Load|5,Degree,1
> ReadingTimestamp|1388191975000,Degree,1
> Load|9,Degree,1
> ReadingTimestamp|1388191975010,Degree,1
> Machine|pluto,Degree,1
> Pool|east,Degree,1
> Load|13,Degree,1
> ReadingTimestamp|1388191975090,Degree,1
> 
> 
> TedgeText table:
> rowKey,columnQualifier,value
> 
> 0005791918831-neptune,Text,< ... raw text of original log ...>
> 0105791918831-neptune,Text,< ... raw text of original log ...>
> 0905791918831-pluto,Text,< ... raw text of original log ...>
> 
> On Dec 27, 2013, at 8:01 PM, Arshak Navruzyan <ar...@gmail.com> wrote:
> 
> > Jeremy,
> >
> > Wow, didn't expect to get help from the author :)
> >
> > How about something simple like this:
> >
> > Machine    Pool      Load        ReadingTimestamp
> > neptune     west      5            1388191975000
> > neptune     west      9            1388191975010
> > pluto         east       13           1388191975090
> >
> > These are the areas I am unclear on:
> >
> > 1.  Should the transpose table be built as part of ingest code or as an accumulo combiner?
> > 2.  What does the degree table do in this example ?  The paper mentions it's useful for query optimization.  How?
> > 3.  Does D4M accommodate "repurposing" the row_id to a partition key?  The wikisearch shows how the partition id is important for parallel scans of the index.  But since Accumulo is a row store how can you do fast lookups by row if you've used the row_id as a partition key.
> >
> > Thank you,
> >
> > Arshak
> >
> >
> >
> >
> >
> >
> > On Thu, Dec 26, 2013 at 5:31 PM, Jeremy Kepner <ke...@ll.mit.edu> wrote:
> > Hi Arshak,
> >   Maybe you can send a few (~3) records of data that you are familiar with
> > and we can walk you through how the D4M schema would be applied to those records.
> >
> > Regards.  -Jeremy
> >
> > On Thu, Dec 26, 2013 at 03:10:59PM -0500, Arshak Navruzyan wrote:
> > >    Hello,
> > >    I am trying to get my head around Accumulo schema designs.  I went through
> > >    a lot of trouble to get the wikisearch example running but since the data
> > >    in protobuf lists, it's not that illustrative (for a newbie).
> > >    Would love to find another example that is a little simpler to understand.
> > >     In particular I am interested in java/scala code that mimics the D4M
> > >    schema design (not a Matlab guy).
> > >    Thanks,
> > >    Arshak
> >
> 
>

Re: schema examples

Posted by Arshak Navruzyan <ar...@gmail.com>.

Jeremy,

Thanks for the detailed explanation.  Just a couple of final questions:

1.  What's your advise on the transpose table as far as whether to repeat
the indexed term (one per matching row id) or try to store all matching row
ids from tedge in a single row in tedgetranspose (using protobuf for
example).  What's the performance implication of each approach?  In the
paper you mentioned that if it's a few values they should just be stored
together.  Was there a cut-off point in your testing?

2.  You mentioned that the degrees should be calculated beforehand for high
ingest rates.  Doesn't this change Accumulo from being a true database to
being more of an index?  If changes to the data cause the degree table to
get out of sync, sounds like changes have to be applied elsewhere first and
Accumulo has to be reloaded periodically.  Or perhaps letting the degree
table get out of sync is ok since it's just an assist...

Thanks,

Arshak


On Sat, Dec 28, 2013 at 10:36 AM, Kepner, Jeremy - 0553 - MITLL <
kepner@ll.mit.edu> wrote:

> Hi Arshak,
>   Here is how you might do it.  We implement everything with batch writers
> and batch scanners.  Note: if you are doing high ingest rates, the degree
> table can be tricky and usually requires pre-summing prior to ingestion to
> reduce the pressure on the accumulator inside of Accumulo.  Feel free to
> ask further questions as I would imagine that there a details that still
> wouldn't be clear.  In particular, why we do it this way.
>
> Regards.  -Jeremy
>
> Original data:
>
> Machine,Pool,Load,ReadingTimestamp
> neptune,west,5,1388191975000
> neptune,west,9,1388191975010
> pluto,east,13,1388191975090
>
>
> Tedge table:
> rowKey,columnQualifier,value
>
> 0005791918831-neptune,Machine|neptune,1
> 0005791918831-neptune,Pool|west,1
> 0005791918831-neptune,Load|5,1
> 0005791918831-neptune,ReadingTimestamp|1388191975000,1
> 0105791918831-neptune,Machine|neptune,1
> 0105791918831-neptune,Pool|west,1
> 0105791918831-neptune,Load|9,1
> 0105791918831-neptune,ReadingTimestamp|1388191975010,1
> 0905791918831-pluto,Machine|pluto,1
> 0905791918831-pluto,Pool|east,1
> 0905791918831-pluto,Load|13,1
> 0905791918831-pluto,ReadingTimestamp|1388191975090,1
>
>
> TedgeTranspose table:
> rowKey,columnQualifier,value
>
> Machine|neptune,0005791918831-neptune,1
> Pool|west,0005791918831-neptune,1
> Load|5,0005791918831-neptune,1
> ReadingTimestamp|1388191975000,0005791918831-neptune,1
> Machine|neptune,0105791918831-neptune,1
> Pool|west,0105791918831-neptune,1
> Load|9,0105791918831-neptune,1
> ReadingTimestamp|1388191975010,0105791918831-neptune,1
> Machine|pluto,0905791918831-pluto,1
> Pool|east,0905791918831-pluto,1
> Load|13,0905791918831-pluto,1
> ReadingTimestamp|1388191975090,0905791918831-pluto,1
>
>
> TedgeDegree table:
> rowKey,columnQualifier,value
>
> Machine|neptune,Degree,2
> Pool|west,Degree,2
> Load|5,Degree,1
> ReadingTimestamp|1388191975000,Degree,1
> Load|9,Degree,1
> ReadingTimestamp|1388191975010,Degree,1
> Machine|pluto,Degree,1
> Pool|east,Degree,1
> Load|13,Degree,1
> ReadingTimestamp|1388191975090,Degree,1
>
>
> TedgeText table:
> rowKey,columnQualifier,value
>
> 0005791918831-neptune,Text,< ... raw text of original log ...>
> 0105791918831-neptune,Text,< ... raw text of original log ...>
> 0905791918831-pluto,Text,< ... raw text of original log ...>
>
> On Dec 27, 2013, at 8:01 PM, Arshak Navruzyan <ar...@gmail.com> wrote:
>
> > Jeremy,
> >
> > Wow, didn't expect to get help from the author :)
> >
> > How about something simple like this:
> >
> > Machine    Pool      Load        ReadingTimestamp
> > neptune     west      5            1388191975000
> > neptune     west      9            1388191975010
> > pluto         east       13           1388191975090
> >
> > These are the areas I am unclear on:
> >
> > 1.  Should the transpose table be built as part of ingest code or as an
> accumulo combiner?
> > 2.  What does the degree table do in this example ?  The paper mentions
> it's useful for query optimization.  How?
> > 3.  Does D4M accommodate "repurposing" the row_id to a partition key?
>  The wikisearch shows how the partition id is important for parallel scans
> of the index.  But since Accumulo is a row store how can you do fast
> lookups by row if you've used the row_id as a partition key.
> >
> > Thank you,
> >
> > Arshak
> >
> >
> >
> >
> >
> >
> > On Thu, Dec 26, 2013 at 5:31 PM, Jeremy Kepner <ke...@ll.mit.edu>
> wrote:
> > Hi Arshak,
> >   Maybe you can send a few (~3) records of data that you are familiar
> with
> > and we can walk you through how the D4M schema would be applied to those
> records.
> >
> > Regards.  -Jeremy
> >
> > On Thu, Dec 26, 2013 at 03:10:59PM -0500, Arshak Navruzyan wrote:
> > >    Hello,
> > >    I am trying to get my head around Accumulo schema designs.  I went
> through
> > >    a lot of trouble to get the wikisearch example running but since
> the data
> > >    in protobuf lists, it's not that illustrative (for a newbie).
> > >    Would love to find another example that is a little simpler to
> understand.
> > >     In particular I am interested in java/scala code that mimics the
> D4M
> > >    schema design (not a Matlab guy).
> > >    Thanks,
> > >    Arshak
> >
>
>

Re: schema examples

Posted by "Kepner, Jeremy - 0553 - MITLL" <ke...@ll.mit.edu>.

Hi Arshak,
  Here is how you might do it.  We implement everything with batch writers and batch scanners.  Note: if you are doing high ingest rates, the degree table can be tricky and usually requires pre-summing prior to ingestion to reduce the pressure on the accumulator inside of Accumulo.  Feel free to ask further questions as I would imagine that there a details that still wouldn't be clear.  In particular, why we do it this way.

Regards.  -Jeremy

Original data:

Machine,Pool,Load,ReadingTimestamp
neptune,west,5,1388191975000
neptune,west,9,1388191975010
pluto,east,13,1388191975090


Tedge table:
rowKey,columnQualifier,value

0005791918831-neptune,Machine|neptune,1
0005791918831-neptune,Pool|west,1
0005791918831-neptune,Load|5,1
0005791918831-neptune,ReadingTimestamp|1388191975000,1
0105791918831-neptune,Machine|neptune,1
0105791918831-neptune,Pool|west,1
0105791918831-neptune,Load|9,1
0105791918831-neptune,ReadingTimestamp|1388191975010,1
0905791918831-pluto,Machine|pluto,1
0905791918831-pluto,Pool|east,1
0905791918831-pluto,Load|13,1
0905791918831-pluto,ReadingTimestamp|1388191975090,1


TedgeTranspose table:
rowKey,columnQualifier,value

Machine|neptune,0005791918831-neptune,1
Pool|west,0005791918831-neptune,1
Load|5,0005791918831-neptune,1
ReadingTimestamp|1388191975000,0005791918831-neptune,1
Machine|neptune,0105791918831-neptune,1
Pool|west,0105791918831-neptune,1
Load|9,0105791918831-neptune,1
ReadingTimestamp|1388191975010,0105791918831-neptune,1
Machine|pluto,0905791918831-pluto,1
Pool|east,0905791918831-pluto,1
Load|13,0905791918831-pluto,1
ReadingTimestamp|1388191975090,0905791918831-pluto,1


TedgeDegree table:
rowKey,columnQualifier,value

Machine|neptune,Degree,2
Pool|west,Degree,2
Load|5,Degree,1
ReadingTimestamp|1388191975000,Degree,1
Load|9,Degree,1
ReadingTimestamp|1388191975010,Degree,1
Machine|pluto,Degree,1
Pool|east,Degree,1
Load|13,Degree,1
ReadingTimestamp|1388191975090,Degree,1


TedgeText table:
rowKey,columnQualifier,value

0005791918831-neptune,Text,< ... raw text of original log ...>
0105791918831-neptune,Text,< ... raw text of original log ...>
0905791918831-pluto,Text,< ... raw text of original log ...>

On Dec 27, 2013, at 8:01 PM, Arshak Navruzyan <ar...@gmail.com> wrote:

> Jeremy,
> 
> Wow, didn't expect to get help from the author :)
> 
> How about something simple like this:
> 
> Machine    Pool      Load        ReadingTimestamp
> neptune     west      5            1388191975000
> neptune     west      9            1388191975010
> pluto         east       13           1388191975090
> 
> These are the areas I am unclear on:
> 
> 1.  Should the transpose table be built as part of ingest code or as an accumulo combiner?
> 2.  What does the degree table do in this example ?  The paper mentions it's useful for query optimization.  How?  
> 3.  Does D4M accommodate "repurposing" the row_id to a partition key?  The wikisearch shows how the partition id is important for parallel scans of the index.  But since Accumulo is a row store how can you do fast lookups by row if you've used the row_id as a partition key.
> 
> Thank you,
> 
> Arshak
> 
> 
> 
> 
> 
> 
> On Thu, Dec 26, 2013 at 5:31 PM, Jeremy Kepner <ke...@ll.mit.edu> wrote:
> Hi Arshak,
>   Maybe you can send a few (~3) records of data that you are familiar with
> and we can walk you through how the D4M schema would be applied to those records.
> 
> Regards.  -Jeremy
> 
> On Thu, Dec 26, 2013 at 03:10:59PM -0500, Arshak Navruzyan wrote:
> >    Hello,
> >    I am trying to get my head around Accumulo schema designs.  I went through
> >    a lot of trouble to get the wikisearch example running but since the data
> >    in protobuf lists, it's not that illustrative (for a newbie).
> >    Would love to find another example that is a little simpler to understand.
> >     In particular I am interested in java/scala code that mimics the D4M
> >    schema design (not a Matlab guy).
> >    Thanks,
> >    Arshak
>

Re: schema examples

Posted by Arshak Navruzyan <ar...@gmail.com>.

Dylan, thanks for the detailed explanation.  Very helpful!

Josh, appreciate the hint about combiners.


On Fri, Dec 27, 2013 at 9:53 PM, Dylan Hutchison <dh...@stevens.edu>wrote:

> Hi Arshak,
>
> Perhaps this might help, though I will defer to Jeremy on the finer
> points.  Keep in mind that writing methods that interface with the D4M
> schema will duplicate some of the Matlab D4M code.  Jeremy, is the Java
> source for the D4M JARs available?
>
> First let's weigh potential choices for row IDs:
>
>    1. row ID = ReadingTimestamp.  The advantage is that you can easily
>    query for continuous periods of time via the row, but the disadvantage is
>    that all of your writes will go at the end to the last tablet server,
>    assuming that your data is streaming in live.  If your data is not
>    streaming in live but available offline, you can create table splits to
>    distribute the time stamps evenly across your tablet servers.  Scanning may
>    or may not bottleneck on one tablet server depending on your use case.
>    2. row ID = ReadingTimestamp *reversed*.  For example, use
>    0005791918831 instead of 1388191975000.  Now writes will naturally
>    distribute across all your tablet servers distribute.  However, you lose
>    the ability to easily query continuous periods of time.
>    3. row ID = a new unique ID assigned to each row in your table,
>    leaving the ReadingTimestamp as another column.  This is a good all-around
>    solution for the regular table, but what about the transpose table?
>     Because it switches the rows and columns, we still have the disadvantage
>    in option #1 in the transpose table.
>
> Since we don't have a clear winner, I wrote a quick implementation of
> option #1.  Here's the original table, followed by the exploded table, its
> transpose and the degree table.
>
> Original
>               Load,Machine,Pool,
> 1388191975000,005, neptune,west,
> 1388191975010,009, neptune,west,
> 1388191975090,013, pluto,  east,
>
> Exploded
>
> Load|005,Load|009,Load|013,Machine|neptune,Machine|pluto,Pool|east,Pool|west,
> 1388191975000,1,                         1,
>        1,
> 1388191975010,         1,                1,
>        1,
> 1388191975090,                  1,                       1,            1,
>
>
>
> Transpose
>                 1388191975000,1388191975010,1388191975090,
> Load|005,       1,
> Load|009,                     1,
> Load|013,                                   1,
> Machine|neptune,1,            1,
> Machine|pluto,                              1,
> Pool|east,                                  1,
> Pool|west,      1,            1,
>
>
> Degree
>                 degree,
> Load|005,       1,
> Load|009,       1,
> Load|013,       1,
> Machine|neptune,2,
> Machine|pluto,  1,
> Pool|east,      1,
> Pool|west,      2,
>
>
> *1.  Should the transpose table be built as part of ingest code or as an
> accumulo combiner?*
> I recommend ingest code for much greater simplicity, though it may be
> possible to build a combiner to automatically ingest to a second table.
>  When inserting (row,col,val) triples, do another insert to the transpose
> with (col,row,val).
> Use summing combiners to create the degree tables.
>
> *2.  What does the degree table do in this example ?  The paper mentions
> it's useful for query optimization.  How?  *
> Suppose you're interested in querying the row timestamps that log a *load
> > 5* (query A) and come from the *east pool *(query B). (You want A & B.)
>  *I assume that the number of returned rows matching A & B is far less
> than the total number of rows.*  (If they are roughly equal, you must
> query all the rows anyway and a database won't help you.)  There are a few
> options:
>
>    1. Scan through all the rows and return only those that match A & B.
>     Sure, this will do the job, but what if you have 100M rows?  We can do
>    faster with our transpose table.
>    2. Scan through the transpose table for the timestamps that match
>    query A.  In our example, this returns 2 rows with loads > 5.  Then retain
>    only the rows that also match query B and come from the East pool.
>    3. Scan through the transpose table for the timestamps that match
>    query B.  In our example, this returns 1 row that comes from the East pool.
>     Then retain only the rows that also match query A and have loads > 5.
>
> How can we tell whether query strategy #2 or #3 is better?  Choose the one
> that returns the fewest rows first!  We determine that by consulting the
> degree table twice (two quick lookups for counts) and verifying that there
> are 2 rows from query A and 1 row for query B, thus making strategy #3 the
> better choice.  Of course, the difference is 1 row in our small example,
> but it could be much larger on a real database.
>
> *3.  Does D4M accommodate "repurposing" the row_id to a partition key?
>  The wikisearch shows how the partition id is important for parallel scans
> of the index.  But since Accumulo is a row store how can you do fast
> lookups by row if you've used the row_id as a partition key.*
> Table design is tough when optimizing for the general case.  Hopefully the
> above tradeoff analysis helps present different options.
>
>
> I attached the CSV file and a few lines of Matlab code to recreate the
> tables.  If you don't mind reading a little Matlab, take a look at the
> tutorial here:
>
> https://github.com/denine99/d4mBB
>
> I work out a very similar query optimization solution on baseball data in
> the section "%% Find how many players weigh < 200 lb. and bat with left
> hand or both hands".
>
> Cheers,
> Dylan Hutchison
>
>
>
> On Fri, Dec 27, 2013 at 8:01 PM, Arshak Navruzyan <ar...@gmail.com>wrote:
>
>> Jeremy,
>>
>> Wow, didn't expect to get help from the author :)
>>
>> How about something simple like this:
>>
>> Machine    Pool      Load        ReadingTimestamp
>> neptune     west      5            1388191975000
>> neptune     west      9            1388191975010
>> pluto         east       13           1388191975090
>>
>> These are the areas I am unclear on:
>>
>> 1.  Should the transpose table be built as part of ingest code or as an
>> accumulo combiner?
>> 2.  What does the degree table do in this example ?  The paper mentions
>> it's useful for query optimization.  How?
>> 3.  Does D4M accommodate "repurposing" the row_id to a partition key?
>>  The wikisearch shows how the partition id is important for parallel scans
>> of the index.  But since Accumulo is a row store how can you do fast
>> lookups by row if you've used the row_id as a partition key.
>>
>> Thank you,
>>
>> Arshak
>>
>>
>>
>>
>>
>>
>> On Thu, Dec 26, 2013 at 5:31 PM, Jeremy Kepner <ke...@ll.mit.edu> wrote:
>>
>>> Hi Arshak,
>>>   Maybe you can send a few (~3) records of data that you are familiar
>>> with
>>> and we can walk you through how the D4M schema would be applied to those
>>> records.
>>>
>>> Regards.  -Jeremy
>>>
>>> On Thu, Dec 26, 2013 at 03:10:59PM -0500, Arshak Navruzyan wrote:
>>> >    Hello,
>>> >    I am trying to get my head around Accumulo schema designs.  I went
>>> through
>>> >    a lot of trouble to get the wikisearch example running but since
>>> the data
>>> >    in protobuf lists, it's not that illustrative (for a newbie).
>>> >    Would love to find another example that is a little simpler to
>>> understand.
>>> >     In particular I am interested in java/scala code that mimics the
>>> D4M
>>> >    schema design (not a Matlab guy).
>>> >    Thanks,
>>> >    Arshak
>>>
>>
>>
>
>
> --
> www.cs.stevens.edu/~dhutchis
>

Re: schema examples

Posted by Josh Elser <jo...@gmail.com>.

On 12/28/2013 12:53 AM, Dylan Hutchison wrote:
> /1.  Should the transpose table be built as part of ingest code or as an
> accumulo combiner?/
> I recommend ingest code for much greater simplicity, though it may be
> possible to build a combiner to automatically ingest to a second table.
>   When inserting (row,col,val) triples, do another insert to the
> transpose with (col,row,val).
> Use summing combiners to create the degree tables.

Using a combiner is likely to be much more hassle than it's worth. When 
your Combiner gets invoked server-side, you have no notion of lifecycle 
management and the only way to write to another table is to instantiate 
a Connector and BatchWriter.

As such, it's very difficult, and possibly impossible with the current 
API, to write to a separate table inside of a Combiner without leaking 
resources inside of the TabletServer.

Definitely implement it in your ingest code :)

Re: schema examples

Posted by Dylan Hutchison <dh...@stevens.edu>.

Hi Arshak,

Perhaps this might help, though I will defer to Jeremy on the finer points.
 Keep in mind that writing methods that interface with the D4M schema will
duplicate some of the Matlab D4M code.  Jeremy, is the Java source for the
D4M JARs available?

First let's weigh potential choices for row IDs:

   1. row ID = ReadingTimestamp.  The advantage is that you can easily
   query for continuous periods of time via the row, but the disadvantage is
   that all of your writes will go at the end to the last tablet server,
   assuming that your data is streaming in live.  If your data is not
   streaming in live but available offline, you can create table splits to
   distribute the time stamps evenly across your tablet servers.  Scanning may
   or may not bottleneck on one tablet server depending on your use case.
   2. row ID = ReadingTimestamp *reversed*.  For example, use 0005791918831
   instead of 1388191975000.  Now writes will naturally distribute across all
   your tablet servers distribute.  However, you lose the ability to easily
   query continuous periods of time.
   3. row ID = a new unique ID assigned to each row in your table, leaving
   the ReadingTimestamp as another column.  This is a good all-around solution
   for the regular table, but what about the transpose table?  Because it
   switches the rows and columns, we still have the disadvantage in option #1
   in the transpose table.

Since we don't have a clear winner, I wrote a quick implementation of
option #1.  Here's the original table, followed by the exploded table, its
transpose and the degree table.

Original
              Load,Machine,Pool,
1388191975000,005, neptune,west,
1388191975010,009, neptune,west,
1388191975090,013, pluto,  east,

Exploded

Load|005,Load|009,Load|013,Machine|neptune,Machine|pluto,Pool|east,Pool|west,
1388191975000,1,                         1,
     1,
1388191975010,         1,                1,
     1,
1388191975090,                  1,                       1,            1,

Transpose
                1388191975000,1388191975010,1388191975090,
Load|005,       1,
Load|009,                     1,
Load|013,                                   1,
Machine|neptune,1,            1,
Machine|pluto,                              1,
Pool|east,                                  1,
Pool|west,      1,            1,

Degree
                degree,
Load|005,       1,
Load|009,       1,
Load|013,       1,
Machine|neptune,2,
Machine|pluto,  1,
Pool|east,      1,
Pool|west,      2,

*1.  Should the transpose table be built as part of ingest code or as an
accumulo combiner?*
I recommend ingest code for much greater simplicity, though it may be
possible to build a combiner to automatically ingest to a second table.
 When inserting (row,col,val) triples, do another insert to the transpose
with (col,row,val).
Use summing combiners to create the degree tables.

*2.  What does the degree table do in this example ?  The paper mentions
it's useful for query optimization.  How?  *
Suppose you're interested in querying the row timestamps that log a *load >
5* (query A) and come from the *east pool *(query B). (You want A & B.)  *I
assume that the number of returned rows matching A & B is far less than the
total number of rows.*  (If they are roughly equal, you must query all the
rows anyway and a database won't help you.)  There are a few options:

   1. Scan through all the rows and return only those that match A & B.
    Sure, this will do the job, but what if you have 100M rows?  We can do
   faster with our transpose table.
   2. Scan through the transpose table for the timestamps that match query
   A.  In our example, this returns 2 rows with loads > 5.  Then retain only
   the rows that also match query B and come from the East pool.
   3. Scan through the transpose table for the timestamps that match query
   B.  In our example, this returns 1 row that comes from the East pool.  Then
   retain only the rows that also match query A and have loads > 5.

How can we tell whether query strategy #2 or #3 is better?  Choose the one
that returns the fewest rows first!  We determine that by consulting the
degree table twice (two quick lookups for counts) and verifying that there
are 2 rows from query A and 1 row for query B, thus making strategy #3 the
better choice.  Of course, the difference is 1 row in our small example,
but it could be much larger on a real database.

*3.  Does D4M accommodate "repurposing" the row_id to a partition key?  The
wikisearch shows how the partition id is important for parallel scans of
the index.  But since Accumulo is a row store how can you do fast lookups
by row if you've used the row_id as a partition key.*
Table design is tough when optimizing for the general case.  Hopefully the
above tradeoff analysis helps present different options.

I attached the CSV file and a few lines of Matlab code to recreate the
tables.  If you don't mind reading a little Matlab, take a look at the
tutorial here:

https://github.com/denine99/d4mBB

I work out a very similar query optimization solution on baseball data in
the section "%% Find how many players weigh < 200 lb. and bat with left
hand or both hands".

Cheers,
Dylan Hutchison

On Fri, Dec 27, 2013 at 8:01 PM, Arshak Navruzyan <ar...@gmail.com> wrote:

> Jeremy,
>
> Wow, didn't expect to get help from the author :)
>
> How about something simple like this:
>
> Machine    Pool      Load        ReadingTimestamp
> neptune     west      5            1388191975000
> neptune     west      9            1388191975010
> pluto         east       13           1388191975090
>
> These are the areas I am unclear on:
>
> 1.  Should the transpose table be built as part of ingest code or as an
> accumulo combiner?
> 2.  What does the degree table do in this example ?  The paper mentions
> it's useful for query optimization.  How?
> 3.  Does D4M accommodate "repurposing" the row_id to a partition key?  The
> wikisearch shows how the partition id is important for parallel scans of
> the index.  But since Accumulo is a row store how can you do fast lookups
> by row if you've used the row_id as a partition key.
>
> Thank you,
>
> Arshak
>
>
>
>
>
>
> On Thu, Dec 26, 2013 at 5:31 PM, Jeremy Kepner <ke...@ll.mit.edu> wrote:
>
>> Hi Arshak,
>>   Maybe you can send a few (~3) records of data that you are familiar with
>> and we can walk you through how the D4M schema would be applied to those
>> records.
>>
>> Regards.  -Jeremy
>>
>> On Thu, Dec 26, 2013 at 03:10:59PM -0500, Arshak Navruzyan wrote:
>> >    Hello,
>> >    I am trying to get my head around Accumulo schema designs.  I went
>> through
>> >    a lot of trouble to get the wikisearch example running but since the
>> data
>> >    in protobuf lists, it's not that illustrative (for a newbie).
>> >    Would love to find another example that is a little simpler to
>> understand.
>> >     In particular I am interested in java/scala code that mimics the D4M
>> >    schema design (not a Matlab guy).
>> >    Thanks,
>> >    Arshak
>>
>
>

-- 
www.cs.stevens.edu/~dhutchis

Re: schema examples

Posted by Arshak Navruzyan <ar...@gmail.com>.

Jeremy,

Wow, didn't expect to get help from the author :)

How about something simple like this:

Machine    Pool      Load        ReadingTimestamp
neptune     west      5            1388191975000
neptune     west      9            1388191975010
pluto         east       13           1388191975090

These are the areas I am unclear on:

1.  Should the transpose table be built as part of ingest code or as an
accumulo combiner?
2.  What does the degree table do in this example ?  The paper mentions
it's useful for query optimization.  How?
3.  Does D4M accommodate "repurposing" the row_id to a partition key?  The
wikisearch shows how the partition id is important for parallel scans of
the index.  But since Accumulo is a row store how can you do fast lookups
by row if you've used the row_id as a partition key.

Thank you,

Arshak

On Thu, Dec 26, 2013 at 5:31 PM, Jeremy Kepner <ke...@ll.mit.edu> wrote:

> Hi Arshak,
>   Maybe you can send a few (~3) records of data that you are familiar with
> and we can walk you through how the D4M schema would be applied to those
> records.
>
> Regards.  -Jeremy
>
> On Thu, Dec 26, 2013 at 03:10:59PM -0500, Arshak Navruzyan wrote:
> >    Hello,
> >    I am trying to get my head around Accumulo schema designs.  I went
> through
> >    a lot of trouble to get the wikisearch example running but since the
> data
> >    in protobuf lists, it's not that illustrative (for a newbie).
> >    Would love to find another example that is a little simpler to
> understand.
> >     In particular I am interested in java/scala code that mimics the D4M
> >    schema design (not a Matlab guy).
> >    Thanks,
> >    Arshak
>

Re: schema examples

Posted by Jeremy Kepner <ke...@ll.mit.edu>.

Hi Arshak,
  Maybe you can send a few (~3) records of data that you are familiar with
and we can walk you through how the D4M schema would be applied to those records.

Regards.  -Jeremy

On Thu, Dec 26, 2013 at 03:10:59PM -0500, Arshak Navruzyan wrote:
>    Hello,
>    I am trying to get my head around Accumulo schema designs. �I went through
>    a lot of trouble to get the wikisearch example running but since the data
>    in protobuf lists, it's not that illustrative (for a newbie).
>    Would love to find another example that is a little simpler to understand.
>    �In particular I am interested in java/scala code that mimics the D4M
>    schema design (not a Matlab guy).
>    Thanks,
>    Arshak

Re: schema examples

Posted by Jeremy Kepner <ke...@ll.mit.edu>.

Hi Arshak,
 Maybe you can send a few (~3) records of data that you are familiar with
and we can walk you through how the D4M schema would be applied to those records.

Regards.  -Jeremy


On Thu, Dec 26, 2013 at 03:10:59PM -0500, Arshak Navruzyan wrote:
>    Hello,
>    I am trying to get my head around Accumulo schema designs. �I went through
>    a lot of trouble to get the wikisearch example running but since the data
>    in protobuf lists, it's not that illustrative (for a newbie).
>    Would love to find another example that is a little simpler to understand.
>    �In particular I am interested in java/scala code that mimics the D4M
>    schema design (not a Matlab guy).
>    Thanks,
>    Arshak