You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@accumulo.apache.org by Jamie Johnson <je...@gmail.com> on 2014/01/25 04:23:21 UTC

scanner question in regards to columns loaded

If I have a row that as the key is a particular term and a set of columns
that stores the documents that the term appears in if I load the row is the
contents of all of the columns also loaded?  Is there a way to page over
the columns such that only N columns are in memory at any point?  In this
particular case the documents are all in a particular column family (say
docs) and the column qualifier is created dynamically, for arguments sake
we can say they are UUIDs.

Re: scanner question in regards to columns loaded

Posted by William Slacum <wi...@accumulo.net>.

Filters (and more generally, iterators) are executed on the server. There
is an option to run them client side. See
http://accumulo.apache.org/1.4/apidocs/org/apache/accumulo/core/client/ClientSideIteratorScanner.html

Using fetchColumnFamily will return only keys that have specific column
family values, not rows.

If I have a few keys in a table:

row1 family1: qualifier1
row1 family2: qualifier2
row2 family1: qualifier1

Let's say I call `scanner.fetchColumnFamily("family1")`. My scanner will
return:

row1 family1: qualifier1
row2 family1: qualifier1

Now let's say I want to do a scan, but call
`scanner.fetchColumnFamily("family2")`. My scanner will return:

row1 family2: qualifier2

If you want whole rows that contain specific column families, then I
believe you'd have to write a custom iterator using the RowFilter
http://accumulo.apache.org/1.4/apidocs/org/apache/accumulo/core/iterators/user/RowFilter.html


On Sun, Jan 26, 2014 at 7:39 PM, Jamie Johnson <je...@gmail.com> wrote:

> After a little reading...if I use fetchColumnFamily does that skip any
> rows that does not have the column family?
> On Jan 26, 2014 7:27 PM, "Jamie Johnson" <je...@gmail.com> wrote:
>
>> Thanks for the ideas.  Filters are client side right?
>>
>> I need to read the documentation more as I don't know how to just query a
>> column family.  Would it be possible to get all terms that start with a
>> particular value?  I was thinking that we would need a special prefix for
>> this but if something could be done without needing it that would work well.
>> On Jan 26, 2014 5:44 PM, "Christopher" <ct...@apache.org> wrote:
>>
>>> Ah, I see. Well, you could do that with a custom filter (iterator),
>>> but otherwise, no, not unless you had some other special per-term
>>> entry to query (rather than per-term/document pair). The design of
>>> this kind of table though, seems focused on finding documents which
>>> contain the given terms, though, not listing all terms seen. If you
>>> need that additional feature and don't want to write a custom filter,
>>> you could achieve that by putting a special entry in its own row for
>>> each term, in addition to the entries per-term/document pair, as in:
>>>
>>> RowID                       ColumnFamily     Column Qualifier     Value
>>> <term1>                    term                   -
>>>        -
>>> <term1>=<doc_id2>   index                  count                     5
>>>
>>> Then, you could list terms by querying the "term" column family
>>> without getting duplicates. And, you could get decent performance with
>>> this scan if you put the "term" column family and the "index" column
>>> family in separate locality groups. You could even make this entry an
>>> aggregated count for all documents (see documentation for combiners),
>>> in case you want corpus-wide term frequencies (for something like
>>> TF-IDF computations).
>>>
>>> --
>>> Christopher L Tubbs II
>>> http://gravatar.com/ctubbsii
>>>
>>>
>>> On Sun, Jan 26, 2014 at 7:55 AM, Jamie Johnson <je...@gmail.com>
>>> wrote:
>>> > I mean if a user asked for all terms that started with "term" is there
>>> a way
>>> > to get term1 and term2 just once while scanning or would I get each
>>> twice,
>>> > once for each docid and need to filter client side?
>>> >
>>> > On Jan 26, 2014 1:33 AM, "Christopher" <ct...@apache.org> wrote:
>>> >>
>>> >> If you use the Range constructor that takes two arguments, then yes,
>>> >> you'd get two entries. However, "count" would come before "doc_id",
>>> >> though, because the qualifier is part of the Key, and therefore, part
>>> >> of the sort order. There's also a Range constructor that allows you to
>>> >> specify whether you want the startKey and endKey to be inclusive or
>>> >> exclusive.
>>> >>
>>> >> I don't know of a specific document that outlines various strategies
>>> >> that I can link to. Perhaps I'll put one together, when I get some
>>> >> spare time, if nobody else does. I think most people do a lot of
>>> >> experimentation to figure out which strategies work best.
>>> >>
>>> >> I'm not entirely sure what you mean about "getting an iterator over
>>> >> all terms without duplicates". I'm assuming you don't mean duplicate
>>> >> versions of a single entry, which is handled by the
>>> >> VersioningIterator, which should be on new tables by default, and set
>>> >> to retain the recent 1 version, to support updates. With the scheme I
>>> >> suggested, your table would look something like the following,
>>> >> instead:
>>> >>
>>> >> RowID                       ColumnFamily     Column Qualifier
>>> Value
>>> >> <term1>=<doc_id1>   index                  count
>>> 10
>>> >> <term1>=<doc_id2>   index                  count                     5
>>> >> <term2>=<doc_id3>   index                  count                     3
>>> >> <term3>=<doc_id1>   index                  count
>>> 12
>>> >>
>>> >> With this scheme, you'd have only a single entry (a count) for each
>>> >> row, and a single row for each term/document combination, so you
>>> >> wouldn't have any duplicate counts for any given term/document. If
>>> >> that's what you mean by duplicates...
>>> >>
>>> >>
>>> >> --
>>> >> Christopher L Tubbs II
>>> >> http://gravatar.com/ctubbsii
>>> >>
>>> >>
>>> >> On Sat, Jan 25, 2014 at 12:19 AM, Jamie Johnson <je...@gmail.com>
>>> wrote:
>>> >> > Thanks for the reply Chris.  Say I had the following
>>> >> >
>>> >> > RowID     ColumnFamily     Column Qualifier     Value
>>> >> > term         Occurrence~1     doc_id                    1
>>> >> > term         Occurrence~1     count                      10
>>> >> > term2       Occurrence~2      doc_id                     2
>>> >> > term2       Occurrence~2      count                      1
>>> >> >
>>> >> > creating a scanner with start key new Key(new Text("term"), new
>>> >> > Text("Occurrence~1")) and end key new Key(new Text("term"), new
>>> >> > Text("Occurrence~1")) I would get an iterator with two entries, the
>>> >> > first
>>> >> > key would be doc_id and the second would be count.  Is that
>>> accurate?
>>> >> >
>>> >> > In regards to the other strategies is there anywhere that some of
>>> these
>>> >> > are
>>> >> > captured?  Also in the your example, how would you go about getting
>>> an
>>> >> > iterator over all terms without duplicates?  Again thanks
>>> >> >
>>> >> >
>>> >> > On Fri, Jan 24, 2014 at 11:34 PM, Christopher <ct...@apache.org>
>>> >> > wrote:
>>> >> >>
>>> >> >> It's not quite clear what you mean by "load", but I think you mean
>>> >> >> "iterate over"?
>>> >> >>
>>> >> >> A simplified explanation is this:
>>> >> >>
>>> >> >> When you scan an Accumulo table, you are streaming each entry
>>> >> >> (Key/Value pair), one at a time, through your client code. They are
>>> >> >> only held in memory if you do that yourself in your client code. A
>>> row
>>> >> >> in Accumulo is the set of entries that share a particular value of
>>> the
>>> >> >> Row portion of the Key. They are logically grouped, but are not
>>> >> >> grouped in memory unless you do that.
>>> >> >>
>>> >> >> One additional note is regarding your index schema of a row being a
>>> >> >> search term and columns being documents. You will likely have
>>> issues
>>> >> >> with this strategy, as the number of documents for high frequency
>>> >> >> terms grows, because tablets do not split in the middle of a row.
>>> With
>>> >> >> your schema, a row could get too large to manage on a single tablet
>>> >> >> server. A slight variation, like concatenating the search term
>>> with a
>>> >> >> document identifier in the row (term=doc1, term=doc2, ....) would
>>> >> >> allow the high frequency terms to split into multiple tablets if
>>> they
>>> >> >> get too large. There are better strategies, but that's just one
>>> simple
>>> >> >> option.
>>> >> >>
>>> >> >>
>>> >> >> --
>>> >> >> Christopher L Tubbs II
>>> >> >> http://gravatar.com/ctubbsii
>>> >> >>
>>> >> >>
>>> >> >> On Fri, Jan 24, 2014 at 10:23 PM, Jamie Johnson <jej2003@gmail.com
>>> >
>>> >> >> wrote:
>>> >> >> > If I have a row that as the key is a particular term and a set of
>>> >> >> > columns
>>> >> >> > that stores the documents that the term appears in if I load the
>>> row
>>> >> >> > is
>>> >> >> > the
>>> >> >> > contents of all of the columns also loaded?  Is there a way to
>>> page
>>> >> >> > over
>>> >> >> > the
>>> >> >> > columns such that only N columns are in memory at any point?  In
>>> this
>>> >> >> > particular case the documents are all in a particular column
>>> family
>>> >> >> > (say
>>> >> >> > docs) and the column qualifier is created dynamically, for
>>> arguments
>>> >> >> > sake we
>>> >> >> > can say they are UUIDs.
>>> >> >
>>> >> >
>>>
>>

Re: scanner question in regards to columns loaded

Posted by Christopher <ct...@apache.org>.

Filters are iterators, which are configured to run on the server-side.
fetchColumnFamily will only return entries in the specified column
families. If a row has no entries with the specified column families,
then no entries for that row will return.

--
Christopher L Tubbs II
http://gravatar.com/ctubbsii


On Sun, Jan 26, 2014 at 7:39 PM, Jamie Johnson <je...@gmail.com> wrote:
> After a little reading...if I use fetchColumnFamily does that skip any rows
> that does not have the column family?
>
> On Jan 26, 2014 7:27 PM, "Jamie Johnson" <je...@gmail.com> wrote:
>>
>> Thanks for the ideas.  Filters are client side right?
>>
>> I need to read the documentation more as I don't know how to just query a
>> column family.  Would it be possible to get all terms that start with a
>> particular value?  I was thinking that we would need a special prefix for
>> this but if something could be done without needing it that would work well.
>>
>> On Jan 26, 2014 5:44 PM, "Christopher" <ct...@apache.org> wrote:
>>>
>>> Ah, I see. Well, you could do that with a custom filter (iterator),
>>> but otherwise, no, not unless you had some other special per-term
>>> entry to query (rather than per-term/document pair). The design of
>>> this kind of table though, seems focused on finding documents which
>>> contain the given terms, though, not listing all terms seen. If you
>>> need that additional feature and don't want to write a custom filter,
>>> you could achieve that by putting a special entry in its own row for
>>> each term, in addition to the entries per-term/document pair, as in:
>>>
>>> RowID                       ColumnFamily     Column Qualifier     Value
>>> <term1>                    term                   -
>>> -
>>> <term1>=<doc_id2>   index                  count                     5
>>>
>>> Then, you could list terms by querying the "term" column family
>>> without getting duplicates. And, you could get decent performance with
>>> this scan if you put the "term" column family and the "index" column
>>> family in separate locality groups. You could even make this entry an
>>> aggregated count for all documents (see documentation for combiners),
>>> in case you want corpus-wide term frequencies (for something like
>>> TF-IDF computations).
>>>
>>> --
>>> Christopher L Tubbs II
>>> http://gravatar.com/ctubbsii
>>>
>>>
>>> On Sun, Jan 26, 2014 at 7:55 AM, Jamie Johnson <je...@gmail.com> wrote:
>>> > I mean if a user asked for all terms that started with "term" is there
>>> > a way
>>> > to get term1 and term2 just once while scanning or would I get each
>>> > twice,
>>> > once for each docid and need to filter client side?
>>> >
>>> > On Jan 26, 2014 1:33 AM, "Christopher" <ct...@apache.org> wrote:
>>> >>
>>> >> If you use the Range constructor that takes two arguments, then yes,
>>> >> you'd get two entries. However, "count" would come before "doc_id",
>>> >> though, because the qualifier is part of the Key, and therefore, part
>>> >> of the sort order. There's also a Range constructor that allows you to
>>> >> specify whether you want the startKey and endKey to be inclusive or
>>> >> exclusive.
>>> >>
>>> >> I don't know of a specific document that outlines various strategies
>>> >> that I can link to. Perhaps I'll put one together, when I get some
>>> >> spare time, if nobody else does. I think most people do a lot of
>>> >> experimentation to figure out which strategies work best.
>>> >>
>>> >> I'm not entirely sure what you mean about "getting an iterator over
>>> >> all terms without duplicates". I'm assuming you don't mean duplicate
>>> >> versions of a single entry, which is handled by the
>>> >> VersioningIterator, which should be on new tables by default, and set
>>> >> to retain the recent 1 version, to support updates. With the scheme I
>>> >> suggested, your table would look something like the following,
>>> >> instead:
>>> >>
>>> >> RowID                       ColumnFamily     Column Qualifier
>>> >> Value
>>> >> <term1>=<doc_id1>   index                  count
>>> >> 10
>>> >> <term1>=<doc_id2>   index                  count                     5
>>> >> <term2>=<doc_id3>   index                  count                     3
>>> >> <term3>=<doc_id1>   index                  count
>>> >> 12
>>> >>
>>> >> With this scheme, you'd have only a single entry (a count) for each
>>> >> row, and a single row for each term/document combination, so you
>>> >> wouldn't have any duplicate counts for any given term/document. If
>>> >> that's what you mean by duplicates...
>>> >>
>>> >>
>>> >> --
>>> >> Christopher L Tubbs II
>>> >> http://gravatar.com/ctubbsii
>>> >>
>>> >>
>>> >> On Sat, Jan 25, 2014 at 12:19 AM, Jamie Johnson <je...@gmail.com>
>>> >> wrote:
>>> >> > Thanks for the reply Chris.  Say I had the following
>>> >> >
>>> >> > RowID     ColumnFamily     Column Qualifier     Value
>>> >> > term         Occurrence~1     doc_id                    1
>>> >> > term         Occurrence~1     count                      10
>>> >> > term2       Occurrence~2      doc_id                     2
>>> >> > term2       Occurrence~2      count                      1
>>> >> >
>>> >> > creating a scanner with start key new Key(new Text("term"), new
>>> >> > Text("Occurrence~1")) and end key new Key(new Text("term"), new
>>> >> > Text("Occurrence~1")) I would get an iterator with two entries, the
>>> >> > first
>>> >> > key would be doc_id and the second would be count.  Is that
>>> >> > accurate?
>>> >> >
>>> >> > In regards to the other strategies is there anywhere that some of
>>> >> > these
>>> >> > are
>>> >> > captured?  Also in the your example, how would you go about getting
>>> >> > an
>>> >> > iterator over all terms without duplicates?  Again thanks
>>> >> >
>>> >> >
>>> >> > On Fri, Jan 24, 2014 at 11:34 PM, Christopher <ct...@apache.org>
>>> >> > wrote:
>>> >> >>
>>> >> >> It's not quite clear what you mean by "load", but I think you mean
>>> >> >> "iterate over"?
>>> >> >>
>>> >> >> A simplified explanation is this:
>>> >> >>
>>> >> >> When you scan an Accumulo table, you are streaming each entry
>>> >> >> (Key/Value pair), one at a time, through your client code. They are
>>> >> >> only held in memory if you do that yourself in your client code. A
>>> >> >> row
>>> >> >> in Accumulo is the set of entries that share a particular value of
>>> >> >> the
>>> >> >> Row portion of the Key. They are logically grouped, but are not
>>> >> >> grouped in memory unless you do that.
>>> >> >>
>>> >> >> One additional note is regarding your index schema of a row being a
>>> >> >> search term and columns being documents. You will likely have
>>> >> >> issues
>>> >> >> with this strategy, as the number of documents for high frequency
>>> >> >> terms grows, because tablets do not split in the middle of a row.
>>> >> >> With
>>> >> >> your schema, a row could get too large to manage on a single tablet
>>> >> >> server. A slight variation, like concatenating the search term with
>>> >> >> a
>>> >> >> document identifier in the row (term=doc1, term=doc2, ....) would
>>> >> >> allow the high frequency terms to split into multiple tablets if
>>> >> >> they
>>> >> >> get too large. There are better strategies, but that's just one
>>> >> >> simple
>>> >> >> option.
>>> >> >>
>>> >> >>
>>> >> >> --
>>> >> >> Christopher L Tubbs II
>>> >> >> http://gravatar.com/ctubbsii
>>> >> >>
>>> >> >>
>>> >> >> On Fri, Jan 24, 2014 at 10:23 PM, Jamie Johnson <je...@gmail.com>
>>> >> >> wrote:
>>> >> >> > If I have a row that as the key is a particular term and a set of
>>> >> >> > columns
>>> >> >> > that stores the documents that the term appears in if I load the
>>> >> >> > row
>>> >> >> > is
>>> >> >> > the
>>> >> >> > contents of all of the columns also loaded?  Is there a way to
>>> >> >> > page
>>> >> >> > over
>>> >> >> > the
>>> >> >> > columns such that only N columns are in memory at any point?  In
>>> >> >> > this
>>> >> >> > particular case the documents are all in a particular column
>>> >> >> > family
>>> >> >> > (say
>>> >> >> > docs) and the column qualifier is created dynamically, for
>>> >> >> > arguments
>>> >> >> > sake we
>>> >> >> > can say they are UUIDs.
>>> >> >
>>> >> >

Re: scanner question in regards to columns loaded

Posted by Jamie Johnson <je...@gmail.com>.

After a little reading...if I use fetchColumnFamily does that skip any rows
that does not have the column family?
On Jan 26, 2014 7:27 PM, "Jamie Johnson" <je...@gmail.com> wrote:

> Thanks for the ideas.  Filters are client side right?
>
> I need to read the documentation more as I don't know how to just query a
> column family.  Would it be possible to get all terms that start with a
> particular value?  I was thinking that we would need a special prefix for
> this but if something could be done without needing it that would work well.
> On Jan 26, 2014 5:44 PM, "Christopher" <ct...@apache.org> wrote:
>
>> Ah, I see. Well, you could do that with a custom filter (iterator),
>> but otherwise, no, not unless you had some other special per-term
>> entry to query (rather than per-term/document pair). The design of
>> this kind of table though, seems focused on finding documents which
>> contain the given terms, though, not listing all terms seen. If you
>> need that additional feature and don't want to write a custom filter,
>> you could achieve that by putting a special entry in its own row for
>> each term, in addition to the entries per-term/document pair, as in:
>>
>> RowID                       ColumnFamily     Column Qualifier     Value
>> <term1>                    term                   -
>>      -
>> <term1>=<doc_id2>   index                  count                     5
>>
>> Then, you could list terms by querying the "term" column family
>> without getting duplicates. And, you could get decent performance with
>> this scan if you put the "term" column family and the "index" column
>> family in separate locality groups. You could even make this entry an
>> aggregated count for all documents (see documentation for combiners),
>> in case you want corpus-wide term frequencies (for something like
>> TF-IDF computations).
>>
>> --
>> Christopher L Tubbs II
>> http://gravatar.com/ctubbsii
>>
>>
>> On Sun, Jan 26, 2014 at 7:55 AM, Jamie Johnson <je...@gmail.com> wrote:
>> > I mean if a user asked for all terms that started with "term" is there
>> a way
>> > to get term1 and term2 just once while scanning or would I get each
>> twice,
>> > once for each docid and need to filter client side?
>> >
>> > On Jan 26, 2014 1:33 AM, "Christopher" <ct...@apache.org> wrote:
>> >>
>> >> If you use the Range constructor that takes two arguments, then yes,
>> >> you'd get two entries. However, "count" would come before "doc_id",
>> >> though, because the qualifier is part of the Key, and therefore, part
>> >> of the sort order. There's also a Range constructor that allows you to
>> >> specify whether you want the startKey and endKey to be inclusive or
>> >> exclusive.
>> >>
>> >> I don't know of a specific document that outlines various strategies
>> >> that I can link to. Perhaps I'll put one together, when I get some
>> >> spare time, if nobody else does. I think most people do a lot of
>> >> experimentation to figure out which strategies work best.
>> >>
>> >> I'm not entirely sure what you mean about "getting an iterator over
>> >> all terms without duplicates". I'm assuming you don't mean duplicate
>> >> versions of a single entry, which is handled by the
>> >> VersioningIterator, which should be on new tables by default, and set
>> >> to retain the recent 1 version, to support updates. With the scheme I
>> >> suggested, your table would look something like the following,
>> >> instead:
>> >>
>> >> RowID                       ColumnFamily     Column Qualifier     Value
>> >> <term1>=<doc_id1>   index                  count                     10
>> >> <term1>=<doc_id2>   index                  count                     5
>> >> <term2>=<doc_id3>   index                  count                     3
>> >> <term3>=<doc_id1>   index                  count                     12
>> >>
>> >> With this scheme, you'd have only a single entry (a count) for each
>> >> row, and a single row for each term/document combination, so you
>> >> wouldn't have any duplicate counts for any given term/document. If
>> >> that's what you mean by duplicates...
>> >>
>> >>
>> >> --
>> >> Christopher L Tubbs II
>> >> http://gravatar.com/ctubbsii
>> >>
>> >>
>> >> On Sat, Jan 25, 2014 at 12:19 AM, Jamie Johnson <je...@gmail.com>
>> wrote:
>> >> > Thanks for the reply Chris.  Say I had the following
>> >> >
>> >> > RowID     ColumnFamily     Column Qualifier     Value
>> >> > term         Occurrence~1     doc_id                    1
>> >> > term         Occurrence~1     count                      10
>> >> > term2       Occurrence~2      doc_id                     2
>> >> > term2       Occurrence~2      count                      1
>> >> >
>> >> > creating a scanner with start key new Key(new Text("term"), new
>> >> > Text("Occurrence~1")) and end key new Key(new Text("term"), new
>> >> > Text("Occurrence~1")) I would get an iterator with two entries, the
>> >> > first
>> >> > key would be doc_id and the second would be count.  Is that accurate?
>> >> >
>> >> > In regards to the other strategies is there anywhere that some of
>> these
>> >> > are
>> >> > captured?  Also in the your example, how would you go about getting
>> an
>> >> > iterator over all terms without duplicates?  Again thanks
>> >> >
>> >> >
>> >> > On Fri, Jan 24, 2014 at 11:34 PM, Christopher <ct...@apache.org>
>> >> > wrote:
>> >> >>
>> >> >> It's not quite clear what you mean by "load", but I think you mean
>> >> >> "iterate over"?
>> >> >>
>> >> >> A simplified explanation is this:
>> >> >>
>> >> >> When you scan an Accumulo table, you are streaming each entry
>> >> >> (Key/Value pair), one at a time, through your client code. They are
>> >> >> only held in memory if you do that yourself in your client code. A
>> row
>> >> >> in Accumulo is the set of entries that share a particular value of
>> the
>> >> >> Row portion of the Key. They are logically grouped, but are not
>> >> >> grouped in memory unless you do that.
>> >> >>
>> >> >> One additional note is regarding your index schema of a row being a
>> >> >> search term and columns being documents. You will likely have issues
>> >> >> with this strategy, as the number of documents for high frequency
>> >> >> terms grows, because tablets do not split in the middle of a row.
>> With
>> >> >> your schema, a row could get too large to manage on a single tablet
>> >> >> server. A slight variation, like concatenating the search term with
>> a
>> >> >> document identifier in the row (term=doc1, term=doc2, ....) would
>> >> >> allow the high frequency terms to split into multiple tablets if
>> they
>> >> >> get too large. There are better strategies, but that's just one
>> simple
>> >> >> option.
>> >> >>
>> >> >>
>> >> >> --
>> >> >> Christopher L Tubbs II
>> >> >> http://gravatar.com/ctubbsii
>> >> >>
>> >> >>
>> >> >> On Fri, Jan 24, 2014 at 10:23 PM, Jamie Johnson <je...@gmail.com>
>> >> >> wrote:
>> >> >> > If I have a row that as the key is a particular term and a set of
>> >> >> > columns
>> >> >> > that stores the documents that the term appears in if I load the
>> row
>> >> >> > is
>> >> >> > the
>> >> >> > contents of all of the columns also loaded?  Is there a way to
>> page
>> >> >> > over
>> >> >> > the
>> >> >> > columns such that only N columns are in memory at any point?  In
>> this
>> >> >> > particular case the documents are all in a particular column
>> family
>> >> >> > (say
>> >> >> > docs) and the column qualifier is created dynamically, for
>> arguments
>> >> >> > sake we
>> >> >> > can say they are UUIDs.
>> >> >
>> >> >
>>
>

Re: scanner question in regards to columns loaded

Posted by Jamie Johnson <je...@gmail.com>.

Thanks for the ideas.  Filters are client side right?

I need to read the documentation more as I don't know how to just query a
column family.  Would it be possible to get all terms that start with a
particular value?  I was thinking that we would need a special prefix for
this but if something could be done without needing it that would work well.
On Jan 26, 2014 5:44 PM, "Christopher" <ct...@apache.org> wrote:

> Ah, I see. Well, you could do that with a custom filter (iterator),
> but otherwise, no, not unless you had some other special per-term
> entry to query (rather than per-term/document pair). The design of
> this kind of table though, seems focused on finding documents which
> contain the given terms, though, not listing all terms seen. If you
> need that additional feature and don't want to write a custom filter,
> you could achieve that by putting a special entry in its own row for
> each term, in addition to the entries per-term/document pair, as in:
>
> RowID                       ColumnFamily     Column Qualifier     Value
> <term1>                    term                   -
>      -
> <term1>=<doc_id2>   index                  count                     5
>
> Then, you could list terms by querying the "term" column family
> without getting duplicates. And, you could get decent performance with
> this scan if you put the "term" column family and the "index" column
> family in separate locality groups. You could even make this entry an
> aggregated count for all documents (see documentation for combiners),
> in case you want corpus-wide term frequencies (for something like
> TF-IDF computations).
>
> --
> Christopher L Tubbs II
> http://gravatar.com/ctubbsii
>
>
> On Sun, Jan 26, 2014 at 7:55 AM, Jamie Johnson <je...@gmail.com> wrote:
> > I mean if a user asked for all terms that started with "term" is there a
> way
> > to get term1 and term2 just once while scanning or would I get each
> twice,
> > once for each docid and need to filter client side?
> >
> > On Jan 26, 2014 1:33 AM, "Christopher" <ct...@apache.org> wrote:
> >>
> >> If you use the Range constructor that takes two arguments, then yes,
> >> you'd get two entries. However, "count" would come before "doc_id",
> >> though, because the qualifier is part of the Key, and therefore, part
> >> of the sort order. There's also a Range constructor that allows you to
> >> specify whether you want the startKey and endKey to be inclusive or
> >> exclusive.
> >>
> >> I don't know of a specific document that outlines various strategies
> >> that I can link to. Perhaps I'll put one together, when I get some
> >> spare time, if nobody else does. I think most people do a lot of
> >> experimentation to figure out which strategies work best.
> >>
> >> I'm not entirely sure what you mean about "getting an iterator over
> >> all terms without duplicates". I'm assuming you don't mean duplicate
> >> versions of a single entry, which is handled by the
> >> VersioningIterator, which should be on new tables by default, and set
> >> to retain the recent 1 version, to support updates. With the scheme I
> >> suggested, your table would look something like the following,
> >> instead:
> >>
> >> RowID                       ColumnFamily     Column Qualifier     Value
> >> <term1>=<doc_id1>   index                  count                     10
> >> <term1>=<doc_id2>   index                  count                     5
> >> <term2>=<doc_id3>   index                  count                     3
> >> <term3>=<doc_id1>   index                  count                     12
> >>
> >> With this scheme, you'd have only a single entry (a count) for each
> >> row, and a single row for each term/document combination, so you
> >> wouldn't have any duplicate counts for any given term/document. If
> >> that's what you mean by duplicates...
> >>
> >>
> >> --
> >> Christopher L Tubbs II
> >> http://gravatar.com/ctubbsii
> >>
> >>
> >> On Sat, Jan 25, 2014 at 12:19 AM, Jamie Johnson <je...@gmail.com>
> wrote:
> >> > Thanks for the reply Chris.  Say I had the following
> >> >
> >> > RowID     ColumnFamily     Column Qualifier     Value
> >> > term         Occurrence~1     doc_id                    1
> >> > term         Occurrence~1     count                      10
> >> > term2       Occurrence~2      doc_id                     2
> >> > term2       Occurrence~2      count                      1
> >> >
> >> > creating a scanner with start key new Key(new Text("term"), new
> >> > Text("Occurrence~1")) and end key new Key(new Text("term"), new
> >> > Text("Occurrence~1")) I would get an iterator with two entries, the
> >> > first
> >> > key would be doc_id and the second would be count.  Is that accurate?
> >> >
> >> > In regards to the other strategies is there anywhere that some of
> these
> >> > are
> >> > captured?  Also in the your example, how would you go about getting an
> >> > iterator over all terms without duplicates?  Again thanks
> >> >
> >> >
> >> > On Fri, Jan 24, 2014 at 11:34 PM, Christopher <ct...@apache.org>
> >> > wrote:
> >> >>
> >> >> It's not quite clear what you mean by "load", but I think you mean
> >> >> "iterate over"?
> >> >>
> >> >> A simplified explanation is this:
> >> >>
> >> >> When you scan an Accumulo table, you are streaming each entry
> >> >> (Key/Value pair), one at a time, through your client code. They are
> >> >> only held in memory if you do that yourself in your client code. A
> row
> >> >> in Accumulo is the set of entries that share a particular value of
> the
> >> >> Row portion of the Key. They are logically grouped, but are not
> >> >> grouped in memory unless you do that.
> >> >>
> >> >> One additional note is regarding your index schema of a row being a
> >> >> search term and columns being documents. You will likely have issues
> >> >> with this strategy, as the number of documents for high frequency
> >> >> terms grows, because tablets do not split in the middle of a row.
> With
> >> >> your schema, a row could get too large to manage on a single tablet
> >> >> server. A slight variation, like concatenating the search term with a
> >> >> document identifier in the row (term=doc1, term=doc2, ....) would
> >> >> allow the high frequency terms to split into multiple tablets if they
> >> >> get too large. There are better strategies, but that's just one
> simple
> >> >> option.
> >> >>
> >> >>
> >> >> --
> >> >> Christopher L Tubbs II
> >> >> http://gravatar.com/ctubbsii
> >> >>
> >> >>
> >> >> On Fri, Jan 24, 2014 at 10:23 PM, Jamie Johnson <je...@gmail.com>
> >> >> wrote:
> >> >> > If I have a row that as the key is a particular term and a set of
> >> >> > columns
> >> >> > that stores the documents that the term appears in if I load the
> row
> >> >> > is
> >> >> > the
> >> >> > contents of all of the columns also loaded?  Is there a way to page
> >> >> > over
> >> >> > the
> >> >> > columns such that only N columns are in memory at any point?  In
> this
> >> >> > particular case the documents are all in a particular column family
> >> >> > (say
> >> >> > docs) and the column qualifier is created dynamically, for
> arguments
> >> >> > sake we
> >> >> > can say they are UUIDs.
> >> >
> >> >
>

Re: scanner question in regards to columns loaded

Posted by Christopher <ct...@apache.org>.

Ah, I see. Well, you could do that with a custom filter (iterator),
but otherwise, no, not unless you had some other special per-term
entry to query (rather than per-term/document pair). The design of
this kind of table though, seems focused on finding documents which
contain the given terms, though, not listing all terms seen. If you
need that additional feature and don't want to write a custom filter,
you could achieve that by putting a special entry in its own row for
each term, in addition to the entries per-term/document pair, as in:

RowID                       ColumnFamily     Column Qualifier     Value
<term1>                    term                   -                            -
<term1>=<doc_id2>   index                  count                     5

Then, you could list terms by querying the "term" column family
without getting duplicates. And, you could get decent performance with
this scan if you put the "term" column family and the "index" column
family in separate locality groups. You could even make this entry an
aggregated count for all documents (see documentation for combiners),
in case you want corpus-wide term frequencies (for something like
TF-IDF computations).

--
Christopher L Tubbs II
http://gravatar.com/ctubbsii


On Sun, Jan 26, 2014 at 7:55 AM, Jamie Johnson <je...@gmail.com> wrote:
> I mean if a user asked for all terms that started with "term" is there a way
> to get term1 and term2 just once while scanning or would I get each twice,
> once for each docid and need to filter client side?
>
> On Jan 26, 2014 1:33 AM, "Christopher" <ct...@apache.org> wrote:
>>
>> If you use the Range constructor that takes two arguments, then yes,
>> you'd get two entries. However, "count" would come before "doc_id",
>> though, because the qualifier is part of the Key, and therefore, part
>> of the sort order. There's also a Range constructor that allows you to
>> specify whether you want the startKey and endKey to be inclusive or
>> exclusive.
>>
>> I don't know of a specific document that outlines various strategies
>> that I can link to. Perhaps I'll put one together, when I get some
>> spare time, if nobody else does. I think most people do a lot of
>> experimentation to figure out which strategies work best.
>>
>> I'm not entirely sure what you mean about "getting an iterator over
>> all terms without duplicates". I'm assuming you don't mean duplicate
>> versions of a single entry, which is handled by the
>> VersioningIterator, which should be on new tables by default, and set
>> to retain the recent 1 version, to support updates. With the scheme I
>> suggested, your table would look something like the following,
>> instead:
>>
>> RowID                       ColumnFamily     Column Qualifier     Value
>> <term1>=<doc_id1>   index                  count                     10
>> <term1>=<doc_id2>   index                  count                     5
>> <term2>=<doc_id3>   index                  count                     3
>> <term3>=<doc_id1>   index                  count                     12
>>
>> With this scheme, you'd have only a single entry (a count) for each
>> row, and a single row for each term/document combination, so you
>> wouldn't have any duplicate counts for any given term/document. If
>> that's what you mean by duplicates...
>>
>>
>> --
>> Christopher L Tubbs II
>> http://gravatar.com/ctubbsii
>>
>>
>> On Sat, Jan 25, 2014 at 12:19 AM, Jamie Johnson <je...@gmail.com> wrote:
>> > Thanks for the reply Chris.  Say I had the following
>> >
>> > RowID     ColumnFamily     Column Qualifier     Value
>> > term         Occurrence~1     doc_id                    1
>> > term         Occurrence~1     count                      10
>> > term2       Occurrence~2      doc_id                     2
>> > term2       Occurrence~2      count                      1
>> >
>> > creating a scanner with start key new Key(new Text("term"), new
>> > Text("Occurrence~1")) and end key new Key(new Text("term"), new
>> > Text("Occurrence~1")) I would get an iterator with two entries, the
>> > first
>> > key would be doc_id and the second would be count.  Is that accurate?
>> >
>> > In regards to the other strategies is there anywhere that some of these
>> > are
>> > captured?  Also in the your example, how would you go about getting an
>> > iterator over all terms without duplicates?  Again thanks
>> >
>> >
>> > On Fri, Jan 24, 2014 at 11:34 PM, Christopher <ct...@apache.org>
>> > wrote:
>> >>
>> >> It's not quite clear what you mean by "load", but I think you mean
>> >> "iterate over"?
>> >>
>> >> A simplified explanation is this:
>> >>
>> >> When you scan an Accumulo table, you are streaming each entry
>> >> (Key/Value pair), one at a time, through your client code. They are
>> >> only held in memory if you do that yourself in your client code. A row
>> >> in Accumulo is the set of entries that share a particular value of the
>> >> Row portion of the Key. They are logically grouped, but are not
>> >> grouped in memory unless you do that.
>> >>
>> >> One additional note is regarding your index schema of a row being a
>> >> search term and columns being documents. You will likely have issues
>> >> with this strategy, as the number of documents for high frequency
>> >> terms grows, because tablets do not split in the middle of a row. With
>> >> your schema, a row could get too large to manage on a single tablet
>> >> server. A slight variation, like concatenating the search term with a
>> >> document identifier in the row (term=doc1, term=doc2, ....) would
>> >> allow the high frequency terms to split into multiple tablets if they
>> >> get too large. There are better strategies, but that's just one simple
>> >> option.
>> >>
>> >>
>> >> --
>> >> Christopher L Tubbs II
>> >> http://gravatar.com/ctubbsii
>> >>
>> >>
>> >> On Fri, Jan 24, 2014 at 10:23 PM, Jamie Johnson <je...@gmail.com>
>> >> wrote:
>> >> > If I have a row that as the key is a particular term and a set of
>> >> > columns
>> >> > that stores the documents that the term appears in if I load the row
>> >> > is
>> >> > the
>> >> > contents of all of the columns also loaded?  Is there a way to page
>> >> > over
>> >> > the
>> >> > columns such that only N columns are in memory at any point?  In this
>> >> > particular case the documents are all in a particular column family
>> >> > (say
>> >> > docs) and the column qualifier is created dynamically, for arguments
>> >> > sake we
>> >> > can say they are UUIDs.
>> >
>> >

Re: scanner question in regards to columns loaded

Posted by Jamie Johnson <je...@gmail.com>.

I mean if a user asked for all terms that started with "term" is there a
way to get term1 and term2 just once while scanning or would I get each
twice, once for each docid and need to filter client side?
On Jan 26, 2014 1:33 AM, "Christopher" <ct...@apache.org> wrote:

> If you use the Range constructor that takes two arguments, then yes,
> you'd get two entries. However, "count" would come before "doc_id",
> though, because the qualifier is part of the Key, and therefore, part
> of the sort order. There's also a Range constructor that allows you to
> specify whether you want the startKey and endKey to be inclusive or
> exclusive.
>
> I don't know of a specific document that outlines various strategies
> that I can link to. Perhaps I'll put one together, when I get some
> spare time, if nobody else does. I think most people do a lot of
> experimentation to figure out which strategies work best.
>
> I'm not entirely sure what you mean about "getting an iterator over
> all terms without duplicates". I'm assuming you don't mean duplicate
> versions of a single entry, which is handled by the
> VersioningIterator, which should be on new tables by default, and set
> to retain the recent 1 version, to support updates. With the scheme I
> suggested, your table would look something like the following,
> instead:
>
> RowID                       ColumnFamily     Column Qualifier     Value
> <term1>=<doc_id1>   index                  count                     10
> <term1>=<doc_id2>   index                  count                     5
> <term2>=<doc_id3>   index                  count                     3
> <term3>=<doc_id1>   index                  count                     12
>
> With this scheme, you'd have only a single entry (a count) for each
> row, and a single row for each term/document combination, so you
> wouldn't have any duplicate counts for any given term/document. If
> that's what you mean by duplicates...
>
>
> --
> Christopher L Tubbs II
> http://gravatar.com/ctubbsii
>
>
> On Sat, Jan 25, 2014 at 12:19 AM, Jamie Johnson <je...@gmail.com> wrote:
> > Thanks for the reply Chris.  Say I had the following
> >
> > RowID     ColumnFamily     Column Qualifier     Value
> > term         Occurrence~1     doc_id                    1
> > term         Occurrence~1     count                      10
> > term2       Occurrence~2      doc_id                     2
> > term2       Occurrence~2      count                      1
> >
> > creating a scanner with start key new Key(new Text("term"), new
> > Text("Occurrence~1")) and end key new Key(new Text("term"), new
> > Text("Occurrence~1")) I would get an iterator with two entries, the first
> > key would be doc_id and the second would be count.  Is that accurate?
> >
> > In regards to the other strategies is there anywhere that some of these
> are
> > captured?  Also in the your example, how would you go about getting an
> > iterator over all terms without duplicates?  Again thanks
> >
> >
> > On Fri, Jan 24, 2014 at 11:34 PM, Christopher <ct...@apache.org>
> wrote:
> >>
> >> It's not quite clear what you mean by "load", but I think you mean
> >> "iterate over"?
> >>
> >> A simplified explanation is this:
> >>
> >> When you scan an Accumulo table, you are streaming each entry
> >> (Key/Value pair), one at a time, through your client code. They are
> >> only held in memory if you do that yourself in your client code. A row
> >> in Accumulo is the set of entries that share a particular value of the
> >> Row portion of the Key. They are logically grouped, but are not
> >> grouped in memory unless you do that.
> >>
> >> One additional note is regarding your index schema of a row being a
> >> search term and columns being documents. You will likely have issues
> >> with this strategy, as the number of documents for high frequency
> >> terms grows, because tablets do not split in the middle of a row. With
> >> your schema, a row could get too large to manage on a single tablet
> >> server. A slight variation, like concatenating the search term with a
> >> document identifier in the row (term=doc1, term=doc2, ....) would
> >> allow the high frequency terms to split into multiple tablets if they
> >> get too large. There are better strategies, but that's just one simple
> >> option.
> >>
> >>
> >> --
> >> Christopher L Tubbs II
> >> http://gravatar.com/ctubbsii
> >>
> >>
> >> On Fri, Jan 24, 2014 at 10:23 PM, Jamie Johnson <je...@gmail.com>
> wrote:
> >> > If I have a row that as the key is a particular term and a set of
> >> > columns
> >> > that stores the documents that the term appears in if I load the row
> is
> >> > the
> >> > contents of all of the columns also loaded?  Is there a way to page
> over
> >> > the
> >> > columns such that only N columns are in memory at any point?  In this
> >> > particular case the documents are all in a particular column family
> (say
> >> > docs) and the column qualifier is created dynamically, for arguments
> >> > sake we
> >> > can say they are UUIDs.
> >
> >
>

Re: scanner question in regards to columns loaded

Posted by Christopher <ct...@apache.org>.

If you use the Range constructor that takes two arguments, then yes,
you'd get two entries. However, "count" would come before "doc_id",
though, because the qualifier is part of the Key, and therefore, part
of the sort order. There's also a Range constructor that allows you to
specify whether you want the startKey and endKey to be inclusive or
exclusive.

I don't know of a specific document that outlines various strategies
that I can link to. Perhaps I'll put one together, when I get some
spare time, if nobody else does. I think most people do a lot of
experimentation to figure out which strategies work best.

I'm not entirely sure what you mean about "getting an iterator over
all terms without duplicates". I'm assuming you don't mean duplicate
versions of a single entry, which is handled by the
VersioningIterator, which should be on new tables by default, and set
to retain the recent 1 version, to support updates. With the scheme I
suggested, your table would look something like the following,
instead:

RowID                       ColumnFamily     Column Qualifier     Value
<term1>=<doc_id1>   index                  count                     10
<term1>=<doc_id2>   index                  count                     5
<term2>=<doc_id3>   index                  count                     3
<term3>=<doc_id1>   index                  count                     12

With this scheme, you'd have only a single entry (a count) for each
row, and a single row for each term/document combination, so you
wouldn't have any duplicate counts for any given term/document. If
that's what you mean by duplicates...


--
Christopher L Tubbs II
http://gravatar.com/ctubbsii


On Sat, Jan 25, 2014 at 12:19 AM, Jamie Johnson <je...@gmail.com> wrote:
> Thanks for the reply Chris.  Say I had the following
>
> RowID     ColumnFamily     Column Qualifier     Value
> term         Occurrence~1     doc_id                    1
> term         Occurrence~1     count                      10
> term2       Occurrence~2      doc_id                     2
> term2       Occurrence~2      count                      1
>
> creating a scanner with start key new Key(new Text("term"), new
> Text("Occurrence~1")) and end key new Key(new Text("term"), new
> Text("Occurrence~1")) I would get an iterator with two entries, the first
> key would be doc_id and the second would be count.  Is that accurate?
>
> In regards to the other strategies is there anywhere that some of these are
> captured?  Also in the your example, how would you go about getting an
> iterator over all terms without duplicates?  Again thanks
>
>
> On Fri, Jan 24, 2014 at 11:34 PM, Christopher <ct...@apache.org> wrote:
>>
>> It's not quite clear what you mean by "load", but I think you mean
>> "iterate over"?
>>
>> A simplified explanation is this:
>>
>> When you scan an Accumulo table, you are streaming each entry
>> (Key/Value pair), one at a time, through your client code. They are
>> only held in memory if you do that yourself in your client code. A row
>> in Accumulo is the set of entries that share a particular value of the
>> Row portion of the Key. They are logically grouped, but are not
>> grouped in memory unless you do that.
>>
>> One additional note is regarding your index schema of a row being a
>> search term and columns being documents. You will likely have issues
>> with this strategy, as the number of documents for high frequency
>> terms grows, because tablets do not split in the middle of a row. With
>> your schema, a row could get too large to manage on a single tablet
>> server. A slight variation, like concatenating the search term with a
>> document identifier in the row (term=doc1, term=doc2, ....) would
>> allow the high frequency terms to split into multiple tablets if they
>> get too large. There are better strategies, but that's just one simple
>> option.
>>
>>
>> --
>> Christopher L Tubbs II
>> http://gravatar.com/ctubbsii
>>
>>
>> On Fri, Jan 24, 2014 at 10:23 PM, Jamie Johnson <je...@gmail.com> wrote:
>> > If I have a row that as the key is a particular term and a set of
>> > columns
>> > that stores the documents that the term appears in if I load the row is
>> > the
>> > contents of all of the columns also loaded?  Is there a way to page over
>> > the
>> > columns such that only N columns are in memory at any point?  In this
>> > particular case the documents are all in a particular column family (say
>> > docs) and the column qualifier is created dynamically, for arguments
>> > sake we
>> > can say they are UUIDs.
>
>

Re: scanner question in regards to columns loaded

Posted by Jamie Johnson <je...@gmail.com>.

Thanks for the reply Chris.  Say I had the following

RowID     ColumnFamily     Column Qualifier     Value
term         Occurrence~1     doc_id                    1
term         Occurrence~1     count                      10
term2       Occurrence~2      doc_id                     2
term2       Occurrence~2      count                      1

creating a scanner with start key new Key(new Text("term"), new
Text("Occurrence~1")) and end key new Key(new Text("term"), new
Text("Occurrence~1")) I would get an iterator with two entries, the first
key would be doc_id and the second would be count.  Is that accurate?

In regards to the other strategies is there anywhere that some of these are
captured?  Also in the your example, how would you go about getting an
iterator over all terms without duplicates?  Again thanks


On Fri, Jan 24, 2014 at 11:34 PM, Christopher <ct...@apache.org> wrote:

> It's not quite clear what you mean by "load", but I think you mean
> "iterate over"?
>
> A simplified explanation is this:
>
> When you scan an Accumulo table, you are streaming each entry
> (Key/Value pair), one at a time, through your client code. They are
> only held in memory if you do that yourself in your client code. A row
> in Accumulo is the set of entries that share a particular value of the
> Row portion of the Key. They are logically grouped, but are not
> grouped in memory unless you do that.
>
> One additional note is regarding your index schema of a row being a
> search term and columns being documents. You will likely have issues
> with this strategy, as the number of documents for high frequency
> terms grows, because tablets do not split in the middle of a row. With
> your schema, a row could get too large to manage on a single tablet
> server. A slight variation, like concatenating the search term with a
> document identifier in the row (term=doc1, term=doc2, ....) would
> allow the high frequency terms to split into multiple tablets if they
> get too large. There are better strategies, but that's just one simple
> option.
>
>
> --
> Christopher L Tubbs II
> http://gravatar.com/ctubbsii
>
>
> On Fri, Jan 24, 2014 at 10:23 PM, Jamie Johnson <je...@gmail.com> wrote:
> > If I have a row that as the key is a particular term and a set of columns
> > that stores the documents that the term appears in if I load the row is
> the
> > contents of all of the columns also loaded?  Is there a way to page over
> the
> > columns such that only N columns are in memory at any point?  In this
> > particular case the documents are all in a particular column family (say
> > docs) and the column qualifier is created dynamically, for arguments
> sake we
> > can say they are UUIDs.
>

Re: scanner question in regards to columns loaded

Posted by Sean Busbey <bu...@cloudera.com>.

One small addendum to Christopher's explanation:

If you are using the Isolated Scanner[1], then the entire row will be
buffered on the client side. If you are have configured a table to use the
WholeRowIterator[2] in order to gain isolation guarantees while using e.g.
BatchScanner for performance reasons then that buffering instead has to
happen on the tablet servers.

Note that neither of these things are configured for use by default.

[1]:

http://accumulo.apache.org/1.5/accumulo_user_manual.html#_isolated_scanner
http://accumulo.apache.org/1.5/examples/isolation.html
http://accumulo.apache.org/1.5/apidocs/org/apache/accumulo/core/client/IsolatedScanner.html

[2]:
http://accumulo.apache.org/1.5/apidocs/org/apache/accumulo/core/iterators/user/WholeRowIterator.html

-Sean

On Fri, Jan 24, 2014 at 10:34 PM, Christopher <ct...@apache.org> wrote:

> It's not quite clear what you mean by "load", but I think you mean
> "iterate over"?
>
> A simplified explanation is this:
>
> When you scan an Accumulo table, you are streaming each entry
> (Key/Value pair), one at a time, through your client code. They are
> only held in memory if you do that yourself in your client code. A row
> in Accumulo is the set of entries that share a particular value of the
> Row portion of the Key. They are logically grouped, but are not
> grouped in memory unless you do that.
>
> One additional note is regarding your index schema of a row being a
> search term and columns being documents. You will likely have issues
> with this strategy, as the number of documents for high frequency
> terms grows, because tablets do not split in the middle of a row. With
> your schema, a row could get too large to manage on a single tablet
> server. A slight variation, like concatenating the search term with a
> document identifier in the row (term=doc1, term=doc2, ....) would
> allow the high frequency terms to split into multiple tablets if they
> get too large. There are better strategies, but that's just one simple
> option.
>
>
> --
> Christopher L Tubbs II
> http://gravatar.com/ctubbsii
>
>
> On Fri, Jan 24, 2014 at 10:23 PM, Jamie Johnson <je...@gmail.com> wrote:
> > If I have a row that as the key is a particular term and a set of columns
> > that stores the documents that the term appears in if I load the row is
> the
> > contents of all of the columns also loaded?  Is there a way to page over
> the
> > columns such that only N columns are in memory at any point?  In this
> > particular case the documents are all in a particular column family (say
> > docs) and the column qualifier is created dynamically, for arguments
> sake we
> > can say they are UUIDs.
>

Re: scanner question in regards to columns loaded

Posted by Christopher <ct...@apache.org>.

It's not quite clear what you mean by "load", but I think you mean
"iterate over"?

A simplified explanation is this:

When you scan an Accumulo table, you are streaming each entry
(Key/Value pair), one at a time, through your client code. They are
only held in memory if you do that yourself in your client code. A row
in Accumulo is the set of entries that share a particular value of the
Row portion of the Key. They are logically grouped, but are not
grouped in memory unless you do that.

One additional note is regarding your index schema of a row being a
search term and columns being documents. You will likely have issues
with this strategy, as the number of documents for high frequency
terms grows, because tablets do not split in the middle of a row. With
your schema, a row could get too large to manage on a single tablet
server. A slight variation, like concatenating the search term with a
document identifier in the row (term=doc1, term=doc2, ....) would
allow the high frequency terms to split into multiple tablets if they
get too large. There are better strategies, but that's just one simple
option.

--
Christopher L Tubbs II
http://gravatar.com/ctubbsii

On Fri, Jan 24, 2014 at 10:23 PM, Jamie Johnson <je...@gmail.com> wrote:
> If I have a row that as the key is a particular term and a set of columns
> that stores the documents that the term appears in if I load the row is the
> contents of all of the columns also loaded?  Is there a way to page over the
> columns such that only N columns are in memory at any point?  In this
> particular case the documents are all in a particular column family (say
> docs) and the column qualifier is created dynamically, for arguments sake we
> can say they are UUIDs.