You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@accumulo.apache.org by "damodaram.sundaram@harman.com" <da...@harman.com> on 2017/07/06 06:39:38 UTC

Sorted RowId suffix retrieval using Server Side Iterators

We are storing the RDF statement data to Accumulo in the
POS(Predicate,Object, Subject) fashion. The table is designed to store 100
million records.

Ex:
p1|o1|s1
p1|o1|s5
p1|o2|s3
p1|o2|s2
p2|o1|s4

The data is sorted based on the fist two parts of the key, (p1 & o1 etc). 

When I apply a prefix range with (p1|o1  to p2|o1), I could get the subjects
in the order [s1, s5, s3, s2, s4].

But with the my scan would perform back and forth on the table and I would
be interested to get the list of subjects as [s1, s2, s3, s4, s5] while
reading through the iterators.

Is there anyway I can get the above result ?

Also, on the same table if I apply the Range filter then I would get
distinct order sets like [s2, s3, s5] and [s200, s150, s500] etc. Even in
this case, how should I make the scanner to read the data in the single
sorted order.











--
View this message in context: http://apache-accumulo.1065345.n5.nabble.com/Sorted-RowId-suffix-retrieval-using-Server-Side-Iterators-tp21787.html
Sent from the Developers mailing list archive at Nabble.com.

Re: Sorted RowId suffix retrieval using Server Side Iterators

Posted by Josh Elser <jo...@gmail.com>.
On Mon, Jul 10, 2017 at 6:15 AM, Dylan Hutchison
<dh...@cs.washington.edu> wrote:
> You might be able to take a batched approach, using server-side iterators
> to gather as many S's from POS rows as possible at each tablet server up to
> a memory budget, and then querying the SPO table from inside those
> iterators.  (With some caution to be mindful of tablet server thread
> limits, you can scan another table from inside a server-side iterator.)
>  This likely has the effect of querying the same SPO data multiple times,
> which may or may not be acceptable.
>
> Another alternative is a MapReduce job.
>
> By the way, you don't necessarily need to sort the S's in order to query
> the SPO table.  It depends on how you do the query, such as by providing a
> collection of ranges to a Scanner / BatchScanner or doing server-side
> filtering.

+1 to that. Dropping the requirement to get a sorted list of subjects
for some pair P-O would make a server-side filter much easier. You can
also play tricks like doing a "limited" deduplication server-side. You
can hold up to N subjects server-side to avoid running out of memory,
and then perform a final deduplication client-side.

> Cheers, Dylan
>
> On Thu, Jul 6, 2017 at 3:05 AM, damodaram.sundaram@harman.com <
> damodaram.sundaram@harman.com> wrote:
>
>> Thanks for your reply Dylan.
>>
>> *Are your range queries *small enough to fit in memory*?* Not likely,
>> because given condition on POS table might result few hundred thousands as
>> I
>> am talking about my table would be 100M. Hence, I might not be able to
>> store
>> them in the memory to the Sorting and I might end up getting memory issues.
>>
>> My tables are built with RowIds as  POS in it and not on the column family
>> as I am looking at each cell of my relational data into a single Row at
>> accumulo.
>>
>> The 'S values' will be used to query the SPO table with prefix filter on S,
>> which is stored (Subject|Predicate|Object). If my subjects are in the
>> sorted
>> order then I would not need to put much effort while querying with "List of
>> Order Set of Subjects".
>>
>>
>>
>> --
>> View this message in context: http://apache-accumulo.
>> 1065345.n5.nabble.com/Sorted-RowId-suffix-retrieval-using-
>> Server-Side-Iterators-tp21787p21791.html
>> Sent from the Developers mailing list archive at Nabble.com.
>>

Re: Sorted RowId suffix retrieval using Server Side Iterators

Posted by Dylan Hutchison <dh...@cs.washington.edu>.
You might be able to take a batched approach, using server-side iterators
to gather as many S's from POS rows as possible at each tablet server up to
a memory budget, and then querying the SPO table from inside those
iterators.  (With some caution to be mindful of tablet server thread
limits, you can scan another table from inside a server-side iterator.)
 This likely has the effect of querying the same SPO data multiple times,
which may or may not be acceptable.

Another alternative is a MapReduce job.

By the way, you don't necessarily need to sort the S's in order to query
the SPO table.  It depends on how you do the query, such as by providing a
collection of ranges to a Scanner / BatchScanner or doing server-side
filtering.

Cheers, Dylan

On Thu, Jul 6, 2017 at 3:05 AM, damodaram.sundaram@harman.com <
damodaram.sundaram@harman.com> wrote:

> Thanks for your reply Dylan.
>
> *Are your range queries *small enough to fit in memory*?* Not likely,
> because given condition on POS table might result few hundred thousands as
> I
> am talking about my table would be 100M. Hence, I might not be able to
> store
> them in the memory to the Sorting and I might end up getting memory issues.
>
> My tables are built with RowIds as  POS in it and not on the column family
> as I am looking at each cell of my relational data into a single Row at
> accumulo.
>
> The 'S values' will be used to query the SPO table with prefix filter on S,
> which is stored (Subject|Predicate|Object). If my subjects are in the
> sorted
> order then I would not need to put much effort while querying with "List of
> Order Set of Subjects".
>
>
>
> --
> View this message in context: http://apache-accumulo.
> 1065345.n5.nabble.com/Sorted-RowId-suffix-retrieval-using-
> Server-Side-Iterators-tp21787p21791.html
> Sent from the Developers mailing list archive at Nabble.com.
>

Re: Sorted RowId suffix retrieval using Server Side Iterators

Posted by "damodaram.sundaram@harman.com" <da...@harman.com>.
Thanks for your reply Dylan.

*Are your range queries *small enough to fit in memory*?* Not likely,
because given condition on POS table might result few hundred thousands as I
am talking about my table would be 100M. Hence, I might not be able to store
them in the memory to the Sorting and I might end up getting memory issues.

My tables are built with RowIds as  POS in it and not on the column family
as I am looking at each cell of my relational data into a single Row at
accumulo. 

The 'S values' will be used to query the SPO table with prefix filter on S,
which is stored (Subject|Predicate|Object). If my subjects are in the sorted
order then I would not need to put much effort while querying with "List of
Order Set of Subjects". 



--
View this message in context: http://apache-accumulo.1065345.n5.nabble.com/Sorted-RowId-suffix-retrieval-using-Server-Side-Iterators-tp21787p21791.html
Sent from the Developers mailing list archive at Nabble.com.

Re: Sorted RowId suffix retrieval using Server Side Iterators

Posted by Dylan Hutchison <dh...@cs.washington.edu>.
Let's see if I understand your question.  The queries are range queries on
P over the POS table.  Within each range, you would like to sort the S
values (a suffix of the Key) retrieved.

Are your range queries *small enough to fit in memory*?  If so, you could
gather all the entries in the range together, either at a client or in a
server-side iterator, and sort the S values.  The server-side iterator
approach will only work if your S values are stored in the Column portion
of the key (not the Row), because if they are stored in the Row then the
range query may hit multiple tablets which could be stored on separate
tablet servers.  Of course, you could construct a partial list of the S
values seen in each tablet.

If your range queries exceed memory, then you might try an external sorting
method or create an index on S.

The right choice depends on what you would like to do with the S values.

On Wed, Jul 5, 2017 at 11:39 PM, damodaram.sundaram@harman.com <
damodaram.sundaram@harman.com> wrote:

> We are storing the RDF statement data to Accumulo in the
> POS(Predicate,Object, Subject) fashion. The table is designed to store 100
> million records.
>
> Ex:
> p1|o1|s1
> p1|o1|s5
> p1|o2|s3
> p1|o2|s2
> p2|o1|s4
>
> The data is sorted based on the fist two parts of the key, (p1 & o1 etc).
>
> When I apply a prefix range with (p1|o1  to p2|o1), I could get the
> subjects
> in the order [s1, s5, s3, s2, s4].
>
> But with the my scan would perform back and forth on the table and I would
> be interested to get the list of subjects as [s1, s2, s3, s4, s5] while
> reading through the iterators.
>
> Is there anyway I can get the above result ?
>
> Also, on the same table if I apply the Range filter then I would get
> distinct order sets like [s2, s3, s5] and [s200, s150, s500] etc. Even in
> this case, how should I make the scanner to read the data in the single
> sorted order.
>
>
>
>
>
>
>
>
>
>
>
> --
> View this message in context: http://apache-accumulo.
> 1065345.n5.nabble.com/Sorted-RowId-suffix-retrieval-using-
> Server-Side-Iterators-tp21787.html
> Sent from the Developers mailing list archive at Nabble.com.
>