You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@accumulo.apache.org by Michael Ladakos <md...@gmail.com> on 2018/03/23 17:10:24 UTC

Fwd: Large numbers of authorizations

---------- Forwarded message ----------
From: Michael Ladakos <md...@gmail.com>
Date: Fri, Mar 23, 2018 at 12:32 PM
Subject: Large numbers of authorizations
To: user@accumulo.apache.org


I am somewhat new to Accumulo and was doing some experimentation on
consequences for using large numbers of authorizations.

I found that a user with a large set of authorizations would take a great
deal of time to perform a scan. I tested at various increments up to
100,000 authorizations. At that point, it would take at least 25 seconds to
perform the scan, even if the table was newly created with no rows.

Performing a scan with a small subset of authorizations is equivalent to
performing a query with a user that only has a small number of
authorizations.

I attempted to find the place in the code where whatever is being done,
because I wanted to understand what caused this, but I wasn't able to track
down the exact class. Any chance I could get an explanation or pointed in
the right direction?

Thanks!

Re: Large numbers of authorizations

Posted by Keith Turner <ke...@deenlo.com>.
I found and fixed this issue.

https://github.com/apache/accumulo/pull/410

On Fri, Mar 23, 2018 at 1:10 PM, Michael Ladakos <md...@gmail.com> wrote:
>
> ---------- Forwarded message ----------
> From: Michael Ladakos <md...@gmail.com>
> Date: Fri, Mar 23, 2018 at 12:32 PM
> Subject: Large numbers of authorizations
> To: user@accumulo.apache.org
>
>
> I am somewhat new to Accumulo and was doing some experimentation on
> consequences for using large numbers of authorizations.
>
> I found that a user with a large set of authorizations would take a great
> deal of time to perform a scan. I tested at various increments up to 100,000
> authorizations. At that point, it would take at least 25 seconds to perform
> the scan, even if the table was newly created with no rows.
>
> Performing a scan with a small subset of authorizations is equivalent to
> performing a query with a user that only has a small number of
> authorizations.
>
> I attempted to find the place in the code where whatever is being done,
> because I wanted to understand what caused this, but I wasn't able to track
> down the exact class. Any chance I could get an explanation or pointed in
> the right direction?
>
> Thanks!
>

Re: Large numbers of authorizations

Posted by Keith Turner <ke...@deenlo.com>.
On Fri, Mar 23, 2018 at 4:06 PM, mdladakos <md...@gmail.com> wrote:
> Keith, thanks for your quick response!
>
> Maybe I wasn't clear enough or I am not understanding your explanation.
>
> What I was exploring was performing a scan with a large number of
> authorizations. While I did use tables with thousands of rows, I also ran
> scans against empty tables and still performed at ~25 Seconds. So shouldn't
> VisibilityEvaluator not be in involved?

Gotcha.  So one possibility is its just taking a while to send the
auths from the client to the tserver.  The following code is the
thrift RPC to start a scan.  client.startScan() is passed
scanState.authorizations.getAuthorizationsBB() which is the auths.
The getAuthorizationsBB() method does a copy.  So there is a copy,
then thrift has to serialize auths, send them, and then deserialize on
server side.. and this is done for each startScan RPC.  The startScan
call happens once per tablet, subsequent batches of key/vals from a
tablet are fetched using contunueScan RPC which does not pass auths
again.

https://github.com/apache/accumulo/blob/1e4d4827096bd0047c7de3e0b672263defe66634/core/src/main/java/org/apache/accumulo/core/client/impl/ThriftScanner.java#L429

It would be interesting to see how long the call to startScan takes
for your case.  Enabling trace logging for ThriftScanner will give
some insight into this.

>
> I don't think the actual filtering is the problem. Is there some work done
> by the tablet servers when receiving the scan request, specifically in
> regard to user authorizations?
>
> Again, if I used -s to pass a subset of authorizations for the user with
> 100000 authorizations, this increase in return time would be equivalent to a
> user with that number of authorizations (i.e.: If I scanned with 100
> authorizations out of the 100000, it would be the normal, fast speed)
>
>
>
> --
> Sent from: http://apache-accumulo.1065345.n5.nabble.com/Users-f2.html

Re: Large numbers of authorizations

Posted by Dong Zhou <dz...@gmail.com>.
Depending on how long each authorization string is. You might run into
zookeeper znode storage limit issue.

jute.maxbuffer:

(Java system property:* jute.maxbuffer*)

This option can only be set as a Java system property. There is no
zookeeper prefix on it. It specifies the maximum size of the data that can
be stored in a znode. The default is 0xfffff, or just under 1M. If this
option is changed, the system property must be set on all servers and
clients otherwise problems will arise. This is really a sanity check.
ZooKeeper is designed to store data on the order of kilobytes in size.
https://zookeeper.apache.org/doc/r3.2.2/zookeeperAdmin.html




On Fri, Mar 23, 2018 at 8:58 PM, Keith Turner <ke...@deenlo.com> wrote:

> On Fri, Mar 23, 2018 at 11:55 PM, Keith Turner <ke...@deenlo.com> wrote:
> > On Fri, Mar 23, 2018 at 4:06 PM, mdladakos <md...@gmail.com> wrote:
> >> Keith, thanks for your quick response!
> >>
> >> Maybe I wasn't clear enough or I am not understanding your explanation.
> >>
> >> What I was exploring was performing a scan with a large number of
> >> authorizations. While I did use tables with thousands of rows, I also
> ran
> >> scans against empty tables and still performed at ~25 Seconds. So
> shouldn't
> >> VisibilityEvaluator not be in involved?
> >>
> >> I don't think the actual filtering is the problem. Is there some work
> done
> >> by the tablet servers when receiving the scan request, specifically in
> >> regard to user authorizations?
> >>
> >> Again, if I used -s to pass a subset of authorizations for the user with
> >> 100000 authorizations, this increase in return time would be equivalent
> to a
> >> user with that number of authorizations (i.e.: If I scanned with 100
> >> authorizations out of the 100000, it would be the normal, fast speed)
> >
> > I think the following code may be the problem. The collection
> > userauths is a list, so performance will O(M*N).  Is M and N are 100K,
> > then its not good.  If userauths were a set this would be much faster
> > for the case you are testing.
> >
> > https://github.com/apache/accumulo/blob/17bc708dcabd17824a8378597e0542
> 002470ed18/server/base/src/main/java/org/apache/accumulo/
> server/security/handler/ZKAuthorizor.java#L166
>
> This code is called on the server side to check if the auth passed by
> a scan are valid.
>
> >>
> >>
> >>
> >> --
> >> Sent from: http://apache-accumulo.1065345.n5.nabble.com/Users-f2.html
>

Re: Large numbers of authorizations

Posted by Keith Turner <ke...@deenlo.com>.
On Fri, Mar 23, 2018 at 11:55 PM, Keith Turner <ke...@deenlo.com> wrote:
> On Fri, Mar 23, 2018 at 4:06 PM, mdladakos <md...@gmail.com> wrote:
>> Keith, thanks for your quick response!
>>
>> Maybe I wasn't clear enough or I am not understanding your explanation.
>>
>> What I was exploring was performing a scan with a large number of
>> authorizations. While I did use tables with thousands of rows, I also ran
>> scans against empty tables and still performed at ~25 Seconds. So shouldn't
>> VisibilityEvaluator not be in involved?
>>
>> I don't think the actual filtering is the problem. Is there some work done
>> by the tablet servers when receiving the scan request, specifically in
>> regard to user authorizations?
>>
>> Again, if I used -s to pass a subset of authorizations for the user with
>> 100000 authorizations, this increase in return time would be equivalent to a
>> user with that number of authorizations (i.e.: If I scanned with 100
>> authorizations out of the 100000, it would be the normal, fast speed)
>
> I think the following code may be the problem. The collection
> userauths is a list, so performance will O(M*N).  Is M and N are 100K,
> then its not good.  If userauths were a set this would be much faster
> for the case you are testing.
>
> https://github.com/apache/accumulo/blob/17bc708dcabd17824a8378597e0542002470ed18/server/base/src/main/java/org/apache/accumulo/server/security/handler/ZKAuthorizor.java#L166

This code is called on the server side to check if the auth passed by
a scan are valid.

>>
>>
>>
>> --
>> Sent from: http://apache-accumulo.1065345.n5.nabble.com/Users-f2.html

Re: Large numbers of authorizations

Posted by Keith Turner <ke...@deenlo.com>.
On Fri, Mar 23, 2018 at 4:06 PM, mdladakos <md...@gmail.com> wrote:
> Keith, thanks for your quick response!
>
> Maybe I wasn't clear enough or I am not understanding your explanation.
>
> What I was exploring was performing a scan with a large number of
> authorizations. While I did use tables with thousands of rows, I also ran
> scans against empty tables and still performed at ~25 Seconds. So shouldn't
> VisibilityEvaluator not be in involved?
>
> I don't think the actual filtering is the problem. Is there some work done
> by the tablet servers when receiving the scan request, specifically in
> regard to user authorizations?
>
> Again, if I used -s to pass a subset of authorizations for the user with
> 100000 authorizations, this increase in return time would be equivalent to a
> user with that number of authorizations (i.e.: If I scanned with 100
> authorizations out of the 100000, it would be the normal, fast speed)

I think the following code may be the problem. The collection
userauths is a list, so performance will O(M*N).  Is M and N are 100K,
then its not good.  If userauths were a set this would be much faster
for the case you are testing.

https://github.com/apache/accumulo/blob/17bc708dcabd17824a8378597e0542002470ed18/server/base/src/main/java/org/apache/accumulo/server/security/handler/ZKAuthorizor.java#L166
>
>
>
> --
> Sent from: http://apache-accumulo.1065345.n5.nabble.com/Users-f2.html

Re: Large numbers of authorizations

Posted by mdladakos <md...@gmail.com>.
Keith, thanks for your quick response! 

Maybe I wasn't clear enough or I am not understanding your explanation. 

What I was exploring was performing a scan with a large number of
authorizations. While I did use tables with thousands of rows, I also ran
scans against empty tables and still performed at ~25 Seconds. So shouldn't
VisibilityEvaluator not be in involved? 

I don't think the actual filtering is the problem. Is there some work done
by the tablet servers when receiving the scan request, specifically in
regard to user authorizations? 

Again, if I used -s to pass a subset of authorizations for the user with
100000 authorizations, this increase in return time would be equivalent to a
user with that number of authorizations (i.e.: If I scanned with 100
authorizations out of the 100000, it would be the normal, fast speed)



--
Sent from: http://apache-accumulo.1065345.n5.nabble.com/Users-f2.html

Re: Large numbers of authorizations

Posted by Keith Turner <ke...@deenlo.com>.
This is the code that scans use to filter based on column visibility
and authorizations.  It has a cache of previously seen column
visibilities and the decision that was made for those.

  https://github.com/apache/accumulo/blob/2e171cdb8420f817ff9ebeb23f9d8a70b0878ca5/core/src/main/java/org/apache/accumulo/core/iterators/system/VisibilityFilter.java

The following code does the evaluation.

  https://github.com/apache/accumulo/blob/f81a8ec7410e789d11941351d5899b8894c6a322/core/src/main/java/org/apache/accumulo/core/security/VisibilityEvaluator.java

On Fri, Mar 23, 2018 at 1:10 PM, Michael Ladakos <md...@gmail.com> wrote:
>
> ---------- Forwarded message ----------
> From: Michael Ladakos <md...@gmail.com>
> Date: Fri, Mar 23, 2018 at 12:32 PM
> Subject: Large numbers of authorizations
> To: user@accumulo.apache.org
>
>
> I am somewhat new to Accumulo and was doing some experimentation on
> consequences for using large numbers of authorizations.
>
> I found that a user with a large set of authorizations would take a great
> deal of time to perform a scan. I tested at various increments up to 100,000
> authorizations. At that point, it would take at least 25 seconds to perform
> the scan, even if the table was newly created with no rows.
>
> Performing a scan with a small subset of authorizations is equivalent to
> performing a query with a user that only has a small number of
> authorizations.
>
> I attempted to find the place in the code where whatever is being done,
> because I wanted to understand what caused this, but I wasn't able to track
> down the exact class. Any chance I could get an explanation or pointed in
> the right direction?
>
> Thanks!
>