You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@accumulo.apache.org by "Williamson, Luke MR 1" <lu...@defence.gov.au> on 2013/08/14 03:58:31 UTC

Intersecting Iterators [SEC=UNCLASSIFIED]

UNCLASSIFIED

Hi,
 
I have field indexes that looks something like
 
Row Id: <date>-<UUID>
CF: fi||<type>||<value>
CQ: <date>-<UUID>
 
For example: 

20130814-550e8400-e29b-41d4-a716-446655440000 fi||verb||run 20130814-550e8400-e29b-41d4-a716-446655440000
20130814-550e8400-e29b-41d4-a716-446655440000 page||58 line||16 "the boy can run up the hill"

>From what I could determine from the doco and API I am executing the following code to perform an intersecting query on two values...

Set<Range> shards = new HashSet<Range>();

Text[] terms = {new Text("fi||<type>||<value>"), new Text("fi||<type>||<value>")};

BatchScanner bs = conn.createBatchScanner(table, auths, 20); bs.setTimeout(360, TimeUnit.SECONDS);

IteratorSetting iter = new IteratorSetting(20, "ii", IntersectingIterator.class); IntersectingIterator.setColumnFamilies(iter, terms); bs.addScanIterator(iter);

bs.setRanges(Collections.singleton(new Range()));

for(Entry<Key,Value> entry : bs) {

    shards.add(new Range(entry.getKey().getColumnQualifier()));
}

I then perform a second batch scan using the set of ranges returned by the above to get my actual results.

My issues is that the intersecting query takes several minutes to return if at all (in some cases it times out). Is this expected? Is there some way to improve performance? Is there a better way to do this sort of query?

Any guidance would be much appreciated.

Thanks

Luke


IMPORTANT: This email remains the property of the Department of Defence and is subject to the jurisdiction of section 70 of the Crimes Act 1914. If you have received this email in error, you are requested to contact the sender and delete the email.

Re: Intersecting Iterators [SEC=UNCLASSIFIED]

Posted by William Slacum <wi...@accumulo.net>.
Usually the intersecting iterator is used when you're modeling a document
partitioned table. That is, you have relatively few row values compared to
the number of documents you're storing (like, on the order of hundreds to
millions of documents in a single row). It looks like you have a single row
for each document, with field indices stored in the same row as the
document.

What I might suggest is something like:

Row: date
ColumnFamily (a): fi||field||data
ColumnQualifier (a): document-id
ColumnFamily (b): document Id
ColumnQualifier (b): field||data

I believe that having 1:1 mapping between shards/rows and document IDs can
cause significant overhead when it comes to scanning, because it will be
constantly seek'ing within the same RFile blocks.


On Wed, Aug 14, 2013 at 12:50 AM, Williamson, Luke MR 1 <
luke.williamson1@defence.gov.au> wrote:

> UNCLASSIFIED
>
> I have tried increasing the number of threads and it seems to guarantee
> that it will return before it hits the timeout but it is taking approx. 7
> minutes to complete. Looking at the accumulo manager page it appears that
> all the tablet servers get equally hit (around 16 per node) and start to
> return but a couple of tablet servers take longer than the others. This
> behaviour was indicated to potentially happen in the doco but I was hoping
> it wouldn't be taking this long.
>
> ________________________________
>
> From: David Medinets [mailto:david.medinets@gmail.com]
> Sent: Wednesday, 14 August 2013 12:45
> To: accumulo-user
> Subject: Re: Intersecting Iterators [SEC=UNCLASSIFIED]
>
>
> I'm wondering about the 20 threads in the BatchScanner. Have you played
> with increasing it? I've seen that number go above 15 per accumulo node.
> Are you seeing the scans in the Accumulo monitor? Are the scans progressing
> through the Accumulo nodes?
>
>
> On Tue, Aug 13, 2013 at 9:58 PM, Williamson, Luke MR 1 <
> luke.williamson1@defence.gov.au> wrote:
>
>
>         UNCLASSIFIED
>
>         Hi,
>
>         I have field indexes that looks something like
>
>         Row Id: <date>-<UUID>
>         CF: fi||<type>||<value>
>         CQ: <date>-<UUID>
>
>         For example:
>
>         20130814-550e8400-e29b-41d4-a716-446655440000 fi||verb||run
> 20130814-550e8400-e29b-41d4-a716-446655440000
>         20130814-550e8400-e29b-41d4-a716-446655440000 page||58 line||16
> "the boy can run up the hill"
>
>         From what I could determine from the doco and API I am executing
> the following code to perform an intersecting query on two values...
>
>         Set<Range> shards = new HashSet<Range>();
>
>         Text[] terms = {new Text("fi||<type>||<value>"), new
> Text("fi||<type>||<value>")};
>
>         BatchScanner bs = conn.createBatchScanner(table, auths, 20);
> bs.setTimeout(360, TimeUnit.SECONDS);
>
>         IteratorSetting iter = new IteratorSetting(20, "ii",
> IntersectingIterator.class); IntersectingIterator.setColumnFamilies(iter,
> terms); bs.addScanIterator(iter);
>
>         bs.setRanges(Collections.singleton(new Range()));
>
>         for(Entry<Key,Value> entry : bs) {
>
>             shards.add(new Range(entry.getKey().getColumnQualifier()));
>         }
>
>         I then perform a second batch scan using the set of ranges
> returned by the above to get my actual results.
>
>         My issues is that the intersecting query takes several minutes to
> return if at all (in some cases it times out). Is this expected? Is there
> some way to improve performance? Is there a better way to do this sort of
> query?
>
>         Any guidance would be much appreciated.
>
>         Thanks
>
>         Luke
>
>
>         IMPORTANT: This email remains the property of the Department of
> Defence and is subject to the jurisdiction of section 70 of the Crimes Act
> 1914. If you have received this email in error, you are requested to
> contact the sender and delete the email.
>
>
>
>
> IMPORTANT: This email remains the property of the Department of Defence
> and is subject to the jurisdiction of section 70 of the Crimes Act 1914. If
> you have received this email in error, you are requested to contact the
> sender and delete the email.
>

RE: Intersecting Iterators [SEC=UNCLASSIFIED]

Posted by "Williamson, Luke MR 1" <lu...@defence.gov.au>.
UNCLASSIFIED

I have tried increasing the number of threads and it seems to guarantee that it will return before it hits the timeout but it is taking approx. 7 minutes to complete. Looking at the accumulo manager page it appears that all the tablet servers get equally hit (around 16 per node) and start to return but a couple of tablet servers take longer than the others. This behaviour was indicated to potentially happen in the doco but I was hoping it wouldn't be taking this long.

________________________________

From: David Medinets [mailto:david.medinets@gmail.com]
Sent: Wednesday, 14 August 2013 12:45
To: accumulo-user
Subject: Re: Intersecting Iterators [SEC=UNCLASSIFIED]


I'm wondering about the 20 threads in the BatchScanner. Have you played with increasing it? I've seen that number go above 15 per accumulo node. Are you seeing the scans in the Accumulo monitor? Are the scans progressing through the Accumulo nodes?


On Tue, Aug 13, 2013 at 9:58 PM, Williamson, Luke MR 1 <lu...@defence.gov.au> wrote:


	UNCLASSIFIED
	
	Hi,
	
	I have field indexes that looks something like
	
	Row Id: <date>-<UUID>
	CF: fi||<type>||<value>
	CQ: <date>-<UUID>
	
	For example:
	
	20130814-550e8400-e29b-41d4-a716-446655440000 fi||verb||run 20130814-550e8400-e29b-41d4-a716-446655440000
	20130814-550e8400-e29b-41d4-a716-446655440000 page||58 line||16 "the boy can run up the hill"
	
	From what I could determine from the doco and API I am executing the following code to perform an intersecting query on two values...
	
	Set<Range> shards = new HashSet<Range>();
	
	Text[] terms = {new Text("fi||<type>||<value>"), new Text("fi||<type>||<value>")};
	
	BatchScanner bs = conn.createBatchScanner(table, auths, 20); bs.setTimeout(360, TimeUnit.SECONDS);
	
	IteratorSetting iter = new IteratorSetting(20, "ii", IntersectingIterator.class); IntersectingIterator.setColumnFamilies(iter, terms); bs.addScanIterator(iter);
	
	bs.setRanges(Collections.singleton(new Range()));
	
	for(Entry<Key,Value> entry : bs) {
	
	    shards.add(new Range(entry.getKey().getColumnQualifier()));
	}
	
	I then perform a second batch scan using the set of ranges returned by the above to get my actual results.
	
	My issues is that the intersecting query takes several minutes to return if at all (in some cases it times out). Is this expected? Is there some way to improve performance? Is there a better way to do this sort of query?
	
	Any guidance would be much appreciated.
	
	Thanks
	
	Luke
	
	
	IMPORTANT: This email remains the property of the Department of Defence and is subject to the jurisdiction of section 70 of the Crimes Act 1914. If you have received this email in error, you are requested to contact the sender and delete the email.
	



IMPORTANT: This email remains the property of the Department of Defence and is subject to the jurisdiction of section 70 of the Crimes Act 1914. If you have received this email in error, you are requested to contact the sender and delete the email.

Re: Intersecting Iterators [SEC=UNCLASSIFIED]

Posted by David Medinets <da...@gmail.com>.
I'm wondering about the 20 threads in the BatchScanner. Have you played
with increasing it? I've seen that number go above 15 per accumulo node.
Are you seeing the scans in the Accumulo monitor? Are the scans progressing
through the Accumulo nodes?


On Tue, Aug 13, 2013 at 9:58 PM, Williamson, Luke MR 1 <
luke.williamson1@defence.gov.au> wrote:

> UNCLASSIFIED
>
> Hi,
>
> I have field indexes that looks something like
>
> Row Id: <date>-<UUID>
> CF: fi||<type>||<value>
> CQ: <date>-<UUID>
>
> For example:
>
> 20130814-550e8400-e29b-41d4-a716-446655440000 fi||verb||run
> 20130814-550e8400-e29b-41d4-a716-446655440000
> 20130814-550e8400-e29b-41d4-a716-446655440000 page||58 line||16 "the boy
> can run up the hill"
>
> From what I could determine from the doco and API I am executing the
> following code to perform an intersecting query on two values...
>
> Set<Range> shards = new HashSet<Range>();
>
> Text[] terms = {new Text("fi||<type>||<value>"), new
> Text("fi||<type>||<value>")};
>
> BatchScanner bs = conn.createBatchScanner(table, auths, 20);
> bs.setTimeout(360, TimeUnit.SECONDS);
>
> IteratorSetting iter = new IteratorSetting(20, "ii",
> IntersectingIterator.class); IntersectingIterator.setColumnFamilies(iter,
> terms); bs.addScanIterator(iter);
>
> bs.setRanges(Collections.singleton(new Range()));
>
> for(Entry<Key,Value> entry : bs) {
>
>     shards.add(new Range(entry.getKey().getColumnQualifier()));
> }
>
> I then perform a second batch scan using the set of ranges returned by the
> above to get my actual results.
>
> My issues is that the intersecting query takes several minutes to return
> if at all (in some cases it times out). Is this expected? Is there some way
> to improve performance? Is there a better way to do this sort of query?
>
> Any guidance would be much appreciated.
>
> Thanks
>
> Luke
>
>
> IMPORTANT: This email remains the property of the Department of Defence
> and is subject to the jurisdiction of section 70 of the Crimes Act 1914. If
> you have received this email in error, you are requested to contact the
> sender and delete the email.
>