You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@accumulo.apache.org by Sven Hodapp <sv...@scai.fraunhofer.de> on 2015/12/18 14:34:26 UTC

Re: IntersectingIterator and Ranges

Hi Billie,

I've read in the source code documentation the following:

    This iterator is commonly used with BatchScanner or AccumuloInputFormat, to parallelize the search over all shardIDs.

This means key1 and key2 (the shradIDs) should be searched? Or is this a misunderstanding?
The IndexedDocIterator should have also search in all shradIDs?

Thanks!

Regards,
Sven

-- 
Sven Hodapp, M.Sc.,
Fraunhofer Institute for Algorithms and Scientific Computing SCAI,
Department of Bioinformatics
Schloss Birlinghoven, 53754 Sankt Augustin, Germany
sven.hodapp@scai.fraunhofer.de
www.scai.fraunhofer.de

----- Ursprüngliche Mail -----
> Von: "Billie Rinaldi" <bi...@gmail.com>
> An: "user" <us...@accumulo.apache.org>
> Gesendet: Mittwoch, 18. November 2015 15:57:15
> Betreff: Re: IntersectingIterator and Ranges

> Yes, that is the correct behavior. The IntersectingIterator intersects
> columns within a row, on a single tablet server. To get the results you
> want, you should make sure all the terms for a document are inserted with
> the same key / row. In this case, all the doc1 entries should have key1 as
> their row.
> 
> Billie
> On Nov 18, 2015 7:08 AM, "Sven Hodapp" <sv...@scai.fraunhofer.de>
> wrote:
> 
>> Hello together,
>>
>> Currently I'm using Accumulo 1.7 (currently single a node) with the
>> IntersectingIterator.
>> The current index schema for the IntersectingIterator looks like this, for
>> example:
>>
>>     key1 : term1 : doc1
>>     key1 : term2 : doc1
>>     key2 : term3 : doc1
>>
>> I've noticed that I can't intersect terms which are in distinct key-ranges.
>> Is that a correct behavior, or I'm doing something wrong?
>>
>> Extract of my code (Scala) as example:
>>
>>     val bs = conn.createBatchScanner(tableName, authorizations,
>> numQueryThreads)
>>     val terms = List(new Text("term1"), new Text("term2")).toArray
>>
>>     val ii = new IteratorSetting(priority, name, iteratorClass)
>>     IntersectingIterator.setColumnFamilies(ii, terms)
>>     bs.addScanIterator(ii)
>>
>>     bs.setRanges(Collections.singleton(new Range()))  // all ranges
>>
>>     for (entry <- bs.asScala.take(100)) yield {
>>       entry.getKey.getColumnQualifier.toString
>>     }
>>
>> This will yield "doc1" as expected.
>>
>> But if I'll choose the terms like this:
>>
>>     // ...
>>     val terms = List(new Text("term1"), new Text("term3")).toArray
>>     // ...
>>
>> It will yield "null" but I would expect here also "doc1".
>> I've also tried this with setting a list of Range.exact,
>> but I'll get also "null".
>>
>> I'm doing something wrong?
>>
>> Thank you in advance!
>>
>> Regards,
>> Sven
>>
>> --
>> Sven Hodapp, M.Sc.,
>> Fraunhofer Institute for Algorithms and Scientific Computing SCAI,
>> Department of Bioinformatics
>> Schloss Birlinghoven, 53754 Sankt Augustin, Germany
>> sven.hodapp@scai.fraunhofer.de
>> www.scai.fraunhofer.de

Re: IntersectingIterator and Ranges

Posted by Billie Rinaldi <bi...@gmail.com>.
Yes, all shardIDs will be searched to find documents containing term1 and
term2.  Data will not be passed from one shardID to another, so each
document must appear in only one shard.  You can read more about
document-partitioned indexing at [1] and [2].

[1]:
https://accumulo.apache.org/1.7/accumulo_user_manual.html#_document_partitioned_indexing
[2]:
http://nlp.stanford.edu/IR-book/html/htmledition/distributing-indexes-1.html


On Fri, Dec 18, 2015 at 5:34 AM, Sven Hodapp <sven.hodapp@scai.fraunhofer.de
> wrote:

> Hi Billie,
>
> I've read in the source code documentation the following:
>
>     This iterator is commonly used with BatchScanner or
> AccumuloInputFormat, to parallelize the search over all shardIDs.
>
> This means key1 and key2 (the shradIDs) should be searched? Or is this a
> misunderstanding?
> The IndexedDocIterator should have also search in all shradIDs?
>
> Thanks!
>
> Regards,
> Sven
>
> --
> Sven Hodapp, M.Sc.,
> Fraunhofer Institute for Algorithms and Scientific Computing SCAI,
> Department of Bioinformatics
> Schloss Birlinghoven, 53754 Sankt Augustin, Germany
> sven.hodapp@scai.fraunhofer.de
> www.scai.fraunhofer.de
>
> ----- Ursprüngliche Mail -----
> > Von: "Billie Rinaldi" <bi...@gmail.com>
> > An: "user" <us...@accumulo.apache.org>
> > Gesendet: Mittwoch, 18. November 2015 15:57:15
> > Betreff: Re: IntersectingIterator and Ranges
>
> > Yes, that is the correct behavior. The IntersectingIterator intersects
> > columns within a row, on a single tablet server. To get the results you
> > want, you should make sure all the terms for a document are inserted with
> > the same key / row. In this case, all the doc1 entries should have key1
> as
> > their row.
> >
> > Billie
> > On Nov 18, 2015 7:08 AM, "Sven Hodapp" <sv...@scai.fraunhofer.de>
> > wrote:
> >
> >> Hello together,
> >>
> >> Currently I'm using Accumulo 1.7 (currently single a node) with the
> >> IntersectingIterator.
> >> The current index schema for the IntersectingIterator looks like this,
> for
> >> example:
> >>
> >>     key1 : term1 : doc1
> >>     key1 : term2 : doc1
> >>     key2 : term3 : doc1
> >>
> >> I've noticed that I can't intersect terms which are in distinct
> key-ranges.
> >> Is that a correct behavior, or I'm doing something wrong?
> >>
> >> Extract of my code (Scala) as example:
> >>
> >>     val bs = conn.createBatchScanner(tableName, authorizations,
> >> numQueryThreads)
> >>     val terms = List(new Text("term1"), new Text("term2")).toArray
> >>
> >>     val ii = new IteratorSetting(priority, name, iteratorClass)
> >>     IntersectingIterator.setColumnFamilies(ii, terms)
> >>     bs.addScanIterator(ii)
> >>
> >>     bs.setRanges(Collections.singleton(new Range()))  // all ranges
> >>
> >>     for (entry <- bs.asScala.take(100)) yield {
> >>       entry.getKey.getColumnQualifier.toString
> >>     }
> >>
> >> This will yield "doc1" as expected.
> >>
> >> But if I'll choose the terms like this:
> >>
> >>     // ...
> >>     val terms = List(new Text("term1"), new Text("term3")).toArray
> >>     // ...
> >>
> >> It will yield "null" but I would expect here also "doc1".
> >> I've also tried this with setting a list of Range.exact,
> >> but I'll get also "null".
> >>
> >> I'm doing something wrong?
> >>
> >> Thank you in advance!
> >>
> >> Regards,
> >> Sven
> >>
> >> --
> >> Sven Hodapp, M.Sc.,
> >> Fraunhofer Institute for Algorithms and Scientific Computing SCAI,
> >> Department of Bioinformatics
> >> Schloss Birlinghoven, 53754 Sankt Augustin, Germany
> >> sven.hodapp@scai.fraunhofer.de
> >> www.scai.fraunhofer.de
>