You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@accumulo.apache.org by Yamini Joshi <ya...@gmail.com> on 2016/10/20 22:45:02 UTC

Iterator as a Filter

Hello all

Is it possible to configure an iterator that works as a filter? As per
Accumulo docs:
As such, the `Filter` class functions well for filtering small amounts of
data, but is
inefficient for filtering large amounts of data. The decision to use a
`Filter` strongly
depends on the use case and distribution of data being filtered.

I have a huge corpus to be filtered with a small amount of data selected. I
want to select column families from a list of col families. I have a rough
idea of using 'seek' to bypass cfs that don't exist in the list. I was
hoping I could exploit the 'seek'ing in iterator and go to the range in the
list of cf and check if it exists. I am not sure if this will work or if it
is a good approach. Any feedback is much appreciated.

Best regards,
Yamini Joshi

Re: Iterator as a Filter

Posted by Yamini Joshi <ya...@gmail.com>.

Thank you for the reply! I'll try this and get back to you. Also, I found a
MultiIterator Class. Any ideas on how it works? Will it work with batch
scan and sort data before passing it to other iterators?


Best regards,
Yamini Joshi

On Fri, Oct 21, 2016 at 6:35 AM, <dl...@comcast.net> wrote:

> So if I understand this correctly, for this use case, you could do the
> following:
>
> courseId    studentId      <list of courseIds>
>
> For either of your queries (1 and 2 below) you could use a BatchScanner
> with the set of Ranges being the course ids from input C. In your client
> you would add the resulting columnFamily (studentId) and columnQualifier
> (list of courses) to a map of studentId -> list of courses. For #1, you
> just need the size of the list of courses for each student. For #2, you can
> do the intersection for each student.
>
> Now, this does not work if you want to be able to update the student
> information in an online fashion. This should work though if you are able
> to simply reload the information when it is updated.
>
> ------------------------------
> *From: *"Yamini Joshi" <ya...@gmail.com>
> *To: *user@accumulo.apache.org
> *Sent: *Thursday, October 20, 2016 9:53:34 PM
> *Subject: *Re: Iterator as a Filter
>
>
> I have an input C which is the list of courses a student x is enrolled in.
> I am trying to do some computation which requires 2 things:
> For a student enrolled in atleast one of the courses in C
> 1. Total number of classes a student is enrolled in (Y)
> 2.  Number of courses the student is enrolled in which belong the list
> cardinality(Y intersection C)
>
>
> Best regards,
> Yamini Joshi
>
> On Thu, Oct 20, 2016 at 7:16 PM, Dave <dl...@comcast.net> wrote:
>
>> I'm a little confused to the use case here. Are you trying to find
>> courses that students are taking where the students are in a particular
>> class? The table design is going to depend on the set of questions that you
>> want to answer.
>>
>> On Oct 20, 2016 7:19 PM, Yamini Joshi <ya...@gmail.com> wrote:
>>
>> I did use the inverted index but I went into trouble because I used a
>> batch scan and it returns unsorted data. Also, I need to do some
>> computation after.  Here is my prob definition:
>>
>> The data is of the form:
>> studentID course|courseID [ ]  count
>> .
>> .
>> .
>> .
>> .
>> studentID np2| [ ]  count
>>
>> So a student is registered in multiple courses. The query has the
>> following parameters:
>> Input: List of course Ids
>> Output: Computation on records that contain course from the I/p
>> Algo:
>> Step1: Select rows that contain a course matching courses in the list
>> Step2: Count the number of such courses for each student
>> Step3: Do some computation
>>
>> Approach1(Naive):
>> 1. Designed a RowFilter that checks all the rowIds in the DB to check if
>> the course is in the course List
>> 2. Designed an iterator to count the number of such courses within each
>> student
>> 3. Designed an iterator to do the computation
>>
>> Problem: Complexity = O(n) where n= number of records in the DB which is
>> BAD.
>>
>> Approach2(Better Lookup):
>> 1. Created an inverted Index with:
>> courseID student|studentID [ ] count
>> .
>> .
>> .
>> .
>> 2. Looked up students for courses in the list
>> 3. Accessed records with studentIDs, courseID generated from step1 using
>> Range Object in batch scan
>> 4. Designed an iterator to count courseIds within a student record
>> 5. Designed an iterator to do the computation
>>
>> Problem: Batch scan does not return records in a sorted manner hence step
>> 4 does not give me the required results :\
>>
>> I am not sure how to proceed now.
>>
>>
>>
>> Best regards,
>> Yamini Joshi
>>
>> On Thu, Oct 20, 2016 at 6:04 PM, Dylan Hutchison <
>> dhutchis@cs.washington.edu> wrote:
>>
>> Hi Yamini,
>>
>> If you have a finite, known list of column families, you can use locality
>> groups
>> <https://accumulo.apache.org/1.8/accumulo_user_manual#_locality_groups> to
>> store them in separate files in Hadoop.   Scans that only reference the
>> column families within a locality group need not open data in other
>> locality groups' files.
>>
>> Apart from locality groups, setting "fetch column families and/or
>> qualifiers" on the scanner sets up a standard Filter iterator on the scan.
>> If you need to obtain these columns from every row, then the whole table is
>> scanned and filtered server-side.  (Seeking will occur during the scan if
>> the selected columns are far apart in the table.)  I guess that is too
>> inefficient for your use case.  For reference, these iterators are here
>> for families
>> <https://github.com/apache/accumulo/blob/master/core/src/main/java/org/apache/accumulo/core/iterators/system/ColumnFamilySkippingIterator.java>
>> and here for qualifiers
>> <https://github.com/apache/accumulo/blob/master/core/src/main/java/org/apache/accumulo/core/iterators/system/ColumnQualifierFilter.java>
>> .
>>
>> If locality groups are not an option and you must filter on families and
>> columns, then you may want to consider maintaining an index table, in which
>> the columns are stored as rows, or otherwise moving the columns into the
>> rows.
>>
>> Regards, Dylan
>>
>> On Thu, Oct 20, 2016 at 3:45 PM, Yamini Joshi <ya...@gmail.com>
>> wrote:
>>
>> Hello all
>>
>> Is it possible to configure an iterator that works as a filter? As per
>> Accumulo docs:
>> As such, the `Filter` class functions well for filtering small amounts of
>> data, but is
>> inefficient for filtering large amounts of data. The decision to use a
>> `Filter` strongly
>> depends on the use case and distribution of data being filtered.
>>
>> I have a huge corpus to be filtered with a small amount of data selected.
>> I want to select column families from a list of col families. I have a
>> rough idea of using 'seek' to bypass cfs that don't exist in the list. I
>> was hoping I could exploit the 'seek'ing in iterator and go to the range in
>> the list of cf and check if it exists. I am not sure if this will work or
>> if it is a good approach. Any feedback is much appreciated.
>>
>> Best regards,
>> Yamini Joshi
>>
>>
>>
>>
>>
>
>

Re: Iterator as a Filter

Posted by dl...@comcast.net.

So if I understand this correctly, for this use case, you could do the following: 

courseId studentId <list of courseIds> 

For either of your queries (1 and 2 below) you could use a BatchScanner with the set of Ranges being the course ids from input C. In your client you would add the resulting columnFamily (studentId) and columnQualifier (list of courses) to a map of studentId -> list of courses. For #1, you just need the size of the list of courses for each student. For #2, you can do the intersection for each student. 

Now, this does not work if you want to be able to update the student information in an online fashion. This should work though if you are able to simply reload the information when it is updated. 

----- Original Message -----

From: "Yamini Joshi" <ya...@gmail.com> 
To: user@accumulo.apache.org 
Sent: Thursday, October 20, 2016 9:53:34 PM 
Subject: Re: Iterator as a Filter 

I have an input C which is the list of courses a student x is enrolled in. 
I am trying to do some computation which requires 2 things: 
For a student enrolled in atleast one of the courses in C 
1. Total number of classes a student is enrolled in (Y) 
2. Number of courses the student is enrolled in which belong the list cardinality(Y intersection C) 


Best regards, 
Yamini Joshi 

On Thu, Oct 20, 2016 at 7:16 PM, Dave < dlmarion@comcast.net > wrote: 




I'm a little confused to the use case here. Are you trying to find courses that students are taking where the students are in a particular class? The table design is going to depend on the set of questions that you want to answer. 

On Oct 20, 2016 7:19 PM, Yamini Joshi < yamini.1691@gmail.com > wrote: 

<blockquote>

I did use the inverted index but I went into trouble because I used a batch scan and it returns unsorted data. Also, I need to do some computation after. Here is my prob definition: 

The data is of the form: 
studentID course|courseID [ ] count 
. 
. 
. 
. 
. 
studentID np2| [ ] count 

So a student is registered in multiple courses. The query has the following parameters: 
Input: List of course Ids 
Output: Computation on records that contain course from the I/p 
Algo: 
Step1: Select rows that contain a course matching courses in the list 
Step2: Count the number of such courses for each student 
Step3: Do some computation 

Approach1(Naive): 
1. Designed a RowFilter that checks all the rowIds in the DB to check if the course is in the course List 
2. Designed an iterator to count the number of such courses within each student 
3. Designed an iterator to do the computation 

Problem: Complexity = O(n) where n= number of records in the DB which is BAD. 

Approach2(Better Lookup): 
1. Created an inverted Index with: 
courseID student|studentID [ ] count 
. 
. 
. 
. 
2. Looked up students for courses in the list 
3. Accessed records with studentIDs, courseID generated from step1 using Range Object in batch scan 
4. Designed an iterator to count courseIds within a student record 
5. Designed an iterator to do the computation 

Problem: Batch scan does not return records in a sorted manner hence step 4 does not give me the required results :\ 

I am not sure how to proceed now. 



Best regards, 
Yamini Joshi 

On Thu, Oct 20, 2016 at 6:04 PM, Dylan Hutchison < dhutchis@cs.washington.edu > wrote: 

<blockquote>

Hi Yamini, 

If you have a finite, known list of column families, you can use locality groups to store them in separate files in Hadoop. Scans that only reference the column families within a locality group need not open data in other locality groups' files. 

Apart from locality groups, setting "fetch column families and/or qualifiers" on the scanner sets up a standard Filter iterator on the scan. If you need to obtain these columns from every row, then the whole table is scanned and filtered server-side. (Seeking will occur during the scan if the selected columns are far apart in the table.) I guess that is too inefficient for your use case. For reference, these iterators are here for families and here for qualifiers . 

If locality groups are not an option and you must filter on families and columns, then you may want to consider maintaining an index table, in which the columns are stored as rows, or otherwise moving the columns into the rows. 

Regards, Dylan 

On Thu, Oct 20, 2016 at 3:45 PM, Yamini Joshi < yamini.1691@gmail.com > wrote: 

<blockquote>

Hello all 

Is it possible to configure an iterator that works as a filter? As per Accumulo docs: 
As such, the `Filter` class functions well for filtering small amounts of data, but is inefficient for filtering large amounts of data. The decision to use a `Filter` strongly 
depends on the use case and distribution of data being filtered. 

I have a huge corpus to be filtered with a small amount of data selected. I want to select column families from a list of col families. I have a rough idea of using 'seek' to bypass cfs that don't exist in the list. I was hoping I could exploit the 'seek'ing in iterator and go to the range in the list of cf and check if it exists. I am not sure if this will work or if it is a good approach. Any feedback is much appreciated. 

Best regards, 
Yamini Joshi 





</blockquote>



</blockquote>



</blockquote>

Re: Iterator as a Filter

Posted by Yamini Joshi <ya...@gmail.com>.

I have an input C which is the list of courses a student x is enrolled in.
I am trying to do some computation which requires 2 things:
For a student enrolled in atleast one of the courses in C
1. Total number of classes a student is enrolled in (Y)
2.  Number of courses the student is enrolled in which belong the list
cardinality(Y intersection C)


Best regards,
Yamini Joshi

On Thu, Oct 20, 2016 at 7:16 PM, Dave <dl...@comcast.net> wrote:

> I'm a little confused to the use case here. Are you trying to find courses
> that students are taking where the students are in a particular class? The
> table design is going to depend on the set of questions that you want to
> answer.
>
> On Oct 20, 2016 7:19 PM, Yamini Joshi <ya...@gmail.com> wrote:
>
> I did use the inverted index but I went into trouble because I used a
> batch scan and it returns unsorted data. Also, I need to do some
> computation after.  Here is my prob definition:
>
> The data is of the form:
> studentID course|courseID [ ]  count
> .
> .
> .
> .
> .
> studentID np2| [ ]  count
>
> So a student is registered in multiple courses. The query has the
> following parameters:
> Input: List of course Ids
> Output: Computation on records that contain course from the I/p
> Algo:
> Step1: Select rows that contain a course matching courses in the list
> Step2: Count the number of such courses for each student
> Step3: Do some computation
>
> Approach1(Naive):
> 1. Designed a RowFilter that checks all the rowIds in the DB to check if
> the course is in the course List
> 2. Designed an iterator to count the number of such courses within each
> student
> 3. Designed an iterator to do the computation
>
> Problem: Complexity = O(n) where n= number of records in the DB which is
> BAD.
>
> Approach2(Better Lookup):
> 1. Created an inverted Index with:
> courseID student|studentID [ ] count
> .
> .
> .
> .
> 2. Looked up students for courses in the list
> 3. Accessed records with studentIDs, courseID generated from step1 using
> Range Object in batch scan
> 4. Designed an iterator to count courseIds within a student record
> 5. Designed an iterator to do the computation
>
> Problem: Batch scan does not return records in a sorted manner hence step
> 4 does not give me the required results :\
>
> I am not sure how to proceed now.
>
>
>
> Best regards,
> Yamini Joshi
>
> On Thu, Oct 20, 2016 at 6:04 PM, Dylan Hutchison <
> dhutchis@cs.washington.edu> wrote:
>
> Hi Yamini,
>
> If you have a finite, known list of column families, you can use locality
> groups
> <https://accumulo.apache.org/1.8/accumulo_user_manual#_locality_groups> to
> store them in separate files in Hadoop.   Scans that only reference the
> column families within a locality group need not open data in other
> locality groups' files.
>
> Apart from locality groups, setting "fetch column families and/or
> qualifiers" on the scanner sets up a standard Filter iterator on the scan.
> If you need to obtain these columns from every row, then the whole table is
> scanned and filtered server-side.  (Seeking will occur during the scan if
> the selected columns are far apart in the table.)  I guess that is too
> inefficient for your use case.  For reference, these iterators are here
> for families
> <https://github.com/apache/accumulo/blob/master/core/src/main/java/org/apache/accumulo/core/iterators/system/ColumnFamilySkippingIterator.java>
> and here for qualifiers
> <https://github.com/apache/accumulo/blob/master/core/src/main/java/org/apache/accumulo/core/iterators/system/ColumnQualifierFilter.java>
> .
>
> If locality groups are not an option and you must filter on families and
> columns, then you may want to consider maintaining an index table, in which
> the columns are stored as rows, or otherwise moving the columns into the
> rows.
>
> Regards, Dylan
>
> On Thu, Oct 20, 2016 at 3:45 PM, Yamini Joshi <ya...@gmail.com>
> wrote:
>
> Hello all
>
> Is it possible to configure an iterator that works as a filter? As per
> Accumulo docs:
> As such, the `Filter` class functions well for filtering small amounts of
> data, but is
> inefficient for filtering large amounts of data. The decision to use a
> `Filter` strongly
> depends on the use case and distribution of data being filtered.
>
> I have a huge corpus to be filtered with a small amount of data selected.
> I want to select column families from a list of col families. I have a
> rough idea of using 'seek' to bypass cfs that don't exist in the list. I
> was hoping I could exploit the 'seek'ing in iterator and go to the range in
> the list of cf and check if it exists. I am not sure if this will work or
> if it is a good approach. Any feedback is much appreciated.
>
> Best regards,
> Yamini Joshi
>
>
>
>
>

Re: Iterator as a Filter

Posted by Yamini Joshi <ya...@gmail.com>.

I did use the inverted index but I went into trouble because I used a batch
scan and it returns unsorted data. Also, I need to do some computation
after.  Here is my prob definition:

The data is of the form:
studentID course|courseID [ ]  count
.
.
.
.
.
studentID np2| [ ]  count

So a student is registered in multiple courses. The query has the following
parameters:
Input: List of course Ids
Output: Computation on records that contain course from the I/p
Algo:
Step1: Select rows that contain a course matching courses in the list
Step2: Count the number of such courses for each student
Step3: Do some computation

Approach1(Naive):
1. Designed a RowFilter that checks all the rowIds in the DB to check if
the course is in the course List
2. Designed an iterator to count the number of such courses within each
student
3. Designed an iterator to do the computation

Problem: Complexity = O(n) where n= number of records in the DB which is
BAD.

Approach2(Better Lookup):
1. Created an inverted Index with:
courseID student|studentID [ ] count
.
.
.
.
2. Looked up students for courses in the list
3. Accessed records with studentIDs, courseID generated from step1 using
Range Object in batch scan
4. Designed an iterator to count courseIds within a student record
5. Designed an iterator to do the computation

Problem: Batch scan does not return records in a sorted manner hence step 4
does not give me the required results :\

I am not sure how to proceed now.

Best regards,
Yamini Joshi

On Thu, Oct 20, 2016 at 6:04 PM, Dylan Hutchison <dhutchis@cs.washington.edu
> wrote:

> Hi Yamini,
>
> If you have a finite, known list of column families, you can use locality
> groups
> <https://accumulo.apache.org/1.8/accumulo_user_manual#_locality_groups> to
> store them in separate files in Hadoop.   Scans that only reference the
> column families within a locality group need not open data in other
> locality groups' files.
>
> Apart from locality groups, setting "fetch column families and/or
> qualifiers" on the scanner sets up a standard Filter iterator on the scan.
> If you need to obtain these columns from every row, then the whole table is
> scanned and filtered server-side.  (Seeking will occur during the scan if
> the selected columns are far apart in the table.)  I guess that is too
> inefficient for your use case.  For reference, these iterators are here
> for families
> <https://github.com/apache/accumulo/blob/master/core/src/main/java/org/apache/accumulo/core/iterators/system/ColumnFamilySkippingIterator.java>
> and here for qualifiers
> <https://github.com/apache/accumulo/blob/master/core/src/main/java/org/apache/accumulo/core/iterators/system/ColumnQualifierFilter.java>
> .
>
> If locality groups are not an option and you must filter on families and
> columns, then you may want to consider maintaining an index table, in which
> the columns are stored as rows, or otherwise moving the columns into the
> rows.
>
> Regards, Dylan
>
> On Thu, Oct 20, 2016 at 3:45 PM, Yamini Joshi <ya...@gmail.com>
> wrote:
>
>> Hello all
>>
>> Is it possible to configure an iterator that works as a filter? As per
>> Accumulo docs:
>> As such, the `Filter` class functions well for filtering small amounts of
>> data, but is
>> inefficient for filtering large amounts of data. The decision to use a
>> `Filter` strongly
>> depends on the use case and distribution of data being filtered.
>>
>> I have a huge corpus to be filtered with a small amount of data selected.
>> I want to select column families from a list of col families. I have a
>> rough idea of using 'seek' to bypass cfs that don't exist in the list. I
>> was hoping I could exploit the 'seek'ing in iterator and go to the range in
>> the list of cf and check if it exists. I am not sure if this will work or
>> if it is a good approach. Any feedback is much appreciated.
>>
>> Best regards,
>> Yamini Joshi
>>
>
>

Re: Iterator as a Filter

Posted by Dylan Hutchison <dh...@cs.washington.edu>.

Hi Yamini,

If you have a finite, known list of column families, you can use locality
groups
<https://accumulo.apache.org/1.8/accumulo_user_manual#_locality_groups> to
store them in separate files in Hadoop.   Scans that only reference the
column families within a locality group need not open data in other
locality groups' files.

Apart from locality groups, setting "fetch column families and/or
qualifiers" on the scanner sets up a standard Filter iterator on the scan.
If you need to obtain these columns from every row, then the whole table is
scanned and filtered server-side.  (Seeking will occur during the scan if
the selected columns are far apart in the table.)  I guess that is too
inefficient for your use case.  For reference, these iterators are here for
families
<https://github.com/apache/accumulo/blob/master/core/src/main/java/org/apache/accumulo/core/iterators/system/ColumnFamilySkippingIterator.java>
and here for qualifiers
<https://github.com/apache/accumulo/blob/master/core/src/main/java/org/apache/accumulo/core/iterators/system/ColumnQualifierFilter.java>
.

If locality groups are not an option and you must filter on families and
columns, then you may want to consider maintaining an index table, in which
the columns are stored as rows, or otherwise moving the columns into the
rows.

Regards, Dylan

On Thu, Oct 20, 2016 at 3:45 PM, Yamini Joshi <ya...@gmail.com> wrote:

> Hello all
>
> Is it possible to configure an iterator that works as a filter? As per
> Accumulo docs:
> As such, the `Filter` class functions well for filtering small amounts of
> data, but is
> inefficient for filtering large amounts of data. The decision to use a
> `Filter` strongly
> depends on the use case and distribution of data being filtered.
>
> I have a huge corpus to be filtered with a small amount of data selected.
> I want to select column families from a list of col families. I have a
> rough idea of using 'seek' to bypass cfs that don't exist in the list. I
> was hoping I could exploit the 'seek'ing in iterator and go to the range in
> the list of cf and check if it exists. I am not sure if this will work or
> if it is a good approach. Any feedback is much appreciated.
>
> Best regards,
> Yamini Joshi
>