You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@vxquery.apache.org by Menaka Madushanka <me...@gmail.com> on 2016/06/05 05:18:07 UTC

Automatically updating Index

Hi everyone,

I came up with an implementation plan for the $subject. This will be able
to detect file content changes as well as deletions and additions.

Methodology:
1. Generate checksum (MD5/ SHA) for each file. These checksum values will
be written to a single properties file in following format.

*path_to_the_file=checksum_string*


2.In the first time run,  the checksum will be calculated and the
properties file will be created.

3. When running a query,

   1. The properties file will be read and loaded in to memory.
   2. The checksum values will be checked for each file.
   3. If any modification is detected, the index will be updated and the
   new checksum value will be stored.

In the process of checking the checksum, the path of the file will be taken
by the file itself and retrieve the checksum for that file from properties.
So, if any file insertion or deletion can be detected because we consider
the actual file first.

To make the process more clear, I have attached the flow diagram herewith.

I'd be very happy to have any feedback on this approach.

Thank you very much
Menaka

-- 
*Menaka Madushanka Jayawardena*
Faculty of Engineering, <http://www.pdn.ac.lk/eng>
University of Peradeniyaya.
LinkedIn <http://lk.linkedin.com/in/menakajayawardena>
TP:- 071 885 1183/ 071 350 5470

Re: Automatically updating Index

Posted by Menaka Madushanka <me...@gmail.com>.
Thank you very much Till and Michael. I'll take a look.

On 7 June 2016 at 23:58, Michael J. Carey <mj...@ics.uci.edu> wrote:

> It might also be helpful to look at the AsterixDB external data and
> indexing paper in CIKM'15 for inspiration...?
> On Jun 5, 2016 11:11 AM, "Preston Carman" <pr...@apache.org> wrote:
>
> > As we consider creating a meta data file for each index, lets consider
> > what other information could be stored with the index? What are the
> > types of functionality do we need to have a complete indexing story?
> > As I understand it, we support creating an index and searching using
> > that index. Would we want to show the user a list of indexes? Menaka's
> > e-mail suggest we need a way to update an index. What other
> > queries/features should we support around indexes?
> >
> > Indexing Features
> >  * Create index
> >  * Search using index
> >  * Update index???
> >  * List indexes???
> >  * Delete index???
> >
> > On Sat, Jun 4, 2016 at 10:18 PM, Menaka Madushanka
> > <me...@gmail.com> wrote:
> > > Hi everyone,
> > >
> > > I came up with an implementation plan for the $subject. This will be
> > able to
> > > detect file content changes as well as deletions and additions.
> > >
> > > Methodology:
> > > 1. Generate checksum (MD5/ SHA) for each file. These checksum values
> > will be
> > > written to a single properties file in following format.
> > >
> > > path_to_the_file=checksum_string
> > >
> >
> > Is there anything else that we will eventually want in a metadata file?
> >
> > >
> > > 2.In the first time run,  the checksum will be calculated and the
> > properties
> > > file will be created.
> > >
> >
> > Sounds good.
> >
> > > 3. When running a query,
> > >
> > > The properties file will be read and loaded in to memory.
> > > The checksum values will be checked for each file.
> > > If any modification is detected, the index will be updated and the new
> > > checksum value will be stored.
> > >
> > > In the process of checking the checksum, the path of the file will be
> > taken
> > > by the file itself and retrieve the checksum for that file from
> > properties.
> > > So, if any file insertion or deletion can be detected because we
> consider
> > > the actual file first.
> > >
> >
> > When you say run a query, is this a UPDATE query or a SEARCH query? I
> > think at this point we only want to cause the update action to happen
> > for a UPDATE query. The overhead of update a query before searching
> > could be to much. Lets first get UPDATE working.
> >
> > > To make the process more clear, I have attached the flow diagram
> > herewith.
> > >
> >
> > I do not see the diagram. Apache will only forward certain types of
> > attachments. Can you post a link to your diagram?
> >
> > > I'd be very happy to have any feedback on this approach.
> > >
> > > Thank you very much
> > > Menaka
> > >
> > > --
> > > Menaka Madushanka Jayawardena
> > > Faculty of Engineering,
> > > University of Peradeniyaya.
> > > LinkedIn
> > > TP:- 071 885 1183/ 071 350 5470
> >
>



-- 
*Menaka Madushanka Jayawardena*
Faculty of Engineering, <http://www.pdn.ac.lk/eng>
University of Peradeniyaya.
LinkedIn <http://lk.linkedin.com/in/menakajayawardena>
TP:- 071 885 1183/ 071 350 5470

Re: Automatically updating Index

Posted by "Michael J. Carey" <mj...@ics.uci.edu>.
It might also be helpful to look at the AsterixDB external data and
indexing paper in CIKM'15 for inspiration...?
On Jun 5, 2016 11:11 AM, "Preston Carman" <pr...@apache.org> wrote:

> As we consider creating a meta data file for each index, lets consider
> what other information could be stored with the index? What are the
> types of functionality do we need to have a complete indexing story?
> As I understand it, we support creating an index and searching using
> that index. Would we want to show the user a list of indexes? Menaka's
> e-mail suggest we need a way to update an index. What other
> queries/features should we support around indexes?
>
> Indexing Features
>  * Create index
>  * Search using index
>  * Update index???
>  * List indexes???
>  * Delete index???
>
> On Sat, Jun 4, 2016 at 10:18 PM, Menaka Madushanka
> <me...@gmail.com> wrote:
> > Hi everyone,
> >
> > I came up with an implementation plan for the $subject. This will be
> able to
> > detect file content changes as well as deletions and additions.
> >
> > Methodology:
> > 1. Generate checksum (MD5/ SHA) for each file. These checksum values
> will be
> > written to a single properties file in following format.
> >
> > path_to_the_file=checksum_string
> >
>
> Is there anything else that we will eventually want in a metadata file?
>
> >
> > 2.In the first time run,  the checksum will be calculated and the
> properties
> > file will be created.
> >
>
> Sounds good.
>
> > 3. When running a query,
> >
> > The properties file will be read and loaded in to memory.
> > The checksum values will be checked for each file.
> > If any modification is detected, the index will be updated and the new
> > checksum value will be stored.
> >
> > In the process of checking the checksum, the path of the file will be
> taken
> > by the file itself and retrieve the checksum for that file from
> properties.
> > So, if any file insertion or deletion can be detected because we consider
> > the actual file first.
> >
>
> When you say run a query, is this a UPDATE query or a SEARCH query? I
> think at this point we only want to cause the update action to happen
> for a UPDATE query. The overhead of update a query before searching
> could be to much. Lets first get UPDATE working.
>
> > To make the process more clear, I have attached the flow diagram
> herewith.
> >
>
> I do not see the diagram. Apache will only forward certain types of
> attachments. Can you post a link to your diagram?
>
> > I'd be very happy to have any feedback on this approach.
> >
> > Thank you very much
> > Menaka
> >
> > --
> > Menaka Madushanka Jayawardena
> > Faculty of Engineering,
> > University of Peradeniyaya.
> > LinkedIn
> > TP:- 071 885 1183/ 071 350 5470
>

Re: Automatically updating Index

Posted by Till Westmann <ti...@apache.org>.
Hi,

we should also consider how the issue of data/index consistency is 
tackled
in AsterixDB [1]. It doesn\u2019t automatically update indexes, but it 
ensures
consistency and thus allows the optimizer to choose an index without
changing the result of the query.
The approach might not be the right now for VXQuery, but it would be 
good to
take a look :)

Cheers,
Till

[1] http://dl.acm.org/citation.cfm?id=2806428

On 6 Jun 2016, at 20:52, Menaka Madushanka wrote:

> Hello,
>
> I'm sorry Preston. Here is the link for the image.
> https://drive.google.com/file/d/0B-2mdAzfAj07Z0w4RVZ2SGFfTFk/view?usp=sharing
>
> I came up with this approach thinking that, the index should be 
> updated
> automatically if any of the xml file has been changed. (Without user
> interference) And what I have added in the proposal was also updating 
> the
> index automatically.
> I didn't saw the new issue which was added by Steven about it,
> https://issues.apache.org/jira/browse/VXQUERY-198.
>
> As Steven mentioned, the updating process should be decided where, 
> only the
> changed files (updated, deleted or inserted) should be updated in the
> index.
>
> Is there anything else that we will eventually want in a metadata 
> file?
>
> I think that as we are trying to track the modified files, a content 
> based
> checksum is the best way to do it. We can use last modified date and 
> check
> it. But it's not fully reliable method depending only on single factor
> which can also be changed based on the time of the user's machine.
>
> Other than checksum value, I think we can store some info about the
> relevant index of that file. So when updating the index, the process 
> will
> be very easy. (I have to look whether it is possible)
>
> When you say run a query, is this a UPDATE query or a SEARCH query? I
> think at this point we only want to cause the update action to happen
> for a UPDATE query. The overhead of update a query before searching
> could be to much. Lets first get UPDATE working.
>
> I thought this should be run in a Search query. (As I was not fully 
> aware
> of the update index query) So, my suggestion was, when running a 
> search
> query, it will first check for any file changes. If there were any, 
> update
> the corresponding index and do the search on it. It's true as you 
> mentioned
> it will have a huge overhead. So we can use this method in detecting 
> the
> changed files and update the index in update query.
>
> Thank you very much
> Menaka
>
>
> On 6 June 2016 at 03:02, Steven Jacobs <sj...@ucr.edu> wrote:
>
>> In addition to Preston's comments, we also need to start thinking 
>> about the
>> Lucene side. Once we know a file needs to be changed in the index, 
>> how does
>> this change take place? Looking at how things are stored now will 
>> help with
>> this.
>> Steven
>>
>> On Sunday, June 5, 2016, Preston Carman <pr...@apache.org> wrote:
>>
>>> As we consider creating a meta data file for each index, lets 
>>> consider
>>> what other information could be stored with the index? What are the
>>> types of functionality do we need to have a complete indexing story?
>>> As I understand it, we support creating an index and searching using
>>> that index. Would we want to show the user a list of indexes? 
>>> Menaka's
>>> e-mail suggest we need a way to update an index. What other
>>> queries/features should we support around indexes?
>>>
>>> Indexing Features
>>>  * Create index
>>>  * Search using index
>>>  * Update index???
>>>  * List indexes???
>>>  * Delete index???
>>>
>>> On Sat, Jun 4, 2016 at 10:18 PM, Menaka Madushanka
>>> <menaka12350@gmail.com <javascript:;>> wrote:
>>>> Hi everyone,
>>>>
>>>> I came up with an implementation plan for the $subject. This will 
>>>> be
>>> able to
>>>> detect file content changes as well as deletions and additions.
>>>>
>>>> Methodology:
>>>> 1. Generate checksum (MD5/ SHA) for each file. These checksum 
>>>> values
>>> will be
>>>> written to a single properties file in following format.
>>>>
>>>> path_to_the_file=checksum_string
>>>>
>>>
>>> Is there anything else that we will eventually want in a metadata 
>>> file?
>>>
>>>>
>>>> 2.In the first time run,  the checksum will be calculated and the
>>> properties
>>>> file will be created.
>>>>
>>>
>>> Sounds good.
>>>
>>>> 3. When running a query,
>>>>
>>>> The properties file will be read and loaded in to memory.
>>>> The checksum values will be checked for each file.
>>>> If any modification is detected, the index will be updated and the 
>>>> new
>>>> checksum value will be stored.
>>>>
>>>> In the process of checking the checksum, the path of the file will 
>>>> be
>>> taken
>>>> by the file itself and retrieve the checksum for that file from
>>> properties.
>>>> So, if any file insertion or deletion can be detected because we
>> consider
>>>> the actual file first.
>>>>
>>>
>>> When you say run a query, is this a UPDATE query or a SEARCH query? 
>>> I
>>> think at this point we only want to cause the update action to 
>>> happen
>>> for a UPDATE query. The overhead of update a query before searching
>>> could be to much. Lets first get UPDATE working.
>>>
>>>> To make the process more clear, I have attached the flow diagram
>>> herewith.
>>>>
>>>
>>> I do not see the diagram. Apache will only forward certain types of
>>> attachments. Can you post a link to your diagram?
>>>
>>>> I'd be very happy to have any feedback on this approach.
>>>>
>>>> Thank you very much
>>>> Menaka
>>>>
>>>> --
>>>> Menaka Madushanka Jayawardena
>>>> Faculty of Engineering,
>>>> University of Peradeniyaya.
>>>> LinkedIn
>>>> TP:- 071 885 1183/ 071 350 5470
>>>
>>
>
>
>
> -- 
> *Menaka Madushanka Jayawardena*
> Faculty of Engineering, <http://www.pdn.ac.lk/eng>
> University of Peradeniyaya.
> LinkedIn <http://lk.linkedin.com/in/menakajayawardena>
> TP:- 071 885 1183/ 071 350 5470

Re: Automatically updating Index

Posted by Menaka Madushanka <me...@gmail.com>.
Hello,

I'm sorry Preston. Here is the link for the image.
https://drive.google.com/file/d/0B-2mdAzfAj07Z0w4RVZ2SGFfTFk/view?usp=sharing

I came up with this approach thinking that, the index should be updated
automatically if any of the xml file has been changed. (Without user
interference) And what I have added in the proposal was also updating the
index automatically.
I didn't saw the new issue which was added by Steven about it,
https://issues.apache.org/jira/browse/VXQUERY-198.

As Steven mentioned, the updating process should be decided where, only the
changed files (updated, deleted or inserted) should be updated in the
index.

Is there anything else that we will eventually want in a metadata file?

I think that as we are trying to track the modified files, a content based
checksum is the best way to do it. We can use last modified date and check
it. But it's not fully reliable method depending only on single factor
which can also be changed based on the time of the user's machine.

Other than checksum value, I think we can store some info about the
relevant index of that file. So when updating the index, the process will
be very easy. (I have to look whether it is possible)

When you say run a query, is this a UPDATE query or a SEARCH query? I
think at this point we only want to cause the update action to happen
for a UPDATE query. The overhead of update a query before searching
could be to much. Lets first get UPDATE working.

I thought this should be run in a Search query. (As I was not fully aware
of the update index query) So, my suggestion was, when running a search
query, it will first check for any file changes. If there were any, update
the corresponding index and do the search on it. It's true as you mentioned
it will have a huge overhead. So we can use this method in detecting the
changed files and update the index in update query.

Thank you very much
Menaka


On 6 June 2016 at 03:02, Steven Jacobs <sj...@ucr.edu> wrote:

> In addition to Preston's comments, we also need to start thinking about the
> Lucene side. Once we know a file needs to be changed in the index, how does
> this change take place? Looking at how things are stored now will help with
> this.
> Steven
>
> On Sunday, June 5, 2016, Preston Carman <pr...@apache.org> wrote:
>
> > As we consider creating a meta data file for each index, lets consider
> > what other information could be stored with the index? What are the
> > types of functionality do we need to have a complete indexing story?
> > As I understand it, we support creating an index and searching using
> > that index. Would we want to show the user a list of indexes? Menaka's
> > e-mail suggest we need a way to update an index. What other
> > queries/features should we support around indexes?
> >
> > Indexing Features
> >  * Create index
> >  * Search using index
> >  * Update index???
> >  * List indexes???
> >  * Delete index???
> >
> > On Sat, Jun 4, 2016 at 10:18 PM, Menaka Madushanka
> > <menaka12350@gmail.com <javascript:;>> wrote:
> > > Hi everyone,
> > >
> > > I came up with an implementation plan for the $subject. This will be
> > able to
> > > detect file content changes as well as deletions and additions.
> > >
> > > Methodology:
> > > 1. Generate checksum (MD5/ SHA) for each file. These checksum values
> > will be
> > > written to a single properties file in following format.
> > >
> > > path_to_the_file=checksum_string
> > >
> >
> > Is there anything else that we will eventually want in a metadata file?
> >
> > >
> > > 2.In the first time run,  the checksum will be calculated and the
> > properties
> > > file will be created.
> > >
> >
> > Sounds good.
> >
> > > 3. When running a query,
> > >
> > > The properties file will be read and loaded in to memory.
> > > The checksum values will be checked for each file.
> > > If any modification is detected, the index will be updated and the new
> > > checksum value will be stored.
> > >
> > > In the process of checking the checksum, the path of the file will be
> > taken
> > > by the file itself and retrieve the checksum for that file from
> > properties.
> > > So, if any file insertion or deletion can be detected because we
> consider
> > > the actual file first.
> > >
> >
> > When you say run a query, is this a UPDATE query or a SEARCH query? I
> > think at this point we only want to cause the update action to happen
> > for a UPDATE query. The overhead of update a query before searching
> > could be to much. Lets first get UPDATE working.
> >
> > > To make the process more clear, I have attached the flow diagram
> > herewith.
> > >
> >
> > I do not see the diagram. Apache will only forward certain types of
> > attachments. Can you post a link to your diagram?
> >
> > > I'd be very happy to have any feedback on this approach.
> > >
> > > Thank you very much
> > > Menaka
> > >
> > > --
> > > Menaka Madushanka Jayawardena
> > > Faculty of Engineering,
> > > University of Peradeniyaya.
> > > LinkedIn
> > > TP:- 071 885 1183/ 071 350 5470
> >
>



-- 
*Menaka Madushanka Jayawardena*
Faculty of Engineering, <http://www.pdn.ac.lk/eng>
University of Peradeniyaya.
LinkedIn <http://lk.linkedin.com/in/menakajayawardena>
TP:- 071 885 1183/ 071 350 5470

Re: Automatically updating Index

Posted by Steven Jacobs <sj...@ucr.edu>.
In addition to Preston's comments, we also need to start thinking about the
Lucene side. Once we know a file needs to be changed in the index, how does
this change take place? Looking at how things are stored now will help with
this.
Steven

On Sunday, June 5, 2016, Preston Carman <pr...@apache.org> wrote:

> As we consider creating a meta data file for each index, lets consider
> what other information could be stored with the index? What are the
> types of functionality do we need to have a complete indexing story?
> As I understand it, we support creating an index and searching using
> that index. Would we want to show the user a list of indexes? Menaka's
> e-mail suggest we need a way to update an index. What other
> queries/features should we support around indexes?
>
> Indexing Features
>  * Create index
>  * Search using index
>  * Update index???
>  * List indexes???
>  * Delete index???
>
> On Sat, Jun 4, 2016 at 10:18 PM, Menaka Madushanka
> <menaka12350@gmail.com <javascript:;>> wrote:
> > Hi everyone,
> >
> > I came up with an implementation plan for the $subject. This will be
> able to
> > detect file content changes as well as deletions and additions.
> >
> > Methodology:
> > 1. Generate checksum (MD5/ SHA) for each file. These checksum values
> will be
> > written to a single properties file in following format.
> >
> > path_to_the_file=checksum_string
> >
>
> Is there anything else that we will eventually want in a metadata file?
>
> >
> > 2.In the first time run,  the checksum will be calculated and the
> properties
> > file will be created.
> >
>
> Sounds good.
>
> > 3. When running a query,
> >
> > The properties file will be read and loaded in to memory.
> > The checksum values will be checked for each file.
> > If any modification is detected, the index will be updated and the new
> > checksum value will be stored.
> >
> > In the process of checking the checksum, the path of the file will be
> taken
> > by the file itself and retrieve the checksum for that file from
> properties.
> > So, if any file insertion or deletion can be detected because we consider
> > the actual file first.
> >
>
> When you say run a query, is this a UPDATE query or a SEARCH query? I
> think at this point we only want to cause the update action to happen
> for a UPDATE query. The overhead of update a query before searching
> could be to much. Lets first get UPDATE working.
>
> > To make the process more clear, I have attached the flow diagram
> herewith.
> >
>
> I do not see the diagram. Apache will only forward certain types of
> attachments. Can you post a link to your diagram?
>
> > I'd be very happy to have any feedback on this approach.
> >
> > Thank you very much
> > Menaka
> >
> > --
> > Menaka Madushanka Jayawardena
> > Faculty of Engineering,
> > University of Peradeniyaya.
> > LinkedIn
> > TP:- 071 885 1183/ 071 350 5470
>

Re: Automatically updating Index

Posted by Preston Carman <pr...@apache.org>.
As we consider creating a meta data file for each index, lets consider
what other information could be stored with the index? What are the
types of functionality do we need to have a complete indexing story?
As I understand it, we support creating an index and searching using
that index. Would we want to show the user a list of indexes? Menaka's
e-mail suggest we need a way to update an index. What other
queries/features should we support around indexes?

Indexing Features
 * Create index
 * Search using index
 * Update index???
 * List indexes???
 * Delete index???

On Sat, Jun 4, 2016 at 10:18 PM, Menaka Madushanka
<me...@gmail.com> wrote:
> Hi everyone,
>
> I came up with an implementation plan for the $subject. This will be able to
> detect file content changes as well as deletions and additions.
>
> Methodology:
> 1. Generate checksum (MD5/ SHA) for each file. These checksum values will be
> written to a single properties file in following format.
>
> path_to_the_file=checksum_string
>

Is there anything else that we will eventually want in a metadata file?

>
> 2.In the first time run,  the checksum will be calculated and the properties
> file will be created.
>

Sounds good.

> 3. When running a query,
>
> The properties file will be read and loaded in to memory.
> The checksum values will be checked for each file.
> If any modification is detected, the index will be updated and the new
> checksum value will be stored.
>
> In the process of checking the checksum, the path of the file will be taken
> by the file itself and retrieve the checksum for that file from properties.
> So, if any file insertion or deletion can be detected because we consider
> the actual file first.
>

When you say run a query, is this a UPDATE query or a SEARCH query? I
think at this point we only want to cause the update action to happen
for a UPDATE query. The overhead of update a query before searching
could be to much. Lets first get UPDATE working.

> To make the process more clear, I have attached the flow diagram herewith.
>

I do not see the diagram. Apache will only forward certain types of
attachments. Can you post a link to your diagram?

> I'd be very happy to have any feedback on this approach.
>
> Thank you very much
> Menaka
>
> --
> Menaka Madushanka Jayawardena
> Faculty of Engineering,
> University of Peradeniyaya.
> LinkedIn
> TP:- 071 885 1183/ 071 350 5470