You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@oodt.apache.org by Thomas Bennett <tb...@ska.ac.za> on 2011/09/22 23:30:21 UTC

Catalog queries

Hi,

I have a few questions about building queries for filemgr lucene catalogs
and I was thinking someone may be able to help me.

I've ingested some files into catalog and then using the command line tools
(and aliases - thanks Cameron!) to query the catalog.

I'm not too familiar with writing SQL queries, but I've been able to achieve
the the following types of queries:

bin$ ./query_tool --url http://localhost:9000 --sql -query "SELECT
Observer,Description,Duration,ExperimentID FROM KatFile WHERE
Observer='jasper'" --sortBy Duration

Which returns:
.....
jasper,a9909ae6-822b-11e0-a7a1-0060dd4721d8,Target track,637.841571569
jasper,47c3a4da-822a-11e0-a7a1-0060dd4721d8,Target track,565.859450817
jasper,777b0f34-8224-11e0-a7a1-0060dd4721d8,Target track,80.9798858166

*
*
bin$ ./query_tool --url http://localhost:9000 --lucene -query
'Observer:sharmila'

Which returns:
.......
ba9b292e-e506-11e0-ad74-9f1c5e7f0611
b93dbc0d-e506-11e0-ad74-9f1c5e7f0611
b7e530ec-e506-11e0-ad74-9f1c5e7f0611
b66ff60b-e506-11e0-ad74-9f1c5e7f0611
afc6556a-e506-11e0-ad74-9f1c5e7f0611
*
*
*
*
*Questions:*

   1. The SQL query does what I expect ;-) but with one problem - in what
   order will I receive the data? I can't figure out an automatic way to find
   out which column is which data.
   2. Is full SQL query syntax supported?
   3. The Lucene query returns the productID. Is there a class I can use
   that will return something similar to the sql query? (Although I should look
   at the code and find this out for myself - asking is free :-)
   4. I've not yet tested any more complex SQL and Lucene queries - I was
   just wondering if there where any useful info out there that would show me
   some more funky example queries. So far I've found lucene
tutorial<http://www.lucenetutorial.com/lucene-query-syntax.html>
    and sql quick ref <http://www.w3schools.com/sql/sql_quickref.asp>. I'll
   tie this into OODT Filemgr User
Guide<https://cwiki.apache.org/confluence/display/OODT/OODT+Filemgr+User+Guide>
once
   I've figured these things out.
   5. I see the version of lucene being used it quiet old (2.0.0 and the
   latest ver is 2.9.1). Is there any reason why OODT is using this old
   version?
   6. Should I be spending the effort to use a different (i.e. sql database)
   or are other OODT implementations using lucene?

Thanks in advance for any help.

Kind regards,
Tom

Re: Catalog queries

Posted by Cameron Goodale <si...@gmail.com>.

Tom,

I am glad to hear that someone is using the aliases to query the
FileManager.  Here is the wiki page I started to document examples of how to
use the aliases and different use cases.

https://cwiki.apache.org/OODT/bash-and-tcsh-shell-tools-for-file-manager.html

On to the questions:

The SQL query does what I expect ;-) but with one problem - in what order
will I receive the data? I can't figure out an automatic way to find out
which column is which data.
>>>  I would hope the data would be returned in the same order you asked for
it.  Given your example though the order may not be preserved.

Is full SQL query syntax supported?
>>>  I don't believe it is.  It is intended to be a simple way to form
queries.

The Lucene query returns the productID. Is there a class I can use that will
return something similar to the sql query? (Although I should look at the
code and find this out for myself - asking is free :-)
>>>  Not sure, maybe someone else can chime in.

I've not yet tested any more complex SQL and Lucene queries - I was just
wondering if there where any useful info out there that would show me some
more funky example queries. So far I've found lucene
tutorial<http://www.lucenetutorial.com/lucene-query-syntax.html>
 and sql quick ref <http://www.w3schools.com/sql/sql_quickref.asp>. I'll tie
this into OODT Filemgr User
Guide<https://cwiki.apache.org/confluence/display/OODT/OODT+Filemgr+User+Guide>
once
I've figured these things out.
I see the version of lucene being used it quiet old (2.0.0 and the latest
ver is 2.9.1). Is there any reason why OODT is using this old version?
Should I be spending the effort to use a different (i.e. sql database) or
are other OODT implementations using lucene?
>>>  The intent is to enable simple queries into the FileManager.  Several
projects initially use Lucene here at JPL.  Sometimes though the project out
grows the capabilities of Lucene, and we start to consider a different
catalog solution.  On LMMP we used BerkeleyDB when Paul Ramirez wrote  a
custom catalog, and on RCMES Andrew Hart wrote a custom Catalog to normalize
metadata and store it directly in MySQL.

Hope that helps.


-Cameron



On Fri, Sep 23, 2011 at 6:30 AM, Thomas Bennett <tb...@ska.ac.za> wrote:

> Hi,
>
> I have a few questions about building queries for filemgr lucene catalogs
> and I was thinking someone may be able to help me.
>
> I've ingested some files into catalog and then using the command line tools
> (and aliases - thanks Cameron!) to query the catalog.
>
> I'm not too familiar with writing SQL queries, but I've been able to
> achieve the the following types of queries:
>
> bin$ ./query_tool --url http://localhost:9000 --sql -query "SELECT
> Observer,Description,Duration,ExperimentID FROM KatFile WHERE
> Observer='jasper'" --sortBy Duration
>
> Which returns:
> .....
> jasper,a9909ae6-822b-11e0-a7a1-0060dd4721d8,Target track,637.841571569
> jasper,47c3a4da-822a-11e0-a7a1-0060dd4721d8,Target track,565.859450817
> jasper,777b0f34-8224-11e0-a7a1-0060dd4721d8,Target track,80.9798858166
>
> *
> *
> bin$ ./query_tool --url http://localhost:9000 --lucene -query
> 'Observer:sharmila'
>
> Which returns:
> .......
> ba9b292e-e506-11e0-ad74-9f1c5e7f0611
> b93dbc0d-e506-11e0-ad74-9f1c5e7f0611
> b7e530ec-e506-11e0-ad74-9f1c5e7f0611
> b66ff60b-e506-11e0-ad74-9f1c5e7f0611
> afc6556a-e506-11e0-ad74-9f1c5e7f0611
> *
> *
> *
> *
> *Questions:*
>
>    1. The SQL query does what I expect ;-) but with one problem - in what
>    order will I receive the data? I can't figure out an automatic way to find
>    out which column is which data.
>    2. Is full SQL query syntax supported?
>    3. The Lucene query returns the productID. Is there a class I can use
>    that will return something similar to the sql query? (Although I should look
>    at the code and find this out for myself - asking is free :-)
>    4. I've not yet tested any more complex SQL and Lucene queries - I was
>    just wondering if there where any useful info out there that would show me
>    some more funky example queries. So far I've found lucene tutorial<http://www.lucenetutorial.com/lucene-query-syntax.html>
>     and sql quick ref <http://www.w3schools.com/sql/sql_quickref.asp>. I'll
>    tie this into OODT Filemgr User Guide<https://cwiki.apache.org/confluence/display/OODT/OODT+Filemgr+User+Guide> once
>    I've figured these things out.
>    5. I see the version of lucene being used it quiet old (2.0.0 and the
>    latest ver is 2.9.1). Is there any reason why OODT is using this old
>    version?
>    6. Should I be spending the effort to use a different (i.e. sql
>    database) or are other OODT implementations using lucene?
>
> Thanks in advance for any help.
>
> Kind regards,
> Tom
>



-- 

Sent from a Tin Can attached to a String

Re: Catalog queries

Posted by Thomas Bennett <lm...@gmail.com>.


> Good question! It looks like it just prints the metadata in any order, as opposed to the order that you received it. This is probably not a great thing to do, so 
> can you file an issue and we can take a look at it?

Great! I'll file an issue and start debugging it to find a solution unless you bestme to it. I'm glad that it is an issue and not just bad query syntax. 

> http://oodt.apache.org/components/maven/apidocs/org/apache/oodt/cas/filemgr/util/SqlParser.html

Thanks. I'll take a look. 

> Improvements welcome! :)

Cool. 

> 
>>    • The Lucene query returns the productID. Is there a class I can use that will return something similar to the sql query? (Although I should look at the code and find this out for myself - asking is free :-)
> 
> Heh, great question, but the answer is no. We didn't really standardize on the output from these tools. I originally developed the QueryTool (which understood Lucene to begin with, and later Brian Foster added his SQL syntax to it, and the associated response format). 
> 
> Maybe we should open up an issue (and associated wiki page) on standardizing on the output. Feel free to propose something and I'll be happy to join in (hopefully others will too).

Thanks.  I'm really amped to get my hands dirty with Lucene. I'll see what I can come up with once I've  spend some time on Lucene. 
> 
>>    • I've not yet tested any more complex SQL and Lucene queries - I was just wondering if there where any useful info out there that would show me some more funky example queries. So far I've found lucene tutorial and sql quick ref. I'll tie this into OODT Filemgr User Guide once I've figured these things out.
> 
> +1, that's the best place to start. We also only support a limited set of the Lucene syntax as well, see the following class:
> 
> http://oodt.apache.org/components/maven/apidocs/org/apache/oodt/cas/filemgr/tools/CASAnalyzer.html

Thanks. 

> 
>>    • I see the version of lucene being used it quiet old (2.0.0 and the latest ver is 2.9.1). Is there any reason why OODT is using this old version?
> 
> I would *love* to upgrade to 2.9.1 or 2.9.4.
> 
> Upgrading to 3.0 will break APIs for us, b/c Lucene changed to the ScoreCollector method for getting hits back I believe in the 3.x 
> series, however we should be forwards compat to e.g., 2.9.4.
> 
> http://repo1.maven.org/maven2/org/apache/lucene/lucene-core/2.9.4/

Great. I'm happy to get involved in the upgrade.  Perhaps the best way to learn is to give it a try and ask some questions along the way when things break. 

> 
>>    • Should I be spending the effort to use a different (i.e. sql database) or are other OODT implementations using lucene?
>> Thanks in advance for any help.
> 
> Great question.
> 
> Most of the folks use Lucene to begin with, because it requires no external database or service, it just works out of the box.

Okay. Once the querytool is sorted out I can take a look at these details.  But it's good to know that there are migration options if I need to change. 

> Hope that helps explain things.

Yes! Thanks!

> These would probably be good javadocs, plus Wiki pages for these tools and migration :)

Sure.  I'm happy to try document what i can when I get there.


> Cheers,
> Chris

Cheers,
Tom.

Re: Catalog queries

Posted by Thomas Bennett <lm...@gmail.com>.


> Good question! It looks like it just prints the metadata in any order, as opposed to the order that you received it. This is probably not a great thing to do, so 
> can you file an issue and we can take a look at it?

Great! I'll file an issue and start debugging it to find a solution unless you bestme to it. I'm glad that it is an issue and not just bad query syntax. 

> http://oodt.apache.org/components/maven/apidocs/org/apache/oodt/cas/filemgr/util/SqlParser.html

Thanks. I'll take a look. 

> Improvements welcome! :)

Cool. 

> 
>>    • The Lucene query returns the productID. Is there a class I can use that will return something similar to the sql query? (Although I should look at the code and find this out for myself - asking is free :-)
> 
> Heh, great question, but the answer is no. We didn't really standardize on the output from these tools. I originally developed the QueryTool (which understood Lucene to begin with, and later Brian Foster added his SQL syntax to it, and the associated response format). 
> 
> Maybe we should open up an issue (and associated wiki page) on standardizing on the output. Feel free to propose something and I'll be happy to join in (hopefully others will too).

Thanks.  I'm really amped to get my hands dirty with Lucene. I'll see what I can come up with once I've  spend some time on Lucene. 
> 
>>    • I've not yet tested any more complex SQL and Lucene queries - I was just wondering if there where any useful info out there that would show me some more funky example queries. So far I've found lucene tutorial and sql quick ref. I'll tie this into OODT Filemgr User Guide once I've figured these things out.
> 
> +1, that's the best place to start. We also only support a limited set of the Lucene syntax as well, see the following class:
> 
> http://oodt.apache.org/components/maven/apidocs/org/apache/oodt/cas/filemgr/tools/CASAnalyzer.html

Thanks. 

> 
>>    • I see the version of lucene being used it quiet old (2.0.0 and the latest ver is 2.9.1). Is there any reason why OODT is using this old version?
> 
> I would *love* to upgrade to 2.9.1 or 2.9.4.
> 
> Upgrading to 3.0 will break APIs for us, b/c Lucene changed to the ScoreCollector method for getting hits back I believe in the 3.x 
> series, however we should be forwards compat to e.g., 2.9.4.
> 
> http://repo1.maven.org/maven2/org/apache/lucene/lucene-core/2.9.4/

Great. I'm happy to get involved in the upgrade.  Perhaps the best way to learn is to give it a try and ask some questions along the way when things break. 

> 
>>    • Should I be spending the effort to use a different (i.e. sql database) or are other OODT implementations using lucene?
>> Thanks in advance for any help.
> 
> Great question.
> 
> Most of the folks use Lucene to begin with, because it requires no external database or service, it just works out of the box.

Okay. Once the querytool is sorted out I can take a look at these details.  But it's good to know that there are migration options if I need to change. 

> Hope that helps explain things.

Yes! Thanks!

> These would probably be good javadocs, plus Wiki pages for these tools and migration :)

Sure.  I'm happy to try document what i can when I get there.


> Cheers,
> Chris

Cheers,
Tom.

Re: Catalog queries

Posted by "Mattmann, Chris A (388J)" <ch...@jpl.nasa.gov>.

Hi Tom,

Thanks. Comments below:

On Sep 22, 2011, at 2:30 PM, Thomas Bennett wrote:

> Hi,
> 
> I have a few questions about building queries for filemgr lucene catalogs and I was thinking someone may be able to help me.
> 
> I've ingested some files into catalog and then using the command line tools (and aliases - thanks Cameron!) to query the catalog.
> 
> I'm not too familiar with writing SQL queries, but I've been able to achieve the the following types of queries:
> 
> bin$ ./query_tool --url http://localhost:9000 --sql -query "SELECT Observer,Description,Duration,ExperimentID FROM KatFile WHERE Observer='jasper'" --sortBy Duration
> 
> Which returns:
> .....
> jasper,a9909ae6-822b-11e0-a7a1-0060dd4721d8,Target track,637.841571569
> jasper,47c3a4da-822a-11e0-a7a1-0060dd4721d8,Target track,565.859450817
> jasper,777b0f34-8224-11e0-a7a1-0060dd4721d8,Target track,80.9798858166
> 
> 
> bin$ ./query_tool --url http://localhost:9000 --lucene -query 'Observer:sharmila'
> 
> Which returns:
> .......
> ba9b292e-e506-11e0-ad74-9f1c5e7f0611
> b93dbc0d-e506-11e0-ad74-9f1c5e7f0611
> b7e530ec-e506-11e0-ad74-9f1c5e7f0611
> b66ff60b-e506-11e0-ad74-9f1c5e7f0611
> afc6556a-e506-11e0-ad74-9f1c5e7f0611
> 
> 
> Questions:
> 	• The SQL query does what I expect ;-) but with one problem - in what order will I receive the data? I can't figure out an automatic way to find out which column is which data.

Good question! It looks like it just prints the metadata in any order, as opposed to the order that you received it. This is probably not a great thing to do, so 
can you file an issue and we can take a look at it?

> 	• Is full SQL query syntax supported?

Nope, it's just a small subset. You can see what's supported here:

http://oodt.apache.org/components/maven/apidocs/org/apache/oodt/cas/filemgr/util/SqlParser.html

Improvements welcome! :)

> 	• The Lucene query returns the productID. Is there a class I can use that will return something similar to the sql query? (Although I should look at the code and find this out for myself - asking is free :-)

Heh, great question, but the answer is no. We didn't really standardize on the output from these tools. I originally developed the QueryTool (which understood Lucene to begin with, and later Brian Foster added his SQL syntax to it, and the associated response format). 

Maybe we should open up an issue (and associated wiki page) on standardizing on the output. Feel free to propose something and I'll be happy to join in (hopefully others will too).

> 	• I've not yet tested any more complex SQL and Lucene queries - I was just wondering if there where any useful info out there that would show me some more funky example queries. So far I've found lucene tutorial and sql quick ref. I'll tie this into OODT Filemgr User Guide once I've figured these things out.

+1, that's the best place to start. We also only support a limited set of the Lucene syntax as well, see the following class:

http://oodt.apache.org/components/maven/apidocs/org/apache/oodt/cas/filemgr/tools/CASAnalyzer.html

> 	• I see the version of lucene being used it quiet old (2.0.0 and the latest ver is 2.9.1). Is there any reason why OODT is using this old version?

I would *love* to upgrade to 2.9.1 or 2.9.4.

Upgrading to 3.0 will break APIs for us, b/c Lucene changed to the ScoreCollector method for getting hits back I believe in the 3.x 
series, however we should be forwards compat to e.g., 2.9.4.

http://repo1.maven.org/maven2/org/apache/lucene/lucene-core/2.9.4/

> 	• Should I be spending the effort to use a different (i.e. sql database) or are other OODT implementations using lucene?
> Thanks in advance for any help.

Great question.

Most of the folks use Lucene to begin with, because it requires no external database or service, it just works out of the box. It 
also has a number of other advantages:

* Easy unit testing against your index
* You can copy around FM index directories and share them between machines
* You can test locally on your laptop by copying the FM index off of a server onto your laptop, and then spinning up a local FM from there. The file refs won't exist, but you can play around with the catalog and most other things work.
* You can open up the FM index in Luke http://getopt.org/luke/ and then browse and query the Index using the Full Lucene Syntax
* It's fairly scalable (up to 10s of M of products). You can scale beyond, but you have to get into index partitioning, backups, etc., Also time queries at that stage token explosion (e.g., doing a range query for 2001-01-01T00:00:00.000Z to 2003-01-01T00:00:00.000Z will explode), mainly to do with the SerDe format for storing CAS metadata and product information that we used in the LuceneCatalog. This can be improved to scale beyond a few million products, but no one has invested the effort into that yet, they typically just use a SQL RDBMS, and the DataSourceCatalog at that point 

To move your existing index to the DataSourceCatalog, there's a tool in FM that I wrote called ExpImpCatalog. You can find it here: http://s.apache.org/Xuq

To use the tool in an existing FM deployment, do the following:

1. Stand up a new FM that you are going to configure with your DataSourceCatalog. 
  - change the port to 9010
  - if your existing FM is in e.g., /usr/local/filemgr, put this new one in /usr/local/filemgr2
  - configure it with the DataSourceCatalog
  - set up your DB and bake in the parameters to the FM config

2. Go into /usr/local/filemgr/bin (your existing, Lucene-based FM)
    - run java -Djava.ext.dirs=../lib org.apache.oodt.cas.filemgr.tools.ExpImpCatalog you should see:

]$ java -Djava.ext.dirs=../lib org.apache.oodt.cas.filemgr.tools.ExpImpCatalog
ExpImpCatalog [options] 
--source <url>
--dest <url>
 --unique
[--types <comma separate list of product type names>]
[--sourceCatProps <file> --destCatProps <file>]

This tool works like the following:
   You give it either a combination of: --source and --dest OR
                                  a combination of: --sourceCataProps and --destCatProps

In the case of simply --source and --dest, it will import all of the source catalog into the dest catalog via XML-RPC, talking to 
your source FM URL, and your dest FM URL. In the case of the--sourceCatProps and --destCatProps, it will do the same 
thing, except it won't use XML-RPC as the transport layer, it will simply instantiate a copy of the source Catalog interface object, 
and the dest Catalog interface object (in a single JVM), and import product and met at a time from source to dest. I made the 
props based portion of the tool to avoid transferring large met and product objects over XML-RPC, and to keep them 
within a JVM. 

The --unique parameter will not import a source product ID into a dest catalog if that product ID exists in the dest catalog. The 
--types parameter specifies a comma separated list of Product Types to export from the source catalog into the dest catalog. 
If --types is omitted all product types are assumed.

So, there is an easy way to migrate from an existing Lucene index FM catalog into any other Catalog fronted by the FM. 
Another thing people do sometimes is that if you have the source data and the ingestion pipeline, they will just blow away 
the Lucene (or whatever) Catalog, and then re-ingest using the Crawler/FM/Curation pipeline into e.g., a new DataSourceCat, 
that they configure their existing FM to now use.

Hope that helps explain things. These would probably be good javadocs, plus Wiki pages for these tools and migration :)

Cheers,
Chris

++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Chris Mattmann, Ph.D.
Senior Computer Scientist
NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
Office: 171-266B, Mailstop: 171-246
Email: chris.a.mattmann@nasa.gov
WWW:   http://sunset.usc.edu/~mattmann/
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Adjunct Assistant Professor, Computer Science Department
University of Southern California, Los Angeles, CA 90089 USA
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++

Re: Catalog queries

Posted by Cameron Goodale <si...@gmail.com>.

Tom,

I am glad to hear that someone is using the aliases to query the
FileManager.  Here is the wiki page I started to document examples of how to
use the aliases and different use cases.

https://cwiki.apache.org/OODT/bash-and-tcsh-shell-tools-for-file-manager.html

On to the questions:

The SQL query does what I expect ;-) but with one problem - in what order
will I receive the data? I can't figure out an automatic way to find out
which column is which data.
>>>  I would hope the data would be returned in the same order you asked for
it.  Given your example though the order may not be preserved.

Is full SQL query syntax supported?
>>>  I don't believe it is.  It is intended to be a simple way to form
queries.

The Lucene query returns the productID. Is there a class I can use that will
return something similar to the sql query? (Although I should look at the
code and find this out for myself - asking is free :-)
>>>  Not sure, maybe someone else can chime in.

I've not yet tested any more complex SQL and Lucene queries - I was just
wondering if there where any useful info out there that would show me some
more funky example queries. So far I've found lucene
tutorial<http://www.lucenetutorial.com/lucene-query-syntax.html>
 and sql quick ref <http://www.w3schools.com/sql/sql_quickref.asp>. I'll tie
this into OODT Filemgr User
Guide<https://cwiki.apache.org/confluence/display/OODT/OODT+Filemgr+User+Guide>
once
I've figured these things out.
I see the version of lucene being used it quiet old (2.0.0 and the latest
ver is 2.9.1). Is there any reason why OODT is using this old version?
Should I be spending the effort to use a different (i.e. sql database) or
are other OODT implementations using lucene?
>>>  The intent is to enable simple queries into the FileManager.  Several
projects initially use Lucene here at JPL.  Sometimes though the project out
grows the capabilities of Lucene, and we start to consider a different
catalog solution.  On LMMP we used BerkeleyDB when Paul Ramirez wrote  a
custom catalog, and on RCMES Andrew Hart wrote a custom Catalog to normalize
metadata and store it directly in MySQL.

Hope that helps.


-Cameron



On Fri, Sep 23, 2011 at 6:30 AM, Thomas Bennett <tb...@ska.ac.za> wrote:

> Hi,
>
> I have a few questions about building queries for filemgr lucene catalogs
> and I was thinking someone may be able to help me.
>
> I've ingested some files into catalog and then using the command line tools
> (and aliases - thanks Cameron!) to query the catalog.
>
> I'm not too familiar with writing SQL queries, but I've been able to
> achieve the the following types of queries:
>
> bin$ ./query_tool --url http://localhost:9000 --sql -query "SELECT
> Observer,Description,Duration,ExperimentID FROM KatFile WHERE
> Observer='jasper'" --sortBy Duration
>
> Which returns:
> .....
> jasper,a9909ae6-822b-11e0-a7a1-0060dd4721d8,Target track,637.841571569
> jasper,47c3a4da-822a-11e0-a7a1-0060dd4721d8,Target track,565.859450817
> jasper,777b0f34-8224-11e0-a7a1-0060dd4721d8,Target track,80.9798858166
>
> *
> *
> bin$ ./query_tool --url http://localhost:9000 --lucene -query
> 'Observer:sharmila'
>
> Which returns:
> .......
> ba9b292e-e506-11e0-ad74-9f1c5e7f0611
> b93dbc0d-e506-11e0-ad74-9f1c5e7f0611
> b7e530ec-e506-11e0-ad74-9f1c5e7f0611
> b66ff60b-e506-11e0-ad74-9f1c5e7f0611
> afc6556a-e506-11e0-ad74-9f1c5e7f0611
> *
> *
> *
> *
> *Questions:*
>
>    1. The SQL query does what I expect ;-) but with one problem - in what
>    order will I receive the data? I can't figure out an automatic way to find
>    out which column is which data.
>    2. Is full SQL query syntax supported?
>    3. The Lucene query returns the productID. Is there a class I can use
>    that will return something similar to the sql query? (Although I should look
>    at the code and find this out for myself - asking is free :-)
>    4. I've not yet tested any more complex SQL and Lucene queries - I was
>    just wondering if there where any useful info out there that would show me
>    some more funky example queries. So far I've found lucene tutorial<http://www.lucenetutorial.com/lucene-query-syntax.html>
>     and sql quick ref <http://www.w3schools.com/sql/sql_quickref.asp>. I'll
>    tie this into OODT Filemgr User Guide<https://cwiki.apache.org/confluence/display/OODT/OODT+Filemgr+User+Guide> once
>    I've figured these things out.
>    5. I see the version of lucene being used it quiet old (2.0.0 and the
>    latest ver is 2.9.1). Is there any reason why OODT is using this old
>    version?
>    6. Should I be spending the effort to use a different (i.e. sql
>    database) or are other OODT implementations using lucene?
>
> Thanks in advance for any help.
>
> Kind regards,
> Tom
>



-- 

Sent from a Tin Can attached to a String

Re: Catalog queries

Posted by "Mattmann, Chris A (388J)" <ch...@jpl.nasa.gov>.

Hi Tom,

Thanks. Comments below:

On Sep 22, 2011, at 2:30 PM, Thomas Bennett wrote:

> Hi,
> 
> I have a few questions about building queries for filemgr lucene catalogs and I was thinking someone may be able to help me.
> 
> I've ingested some files into catalog and then using the command line tools (and aliases - thanks Cameron!) to query the catalog.
> 
> I'm not too familiar with writing SQL queries, but I've been able to achieve the the following types of queries:
> 
> bin$ ./query_tool --url http://localhost:9000 --sql -query "SELECT Observer,Description,Duration,ExperimentID FROM KatFile WHERE Observer='jasper'" --sortBy Duration
> 
> Which returns:
> .....
> jasper,a9909ae6-822b-11e0-a7a1-0060dd4721d8,Target track,637.841571569
> jasper,47c3a4da-822a-11e0-a7a1-0060dd4721d8,Target track,565.859450817
> jasper,777b0f34-8224-11e0-a7a1-0060dd4721d8,Target track,80.9798858166
> 
> 
> bin$ ./query_tool --url http://localhost:9000 --lucene -query 'Observer:sharmila'
> 
> Which returns:
> .......
> ba9b292e-e506-11e0-ad74-9f1c5e7f0611
> b93dbc0d-e506-11e0-ad74-9f1c5e7f0611
> b7e530ec-e506-11e0-ad74-9f1c5e7f0611
> b66ff60b-e506-11e0-ad74-9f1c5e7f0611
> afc6556a-e506-11e0-ad74-9f1c5e7f0611
> 
> 
> Questions:
> 	• The SQL query does what I expect ;-) but with one problem - in what order will I receive the data? I can't figure out an automatic way to find out which column is which data.

Good question! It looks like it just prints the metadata in any order, as opposed to the order that you received it. This is probably not a great thing to do, so 
can you file an issue and we can take a look at it?

> 	• Is full SQL query syntax supported?

Nope, it's just a small subset. You can see what's supported here:

http://oodt.apache.org/components/maven/apidocs/org/apache/oodt/cas/filemgr/util/SqlParser.html

Improvements welcome! :)

> 	• The Lucene query returns the productID. Is there a class I can use that will return something similar to the sql query? (Although I should look at the code and find this out for myself - asking is free :-)

Heh, great question, but the answer is no. We didn't really standardize on the output from these tools. I originally developed the QueryTool (which understood Lucene to begin with, and later Brian Foster added his SQL syntax to it, and the associated response format). 

Maybe we should open up an issue (and associated wiki page) on standardizing on the output. Feel free to propose something and I'll be happy to join in (hopefully others will too).

> 	• I've not yet tested any more complex SQL and Lucene queries - I was just wondering if there where any useful info out there that would show me some more funky example queries. So far I've found lucene tutorial and sql quick ref. I'll tie this into OODT Filemgr User Guide once I've figured these things out.

+1, that's the best place to start. We also only support a limited set of the Lucene syntax as well, see the following class:

http://oodt.apache.org/components/maven/apidocs/org/apache/oodt/cas/filemgr/tools/CASAnalyzer.html

> 	• I see the version of lucene being used it quiet old (2.0.0 and the latest ver is 2.9.1). Is there any reason why OODT is using this old version?

I would *love* to upgrade to 2.9.1 or 2.9.4.

Upgrading to 3.0 will break APIs for us, b/c Lucene changed to the ScoreCollector method for getting hits back I believe in the 3.x 
series, however we should be forwards compat to e.g., 2.9.4.

http://repo1.maven.org/maven2/org/apache/lucene/lucene-core/2.9.4/

> 	• Should I be spending the effort to use a different (i.e. sql database) or are other OODT implementations using lucene?
> Thanks in advance for any help.

Great question.

Most of the folks use Lucene to begin with, because it requires no external database or service, it just works out of the box. It 
also has a number of other advantages:

* Easy unit testing against your index
* You can copy around FM index directories and share them between machines
* You can test locally on your laptop by copying the FM index off of a server onto your laptop, and then spinning up a local FM from there. The file refs won't exist, but you can play around with the catalog and most other things work.
* You can open up the FM index in Luke http://getopt.org/luke/ and then browse and query the Index using the Full Lucene Syntax
* It's fairly scalable (up to 10s of M of products). You can scale beyond, but you have to get into index partitioning, backups, etc., Also time queries at that stage token explosion (e.g., doing a range query for 2001-01-01T00:00:00.000Z to 2003-01-01T00:00:00.000Z will explode), mainly to do with the SerDe format for storing CAS metadata and product information that we used in the LuceneCatalog. This can be improved to scale beyond a few million products, but no one has invested the effort into that yet, they typically just use a SQL RDBMS, and the DataSourceCatalog at that point 

To move your existing index to the DataSourceCatalog, there's a tool in FM that I wrote called ExpImpCatalog. You can find it here: http://s.apache.org/Xuq

To use the tool in an existing FM deployment, do the following:

1. Stand up a new FM that you are going to configure with your DataSourceCatalog. 
  - change the port to 9010
  - if your existing FM is in e.g., /usr/local/filemgr, put this new one in /usr/local/filemgr2
  - configure it with the DataSourceCatalog
  - set up your DB and bake in the parameters to the FM config

2. Go into /usr/local/filemgr/bin (your existing, Lucene-based FM)
    - run java -Djava.ext.dirs=../lib org.apache.oodt.cas.filemgr.tools.ExpImpCatalog you should see:

]$ java -Djava.ext.dirs=../lib org.apache.oodt.cas.filemgr.tools.ExpImpCatalog
ExpImpCatalog [options] 
--source <url>
--dest <url>
 --unique
[--types <comma separate list of product type names>]
[--sourceCatProps <file> --destCatProps <file>]

This tool works like the following:
   You give it either a combination of: --source and --dest OR
                                  a combination of: --sourceCataProps and --destCatProps

In the case of simply --source and --dest, it will import all of the source catalog into the dest catalog via XML-RPC, talking to 
your source FM URL, and your dest FM URL. In the case of the--sourceCatProps and --destCatProps, it will do the same 
thing, except it won't use XML-RPC as the transport layer, it will simply instantiate a copy of the source Catalog interface object, 
and the dest Catalog interface object (in a single JVM), and import product and met at a time from source to dest. I made the 
props based portion of the tool to avoid transferring large met and product objects over XML-RPC, and to keep them 
within a JVM. 

The --unique parameter will not import a source product ID into a dest catalog if that product ID exists in the dest catalog. The 
--types parameter specifies a comma separated list of Product Types to export from the source catalog into the dest catalog. 
If --types is omitted all product types are assumed.

So, there is an easy way to migrate from an existing Lucene index FM catalog into any other Catalog fronted by the FM. 
Another thing people do sometimes is that if you have the source data and the ingestion pipeline, they will just blow away 
the Lucene (or whatever) Catalog, and then re-ingest using the Crawler/FM/Curation pipeline into e.g., a new DataSourceCat, 
that they configure their existing FM to now use.

Hope that helps explain things. These would probably be good javadocs, plus Wiki pages for these tools and migration :)

Cheers,
Chris

++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Chris Mattmann, Ph.D.
Senior Computer Scientist
NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
Office: 171-266B, Mailstop: 171-246
Email: chris.a.mattmann@nasa.gov
WWW:   http://sunset.usc.edu/~mattmann/
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Adjunct Assistant Professor, Computer Science Department
University of Southern California, Los Angeles, CA 90089 USA
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++