You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@oodt.apache.org by "Yu, Richard (GSFC-474.0)[NOAA-JPSS]" <ri...@nasa.gov> on 2012/02/01 19:52:15 UTC

Extract meta data and populate database example

Hi,

We are new to OODT and would like to know if there is any example of populating the metadata into a database after successful ingest.  The database could be the following:
POSTGRESQL, or MYSQL.  Your help in this area would be most appreciated!

Richard Yu

Re: Extract meta data and populate database example

Posted by Cameron Goodale <go...@apache.org>.

Richard,

I have had success querying the FileManager via XML-RPC for ProductTypes,
and then Product Metadata in python.  Then I just collect the sub-set of
the metadata I want and re-formatted the content into GeoJSON and loaded
each one into a MongoDB.  It isn't exactly a SQL database, but the core
principles should remain the same.

The use case was that we needed to drive some website widgets (time series
plots and drop points into Google Maps) and we didn't want to directly
connect and query FileManager.

Below is a snippet of code from my python program that does the initial
setup and Query:  (hope this helps)

######## CODE BLOCK ################

'''This code will read a filemanager and query for all product types.
For each product type it will collect the OFSN, [Lat,Lon] (or geoPoint)
and the meantime of the product and create a new collection within mongoDB.

mongoDB json Doc format (GeoJSON):

Starred(*) Elements are going to be indexed for Query Support
which are coordinates, start_date, and stop_date

{
 "type":"[MultiPoint OR Point]",
 "*coordinates*":[
 [lon,lat],[lon,lat],[lon,lat]
 ],
 "properties": {
     "OFSN":"http_url_to_OSFN",
     "*start_date*": new Datetime,
     "*stop_date*": new Datetime,
 }
}

'''
# used to call the XML-RPC interface from Python
import xmlrpclib

# simple string to dates module
import datetime

# Used for Connecting to MongoDB
from pymongo import Connection, GEO2D, ASCENDING


#setup the connection to the filemanager
fm_url = "http://localhost:9000"
filemgr = xmlrpclib.ServerProxy(fm_url).filemgr

#connect to the mongoDB and use the daily_date_list collection
db = Connection('localhost', 27017)
ddl = db.product_met  # final collection where inserts will be made
#turn on safe insert mode (Comment out for DANGER mode)
ddl.set_lasterror_options(safe=True)

#Global list of productTypes
p_type_list = filemgr.getProductTypes()
''' This simple one liner will return a Python Dictionary with all of the
ProductTypes in a format they you can iterate over and pass back into the
fileManager via XML-RPC (as opposed to trying to create the Python
equivalent
of a Java object)'''


# This is just used for clearning and building a MongoDB Collection
def build_collection(p_type, main_collection=ddl):
    '''This will take in a product type and a main collection:
    1.  Create a connection to a sub-collection using the product type
    name metadata value.
    2.  ALWAYS - Drop the Collection
    3.  Create the 3 Required Indexes'''

    try:
        # 1. collection is a sub-collection
        collection = ddl[p_type['name']]
        print str(collection.count()) + ' documents found, purging...'
        # 2. Drop Collection
        collection.drop()
        # 3. Create Indexes
        collection.create_index([("coordinates", GEO2D)])
        print 'geo-index created on the coordinates element'
        ddl.create_index("properties.start_date", ASCENDING)
        ddl.create_index("properties.stop_date", ASCENDING)
        print 'Collection has been built'
        return collection
    except:
        print 'Cannot Build Collection for' + str(p_type['name'])


def build_geo_json(met):
    '''take the met dictionary and return a properly formed
    geo-json document.  Below in code the places you see
    met['KEYNAME'] is how you grab a metadata element by
    KEYNAME.  Your metadata keys will differ and so will your
    mileage.'''
    geo_json = {}  # empty dict we will populate
    raw_geo_list = []  # container to hold all [lon,lat] lists
    try:   # to get metadata elements
        # grab OFSN
        ofsn = met['OFSNFilePath'][0]
        # get the metadata and trim off the microseconds
        start = met['TimeBegin'][0].rsplit('.')[0]
        stop = met['TimeEnd'][0].rsplit('.')[0]
        # create geopoint list, and split
        raw_geo_list = met['CAS.geopoint']
        geo_list = []
        # append all points within the raw_geo_list to the new geo_list
        for point in raw_geo_list:
            # split the values up into a single list with 2 values
            string_list = point.split(' ')
            # flip the GeoPoint [LAT,LON] into [LON,LAT]
            string_list = [string_list[1], string_list[0]]
            float_list = map(float, string_list)
            geo_list.append(float_list)
    except KeyError:
        print met
    except IndexError:
        print "%s has a problem with it's geopoint metadata. CAS.geopoint =
%s, MaxLon = %s, MaxLat = %s" % (met['CAS.ProductName'][0], raw_geo_list,
met['MaxLon'][0], met['MaxLat'][0])
    # convert to datetime objects
    start = datetime.datetime.strptime(start, '%Y-%m-%dT%H:%M:%S')
    stop = datetime.datetime.strptime(stop, '%Y-%m-%dT%H:%M:%S')
    # if a single point is in the geo_list use type:point
    if len(geo_list) == 1:
        geo_json['type'] = "Point"
    else:
        geo_json['type'] = "MultiPoint"

    geo_json['coordinates'] = geo_list
    props = {"OFSN": ofsn,
             "start_date": start,
             "stop_date": stop}
    geo_json['properties'] = props
    return geo_json
    # print geo_json


def build_geo_json_list(p_type):
    # list to hold docs
    geo_json_list = []
    # get product list
    p_list = filemgr.getProductsByProductType(p_type)
    # for each product extract metadata
    for p in p_list:
        # grab the metadata and build geo_json_doc
        met = filemgr.getMetadata(p)  # this is a dictionary
        if met == {}:   # SKIP products that have NO METADATA
            pass
        else:
            try:
                geo_json_doc = build_geo_json(met)
                geo_json_list.append(geo_json_doc)
            except KeyError as err:
                print err
    return geo_json_list


for p_type in p_type_list:
    '''In this loop we iterate over each product_type from the p_type_list
    we got from the fileManager and we call getProductsByProductType to
    loop over all of the products in the catalog'''

    # grab a list of products from filemanager using ProductType
    p_list = filemgr.getProductsByProductType(p_type)
    if len(p_list) == 0:  # If no products are found, then do nothing, else
get to work
        print 'No products found for ' + str(p_type['name'])
        pass
    else:
        # build the productType based collection
        p_collection = build_collection(p_type)
        # build the entire geo_json dict
        geo_json_list = build_geo_json_list(p_type)
        # do a single insert of all geo_json using a single BULK insert
        p_collection.insert(geo_json_list)
        print str(len(geo_json_list)) + ' is the length of the geo_json
list for %s' % p_type['name']


############ END CODE BLOCK ###############


On Wed, Feb 1, 2012 at 10:52 AM, Yu, Richard (GSFC-474.0)[NOAA-JPSS] <
richard.yu@nasa.gov> wrote:

> ** **
>
> Hi,****
>
> ** **
>
> We are new to OODT and would like to know if there is any example of
> populating the metadata into a database after successful ingest.  The
> database could be the following:****
>
> POSTGRESQL, or MYSQL.  Your help in this area would be most appreciated!**
> **
>
> ** **
>
> Richard Yu****
>
> ** **
>

Re: Extract meta data and populate database example

Posted by Cameron Goodale <go...@apache.org>.

Richard,

I have had success querying the FileManager via XML-RPC for ProductTypes,
and then Product Metadata in python.  Then I just collect the sub-set of
the metadata I want and re-formatted the content into GeoJSON and loaded
each one into a MongoDB.  It isn't exactly a SQL database, but the core
principles should remain the same.

The use case was that we needed to drive some website widgets (time series
plots and drop points into Google Maps) and we didn't want to directly
connect and query FileManager.

Below is a snippet of code from my python program that does the initial
setup and Query:  (hope this helps)

######## CODE BLOCK ################

'''This code will read a filemanager and query for all product types.
For each product type it will collect the OFSN, [Lat,Lon] (or geoPoint)
and the meantime of the product and create a new collection within mongoDB.

mongoDB json Doc format (GeoJSON):

Starred(*) Elements are going to be indexed for Query Support
which are coordinates, start_date, and stop_date

{
 "type":"[MultiPoint OR Point]",
 "*coordinates*":[
 [lon,lat],[lon,lat],[lon,lat]
 ],
 "properties": {
     "OFSN":"http_url_to_OSFN",
     "*start_date*": new Datetime,
     "*stop_date*": new Datetime,
 }
}

'''
# used to call the XML-RPC interface from Python
import xmlrpclib

# simple string to dates module
import datetime

# Used for Connecting to MongoDB
from pymongo import Connection, GEO2D, ASCENDING


#setup the connection to the filemanager
fm_url = "http://localhost:9000"
filemgr = xmlrpclib.ServerProxy(fm_url).filemgr

#connect to the mongoDB and use the daily_date_list collection
db = Connection('localhost', 27017)
ddl = db.product_met  # final collection where inserts will be made
#turn on safe insert mode (Comment out for DANGER mode)
ddl.set_lasterror_options(safe=True)

#Global list of productTypes
p_type_list = filemgr.getProductTypes()
''' This simple one liner will return a Python Dictionary with all of the
ProductTypes in a format they you can iterate over and pass back into the
fileManager via XML-RPC (as opposed to trying to create the Python
equivalent
of a Java object)'''


# This is just used for clearning and building a MongoDB Collection
def build_collection(p_type, main_collection=ddl):
    '''This will take in a product type and a main collection:
    1.  Create a connection to a sub-collection using the product type
    name metadata value.
    2.  ALWAYS - Drop the Collection
    3.  Create the 3 Required Indexes'''

    try:
        # 1. collection is a sub-collection
        collection = ddl[p_type['name']]
        print str(collection.count()) + ' documents found, purging...'
        # 2. Drop Collection
        collection.drop()
        # 3. Create Indexes
        collection.create_index([("coordinates", GEO2D)])
        print 'geo-index created on the coordinates element'
        ddl.create_index("properties.start_date", ASCENDING)
        ddl.create_index("properties.stop_date", ASCENDING)
        print 'Collection has been built'
        return collection
    except:
        print 'Cannot Build Collection for' + str(p_type['name'])


def build_geo_json(met):
    '''take the met dictionary and return a properly formed
    geo-json document.  Below in code the places you see
    met['KEYNAME'] is how you grab a metadata element by
    KEYNAME.  Your metadata keys will differ and so will your
    mileage.'''
    geo_json = {}  # empty dict we will populate
    raw_geo_list = []  # container to hold all [lon,lat] lists
    try:   # to get metadata elements
        # grab OFSN
        ofsn = met['OFSNFilePath'][0]
        # get the metadata and trim off the microseconds
        start = met['TimeBegin'][0].rsplit('.')[0]
        stop = met['TimeEnd'][0].rsplit('.')[0]
        # create geopoint list, and split
        raw_geo_list = met['CAS.geopoint']
        geo_list = []
        # append all points within the raw_geo_list to the new geo_list
        for point in raw_geo_list:
            # split the values up into a single list with 2 values
            string_list = point.split(' ')
            # flip the GeoPoint [LAT,LON] into [LON,LAT]
            string_list = [string_list[1], string_list[0]]
            float_list = map(float, string_list)
            geo_list.append(float_list)
    except KeyError:
        print met
    except IndexError:
        print "%s has a problem with it's geopoint metadata. CAS.geopoint =
%s, MaxLon = %s, MaxLat = %s" % (met['CAS.ProductName'][0], raw_geo_list,
met['MaxLon'][0], met['MaxLat'][0])
    # convert to datetime objects
    start = datetime.datetime.strptime(start, '%Y-%m-%dT%H:%M:%S')
    stop = datetime.datetime.strptime(stop, '%Y-%m-%dT%H:%M:%S')
    # if a single point is in the geo_list use type:point
    if len(geo_list) == 1:
        geo_json['type'] = "Point"
    else:
        geo_json['type'] = "MultiPoint"

    geo_json['coordinates'] = geo_list
    props = {"OFSN": ofsn,
             "start_date": start,
             "stop_date": stop}
    geo_json['properties'] = props
    return geo_json
    # print geo_json


def build_geo_json_list(p_type):
    # list to hold docs
    geo_json_list = []
    # get product list
    p_list = filemgr.getProductsByProductType(p_type)
    # for each product extract metadata
    for p in p_list:
        # grab the metadata and build geo_json_doc
        met = filemgr.getMetadata(p)  # this is a dictionary
        if met == {}:   # SKIP products that have NO METADATA
            pass
        else:
            try:
                geo_json_doc = build_geo_json(met)
                geo_json_list.append(geo_json_doc)
            except KeyError as err:
                print err
    return geo_json_list


for p_type in p_type_list:
    '''In this loop we iterate over each product_type from the p_type_list
    we got from the fileManager and we call getProductsByProductType to
    loop over all of the products in the catalog'''

    # grab a list of products from filemanager using ProductType
    p_list = filemgr.getProductsByProductType(p_type)
    if len(p_list) == 0:  # If no products are found, then do nothing, else
get to work
        print 'No products found for ' + str(p_type['name'])
        pass
    else:
        # build the productType based collection
        p_collection = build_collection(p_type)
        # build the entire geo_json dict
        geo_json_list = build_geo_json_list(p_type)
        # do a single insert of all geo_json using a single BULK insert
        p_collection.insert(geo_json_list)
        print str(len(geo_json_list)) + ' is the length of the geo_json
list for %s' % p_type['name']


############ END CODE BLOCK ###############


On Wed, Feb 1, 2012 at 10:52 AM, Yu, Richard (GSFC-474.0)[NOAA-JPSS] <
richard.yu@nasa.gov> wrote:

> ** **
>
> Hi,****
>
> ** **
>
> We are new to OODT and would like to know if there is any example of
> populating the metadata into a database after successful ingest.  The
> database could be the following:****
>
> POSTGRESQL, or MYSQL.  Your help in this area would be most appreciated!**
> **
>
> ** **
>
> Richard Yu****
>
> ** **
>

RE: Extract meta data and populate database example

Posted by "Yu, Richard (GSFC-474.0)[NOAA-JPSS]" <ri...@nasa.gov>.

This is great and a very positive OODT experience.  Thanks!

Richard

-----Original Message-----
From: Mattmann, Chris A (388J) [mailto:chris.a.mattmann@jpl.nasa.gov] 
Sent: Thursday, February 02, 2012 2:42 PM
To: user@oodt.apache.org
Subject: Re: Extract meta data and populate database example

Hi Richard,

Thanks for your question!

I'm going to assume you'd like the File Manager component:

http://oodt.apache.org/components/maven/filemgr/

to store the metadata for you. Also, what'll happen if you use the File Manager component is the following:

1. You'll define the configuration inside of $FILEMGR_DEPLOY/etc/filemgr.properties
   - select as the catalog factory org.apache.oodt.cas.filemgr.catalog.DataSourceCatalogFactory
   - below the factory configs are implementation-specific config properties. There should be a set of 4 of them for JDBC DataSource catalogs requiring you to specify the username, password, JDBC URL and driver class (fully qualified class name). Specify the appropriate options for any of those JDBC databases below there, and then make sure that you drop the appropriate JDBC driver jar (the one that contains the class that you will reference in the config property) inside of FILEMGR_DEPLOY/lib/

2. You can find some examples of the needed schemas/etc., inside of:

http://svn.apache.org/repos/asf/oodt/trunk/filemgr/src/main/resources/

Hope that helps and let me know if I can provide more information or if anyone else on the list can!

Cheers,
Chris

On Feb 1, 2012, at 10:52 AM, Yu, Richard (GSFC-474.0)[NOAA-JPSS] wrote:

>  
> Hi,
>  
> We are new to OODT and would like to know if there is any example of populating the metadata into a database after successful ingest.  The database could be the following:
> POSTGRESQL, or MYSQL.  Your help in this area would be most appreciated!
>  
> Richard Yu
>  


++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Chris Mattmann, Ph.D.
Senior Computer Scientist
NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
Office: 171-266B, Mailstop: 171-246
Email: chris.a.mattmann@nasa.gov
WWW:   http://sunset.usc.edu/~mattmann/
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Adjunct Assistant Professor, Computer Science Department University of Southern California, Los Angeles, CA 90089 USA
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++

RE: Extract meta data and populate database example

Posted by "Yu, Richard (GSFC-474.0)[NOAA-JPSS]" <ri...@nasa.gov>.

This is great and a very positive OODT experience.  Thanks!

Richard

-----Original Message-----
From: Mattmann, Chris A (388J) [mailto:chris.a.mattmann@jpl.nasa.gov] 
Sent: Thursday, February 02, 2012 2:42 PM
To: user@oodt.apache.org
Subject: Re: Extract meta data and populate database example

Hi Richard,

Thanks for your question!

I'm going to assume you'd like the File Manager component:

http://oodt.apache.org/components/maven/filemgr/

to store the metadata for you. Also, what'll happen if you use the File Manager component is the following:

1. You'll define the configuration inside of $FILEMGR_DEPLOY/etc/filemgr.properties
   - select as the catalog factory org.apache.oodt.cas.filemgr.catalog.DataSourceCatalogFactory
   - below the factory configs are implementation-specific config properties. There should be a set of 4 of them for JDBC DataSource catalogs requiring you to specify the username, password, JDBC URL and driver class (fully qualified class name). Specify the appropriate options for any of those JDBC databases below there, and then make sure that you drop the appropriate JDBC driver jar (the one that contains the class that you will reference in the config property) inside of FILEMGR_DEPLOY/lib/

2. You can find some examples of the needed schemas/etc., inside of:

http://svn.apache.org/repos/asf/oodt/trunk/filemgr/src/main/resources/

Hope that helps and let me know if I can provide more information or if anyone else on the list can!

Cheers,
Chris

On Feb 1, 2012, at 10:52 AM, Yu, Richard (GSFC-474.0)[NOAA-JPSS] wrote:

>  
> Hi,
>  
> We are new to OODT and would like to know if there is any example of populating the metadata into a database after successful ingest.  The database could be the following:
> POSTGRESQL, or MYSQL.  Your help in this area would be most appreciated!
>  
> Richard Yu
>  


++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Chris Mattmann, Ph.D.
Senior Computer Scientist
NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
Office: 171-266B, Mailstop: 171-246
Email: chris.a.mattmann@nasa.gov
WWW:   http://sunset.usc.edu/~mattmann/
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Adjunct Assistant Professor, Computer Science Department University of Southern California, Los Angeles, CA 90089 USA
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++

Re: Extract meta data and populate database example

Posted by "Mattmann, Chris A (388J)" <ch...@jpl.nasa.gov>.

Hi Richard,

Thanks for your question!

I'm going to assume you'd like the File Manager component:

http://oodt.apache.org/components/maven/filemgr/

to store the metadata for you. Also, what'll happen if you use the File Manager component is the following:

1. You'll define the configuration inside of $FILEMGR_DEPLOY/etc/filemgr.properties
   - select as the catalog factory org.apache.oodt.cas.filemgr.catalog.DataSourceCatalogFactory
   - below the factory configs are implementation-specific config properties. There should be a set
of 4 of them for JDBC DataSource catalogs requiring you to specify the username, password, JDBC
URL and driver class (fully qualified class name). Specify the appropriate options for any of those
JDBC databases below there, and then make sure that you drop the appropriate JDBC driver jar
(the one that contains the class that you will reference in the config property) inside of FILEMGR_DEPLOY/lib/

2. You can find some examples of the needed schemas/etc., inside of:

http://svn.apache.org/repos/asf/oodt/trunk/filemgr/src/main/resources/

Hope that helps and let me know if I can provide more information or if anyone else on the list
can!

Cheers,
Chris

On Feb 1, 2012, at 10:52 AM, Yu, Richard (GSFC-474.0)[NOAA-JPSS] wrote:

>  
> Hi,
>  
> We are new to OODT and would like to know if there is any example of populating the metadata into a database after successful ingest.  The database could be the following:
> POSTGRESQL, or MYSQL.  Your help in this area would be most appreciated!
>  
> Richard Yu
>  


++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Chris Mattmann, Ph.D.
Senior Computer Scientist
NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
Office: 171-266B, Mailstop: 171-246
Email: chris.a.mattmann@nasa.gov
WWW:   http://sunset.usc.edu/~mattmann/
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Adjunct Assistant Professor, Computer Science Department
University of Southern California, Los Angeles, CA 90089 USA
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++

Re: Extract meta data and populate database example

Posted by "Mattmann, Chris A (388J)" <ch...@jpl.nasa.gov>.

Hi Richard,

Thanks for your question!

I'm going to assume you'd like the File Manager component:

http://oodt.apache.org/components/maven/filemgr/

to store the metadata for you. Also, what'll happen if you use the File Manager component is the following:

1. You'll define the configuration inside of $FILEMGR_DEPLOY/etc/filemgr.properties
   - select as the catalog factory org.apache.oodt.cas.filemgr.catalog.DataSourceCatalogFactory
   - below the factory configs are implementation-specific config properties. There should be a set
of 4 of them for JDBC DataSource catalogs requiring you to specify the username, password, JDBC
URL and driver class (fully qualified class name). Specify the appropriate options for any of those
JDBC databases below there, and then make sure that you drop the appropriate JDBC driver jar
(the one that contains the class that you will reference in the config property) inside of FILEMGR_DEPLOY/lib/

2. You can find some examples of the needed schemas/etc., inside of:

http://svn.apache.org/repos/asf/oodt/trunk/filemgr/src/main/resources/

Hope that helps and let me know if I can provide more information or if anyone else on the list
can!

Cheers,
Chris

On Feb 1, 2012, at 10:52 AM, Yu, Richard (GSFC-474.0)[NOAA-JPSS] wrote:

>  
> Hi,
>  
> We are new to OODT and would like to know if there is any example of populating the metadata into a database after successful ingest.  The database could be the following:
> POSTGRESQL, or MYSQL.  Your help in this area would be most appreciated!
>  
> Richard Yu
>  


++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Chris Mattmann, Ph.D.
Senior Computer Scientist
NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
Office: 171-266B, Mailstop: 171-246
Email: chris.a.mattmann@nasa.gov
WWW:   http://sunset.usc.edu/~mattmann/
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Adjunct Assistant Professor, Computer Science Department
University of Southern California, Los Angeles, CA 90089 USA
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++