You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@oodt.apache.org by "Yu, Richard (GSFC-474.0)[NOAA-JPSS]" <ri...@nasa.gov> on 2012/02/01 19:52:15 UTC
Extract meta data and populate database example
Hi,
We are new to OODT and would like to know if there is any example of populating the metadata into a database after successful ingest. The database could be the following:
POSTGRESQL, or MYSQL. Your help in this area would be most appreciated!
Richard Yu
Re: Extract meta data and populate database example
Posted by Cameron Goodale <go...@apache.org>.
Richard,
I have had success querying the FileManager via XML-RPC for ProductTypes,
and then Product Metadata in python. Then I just collect the sub-set of
the metadata I want and re-formatted the content into GeoJSON and loaded
each one into a MongoDB. It isn't exactly a SQL database, but the core
principles should remain the same.
The use case was that we needed to drive some website widgets (time series
plots and drop points into Google Maps) and we didn't want to directly
connect and query FileManager.
Below is a snippet of code from my python program that does the initial
setup and Query: (hope this helps)
######## CODE BLOCK ################
'''This code will read a filemanager and query for all product types.
For each product type it will collect the OFSN, [Lat,Lon] (or geoPoint)
and the meantime of the product and create a new collection within mongoDB.
mongoDB json Doc format (GeoJSON):
Starred(*) Elements are going to be indexed for Query Support
which are coordinates, start_date, and stop_date
{
"type":"[MultiPoint OR Point]",
"*coordinates*":[
[lon,lat],[lon,lat],[lon,lat]
],
"properties": {
"OFSN":"http_url_to_OSFN",
"*start_date*": new Datetime,
"*stop_date*": new Datetime,
}
}
'''
# used to call the XML-RPC interface from Python
import xmlrpclib
# simple string to dates module
import datetime
# Used for Connecting to MongoDB
from pymongo import Connection, GEO2D, ASCENDING
#setup the connection to the filemanager
fm_url = "http://localhost:9000"
filemgr = xmlrpclib.ServerProxy(fm_url).filemgr
#connect to the mongoDB and use the daily_date_list collection
db = Connection('localhost', 27017)
ddl = db.product_met # final collection where inserts will be made
#turn on safe insert mode (Comment out for DANGER mode)
ddl.set_lasterror_options(safe=True)
#Global list of productTypes
p_type_list = filemgr.getProductTypes()
''' This simple one liner will return a Python Dictionary with all of the
ProductTypes in a format they you can iterate over and pass back into the
fileManager via XML-RPC (as opposed to trying to create the Python
equivalent
of a Java object)'''
# This is just used for clearning and building a MongoDB Collection
def build_collection(p_type, main_collection=ddl):
'''This will take in a product type and a main collection:
1. Create a connection to a sub-collection using the product type
name metadata value.
2. ALWAYS - Drop the Collection
3. Create the 3 Required Indexes'''
try:
# 1. collection is a sub-collection
collection = ddl[p_type['name']]
print str(collection.count()) + ' documents found, purging...'
# 2. Drop Collection
collection.drop()
# 3. Create Indexes
collection.create_index([("coordinates", GEO2D)])
print 'geo-index created on the coordinates element'
ddl.create_index("properties.start_date", ASCENDING)
ddl.create_index("properties.stop_date", ASCENDING)
print 'Collection has been built'
return collection
except:
print 'Cannot Build Collection for' + str(p_type['name'])
def build_geo_json(met):
'''take the met dictionary and return a properly formed
geo-json document. Below in code the places you see
met['KEYNAME'] is how you grab a metadata element by
KEYNAME. Your metadata keys will differ and so will your
mileage.'''
geo_json = {} # empty dict we will populate
raw_geo_list = [] # container to hold all [lon,lat] lists
try: # to get metadata elements
# grab OFSN
ofsn = met['OFSNFilePath'][0]
# get the metadata and trim off the microseconds
start = met['TimeBegin'][0].rsplit('.')[0]
stop = met['TimeEnd'][0].rsplit('.')[0]
# create geopoint list, and split
raw_geo_list = met['CAS.geopoint']
geo_list = []
# append all points within the raw_geo_list to the new geo_list
for point in raw_geo_list:
# split the values up into a single list with 2 values
string_list = point.split(' ')
# flip the GeoPoint [LAT,LON] into [LON,LAT]
string_list = [string_list[1], string_list[0]]
float_list = map(float, string_list)
geo_list.append(float_list)
except KeyError:
print met
except IndexError:
print "%s has a problem with it's geopoint metadata. CAS.geopoint =
%s, MaxLon = %s, MaxLat = %s" % (met['CAS.ProductName'][0], raw_geo_list,
met['MaxLon'][0], met['MaxLat'][0])
# convert to datetime objects
start = datetime.datetime.strptime(start, '%Y-%m-%dT%H:%M:%S')
stop = datetime.datetime.strptime(stop, '%Y-%m-%dT%H:%M:%S')
# if a single point is in the geo_list use type:point
if len(geo_list) == 1:
geo_json['type'] = "Point"
else:
geo_json['type'] = "MultiPoint"
geo_json['coordinates'] = geo_list
props = {"OFSN": ofsn,
"start_date": start,
"stop_date": stop}
geo_json['properties'] = props
return geo_json
# print geo_json
def build_geo_json_list(p_type):
# list to hold docs
geo_json_list = []
# get product list
p_list = filemgr.getProductsByProductType(p_type)
# for each product extract metadata
for p in p_list:
# grab the metadata and build geo_json_doc
met = filemgr.getMetadata(p) # this is a dictionary
if met == {}: # SKIP products that have NO METADATA
pass
else:
try:
geo_json_doc = build_geo_json(met)
geo_json_list.append(geo_json_doc)
except KeyError as err:
print err
return geo_json_list
for p_type in p_type_list:
'''In this loop we iterate over each product_type from the p_type_list
we got from the fileManager and we call getProductsByProductType to
loop over all of the products in the catalog'''
# grab a list of products from filemanager using ProductType
p_list = filemgr.getProductsByProductType(p_type)
if len(p_list) == 0: # If no products are found, then do nothing, else
get to work
print 'No products found for ' + str(p_type['name'])
pass
else:
# build the productType based collection
p_collection = build_collection(p_type)
# build the entire geo_json dict
geo_json_list = build_geo_json_list(p_type)
# do a single insert of all geo_json using a single BULK insert
p_collection.insert(geo_json_list)
print str(len(geo_json_list)) + ' is the length of the geo_json
list for %s' % p_type['name']
############ END CODE BLOCK ###############
On Wed, Feb 1, 2012 at 10:52 AM, Yu, Richard (GSFC-474.0)[NOAA-JPSS] <
richard.yu@nasa.gov> wrote:
> ** **
>
> Hi,****
>
> ** **
>
> We are new to OODT and would like to know if there is any example of
> populating the metadata into a database after successful ingest. The
> database could be the following:****
>
> POSTGRESQL, or MYSQL. Your help in this area would be most appreciated!**
> **
>
> ** **
>
> Richard Yu****
>
> ** **
>
Re: Extract meta data and populate database example
Posted by Cameron Goodale <go...@apache.org>.
Richard,
I have had success querying the FileManager via XML-RPC for ProductTypes,
and then Product Metadata in python. Then I just collect the sub-set of
the metadata I want and re-formatted the content into GeoJSON and loaded
each one into a MongoDB. It isn't exactly a SQL database, but the core
principles should remain the same.
The use case was that we needed to drive some website widgets (time series
plots and drop points into Google Maps) and we didn't want to directly
connect and query FileManager.
Below is a snippet of code from my python program that does the initial
setup and Query: (hope this helps)
######## CODE BLOCK ################
'''This code will read a filemanager and query for all product types.
For each product type it will collect the OFSN, [Lat,Lon] (or geoPoint)
and the meantime of the product and create a new collection within mongoDB.
mongoDB json Doc format (GeoJSON):
Starred(*) Elements are going to be indexed for Query Support
which are coordinates, start_date, and stop_date
{
"type":"[MultiPoint OR Point]",
"*coordinates*":[
[lon,lat],[lon,lat],[lon,lat]
],
"properties": {
"OFSN":"http_url_to_OSFN",
"*start_date*": new Datetime,
"*stop_date*": new Datetime,
}
}
'''
# used to call the XML-RPC interface from Python
import xmlrpclib
# simple string to dates module
import datetime
# Used for Connecting to MongoDB
from pymongo import Connection, GEO2D, ASCENDING
#setup the connection to the filemanager
fm_url = "http://localhost:9000"
filemgr = xmlrpclib.ServerProxy(fm_url).filemgr
#connect to the mongoDB and use the daily_date_list collection
db = Connection('localhost', 27017)
ddl = db.product_met # final collection where inserts will be made
#turn on safe insert mode (Comment out for DANGER mode)
ddl.set_lasterror_options(safe=True)
#Global list of productTypes
p_type_list = filemgr.getProductTypes()
''' This simple one liner will return a Python Dictionary with all of the
ProductTypes in a format they you can iterate over and pass back into the
fileManager via XML-RPC (as opposed to trying to create the Python
equivalent
of a Java object)'''
# This is just used for clearning and building a MongoDB Collection
def build_collection(p_type, main_collection=ddl):
'''This will take in a product type and a main collection:
1. Create a connection to a sub-collection using the product type
name metadata value.
2. ALWAYS - Drop the Collection
3. Create the 3 Required Indexes'''
try:
# 1. collection is a sub-collection
collection = ddl[p_type['name']]
print str(collection.count()) + ' documents found, purging...'
# 2. Drop Collection
collection.drop()
# 3. Create Indexes
collection.create_index([("coordinates", GEO2D)])
print 'geo-index created on the coordinates element'
ddl.create_index("properties.start_date", ASCENDING)
ddl.create_index("properties.stop_date", ASCENDING)
print 'Collection has been built'
return collection
except:
print 'Cannot Build Collection for' + str(p_type['name'])
def build_geo_json(met):
'''take the met dictionary and return a properly formed
geo-json document. Below in code the places you see
met['KEYNAME'] is how you grab a metadata element by
KEYNAME. Your metadata keys will differ and so will your
mileage.'''
geo_json = {} # empty dict we will populate
raw_geo_list = [] # container to hold all [lon,lat] lists
try: # to get metadata elements
# grab OFSN
ofsn = met['OFSNFilePath'][0]
# get the metadata and trim off the microseconds
start = met['TimeBegin'][0].rsplit('.')[0]
stop = met['TimeEnd'][0].rsplit('.')[0]
# create geopoint list, and split
raw_geo_list = met['CAS.geopoint']
geo_list = []
# append all points within the raw_geo_list to the new geo_list
for point in raw_geo_list:
# split the values up into a single list with 2 values
string_list = point.split(' ')
# flip the GeoPoint [LAT,LON] into [LON,LAT]
string_list = [string_list[1], string_list[0]]
float_list = map(float, string_list)
geo_list.append(float_list)
except KeyError:
print met
except IndexError:
print "%s has a problem with it's geopoint metadata. CAS.geopoint =
%s, MaxLon = %s, MaxLat = %s" % (met['CAS.ProductName'][0], raw_geo_list,
met['MaxLon'][0], met['MaxLat'][0])
# convert to datetime objects
start = datetime.datetime.strptime(start, '%Y-%m-%dT%H:%M:%S')
stop = datetime.datetime.strptime(stop, '%Y-%m-%dT%H:%M:%S')
# if a single point is in the geo_list use type:point
if len(geo_list) == 1:
geo_json['type'] = "Point"
else:
geo_json['type'] = "MultiPoint"
geo_json['coordinates'] = geo_list
props = {"OFSN": ofsn,
"start_date": start,
"stop_date": stop}
geo_json['properties'] = props
return geo_json
# print geo_json
def build_geo_json_list(p_type):
# list to hold docs
geo_json_list = []
# get product list
p_list = filemgr.getProductsByProductType(p_type)
# for each product extract metadata
for p in p_list:
# grab the metadata and build geo_json_doc
met = filemgr.getMetadata(p) # this is a dictionary
if met == {}: # SKIP products that have NO METADATA
pass
else:
try:
geo_json_doc = build_geo_json(met)
geo_json_list.append(geo_json_doc)
except KeyError as err:
print err
return geo_json_list
for p_type in p_type_list:
'''In this loop we iterate over each product_type from the p_type_list
we got from the fileManager and we call getProductsByProductType to
loop over all of the products in the catalog'''
# grab a list of products from filemanager using ProductType
p_list = filemgr.getProductsByProductType(p_type)
if len(p_list) == 0: # If no products are found, then do nothing, else
get to work
print 'No products found for ' + str(p_type['name'])
pass
else:
# build the productType based collection
p_collection = build_collection(p_type)
# build the entire geo_json dict
geo_json_list = build_geo_json_list(p_type)
# do a single insert of all geo_json using a single BULK insert
p_collection.insert(geo_json_list)
print str(len(geo_json_list)) + ' is the length of the geo_json
list for %s' % p_type['name']
############ END CODE BLOCK ###############
On Wed, Feb 1, 2012 at 10:52 AM, Yu, Richard (GSFC-474.0)[NOAA-JPSS] <
richard.yu@nasa.gov> wrote:
> ** **
>
> Hi,****
>
> ** **
>
> We are new to OODT and would like to know if there is any example of
> populating the metadata into a database after successful ingest. The
> database could be the following:****
>
> POSTGRESQL, or MYSQL. Your help in this area would be most appreciated!**
> **
>
> ** **
>
> Richard Yu****
>
> ** **
>
RE: Extract meta data and populate database example
Posted by "Yu, Richard (GSFC-474.0)[NOAA-JPSS]" <ri...@nasa.gov>.
This is great and a very positive OODT experience. Thanks!
Richard
-----Original Message-----
From: Mattmann, Chris A (388J) [mailto:chris.a.mattmann@jpl.nasa.gov]
Sent: Thursday, February 02, 2012 2:42 PM
To: user@oodt.apache.org
Subject: Re: Extract meta data and populate database example
Hi Richard,
Thanks for your question!
I'm going to assume you'd like the File Manager component:
http://oodt.apache.org/components/maven/filemgr/
to store the metadata for you. Also, what'll happen if you use the File Manager component is the following:
1. You'll define the configuration inside of $FILEMGR_DEPLOY/etc/filemgr.properties
- select as the catalog factory org.apache.oodt.cas.filemgr.catalog.DataSourceCatalogFactory
- below the factory configs are implementation-specific config properties. There should be a set of 4 of them for JDBC DataSource catalogs requiring you to specify the username, password, JDBC URL and driver class (fully qualified class name). Specify the appropriate options for any of those JDBC databases below there, and then make sure that you drop the appropriate JDBC driver jar (the one that contains the class that you will reference in the config property) inside of FILEMGR_DEPLOY/lib/
2. You can find some examples of the needed schemas/etc., inside of:
http://svn.apache.org/repos/asf/oodt/trunk/filemgr/src/main/resources/
Hope that helps and let me know if I can provide more information or if anyone else on the list can!
Cheers,
Chris
On Feb 1, 2012, at 10:52 AM, Yu, Richard (GSFC-474.0)[NOAA-JPSS] wrote:
>
> Hi,
>
> We are new to OODT and would like to know if there is any example of populating the metadata into a database after successful ingest. The database could be the following:
> POSTGRESQL, or MYSQL. Your help in this area would be most appreciated!
>
> Richard Yu
>
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Chris Mattmann, Ph.D.
Senior Computer Scientist
NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
Office: 171-266B, Mailstop: 171-246
Email: chris.a.mattmann@nasa.gov
WWW: http://sunset.usc.edu/~mattmann/
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Adjunct Assistant Professor, Computer Science Department University of Southern California, Los Angeles, CA 90089 USA
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
RE: Extract meta data and populate database example
Posted by "Yu, Richard (GSFC-474.0)[NOAA-JPSS]" <ri...@nasa.gov>.
This is great and a very positive OODT experience. Thanks!
Richard
-----Original Message-----
From: Mattmann, Chris A (388J) [mailto:chris.a.mattmann@jpl.nasa.gov]
Sent: Thursday, February 02, 2012 2:42 PM
To: user@oodt.apache.org
Subject: Re: Extract meta data and populate database example
Hi Richard,
Thanks for your question!
I'm going to assume you'd like the File Manager component:
http://oodt.apache.org/components/maven/filemgr/
to store the metadata for you. Also, what'll happen if you use the File Manager component is the following:
1. You'll define the configuration inside of $FILEMGR_DEPLOY/etc/filemgr.properties
- select as the catalog factory org.apache.oodt.cas.filemgr.catalog.DataSourceCatalogFactory
- below the factory configs are implementation-specific config properties. There should be a set of 4 of them for JDBC DataSource catalogs requiring you to specify the username, password, JDBC URL and driver class (fully qualified class name). Specify the appropriate options for any of those JDBC databases below there, and then make sure that you drop the appropriate JDBC driver jar (the one that contains the class that you will reference in the config property) inside of FILEMGR_DEPLOY/lib/
2. You can find some examples of the needed schemas/etc., inside of:
http://svn.apache.org/repos/asf/oodt/trunk/filemgr/src/main/resources/
Hope that helps and let me know if I can provide more information or if anyone else on the list can!
Cheers,
Chris
On Feb 1, 2012, at 10:52 AM, Yu, Richard (GSFC-474.0)[NOAA-JPSS] wrote:
>
> Hi,
>
> We are new to OODT and would like to know if there is any example of populating the metadata into a database after successful ingest. The database could be the following:
> POSTGRESQL, or MYSQL. Your help in this area would be most appreciated!
>
> Richard Yu
>
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Chris Mattmann, Ph.D.
Senior Computer Scientist
NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
Office: 171-266B, Mailstop: 171-246
Email: chris.a.mattmann@nasa.gov
WWW: http://sunset.usc.edu/~mattmann/
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Adjunct Assistant Professor, Computer Science Department University of Southern California, Los Angeles, CA 90089 USA
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Re: Extract meta data and populate database example
Posted by "Mattmann, Chris A (388J)" <ch...@jpl.nasa.gov>.
Hi Richard,
Thanks for your question!
I'm going to assume you'd like the File Manager component:
http://oodt.apache.org/components/maven/filemgr/
to store the metadata for you. Also, what'll happen if you use the File Manager component is the following:
1. You'll define the configuration inside of $FILEMGR_DEPLOY/etc/filemgr.properties
- select as the catalog factory org.apache.oodt.cas.filemgr.catalog.DataSourceCatalogFactory
- below the factory configs are implementation-specific config properties. There should be a set
of 4 of them for JDBC DataSource catalogs requiring you to specify the username, password, JDBC
URL and driver class (fully qualified class name). Specify the appropriate options for any of those
JDBC databases below there, and then make sure that you drop the appropriate JDBC driver jar
(the one that contains the class that you will reference in the config property) inside of FILEMGR_DEPLOY/lib/
2. You can find some examples of the needed schemas/etc., inside of:
http://svn.apache.org/repos/asf/oodt/trunk/filemgr/src/main/resources/
Hope that helps and let me know if I can provide more information or if anyone else on the list
can!
Cheers,
Chris
On Feb 1, 2012, at 10:52 AM, Yu, Richard (GSFC-474.0)[NOAA-JPSS] wrote:
>
> Hi,
>
> We are new to OODT and would like to know if there is any example of populating the metadata into a database after successful ingest. The database could be the following:
> POSTGRESQL, or MYSQL. Your help in this area would be most appreciated!
>
> Richard Yu
>
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Chris Mattmann, Ph.D.
Senior Computer Scientist
NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
Office: 171-266B, Mailstop: 171-246
Email: chris.a.mattmann@nasa.gov
WWW: http://sunset.usc.edu/~mattmann/
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Adjunct Assistant Professor, Computer Science Department
University of Southern California, Los Angeles, CA 90089 USA
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Re: Extract meta data and populate database example
Posted by "Mattmann, Chris A (388J)" <ch...@jpl.nasa.gov>.
Hi Richard,
Thanks for your question!
I'm going to assume you'd like the File Manager component:
http://oodt.apache.org/components/maven/filemgr/
to store the metadata for you. Also, what'll happen if you use the File Manager component is the following:
1. You'll define the configuration inside of $FILEMGR_DEPLOY/etc/filemgr.properties
- select as the catalog factory org.apache.oodt.cas.filemgr.catalog.DataSourceCatalogFactory
- below the factory configs are implementation-specific config properties. There should be a set
of 4 of them for JDBC DataSource catalogs requiring you to specify the username, password, JDBC
URL and driver class (fully qualified class name). Specify the appropriate options for any of those
JDBC databases below there, and then make sure that you drop the appropriate JDBC driver jar
(the one that contains the class that you will reference in the config property) inside of FILEMGR_DEPLOY/lib/
2. You can find some examples of the needed schemas/etc., inside of:
http://svn.apache.org/repos/asf/oodt/trunk/filemgr/src/main/resources/
Hope that helps and let me know if I can provide more information or if anyone else on the list
can!
Cheers,
Chris
On Feb 1, 2012, at 10:52 AM, Yu, Richard (GSFC-474.0)[NOAA-JPSS] wrote:
>
> Hi,
>
> We are new to OODT and would like to know if there is any example of populating the metadata into a database after successful ingest. The database could be the following:
> POSTGRESQL, or MYSQL. Your help in this area would be most appreciated!
>
> Richard Yu
>
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Chris Mattmann, Ph.D.
Senior Computer Scientist
NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
Office: 171-266B, Mailstop: 171-246
Email: chris.a.mattmann@nasa.gov
WWW: http://sunset.usc.edu/~mattmann/
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Adjunct Assistant Professor, Computer Science Department
University of Southern California, Los Angeles, CA 90089 USA
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++