You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@oodt.apache.org by Scott Konzem <ko...@gmail.com> on 2011/01/12 01:43:58 UTC

Approaching OODT as a new user

First of all, I'd like to congratulate OODT on becoming a top level project and NASA for making this project available. Thank you!

From all the nasa.gov email addresses around here, I get the impression that in the early days of this project, most of the developers and users have been in direct contact or even within the same organization, so I'd like to share my experience as a complete outsider.  I am familiar with the challenges of managing research data at a large organization with many research groups, so I've been trying to figure out what OODT does and what it could do for me.  So far most of what I've found has been written either at a very abstract level for managers (the TLP press release and the OODT main page) or a very detailed level for developers (the javadocs). I haven't seen much so far for the "data people" in the middle -- the people who need enough technical detail to put the system into practice because they're tired of coding their own.  This is my experience trying to get that information.

The website has a lot of stub pages for the individual components, so I thought that I might be able to get some more information by downloading and running the software.  This started as a NASA project, so there have to be stacks of documentation somewhere, right?  I downloaded the trunk and built it using the instructions I eventually found on the File Manager page (http://oodt.apache.org/components/maven/filemgr/user/basic.html), but now I have a directory with a bunch of folders in it, and I have no idea what to do with them.  The only tutorial I can find is for the File Manager -- which I very much appreciate, even though it doesn't completely work for me -- and there are only two files named README.txt in the entire project.

As a result, I still have a lot of very basic questions:  What do I do with all of these components? What do they all do?  Which ones do I need, and which are optional? Are they standalone executables?  Web services that require some sort of container?  Do I interact with them using the command line, or do they have web or web services interfaces?  What are the configuration options?  What kinds of data and metadata can I manage? What kinds of roles do I need to have within my organization (administrator, content owner, metadata maintainer), and how does the software handle these? What do I want to do that this project can't? (In this type of software, there's always something that's just a little too specific to the original purpose or organization.)

OODT claims to have a large user community apart from the original developers.  How did it come to be that these organizations and individuals knew how to use the software?  What sort of documentation and support did the developers need to provide in order to get them up and running?  How can I get some of that? :)

Again, I'm very grateful that this product exists and am excited to find out more about it.  Thanks for making it available for me to puzzle over!

Sincerely,

Scott Konzem

Re: Approaching OODT as a new user

Posted by David Kale <da...@cs.stanford.edu>.
Good questions and fair points. I myself am by no means an expert, so I'm
going to let Chris or another person answer your questions, but I wanted to
give you spiritual support.  I'm a machine learning guy working at
Children's Hospital Los Angeles, and we're in the process of modifying and
deploying OODT for building and managing what amounts to a clinical research
data repository, an eye-opening experience since most of previous research
experience has been with canned datasets or on projects in which someone
else wrote the basic data munging tools.

My own efforts to deploy OODT have depended largely on the direct support of
the folks with NASA email addresses you pointed out (we have a grant-funded
project in collaboration with JPL), because as you point out the
documentation is relatively bare.  The good news is that we are making
comprehensive documentation, tutorials, and even how-to videos a priority
(though perhaps we need to pick up the pace...I do worry that with the
recent exposure, many new users will come to the project and be turned off
by the dearth of clear documentation).

In the meantime, the even better news is that there is a active,
enthusiastic community supporting this project -- including MANY people
whose full-time jobs involve developing it.  Thus, I highly encourage you to
join the email lists and bombard us with questions, as you will get guidance
and answers relatively quickly.

gratefully,
Dave



On Tue, Jan 11, 2011 at 4:43 PM, Scott Konzem <ko...@gmail.com> wrote:

> First of all, I'd like to congratulate OODT on becoming a top level project
> and NASA for making this project available. Thank you!
>
> From all the <http://nasa.gov>nasa.gov email addresses around here, I get
> the impression that in the early days of this project, most of the
> developers and users have been in direct contact or even within the same
> organization, so I'd like to share my experience as a complete outsider.  I
> am familiar with the challenges of managing research data at a large organization
> with many research groups, so I've been trying to figure out what OODT
> does and what it could do for me.  So far most of what I've found has been written
> either at a very abstract level for managers (the TLP press release and the
> OODT main page) or a very detailed level for developers (the javadocs). I
> haven't seen much so far for the "data people" in the middle -- the people
> who need enough technical detail to put the system into practice because
> they're tired of coding their own.  This is my experience trying to get
> that information.
>
> The website has a lot of stub pages for the individual components, so I
> thought that I might be able to get some more information by downloading
> and running the software.  This started as a NASA project, so there have to
> be stacks of documentation somewhere, right?  I downloaded the trunk and
> built it using the instructions I eventually found on the File Manager
> page ( <http://oodt.apache.org/components/maven/filemgr/user/basic.html><http://oodt.apache.org/components/maven/filemgr/user/basic.html><http://oodt.apache.org/components/maven/filemgr/user/basic.html>
> http://oodt.apache.org/components/maven/filemgr/user/basic.html), but now
> I have a directory with a bunch of folders in it, and I have no idea what
> to do with them.  The only tutorial I can find is for the File Manager --
> which I very much appreciate, even though it doesn't completely work for
> me -- and there are only two files named README.txt in the entire project.
>
> As a result, I still have a lot of very basic questions:  What do I do with
> all of these components? What do they all do?  Which ones do I need, and
> which are optional? Are they standalone executables?  Web services that
> require some sort of container?  Do I interact with them using the command
> line, or do they have web or web services interfaces?  What are the
> configuration options?  What kinds of data and metadata can I manage? What
> kinds of roles do I need to have within my organization (administrator, content
> owner, metadata maintainer), and how does the software handle these? What do
> I want to do that this project can't? (In this type of software, there's
> always something that's just a little too specific to the original purpose
> or organization.)
>
> OODT claims to have a large user community apart from the original developers.
>  How did it come to be that these organizations and individuals knew how
> to use the software?  What sort of documentation and support did the
> developers need to provide in order to get them up and running?  How can I
> get some of that? :)
>
> Again, I'm very grateful that this product exists and am excited to find
> out more about it.  Thanks for making it available for me to puzzle over!
>
> Sincerely,
>
> Scott Konzem
>

Re: Approaching OODT as a new user

Posted by "Mattmann, Chris A (388J)" <ch...@jpl.nasa.gov>.
Hi Scott,

Thanks for your detailed and informative email, giving us the user perspective! 

My comments inline below:

On Jan 11, 2011, at 4:43 PM, Scott Konzem wrote:

> First of all, I'd like to congratulate OODT on becoming a top level project and NASA for making this project available. Thank you!

No problemo! We're very happy to be working on OODT in open source, with the rest of the community!

> 
> From all the nasa.gov email addresses around here, I get the impression that in the early days of this project, most of the developers and users have been in direct contact or even within the same organization, so I'd like to share my experience as a complete outsider.  I am familiar with the challenges of managing research data at a large organization with many research groups, so I've been trying to figure out what OODT does and what it could do for me.  So far most of what I've found has been written either at a very abstract level for managers (the TLP press release and the OODT main page) or a very detailed level for developers (the javadocs). I haven't seen much so far for the "data people" in the middle -- the people who need enough technical detail to put the system into practice because they're tired of coding their own.  This is my experience trying to get that information.

Sorry that you've had that experience so far. The guide for the file manager that you stumbled upon below is an effort to start to obviate some of those concerns. I agree that much of the documentation as it stands is Javadoc type documentation, or high level architecture, but I'd also point you to more guides like the below (there are more). In fact, many of the OODT components have a few such guides that can help out at least in getting started. I'll reply more on these on the below paragraph because they are more applicable there.

> 
> The website has a lot of stub pages for the individual components, so I thought that I might be able to get some more information by downloading and running the software.  This started as a NASA project, so there have to be stacks of documentation somewhere, right?  I downloaded the trunk and built it using the instructions I eventually found on the File Manager page (http://oodt.apache.org/components/maven/filemgr/user/basic.html), but now I have a directory with a bunch of folders in it, and I have no idea what to do with them.  The only tutorial I can find is for the File Manager -- which I very much appreciate, even though it doesn't completely work for me -- and there are only two files named README.txt in the entire project.

Thanks. Can you elaborate on what part of the guide doesn't completely work? 

The filemgr, workflow, and resource components are 3 sort of canonical services that help you implement data processing and management. File Manager tracks file locations, their metadata, handles data transfers, and provides the ability to transform that captured metadata in a variety of ways (e.g., output it as RSS or RDF via the cas-product webapp), and to deliver those files and metadata to folks who ask for them. The workflow manager is a light-weight wrapper where you can cook up control flow and data flow (sets of Tasks chained together) in XML files, you can execute those Tasks locally on a single machine, or you can plug the workflow manager into a resource manager, and have those tasks be distributed out onto a cluster, a cloud, a grid or whatever type of hardware you have to execute processes and jobs on. These components, by themselves, are useful independently of one another. In fact, they don't have any direct dependencies on one another unless you tell them to. What that means is that you can use the filemgr as an independent component simply to programmatically capture information about files and metadata; but never do anything with them that involves a workflow manager or resource manager. You can simply use the workflow system if you want, independent of the filemgr or resource manager; you can use resource manager similarly. 

However, when you put these 3 services together, you start to have a really powerful substrate to perform data management system functions on. For example, the crawler framework combines the power of automatic file identification, and ingestion, with the file manager, to rapidly build up your file manager based archive and catalog; it also provides the ability to notify the workflow manager when files are ingested to kick off tasks and processes (algorithms) associated with the ingestion of those files. The pushpull framework is a remote content acquisition system, that can go get you ancillary files and metadata, pull them down locally, and feed them to the crawler for ingestion and management in your data management system. Finally the PGE component is a specialized workflow task jar library, that when dropped into the context of the workflow manager's lib directory, gives you a high powered workflow task that can easily communicate with the filemgr, workflow manager or resource manager, and feed information to your algorithm that otherwise you'd have to write lots of specialized data management code for.

The above is a description of what *one set* of OODT components (the CAS family) do; there's a whole other set of those components that handle information integration. The use case here is that you have a bunch of existing databases or data systems that you'd like to link together, but you don't control their population, schema, or business processes associated with them. In this case, we have the profile (metadata) and product (data) server components, which expose the underlying metadata and data from these systems and make it easily available for query, representation and dissemination. Profile and product servers run on top of the web-grid WAR file, a Tomcat webapp that turns them into REST-ful services. The best place to get started here is to look at:

http://oodt.apache.org/components/maven/grid/slides.pdf

NOTE: those slides were made pre-Apache OODT, so some of them will contain old properties and paths for Web Grid, but should still give you an idea of what's going on. The Apache OODT web-grid is basically the same component that you see in those slides.

Once you are familiar with web-grid there are a few custom, extensible profile and product server handlers that we have been working on. xmlps (available as a top-level OODT module) is a XML-configurable profile/profile server that can easily connect to JDBC-accesible databases and dump out the bits and metadata from them. OPeNDAPPs is a XML configurable profile server that can connect to OPeNDAP accessible data servers and extract metadata and data from them.

> 
> As a result, I still have a lot of very basic questions:  What do I do with all of these components? What do they all do?  Which ones do I need, and which are optional? Are they standalone executables?  Web services that require some sort of container?  Do I interact with them using the command line, or do they have web or web services interfaces?  What are the configuration options?  What kinds of data and metadata can I manage? What kinds of roles do I need to have within my organization (administrator, content owner, metadata maintainer), and how does the software handle these? What do I want to do that this project can't? (In this type of software, there's always something that's just a little too specific to the original purpose or organization.)

Hopefully what I mentioned above will give you a basic idea of what's going on. Apache OODT is a framework that by itself doesn't build your data system for you; it needs some TLC from a person like you that knows your data system requirements, etc., and can help map those to the specific components and resultant architecture provided by OODT to use it for your application.

Check out what I mentioned above, and then if you need more help just jump on list and let us know. At that point it would be nice if you could give us some more detail about what you are actually trying to do in terms of data management/etc., as that would give us a better idea of how to suggest help in configuring and using OODT for your specific case.

> 
> OODT claims to have a large user community apart from the original developers.  How did it come to be that these organizations and individuals knew how to use the software?  What sort of documentation and support did the developers need to provide in order to get them up and running?  How can I get some of that? :)

Like Dave Kale mentioned in his email, a lot of the work to date has come from collaborative research grants and shared effort on projects with folks working in the organizations that have used OODT. Now that it's here at Apache many of those folks are lurking on these lists, and available to help out and discuss issues with the software, etc., also in the hopes that it will help out their specific deployments.

> 
> Again, I'm very grateful that this product exists and am excited to find out more about it.  Thanks for making it available for me to puzzle over!

Thanks for your email and welcome!

Cheers,
Chris

++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Chris Mattmann, Ph.D.
Senior Computer Scientist
NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
Office: 171-266B, Mailstop: 171-246
Email: chris.a.mattmann@nasa.gov
WWW:   http://sunset.usc.edu/~mattmann/
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Adjunct Assistant Professor, Computer Science Department
University of Southern California, Los Angeles, CA 90089 USA
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++


Re: Approaching OODT as a new user

Posted by "Mattmann, Chris A (388J)" <ch...@jpl.nasa.gov>.
Hi Scott,

Thanks for your detailed and informative email, giving us the user perspective! 

My comments inline below:

On Jan 11, 2011, at 4:43 PM, Scott Konzem wrote:

> First of all, I'd like to congratulate OODT on becoming a top level project and NASA for making this project available. Thank you!

No problemo! We're very happy to be working on OODT in open source, with the rest of the community!

> 
> From all the nasa.gov email addresses around here, I get the impression that in the early days of this project, most of the developers and users have been in direct contact or even within the same organization, so I'd like to share my experience as a complete outsider.  I am familiar with the challenges of managing research data at a large organization with many research groups, so I've been trying to figure out what OODT does and what it could do for me.  So far most of what I've found has been written either at a very abstract level for managers (the TLP press release and the OODT main page) or a very detailed level for developers (the javadocs). I haven't seen much so far for the "data people" in the middle -- the people who need enough technical detail to put the system into practice because they're tired of coding their own.  This is my experience trying to get that information.

Sorry that you've had that experience so far. The guide for the file manager that you stumbled upon below is an effort to start to obviate some of those concerns. I agree that much of the documentation as it stands is Javadoc type documentation, or high level architecture, but I'd also point you to more guides like the below (there are more). In fact, many of the OODT components have a few such guides that can help out at least in getting started. I'll reply more on these on the below paragraph because they are more applicable there.

> 
> The website has a lot of stub pages for the individual components, so I thought that I might be able to get some more information by downloading and running the software.  This started as a NASA project, so there have to be stacks of documentation somewhere, right?  I downloaded the trunk and built it using the instructions I eventually found on the File Manager page (http://oodt.apache.org/components/maven/filemgr/user/basic.html), but now I have a directory with a bunch of folders in it, and I have no idea what to do with them.  The only tutorial I can find is for the File Manager -- which I very much appreciate, even though it doesn't completely work for me -- and there are only two files named README.txt in the entire project.

Thanks. Can you elaborate on what part of the guide doesn't completely work? 

The filemgr, workflow, and resource components are 3 sort of canonical services that help you implement data processing and management. File Manager tracks file locations, their metadata, handles data transfers, and provides the ability to transform that captured metadata in a variety of ways (e.g., output it as RSS or RDF via the cas-product webapp), and to deliver those files and metadata to folks who ask for them. The workflow manager is a light-weight wrapper where you can cook up control flow and data flow (sets of Tasks chained together) in XML files, you can execute those Tasks locally on a single machine, or you can plug the workflow manager into a resource manager, and have those tasks be distributed out onto a cluster, a cloud, a grid or whatever type of hardware you have to execute processes and jobs on. These components, by themselves, are useful independently of one another. In fact, they don't have any direct dependencies on one another unless you tell them to. What that means is that you can use the filemgr as an independent component simply to programmatically capture information about files and metadata; but never do anything with them that involves a workflow manager or resource manager. You can simply use the workflow system if you want, independent of the filemgr or resource manager; you can use resource manager similarly. 

However, when you put these 3 services together, you start to have a really powerful substrate to perform data management system functions on. For example, the crawler framework combines the power of automatic file identification, and ingestion, with the file manager, to rapidly build up your file manager based archive and catalog; it also provides the ability to notify the workflow manager when files are ingested to kick off tasks and processes (algorithms) associated with the ingestion of those files. The pushpull framework is a remote content acquisition system, that can go get you ancillary files and metadata, pull them down locally, and feed them to the crawler for ingestion and management in your data management system. Finally the PGE component is a specialized workflow task jar library, that when dropped into the context of the workflow manager's lib directory, gives you a high powered workflow task that can easily communicate with the filemgr, workflow manager or resource manager, and feed information to your algorithm that otherwise you'd have to write lots of specialized data management code for.

The above is a description of what *one set* of OODT components (the CAS family) do; there's a whole other set of those components that handle information integration. The use case here is that you have a bunch of existing databases or data systems that you'd like to link together, but you don't control their population, schema, or business processes associated with them. In this case, we have the profile (metadata) and product (data) server components, which expose the underlying metadata and data from these systems and make it easily available for query, representation and dissemination. Profile and product servers run on top of the web-grid WAR file, a Tomcat webapp that turns them into REST-ful services. The best place to get started here is to look at:

http://oodt.apache.org/components/maven/grid/slides.pdf

NOTE: those slides were made pre-Apache OODT, so some of them will contain old properties and paths for Web Grid, but should still give you an idea of what's going on. The Apache OODT web-grid is basically the same component that you see in those slides.

Once you are familiar with web-grid there are a few custom, extensible profile and product server handlers that we have been working on. xmlps (available as a top-level OODT module) is a XML-configurable profile/profile server that can easily connect to JDBC-accesible databases and dump out the bits and metadata from them. OPeNDAPPs is a XML configurable profile server that can connect to OPeNDAP accessible data servers and extract metadata and data from them.

> 
> As a result, I still have a lot of very basic questions:  What do I do with all of these components? What do they all do?  Which ones do I need, and which are optional? Are they standalone executables?  Web services that require some sort of container?  Do I interact with them using the command line, or do they have web or web services interfaces?  What are the configuration options?  What kinds of data and metadata can I manage? What kinds of roles do I need to have within my organization (administrator, content owner, metadata maintainer), and how does the software handle these? What do I want to do that this project can't? (In this type of software, there's always something that's just a little too specific to the original purpose or organization.)

Hopefully what I mentioned above will give you a basic idea of what's going on. Apache OODT is a framework that by itself doesn't build your data system for you; it needs some TLC from a person like you that knows your data system requirements, etc., and can help map those to the specific components and resultant architecture provided by OODT to use it for your application.

Check out what I mentioned above, and then if you need more help just jump on list and let us know. At that point it would be nice if you could give us some more detail about what you are actually trying to do in terms of data management/etc., as that would give us a better idea of how to suggest help in configuring and using OODT for your specific case.

> 
> OODT claims to have a large user community apart from the original developers.  How did it come to be that these organizations and individuals knew how to use the software?  What sort of documentation and support did the developers need to provide in order to get them up and running?  How can I get some of that? :)

Like Dave Kale mentioned in his email, a lot of the work to date has come from collaborative research grants and shared effort on projects with folks working in the organizations that have used OODT. Now that it's here at Apache many of those folks are lurking on these lists, and available to help out and discuss issues with the software, etc., also in the hopes that it will help out their specific deployments.

> 
> Again, I'm very grateful that this product exists and am excited to find out more about it.  Thanks for making it available for me to puzzle over!

Thanks for your email and welcome!

Cheers,
Chris

++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Chris Mattmann, Ph.D.
Senior Computer Scientist
NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
Office: 171-266B, Mailstop: 171-246
Email: chris.a.mattmann@nasa.gov
WWW:   http://sunset.usc.edu/~mattmann/
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Adjunct Assistant Professor, Computer Science Department
University of Southern California, Los Angeles, CA 90089 USA
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++


Re: Approaching OODT as a new user

Posted by David Kale <da...@cs.stanford.edu>.
Good questions and fair points. I myself am by no means an expert, so I'm
going to let Chris or another person answer your questions, but I wanted to
give you spiritual support.  I'm a machine learning guy working at
Children's Hospital Los Angeles, and we're in the process of modifying and
deploying OODT for building and managing what amounts to a clinical research
data repository, an eye-opening experience since most of previous research
experience has been with canned datasets or on projects in which someone
else wrote the basic data munging tools.

My own efforts to deploy OODT have depended largely on the direct support of
the folks with NASA email addresses you pointed out (we have a grant-funded
project in collaboration with JPL), because as you point out the
documentation is relatively bare.  The good news is that we are making
comprehensive documentation, tutorials, and even how-to videos a priority
(though perhaps we need to pick up the pace...I do worry that with the
recent exposure, many new users will come to the project and be turned off
by the dearth of clear documentation).

In the meantime, the even better news is that there is a active,
enthusiastic community supporting this project -- including MANY people
whose full-time jobs involve developing it.  Thus, I highly encourage you to
join the email lists and bombard us with questions, as you will get guidance
and answers relatively quickly.

gratefully,
Dave



On Tue, Jan 11, 2011 at 4:43 PM, Scott Konzem <ko...@gmail.com> wrote:

> First of all, I'd like to congratulate OODT on becoming a top level project
> and NASA for making this project available. Thank you!
>
> From all the <http://nasa.gov>nasa.gov email addresses around here, I get
> the impression that in the early days of this project, most of the
> developers and users have been in direct contact or even within the same
> organization, so I'd like to share my experience as a complete outsider.  I
> am familiar with the challenges of managing research data at a large organization
> with many research groups, so I've been trying to figure out what OODT
> does and what it could do for me.  So far most of what I've found has been written
> either at a very abstract level for managers (the TLP press release and the
> OODT main page) or a very detailed level for developers (the javadocs). I
> haven't seen much so far for the "data people" in the middle -- the people
> who need enough technical detail to put the system into practice because
> they're tired of coding their own.  This is my experience trying to get
> that information.
>
> The website has a lot of stub pages for the individual components, so I
> thought that I might be able to get some more information by downloading
> and running the software.  This started as a NASA project, so there have to
> be stacks of documentation somewhere, right?  I downloaded the trunk and
> built it using the instructions I eventually found on the File Manager
> page ( <http://oodt.apache.org/components/maven/filemgr/user/basic.html><http://oodt.apache.org/components/maven/filemgr/user/basic.html><http://oodt.apache.org/components/maven/filemgr/user/basic.html>
> http://oodt.apache.org/components/maven/filemgr/user/basic.html), but now
> I have a directory with a bunch of folders in it, and I have no idea what
> to do with them.  The only tutorial I can find is for the File Manager --
> which I very much appreciate, even though it doesn't completely work for
> me -- and there are only two files named README.txt in the entire project.
>
> As a result, I still have a lot of very basic questions:  What do I do with
> all of these components? What do they all do?  Which ones do I need, and
> which are optional? Are they standalone executables?  Web services that
> require some sort of container?  Do I interact with them using the command
> line, or do they have web or web services interfaces?  What are the
> configuration options?  What kinds of data and metadata can I manage? What
> kinds of roles do I need to have within my organization (administrator, content
> owner, metadata maintainer), and how does the software handle these? What do
> I want to do that this project can't? (In this type of software, there's
> always something that's just a little too specific to the original purpose
> or organization.)
>
> OODT claims to have a large user community apart from the original developers.
>  How did it come to be that these organizations and individuals knew how
> to use the software?  What sort of documentation and support did the
> developers need to provide in order to get them up and running?  How can I
> get some of that? :)
>
> Again, I'm very grateful that this product exists and am excited to find
> out more about it.  Thanks for making it available for me to puzzle over!
>
> Sincerely,
>
> Scott Konzem
>