You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@oodt.apache.org by Ivan Subotic <ib...@subotic.ch> on 2011/01/20 01:10:01 UTC

OODT for a Distributed Archiving Project

Dear list,

I'm currently working on my PhD project, where I'm building a distributed archiving solution.

Basically the distributed archive will consist of a number of nodes (every node belonging to another organization), where every node will be storing his data on a local node and replicas on a number of selected remote nodes.

There will be a number of predefined processes (eg., integrity checking, creating additional replicas, etc.) that will run either periodically or when some event occurs (node lost event, corrupted object event, etc.). The data that the system will archive will consist of RDF/XML files (metadata) + binary files (e.g., tiff images, jpeg images, etc.; referenced from the RDF). The RDF/XML files together with the binary files will be the products (in OODT language).

I'm looking into OODT to see if it can be used to create such a system and what components I would be using.

In the following is a list of components that I have identified that I could use:
- CAS Workflow (to implement the processes)
- CAS Push/Pull Component (to send products to remote nodes, to get products from remote nodes). With what is the push/pull component communication on the other side? The push/pull component? From where is the push/pull component getting the data that it will send? From the file manager?

What I'm missing, but should be there somewhere:
- Security Component. How do I create Virtual Organizations and manage user and groups, so that I can restrict access?

Probably also needed:
- File Manager. In my case I would have the products (rdf + binary files) and would need to create the profiles on the fly with some basic information. Do I need the file manager for something other than for the end user to access products and profiles? Since I'm going to load up the RDF files in a triple store for further use, is it possible to extend the file manager so that the profile catalog is stored in a triple store?

Thank you very much for you time :-)

Best regards,
Ivan

Re: OODT for a Distributed Archiving Project

Posted by David Kale <da...@cs.stanford.edu>.

Ivan, I'd also look at the profile and product servers (for serving metadata
and data, respectively) and query server (for querying the distributed
servers).

That said, I'm not fully versed in the completeness of OODT (particularly in
the things that haven't been ported from the NASA version), but I'm not
certain if there is a security component that does everything you're looking
for -- which, if true, means there's a great opportunity for you to
contribute!  I for one am also interested in security as my project is
working with medical data, so we have to worry a LOT about privacy and
security.

Dave



On Wed, Jan 19, 2011 at 4:10 PM, Ivan Subotic <ib...@subotic.ch> wrote:

> Dear list,
>
> I'm currently working on my PhD project, where I'm building a distributed
> archiving solution.
>
> Basically the distributed archive will consist of a number of nodes (every
> node belonging to another organization), where every node will be storing
> his data on a local node and replicas on a number of selected remote nodes.
>
> There will be a number of predefined processes (eg., integrity checking,
> creating additional replicas, etc.) that will run either periodically or
> when some event occurs (node lost event, corrupted object event, etc.). The
> data that the system will archive will consist of RDF/XML files (metadata) +
> binary files (e.g., tiff images, jpeg images, etc.; referenced from the
> RDF). The RDF/XML files together with the binary files will be the products
> (in OODT language).
>
> I'm looking into OODT to see if it can be used to create such a system and
> what components I would be using.
>
> In the following is a list of components that I have identified that I
> could use:
> - CAS Workflow (to implement the processes)
> - CAS Push/Pull Component (to send products to remote nodes, to get
> products from remote nodes). With what is the push/pull component
> communication on the other side? The push/pull component? From where is the
> push/pull component getting the data that it will send? From the file
> manager?
>
> What I'm missing, but should be there somewhere:
> - Security Component. How do I create Virtual Organizations and manage user
> and groups, so that I can restrict access?
>
> Probably also needed:
> - File Manager. In my case I would have the products (rdf + binary files)
> and would need to create the profiles on the fly with some basic
> information. Do I need the file manager for something other than for the end
> user to access products and profiles? Since I'm going to load up the RDF
> files in a triple store for further use, is it possible to extend the file
> manager so that the profile catalog is stored in a triple store?
>
> Thank you very much for you time :-)
>
> Best regards,
> Ivan

Re: OODT for a Distributed Archiving Project

Posted by David Kale <da...@cs.stanford.edu>.

Ivan, I'd also look at the profile and product servers (for serving metadata
and data, respectively) and query server (for querying the distributed
servers).

That said, I'm not fully versed in the completeness of OODT (particularly in
the things that haven't been ported from the NASA version), but I'm not
certain if there is a security component that does everything you're looking
for -- which, if true, means there's a great opportunity for you to
contribute!  I for one am also interested in security as my project is
working with medical data, so we have to worry a LOT about privacy and
security.

Dave



On Wed, Jan 19, 2011 at 4:10 PM, Ivan Subotic <ib...@subotic.ch> wrote:

> Dear list,
>
> I'm currently working on my PhD project, where I'm building a distributed
> archiving solution.
>
> Basically the distributed archive will consist of a number of nodes (every
> node belonging to another organization), where every node will be storing
> his data on a local node and replicas on a number of selected remote nodes.
>
> There will be a number of predefined processes (eg., integrity checking,
> creating additional replicas, etc.) that will run either periodically or
> when some event occurs (node lost event, corrupted object event, etc.). The
> data that the system will archive will consist of RDF/XML files (metadata) +
> binary files (e.g., tiff images, jpeg images, etc.; referenced from the
> RDF). The RDF/XML files together with the binary files will be the products
> (in OODT language).
>
> I'm looking into OODT to see if it can be used to create such a system and
> what components I would be using.
>
> In the following is a list of components that I have identified that I
> could use:
> - CAS Workflow (to implement the processes)
> - CAS Push/Pull Component (to send products to remote nodes, to get
> products from remote nodes). With what is the push/pull component
> communication on the other side? The push/pull component? From where is the
> push/pull component getting the data that it will send? From the file
> manager?
>
> What I'm missing, but should be there somewhere:
> - Security Component. How do I create Virtual Organizations and manage user
> and groups, so that I can restrict access?
>
> Probably also needed:
> - File Manager. In my case I would have the products (rdf + binary files)
> and would need to create the profiles on the fly with some basic
> information. Do I need the file manager for something other than for the end
> user to access products and profiles? Since I'm going to load up the RDF
> files in a triple store for further use, is it possible to extend the file
> manager so that the profile catalog is stored in a triple store?
>
> Thank you very much for you time :-)
>
> Best regards,
> Ivan

Re: OODT for a Distributed Archiving Project

Posted by Ivan Subotic <ib...@subotic.ch>.

Hi Chris,

Thank you very much for your answer. And thank you for the link to your thesis.

Now I see things much clearer. I'm sure I'll come with more questions, as I progress with the assembly of the system.

Thanks,
Ivan


 
On 21.01.2011, at 04:25, Mattmann, Chris A (388J) wrote:

> Hi Ivan,
> 
> Thanks for your email! Comments inline below:
> 
>> I'm currently working on my PhD project, where I'm building a distributed archiving solution.
> 
> Strangely familiar :)
> 
> I was doing the same thing in the context of OODT from 2003-2007, see here for the culmination:
> 
> http://sunset.usc.edu/~mattmann/Dissertation.pdf
> 
>> 
>> Basically the distributed archive will consist of a number of nodes (every node belonging to another organization), where every node will be storing his data on a local node and replicas on a number of selected remote nodes.
> 
> Gotcha.
> 
>> 
>> There will be a number of predefined processes (eg., integrity checking, creating additional replicas, etc.) that will run either periodically or when some event occurs (node lost event, corrupted object event, etc.). The data that the system will archive will consist of RDF/XML files (metadata) + binary files (e.g., tiff images, jpeg images, etc.; referenced from the RDF). The RDF/XML files together with the binary files will be the products (in OODT language).
> 
> Okey dokey.
> 
>> 
>> I'm looking into OODT to see if it can be used to create such a system and what components I would be using.
>> 
>> In the following is a list of components that I have identified that I could use:
>> - CAS Workflow (to implement the processes)
>> - CAS Push/Pull Component (to send products to remote nodes, to get products from remote nodes). With what is the push/pull component communication on the other side?
> 
> The Pull communication in PushPull is the set of protocols like FTP, SCP, HTTP, etc. The Push Part is its ability to accept emails over IMAPS "pushed" to a mailbox, and then to take the URLs from those emails and go resolve them using the pull protocols. So, it's really simulated Push at this point, but it works well with systems that deliver emails (like NOAA, NASA, etc.) to indicate a file is ready to be pushed.
> 
>> The push/pull component? From where is the push/pull component getting the data that it will send? From the file manager?
> 
> Push Pull acquires remote content and then hands off to a staging area that the crawler component picks up and reads from. crawler only handles local data (intentionally -- the complexity of acquiring remote content was large enough to warrant the creation of its own component). crawler takes the now local content (and any other content dropped in the shared staging area) and then ingests it into the file manager, sending metadata + references to it.
> 
>> 
>> What I'm missing, but should be there somewhere:
>> - Security Component. How do I create Virtual Organizations and manage user and groups, so that I can restrict access?
> 
> There is an sso component that is pretty light-weight at this point it implements connections to LDAP to do single sign on. At one point I did a restful-implementation of the SSO interface that connected to Java's Open SSO. Totally cleanroom using web services and protocols to connect to an OpenSSO service. I'll create a JIRA for this and attach in the next few days.
> 
>> 
>> Probably also needed:
>> - File Manager. In my case I would have the products (rdf + binary files) and would need to create the profiles on the fly with some basic information. Do I need the file manager for something other than for the end user to access products and profiles?
> 
> Yep you sure do. You'll need file manager, along with the cas-product webapp that lives in webapp/fmprod.
> 
> 
>> Since I'm going to load up the RDF files in a triple store for further use, is it possible to extend the file manager so that the profile catalog is stored in a triple store?
> 
> Sure you could do a catalog implementation that stores the metadata to a triple store. Alternatively you could use the fmprod webapp to deliver RDF views of the metadata that's stored per product, and configure it using the rdfconf.xml file that's part of fmprod.
> 
> Thanks!
> 
> Cheers,
> Chris
> 
> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
> Chris Mattmann, Ph.D.
> Senior Computer Scientist
> NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
> Office: 171-266B, Mailstop: 171-246
> Email: chris.a.mattmann@nasa.gov
> WWW:   http://sunset.usc.edu/~mattmann/
> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
> Adjunct Assistant Professor, Computer Science Department
> University of Southern California, Los Angeles, CA 90089 USA
> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>

Re: OODT for a Distributed Archiving Project

Posted by Ivan Subotic <ib...@subotic.ch>.

Hi Chris,

Thank you very much for your answer. And thank you for the link to your thesis.

Now I see things much clearer. I'm sure I'll come with more questions, as I progress with the assembly of the system.

Thanks,
Ivan


 
On 21.01.2011, at 04:25, Mattmann, Chris A (388J) wrote:

> Hi Ivan,
> 
> Thanks for your email! Comments inline below:
> 
>> I'm currently working on my PhD project, where I'm building a distributed archiving solution.
> 
> Strangely familiar :)
> 
> I was doing the same thing in the context of OODT from 2003-2007, see here for the culmination:
> 
> http://sunset.usc.edu/~mattmann/Dissertation.pdf
> 
>> 
>> Basically the distributed archive will consist of a number of nodes (every node belonging to another organization), where every node will be storing his data on a local node and replicas on a number of selected remote nodes.
> 
> Gotcha.
> 
>> 
>> There will be a number of predefined processes (eg., integrity checking, creating additional replicas, etc.) that will run either periodically or when some event occurs (node lost event, corrupted object event, etc.). The data that the system will archive will consist of RDF/XML files (metadata) + binary files (e.g., tiff images, jpeg images, etc.; referenced from the RDF). The RDF/XML files together with the binary files will be the products (in OODT language).
> 
> Okey dokey.
> 
>> 
>> I'm looking into OODT to see if it can be used to create such a system and what components I would be using.
>> 
>> In the following is a list of components that I have identified that I could use:
>> - CAS Workflow (to implement the processes)
>> - CAS Push/Pull Component (to send products to remote nodes, to get products from remote nodes). With what is the push/pull component communication on the other side?
> 
> The Pull communication in PushPull is the set of protocols like FTP, SCP, HTTP, etc. The Push Part is its ability to accept emails over IMAPS "pushed" to a mailbox, and then to take the URLs from those emails and go resolve them using the pull protocols. So, it's really simulated Push at this point, but it works well with systems that deliver emails (like NOAA, NASA, etc.) to indicate a file is ready to be pushed.
> 
>> The push/pull component? From where is the push/pull component getting the data that it will send? From the file manager?
> 
> Push Pull acquires remote content and then hands off to a staging area that the crawler component picks up and reads from. crawler only handles local data (intentionally -- the complexity of acquiring remote content was large enough to warrant the creation of its own component). crawler takes the now local content (and any other content dropped in the shared staging area) and then ingests it into the file manager, sending metadata + references to it.
> 
>> 
>> What I'm missing, but should be there somewhere:
>> - Security Component. How do I create Virtual Organizations and manage user and groups, so that I can restrict access?
> 
> There is an sso component that is pretty light-weight at this point it implements connections to LDAP to do single sign on. At one point I did a restful-implementation of the SSO interface that connected to Java's Open SSO. Totally cleanroom using web services and protocols to connect to an OpenSSO service. I'll create a JIRA for this and attach in the next few days.
> 
>> 
>> Probably also needed:
>> - File Manager. In my case I would have the products (rdf + binary files) and would need to create the profiles on the fly with some basic information. Do I need the file manager for something other than for the end user to access products and profiles?
> 
> Yep you sure do. You'll need file manager, along with the cas-product webapp that lives in webapp/fmprod.
> 
> 
>> Since I'm going to load up the RDF files in a triple store for further use, is it possible to extend the file manager so that the profile catalog is stored in a triple store?
> 
> Sure you could do a catalog implementation that stores the metadata to a triple store. Alternatively you could use the fmprod webapp to deliver RDF views of the metadata that's stored per product, and configure it using the rdfconf.xml file that's part of fmprod.
> 
> Thanks!
> 
> Cheers,
> Chris
> 
> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
> Chris Mattmann, Ph.D.
> Senior Computer Scientist
> NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
> Office: 171-266B, Mailstop: 171-246
> Email: chris.a.mattmann@nasa.gov
> WWW:   http://sunset.usc.edu/~mattmann/
> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
> Adjunct Assistant Professor, Computer Science Department
> University of Southern California, Los Angeles, CA 90089 USA
> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>

Re: OODT for a Distributed Archiving Project

Posted by "Mattmann, Chris A (388J)" <ch...@jpl.nasa.gov>.

Hi Ivan,

Thanks for your email! Comments inline below:

> I'm currently working on my PhD project, where I'm building a distributed archiving solution.

Strangely familiar :)

I was doing the same thing in the context of OODT from 2003-2007, see here for the culmination:

http://sunset.usc.edu/~mattmann/Dissertation.pdf

> 
> Basically the distributed archive will consist of a number of nodes (every node belonging to another organization), where every node will be storing his data on a local node and replicas on a number of selected remote nodes.

Gotcha.

> 
> There will be a number of predefined processes (eg., integrity checking, creating additional replicas, etc.) that will run either periodically or when some event occurs (node lost event, corrupted object event, etc.). The data that the system will archive will consist of RDF/XML files (metadata) + binary files (e.g., tiff images, jpeg images, etc.; referenced from the RDF). The RDF/XML files together with the binary files will be the products (in OODT language).

Okey dokey.

> 
> I'm looking into OODT to see if it can be used to create such a system and what components I would be using.
> 
> In the following is a list of components that I have identified that I could use:
> - CAS Workflow (to implement the processes)
> - CAS Push/Pull Component (to send products to remote nodes, to get products from remote nodes). With what is the push/pull component communication on the other side?

The Pull communication in PushPull is the set of protocols like FTP, SCP, HTTP, etc. The Push Part is its ability to accept emails over IMAPS "pushed" to a mailbox, and then to take the URLs from those emails and go resolve them using the pull protocols. So, it's really simulated Push at this point, but it works well with systems that deliver emails (like NOAA, NASA, etc.) to indicate a file is ready to be pushed.

> The push/pull component? From where is the push/pull component getting the data that it will send? From the file manager?

Push Pull acquires remote content and then hands off to a staging area that the crawler component picks up and reads from. crawler only handles local data (intentionally -- the complexity of acquiring remote content was large enough to warrant the creation of its own component). crawler takes the now local content (and any other content dropped in the shared staging area) and then ingests it into the file manager, sending metadata + references to it.

> 
> What I'm missing, but should be there somewhere:
> - Security Component. How do I create Virtual Organizations and manage user and groups, so that I can restrict access?

There is an sso component that is pretty light-weight at this point it implements connections to LDAP to do single sign on. At one point I did a restful-implementation of the SSO interface that connected to Java's Open SSO. Totally cleanroom using web services and protocols to connect to an OpenSSO service. I'll create a JIRA for this and attach in the next few days.

> 
> Probably also needed:
> - File Manager. In my case I would have the products (rdf + binary files) and would need to create the profiles on the fly with some basic information. Do I need the file manager for something other than for the end user to access products and profiles?

Yep you sure do. You'll need file manager, along with the cas-product webapp that lives in webapp/fmprod.


> Since I'm going to load up the RDF files in a triple store for further use, is it possible to extend the file manager so that the profile catalog is stored in a triple store?

Sure you could do a catalog implementation that stores the metadata to a triple store. Alternatively you could use the fmprod webapp to deliver RDF views of the metadata that's stored per product, and configure it using the rdfconf.xml file that's part of fmprod.

Thanks!

Cheers,
Chris

++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Chris Mattmann, Ph.D.
Senior Computer Scientist
NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
Office: 171-266B, Mailstop: 171-246
Email: chris.a.mattmann@nasa.gov
WWW:   http://sunset.usc.edu/~mattmann/
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Adjunct Assistant Professor, Computer Science Department
University of Southern California, Los Angeles, CA 90089 USA
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++

Re: OODT for a Distributed Archiving Project

Posted by "Mattmann, Chris A (388J)" <ch...@jpl.nasa.gov>.

Hi Ivan,

Thanks for your email! Comments inline below:

> I'm currently working on my PhD project, where I'm building a distributed archiving solution.

Strangely familiar :)

I was doing the same thing in the context of OODT from 2003-2007, see here for the culmination:

http://sunset.usc.edu/~mattmann/Dissertation.pdf

> 
> Basically the distributed archive will consist of a number of nodes (every node belonging to another organization), where every node will be storing his data on a local node and replicas on a number of selected remote nodes.

Gotcha.

> 
> There will be a number of predefined processes (eg., integrity checking, creating additional replicas, etc.) that will run either periodically or when some event occurs (node lost event, corrupted object event, etc.). The data that the system will archive will consist of RDF/XML files (metadata) + binary files (e.g., tiff images, jpeg images, etc.; referenced from the RDF). The RDF/XML files together with the binary files will be the products (in OODT language).

Okey dokey.

> 
> I'm looking into OODT to see if it can be used to create such a system and what components I would be using.
> 
> In the following is a list of components that I have identified that I could use:
> - CAS Workflow (to implement the processes)
> - CAS Push/Pull Component (to send products to remote nodes, to get products from remote nodes). With what is the push/pull component communication on the other side?

The Pull communication in PushPull is the set of protocols like FTP, SCP, HTTP, etc. The Push Part is its ability to accept emails over IMAPS "pushed" to a mailbox, and then to take the URLs from those emails and go resolve them using the pull protocols. So, it's really simulated Push at this point, but it works well with systems that deliver emails (like NOAA, NASA, etc.) to indicate a file is ready to be pushed.

> The push/pull component? From where is the push/pull component getting the data that it will send? From the file manager?

Push Pull acquires remote content and then hands off to a staging area that the crawler component picks up and reads from. crawler only handles local data (intentionally -- the complexity of acquiring remote content was large enough to warrant the creation of its own component). crawler takes the now local content (and any other content dropped in the shared staging area) and then ingests it into the file manager, sending metadata + references to it.

> 
> What I'm missing, but should be there somewhere:
> - Security Component. How do I create Virtual Organizations and manage user and groups, so that I can restrict access?

There is an sso component that is pretty light-weight at this point it implements connections to LDAP to do single sign on. At one point I did a restful-implementation of the SSO interface that connected to Java's Open SSO. Totally cleanroom using web services and protocols to connect to an OpenSSO service. I'll create a JIRA for this and attach in the next few days.

> 
> Probably also needed:
> - File Manager. In my case I would have the products (rdf + binary files) and would need to create the profiles on the fly with some basic information. Do I need the file manager for something other than for the end user to access products and profiles?

Yep you sure do. You'll need file manager, along with the cas-product webapp that lives in webapp/fmprod.


> Since I'm going to load up the RDF files in a triple store for further use, is it possible to extend the file manager so that the profile catalog is stored in a triple store?

Sure you could do a catalog implementation that stores the metadata to a triple store. Alternatively you could use the fmprod webapp to deliver RDF views of the metadata that's stored per product, and configure it using the rdfconf.xml file that's part of fmprod.

Thanks!

Cheers,
Chris

++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Chris Mattmann, Ph.D.
Senior Computer Scientist
NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
Office: 171-266B, Mailstop: 171-246
Email: chris.a.mattmann@nasa.gov
WWW:   http://sunset.usc.edu/~mattmann/
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Adjunct Assistant Professor, Computer Science Department
University of Southern California, Los Angeles, CA 90089 USA
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++