You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@oodt.apache.org by Thomas Bennett <lm...@gmail.com> on 2012/03/16 14:57:00 UTC

Data transfer questions

Hi,

I have a few questions about data transfer and thought I would roll it into
one email:

1) Local and remote data transfer with the same file manager

   - I see that when configuring a cas-crawler, one specifies the data
   transfer factory by using --clientTransferer
   - However in etc/filemgr.properties the data transfer factory
   is specified with  filemgr.datatransfer.factory.

Does this mean that I if I specify a local transfer factory I cannot  use a
crawler with a remote data transferer?

I'm wanting to cater for a situation where files could be ingested locally
as well as remotely using a single file manager. Is this possible?

2) Copy and ingested product on a back up archive

For backup (and access purposes), I'm wanting to ingest the product into an
off site archive (at our main engineering office) with it's
own separate catalogue.
What is the recommended way of doing this?

They way I currently do this is by replicate the files using rsync (but I'm
then left with finding a way to update the catalogue). I was wondering if
there was a neater (more OODT) solution?

I was thinking, perhaps using the functionality described in OODT-84
(Ability for File Manager to stage an ingested Product to one of its
clients) and then have a second crawler on the backup archive which will
then update it's own catalogue.

I just thought I would ask the question in case anyone has tried something
similar.

Cheers,
Tom

Re: Data transfer questions

Posted by Thomas Bennett <lm...@gmail.com>.

Thanks Chris - wiki page on its way :)

On 19 March 2012 22:52, Mattmann, Chris A (388J) <
chris.a.mattmann@jpl.nasa.gov> wrote:

> Hey Tom,
>
> AWESOME. I smell Wiki page :)
>
> Read on below:
>
> On Mar 19, 2012, at 8:18 PM, Thomas Bennett wrote:
>
> >
> > Versioner schemes
> >
> > The Data Transferers have an acute coupling with the Versioner scheme,
> case in point: if you are doing InPlaceTransfer,
> > you need a versioner that will handle file paths that don't change from
> src to dest.
> >
> > The Versioner is used to describe who a target directory is created for
> a file to archive. I.e a directory structure where the data will be place.
> So if I have an archive root at /var/kat/archive/data/ and I use a basic
> versioner it will archive a file called 1234567890.h5 at
> /var/kat/archive/data/1234567890.h5/1234567890.h5. So this would describe
> the destination for a local data transfer.
> >
> > I have the following versioner set in my policy/product-types.xml.
> >
> > policy/product-types.xml
> > <versioner
> class="org.apache.oodt.cas.filemgr.versioning.BasicVersioner"/>
>
> Ah, gotcha. You may consider using the MetadataBasedFileVersioner. It lets
> you define a filePathSpec,
> e.g., /[PrincipalInvestigator]/[Project]/[AcquisitionDate]/[Filename]
>
> And then versions or "places" the resulting product files in that
> specification structure.
>
> To create the above, you would simply subclass the Versioner like so:
>
> public KATVersioner extends MetadataBasedFileVersioner{
>   String filePathSpec =
> "/[PrincipalInvestigator]/[Project]/[AcquisitionDate]/[Filename]";
>
>   public KATVersioner(){
>     setFilePathSpec(filePathSpec);
>   }
> }
>
> You can even refer to keys that don't exist yet, and then dynamically
> generate them (and their
> values) by overriding the createDatStoreReferences method:
>
> @Override
>  public void createDataStoreReferences(Product product, Metadata met){
>     // do work to generate AcquisitionDate here
>     met.replaceMetadata("AcquisitionDate", acqdate);
>     super.createDataStoreReferences(product, met);
>   }
>
>
> >
> > Just out of curiosity... why is this called a versioner?
>
> Hehe, if it's weird in OODT, it most likely resulted from me :) I
> originally saw
> this as a great tool to "version" or allow for multiple copies of a file
> on disk, e.g., with different
> file (or directory-based) metadata to delineate the versioners. Over time
> it really grew to be a
> "URIGenerationScheme" or "ArchivePathGenerator". Those would be better
> names, but Versioner
> stuck, so here we are :)
>
> >
> > Using the File Manager as the client
> >
> > Configuring a data trransfer in filemgr.properties, and then not using
> the crawler directly, but e.g., using the XmlRpcFileManagerClient,directly,
> > you can tell the server (on the ingest(...) method) to handle all the
> file transfers for you. In that case, the server needs a
> > Data Transferer configured, and the above properties apply, with the
> caveat that the FM server is now the "client" that is transferring
> > the data to itself :)
> >
> > If I set the following property in the etc/filemgr.property file
> >
> >
> filemgr.datatransfer.factory=org.apache.oodt.cas.filemgr.datatransfer.RemoteDataTransfer
> >
> > I did a quick try of this today, trying an ingest on my localhost, (to
> avoid any sticky network issues) and I was able to perform an ingest.
> >
> > I see you can specify the data transfer factory to use, so I assume then
> that the filemgr.datatransfer.factory setting is just the default if none
> is specified on the command line. Is this true?
>
> It's true, if you are doing server-based transfers (by calling the
> filemgr-client --ingestProduct method directory, without specifying the
> data transfer factory on the command line,
> yep).
>
> >
> > I ran a version of the command line client (my own version of
> filemgr-client with abs paths to the configuration files):
> >
> > cas-filemgr-client.sh --url http://localhost:9101 --operation
> --ingestProduct --refs /Users/thomas/1331871808.h5 --productStructure Flat
> --productTypeName KatFile --metadataFil/Users/thomas/1331871808.h5.met
> --productName 1331871808.h5 --clientTransfer --dataTransfer
> org.apache.oodt.cas.filemgr.datatransfer.RemoteDataTransferFactory
> >
> > With the data factory also type spec'ed as:
> >
> > etc/filemgr.properties
> >
> filemgr.datatransfer.factory=org.apache.oodt.cas.filemgr.datatransfer.RemoteDataTransferFactory
> >
> > And the versioner set as:
> >
> > policy/product-types.xml
> > <versioner
> class="org.apache.oodt.cas.filemgr.versioning.BasicVersioner"/>
> >
> > And it ingested the file. +1 for OODT!
>
> WOOT!
>
> >
> > Local and remote transfers to the same filemgr
> >
> > One way to do this is to write a Facade java class, e.g.,
> MultiTransferer, that can e.g., on a per-product type basis,
> > decide whether to call and delegate to LocalDataTransfer or
> RemoteDataTransfer. If wrote in a configurable way, that would be
> > an awesome addition to the OODT code base. We could call it
> ProductTypeDelegatingDataTransfer.
> >
> > I'm thinking I would prefer to have some crawlers specifying how file
> should be transferred. Is there any particular reason why this would not be
> a good idea - as long as the client specifies the transfer method to use?
>
> Yeah this is totally acceptable -- you can simply tell the crawler which
> TransferFactory to use. If you wanted the crawlers to sense it
> automatically based on Product Type (which also has to be provided), then
> you could use a method similar to the above.
>
> >
> > Getting the product to a second archive
> >
> > One way to do it is to simply stand up a file manager at the remote site
> and catalog, and then do remote data transfer (and met transfer) to take
> care of that.
> > Then as long as your XML-RPC ports are open both the data and metadata
> can be backed up by simply doing the same ingestion mechanisms. You could
> > wire that up as a Workflow task to run periodically, or as part of your
> std ingest pipeline (e.g., a Crawler action that on postIngestSuccess backs
> up to the remote
> > site by ingesting into the remote backup file manager).
> >
> > Okay. Got it! I'll see if I can wire up both options!
>
> AWESOME.
>
> >
> > I'd be happy to help you down either path.
> >
> > Thanks! Much appreciated.
> >
> > > I was thinking, perhaps using the functionality described in OODT-84
> (Ability for File Manager to stage an ingested Product to one of its
> clients) and then have a second crawler on the backup archive which will
> then update it's own catalogue.
> >
> > +1, that would work too!
> >
> > Once again, thanks for the input and advice - always informative ;)
>
> Haha anytime dude. Great work!
>
> Cheers,
> Chris
>
> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
> Chris Mattmann, Ph.D.
> Senior Computer Scientist
> NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
> Office: 171-266B, Mailstop: 171-246
> Email: chris.a.mattmann@nasa.gov
> WWW:   http://sunset.usc.edu/~mattmann/
> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
> Adjunct Assistant Professor, Computer Science Department
> University of Southern California, Los Angeles, CA 90089 USA
> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>
>

Re: Data transfer questions

Posted by Thomas Bennett <lm...@gmail.com>.

Thanks Chris - wiki page on its way :)

On 19 March 2012 22:52, Mattmann, Chris A (388J) <
chris.a.mattmann@jpl.nasa.gov> wrote:

> Hey Tom,
>
> AWESOME. I smell Wiki page :)
>
> Read on below:
>
> On Mar 19, 2012, at 8:18 PM, Thomas Bennett wrote:
>
> >
> > Versioner schemes
> >
> > The Data Transferers have an acute coupling with the Versioner scheme,
> case in point: if you are doing InPlaceTransfer,
> > you need a versioner that will handle file paths that don't change from
> src to dest.
> >
> > The Versioner is used to describe who a target directory is created for
> a file to archive. I.e a directory structure where the data will be place.
> So if I have an archive root at /var/kat/archive/data/ and I use a basic
> versioner it will archive a file called 1234567890.h5 at
> /var/kat/archive/data/1234567890.h5/1234567890.h5. So this would describe
> the destination for a local data transfer.
> >
> > I have the following versioner set in my policy/product-types.xml.
> >
> > policy/product-types.xml
> > <versioner
> class="org.apache.oodt.cas.filemgr.versioning.BasicVersioner"/>
>
> Ah, gotcha. You may consider using the MetadataBasedFileVersioner. It lets
> you define a filePathSpec,
> e.g., /[PrincipalInvestigator]/[Project]/[AcquisitionDate]/[Filename]
>
> And then versions or "places" the resulting product files in that
> specification structure.
>
> To create the above, you would simply subclass the Versioner like so:
>
> public KATVersioner extends MetadataBasedFileVersioner{
>   String filePathSpec =
> "/[PrincipalInvestigator]/[Project]/[AcquisitionDate]/[Filename]";
>
>   public KATVersioner(){
>     setFilePathSpec(filePathSpec);
>   }
> }
>
> You can even refer to keys that don't exist yet, and then dynamically
> generate them (and their
> values) by overriding the createDatStoreReferences method:
>
> @Override
>  public void createDataStoreReferences(Product product, Metadata met){
>     // do work to generate AcquisitionDate here
>     met.replaceMetadata("AcquisitionDate", acqdate);
>     super.createDataStoreReferences(product, met);
>   }
>
>
> >
> > Just out of curiosity... why is this called a versioner?
>
> Hehe, if it's weird in OODT, it most likely resulted from me :) I
> originally saw
> this as a great tool to "version" or allow for multiple copies of a file
> on disk, e.g., with different
> file (or directory-based) metadata to delineate the versioners. Over time
> it really grew to be a
> "URIGenerationScheme" or "ArchivePathGenerator". Those would be better
> names, but Versioner
> stuck, so here we are :)
>
> >
> > Using the File Manager as the client
> >
> > Configuring a data trransfer in filemgr.properties, and then not using
> the crawler directly, but e.g., using the XmlRpcFileManagerClient,directly,
> > you can tell the server (on the ingest(...) method) to handle all the
> file transfers for you. In that case, the server needs a
> > Data Transferer configured, and the above properties apply, with the
> caveat that the FM server is now the "client" that is transferring
> > the data to itself :)
> >
> > If I set the following property in the etc/filemgr.property file
> >
> >
> filemgr.datatransfer.factory=org.apache.oodt.cas.filemgr.datatransfer.RemoteDataTransfer
> >
> > I did a quick try of this today, trying an ingest on my localhost, (to
> avoid any sticky network issues) and I was able to perform an ingest.
> >
> > I see you can specify the data transfer factory to use, so I assume then
> that the filemgr.datatransfer.factory setting is just the default if none
> is specified on the command line. Is this true?
>
> It's true, if you are doing server-based transfers (by calling the
> filemgr-client --ingestProduct method directory, without specifying the
> data transfer factory on the command line,
> yep).
>
> >
> > I ran a version of the command line client (my own version of
> filemgr-client with abs paths to the configuration files):
> >
> > cas-filemgr-client.sh --url http://localhost:9101 --operation
> --ingestProduct --refs /Users/thomas/1331871808.h5 --productStructure Flat
> --productTypeName KatFile --metadataFil/Users/thomas/1331871808.h5.met
> --productName 1331871808.h5 --clientTransfer --dataTransfer
> org.apache.oodt.cas.filemgr.datatransfer.RemoteDataTransferFactory
> >
> > With the data factory also type spec'ed as:
> >
> > etc/filemgr.properties
> >
> filemgr.datatransfer.factory=org.apache.oodt.cas.filemgr.datatransfer.RemoteDataTransferFactory
> >
> > And the versioner set as:
> >
> > policy/product-types.xml
> > <versioner
> class="org.apache.oodt.cas.filemgr.versioning.BasicVersioner"/>
> >
> > And it ingested the file. +1 for OODT!
>
> WOOT!
>
> >
> > Local and remote transfers to the same filemgr
> >
> > One way to do this is to write a Facade java class, e.g.,
> MultiTransferer, that can e.g., on a per-product type basis,
> > decide whether to call and delegate to LocalDataTransfer or
> RemoteDataTransfer. If wrote in a configurable way, that would be
> > an awesome addition to the OODT code base. We could call it
> ProductTypeDelegatingDataTransfer.
> >
> > I'm thinking I would prefer to have some crawlers specifying how file
> should be transferred. Is there any particular reason why this would not be
> a good idea - as long as the client specifies the transfer method to use?
>
> Yeah this is totally acceptable -- you can simply tell the crawler which
> TransferFactory to use. If you wanted the crawlers to sense it
> automatically based on Product Type (which also has to be provided), then
> you could use a method similar to the above.
>
> >
> > Getting the product to a second archive
> >
> > One way to do it is to simply stand up a file manager at the remote site
> and catalog, and then do remote data transfer (and met transfer) to take
> care of that.
> > Then as long as your XML-RPC ports are open both the data and metadata
> can be backed up by simply doing the same ingestion mechanisms. You could
> > wire that up as a Workflow task to run periodically, or as part of your
> std ingest pipeline (e.g., a Crawler action that on postIngestSuccess backs
> up to the remote
> > site by ingesting into the remote backup file manager).
> >
> > Okay. Got it! I'll see if I can wire up both options!
>
> AWESOME.
>
> >
> > I'd be happy to help you down either path.
> >
> > Thanks! Much appreciated.
> >
> > > I was thinking, perhaps using the functionality described in OODT-84
> (Ability for File Manager to stage an ingested Product to one of its
> clients) and then have a second crawler on the backup archive which will
> then update it's own catalogue.
> >
> > +1, that would work too!
> >
> > Once again, thanks for the input and advice - always informative ;)
>
> Haha anytime dude. Great work!
>
> Cheers,
> Chris
>
> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
> Chris Mattmann, Ph.D.
> Senior Computer Scientist
> NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
> Office: 171-266B, Mailstop: 171-246
> Email: chris.a.mattmann@nasa.gov
> WWW:   http://sunset.usc.edu/~mattmann/
> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
> Adjunct Assistant Professor, Computer Science Department
> University of Southern California, Los Angeles, CA 90089 USA
> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>
>

Re: Data transfer questions

Posted by "Mattmann, Chris A (388J)" <ch...@jpl.nasa.gov>.

Hey Tom,

AWESOME. I smell Wiki page :)

Read on below:

On Mar 19, 2012, at 8:18 PM, Thomas Bennett wrote:

> 
> Versioner schemes
> 
> The Data Transferers have an acute coupling with the Versioner scheme, case in point: if you are doing InPlaceTransfer,
> you need a versioner that will handle file paths that don't change from src to dest.
> 
> The Versioner is used to describe who a target directory is created for a file to archive. I.e a directory structure where the data will be place. So if I have an archive root at /var/kat/archive/data/ and I use a basic versioner it will archive a file called 1234567890.h5 at /var/kat/archive/data/1234567890.h5/1234567890.h5. So this would describe the destination for a local data transfer. 
> 
> I have the following versioner set in my policy/product-types.xml.
> 
> policy/product-types.xml
> <versioner class="org.apache.oodt.cas.filemgr.versioning.BasicVersioner"/>

Ah, gotcha. You may consider using the MetadataBasedFileVersioner. It lets you define a filePathSpec, 
e.g., /[PrincipalInvestigator]/[Project]/[AcquisitionDate]/[Filename]

And then versions or "places" the resulting product files in that specification structure.

To create the above, you would simply subclass the Versioner like so:

public KATVersioner extends MetadataBasedFileVersioner{
   String filePathSpec = "/[PrincipalInvestigator]/[Project]/[AcquisitionDate]/[Filename]";

   public KATVersioner(){
     setFilePathSpec(filePathSpec);
   }
}

You can even refer to keys that don't exist yet, and then dynamically generate them (and their
values) by overriding the createDatStoreReferences method:

@Override
 public void createDataStoreReferences(Product product, Metadata met){
     // do work to generate AcquisitionDate here
     met.replaceMetadata("AcquisitionDate", acqdate);
     super.createDataStoreReferences(product, met);
  }


>  
> Just out of curiosity... why is this called a versioner?

Hehe, if it's weird in OODT, it most likely resulted from me :) I originally saw 
this as a great tool to "version" or allow for multiple copies of a file on disk, e.g., with different
file (or directory-based) metadata to delineate the versioners. Over time it really grew to be a
"URIGenerationScheme" or "ArchivePathGenerator". Those would be better names, but Versioner
stuck, so here we are :)

> 
> Using the File Manager as the client
> 
> Configuring a data trransfer in filemgr.properties, and then not using the crawler directly, but e.g., using the XmlRpcFileManagerClient,directly,
> you can tell the server (on the ingest(...) method) to handle all the file transfers for you. In that case, the server needs a
> Data Transferer configured, and the above properties apply, with the caveat that the FM server is now the "client" that is transferring
> the data to itself :)
> 
> If I set the following property in the etc/filemgr.property file 
> 
> filemgr.datatransfer.factory=org.apache.oodt.cas.filemgr.datatransfer.RemoteDataTransfer
> 
> I did a quick try of this today, trying an ingest on my localhost, (to avoid any sticky network issues) and I was able to perform an ingest. 
> 
> I see you can specify the data transfer factory to use, so I assume then that the filemgr.datatransfer.factory setting is just the default if none is specified on the command line. Is this true?

It's true, if you are doing server-based transfers (by calling the filemgr-client --ingestProduct method directory, without specifying the data transfer factory on the command line, 
yep).

> 
> I ran a version of the command line client (my own version of filemgr-client with abs paths to the configuration files):
> 
> cas-filemgr-client.sh --url http://localhost:9101 --operation --ingestProduct --refs /Users/thomas/1331871808.h5 --productStructure Flat --productTypeName KatFile --metadataFil/Users/thomas/1331871808.h5.met --productName 1331871808.h5 --clientTransfer --dataTransfer org.apache.oodt.cas.filemgr.datatransfer.RemoteDataTransferFactory
> 
> With the data factory also type spec'ed as:
> 
> etc/filemgr.properties
> filemgr.datatransfer.factory=org.apache.oodt.cas.filemgr.datatransfer.RemoteDataTransferFactory
> 
> And the versioner set as:
> 
> policy/product-types.xml
> <versioner class="org.apache.oodt.cas.filemgr.versioning.BasicVersioner"/>
> 
> And it ingested the file. +1 for OODT!

WOOT!

> 
> Local and remote transfers to the same filemgr 
>  
> One way to do this is to write a Facade java class, e.g., MultiTransferer, that can e.g., on a per-product type basis,
> decide whether to call and delegate to LocalDataTransfer or RemoteDataTransfer. If wrote in a configurable way, that would be
> an awesome addition to the OODT code base. We could call it ProductTypeDelegatingDataTransfer.
> 
> I'm thinking I would prefer to have some crawlers specifying how file should be transferred. Is there any particular reason why this would not be a good idea - as long as the client specifies the transfer method to use? 

Yeah this is totally acceptable -- you can simply tell the crawler which TransferFactory to use. If you wanted the crawlers to sense it
automatically based on Product Type (which also has to be provided), then you could use a method similar to the above.

> 
> Getting the product to a second archive
> 
> One way to do it is to simply stand up a file manager at the remote site and catalog, and then do remote data transfer (and met transfer) to take care of that.
> Then as long as your XML-RPC ports are open both the data and metadata can be backed up by simply doing the same ingestion mechanisms. You could
> wire that up as a Workflow task to run periodically, or as part of your std ingest pipeline (e.g., a Crawler action that on postIngestSuccess backs up to the remote
> site by ingesting into the remote backup file manager).
> 
> Okay. Got it! I'll see if I can wire up both options!

AWESOME.

>  
> I'd be happy to help you down either path.
> 
> Thanks! Much appreciated.
>  
> > I was thinking, perhaps using the functionality described in OODT-84 (Ability for File Manager to stage an ingested Product to one of its clients) and then have a second crawler on the backup archive which will then update it's own catalogue.
> 
> +1, that would work too!
> 
> Once again, thanks for the input and advice - always informative ;) 

Haha anytime dude. Great work!

Cheers,
Chris

++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Chris Mattmann, Ph.D.
Senior Computer Scientist
NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
Office: 171-266B, Mailstop: 171-246
Email: chris.a.mattmann@nasa.gov
WWW:   http://sunset.usc.edu/~mattmann/
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Adjunct Assistant Professor, Computer Science Department
University of Southern California, Los Angeles, CA 90089 USA
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++

Re: Data transfer questions

Posted by "Mattmann, Chris A (388J)" <ch...@jpl.nasa.gov>.

Hey Tom,

AWESOME. I smell Wiki page :)

Read on below:

On Mar 19, 2012, at 8:18 PM, Thomas Bennett wrote:

> 
> Versioner schemes
> 
> The Data Transferers have an acute coupling with the Versioner scheme, case in point: if you are doing InPlaceTransfer,
> you need a versioner that will handle file paths that don't change from src to dest.
> 
> The Versioner is used to describe who a target directory is created for a file to archive. I.e a directory structure where the data will be place. So if I have an archive root at /var/kat/archive/data/ and I use a basic versioner it will archive a file called 1234567890.h5 at /var/kat/archive/data/1234567890.h5/1234567890.h5. So this would describe the destination for a local data transfer. 
> 
> I have the following versioner set in my policy/product-types.xml.
> 
> policy/product-types.xml
> <versioner class="org.apache.oodt.cas.filemgr.versioning.BasicVersioner"/>

Ah, gotcha. You may consider using the MetadataBasedFileVersioner. It lets you define a filePathSpec, 
e.g., /[PrincipalInvestigator]/[Project]/[AcquisitionDate]/[Filename]

And then versions or "places" the resulting product files in that specification structure.

To create the above, you would simply subclass the Versioner like so:

public KATVersioner extends MetadataBasedFileVersioner{
   String filePathSpec = "/[PrincipalInvestigator]/[Project]/[AcquisitionDate]/[Filename]";

   public KATVersioner(){
     setFilePathSpec(filePathSpec);
   }
}

You can even refer to keys that don't exist yet, and then dynamically generate them (and their
values) by overriding the createDatStoreReferences method:

@Override
 public void createDataStoreReferences(Product product, Metadata met){
     // do work to generate AcquisitionDate here
     met.replaceMetadata("AcquisitionDate", acqdate);
     super.createDataStoreReferences(product, met);
  }


>  
> Just out of curiosity... why is this called a versioner?

Hehe, if it's weird in OODT, it most likely resulted from me :) I originally saw 
this as a great tool to "version" or allow for multiple copies of a file on disk, e.g., with different
file (or directory-based) metadata to delineate the versioners. Over time it really grew to be a
"URIGenerationScheme" or "ArchivePathGenerator". Those would be better names, but Versioner
stuck, so here we are :)

> 
> Using the File Manager as the client
> 
> Configuring a data trransfer in filemgr.properties, and then not using the crawler directly, but e.g., using the XmlRpcFileManagerClient,directly,
> you can tell the server (on the ingest(...) method) to handle all the file transfers for you. In that case, the server needs a
> Data Transferer configured, and the above properties apply, with the caveat that the FM server is now the "client" that is transferring
> the data to itself :)
> 
> If I set the following property in the etc/filemgr.property file 
> 
> filemgr.datatransfer.factory=org.apache.oodt.cas.filemgr.datatransfer.RemoteDataTransfer
> 
> I did a quick try of this today, trying an ingest on my localhost, (to avoid any sticky network issues) and I was able to perform an ingest. 
> 
> I see you can specify the data transfer factory to use, so I assume then that the filemgr.datatransfer.factory setting is just the default if none is specified on the command line. Is this true?

It's true, if you are doing server-based transfers (by calling the filemgr-client --ingestProduct method directory, without specifying the data transfer factory on the command line, 
yep).

> 
> I ran a version of the command line client (my own version of filemgr-client with abs paths to the configuration files):
> 
> cas-filemgr-client.sh --url http://localhost:9101 --operation --ingestProduct --refs /Users/thomas/1331871808.h5 --productStructure Flat --productTypeName KatFile --metadataFil/Users/thomas/1331871808.h5.met --productName 1331871808.h5 --clientTransfer --dataTransfer org.apache.oodt.cas.filemgr.datatransfer.RemoteDataTransferFactory
> 
> With the data factory also type spec'ed as:
> 
> etc/filemgr.properties
> filemgr.datatransfer.factory=org.apache.oodt.cas.filemgr.datatransfer.RemoteDataTransferFactory
> 
> And the versioner set as:
> 
> policy/product-types.xml
> <versioner class="org.apache.oodt.cas.filemgr.versioning.BasicVersioner"/>
> 
> And it ingested the file. +1 for OODT!

WOOT!

> 
> Local and remote transfers to the same filemgr 
>  
> One way to do this is to write a Facade java class, e.g., MultiTransferer, that can e.g., on a per-product type basis,
> decide whether to call and delegate to LocalDataTransfer or RemoteDataTransfer. If wrote in a configurable way, that would be
> an awesome addition to the OODT code base. We could call it ProductTypeDelegatingDataTransfer.
> 
> I'm thinking I would prefer to have some crawlers specifying how file should be transferred. Is there any particular reason why this would not be a good idea - as long as the client specifies the transfer method to use? 

Yeah this is totally acceptable -- you can simply tell the crawler which TransferFactory to use. If you wanted the crawlers to sense it
automatically based on Product Type (which also has to be provided), then you could use a method similar to the above.

> 
> Getting the product to a second archive
> 
> One way to do it is to simply stand up a file manager at the remote site and catalog, and then do remote data transfer (and met transfer) to take care of that.
> Then as long as your XML-RPC ports are open both the data and metadata can be backed up by simply doing the same ingestion mechanisms. You could
> wire that up as a Workflow task to run periodically, or as part of your std ingest pipeline (e.g., a Crawler action that on postIngestSuccess backs up to the remote
> site by ingesting into the remote backup file manager).
> 
> Okay. Got it! I'll see if I can wire up both options!

AWESOME.

>  
> I'd be happy to help you down either path.
> 
> Thanks! Much appreciated.
>  
> > I was thinking, perhaps using the functionality described in OODT-84 (Ability for File Manager to stage an ingested Product to one of its clients) and then have a second crawler on the backup archive which will then update it's own catalogue.
> 
> +1, that would work too!
> 
> Once again, thanks for the input and advice - always informative ;) 

Haha anytime dude. Great work!

Cheers,
Chris

++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Chris Mattmann, Ph.D.
Senior Computer Scientist
NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
Office: 171-266B, Mailstop: 171-246
Email: chris.a.mattmann@nasa.gov
WWW:   http://sunset.usc.edu/~mattmann/
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Adjunct Assistant Professor, Computer Science Department
University of Southern California, Los Angeles, CA 90089 USA
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++

Re: Data transfer questions

Posted by Thomas Bennett <lm...@gmail.com>.

Hey Chris,

Thanks for you're reply, much appreciated. You've cleared up a few issues
in my understanding.

I've gone through your reply and just added a few notes for completeness.
*

Crawler data transfer. I.e. not using the File Manager as a client.*

there are 2 ways to configure data transfer. If you are using a Crawler,
> the crawler is going to
> handle client side transfer to the FM server. You can configure Local,
> Remote, or InPlace transfer at the moment,
> or roll your own client side transfer and then pass it via the crawler
> command line or config.


1) Local data transfer


> Local means that the
> source and dest file paths need to be visible from the crawler's machine
> (or at least "appear" that way. A common pattern
> here is to use a Distributed File System like HDFS or ClusterFS to
> virtualize local disk, and mount it at a global virtual
> root. That way even though the data itself is distributed, to the Crawler
> and thus to LocalDataTransfer, it looks like
> it's on the same path).


2) Remote data transfer


> Remote means that the dest path can live on a different host, and that the
> client will work
> with the file manager server to chunk and transfer (via XML-RPC) that data
> from the client to the server.



3) In place data transfer

> InPlace means
> that no data transfer will occur at all.
>

(Great explanations - thanks!)

*Versioner schemes*
*
*

> The Data Transferers have an acute coupling with the Versioner scheme,
> case in point: if you are doing InPlaceTransfer,
> you need a versioner that will handle file paths that don't change from
> src to dest.
>

The Versioner is used to describe who a target directory is created for a
file to archive. I.e a directory structure where the data will be place. So
if I have an archive root at /var/kat/archive/data/ and I use a basic
versioner it will archive a file called 1234567890.h5 at
/var/kat/archive/data/1234567890.h5/1234567890.h5. So this would describe
the destination for a local data transfer.

I have the following versioner set in my policy/product-types.xml.

policy/product-types.xml
<versioner class="org.apache.oodt.cas.filemgr.versioning.BasicVersioner"/>

Just out of curiosity... why is this called a versioner?
*
*
*Using the File Manager as the client*

Configuring a data trransfer in filemgr.properties, and then not using the
> crawler directly, but e.g., using the XmlRpcFileManagerClient,directly,
> you can tell the server (on the ingest(...) method) to handle all the file
> transfers for you. In that case, the server needs a
> Data Transferer configured, and the above properties apply, with the
> caveat that the FM server is now the "client" that is transferring
> the data to itself :)


If I set the following property in the etc/filemgr.property file

filemgr.datatransfer.factory=org.apache.oodt.cas.filemgr.datatransfer.RemoteDataTransfer

I did a quick try of this today, trying an ingest on my localhost, (to
avoid any sticky network issues) and I was able to perform an ingest.

I see you can specify the data transfer factory to use, so I assume then
that the filemgr.datatransfer.factory setting is just the default if none
is specified on the command line. Is this true?

I ran a version of the command line client (my own version of
filemgr-client with abs paths to the configuration files):

cas-filemgr-client.sh --url
http://localhost:9101<http://192.168.1.211:9101>--operation
--ingestProduct --refs /Users/thomas/1331871808.h5
--productStructure Flat --productTypeName KatFile
--metadataFil/Users/thomas/1331871808.h5.met --productName 1331871808.h5
--clientTransfer --dataTransfer
org.apache.oodt.cas.filemgr.datatransfer.RemoteDataTransferFactory

With the data factory also type spec'ed as:

etc/filemgr.properties
filemgr.datatransfer.factory=org.apache.oodt.cas.filemgr.datatransfer.RemoteDataTransferFactory

And the versioner set as:

policy/product-types.xml
<versioner class="org.apache.oodt.cas.filemgr.versioning.BasicVersioner"/>

And it ingested the file. +1 for OODT!

*Local and remote transfers to the same filemgr*


> One way to do this is to write a Facade java class, e.g., MultiTransferer,
> that can e.g., on a per-product type basis,
> decide whether to call and delegate to LocalDataTransfer or
> RemoteDataTransfer. If wrote in a configurable way, that would be
> an awesome addition to the OODT code base. We could call it
> ProductTypeDelegatingDataTransfer.
>

I'm thinking I would prefer to have some crawlers specifying how file
should be transferred. Is there any particular reason why this would not be
a good idea - as long as the client specifies the transfer method to use?

*Getting the product to a second archive*
*
*

> One way to do it is to simply stand up a file manager at the remote site
> and catalog, and then do remote data transfer (and met transfer) to take
> care of that.
> Then as long as your XML-RPC ports are open both the data and metadata can
> be backed up by simply doing the same ingestion mechanisms. You could
> wire that up as a Workflow task to run periodically, or as part of your
> std ingest pipeline (e.g., a Crawler action that on postIngestSuccess backs
> up to the remote
> site by ingesting into the remote backup file manager).
>

Okay. Got it! I'll see if I can wire up both options!


> I'd be happy to help you down either path.
>

Thanks! Much appreciated.

> I was thinking, perhaps using the functionality described in OODT-84
(Ability for File Manager to stage an ingested Product to one of its
clients) and then have a second crawler on the backup archive which will
then update it's own catalogue.

>
> +1, that would work too!


Once again, thanks for the input and advice - always informative ;)

Cheers,
Tom

Re: Data transfer questions

Posted by Thomas Bennett <lm...@gmail.com>.

Hey Chris,

Thanks for you're reply, much appreciated. You've cleared up a few issues
in my understanding.

I've gone through your reply and just added a few notes for completeness.
*

Crawler data transfer. I.e. not using the File Manager as a client.*

there are 2 ways to configure data transfer. If you are using a Crawler,
> the crawler is going to
> handle client side transfer to the FM server. You can configure Local,
> Remote, or InPlace transfer at the moment,
> or roll your own client side transfer and then pass it via the crawler
> command line or config.


1) Local data transfer


> Local means that the
> source and dest file paths need to be visible from the crawler's machine
> (or at least "appear" that way. A common pattern
> here is to use a Distributed File System like HDFS or ClusterFS to
> virtualize local disk, and mount it at a global virtual
> root. That way even though the data itself is distributed, to the Crawler
> and thus to LocalDataTransfer, it looks like
> it's on the same path).


2) Remote data transfer


> Remote means that the dest path can live on a different host, and that the
> client will work
> with the file manager server to chunk and transfer (via XML-RPC) that data
> from the client to the server.



3) In place data transfer

> InPlace means
> that no data transfer will occur at all.
>

(Great explanations - thanks!)

*Versioner schemes*
*
*

> The Data Transferers have an acute coupling with the Versioner scheme,
> case in point: if you are doing InPlaceTransfer,
> you need a versioner that will handle file paths that don't change from
> src to dest.
>

The Versioner is used to describe who a target directory is created for a
file to archive. I.e a directory structure where the data will be place. So
if I have an archive root at /var/kat/archive/data/ and I use a basic
versioner it will archive a file called 1234567890.h5 at
/var/kat/archive/data/1234567890.h5/1234567890.h5. So this would describe
the destination for a local data transfer.

I have the following versioner set in my policy/product-types.xml.

policy/product-types.xml
<versioner class="org.apache.oodt.cas.filemgr.versioning.BasicVersioner"/>

Just out of curiosity... why is this called a versioner?
*
*
*Using the File Manager as the client*

Configuring a data trransfer in filemgr.properties, and then not using the
> crawler directly, but e.g., using the XmlRpcFileManagerClient,directly,
> you can tell the server (on the ingest(...) method) to handle all the file
> transfers for you. In that case, the server needs a
> Data Transferer configured, and the above properties apply, with the
> caveat that the FM server is now the "client" that is transferring
> the data to itself :)


If I set the following property in the etc/filemgr.property file

filemgr.datatransfer.factory=org.apache.oodt.cas.filemgr.datatransfer.RemoteDataTransfer

I did a quick try of this today, trying an ingest on my localhost, (to
avoid any sticky network issues) and I was able to perform an ingest.

I see you can specify the data transfer factory to use, so I assume then
that the filemgr.datatransfer.factory setting is just the default if none
is specified on the command line. Is this true?

I ran a version of the command line client (my own version of
filemgr-client with abs paths to the configuration files):

cas-filemgr-client.sh --url
http://localhost:9101<http://192.168.1.211:9101>--operation
--ingestProduct --refs /Users/thomas/1331871808.h5
--productStructure Flat --productTypeName KatFile
--metadataFil/Users/thomas/1331871808.h5.met --productName 1331871808.h5
--clientTransfer --dataTransfer
org.apache.oodt.cas.filemgr.datatransfer.RemoteDataTransferFactory

With the data factory also type spec'ed as:

etc/filemgr.properties
filemgr.datatransfer.factory=org.apache.oodt.cas.filemgr.datatransfer.RemoteDataTransferFactory

And the versioner set as:

policy/product-types.xml
<versioner class="org.apache.oodt.cas.filemgr.versioning.BasicVersioner"/>

And it ingested the file. +1 for OODT!

*Local and remote transfers to the same filemgr*


> One way to do this is to write a Facade java class, e.g., MultiTransferer,
> that can e.g., on a per-product type basis,
> decide whether to call and delegate to LocalDataTransfer or
> RemoteDataTransfer. If wrote in a configurable way, that would be
> an awesome addition to the OODT code base. We could call it
> ProductTypeDelegatingDataTransfer.
>

I'm thinking I would prefer to have some crawlers specifying how file
should be transferred. Is there any particular reason why this would not be
a good idea - as long as the client specifies the transfer method to use?

*Getting the product to a second archive*
*
*

> One way to do it is to simply stand up a file manager at the remote site
> and catalog, and then do remote data transfer (and met transfer) to take
> care of that.
> Then as long as your XML-RPC ports are open both the data and metadata can
> be backed up by simply doing the same ingestion mechanisms. You could
> wire that up as a Workflow task to run periodically, or as part of your
> std ingest pipeline (e.g., a Crawler action that on postIngestSuccess backs
> up to the remote
> site by ingesting into the remote backup file manager).
>

Okay. Got it! I'll see if I can wire up both options!


> I'd be happy to help you down either path.
>

Thanks! Much appreciated.

> I was thinking, perhaps using the functionality described in OODT-84
(Ability for File Manager to stage an ingested Product to one of its
clients) and then have a second crawler on the backup archive which will
then update it's own catalogue.

>
> +1, that would work too!


Once again, thanks for the input and advice - always informative ;)

Cheers,
Tom

Re: Data transfer questions

Posted by "Mattmann, Chris A (388J)" <ch...@jpl.nasa.gov>.

Hey Tom,

On Mar 16, 2012, at 6:57 AM, Thomas Bennett wrote:

> Hi,
> 
> I have a few questions about data transfer and thought I would roll it into one email:
> 
> 1) Local and remote data transfer with the same file manager
> 	• I see that when configuring a cas-crawler, one specifies the data transfer factory by using --clientTransferer 
> 	• However in etc/filemgr.properties the data transfer factory is specified with  filemgr.datatransfer.factory.
> Does this mean that I if I specify a local transfer factory I cannot  use a crawler with a remote data transferer?

Basically it means that there are 2 ways to configure data transfer. If you are using a Crawler, the crawler is going to 
handle client side transfer to the FM server. You can configure Local, Remote, or InPlace transfer at the moment, 
or roll your own client side transfer and then pass it via the crawler command line or config. Local means that the
source and dest file paths need to be visible from the crawler's machine (or at least "appear" that way. A common pattern
here is to use a Distributed File System like HDFS or ClusterFS to virtualize local disk, and mount it at a global virtual 
root. That way even though the data itself is distributed, to the Crawler and thus to LocalDataTransfer, it looks like
it's on the same path). Remote means that the dest path can live on a different host, and that the client will work
with the file manager server to chunk and transfer (via XML-RPC) that data from the client to the server. InPlace means
that no data transfer will occur at all.

The Data Transferers have an acute coupling with the Versioner scheme, case in point: if you are doing InPlaceTransfer,
you need a versioner that will handle file paths that don't change from src to dest.

Configuring a data trransfer in filemgr.properties, and then not using the crawler directly, but e.g., using the XmlRpcFileManagerClient,directly,
you can tell the server (on the ingest(...) method) to handle all the file transfers for you. In that case, the server needs a 
Data Transferer configured, and the above properties apply, with the caveat that the FM server is now the "client" that is transferring
the data to itself :)

> 
> I'm wanting to cater for a situation where files could be ingested locally as well as remotely using a single file manager. Is this possible?

Sure can. One way to do this is to write a Facade java class, e.g., MultiTransferer, that can e.g., on a per-product type basis, 
decide whether to call and delegate to LocalDataTransfer or RemoteDataTransfer. If wrote in a configurable way, that would be
an awesome addition to the OODT code base. We could call it ProductTypeDelegatingDataTransfer.

> 
> 2) Copy and ingested product on a back up archive
> 
> For backup (and access purposes), I'm wanting to ingest the product into an off site archive (at our main engineering office) with it's own separate catalogue.
> What is the recommended way of doing this? 

One way to do it is to simply stand up a file manager at the remote site and catalog, and then do remote data transfer (and met transfer) to take care of that.
Then as long as your XML-RPC ports are open both the data and metadata can be backed up by simply doing the same ingestion mechanisms. You could
wire that up as a Workflow task to run periodically, or as part of your std ingest pipeline (e.g., a Crawler action that on postIngestSuccess backs up to the remote
site by ingesting into the remote backup file manager). 

I'd be happy to help you down either path.

> 
> They way I currently do this is by replicate the files using rsync (but I'm then left with finding a way to update the catalogue). I was wondering if there was a neater (more OODT) solution?

I think a good solution might be to run a remote Backup File Manager and just ingest it again. Another option too would be to use the File Manager ExpImpCatalog tool to replicate the 
metadata out to your remote site, and then to rsync the files. That way you get files + met.

> 
> I was thinking, perhaps using the functionality described in OODT-84 (Ability for File Manager to stage an ingested Product to one of its clients) and then have a second crawler on the backup archive which will then update it's own catalogue.

+1, that would work too!

> 
> I just thought I would ask the question in case anyone has tried something similar.

Let me know what you think of the above and we'll work it out!

Cheers,
Chris

++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Chris Mattmann, Ph.D.
Senior Computer Scientist
NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
Office: 171-266B, Mailstop: 171-246
Email: chris.a.mattmann@nasa.gov
WWW:   http://sunset.usc.edu/~mattmann/
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Adjunct Assistant Professor, Computer Science Department
University of Southern California, Los Angeles, CA 90089 USA
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++

Re: Data transfer questions

Posted by "Mattmann, Chris A (388J)" <ch...@jpl.nasa.gov>.

Hey Tom,

On Mar 16, 2012, at 6:57 AM, Thomas Bennett wrote:

> Hi,
> 
> I have a few questions about data transfer and thought I would roll it into one email:
> 
> 1) Local and remote data transfer with the same file manager
> 	• I see that when configuring a cas-crawler, one specifies the data transfer factory by using --clientTransferer 
> 	• However in etc/filemgr.properties the data transfer factory is specified with  filemgr.datatransfer.factory.
> Does this mean that I if I specify a local transfer factory I cannot  use a crawler with a remote data transferer?

Basically it means that there are 2 ways to configure data transfer. If you are using a Crawler, the crawler is going to 
handle client side transfer to the FM server. You can configure Local, Remote, or InPlace transfer at the moment, 
or roll your own client side transfer and then pass it via the crawler command line or config. Local means that the
source and dest file paths need to be visible from the crawler's machine (or at least "appear" that way. A common pattern
here is to use a Distributed File System like HDFS or ClusterFS to virtualize local disk, and mount it at a global virtual 
root. That way even though the data itself is distributed, to the Crawler and thus to LocalDataTransfer, it looks like
it's on the same path). Remote means that the dest path can live on a different host, and that the client will work
with the file manager server to chunk and transfer (via XML-RPC) that data from the client to the server. InPlace means
that no data transfer will occur at all.

The Data Transferers have an acute coupling with the Versioner scheme, case in point: if you are doing InPlaceTransfer,
you need a versioner that will handle file paths that don't change from src to dest.

Configuring a data trransfer in filemgr.properties, and then not using the crawler directly, but e.g., using the XmlRpcFileManagerClient,directly,
you can tell the server (on the ingest(...) method) to handle all the file transfers for you. In that case, the server needs a 
Data Transferer configured, and the above properties apply, with the caveat that the FM server is now the "client" that is transferring
the data to itself :)

> 
> I'm wanting to cater for a situation where files could be ingested locally as well as remotely using a single file manager. Is this possible?

Sure can. One way to do this is to write a Facade java class, e.g., MultiTransferer, that can e.g., on a per-product type basis, 
decide whether to call and delegate to LocalDataTransfer or RemoteDataTransfer. If wrote in a configurable way, that would be
an awesome addition to the OODT code base. We could call it ProductTypeDelegatingDataTransfer.

> 
> 2) Copy and ingested product on a back up archive
> 
> For backup (and access purposes), I'm wanting to ingest the product into an off site archive (at our main engineering office) with it's own separate catalogue.
> What is the recommended way of doing this? 

One way to do it is to simply stand up a file manager at the remote site and catalog, and then do remote data transfer (and met transfer) to take care of that.
Then as long as your XML-RPC ports are open both the data and metadata can be backed up by simply doing the same ingestion mechanisms. You could
wire that up as a Workflow task to run periodically, or as part of your std ingest pipeline (e.g., a Crawler action that on postIngestSuccess backs up to the remote
site by ingesting into the remote backup file manager). 

I'd be happy to help you down either path.

> 
> They way I currently do this is by replicate the files using rsync (but I'm then left with finding a way to update the catalogue). I was wondering if there was a neater (more OODT) solution?

I think a good solution might be to run a remote Backup File Manager and just ingest it again. Another option too would be to use the File Manager ExpImpCatalog tool to replicate the 
metadata out to your remote site, and then to rsync the files. That way you get files + met.

> 
> I was thinking, perhaps using the functionality described in OODT-84 (Ability for File Manager to stage an ingested Product to one of its clients) and then have a second crawler on the backup archive which will then update it's own catalogue.

+1, that would work too!

> 
> I just thought I would ask the question in case anyone has tried something similar.

Let me know what you think of the above and we'll work it out!

Cheers,
Chris

++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Chris Mattmann, Ph.D.
Senior Computer Scientist
NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
Office: 171-266B, Mailstop: 171-246
Email: chris.a.mattmann@nasa.gov
WWW:   http://sunset.usc.edu/~mattmann/
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Adjunct Assistant Professor, Computer Science Department
University of Southern California, Los Angeles, CA 90089 USA
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++