You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@oodt.apache.org by Thomas Bennett <lm...@gmail.com> on 2015/04/22 12:09:10 UTC

Datastore references for a duplicate products

Hi,

Okay - so a 'flat' product has a single datastore reference.

So, how do you handle redundant copies of products? At the moment I have an
independent catalogue at each site.

I was thinking of a site metadata key, so multiple products can be
filtered, but I thought I would see what other people are doing and if
there is any interest in perhaps getting a product (flat or heirachical)
with multiple references, i.e. beyond originalReference and
datastoreReference.

Or does that totally break the OODT model? It probably does...

Also - does anyone store data on tape library and index it with OODT. I'm
talking basic tar on a tape. This obviously breaks the file retrieval, if
used, but I'm thinking of how this could be included in the OODT framework
and maybe develop some methods.

But before I go to deep I thought I would ask.

Cheers,
Tom

Re: Datastore references for a duplicate products

Posted by Thomas Bennett <lm...@gmail.com>.

Hey Lewis,

Well do you want to retain them or discard them?


Yeah.  We keep the copy while we still can. I'm more interested in tying
our current databases together so that I have one place I can query for
duplicates - stuff could also be on tape.


> Are you interested in provenance tracking?


Not at the moment - not needed.


> I'm going to start a thread on schema evolution in parallel which may help
> on data modeling front.
>

w00t. I'll take a look.

Cheers,
Tom

Re: Datastore references for a duplicate products

Posted by Lewis John Mcgibbney <le...@gmail.com>.

Hi Thomas,

On Wednesday, April 22, 2015, Thomas Bennett <lm...@gmail.com> wrote:

> Hi,
>
> Okay - so a 'flat' product has a single datastore reference.
>
> So, how do you handle redundant copies of products?


Well do you want to retain them or discard them? Are you interested
in provenance tracking?
If not, then you just make updates to the data assuming the data model does
not change.
I'm going to start a thread on schema evolution in parallel which may help
on data modeling front.
Lewis



> At the moment I have an
> independent catalogue at each site.
>
> I was thinking of a site metadata key, so multiple products can be
> filtered, but I thought I would see what other people are doing and if
> there is any interest in perhaps getting a product (flat or heirachical)
> with multiple references, i.e. beyond originalReference and
> datastoreReference.
>
> Or does that totally break the OODT model? It probably does...
>
> Also - does anyone store data on tape library and index it with OODT. I'm
> talking basic tar on a tape. This obviously breaks the file retrieval, if
> used, but I'm thinking of how this could be included in the OODT framework
> and maybe develop some methods.
>
> But before I go to deep I thought I would ask.
>
> Cheers,
> Tom
>


-- 
*Lewis*

Re: Datastore references for a duplicate products

Posted by Chris Mattmann <ch...@gmail.com>.

Hey Tom,


-----Original Message-----
From: Thomas Bennett <lm...@gmail.com>
Reply-To: <de...@oodt.apache.org>
Date: Thursday, April 23, 2015 at 8:08 AM
To: OODT <de...@oodt.apache.org>
Subject: Re: Datastore references for a duplicate products

>Hey Chris,
>
>Thanks for your reply :)
>>[..snip..]

And yours!

>> Well yep we do this - one way to do would be to add a
>> TapeBasedDataTransferer,
>> and/or utilities to handle waiting for data e.g., from Tape. You could
>>also
>> use the InPlaceDataTransferer, and then come up with a Versioner that
>> easily
>> supports having Tape-based URIs.
>>
>
>Great! Thats where I'm heading :). How did you guys handle the URI's for
>tapes? I'm basically just using the the tape name - dataStoreReference =
>tape://MK0001L6

Yep something like that is what we did. Should work fine!

Cheers,
Chris

Re: Datastore references for a duplicate products

Posted by Thomas Bennett <lm...@gmail.com>.

Hey Chris,

Thanks for your reply :)

>Or does that totally break the OODT model? It probably does...
>
> Nope doesn’t break the model at all - the Product with structure
> ProductStructure.HIERARCHICAL works fine with this - data transfers
> work fine - metadata retrieval works fine (try ingesting a directory),
> and other things work fine too.
>
> However, I would seriously advise *against* using this model ^_^
> Mostly b/c no one really does it that way - and the code isn’t actively
> being developed and maintained. At one point I thought it would be a
> great thing to support and the last I heard of someone using this structure
> natively was OCO from 2007/2008 (a NASA mission).
>

Thanks Chris. That confirms what I was thinking. I think I'll just keep it
simple.

>Also - does anyone store data on tape library and index it with OODT. I'm
> >talking basic tar on a tape. This obviously breaks the file retrieval, if
> >used, but I'm thinking of how this could be included in the OODT framework
> >and maybe develop some methods.
>
> Well yep we do this - one way to do would be to add a
> TapeBasedDataTransferer,
> and/or utilities to handle waiting for data e.g., from Tape. You could also
> use the InPlaceDataTransferer, and then come up with a Versioner that
> easily
> supports having Tape-based URIs.
>

Great! Thats where I'm heading :). How did you guys handle the URI's for
tapes? I'm basically just using the the tape name - dataStoreReference =
tape://MK0001L6

Cheers,
Tom

Re: Datastore references for a duplicate products

Posted by Thomas Bennett <lm...@gmail.com>.

Hi Bruce,

Interesting!

As a note on practice, NASA's ERBE and CERES projects
> (the latter continuing on JPSS) use a hierarchical categorization
> of files.


Just so that I understand - do you add to the hierarchy over time - i.e
like a version control same product each added reference is a newer
version? Or are you using the as I understand as collection of files in a
directory that is a product.


> CERES, in particular, is likely to continue indefinitely.
>
> Note also that sites archiving rock core samples fit into this
> category, as does the NOAA Emergency Responder Imagery
> Collection (ERIC).  ERIC only picks up one or two storms per
> year and they already handle very short turnaround on 100,000
> images taken a day or two after a major storm.
>
> Bruce B.


Cheers,
Tom Bennett.

Re: Datastore references for a duplicate products

Posted by Bruce Barkstrom <br...@gmail.com>.

As a note on practice, NASA's ERBE and CERES projects
(the latter continuing on JPSS) use a hierarchical categorization
of files.  CERES, in particular, is likely to continue indefinitely.

Note also that sites archiving rock core samples fit into this
category, as does the NOAA Emergency Responder Imagery
Collection (ERIC).  ERIC only picks up one or two storms per
year and they already handle very short turnaround on 100,000
images taken a day or two after a major storm.

Bruce B.

On Thu, Apr 23, 2015 at 1:11 AM, Mattmann, Chris A (3980) <
chris.a.mattmann@jpl.nasa.gov> wrote:

> Hey ThomasB (we have 2 “TomB”s now, so I’ll fully use your
> name ;) )
>
> Great questions. My thoughts below:
>
>
>
> -----Original Message-----
> From: Thomas Bennett <lm...@gmail.com>
> Reply-To: "dev@oodt.apache.org" <de...@oodt.apache.org>
> Date: Wednesday, April 22, 2015 at 6:09 AM
> To: OODT <de...@oodt.apache.org>
> Subject: Datastore references for a duplicate products
>
> >Hi,
> >
> >Okay - so a 'flat' product has a single datastore reference.
> >
> >So, how do you handle redundant copies of products? At the moment I have
> >an
> >independent catalogue at each site.
> >
> >I was thinking of a site metadata key, so multiple products can be
> >filtered, but I thought I would see what other people are doing and if
> >there is any interest in perhaps getting a product (flat or heirachical)
> >with multiple references, i.e. beyond originalReference and
> >datastoreReference.
>
> My personal opinion is that tagging each Product with a “Site” metadata
> field would be a great idea and a great way to filter this later.
>
> >
> >Or does that totally break the OODT model? It probably does...
>
> Nope doesn’t break the model at all - the Product with structure
> ProductStructure.HIERARCHICAL works fine with this - data transfers
> work fine - metadata retrieval works fine (try ingesting a directory),
> and other things work fine too.
>
> However, I would seriously advise *against* using this model ^_^
> Mostly b/c no one really does it that way - and the code isn’t actively
> being developed and maintained. At one point I thought it would be a
> great thing to support and the last I heard of someone using this structure
> natively was OCO from 2007/2008 (a NASA mission).
>
> >
> >Also - does anyone store data on tape library and index it with OODT. I'm
> >talking basic tar on a tape. This obviously breaks the file retrieval, if
> >used, but I'm thinking of how this could be included in the OODT framework
> >and maybe develop some methods.
>
> Well yep we do this - one way to do would be to add a
> TapeBasedDataTransferer,
> and/or utilities to handle waiting for data e.g., from Tape. You could also
> use the InPlaceDataTransferer, and then come up with a Versioner that
> easily
> supports having Tape-based URIs.
>
> Does that make sense?
>
> Cheers,
> Chris
>
> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
> Chris Mattmann, Ph.D.
> Chief Architect
> Instrument Software and Science Data Systems Section (398)
> NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
> Office: 168-519, Mailstop: 168-527
> Email: chris.a.mattmann@nasa.gov
> WWW:  http://sunset.usc.edu/~mattmann/
> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
> Adjunct Associate Professor, Computer Science Department
> University of Southern California, Los Angeles, CA 90089 USA
> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>
>
>

Re: Datastore references for a duplicate products

Posted by "Mattmann, Chris A (3980)" <ch...@jpl.nasa.gov>.

Hey ThomasB (we have 2 “TomB”s now, so I’ll fully use your
name ;) )

Great questions. My thoughts below:



-----Original Message-----
From: Thomas Bennett <lm...@gmail.com>
Reply-To: "dev@oodt.apache.org" <de...@oodt.apache.org>
Date: Wednesday, April 22, 2015 at 6:09 AM
To: OODT <de...@oodt.apache.org>
Subject: Datastore references for a duplicate products

>Hi,
>
>Okay - so a 'flat' product has a single datastore reference.
>
>So, how do you handle redundant copies of products? At the moment I have
>an
>independent catalogue at each site.
>
>I was thinking of a site metadata key, so multiple products can be
>filtered, but I thought I would see what other people are doing and if
>there is any interest in perhaps getting a product (flat or heirachical)
>with multiple references, i.e. beyond originalReference and
>datastoreReference.

My personal opinion is that tagging each Product with a “Site” metadata
field would be a great idea and a great way to filter this later.

>
>Or does that totally break the OODT model? It probably does...

Nope doesn’t break the model at all - the Product with structure
ProductStructure.HIERARCHICAL works fine with this - data transfers
work fine - metadata retrieval works fine (try ingesting a directory),
and other things work fine too.

However, I would seriously advise *against* using this model ^_^
Mostly b/c no one really does it that way - and the code isn’t actively
being developed and maintained. At one point I thought it would be a
great thing to support and the last I heard of someone using this structure
natively was OCO from 2007/2008 (a NASA mission).

>
>Also - does anyone store data on tape library and index it with OODT. I'm
>talking basic tar on a tape. This obviously breaks the file retrieval, if
>used, but I'm thinking of how this could be included in the OODT framework
>and maybe develop some methods.

Well yep we do this - one way to do would be to add a
TapeBasedDataTransferer,
and/or utilities to handle waiting for data e.g., from Tape. You could also
use the InPlaceDataTransferer, and then come up with a Versioner that
easily
supports having Tape-based URIs.

Does that make sense?

Cheers,
Chris

++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Chris Mattmann, Ph.D.
Chief Architect
Instrument Software and Science Data Systems Section (398)
NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
Office: 168-519, Mailstop: 168-527
Email: chris.a.mattmann@nasa.gov
WWW:  http://sunset.usc.edu/~mattmann/
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Adjunct Associate Professor, Computer Science Department
University of Southern California, Los Angeles, CA 90089 USA
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++

Re: Datastore references for a duplicate products

Posted by Bruce Barkstrom <br...@gmail.com>.

As a variant, what happens if you discover that several of
the files have been corrupted (by errors in software or by
hardware malfunction during the write process)?  If the files
are part of a series (such as a time series) where missing
data could cause a serious increase in uncertainty, do you
replace the erroneous members of the series - and if so,
how do you identify the replacement values?  You should
also consider how you deal with redoing files where the
data that was corrupted was used to create other files in a
complex workflow.

Incidentally, we had a case where a router hardware failure
corrupted a large number of transmitted files.  Those had
to be replaced.  The replacement files then had to be input
to a fairly complex production workflow, followed by reinsertion
into the data stores we were using.  Not fun!

Some groups concerned with long-term archival don't
permit deletion of data - even if erroneous.  However,
the users really need a homogeneous data record with
as few gaps as possible.  They need to be informed of
the availability of the replacements.

Along this line, in one case we had built a very stringent
consistency check that time always increased in both
inputs to a process that had to merge data sources.
Specifically, data from our instruments came in one
data stream; data on the satellite position (ephemeris)
came in another; data on the satellite attitude came in
a third.  When we got data, one of the sources had put
time-reversed "tape recorder" data on one of the input
files - for about a two week period.  Trying to rewrite the
error checking would have been so complicated that
we just dropped that two-weeks of data and had to live
with the gap.  Personally, I was glad the error checking
worked as intended.

Bruce B.

On Wed, Apr 22, 2015 at 8:29 AM, Bruce Barkstrom <br...@gmail.com>
wrote:

> What happens to references to duplicate files stored in an online backup
> directory, as well as ones stored in a remote backup
> location?  In more complicated versions of this question,
> how would a federated archive handle replicas stored
> in other archives?
>
> Then, would it matter if the archive decided to do a slight
> reformat of the data that merely rearranged the numerical
> values without adding or deleting any of them?
>
> Bruce B.
>
> On Wed, Apr 22, 2015 at 6:09 AM, Thomas Bennett <lm...@gmail.com>
> wrote:
>
>> Hi,
>>
>> Okay - so a 'flat' product has a single datastore reference.
>>
>> So, how do you handle redundant copies of products? At the moment I have
>> an
>> independent catalogue at each site.
>>
>> I was thinking of a site metadata key, so multiple products can be
>> filtered, but I thought I would see what other people are doing and if
>> there is any interest in perhaps getting a product (flat or heirachical)
>> with multiple references, i.e. beyond originalReference and
>> datastoreReference.
>>
>> Or does that totally break the OODT model? It probably does...
>>
>> Also - does anyone store data on tape library and index it with OODT. I'm
>> talking basic tar on a tape. This obviously breaks the file retrieval, if
>> used, but I'm thinking of how this could be included in the OODT framework
>> and maybe develop some methods.
>>
>> But before I go to deep I thought I would ask.
>>
>> Cheers,
>> Tom
>>
>
>

Re: Datastore references for a duplicate products

Posted by Bruce Barkstrom <br...@gmail.com>.

What happens to references to duplicate files stored in an online backup
directory, as well as ones stored in a remote backup
location?  In more complicated versions of this question,
how would a federated archive handle replicas stored
in other archives?

Then, would it matter if the archive decided to do a slight
reformat of the data that merely rearranged the numerical
values without adding or deleting any of them?

Bruce B.

On Wed, Apr 22, 2015 at 6:09 AM, Thomas Bennett <lm...@gmail.com> wrote:

> Hi,
>
> Okay - so a 'flat' product has a single datastore reference.
>
> So, how do you handle redundant copies of products? At the moment I have an
> independent catalogue at each site.
>
> I was thinking of a site metadata key, so multiple products can be
> filtered, but I thought I would see what other people are doing and if
> there is any interest in perhaps getting a product (flat or heirachical)
> with multiple references, i.e. beyond originalReference and
> datastoreReference.
>
> Or does that totally break the OODT model? It probably does...
>
> Also - does anyone store data on tape library and index it with OODT. I'm
> talking basic tar on a tape. This obviously breaks the file retrieval, if
> used, but I'm thinking of how this could be included in the OODT framework
> and maybe develop some methods.
>
> But before I go to deep I thought I would ask.
>
> Cheers,
> Tom
>