You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@oodt.apache.org by "Mattmann, Chris A (388J)" <ch...@jpl.nasa.gov> on 2011/05/05 04:11:01 UTC

Re: NCT3 Data ingest - revisit aggregation of IPs and eliminate unwanted data types?

[replying to dev@oodt.apache.org, since this conversation I think could help some users who are thinking about similar things]

> Ok, since you jumped in, maybe you can elaborate.
> 
> How would we implement a process in the crawler to perform a 200-1 down
> sample of push-pull downloaded files to aggregated, ingested products,
> without involving other PCS components, e.g., filemgr, workflow, etc.?
> 
> The production rule would be to gather and wait for all (or maybe just
> select the optimal set of) temporally coincident files (in this case 16 ~30
> sec files spanning 8 min), simultaneously corresponding to ~12 different
> file types, using some rule-based modulo-time boundary.

What would the down select involve? Throwing out the files that don't meet the criteria? Or, still archiving them, but not worrying about them together, as a whole?

> 
> Perhaps one simplification to this problem would be to trigger the
> processing (and even better, derive the time boundaries) based on crawling a
> separate file type that we expect would be delivered at the desired temporal
> resolution.

Yep that's one way to do it. That's how we created the FTS pipeline in OCO, by having a separate ("mock") product, called FTSSavesetDir that we ingested (and on ingest, notified the WM that processing should occur). We controlled how and when these FTSSavesetDirs were made, and when they got moved into the appropriate staging area with the appropriate FTSSavesetDirCrawler watching for them.

>  After the aggregate product is generated, the executing process
> would need to move all its 200 input files out of the push pull staging area
> to a separate disk area for storage. But we would still want this process to
> wait on executing until it got all its expected input files (or reached some
> appropriate time out) before creating its product.

Wouldn't one way to do this just be to do it with Versioning? It sounds like you have sets of files that you'd like to be archived to the "nominal" archive (aka defined in a versioner, maybe your NPP PEATE std one) -- and then a set of them that you still want archived but to a separate disk area for storage, correct?

One way to do this would be simply to create a ShadowProduct (aka one you don't care about from an Ops Perspective), and then archive the 200 input files as this "ShadowProduct" with a versioner that dumps them into the separate disk for storage, outside of your std product archive.

> 
> Of course at this point it seems to me we are basically buying into
> duplicating most of the basic filemgr and workflow capabilities, without
> using either.

Yeah -- it's been a careful tradeoff. I fought long and card to keep the crawler from evolving into its own WM. As part of that, I think its simple phase model was the right tradeoff. You can do some phase-based actions, and customize behavior, but it's not full out control or data flow, which is a win in my mind.

> 
> ps. A separate concept that we kicked around with Brian at one time was to
> have the PCS track not single files but directories (aggregations) of files
> that could be continually ingested into (along with appropriate metadata
> updates), each time another matching file arrived.  But we never fleshed out
> the details of how this would be implemented.

It's pretty much there with the Hierarchical product concept, but the only catch is that all the refs slow things down sometimes. But I have some ideas (aka lazy loading of refs on demand) that may help there.

Cheers,
Chris

++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Chris Mattmann, Ph.D.
Senior Computer Scientist
NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
Office: 171-266B, Mailstop: 171-246
Email: chris.a.mattmann@nasa.gov
WWW:   http://sunset.usc.edu/~mattmann/
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Adjunct Assistant Professor, Computer Science Department
University of Southern California, Los Angeles, CA 90089 USA
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++


Re: NCT3 Data ingest - revisit aggregation of IPs and eliminate unwanted data types?

Posted by "Mattmann, Chris A (388J)" <ch...@jpl.nasa.gov>.
Hi Albert,

Thanks. Sorry it took me so long to get back to you on this, replies inline below:

> 
> Perhaps it would be useful to summarize/rewind to the beginning.
> 
> Summary:
> 
> The underlying discussion is whether we want to aggregate incoming data
> files both in time (32 sec -> 8 min) and across data types (something like
> 10 -> 1) in order to reduce the number of mission data files we need to
> handle in our system.
> 
> Specifically the proposal is that we need to do the above 150-1 file
> aggregation WITHOUT ever ingesting the 150 input files into our file
> catalog.

To achieve this, why not maintain a separate FM instance (call it "shadow"), configured as follows:

1. NoOp FM catalog implementation - in essence, implement the o.a.oodt.cas.filemgr.catalog.Catalog interface, but make it do nothing, i.e., no storage of prod references or metadata, no querying, nada.
2. Data Transfer and Versioning work fine, and place the 150 files you want in your desire archive location.

And then have a crawler PreConditionComparator that ignores all files but the 1 file you care about, and then an associated crawler action that separately ingests into your "shadow" FM (i.e. move but not catalog).

> 
> This is because there is concern about overloading the PCS.
> 
> This is your "jumping in" point, where you suggested this be done as a
> crawler action. 

Yep, sorry, I neglected to explain my original intent which was Crawler action *and* 1-2 above.

> 
> I can see how we can easily plug in a 1-1 transform action into the crawler,
> but I don't see how we can simply do the 150-1, especially if we need to
> build in all the requisite clean-up, tracking, and error and off-nominal
> condition handling to achieve a reliable system.

I think if you implement a NoOp catalog from 1 above and have a separate FM instance, configuring the CrawlerAction that runs on postIngestSuccess to ingest the other 150 files into some other location would be simply giving it the "shadow" FM instance URL.

> 
> I personally believe we haven't gotten to the point that we've justified the
> need that the above file reduction (if truly needed for science) be done
> pre-ingest, rather than through introducing another PGE workflow within the
> PCS, which we understand how to do.

Yep that's another way to do it.

> 
> But again, if anyone has a good idea on this, we'd like to hear it.
> 
> 
> More detailed background:
> 
> 1) This thread started because we've recently learned our primary data
> provider is going to provide many (but not all) of our desired file types at
> 32 sec, instead of 8 minute, temporal granularity, which was our original
> design point.
> 
> 2) Based on a round of interface testing with the data provider a few months
> ago, the concern was raised that we may not be able to keep up with the
> incoming data flow.  Based on earlier testing, the main culprit seems to be
> the crawler's ability to keep up with the sheer number of files as opposed
> to the data volume.

Do you have some #s on this, would be great to share? Can you explain what not being able to "keep up" entails?

> 
> Without going into the details, Brian did make changes to push-pull to tidy
> up the directory structure meant to address this problem.  So whether this
> will still be a issue or not is now a matter of speculation [and we as a
> team are quite good at speculation].

:-)

> 
> The obvious next step is to try to repeat the test (or simulate it as best
> we can with ourselves as the data provider). I believe there is a space of
> relatively simple optimizations here that we have yet to explore (for
> example, running multiple crawlers).

Yep that's one way to do it too, just run another crawler (or set of them) to move or cleanup those files.

> 
> 3) A related concern is whether the system can handle a ~20 fold increase in
> the total number of files in the file catalog.  An underling issue from the
> very beginning of this project is whether we should design the system based
> on the fundamental instrument data granularity (32 sec) or a larger
> "aggregation" granularity based on our choosing (8 min).  This is because
> upstream data provider has always advertised that they could provide all
> customers with aggregated products at any granularity that was desired.  But
> based our understanding of their system, we always had a lingering concern
> about their ability (and reliability) to do this per our specifications.
> Some of us argued that if the system could handle it, we should just adopt
> the naïve 32 sec granularity across the board for all the incoming data,
> because then everything would be under our control.
> 
> Nevertheless, because of earlier assurances from our data provider, and
> because we haven't ever had numbers demonstrating PCS performance for high
> file loading, and because of science team aversion to the smaller 32 sec
> file size, we designed the system to 8 min.  And now the data provider is
> partially back tracking.
> 
> Before he left, Brian worked with our DBA to redesign the file manager
> backend to store file metadata as explicit DB typed data columns (rather
> than through Strings referenced generically through name, value). So this
> should give us significant growth potential in terms of DB/filemgr
> performance.  Thus there is reason to believe system scaling concern will no
> longer be an issue.  But we don't know how we can be sure.

Best way to do is test! Also would love to see the ColumnBasedCatalog that Brian worked on that's in a branch be backported to trunk. Any takers on your end? :)

> 
> Of course, if number of files really proves to really be a problem, we
> always have the relatively straight-forward option of having the operator
> (or some process) routinely delete (or un-catalog) the 32 sec files when
> we're done aggregating them.

+1

> 
> 4) Because of above concerns, one solution put forward is that we NEVER
> ingest the 32 sec files into the file catalog, but rather do the 150-1
> aggregation before ingestion.  This is where you suggested a crawler
> action... 

No probs. See my reply above (1, 2, specifically) and let's take it from there.

Cheers,
Chris

++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Chris Mattmann, Ph.D.
Senior Computer Scientist
NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
Office: 171-266B, Mailstop: 171-246
Email: chris.a.mattmann@nasa.gov
WWW:   http://sunset.usc.edu/~mattmann/
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Adjunct Assistant Professor, Computer Science Department
University of Southern California, Los Angeles, CA 90089 USA
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++


Re: NCT3 Data ingest - revisit aggregation of IPs and eliminate unwanted data types?

Posted by "Mattmann, Chris A (388J)" <ch...@jpl.nasa.gov>.
Hahah, thanks BFost!

I have some comments too which I'll reply later after I finish some telecons today. Hope all is going well!

Cheers,
Chris

On May 5, 2011, at 11:38 PM, holenoter wrote:

> 
> hey guys! . . . a few things:
> 
> you could probably figure out a way to get a crawler action to pull it off however there are a few limitations you would be imposing by doing it this way:
> - limiting yourself to creating one aggregate at a time per crawler instance (how much time will the aggregator take to run?)
> - you will pretty much be pluggin cas-pge into the crawler (i imagine you will want the config, sci_log, cas_log, etc... ingested as well as the aggregated file)
> - also if the crawler can't keep up already, adding a PGE like action to it is just gonna make is slower.
> 
> recommendation:
> crawler has a skipIngest option which turns off ingest . . . so, have the pushpull download these 32 sec files to a separate staging area where you will run a crawler that skips ingest . . . write a metadata extractor for this crawler which for a given 32 sec file determines which 8 min aggregate it belongs in and moves the file to an NFS mounted directory where the files are grouped into directories by which aggregate they will end up in . . . create a workflow task which monitors these directories and triggers a workflow (which performs the aggregation) if it believes that all the 32 sec files are present to create an 8 min aggregate file.
> 
> however, with ColumnBasedDataSourceCatalog i don't think you will have db overload problems . . . john can partition the metadata table for these products and index for the typical queries you will be doing . . . i would work on getting that ColumnBasedDataSourceCatalog up and running somewhere on peate or acos so you can test against the volume you expect . . . john might even be able to create a mock metadata table in the database with dummy metadata for you, so a filemgr would just have to be pointed at it and queried against -- will save you having to ingest a billion files just to test it.
> 
> hope all is going well!
> -brian
> 
> On May 05, 2011, at 09:48 AM, "Chang, Albert Y (388D)" <al...@jpl.nasa.gov> wrote:
> 
>> Hi Chris:
>> 
>> Thanks for the reply.
>> 
>> Perhaps it would be useful to summarize/rewind to the beginning.
>> 
>> Summary:
>> 
>> The underlying discussion is whether we want to aggregate incoming data
>> files both in time (32 sec -> 8 min) and across data types (something like
>> 10 -> 1) in order to reduce the number of mission data files we need to
>> handle in our system.
>> 
>> Specifically the proposal is that we need to do the above 150-1 file
>> aggregation WITHOUT ever ingesting the 150 input files into our file
>> catalog.
>> 
>> This is because there is concern about overloading the PCS.
>> 
>> This is your "jumping in" point, where you suggested this be done as a
>> crawler action. 
>> 
>> I can see how we can easily plug in a 1-1 transform action into the crawler,
>> but I don't see how we can simply do the 150-1, especially if we need to
>> build in all the requisite clean-up, tracking, and error and off-nominal
>> condition handling to achieve a reliable system.
>> 
>> I personally believe we haven't gotten to the point that we've justified the
>> need that the above file reduction (if truly needed for science) be done
>> pre-ingest, rather than through introducing another PGE workflow within the
>> PCS, which we understand how to do.
>> 
>> But again, if anyone has a good idea on this, we'd like to hear it.
>> 
>> 
>> More detailed background:
>> 
>> 1) This thread started because we've recently learned our primary data
>> provider is going to provide many (but not all) of our desired file types at
>> 32 sec, instead of 8 minute, temporal granularity, which was our original
>> design point.
>> 
>> 2) Based on a round of interface testing with the data provider a few months
>> ago, the concern was raised that we may not be able to keep up with the
>> incoming data flow. Based on earlier testing, the main culprit seems to be
>> the crawler's ability to keep up with the sheer number of files as opposed
>> to the data volume.
>> 
>> Without going into the details, Brian did make changes to push-pull to tidy
>> up the directory structure meant to address this problem. So whether this
>> will still be a issue or not is now a matter of speculation [and we as a
>> team are quite good at speculation].
>> 
>> The obvious next step is to try to repeat the test (or simulate it as best
>> we can with ourselves as the data provider). I believe there is a space of
>> relatively simple optimizations here that we have yet to explore (for
>> example, running multiple crawlers).
>> 
>> 3) A related concern is whether the system can handle a ~20 fold increase in
>> the total number of files in the file catalog. An underling issue from the
>> very beginning of this project is whether we should design the system based
>> on the fundamental instrument data granularity (32 sec) or a larger
>> "aggregation" granularity based on our choosing (8 min). This is because
>> upstream data provider has always advertised that they could provide all
>> customers with aggregated products at any granularity that was desired. But
>> based our understanding of their system, we always had a lingering concern
>> about their ability (and reliability) to do this per our specifications.
>> Some of us argued that if the system could handle it, we should just adopt
>> the naïve 32 sec granularity across the board for all the incoming data,
>> because then everything would be under our control.
>> 
>> Nevertheless, because of earlier assurances from our data provider, and
>> because we haven't ever had numbers demonstrating PCS performance for high
>> file loading, and because of science team aversion to the smaller 32 sec
>> file size, we designed the system to 8 min. And now the data provider is
>> partially back tracking.
>> 
>> Before he left, Brian worked with our DBA to redesign the file manager
>> backend to store file metadata as explicit DB typed data columns (rather
>> than through Strings referenced generically through name, value). So this
>> should give us significant growth potential in terms of DB/filemgr
>> performance. Thus there is reason to believe system scaling concern will no
>> longer be an issue. But we don't know how we can be sure.
>> 
>> Of course, if number of files really proves to really be a problem, we
>> always have the relatively straight-forward option of having the operator
>> (or some process) routinely delete (or un-catalog) the 32 sec files when
>> we're done aggregating them.
>> 
>> 4) Because of above concerns, one solution put forward is that we NEVER
>> ingest the 32 sec files into the file catalog, but rather do the 150-1
>> aggregation before ingestion. This is where you suggested a crawler
>> action... 
>> 
>> Thanks,
>> 
>> -Albert
>> 
>> 
>> On 5/4/11 7:11 PM, "Mattmann, Chris A (388J)"
>> <ch...@jpl.nasa.gov> wrote:
>> 
>> > [replying to dev@oodt.apache.org, since this conversation I think could help
>> > some users who are thinking about similar things]
>> > 
>> >> Ok, since you jumped in, maybe you can elaborate.
>> >> 
>> >> How would we implement a process in the crawler to perform a 200-1 down
>> >> sample of push-pull downloaded files to aggregated, ingested products,
>> >> without involving other PCS components, e.g., filemgr, workflow, etc.?
>> >> 
>> >> The production rule would be to gather and wait for all (or maybe just
>> >> select the optimal set of) temporally coincident files (in this case 16 ~30
>> >> sec files spanning 8 min), simultaneously corresponding to ~12 different
>> >> file types, using some rule-based modulo-time boundary.
>> > 
>> > What would the down select involve? Throwing out the files that don't meet the
>> > criteria? Or, still archiving them, but not worrying about them together, as a
>> > whole?
>> > 
>> >> 
>> >> Perhaps one simplification to this problem would be to trigger the
>> >> processing (and even better, derive the time boundaries) based on crawling a
>> >> separate file type that we expect would be delivered at the desired temporal
>> >> resolution.
>> > 
>> > Yep that's one way to do it. That's how we created the FTS pipeline in OCO, by
>> > having a separate ("mock") product, called FTSSavesetDir that we ingested (and
>> > on ingest, notified the WM that processing should occur). We controlled how
>> > and when these FTSSavesetDirs were made, and when they got moved into the
>> > appropriate staging area with the appropriate FTSSavesetDirCrawler watching
>> > for them.
>> > 
>> >> After the aggregate product is generated, the executing process
>> >> would need to move all its 200 input files out of the push pull staging area
>> >> to a separate disk area for storage. But we would still want this process to
>> >> wait on executing until it got all its expected input files (or reached some
>> >> appropriate time out) before creating its product.
>> > 
>> > Wouldn't one way to do this just be to do it with Versioning? It sounds like
>> > you have sets of files that you'd like to be archived to the "nominal" archive
>> > (aka defined in a versioner, maybe your NPP PEATE std one) -- and then a set
>> > of them that you still want archived but to a separate disk area for storage,
>> > correct?
>> > 
>> > One way to do this would be simply to create a ShadowProduct (aka one you
>> > don't care about from an Ops Perspective), and then archive the 200 input
>> > files as this "ShadowProduct" with a versioner that dumps them into the
>> > separate disk for storage, outside of your std product archive.
>> > 
>> >> 
>> >> Of course at this point it seems to me we are basically buying into
>> >> duplicating most of the basic filemgr and workflow capabilities, without
>> >> using either.
>> > 
>> > Yeah -- it's been a careful tradeoff. I fought long and card to keep the
>> > crawler from evolving into its own WM. As part of that, I think its simple
>> > phase model was the right tradeoff. You can do some phase-based actions, and
>> > customize behavior, but it's not full out control or data flow, which is a win
>> > in my mind.
>> > 
>> >> 
>> >> ps. A separate concept that we kicked around with Brian at one time was to
>> >> have the PCS track not single files but directories (aggregations) of files
>> >> that could be continually ingested into (along with appropriate metadata
>> >> updates), each time another matching file arrived. But we never fleshed out
>> >> the details of how this would be implemented.
>> > 
>> > It's pretty much there with the Hierarchical product concept, but the only
>> > catch is that all the refs slow things down sometimes. But I have some ideas
>> > (aka lazy loading of refs on demand) that may help there.
>> > 
>> > Cheers,
>> > Chris
>> > 
>> > ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>> > Chris Mattmann, Ph.D.
>> > Senior Computer Scientist
>> > NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
>> > Office: 171-266B, Mailstop: 171-246
>> > Email: chris.a.mattmann@nasa.gov
>> > WWW: http://sunset.usc.edu/~mattmann/
>> > ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>> > Adjunct Assistant Professor, Computer Science Department
>> > University of Southern California, Los Angeles, CA 90089 USA
>> > ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>> > 
>> 


++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Chris Mattmann, Ph.D.
Senior Computer Scientist
NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
Office: 171-266B, Mailstop: 171-246
Email: chris.a.mattmann@nasa.gov
WWW:   http://sunset.usc.edu/~mattmann/
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Adjunct Assistant Professor, Computer Science Department
University of Southern California, Los Angeles, CA 90089 USA
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++


Re: NCT3 Data ingest - revisit aggregation of IPs and eliminate unwanted data types?

Posted by holenoter <ho...@me.com>.
hey guys! . . . a few things:

you could probably figure out a way to get a crawler action to pull it off however there are a few limitations you would be imposing by doing it this way:
- limiting yourself to creating one aggregate at a time per crawler instance (how much time will the aggregator take to run?)
- you will pretty much be pluggin cas-pge into the crawler (i imagine you will want the config, sci_log, cas_log, etc.. ingested as well as the aggregated file)
- also if the crawler can't keep up already, adding a PGE like action to it is just gonna make is slower.

recommendation:
crawler has a skipIngest option which turns off ingest . . . so, have the pushpull download these 32 sec files to a separate staging area where you will run a crawler that skips ingest . . . write a metadata extractor for this crawler which for a given 32 sec file determines which 8 min aggregate it belongs in and moves the file to an NFS mounted directory where the files are grouped into directories by which aggregate they will end up in . . . create a workflow task which monitors these directories and triggers a workflow (which performs the aggregation) if it believes that all the 32 sec files are present to create an 8 min aggregate file.

however, with ColumnBasedDataSourceCatalog i don't think you will have db overload problems . . . john can partition the metadata table for these products and index for the typical queries you will be doing  . . i would work on getting that ColumnBasedDataSourceCatalog up and running somewhere on peate or acos so you can test against the volume you expect . . . john might even be able to create a mock metadata table in the database with dummy metadata for you, so a filemgr would just have to be pointed at it and queried against -- will save you having to ingest a billion files just to test it.

hope all is going well!
-brian

On May 05, 2011, at 09:48 AM, "Chang, Albert Y (388D)" <al...@jpl.nasa.gov> wrote:

Hi Chris:

Thanks for the reply.

Perhaps it would be useful to summarize/rewind to the beginning.

Summary:

The underlying discussion is whether we want to aggregate incoming data
files both in time (32 sec -> 8 min) and across data types (something like
10 -> 1) in order to reduce the number of mission data files we need to
handle in our system.

Specifically the proposal is that we need to do the above 150-1 file
aggregation WITHOUT ever ingesting the 150 input files into our file
catalog.

This is because there is concern about overloading the PCS.

This is your "jumping in" point, where you suggested this be done as a
crawler action. 

I can see how we can easily plug in a 1-1 transform action into the crawler,
but I don't see how we can simply do the 150-1, especially if we need to
build in all the requisite clean-up, tracking, and error and off-nominal
condition handling to achieve a reliable system.

I personally believe we haven't gotten to the point that we've justified the
need that the above file reduction (if truly needed for science) be done
pre-ingest, rather than through introducing another PGE workflow within the
PCS, which we understand how to do.

But again, if anyone has a good idea on this, we'd like to hear it.


More detailed background:

1) This thread started because we've recently learned our primary data
provider is going to provide many (but not all) of our desired file types at
32 sec, instead of 8 minute, temporal granularity, which was our original
design point.

2) Based on a round of interface testing with the data provider a few months
ago, the concern was raised that we may not be able to keep up with the
incoming data flow. Based on earlier testing, the main culprit seems to be
the crawler's ability to keep up with the sheer number of files as opposed
to the data volume.

Without going into the details, Brian did make changes to push-pull to tidy
up the directory structure meant to address this problem. So whether this
will still be a issue or not is now a matter of speculation [and we as a
team are quite good at speculation].

The obvious next step is to try to repeat the test (or simulate it as best
we can with ourselves as the data provider). I believe there is a space of
relatively simple optimizations here that we have yet to explore (for
example, running multiple crawlers).

3) A related concern is whether the system can handle a ~20 fold increase in
the total number of files in the file catalog. An underling issue from the
very beginning of this project is whether we should design the system based
on the fundamental instrument data granularity (32 sec) or a larger
"aggregation" granularity based on our choosing (8 min). This is because
upstream data provider has always advertised that they could provide all
customers with aggregated products at any granularity that was desired. But
based our understanding of their system, we always had a lingering concern
about their ability (and reliability) to do this per our specifications.
Some of us argued that if the system could handle it, we should just adopt
the naïve 32 sec granularity across the board for all the incoming data,
because then everything would be under our control.

Nevertheless, because of earlier assurances from our data provider, and
because we haven't ever had numbers demonstrating PCS performance for high
file loading, and because of science team aversion to the smaller 32 sec
file size, we designed the system to 8 min. And now the data provider is
partially back tracking.

Before he left, Brian worked with our DBA to redesign the file manager
backend to store file metadata as explicit DB typed data columns (rather
than through Strings referenced generically through name, value). So this
should give us significant growth potential in terms of DB/filemgr
performance. Thus there is reason to believe system scaling concern will no
longer be an issue. But we don't know how we can be sure

Of course, if number of files really proves to really be a problem, we
always have the relatively straight-forward option of having the operator
(or some process) routinely delete (or un-catalog) the 32 sec files when
we're done aggregating them.

4) Because of above concerns, one solution put forward is that we NEVER
ingest the 32 sec files into the file catalog, but rather do the 150-1
aggregation before ingestion. This is where you suggested a crawler
action... 

Thanks,

-Albert


On 5/4/11 7:11 PM, "Mattmann, Chris A (388J)"
<ch...@jpl.nasa.gov> wrote:

> [replying to dev@oodt.apache.org, since this conversation I think could help
> some users who are thinking about similar things]
> 
>> Ok, since you jumped in, maybe you can elaborate.
>> 
>> How would we implement a process in the crawler to perform a 200-1 down
>> sample of push-pull downloaded files to aggregated, ingested products,
>> without involving other PCS components, e.g., filemgr, workflow, etc.?
>> 
>> The production rule would be to gather and wait for all (or maybe just
>> select the optimal set of) temporally coincident files (in this case 16 ~30
>> sec files spanning 8 min), simultaneously corresponding to ~12 different
>> file types, using some rule-based modulo-time boundary.
> 
> What would the down select involve? Throwing out the files that don't meet the
> criteria? Or, still archiving them, but not worrying about them together, as a
> whole?
> 
>> 
>> Perhaps one simplification to this problem would be to trigger the
>> processing (and even better, derive the time boundaries) based on crawling a
>> separate file type that we expect would be delivered at the desired temporal
>> resolution.
> 
> Yep that's one way to do it. That's how we created the FTS pipeline in OCO, by
> having a separate ("mock") product, called FTSSavesetDir that we ingested (and
> on ingest, notified the WM that processing should occur). We controlled how
> and when these FTSSavesetDirs were made, and when they got moved into the
> appropriate staging area with the appropriate FTSSavesetDirCrawler watching
> for them.
> 
>> After the aggregate product is generated, the executing process
>> would need to move all its 200 input files out of the push pull staging area
>> to a separate disk area for storage. But we would still want this process to
>> wait on executing until it got all its expected input files (or reached some
>> appropriate time out) before creating its product.
> 
> Wouldn't one way to do this just be to do it with Versioning? It sounds like
> you have sets of files that you'd like to be archived to the "nominal" archive
> (aka defined in a versioner, maybe your NPP PEATE std one) -- and then a set
> of them that you still want archived but to a separate disk area for storage,
> correct?
> 
> One way to do this would be simply to create a ShadowProduct (aka one you
> don't care about from an Ops Perspective), and then archive the 200 input
> files as this "ShadowProduct" with a versioner that dumps them into the
> separate disk for storage, outside of your std product archive.
> 
>> 
>> Of course at this point it seems to me we are basically buying into
>> duplicating most of the basic filemgr and workflow capabilities, without
>> using either.
> 
> Yeah -- it's been a careful tradeoff. I fought long and card to keep the
> crawler from evolving into its own WM. As part of that, I think its simple
> phase model was the right tradeoff. You can do some phase-based actions, and
> customize behavior, but it's not full out control or data flow, which is a win
> in my mind.
> 
>> 
>> ps. A separate concept that we kicked around with Brian at one time was to
>> have the PCS track not single files but directories (aggregations) of files
>> that could be continually ingested into (along with appropriate metadata
>> updates), each time another matching file arrived. But we never fleshed out
>> the details of how this would be implemented.
> 
> It's pretty much there with the Hierarchical product concept, but the only
> catch is that all the refs slow things down sometimes. But I have some ideas
> (aka lazy loading of refs on demand) that may help there.
> 
> Cheers,
> Chris
> 
> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
> Chris Mattmann, Ph.D.
> Senior Computer Scientist
> NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
> Office: 171-266B, Mailstop: 171-246
> Email: chris.a.mattmann@nasa.gov
> WWW: http://sunset.usc.edu/~mattmann/
> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
> Adjunct Assistant Professor, Computer Science Department
> University of Southern California, Los Angeles, CA 90089 USA
> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
> 


Re: NCT3 Data ingest - revisit aggregation of IPs and eliminate unwanted data types?

Posted by "Chang, Albert Y (388D)" <al...@jpl.nasa.gov>.
Hi Chris:

Thanks for the reply.

Perhaps it would be useful to summarize/rewind to the beginning.

Summary:

The underlying discussion is whether we want to aggregate incoming data
files both in time (32 sec -> 8 min) and across data types (something like
10 -> 1) in order to reduce the number of mission data files we need to
handle in our system.

Specifically the proposal is that we need to do the above 150-1 file
aggregation WITHOUT ever ingesting the 150 input files into our file
catalog.

This is because there is concern about overloading the PCS.

This is your "jumping in" point, where you suggested this be done as a
crawler action. 

I can see how we can easily plug in a 1-1 transform action into the crawler,
but I don't see how we can simply do the 150-1, especially if we need to
build in all the requisite clean-up, tracking, and error and off-nominal
condition handling to achieve a reliable system.

I personally believe we haven't gotten to the point that we've justified the
need that the above file reduction (if truly needed for science) be done
pre-ingest, rather than through introducing another PGE workflow within the
PCS, which we understand how to do.

But again, if anyone has a good idea on this, we'd like to hear it.


More detailed background:

1) This thread started because we've recently learned our primary data
provider is going to provide many (but not all) of our desired file types at
32 sec, instead of 8 minute, temporal granularity, which was our original
design point.

2) Based on a round of interface testing with the data provider a few months
ago, the concern was raised that we may not be able to keep up with the
incoming data flow.  Based on earlier testing, the main culprit seems to be
the crawler's ability to keep up with the sheer number of files as opposed
to the data volume.

Without going into the details, Brian did make changes to push-pull to tidy
up the directory structure meant to address this problem.  So whether this
will still be a issue or not is now a matter of speculation [and we as a
team are quite good at speculation].

The obvious next step is to try to repeat the test (or simulate it as best
we can with ourselves as the data provider). I believe there is a space of
relatively simple optimizations here that we have yet to explore (for
example, running multiple crawlers).

3) A related concern is whether the system can handle a ~20 fold increase in
the total number of files in the file catalog.  An underling issue from the
very beginning of this project is whether we should design the system based
on the fundamental instrument data granularity (32 sec) or a larger
"aggregation" granularity based on our choosing (8 min).  This is because
upstream data provider has always advertised that they could provide all
customers with aggregated products at any granularity that was desired.  But
based our understanding of their system, we always had a lingering concern
about their ability (and reliability) to do this per our specifications.
Some of us argued that if the system could handle it, we should just adopt
the naïve 32 sec granularity across the board for all the incoming data,
because then everything would be under our control.

Nevertheless, because of earlier assurances from our data provider, and
because we haven't ever had numbers demonstrating PCS performance for high
file loading, and because of science team aversion to the smaller 32 sec
file size, we designed the system to 8 min.  And now the data provider is
partially back tracking.

Before he left, Brian worked with our DBA to redesign the file manager
backend to store file metadata as explicit DB typed data columns (rather
than through Strings referenced generically through name, value). So this
should give us significant growth potential in terms of DB/filemgr
performance.  Thus there is reason to believe system scaling concern will no
longer be an issue.  But we don't know how we can be sure.

Of course, if number of files really proves to really be a problem, we
always have the relatively straight-forward option of having the operator
(or some process) routinely delete (or un-catalog) the 32 sec files when
we're done aggregating them.

4) Because of above concerns, one solution put forward is that we NEVER
ingest the 32 sec files into the file catalog, but rather do the 150-1
aggregation before ingestion.  This is where you suggested a crawler
action... 

Thanks,

-Albert


On 5/4/11 7:11 PM, "Mattmann, Chris A (388J)"
<ch...@jpl.nasa.gov> wrote:

> [replying to dev@oodt.apache.org, since this conversation I think could help
> some users who are thinking about similar things]
> 
>> Ok, since you jumped in, maybe you can elaborate.
>> 
>> How would we implement a process in the crawler to perform a 200-1 down
>> sample of push-pull downloaded files to aggregated, ingested products,
>> without involving other PCS components, e.g., filemgr, workflow, etc.?
>> 
>> The production rule would be to gather and wait for all (or maybe just
>> select the optimal set of) temporally coincident files (in this case 16 ~30
>> sec files spanning 8 min), simultaneously corresponding to ~12 different
>> file types, using some rule-based modulo-time boundary.
> 
> What would the down select involve? Throwing out the files that don't meet the
> criteria? Or, still archiving them, but not worrying about them together, as a
> whole?
> 
>> 
>> Perhaps one simplification to this problem would be to trigger the
>> processing (and even better, derive the time boundaries) based on crawling a
>> separate file type that we expect would be delivered at the desired temporal
>> resolution.
> 
> Yep that's one way to do it. That's how we created the FTS pipeline in OCO, by
> having a separate ("mock") product, called FTSSavesetDir that we ingested (and
> on ingest, notified the WM that processing should occur). We controlled how
> and when these FTSSavesetDirs were made, and when they got moved into the
> appropriate staging area with the appropriate FTSSavesetDirCrawler watching
> for them.
> 
>>  After the aggregate product is generated, the executing process
>> would need to move all its 200 input files out of the push pull staging area
>> to a separate disk area for storage. But we would still want this process to
>> wait on executing until it got all its expected input files (or reached some
>> appropriate time out) before creating its product.
> 
> Wouldn't one way to do this just be to do it with Versioning? It sounds like
> you have sets of files that you'd like to be archived to the "nominal" archive
> (aka defined in a versioner, maybe your NPP PEATE std one) -- and then a set
> of them that you still want archived but to a separate disk area for storage,
> correct?
> 
> One way to do this would be simply to create a ShadowProduct (aka one you
> don't care about from an Ops Perspective), and then archive the 200 input
> files as this "ShadowProduct" with a versioner that dumps them into the
> separate disk for storage, outside of your std product archive.
> 
>> 
>> Of course at this point it seems to me we are basically buying into
>> duplicating most of the basic filemgr and workflow capabilities, without
>> using either.
> 
> Yeah -- it's been a careful tradeoff. I fought long and card to keep the
> crawler from evolving into its own WM. As part of that, I think its simple
> phase model was the right tradeoff. You can do some phase-based actions, and
> customize behavior, but it's not full out control or data flow, which is a win
> in my mind.
> 
>> 
>> ps. A separate concept that we kicked around with Brian at one time was to
>> have the PCS track not single files but directories (aggregations) of files
>> that could be continually ingested into (along with appropriate metadata
>> updates), each time another matching file arrived.  But we never fleshed out
>> the details of how this would be implemented.
> 
> It's pretty much there with the Hierarchical product concept, but the only
> catch is that all the refs slow things down sometimes. But I have some ideas
> (aka lazy loading of refs on demand) that may help there.
> 
> Cheers,
> Chris
> 
> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
> Chris Mattmann, Ph.D.
> Senior Computer Scientist
> NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
> Office: 171-266B, Mailstop: 171-246
> Email: chris.a.mattmann@nasa.gov
> WWW:   http://sunset.usc.edu/~mattmann/
> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
> Adjunct Assistant Professor, Computer Science Department
> University of Southern California, Los Angeles, CA 90089 USA
> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>