You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@oodt.apache.org by "Mattmann, Chris A (388J)" <ch...@jpl.nasa.gov> on 2013/03/20 00:51:33 UTC

My Hadoop Summit Talk: NASA+BigData

Hey Guys,

I proposed a talk for NASA and Big Data at the Hadoop Summit:

http://hadoopsummit2013.uservoice.com/forums/196822-future-of-apache-hadoop
/suggestions/3733470-nasa-science-and-technology-for-big-data-junkies-


If you still have votes, and would like to support my talk, I'd certainly
appreciate it!

Thank you for considering.

Cheers,
Chris Mattmann 
Vote Herder

Re: My Hadoop Summit Talk: NASA+BigData

Posted by "Mattmann, Chris A (388J)" <ch...@jpl.nasa.gov>.

Hey Bruce,

On 3/20/13 7:56 AM, "Bruce Barkstrom" <br...@gmail.com> wrote:

>I'll subside after one minor note on the "sky is the archive."

Don't ever subside! I appreciate your feedback and commentary and
wholly look up to you for advice and help.

Your cynicism at the conference is totally understood amidst as you
mention your ability to download the conference (or something similar ^_^)
off of your Gmail web page :)

>
>I once had a course from W. W. Morgan, the U. Chicago prof who
>developed the atlas of stellar types (A, O, B, etc.).  He had
>the spectrum of a "standard type R".  As I recall, two weeks
>after he published his atlas with the spectra, the star defining
>the type became a variable.

Precisely.

>
>Also, I note that on this very Google Mail page, I can get
>a "Free Guide to Big Data", as well as the "IBM Big Data
>Free eBook".  I suppose I don't need to go to a conference
>to become informed.

Nah, but it would be less fun without you there! Who else will represent
the society of troublemakers, and scientific reality, that is,
the people actually doing the work?!!

Take care my friend.

Cheers,
Chris


>
>Bruce B.
>
>On Wed, Mar 20, 2013 at 10:21 AM, Mattmann, Chris A (388J) <
>chris.a.mattmann@jpl.nasa.gov> wrote:
>
>> Hey Bruce,
>>
>> A couple points:
>>
>> On 3/20/13 5:46 AM, "Bruce Barkstrom" <br...@gmail.com> wrote:
>>
>> >That may be a bit better.
>> >
>> >However, it still isn't clear to me how the physics of the instruments
>> >and of the data processing gets into what users understand they
>> >can do with the data.
>>
>> Yeah agreed. At the same time, this is kind of difficult to throw into
>> a 45 min with 15 mins "techie talk" that I haven't even prepared yet,
>> and even harder to throw in to a 100 word (what you see on the website)
>> and 200 word (longer, what I sent you) abstract that they requested.
>>
>> >
>> >As I understand Big Data and analytics, it usually appears to using
>> >a lot of statistics to find unexpected correlations in the data, but
>> >the techniques aren't looking for causation.  If you're dealing with
>> >scientific data, you're usually trying to get to physical causation.
>> >That means, I think, that users need to understand how the
>> >physics and math constrain what they can do.
>>
>> ++50 agreed.
>>
>> >
>> >Let me see if I can identify a more concrete example of a
>> >concern.  Usually, when we want to deal with physically
>> >connected phenomena, we want disparate data to be
>> >observing the same chunk of space at the same time.
>> >If the Big Data user picks up one piece of data from region
>> >X_1 and t_1 and then develops a correlation with observations
>> >with data from X_2 and t_2, where X_1 /= X_2 and t_1 /= t_2,
>> >it isn't clear why that correlation has anything to do with
>> >physical causation.  Of, to put it another way, Big Data
>> >may just give more examples of the "cherry picking"
>> >climate deniers do when they select data without
>> >paying attention to the statistical and physical significance
>> >of their "results".
>>
>> Totally agree. This is the big difference between card
>> carrying statisticians a lot of time and *computer science*
>> oriented *machine learning* people.
>>
>> >
>> >So, even though the data rates are large by today's
>> >standards, I'm not sure that, by itself, is impressive.
>>
>> Well I have to say it is impressive. Can you show me a disk
>> that can today write 700 TB/data per second? Or the filesystem
>> drivers and parallel I/O necessary to software them? Imagine in
>> astronomy, where they are moving into the time domain, and
>> away from the "sky is the archive" "so just reobserve next
>> time" mentality, and thus triage, which is super important,
>> isn't the main driver and archival is now becoming important,
>> and necessary in these eventually 700TB/sec producing systems.
>>
>> There are all sorts of IO, hardware, computer science, and
>> other advances that we don't have that are needed, and that
>> these types of examples like the SKA will drive.
>>
>> OTOH, the sheer infrastructure, domestic and international policy,
>> investment, and excitement and sense of nationality that many of
>> these new Big Data systems (especially the SKA) are creating in
>> their respective countries (e.g., in South Africa), is enough
>> to at least suggest to my evidence based mind that there is
>> something impressive here.
>>
>> >Maybe the relevant example would be all those statistics
>> >on dams built or tons of steel produced by the Soviet
>> >Union.  The hype would be more interesting if it could
>> >talk about what new phenomena or understanding
>> >these techniques will produce - not just the data rate
>> >or the total amount of data being produced.
>>
>> Agreed, lots of data has been generated for a while. However,
>> the volume (total and discrete); velocity, and variety (in
>> data types, metadata, etc.) are certainly such that they are
>> worthy of current study, at least in the area of data management.
>>
>> >
>> >Maybe it's just a glorified popularity contest; if so,
>> >it would seem to be at about the level of interest
>> >of the new season of "Dancing with the Stars".
>>
>> Perhaps, but I know you guys are interested in that show :)
>> Who's not?
>>
>> >I suppose the hype is necessary to generate the
>> >funding (which has its uses), but I'm not sure it
>> >will do as much as a few million sent to appropriate
>> >super PACs to move the politics of climate change
>> >along.
>>
>> Think of this as an IT super PAC for next generation data management
>> techniques and systems to deal with data volumes and varieties that
>> we don't have hardware or CS tools to manage yet. I'm not talking
>> about writing to tape and letting it die the morgue. I'm talking about
>> even simple things like making it available after you write it to
>>spinning
>> disk.
>>
>> Cheers,
>> Chris
>>
>> >
>> >Bruce B.
>> >
>> >On Wed, Mar 20, 2013 at 1:16 AM, Mattmann, Chris A (388J) <
>> >chris.a.mattmann@jpl.nasa.gov> wrote:
>> >
>> >> Hey Bruce,
>> >>
>> >> Hah!
>> >>
>> >> Unfortunately all you get is the short summary through
>> >> the website which does make it scientifically hard to
>> >> judge, however, then again this isn't science, it's a
>> >> glorified popularity contest.
>> >>
>> >> I have a little bit more detailed abstract that I wrote up,
>> >> pasted below (of course the part that they don't use to solicit
>>votes):
>> >>
>> >> ---longer abstract
>> >> The NASA Jet Propulsion Laboratory, California Institute of
>> >> Technology contributes to many Big Data projects for Earth science
>>such
>> >>as
>> >> the
>> >> U.S. National Climate Assessment (NCA) and for astronomy such as next
>> >> generation astronomical instruments like the Square Kilometre Array
>> >>(SKA)
>> >> that
>> >> will generate unprecedented volumes of data (700TB/sec!).
>> >>
>> >> Through these projects, we are addressing four key
>> >> challenges critical for the Hadoop community and broader open source
>>Big
>> >> Data
>> >> community to consider: (1) unobtrusively integrating science
>>algorithms
>> >> into
>> >> large scale processing systems; (2) selecting and deploying high
>>powered
>> >> data
>> >> movement technologies for data staging and remote data acquisition;
>> >> processing,
>> >> and delivery to our customers and users; (3) better leveraging of
>>cloud
>> >> computing (storage and processing) technologies in NASA missions; and
>> >>(4)
>> >> technologies for automatically and rapidly extracting text and
>>metadata
>> >> from
>> >> the file formats, by some estimates ranging from a few thousand to
>>over
>> >> fifty
>> >> thousand in total.
>> >>
>> >> This talk will focus on those Big Data challenges, how NASA
>> >> JPL is addressing them both technologically (Hadoop, OODT, Tika,
>>Nutch,
>> >> Solr)
>> >> and from a community standpoint (Apache, interacting with open
>>source,
>> >> etc.).
>> >> I¹ll also discuss the future of Big Data at JPL and NASA and how
>>others
>> >> can get
>> >> Involved.
>> >> -----
>> >>
>> >> You can think of that as the longer version of what I submitted.
>>*grin*
>> >>
>> >> Cheers,
>> >> Chris
>> >>
>> >>
>> >>
>> >> On 3/19/13 7:20 PM, "Bruce Barkstrom" <br...@gmail.com> wrote:
>> >>
>> >> >OK, so you've got a three-word summary of some
>> >> >hyperbole with Dumbo, the Flying Elephant.
>> >> >How are you going to deal with the real
>> >> >scientific constraints on the physics of combining real
>> >> >measurement technologies and "mashing stuff together"?
>> >> >
>> >> >You need to remember that imaging instruments integrate
>> >> >radiances with spectral responses and Point Spread Function
>> >> >weighted averages over the FOV of whatever the instrument
>> >> >was looking at - and that's just the instantaneous (L1 measurement).
>> >> >If you do orthorectification, you've got variations in the
>> >>uncertainties
>> >> >across the image where the parts of the image where you've
>> >> >increased the resolving power (by putting interpolated points
>> >> >closer together) and have also increased the noise from the
>> >> >orthorectification process that acts as a noise multiplier.
>> >> >
>> >> >Next, you've got stuff like cloud identification (and rejection or
>> >> >acceptance) - which depends on spectral response, solar illumination
>> >> >(during the day) and temperature and cloud property stuff during
>> >> >the night - and finally, you've got temporal interpolation (not just
>> >> >creating an average through emission driven by solar illumination
>> >> >during the day and IR cooling at night.  Where (the hel)l is
>> >> >the physics that deals with this stuff?  If you do get some
>> >> >statistical stuff, why should anyone believe it contributes to
>> >> >our understanding of climate change?
>> >> >
>> >> >I won't vote, but you can think of this as my input to your
>> >> >scientific conscience.
>> >> >
>> >> >Bruce B.
>> >> >
>> >> >On Tue, Mar 19, 2013 at 7:51 PM, Mattmann, Chris A (388J) <
>> >> >chris.a.mattmann@jpl.nasa.gov> wrote:
>> >> >
>> >> >> Hey Guys,
>> >> >>
>> >> >> I proposed a talk for NASA and Big Data at the Hadoop Summit:
>> >> >>
>> >> >>
>> >> >>
>> >>
>> >>
>> 
>>http://hadoopsummit2013.uservoice.com/forums/196822-future-of-apache-hado
>> >> >>op
>> >> >>
>> >>/suggestions/3733470-nasa-science-and-technology-for-big-data-junkies-
>> >> >>
>> >> >>
>> >> >> If you still have votes, and would like to support my talk, I'd
>> >> >>certainly
>> >> >> appreciate it!
>> >> >>
>> >> >> Thank you for considering.
>> >> >>
>> >> >> Cheers,
>> >> >> Chris Mattmann
>> >> >> Vote Herder
>> >> >>
>> >> >>
>> >>
>> >>
>>
>>

Re: My Hadoop Summit Talk: NASA+BigData

Posted by Bruce Barkstrom <br...@gmail.com>.

I'll subside after one minor note on the "sky is the archive."

I once had a course from W. W. Morgan, the U. Chicago prof who
developed the atlas of stellar types (A, O, B, etc.).  He had
the spectrum of a "standard type R".  As I recall, two weeks
after he published his atlas with the spectra, the star defining
the type became a variable.

Also, I note that on this very Google Mail page, I can get
a "Free Guide to Big Data", as well as the "IBM Big Data
Free eBook".  I suppose I don't need to go to a conference
to become informed.

Bruce B.

On Wed, Mar 20, 2013 at 10:21 AM, Mattmann, Chris A (388J) <
chris.a.mattmann@jpl.nasa.gov> wrote:

> Hey Bruce,
>
> A couple points:
>
> On 3/20/13 5:46 AM, "Bruce Barkstrom" <br...@gmail.com> wrote:
>
> >That may be a bit better.
> >
> >However, it still isn't clear to me how the physics of the instruments
> >and of the data processing gets into what users understand they
> >can do with the data.
>
> Yeah agreed. At the same time, this is kind of difficult to throw into
> a 45 min with 15 mins "techie talk" that I haven't even prepared yet,
> and even harder to throw in to a 100 word (what you see on the website)
> and 200 word (longer, what I sent you) abstract that they requested.
>
> >
> >As I understand Big Data and analytics, it usually appears to using
> >a lot of statistics to find unexpected correlations in the data, but
> >the techniques aren't looking for causation.  If you're dealing with
> >scientific data, you're usually trying to get to physical causation.
> >That means, I think, that users need to understand how the
> >physics and math constrain what they can do.
>
> ++50 agreed.
>
> >
> >Let me see if I can identify a more concrete example of a
> >concern.  Usually, when we want to deal with physically
> >connected phenomena, we want disparate data to be
> >observing the same chunk of space at the same time.
> >If the Big Data user picks up one piece of data from region
> >X_1 and t_1 and then develops a correlation with observations
> >with data from X_2 and t_2, where X_1 /= X_2 and t_1 /= t_2,
> >it isn't clear why that correlation has anything to do with
> >physical causation.  Of, to put it another way, Big Data
> >may just give more examples of the "cherry picking"
> >climate deniers do when they select data without
> >paying attention to the statistical and physical significance
> >of their "results".
>
> Totally agree. This is the big difference between card
> carrying statisticians a lot of time and *computer science*
> oriented *machine learning* people.
>
> >
> >So, even though the data rates are large by today's
> >standards, I'm not sure that, by itself, is impressive.
>
> Well I have to say it is impressive. Can you show me a disk
> that can today write 700 TB/data per second? Or the filesystem
> drivers and parallel I/O necessary to software them? Imagine in
> astronomy, where they are moving into the time domain, and
> away from the "sky is the archive" "so just reobserve next
> time" mentality, and thus triage, which is super important,
> isn't the main driver and archival is now becoming important,
> and necessary in these eventually 700TB/sec producing systems.
>
> There are all sorts of IO, hardware, computer science, and
> other advances that we don't have that are needed, and that
> these types of examples like the SKA will drive.
>
> OTOH, the sheer infrastructure, domestic and international policy,
> investment, and excitement and sense of nationality that many of
> these new Big Data systems (especially the SKA) are creating in
> their respective countries (e.g., in South Africa), is enough
> to at least suggest to my evidence based mind that there is
> something impressive here.
>
> >Maybe the relevant example would be all those statistics
> >on dams built or tons of steel produced by the Soviet
> >Union.  The hype would be more interesting if it could
> >talk about what new phenomena or understanding
> >these techniques will produce - not just the data rate
> >or the total amount of data being produced.
>
> Agreed, lots of data has been generated for a while. However,
> the volume (total and discrete); velocity, and variety (in
> data types, metadata, etc.) are certainly such that they are
> worthy of current study, at least in the area of data management.
>
> >
> >Maybe it's just a glorified popularity contest; if so,
> >it would seem to be at about the level of interest
> >of the new season of "Dancing with the Stars".
>
> Perhaps, but I know you guys are interested in that show :)
> Who's not?
>
> >I suppose the hype is necessary to generate the
> >funding (which has its uses), but I'm not sure it
> >will do as much as a few million sent to appropriate
> >super PACs to move the politics of climate change
> >along.
>
> Think of this as an IT super PAC for next generation data management
> techniques and systems to deal with data volumes and varieties that
> we don't have hardware or CS tools to manage yet. I'm not talking
> about writing to tape and letting it die the morgue. I'm talking about
> even simple things like making it available after you write it to spinning
> disk.
>
> Cheers,
> Chris
>
> >
> >Bruce B.
> >
> >On Wed, Mar 20, 2013 at 1:16 AM, Mattmann, Chris A (388J) <
> >chris.a.mattmann@jpl.nasa.gov> wrote:
> >
> >> Hey Bruce,
> >>
> >> Hah!
> >>
> >> Unfortunately all you get is the short summary through
> >> the website which does make it scientifically hard to
> >> judge, however, then again this isn't science, it's a
> >> glorified popularity contest.
> >>
> >> I have a little bit more detailed abstract that I wrote up,
> >> pasted below (of course the part that they don't use to solicit votes):
> >>
> >> ---longer abstract
> >> The NASA Jet Propulsion Laboratory, California Institute of
> >> Technology contributes to many Big Data projects for Earth science such
> >>as
> >> the
> >> U.S. National Climate Assessment (NCA) and for astronomy such as next
> >> generation astronomical instruments like the Square Kilometre Array
> >>(SKA)
> >> that
> >> will generate unprecedented volumes of data (700TB/sec!).
> >>
> >> Through these projects, we are addressing four key
> >> challenges critical for the Hadoop community and broader open source Big
> >> Data
> >> community to consider: (1) unobtrusively integrating science algorithms
> >> into
> >> large scale processing systems; (2) selecting and deploying high powered
> >> data
> >> movement technologies for data staging and remote data acquisition;
> >> processing,
> >> and delivery to our customers and users; (3) better leveraging of cloud
> >> computing (storage and processing) technologies in NASA missions; and
> >>(4)
> >> technologies for automatically and rapidly extracting text and metadata
> >> from
> >> the file formats, by some estimates ranging from a few thousand to over
> >> fifty
> >> thousand in total.
> >>
> >> This talk will focus on those Big Data challenges, how NASA
> >> JPL is addressing them both technologically (Hadoop, OODT, Tika, Nutch,
> >> Solr)
> >> and from a community standpoint (Apache, interacting with open source,
> >> etc.).
> >> I¹ll also discuss the future of Big Data at JPL and NASA and how others
> >> can get
> >> Involved.
> >> -----
> >>
> >> You can think of that as the longer version of what I submitted. *grin*
> >>
> >> Cheers,
> >> Chris
> >>
> >>
> >>
> >> On 3/19/13 7:20 PM, "Bruce Barkstrom" <br...@gmail.com> wrote:
> >>
> >> >OK, so you've got a three-word summary of some
> >> >hyperbole with Dumbo, the Flying Elephant.
> >> >How are you going to deal with the real
> >> >scientific constraints on the physics of combining real
> >> >measurement technologies and "mashing stuff together"?
> >> >
> >> >You need to remember that imaging instruments integrate
> >> >radiances with spectral responses and Point Spread Function
> >> >weighted averages over the FOV of whatever the instrument
> >> >was looking at - and that's just the instantaneous (L1 measurement).
> >> >If you do orthorectification, you've got variations in the
> >>uncertainties
> >> >across the image where the parts of the image where you've
> >> >increased the resolving power (by putting interpolated points
> >> >closer together) and have also increased the noise from the
> >> >orthorectification process that acts as a noise multiplier.
> >> >
> >> >Next, you've got stuff like cloud identification (and rejection or
> >> >acceptance) - which depends on spectral response, solar illumination
> >> >(during the day) and temperature and cloud property stuff during
> >> >the night - and finally, you've got temporal interpolation (not just
> >> >creating an average through emission driven by solar illumination
> >> >during the day and IR cooling at night.  Where (the hel)l is
> >> >the physics that deals with this stuff?  If you do get some
> >> >statistical stuff, why should anyone believe it contributes to
> >> >our understanding of climate change?
> >> >
> >> >I won't vote, but you can think of this as my input to your
> >> >scientific conscience.
> >> >
> >> >Bruce B.
> >> >
> >> >On Tue, Mar 19, 2013 at 7:51 PM, Mattmann, Chris A (388J) <
> >> >chris.a.mattmann@jpl.nasa.gov> wrote:
> >> >
> >> >> Hey Guys,
> >> >>
> >> >> I proposed a talk for NASA and Big Data at the Hadoop Summit:
> >> >>
> >> >>
> >> >>
> >>
> >>
> http://hadoopsummit2013.uservoice.com/forums/196822-future-of-apache-hado
> >> >>op
> >> >>
> >>/suggestions/3733470-nasa-science-and-technology-for-big-data-junkies-
> >> >>
> >> >>
> >> >> If you still have votes, and would like to support my talk, I'd
> >> >>certainly
> >> >> appreciate it!
> >> >>
> >> >> Thank you for considering.
> >> >>
> >> >> Cheers,
> >> >> Chris Mattmann
> >> >> Vote Herder
> >> >>
> >> >>
> >>
> >>
>
>

Re: My Hadoop Summit Talk: NASA+BigData

Posted by "Mattmann, Chris A (388J)" <ch...@jpl.nasa.gov>.

Hey Bruce,

A couple points:

On 3/20/13 5:46 AM, "Bruce Barkstrom" <br...@gmail.com> wrote:

>That may be a bit better.
>
>However, it still isn't clear to me how the physics of the instruments
>and of the data processing gets into what users understand they
>can do with the data.

Yeah agreed. At the same time, this is kind of difficult to throw into
a 45 min with 15 mins "techie talk" that I haven't even prepared yet,
and even harder to throw in to a 100 word (what you see on the website)
and 200 word (longer, what I sent you) abstract that they requested.

>
>As I understand Big Data and analytics, it usually appears to using
>a lot of statistics to find unexpected correlations in the data, but
>the techniques aren't looking for causation.  If you're dealing with
>scientific data, you're usually trying to get to physical causation.
>That means, I think, that users need to understand how the
>physics and math constrain what they can do.

++50 agreed.

>
>Let me see if I can identify a more concrete example of a
>concern.  Usually, when we want to deal with physically
>connected phenomena, we want disparate data to be
>observing the same chunk of space at the same time.
>If the Big Data user picks up one piece of data from region
>X_1 and t_1 and then develops a correlation with observations
>with data from X_2 and t_2, where X_1 /= X_2 and t_1 /= t_2,
>it isn't clear why that correlation has anything to do with
>physical causation.  Of, to put it another way, Big Data
>may just give more examples of the "cherry picking"
>climate deniers do when they select data without
>paying attention to the statistical and physical significance
>of their "results".

Totally agree. This is the big difference between card
carrying statisticians a lot of time and *computer science*
oriented *machine learning* people.

>
>So, even though the data rates are large by today's
>standards, I'm not sure that, by itself, is impressive.

Well I have to say it is impressive. Can you show me a disk
that can today write 700 TB/data per second? Or the filesystem
drivers and parallel I/O necessary to software them? Imagine in
astronomy, where they are moving into the time domain, and
away from the "sky is the archive" "so just reobserve next
time" mentality, and thus triage, which is super important,
isn't the main driver and archival is now becoming important,
and necessary in these eventually 700TB/sec producing systems.

There are all sorts of IO, hardware, computer science, and
other advances that we don't have that are needed, and that
these types of examples like the SKA will drive.

OTOH, the sheer infrastructure, domestic and international policy,
investment, and excitement and sense of nationality that many of
these new Big Data systems (especially the SKA) are creating in
their respective countries (e.g., in South Africa), is enough
to at least suggest to my evidence based mind that there is
something impressive here.

>Maybe the relevant example would be all those statistics
>on dams built or tons of steel produced by the Soviet
>Union.  The hype would be more interesting if it could
>talk about what new phenomena or understanding
>these techniques will produce - not just the data rate
>or the total amount of data being produced.

Agreed, lots of data has been generated for a while. However,
the volume (total and discrete); velocity, and variety (in
data types, metadata, etc.) are certainly such that they are
worthy of current study, at least in the area of data management.

>
>Maybe it's just a glorified popularity contest; if so,
>it would seem to be at about the level of interest
>of the new season of "Dancing with the Stars".

Perhaps, but I know you guys are interested in that show :)
Who's not?

>I suppose the hype is necessary to generate the
>funding (which has its uses), but I'm not sure it
>will do as much as a few million sent to appropriate
>super PACs to move the politics of climate change
>along.

Think of this as an IT super PAC for next generation data management
techniques and systems to deal with data volumes and varieties that
we don't have hardware or CS tools to manage yet. I'm not talking
about writing to tape and letting it die the morgue. I'm talking about
even simple things like making it available after you write it to spinning
disk.

Cheers,
Chris

>
>Bruce B.
>
>On Wed, Mar 20, 2013 at 1:16 AM, Mattmann, Chris A (388J) <
>chris.a.mattmann@jpl.nasa.gov> wrote:
>
>> Hey Bruce,
>>
>> Hah!
>>
>> Unfortunately all you get is the short summary through
>> the website which does make it scientifically hard to
>> judge, however, then again this isn't science, it's a
>> glorified popularity contest.
>>
>> I have a little bit more detailed abstract that I wrote up,
>> pasted below (of course the part that they don't use to solicit votes):
>>
>> ---longer abstract
>> The NASA Jet Propulsion Laboratory, California Institute of
>> Technology contributes to many Big Data projects for Earth science such
>>as
>> the
>> U.S. National Climate Assessment (NCA) and for astronomy such as next
>> generation astronomical instruments like the Square Kilometre Array
>>(SKA)
>> that
>> will generate unprecedented volumes of data (700TB/sec!).
>>
>> Through these projects, we are addressing four key
>> challenges critical for the Hadoop community and broader open source Big
>> Data
>> community to consider: (1) unobtrusively integrating science algorithms
>> into
>> large scale processing systems; (2) selecting and deploying high powered
>> data
>> movement technologies for data staging and remote data acquisition;
>> processing,
>> and delivery to our customers and users; (3) better leveraging of cloud
>> computing (storage and processing) technologies in NASA missions; and
>>(4)
>> technologies for automatically and rapidly extracting text and metadata
>> from
>> the file formats, by some estimates ranging from a few thousand to over
>> fifty
>> thousand in total.
>>
>> This talk will focus on those Big Data challenges, how NASA
>> JPL is addressing them both technologically (Hadoop, OODT, Tika, Nutch,
>> Solr)
>> and from a community standpoint (Apache, interacting with open source,
>> etc.).
>> I¹ll also discuss the future of Big Data at JPL and NASA and how others
>> can get
>> Involved.
>> -----
>>
>> You can think of that as the longer version of what I submitted. *grin*
>>
>> Cheers,
>> Chris
>>
>>
>>
>> On 3/19/13 7:20 PM, "Bruce Barkstrom" <br...@gmail.com> wrote:
>>
>> >OK, so you've got a three-word summary of some
>> >hyperbole with Dumbo, the Flying Elephant.
>> >How are you going to deal with the real
>> >scientific constraints on the physics of combining real
>> >measurement technologies and "mashing stuff together"?
>> >
>> >You need to remember that imaging instruments integrate
>> >radiances with spectral responses and Point Spread Function
>> >weighted averages over the FOV of whatever the instrument
>> >was looking at - and that's just the instantaneous (L1 measurement).
>> >If you do orthorectification, you've got variations in the
>>uncertainties
>> >across the image where the parts of the image where you've
>> >increased the resolving power (by putting interpolated points
>> >closer together) and have also increased the noise from the
>> >orthorectification process that acts as a noise multiplier.
>> >
>> >Next, you've got stuff like cloud identification (and rejection or
>> >acceptance) - which depends on spectral response, solar illumination
>> >(during the day) and temperature and cloud property stuff during
>> >the night - and finally, you've got temporal interpolation (not just
>> >creating an average through emission driven by solar illumination
>> >during the day and IR cooling at night.  Where (the hel)l is
>> >the physics that deals with this stuff?  If you do get some
>> >statistical stuff, why should anyone believe it contributes to
>> >our understanding of climate change?
>> >
>> >I won't vote, but you can think of this as my input to your
>> >scientific conscience.
>> >
>> >Bruce B.
>> >
>> >On Tue, Mar 19, 2013 at 7:51 PM, Mattmann, Chris A (388J) <
>> >chris.a.mattmann@jpl.nasa.gov> wrote:
>> >
>> >> Hey Guys,
>> >>
>> >> I proposed a talk for NASA and Big Data at the Hadoop Summit:
>> >>
>> >>
>> >>
>> 
>>http://hadoopsummit2013.uservoice.com/forums/196822-future-of-apache-hado
>> >>op
>> >> 
>>/suggestions/3733470-nasa-science-and-technology-for-big-data-junkies-
>> >>
>> >>
>> >> If you still have votes, and would like to support my talk, I'd
>> >>certainly
>> >> appreciate it!
>> >>
>> >> Thank you for considering.
>> >>
>> >> Cheers,
>> >> Chris Mattmann
>> >> Vote Herder
>> >>
>> >>
>>
>>

Re: My Hadoop Summit Talk: NASA+BigData

Posted by Bruce Barkstrom <br...@gmail.com>.

That may be a bit better.

However, it still isn't clear to me how the physics of the instruments
and of the data processing gets into what users understand they
can do with the data.

As I understand Big Data and analytics, it usually appears to using
a lot of statistics to find unexpected correlations in the data, but
the techniques aren't looking for causation.  If you're dealing with
scientific data, you're usually trying to get to physical causation.
That means, I think, that users need to understand how the
physics and math constrain what they can do.

Let me see if I can identify a more concrete example of a
concern.  Usually, when we want to deal with physically
connected phenomena, we want disparate data to be
observing the same chunk of space at the same time.
If the Big Data user picks up one piece of data from region
X_1 and t_1 and then develops a correlation with observations
with data from X_2 and t_2, where X_1 /= X_2 and t_1 /= t_2,
it isn't clear why that correlation has anything to do with
physical causation.  Of, to put it another way, Big Data
may just give more examples of the "cherry picking"
climate deniers do when they select data without
paying attention to the statistical and physical significance
of their "results".

So, even though the data rates are large by today's
standards, I'm not sure that, by itself, is impressive.
Maybe the relevant example would be all those statistics
on dams built or tons of steel produced by the Soviet
Union.  The hype would be more interesting if it could
talk about what new phenomena or understanding
these techniques will produce - not just the data rate
or the total amount of data being produced.

Maybe it's just a glorified popularity contest; if so,
it would seem to be at about the level of interest
of the new season of "Dancing with the Stars".
I suppose the hype is necessary to generate the
funding (which has its uses), but I'm not sure it
will do as much as a few million sent to appropriate
super PACs to move the politics of climate change
along.

Bruce B.

On Wed, Mar 20, 2013 at 1:16 AM, Mattmann, Chris A (388J) <
chris.a.mattmann@jpl.nasa.gov> wrote:

> Hey Bruce,
>
> Hah!
>
> Unfortunately all you get is the short summary through
> the website which does make it scientifically hard to
> judge, however, then again this isn't science, it's a
> glorified popularity contest.
>
> I have a little bit more detailed abstract that I wrote up,
> pasted below (of course the part that they don't use to solicit votes):
>
> ---longer abstract
> The NASA Jet Propulsion Laboratory, California Institute of
> Technology contributes to many Big Data projects for Earth science such as
> the
> U.S. National Climate Assessment (NCA) and for astronomy such as next
> generation astronomical instruments like the Square Kilometre Array (SKA)
> that
> will generate unprecedented volumes of data (700TB/sec!).
>
> Through these projects, we are addressing four key
> challenges critical for the Hadoop community and broader open source Big
> Data
> community to consider: (1) unobtrusively integrating science algorithms
> into
> large scale processing systems; (2) selecting and deploying high powered
> data
> movement technologies for data staging and remote data acquisition;
> processing,
> and delivery to our customers and users; (3) better leveraging of cloud
> computing (storage and processing) technologies in NASA missions; and (4)
> technologies for automatically and rapidly extracting text and metadata
> from
> the file formats, by some estimates ranging from a few thousand to over
> fifty
> thousand in total.
>
> This talk will focus on those Big Data challenges, how NASA
> JPL is addressing them both technologically (Hadoop, OODT, Tika, Nutch,
> Solr)
> and from a community standpoint (Apache, interacting with open source,
> etc.).
> I¹ll also discuss the future of Big Data at JPL and NASA and how others
> can get
> Involved.
> -----
>
> You can think of that as the longer version of what I submitted. *grin*
>
> Cheers,
> Chris
>
>
>
> On 3/19/13 7:20 PM, "Bruce Barkstrom" <br...@gmail.com> wrote:
>
> >OK, so you've got a three-word summary of some
> >hyperbole with Dumbo, the Flying Elephant.
> >How are you going to deal with the real
> >scientific constraints on the physics of combining real
> >measurement technologies and "mashing stuff together"?
> >
> >You need to remember that imaging instruments integrate
> >radiances with spectral responses and Point Spread Function
> >weighted averages over the FOV of whatever the instrument
> >was looking at - and that's just the instantaneous (L1 measurement).
> >If you do orthorectification, you've got variations in the uncertainties
> >across the image where the parts of the image where you've
> >increased the resolving power (by putting interpolated points
> >closer together) and have also increased the noise from the
> >orthorectification process that acts as a noise multiplier.
> >
> >Next, you've got stuff like cloud identification (and rejection or
> >acceptance) - which depends on spectral response, solar illumination
> >(during the day) and temperature and cloud property stuff during
> >the night - and finally, you've got temporal interpolation (not just
> >creating an average through emission driven by solar illumination
> >during the day and IR cooling at night.  Where (the hel)l is
> >the physics that deals with this stuff?  If you do get some
> >statistical stuff, why should anyone believe it contributes to
> >our understanding of climate change?
> >
> >I won't vote, but you can think of this as my input to your
> >scientific conscience.
> >
> >Bruce B.
> >
> >On Tue, Mar 19, 2013 at 7:51 PM, Mattmann, Chris A (388J) <
> >chris.a.mattmann@jpl.nasa.gov> wrote:
> >
> >> Hey Guys,
> >>
> >> I proposed a talk for NASA and Big Data at the Hadoop Summit:
> >>
> >>
> >>
> http://hadoopsummit2013.uservoice.com/forums/196822-future-of-apache-hado
> >>op
> >> /suggestions/3733470-nasa-science-and-technology-for-big-data-junkies-
> >>
> >>
> >> If you still have votes, and would like to support my talk, I'd
> >>certainly
> >> appreciate it!
> >>
> >> Thank you for considering.
> >>
> >> Cheers,
> >> Chris Mattmann
> >> Vote Herder
> >>
> >>
>
>

Re: My Hadoop Summit Talk: NASA+BigData

Posted by "Mattmann, Chris A (388J)" <ch...@jpl.nasa.gov>.

Hey Bruce,

Hah!

Unfortunately all you get is the short summary through
the website which does make it scientifically hard to
judge, however, then again this isn't science, it's a
glorified popularity contest.

I have a little bit more detailed abstract that I wrote up,
pasted below (of course the part that they don't use to solicit votes):

---longer abstract
The NASA Jet Propulsion Laboratory, California Institute of
Technology contributes to many Big Data projects for Earth science such as
the
U.S. National Climate Assessment (NCA) and for astronomy such as next
generation astronomical instruments like the Square Kilometre Array (SKA)
that
will generate unprecedented volumes of data (700TB/sec!).
 
Through these projects, we are addressing four key
challenges critical for the Hadoop community and broader open source Big
Data
community to consider: (1) unobtrusively integrating science algorithms
into
large scale processing systems; (2) selecting and deploying high powered
data
movement technologies for data staging and remote data acquisition;
processing,
and delivery to our customers and users; (3) better leveraging of cloud
computing (storage and processing) technologies in NASA missions; and (4)
technologies for automatically and rapidly extracting text and metadata
from
the file formats, by some estimates ranging from a few thousand to over
fifty
thousand in total.
 
This talk will focus on those Big Data challenges, how NASA
JPL is addressing them both technologically (Hadoop, OODT, Tika, Nutch,
Solr)
and from a community standpoint (Apache, interacting with open source,
etc.).
I¹ll also discuss the future of Big Data at JPL and NASA and how others
can get
Involved.
-----

You can think of that as the longer version of what I submitted. *grin*

Cheers,
Chris



On 3/19/13 7:20 PM, "Bruce Barkstrom" <br...@gmail.com> wrote:

>OK, so you've got a three-word summary of some
>hyperbole with Dumbo, the Flying Elephant.
>How are you going to deal with the real
>scientific constraints on the physics of combining real
>measurement technologies and "mashing stuff together"?
>
>You need to remember that imaging instruments integrate
>radiances with spectral responses and Point Spread Function
>weighted averages over the FOV of whatever the instrument
>was looking at - and that's just the instantaneous (L1 measurement).
>If you do orthorectification, you've got variations in the uncertainties
>across the image where the parts of the image where you've
>increased the resolving power (by putting interpolated points
>closer together) and have also increased the noise from the
>orthorectification process that acts as a noise multiplier.
>
>Next, you've got stuff like cloud identification (and rejection or
>acceptance) - which depends on spectral response, solar illumination
>(during the day) and temperature and cloud property stuff during
>the night - and finally, you've got temporal interpolation (not just
>creating an average through emission driven by solar illumination
>during the day and IR cooling at night.  Where (the hel)l is
>the physics that deals with this stuff?  If you do get some
>statistical stuff, why should anyone believe it contributes to
>our understanding of climate change?
>
>I won't vote, but you can think of this as my input to your
>scientific conscience.
>
>Bruce B.
>
>On Tue, Mar 19, 2013 at 7:51 PM, Mattmann, Chris A (388J) <
>chris.a.mattmann@jpl.nasa.gov> wrote:
>
>> Hey Guys,
>>
>> I proposed a talk for NASA and Big Data at the Hadoop Summit:
>>
>> 
>>http://hadoopsummit2013.uservoice.com/forums/196822-future-of-apache-hado
>>op
>> /suggestions/3733470-nasa-science-and-technology-for-big-data-junkies-
>>
>>
>> If you still have votes, and would like to support my talk, I'd
>>certainly
>> appreciate it!
>>
>> Thank you for considering.
>>
>> Cheers,
>> Chris Mattmann
>> Vote Herder
>>
>>

Re: My Hadoop Summit Talk: NASA+BigData

Posted by Bruce Barkstrom <br...@gmail.com>.

OK, so you've got a three-word summary of some
hyperbole with Dumbo, the Flying Elephant.
How are you going to deal with the real
scientific constraints on the physics of combining real
measurement technologies and "mashing stuff together"?

You need to remember that imaging instruments integrate
radiances with spectral responses and Point Spread Function
weighted averages over the FOV of whatever the instrument
was looking at - and that's just the instantaneous (L1 measurement).
If you do orthorectification, you've got variations in the uncertainties
across the image where the parts of the image where you've
increased the resolving power (by putting interpolated points
closer together) and have also increased the noise from the
orthorectification process that acts as a noise multiplier.

Next, you've got stuff like cloud identification (and rejection or
acceptance) - which depends on spectral response, solar illumination
(during the day) and temperature and cloud property stuff during
the night - and finally, you've got temporal interpolation (not just
creating an average through emission driven by solar illumination
during the day and IR cooling at night.  Where (the hel)l is
the physics that deals with this stuff?  If you do get some
statistical stuff, why should anyone believe it contributes to
our understanding of climate change?

I won't vote, but you can think of this as my input to your
scientific conscience.

Bruce B.

On Tue, Mar 19, 2013 at 7:51 PM, Mattmann, Chris A (388J) <
chris.a.mattmann@jpl.nasa.gov> wrote:

> Hey Guys,
>
> I proposed a talk for NASA and Big Data at the Hadoop Summit:
>
> http://hadoopsummit2013.uservoice.com/forums/196822-future-of-apache-hadoop
> /suggestions/3733470-nasa-science-and-technology-for-big-data-junkies-
>
>
> If you still have votes, and would like to support my talk, I'd certainly
> appreciate it!
>
> Thank you for considering.
>
> Cheers,
> Chris Mattmann
> Vote Herder
>
>