You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@oodt.apache.org by Bruce Barkstrom <br...@gmail.com> on 2013/03/20 17:39:33 UTC

Re: My Hadoop Summit Talk: NASA+BigData - A Faustian Bargain

As I was mulling over our conversation, I was coming off some
writing on the history of cataloging - which includes the Fundamental
Requirements for Bibliographic Records.  Back before 1900, libraries
began to do their own cataloging, but the catalogs were often
so different that there was no easy way to exchange catalogs.
The Library of Congress began to publish cards (subject, author,
title, and a couple of other categories).  This cut the labor of
smaller libraries and an improvement in standardization of
user book searches.  At the same time, cataloging became
a highly skilled activity and one that was fairly expensive
(see Adam Smith's notions of division of labor in The Wealth
of Nations).

In practical terms, if it takes a cataloger half an hour
to create a bibliographical record for a book, then that
cataloger can create about 3,000 bibliographical records
per year (allowing 1,500 hours of work over a year).
The US produces about 300,000 new titles every year,
so the LoC would have to have a workforce of about
100 professional catalogers to cover that classification
work.  The number of new titles produced globally is
apparently about 12,000,000.  To do that cataloging
would require forty times as many people - about
40,000 full time people (hmm - at $100k per person,
that would be $40M per year, although maybe one
might expect a certain amount of off-shoring to occur).

Where this gets interesting is in the "Big Data" realm.
If I assume that a file of data is 100 MB (meaning there
might be a use for Google's FiberNet), then 1 TB of
data is going to create 10,000 files.  OK - then that
would look like we'd need about three cataloger-years
to create the "bibliographic records" for that TB.  I think
that creates a rather compelling case for a very different
and very highly automated approach to developing
a completely different kind of catalog.  Note also that
we probably can't afford to do anything with XML because
of the overhead of parsing XML records (see my article
in ESI on an experiment I did about the overhead).
Also, I doubt anyone in his or her right mind would
expect the data from the kind of missions you mentioned
to be in anything but binary.

There's a second issue - equivalence testing.  Some of
the folks with a library science background expect to use
cryptographic digests and standardization of file formats
to allow data files to be uniquely identified and stable.
I've got some fairly important data sets that are updated
daily to monthly (NOAA's GHCN and HCN, with the basic
temperature and humidity from the network of ground
stations, as well as the IGRA data collection with radiosonde
data), where the freely available data are in ASCII characters.
If I want to do statistical testing for climate change, I convert
the ASCII data to double precision - and then sort the
data to create a cumulative probability distribution for,
say, precipitation.  Under the "librarian" proposal, my sorted
data isn't "authentic" because the data has been changed
from its original format - and permuted.  Seems strange that
just by rearranging the numbers I've made the data "inauthentic".
But, then, how could I test whether my rearranged data were
still "authentic".  At least from my perspective, I think what
I'd do is to develop a translator that could covert my transformed
data back into the original form - or could rearrange the original
into my form - and basically test each number for equality.
Again, on the scale of Big Data, that's a fair amount of
computation.

In other words, these two kinds of issues might actually
make for a Big Data project in their own right - and have
probably not been mentioned in the Big Data hype.

Now - why the Faustian Bargain?  Basically, I've given you
a couple of ideas I think are novel - and maybe would be
interesting for the conference.  What I'd like to do is to put
together an article for ESI - but find funding to spring the
copyright from Springer so that I could do a "derivative
work" that I could put into the books I'm writing.  The bill
for this is $3,000 which is a bit steep for our family budget.
Can you help?

I also don't know if this could lead to a Sloan Fellowship.
Again, because I'm unaffiliated and unfunded, I don't think
I'd have much success at applying.  If interested, JPL
might be a suitable location - assuming I could be tied
on as a consultant.

As a minor caveat, while I'm aware of the usual cautions
about e-mail, I'll formally include the note
Copyright 2013, B. R. Barkstrom - all rights reserved
except that the address has the right to read (but not
to copy or distribute) this note.

Bruce B.

On Wed, Mar 20, 2013 at 10:56 AM, Bruce Barkstrom <br...@gmail.com>wrote:

> I'll subside after one minor note on the "sky is the archive."
>
> I once had a course from W. W. Morgan, the U. Chicago prof who
> developed the atlas of stellar types (A, O, B, etc.).  He had
> the spectrum of a "standard type R".  As I recall, two weeks
> after he published his atlas with the spectra, the star defining
> the type became a variable.
>
> Also, I note that on this very Google Mail page, I can get
> a "Free Guide to Big Data", as well as the "IBM Big Data
> Free eBook".  I suppose I don't need to go to a conference
> to become informed.
>
> Bruce B.
>
>
> On Wed, Mar 20, 2013 at 10:21 AM, Mattmann, Chris A (388J) <
> chris.a.mattmann@jpl.nasa.gov> wrote:
>
>> Hey Bruce,
>>
>> A couple points:
>>
>> On 3/20/13 5:46 AM, "Bruce Barkstrom" <br...@gmail.com> wrote:
>>
>> >That may be a bit better.
>> >
>> >However, it still isn't clear to me how the physics of the instruments
>> >and of the data processing gets into what users understand they
>> >can do with the data.
>>
>> Yeah agreed. At the same time, this is kind of difficult to throw into
>> a 45 min with 15 mins "techie talk" that I haven't even prepared yet,
>> and even harder to throw in to a 100 word (what you see on the website)
>> and 200 word (longer, what I sent you) abstract that they requested.
>>
>> >
>> >As I understand Big Data and analytics, it usually appears to using
>> >a lot of statistics to find unexpected correlations in the data, but
>> >the techniques aren't looking for causation.  If you're dealing with
>> >scientific data, you're usually trying to get to physical causation.
>> >That means, I think, that users need to understand how the
>> >physics and math constrain what they can do.
>>
>> ++50 agreed.
>>
>> >
>> >Let me see if I can identify a more concrete example of a
>> >concern.  Usually, when we want to deal with physically
>> >connected phenomena, we want disparate data to be
>> >observing the same chunk of space at the same time.
>> >If the Big Data user picks up one piece of data from region
>> >X_1 and t_1 and then develops a correlation with observations
>> >with data from X_2 and t_2, where X_1 /= X_2 and t_1 /= t_2,
>> >it isn't clear why that correlation has anything to do with
>> >physical causation.  Of, to put it another way, Big Data
>> >may just give more examples of the "cherry picking"
>> >climate deniers do when they select data without
>> >paying attention to the statistical and physical significance
>> >of their "results".
>>
>> Totally agree. This is the big difference between card
>> carrying statisticians a lot of time and *computer science*
>> oriented *machine learning* people.
>>
>> >
>> >So, even though the data rates are large by today's
>> >standards, I'm not sure that, by itself, is impressive.
>>
>> Well I have to say it is impressive. Can you show me a disk
>> that can today write 700 TB/data per second? Or the filesystem
>> drivers and parallel I/O necessary to software them? Imagine in
>> astronomy, where they are moving into the time domain, and
>> away from the "sky is the archive" "so just reobserve next
>> time" mentality, and thus triage, which is super important,
>> isn't the main driver and archival is now becoming important,
>> and necessary in these eventually 700TB/sec producing systems.
>>
>> There are all sorts of IO, hardware, computer science, and
>> other advances that we don't have that are needed, and that
>> these types of examples like the SKA will drive.
>>
>> OTOH, the sheer infrastructure, domestic and international policy,
>> investment, and excitement and sense of nationality that many of
>> these new Big Data systems (especially the SKA) are creating in
>> their respective countries (e.g., in South Africa), is enough
>> to at least suggest to my evidence based mind that there is
>> something impressive here.
>>
>> >Maybe the relevant example would be all those statistics
>> >on dams built or tons of steel produced by the Soviet
>> >Union.  The hype would be more interesting if it could
>> >talk about what new phenomena or understanding
>> >these techniques will produce - not just the data rate
>> >or the total amount of data being produced.
>>
>> Agreed, lots of data has been generated for a while. However,
>> the volume (total and discrete); velocity, and variety (in
>> data types, metadata, etc.) are certainly such that they are
>> worthy of current study, at least in the area of data management.
>>
>> >
>> >Maybe it's just a glorified popularity contest; if so,
>> >it would seem to be at about the level of interest
>> >of the new season of "Dancing with the Stars".
>>
>> Perhaps, but I know you guys are interested in that show :)
>> Who's not?
>>
>> >I suppose the hype is necessary to generate the
>> >funding (which has its uses), but I'm not sure it
>> >will do as much as a few million sent to appropriate
>> >super PACs to move the politics of climate change
>> >along.
>>
>> Think of this as an IT super PAC for next generation data management
>> techniques and systems to deal with data volumes and varieties that
>> we don't have hardware or CS tools to manage yet. I'm not talking
>> about writing to tape and letting it die the morgue. I'm talking about
>> even simple things like making it available after you write it to spinning
>> disk.
>>
>> Cheers,
>> Chris
>>
>> >
>> >Bruce B.
>> >
>> >On Wed, Mar 20, 2013 at 1:16 AM, Mattmann, Chris A (388J) <
>> >chris.a.mattmann@jpl.nasa.gov> wrote:
>> >
>> >> Hey Bruce,
>> >>
>> >> Hah!
>> >>
>> >> Unfortunately all you get is the short summary through
>> >> the website which does make it scientifically hard to
>> >> judge, however, then again this isn't science, it's a
>> >> glorified popularity contest.
>> >>
>> >> I have a little bit more detailed abstract that I wrote up,
>> >> pasted below (of course the part that they don't use to solicit votes):
>> >>
>> >> ---longer abstract
>> >> The NASA Jet Propulsion Laboratory, California Institute of
>> >> Technology contributes to many Big Data projects for Earth science such
>> >>as
>> >> the
>> >> U.S. National Climate Assessment (NCA) and for astronomy such as next
>> >> generation astronomical instruments like the Square Kilometre Array
>> >>(SKA)
>> >> that
>> >> will generate unprecedented volumes of data (700TB/sec!).
>> >>
>> >> Through these projects, we are addressing four key
>> >> challenges critical for the Hadoop community and broader open source
>> Big
>> >> Data
>> >> community to consider: (1) unobtrusively integrating science algorithms
>> >> into
>> >> large scale processing systems; (2) selecting and deploying high
>> powered
>> >> data
>> >> movement technologies for data staging and remote data acquisition;
>> >> processing,
>> >> and delivery to our customers and users; (3) better leveraging of cloud
>> >> computing (storage and processing) technologies in NASA missions; and
>> >>(4)
>> >> technologies for automatically and rapidly extracting text and metadata
>> >> from
>> >> the file formats, by some estimates ranging from a few thousand to over
>> >> fifty
>> >> thousand in total.
>> >>
>> >> This talk will focus on those Big Data challenges, how NASA
>> >> JPL is addressing them both technologically (Hadoop, OODT, Tika, Nutch,
>> >> Solr)
>> >> and from a community standpoint (Apache, interacting with open source,
>> >> etc.).
>> >> I¹ll also discuss the future of Big Data at JPL and NASA and how others
>> >> can get
>> >> Involved.
>> >> -----
>> >>
>> >> You can think of that as the longer version of what I submitted. *grin*
>> >>
>> >> Cheers,
>> >> Chris
>> >>
>> >>
>> >>
>> >> On 3/19/13 7:20 PM, "Bruce Barkstrom" <br...@gmail.com> wrote:
>> >>
>> >> >OK, so you've got a three-word summary of some
>> >> >hyperbole with Dumbo, the Flying Elephant.
>> >> >How are you going to deal with the real
>> >> >scientific constraints on the physics of combining real
>> >> >measurement technologies and "mashing stuff together"?
>> >> >
>> >> >You need to remember that imaging instruments integrate
>> >> >radiances with spectral responses and Point Spread Function
>> >> >weighted averages over the FOV of whatever the instrument
>> >> >was looking at - and that's just the instantaneous (L1 measurement).
>> >> >If you do orthorectification, you've got variations in the
>> >>uncertainties
>> >> >across the image where the parts of the image where you've
>> >> >increased the resolving power (by putting interpolated points
>> >> >closer together) and have also increased the noise from the
>> >> >orthorectification process that acts as a noise multiplier.
>> >> >
>> >> >Next, you've got stuff like cloud identification (and rejection or
>> >> >acceptance) - which depends on spectral response, solar illumination
>> >> >(during the day) and temperature and cloud property stuff during
>> >> >the night - and finally, you've got temporal interpolation (not just
>> >> >creating an average through emission driven by solar illumination
>> >> >during the day and IR cooling at night.  Where (the hel)l is
>> >> >the physics that deals with this stuff?  If you do get some
>> >> >statistical stuff, why should anyone believe it contributes to
>> >> >our understanding of climate change?
>> >> >
>> >> >I won't vote, but you can think of this as my input to your
>> >> >scientific conscience.
>> >> >
>> >> >Bruce B.
>> >> >
>> >> >On Tue, Mar 19, 2013 at 7:51 PM, Mattmann, Chris A (388J) <
>> >> >chris.a.mattmann@jpl.nasa.gov> wrote:
>> >> >
>> >> >> Hey Guys,
>> >> >>
>> >> >> I proposed a talk for NASA and Big Data at the Hadoop Summit:
>> >> >>
>> >> >>
>> >> >>
>> >>
>> >>
>> http://hadoopsummit2013.uservoice.com/forums/196822-future-of-apache-hado
>> >> >>op
>> >> >>
>> >>/suggestions/3733470-nasa-science-and-technology-for-big-data-junkies-
>> >> >>
>> >> >>
>> >> >> If you still have votes, and would like to support my talk, I'd
>> >> >>certainly
>> >> >> appreciate it!
>> >> >>
>> >> >> Thank you for considering.
>> >> >>
>> >> >> Cheers,
>> >> >> Chris Mattmann
>> >> >> Vote Herder
>> >> >>
>> >> >>
>> >>
>> >>
>>
>>
>