What do you do with a petabyte of data?

The question came up during a lunch today with two NASA computing people, on in IT and the other in supercomputing.  Modern satellites are returning petabytes of data, and there are many satellites.  This is far more than any human can expect to personally look at, and in fact more than they can fit into their local machine.  How do we make these huge amounts of data useful?

We can't ship it to the user's desktop-- there's no room, it'd take forever, and the user doesn't have tools that can browse massive data sets.

One concept in cloud computing is that the user runs the program where the data is, rather than bringing their data to their program.  This works well for cases where the scientist knows what the data is, and simply needs to run on it.

For example, Facebook lets anyone write a program that runs on the entire Facebook population.  A user can write a program to fetch a subset of the data that fits certain criteria, then do interesting stuff with it.  The data resides on Facebook's servers, and the user only accesses that subset which they need.  This works because Facebook has very few data items available to the programmer, and a simple relational database fetches all relevant data.

But in science data sets, you don't know what the relevant data is until after you've solved the problem.  You have to first do data exploration.  And NASA has (for example) Earth data in multiple wavelengths and sensor types, resolutions, timing.  If you're not an earth science expert, you may not yet know which data is important.

We can't ask the user to run their analysis remotely as a 'batch' job, because then the scientist has to predecide which data they need.  You can't explore the data in a series of request-and-wait runs, that'd take forever.  It'd be like watching a movie in slide show format, one slide an hour.

There needs to be a way for a user to a) explore, b) select, then c) work with massive data sets in an interactive fashion.

Let's look at the example of the starship Enterprise from Star Trek.  It has a nigh omniscient computer and a vast array of sensors.  Commander Data wants to find a hidden Romulan ship.  He talks to the computer to use it.  Here is his technique [paraphrased]:

"Computer, visual onscreen now."  [no ship seen]
"Computer, show thermal signatures."  [still no ship seen]
"Computer, show subspace anomalies." [voila!  the outline of a Romulan ship appears!]

This is a terrible model for attacking a large data set!  A good system would work like this:

"Computer, show me all ship-like objects, in any profile.  Ah, there it is."

Why should a scientist have to drill down through all possibilities?  The problem becomes worse when scientists access data from different domains.  A sociologist may want to explore 'temperature' with 'crime rate' for different regions of a city, to see if there is a connection.

But where is 'local temperature' in the petabyes of the Earth Observing System (EOS)?  There are many temperatures... water temperatures, air temperatures, buildings radiating heat.  There's humidity to factor in ("it's not the heat, it's the humidity.")  You need your temperature data to match the timing of your crime stats. If you have only one crime figure per day, what do you use as a daily temperature-- an average, a peak, a sampling?

To answer these, you need to fully understand all the possible data sets, and then be able to make a judgement of which are relevant.  But you may want to try different assumptions, because (again) you don't know the answer until you get it.  Having so much data is a blessing, but it creates a very large task.  A task made worse if computers and IT aren't your field.

And really, the user doesn't need a petabyte of data.  They just need a subset of that.  The problem is, they don't know which subset they need until they explore it.  Then, and only then, they need an analysis run on a particular subset of that data-- a distilled product.  Which subset, they may not know until they do data exploration.

We've talked about how there is too much data to just explore it all frame by frame, so some additional structure is needed.

One approach is semantic data.  This is a meta-data, or description language put alongside the actual data images and whatnot, that describes what the data is and what it can be used for.  In theory, you can do a semantic query and not have to specify any dataset per se.

The difficulty in semantic data is it tags the data from one perspective, which may not match what users (particularly from other fields) wants.  In the end, semantic data is still a data-based schema imposed from outside.  It may not map to how people think.

Another approach is what I'll call the liaison approach, and is based on library science (or systems analysis, if you remember that subfield).  At the library, you can search directly or browse the shelves.  If you need help at a library, you ask a person who understands both what people want, and how libraries are set up.  These 3 approachs are:

1) search the catalog for titles and keywords that match = semantic data
2) browse the shelf that has books on that topic = data exploration
3) ask a librarian  = liaison

The advantage of the librarian is they know books, and they know people.  So they can connect the two.  So if I am looking for a book on, say, Asperger's Syndrome, a librarian can help me narrow down whether I want to read about possible causes (medical), diagnose someone (psychology), find out if there is legal accommodation (law), read about people who have it (biography), or read fiction with aspie protagonists (fiction).  None of these are directly addressed by a keyword search and there isn't a single shelf that has all these categories.  The research librarian provides the interface between need and data.

The final method is what I'll call building a map.  For this, a dedicated scientist formulates a problem, gets up to speed on all the possible data, decides what is relevant, then solves the problem.  She publishes her method and results, and from that point on, others can do similar analyses using her map.

This is how science works at present.  It requires individual scientists blessed with wisdom who are (by magic, usually) able to secure funding for unproven cross-disciplinary work.  It also has a time lag, since a research project can easily take years from funding to publication.  But we are acquiring data faster than we are creating new maps for analyzing it.

If this was an academic paper, I could end now, having raised the questions and offered possible solutions.  But this is ScientificBlogging, and we are allowed opinions.  So here goes.

To solve the petabyte problem, I would set up a data researcher position, akin to reference librarian.  Then I would ship those data-saavy souls to conferences in different fields, to jam with their colleagues on what new problems could be solved using cross-disciplinary approaches.  Together they'd define how to approach and tackle the problems.  Said colleagues could then get proposals funded for the approach and create the maps for others to follow.

And, I want a pony.

Alex, the Daytime Astronomer

The Daytime Astronomer, Tues&Fri here, via RSS feed, and twitter @skyday