S. George Djorgovski, Astronomer, Data Scientist, and Pioneer of Astroinformatics
Today, artificial intelligence and machine learning influence our lives in ways both obvious and unseen. Perhaps as much as in any other sector, scientific research is undergoing an AI revolution that has already profoundly influenced physics, chemistry, and biology - and the prevailing assumption is that these changes are only the beginning. Given the significance of these developments, a hugely important question for the history of science is raised: when and why did researchers begin to understand and embrace the value of artificial intelligence?
In the discussion below, George Djorgovski provides powerful perspective on this question. As he explains, the advent in the 1990s of digital detectors for sky survey observational projects began to yield levels of data too massive for humans to sift through. Both aspects of this development explain the exponential rise in captured data. The devices, called charged-couple detectors, utilized technology akin to video recorders, and they were able to capture the night sky essentially as a moving image. And the goal of astronomy survey projects is to capture massive areas of the sky over long periods of time. Together, these advances presented astronomers with a challenging, but very productive problem: how to separate the signals from the noise? Or put another way, how to sift through all the data to find potentially interesting objects to send to higher-resolution and more focused telescopes, in the hopes of making new discoveries in every aspect of astronomy?
In his own career, and in many cases relying on these research methods, Djorgovski has made hugely important contributions across a wide range of astronomy sub-disciplines, including galaxy formation and evolution, quasars, and gravitational lenses. Djorgovski reflects on the significance of his work on gamma-ray bursts, and he is appreciated in the field as one of the pioneering figures in astroinformatics, which, as it sounds, is a new interdisciplinary approach that combines data science and astronomy. With the launch of the National Virtual Observatory, (and the Center for Data-Driven Discovery at Caltech) Djorgovski ensured that these advances would reverberate globally (as astronomy is a truly multinational endeavor). Soon enough, scientists from other fields began to appreciate the universal applicability in utilizing machine learning in their own work. To take one example, even twenty years ago, most of biology was considered "small science"; today, advances in drug delivery and genomics are only possible because of the massively improved computational AI capacities that allow biologists to make sense of the data they collect.
In reflecting on his own trajectory and upbringing in Yugoslavia, where opportunities to learn about the universe were few and far between, Djorgovski is also a committed science communicator. His spearheading of the "Big Picture" at Griffith Observatory - the largest and likely most viewed astronomical image in the world - is but one example of Djorgovski's drive to combine scientific discovery with a broader public audience that clamors to understand our universe.
Interview Transcript
DAVID ZIERLER: This is David Zierler, Director of the Caltech Heritage Project. It is Friday, January 27, 2023. I am delighted to be here with Professor George Djorgovski. George, it's great to be with you. Thank you so much for joining me today.
GEORGE DJORGOVSKI: My pleasure. I'm happy to be here to help.
ZIERLER: To start, you go by your middle name. Have you always gone by George?
DJORGOVSKI: Well, since I arrived in the US, basically. In Yugoslavia, where I'm from, there are no middle names. In some other Slavic countries, the father's name is patronymic. But we just have the first name and last name. And people found it hard to pronounce Djorgovski, so somehow, that became George, and I got used to it. Then, when I became a US citizen, I added it as a middle name since it was de facto what people called me.
ZIERLER: Oh, wow, that's great. [Laugh] On a more official level, what are your titles here at Caltech?
DJORGOVSKI: I'm a Professor of Astronomy and Data Science. I'm also the Director of the Center for Data-Driven Discovery.
ZIERLER: Astronomy and data science. Is that two affiliations or one affiliation within one job title?
DJORGOVSKI: It's just one job title. As my interests and research evolved over the last couple of decades, I just requested it. I've been de facto doing data science, or really, astroinformatics, mostly, so I asked if I could be called Professor of Astroinformatics. The administration was reluctant for some reason to use that word, and since I was doing things outside astronomy as well, they chose to give me the title Professor of Astronomy and Data Science, which is a perfectly good description of what I do.
ZIERLER: Is there any other professor with a similar formulation in their title, given how expansive data science has become? Do we see other Caltech professors who are professors of X and data science?
DJORGOVSKI: Not that I know of. Although, it's probably just a matter of time. Another thing is that the methods of data science are becoming a standard part of data analytics in every field, so it would be simply assumed that everybody does it. It's already the case, I think. But academia changes its ways very slowly.
ZIERLER: You mentioned starting off in astrophysics. I'm curious, from your perspective, there's astronomy, there's astrophysics, and your work has certainly touched on cosmology. Where do you see the boundaries of these fields and your unique perspective on big data, data science, and astroinformatics, how might those technologies influence where the boundaries between these disciplines might be?
DJORGOVSKI: There are no real boundaries. I think astronomy in the 20th century transitioned from being a separate science to a branch of physics. There's almost nothing in astronomy that doesn't involve physics in some way. I compare it sometimes to condensed-matter physics, in that it's messy, phenomenological, and complex. But the underlying apparatus, both for doing the measurements and for understanding what's going on, is physics. In my mind, astronomy equals astrophysics. It's a branch of physics. Cosmology is specifically studying the universe at large. It used to be just about global geometry and dynamics of the expanding universe, but it gradually started including understanding the ingredients of the universe on large scales, galaxies, large-scale structures, and whatnot.
There's a very fuzzy boundary between what we call extragalactic astronomy, anything outside the Milky Way, and observational cosmology. Now, physicists who study cosmology tend to deal with the much earlier universe. As I was telling our students, people coming from astronomy to cosmology, think that cosmology begins here and ends at the cosmic microwave background. And physicists think that it begins at the cosmic microwave background and goes towards the big bang. But that's rather arbitrary, and it's just a question of what you're really interested in. Are you interested in an inflationary scenario or how galaxies form? There are no sharp boundaries there.
From Belgrade to Berkeley
ZIERLER: Let's take it back to the beginning. Briefly, I'd like to learn a little bit about your background. Let's start with your parents. Tell me where they're from.
DJORGOVSKI: I was born and grew up in Yugoslavia. I studied physics at the University of Belgrade, then got to graduate school in Berkeley for astronomy. Got my degrees there, then I went to Harvard as a Harvard Junior Fellow for two years, before I was hired as a faculty at Caltech.
ZIERLER: And your family background. Yugoslavia, of course, was made up of several different ethnicities or nationalities. Where did your family come from?
DJORGOVSKI: From Serbia, or Belgrade, the capital city. I was always interested in how the world works. That's why I wanted to be a physicist. Somewhere along the way, I switched to astrophysics and cosmology, and I think that was the right choice for me.
ZIERLER: Were you always interested in astronomy, even when you were a kid?
DJORGOVSKI: Well, as much as any science-curious kid would be. But until I was maybe freshman or sophomore at the university, I thought I would be a theoretical particle physicist because that sounded fundamental and glamorous. As I learned more, I realized that I was actually more interested in things that were going on in the universe, in cosmology, quasars, black holes, and stuff like that. I also found out that my scientific temperament is more suited to experimental or observational work than to pure theory, although I try to be conversant in theory as well.
ZIERLER: Tell me the kinds of schools you went to growing up.
DJORGOVSKI: There was nothing special about them, it was just the European public education system. There are no special schools, there are no schools for advanced children. Everybody was the same, comrade. [Laugh] Anyway, I was largely self-taught, and bored in school, so I would just read and learn about whatever I found interesting.
ZIERLER: And going to the University of Belgrade, was that the best school in Yugoslavia?
DJORGOVSKI: Probably. A couple of other regional capital cities, like maybe Zagreb, had excellent old universities also. These are from the old European academic tradition, and they actually provide a really excellent, rigorous education, in physics, mathematics, and so on. More so than I thought. When I got to the US, I found out I wasn't the dumbest kid on the block. In fact, I knew more physics and math than most of my peers in grad school who came from Princeton, Harvard, or wherever.
ZIERLER: As an undergraduate, did you have access to telescopes? Did you have a sense of what observational astronomy was?
DJORGOVSKI: Not so much. Yugoslavia did not have many observational facilities. Very modest ones, I would say. And no modern instrumentation. We were trained in some of that stuff, but really, my preparation was largely in understanding the theoretical background. We had no computers, either. I think I'd seen a computer once through a glass door while at the University of Belgrade. I never touched one. The first time I actually got to use one was in grad school.
ZIERLER: Was there a professor, or do you have a specific memory of when you were encouraged to think about graduate school, applying abroad?
DJORGOVSKI: No, that was always my plan, if you will. I knew since I was maybe in elementary school that I wanted to be a scientist, and I wanted to go to the United States. That was pretty much how I always viewed things. The first time I could actually do that was to go to grad school. As you know, in the US, you have to pay a lot of money to be an undergraduate student, and in grad school, they pay you. Slave wages, but nevertheless. Realistically, the first time I could really go to study abroad was in grad school. And I went as far west as was practical, so I ended up in Berkeley.
ZIERLER: What programs did you apply to? What was interesting for you, coming from Yugoslavia?
DJORGOVSKI: Well, I was applying for astronomy, astrophysics PhD programs, maybe a dozen of them. All the usual big names, a couple of backup schools, because I had no idea how good I really was. I got into all of them except the University of Hawaii because their bureaucracy wanted our Ministry of Agriculture to vouch that I wouldn't bring in snakes, snails, and stuff like that – the usual tourist form. That's not how things worked in Yugoslavia. But that's fine. Among other schools, I got into Berkeley and Caltech. In the end, I pretty much tossed a coin to decide which one to attend. I got into Berkeley. Now, I'm a professor at Caltech, which I think is a lot better than the other way around.
ZIERLER: Why do you say that? That's interesting.
DJORGOVSKI: Caltech has always been the mecca or a leading institution for astronomy. Maybe less so now, but in terms of ownership of big telescopes, hiring the best people, and so on. From a student's viewpoint, the social life was better in Berkeley. Caltech was small, there still weren't very many women students at that time. That's changed completely since then. I think that overall it was certainly a better experience, even though most of the time I just worked. It was probably a better experience to be a student at Berkeley at that time. I was offered a faculty position at Berkeley at the same time, but I knew I wanted to come to Caltech because, to me, that was clearly the better choice.
ZIERLER: When you arrived in Berkeley, how was your English? Had you learned it in school in Yugoslavia?
DJORGOVSKI: No, I learned Russian in school, and I had no choice about that. I learned English on my own. I never had formal instruction in English in my life, so I learned it from books, movies, from traveling with a backpack around Europe, and I just picked it up. As it turned out, it was quite serviceable. My peers at Berkeley were surprised at how good my English was at the time. That's perhaps because it wasn't learned from the books, but learned on the streets if you will.
Foundations in Galaxy Evolution
ZIERLER: When you got to the department, what were the big exciting things happening in astronomy at that time?
DJORGOVSKI: There was a lot of different stuff happening. I was interested in extragalactic astronomy and cosmology. The big thing back then was studies of galaxy evolution. Originally, cosmology was all about finding geometry, dynamics, and evolution of the universe as a whole, but cosmological tests up to that time had proven to be unreliable because they were using galaxies as probes, and galaxies change in time. There was no good standard candle or standard ruler, things that you need for cosmological tests. People have almost given up on measuring cosmological parameters because of these difficulties, but it became clear that galaxy evolution is a really interesting subject in and of itself. Where does the content of the universe come from? I got to work with Professor Hy Spinrad, who was a well-known observer of very distant galaxies, and I learned my observational trade from him. But I also found out other things I was really interested in, like the dynamics of star clusters, with Professor Ivan King, who was probably one of the biggest intellectual influences on me at the time and stayed as a friend until the end of his life. He passed away recently. I also worked with Professor Marc Davis, who, again, is an expert in galaxies and large-scale structures, and my thesis was officially under him, as well as with other professors. People said that I did three theses at once, and just one happened to be the official one.
ZIERLER: The three parts, are they related to the three advisors or mentors you noted?
DJORGOVSKI: Discovering the most distant galaxies known at the time and finding the first evidence for galaxy evolution at very large redshifts was the one with Hy Spinrad. With Ivan King, it was discovering a phenomenon in globular clusters called core collapse that is important for the dynamical evolution of those systems. That was very interesting for a while. And with Marc Davis, the original idea was to find a better distance-indicator relationship for galaxies in order to measure the large-scale structure and peculiar velocities of galaxies as they move around. In that process, we discovered what's called the fundamental plane of elliptical galaxies, a set of bivariate correlations that tells us something about how galaxies form, how they work, and can also be used for the original purpose we had, which was to measure the distances and velocities of galaxies. A lot of my subsequent work involved studies of these fundamental correlations, and their uses to gain a better understanding of the formation and evolution of elliptical galaxies. Maybe in the long term, that might have been my most important contribution to astronomy at that time.
ZIERLER: What telescopes or datasets were most important for your thesis work?
DJORGOVSKI: Most of the data for the thesis itself was done on a homebuilt one-meter telescope at Lick Observatory. Not a very big telescope. And that was fine because I was studying nearby galaxies. But most of my interesting observational work was done with the biggest telescopes we had access to at the time, mostly the four-meter telescopes at Kitt Peak, at Cerro Tololo Inter-American Observatory, and the three-meter telescope at Lick Observatory. The Hale 200-inch Telescope at Palomar was still the largest, but the rest of the astronomical community was catching up and producing better instruments at the time. I pretty much observed with all the major and semi-major telescopes in operation back then and developed various observational and data skills with those.
ZIERLER: You emphasized in particular King's role in your intellectual development. In what way?
DJORGOVSKI: He was a scientist of the old school, in some sense, very rigorous, incredibly honest, and with very high standards. He also had a good sense of humor and didn't suffer fools gladly. He was one of the people involved in the creation of the Hubble Space Telescope. He spent a lot of time and effort on that. I did some work for him in preparation for Hubble before it was launched. Overall, he had a very interesting intellect, more multifaceted than most other faculty. A lot of people in our field tend to be very one-sided astronomy nerds. I just enjoyed interacting with him more than with the others.
ZIERLER: What was some of your work with globular clusters around this time in the early 1990s?
DJORGOVSKI: It really started in the mid-1980s while I was in grad school, with almost a throwaway project that Ivan King gave me. He was probably the world's leading expert in globular star clusters. To me, what was interesting about star clusters was the intersection between gravity and thermodynamics. Essentially, you're looking at this self-gravitating ball of stars as test particles in a gravitational field. Just like molecules of gas, they were moving around, bouncing around, and so on. There's a whole field of gravo-thermodynamics, the thermodynamics of stellar systems bound by gravity. I thought there was some fascinating physics there. Ivan King gave me this project to digitize some of the photographic plates he took of globular clusters and look at their cores.
There was one cluster that always seemed to have too bright a core relative to what was expected at the time. Ivan King made his name by coming up with the first dynamical models of star clusters, and they're still useful. But there was another phenomenon predicted by theorists called core collapse. The interesting thing about systems bound by only attractive forces, like gravity, is that they have negative specific heat. That's not the case with gas, which has what's called van der Waals forces. Molecules may be attracting each other from large distances, but when they get close, they start repelling each other. That doesn't happen with gravity. When you have one of those balls of gas of stars, it will tend to pull together under its own self-gravity, and then some stars will get very fast due to the gravitational encounters in the middle, and so fast that they'll leave the cluster, take some kinetic energy with them, which will make the cluster shrink some more. And that's a runaway process.
At first, people thought that was how you could make big black holes, by the gravitational collapse of star clusters. In fact, that could happen. But sometime around the early 1980s, there was another interesting finding, from the first applications of computers to study these processes. What happens is, you have a binary star formed in the middle of a cluster, and binaries that are very tightly bound can serve as a source of energy in kicking out other stars. They shrink, they kind of absorb this loss of kinetic energy, and they stabilize the cluster against gravitational collapse. There was a prediction that there would be these post-core collapse structures, and their shape in the middle would be different. That was very much a theoretical prediction that was not known to be borne out by observations. The project Ivan gave me was to check this out, and lo and behold, we found exactly what the theory predicted.
That there were clusters that had undergone gravitational core collapse, then recovered, and they look a little different from the others. That was an important finding. Then, I decided that instead of just looking at 10 or so, let's do all of the globular clusters in the Milky Way. We did a survey and we found 20% of globular clusters have gone through this process. There's a whole manner of interesting stellar dynamics involved. Then, we started looking at whether this is affecting the stellar populations because of all these close encounters and collisions of stars, and so on. That tied together the stellar dynamics with stellar evolution, which was interesting. When the Hubble Space Telescope was finally launched, it enabled us to look at clusters with a much better resolution than we could from the ground, and a whole bunch of interesting scientific results came out of that.
ZIERLER: Either scientifically or intellectually, what do you see as the connecting points between these three discrete topics in your thesis?
DJORGOVSKI: It's not really obvious. They're all really interesting topics. The one I was interested in during my prior education was very distant galaxies, distant universe, and cosmology. The dynamics of stellar systems was just not on my radar at all, but then I discovered that it is very interesting, too. Regarding nearby galaxies, I also thought that they are well-understood and that the exciting stuff is far away. This turns out not to be the case, we can learn a lot about galaxies, how they work, and where they come from by studying galaxies nearby. Today, this is a booming field. They call it galactic archeology. Just from the properties of the Milky Way, its constituents, or nearby galaxies, you can learn a lot about what happened in the past.
To me, this was a new thing. One aspect of this, which again, I was led to first by Ivan King, is looking at highly dimensional data spaces. If you measure a bunch of different properties of something, say, galaxies, then each quantity that you're measuring is a new axis of some data space. Those abstract spaces can have not just two or three, but tens of dimensions, depending on how many things you can measure well. Somewhere in that multidimensional data space, there may be interesting correlations that you can then use to understand better what you're studying, use to measure their distances, and things like that. And that, to me, was fascinating. It connected with a purely intellectual interest I had earlier from the studies of linear algebra in physics, in Hilbert spaces. I was always intrigued by multidimensional spaces, not just mathematical, but in this case, multidimensional data spaces. That remained one of the core intellectual touchstones throughout my professional life. A lot of the stuff we do today, analyzing big sky surveys with machine learning, pretty much plays out in that multidimensional framework.
Early Embrace of Computation
ZIERLER: Was there anything specifically computational about your thesis research, just to foreshadow ahead?
DJORGOVSKI: Yes, but it was more of a traditional image processing. The datasets were not large, and the digital images were not large, because the detectors were small. Astronomy as a whole was really getting into digital imaging detectors around the 1970s, first with detectors that were like TV cameras, or similar to that, then with solid-state devices, CCDs, which were just coming into use in astronomy in the early 1980s as I was starting my graduate work. Those detectors, even in their state back then, which is not nearly as nice as today, were still superior in terms of sensitivity and noise properties to photographic plates. The drawback was, at the time, they only viewed small fields. But it also required astronomers to develop image processing tools and measurement tools to extract quantitative information from those relatively small images. I wrote a lot of software to do exactly that. You can follow the history of data-rich astronomy in terms of how much data is collected. Back in the 1980s, we were talking about Megabytes. We had to develop image analysis tools to extract knowledge from Megabyte-scale datasets. Then, as things kept going up and up into Gigabytes and Terabytes, machine learning had to be used because there was just too much data for people to process by hand. We had to automate a lot of data processing tasks, and that was really how machine learning entered astronomy.
ZIERLER: Did you work on gravitational lensing as a graduate student?
DJORGOVSKI: Yes. In fact, I found what we thought was the third known gravitational lens. Actually, it may have been the first example of a binary quasar, not a lens, but it wasn't clear for a while. In my subsequent work, I found a bunch more. I also discovered the first unambiguous case of a binary quasar, two physically distinct quasars in close proximity. Just like you can have binary stars, you can have binary galactic nuclei and binary quasars. In the subsequent work, I found several tens more of either lenses or binary quasars, including the first known triple quasar, which was an interesting thing in and of itself. In fact, one of my main projects now with collaborators, notably Alberto Krone-Martins at UC Irvine, Daniel Stern at JPL, Matthew Graham here at Caltech, and a whole lot of other people, mostly in Europe, was to use data from the Gaia Satellite with very innovative approaches, using machine learning, to discover gravitational lenses in a systematic fashion, and then use those lenses for cosmology to measure the Hubble constant, to better understand dark matter, and so on. I thought that was really a good synergy of big data, machine learning tools, and interesting science, and that's where pretty much the regime of astronomy is today.
ZIERLER: Given the significance of the discovery, I wonder if you can describe in some technical detail how you discovered a binary quasar. What did that mean?
DJORGOVSKI: We didn't really anticipate binary quasars, although, in retrospect, it's clear they should've been there. But I was just looking for gravitational lenses, and I had some ideas on how to do that. The first binary quasar I found by just looking at photographic sky survey prints, looking for quasars that seemed to have extended images, which could be two objects close together, and following them up with a spectrograph. This one was at the Multiple Mirror Telescope in Arizona. We thought that found another gravitational lens, which was interesting, but not a big deal. And then, something really interesting thing happened. We did follow-up using the Very Large Array radio telescope in New Mexico, and instead of seeing two quasars, we saw only one. That can't be if they're gravitationally lensed, because gravitational lensing is wavelength independent. You should see the exact same thing replicated in visible light or in the radio. That immediately told us we weren't looking at a gravitational lens, we were looking at a physical pair of quasars.
Quasars tend to be fueled by interactions of galaxies, collisions of galaxies, and most or all large galaxies have supermassive black holes in their centers. Sometimes, instead of just fueling one of them, the interaction fuels both of them. The number of binary quasars that were subsequently found was much higher than you'd expect by chance from galaxy clustering. It had to do with the way that quasar activity is triggered, and we understood very quickly why this was. We looked at other data, found out that some other systems we thought were gravitational lenses were probably not, but more examples of these binary quasars. We did surveys of binary quasars later on. They are interesting probes of both clustering of galaxies far away, and also how galaxy interactions are responsible for the fueling of quasar activity.
ZIERLER: When you said the field didn't know that binary quasars were out there, is that to say there was some deficiency in the theory?
DJORGOVSKI: No, it's just one of those things that nobody bothered to consider. This happens more often than you'd think. You find something surprising, then you go back and say, "Oh, yeah, we should've found that". Another case of that more recently in surveys we've done here at Caltech is to discover candidates for supermassive black hole binaries. These would be the analogs of the black hole binaries that LIGO is finding, which eventually merge, except these would be maybe 10 million times bigger. The story here is that we know from a variety of studies that nearly every massive galaxy, including the Milky Way, develops a supermassive black hole in the middle. In the Milky Way, it's four million times more massive than the Sun. In M87 galaxy, shown in the famous image from the Event Horizon Telescope, it is several billion times the mass of the Sun. Those are the black holes that, when you dump some fuel in them, create the quasar phenomenon.
We also know that galaxies merge, a lot, as the universe evolves. They collide, they lose their kinetic energy to tidal friction, and if most of them have big black holes in the middle, eventually you end up with a binary supermassive black hole in the center of the resulting merger remnant galaxy. At first, they orbit for a while, maybe losing some of their orbital energy to just transferring it into the motions of stars, but when they get close enough, they start losing their orbital energy due to gravitational waves, exactly what happens with the stellar-mass black holes that LIGO finds. They spiral in and eventually merge, making an even bigger black hole. Theory totally expected this to happen, in what we call hierarchical structure formation in the universe. Galaxies merge. Black holes will merge, and they will grow in time. If you look at theoretical models how many should we see, there'd be roughly one in 10,000 quasars or so. There were just no sufficient data to look for this phenomenon.
The first time we actually had adequate datasets to check for something like this was with one of these modern sky surveys. In the sky survey we used, the Catalina Real-Time Transient Survey, we collected data that monitored the brightness of something like 350,000 quasars, eventually maybe a million or so. In terms of the numbers and the temporal coverage, such data never existed before. We didn't actually look for the supermassive black hole binaries. We kind of did this almost on a lark, to see if we can find periodic signals. Quasars vary in a chaotic fashion, similar to the variations of stock markets. You can see bumps and wiggles, and people sometimes see fake periods. People talk about cycles, which is actually not true. But it was an interesting challenge, whether we could dig up periodic signals there are buried in this type of noise that sometimes can fool you.
We used new types of machine learning and statistical analysis tools to do that, and to our initial surprise, we found them. But as soon as we started thinking about it, we realized that we should have found them. Remarkably enough, we found exactly the numbers that theory predicted we should see. That, I would say, was one of the more satisfying scientific results we obtained in all of our sky survey work. It's by no means fully established yet. These things take a while. The periods are several years long, the way we observe them, so you need to really keep collecting lots and lots of data to really obtain statistically significant results, implying that this is not some chance noise thing, but we actually do see a signature of two big black holes orbiting each other.
ZIERLER: You mentioned the triple quasar. Is there any theoretical limit on how high that number could go? Quadruple, quintuple?
DJORGOVSKI: These systems form from gravitational interactions of their host galaxies. Galaxies cluster, and we find a lot of cases where two of them in the process of merging will both trigger activity in their central black holes. We can compute the probabilities of that happening just from the clustering of galaxies. We understand very well how galaxies cluster and can quantify that. Then, occasionally, you're going to see not just two galaxies close together but three, and we can estimate the probability of that happening. And the probability of any one quasar having a galaxy in it multiplies that. If you have three, it's a cube of that number. Long story short, among the quasars we should see over the entire sky, there won't be more than a handful of these triple systems at any given time. We happened to find one, and there was another one found later by a different group.
That is just from the statistics of galaxy clustering, and you also have to catch them while they're in their quasar phase, which is not all the time. Galaxies have been around for about 12 or 13 billion years, and quasar episodes last tens of millions of years, maybe a hundred million at most. It's a relatively rare thing, and we have a pretty good idea how many there are. I would say the chances of finding one with four quasars in an interacting galaxy system are pretty small. Not impossible, it may happen, but it has never been seen, and I wouldn't be surprised if there just isn't one within the observable universe.
ZIERLER: Of all the kinds of astronomy, at the end of your thesis research, what kind of astronomer did you think of yourself as?
DJORGOVSKI: As an observational extragalactic astronomer, which is how things looked from even before I started grad school. I'd been working in many, many different fields and subfields in astronomy, published papers on everything from asteroids and comets to dark energy. But some themes were more present in my work, and we just talked about most of them.
Astronomy at Harvard
ZIERLER: After you defended, what opportunities did you have? What post-docs were you looking at?
DJORGOVSKI: I started actually looking at junior faculty jobs right away I thought that I could get a faculty job straight out of grad school. And I actually had some potential leads. But then, my advisors persuaded me, "You really want to do a post-doc first because it really is the best time of your career. You get to do your research, you know what you're doing, you don't have a lot of obligations that faculty have, and so on." I said, "You know, you're right." I applied for a bunch of prestigious fellowships, and I got, among others, a Harvard Junior Fellowship, which a number of Caltech faculty went through as well, especially in physics.
That was a very interesting thing intellectually. I wasn't quite applying for faculty jobs, it was only my second year of the three-year post-doc, but I was asked to apply for a Caltech job, a Berkeley job, a Johns Hopkins job, and a bunch of others. Of those, I was only really interested in Caltech and Berkeley. The Space Telescope Science Institute is what made Johns Hopkins astronomy great. But I just didn't think I wanted to live in Baltimore or anywhere on the East Coast. Interestingly enough, while I was at Harvard, they also had a faculty opening for somebody with exactly my profile, but I was not interested in Harvard at all. They were really upset that I didn't even bother to apply. But that's a different story. When Caltech asked me to apply, I said, "Yes, I'm definitely going to do that." It was an interesting year. There were three people in our business on the market: Shri Kulkarni, myself, and a former student of Wal Sargent, Alex Filippenko, who was a postdoc at Berkeley. I think all three of us got offers from both Caltech and Berkeley. Filippenko decided to stay in Berkeley, probably correctly, Shri was here already as a Millikan Fellow, and in my mind, Caltech was the most desirable of those places. After only two years as a postdoc at Harvard, I came here as faculty. And Shri was hired to be faculty at the same time.
ZIERLER: What was your research focus during your time at Harvard?
DJORGOVSKI: A direct continuation of all the things I'd been doing as a graduate student, pursuing where those research avenues went. Fundamental properties of elliptical galaxies, those correlations I mentioned, globular star clusters, observation of those with the Hubble space telescope, and always looking for ever more distant galaxies. Those were all very interesting. I had plenty of time and energy. I'd even done many more projects that were distracting me from those. But I was so curious about all of that stuff, I almost couldn't say no to an interesting project. Those were years of very hard work. Later on, I decided to focus more on some projects. The sky surveys were, again, a new thing. I did not foresee myself doing digital sky surveys while I was doing all these other projects.
That became an interesting subject to follow up on, and that's pretty much where I've stayed for the rest of my career. It was always driven by a scientific goal of some kind or another, but also always with an understanding that we have, for the first time, a means of systematically exploring the sky. Finding not only interesting things that we know are there, but potentially finding new things we didn't know about before. To me, that was the most interesting possibility of them all. And the tools were there, machine learning, and so on. I realized that astronomy wasn't the only science doing this and that all sciences were going through pretty much the same process. Some were doing it earlier than others. Astronomy was one of the very first, if not the first, to really get into this mode of research. Nowadays, biological sciences are pretty much leading the way in that arena. We've seen this whole transformation of science in general driven by the computing and information technology revolution. I just couldn't resist doing that because to me this was absolutely fascinating.
We were always driven by technological advancements, mostly in terms of new detectors and new wavelengths we could explore, radio telescopes from electronics, infrared from infrared detectors, spaceflight gave us X-ray and gamma-ray astronomy, and so on. But this was a different kind of exploration, the cyberspace, the information space, where it didn't matter what wavelength it was. It so happens that optical astronomy still produces most of the data. It enabled us to explore this informational image of the physical universe, and to do so in ways we could never do before, that we can now do systematically. My core intellectual interest is how computing and information technology is changing the way we do science. I still like doing science, but these technologies are changing everything, in all aspects of modern society, economy, science, scholarship, education, and so on. I think that we're still largely fumbling with how to do that well. The technology that's enabling all this new stuff is developing so fast, and it's so new, that a lot of people just can't adapt to it. The developments are too fast, and that's an interesting challenge in and of itself.
ZIERLER: When you were at Harvard with that affiliation, did you have access to different telescopes than you had on the West Coast?
DJORGOVSKI: Oh, sure. Yes. They had their own, and I certainly used them. I also used telescopes at various national and international facilities. The number of private telescopes had been increasing at first, then started decreasing at some point. Interestingly enough, when I got to Caltech, the amount of observing time I was getting went down. The astronomy community tends to be a little jealous of Caltech. I'd be told, whether or not it was legally correct, "You're at Caltech. You don't need time on this telescope." That's not how it works. [Laugh] You need a particular instrument for a particular science. In principle, national facilities should not discriminate against people who have access to big telescopes at their own institutions. But they do, even though they'd never admit to it. Anyway, I got plenty of observing time all told, especially moving into this new arena, where you do a survey with a smaller telescope, then use the big telescopes just to follow up. I always had my hands full with interesting things to do.
ZIERLER: Without getting into any sensitive details, your lack of interest in applying at Harvard, was that more about the research culture, the climate? Did you just not want to be in the Northeast?
DJORGOVSKI: A combination of everything. I always call myself a born-again Californian because once you get used to California, you don't really want to live anywhere else. Sometimes, I jokingly say my two years at Harvard were my exile east. And since I behaved so well, they cut my sentence and let me go back to California after only two years. But it was a combination of things, the academic culture, the overall culture. Harvard is the butt of many jokes about their institutional arrogance, and all of them are correct. I also found the overall culture of the Boston area to be actually very parochial compared to something like New York. I would sometimes go to New York just to kind of get out of there. And the climate, of course. Overall, I would say that it was a combination of everything. Quality of life, culture in general, climate. I think that California was the only place to be.
ZIERLER: Were there any faculty members or post-docs at Harvard that you worked closely with?
DJORGOVSKI: It was more like professional friendships and sometimes collaborations. The late John Huchra, who was one of the people who pioneered a lot of redshift surveys, and mapping of large-scale structures, was a good friend of mine. His untimely passing really saddened us all. Bob Kirshner was also a Harvard professor at the time. Now, of course, he's leading the TMT Observatory. And there were others. But largely, I would say, I was continuing my previous collaborations, developing new ones, especially in Europe. I worked a lot, for example, with Italian astronomers, and I still work with some of them. There was never a shortage of interesting people with whom to work. I tried to spend as much time as I could not in Cambridge, but in California, Arizona, South America, Europe, to the point that they were joking about putting my picture on a milk carton as a missing postdoc.
ZIERLER: 1987, you arrive in Pasadena. What were some of your early impressions?
DJORGOVSKI: I can't really think of anything in particular. I knew what California was like, I'd been to Pasadena before. And frankly, I was just so busy. I was working 14-hour days, seven days a week because I just loved doing that stuff. For years, people would come and visit, "Oh, we went to see this or that," some local attraction, and I'd say, "Gosh, I should go see that." Because you can always go see it and do it at some point, so why tomorrow? Tomorrow, I'd rather work on this project. I was a little too nerdy, perhaps, in the beginning. But I love being in California. The combination of not just the weather, but the intellectual environment at Caltech, the overall open and friendly culture in California, and the beautiful coast. What's not to like? It's very cosmopolitan. A lot of people in California came from somewhere else, and not just in the US, but elsewhere in the world. Look at the Asian populations we have, the different cultures, cuisines, and everything. You just don't see that on the East Coast, which I found to be very closed and parochial, thinking very highly of themselves but not noticing that a lot of action has moved further west.
Joining the Caltech Faculty
ZIERLER: Did joining the faculty change your research at all? Did you come in with an idea of what you wanted to do, and that changed as a result of what was going on at Caltech?
DJORGOVSKI: No, I naturally just continued working on projects I had in hand at the time, then new projects would come in. What influenced my change in moving into sky surveys and all that, when I arrived, they were starting the second Palomar Sky Survey using photographic plates, trying to repeat the great success of the first Palomar Sky Survey from the 1950s, which was the first time a really good road map for the northern sky was produced. At face value, it's a good idea, but I didn't think doing it photographically was very smart because it was already clear that CCDs, charge-coupled devices, were getting bigger and better. And former Caltech professor, Jim Gunn, who moved to Princeton, was building a camera for the Sloan Digital Sky Survey, which involved a bunch of large CCDs to do a sky survey using these new detectors.
In my mind, that was clearly the right way to go because already, we were analyzing all of our data on computers, so you had to have data in computer form. Why not generate it in a digital form? Well, all right, they were taking photographic plates, the project was well underway, and I thought the only sensible thing to do about it, because there was a lot of excellent information in these photographic images, was to digitize them.
ZIERLER: To clarify, when you say you were dismayed at the prospect of letting all this data go to waste, is that to say that it would disappear into the ether, or in its analog form, it simply would not be accessible because there was just too much there?
DJORGOVSKI: Well, it would not be accessible because just looking at pictures with a magnifier doesn't get you very much. To really do science, you needed to convert it into digital quantitative images for a proper analysis. In the olden days, before we could create large digital databases, there was no choice. You had to really look at them, and it was done by people looking with magnifiers for clusters of galaxies or whatever, but that was what you could do back then. We were then on the cusp of a technological transition, where we could move from the old style of eyeballing photographic plate surveys into the fully digital, much more quantitative, much more data-rich and information-rich datasets, with which we can do other stuff that simply wasn't possible before. That was roughly a decade, maybe even less, where those digital plate surveys were really useful. Now, they've been completely replaced with the surveys done with CCDs, first the Sloan, then many others. We've done digital sky surveys from Palomar as well. After the last plates were taken, CCD cameras were developed for the same telescope. Now, the Zwicky Transient Facility is a prime example of how things are and should be done. This was all just riding on the exponentially growing and improving technology described by Moore's Law, and based on the VLSI integration that Carver Mead and others pioneered earlier.
ZIERLER: I wonder if you can explain just how revolutionary VLSI was for astronomy. What exactly did Carver contribute?
DJORGOVSKI: The same technology that produces CPUs and other chips for computers is the technology that produces these detectors, charge-coupled devices. Once people realized they weren't just memory storage devices but were also light sensitive, that we could make imaging detectors out of them, that really changed things. In fact, the inventors of CCDs were given a Nobel Prize for Physics for that. Astronomers, of course, instantly recognized that this would be the detector we wanted. The size of the detector tells you how much sky it can cover with it. The big prize was making them as big as possible to capture as much of the focal plane of the telescopes, because telescopes are expensive, and computer chips are relatively cheap. It was obvious that we had some telescopes already, we were building new ones, and the optics tell you just how much of the focal plane you have.
Photons came in whether or not you captured them, so you wanted to have the largest digital detectors possible, which is where the whole VLSI technology comes in. The first CCD chips were a half inch, a quarter of an inch in size, or so. Now, they're more like two or three inches in size. And the square of that is what matters. But it's also the quality of the devices. The first devices had a lot of flaws in them, not all of the pixels were very good. In fact, the yield of usable devices was very small. But advances in these photolithographic techniques in producing VLSI circuits for computers have exactly the same demands. You want as many flawless chips as you can get, you want to make them as big as possible in order to have more transistors on them, and so on. Essentially, it was parallel demands requiring the same type of hardware development.
ZIERLER: Just as a thought experiment, with the imperative to digitize the second Palomar Sky Survey, do you think that was central to ensuring Palomar would remain relevant in astronomy?
DJORGOVSKI: No, the relevance of Palomar was largely driven by the 200-inch. The first Palomar Sky Survey, as I said, was really a milestone sky-mapping exercise, and it continued to be useful. But most of the interesting science was coming from the 200-Inch. As a survey telescope, the Oschin Schmidt Telescope, is now adding to the relevance of Palomar. After we built the Keck telescopes, the 200-inch, while still useful and productive, was no longer as important anymore. It was competitive with four-meter-class telescopes built by various national and international observatories, but in the era of eight- and ten-meter telescopes, it plays a supporting role, by and large. All the action was with the biggest telescopes.
ZIERLER: Did you see these developments as happening side-by-side with new capabilities in machine learning? In other words, the imperative to digitize the second sky survey at Palomar, was that only possible because of advances in computation and machine learning?
DJORGOVSKI: No, but the analysis of the data was totally dependent on those advances. At first, we used machine learning to basically outsource repetitive, menial tasks to a machine, like separating images of stars and images of galaxies, which is a nontrivial but not all that difficult task. But as the data extracted from sky surveys got to be more complex, and more information-rich, we had to start using machine learning to actually discover interesting things in these data, not just for pipeline processing, but to actually make scientific discoveries. And that's really where things are today. We still use machine learning in part of the pipeline from raw images to articulated catalogs, where machine learning is telling us, "These are stars, these are galaxies, these are likely quasars, this is a supernova of this kind or that kind," and so on. But also, to do science that is inherently almost impossible to do without using machine learning or artificial intelligence, and that's where things get really interesting.
The role of machine learning in astronomical surveys, and probably elsewhere as well, changed from automating something that humans can do but machines can be taught to do as well, to more and more complex requirements and tasks. Now, we are moving towards genuine human-AI collaborative discovery. We've seen probably the first example of that in biology with AlphaFold and the solution of the protein-folding problem. I'd say that is probably the first one of more interesting things to come.
We've been playing with other techniques, with something called symbolic regression, which can discover analytical formulas that describe something in data that may reflect the laws of nature or a phenomenology that derives from the laws of nature, but that we didn't notice before. It's still in the sense of machines discovering a pattern, and humans interpreting it. I think that we'll slowly move more and more to where more nontrivial tasks will be accomplished by machine intelligence, but it will be a truly symbiotic relationship, a collaboration between the carbon-based computers in our heads and silicon-based computers that do machine learning.
ZIERLER: In order to understand these historical developments, we need to take a tour of some often advances in technology and computation that allowed for the rise of astroinformatics. Let's start on the instrumentation side. What were some advances in instrumentation that portended this massive amount of data that astronomy was creating?
DJORGOVSKI: We could already generate massive amounts of data, even with the old analog detectors, like photographic plates, after they were digitized, which required computer technology. In radio astronomy, they went from analog to digital receivers. First, you collect lots and lots of bits, then you have to transform them, and you save only a tiny fraction of them. That process was going on through the 1980s. Sometime in the early 1980s, imaging detectors started coming in. Actually, imaging detectors started with TV-type cameras using vacuum tubes and things like that back in the 1960s, but those are awkward, there's a high voltage involved, they don't have very good noise properties, and so on.
The moment it became possible to use solid-state detectors like Charge-Coupled Devices (CCDs), astronomers eagerly switched to those. The first CCDs were not nearly as nice as those we have today. They required a lot of care and feeding, and laborious calibration and extraction of data. But they were much more sensitive, they had a much larger dynamical range between, say, the faintest and the brightest objects that you can detect, and they also had much better noise properties than analog detectors like photographic plates. In a digital detector, you have pixels and however many photons fall in that pixel, that's that. Then, you get a shot noise. But in a photographic plate, you also get extra scattering through the grains and the emulsion, so the type of noise you get in the output is not as benevolent and easy to deal with as it is with a pure digital detector.
All of the computer information technology really rode on Moore's Law and very large-scale integration, in which Carver Mead has played an absolutely crucial role. You want the largest detectors you can possibly have, just like people building CPUs and related computing devices want the largest, flawless silicon chips that can be made. The demands for both computation and for use of these devices as detectors are pretty much the same. You want purity, a small number of defects, a high production rate, and so on. The quality and the size of the detectors kept going up, and it is still doing that. And the prices went down, of course. It is exactly how Moore's Law works. The goal was always to have enough CCDs or big enough CCDs to pave the focal planes of the telescopes we use, and the sizes of focal planes, which is really the region in which images are of high quality and in a good focus, were pretty much designed on the basis of photographic plates because those were the detectors when the telescopes were built. They were not smaller than an inch, they were several inches to several tens of inches. The Palomar Schmidt photographic plates are 14 inches square. You know you got good images over that much area in a focal plane with a telescope, and ideally, you'd like to catch every photon that falls in. At first, we simply couldn't do that.
There were no devices, or they were too expensive, and computing the data stream was out of reach. Even at Palomar, the photographic plates were replaced first with a smaller number of CCDs, then bigger, and bigger until the ZTF camera, which covers almost the entire focal plane of the 48-inch telescope. That's been the general trend, not just for us, but for everybody else. This was kind of an obvious thing to do, so astronomy worldwide was all moving in the same direction. You want to gather all of the light your telescopes can provide, and not just light, but also radio telescopes, and so on. And you want to be able to process the data. That's how the progress in computing information technology was directly driving progress in observational astronomy.
Digitization Comes to Observational Astronomy
ZIERLER: Do you have a sense of the history of the drive to digitize the photographic plates? Was that an individual? Was it happening at a specific institution?
DJORGOVSKI: No, I don't think so. I think that the digitization of photographic films was first done for the military or intelligence, which had the first surveying satellites. They were actually taking pictures on photographic film and dropping them down to the Earth. I think that's where it began first. But everybody knew, as computers first started showing up, that we'd like to digitize astronomical photographs. When I was a grad student at UC Berkeley, there was a plate scanner. It couldn't do a whole Schmidt plate, but it could digitize small pieces of photographic plates taken on big telescopes, and I've done a lot of that. I said small pieces not because the machine was limited, even though it wasn't all that great, but because we couldn't process the images that were larger than a certain size. The computer memory was really the bottleneck then. Of course, as computers and memory improved, we could analyze larger and larger images, store the data, and analyze it. Really, our ability to create data in a digital form and analyze it was driven directly by the progress in computing technology.
ZIERLER: There were no sky surveys in the early 1980s, this is a later development?
DJORGOVSKI: No, there have been sky surveys since the Herschel family and friends, or even earlier just by the naked eye. But the first large digital sky surveys came in the early 1990s, and those were from digitizing previous photographic sky surveys, which was the imaging technique of astronomy until digital detectors came about. Surveys themselves have been growing in importance and scientific power for many years, even decades. Usually, people would do surveys when something was completely unknown. The Herschels were doing their sky surveys making the first galaxy catalogs, although they didn't know what the galaxies were. And there were star catalogs since Hipparcos and Ptolemy. It's a basic scientific approach to first map out the observed phenomenology.
Then, in the 20th century, as new domains of wavelength regime were opening up, say, radio astronomy, then later, X-ray astronomy, far infrared, gamma ray, and so on, the obvious first goal was to find out what's out there in the sky, in this previously unexplored regime. Another aspect of surveys, at least for optical astronomy, which was still leading the way at the time, was that sky surveys would serve as a road map to find potentially interesting things to be studied in more detail with bigger telescopes. For most of the history of astronomy, people were dealing with so-called pointed observations, of their favorite objects or samples, typically consisting of tens of galaxies or stars, or maybe hundreds. But with sky surveys, you could find rare, unusual types of objects or even the kinds you didn't know existed to observe with large telescopes to learn more about.
That was the logic behind the Schmidt telescope at Palomar, later called the Samuel Oschin Telescope, which was used to conduct the first Palomar Sky Survey. Even before then, in the 1930s, Fritz Zwicky built the first telescope at Mount Palomar, the 18-inch-diameter Schmidt camera. Schmidt refers to a particular optical design that gives you a wide field of view. With these wide-field cameras, Fritz Zwicky was observing nearby galaxies, looking for supernovae, which were a new thing back then, for his work with Walter Baade. He found, I don't know, several tens or a hundred in his lifetime. We now find that many and more every night. The same telescope was later used to look for Earth-crossing asteroids.
The 200-inch telescope was conceived and started around the same time, but it was only finished in 1948 because World War II intervened. Caltech astronomers knew that with a big telescope, like the 200-Inch, you only have a small field of view – it was smaller than the size of the full moon. The idea was to map the sky visible from Palomar, to be able to look for interesting targets to study with a large telescope. That model has really become very powerful in modern days. As new sky surveys became more data-rich and with high data quality, and since now we have data-analysis techniques to do proper data mining, this has become the established mode of much of astronomical studies. Sky surveys produce data that are interesting on their own, but can also be used to find optimal targets to be observed with big telescopes. The time on big telescopes is very precious and highly competitive, so you want to use that resource on an optimal selection of interesting scientific targets, which are now largely provided from sky surveys. That's been an evolving story for a while. It really became very interesting once the digital sky surveys came about with accompanying increasing information content and quality, which really happened in the 1990s.
ZIERLER: We touched on it, but really the origin story of how the second Digital Sky Survey at Palomar came out. Just as a matter of chronology, when did you and your colleagues start thinking about doing this survey?
DJORGOVSKI: Well, the photographic survey, the second Palomar Sky Survey, was initiated before I arrived at Caltech, but just shortly before then. That would be the mid-1980s to late 1980s. I don't know when they actually started taking the photographic plates, but Jean Mueller would know all about that part. Also, Robert Brucato, who's a retired deputy director of Palomar, would be a good source. Anyway, when I arrived and found out they were doing this, one thought I had was that it's good to do surveys, but not photographic ones. The old Palomar Sky Survey, the first one from the 1950s, was definitely getting a little dated. But on the other hand, the era of astronomical photography was clearly over. They knew that digital detectors were coming, but CCDs were still small, so we couldn't put lots of them in the focal plane of the Palomar Schmidt telescope, but that was just a matter of time. In fact, around the same time, planning for the Sloan Digital Sky Survey started with Jim Gunn in Princeton, and they were going to build a CCD camera to do a survey themselves.
Their motivation was a little different. Whereas at Palomar, the idea was to just basically repeat the glory of the first Palomar Sky Survey, create the photographic map of the entire Northern sky for whatever scientific purposes people wanted to use it, Sloan began purely with an idea of mapping the large-scale structure of the nearby universe by measuring redshifts of many, many galaxies. Redshift surveys were a big fad back then. But in order to do that, they wanted a really good quality galaxy catalogs to be observed, so they decided to do an imaging survey as well. Probably the imaging survey turned out to be more scientifically useful in the long run, but both produced great science. At the time, the Sloan team was just building their hardware. Sloan was not in the sky for maybe another 10 years.
Anyway, I found out there was this whole new sky survey, and one could do a lot of science with it. But in order to do science–we're now talking late 1980s, early 1990s–you had to have data in a digital form. I just couldn't bear to see this opportunity go away, so we started talking about how to digitize the plates. There were already some precedents. Palomar has already done the so-called Quick V (for visual band) Survey for Space Telescope Science Institute. The Space Telescope Science Institute in Baltimore had a plate-scanning machinery, which they originally bought as a commercial version, but they greatly improved it. Those scanners were probably the best in the business. They did that to create a guide star catalog for pointing the Hubble. But, as the name implies, Quick V wasn't going very deep. It was fine for the early days of the Space Telescope, but they wanted to have a little deeper survey.
At around the same time, there were also surveys from the Southern hemisphere done by the European Southern Observatory and Anglo-Australian Observatory. I think those were already done at that time. Maybe they were just in the process of completion. In any case, the Palomar one was to cover the Northern sky and the Northern sky only because Europeans and Australians were doing the Southern sky with comparable quality material. The first Palomar Sky Survey covered two-thirds of the sky, as far as we can point south from Palomar. But then, the Southern portions were really replaced by the Australian and European efforts. The first question for us was to decide who should digitize it. There was a parallel project with the US Naval Observatory, and they wanted to digitize every sky survey plate Palomar has ever taken in order to obtain a new astrometric catalog of stars in the sky, and measure as precise positions and motions of the stars as they could.
The way they were going to do this was optimized for their own purpose, but the scanner they built was not producing a material that would be useful for much other science. They did digitize every sky survey plate there was over a course of years, and they produced exactly what they wanted to produce, a catalog of a couple of a billion or two stars with their precise positions and proper motions, because the survey plates scanned decades, and stars move. That was a very useful data product until the Gaia Satellite came about. Gaia, of course, has completely revolutionized that field, but that's a whole other story. Anyhow, that was a good use of the photographic sky survey plates taken at Palomar, both the first and the second surveys, and anything in between there, including a lot of special projects that were done as well.
ZIERLER: You mentioned that Palomar Survey came about actually before you came to Caltech. Was that one of the many reasons that attracted you to Caltech? Did you assume that you'd become heavily involved?
DJORGOVSKI: No, not at all. I was a big telescope guy at the time, doing very distant universe, star clusters, and elliptical galaxies. It was all digital detectors for me. I wasn't really thinking much in terms of surveys at all. But I quickly figured out this would be an interesting dataset to have to support my science and everybody else's science. The first task was to figure out whom we wanted to partner with to scan these plates for our purposes. We even briefly considered building a plate-scanner machine ourselves, which thank God we didn't do because that would've been a nightmare.
ZIERLER: I wonder if you can explain why that would've been such a big job. What does the machine look like?
DJORGOVSKI: Those scanners are a couple of yards across, and they have a stage on which you mount a plate, which has to be moved very precisely. A light beam is focused to pass through a spot that's about 15 microns on a side with a detector on the other side. They measure the plate transparency from the light that goes through, and that's converted into a digital image. The plate has to be moved very precisely, in a raster scan. I don't know how long it takes, my guess is maybe an hour or two to do one of those. In the case of the Digital Palomar Sky Survey, one of those plates, which is physically about 14 inches in size, spans an area of the sky that corresponds to about six and a half degrees square, which is 13 times the size of the full moon, so the square of that.
That, with a sampling of 15 microns, produces an image of about one gigabyte in size, which was very large at the time. In fact, we couldn't process those images one by one, we had to cut them up into many smaller sub-images, process those, then stitch them back together in the catalog domain. Nowadays, this would be no big deal at all, but back then, it was really pushing the technology. Anyway, to make a machine that could do this precisely and consistently, with a very stable measurement and all that, then to process those scans is a really very big deal. It was clear to me quickly that we didn't want to do that because there were already people with the right machinery, and the right knowledge in processing, at no cost to us, who would do it, including the Space Telescope Science Institute. There were a couple of other relevant groups in the world. There was a scanning machinery in the Royal Observatory in Edinburgh in the UK that did the Southern sky surveys, and there was another group at the University of Minnesota with an ancient machine that was built for a very special purpose.
It was very quickly clear to us that the Space Telescope Science Institute was the obvious partner for us. They liked it, they wanted to scan the new sky survey, too, for their own purpose, the new Space Telescope guide star catalog. The people with whom I worked there were a pleasure to work with. Mike Shara was one of them, also Barry Lasker, and there were a few others. And it was a really wonderful, friendly collaboration. We were all on the same page, we agreed on everything. The director of the Space Telescope Science Institute at the time was Riccardo Giacconi, and he was a big personality. I liked him. He got a Nobel Prize for the discovery of the cosmic X-ray background. The director of Palomar was Gerry Neugebauer, who also had a big personality. Fortunately, we managed to kind of ignore the two directors and just give them the good news and made everything work.
That was how it all started. I just couldn't bear to see the richness of this data go to waste, which is what would have happened in my mind unless we converted it into a digital form. We did a lot of science with it, then the Sloan Digital Sky Survey finally geared up, and they started obtaining higher quality multi-color images of a portion of the sky, not the entire Northern sky, and they started creating a lot of excellent science. I was not sorry to see that. I thought that was exactly what had to happen. I always thought these photographic sky surveys were a thing of the past, and that their digital version was just an interim step as we moved to the fully digital, higher quality, higher information content datasets.
The Palomar Sky Survey
ZIERLER: Some questions as they relate to funding and the second sky survey at Palomar. Specifically, with the partnership with Space Telescope, was NASA a viable partner? Were they in consideration? I understand the binary, that NSF generally funds land-based observation projects, and NASA, of course, is in space. But with the institutional connections with Space Telescope, did NASA ever get involved in sky surveys?
DJORGOVSKI: Well, they did, but not because of this. I believe that both the first and the second Palomar Sky Surveys, certainly the second one, had some funding from NASA because they understood that this was a legacy dataset, on which you could then build and interpret space-based observations. This is why the Space Telescope Science Institute was scanning the plates, to begin with. I think that was already well understood. The fact that we collaborated so closely with the Space Telescope Science Institute didn't hurt. We did get some funding from NASA to do what now we'd call data science applications for the sky surveys and other things, as regular peer-reviewed projects. The story of NASA in this is certainly a very interesting and good one. There was also a very good understanding between NASA and the NSF as we moved into the whole virtual observatory framework, they understood that everybody gets to benefit from this and that it really had to be done. But that's a whole other story in and of itself.
ZIERLER: What about the institutional connections of JPL? Would that have NASA more involved than they otherwise may have been?
DJORGOVSKI: There wasn't an obvious enhancement of the funding stream if that's what you're asking. We write proposals like everybody else, and they're judged on their scientific merits. As time went on, we were doing more and more science directly based on the processing and exploitation of sky surveys.
ZIERLER: What about in the world of private benefactors? For example, I understand the Norris Foundation supported the Palomar survey.
DJORGOVSKI: Yes, and the Oschin Foundation as well. They made a grant to Palomar, which was used to do some necessary refurbishments of the old 48-inch Schmidt telescope, which we now call the Samuel Oschin Telescope. But the telescope was already there. They just needed to improve the optics, and the mechanics, and stuff like that. That's why it got renamed after Sam Oschin. Norris Foundation supported several different projects. I think mostly, it was for the Millimeter Observatory at OVRO, and then CARMA. They gave us a modest grant to do the cataloging of the sky, and we did the job and made the data public for everybody to enjoy. But most of our work on DPOSS and its processing and the science it enabled was done through a number of separate grants, most of them from the NSF, some from NASA, and some private.
DJORGOVSKI: One problem I had to deal with was, once we decided on them to produce the scans, where to store all this information. We're talking about the early 1990s. With three filters and a thousand fields of the northern sky, 3,000 Gigabytes, or three Terabytes, was just an unheard-of dataset. For computer storage back then, there were still floppies and nine-track tapes, which would store Megabytes and Megabytes, but not Gigabytes or Terabytes. The first homework I had to do was figuring out the right storage technology. We ended up with these little cassette tapes called Exabyte tapes, and we went through thousands of those. We had roomfuls of them. Then, we had to develop a software pipeline, because what Space Telescope was doing was, for their purpose, to optimize theirs to make the guide star catalog. What we wanted to do was essentially what Sloan wanted to do later on with superior digital material, in making measurements of all objects detected in the sky, measuring a whole bunch of their properties, matching them together, making sure the photometry was calibrated right, that we knew exactly what the brightness was, even though each plate was a little different.
That required a special calibration effort, which was done with the Palomar 60-inch telescope, which at that time was just used for small student projects. This was before time-domain astronomy became a major driver. I had a brilliant graduate student at the time, Nick Weir, who was very much a computer whiz, and might've been the best student I've ever had, actually. To show that he was actually smarter than all the rest of us, he just went to work for Goldman Sachs straight out of graduate school and never turned back. More great graduate and undergraduate students and postdocs got involved later on, including Julia Kennefick, Roy Gal, Reinaldo de Carvalho, Steve Odewahn, Robert Brunner, Ashish Mahabal, Sandra Castro, and many others. We also had a number of excellent Caltech undergraduates working on DPOSS, and many of them, later on, went to illustrious scientific careers of their own.
That was basically what I really wanted, to have a legacy dataset that would be as useful as the original first Palomar Sky Survey, but in the digital era. We did a number of projects with it. In some sense, it was our project on which we learned how to work with big data. We were one of the first groups that used proper databases in astronomy, one of the first to use machine learning, and maybe the first to use it in kind of an industrial fashion. We did that for the problem of star-galaxy separation. The images of stars in galaxies are a little different, galaxies are a little fuzzier, but for most scientific projects, you either need stars or objects that look like stars, such as quasars, or galaxies. If you do, say, the structure of the Milky Way, you need stars. If you want to look for quasars, you need star-like objects. If you want to do, say, the large-scale structure of the universe, you need galaxies.
That was a big problem. Usually, a human can look at a plate under magnification and say, "It looks a little fuzzy. It's probably a galaxy." But it's a little bit subjective, and you cannot do this for something like a billion or so objects. It was clear to us that it had to be automated and made uniform and objective, which is where machine learning comes in. Once you train machine learning, things like neural networks and decision trees, if you have the right kind of dataset to train it–and we had plenty, of course–they would do the exact same thing without getting tired, over the entire Northern sky. That might've been the first really major use of machine learning for astronomical data analysis.
I went to my computer science colleagues at Caltech, and they were just not interested. For them, this was a trivial project. They were more theoretically inclined. Then, I went to JPL because they hire computer scientists to solve real-world problems, not prove theorems. I established my first connection with a colleague, Richard Doyle, who was spearheading a lot of machine learning at JPL at the time. It was an uphill battle for them, too, because back then, JPL was all about the spaceflight hardware. The whole idea of the importance of software and machine learning hadn't quite sunk in yet. It did later, but Rich had his own hands full trying to change the culture of the place. That established a collaboration, which, in one form or another, has continued until the present day. Richard just retired recently, and I've worked with many other people who originally worked for him or in their own groups at JPL. One of the people he hired was a freshly minted Ph.D. from Michigan named Usama Fayyad, who was a machine learning guy, very capable, very talented. He made a big name for himself later on. He eventually left JPL, worked for Microsoft Research in a big capacity, started multiple companies, and held multiple advisory jobs. He's now at, I think, Northeastern University. He's quite famous in those circles.
We got him straight out of grad school. He helped us with the particular task of object classification. What I mean by this, is you have pictures, and most things are very faint, little tiny dots and blobs. It's not like the glamorous-looking big galaxies that you see in popular pictures. The vast majority of objects detected are very faint. Some of them are stars or things that look like stars, like quasars, or galaxies. When they're relatively bright, you can tell stars from galaxies. Galaxies are a little fuzzy, and stars are sharp. When it gets to be really faint, it's hard to tell. A well-trained expert can do this just by eyeballing pictures and giving a probable answer. But we're talking about two billion detections. Clearly, this was not going to be done by hand, so to speak.
Partnering with JPL on Machine Learning
ZIERLER: A specific question. When you started to get involved, you appreciated the advances in machine learning at JPL. The Sky Imaging Catalog and Analysis Tool, or SKICAT, was that already in progress, or that came about as a result of this partnership?
DJORGOVSKI: The latter. Doyle and Fayyad certainly got excited, and there also was a code developer working for them, Joe Roden, who was very deeply involved in this. They recognized that this was an excellent application of machine learning in and of itself. In fact, our work with them was the model or precursor of our subsequent involvements–not just for our group, but also for other people in big-data astronomy–for how to work with computer scientists. We recognized that this was a genuine intellectual scientific partnership, that we give them interesting challenges for them to sharpen and develop their own tools for machine learning, and we then benefit from those tools to do our science with. Those people are just as good scientists and just as interested in science as we were. They are not programmers to be hired to do your bidding. I've seen a lot of non-CS people who suddenly decided they had big-data problems, and who were thinking of computer scientists essentially as code developers, and that's not how it works. But in our case, it was a very fruitful, mutually respectful collaboration from the get-go. In fact, I would say that SKICAT was what really launched Fayyad's career, and he's done extremely well since then.
There were other aspects of the project that clearly demanded a certain amount of automation, processing, or pipelines. Back then, astronomy didn't have sufficient amounts of data to require anything special. People would kind of just deal with images one-by-one on their desktops. Images might have been a couple of Megabytes in size, and astronomical software for processing the images was geared toward that kind of data. And there weren't very many of those images either, so you could just do them by hand, so to speak, individually. Once we started producing this pipeline, it generated tens of millions of images, so we had to change the way we did it. We went into the databases first, which was essential. Until then, people just had a directory structure, folders, subfolders, and so on. But that doesn't work when you have tens of millions of files.
We were one of the first groups in astronomy to use proper databases. Independently, other people doing surveys came to the same conclusion. Certainly, NASA data centers, like IPAC, had to go into that from the get-go because they had to have high-quality, completely reliable systems to serve data to the community. Sloan Digital Sky Survey also did as well. The person who led that development was my good friend Alex Szalay, a professor at Johns Hopkins University, and he plays a big role in this story later on. There were two things we had to do. First, we had to do data farming, if you will. What do you do with all this data? How do you access and organize them? Then, you need to do data mining and figure out how to extract knowledge from the data.
The first use of machine learning was to automate the process of classifying each object image, whether is it a star or a galaxy, or give some probability it's one or the other. That was the first task we did, and we used what then were the machine learning tools available, neural networks, which later evolved into deep learning methods, and decision trees, which later became methods called random forests. Back then, this was still relatively simple stuff. We essentially outsourced a tedious, repetitive job that had to also be done uniformly and objectively over the entire dataset, to machine learning tools. That was some of the first use of machine learning in astronomy. We generated catalogs and calibrated them. With datasets like that, the work is never done, and at some point, you just have to say, it is good enough, and we need to move on. The same experience is repeated in all sky surveys ever.
ZIERLER: I wanted to ask specifically on that, to foreshadow where this field is going, why do you see this framework as sort of the originating point of machine learning of sky surveys?
DJORGOVSKI: Machine learning is a well-established field of its own, and sky surveys were evolving constantly to measure more quantities for detected objects, stars, galaxies, and so on. Most astronomers weren't really thinking in terms of multidimensional data spaces. Until we actually had to process data from the first digital sky surveys, most astronomers didn't really think about machine learning applications in astronomy either. But the connection was very natural once it became obvious that was what we had to deal with.
ZIERLER: Is that to say that if the machine learning gets better, you can go back to older datasets and mine interesting things that you may have missed the first time around?
DJORGOVSKI: At first, we didn't really use machine learning to find interesting things, it was just to pipe-process everything. Then, we had to figure out ways to find interesting things in the catalogs that we generated from the images. The steps are, you have a photographic image, then you digitize it, then you process the digital image to turn it into a catalog of all sources detected. And for each one of them, you measure a whole bunch of different properties in brightness, in different apertures, shape, size, and stuff like that. It was a few tens of different numbers we measured for every single detected object. Mathematical representation of data like this is what they call feature vectors. Each quantity you measure is one new dimension of a data space. And that data space, in our case, may have had 30 dimensions or something like that.
Each object is represented as a vector in the data space. That's a standard computer science approach. Because the dimensionality was high, it was nontrivial to decide which of the measurements combined in which way was really giving you the most useful answer, and that's where machine learning came in. Because there were a couple of billion of feature vectors, that was why we needed to automate it. Figure out the optimal algorithm, then let it rip through all of the data. We did that, calibrated it with lots and lots of CCD digital images taken at the 60-inch telescope, then we started finding interesting things to follow up on.
One of the first projects we did was to look for very distant quasars at redshifts three and four when the universe was about 20% of its present size. At that time, that was pretty much the high-redshift frontier, with the most distant objects known. We can use them to probe the distant universe, learn not just how quasars evolve, but also use them as probes of the universe between us and them from absorption by the intergalactic clouds. That was a big industry at Palomar, too. Martin Schmidt and his collaborators were also looking for these distant quasars using different equipment and different approaches. At the time we started, there may have been, like, a dozen of them. We approached this in a systematic fashion, by devising a method to find quasars, and roughly for every million stars, there's one of them, so you have to figure out how to find that one in a million and do this over the entire sky-survey region. This was the Ph.D. thesis of my student Julia Kennefick, who's now a professor at the University of Arkansas. We ended up with several hundred of them, which was an order of magnitude more efficient than what was then state of the art.
We were surpassed by Sloan Digital Sky Survey later, which found thousands, not hundreds, of such high redshift quasars, which is exactly how things should be. Equipment gets better, science moves forward. That was just one example. We also did projects to find clusters of galaxies. Back in the 1950s, one of the first scientific projects that was done with the first Palomar Sky Survey was to come up with a catalog of clusters of galaxies, and that was done by the then-graduate student, George Abell, who was later a professor at UCLA. He created the first proper galaxy cluster catalog, called the Abell catalog, which was a mainstay of all galaxy cluster astronomy, although much better catalogs exist today. Fritz Zwicky actually made the first cluster catalog, but his was not nearly as well-defined as George Abell's. Abell did that by looking at photographic plates with a magnifying glass and counting galaxies by hand. He came up with a catalog of a couple thousand clusters, and later on, with some collaborators, they did the southern sky. It was about 4,000 clusters or so, all told.
One of my graduate students, Roy Gal, who's now a professor in Hawaii, did a modern equivalent of the Abell catalog by designing an automated, objective, repeatable way of defining clusters, not by a subjective judgment, whether a collection of galaxies is or isn't a cluster. He came up with a catalog of about 20,000 clusters in the Northern sky alone, and similar projects are now done also using different machine learning approaches applied to more modern catalogs, like the Pan-STARRS, Dark Energy Survey, and so on. This did enable us, and others, to do a lot of interesting science, but there was still relatively limited information content of the data, and the quality of these data was not as good as modern digital images.
There are several reasons for this. One is that modern imagers, like CCDs, are vastly more sensitive. What we call the quantum efficiency of an astronomical photographic plate is maybe a couple of percent, meaning 1 in 50 or 100 photons actually gets absorbed. All the rest are lost. The second thing is, they don't have what we call a good dynamical range, the span in the brightness between the faintest things we can detect and the brightest before the detector saturates completely. In the case of a photographic plate, that may be a factor of 100-ish, maybe a little more. With CCDs, it's easily many thousands. It's more powerful in that way. Finally, the noise in the images is much more benign with a purely digital detector.
What happens with the photographic plate, is that the images form by so-called exposed grains of the silver halide, suspended in an emulsion that's been painted on glass, or on a film for commercial uses. Photons deposit their energy into those grains, and then with a chemical reaction and processing development, they turn black. Your negative photograph is black where the light hit and white where it didn't. You can then transfer the negatives into a positive picture. Well, those little grains that are actually detecting the light have a finite size, a few microns, and also scatter light. They're sort of like raisins in a layer of photographic emulsion. Remember, plate scanning involves a beam of light that's been focused through the plate, and the transmission is measured on the other side.
There's a lot more scattering going on because of all these grains. That simply doesn't happen with a purely digital detector. A photon hits a spot on the silicon, and that's where the electron is collected, and that's all. The noise in the images is much more benevolent with digital detectors than with photographic plates. For all of those reasons, CCD data are better. We knew from the get-go that was how it was going to be. But we had this photographic material in hand, and we had several years to actually do something good with it.
ZIERLER: With the CCDs themselves, Moore's Law applies? It gets better every year?
DJORGOVSKI: Right. Bigger, better, all that. It's the same technology that's used to build CPUs, GPUs, and all the other components that go into big computers. Anyway, we had some time from, say, the early-ish through late-ish 1990s to actually do our science before better data came in, and we actually kept doing this science into the early 2000s because there still was a lot of good stuff to be found. Sloan was still progressing and building up. But that's basically the story about DPOSS or the Digital Palomar Observatory Sky Survey.
ZIERLER: So the bigger question there, as you alluded, when you came to Caltech, it was not on the basis that you would get so fully involved in the sky surveys. As you characterized yourself, you were a big-telescope guy. In the first maybe decade or so at Caltech, how did you manage that duality in your research? Having so much interest in this new endeavor, in the sky survey, while remaining engaged in what big telescopes could do, how did you manage your research in that regard?
DJORGOVSKI: I don't know, working 14 hours a day, seven days a week, pretty much.
ZIERLER: It was all good, you didn't want to let go of any of it?
DJORGOVSKI: Yeah. I was motivated by curiosity. I couldn't say no to a good project.
ZIERLER: Where does this fit into your work on the Tolman test for understanding universal expansion?
DJORGOVSKI: Oh, not in this at all. That's a whole separate story.
ZIERLER: Chronologically, though, you're working on this around the same time, in the 1990s?
DJORGOVSKI: Yes, I'd still been doing the other stuff I was doing before. Specifically, one line of my research then was using the Fundamental Plane correlations for elliptical galaxies to try to understand their structure and physics and deduce from that how they got to be that way. Today, people talk about galactic archaeology, that you can look for things in the Milky Way, stellar streams, and stuff like that, and deduce how the Milky Way formed on that basis. Well, we can also do this for the nearby galaxies, and one approach is what we were doing with the elliptical galaxies. This was a part of the thesis work of my student Mike Pahre, for which he won the Trumpler Prize, which is given annually for the best astronomy Ph.D. thesis in North America. One of the results that came out of that was the Tolman test you just mentioned. But that is a whole other interesting branch of research that we should probably cover some other time.
Also, I was working on globular star clusters, both with ground-based measurements and space telescopes, and that's yet another big branch of research. Another research direction was to look for the young and forming galaxies at large redshift, using their Lyman-alpha line emission. That was the Ph.D. thesis of my student David Thompson, who is now a staff astronomer at the Large Binocular Telescope Observatory in Arizona. I was doing all of those things at the same time. Then, later on, I started narrowing down my non-sky-survey-related work because there was a lot more exciting science coming out of these new big datasets. That was not something I planned on, it just happened that that was how things were going.
Solving the Mystery of Gamma Ray Bursts
ZIERLER: Tell me about your work in the late 1990s with Shri Kulkarni on measuring the redshift of a gamma-ray burst. I understand this created quite a stir in the field, particularly among the theorists.
DJORGOVSKI: Gamma-ray bursts were one of the great mysteries of astrophysics since their discovery in 1973. At one point there were about 150 theories about their nature. The problem was that in order to understand such a phenomenon you need to observe it at other wavelengths, especially in visible light, so that you can take a spectrum and determine their distance. However, their positions could not be measured in gamma rays with sufficient precision to determine the right counterpart, and they are also very short, typically lasting seconds. Getting the redshifts, and therefore distances and luminosities of these objects was the key to the understanding of this phenomenon.
The breakthrough came with the launch of the Italian satellite BeppoSAX, which had X-ray cameras that could pinpoint any possible X-ray counterparts with sufficient precision to identify possible optical counterparts. I got involved in this search with Dale Frail, a radio astronomer from NRAO, Shri Kulkarni, and initially, Mark Metzger, who was an Assistant Professor at the time; later we were joined by Fiona Harrison and Re'em Sari, who moved to Israel since then. Also, many postdocs and students were involved, and many of them went on to successful careers in astronomy.
The BeppoSAX team would communicate their X-ray positions to various groups of interested ground-based observers, us included. The first such identification of a possible optical counterpart was done by a Dutch group for a burst detected in February of 1997, but too late to obtain its spectrum; we obtained the redshift of its host galaxy later on. We finally succeeded in May of 1997. We obtained images of the field at the Palomar 200-inch, determined which possible sources were changing in brightness between different nights, and finally determined the actual counterpart. We asked Chuck Steidel, who was observing at the Keck that night, to obtain a spectrum for us, which he did. Mark Metzger analyzed it, and there were absorption lines that I instantly recognized as due to ionized Magnesium, which are very common in the spectra of distant quasars. That was the final proof that the bursts were at cosmological distances, and thus the puzzle was solved.
A lot of work followed, by us and other groups, which made it possible to converge on the right theoretical models. Basically, they are due to the explosions and collapse of very massive stars that produce stellar black holes. That is for the so-called long-soft bursts; there is another type, short-hard, that is produced by mergers of neutron stars. We suspected that, but it was only proven with the LIGO detection of a neutron star merger, which also had an electromagnetic counterpart.
ZIERLER: I saw a quote from you that a burst you observed was, "for about 1 or 2 seconds, as luminous as all the rest of the entire universe." How can one event be so powerful?
DJORGOVSKI: Actually, that was incorrect, since it assumed that the burst radiated equally in all directions, which was a reasonable assumption at the time, in the absence of evidence to the contrary. We found such evidence in 2001, that the bursts are actually narrowly beamed, which was also predicted by some theoretical models. If we are in the beam, we detect the burst, but the rest of the universe doesn't. So when we made that correction, the energies involved are as those expected from their progenitor supernovae.
Overall, I would say that this early work on gamma-ray bursts was one of the main scientific achievements of my career.
ZIERLER: Going back to the sky surveys, where were the opportunities for symbiosis between the surveys and the big telescopes? Where could you see the obvious points of connection between your new research and your extant research?
DJORGOVSKI: Well, surveys can support other research by enabling you to find interesting things, like the high-redshift quasars. More recently, like the work I do with Matthew Graham and others, on understanding how the variability of quasars can teach us things we cannot find in any other way. Those would be good examples. Once you find distant quasars, you can go to the Keck Telescope and obtain high-quality spectra that can be used to probe the evolution of the intergalactic medium, which also connects with the evolution of galaxies. You can use those quasars as pointers to where there might be other galaxies in the early universe. One thing about galaxies is, they cluster, and the best place to look for a galaxy is next to another galaxy. All quasars basically sit in big galaxies somewhere, and they may have companions and neighbors. We did that to look for proto-clusters of galaxies around distant quasars, for example, and for probing the intergalactic medium absorption by neutral hydrogen, which at some point, thickens.
The general evolution of the cosmic structures as we understand it now, starts with the cosmic microwave background, the dark matter halos collapse, and eventually gas pours into them, eventually starts making stars, as the first stars and galaxies form. The universe was filled with a fully ionized gas until the microwave background was released, then it was fully neutral until the first stars formed, and then those stars ionized it again, which made it transparent to the UV light. That's what we call the reionization era, when the first galaxies form. You can figure out where that is by looking at the spectra of ever-more-distant quasars and seeing where the absorption by the intergalactic hydrogen changes from being spotty to continuous. Essentially, all the light blueward of some wavelength is absorbed by that neutral hydrogen.
That happens around a redshift of six or so. Quasars were found by surveys, not just ours, mostly by Sloan, and then we studied them and used them as probes of something else. We use them to find binary quasars, gravitational lenses, and so on. The way we use surveys has really evolved. As the data quality and data information content kept increasing, it became possible to do nontrivial, ever-more-interesting projects, and they were also ever more demanding, requiring use of ever-more-sophisticated AI and machine learning techniques. That's what's happening in the entire field of astronomy now.
ZIERLER: It's a very basic question, but I think it's important just so we understand the larger story here. Can you explain the importance of the development of large digital sky surveys in the creation or the need for an approach of data-driven astronomy?
DJORGOVSKI: Yes. Surveys really serve a dual purpose. First, there's science you can do with the survey data and only the survey data, for the kind of projects that require large coverage. Say, if you're mapping the large-scale structure of the universe. That was the original motivation for the Sloan Digital Sky Survey. Or if you're looking for rare, unusual kinds of objects, you have to have lots of sources to go through. But surveys can also serve as the basis for observations with large and space-based telescopes because those telescopes have very small fields of view. They can collect a lot of photons, but over a very small area of the sky. You can use surveys, coupled with a data-mining analysis, to identify potentially interesting targets that are worth spending time on with these very expensive, large and/or space telescopes.
You can't just go and survey the sky with one of those large telescopes. It would take forever, and it would be a waste of their time. You should be using these large or space-based telescopes where you know you're going to get some interesting science out. Surveys guide you to those objects or parts of the sky, so there's a very good synergy there. Astronomy used to be mainly for what we call targeted observations, individual objects, or small sets of objects. Surveys were not a very respectable activity unless you're opening a new wavelength domain, where we don't know what's out there in the sky. But modern surveys changed that, and they changed it because the information content of the data was vastly higher and better than the surveys in the past. You may conduct a survey of the sky to just get galaxy catalogs to take spectra and map the large-scale structure, but the data are so good that you can do hundreds of other projects with it, which the people who envisioned or led the survey never even thought of.
That was a big lesson. This became clear circa the turn of the millennium, largely thanks to the Sloan Digital Sky Survey. There were a lot of smart people working there. There were also many others, including infrared surveys, in which, again, Caltech and JPL have played a leading role. Now we see a major rise in radio astronomy surveys. The Owens Valley Radio Observatory (OVRO) experiments are now playing some of the leading roles in that domain. There is the same trend among different fields in astronomy and different parts of different fields, but it always goes in the same direction. We are doing ever more survey-specific science, and at the same time also enabling follow-up of very interesting sources that we find in the sky.
Separating the Signals from the Noise
ZIERLER: I wonder if you can explain when machine learning detects a signal in all the noise from these sky surveys and what the mode is to get that interesting signal to a higher-power telescope, something that has more resolution to look more closely at what that signal may be.
DJORGOVSKI: That is, in fact, a very big area of work, especially in what's now called time-domain astronomy. That was another important transition. At first, observing the sky with panoramic photography was costly and took a long time. But with these new detectors, it's so efficient that you can do it very quickly, then do it again and again. That opened up a whole time axis of studying the sky, how it's changing, and detecting moving objects, like the potential Earth-crossing asteroids, all kinds of variable astronomical objects, stars, quasars, cosmic explosions, etc. If in these time-domain surveys, a new source appears in the sky, then you really wanted to know right away if this is something that's worth spending expensive resources on because it's perishable. It may not be there tomorrow. That actually turned out to be an extremely challenging task because you have very little data to go with, the data was changing, and the data was heterogeneous.
ZIERLER: By the early and mid-1990s, did you see your work in any kind of evangelical role in terms of convincing the astronomy community broadly about the importance of digitizing the sky? Or was everybody basically on board at that point?
DJORGOVSKI: Oh, very few people were on board with that. And yes, I would talk about this, we'd talk about our project, the Digital Palomar Observatory Sky Survey, DPOSS, the digital version of the second Palomar Sky Survey. I'd give talks about it, and usually, they'd want to hear about the science, but I'd tell them about the technology as well. And that was kind of okay, but I did actually have to play a proselytizing role once the Virtual Observatory framework came about. I got to chair the National Virtual Observatory Science Definition Team that was charged by NASA and NSF to provide a road map for developing this framework. As a part of that, I had to go around and give talks about it, and persuade people why this was going to be excellent for them. And it was interesting. Some people kind of got it, most people didn't care one way or the other. Sometimes, people would say, "But if you make all this data available, people who don't understand the data will do something stupid with it." To me, the first response was, "Oh? How is that different from what we have now?" And the second response was, "You can't stop stupid people from doing stupid things. What I want to do is enable smart people to do things they couldn't do before." That's what that was all about.
I would say it took a while. For most of the scientific community, it still hasn't quite sunk in the fundamental nature and depth of this transformation. Even in astronomy at large, the whole issue of having to do something about big data is no more than several years old. In 2010, we had the first astroinformatics conference here at Caltech, which I organized. It was a very good conference. A lot of people who were getting onto this understood and came by. We had great discussions on what to do next.
At one point, I pointed out that here we are in Caltech, arguably the world's leading astronomy department, but the number of Caltech faculty in the auditorium, aside from myself, was zero. My colleagues at the time still didn't quite get it yet. They do now, but it took a while, and it took some persuading. Generally, scientists respond to two kinds of stimuli: resources (is there some funding agency I can get money from?), and results (if my competitors are getting some cool new stuff, I want to do the same thing). That is what can ignite a real interest. Now we see a lot of smart people who know what to do with the data, especially young people, who don't have a vested interest in the old ways of doing things, are producing excellent new results. That's essentially how a community changes. I'm sure this is how it's always been. It is often a small number of people who recognize something that lead in a new direction, and everybody else follows.
ZIERLER: This idea that there were people in the field who were concerned that the uneducated, the citizen scientists, would do crazy things…
DJORGOVSKI: No, no, they were talking about other astronomers.
ZIERLER: That's even worse.
DJORGOVSKI: Yes, it is, and unfortunately, it happens all the time. When the data becomes very complex, and maybe not fully documented, people who don't read the manual will make assumptions about the data there are simply not true and get wrong results.
ZIERLER: Perhaps it's an ignorant question, but isn't science self-correcting? Even if people get their hands on this data, experts, and they don't do correct things with that, isn't it worth it, first of all, for them to get corrected in peer review, and then maybe they hit on something important?
DJORGOVSKI: Sure. That's exactly how science works. All of that happens. In the end, there is progress. Science is self-correcting, by and large. But I was just by this puritanical fear that somebody was going to do something wrong. People do wrong stuff all the time.
ZIERLER: When Sloan Survey comes online, is there planned obsolescence for the second Palomar Sky Survey? In other words, when the second Sky Survey's being conceptualized, is there an assumption that increasing technology will render the second Palomar Sky Survey not cutting edge?
DJORGOVSKI: Absolutely. That was clear to me from the get-go. But it was still worth doing. We still had close to a decade of good use of that sky survey. It's still used for finding charts through the digital sky survey servers of STScI, and it is also now useful for time-domain astronomy. Because now, we have these exposures that were done over the span of a couple of decades, and we can attach them to the modern CCD sky surveys that we use today, CRTS, ZTF, and others. That gives us additional time baselines in the past. The First Palomar Sky Survey as well. That gave us some measurements from the 1950s, and then DPOSS gave us measurements from the 1990s, and then all of the CCD stuff came in the 2000s. That's useful information for time-domain astronomy studies because it probes time scales that we do not have with the CCD data. An even better example of that is the DASCH project at Harvard that Josh Grindlay's doing, digitizing all of the Harvard plate collection that spans a century. Obviously, those plates are not nearly the same quality as the sky survey plates we used at Palomar, but nevertheless, there's information there that's simply not available any other way and gives us insights into the variability of stars, quasars, and other objects over a century of observations. These datasets are no longer a cutting edge for most of the original intended science but still retain their value as legacy or archival datasets for studies that do require the time span.
ZIERLER: Moving into the mid-1990s, as the internet was becoming more widely adopted, what did that mean for you? What did that mean for where you wanted to see astronomy go?
DJORGOVSKI: At that time, I didn't think yet about the general transformative power of the internet, or even the Web, which is just a subset thereof. But purely from a practical viewpoint, I thought that this is going to be a much more effective way of sharing the data and moving the data around. We already had data centers, mostly from NASA missions, and their mandate was to preserve the data, and make them available. Before the internet really got going, you had to mail them a blank computer tape, and they'd put what you wanted on it, then mail it back. That wasn't terribly effective. Certainly, the internet changed how that was done.
ZIERLER: Let's move on to the computation side. What were some of the advances around the turn of the century that made it apparent just how revolutionary computation and machine learning would be for astronomy?
DJORGOVSKI: There was no step function there. Moore's Law is a continuous exponential. Every once in a while, there's a new process involved, and things jump a little bit, but overall, it's a continuous trend. Astronomy latched onto this trend in the early 1980s, and data volumes just got bigger, and bigger. Realizing that there's a qualitative change driven by this quantitative change, that more data isn't just more data, but that it will let you do things you couldn't do before, that started forming in people's minds in the mid-1990s, in some cases earlier, and in some cases later. That's where the idea of the virtual observatory originated. By the mid-1990s, every NASA mission that had some kind of data archive that was obliged to provide it in a publicly accessible format. At first, people would actually mail large computer tapes and get them back with some data. This was before the web really took off. Sky surveys were also collecting big archives of digital information.
You had to access every one of those archives separately. There was one good image format standard called FITS, and that was quite visionary at the time. You could exchange images, and if you knew how to read them, you could do it on any computer, no matter whether they were radio, optical, or X-ray images. But it became clear that having more data brought new challenges. First, find out what's available out there. I want all the data, say, on my favorite galaxy. Where can I find it? Or I want all the data for this patch of sky that I want to study. Where can I find it? Then, you want to bring the data together, you want to cross-correlate between, say, different telescopes, filters, wavelengths, and so on. And when you're talking about small fields, small images, that's no big deal. But once you start getting into the Terabytes, it is a big deal. We had to learn this whole new set of skills that the industry was already developing about how to handle big data.
By the end of the 1990s, it was clear that we needed to do something about the exponential data growth, that our existing structures and tools were just not scalable to ever larger data volumes. We were seeing data volumes doubling every year and a half, like Moore's Law says. That's where the idea of the virtual observatory came in.
ZIERLER: Let's go into some detail now about the origins of NVO. When did those conversations start, and who was involved?
DJORGOVSKI: It started in the late 1990s at various conferences. There were maybe a dozen or so people who were into this kind of stuff in the 1990s, who were really recognizing that we have to do something about the exponential data flood as an opportunity, and we need to do something different from the way we normally deal with data in astronomy. There were informal conversations at conferences and so on, and certainly, I would say, the core of that discussion might have been between Alex Szalay, Tom Prince, myself, and a number of other people across the community as well. As I recall, the three of us were once having a beer at the AAS meeting in San Diego and discussing this. That's where we came up with the moniker of the National Virtual Observatory. We were laughing about it because we said, like the Holy Roman Empire, it was neither national, nor virtual, nor an observatory. It was a catchy phrase.
On the Sloan side, Alex Szalay really spearheaded the development of their software planning, archive, and all that. Up until then, I think they had some efforts at Princeton and Fermilab to build the software and archives for Sloan, and I don't think those worked out very well. Basically, Alex took it over to Johns Hopkins, and they really made it work. A crucial part of that was his friendship and collaboration with Jim Gray, a computer scientist who was working for Microsoft Research. He was a great computer scientist who got the Turing Prize for inventing most of the technologies underlying databases today. Every time you buy anything online or use an ATM, you're using something he invented. He was also one of the nicest people I've ever met. Super smart, super accomplished, and very constructive in every way. Just to give you an idea of how well-regarded he was, he used to work for Digital Equipment Corporation, DEC, and then Microsoft hired him. But he liked living in San Francisco, and he wasn't going to move to Seattle, so they built him a branch of Microsoft Research in San Francisco because they really wanted him there.
Anyway, he helped us enormously by giving us this absolutely cutting-edge expertise on building databases and so on. He was a crucial player in the early days. The first thing you need to do when you get into a big data game is, you need to get your data house in order. You have to have data archived and documented, have all the metadata, and have it all findable, which is a nontrivial thing. The whole virtual observatory really began, and kind of ended, as ways of organizing astronomy's data, and it succeeded in that. What VO didn't do, and should have, as envisioned, was to serve as the engine of discovery by actually having tools to discover things in these datasets that we put in a proper form. That didn't pan out, which is another subject, which is where astroinformatics comes in.
Alex and Tom got to be on the theory and computation panel for the 2000 Academy decadal survey. As you know, the National Academy does this every 10 years. The astronomical community gets organized, has different panels, working groups, etc., and they produce a report, which is desiderata for the next decade. Which projects NSF, NASA, and now DOE should invest in, both from space and from the ground. And usually, the big-ticket items are space observatories, gigantic telescopes, and stuff like that. There's no problem coming up with a big wish list from the observational side. But theorists traditionally didn't really have much to ask for, a bigger computer, more post-docs, that kind of stuff. I think that Alex and Tom were getting bored with this, and they decided to float this idea of the National Virtual Observatory in the theory and computation panel, and they found a very willing accomplice in Charlie Alcock from Harvard, who led the MACHO project.
People on that panel thought that this was a wonderful idea. They put that as their top recommendation for the theory and computation panel. And then, a greater panel that synthesizes the whole thing also thought that it was a wonderful idea. That became the top-ranked project in what we call the small projects category, which is defined as costing less than $100 million. That was the first time we actually had some kind of community validation that what we were talking about was really important and should be invested in. As a consequence of that, suddenly, a whole lot of people across the community discovered that they are interested in it, and that it is something we need to do. Certainly, all of the data centers, both for ground-based observatories and all of the NASA data centers, recognized this validated them as well and gave them an opportunity to develop their business further.
We had this kind of grassroots group of maybe a couple of dozen people who started talking to each other. We organized the first Virtual Observatory conference at Caltech in June 2000, "Virtual Observatories of the Future". We knew that the recommendation was coming out, so we already planned the conference, and I think that it was out by then. That was the foundational conference for the NVO. Program directors from both NASA and NSF were there, and they were 100% buying into it.
The Long Reach of the Virtual Observatory
ZIERLER: You mentioned that there's no variant of astronomy that would not be touched by the virtual observatory, and that's obvious when you look at the kinds of presentations at this conference. What were some of the themes that jump out in your memory, the kinds of things people were talking about that could be achieved as a result of having the virtual observatory in cosmology, astrophysics, astronomy, or all of the above?
DJORGOVSKI: Well, yeah, all of the above. I don't really remember what exactly was done at the conference itself, but in general, the story that emerged is that on the scientific side, the types of science that the virtual observatory would enable would be of different kinds. The first one was extracting knowledge from very large datasets, like survey datasets, that would do some kind of statistical study. For example, the large-scale structure of the universe or the structure of the Milky Way, that kind of thing, where you need a lot of objects to actually achieve a certain degree of statistical accuracy, square-root-of-N type of errors. That was the most obvious one, and that was why people were already doing large surveys. The second one was a little less obvious, that new knowledge can come from data federation, from a fusion of datasets, for example, in astronomy at different wavelengths.
Astronomers had been doing that for a while, but not on the grand scale that's demanded by the sky surveys. The way I phrased it was that there's the knowledge that's present in all of these ingredient data sets that cannot be recognized in any data set separately, only when those data sets are combined. The obvious examples are from multi-wavelength astronomy. When, say, radio astronomy first started, people found out that there were radio sources, and with some effort, they were identified in visible light. Some were identified with galaxies that looked to be merging or something like that, and some were identified with these star-like objects, which then turned out to be quasars, which was a previously unpredicted or unappreciated phenomenon of nature. When the IRAS satellite was flown, it was found out there were ultra-luminous starbursts, that half the star formations in the universe seemed to be obscured by dust, and some of the most luminous objects in the universe we could detect were not obvious in visible light, you had to use far infrared to find them. That, again, was something that was really a new phenomenology that was recognized by data fusion. And so, in high-energy astrophysics, for example, gamma-ray bursts are an obvious example. Until there were optical counterparts, we couldn't figure out what they were. That's another kind of science.
Another one is looking for very rare types of things, whether you know they exist or not. For example, rare kinds of things would be quasars at very large red shifts, or quasars in general. They look just like stars, but, like, one in a million objects that look like stars actually turns out to be a quasar. Finding those things, or brown dwarfs were also a relatively new thing at the time. Nowadays, we're talking about finding very unusual types of supernovae, things like that. There was always a hope that there would be something that's generally new that nobody expected to find, which would be found just because we had so much data to sift through, and those things are rare. I don't think that really happened in some pure sense. What we found a lot of were very rare, unusual subtypes of known phenomena, whether it was an unusual kind of quasar, an unusual kind of supernova, that kind of thing. But generally new phenomena that nobody has anticipated are rare. In fact, the only one that comes to my mind in the recent memory are the fast radio bursts. Eventually, you figure out what they are. But nobody expected then when they were first found.
ZIERLER: I'll see if I can jog your memory because you made many presentations at the conference. You already mentioned rare objects. What were some of the thoughts about what the National Virtual Observatory could do in terms of capturing things that were simply unknown to astronomy?
DJORGOVSKI: That's exactly what I'm talking about. I always thought of it as a systematic exploration of observable parameter spaces, or really organized serendipity. And in the past, people didn't fully realize that you could find such things in a systematic fashion. People just found them by sheer dumb luck or stumbled upon them. For a long time, people referred to that approach as a fishing expedition: you don't know what you're going to find. By definition, but you can be systematic about it. If something is rare, you're unlikely to find it by chance. But if you have lots and lots of data that are relevant, you might be able to find things that normally would be missed. In a sky survey where you detect two billion stars and galaxies, chances are you might find something so rare that nobody really identified it before.
ZIERLER: What about something like gravitational waves, which, of course, were theorized but not detected at that point? Would you consider those a new type of object?
DJORGOVSKI: Not really because they're purely predicted by theory. They were just extremely difficult to detect. I forget what stage of LIGO it was back then, as you know, it took decades to build those detectors, and those are some of the most precise measurements humanity's ever done of anything. But the sources of gravitational waves were predicted and understood. The challenge was just finding them because the signals are so weak. And then, interpreting them by matching them with other observations. But that was 15 years later, after the virtual observatory. What we had back then was an obvious issue of having to deal with terascale datasets coming from the first digital sky surveys and extracting knowledge from them. A lot of effort back then was focused on purely data farming aspects of it, understanding databases, standards for data formats, interoperability, things that most people couldn't care less about, but they're essential cyber-infrastructure in order to do any science with the big data.
Personally, I was always much more interested in new knowledge discovery, so my thoughts at the time were those that actually described at the conference looking for rare and unusual types of objects by a systematic exploration of large datasets. I do recall two thoughts that came to my mind back then. First, let me back off a little bit. Any astronomical observation has its own limits in terms of how much sky you're covering, how deep you go, which wavelengths you're covering, the angular resolution, the spectroscopic resolution, and so on. All those parameters that describe any given observation, depth, flux, color, and so on, form an observable parameter space. Any observation, whether you take a single picture or a whole large survey, carves out some multidimensional volume in that parameter space of all observable quantities. Different astrophysical objects will populate that parameter space depending on where they emit most of their energy, how big they are, and so on.
The first operation we do is to get the data. In the case of sky surveys, that would be panoramic imagery. A data processing pipeline tries to separate actual sources in the sky from just background noise, and then, for each of the sources, we measure a whole bunch of parameters, fluxes in different ways, shapes, sizes, etc., match that with other surveys, and so on. You create this large dimensionality parameter space of properties of sources detected in the sky. The vast majority are things you know about, and they form clusters, correlations, and so on. The hope is that there may be things that stand out that you wouldn't recognize in any one of these observable parameters, but when you look in this multidimensional view, they may stand out. That was the basic idea that I was toying with, and I looked at what was observable back then, what different sky surveys had covered, and I came up saying there would be clearly two interesting new arenas to explore.
One was what we called a low-surface-brightness universe. When you take pictures, there's always some kind of sky noise, so you're not only limited by the total flux that comes from your object in the sky but also how it's distributed spatially. If you smear it over a large area, it may be lost in the noise – the contrast is too low. We knew there were objects with a very low surface brightness that were previously missed, there were galaxies that were found not long before then that were much, much dimmer than those that you see in the typical pretty pictures. We looked at that and tried a few experiments, and there just weren't any survey data back then that we could really use for such a search. Since then, actually, much more recently, people have developed telescopes or modes of observing that can pull out these very low-contrast objects in the sky. For example, Pieter van Dokkum, formerly a Caltech post-doc who's now a professor at Yale, has a specially built small telescope array that is optimized to look precisely for those very diffuse, very low-surface-brightness objects, and he's been finding some exciting and interesting new things. But back then, in 2000, we simply didn't have that kind of data.
The other domain of this observable parameter space that struck me almost right away is what we call the time domain. Up until then, most of the surveys were just one pass over the sky in some filters, radio frequencies, or something. Then, variability in the sky was studied usually in just selecting some targets and then following them up. There may be a few, a few tens, or maybe a few hundreds, but not over the entire sky.
In principle, there could be phenomena or objects that manifest themselves through their variability. Maybe they only appear once, or maybe they're there all the time but vary in some unusual fashion. And they may be so rare that because you've never studied before anything more than very small samples of known variable types of objects, variable stars or quasars, that you'd miss those. There was already a budding attempt to do that for supernova surveys, looking for supernovas in nearby galaxies largely for cosmological purposes. It was very focused on doing that one thing, looking for a known kind of supernovas in order to do, say, dark energy studies with it. But we did not actually know much about the variable sky in general. We just knew about things that stood out so much that people instantly recognized them and followed them up, like the various kinds of variable stars.
Around the same time, an astronomer at Princeton, Bohdan Paczynski, a stellar astronomer in both senses of the word, advocated very much the same idea of monitoring the sky for variability, and he was partly motivated what we call the gravitational microlensing experiments. Charles Alcock at Harvard was also motivated by that. He led one of those first searches for gravitational microlensing over a small area in the sky. What we've done, just to kind of nibble at it, was to use our Digital Palomar Sky Survey photographic plates, which was not designed to be a multi-epoch sky survey, just a one-pass thing. But the photographic plates were taken at different times, and they were taken with a considerable overlap, so they that could kind of stitch the whole mosaic together. We looked in the overlap regions for sources that would only appear once, or maybe they're there all the time, but show a very high amplitude variability.
We followed up on those, and we did find that over the entire sky at any given time, there may be about 1,000 objects down to the limits of our survey that are only visible then and then disappear, fade out. We tried to follow them up, but this was years after the data were taken, so they were gone. That was clearly an interesting thing to follow up. What was the nature of these objects we call optical transients that only appeared once, as far as we can tell? It doesn't mean they're not there, it could just be that they're in their quiet state, below the limit of our survey. You can think of, say, a supernova in a dwarf galaxy far away. Normally, you just don't see that host galaxy, but when a supernova goes off, there is a star-like source there.
Or there could be something completely new in the same sense that gamma-ray bursts were completely new when they were found in 1973. That was the idea I really pushed at the time, and we did a limited experiment and found that there were a lot of highly variable objects in the sky. We'd get the spectra of those that were always present but just varying by a large amount, we found out they're a combination largely of so-called cataclysmic variables, a known type of objects, and blazars, a known type of quasars that were known to be highly variable, but they were accounting for most of those that are always there but highly variable. We didn't look for a very low-level variability because there were just too many
It was obvious that we would actually have to do a survey where you find them in real-time with an imaging survey and follow them up while they're still bright with a spectrograph in order to find out what was going on. That directly served as part of the motivation for the fully digital sky surveys with CCDs, including Palomar-Quest that we did at Palomar, Palomar Transient Factory after that, and ZTF as well. That all came out of the idea that we move from panoramic cosmic photography to panoramic cosmic cinematography, that we could now explore the time domain because there are a lot of phenomena out there that really only manifest themselves in the temporal sense. If you were to take a picture, and there was a supernova in the picture, and you have no idea it was a supernova explosion, it looks just like a star.
In order to learn anything about objects that do dramatic things like that, you have to follow them in time. And because they can fade very quickly, you have to react to them right away. That led to this whole rebirth, or an explosion of time-domain astronomy, which is now this huge, vibrant field. Interestingly enough, the LSST Survey, Large Survey of Space and Time, which used to be the Large Synoptic Survey Telescope but is now the Vera Rubin Observatory, originally had nothing to do with the time domain. It was all to look for gravitational weak lensing and map the dark matter. In fact, the original name for that project was the Dark Matter Telescope. After that science was done already by people using big telescopes elsewhere–you don't have to do the whole sky to do it–and the whole time domain started coming up on everybody's radar as a potentially really interesting thing, that completely changed the scientific orientation and design of that survey. That's now seen as the dominant role for it, which I have some doubts about, but that's a whole other story. Anyway, the technology in terms of CCD detectors got good enough for us to cover the sky not once in a given filter, but do it again, and again, and again over a variety of different time scales.
The Digital Detector Revolution
ZIERLER: I wonder if you can explain what CCDs do that allows that to happen.
DJORGOVSKI: They're much more sensitive than photographic plates, almost a factor of 100. They're much higher quality data with a large dynamical range. They're superior optical detectors in every way. But basically, because they're so sensitive, you can reach interesting depths already in a very short exposure, minutes, or even seconds if you have big telescopes. Then, if you just build the whole survey around short, repeated exposures covering as much area of the sky as you can, then you really are doing cosmic cinematography. And things that change on whatever time scales you're probing suddenly pop up. That, again, turns out to touch upon just about every field of astronomy.
In the Solar system, it's not that things change brightness so much as they move. People who look for planetary hazard asteroids do exactly that. They cover the sky and look for something that has moved between different exposures. Eventually, they determine their orbits. But even solar astronomy does a lot of variability of one source, our Sun, and so on.
If you're interested in stellar astrophysics, you want certain types of variable stars, some of which may be very rare. If you want to map our galaxy, there are types of pulsating variables that serve as excellent yardsticks to measure distances within the Milky Way. If you want quasars, quasars vary. You want supernovae because you're interested in the death of massive stars, or you want to use them to measure dark energy, that's another thing you can only find if you repeat observations in the same patch of the sky again and again. And so, that really opens a whole wealth of new information that's simply not available in any other way.
DJORGOVSKI: The National Virtual Observatory, as we were joking, was not an observatory. It was a way of organizing our data to enable some science. NVO does not create or store data, it just organizes data people have obtained for whatever other reason and enables everybody else to access those data to do their own science. Essentially, it was always a big data game. The only big data in town were sky surveys. To a large extent, this is still true.
What happened with all of the other digital sky surveys before was that they were usually done for some purpose, whether it's to map the large-scale structure of the universe, to find supernovae, or what have you, but other people used the exact same data to do other science that the originators of the data never thought of or weren't interested in. For example, our own surveys, CRTS, Catalina Real-Time Transient Survey, simply piggybacked on a data stream obtained by astronomers in Arizona to look for the Earth-crossing asteroids. We approached them and asked if we could use their data to look for things outside the Solar system that show variability. They thought that this was a really good idea as it immediately multiplied the scientific impact of their survey. They were the leading group in finding Earth-crossing asteroids, or at least they were at the time.
Data reuse for different purposes is really one of the major byproducts of getting our data house in order. Now, funding agencies are very keen on data reuse. That was significantly helped by the introduction of finite proprietary periods. You obtain data with some national facility, or get money from the National Science Foundation with your own facility, like Palomar, but you have to make the data available to the whole world after, say, 18 months or something like that, typically. That, I think, is what really enabled this major impact we talked about earlier, that anybody with an internet connection anywhere in the world now has a level playing field. They can look at the same data as the astronomers at Caltech, Harvard, Princeton, or or any other major institution. If they have a clever idea and know what they're doing, they can do great science.
Essentially, the data quality and information content that's enabled by high-quality digital detectors lets you map the physical sky into a digital archive, at least some aspects of it. And the better the data you have, the more information content you have, the better reflection of the real world you're now creating as an archive. In observation astronomy you observe real sky with some instruments attached to telescopes. But here, you've already observed all sky with some different instruments, and now you're using algorithms and machine-learning tools to "observe" the archive as a representation of the real world. And so, just like you find things in the real sky, observing them, you can find new things by "observing" the archive data with some software instrument.
That was really a new thing. Prior to these large digital sky surveys, there wasn't a data source that could enable really profitable data mining in astronomy. People would take the data for a particular purpose, and that's all you could do with it. They'd publish papers, and that was that. Sometimes, people would reuse them, but not in any spectacular fashion.
This was driven home largely by the Sloan Digital Sky Survey because they, too, were obliged to release their data to the whole world. They did a fantastic job with that, and uncounted numbers of papers have been published, the vast majority of which had nothing to do with what that survey was originally for, including the survey team themselves, once they did what they originally set out to do. People who had nothing to do with that survey team, having good ideas, could observe their archive using software instruments to make new discoveries.
ZIERLER: What guidance, if at all, did the decadal report provide on how both the NSF and NASA would support this initiative?
DJORGOVSKI: The decadal report is simply advisory to the funding agencies, that this is what the community thinks are the most important things to do. Then, they customize their grant programs or space missions accordingly. The big-ticket items tend to be space missions. In ground-based astronomy, it's promoting new sky surveys, sometimes new facilities that can do it, and so on. Mind you, this was the top recommendation in what they call the small projects category. The software was not yet getting the respect it deserved. But it started us thinking along these lines. I remember, there was some kind of strategic activity just a few years after that, and I went and asked people involved in these big data projects, sky surveys, not just in astronomy but also in physics and so on, how much of the total cost of it was divided between, say, hardware and data-gathering, then archives and doing science with the data?
Roughly speaking, typically, maybe 80% of all cost was in the data space, building and maintaining the archives, and extracting knowledge from the data archive. The telescopes we think of as big-cost items are just the hardware front end. All of the real action is on the computer. That really changes the landscape dramatically. But it took a long time for people to really understand this, and many people still don't understand it. People used to think that their students are going to write software for free and that they don't have to be experts in writing software. Those days are long gone. It's now increasingly understood that software development is a major component of the scientific cost, and if you don't plan on it, then you are slowing yourself down. Because sooner or later, you have to pay that price. Otherwise, you won't be able to do science with whatever data you got, however expensive it was to get that data. Without investing in the software infrastructure, which still does not get the respect it deserves, nothing's going to happen. This is a common issue for all sciences, and not just the sciences, but anything in the real world that generates great amounts of data, in commerce, national security, you name it.
Anyhow, as soon as the conference was over, the two agencies decided to commission a road map and asked me to form what's called the NVO Science Definition Team. We had a lot of people who were really involved in this talk about it, coming up with use cases, all manner of stuff. That was a report you may have seen by now. And that's really kind of what defined moving forward. While the NVO was recommended in the "up to $100 million" project category, it never came up to that much., but it was maybe some tens of millions of dollars. NVO took off in the early 2000s, then was succeeded by a second phase called Virtual Astronomical Observatory. The first part was to build this virtual facility, and the second was, to make sure that it works well. Most of the people doing the heavy lifting were people from data centers, and some were from university-based groups, including Caltech, Johns Hopkins, Harvard, and a few others.
Around the same time, circa 2000-ish, the NSF was really getting into the big data transformation of science. That's when they formed the Office of Cyber Infrastructure, exactly what we're talking about. Astronomy was seen as a great model, but there were smaller, similar examples in climate, ecology, distributed virtual laboratories, and things like that. It was pretty much the same idea in every field. But in terms of scalability, astronomy was at the forefront. High-energy physics was already handling larger amounts of data, thinking about distributing them, and all that. But that was for one gigantic use case, the Large Hadron Collider, not all of physics.
It was really fascinating to see that happen. Some funding agencies didn't really understand this new type of scientific organization that was not a brick-and-mortar place. It's inherently something that's distributed on the web. There isn't one building that has National Virtual Observatory sign on it. It doesn't exist. It's a community of collaborating institutions that really represent this new type of organization.
The NSF understood all this, and they never had a problem with it. The NSF is staffed by academic scientists who left their institutions to work at the NSF. NASA, on the other hand, does have also academics, people who came from research and all that, but that's a relatively minor part of their mission. NASA, as a whole, is largely an engineering and procurement organization that also does some science, which helps justify the work they do. The culture of that agency–not of their astrophysics directorate but the agency as a whole–is really an engineering, procurement culture. Scientists who work for NASA also have to kind of fight with that all the time. For them, it was a little difficult to understand exactly how to deal with this new type of distributed virtual scientific organization. NASA is very much a specific project or mission-oriented. I think that still happens in different fields, which could be another interesting thing to be informed by a history of data-driven astronomy.
ZIERLER: In the way that the decadal survey recommended the support, how did that break down in terms of the ways NSF and NASA decided to help with this endeavor?
DJORGOVSKI: The natural division is, NASA will pay for archives that are archives of NASA data, and NSF will pay for archives from the ground-based observations. That was the easy part.
ZIERLER: Is that roughly an even split?
DJORGOVSKI: I don't know. My guess is that the NASA-based archives were more expensive, in part because they have very stringent requirements in data quality, access, and all that. And ground-based observatories still weren't quite doing this yet. They were just surveys done by individual groups like us doing DPOSS, a consortium doing the Two-Micron All Sky Survey, or the Sloan. Surveys since then gradually increased in importance in the ground-based observatory world. Now, people are building observatories to do surveys, and everything else is going to come out of that. Then again, the question is, how do you fund science? You can pay for the observatory, for the detectors, even an archive, but then somebody needs to do the science with it. That's still funded through individual PI grants through general competition.
There's a real concern that by investing most of the money into the big observatory facilities, including, say, the Vera Rubin Observatory, which is going to generate vast amounts of data, and certainly radio observatories, and not planning sufficiently ahead for all of the data analytics issues and challenges, essentially paying for students, post-docs, software engineers we'll need to convert these expensively obtained data into actual knowledge, that may be a real issue. That's another one of those things that big data changed across the board in all sciences. Because that's really something new in the world that we never had to deal with, and it's developing faster than anybody has had a chance to really understand or react to.
ZIERLER: To go back to the joke, "It's not national, it's not virtual, it's not an observatory," what actual infrastructure was created once this was up and running? What did that look like?
DJORGOVSKI: Some of it has nothing to do with astronomy, but simply the improved capacity and bandwidth on the internet. Most of it is software that lets archives interoperate, that you can actually move data from one place to another to combine them or to do whatever you want with them, so there had to be a lot of protocols and standards agreed upon worldwide. Then, how to deal with metadata. If you are interested in this one particular galaxy or a piece of sky, you want to know where are all the data available to you on all frequencies. That's a nontrivial job to figure out. You need to document the data with the metadata, make it findable, you may have to make it efficient. And that's all working with the existing hardware but coming up with new protocols, software, testing, and all that to make sure it actually works.
Internationalizing the Virtual Observatory
ZIERLER: The idea of it being national, does that directly lead to the need for an International Virtual Observatory?
DJORGOVSKI: Yes, of course. Astronomy is a global enterprise. The sky doesn't know about country boundaries, or NASA and NSF, the distinction between ground- and space-based. We knew from the get-go that it had to be completely panchromatic, incorporating data from the ground, space, and from any country. Since in every country, their scientists get funding from their national organization, their equivalent of the NSF or NASA, you always have to sell it as, "Our national observatory." There was a slew of initial National Virtual Observatories, German, Japanese, etc. The Europeans already had the European Southern Observatory, and the European Space Agency. After the US made the first move toward the US National Virtual Observatory, almost immediately, the Europeans started moving in the same direction. There was nothing but friendly cooperation in all of that. The meeting at the European Southern Observatory in Garching in 2001 or 2002 was when the International Virtual Observatory Alliance was put together. Because every country had its own funding mechanism, and we all still wanted to coordinate the activities, have common standards for everything, and all data be available to everybody in the world, not just the countries that participated in this, that essentially served as a kind of coordinating organization, and it still does.
We were really building a global, across-the-spectrum, across-the ground- and space-based astronomy, data commons for astronomy. And that's exactly what we built. As a home organization of this global project, we formed the International Virtual Observatory Alliance. That serves essentially as a coordinating body for the individual efforts in different countries or consortia of countries. In general, the whole virtual observatory framework became the data commons for astronomy worldwide. And in that, it succeeded. Not perfectly – huge projects like this are not always working as smoothly as one would hope. Nevertheless, astronomy was still the first science to create such global data commons. I've been hearing from people in other fields how astronomy's done so well in this. I was partly proud of this and partly horrified because I knew how that sausage was made. I was pretty sure that people in other fields are going to run into the exact same problems that we've had. But we've done it, and I think that we were sort of a role model.
That essentially was not a qualitatively new thing that was prompted by NVO. Astronomers often collaborate internationally, and have for a long, long time. The new thing, in some sense, was that once you actually conducted these large surveys, you really needed the follow-up resources that require spanning the globe. For that matter, that also happened with large telescopes, looking at, say, the Thirty-Meter Telescope, the way that became a big international collaboration as well. Astronomers understood the benefits of collaboration with other groups, whether national or international, for a long time. Again, because astronomy has no commercial value, no privacy issues, no government regulation, it's relatively easy to do that. You can't have things like that happen just as easily in a field that, say, would have commercial value, personnel records, privacy, or anything like that.
ZIERLER: What were some of the considerations? Absent a brick-and-mortar organization, did the National Virtual Observatory need a staff, an executive administrator? Or was even that distributed among institutions?
DJORGOVSKI: Initially we decided what we needed was a pair of Co-PIs, one from astronomy, one from computer science. Alex Szalay was an obvious choice for the astronomy part. We had here in Caltech a great applied computer scientist, Paul Messina, who was the director of the Center for Advanced Computing Research (CACR). He was extremely good at finding resources, doing the necessary politics, and so on. We decided to have him as the computer science Co-PI. That was a very powerful combination, those two people. Unfortunately, a couple years later, Paul decided to retire, and another scientist from CACR, Roy Williams, took his place. He also contributed a lot to the development of the NVO. Later on, Bob Hanisch, who was at the Space Telescope Science Institute, was the director for a long time.
But in the end, grants went to the individual institutions, and then they do what they normally do. When the Virtual Astronomical Observatory succeeded the National Virtual Observatory in the US, I think it was the same way. That's another issue of trying to understand how these new virtual scientific organizations, that are community-based and internet-based as opposed to institution-based, work and should be managed. Because of course, every participant wanted a big slice for their organization, so I'm sure there was a lot of negotiations of budgets and so on, and things were not always coordinated as well as they could've been. You can't have a community-wide scientific organization with everybody reporting to the same executive structure in the same way as they would as if something was, say, only within Caltech. Those are all the changes that are interesting to study from the history of science and sociology of science viewpoints.
ZIERLER: Why did the VAO succeed the NVO? Why did the NVO not simply modernize and adapt?
DJORGOVSKI: I don't know why they had to change the name for it. Perhaps there was some strategic or administrative reason, maybe a project like this lasts five years, then it becomes a new project, and it's called something else.
ZIERLER: It was more of a name change than a change of institutions.
DJORGOVSKI: Yes. Nothing really changed. It was just people continuing what they used to be doing.
ZIERLER: Was NVO conceived to be all-inclusive of every kind of astronomy, from radio, to gamma waves, to gravitational waves? Was everything involved?
DJORGOVSKI: Well, there was no gravitational-wave astronomy back then, but yes. All of astronomy, both by wavelength and by scientific topic.
ZIERLER: And why? Is that just because that's the mission, that it should be all-inclusive? Scientifically, what might explain that?
DJORGOVSKI: Wavelength divisions are partly arbitrary, but some wavelengths do not pass through Earth's atmosphere, so we have to do it from space, and so on. But objects in the universe don't care what we call radio waves, or gamma waves, or visible light. The same piece of sky at different wavelengths looks completely different because different physical processes dominate at different wavelengths. You don't get the complete picture unless you consider it all at once. Also, the universe doesn't know about different government agencies, the NSF, NASA, or anything like that. In some sense, astronomy is really well-positioned to serve as the kind of an international enterprise, especially because it doesn't have a commercial value and doesn't involve human subjects. That makes our life a little easier than in many other fields. It was always understood that it is one universe, over the full electromagnetic spectrum.
ZIERLER: Tell me about your collaborations with Giuseppe Longo and if that was useful in terms of internationalizing the virtual observatory.
DJORGOVSKI: It's been a great collaboration over many years, and he is one of my closest friends. I worked with him on a variety of things, including DPOSS and a lot of astroinformatics projects. He's another person who instantly recognized that this is the future, and this is how things are changing. Just like I shifted my scientific research from very distant galaxies and elliptical galaxies, he did as well and organized one of the leading astroinformatics groups in Europe at the University of Naples, and we did a lot of work together as well. He is another one of the founders of astroinformatics as a discipline, and he has produced a number of excellent students.
ZIERLER: When does Charlie Baltay enter the picture? Is that with Palomar-Quest? Or he's involved earlier?
DJORGOVSKI: The photographic survey ended sometime in the late 1990s. I forget exactly when the last plates were taken. A group at JPL looking for the Earth-crossing asteroids built the first CCD camera for the Schmidt telescope at Palomar, called the Three-Shooter because there were three devices side-by-side. CCDs were not yet big enough and cheap enough to pave the whole focal plane. They started using it to look for Earth-crossing asteroids. Before then, they were using photographic films on the 18-inch Schmidt, the first telescope at Palomar that Fritz Zwicky built. That lasted a few years, but wasn't really scalable. I forget exactly how we got in touch with Charlie Baltay. He led a group at Yale, and they decided they were going to look for quasars in order to find gravitational lenses, and from the statistics of gravitational lenses, you can constrain cosmology. They were going to look for quasars using variability. They built a camera they took to the Schmidt telescope in Venezuela that I'd never heard of before.
As you can guess, it's probably not a great astronomical site. They'd done some work there, I'm not sure how much science came out of it, then they figured out that a better site, like Palomar, would be the right place to go. There was an opportunity there, and they could build a bigger CCD camera to do surveys for quasars. Well, we also wanted a bigger CCD camera, so we thought that this can be an interesting partnership. The camera was somewhat similar to the Sloan Digital Sky Survey camera. There were CCDs lined up on what they call fingers, covered by different filters, and there were four of them. They would let the sky drift, and they would just read out the devices at the same rate. Pictures of the sky in these ribbons were taken in four different filters in succession, and they'd just stitch them together later. That's perfectly reasonable, and Sloan was the first project to pioneer that approach for sky surveys. We thought that this was what we want, a proper digital sky survey, maybe our mini version of Sloan.
We submitted a joint proposal to the NSF, that got funded and renewed, so we started working with the Yale group. It turned out that the camera they built was not a very good one by the astronomical standards. In fact, Charlie Baltay bragged how he got his CCDs really cheap. In this business, you get what you pay for. They are good experimental particle physicists who knew how to do electronics for detectors in accelerators and stuff like that, but that's not quite the same thing as the electronics for astronomical cameras. Long story short, it turned out that the data quality from that camera was, well, better than the photographic plates, but nowhere near what, say, Sloan was producing. We had to spend vast amounts of effort trying to fix in software the problems that were caused by the inferior hardware. That really slowed things down.
ZIERLER: What does that look like? How do you go about doing that?
DJORGOVSKI: There's a lot of different stuff you have to do. But one thing that really made us figure how to do it well was the production of The Big Picture at the Griffith Observatory. Griffith Observatory was undergoing a major renovation in the early 2000s, and as part of that, they were going to make a big exhibit hall underground, under the parking lot, because Griffith Observatory is a historically protected building, a beautiful art deco California architecture, so they couldn't change that. The plan was that one whole wall of this giant underground exhibit space now called the Gunther Depths of Space–Gunther was the donor–they would put an image of the actual sky on that whole wall, which was going to be 150 feet long and 20 feet high, something like that. They really wanted a real image, a real sky from a real sky survey. Then, they would have people observe it with little telescopes, come close to it, be able to touch it, and so on. Anyhow, it was a wonderful public outreach project. It still is.
By now, The Big Picture, as they imaginatively called it, has been seen by about 10 million visitors at least. A lot of kids, especially from minorities. It's a wonderful thing to do with your family, especially if you don't have much money, since access is free to all. I thought that it was a really worthwhile thing to do, so we pulled out all the stops, and we really worked hard. We scanned a particular piece of sky specifically for this. We used those data for our science anyhow. Then we had to clean the data very carefully, remove all of the artifacts, and add them all up in just the right way to create this one big multicolor image of the sky. And that might've been actually the best thing that Palomar-Quest Survey has done.
ZIERLER: Because of its scientific significance?
DJORGOVSKI: No, because of its public outreach value. That's a valuable thing, too. Inspire some kids to go into science.
ZIERLER: Why has it resonated so well?
DJORGOVSKI: Well, everybody loves astronomy and pictures of the sky, and wonders about the universe. That's been there forever. The whole public outreach aspect was something we were keenly aware of in the virtual observatory as well. It wasn't originally a part of our plan with the Palomar-Quest Survey, but Griffith people came to us even before Palomar-Quest, they actually wanted something from the Digital Palomar Sky Survey, DPOSS, to use as an image for this wall. By the time they actually got their project funded, we were doing this purely digital survey. We told them that this is actually better, with higher quality data than photographic plates would allow. We did that, which required a lot of work, but it was very rewarding overall. And it's still there. The Big Picture was reproduced on steel plates with porcelain using different inks to produce colors, and it's meant to last centuries. It probably will. It's no longer state-of-the-art astronomical imaging, but it still serves its outreach purpose.
On the scientific front with Palomar-Quest, the original idea was to do a survey just like the Sloan. As I mentioned, you had these lines of CCDs with different filters that would scan across the sky. If you had high-quality CCDs, that's a really well-designed strategy, a very efficient way to gather lots of exposure. But if you had poor-quality detectors, they'd smear all the bad stuff as well. That was why it really took vast amounts of effort to deal with this data. In the meantime, Baltay's group got a separate deal with the Lawrence Berkeley Lab. They were looking for supernovae to do cosmology. That was the origin of the supernova cosmology project that later led to a Nobel Prize.
ZIERLER: This is Saul Perlmutter you're talking about?
DJORGOVSKI: Right, but before Saul, there were other people running that. They were just looking for supernovae. When you look for supernovae, you don't care about multicolor photometry, you just want to see if something gets brighter in some galaxy. And you don't necessarily want to cover the whole sky uniformly. It doesn't matter whether there are gaps in pictures, missing columns, and stuff like that. They would just do traditional pointed observations. Point, expose, move to another spot, expose. And only with a single filter, so you don't have all the information you would in a proper multicolor survey like Sloan. But you don't care because all you're looking for is things that appear in some galaxy.
At the same time, Mike Brown here at Caltech, who was looking for the trans-Neptunian objects or planets, realized that he could work with that, too. He didn't care about different colors, he just wanted to see whether some faint objects that looked like stars moved a little bit between different observations, so he started gathering the data. That cut down on our survey time substantially, but obviously, Mike did fantastic work in finding Sedna and a whole bunch of other dwarf planets, or trans-Neptunian objects. That might have been actually the biggest scientific return from this project.
The supernova work done by LBL never really converged. What they do now with the ZTF, and what they did with the Palomar Transient Factory, was vastly higher quality stuff. That was what they really wanted to do. But they, too, were limited by the poor quality of the Palomar-Quest data.
In retrospect, I wish that we never started it because we spent too much of our time and effort on fixing the problems that shouldn't have been there in the first place. We should've just waited and got a proper high-quality CCD camera, which is exactly what Shri Kulkarni did by buying the camera that was decommissioned from the Canada-France-Hawaii Telescope, to use for the Palomar Transient Factory, a survey that replaced Palomar-Quest, and then by building the ZTF camera.
ZIERLER: This is kind of a chicken-and-the-egg question, but with the conceptualization of Palomar-Quest, was that done as a result or with in mind that the National Virtual Observatory was already up and running? Or did Palomar-Quest sort of serve as an intellectual inspiration to get NVO going?
DJORGOVSKI: It had nothing to do with NVO. NVO was really inspired and based on sky surveys we had or that were being conceptualized in the 1990s. We had DPOSS, Sloan was being planned for, and there was the Two-Micron All Sky Survey that ended up at IPAC, and a bunch of others. All across astronomy, we saw this exponential growth of data taking off. That's what motivated the whole virtual observatory movement. Frankly, I don't think that Charlie was necessarily even aware of the existence of the NVO until we started talking. He just had this project to find a bunch of gravitational lenses, which never panned out, but that's okay.
We planned on it, knowing what NVO was doing. We counted on it to be fully compatible with all of the standards, data archiving, interchange, and all of that stuff. I was the co-PI of that survey, so of course it had to be that way. In fact, the emerging existence of NVO as a cyber infrastructure was a factor in not having to invent this set of wheels. It was happening community-wide, which was going to make our data available using whatever NVO mechanisms are in place. The exact same thing happened with any number of other survey projects.
ZIERLER: If Palomar-Quest was in some ways a proof of concept that a sky survey could be conceptualized with the virtual observatory in mind, what did we learn, or what was made more efficient as a result of making this new survey with the National Virtual Observatory up and running?
DJORGOVSKI: It's hard to tell. It also depends on what you mean by the NVO being up and running. It took years to debug it and agree on things, and it was always less efficient than one wanted, partly because you'd have to satisfy everybody, and it's all committee-designed stuff. I was personally never happy with what the NVO was delivering in terms of even data interchange and access because there were always some issues that had to be solved. But you've got to start somewhere, and it gets better in time, and eventually, it becomes invisible. You just count on it being there. But certainly, one thing that really was important in astronomy was understanding that having common formats for your data is really, really important.
People were already sensitized to that with the preexisting FITS image standard and understood that if we all save our images in this format, then we don't have to write new software to read and write images every time we do any software development. We can just count on it. There's a library of routines we can use. That made a huge difference, and people were already getting used to it. But in the 1980s, that wasn't yet fully realized. People were just inventing their own formats. You couldn't read tapes from one telescope anywhere else. Having common standards was a good thing, and there are many other standards underlying the virtual observatory framework now, and most people don't even know they exist. They just go and get their data, and that's that.
The Big Picture at the Griffith Observatory
ZIERLER: Tell me how you got involved in the creation of The Big Picture at Griffith Observatory.
DJORGOVSKI: That goes back to the Palomar-Quest Survey. Actually, even before then. Griffith Observatory was planning a major renovation, even circa 2000, very early 2000s. They had the idea of The Big Picture as a single continuous image of real sky on the back wall of their new great exhibit hall, what they call the Gunther Depths of Space. The physical dimensions were 20 feet high by 150 feet long or something like that. Of course, since Palomar was where modern sky surveys were first really done, they came to us. At first, we were actually thinking about DPOSS images. But since it took them so long to get the money and get going, by the time they were ready, we were already doing Palomar-Quest. I thought this would be an excellent thing to do for numerous reasons, public outreach, that's part of our social contract anyhow. They gave us samples of the sky, saying, "This galaxy, that galaxy, or these coordinates."
And they did the same with the Sloan Digital Sky Survey, and we were to submit prints of what things would look like from our survey. They had a committee of people, including some astronomers, to compare, and we won. They judged what we could produce to be better than what the Sloan Sky Survey could produce. I think it was really a toss-up, honestly, but I was very pleased that we could do this. We agreed on a piece of sky, which is through the middle of the Virgo cluster of galaxies, the nearest rich cluster with a lot of photogenic galaxies in it. They had this idea of having a statue of Albert Einstein putting his finger in front of his face like that. How much sky is covered by that angle? That turned out to be 2 degrees by 15 degrees in the sky. We decided on the exact slice of the sky that was going to be, and we obtained extra data. We were already observing that piece of sky with Palomar-Quest with different filters, but we got extra data to make sure we got enough good data to cover all the blemishes and everything.
It really pushed us to improve our data cleaning because professionals can recognize, "Ah, this is just an artifact," and ignore it. But the general public doesn't know CCD artifacts, so we had to pretty much perfectly clean it. That pushed us to really do better with our own data processing in the end, we did produce this big sky mosaic. It was then reproduced on porcelain on steel plates, which is also used to make very durable outdoor signs. You can use different minerals as inks and have layers of porcelain that are just like pictures. Normally, you see that on a much smaller scale on your coffee mug or something like that. These were about seven-foot-tall panels. There's a company that produced them in Washington state. It was the biggest thing they'd ever done. We had to make sure all the images were really perfect matched, that black sky was equally black on all of them. We couldn't have some panels dimmer or brighter than others. It had to be consistent across the whole thing.
My wife, Leslie Maxfield, as a Photoshop expert, removed any leftover artifacts that were there so that people were not baffled by them. One thing about CCDs, which are wonderful detectors, is that they do saturate. If you have a really bright star, charges from the detector are going to kind of streak out in what looks like a tail. Of course, in real life, stars don't have tails. We decided we weren't going to mess with the data at all except to remove things you just couldn't explain to a child. That's the only really artificial thing we did with that picture. It's been there now for over 10 years, seen by, I would say, more than 10 million visitors so far. It is still physically the biggest astronomical image ever made. I think it's been a fantastic outreach experience. We were very happy to do this, especially since Griffith Observatory is open to the general public for free, and it's a nice location. It could be a weekend destination for a lot of people who can't afford to buy tickets for an expensive museum with their kids. That may be, actually, the most valuable outcome that came out of the Palomar-Quest Survey.
ZIERLER: The theme here in this answer is the social contract, as you called it, of public outreach. Is that also the motivation for you to do online open courses?
DJORGOVSKI: No, not originally. I always wanted to experiment with recording my lectures. Again, that's another aspect of how we can use computing and information technology in our work. I had the foolish idea that if I do this once, I don't have to do the same lectures ever again. Which was, of course, foolish because it takes 10 times longer to produce one of these MOOCs than doing a real-life class, and you still want to regularly update the material, and so on. But anyway, just around that time, MOOCs became a big hype, and Caltech hopped on it. Jean-Lou Chameau was the president then. There were three of us on the faculty, one in biology, one in economics, and me, to do the first set of Caltech MOOCs, with great help from our Academic Media Technologies and the Center for Teaching, Learning, and Outreach.
Of course, I wanted to teach a Caltech class, and that was what Coursera wanted as well, to have actual university classes. I started realizing what a fantastic impact it could have, because the class I was teaching, which I'm also teaching right now, on galaxies and cosmology, is a sophomore class for our astrophysics majors. Astrophysics is a small major in a very small university, so typically, I would have between 6 and 10 students or something like that. In the first week alone, 28,000 people signed up for this class. And it kept going up, and up, and up, and that was astonishing to me. Because it wasn't an easy class, it required some calculus, a good understanding of physics, and so on. We had an estimate later that it would probably take the typical person about nine hours per week to follow it. The real surprise to me was who was taking it.
First, I thought it would be just students like our students, but from different universities adding a complementary source of material or something like that, something of that nature. It turned out that only 20% of my online students were actual students somewhere, a lot of them from places like India. Another 20% were what I call science or education professionals, high school teachers who wanted to expand their knowledge, post-docs in astronomy who wanted to learn a new field. And fully 60% were just general public interested in science at the level where they would have to spend this much effort. On average, they were middle-aged people. One of the students I had was 80 years old, a Caltech alum, and he was sharp as a tack. Who knew these people existed? It's not like watching a science program on TV, for one hour a week. It's a serious commitment of time. And they wanted it hard. They didn't want pretty pictures, they actually wanted it Caltech-hard. They used those very words.
I was surprised by that – there's a real need out there that we just didn't know about. That was very interesting. I then did two more MOOCs, one for my Introductory Astronomy class, one about data science. To date, probably over half a million students have seen these worldwide, and they came from all continents except Antarctica. Maybe even Antarctica now. The students were from about 150 countries, something like that. A lot from India, because they really want to have access to this type of material.
I had a Facebook page for the class, just experimenting to see what works. At one point, I noticed on the Facebook page that a lot of people asking questions had Middle Eastern sounding names. I commented that I was curious about it. I got answers from several of them, from Pakistan, from Saudi, and elsewhere, and they all basically said the same thing, that in their countries education is so bad that the only way they can learn anything is from these online classes. I thought that was amazing. Never mind the cosmology class and galaxies, what about classes in political science, in sociology, in history, and so on? There's a very, very powerful role for online education that's been neglected because we think in terms of what we're doing, which is mostly selected US kids. The world out there really needs this kind of stuff. I think that MOOCs have done something really good. They've basically allowed scaling up of education, along with other things, like YouTube instructional videos, but they're not vetted in any manner. Coursera actually gives real classes. On YouTube, you can learn real stuff, and you can also learn a lot of bogus stuff. And there are other excellent learning platforms like Khan Academy and so on.
I think that's a great thing that really allows essentially free education that can be optimized for individuals anywhere in the world to learn what they need to learn. We also talk about continuous education. You have to keep learning your entire life to keep up with your own field. You learn new skills. That, I think, is going to still be a huge impact. The thing that MOOCs don't provide is the personal interaction component of learning, student-to-student, student-to-teacher because human time doesn't scale. I can have half a million people watch my lectures, but I can't talk to half a million students. That, I think, has been gradually solved with various forms of telepresence, including what we're doing right now, Zoom. But someday, there will be immersive virtual spaces as well. That will facilitate the human interaction component of the learning process, and gelling of knowledge.
Also, MOOCs are built for video platforms. What they're lacking is the ability to do hands-on labs. People are thinking about how to have real-life labs somewhere to complement online classes. But we've also been playing with developing fully digital labs people can access on the computer, which can teach them the same way we have students do, and to be safe. Moreover, you can do online things you cannot do in real life. You can have students be bigger than galaxies or smaller than molecules to explore those aspects of reality. There's been some work along those lines already. I would say that information technology is going to really, truly completely revolutionize education, and it's long overdue in doing so, not only because it's too expensive in the US, but also because it gives you access to top-notch-quality educational materials to anyone with an internet connection.
From Palomar to Catalina Surveys
ZIERLER: Tell me about how the Catalina Real-Time Transient Survey got started.
DJORGOVSKI: We were slowly sunsetting Palomar-Quest. My colleague, Shri Kulkarni, who always recognizes a good idea when he sees one, decided he wanted to do one of these, too. He wanted to basically move Palomar-Quest from the 48-Inch and get his own camera to do what became the Palomar Transient Factory, which was very successful. But even while Palomar-Quest was still going on, we were looking for something else. Not just because we would lose access to Palomar, but also because the data quality wasn't exactly what we were looking for, our scientific goals have changed, and so on. We knew about Catalina Sky Survey looking for asteroids in Tucson, and there was a workshop in Tucson having to do with the transient universe in general. We went to talk to the PIs of the survey, saying, "If we can have access to your data, we can do all this other science with it."
They agreed immediately, thought it was a great idea, came at no cost to them. They were funded by NASA to look for the Earth-crossing asteroids, and if we had access to the data, we could do everything else. They'd get to be coauthors on all the papers we wrote, so that worked out beautifully. We built a software pipeline that tapped into their data stream. They were looking for little things that moved, and we were looking for things that varied. We decided from the get-go that all the data would be public immediately to the whole world, which I think was maybe the first time that was done in astronomy, but I'm not 100% sure. We already understood that there was no need to clutch to your data. There was just so much data out there. Even though we'd get scooped on some things by people who just tapped into our products before we did it, that was fine. There was new stuff coming all the time. We were limited by our research time rather than by an exclusive access to the data.
ZIERLER: You mentioned there were some bugs that needed to be worked out for the National Virtual Observatory. Once the Catalina Survey was up and running in 2008, was the NVO basically fully operational at that point?
DJORGOVSKI: Depends on what you mean by fully operational. Things like that are never fully operational. Mind you, the idea was always that the NVO doesn't store anybody's data. Everybody gets to take care of their own data. They just need to talk to the rest of the framework. We were serving our own data just like the ZTF's serving their own data, like Sloan's been serving their own data. It just has to be compatible so that other people can access it without doing anything super special. Of course, we've done that. But NVO is a meta-organization. It doesn't have its own instruments or staff. There were maybe, like, two jobs involved, a director and an assistant. But all the real work was done in the participating organizations that contribute. Why would astronomers do that? Because they benefit from sharing. If they make their data available, and everybody else does, the synergy of being able to extract new knowledge from different datasets, which you couldn't do from any one of them individually, helps everyone. The second thing is, people who use your data usually give you credit, so the scientific impact is much higher than if they were to clutch to their data and make it impossible for everyone to access it. It was understood that playing well with others and sharing was to everybody's benefit.
ZIERLER: What were the key scientific objectives of the Catalina Survey?
DJORGOVSKI: If you separate the asteroid work, which was done in Arizona, which was the original motivation, we approached it in a very general sense, understanding variability in the sky in a systematic fashion. There were individual projects we were interested in doing. For example, Andrew Drake, who was the co-PI and ran the survey on a day-to-day basis, was interested in certain types of variable stars. I was interested in quasars, as was Matthew Graham. Other team members included Ashish Mahabal, Ciro Donalek (a Machine Learning expert who later became the CTO of Virtualitics), Eilat Glikman, a postdoc who is now a professor at Middlebury College, and others.
Scientifically, CRTS was much more productive than Palomar-Quest, and even DPOSS. There was more than enough science that we could do. And probably, more papers were written with Catalina data by other people than by us. I think that we've written 40 or 50 papers from those data, but I'm sure there are at least that many, probably many more, done by people who had nothing to do with the survey, but used it for their own purposes.
ZIERLER: Do you recall a specific moment or conversation when you and your colleagues in astronomy realized the value of big data and machine learning, not just for your field, but for all of science?
DJORGOVSKI: There was no specific moment or a discussion. I think it's something that kind of gradually became clear to all of us who were paying any attention to it. I mentioned my friend Alex Szalay. He was certainly one of the leaders in this transformation of science, and he has also worked with biologists, medical researchers, and others in connecting their domains with powerful tools from computing and data science. I certainly understood this very clearly by about the year 2000. I went to then-President Baltimore with the idea that we should form a center that would support that general transformation across the board, and he just didn't get it, as smart a person as he is. Most biologists then were really in the mindset of in-vitro, molecular biology. Biology hadn't quite been hit with the big data wave yet. It happened shortly thereafter, and now biologists are all for it. But I would say that 20 years ago, it was not really widely understood that we all have the same challenges and same opportunities in the era of big data. That understanding crystallized roughly through the first decade of the century.
ZIERLER: In more recent years, in the last few decades, as these developments have been taking place, has your research shifted to some degree, that your work in computation and data science does not have a specific anchor point to astronomy?
DJORGOVSKI: I wouldn't phrase it that way. Because of this commonality of the methodology that underlines knowledge discovery, we started talking about data science methodology transfer, that we could recognize a problem in one field that we have already solved in a different field, in a different packaging. We could, in principle, save the effort, development, and mistakes in this new field by repurposing the data science solutions that were developed somewhere else. That was one of the key motivations and goals for the Center for Data-Driven Discovery, precisely to enable the transfer of these ideas and methodologies from one field to another. We've done a lot of that, also in collaboration with JPL, transferring data science solutions, say, from astronomy and space science to medical research, like early detection of cancer, for example.
Medicine, as a field, has not yet educated sufficient numbers of its own specialists in data science and computational skills, so they pretty much have to acquire them through collaborations, or just hire them from different fields. That will change in due time as well, I'm sure. But it's a field that is also becoming immensely data-rich without much data-science expertise or maybe even data culture. It's a very complicated situation because there are also privacy, regulatory, and commercial issues. As the late Jim Gray used to say, astronomy is so exciting because it's worthless – he meant commercially. You can't sell astronomical data. It's not like data on new drugs or on people.
ZIERLER: It can't be compromised, to some degree.
DJORGOVSKI: Right. In fact, part of our success, why we were one of the leading fields, if not the leading field, in this transformation, was because our data do not have commercial value, and we do not have privacy issues at all. There are other factors. The field of astronomy was always computationally savvy, or at least since the early 1980s, and always looking for a better way of doing our science, eager to adopt new technologies. It also helped that it's a small field. There are probably, I would say, less than 20,000, maybe less than 10,000, active research astronomers worldwide, which is not a lot compared to most other fields. A dentist convention in Las Vegas may have 30,000 people, more than all astronomers in the world. It's not like everybody in astronomy knew everybody else, but a lot of people knew a lot of other people, which made it easier to communicate, share ideas, and so on.
What happened with the initial NVO project was that it really got taken over by the data centers. The work at the university groups was supported, but all the big money really went to the data centers. On some level, this was understandable. You had to do that first. We had to develop all the interoperability standards, formats, protocols, and so on. But as time went on, it really became something of a cash cow for the data centers. The knowledge-discovery component was losing to the data-farming aspect of it. Our motivation was both to archive the data for posterity as well as to do science with these big data now. Not much of that was happening at the time. All of the progress in actually developing the knowledge discovery tools from big data, most of which involve machine learning, was really happening in university groups, at Caltech, JHU, and at many other universities. That's where the whole concept of astroinformatics really bubbled up.
ZIERLER: How important was what the decadal survey was saying about astroinformatics and machine learning for the creation of the NVO?
DJORGOVSKI: I don't think anybody even talked about astroinformatics at the time. I'm not sure the word as such existed yet. They were all talking about data, the demands of doing proper stewardship and use of data. Sad to say, it's still pretty much the case that most astronomers just think in those terms. The importance of knowledge discovery tools, machine learning and so on, is something that's only been penetrating the collective conscious of astronomy over the last several years. We who have been at this from the get-go understood that was the real goal. But unfortunately, the 2010 decadal report took a step back in terms of promoting this type of research. And the 2020 decadal report reiterated the importance of taking care of the data, especially with the multi-messenger astronomy, but still without showing much understanding of how the methodology of science is changing so profoundly.
This is not just some subsidiary thing. In the early 2000s, there was a lot of thinking and people saying things like, it's just archiving, even an intellectually inferior work, a subsidiary of the real work that happens with a pen and paper or the blackboard. People by and large did not understand that this was not a trivial work, like software written by graduate students at night, where they don't need to be trained for it. That still wasn't understood at all. And in fact, the whole intellectual potential and core of astroinformatics, because it incorporates VO in it as the data farming part, still hasn't really been understood by the community. This perception is something that changes slowly, partly as a generational thing. It's always younger people who are really up to snuff with the latest technology, especially when it comes to computing, since it develops so quickly.
As I told you earlier, I think that people and institutions evolve on time scales comparable to their age, and you have to be in the tail of the distribution to really lead the pack, saying, "No, here is the future. That's where we want to go." Being at Caltech offered no special benefit in that regard because our business model for astronomy was to own the world's biggest telescope. That's a really good thing to have, but now with this whole emerging synergy of surveys to find interesting things in a systematic fashion, and then big telescopes to follow up the most interesting ones is really how you win. Understanding that is just now starting to kind of make some in-roads.
ZIERLER: In the planning discussions that led to the National Virtual Observatory, were you and your colleagues talking about the obvious value this would have ultimately to other sciences, or did that only occur to you and your colleagues as this was developing?
DJORGOVSKI: Well, we were aware that probably all sciences were going through the same thing. We were concentrating on our own field because that alone was more than enough of a challenge for us to push through. But yes, we were always aware that this was not astronomy-specific at all. Once we got things really rolling, then I personally started thinking more in terms of the rest of the sciences. They, too, have the exact same challenge and opportunity here. It was interesting to see how different fields responded in different ways. Back 20 years ago, biology was just starting to look into the onslaught of genomics information. That has changed completely over the past two decades, and now biology is in the lead in having big data, and applying new AI tools, and so on.
Different fields respond to these challenges and opportunities at a different time scales, but ultimately, everything goes in the same direction because the technology is all-pervasive. It does the exact same thing for everyone, creating an exponential data flood with a higher quality, a higher complexity, which implies the need for machine learning and AI tools to actually do what science is supposed to do, discover new stuff in these great new datasets and data streams. I was always saying that this whole business about e-science, cyber-science, fourth paradigm, different words that were used to describe it, is a temporary stage. This is just going to be the standard way of doing things in science, and in maybe 10 or 20 years, nobody's going to talk about bioinformatics or astroinformatics. Those are just the tools that we use. Machine learning is one set of tools, and statistics is another. I'd say that these X-informatics fields will actually become obsolete within a couple of decades because that will just be the way science is done.
The Rise of Data Driven Astronomy
ZIERLER: Tell me about the origins of the Center for Data-Driven Discovery and what role developments for the virtual observatory may have played in that.
DJORGOVSKI: Astronomers are obviously not the only people in this position. In the early 2000s, it was becoming clear that there was a general need everywhere to do something about this, not so much data archiving and handling, but the methodology for knowledge discovery. I talked to a number of colleagues in different fields; high-energy physics was getting a huge data flood from the Large Hadron Collider, biology started getting their hands dirty with genomics, proteinomics, and so on, and somehow or other, I assembled three dozen faculty from all Divisions to agree that we absolutely needed to have something like this. We wrote a proposal, which we then gave to the Institute's administration. As you probably know, it's hard enough to do things between two different Divisions, but between three or more, the success stories are actually fairly rare. And here, we had all of them. Plus the JPL.
At that time, we still had the Center for Advanced Computing Research (CACR), which was kind of getting drifting. It was formed in the 1980s, I think, largely to develop novel types of hardware, massively parallel computing in particular, then grid computing, and so on. It was one of the leading places for that type of research. But then some of the key people went elsewhere, so it kind of lost the real drivers. Certainly, the loss of Paul Messina as the director was very important. But the Center was still doing useful things, working with different groups on some large projects, especially in applied physics, astronomy, and biology. But it was becoming clear that high-performance computing alone wasn't really what we were after. It wasn't just about hardware. Hardware is now a commodity business. The basement of Powell-Booth, which is where the Center was, is full of expensive computer machinery, different clusters, etc. For a short period of time, we had the world's fastest supercomputer there, Intel Delta. It had nice blinking lights in the front, so it was really fun to watch.
Data-driven computing is something different. We proposed to transform the existing center, CACR, into something that would be really data-science-oriented. We didn't call it data science then, but data-enabled, or computing-enabled science. It did not really get the administration's support, even though the administration often says that they love collaborative efforts between different divisions, between Caltech and JPL, and this was exactly what we were doing. JPL was deeply involved on a collaborative basis. We even had a memorandum of understanding that they have their own center, and we have our own, but it was really one center in two pieces, with Dan Crichton leading the JPL effort. This has been a great, mutually beneficial collaboration between the campus and the Lab.
In order to get some startup funding, I wrote endless blurbs, proposals, and one-pagers for the Development, and talked to them. And they thought that this is great. You'd think that the Caltech donor base would respond well to something like this. But it didn't go past the gatekeepers. If something belongs to everyone, it belongs to no one. Each Division has their own priority list, and somehow, they never really felt they should stand together for something like this. Also, I don't think the Provost at the time ever understood what this was all about, and seemed almost hostile to this idea. When you have the gatekeepers not really see you as a priority, it can't go very far. The development officers were all for it because they knew that this was something that would sell, but they had to follow the priorities given by the Provost. We pretty much subsisted on the grants we were bringing in, and we were very good at that for many years. By now, I'd say there's no real need for it because every Division does have its own related efforts. Never enough, for sure. But it's becoming a standard part of doing science.
There was an opportunity cost, and our peer institutions were much wiser about that. For example, they got grants from the Moore Foundation to establish this kind of initiatives, and other gifts and funding. You can see centers for some kind of data science popping up all over the map. Some universities were extremely successful at that, like the University of Washington that has distinguished itself with their computer science and data science, coming from a good state university into a true leadership position. Our computer science faculty offers were sometimes declined because young faculty would rather go to Washington. I think that's what helped changed the culture here a little bit, the realization that we may no longer be as competitive as we thought. Now, we have some fantastic young people doing computer science. Anyway, I think that Caltech as an institution did not really avail itself of the opportunities provided by the early rise of data science and data-driven computing. It fell to the individual faculty and groups of faculty to carry the torch, which is kind of how we do things anyhow.
ZIERLER: Just to clarify, obviously you recognized the ultimate value that this would have for all of science, but did you see astronomy as sort of first-in, early adopters to embrace what machine learning could do for science?
DJORGOVSKI: Yes. There are certainly people in different fields, high-energy physics, for example, who were thinking along the same lines, but they were narrowly focused on whatever they were working on, or even a big project, like the Large Hadron Collider, rather than changing the way we do all of physics. Physicists are notorious for thinking that whatever they do is exactly how everybody else should do it, and that's not how the world works. In terms of astronomy as a science on the whole, maybe we were the first in, in an overall astronomy community sense. Of course, not the entire community always agreed. But many people across astronomy were leaning in this direction in ways that we're discussing.
ZIERLER: Do you have a clear sense of when the term astroinformatics came into use?
DJORGOVSKI: I would say somewhere between 2000 and 2010. And certainly no later than that because we had the very first astroinformatics conference here at Caltech in 2010. The field de facto existed already since, I would say, at least the late 90s. The initial focus was really on getting our data house in order, getting the data properly indexed, annotated, and findable, and being able to put it together between different surveys and observations. That's where the whole Virtual Observatory concept came in. But the data, as I always say, is just an incidental part of this. The real goal is to discover knowledge in the data. Yes, you need the data. You need to organize the data in a proper way so you can deploy knowledge discovery tools. But it wasn't just all about archiving things, it was really about making data accessible so that one can deploy tools. When you have lots of data, it inevitably involves machine learning in order to find things about the universe.
The NSF Astronomy now has panels on astroinformatics, which is a relatively new thing. I think the number of projects that involve applications of machine learning and AI is growing very rapidly, and these agencies respond to their customer demand.
ZIERLER: Tell me about your collaboration at JPL. What made that such a fruitful enterprise for you, working with people like Richard Doyle, Dan Crichton? Why was that such an asset?
DJORGOVSKI: In the beginning, it was simply because we didn't have the machine-learning expertise we needed to process even DPOSS data. We were gaining that expertise as we went along, but having highly competent people, complementary expertise, and just more people to work on stuff was always good. That was how it all started. And that continued, not just for our group, but the ZTF as well has very fruitful collaborations with JPL computer scientists in developing their software as well. My collaboration with Dan Crichton and Richard Doyle, before he retired, really took a little different turn. This had to do with my change from just big-data astronomy to big-data science, period, and realizing it's not just the data reuse, but it's the reuse of the methodology we create to extract knowledge from the data.
Once you have data in some way that computer scientists would understand, machine learning algorithms would understand, like feature spaces and so on, it doesn't matter whether the data came from a telescope, a gene-sequencing machine, a network of seismographs, and so on. They all have the same type of data science challenges, using the same kind of machine learning algorithms, although it always has to be really optimized for whatever problem you're dealing with. But there's a great deal of a common underlying methodology that's what's now called data science, which is then applied in different fields, astronomy, biology, etc. That became the intellectual focus, understanding how we can reuse the methodology, not just the data, but have the solutions we develop in one field to tackle an equivalent problem in a completely different field.
The longest step is the initial step of understanding each other's language because each field has its own jargon, way of thinking about things, and so on. But once you get past that through lots of discussions and constructive interactions, then you actually can reuse data science tools. Dan Crichton, for example, was already working with medical researchers on providing them with a cyber-infrastructure for the early detection of cancer with biomarkers. We joined that effort, then we did a bunch of other projects along similar lines. This spans all of the data science from what I call data farming, getting the whole system architecture in place, getting data-interchange protocols, and so on, to having access to data analytics, through actually trying different AI methods, machine learning methods to extract some interesting knowledge from the data. Reusing, for example, some of the automated classification methods we developed for sky images to look at the pathology images for cancer detection.
And that's essentially why data science is now becoming so essential and in high demand everywhere because it's the set of methodological skills you need in order to extract knowledge from the data in whatever domain or problem you're working on. In the same sense, statistics is used as a methodological skill that you need to extract some knowledge from the data. You can think of statistics as the early form of data science. Now, we use statistics, of course, but also a lot of other stuff, algorithms, machine learning, AI, workflows, the whole archive story, and so on.
ZIERLER: Given all of the interest that students, undergraduates, exhibit in computation, what do you see as the value educationally of this focus, not just on data-driven astronomy, but the way that astronomy has shown the way for the other fundamental sciences?
DJORGOVSKI: Well, I'm sure people in the other fundamental sciences figured it out on their own. Students understand how this technology is fundamentally changing everything, not just in science, but in commercial applications as well. They know that if they have these skills, they have a meal ticket for life, and they'll do something intellectually challenging and interesting. There's a bit of a confusion, I would say, that students think, "Computer science is what I need to study." There's nothing wrong with studying computer science, but all of the fun stuff is in its applications in different fields. It's kind of the nature of Caltech to do precisely that, so I would say as we really move into using data science tools in different fields, somebody who's keen on learning all that stuff would say, "Look, they're using the same ideas and methods that I'm interested in biology, astronomy, high-energy physics." As I said, this will become the standard way of doing everything in science, and in that sense, seeing powerful new applications and results coming out of huge datasets and knowledge discovery tools will inevitably inspire and excite students in general, not just at Caltech.
ZIERLER: In 2007, 2008, were there advances in virtual reality software that made the MICA experiment, the Meta Institute for Computational Astrophysics, feasible to try it out?
DJORGOVSKI: That's a whole other story, but it is part of my general interest of understanding how we can use computing and information technology for science, education, and scholarship. Since I was always looking at different technologies that came out, I noticed that an old friend of mine, Piet Hut, a professor at the Institute for Advanced Study at Princeton, put a couple preprints on the arXiv server, which is like the publishing house of all physical sciences, about his experiments in using primitive, by the present standards, virtual worlds as a collaboration platform. He did a lot of collaboration with Japanese scientists, for example. They could meet in virtual spaces, write papers together, and so on. I thought, "That's interesting."
We started talking, and then we had a little meeting in Princeton with a couple of other astrophysicists who work on similar problems, computational astrophysics, and some of the IBM scientists who were working on the early virtual worlds, and we decided to explore this a little more. We put in a proposal to the NSF, which, remarkably enough, got funded, and that's thanks to a very enlightened program manager, Bill Bainbridge, who's a very interesting person. That was how it started. First, we tried a simple virtual world which was originally known as Quack, and then some marketing person decided that maybe that's not such a good idea, so they got the name Teleplace. Then, we found that Second Life, which was kind of at its peak circa 2008-ish, was really a superior platform to do everything we wanted to do. The idea behind it was to form the first professional science organization based in virtual worlds.
The only real-world parts were us. There was actually a virtual building, there were no real buildings, but we would meet there, we would have seminars just like we have seminars in the real world, with guest speakers who would import their PowerPoint slides, and we would sit around in avatar form, and just like we do in regular seminars, and interact. Same for the group meetings or even classes and public lectures, which were a great success at the time. We explored what technology could do. Also, data visualization is yet another story, which, for me, is probably the most useful thing that came out of all this. After about four years of this, the NSF money ran out. We learned what we could, and there was not much of an uptake because a lot of people thought that this is just too weird.
Now the world's still slowly moving into what I think of as a 3D web. But those were our first learning experiences. Basically, anything that involves human interaction, collaboration, teaching, and so on is better done in one of those immersive or semi-immersive virtual spaces than anything we have otherwise, including platforms like Zoom or any other teleconferencing platform. There's a lot of research into why this is so effective. We're still moving in that direction. That's a whole other interesting story in and of itself. The upshot for me was that this is going to be a really powerful platform for visualizing highly complex data, data with multiple dimensions, where you measure not one, two, or three things for whatever you're studying, but tens or hundreds of different things, so you can actually visualize data in 10 dimensions, stuff like that. That's what eventually, with subsequent work and development led to a startup company, Virtualitics, that combines machine learning and extended-reality visualization as a very powerful way of understanding both the data and what AI is telling you about the data.
ZIERLER: Why specifically can visualization be so much more successful in this platform? What does that look like?
DJORGOVSKI: Ultimately, I think, it boils down to the fact that we're biologically optimized to live in a 3D world and interact with 3D objects in 3D spaces and with each other. Our pattern recognition is really optimized to deal with a 3D scenery. But in addition to three spatial dimensions, you also can use things like colors, shapes of data points, put textures on them, or animate them. With various tricks, you can encode up to 10 dimensions of data space in a pseudo-3D scene. Your brain's already trained to recognize patterns in 3D. That really helps. It's easier also to remember what you saw in these virtual 3D spaces than what you see on a flat screen. Numerous experiments have been done about this in a variety of different fields, all yielding more or less the same result. As the extended reality now gets better and more ubiquitous, leading to what I think is going to be a 3D-enabled web, this will be a commonplace thing to do.
We're not there yet, we still have these headsets that are very clunky and not quite as powerful as one wishes, but this will all be resolved in the next 5, 10, maybe 20 years. The evolution of our human-computer interfaces has always gone from simpler, lower-information content to higher-dimensional, more immediate, higher-fidelity information content. From the linear text, then 2D images, then videos, the next obvious step is 3D. As the technology becomes capable of delivering 3D visual stimuli, for which our brains are optimized, that's what people are going to use. Which doesn't mean you don't use one- and two-dimensional data. We still read text, we still look at pictures, and watch movies. It just adds another level of the power of perception to what we can do in interacting with the world and with each other.
ZIERLER: Do you see Zoom as being a positive step towards these virtual worlds? Or has it slowed that down? Or is it separate?
DJORGOVSKI: It's separate, and it's slowed it because it was easy and available. After the pandemic hit, people wanted an immediate solution, so they went to different video-conferencing platforms, WebX, Zoom, Microsoft Teams, etc., and they all basically do the same thing. And they're fine. Skype before that. But they're essentially just one step above talking on the phone in terms of perception and interaction. It was noticed by many people, myself included, that when I was talking to a colleague about something, if the information came via email or a phone call, I stored it in one part of my brain, but if it was in either a physical space or a virtual space, it was stored in a different part of my brain. Our brains interpolate over all the imperfections of pseudo-3D displays and classify those experiences together with experiences from the real world. That's why they're easier to remember among other things. That's just one of the aspects of why I think this will be yet another enabling technology. As you know, in science and scholarship, interaction is where all of the ideas really start. Whether you interact with another person, a dataset, text, a paper, or something like that. If you can have a better interaction or an interaction that stimulates your innate intuition or pattern recognition better than what you had before, that's going to help.
ZIERLER: In reflecting on all of the research that you've done, one perspective might be that the era of big data, data science, has changed the kind of astronomy you do, that you've done different kinds of things as a result. Or, perhaps, you still do the things you've always done, but you do them in better or different ways. I wonder if you can compare and contrast those perspectives.
DJORGOVSKI: Both are true, at some level, absolutely. There are some things for which you must have a big telescope. You just need that many photons, and it doesn't matter how you select them. That's not something that's been replaced, it's just that the information-rich sky surveys have given us a new, powerful tool to study the universe, one that guides us faster to more interesting results. Observational astronomy transitioned from the traditional targeted studies of small samples of galaxies or stars into looking at the whole sky, and doing the science for which you do need to look at much or all of the sky or to find the very rare things that you really want to know about. There are many interesting methodological and sociological issues there, and we're learning how to do it as we go along.
Historicizing Astronomy and AI
ZIERLER: Moving our conversation closer to the present and the topic that really brought us together, when did you and Ashish start to think in a systematic way about embarking on this history project?
DJORGOVSKI: It was his initiative, really. Of course, I thought that this is a really good idea, for the reasons we discussed. When you're part of doing something, you don't think of it as part of history someday, and you don't bother preserving the records and information on how you got to those ideas, what you did, where you failed, all that. I thought that we still have essentially all of the originators of this transformation of astronomy around. It would be a good idea to try to really document and preserve that knowledge for the benefit of future big-data astronomers, historians of science, or anybody else. And because astronomy is just one field in which this transformation's happened, ostensibly, it could help people understand more broadly how big data and computing is transforming the way we do science in the 21st century in general. To my mind, there was a real intellectual value to that.
ZIERLER: We're not putting a proposal together in our conversation, but if I could just capture in your own words when you look out one, five, ten years, what are your goals and aspirations for what this project will accomplish?
DJORGOVSKI: Hopefully, it will lay the foundations for more systematic documentation of this unprecedented transformation of science. Because it really is, I think, historically unprecedented. The computing and information technology are so empowering and developing so fast, at Moore's law pace, that I don't think we've really fully understood yet just how profound it is in changing our world. We can at least start documenting this history in real-time. Then, hopefully, other people will benefit from it and keep adding to it. Because I think that it's really important to understand just how profound this change in science and scholarship is, in every field, and optimize things accordingly. Because all of our institutions, including funding agencies, funding mechanisms, professional reward mechanisms, and so on, universities, curricula, and all that, it's all really based on what we used to do decades ago. A lot of that stuff just does not work in this information-rich, computing-enabled world of the 21st century.
ZIERLER: This is really overdue when you put it like that.
DJORGOVSKI: Oh, very much so. And this is the case in every field, not just in academia. In the commercial world, it's a little more dramatic, the internet sometimes completely destroys some fields and replaces them with something else. But the purpose of education is to preserve and disseminate knowledge, and that need will always be with us, so we may as well learn how to do this effectively in the ways that technology enables.
ZIERLER: To wrap up this incredible series of conversations we've had, if I may, I'd like to ask one overall retrospective question about your career, and then we can end up looking to the future. I wonder, if you survey all of your accomplishments in astronomy, from radio galaxies, quasars, supermassive black hole binaries, gravitational lensing, the list goes on and on in terms of the depth and breadth of things you've worked on. I wonder if you've ever thought about organizing those accomplishments either in the era before machine learning and organizing them into what was possible as a result of machine learning if you've ever divided it up in a chronology.
DJORGOVSKI: I don't think it's possible to do this in any meaningful fashion because it's a gradual change, and there's no clean-cut boundary. You just use the best tools you have at your disposal at any given time. As time goes on, these new tools are more important. But it's a transitional period of at least 20 years, so that's a good chunk of a career. I don't think of my work as before or after this broader transformation, it's just a natural evolution.
ZIERLER: Do you see that the impact or the pace of discovery once machine-learning tools became available is greater?
DJORGOVSKI: Oh, I'm sure. But again, it's hard to kind of grasp these things when you're in the middle of them. But looking backward, you can say, "This is why we started having much more interesting stuff circa the turn of the millennium." Maybe that's some of what we can learn from this project in the history of data-driven astronomy.
ZIERLER: Is there a discovery in astronomy that has given you the most personal satisfaction, either in terms of the way it changed the field or simply that was just the most fun to be involved in?
DJORGOVSKI: I have to think about that. There are easily at least half a dozen potential candidates. Let's leave it for another time.
ZIERLER: Finally, looking to the future, in the way that Moore's Law has made all of these advancements so phenomenally productive, is there Moore's Law for machine learning that will get us to a place that we can't even currently imagine?
DJORGOVSKI: Well, not Moore's Law, but yes, there's an accelerating pace of progress. Although, I don't think that I can put a doubling time on it. If you look at what's happened with artificial intelligence in the last few years, there's been a phenomenal amount of progress. Part of the issue here is that algorithms scale in a more complex way with the complexity of the problem, not just with the amount of data. Moore's law is all about quantity. More transistors, more bits per second, etc. Algorithms are much more complex, and it depends on what exactly you're asking for, but for sure, there's great and accelerating progress as well, and I think we're just in the early phases of it.
ZIERLER: As an addendum to that, among the known unknowns, if you will, are there particular areas in astronomy that you're most optimistic will achieve breakthroughs as a result of these advances in computation and machine learning?
DJORGOVSKI: It's hard to pinpoint. Again, this affects all of it. Maybe at a different pace, but all of it.
ZIERLER: I want to thank you for spending this time with me. This has been tremendous for Caltech history, for the history of astronomy, computation, and perhaps most immediately, for our DDA project together, so I want to thank you so much for this.
DJORGOVSKI: Absolutely, thank you for doing it.
[END]
Interview Highlights
- From Belgrade to Berkeley
- Foundations in Galaxy Evolution
- Early Embrace of Computation
- Astronomy at Harvard
- Joining the Caltech Faculty
- Digitization Comes to Observational Astronomy
- The Palomar Sky Survey
- Partnering with JPL on Machine Learning
- Solving the Mystery of Gamma Ray Bursts
- Separating the Signals from the Noise
- The Long Reach of the Virtual Observatory
- The Digital Detector Revolution
- Internationalizing the Virtual Observatory
- The Big Picture at the Griffith Observatory
- From Palomar to Catalina Surveys
- The Rise of Data Driven Astronomy
- Historicizing Astronomy and AI