Giuseppe Longo, Astrophysicist and Research Leader in Computation and the Natural Sciences
In an earlier era, Giuseppe Longo perhaps would have been called a natural philosopher, due to his expansive research interests and capacity for deep thinking across astrophysics, mathematics, biology, computer science, and even the philosophy of knowledge.
The discussion below covers one aspect of Longo's long and rich academic career, and the topics put him in close proximity with one of the most profound developments in modern astronomy and the leading position Caltech played in it; namely, the rise of big data in astronomy and the embrace of machine learning and high-powered computation to assess amounts of data far too enormous for human brains alone. Longo recounts meeting Caltech professor George Djorgovski and their collaborations in the burgeoning field of Astroinformatics, which, as the name suggests, is a merging of data or information science with astronomy. Longo explains why the adoption of digital detectors, especially for sky survey telescopes, are capable of capturing so much data, and how computation can ensure the minimization of missing signals amidst the noise - simply, all of the astronomical phenomena that captivate astronomers which might others be lost in terabytes or petabytes of data.
In a powerful scientific development, Longo also explains how big data methodology has migrated to other fundamental sciences (especially biology), and he considers the varying ways to assess one's impact as it relates to fundamental research - nothing is more "basic" or removed from direct societal benefit than astronomy - to translational science, where today advances in biology made possible my machine learning and computation can be directly tied to helping people. At the end of the discussion, Longo reveals a paradox of achieving greatness in science. What makes for such greatness is a love of learning, but ironically, success is especially burdensome on one's capacity to sit quietly to study, and learn. Such are the consequences of a schedule packed with conferences, paper deadlines, and mentoring students (whom he loves dearly). As he looks to retirement one day, Longo relishes the thought of going back to his roots as a student, where the learning never ends.
Interview Transcript
DAVID ZIERLER: This is David Zierler, Director of the Caltech Heritage Project. It is Friday, March 3, 2023. I am delighted to be here with Dr. Giuseppe Longo. Giuseppe, great to be with you. Thank you so much for joining me today.
GIUSEPPE LONGO: Thank you for inviting me for this rather prestigious initiative. Congratulations, by the way. I have been surfing a little your site. It's really interesting and informative.
ZIERLER: Oh, thank you so much. I appreciate that. To start, would you please tell me your title and institutional affiliation?
LONGO: I am Professor of Astrophysics and Data Science and coordinator of the master in Data Science at the University of Napoli Federico II, which is the oldest non-confessional university in the world. It was established in 1224 by Frederick the II, emperor of the Sacred Roman Empire. He was an enlightened and controversial man, well beyond the boundaries of his time, who spoke many languages including Arabic and really loved art and culture.
Early Interests in Astronomy
ZIERLER: Is your training in astrophysics, and you took on data science later on?
LONGO: Yes. I decided to become an astronomer when I was very young, 5 years old I guess, and I was lucky enough to succeed. I graduated in physics in the same university where I teach now, and since at that time in Italy there was no PhD program, I first spent two years at the Kapteyn Lab of the Groningen University in the Netherlands, and then two and half years at the University of Austin in Texas, where I collaborated with Gérard de Vaucouleurs, even though it would be more appropriate to say that I worked under his supervision since he was a very strongly opinionated person. Then, I went back to Italy, where for many years, I was an astronomer at the Astronomical Observatory of Capodimonte. Finally, in 2000, I was appointed professor at the University Federico II where I still am. For what my shift from astronomy to data science and machine learning I must say that it is a very much Caltech-related story.
Around 1995, I believe, I had the fortune to meet George Djorgovski with whom over the years we have become very close friends. At that time, George was pioneering the field of digital surveys coordinating a large project for the calibration of the first digitized Palomar Sky Survey. At that time, it was a huge enterprise. Basically, it consisted of transforming into digitized images, thousands and thousands of wide-field plates taken with the Oschin Telescope at Mount Palomar. It was the first large-scale attempt to build a digital catalog of the sky available to the whole community. In order to process that huge amount of data, we began to enter into the field of automatic data reduction and processing. As it often happens in science, one thing generates the other and, in a few years, I started using machine learning, i.e. the same methodology which nowadays is called artificial intelligence, which, to tell you the truth, is a name I don't like very much.
ZIERLER: Why not? What is problematic about it?
LONGO: In my opinion, the way it is used today, the term artificial intelligence is just a novel branding of traditional Machine Learning. The thing I don't like is that they're trying to sell artificial intelligence as something new, while from the mathematic and algorithmic point of view, there's not much difference with what was done already 20 years ago. What makes these methods much more effective than in the past, is the enormous increase in computing power and the availability of huge repositories of data. Basically, the availability of big data and high-performance computing together allowed old tools, like neural networks, to become more complex and perform more complex tasks. In some sense, from my point of view, artificial intelligence is a misleading, "spooky" term which leverages on sociological and psychological implications for marketing purposes. In my opinion, in spite of the recent amazing successes of the Large Language Models as CHAT-GPT, we're still very, very far from artificial intelligence. We have these algorithms, which are a smart way to capture the complexity present in the data more efficiently than what our brain can do. They're just very good at solving very specific tasks, even complex tasks, but that's it. We're far away from the idea of general purpose artificial intelligence. But you know these things better than me.
ZIERLER: Tell me about the kind of astronomy and astrophysics you did before you got into data science. What was your point of entree with George?
LONGO: In the early years of my career I worked mainly on the photometric and dynamical properties of galaxies. When I got back to the Observatory of Napoli there was no one else working on the topic and I had to build a small group leveraging on the experience of Massimo Capaccioli who at the time was at the University of Padova. Basically, this brings us to the point where I met George and our collaborations started. He was already a famous astronomer and I was homoured when he offered me and my group the opportunity to became the NA in the CRONARIO project (Caltech- Rome-Naples-Rio de Janeiro) which aimed to calibrate the already mentioned Palomar Digital Sky Survey. This somehow shaped all my future work and even now, some people in my former and present group keep working on the exploitation of the large digital survey produced by the new generations of instruments. In this context, it is important to notice that Capaccioli, who in the meanwhile had moved to Naples as Director of the observatory where I worked, succeeded in financing and building the VLT Survey Telescope or VST, which is currently located on Cerro Paranal, next to the four domes of the Very Large Telescope. VST was the first wide field telescope built to perform exclusively digital surveys of the sky. Digital Surveys are the reason why I entered the field of Machine Learning. Data produced by these surveys is so large that to use traditional interactive pipelines would be preposterous. As I used to say, Machine learning was no longer a choice, it had become indispensable. More or less around that time, I think, my group published on line the first paper on the application of machine learning to the evaluation of photometric redshift of galaxies. You know the Hubble correlation ? How much astronomy do you know?
ZIERLER: I've picked up quite a bit, but we publish these discussions so that they should be accessible to a wide audience, so please explain.
LONGO: One of the most relevant pieces of information you need to study the structure of the universe is the distance of galaxies. Beyond the local universe, let's say, for galaxies more distant than the Virgo cluster, the only way we have to estimate their distance, is through the Hubble law, which tells you that the recession velocity of a galaxy is directly proportional to its distance from the observer. Basically, the more distant a galaxy is, the faster it recedes from us. One way to measure this velocity is to use the shift in wavelength of spectral lines, but spectroscopy is time-consuming, and in order to obtain the large statistical samples you need for modern cosmology–we're talking hundreds of millions of galaxies– the spectroscopic approach is not feasible. As an alternative to spectroscopic observation, in the early ‘80's was devised a technique called photometric redshift based on a very simple idea. Instead of observing one galaxy at a time, I can observe a large chunk of sky in three, or four, or more different wavelengths, then I can use the luminosity of these galaxies in the different bands, to invert the problem and try to recover from the difference of luminosity in the various bands, the redshift and hence the distance.
With some colleagues, we had the idea to use machine learning. At the time, there were many surveys available with many, many tens of thousands of spectroscopic redshifts. And the idea was, "…let's use these objects as a training set for our neural network, and let's see if we can train a neural network to measure the photometric red shift of a galaxy." This worked really well. I want however to specify that ours was not the first paper on the application of neural networks to astronomy. That one had already appeared a few years before by Steve Odewahn, who used neural networks for galaxy classification.
To tell you the truth, at that time I would have never suspected that after these pioneeristic papers, the field would have exploded. Nowadays, every year, you have scores of papers from many different groups around the world who try to improve on these neural algorithms to measure photometric red shift. For me, this was the beginning of my fascination with everything surrounding machine learning. In collaboration with Massimo Brescia, a former student of mine who is now a professor in my Department, I began also to work also in other fields, like geology, for instance, trying to use the thickness and colors of sedimentary layers to measure the Milankovitch cycles, which as you know are a consequence of the fact that the orbit of the Earth breathes on different time scales, with periods of 400,000, 100,000, 40,000, and 20,000 years. Basically, four main periods. And these periods are reflected in the appearance of these sedimentary layers in rocks. We began to use neural networks to analyze these characteristics of the sedimental layer both to confirm the Milankovitch periods, and to establish an accurate time scale.
For me it is difficult to summarize in a few minutes 20 years in which I slowly grew less interested in astrophysics and much more interested in machine learning itself which, by the way, is one of the most fascinating fields I have ever encountered in my life. The amount of implications is huge. At the same time, for similar reason, George was also having a similar evolution. George has pioneered many things, both in astrophysics and in other fields, and in my opinion he has been one of the gurus behind many things which have happened in the last 30 years. For instance, George was among the first people to promote and pioneer the need for a virtual observatory or the exploitation of virtual reality for teaching and science.
Sky Surveys and the Dawn of Big Data
ZIERLER: On that point, I want to ask you, in the 1990s, what was it about digital detectors and sky surveys that mandated this need to embrace big data and machine learning? What did that look like to you?
LONGO: Big data is a ill defined concept which strongly depends on the epoch: 20 years ago 1 terabyte of data was "big data" while nowadays it is almost nothing. This said, big data was already present in astronomy. A first example of big data was the already mentioned Digital Palomar Sky Survey, which was hosted at Caltech in a database hosted in the basement of the old astronomy building which now, I think, has become the geophysics department. In any case, you can still see the dome of the telescopes above it. This was a first example of publicly accessible astronomical big data. The DPOSS was based on wide field photographic plates and even though it suffered of the limitations of photographic material, for quite some time it was the only way to obtain information on luge chunks of the sky. It needs to be said that the digital detectors of the time had a huge limitation in their size which allowed only for a very small field of view. It was only at the end of the 90's that the size of these digital detectors started to increase and the first wide field mosaic detectors became available. At that point, the amount of data collected every day suffered an explosive growth. And due to the way astronomy works, these data needed to be preserved and shared within the community. Physics does experiments. Astronomers do observations. Preserving observation for as long as possible is needed because you can never be sure that there's not something interesting that you have overlooked in your data.
Also, we still know little about the temporal evolution of many astronomical phenomena and it is important to be able to go back to older observation of a given object I could mention many, many discoveries done by exploiting data stored in the archives. For instance, if you think of supernova 1987A, which was one of the most important phenomenons in astrophysics, a star which suddenly blew up in the Magellanic Cloud, the closest supernova to the Earth after the invention of the telescope. A lot of stellar astrophysics was done there. But the fact that astronomers never throw away data allowed people to go back to some photographic plates, which were taken 30 years before, thus allowing to see for the first time in history what the progenitor of a supernova looks like. Furthermore, since our understanding of the physics behind the evolution of celestial objects changes rapidly, very often you need to go back to the same data to answer different questions. This is also why astronomy pioneered the field of making all data distributable to the community. To the best of my knowledge, this happened in astronomy well before any other discipline.
Also, another important factor is that astronomical data has no intrinsic economic value. Not that they cost nothing, but basically, there are no economic strings attached to them. It's one of the few cases where you can access huge datasets for free. This made astronomy very appealing to companies like Oracle and Google that have based their economic strategy on the exploitation of big data. Astronomical data has been used to test infrastructure, to test algorithms, and so on. Therefore, big data is inherent in astrophysics. In 2000, there was a meeting at Caltech in Pasadena, Virtual Observatories of the Future, which was the apex of a long discussion on these problems within the international community. In the United States, this discussion had led the National Science Foundation to indicate in its decadal report as a top priority low budget project, the implementation of the Virtual Observatory. Which basically was the federation and interoperation of all astronomical databases available in the world. And this started a huge international effort. George and I were among the promoters of these efforts in our respective countries. This required a huge amount of work in building an ontology of astronomy and defining through unique keywords all possible characteristics of the astronomical data life. It slowly become clear that astronomers had greatly underestimated the amount of work which was needed to do something like this. So, rather naturally, a lot of emphasis was put on federating the data and making them interoperable, and relatively less was invested in creating tools capable to extract information from these unprecedented and complex datasets. And this marked my definitive transition to the field of data science because it was clear that the most interesting part of having all this data federated was not in the possibility to download the data, but in exploiting them and extracting useful information from them. A fact, which due to their sheer size and complexity, could not be done with the traditional tools.
After starting a knowledge discovery in data base working group within the Virtual Observatory, we understood that we had to move outside of astronomy and leverage on the huge potentialities offered by the fast growth taking place in other fields. This is when Astroinformatics as a discipline was invented. I remember the first time I heard this term was at a dinner at Smitty's in Pasadena after a meeting. At the table there were Alex Szalay from JHU, George, Robert Hanish at that time at the Hubble Space Telescope, and I. We were discussing the need to move in a different direction more focused on the development of machine learning based tools. Since at that time, bioinformatics had become very popular, and geoinformatics was beginning to appear, George suggested X-informatics, where X stands for whatever you want and so the world astroinformatics entered the arena.
I have stolen that idea. I think the only course of X-informatics in the world is the one we offer at the University of Naples. I must say that the more time passes by the more I like the term X-informatics. Machine learning, in fact is a new science, where the same methodologies can be applied to all domains with just a little fine tuning to take into account the specificities of a given type of data. Immediately after that dinner we started a series of yearly international workshops on Astroinformatics which still continues. Basically, this is it. I told you the story of 40 years of my life in a few minutes.
ZIERLER: I wonder if you can talk about the two-way street in advances between astronomy and computation. In what ways was computation ready to deal with all of this big data, and in what ways did computation need to catch up based on what all of the data from the sky surveys was taking in?
LONGO: I think we, I mean the astronomers, have always been at the cutting edge of the computational capabilities of our time. But when we entered the time of the digital sky surveys, problems with both storage and computation, became humongous. As a matter of fact, a true revolution in the field of database structure and also in the way data are distributed came from the collaboration between Alex Szalay and Jim Gray, the Turing Prize-winner who was in charge of databases for Microsoft. They collaborated for the Sloan Digital Sky Survey, which in astronomy has been the most important turning point of the last half century, I would say. It was the first time there was a large-scale experiment aimed at mapping the whole sky with high resolution in several bands, processing the data, and making the processed data available in real time to the community.
Due to the fact that it was since the beginnings a public database, the Sloan has produced more than any other scientific experiment ever, a little short of 100,000 papers However, going back to your question, even nowadays, the data throughput of astronomical instruments is at the limit of the available technology. In this respect, I think that astronomy is still a driving force. If you look at the Vera Rubin Survey Telescope, the former Large Synoptic Survey Telescope, it will produce –I don't remember the right figure- but on the order of 30 terabytes of data per day. Data which needs to be stored, calibrated, processed, and therefore, it really puts a huge stress on not only the storage, but also the computational capability and the network, since you cannot do this in only one place. But to understand the scale of the problem you must think to the radio-interferometer SKA, the Square Kilometer Array, which is being built and will produce many petabytes of raw data every day. People still don't know how to handle that amount of data. In fact, we are far from achieving that level of speed. However, I must admit that real life sometimes surprises you and technological advances are much faster than what you can foresee.
I think we're still at the beginning of the story, which is also what really motivates me. I'm not an engineer. I'm not interested in the results of the application of artificial intelligence. The fact that you can train a deep-learning network to recognize the label of a can of tomatoes or to understand a legal text is interesting but it is just a scratch on the surface of a much more interesting field. A question I always ask to my students is, "Tell me any law of physics which depends on more than three independent variables." In 10 years, no one has ever been able to provide me with an answer, simply because there is not one. Therefore, the question arises, is the universe unrealistically simple? Or, rather, the fact that our scientific understanding of the world allows only three degrees of freedom is just a bias introduced by our brain? As Carnap said, the laws of physics and all analytical laws we know of, almost always derive from patterns or structures we observe in data. We measure the pressure, volume, and temperature of many gases, and we derive the law for the ideal gases. P times V equals a constant times the temperature. Again, three variables (two independent). Therefore, since we are living beings who have evolved into a three-dimensional world with a three-dimensional perspective of the world, our brain can visualize only three dimensions, so we can spot patterns only in three dimensions and this implies that the fact that our physics basically is based on three degrees of freedom is not a property of the universe but a limitation of our brain.
Machines do not suffer from the same limit. If you feed them with proper data and use proper algorithms, a machine can spot trends and patterns in spaces of a much higher dimension. They can look for correlation or trends even in spaces of a thousand dimensions. Therefore, I think these machine-learning/artificial-intelligence algorithms, whatever you call them, will prove crucial. They already do, but they will become more and more crucial in the years to come in order to extract these higher complexity structures, which are hidden in the data. This is why we talk about the science of science, in some sense. Basically, big data plus artificial intelligence will become a sort of prosthetic arm for our brain that will allow us to see physical laws, correlations which we otherwise could never discover. And this is beginning to happen. In fact, if you go into the literature, you start finding that for instance, artificial intelligence has been able to solve the protein-folding problem, which is a problem so complex that no biologist or molecular biologist could cope with. But a couple years ago, a deep-learning algorithm was able to model the protein folding for many, many properties. And the same is happening in astronomy, look for instance at the discovery of binary quasars by Matthew Graham and other people at Caltech. We are beginning to use this big data with these new tools to produce a science of a higher complexity, which would have not been possible without these tools.
The Astronomical Impact of Machine Learning
ZIERLER: I wonder if you can discuss what has been made possible by machine learning in astronomy and what has been made simply more efficient by machine learning.
LONGO: Wonderful question. From a scientific point of view, 85% is just higher efficiency. Just because machine learning has enabled us to do the same stuff we have done in the past but with a higher level of efficiency. Which, in some sense, also means that without machine learning, basically, it would be a waste of time to collect so much data because we would not be able to use it. It would sit in our drawers and be of no use.
From the point of view of enabling new things, we are at the beginning of it. Because we've just realized that some problems can be tackled in a completely different way with machine learning, and, as I already mentioned, the first results are coming even though it is difficult. In this respect I would like to add something based on my experience as coordinator of the data-science master at my University. I interact a lot with companies and people who tackle industrial application of machine learning. These are profit-driven, and this usually calls for a very short time scale: tools must produce a profit for the company, and usually, the implementation and testing time is very short, between three and six months, seldom more.
On the other end, in astronomy, physics, science in general, you are not profit-driven and you look for accuracy and robustness. To make machine-learning tools that are very accurate, robust and capable of grasping all the subtleties of the problem and the subtleties of the biases in the data, it can take many years and progress is slow. Therefore, at the moment, we still have only few new things which have been made possible only because there was artificial intelligence.
For instance, we have a much higher number of supernovas nowadays and the use of machine learning allows to pinpoint new types of objects. We are also discovering new types of objects, especially on the faint-surface-brightness limits, by using segmentation methods of images, capable to unveil these low-brightness details which would otherwise escape detection. We have also understood that the problems are much more complex than what we thought in the past. One good example, which I always use as a reference, is an experiment by the group led by Kai Polsterer at the Heidelberg Institute for Theoretical Studies, where they used a machine-learning approach to see how many parameters are needed to really constrain the problem of the estimation of photometric redshift we were discussing before. Previously, we had been using a human-expert-driven approach, in the sense that the astronomer's experience was telling us which parameters to feed to the network. Kai Polsterer adopted a completely brute-force, data driven approach, and he proved that in order to pinpoint the photometric red shift problem, you need at least 10 parameters. I think this was the first confirmation that by exploiting the power of machine-learning algorithms combined with a large dataset, you can really make significant progress in understanding what's happening. It is also worth mentioning that the 10 parameters selected by the machine would have never been selected by a human being. In other words, the machine was capable of capturing details in the data, such as correlations, biases, small defects, which the human being would basically ignore because we always somehow constrain our measures to follow a-priori defined models which often do not match the true complexity of the data.
As I said, we are just at the beginning of the story and in the coming years, I am sure that this approach will really lead to outstanding results by fully exploiting the complexity of data.
ZIERLER: Even with all of the advances in machine learning, are you concerned that we're still missing important signals, just given the sheer size of the data?
LONGO: Let me put things like this: it depends on what you're looking for. Because deep learning is something where you are teaching the machine what you want to see. Therefore, the machine becomes very good at finding what you are asking for. To give you an example, with a student of mine, Michele delli Veneri, who has now become a colleague because he just finished his PhD and is in his first post-doc, we have been working for the last three years on analyzing the simulated data cubes for SKA, basically. We're using ALMA (Atacama Large Millimeter Array) data at the moment, but the idea is to prepare for SKA. In order to train the deep-learning algorithm, you have to run simulations. You train the algorithm on simulated data. Therefore, somehow, you are putting in what you want to see. In some sense, depending on the question you ask, there's still a lot of stuff in the real data that you don't detect or see unless you are aware of it.
ZIERLER: It's also difficult because you don't know what you don't know.
LONGO: Yes. The famous Donald Rumsfeld's epistemology: "… Known unknowns, unknown unknowns." I guess. That's that. But there are techniques like outlier detection, which allow you to look for signals which are unexpected or outside of the mainstream. But again, you need to look for the outliers. Most of the applications so far –this is also why I think we're still in the infancy of what we can do with these methods- don't do it. For instance, people know that the theory tells them that there must be a morphological evolution of the galaxies with the red shift and then they train the network to measure the morphology for hundreds, or millions, or billions of objects. But basically, the network has learned how to do what you taught it to do. However, not all objects are properly classified. Sometimes this is due to noise, wrong observations or artifacts in the data but, among these objects, very likely, there are also new objects, objects which have different properties. But if you're not looking specifically for them, you will never notice they're there.
ZIERLER: To flip the question around, for all of the things that machine learning has made possible to discover, what's been most exciting for you? What do we know about the universe now as a result of these advances?
LONGO: To tell you the truth, I think that in observational cosmology, most of the advances that have taken place over the last 10 years, in one way or another, are related to machine learning since precision cosmology requires data for a huge amount of objects which could have never been derived without machine learning. We have a much better understanding of galaxy evolution at high red shift. Also, we have a much better also understanding of the temporal behavior of sources but still, we are in that 85% we discussed a little earlier. If I should tell you what brand-new things machine learning has done, they're still minor things. The best is to come. I have no doubts about that. Because we are just beginning to do it.
ZIERLER: Do you see LIGO, and Virgo, and the detection of gravitational waves as a machine-learning success story?
LONGO: I don't know much about the data reduction of LIGO. But I think that machine learning, at least in the detection of the first gravitational wave, didn't play a big role. I know colleagues in my department who belong to the Virgo collaboration who are beginning, together with colleagues, to apply, in a systematic way, neural networks to Virgo data only now.
ZIERLER: Now, the way that you say we're just getting started, there's so much discovery ahead of us, what role do you see for machine learning in some of the most vexing challenges in astronomy and astrophysics, like dark energy, dark matter, the Big Bang? What are the possibilities there?
LONGO: Again, 85% support. In order to constrain dark matter or dark energy, you need very high-precision data, and this can come only by huge samples. For instance, now, we have the Euclid satellite, which I think will plan an important role in measuring the distribution and the properties of dark matter and dark energy in the intermediate universe or intermediate red shift. There, you cannot do weak lensing without the distances of hundreds of millions of objects. Most of the pipeline for the data reduction of Euclid would not be possible without machine learning.. But if we talk of precision science or new science coming out of machine learning, as I said, it takes time, to pick up the right method, to test it, explore all its implications and so on.
And time is what truly affects the vast majority of the applications of artificial intelligence which surround us in our everyday life. This is also what makes me worried about this hype about artificial intelligence in the world, to tell you the truth. In most real life scenarios, these algorithms are taken blindly off the shelf, implemented, tested on uncontrolled and often bad data, and applied.. Since they're quite plastic, I would say that 90 to 95% of the time, they're right. But it is the remaining 10 or 5% which is incredibly dangerous.
Big Data for All Science
ZIERLER: Tell me about how you got involved in bringing data science to fields beyond astronomy and astrophysics. You mentioned geology, biology. What opportunities did you see in that regard?
LONGO: Out of sheer curiosity. You teach a network using this data. At the beginning, especially when I begun to understand how to use these methods, I was looking for data. If you gave me a phone book, I would have applied machine learning to it. In doing so, I realized that there are many similarities. In fact, we always paid a lot of attention at the Astroinformatics meetings to methodological transfer. Because machine learning is a little more than statistics on steroids. That's the point it is a new way to derive useful information from incomplete and very large datasets. As there is nothing like statistics for biologists, there's nothing like machine learning for astrophysics, or machine learning for geology, or machine learning for biology. The same tools we use in astronomy can be applied in bioinformatics, in geoinformatics. Obviously, there's some fine-tuning to do because the problems are slightly different, and the data have different properties. But you use the same ideas or methods, encoders or transformers for instance, in all fields, with very small differences.
We are using the same auto-encoder to extract the significant features for the galaxy morphology and to extract the styles from the style of painting of Vermeer. We use the same method, the same machine. We do not change one thing. In one case, we feed images of galaxies, in the other case, we feed images of Vermeer, and in another case, we may feed images of breast cancer.
Then, everything lays in the way you do the analysis and interpretation of the results. In my opinion, if you have a fascination for machine learning and data science, it's like knowing statistics and applying it to the problem which, in that moment, makes you more curious. You can use statistics to analyze weather, political elections, and so on.
ZIERLER: Let me ask you this. Among the different disciplines, the domains, as you call them, were there particular groups that were more amenable to these methodologies than others? Were biologists faster or slower than the geologists to appreciate it?
LONGO: Now, I run into the risk of being politically incorrect but, as George knows well, as most Italians, I believe that to be politically incorrect is at the basis of democracy and social growth. Obviously, I judge from my point of view. Let me put things like this. Astronomers had some resistance, but nothing in comparison to what you encounter in other fields. Somehow, astrophysicists and physicists are used to delegating some part of the discovery process to mathematics. In using machine learning, astronomers, even the old-fashioned ones, have some sort of uneasiness at the beginning, but when they look at the result and perform some statistics tests, they trust the results. In other domains it is not so. In other domains people have the pre-conceived idea that machines can never do better than a human being, and no evidence can convince them.
Take the case of medicine where two factors are competing. On the one hand are the companies that work for health, which basically are pushing for the adoption of this new methodologies, for instance, for X-rays, for tomographic radiography. New types of sensors, new types of devices, which are largely based on the adoption of machine learning.
But on the other hand, there is the community of medical doctors, which, at least in Italy, know little or no mathematics, not to mention linear algebra or advanced statistics, and have not the slightest clue of what these methods are about, of how they work. Then you really get a lot of resistance. This is slowing down a lot of progress in the field of medical application or machine learning. At the moment, for instance, we are trying to convince an hospital to use machine learning for the analysis of epileptic seizures. Not so much for the detection, but to perform a sort of classification of patients as a function of the type of disorder they have, in order to optimize the treatment. It works well, but it is not easy.
Another huge problem in the medical domain is the privacy constraint, which I understand from the human point of view, and of individual rights. But, for instance, in the whole of Italy, at some point, there were only few hundreds patients who had granted permission to use their radiography, their X-ray of the breasts for testing algorithms. This is far too few and medical doctors, for instance, at least in my country, do not understand the importance of asking a patient, "Do you give me permission to use your data? We anonymize it and everything." They just do not understand the relevance. A lot of time is spent, when you try to apply these techniques in fields outside the hard sciences, in convincing the domain experts that it is worth doing it.
ZIERLER: For you, after spending so much of your career in a fundamental discipline, learning about astronomy and astrophysics, what have been some of the greatest pleasures in advancing the machine learning in more applied areas as they help people?
LONGO: You hit the right spot even though it is a complicated answer. For sure, I love astronomy, I'm fascinated by astronomy. I understand all the Neil deGrasse Tyson approach to astronomy, that it's important to know your role in the universe, and so on. But still, in some sense, when you see the war in Ukraine, when you see people dying in Africa by starvation, when you look at what is happening with pollution, climate change etc., when you see all these terrible problems, you end up asking yourself, "Well, isn't it that I am using my life for something a little too far away from the real world in which I live?" I think that this played a role. The fact that I could use my curiosity, because I'm a curiosity-driven person, and my energies to do something that would have a practical impact played a role. Also, there is a sort of personal satisfaction, and please don't use this against me, but I think one of the greatest fascinations of science is that it basically matches the "nerdish video game attitude" which is inside all of us. Allow me to use a metaphor, science is like a fantastic video game where you play a game, in this case you run an experiment, you get a result, and so you are satisfied by the fact that you have won, or, to stick to the metaphor, that you have advanced one level. That is the gratification, the reward mechanism. When I apply machine learning to a new problem, and I understand that it is working, it is really extremely gratifying for me. That really gives me huge pleasure.
ZIERLER: Because, as you say, the field is really just getting started, we're only beginning to appreciate what machine learning can do for astronomy and astrophysics, is that to say that the infrastructure or the framework of the National Virtual Observatory needs updating?
LONGO: The Virtual Observatory, I am afraid to say, is only the big-data part of the story. Basically, it is a federation of databases and, as such, it did a great job. The machine-learning part, however, is more in the hands of users distributed all over the world. One of the big mistakes which was done at the beginning of this field, and still is done in physics, is that physicists always have the feeling that they can do everything, in the sense that they're the best scientists, they are those who explore the mysteries of the universe, and so on. And sometimes, they are. This however leads to a bad attitude. "I read in a paper that machine learning is very good at solving a given problem", the professor tells the PhD student, "Use machine learning to classify the shapes of a galaxy." The student usually has little training in statistics and goes to a repository from which he downloads software. And on GitHub, you can find thousands of these codes. They install it on their computer, run it, they get some results and they think they're doing machine learning.
This I would say rather arrogant approach, very often leads to miserable results. What I'm saying is that experts in machine learning are experts in machine learning. They cannot be astronomers. They cannot be physicists, they cannot be medical doctors, they cannot be biologists. They're what we call data scientists. They're people who have professional training. They do a master's, they do a PhD, they follow the regular literature in the field, and they keep active all their life in that field. An astronomer cannot do that. An astronomer can be a practitioner. A physicist can be a practitioner. A biologist can be a practitioner, someone who is capable to interpret the results which are provided by the data scientist. Basically, what I'm trying to say is that if you want to get truly reliable, robust results, you cannot do it by yourself. You need it to be in an interdisciplinary team, where you have a data scientist who understands the problem, who knows about the domain of application, and the domain expert, in this case, the astronomer, who is capable of explaining the problem to the data scientist. But they're two different professional fields. And I think that this is one of the main limitations we need to overcome.
One needs to keep in mind, however, that machine learning is a recent thing, at least usable machine learning. Basically started in the new millennium, when big data an GPU came along. Therefore it is rather normal that there's still a lot of confusion in all disciplines on how to make the best out of it. And only now, we're beginning to have people who have been formed as data scientists, who have all the needed background in linear algebra, statistics, computation, who can interact proficiently with the domain expert. Thanks to this transition period, the domain experts are beginning to understand that this type of interaction is needed. But I think that as time progresses, the problem will be solved.
ZIERLER: For the last part of our talk, I'd like to ask a few questions about where we are currently, and then we shall end looking to the future. One aspect we haven't covered yet is your role as a professor, as a mentor to students. Students as the up-and-coming generation, generationally, have grown up with computers in a way perhaps you did not. What are you seeing in terms of their interests and skillsets that are informing how you understand where these things are headed?
LONGO: I'm 66. I live with my students, basically. I am not kidding. I spend most of my time with my students. I love them. There are good things and bad things. At least in Europe, but I think this is also true in the United States– therefore let me speak generally. During the last two years the remote-teaching has deeply damaged the level of preparation of the average student. This generation comes out of high school or out of a bachelor degree with a much lower level of understanding. I think the didactic interaction is crucial. If not with the professor, at least among your peers. Speaking in general, however, I would say that the new generation is much more at ease with computers and with the fact that they can play a very important role. And they are much faster in grasping the details of how to write a program, or how to code. They're much faster. On the other end, they don't have a critical mindset, which was much more present in the previous generations. On average, they tend to make things much easier than what they actually are. I don't know how to explain it better. They're smart. On average, they're at least as smart as they used to be in the past, but they're definitely different and less prone to going in depth. It's like the internet, much more superficial, but wider. And when they need to go deeper, usually they have problems.
Prognosticating the Role of Artificial Intelligence
ZIERLER: At least in the United States, every day, there's more and more news about ChatGPT and OpenAI. What are you seeing in Italy, what are some of the concerns you have, and what may be some of the major public misperceptions about what this technology is?
LONGO: Well, ChatGPT came out very recently. I know there has been an explosion also in Italy, even though there is a little difference. Since, again, the performances of these methods depend on the amount of data you use for training, ChatGPT performs incredibly well in English, but as soon as you go to another language, the performance lowers a lot. For instance, if you ask it to write you an essay on Newton's laws in English, you'll get outstanding results. If you ask it to do it in Italian, or even worse, in German, the level is much lower. The training data were not large enough to allow the algorithm to become as proficient in Italian as it is in English. In Italy, a professor can still recognize a text written by ChatGPT from that written by a student, while in English, I'm afraid very often it is quite difficult because the outcome is almost perfect. I'm impressed by it.
How do I see it? It's a new tool. I am not scared by it. I always go to the fact that when printing with the mobile characters was invented by Gutenberg, many intellectuals of the time screamed that it would have marked the end of human culture. Because by delegating the facts to the books, humans would lose their memory. That is a typical human reaction any time something potentially disruptive is introduced. ChatGPT is a new extraordinary tool–because it is extraordinary, I must admit. I'm fascinated by it. But It is just a new tool. I'm using it in my lectures. I'm asking my students, 'Ask your ChatGPT to write an essay of 2,000 characters on these topics," and they bring it to class. And we compare them, and we discuss them." It's a new tool we must learn how to use. It's not going to cause havoc, like most of these things.
I'm more worried for the fake news, produced by some of these methods. You had the experience in the United States during the Trump era, we had the experience in Europe with Cambridge Analytica and its role in Brexit. The tools have no ethics and the problem with artificial intelligence is there, basically. It's not in the algorithms, it's in the people who use the algorithm. The problem, in the end, is always the human being.
ZIERLER: That's good perspective to have.
LONGO: For me, every technological advancement is an advancement. Physicists invented nuclear fission, then somebody invented the atomic bomb and used it to blow up human beings. The problem is not in knowledge but in ethics.
ZIERLER: Finally, looking to the future, for all of your accomplishments in astronomy and astrophysics, all of your accomplishments in data science, and all the ways that you've combined those expertises for new areas of science, what's left for you to do? What have you not taken on yet that you want to for however long you want to remain active?
Longing to Return to Full Time Study
LONGO: Once I'll be retired, I want to have the time to study, which is something I have not been doing as I used to. In the last 20 years, the profession of scientists has become publish or perish. When I was a kid, a scientist could spend even a year to publish just one paper, but that paper was worth reading. In recent years, the mantra it has become "publish or perish". In order to be funded, you have to publish a lot. As a result, a large fraction of what is published today is not worth reading. If you want to do quality work, you cannot produce 10-15 papers per year and for sure, regardless who you are, you have no time to study. What I really want to accomplish in the rest of my life is to be able to go back and study, without the pressure of publishing or writing proposal for fundings.
ZIERLER: You want to be a student again.
LONGO: Yeah. I want to understand things in much more depth than what I have done in the last few years. And there are aspects, for instance, of machine learning by which I'm really fascinated. I'm sure I will devote the rest of my years to simple problems such as understanding how the information is represented inside the layers of a deep neural network, what is the real meaning of the features which are identified as the more relevant by a neural network, which role is played by entropy in the models, ecc. . It's something related with the inner-workings of this deep learning. There is one problem though. Finally, I'll have the time to do it, but I will no longer have the brain I had 30 years ago. In some sense, it's a lost game to begin with. But at least I will enjoy it.
ZIERLER: But it indicates your love of learning. You'll do it anyway.
LONGO: Yeah, definitely. Knowledge is the only gratification that I think is worth pursuing. Knowledge, I think, is the thing which makes the difference. Let me finish with two verses from the "Divina Commedia" written by one of the greatest intellectuals ever existed, Dante Alighieri, in the early decades of the XIV century, "…Nati non foste per viver come bruti, ma per seguir virtute e conoscenza" (you were not made to live as brutes, but to follow virtue and knowledge). I believe this is what define our role as a specie on this planet.
ZIERLER: Giuseppe, it's been a great pleasure spending this time with you. I'm so glad we were able to do this. Thank you so much.
LONGO: It was my pleasure, really.
[END]

