On January 11, 2007, Jim Gray, in his last talk to the Computer Science and Telecommunications Board (CSTB)[2], presented his vision of what he called “the fourth paradigm of scientific research”, whereby researchers start with collecting massive data and analyze it afterwards. In fact, Jim advocated for a new scientific methodology focused on the power of data-intensive science. And less than five later, this vision became true when physicists working at the Large Hadron Collider (LHC) at the CERN laboratory near Geneva announced the discovery of the Higgs boson, on July 4, during a seminar at the lab.
The Higgs boson adventure started back in 1964, when independently and almost simultaneously, three groups of physicists made the hypothesis of the Higgs boson existence in order to explain, in the standard model, the lack of symmetry between the particles that carry the electromagnetic interaction (photons, which mass is zero) and those subject to weak interaction (the W and Z bosons). This lack of symmetry hindered the unification of these two forces within the same theoretical framework, named the electroweak theory.
Without going into further details about the Higgs boson and all its theoretical background, it is important to understand that the Higgs particle is very difficult to seek because, (1) it is very hard to detect (indeed, it has no spin, electric charge, or color charge), (2) it is also very unstable, decaying into other particles almost immediately (for a Higgs boson with a mass of 126 GeV/c the Standard Model predicts a mean life time of about 1.6×10 Seconds[3]).
So, as you can imagine, search for the Higgs boson is not a trivial endeavor. You have to accelerate two beams of particles to very high energies and make them collide inside a particle detector. And if you are lucky enough, you will get the production of the Higgs boson. But, because the Higgs boson decays almost instantaneously, it is not possible to detect it directly; therefore, you have to place highly sensitive detectors that look for the signature of the several ways the Higgs boson might have decayed. “If the observed decay products match a possible decay channel of the Higgs boson, this indicates that a Higgs boson may have been created. In practice, there are many processes that may produce similar decay signatures. Fortunately, the Standard Model precisely predicts the likelihood of each process occurring. So, if the detector detects more decay signatures that could have been a Higgs boson than are predicted by the Standard Model assuming that there is no Higgs boson, then this is strong evidence that the Higgs boson exists.[4]”
“Given that the Higgs boson production in a particle collision is a very rare event (1 in 10 billion at the LHC) and that there are a lot of other possible collision events with similar decay signature, the data of hundreds of trillions of collisions needs to be analyzed before a conclusion about the existence of the Higgs boson can be reached. To conclude that a new particle has been found, particle physicists require that the statistical analysis of the data of two independent particle detectors each indicate that there is less than a 1 in a million chance that the observed decay signatures are due to just background Standard Model events (i.e. that the observed number of events is more than 5 standard deviations (sigma) away from the expectation if there was no new particle). By accumulating more collision data, the physical properties of the new particle may be inferred, telling us if the observed particle is indeed the Higgs boson as described by the Standard Model or some other hypothetical new particle.[5]”
This is why it is possible to mention here the Fourth Paradigm and the power of data-intensive science. Indeed there is a lot of data to analyze. So let’s have a look at this Fourth Paradigm.
The Fourth Paradigm
In his talk to the CSTB, Jim Gray introduced four paradigms that made up what science is today. The first paradigm was prevalent a thousand years ago: at that time, science was mostly empirical, describing natural phenomena. This was the first paradigm, which was based on experiment and measurement. Based on observations, scientists defined the model in terms of a logical, physical or mathematical representation. Then, appeared the second paradigm, where was introduced theoretical science, with Kepler’s Laws, Newton’s Laws of Motion, Maxwell’s equations, and so on. The goal was to break an observation or theory down into simpler concepts in order to understand it. “Then, for many problems, the theoretical models grew too complicated to solve analytically, and people had to start simulating. These simulations have carried us through much of the last half of the last millennium[6]”. This became the third paradigm, where one attempts to simulate an abstract model of a particular system.Then, in the last decades, came the fourth paradigm: “The world of science has changed, and there is no question about this. The new model requires data to be captured by instruments or generated by simulations before being processed by software and for the resulting information or knowledge to be stored in computers. Scientists only get to look at their data fairly late in this pipeline. The techniques and technologies for such data-intensive science are so different that it is worth distinguishing data-intensive science from computational science as a new, fourth paradigm for scientific exploration.[7]”
Some words about Jim Gray
I had the pleasure to have some email interactions with Jim a long time ago when we were working together for Digital Equipment Corporation. I also had the pleasure to interact with him where he came to Paris in July 2004 in order to be awarded a “Honoris Causa” doctorate from the Paris-Dauphine University.Jim joined Microsoft in 1995 as a Senior Researcher, then becoming a Technical Fellow and managing the Bay Area Research Center (BARC). Initially, Jim’s research interests were focused on large databases and transaction processing systems. In 1998, he received the ACM A.M. Turing Award, the most prestigious award in computer science, for “seminal contributions to database and transaction processing research and technical leadership in system implementation.”
Then, after 2002, he started to focus on eScience, trying to apply computer science in order to solve data-intensive scientific problems. Hence his deep contribution to the WorldWide Telescope, which allows to explore the universe, bringing together imagery from the best ground and space-based telescopes in the world and combining it with 3D navigation. And to the concept of the fourth paradigm.
On 28 January 2007, Jim took his sail boat out for a trip to the Farallon Island. Since then, he has been lost at sea; here was declared deceased on May 16, 2012.
Following Jim’s passing, Microsoft Chairman Bill Gates summed up his legacy in the following way: “The impact of his thinking is continuing to get people to think in a new way about how data and software are redefining what it means to do science.” “Redefining what it means to do science”: this is what the fourth paradigm is all about.
The fourth paradigm: a way to achieve breakthroughs
As Jim stated in his last talk to the Computer Science and Telecommunications Board, science through its first three paradigms has been able to carry us to where we are in history. And it is almost certain that, if we continue to rely on those paradigms, we will continue to make incremental progress. But, if we wish to achieve dramatic breakthroughs, we will need new approaches embracing the fourth paradigm.One element which is key to this fourth paradigm approach is obviously the importance of data. We are looking here at “big data”. As mentioned by Gordon Bell in his foreword to the “Fourth Paradigm” book (freely available here), “In many instances, science is lagging behind the commercial world in the ability to infer meaning from data and take action based on that meaning. However, commerce is comparatively simple: things that can be described by a few numbers or a name are manufactured and then bought and sold. Scientific disciplines cannot easily be encapsulated in a few understandable numbers and names, and most scientific data does not have a high enough economic value to fuel more rapid development of scientific discovery.”
If we look at the LHC, breakthrough in the Higgs Boson discovery was achieved by doing over 300 trillion (3 x 10) proton-proton collisions in the LHC with an energy of up to 8 TeV and analyzing all this information through a worldwide network of computing facilities. This represents a huge amount of data. According to the CERN’s deputy head of IT, David Foster: “When the LHC is working, there are about 600 million collisions per second. But we only record here about one in 10 (ten trillion). If you were to digitize all the information from a collision in a detector, it’s about a petabyte a second or a million gigabytes per second. There is a lot of filtering of the data that occurs within the 25 nanoseconds between each bunch crossing (of protons). Each experiment operates their own trigger farm – each consisting of several thousand machines – that conduct real-time electronics within the LHC. These trigger farms decide, for example, was this set of collisions interesting? Do I keep this data or not? The non-interesting event data is discarded, the interesting events go through a second filter or trigger farm of a few thousand more computers, also on-site at the experiment. [These computers] have a bit more time to do some initial reconstruction – looking at the data to decide if it’s interesting. Out of all of this comes a data stream of some few hundred megabytes to 1Gb per second that actually gets recorded in the CERN datacenter, the facility we call ‘Tier Zero’.[8]”
Fourth Paradigm makes a big deal about analyzing data. Data analysis embeds a broad range of activities throughout the workflow pipeline, including the use of databases (structured, partly structured or not structured), analysis and modeling, data visualization in a multidimensional environment,…
So, the Fourth Paradigm is about data. But not only. Forth Paradigm is also about multidisciplinary approach where people are able to assemble around the same “virtual table”: physicists, statisticians and mathematicians, information and computer scientists, database engineers, software developers,… in the case of the Higgs boson.
As mentioned in the Fourth Paradigm book: “At the heart of scientific computing in this age of the Fourth Paradigm is a need for scientists and computer scientists to work collaboratively—not in a superior/subordinate relationship, but as equals—with both communities fueling, enabling, and enriching our ability to make discoveries that can bring about productive and positive changes in our world”.
Fourth Paradigm: what’s next?
In the mid-1990s, Jim Gray envisioned that the next “big data” challenges for database technology would come from science and not from traditional commerce. One could argue that e-Commerce activities, notably all the activities related to social networks are already challenging “big data” scenarios in the commercial sector. Jim also zeroed in the technical challenges that such data-intensive science would pose for scientists in their fields and the key role that computer science could play in enabling future scientific discoveries.Today, we are starting to experiment with data-driven science. In this new world, it becomes feasible to unify theory, experiment and simulation using data exploration and data mining, whether this data is being captured by instruments, generated by humans, by simulations or by sensor networks. This new data-driven science will enable us to embrace complex models using multidisciplinary interactions through wide temporal and special scales. That data will most of the time come from various teams working through distributed organizations within virtual communities.
Let’s try to build a non-exhaustive list of potential problems that might be solved through this approach:
- In the search of dark matter: “In astronomy and cosmology, dark matter is a type of matter hypothesized to account for a large part of the total mass in the universe. Dark matter cannot be seen directly with telescopes; evidently it neither emits nor absorbs light or other electromagnetic radiation at any significant level. Instead, its existence and properties are inferred from its gravitational effects on visible matter, radiation, and the large scale structure of the universe. Dark matter is estimated to constitute 84% of the matter in the universe and 23% of the mass-energy.[9]” Although there are some other hypothesis, one of those is that the dark matter within our galaxy is made up of Weakly Interacting Massive Particles (WIMPs): using the right set of experiments and using a big data approach, it might be possible to detect the potential thousands of WIMPs which might pass through every square centimeter of the planet earth each and every second;
- Super-Kamiokande neutrino detector: this detector which sits under Mount Kamioka in Japan has been designed to “search for proton decay, study solar and atmospheric neutrinos, and keep watch for supernovae in the Milky Way Galaxy[10]”.
- Modeling the Fusion Reaction: “ITER (originally an acronym of International Thermonuclear Experimental Reactor) is an international nuclear fusion research and engineering project, which is currently building the world’s largest and most advanced experimental tokomak nuclear fusion reactor at the Cadarache facility in the south of France”[11]. Controlled thermonuclear fusion requires massive computational models as it is expected that ITER reactor will require a 100-teraflop computation datacenter.
- SKA - Square Kilometer Array: “The Square Kilometre Array (SKA) is a radio telescope in development in Australia and South Africa which will have a total collecting area of approximately one square kilometer. It will operate over a wide range of frequencies and its size will make it 50 times more sensitive than any other radio instrument. It will require very high performance central computing engines and long-haul links with a capacity greater than the current global Internet traffic. It will be able to survey the sky more than ten thousand times faster than ever before”[12].
Obviously this list if far from being exhaustive. Let’s, as a conclusion, take some other examples from the “Fourth Paradigm” book:
“Data-intensive science promises breakthroughs across a broad spectrum. As the Earth becomes increasingly instrumented with low-cost, high-bandwidth sensors, we will gain a better understanding of our environment via a virtual, distributed whole-Earth “macroscope.” Similarly, the night sky is being brought closer with high-bandwidth, widely available data-visualization systems. This virtuous circle of computing technology and data access will help educate the public about our planet and the Universe at large—making us all participants in the experience of science and raising awareness of its immense benefit to everyone.
In healthcare, a shift to data-driven medicine will have an equally transformative impact. The ability to compute genomics and proteomics will become feasible on a personal scale, fundamentally changing how medicine is practiced. Medical data will be readily available in real time—tracked, benchmarked, and analyzed against our unique characteristics, ensuring that treatments are as personal as we are individual. Massive-scale data analytics will enable real-time tracking of disease and targeted responses to potential pandemics. Our virtual “macroscope” can now be used on ourselves, as well as on our planet. And all of these advances will help medicine scale to meet the needs of the more than 4 billion people who today lack even basic care.[13]”
Through the microscope, Anton van Leeuwenhoek was able to discover in 1683 red blood cells, bacteria, and protozoa. However, he did not realize that microorganism could cause disease. Later, in the 19 century, Louis Pasteur finally proved that microscopic organisms could cause disease.
We are now entering the 21 century at full speed; this century will be the Fourth Paradigm century. Using a “macroscope” we will enter a world “where no one has gone before” [14]…
[3] “Observation of a new boson at a mass of 125 GeV with the CMS experiment at the LHC”. Physics Letters B 716 (1): 30–61. arXiv:1207.7235. doi:10.1016/j.physletb.2012.08.021
[6] Based on the transcript of a talk given by Jim Gray to the NRC-CSTB1 in Mountain View, CA, on January 11, 2007 - http://research.microsoft.com/en-us/collaboration/fourthparadigm/4th_paradigm_book_jim_gray_transcript.pdf.
[7] Idem.
[13] The Fourth Paradigm – The way forward, page 224.
[14] Star Treck introductory sequence
Aucun commentaire:
Enregistrer un commentaire
Remarque : Seul un membre de ce blog est autorisé à enregistrer un commentaire.