Large Volumes of Data Are Challenging Open Science

The 'data explosion' of the past 20 years undermines a basic scientific principle, argues Geoffrey Boulton.

Open data and open science are not new concepts. Arguably they were introduced by Henry Oldenburg, the first secretary of the United Kingdom's newly created Royal Society in the 1660s.

Oldenburg frequently corresponded on scientific matters and persuaded the new society to publish the 'letters' he received – provided that a novel concept was accompanied by the evidence (the data) on which it depended.

Oldenburg's innovation ushered in an era of 'open science', meaning the concurrent publication of both concept and evidence. This allowed scientists to scrutinise each other's logic, and replicate or refute observations or experiments.

It ensured that science was 'self-correcting' and therefore cumulative, which was the foundation for the scientific revolutions of the eighteenth and nineteenth centuries.

Self-correction was recently exemplified when a beam of neutrinos fired from CERN (the European Organization for Nuclear Research) to a laboratory 730 kilometres away seemed to travel faster than the speed of light.

The detailed results were made openly available, resulting in the discovery of a timing error and a repeat experiment that respected the universal speed limit.

But the 'data explosion' of the past 20 years poses severe challenges to the principle of self-correction.

The volume and complexity of the data that can be acquired, stored and manipulated, coupled with ubiquitous technologies for instant communication, have created a flood of data – 90 per cent of all data were generated in the last two years. [1]

Conventional, printed journals can no longer contain the data on which a paper they publish is based, which means losing the vital link between concept and evidence.

A consequence of this was highlighted in 2012 when attempts to replicate the findings of 50 benchmark papers in preclinical oncology were only possible in 11 per cent of cases. [2] This was mainly because of missing data, in addition to the common cause of erroneous analysis.

The next scientific revolution?

The data explosion has also created unprecedented opportunities for scientific discovery that some have argued place us on the verge of another scientific revolution.

The opportunities lie in the potential of large volumes of complex data, integrated from different sources, to reveal deeper, previously hidden relationships in phenomena. For example, identifying varying patterns in the human genome carries great potential in areas such as personalised medicine.

But to seize the opportunities and address the problems created by the data explosion, scientists need to recognise the essential attributes of scientific openness. This is because openness in itself has no value unless it is, as a 2012 Royal Society report calls it, "intelligent openness". [3]

This means that published data should be accessible (can they be readily located?), intelligible (can they be understood?), assessable (can their source and reliability be evaluated?) and reusabl
e (do the data have all the associated information required for reuse?).

Scientific claims and concepts that are published without access to the data on which they are based in ways that satisfy these norms are the equivalents of adverts for the product rather than the product itself.

The power of collaboration

In practice, open data depends on a willingness to share data, which can be both highly efficient and highly creative.

There was a powerful example of open data in action in May 2011, when a severe gastrointestinal infection spread rapidly from Hamburg, Germany. The laboratory dealing with the outbreak shared data and samples with bioinformatics groups on four continents, leading to assembly of the offending microorganism's genome within 24 hours. The analyses provided crucial information in time to help contain the outbreak.

An analogous example of the power of using many brains rather than one was an open collaboration led by mathematician Tim Gowers that solved a long-lasting mathematical problem in about 30 days. In response, Gowers said that "this process is to normal research as driving is to pushing a car". [4]

Given their obvious power, why are open collaboration and sharing uncommon? It is simply because the criteria for promoting and assessing the merit of scientists are based on independent work.

But if scientific understanding is a public good, these criteria need to change to recognise the generation of important new data, data sharing and novel means of collaboration.

Data for society

Publicly funded scientists need to regard the data that they acquire as held in trust on behalf of society and abandon the view that the data are theirs.

Funders of public research need to mandate open data as a condition of funding. Scientific publishers need to require concurrent publication of the data on which articles are based in electronic databases.

And universities and research institutes need to recognise the importance of intelligent openness to the future of science.

This form of open science has benefits other than efficiency. Increasing numbers of citizens are no longer willing to accept scientists' conclusions about important issues that affect them or society. They should be able to scrutinise the evidence behind scientific claims.

The integration of data also offers opportunities, especially for developing countries.

For example, bringing data on soil types together with readily available satellite-based means of assessing land surface conditions can improve soil fertility monitoring. The exploitation of such opportunities depends on developing data science skills locally, and linking the data to policymakers.

Science's capacity to monitor and integrate observations from an increasing variety of natural and anthropogenic processes – from those related to infectious disease to those governing the global climate system – is growing.

Open data sharing can help ensure that this process is efficient and speedy, and that the resulting knowledge is readily available to those who need it.

Geoffrey Boulton is Regius professor of geology emeritus at the University of Edinburgh, United Kingdom. He can be contacted at g.boulton@ed.ac.uk

References

[1] Big Data, for better or worse: 90% of world's data generated over last two years (ScienceDaily, 2013)

[2] C. Glenn Begley and Lee M. Ellis Drug development: Raise standards for preclinical cancer research (Nature, 2012)

[3] Science as an open enterprise (The Royal Society, June 2012)

[4] Tim Gowers Comment on the blog: Questions of procedure(WordPress.com, 2 February 2009)

Credit: Scidev.net