Life Expectancy at Birth : the Numbers Behind the Means

Datasets spanning long periods of time, are crucial to our understanding of a range of phenomena, both within as outside social and economic history. For various processes, from social inequality to climate change, only change gradually, requiring long periods of time to observe and to explain change. Moreover, data spanning longer periods of time help us to unravel causality issues in the processes we study. Finally, time series data help us put contemporary phenomena in perspective.

dataset featured in the OECD's 'How was life?'-publication, 1 where it was used to describe and explain changes in health across the globe for nearly 200 years.More specifically, the dataset describes Period Life Expectancy at birth, the number of years one would live, if the circumstances at birth would continue to be the same.
Being a Clio Infra dataset the ultimate goal for this dataset is to cover the globe for the period ca.1500-2000.At the moment however, data range from 1543 (for the UK) to 2010 for nearly all countries in the world.The further back in time, the more limited the number of countries for which data is available.Moreover, for many countries life expectancy rates are only available after 1950.The data were gathered between September 2013 and April 2014.The dataset is stored at the Dataverse instance of the International Institute of Social History (http://datasets.socialhistory.org)and is available via a persistent identifier: http://hdl.handle.net/10622/LKYT53.
The OECD article by Zijdeman and Ribeiro da Silva (2014) provides quite detailed information on which sources were used in case of overlapping sources. 2 In total, data from seven different data providers were used:

Figure 2. Absolute differences in total life expectancy at birth between two sources for a number of Western countries
In case of overlapping data sources, the main principle was to favour datasets that spanned longer periods of time for reasons of consistency.The data is accompanied by a codebook providing an R script that shows exactly how the data were derived.In addition to the article and codebook, Table 1 in the Appendix provides an overview of the sources which were used for all countries and time periods in the dataset.
A first issue with long time series data is that the use of multiple sources, needed for such long time series, is obscured.Sources are seldom uniform in the way they acquired their data.Variation in source data can range from anything between data acquisition methods (from e.g.census takers registering data in 'their own way' up to different ways of measuring the concept at hand (different instructions)' .Obviously such differences could lead to biases over time as well as between different regional units (countries).
Users of datasets are seldom presented information on the use (and consequences) of different sources, directly inside the dataset.At best, there are some notes in a codebook, but this would require the researcher herself to create variables for robustness checks.A quite cumbersome task judging by the size of Table 1 in the Appendix.Research journals also seldom allow for more rigorous notes on data gathering due to space restrictions.As a result, users need to rely on a combination of the codebook and different patterns in the data such as the change around 1840 in the UK in Figure 1.
As an illustration of how different sources may relate to one another, Figure 2 presents data on life expectancy after World War II, from two of the biggest data agencies, the Organisation for Economic Co-operation and Development (OECD) and the World Bank.The zero-value on the y-axis indicates that there are no differences in life expectancy according to OECD and World Bank, while a positive value indicates that the OECD has a higher estimated life expectancy than the World Bank.Reassuringly, we see hardly any differences between countries, nor systematic changes over time.However, for a number of countries there are differences, most notably in Greece, Israel, Mexico and Turkey.Researchers interested with a particular focus on any of these countries might want to do robustness checks on the use of the OECD and World Bank data.
Another instance of multiple data sources causing irregularities in data patterns is the use of complementary datasets.In the life expectancy dataset, data sources complement each other mostly over time and may lead to sudden 'shifts' in levels of life expectancy, for example because of differences in registration methods.
Figure 3 shows how the OECD data are complemented by data from various sources.Each line graph represents a different data source.Overall, the patterns appear to be surprisingly homogeneous across datasets.More- over, historical datasets that expand beyond the 1950's again show remarkable similar levels of life expectancy to the one reported by the OECD.Thus while a potential risk, the life expectancy data does not appear to suffer too much from differences in temporally complementary sources.
Data sources that are regionally aggregated may also distort genuine patterns in life expectancy.This could be the result of how the data were constructed (e.g.multiple sources being combined into country level data), but also in the way the data are presented.Given the number of countries for which data are available, it is often tempting to aggregate data to some supra-national level in order to make the data more 'comprehensive' .Zijdeman and Ribeiro da Silva (2014) did so as well aggregating the data to 8 major world regions, and visualizing changes in life expectancy by a single line, each representing a single major world region.
There are two issues to remember when visualizing datasets such as the Clio Infra Life Expectancy dataset in an aggregated way.First, given the historical nature of the data, the mean is bound to be calculated from a different number of countries over time.The mean is thus more representative of later periods, but also, fluctuations in the mean, can be the result of the expanding number of countries to calculate the mean from.Secondly, the mean does by definition not allow one to interpret the variation in scores of countries.Countries could thus be similar to each other with values close to the mean, but the mean could also be less representative if some countries perform considerably better or worse in comparison to other countries.
Figures 4 and 5 provide alternative visualizations to the ones used by Zijdeman and Ribeiro da Silva (2014) for life expectancy at birth for Western Europe and Sub-Saharan Africa respectively.Each figure consists of a jittered scatterplot overlaid by a boxplot.The scatterplot shows the average value of life expectancy by country.From the figures it is easy to see, how over time the number of countries representing a region increases.The jitter function adds some random error to the positioning of each point, in order to reduce overlay between data points (countries).The boxplots provide guidance on interpreting the variation across countries in life expectancy.The larger the size of the box and/or the 'whiskers' above and below the box, the more variation there is and the less the mean is a proper representation of the countries at hand.
The benefit of visualizing data using a combination of scatter-and boxplot over a single line representing the mean, becomes evident when thinking about the different conclusions one could draw from the alternative graphs.When figures 4 and 5 would just represent means, one would conclude that the world is becoming a better place, for both in Western Europe as in Sub-Saharan Africa mean life expectancy has been rising over the course of the twentieth century.
The scatter-and boxplots show a more nuanced picture though.First of all, it shows that for earlier time periods fewer countries are representing each global region and any claims on life expectancy at the aggregate level are thus more uncertain for earlier time periods.Moreover, we see that in Western Europe the variation in life expectancy is declining, while in Sub-Saharan Africa there is, at best, no evidence for convergence.A more appropriate conclusion then appears to be, that in both Western Europe and Sub-Saharan Africa life expectancy has increased, but that inequality in life expectancy within Western Europe has strongly decreased to a maximum of 5 years, while in Sub-Saharan Africa inequality did at best not lessen, showing a 45-year difference at the extremes, and even a 10-year difference between the middle 50 per cent of the countries.Zijdeman and Ribeiro da Silva (2014) reach a similar conclusion, but the reader has no way to corroborate their claims based on the visualization of the means as presented in their chapter.
To summarise, the Clio Infra dataset on life expectancy at birth is a data-set created from multiple data sources and in this 'data stage' article I have illustrated potential issues with this type of datasets.Data sources covering the same time periods and regions might be incongruent and complementary datasets may cause sudden changes in trends over time.The life expectancy dataset at hand proves to be remarkably resilient to both issues.A second issue raised was that datasets like the life expectancy dataset are often used to literally draw comparisons between world regions using line graphs of means over time.I have suggested that a combination of scatter-and boxplot is more appropriate as it helps the reader to assess changes in the richness of data over time, as well as to draw substantively more interesting (and more appropriate) conclusions.
To my knowledge, the construction of datasets has received much more attention in recent years.While debates on 'openness of data' often result in discussions on principles of ownership, the one principle that we share as researchers is that research should be replicable.That does not only apply to regression analysis on a specific dataset, but also to the construction of those datasets themselves.In journal articles, and even in codebooks there is little space to go into great detail on the particulars of the creation of a dataset, nor to highlight particular issues that the researcher had to deal with.It is my hope that the space provided in this journal to write data review articles or data stage articles like this one, will be used to raise awareness of peculiarities of datasets.Not only would that provide common ground for replication and robustness checks of datasets, it would also enhance our use of those data as we gain a better understanding of the datasets at hand.

About the author
Richard Zijdeman obtained his PhD in sociology and focuses on long term patterns of occupational stratification in Western countries over the past 200 years.Methodologically he is specialized in historical measures of occupational status and multilevel models accounting for complex variance structures.Currently his main roles are Chief Data Officer at the International Institute of Social History and project lead for the structured data component of the the Common Lab Research Infrastructure for the Arts and Humanities (CLARIAH).For the latter his team is building an infra-structure to transpose historical datasets (including GIS) to Linked Open Data, enhancing the connectivity of datasets as well as the reproducibility of research.E-mail: richard.zijdeman@iisg.nl 1 J.L. van Zanden, c.s. (eds.),How was life?Global well-being since 1820 (Paris 2014).doi: 10.1787/9789264214262-en. 2 R.L. Zijdeman and F. Ribeiro da Silva, 'Life Expectancy since 1820' in How was life?Global well-being since 1820, edited by Jan Luiten, c.s. (Paris 2014) 101-116.

Figure 1 .
Figure 1.An example of long term reasearch data being shared via social media

Figure 3 .
Figure 3.An illustration of various sources used to depict total life expectancy at birth in a number of European countries

Figure 4 .
Figure 4.A box-and scatterplot representation of total life expectancy at birth in countries in Western Europe

Figure 5 .
Figure 5.A box-and scatterplot representation of total life expectancy at birth in countries in Sub-Saharan Africa