 Research article
 Open Access
 Open Peer Review
 Published:
An infectious disease model on empirical networks of human contact: bridging the gap between dynamic network data and contact matrices
BMC Infectious Diseasesvolume 13, Article number: 185 (2013)
Abstract
Background
The integration of empirical data in computational frameworks designed to model the spread of infectious diseases poses a number of challenges that are becoming more pressing with the increasing availability of highresolution information on human mobility and contacts. This deluge of data has the potential to revolutionize the computational efforts aimed at simulating scenarios, designing containment strategies, and evaluating outcomes. However, the integration of highly detailed data sources yields models that are less transparent and general in their applicability. Hence, given a specific disease model, it is crucial to assess which representations of the raw data work best to inform the model, striking a balance between simplicity and detail.
Methods
We consider highresolution data on the facetoface interactions of individuals in a pediatric hospital ward, obtained by using wearable proximity sensors. We simulate the spread of a disease in this community by using an SEIR model on top of different mathematical representations of the empirical contact patterns. At the most detailed level, we take into account all contacts between individuals and their exact timing and order. Then, we build a hierarchy of coarsegrained representations of the contact patterns that preserve only partially the temporal and structural information available in the data. We compare the dynamics of the SEIR model across these representations.
Results
We show that a contact matrix that only contains average contact durations between role classes fails to reproduce the size of the epidemic obtained using the highresolution contact data and also fails to identify the most atrisk classes. We introduce a contact matrix of probability distributions that takes into account the heterogeneity of contact durations between (and within) classes of individuals, and we show that, in the case study presented, this representation yields a good approximation of the epidemic spreading properties obtained by using the highresolution data.
Conclusions
Our results mark a first step towards the definition of synopses of highresolution dynamic contact networks, providing a compact representation of contact patterns that can correctly inform computational models designed to discover risk groups and evaluate containment policies. We show in a typical case of a structured population that this novel kind of representation can preserve in simulation quantitative features of the epidemics that are crucial for their study and management.
Background
Computational models and multiscale numerical simulations represent essential tools in the understanding of the epidemic spread of infections, in particular to study scenarios, and to design, evaluate and compare containment strategies. The applicability of such computational studies crucially depends on informing transmission models with actual data. We are currently witnessing an important evolution as more and more data on human mobility and behavioral patterns become accessible [1–9]. For instance, data on human travel patterns and mobility have been fed into largescale models of epidemic spread at regional or planetary scale, providing important modeling and prediction tools [10–14].
When considering smaller scales, the knowledge of contact patterns among individuals becomes relevant for identifying transmission routes, recognizing specific transmission mechanisms, and targeting groups of individuals at risk with appropriate prevention strategies or interventions such as prophylaxis or vaccination [15]. Several properties of the contact patterns are known to bear a strong influence on spreading patterns, such as the topological structure of the contact network, the presence of individuals with a particularly large number of contacts, the frequency and duration of contacts, and the existence of communities [16–24].
The simplest approach to the description of contact patterns is the homogeneous mixing assumption, which postulates that each individual has an equal probability of having contacts with any other individual in the population [25]. A widely used refinement of this approach consists in dividing individuals into classes (corresponding for instance to different age groups or to different social activities), and defining a contact matrix between classes in terms of the average number (or duration) of the contacts that individuals in one given class have with individuals in another given class. These matrices are constructed using data from questionnaires or diaries, timeuse data [23, 26–31], and more recently by sensing room copresence and closerange proximity between individuals [32, 33].
The use of models with structured populations can help to define more refined analytical approaches [34]. In addition, it allows to perform numerical simulations of epidemic spread in synthetic structured populations with fixed rates of contacts between individuals belonging to different classes, given by the (possibly empirical) contact matrix elements [10, 12, 35–37]. These simulations can also be used to compare vaccination strategies targeting specific groups and to estimate which strategies are most effective [35–41]. The use of contact matrices for modeling contact patterns relies on a set of restricted homogeneous mixing assumptions within each class and on the representativeness of the average mixing behavior between classes. However, such an approach neglects the strong fluctuations that are usually observed in the distributions of the numbers and durations of contacts between two individuals of given classes, which are often modeled with negative binomial distributions [23, 29, 31, 42, 43].
Overall, in the context of a specific modeling problem, little is known about the level of detail that should be incorporated in modeling contact patterns. Coarse representations such as the homogeneous mixing assumption leave out crucial elements, but are analytically tractable and can provide a coarse understanding of epidemic processes. More realistic approaches are however needed when the aim is to predict with quantitative accuracy the outcomes of specific scenarios, to target groups of individuals who are most at risk, and to quantitatively evaluate and rank interventions and containment strategies according to their effectiveness. In these cases it is crucial to achieve realistic simulations to estimate quantities such as the extinction probability or the attack rate. However, very detailed representations may critically lack transparency about the specific role of the many modeling assumptions they incorporate, and might yield unnecessarily finegrained predictions. For instance, in order to evaluate the relative efficacy of targeted vaccination of different groups of persons, we do not need to know the risk of infection of each individual — it is sufficient to estimate the average attack rate in each group.
As technological advances keep enhancing our ability to gather highresolution data on proximity and facetoface interactions [4–9] and to integrate such data in computational models, the issue of understanding what is the most adequate data representation becomes therefore crucial, and the answer may depend on the specific epidemic process under study as well as on the specific goals of the modeling approach. This points to general problems for datarich scenarios such as the ones involving wearable sensors in realworld settings: What is the right amount of information about individual interactions that is appropriate for a given modeling task, so that the relevant information is retained but the model stays as parsimonious as possible [44]? What are the most useful synopses of highresolution contact network data?
Several studies have started to tackle such issues within the framework of the description of human interactions as static or dynamic contact networks, in the case of unstructured populations [22, 24, 45–48]. It is known that the heterogeneity of contact durations is a crucial element that needs to be taken into account. On the other hand, in a particular case [48] it was shown that the dynamics of an SEIR (Susceptible, Exposed, Infectious, Recovered) process over an aggregated network that only takes into account daily contact durations achieves a good approximation of the transmission dynamics over the full timevarying contact network with a temporal resolution of the order of minutes.
To date, the more complex case of a structured population has not been investigated under this perspective of data summarization. Here we address this problem motivated by the needs of applications and interventions aimed at the containment of diseases, such as identifying groups of individuals that need to be prioritized in deploying containment strategies such as vaccination and prophylaxis.
The aim of the present paper is therefore to understand whether and how different representations of the contact patterns within a structured population might lead, in simulation, to different estimates for the outcomes of an epidemic process and for the attack rates within groups of individuals: How faithful to the highresolution empirical data are the summarized representations of contact patterns, in terms of the spreading process they yield and of the risk groups they identify? Given a specific epidemic process to be modeled, what is the optimal tradeoff between the compactness of the data representation and the amount of detail retained in the model? The answer to these questions lies in designing compact yet informative representations of contact patterns that can be used to model epidemics in structured communities (e.g., the case of nosocomial infections in hospitals). In such contexts, the design and evaluation of prevention measures or containment policies is naturally based on targeting specific classes of individuals. For example, it is possible to prioritize the vaccination of a certain class of individuals (e.g., nurses in a hospital [35]), whereas no simple policy can be based on a list of specific individuals. Therefore, it is important to build data representations that yield accurate results at the level of groups of individuals and that, at the same time, can be generalized in order to support generic conclusions on the relative efficacy of different strategies, and in particular of strategies targeting only high risk groups.
Here we put forward such a representation and we validate its appropriateness and usefulness in a case study of structured population: to this aim, we leverage a highresolution dataset on the facetoface proximity of individuals in a hospital ward, collected by the SocioPatterns collaboration by using wireless wearable sensors. The population is structured in different classes that correspond to different roles in the hospital ward. The collected data, described in detail in Ref. [33], include the detailed timeordered sequence of individuals’ contacts between all subject pairs.
We consider various representations of the hospital ward data, corresponding to different types and degrees of aggregation. At the most detailed level, the raw data can be viewed as a dynamic contact network, where all the available information on the interactions between pairs of individuals (contact times and durations) is explicit. We then construct static networks that project away the temporal structure of the contacts but retain the identity of each individual and the structure of her contact network. At a coarser level, we consider the customary contact matrix representation of the data, which aggregates along the class dimension: All individuals belonging to the same role class are grouped together, and the contact rate between two individuals of given classes is given by the corresponding element of the contact matrix. This representation, by definition, discards many heterogeneities of contact behavior at the individual level as well as the heterogeneities among contacts. Between the contact matrix view and the individualbased network view, other intermediate representations are possible. Here we introduce a novel representation based on a contact matrix of distributions, designed to retain the heterogeneous properties of contact durations among pairs of individuals belonging to given role classes.
We then use an SEIR process to model the spread of an infectious disease in a structured hospital population whose contact patterns are described by using all of the above representations, computed from the raw proximity data. We simulate the distribution of attack rates for the various role classes and compare these outcomes against individualbased simulations that take into account the most detailed description of the empirical contacts. Our results highlight similarities and differences in the dynamics obtained for various aggregation levels of the contact pattern representation. In particular, they show that coarse representations that do not take into account the heterogeneity of contact durations, such as the usual contact matrix, yield rather bad estimates for the global attack rate. Such representations also lead to a wrong classification of role classes in terms of their relative risk and could therefore lead to incorrect prioritization for interventions such as targeted vaccination or prophylaxis. The novel representation that we put forward is shown to strike a promising balance between simplicity and generalizability on the one hand, and on the other hand the ability to yield accurate evaluations of risk and a correct prioritization of groups of individuals in the design of targeted measures.
Methods
Data
The dataset we use [33] describes the contacts of 119 individuals in a general pediatric ward of the Bambino Gesù Children’s Hospital in Rome, Italy, over a period of one week (916 November 2009). The ward under study is located in the Department of Pediatrics and is physically separated from other wards and facilities of the Hospital. It has 44 beds arranged in 22 rooms with 2 beds each and mostly admits children with acute diseases who do not require intensive care or surgery. Admitted patients are accompanied by one person (parent or tutor) who spends the night in the same room. Contacts, defined as close proximity interactions, were measured by using wearable badges that engage in bidirectional lowpower radio communication [7, 8] and can assess the facetoface proximity of two individuals.
The study was approved by the Ethical Committee of the Bambino Gesù Hospital [33]. As described in [33], during the data collection period all healthcare workers (HCWs) on duty, patients, parents or tutors were invited to participate in the measurements. HCWs participated in several meetings to receive information about the study and all procedures were approved by the Ethical Committee of the hospital. Only 5 HCWs, 1 patient and 1 parent declined. Some proximity sensing badges assigned to patients, parents or tutors were not continuously detected by the receiving infrastructure and were missing for more than 25% of the time they were assigned to the individuals. These badges were excluded from the dataset, after assessment of the robustness of the data statistical properties with respect to this filtering procedure (see [33] for details). Overall, the data describes the contacts of 119 individuals, among which more than 90% of the HCWs who were on duty in the ward during the chosen week. These 119 participants were associated with the following role classes: 37 patients (P), 20 physicians (D), 21 nurses (N), 10 ward assistants (A), i.e., health care workers in charge of cleaning the ward and distributing the meals, and 31 caregivers (C), who comprise tutors and nonprofessional visitors. As data was collected during the pandemic period when several patients with H1N1 infection were hospitalized, HCWs applied standard operating procedures to prevent transmission of infection including hand hygiene, face mask, wearing gowns and gloves. Specific guidelines are however permanently adopted in the study hospital to prevent cross transmission of respiratory infections, so that the contact patterns were not specifically altered during the data collection period.
Empirical contact patterns
An analysis of the recorded contact patterns is described in Ref. [33]. At the most detailed level the data provide an explicit representation of the timevarying contact network of individuals, resolved at the fastest available temporal scale (20 seconds). As discussed in the Introduction, our aim here is to compare different numerical simulations of the spread of an infectious disease, where each simulation is constructed on top of a specific mathematical representation of contact patterns, and all these representations are derived from the same empirical data, summarized or modeled at different levels of detail (e.g., individualbased contact network vs contact matrices). In the following we describe the contact pattern representations we will use.
The representation that most closely corresponds to the raw empirical data, here forth named “DYN”, is the timevarying individualbased contact network, where for each pair of individuals the precise starting and ending times of individual contact events are available with the finest temporal resolution of 20 seconds. When simulating a process, such as the spread of an infectious disease, characterized by longer time scales than the available interval of one week, we simply play back the recorded contact history over and over. Whereas other procedures are possible for artificially extending the time span of the empirical data, it was shown in [48] that the simple repetition procedure gives robust results.
The DYN representation is thus the one that contains the most detailed information on the empirical contact patterns, and therefore we will consider it as the reference against which all the other representations have to be compared.
Starting from the dynamical contact network representation (DYN), different aggregation procedures are possible that retain only selected features of the contact network while aggregating over the others. At the most aggregated level we consider, we disregard the topology of the empirical contact network as well as the contact heterogeneities and assume that each individual is in contact with all others for the same average time. We refer to this very coarse representation as “FULL”, because its contact structure is a fully connected graph, with homogeneous contact durations.
As pointed out above, several intermediate representations can be designed between the finegrained DYN representation and the coarsegrained FULL one. We define a space of possible representations by considering two qualitative features: the amount of information on the network structure, and the amount of information on contacts (times, durations, etc.). Figure 1 and Table 1 illustrate where the representations we study, described in the rest of this section, lie with respect to one another, and which properties of the empirical data they preserve.
The first two representations we consider are designed to preserve the empirical structure of the contact network (who has met whom). The first representation (HET in Figure 1) is a static and weighted contact network: individuals who have been in contact during the data collection are linked by an edge that is weighted by the total time they have spent in facetoface proximity. All the dynamical contact events are thus summarized into a single contact graph that includes information on the observed contacts between individuals (who has met whom) and on the total (heterogeneous) duration of these contacts, i.e., how long A and B have been in contact. HET disregards information on the time of occurrence and on the temporal ordering of contacts. A less informative static network representation can be obtained from HET by removing the weight heterogeneity: this homogeneous network (HOM) is constructed by retaining the topology of the HET network, while giving all links the same weight, set as the average total duration of contacts between two individuals that have met at least once.
The DYN, HET and HOM representations of the raw data describing the contacts between the attendees of a conference were compared in Ref. [48]. In the present case of a hospital ward, however, role classes are present, and the structure of the contact network is known to be very different [33] because of the role structure. It is therefore interesting to first verify the validity of the results of Ref. [48] in the case of a structured population.
The role structure of a hospital ward naturally lends itself to the definition of contact matrices that encode in a very compact way the class heterogeneity of contact patterns. Rows and columns of a contact matrix correspond to role classes. The matrix entry for classes X and Y gives the average contact time between any two members of those respective classes. The contact matrix description is not individualbased, hence in order to make contact with the individualbased representations introduced above, as well as with the empirical data, we define a network representation of the contact matrix (CM) as follows. We build a complete contact graph where each pair of individuals (x,y), with x in class X and y in class Y, is linked by an edge of weight w_{XY} that depends only on the respective classes of the individuals. We set w_{XY} = W_{XY}/(N_{X}N_{Y}), where N_{X} is the number of individuals in class X, N_{Y} the number of individuals in class Y, and W_{XY} is the total time spent in contact by any individual of class X with any individual of class Y. In the case X = Y, we set w_{XX} = W_{XX}/(N_{X}(N_{X}1)/2)), where the denominator is the number of withinclass edges. Similarly to the FULL representation, the CM representation does not preserve the topology of the empirical contact network, but the more coarsegrained contact structure between classes of individuals is retained by associating pairs of individuals with weights that depend on their classes.
As already mentioned above, it is known [33] that the distribution of cumulative contact durations between pairs of individuals belonging to given categories displays large heterogeneities. In fact, as reported in other studies [23, 29, 31, 42, 43], such distributions can often be modeled as negative binomial distributions, which can account for the broad fluctuations of contact durations as well as for the comparatively high fraction of missing links across the chosen classes. In order to account for these heterogeneities, we define a new representation based on a contact matrix of distributions. The matrix is defined so that the entry for role classes X and Y consists of the parameters of the negative binomial distribution that fits the empirical distribution of contact durations between all pairs of individuals x in class X and y in class Y.
Similarly to the customary contact matrices, the contact matrix of distributions is not an individualbased representation. Hence, in analogy to what we have done above, we start from the Contact Matrix of Distributions and define a corresponding network representation (“CMD”, in the following): we build a complete contact graph, where each pair of individuals (x,y), with x in class X and y in class Y, is linked by an edge whose weight w _{ xy } is a random variable obtained by sampling the negative binomial distribution with the parameters specified by the (X,Y) entry of the contact matrix of distributions. A weight w _{ xy } = 0 indicates that there is no link between individuals x and y. We remark that the detailed structure of the empirical contact network is not retained in this representation, but that the average number of links with nonzero weight between each pair of classes is preserved.
Finally, we also report on an intermediate contact matrix representation designed to capture the link density and the average weights of the contact subnetwork connecting classes X and Y, without taking into account the corresponding variance of edge weights. This is built as a contact matrix of bimodal distributions (CMB, “B” for “bimodal”) that are used in place of the negative binomial distributions. As in the CMD case, the matrix is defined so that the entry for role classes X and Y consists of the parameters of a bimodal distribution that summarizes the empirical contacts between all pairs of individuals x in class X and y in class Y. Specifically, let us denote by E_{XY} the number of observed edges between individuals of classes X and Y, and by W_{XY} the total time spent in contact by any individual of class X with any individual of class Y. The (X,Y) entry is a bimodal distribution that assumes the value W_{XY}/E_{XY} (i.e., the average observed weights between classes X and Y) with probability E_{XY}/(N_{X}N_{Y}), and the value 0 with probability 1 E_{XY} /(N_{X}N_{Y}). On building a network representation from the matrix, for each pair of individuals (x,y), with x in class X and y in class Y, we accordingly set w _{ xy } = W_{XY}/E_{XY} with probability E_{XY}/(N_{X}N_{Y}), and w _{ xy } = 0 with probability 1 E_{XY} /(N_{X}N_{Y}). This representation is discussed in the Additional file 1.
Infectious disease model
To simulate the spread of an epidemic we use an SEIR model without births, deaths or introductions of individuals. In this model, individuals can be in one of 4 distinct states: susceptible (S), exposed (E), infectious (I) or recovered (R). Susceptible individuals can become exposed with a given rate β, when they are in contact with infectious individuals. Exposed individuals do not transmit the disease through contacts with others, but become infectious with a rate σ, where 1/σ is the average duration of the latent period of the disease. Infectious individuals can transmit the disease through contacts with susceptible ones. Finally, infectious individuals become recovered with rate v and acquire permanent immunity to the disease. The initial conditions we use correspond to a population of susceptible individuals and a single infectious individual chosen at random.
The transitions from the exposed to the infectious state and from the infectious to the recovered state do not depend on the contact behavior of each individual. On the other hand, the probability that an infectious individual transmits the disease to a susceptible one does depend on the duration of their contacts. In order to compare the simulated spreading dynamics of an epidemic for different representations of the contact patterns we need to adequately define the rate of infection β for an infectioussusceptible pair, so that in all cases we have the same average infection probability. In particular, care must be taken in comparing timevarying network representations with static network representations such as HOM or HET. The corresponding procedure is detailed in [48] and in the Supporting Text (Additional file 1).
We consider two different scenarios for the simulation of epidemic spread: in the first scenario, the latent period is set to 2 days and the average infectious period is 1 day, whereas in the second scenario we halve these intervals in order to better probe the interplay with the fast temporal scale of human contacts. The parameters we use are:

(i)
. scenario 1: 1/σ = 1 day, 1/ν = 2 days and β = 6.9 10^{4} s^{1}(short latent and infectious periods)

(ii)
. scenario 2: 1/σ = 0.5 day, 1/ν1 day and β = 2.8 10^{3} s^{1} (very short latent and infectious periods).
For each scenario and each data representation (DYN, HOM, HET), we performed 16,000 simulation runs of the epidemic spread. In the CMD and CMB cases we also performed 16,000 runs, obtained by generating 1,600 different contact network samples from the fitted contact matrices of distributions, and by simulating for each of them 10 realizations of the epidemic spread.
In the following we will keep the above “microscopic” epidemiological parameters fixed, and we will simulate an SEIR model over different representations of contact patterns. An alternative approach to carry out this comparison consists in tuning the model parameters so that the spread on the different contact pattern representations leads to the same basic reproductive number R_{0}. This procedure is considered in the Additional file 1 and yields results that are similar to the ones reported below.
Analysis
The empirical contact data are analyzed and characterized in Ref. [33]. In order for the present paper to be selfcontained, we report here the contact matrix measured in the hospital ward as well as the newly introduced contact matrix of distributions (CMD).
The simulated properties of epidemic spread, based on different contact pattern representations, are quantified and compared by means of several indicators that we compute over an ensemble of stochastic realizations of the epidemic process. We focus in particular on the number of recovered individuals at the end of the spreading process: its distribution provides important information on the expected magnitude of the epidemics and can hence be used to evaluate the effectiveness of interventions. We study the impact on this distribution of the choice of the initial seed (initial infectious individual) and compute the distributions of attack rates for each role class, as such measures can be used to design and evaluate containment policies targeting specific roles or groups of individuals. Finally, we compute the extinction probability of the epidemics, as well as the fraction of stochastic realizations that yield a global attack rate lower than 10%.
For completeness, in the Additional file 1 we also consider the peak time and the duration of the epidemic. We report as well results on the expected number of secondary infections from an initially infected individual in a completely susceptible host population (i.e., the basic reproduction number [25]).
In all cases, the results of simulations are compared with the spread simulated on the DYN representation, i.e., the representation that incorporates the most detailed information on the precise timing and duration of contacts. For our purposes, the faithfulness of a data representation is measured by how close the outcomes of the epidemics spread based on that representation are to those of the simulations based on the DYN representation.
Results
The dataset we use provides information on approximately 16,000 contact events, recorded during one week, among 119 individuals: 37 patients (31.1% of the total number of individuals), 31 caregivers (26.1%), 20 physicians (16.8%), 21 nurses (17.6%), and 10 ward assistants (8.4%).
Table 2 reports the contact matrix defined on role classes: the element at row X and column Y is the average over all pairs of individuals x in X and y in Y of the total time that x has spent in contact with y during the measurement week, normalized to obtain daily contact durations.
Table 3 reports the parameters of the contact matrix of distributions introduced above. As shown in the Additional file 1, we fit a negative binomial distribution to each empirical distribution of cumulated contact durations between members of classes X and Y and set the (X,Y) entry of the matrix to the values of the two fitted parameters. The negative binomial form allows the variance of the distribution to be much larger than its average, thus capturing in a simple way the breadth of the empirical distribution. We remark that no zeroboosting of the negative binomial distributions was necessary in order to reproduce the fraction of missing links from the members of class X and class Y. In other words, the fraction of pairs of individuals x in X and y in Y for which no contact was recorded (cumulated contact durations equal to 0) is naturally reproduced by the simple form of the negative binomial.
In the following we systematically discuss the results corresponding to the second set of parameter values for the SEIR model (very short latent and infectious periods) and, when appropriate, point to differences with the results from the first parameter set, which are reported in the Additional file 1. Moreover, since the FULL case corresponds to a very crude assumption of global homogeneous mixing that yields a strong overestimation of the epidemic size, we include the corresponding results in the Additional file 1 only.
Figure 2 and Table 4 summarize the distributions of the final number of cases for the various contact pattern representations.
When the epidemic reaches more than 10% of the population, the distributions of its final size, shown in Figure 2A, can be approximately grouped in two classes. The distributions for the DYN, HET and CMD cases overlap (and they are also very similar to one another for the other parameter set, see Additional file 1). Conversely, the distributions for the HOM and CM cases display a large overestimation of the final epidemic size, with the CM case yielding a higher number of cases than the HOM one.
The probability of extinction and the fraction of stochastic runs that result in an attack rate lower than 10% (Table 4) are much smaller in the HOM and CM cases. The HET and CMD cases are comparable with one another and both are still lower than the DYN case, but closer to it (remember that DYN is the most informative representation of all). The difference between the DYN and the HET/CMD cases is smaller for the other parameter set (see Additional file 1), while even in that case HOM and CM lead to a strong underestimation of the extinction probability and of the fraction of runs with an attack rate lower than 10%. Spreading simulations based on the CMB representation (see Additional file 1) have an intermediate performance but still provide a rather bad approximation of the spreading patterns obtained from the heterogeneous contact network (HET) and from the CMD representation, which do include more information about the statistical heterogeneity of contact durations.
Table 4 also shows that the role class of the seed individual strongly influences the probability of extinction and of obtaining an attack rate lower than 10%. For larger epidemics, however, the choice of the seed does not have a strong impact on the final epidemic size, as seen in Figure 2B. For all contact representations, epidemics starting from a ward assistant are the most likely to spread. On the other hand, epidemics starting from a patient, a caregiver or a physician have a systematically larger probability of affecting only a small fraction of the population, except for the CM case in which this probability is strongly underestimated and comparable to the case in which the seed is a nurse.
Figure 3 shows the distribution of the fraction of individuals of each role class reached by the epidemic in the various scenarios. The DYN, HET and CMD representations give very similar results for both sets of parameter values: assistants and nurses are particularly affected, while doctors, patients and caregivers have smaller attack rates. On the other hand, for the HOM and CM representations the distributions are very different, with a strong global overestimation of the attack rates in every class and an incorrect ranking of the risk probabilities for the various role classes. In particular, in the CM case the attack rate for patients and caregivers is as high as that for nurses and much higher than the one for doctors, in contrast to the reference case of the DYN representation.
Finally, Figure 4 and Table 5 report the temporal evolution of the number of infectious and recovered individuals for the various contact pattern representations. The curves show the simulated values averaged over the stochastic realizations that result in an attack rate higher than 10%. The epidemic peak is higher for the HOM and CM cases and very similar for the DYN, HET and CMD representations. Consistently with the findings of Ref. [48], instead, the time of the epidemic peak only slightly depends on the specific representation (see Additional file 1).
The above comparison assumes that the epidemic parameters of the SEIR model are known and that their values are characteristic of the “microscopic” dynamics of the infectious disease, at the level of individual facetoface contacts. As discussed in the Methods section, an alternative approach to carry out this comparison is to tune the model parameters for each case (contact pattern representation) so that all cases yield, in simulation, the same average number of secondary infections from an initially infected individual. The above comparison can then be repeated. In the Additional file 1 we show that the results reported above stay qualitatively similar.
Discussion
We have introduced and validated in a case study a novel way of summarizing and using in simulation highresolution contact data that describe the interactions between individuals in a structured population. We based our work on a highresolution timeresolved dataset on the facetoface contacts of individuals in a hospital ward. We have defined several representations of the contact data that correspond to different levels of aggregation, from the highresolution dynamical information on all individual contacts (DYN), passing through several intermediate representation that include individualbased static networks (HOM and HET) and contact matrices (CM, CMB and CMD), to the coarse homogeneous mixing description (FULL). In particular, we have defined a new representation in terms of a contact matrix of negative binomial distributions (CMD) that has the simplicity of a contact matrix formulation but takes into account the heterogeneity of contact durations between individuals.
To compare these representations and assess their ability to correctly inform public health policies, such as targeted vaccination strategies, we used each of the representations to simulate the spread of an infectious disease, modeled by an SEIR model, and we compared the results with the outcome of simulations based on the most detailed representation (DYN), which we regard as a gold standard.
The possibility to use a certain data representation depends on the data at hand, and each representation has advantages and limitations. In particular, both dynamical and heterogeneous network representations (DYN and HET, respectively) rely on a rather large amount of information on contact patterns. Although emerging pervasive technologies allow to gather such information in diverse contexts [4, 5, 7, 8], the data currently available remain scarce and there are no standard modeling procedures to generalize contact measurements concerning a specific population during a specific period to different periods or to larger populations. In this respect, contact matrix representations are very appealing in the case of structured populations. This is precisely the case of the data used here, in which individuals are grouped according to their known role in the hospital ward. The usual contact matrix representation gives, for each pair of individuals, an average contact rate that only depends on the respective roles of those individuals. Therefore, it can be computed for one studied population and easily generalized to larger ones (under the condition that the set of roles stays similar). However, the customary contact matrix representation suffers from an important limitation: it only retains information on the average contact durations of different role pairs, discarding all the heterogeneities that are known to exist in the contact properties of individuals that belong to given role classes [33].
These considerations have prompted us to move beyond the customary contact matrix representation, in particular to take into account three important features of the empirical data: 1) cumulative contact durations are known to display important temporal and rolerelated heterogeneities, 2) empirical contact networks are sparse, and 3) the density of links in the subnetworks restricted to given role pairs strongly depends on the specific role pair [33]. We have therefore introduced a novel data representation based on a matrix of contact duration distributions. For each role pair, we have fitted the distribution of the measured cumulative contact durations using a negative binomial distribution. Interestingly, no zeroboosting was needed to reproduce the fraction of links with zero weights, i.e., the sparseness of the contact subnetworks that relate one role class to another. This distributional representation thus aggregates the individuals into role classes, but takes into account the contact heterogeneities and the sparseness of the contact network. At the same time, it retains the simplicity and versatility of the usual contact matrix representation, as it can be easily used to construct the contact patterns for a different or larger population.
We validated the usefulness of the proposed contact matrix representation using data from a realworld environment, namely a highresolution record of the contacts within the structured population of a hospital ward. We described these highresolution contact patterns using both the standard contact matrix representation and the representation we introduced here. We then used the two representation to simulate the epidemic spread within the hospital community. Each representation was evaluated by comparing the final size of the epidemics, the impact of the seed role class, and the final attack rates in each role class against simulations based on the much richer networkbased representations (HET and DYN), which retain several heterogeneities and correlations of the empirical dataset. Our results show that the usual contact matrix representation leads to an overall strong overestimation of the outbreak probability and of the attack rate. Moreover, the relative risks of individuals belonging to different classes are not correctly estimated from the simulations performed with the usual contact matrix. On the other hand, the contact matrix of distributions (CMD) affords a very good qualitative and quantitative agreement with the spreading patterns obtained with the heterogeneous network representation (HET), even though the CMD representation contains significantly less information on the contact network structure than HET (see Table 1). The agreement is significantly better for the CMD representation than for CMB, as CMD better takes into account the heterogeneity of contact durations. The CMD representation, therefore, provides an important improvement with respect to the usual contact matrix description, as it allows to correctly rank the groups of individuals according to their risk of being infected and therefore to correctly inform policies based on targeted vaccination of role classes.
We also note that the spreading dynamics over the (static) network representation with heterogeneous weights (HET) yields here a smaller probability of extinction than observed for the full dynamical contact data (DYN). When restricting the comparison to outbreaks with final attack rates higher than 10%, a small but systematic difference in the number of observed cases is observed, while similar timings are obtained. These results are somewhat different from the ones obtained in Ref [48] for the case of contacts at a conference. This difference might stem from various causes: first, conference attendees mix more than individuals in a hospital and the structure of the corresponding contact network is thus different from the hospital case [33]. Moreover, the present hospital dataset is longer than the 2days dataset used in Ref [48], hence dynamical effects at multiple timescales may play a stronger role. The difference between the simulations performed on the HET and DYN representations are larger for faster propagations, consistently with the results of Ref. [49], according to which a very fast (deterministic) spreading phenomenon can result in very different patterns on static and dynamic network representations.
In both cases (HET and DYN), the role class of the initial seed has a strong impact on the extinction probability and on the probability of observing a large outbreak: if the seed is a ward assistant or a nurse, the probability of a large outbreak is much larger. In addition, assistants and nurses have an overall larger risk compared to the other role classes. These results are consistent with literature that highlights the crucial importance of prioritizing nurses for local infection control interventions [35].
In our analysis we have only considered two extreme cases of temporal aggregation of the data. The DYN representation retains the full temporal information, i.e., the precise starting and ending time of each contact event. Conversely, the static network and contact matrix representations aggregate the contact behavior of individuals over a whole week. It is clearly possible to consider intermediate temporal aggregation levels that include some information on the activity schedules, for instance at the daily scale. We have thus built daily contact networks, as well as daily contact matrices and daily contact matrices of distributions: these representations take into account the fact that not all individuals are present every day, and that the contact patterns can vary from one day to the next. The results of SEIR spreading simulations, presented in the Additional file 1, are interestingly very close to the ones of representations aggregated at a weekly timescale, showing that weekly averaged data contain substantially enough information to correctly evaluate mitigation and containment measures.
We finally note that the study of timeresolved networks is still in its infancy [50]. Recent studies [48, 49, 51–53] have investigated models of epidemic spread on temporal networks and studied the role of the networks’ temporal properties, sometimes with contrasting results, as discussed in Ref. [54]. The comparison is however difficult, because different datasets, different epidemic models, and different parameters values have been used [54]. In particular, many studies have focused on processes that unfold rather fast with respect to the temporal scale of the network dynamics, so that the precise ordering of individual interactions is important, as well as the distribution of time intervals between contacts. In our case, conversely, the timescales of the spreading process and of the network dynamics are well separated, which explains why static representations yield results that are similar to the timeresolved representations like DYN. More systematic investigations on the interplay between the involved timescales, as well as studies using different epidemic models on the same datasets would be of strong interest for future research [54].
Conclusions
We have shown in a case study that the usual contact matrix representation of empirical contact patterns in a structured population can fail to correctly inform models of epidemic spreading and of the ensuing relative risk for groups. We have introduced an alternative representation of the empirical contact data based on a contact matrix of negative binomial distributions. This representation affords a tradeoff between too coarse and too detailed representations of the data. We have shown in a specific case that it allows to correctly model important features of epidemic spread, and in particular to estimate the relative risk of individuals in different role classes, while simultaneously maintaining a compact form that retains very little information from the timevarying contact network it summarizes. This is a step in the direction pointed by Ref. [44], as this representation provides a synopsis that combines simplicity, compactness of representation and modeling power, all of which are essential features for guiding decision making in public health contexts. These results provide insight about the level of detail that should be sought when performing measurement campaigns aimed at informing epidemic models and at understanding the efficacy of interventions. On the one hand, contact matrices containing only average quantities are not refined enough, and it is crucial to measure the heterogeneities of contact durations between individuals of given classes; on the other hand, the exact knowledge of who has been in contact with whom, over time, and of the precise structure of the contact network yields a description that is unnecessarily detailed and complex to manage for most practical cases.
Some limitations of the present study need to be discussed explicitly. First of all, the study was performed based on data for a relatively small population (nearly one hundred individuals), taken at one point in time, and considered diseases with short latent periods. On studying the DYN case, the long but limited temporal span of the data set required the generation of artificially extended activity timelines, obtained by repeating the empirical timeline. The static representations we considered do not take into account behavioral variations across days of the week, nor the heterogeneous amount of time spent by the various individuals in the ward. It is for instance possible that larger differences between the results of the simulations on the CMD and HET representations would be observed in larger populations because of specific network effects.
These limitations point to natural ways of extending the work reported here. In order to validate our approach in other contexts and for different categories of diseases, e.g. with longer latent periods, it is crucially important to measure highresolution contact patterns in different structured populations, in larger populations, and over longer time scales. Moreover it would be important to check the robustness of the fitting procedure based on negative binomial distributions and to test other possible functional forms that can fit the empirical distribution.
Other perspectives include testing datadriven containment strategies based on the various contact patterns representations and understanding how the evaluation of their performance depends on the representation, in particular for the case of the novel representation defined by the contact matrix of distributions we introduced here. A particularly interesting validation would be to extract a contact matrix of distributions from one set of data, e.g. in a hospital, to use it to design efficient containment strategies, and then to apply these strategies to a simulated epidemic on another dataset for a similar environment, for instance another hospital.
References
 1.
Brockmann D, Hufnagel L, Geisel T: The scaling laws of human travel. Nature. 2006, 439: 462465. 10.1038/nature04292.
 2.
González MC, Hidalgo CA, Barabási AL: Understanding individual human mobility patterns. Nature. 2008, 453: 779782. 10.1038/nature06958.
 3.
Song C, Qu Z, Blumm N, Barabási AL: Limits of Predictability in Human Mobility. Science. 2010, 327: 10181021. 10.1126/science.1177170.
 4.
Eagle N, Pentland A: Reality mining: Sensing complex social systems. Pers Ubiquit Comput. 2006, 10: 255268. 10.1007/s0077900500463.
 5.
O’Neill E, Kostakos V, Kindberg T, Fatah G, Schieck A, Penn A: Instrumenting the city: developing methods for observing and understanding the digital cityscape. Lect Notes Comput Sc. 2006, 4206: 315332. 10.1007/11853565_19.
 6.
Pentland A: Honest Signals: how they shape our world. 2008, Cambridge MA: MIT Press
 7.
Cattuto C, Van den Broeck W, Barrat A, Colizza V, Pinton JF: Dynamics of persontoperson interactions from distributed RFID sensor networks. PLoS One. 2010, 5 (7): e1159610.1371/journal.pone.0011596.
 8.
THe SocioPatterns collaboration: http://www.sociopatterns.org,
 9.
Salathé M, Kazandjieva M, Lee JW, Levis P, Feldman MW, Jones JH: A highresolution human contact network for infectious disease transmission. Proc Natl Acad Sci (USA). 2010, 107: 2202022025. 10.1073/pnas.1009094108.
 10.
Germann TC, Kadau K, Longini IM, Macken CA: Mitigation strategies for pandemic influenza in the United States. Proc Natl Acad Sci (USA). 2006, 103: 59355940. 10.1073/pnas.0601266103.
 11.
Atti ML Cd, Merler S, Rizzo C, Ajelli M, Massari M: Mitigation measures for pandemic influenza in Italy: an individual based model considering different scenarios. PLoS One. 2008, 3: e179010.1371/journal.pone.0001790.
 12.
Halloran ME, Ferguson NM, Eubank S, Longini IM, Cummings DA: Modeling targeted layered containment of an influenza pandemic in the United States. Proc Natl Acad Sci (USA). 2008, 105 (12): 46394644. 10.1073/pnas.0706849105.
 13.
Balcan D, Gonçalves B, Hu H, Ramasco JJ, Colizza V: Modeling the spatial spread of infectious diseases: The GLobal Epidemic and Mobility computational model. J Comput Sci. 2010, 1: 132145. 10.1016/j.jocs.2010.07.002.
 14.
Van den Broeck W, Gioannini C, Gonçalves B, Quaggiotto M, Colizza V: The GLEaMviz computational tool, a publicly available software to explore realistic epidemic spreading scenarios at the global scale. BMC Infect Dis. 2011, 11: 3710.1186/147123341137.
 15.
Read JM, Edmunds WJ, Riley S, Lessler J, Cummings DAT: Close encounters of the infectious kind: methods to measure social mixing behaviour. Epidemiol Infect. 2012, 10.1017/S0950268812000842. FirstView Articles
 16.
Liljeros F, Edling CR, Amaral LA, Stanley HE, Aberg Y: The web of human sexual contacts. Nature. 2001, 411: 907908. 10.1038/35082140.
 17.
Lloyd AL, May RM: Epidemiology. How viruses spread among computers and people. Science. 2001, 292: 13161317. 10.1126/science.1061076.
 18.
LloydSmith JO, Schreiber SJ, Kopp PE, Getz WM: Superspreading and the effect of individual variation on disease emergence. Nature. 2005, 438: 355359. 10.1038/nature04153.
 19.
PastorSatorras R, Vespignani A: Epidemic spreading in scalefree networks. Phys Rev Lett. 2001, 86: 32003203. 10.1103/PhysRevLett.86.3200.
 20.
Eames KTD: Modelling disease spread through random and regular contacts in clustered populations. Theor Popul Biol. 2008, 73: 104111. 10.1016/j.tpb.2007.09.007.
 21.
Keeling MJ: The effects of local spatial structure on epidemiological invasions. Proc Biol Sci. 1999, 266: 859867. 10.1098/rspb.1999.0716.
 22.
Smieszek T, Fiebig L, Scholz RW: Models of epidemics: when contact repetition and clustering should be included. Theor Biol Med Model. 2009, 6: 1110.1186/17424682611.
 23.
Mossong J, Hens N, Jit M, Beutels P, Auranen K: Social contacts and mixing patterns relevant to the spread of infectious diseases. PLoS Med. 2008, 5: e7410.1371/journal.pmed.0050074.
 24.
Read JM, Eames KT, Edmunds WJ: Dynamic social networks and the implications for the spread of infectious disease. J R Soc Interface. 2008, 5: 10011007. 10.1098/rsif.2008.0013.
 25.
Anderson R, May R: Infectious Diseases of Humans: Dynamics and Control. 1991, Oxford: Oxford University Press
 26.
Beutels P, Shkedy Z, Aerts M, Van Damme P: Social mixing patterns for transmission models of close contact infections: exploring selfevaluation and diarybased data collection through a webbased interface. Epidemiol Infect. 2006, 134: 11581166. 10.1017/S0950268806006418.
 27.
Edmunds WJ, O’Callaghan CJ, Nokes DJ: Who mixes with whom? A method to determine the contact patterns of adults that may lead to the spread of airborne infections. Proc Biol Sci. 1997, 264: 949957. 10.1098/rspb.1997.0131.
 28.
Del Valle SY, Hyman JM, Hethcote HW, Eubank SG: Mixing patterns between age groups in social networks. Soc Networks. 2007, 29: 539554. 10.1016/j.socnet.2007.04.005.
 29.
Wallinga J, Teunis P, Kretzschmar M: Using data on social contacts to estimate agespecific transmission parameters for respiratoryspread infectious agents. Am J Epidemiol. 2006, 164: 936944. 10.1093/aje/kwj317.
 30.
Zagheni E, Billari FC, Manfredi P, Melegaro A, Mossong J: Using timeuse data to parameterize models for the spread of closecontact infectious diseases. Am J Epidemiol. 2008, 168: 10821090. 10.1093/aje/kwn220.
 31.
Hens N, Goeyvaerts N, Aerts M, Shkedy Z, Van Damme P: Mining social mixing patterns for infectious disease models based on a twoday population survey in Belgium. BMC Infect Dis. 2009, 20 (9): 5
 32.
Stehlé J, Voirin N, Barrat A, Cattuto C, Isella L: Highresolution measurements of facetoface contact patterns in a primary school. PLoS One. 2011, 6 (8): e2317610.1371/journal.pone.0023176.
 33.
Isella L, Romano M, Barrat A, Cattuto C, Colizza V: Close encounters in a pediatric ward: measuring facetoface proximity and mixing patterns with wearable sensors. PLoS One. 2011, 6 (2): e1714410.1371/journal.pone.0017144.
 34.
Pellis L, Ball F, Trapman P: Reproduction numbers for epidemic models with households and other social structures. I. Definition and calculation of R_{0}. Math Biosci. 2012, 235: 8597. 10.1016/j.mbs.2011.10.009.
 35.
Polgreen PM, Tassier TL, Pemmaraju SV, Segre AM: Prioritizing Healthcare Worker Vaccinations on the Basis of Social Network Analysis. Infect Control Hosp Epidemiol. 2010, 31 (9): 893900. 10.1086/655466.
 36.
Medlock J, Galvani AP: Optimizing influenza vaccine distribution. Science. 2010, 325: 17041708.
 37.
Wallinga J, Van Boven M, Lipsitch M: Optimizing infectious disease interventions during an emerging epidemic. Proc Natl Acad Sci (USA). 2010, 107: 923928. 10.1073/pnas.0908491107.
 38.
Earn DJD, He D, Loeb MB, Fonseca K, Lee BE: Effects of school closure on incidence of pandemic influenza in Alberta, Canada. Ann Int Med. 2012, 156: 173181. 10.7326/00034819156320120207000005.
 39.
Van Kerkhove MD, Asikainen T, Becker NG, Bjorge S, Desenclos JC: Studies Needed to Address Public Health Challenges of the 2009 H1N1 Influenza Pandemic: Insights from Modeling. PLoS Med. 2010, 7 (6): e100027510.1371/journal.pmed.1000275.
 40.
Goldstein E, Paur K, Fraser C, Kenah E, Wallinga J: Reproductive numbers, epidemic spread and control in a community of households. Math Biosci. 2009, 221 (1): 1125. 10.1016/j.mbs.2009.06.002.
 41.
Goldstein E, Apolloni A, Lewis B, Miller JC, Macauley M: Distribution of vaccine/antivirals and the ‘least spread line’ in a stratified population. J Roy Soc Interface vol. 2010, 7: 755764. 10.1098/rsif.2009.0393.
 42.
Goeyvaerts N, Hens N, Ogunjimi B, Aerts M, Shkedy Z: Estimating infectious disease parameters from data on social contacts and serological status. J R Stat Soc Series C (Applied Statistics). 2010, 59: 255277. 10.1111/j.14679876.2009.00693.x.
 43.
Mikolajczyk RT, Akmatov MK, Rastin S, Kretzschmar M: Social contacts of school children and the transmission of respiratoryspread pathogens. Epidemiol Infect. 2008, 136: 813822.
 44.
Blower S, Go MH: The importance of including dynamic social networks when modeling epidemics of airborne infections: does increasing complexity increase accuracy?. BMC Med. 2011, 9: 8810.1186/17417015988.
 45.
Eames KTD, Keeling MJ: Modeling dynamic and network heterogeneities in the spread of sexually transmitted disease. Proc Natl Acad Sci (USA). 2002, 99: 1333013335. 10.1073/pnas.202244299.
 46.
Eames KTD, Read JM, Edmunds WJ: Epidemic prediction and control in weighted networks. Epidemics. 2009, 1: 7076. 10.1016/j.epidem.2008.12.001.
 47.
Potter GE, Handcock MS, Longini IM, Halloran ME: Estimating withinschool contact networks to understand influenza transmission. Ann Appl Stat. 2012, 6: 126. 10.1214/11AOAS505.
 48.
Stehlé J, Voirin N, Barrat A, Cattuto C, Colizza V: Simulation of a SEIR infectious disease model on the dynamic contact network of conference attendees. BMC Med. 2011, 9: 8710.1186/17417015987.
 49.
Isella L, Stehlé J, Barrat A, Cattuto C, Pinton JF: What’s in a crowd? Analysis of facetoface behavioral networks. J Theor Biol. 2011, 271: 166180. 10.1016/j.jtbi.2010.11.033.
 50.
Holme P, Saramäki J: Temporal networks. Physics Reports. 2012, 519: 97125. 10.1016/j.physrep.2012.03.001.
 51.
Miritello G, Moro E, Lara R: Dynamical strength of social ties in information spreading. Phys Rev E. 2011, 83: 045102
 52.
Karsai M, Kivelä M, Pan RK, Kaski K, Kertész J, Barabási AL, Saramäki J: Small but slow world: how network topology and burstiness slow down spreading. Phys Rev E. 2011, 83: 025102
 53.
Rocha LEC, Liljeros F, Holme P: Simulated epidemics in an empirical spatiotemporal network of 50,185 sexual contacts. PLoS Comput Biol. 2011, 7: e100110910.1371/journal.pcbi.1001109.
 54.
Masuda N, Holme P: Predicting and controlling infectious disease epidemics using temporal networks. F1000 Prime Reports. 2013, 5: 6
Prepublication history
The prepublication history for this paper can be accessed here:http://www.biomedcentral.com/14712334/13/185/prepub
Acknowledgments
This work has been partially supported by the Italian Ministry of Health (Grant n. 1M22) to CR, AET and CC, by the French ANR project HarMSflu (ANR12MONU0018) to AB, and by the EU FET project Multiplex 317532 to AB and CC. The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript. CC acknowledges a stimulating discussion with John Edmunds and inspiring discussions at the Data Science and Epidemiology workshop organized by Marcel Salathé.
Author information
Additional information
Competing interests
The authors declare no competing interest.
Authors’ contributions
AB and CC designed the study. AB, CC, FG, CR, AET designed and executed the contact network measurements in the hospital. FG and AET managed the field operations, collected participant metadata and cleaned the data. AB, CC, AM analyzed the data. AM performed the simulations. AB, CC, AM wrote the manuscript. All Authors contributed to the final version of the manuscript.
Electronic supplementary material
Authors’ original submitted files for images
Below are the links to the authors’ original submitted files for images.
Rights and permissions
About this article
Received
Accepted
Published
DOI
Keywords
 Attack Rate
 Negative Binomial Distribution
 Contact Network
 Contact Pattern
 Contact Duration