# Seminari del DiSIA

## Abstract

18/06/2018 ore 12.00

### Data-driven transformations in small area estimation: An application with the R-package emdi

### Timo Schimid (Freie Universität Berlin)

Small area models typically depend on the validity of model assumptions. For example, a commonly used version of the Empirical Best Predictor relies on the Gaussian assumptions of the error terms of the linear mixed regression model, a feature rarely observed in applications with real data. The present paper proposes to tackle the potential lack of validity of the model assumptions by using data-driven scaled transformations as opposed to ad-hoc chosen transformations. Different types of transformations are explored, the estimation of the transformation parameters is studied in detail under the linear mixed regression model and transformations are used in small area prediction of linear and non-linear parameters. Mean squared error estimation that accounts for the uncertainty due to the estimation of the transformation parameters is explored using bootstrap approaches. The proposed methods are illustrated using real survey and census data for estimating income deprivation parameters for municipalities in Mexico with the R-package emdi. The package enables the estimation of regionally disaggregated indicators using small area estimation methods and includes tools for (a) customized parallel computing, (b) model diagnostic analyses, (c) creating high quality maps and (d) exporting the results to Excel and OpenDocument Spreadsheets are included. Simulation studies and the results from the application show that using carefully selected, data-driven transformations can improve small area estimation.

14/06/2018 ore 12.00

### A brief history of linear quantile mixed models and recent developments in nonlinear and additive regression

### Marco Geraci (University of South Carolina)

What follows is a story that began about sixteen years ago in Viale Morgagni (with some of the events taking place in a cottage of the Montalve’s estate). In this talk, I will retrace the steps that led me to develop linear quantile mixed models (LQMMs). These models have found application in public health, preventive medicine, virology, genetics, anesthesiology, immunology, ophthalmology, orthodontics, cardiology, pharmacology, biochemistry, biology, marine biology, environmental, climate and marine sciences, psychology, criminology, gerontology, economics and finance, linguistic and lexicography. Supported by a grant from the National Institute of Child Health and Human Development, I recently extended LQMMs to nonlinear and additive regression. I will present models, estimation algorithms and software, along with a few applications.

18/05/2018 ore 14.30

### Transcompiling and Analysing Firewalls

### Letterio Galletta (IMT Lucca)

Configuring and maintaining a firewall configuration is notoriously hard. On the one hand, network administrators have to know in detail the policy meaning, as well as the internals of the firewall systems and of their languages. On the other hand, policies are written in low-level, platform-specific languages where firewall rules are inspected and enforced along non trivial control flow paths. Further difficulties arise from Network Address Translation (NAT), an indispensable mechanism in IPv4 networking for performing port redirection and translation of addresses. In this talk, we present a transcompilation pipiline that helps system administrators reason on policies, port a configuration from a system to another and perform refactoring, e.g., removing useless or redundant rules. Our pipeline and its correctness are based on IFCL, a generic configuration language equipped with a formal semantics. Relying on this language we decompile a real firewall configuration into an abstract specification, which exposes the meaning of the configuration and enables us to carry out analysis and recompilation.

18/05/2018 ore 12.00

### Empirical Bayes Estimation of Species Distribution with Overdispersed Data

### Fabio Divino (Dipartimento di Bioscienze e Territorio, Università del Molise)

The estimation of species distributions is fundamental in the assessment of biodiversity and in monitoring of environmental conditions. In this work we present preliminary results in Bayesian inference of multivariate discrete distributions with applications in biomonitoring. In particular, we consider the problem of the estimation of species distributions when collected data are affected by overdispersion. Ecologists have often to deal with data which exhibit a variability that differs from what they expect on the basis of the model assumed to be valid. The phenomenon is known as overdispersion if the observed variability exceeds the expected variability or underdispersion if it is lower than expected. Such differences between observed and expected variation in the data can be interpreted as failures of some of the basic hypotheses of the model. The problem is very common when dealing with counts data for which the variability is directly connected with the magnitude of the phenomenon. Overdispersion is more common than underdispersion and can be originated by several causes. Among them, the most important and relevant in ecolgy is the overdispersion by heterogeneity of the population with respect to the assumed model. An interesting approach to account for the overdispersion by heterogeneity is the method based on compound models. The idea of compound models originated during the 20s and it concerns the possibility to mix a model of interest with a mixing probability measure. Therefore, the mixture resulting by the integration generates a new model that allows to account for larger variation than that one included in the reference model. A typical application of this approach is the well known Gamma-Poisson compound model that generates the Negative Binomial model. In this work we present the use of a double compound model, a multivariate Poisson distribution combined with Gamma and Dirichlet models, in order to account for the presence of overdispersion . Some results of simulations will compare the Gamma-Dirichlet-Poisson model with the reference Multinomial model. Further, some applications in biomonitoring of aquatic environments are presented. Acknowledgement This work is part of a joint research with Salme Karkkainen, Johanna Arje and Antti Penttinen (University of Jyväskylä) and Kristian Meissner (SYKE, Finland).

11/05/2018 ore 14.30

### From sentiment to superdiversity on Twitter

### Alina Sirbu (Dip. Informatica, Universita' di Pisa)

Superdiversity refers to large cultural differences in a population due to immigration. In this talk we introduce a superdiversity index based on Twitter data and lexicon based sentiment analysis, using ideas from epidemic spreading and opinion dynamics models. We show how our index correlates with official immigration statistics available from the European Commission’s Joint Research Center, and we compare it with various other measures computed from the same Twitter data. We argue that our index has predictive power in regions where exact data on immigration is not available, paving the way for a nowcasting model of immigration.

18/04/2018 ore 12.00

### Convolution Autoregressive Processes and Excess Volatility

### Umberto Cherubini (University of Bologna)

We discuss the economic model and the econometric properties of the Convolution Autoregressive Process of order 1 (C-AR(1)), with focus on the simplest gaussian case. This is a first order autoregressive property in which the innovations are dependent of the lagged value of the process. We show that the model may be generated by the presence of extrapolative bias in expectations. Extrapolative expectations bring about excess volatility and excess persistence in the dynamics of the variable. While excess volatility cannot be identified if one only observes the time series of the variable, identification can be achieved if the expectations of the variable are observed in the forward markets. We show that the model is well suited to generate the excess variance of long maturity prices documented in the Giglio-Kelly variance ratio test. We finally discuss possible extensions of the model beyond the gaussian case both by changing the specification of the expectations model and setting a non linear data generating process for the fundamental process.

16/03/2018 ore 12.45 - Aula Magna 327- Polo Morgagni (V.le Morgagni 40)

### Heterogeneous federated data center for research: from design to operation

### Sergio Rabellino (University of Torino)

Bringing a large datacenter from design to operation is complex in all the world. In Italy it is a challenge that starts playing a snakes and ladders game against bureaucracy and colleagues for shaping the European tenders, the government board, the price list, the access policy and eventually all the software needed to operate the datacenter. The talk review all these aspects with experience of designing the University of Torino Competency Center in Scientific Computing serving over 20 departments with its 1M€ OCCAM platform, and writing the HPC4AI project charter recently funded with 4.5M€ by Piedmont Region and serving over 10 departments in two universities (University and Technical University of Turin).

16/03/2018 ore 12.00 - Aula Magna 327- Polo Morgagni (V.le Morgagni 40)

### Designing a heterogeneous federated data center for research

### Marco Aldinucci (Computer Science Department, University of Torino)

The advance of high-speed networks and virtualization techniques make it possible to leverage on economy of scale and consolidate all servers of a large organisation in few, powerful, energy-efficient data centers, which also synergically work with public cloud. In this, research organisations as universities exhibit distinguishing features of both technical and sociological nature. Firstly, it hardly exists a compute workload and resource usage pattern that are dominant across many different disciplines and departments. This makes the design of the datacenter and the access policy so delicate to require to address open research questions. Secondly, scientists are generally inclined to complain about all that is not total freedom to do what they want to urgently do, including strange, greedy and dangerous behaviours for the data and the system themselves.

22/02/2018 ore 12.00

### Job Instability and Fertility during the Economic Recession: EU Countries

### Isabella Giorgetti (Dep. of Economics and Social Sciences, Università Politecnica delle Marche)

The trends of decline in TFR varied widely across EU countries. Exploiting individual data from the longitudinal EU-SILC dataset (2005-2013), this study investigates the cross-country effect of job instability on the couple’s choice of having one (more) child. I build job instability measure for both partners by the first-order lag of own economic activity status in labour market (holding temporary, permanent contract, or being unemployed). In order to account for the unobserved heterogeneity and potential presence of endogeneity, I estimate, under sequential moment restriction, a Two Stage Least Square Model (2SLS) in first differences. Thus, I group European countries according to six different welfare regimes and I estimate the heterogeneous effects of instability in the labour market on childbearing in a comparative framework. The principal result is that the cross-country average effect of job instability on couples’ fertility decisions has not statistical relevance due to the huge countryspecific fixed effects. Distinguishing between welfare regimes, institutional settings and social active policies reveal a varying family behaviour in fertility. In low-fertility countries, however, it is confirmed that the impact of parents’ successful labour market integration might be ambiguous and it might be due to the scarcity of child care options and/or cultural norms.

15/02/2018 ore 11.30

### Data Science and Our Environment

### Francesca Dominici (Harvard T.H. Chan School of Public Health)

What if I told you I had evidence of a serious threat to American national security--a terrorist attack in which a jumbo jet will be hijacked and crashed every 12 days. Thousands will continue to die unless we act now. This is the question before us today--but the threat doesn’t come from terrorists. The threat comes from climate change and air pollution. We have developed an artificial neural network model that uses on-the-ground air-monitoring data and satellite-based measurements to estimate daily pollution levels across the continental U.S., breaking the country up into 1-square-kilometer zones. We have paired that information with health data contained in Medicare claims records from the last 12 years, and for 97% of the population ages 65 or older. We have developed statistical methods and computational efficient algorithms for the analysis over 460 million health records. Our research shows that short and long term exposure to air pollution is killing thousands of senior citizens each year. This data science platform is telling us that federal limits on the nation’s most widespread air pollutants are not stringent enough. This type of data is the sign of a new era for the role of data science in public health, and also for the associated methodological challenges. For example, with enormous amounts of data, the threat of unmeasured confounding bias is amplified, and causality is even harder to assess with observational studies. These and other challenges will be discussed.

17/01/2018 ore 15.30

### Seminario di presentazione: principali linee di ricerca e risultati ottenuti

### Francesca Giambona (DiSIA)

Nel corso del seminario verranno presentate le principali linee di ricerca e le principali tematiche oggetto di analisi e di ricerca del percorso scientifico. In particolare verranno descritte le principali metodologie statistiche utilizzate per analizzare i fenomeni oggetto di studio e presentati i più importanti risultati empirici ottenuti. Infine, verranno brevemente introdotti i temi di ricerca attualmente oggetto di studio.

17/01/2018 ore 14.30

### Latent variable modelling for dependent observations: some developments

### M. Francesca Marino (DiSIA)

When dealing with dependent data, standard statistical methods cannot be directly used for the analysis as they may produce biased results and lead to misleading inferential conclusions. In this framework, latent variables are frequently used as a tool for capturing dependence and describing association. During the seminar, some developments in the context of latent variable modelling will be presented. Three main area of research will be covered, namely quantile regression, social network analysis, and small area estimation. The main results will be presented and some hints on current research and future developments will be given

12/01/2018 ore 12.00

### SAM Based Analysis of the Impact of VAT Rate Cut on the Economic System of China after the Tax Reform

### Ma Kewei (Shanxi University of Finance and Economics)

China completed on May 2016 the tax reform, and from a bifurcated system based on business tax (BT) and Value Added Tax (VAT), moved to a VAT entirely based system. The effects of this reform are still under analysis, regarding both the alleviation of tax burden on industry and service sectors and the economic improvements. Currently, it seems there is a common agreement that, except some minor cases worth of further deepening, in both respects these effects are positive. It is interesting to analyze the impact that in a VAT based system regime would have a reduction of the VAT rate on some key industry. To our knowledge, no analysis in this direction has been performed so far. In this paper, we try to fill this gap and analyze the effect of VAT rate cuts on selected industries on the whole economic system using an impact multiplier model based on a purposively elaborated Social Accounting Matrix (SAM) 2015 for China, which also allows to conduct a preliminary analysis of the structure of China’s economic system. (Joint work with Guido Ferrari and Zichuan Mi)

29/11/2017 ore 12:00

### Probabilistic Distance Algorithm: some recent developments in data clustering

### Francesco Palumbo (Università di Napoli Federico II)

Distance-based clustering, part of the not model based methods, optimize a global criterion based on the distance among clusters. The most widely known distance-based method is k-means clustering (MacQueen, 1967) and several extensions of this method have recently been proposed (Vichi and Kiers, 2001; Rocci et al., 2011; Timmerman et al., 2013). These extensions overcome issues arising from the correlation between variables. However, k-means clustering, and extensions thereof, can fail when clusters do not have a spherical shape and/or some extreme points are present, which tend to affect the group means. Iyigun (2007) and Ben-Israel and Iyigun (2008) propose a non-hierarchical distance-based clustering method, called probabilistic distance (PD) clustering, that overcomes these issues. Tortora et al. Tortora et al. (2016) propose a factor version of the method to deal with high-dimensional data. In PD-clustering, the number of clusters K is assumed to be a priori known, and a wide review on how to choose K can be found in. Given some random centres, the probability of any point belonging to a cluster is assumed to be inversely proportional to the distance from the centre of that cluster Iyigun (2007). The aim of the seminar is to illustrate the PD algorithm and some recent applications in the data clustering framework in the model-based (Rainey et al., 2017) and not model-based perspective. References

15/11/2017 ore 12.00

### Bayesian Multilevel Latent Class Models for the Multiple Imputation of Nested Categorical Data

### Davide Vidotto (Tilburg University)

Multiple imputation of multilevel data (i.e., data collected from different groups) does not only require to take correlations among variables into account, but also to consider possible dependencies between units coming from the same group. While a number of imputation models have been proposed in the literature for continuous data, existing methods for multilevel categorical data, such as the JOMO imputation method, still have limitations. For instance, JOMO only considers pairwise relationships between variables, and uses default priors that can affect the quality of the imputations in case of small sample sizes. With the present work, we propose using Multilevel Latent Class models to perform multiple imputation of missing multilevel categorical data. The model is flexible enough to retrieve original (complex) associations of the variables at hand while respecting the data hierarchy. The model is implemented under a Bayesian framework and estimated via Gibbs sampling, a natural choice for multiple imputation applications. After formally introducing the model, we will show the results of a simulation study in which model performance is assessed, and compared with the listwise deletion and JOMO methods. Results indicate that the Bayesian Multilevel Latent Class model is able to recover unbiased and efficient parameter estimates of the analysis model considered in our study.

11/07/2017 ore 15.00 - aula 303 del Plesso Didattico Morgagni

### Resilient and Secure Cyber Physical Systems: Matching the Present and the Future!

### Andrea Bondavalli (DiMAI)

Un sistema cyber-fisico (Cyber Physical System) è un sistema in cui gli elementi computazionali interagiscono strettamente con le entità fisiche tramite sensori e attuatori, controllando così processi individuali, organizzativi o meccanici tramite l'utilizzo delle tecnologie dell'informazione e della comunicazione (computer, software e reti). Tali sistemi sono tipicamente automatizzati, intelligenti e collaborativi, e molti di essi richiedono elevati livelli di resilienza e di sicurezza che assicurino la sopravvivenza dei sistemi in presenza di anomalie casuali, attacchi deliberati e, in generale, eventi critici imprevisti.
Il seminario sarà organizzato in due parti:
- Nella prima parte saranno presentate le caratteristiche principali dei Cyber Physical System, soffermandosi in particolare sugli aspetti legati alla resilienza e sicurezza di tali sistemi, e saranno presentati esempi concreti di CPS in diversi domini applicativi, dall'Internet delle cose (IoT - Internet of Things), ai Sistemi di Sistemi, alle tematiche di Industria 4.0.
- Nella seconda parte sarà presentato il nuovo Curriculum in Resilient and Secure Cyber Physical Systems del Corso di Laurea Magistrale in Informatica, chiarendone gli aspetti caratterizzanti, gli obiettivi formativi e gli sbocchi professionali.

La partecipazione è libera per tutti, è richiesta per la prenotazione tramite il seguente form:
https://docs.google.com/forms/d/e/1FAIpQLScrONY0ixHxA190hIk04g1VQjd_4btdRbXtY1GkuGQGSWQSdg/viewform

21/06/2017 ore 11.30

### Stronger Instruments and Refined Covariate Balance in an Observational Study of the Effectiveness of Prompt Admission to the ICU

### Luke Keele (Georgetown University)

Instrumental Variable (IV) methods, subject to appropriate identification assumptions, allow for consistent estimation of causal effects in observational data in the presence of unobserved confounding. Near-far matching has been proposed as one analytic method to improve inference by strengthening the effect of the instrument on the exposure and balancing observable characteristics between groups of subjects with low and high values of the instrument. However, in settings with hierarchical data (e.g. patients nested within hospitals), or where several covariate interactions must be balanced, conventional near-far matching algorithms may fail to achieve the requisite covariate balance. We develop a new matching algorithm, that combines near-far matching with refined covariate balance, to balance large numbers of nominal covariates while also strengthening the IV. This extension of near-far matching is motivated by a UK case study that aims to identify the causal effect of prompt admission to the Intensive Care Unit on 7-day and 28-day mortality.

16/06/2017 ore 11.30

### Assessing the Efficacy of Intrapartum Antibiotic Prophylaxis for Prevention of Early-Onset Group B Streptococcal Disease through Propensity Score Design

### Elizabeth R. Zell (Centers for Disease Control and Prevention)

Observational data can assist in answering hard questions about disease prevention. Early-onset neonatal group B streptococcal disease (EOGBS) can be prevented by intrapartum antibiotic prophylaxis (IAP). Clinical trials demonstrated efficacy of beta-lactam agents for a narrow population of women. Questions about effectiveness of antibiotic durations <4 hours, and agents appropriate for penicillin allergic women remain. We applied propensity score design methods to sample survey data on EOGBS cases and over 7000 non-cases from ten US states in 2003-2004, to match infants exposed to IAP with infants who were not exposed. Antibiotic efficacy was estimated for different antibiotic classes and durations before delivery. Our analysis supports the recommendation that Beta-lactam intrapartum prophylaxis of at least four hours before delivery remains the primary treatment; less than four hours and clindamycin prophylaxis are not as effective in preventing EOGBS.

15/06/2017 ore 15.30 - Aula 13

### Spatial Chaining of Price Indexes to Improve International Comparisons of Prices and Real Incomes

### D.S. Prasada Rao (School of Economics, The University of Queensland, Brisbane, Australia )

The International Comparisons Program (ICP) compares the purchasing power of currencies and real income of almost all countries in the world. An ICP multilateral comparison uses as building blocks bilateral comparisons between all possible pairs of countries. These are then combined to obtain the overall global comparison. One problem with this approach is that some of the bilateral comparisons are typically of lower quality, and their inclusion therefore undermines the integrity of the multilateral comparison. Formulating multilateral comparisons as a graph theory problem, we show how quality can be improved by replacing bilateral comparisons with their shortest path spatially chained equivalents. We consider a number of different ways in which this can be done, and illustrate these methods using data from the 2011 round of ICP. We then propose criteria for comparing the performance of competing multilateral methods, and using these criteria demonstrate how spatial chaining improves the quality of the overall global comparison.

15/06/2017 ore 14.30 - Aula 101

### Netflix e Deep Learning

### P. Crescenzi (Università di Firenze)

Seminario divulgativo su alcuni temi "caldi" dell'informatica : • Come, tra decine di migliaia di film, qualcuno ci possa suggerire cosa vedere. E ci azzecca. Ovvero la magia dei sistemi di raccomandazione. • Come si sia studiato il funzionamento dei neuroni del cervello umano per simularne il comportamento mediante le reti neurali e come queste reti, previo addestramento, possano comportarsi in modo intelligente. Ovvero la magia dell'intelligenza artificiale. La partecipazione è libera per tutti, è richiesta però la prenotazione tramite il seguente form: https://docs.google.com/forms/d/e/1FAIpQLSd8O2B_cNeNy3zMCKB6nHUDjxZ9OVx55e0H9VFdpmVi6p9MXA/viewform

09/06/2017 ore 11.30

### Exact P-values for Network Interference

### Guido Imbens (Stanford)

We study the calculation of exact p-values for a large class of non-sharp null hypotheses about treatment effects in a setting with data from experiments involving members of a single connected network. The class includes null hypotheses that limit the effect of one unit's treatment status on another according to the distance between units; for example, the hypothesis might specify that the treatment status of immediate neighbors has no effect, or that units more than two edges away have no effect. We also consider hypotheses concerning the validity of sparsification of a network (for example based on the strength of ties) and hypotheses restricting heterogeneity in peer effects (so that, for example, only the number or fraction treated among neighboring units matters). Our general approach is to define an artificial experiment, such that the null hypothesis that was not sharp for the original experiment is sharp for the artificial experiment, and such that the randomization analysis for the artificial experiment is validated by the design of the original experiment. (with Susan Athey and Dean Eckles)

27/04/2017 ore 14.30

### Introducing AWMA and its activities to promote mathematics among African women

### Kifle Yirgalem Tsegaye (Department of Mathematics, Addis Ababa University, Ethiopia)

Archeological findings and their interpretations persuade us to think that Africa is not only the cradle of human kind, but along with ancient Mesopotamia and others, is also a birth place of mathematical and technological ideas. It is natural to ask - what happened to civilization, mathematics and technological sciences in contemporary Africa? A million-euro question, though this talk is about what African women in mathematics are doing to change, at least to improve their realities. Just like women around the world, African women, had always been denied access to important components of development like education, in particular mathematics and technological sciences, with excuses like - "these are strictly meant for men" kind of impositions and restrictions. Many had no choices but believe in it and collaborate in making only "good wives" or "attractive women" out of themselves. Things are changing in this regard, but not strong enough to liberate African women out of the stereotypical depiction of themselves and cope up in the world of science. It is believed that role models in every profession are important to convince youngsters that it is possible for them to be whoever they want. Having nation-wide, continent-wide or world-wide strong networks of women in the fields of mathematics and Technological Sciences, enables us destroy such stereotypes and open the gate wide enough for our girls to dance with all sciences, most of all in the fields of math and technological sciences. The talk is a brief introduction of: - African Women in Mathematics Association (AWMA) and its activities; - Ethiopian girls in the fields of Science, Mathematics, Engineering and Technology.

29/03/2017 ore 14.30

### Le problematiche dei rischi estremi: modelli teorici e strumenti finanziario-attuariali

### Marcello Galeotti (University of Florence)

1. Una definizione dinamica di rischio economico. Misure di rischio: VaR e Expected Shortfall. Il problema del calcolo delle misure di rischio. 2. La teoria dei valori estremi. Distribuzioni light e heavy tail. Tasso d’azzardo. Fluttuazioni di somme e massimi. L’eccesso medio. Distribuzione generalizzata di Pareto. 3. Strumenti finanziari innovativi per la gestione di rischi ambientali. Opzioni di progetto e Catastrophic bonds. Dinamiche interattive: possibilità di esiti “virtuosi” e di equilibri sub-ottimali. Un caso di studio: rischi di esondazione del fiume Arno nell’area e nella città di Firenze. 4. Un modello evolutivo per i rischi sanitari. Medicina difensiva, assicurazioni sanitarie, azioni legali. Un modello di gioco evolutivo. Il ruolo dei premi assicurativi. Comportamenti asintotici dipendenti dai principi di calcolo del premio.

23/03/2017 ore 14.45

### Endogenous Significance Levels in Finance and Economics

### Alessandro Palandri (DISiA - University of Florence)

This paper argues that rational agents who do not know the model's parameters and have to proceed to their estimation will treat the significance level as a choice variable. Calculating costs associated to errors of Types I and II, rational agents will choose significance levels that maximize expected utility. The misalignment of the investigators' standard statistical significance levels to those that are optimal for agents has profound implications for the empirical tests of the model itself. Specifically, empirical studies could reject models in terms of statistically significant misprices when in fact the models are true: the misprices are not significant for the agents, no expected prontable intervention, no force to bring the system back to equilibrium.

23/03/2017 ore 14.00

### Bayesian methods in Biostatistics

### Francesco Stingo (University of Florence)

In this talk I will review some Bayesian methods for bio-medical applications I have developed in the recent years. These methods can be classified in 4 research areas: 1) graphical models for complex biological networks, 2) hierarchical models for data integration (integromics and imaging genetics), 3) power-prior approaches for personalized medicine, 4) change-point models for cancer early detection.

09/03/2017 ore 14.30

### Validity of case-finding algorithms for diseases in database and multi-database studies

### Rosa Gini (Agenzia Regionale di Sanità della Toscana, Firenze)

In database studies in pharmacoepidemiology and health services research, variables that identify a disease are derived from existing data sources, by mean of data processing. The ‘true’ variables should be conceptualized as unobserved quantities, and the study variables entering the actual analysis as measurements, resulting from case-finding algorithms (CFA) applied to the original data. The validity of a CFA is the difference between its result and the true variable. The science of estimating the validity of CFAs is in development. Stemming from the methodology of validation of diagnostic algorithms, it nevertheless has specific hurdles and opportunities. In this talk we will introduce the concept of *component CFA*. We will show the results from a large validation study of CFAs for type 2 diabetes, hypertension, and ischaemic heart disease in Italian administrative databases, using primary care medical records as a gold standard, and propose more research to generalize and apply its results. We will then show how the component analysis may support the estimation of validity of CFAs in multi-database, multi-national studies.

09/02/2017 ore 15.15 - (Aula Anfiteatro, viale Morgagni 65)

### Introduction to Riordan graphs

### Gi-Sang Cheon (Department of Mathematics, Sungkyunkwan University (Korea))

There are many reasons to define new classes of graphs. More generally, considering the n*n symmetric Riordan matrix in modulo 2, we define Riordan graph RG(n) of order n. In this talk, we study some basic properties of Riordan graphs such as the number of edges, the degree sequence, the matching number, the clique number, the independence number and so on. Moreover, several examples of Riordan graphs are given. They include Pascal graphs, Catalan graphs, Fibonacci graphs and many others.

09/02/2017 ore 14.00 - (Aula Anfiteatro, viale Morgagni 65)

### Pascal Graphs and related open problems

### Gi-Sang Cheon (Department of Mathematics, Sungkyunkwan University (Korea))

In 1983, Deo and Quinn introduced Pascal graphs in their searching for a class of graphs with certain desired properties to be used as computer networks. One of the desired properties is that the design be simple and recursive so that when a new node is added, the entire network does not have to be reconfigured. Another property is that one central vertex be adjacent to all others. The third requirement is that there exist several paths between each pair of vertices (for reliability) and that some of these paths be of short lengths (to reduce communication delays). Finally, the graphs should have good cohesion and connectivity. A Pascal matrix PM(n) of order n is defined to be an n*n symmetric binary matrix where the main diagonal entries are all 0's and the lower triangular part of the matrix consists of the first n-1 rows of Pascal's triangle modulo 2. The graph corresponding to the adjacency matrix PM(n) is called the Pascal graph of order n.

26/01/2017 ore 14.00

### Uncertainty in Propensity Score Estimation: Bayesian Methods for Variable Selection and Model-Averaged Causal Effects.

### Corwin M. Zigler (Harvard University)

Causal inference with observational data frequently relies on the notion of the propensity score (PS) to adjust treatment comparisons for observed confounding factors. As comparative effectiveness research in the era of "big data" increasingly relies on large and complex collections of administrative resources, researchers are frequently confronted with decisions regarding which of a high-dimensional covariate set to include in the PS model in order to satisfy the assumptions necessary for estimating average causal effects. Typically, simple or ad-hoc methods are employed to arrive at a single PS model, without acknowledging the uncertainty associated with the model selection. We propose Bayesian methods for PS variable selection and model averaging that 1) select relevant variables from a set of candidate variables to include in the PS model and 2) estimate causal treatment effects as weighted averages of estimates under different PS models. The associated weight for each PS model reflects the data-driven support for that model’s ability to adjust for the necessary variables. We illustrate features of our proposed approaches with a simulation study, and ultimately use our methods to compare the effectiveness of treatments for brain tumors among Medicare beneficiaries.

26/01/2017 ore 11.00

### Bayesian Effect Estimation Accounting for Adjustment Uncertainty

### Corwin M. Zigler (Harvard University)

Model-based estimation of the effect of an exposure on an outcome is generally sensitive to the choice of which confounding factors are included in the model. We propose a new approach, which we call Bayesian adjustment for confounding (BAC), to estimate the effect of an exposure of interest on the outcome, while accounting for the uncertainty in the choice of confounders. Our approach is based on specifying two models: (1) the outcome as a function of the exposure and the potential confounders (the outcome model); and (2) the exposure as a function of the potential confounders (the exposure model). We consider Bayesian variable selection on both models and link the two by introducing a dependence parameter denoting the prior odds of including a predictor in the outcome model, given that the same predictor is in the exposure model. In the absence of dependence, BAC reduces to traditional Bayesian model averaging (BMA). In simulation studies, we show that BAC with dependence can estimate the exposure effect with smaller bias than traditional BMA, and improved coverage. We compare BAC with other methods, including traditional BMA, in a time series data set of hospital admissions, air pollution levels, and weather variables in Nassau, NY for the period 1999–2005. Using each approach, we estimate the short-term effects of PM2.5 on emergency admissions for cardiovascular diseases, accounting for confounding. This application illustrates the potentially significant pitfalls of misusing variable selection methods in the context of adjustment uncertainty.

Ultimo aggiornamento 5 giugno 2018 .