# Seminari del DiSIA

## Abstract

23/09/2019 ore 12.00

### Identification of good educational practices in schools with high added value using Data Mining Techniques

### Fernando Martínez Abad (University of Salamanca)

The main objective of this Research Project, funded by BBVA Foundation, was the identification of factors associated with performance in schools with high added value for the production of a catalogue of good educational practices and its dissemination to the educational community. Based on the results of the Spanish sample from the PISA 2015 assessment, this objective was developed in 3 stages: 1) Application of hierarchical linear models for identifying schools with high and low effectiveness: To quantify the effectiveness of every school, we isolated the average performance of schools not attributable to the effect to the contextual variables in order to obtain an average residual. Thus, we select and characterize schools whose residual average performance is systematically maintained at higher (high effectiveness) and lower levels (low effectiveness); 2) Application of Data Mining techniques for identifying non-contextual factors associated with high and low effectiveness: We applied decision trees, which generate classifications based on their scores in the explanatory variables taking as a reference the given criterion (schools considered to have high and low effectiveness); 3) Design of a catalogue of good educational practices: this stage was developed with the aim of disseminating the results to different groups that may be interested in the project.

27/06/2019 ore 14.00

### Dimensionality reduction via the identifications of the data intrinsic dimensions

### Antonietta Mira (Università della Svizzera Italiana)

Even if they are defined on a space with a large dimension, data points usually lie onto a hypersurface, or manifold with a much smaller intrinsic dimension (ID). The recent TWO-NN method (Facco et al., 2017, Scientific Report) allows estimating the ID when all points lie onto a single manifold. TWO-NN makes only a fairly weak assumption, that the density of points is approximately constant in a small neighborhood around each point. Under this hypothesis, the ratio of the distances of a point from its first and second neighbour follows a Pareto distribution that depends parametrically only on the ID, allowing for an immediate estimation of the latter. We extend the TWO-NN model to the case in which the data lie onto several manifolds with different ID. While the idea behind the extension is simple (the Pareto distribution is just replaced by a mixture of K Pareto distributions), a non-trivial Bayesian scheme is required for correctly estimating the model and assigning each point to the correct manifold. Applying this method, which we dub Hidalgo (heterogeneous intrinsic dimension algorithm), we uncover a surprising ID variability in several real-world datasets such as FMRI, protein folding, financial data, gene expression and basketball data. The Hidalgo model obtains remarkable results, but its limitation consists in fixing a priori the number of component in the mixture. To adopt a fully Bayesian approach, a possible extension would be the specification of a prior distribution for the parameter K. Instead, with even greater flexibility, we let K go to infinity, using a Bayesian Nonparametric approach and model the data as an infinite mixture of Pareto distributions. This approach, at the same time, takes into account the uncertainty relative to the number of mixture components.

26/06/2019 ore 15:00

### B. Arpino: Challenges of causal inference in demographic observational studies

R. Guetto: The growth of mixed unions in Italy: a marker of immigrant integration and societal openness?

### Welcome seminar: Bruno Arpino & Raffaele Guetto

B.Arpino: I will start the seminar summarising my research interests and academic trajectory. Then I will present the following work-in-progress study that combines my main methodological and applied research interests:

Demographers are often confronted with the goal of establishing a causal link between demographic events (e.g., fertility, union formation and dissolution) and socio-economic, health and other types of outcomes. Since experiments are commonly not a feasible strategy, demographic research often relies on observational studies. Not being able to manipulate the treatment assignment, demographers have to deal with several issues, such as omitted variable bias and reverse causality. The aims of this paper are to review the methods commonly used by demographers to estimate causal effects in observational studies and to discuss strengths and limitations of these methods. Motivated by the estimation of the causal effect of grandparental childcare on health and using simulations mimicking the Survey of Health and Retirement in Europe (SHARE), I will compare propensity score matching, fixed effects and instrumental variables regression. The goal of these simulations is to highlight the consequences of violations of assumptions underlying each method depending also on different types of data available to the researcher.

I will conclude the seminar with my plans for future research in the short run.

R.Guetto: I will start the seminar presenting my academic background, my research interests and plans for future research. I will then present an overview of the results of the research on native-immigrant unions and immigrant socioeconomic integration that I carried out in the last years.

Mixed unions, i.e. unions between natives and immigrants, which in Italy primarily involve Italian men partnered with foreign women, are commonly understood as the height of immigrant integration and societal openness. However, consistent with the status exchange theory, such unions are more likely when less-educated, older native men marry better educated, younger immigrant women, especially when the latter originate from non-Western countries. I highlight the existence of a multiplicity of factors underlying such mating patterns. From the standpoint of the foreign partner, I discuss the relevance of immigrants’ economic circumstances and provide causal evidence on the role played by the possibility of obtaining Italian/EU citizenship through marriage. The analysis of the Italian partner’s perspective points to the increasing crowding out of low-educated men on the native marriage market. Cultural factors also need to be considered, as foreign women are usually more compliant with traditional gender roles. Overall, the results reject a simplistic interpretation of the growth of mixed unions as an indicator of increased immigrant integration and societal openness.

14/06/2019 ore 11.00

### A causal inference approach to evaluate the health impacts of air quality regulations: The health benefits of the 1990 Clean Air Act Amendments

### Rachel Nethery (Harvard University)

In evaluating the health effects of previously implemented air quality regulations, the US Environmental Protection Agency first predicts what air pollution levels would have been in a given year under the counterfactual scenario of no regulation and then inserts these predictions into a health impact function to predict the corresponding counterfactual number of various types of health events (e.g., death, cardiovascular events, etc...). These predictions are then compared to the number of health events predicted for the observed pollutant levels under the regulations. This procedure is generally carried out for each pollutant separately. In this paper, we develop a causal inference framework to estimate the number of health events prevented by air quality regulations via the resulting changes in exposure to multiple pollutants simultaneously. We introduce a causal estimand called the Total Events Avoided (TEA), and we propose both a matching method and a Bayesian machine learning method for estimation. In simulations, we find that both the matching and machine learning methods perform favorably in comparison to standard parametric approaches, and we evaluate the impacts of tuning parameter specifications. To our knowledge, this is the first attempt to perform causal inference in the presence of multiple continuous exposures. We apply these methods to investigate the health impacts of the 1990 Clean Air Act Amendments (CAAA). In particular, we seek to answer the question "How many mortality events, cardiovascular hospitalizations, and dementia-related hospitalizations were avoided in the Medicare population in 2001 thanks to CAAA-attributable changes in pollution exposures in the year 2000?". For each zipcode in the US, we have obtained (1) pollutant exposure levels with the CAAA in place in 2000, (2) the observed count of each health event in the Medicare population in 2001, and (3) estimated counterfactual pollutant exposure levels under a no-CAAA scenario in 2000. Without relying on modeling assumptions, our matching and machine learning methods use confounder-adjusted relationships between observed pollution exposures and health outcomes to inform estimation of the number of health events that would have occurred in the same population under the counterfactual, no-CAAA pollution exposure levels. The TEA is computed as the difference in the estimated no-CAAA counterfactual event count and the observed event count. This approach could be used to analyze any regulation, any set of pollutants, and any health outcome for which data are available. This framework improves on the current regulatory evaluation protocol in the following ways: (1) the causal inference approach clarifies the question under study, the statistical quantity being estimated, and the assumptions of the methods; (2) statistical models and resulting estimates are built on real health outcome data; (3) the results do not rely on dubious parametric assumptions; and (4) all pollutants are evaluated concurrently so that any synergistic effects are accounted for.

07/06/2019 ore 15.00

### Sustainable development Goals, Climate Change and Hazardous events: statistical measures, challenges and innovations

### Angela Ferruzza (ISTAT)

Sustainable development Goals, Climate Change and Hazardous events frameworks will be presented, considering their interactions. Challenges and innovations in the process of improving building capacity for increasing the statistical measurues will be considered. The 2019 SDGs Istat National Report will be presented.

06/06/2019 ore 11.00

### Causal inference methods in environmental studies: challenges and opportunities

### Francesca Dominici (Harvard University)

What if I told you I had evidence of a serious threat to American national security – a terrorist attack in which a jumbo jet will be hijacked and crashed every 12 days. Thousands will continue to die unless we act now. This is the question before us today – but the threat doesn't come from terrorists. The threat comes from climate change and air pollution. We have developed an artificial neural network model that uses on-the-ground air-monitoring data and satellite-based measurements to estimate daily pollution levels across the continental U.S., breaking the country up into 1-square-kilometer zones. We have paired that information with health data contained in Medicare claims records from the last 12 years, and for 97% of the population ages 65 or older. We have developed statistical methods and computational efficient algorithms for the analysis of over 460 million health records. Our research shows that short and long term exposure to air pollution is killing thousands of senior citizens each year. This data science platform is telling us that federal limits on the nation's most widespread air pollutants are not stringent enough. This type of data is the sign of a new era for the role of data science in public health, and also for the associated methodological challenges. For example, with enormous amounts of data, the threat of unmeasured confounding bias is amplified, and causality is even harder to assess with observational studies. These and other challenges will be discussed.

03/06/2019 ore 14.30

### Efficient Data Processing in High Performance Big Data Platforms

### Nicola Tonellotto (ISTI CNR)

Abstract: One of the largest and most used big data platforms used nowadays is represented by the Web search engines. With the ever-growing amount of data produced daily, all Web search companies must rely on distributed data storage and processing mechanisms hosted on computer clusters composed by thousands of processors and provided by large data centers. This distributed infrastructure allows to execute a vast amount of complex data processing that provide effective insights for the users, at near real-time with sub-second response times. Moreover, the most recent scientific advances in big data analytics exploit machine learning and artificial intelligence solutions. These solutions are particularly computationally expensive, and their energy consumption has a great impact on the overall energy consumption at a global scale.
In this seminar we will discuss some recent investigations about (i) novel efficient algorithmic solutions to improve the usage of hardware resources (e.g., reducing response times and increasing the throughput) when complex machine learned models for processing large data collections and (ii) online management of computational load to reduce the energy consumption by automatically switching among their available CPU frequencies to adapt to external operational conditions.

Short bio: Dr. Nicola Tonellotto (http://hpc.isti.cnr.it/~khast/) is a researcher within the High Performance Computing Lab at Information Science and Technologies Institute of the National Research Council of Italy. His main research interests include high performance big data platforms and information retrieval, focusing on efficiency aspects of query processing and resource management. Nicola has co-authored more than 60 papers on these topics in peer reviewed international journal and conferences. He lectures on Computer Architectures for BSc students and Distributed Enabling Platforms for MSc students at the University of Pisa. He was co-recipient of the ACM’s SIGIR 2015 Best Paper Award for the paper entitled “QuickScorer: a Fast Algorithm to Rank Documents with Additive Ensembles of Regression Trees”.

30/05/2019 ore 12.00

### Alternatives to Aging Alone?: "Kinlessness" and the Potential Importance of Friends

### Christine Mair (University of Maryland)

Increasing numbers of older adults cross-nationally are without children or partners in later life and may have greater reliance on non-kin (e.g., friends), although these patterns likely vary by country context. This paper hypothesizes that those without traditional kin and who live in countries with a stronger emphasis on friendship will have more friends. While these hypothesized patterns are consistent with interdisciplinary literatures, they have not been tested empirically and remain overlooked in current narratives on "aging alone." This study combines individual-level data from the Survey of Health, Ageing, and Retirement in Europe (SHARE, Wave 6) with aggregate nation-level data to estimate multilevel negative binomial models exploring number of friends among those aged 50+ across 17 countries. Those who lack kin report more friends, particularly in countries with a higher percentage of people who believe that friends are "very important" in life. This paper challenges dominating assumptions about "aging alone" that rely on lack of family as an indicator of "alone." Future studies should investigate how friendship is correlated with lack of kin, particularly in wealthier nations. Previous research may have overestimated risk in wealthier nations, but underestimated risk in less wealthy nations and/or more family-centered nations.

28/05/2019 ore 14.00

### S. Bacci: Developments in the context of Item Response Theory models

A. Magrini: Linear Markovian models with distributed-lags to assess the economic impact of investments

### Welcome seminar: Silvia Bacci & Alessandro Magrini

S. Bacci: Latent variable models are a wide family of statistical models based on the use of unobservable (i.e., latent) variables for multiple aims, such as, measuring unobservable traits, accounting for measurement errors, representing unobserved heterogeneity that arises with complex data structures (e.g., multilevel and longitudinal data). The class of latent variable models includes the Item Response Theory (IRT) models, which are adopted to measure latent traits, when individual responses to a set of categorical items are available.
In such a context, the multidimensional Latent Class IRT (LC-IRT) models extend traditional IRT models for multidimensionality (i.e., presence of multiple latent traits) and discreteness of latent traits, which allows us to cluster individuals in unobserved homogeneous groups (latent classes).
The general formulation of the class of models at issue is illustrated with details concerning the model specification, the estimation, and the model selection. Furthermore, some useful extensions to deal with real data are discussed: (i) introduction of individual covariates, (ii) model formulation to account for multilevel data structures, and (iii) model formulation to deal with missing item responses. The estimation of models at issue may be accomplished through two specific R packages. Finally, some applications in the educational setting are illustrated.

A. Magrini: Linear regression with temporally delayed covariates (distributed-lag linear regression) is a standard approach to assess the impact of investments on economic outcomes through time. Typically, constraints on the lag shapes are required to meet domain knowledge. For instance, the effect of an investment may be small at first, then it may reach a peak before diminishing to zero after some time lags. Polynomial lag shapes with endpoint constraints are typically exploited to represent such feature. A deeper analysis is directed towards the decomposition of the 'overall' impact into intermediate contributions and considers several multiple outcomes. Linear Markovian models are proposed for this task, where a set of distributed-lag linear regressions are recursively defined according to causal assumptions of the problem domain. The theory and the software are presented, several real-world applications are illustrated, and open issues are discussed.

23/05/2019 ore 15.00

### Couples' transition to parenthood in Finland: A tale of two recessions

### Chiara Comolli (University of Lausanne)

The question of how fluctuations in the business cycles and fertility are linked resurfaced in the aftermath of the Great Recession of 2008-09, when birth rates started declining in many countries. Finland, although affected to a much lesser extent than other regions of Europe, is no exception to this decline. However, previous macro-level research on the much stronger recession in Finland in the 1990s shows that, contrary to other developed countries, the typical pro-cyclical behavior of fertility in relation to the business cycle was absent. The objective of this paper is to test how a typical feature of both recessions at the individual level, labor market uncertainty, is linked to childbearing risk in Finland. In particular, I focus on the transition to first birth and on the explicit period comparison between the 1990s and the 2000s. I use Finnish population registers (1988- 2013) and adopt a dyadic couple perspective to assess the association between each partner's employment status and the transition to parenthood. Finally, I investigate how, differently in the two periods, the latter relationship changes depending on aggregate labor market conditions to test whether there was a change over time from counter- to pro- cyclicality of fertility in Finland.

13/05/2019 ore 12.00

### Simple Structure Detection Through Bayesian Exploratory Multidimensional IRT Models

### Lara Fontanella (Università di Chieti-Pescara)

In modern validity theory, a major concern is the construct validity of a test, which is commonly assessed through confirmatory or exploratory factor analysis. In the framework of Bayesian exploratory Multidimensional Item Response Theory (MIRT) models, we discuss two methods aimed at investigating the underlying structure of a test, in order to verify if the latent model adheres to a chosen simple factorial structure. This purpose is achieved without imposing hard constraints on the discrimination parameter matrix to address the rotational indeterminacy. The first approach prescribes a 2-step procedure. The parameter estimates are obtained through an unconstrained MCMC sampler. The simple structure is, then, inspected with a post-processing step based on the Consensus Simple Target Rotation technique. In the second approach, both rotational invariance and simple structure retrieval are addressed within the MCMC sampling scheme, by introducing a sparsity-inducing prior on the discrimination parameters. Through simulation as well as real-world studies, we demonstrate that the proposed methods are able to correctly infer the underlying sparse structure and to retrieve interpretable solutions.

08/05/2019 ore 14.00

### Incorporating information and assumptions to estimate validity of case identification in healthcare databases and to address bias from measurement error in observational studies: a research plan to su

### Rosa Gini (Agenzia Regionale di Sanità della Toscana)

Observational studies based on healthcare databases have become increasingly common. While methods to address confounding have been tailored to the nature of this data, methods to address bias from measurement error are less developed. However, it is acknowledged that study variables are not measured exactly in databases, since data is collected for purposes other than research, and the variables are measured based on recorded information only. Estimating indices of validity of a measurement, such as sensitivity, specificity, positive and negative predictive value, is commonly recommended, but rarely accomplished. In this talk we introduce a methodology of measuring variables, called component strategy, that has the potential of addressing this problem when multiple sources of data are available. We illustrate the examples of a chronic disease (type 2 diabetes), an acute disease (acute myocardial infarction) and an infectious disease (pertussis) measured in multiple European databases and we describe the effect of different measurements. We introduce some formulas that allow to analytically obtain some validity indices based on other validity indices and on observed frequencies. We sketch a research plan based on this strategy that aims at understanding when partial information on validity can be exploited to provide a full picture of measurement error, at quantifying the dependence on assumptions, at measuring associated uncertainty, and at addressing bias produced by quantified measurement error. We introduce the ConcePTION project, a 5-year project funded by the Innovative Medicines Initiative, that aims at building an ecosystem for better monitoring safety of medicines use in pregnancy and breastfeeding, based on procedures and tools to transform existing data into actionable evidence, resulting in better and timely information. The aim of the talk is to discuss the research plan and foster a collaboration, that may profit, in particular, the ConcePTION project.

04/04/2019 ore 12.00

### Statistical Learning with High-dimensional Structurally Dependent Data

### Tapabrata Maiti (Michigan State University)

Rapid development of information technology is making it possible to collect massive amounts of multidimensional, multimodal data with high dimensionality in diverse fields of science and engineering. New statistical learning and data mining methods have been developing accordingly to solve challenging problems arising out of these complex systems. In this talk, we will discuss a specific type of statistical learning, namely the problem of feature selection and classification when the features are high dimensional and structured, specifically, they are spatio-temporal in nature. Various machine learning techniques are suitable for this type of problems although the underlying statistical theories are not well established. We will discuss some recently developed techniques in the context of specific examples arising in neuroimaging study.

18/03/2019 ore 14.30

### Analysis and automatic detection of hate speech: from pre-teen Cyberbullying on WhatsApp to Islamophobic discourse on Twitter

### Rachele Sprugnoli

The widespread use of social media yields a huge number of interactions on the Web. Unfortunately, social media messages are often written to attack specific groups of users based on their religion, ethnicity or social status. Due to the massive rise of hateful, abusive, offensive messages, platforms such as Twitter and Facebook have been searching for solutions to tackle hate speech. As a consequence, the amount of research targeting the detection of hate speech, abusive language and cyberbullying also shows an increase. In this talk we will present two projects in the field of hate speech analysis and detection: CREEP which aims at identifying and preventing the possible negative impacts of cyberbullying on young people and HateMeter, whose goal is to increase the efficiency and effectiveness of NGOs in preventing and tackling Islamophobia at EU level. In particular, we will describe the language resources and technologies under development in the two projects and we will show two demos based on Natural Language Processing tools.

05/03/2019 ore 12.00

### Using marginal models for structure learning

### Sung-Ho Kim (Dept of Mathematical Sciences, KAIST)

Structure learning for Bayesian networks has been made in a heuristic mode in search of an optimal model to avoid an explosive computational burden. A structural error which occurred at a point of structure learning may deteriorate its subsequent learning. In the talk, a remedial approach to this error-for-error process will be introduced by using marginal model structures. The remedy is made by fixing local errors in structure in reference to the marginal structures. In this sense, the remedy is called a marginally corrective procedure. A new score function is also introduced for the procedure which consists of two components, the likelihood function of a model and a discrepancy measure in marginal structures. The marginally corrective procedure compares favorably with one of the most popular algorithms in experiments with benchmark data sets.

26/02/2019 ore 12.30

### Evidence of bias in randomized clinical trials of hepatitis C interferon therapies

### Massimo Attanasio (Università degli Studi di Palermo)

Introduction: Bias may occur in randomized clinical trials in favor of the new experimental treatment because of unblinded assessment of subjective endpoints or wish bias. Using results from published trials, we analyzed and compared the treatment effect of hepatitis C antiviral interferon therapies experimental or control. Methods: Meta-regression of trials enrolling naive hepatitis C virus patients that underwent four therapies including interferon alone or plus ribavirin during past years. The outcome measure was the sustained response evaluated by transaminases and/or hepatitis C virus-RNA serum load. Data on the outcome across therapies were collected according to the assigned arm (experimental or control) and to other trial and patient-level characteristics. Results: The overall difference in efficacy between the same treatment labeled experimental or control had a mean of + 11.9% (p < 0.0001). The unadjusted difference favored the experimental therapies of group IFN-1 (+ 6%) and group IFN-3 (+ 10%), while there was no difference for group IFN-2 because of success rates from large multinational trials. In a meta-regression model with trial-specific random effects including several trial and patient-level variables, treatment and arm type remained significant (p < 0.0001 and p = 0.0009 respectively) in addition to drug-schedule-related variables. Conclusion: Our study indicates the same treatment is more effective when labeled ‘‘experimental’’ compared to when labeled ‘‘control’’ in a setting of trials using an objective endpoint and even after adjusting for patient and study-level characteristics. We discuss several factors related to design and conduct of hepatitis C trials as potential explanations of the bias toward the experimental treatment.

14/02/2019 ore 11.30 - Aula Anfiteatro

### A new aspect of Riordan arrays - II

### Gi-Sang Cheon (Department of Mathematics, Sungkyunkwan University (Korea))

Let (S,\star) be a semigroup. A semigroup algebra ${\mathbb K}[S]$ of $S$ over a field ${\mathbb K}$ is the set of all linear combinations of finitely many elements of $S$ with coefficients in ${\mathbb K}$: \begin{eqnarray*} {\mathbb K}[S]=\left\{\sum_{\alpha\in S}c_\alpha \alpha| c_\alpha\in {\mathbb K}\right\}. \end{eqnarray*} The ring of formal power series over a field ${\mathbb K}[[t]]$ together with convolution is an example of semigroup. It suggests that Riordan arrays over ${\mathbb K}[[t]]$ can be generalized to a semigroup algebra ${\mathbb K}[S]$. Furthermore, by using the fact that lattice is a partially ordered set and a semigroup, the notion of {\it semi-Riordan arrays} over a semigroup algebra will be introduced in connection with lattice and poset. Then we will see that a Riordan array is the semi-Riordan array over the semigroup algebra ${\mathbb K}[S]$ where $S=\{0,1,2,\ldots\}$ is a semigroup together with usual addition which is a totally ordered set.

14/02/2019 ore 10.30 - Aula Anfiteatro

### A new aspect of Riordan arrays - I

### Gi-Sang Cheon (Department of Mathematics, Sungkyunkwan University (Korea))

Let $R[[t]]$ be the ring of formal power series over a ring $R$. A Riordan array $(g,f)$ is an infinite lower triangular matrix constructed out of two functions $g,f\in R[[t]]$ with $f(0) = 0$ in such a way that its $k$th column generating function is $gf^k$ for $k\ge0$. The set of all invertible Riordan arrays forms a group called the {\it Riordan group}. In many contexts we see that the Riordan arrays are used as a machine to generate new approaches in combinatorics and its applications. Throughout this talk we will see the Riordan group and Riordan arrays from the several different points of view, e.g. group theory, combinatorics, graph theory, matrix theory, topology and Lie theory. In addition, we will see how Riordan arrays have been generalized and where they have been applied.

13/02/2019 ore 14.00 - Aula Magna D6 - Polo delle Scienze Sociali

### Statistica, nuovo empirismo e società nell’era dei Big Data

### Giuseppe Arbia (Univ. Cattolica Sacro Cuore Roma, Univ. Svizzera Italiana, College of William & Mary di Williamsburg)

Gli ultimi decenni hanno visto un'esplosione formidabile nella raccolta dei dati e nella loro diffusione ed utilizzo in tutti i settori della società umana. Tale fenomeno è dovuto soprattutto alla accresciuta capacità di raccogliere ed immagazzinare informazioni in forma automatica attraverso fonti diversissime quali sensori di varia natura, satelliti, telefoni cellulari, internet, droni e molti altri ancora. È questo il fenomeno denominato la "rivoluzione dei big data". Lo scopo del seminario è quello di descrivere, in termini accessibili anche ai non specialisti, il fenomeno dei big data e le sue possibili ripercussioni nella vita di ogni giorno. Inoltre si discuterà le conseguenze sulla Statistica: l'arte di conoscere la realtà e di prendere decisioni sulla base di dati empirico-osservazionali. Presentazione del libro "Statistica, nuovo empirismo e società nell'era dei Big Data" di Giuseppe Arbia, Edizioni Nuova Cultura (2018).

01/02/2019 ore 12.00

### Cattuto: High-resolution social networks: measurement, modeling and applications

Paolotti: It takes a village - how collaborations in data science for social good can make a difference

### Doppio seminario: Ciro Cattuto & Daniela Paolotti (ISI Foundation)

Ciro Cattuto: Digital technologies provide the opportunity to quantify specific human behaviors with unprecedented levels of detail and scale. Personal electronic devices and wearable sensors, in particular, can be used to map the network structure of human close-range interactions in a variety of settings relevant for research in computational social science, epidemiology and public health. This talk will review the experience of the SocioPatterns collaboration (www.sociopatterns.org), an international effort aimed at measuring and studying high-resolution human and animal social networks using wearable proximity sensors. I will discuss technology requirements and measurement experiences in diverse environments such as schools, hospitals and households, including recent work in low-resource rural settings in Africa. I will discuss the complex features found in empirical temporal networks and show how methods from network science and machine learning can be used to detect structures and to understand the role they play for dynamical processes, such as epidemics, occurring over the network. I will close with an overview of future research directions and applications.

Paolotti: The unprecedented opportunities provided by data science in all the areas of human knowledge become even more evident when applied to the fields of social innovation, international development and humanitarian aid. Using social media data to study malnutrition and obesity in children in developing countries, using mobile phones digital traces to understand women mobility for safety and security, harvesting search engine queries to study suicide among young people in India: these are only a few of the examples of how data science can be exploited to solve issues around many social problems and support global agencies and policymakers in implementing better and more impactful policies and interventions. Nevertheless, data scientists alone cannot be successful in this complex effort. Greater access to data, more collaboration between public and private sector entities, and an increased ability to analyze datasets are needed to tackle these society's greatest challenges.
In this talk, we will cover examples of how actors from different entities can join forces around data and knowledge to create public value with an impact on global societal issues and set the path to accelerate the harnessing of data science for social good.

24/01/2019 ore 12.00

### CONJUGATE BAYES FOR PROBIT REGRESSION VIA UNIFIED SKEW-NORMALS

### Daniele Durante (Department of Decision Sciences, Bocconi University)

Regression models for dichotomous data are ubiquitous in statistics. Besides being useful for inference on binary responses, such methods are also fundamental building-blocks in more complex classification strategies covering, for example, Bayesian additive regression trees (BART). Within the Bayesian framework, inference proceeds by updating the priors for the coefficients, typically set to be Gaussians, with the likelihood induced by probit or logit regressions for the binary responses. In this updating, the apparent absence of a tractable posterior has motivated a variety of computational methods, including Markov Chain Monte Carlo (MCMC) routines and algorithms which approximate the posterior. Despite being routinely implemented, current MCMC strategies face mixing or time-inefficiency issues in large p and small n studies, whereas approximate routines fail to capture the skewness typically observed in the posterior. In this seminar, I will prove that the posterior distribution for the probit coefficients has a unified skew-normal kernel, under Gaussian priors. Such a novel result allows efficient Bayesian inference for a wide class of applications, especially in large p and small-to-moderate n studies where state-of-the-art computational methods face notable issues. These advances are outlined in a genetic study, and further motivate the development of a wider class of conjugate priors for probit models along with methods to obtain independent and identically distributed samples from the unified skew-normal posterior. Finally, these results are also generalized to improve classification via BARTs.

22/01/2019 ore 12.00

### Heterogeneity in dynamics of risk accumulation: the case of unemployment

### Raffaele Grotti (European University Institute)

The paper aims at providing a contribution to the study of socioeconomic risks. It studies different mechanisms that account for the stickiness of the unemployment condition and, relatedly, for the longitudinal accumulation of unemployment experiences. In particular, the paper disentangles two mechanisms: ‘genuine state dependence’ dynamics and unobserved characteristics at the individual level; and accounts both for their relative weight and for their possible interplay in shaping the accumulation of unemployment risks over time. Dynamics of accumulation are investigated, showing their distribution among different workforce segments defined both in terms of observable and unobservable characteristics. This is done applying correlated dynamic random-effects probit models and providing statistics for protracted unemployment exposure. The analysis makes use of EU-SILC data from 2003 to 2015 for four European countries (DK, FR, IT and UK). Empirical results indicate that both unobserved heterogeneity and genuine state dependence are relevant and partly independent factors in explaining the reiteration and accumulation of unemployment and long-term unemployment risks over time. The analysis shows how the weight of these two components varies at both macro and micro level, according to different labour market and institutional settings and depending on individual endowments. Finally, the paper discusses how the distinction between unobserved heterogeneity and genuine state dependence and the evaluation of their possible interplay can provide useful insights with respect to theories of cumulative advantages and with respect to an efficient design of policy measures aimed at contrasting the accumulation of occupational penalties over time.

16/01/2019 ore 14.30

### Lupparelli: On log-mean linear regression graph models

Bocci: Statistical modelling of spatial data: some results and developments

### Welcome Seminars: Monia Lupparelli & Chiara Bocci

Lupparelli: This talk aims to illustrate the log-mean linear parameterization for multivariate Bernoulli distributions which represents the counterpart for marginal modelling of the well-established log-linear parameterization. In fact, the log-mean transformation is defined by the same log-linear mapping applied on the space of the mean parameter (the marginal probability vector) rather than on the simplex (the space of the joint probability vector). The class of log-mean linear models, under suitable zero constraints, corresponds to the class of discrete bi-directed graph models as well as the class of log-linear models is used to specify discrete undirected graph models. Moreover, the log-mean linear transformation provides a novel link function in multivariate regression settings with discrete response variables. The resulting class of log-mean linear regression models is used for modelling regression graphs via a sequence of marginal regressions where the coefficients are linear functions of log-relative risk parameters. The class of models will be better illustrated in two different contexts: (i) for assessing the effect of HIV-infection on multimorbidity and (ii) to derive the relationship between marginal and conditional relative risk parameters in regression settings with multiple intermediate variables.

Bocci: TBA

19/12/2018 ore 10.00

### Flexible and Sparse Bayesian Model-Based Clustering

### Bettina Grün (Johannes Kepler University, Linz, Austria)

Finite mixtures of multivariate normal distributions constitute a standard tool for clustering multivariate observations. However, selecting the suitable number of clusters, identifying cluster-relevant variables as well as accounting for non-normal shapes of the clusters are still challenging issues in applications. Within a Bayesian framework we indicate how suitable prior choices can help to solve these issues. We achieve this considering only prior distributions that have the characteristics that they are conditionally conjugate or can be reformulated as hierarchical priors, thus allowing for simple estimation using MCMC methods with data augmentation.

18/12/2018 ore 11.30

### Bayesian Structure Learning in Multi-layered Genomic Networks

### Min Jin Ha (UT MD Anderson Cancer Center)

Integrative network modeling of data arising from multiple genomic platforms provides insight into the holistic picture of the interactive system, as well as the flow of information across many disease domains. The basic data structure consists of a sequence of hierarchically ordered datasets for each individual subject, which facilitates integration of diverse inputs, such as genomic, transcriptomic, and proteomic data. A primary analytical task in such contexts is to model the layered architecture of networks where the vertices can be naturally partitioned into ordered layers, dictated by multiple platforms, and exhibit both undirected and directed relationships. We propose a multi-layered Gaussian graphical model (mlGGM) to investigate conditional independence structures in such multi-level genomic networks. We use a Bayesian node-wise selection (BANS) framework that coherently accounts for the multiple types of dependencies in mlGGM, and using variable selection strategies, allows for flexible modeling, sparsity, and incorporation of edge-specific prior knowledge. Through simulated data generated under various scenarios, we demonstrate that BANS outperforms other existing multivariate regression-based methodologies. We apply our method to estimate integrative genomic networks for key signaling pathways across multiple cancer types, find commonalities and differences in their multi-layered network structures, and show translational utilities of these integrative networks.

04/12/2018 ore 12.00

### "Exit this way": Persistent Gender & Race Differences in Pathways Out of In-Work Poverty in the US

### Emanuela Struffolino (WZB - Berlin Social Science Center)

We analyze the differences by gender and race in long-term pathways out of in-work poverty. Such differences are understood as "pathway gaps", analogous with gender and racial income gaps studied in labor-market economics and sociology. We combine data from three high-quality data sources (NLSY79, NLSY97, PSID) and apply sequence analysis multistate models to 1) empirically identify pathways out of in-work poverty, 2) estimate the associations between gender and race with each distinct pathway, and 3) attempt to account for these gender and race differences. We identify five different pathways out from in-work poverty. While men and non-Hispanic whites are most likely to experience successful long-term transitions out of poverty within the labor market, women and African Americans are more likely to only temporarily exit in-work poverty, commonly by exiting the labor market. These "pathway gaps" persist even after controlling for selection into in-work poverty, educational attainment, and family demographic behavior.

21/11/2018 ore 12.00

### Estimating Causal Effects On Social Networks

### Laura Forastiere (Yale Institute for Network Science - Yale University)

In most real-world systems units are interconnected and can be represented as networks consisting of nodes and edges. For instance, in social systems individuals can have social ties, family or financial relationships. In settings where some units are exposed to a treatment and its effects spills over connected units, estimating both the direct effect of the treatment and spillover effects presents several challenges. First, assumptions on the way and the extent to which spillover effects occur along the observed network are required. Second, in observational studies, where the treatment assignment is not under the control of the investigator, confounding and homophily are potential threats to the identification and estimation of causal effects on networks. Here, we make two structural assumptions: i) neighborhood interference, which assumes interference to operate only through a function of the the immediate neighbors’ treatments, ii) unconfoundedness of the individual and neighborhood treatment, which rules out the presence of unmeasured confounding variables, including those driving homophily. Under these assumptions we develop a new covariate-adjustment estimator for treatment and spillover effects in observational studies on networks. Estimation is based on a generalized propensity score that balances individual and neighborhood covariates across units under different levels of individual treatment and of exposure to neighbors’ treatment. Adjustment for propensity score is performed using a penalized spline regression. Inference capitalizes on a three-step Bayesian procedure which allows taking into account the uncertainty in the propensity score estimation and avoiding model feedback. Finally, correlation of interacting units is taken into account using a community detection algorithm and incorporating random effects in the outcome model. All these sources of variability, including variability of treatment assignment, are accounted for in in the posterior distribution of finite-sample causal estimands.This is a joint work with Edo Airoldi, Albert Wu and Fabrizia Mealli.

11/10/2018 ore 12.00

### Time-varying survivor average causal effects with semicompeting risks

### Leah Comment (Department of Biostatistics, Harvard T.H. Chan School of Public Health)

In semicompeting risks problems, non-terminal time-to-event outcomes such as time to hospital readmission are subject to truncation by death. These settings are often modeled with parametric illness-death models, but evaluating causal treatment effects with hazard models is problematic due to the evolution of incompatible risk sets over time. To combat this problem, we introduce two new causal estimands: the time-varying survivor average causal effect (TV-SACE) and the restricted mean survivor average causal effect (RM-SACE). These principal stratum causal effects are defined among units that would survive regardless of assigned treatment. We adopt a Bayesian estimation procedure that is anchored to parameterization of illness-death models for both treatment arms but maintains causal interpretability. We outline a frailty specification that can accommodate within-person correlation between non-terminal and terminal event times, and we discuss potential avenues for adding model flexibility. This research is joint work with Fabrizia Mealli, Corwin Zigler, and Sebastien Haneuse.

01/10/2018 ore 12.30

### Estimation of Multivariate Factor Stochastic Volatility Models by Efficient Method of Moments

### Christian Muecher (University of Konstanz)

We introduce a frequentist procedure to estimate multivariate factor stochastic volatility models. The estimation is done in two steps. First, the factor loadings, idiosyncratic variances and unconditional factor variances are estimated by approximating the dynamic factor model with a static one. Second, we apply the Efficient Method of Moments with GARCH(1,1) as an auxiliary model to estimate the stochastic volatility parameters governing the dynamic latent factors and idiosyncratic noises. Based on various simulations, we show that our procedure outperforms existing approaches in terms of accuracy and efficiency and it has clear computational advantages over the existing Bayesian methods.

09/07/2018 ore 15.00

### Graph Algorithms for Data Analysis

### Andrea Marino (Università di Pisa)

Real world data can be very often modelled with networks whose aim is to represent relationships among real world entities. In this talk we will discuss efficient algorithmic tools for the analysis of big real world networks. We will overview some algorithms for the analysis of huge graphs focused on data gathering, degrees of separation computation, centrality measures, community discovery, and novel similarity measures among entities.

18/06/2018 ore 12.00

### Data-driven transformations in small area estimation: An application with the R-package emdi

### Timo Schimid (Freie Universität Berlin)

Small area models typically depend on the validity of model assumptions. For example, a commonly used version of the Empirical Best Predictor relies on the Gaussian assumptions of the error terms of the linear mixed regression model, a feature rarely observed in applications with real data. The present paper proposes to tackle the potential lack of validity of the model assumptions by using data-driven scaled transformations as opposed to ad-hoc chosen transformations. Different types of transformations are explored, the estimation of the transformation parameters is studied in detail under the linear mixed regression model and transformations are used in small area prediction of linear and non-linear parameters. Mean squared error estimation that accounts for the uncertainty due to the estimation of the transformation parameters is explored using bootstrap approaches. The proposed methods are illustrated using real survey and census data for estimating income deprivation parameters for municipalities in Mexico with the R-package emdi. The package enables the estimation of regionally disaggregated indicators using small area estimation methods and includes tools for (a) customized parallel computing, (b) model diagnostic analyses, (c) creating high quality maps and (d) exporting the results to Excel and OpenDocument Spreadsheets are included. Simulation studies and the results from the application show that using carefully selected, data-driven transformations can improve small area estimation.

14/06/2018 ore 12.00

### A brief history of linear quantile mixed models and recent developments in nonlinear and additive regression

### Marco Geraci (University of South Carolina)

What follows is a story that began about sixteen years ago in Viale Morgagni (with some of the events taking place in a cottage of the Montalve’s estate). In this talk, I will retrace the steps that led me to develop linear quantile mixed models (LQMMs). These models have found application in public health, preventive medicine, virology, genetics, anesthesiology, immunology, ophthalmology, orthodontics, cardiology, pharmacology, biochemistry, biology, marine biology, environmental, climate and marine sciences, psychology, criminology, gerontology, economics and finance, linguistic and lexicography. Supported by a grant from the National Institute of Child Health and Human Development, I recently extended LQMMs to nonlinear and additive regression. I will present models, estimation algorithms and software, along with a few applications.

18/05/2018 ore 14.30

### Transcompiling and Analysing Firewalls

### Letterio Galletta (IMT Lucca)

Configuring and maintaining a firewall configuration is notoriously hard. On the one hand, network administrators have to know in detail the policy meaning, as well as the internals of the firewall systems and of their languages. On the other hand, policies are written in low-level, platform-specific languages where firewall rules are inspected and enforced along non trivial control flow paths. Further difficulties arise from Network Address Translation (NAT), an indispensable mechanism in IPv4 networking for performing port redirection and translation of addresses. In this talk, we present a transcompilation pipiline that helps system administrators reason on policies, port a configuration from a system to another and perform refactoring, e.g., removing useless or redundant rules. Our pipeline and its correctness are based on IFCL, a generic configuration language equipped with a formal semantics. Relying on this language we decompile a real firewall configuration into an abstract specification, which exposes the meaning of the configuration and enables us to carry out analysis and recompilation.

18/05/2018 ore 12.00

### Empirical Bayes Estimation of Species Distribution with Overdispersed Data

### Fabio Divino (Dipartimento di Bioscienze e Territorio, Università del Molise)

The estimation of species distributions is fundamental in the assessment of biodiversity and in monitoring of environmental conditions. In this work we present preliminary results in Bayesian inference of multivariate discrete distributions with applications in biomonitoring. In particular, we consider the problem of the estimation of species distributions when collected data are affected by overdispersion. Ecologists have often to deal with data which exhibit a variability that differs from what they expect on the basis of the model assumed to be valid. The phenomenon is known as overdispersion if the observed variability exceeds the expected variability or underdispersion if it is lower than expected. Such differences between observed and expected variation in the data can be interpreted as failures of some of the basic hypotheses of the model. The problem is very common when dealing with counts data for which the variability is directly connected with the magnitude of the phenomenon. Overdispersion is more common than underdispersion and can be originated by several causes. Among them, the most important and relevant in ecolgy is the overdispersion by heterogeneity of the population with respect to the assumed model. An interesting approach to account for the overdispersion by heterogeneity is the method based on compound models. The idea of compound models originated during the 20s and it concerns the possibility to mix a model of interest with a mixing probability measure. Therefore, the mixture resulting by the integration generates a new model that allows to account for larger variation than that one included in the reference model. A typical application of this approach is the well known Gamma-Poisson compound model that generates the Negative Binomial model. In this work we present the use of a double compound model, a multivariate Poisson distribution combined with Gamma and Dirichlet models, in order to account for the presence of overdispersion . Some results of simulations will compare the Gamma-Dirichlet-Poisson model with the reference Multinomial model. Further, some applications in biomonitoring of aquatic environments are presented. Acknowledgement This work is part of a joint research with Salme Karkkainen, Johanna Arje and Antti Penttinen (University of Jyväskylä) and Kristian Meissner (SYKE, Finland).

11/05/2018 ore 14.30

### From sentiment to superdiversity on Twitter

### Alina Sirbu (Dip. Informatica, Universita' di Pisa)

Superdiversity refers to large cultural differences in a population due to immigration. In this talk we introduce a superdiversity index based on Twitter data and lexicon based sentiment analysis, using ideas from epidemic spreading and opinion dynamics models. We show how our index correlates with official immigration statistics available from the European Commission’s Joint Research Center, and we compare it with various other measures computed from the same Twitter data. We argue that our index has predictive power in regions where exact data on immigration is not available, paving the way for a nowcasting model of immigration.

18/04/2018 ore 12.00

### Convolution Autoregressive Processes and Excess Volatility

### Umberto Cherubini (University of Bologna)

We discuss the economic model and the econometric properties of the Convolution Autoregressive Process of order 1 (C-AR(1)), with focus on the simplest gaussian case. This is a first order autoregressive property in which the innovations are dependent of the lagged value of the process. We show that the model may be generated by the presence of extrapolative bias in expectations. Extrapolative expectations bring about excess volatility and excess persistence in the dynamics of the variable. While excess volatility cannot be identified if one only observes the time series of the variable, identification can be achieved if the expectations of the variable are observed in the forward markets. We show that the model is well suited to generate the excess variance of long maturity prices documented in the Giglio-Kelly variance ratio test. We finally discuss possible extensions of the model beyond the gaussian case both by changing the specification of the expectations model and setting a non linear data generating process for the fundamental process.

16/03/2018 ore 12.45 - Aula Magna 327- Polo Morgagni (V.le Morgagni 40)

### Heterogeneous federated data center for research: from design to operation

### Sergio Rabellino (University of Torino)

Bringing a large datacenter from design to operation is complex in all the world. In Italy it is a challenge that starts playing a snakes and ladders game against bureaucracy and colleagues for shaping the European tenders, the government board, the price list, the access policy and eventually all the software needed to operate the datacenter. The talk review all these aspects with experience of designing the University of Torino Competency Center in Scientific Computing serving over 20 departments with its 1M€ OCCAM platform, and writing the HPC4AI project charter recently funded with 4.5M€ by Piedmont Region and serving over 10 departments in two universities (University and Technical University of Turin).

16/03/2018 ore 12.00 - Aula Magna 327- Polo Morgagni (V.le Morgagni 40)

### Designing a heterogeneous federated data center for research

### Marco Aldinucci (Computer Science Department, University of Torino)

The advance of high-speed networks and virtualization techniques make it possible to leverage on economy of scale and consolidate all servers of a large organisation in few, powerful, energy-efficient data centers, which also synergically work with public cloud. In this, research organisations as universities exhibit distinguishing features of both technical and sociological nature. Firstly, it hardly exists a compute workload and resource usage pattern that are dominant across many different disciplines and departments. This makes the design of the datacenter and the access policy so delicate to require to address open research questions. Secondly, scientists are generally inclined to complain about all that is not total freedom to do what they want to urgently do, including strange, greedy and dangerous behaviours for the data and the system themselves.

22/02/2018 ore 12.00

### Job Instability and Fertility during the Economic Recession: EU Countries

### Isabella Giorgetti (Dep. of Economics and Social Sciences, Università Politecnica delle Marche)

The trends of decline in TFR varied widely across EU countries. Exploiting individual data from the longitudinal EU-SILC dataset (2005-2013), this study investigates the cross-country effect of job instability on the couple’s choice of having one (more) child. I build job instability measure for both partners by the first-order lag of own economic activity status in labour market (holding temporary, permanent contract, or being unemployed). In order to account for the unobserved heterogeneity and potential presence of endogeneity, I estimate, under sequential moment restriction, a Two Stage Least Square Model (2SLS) in first differences. Thus, I group European countries according to six different welfare regimes and I estimate the heterogeneous effects of instability in the labour market on childbearing in a comparative framework. The principal result is that the cross-country average effect of job instability on couples’ fertility decisions has not statistical relevance due to the huge countryspecific fixed effects. Distinguishing between welfare regimes, institutional settings and social active policies reveal a varying family behaviour in fertility. In low-fertility countries, however, it is confirmed that the impact of parents’ successful labour market integration might be ambiguous and it might be due to the scarcity of child care options and/or cultural norms.

15/02/2018 ore 11.30

### Data Science and Our Environment

### Francesca Dominici (Harvard T.H. Chan School of Public Health)

What if I told you I had evidence of a serious threat to American national security--a terrorist attack in which a jumbo jet will be hijacked and crashed every 12 days. Thousands will continue to die unless we act now. This is the question before us today--but the threat doesn’t come from terrorists. The threat comes from climate change and air pollution. We have developed an artificial neural network model that uses on-the-ground air-monitoring data and satellite-based measurements to estimate daily pollution levels across the continental U.S., breaking the country up into 1-square-kilometer zones. We have paired that information with health data contained in Medicare claims records from the last 12 years, and for 97% of the population ages 65 or older. We have developed statistical methods and computational efficient algorithms for the analysis over 460 million health records. Our research shows that short and long term exposure to air pollution is killing thousands of senior citizens each year. This data science platform is telling us that federal limits on the nation’s most widespread air pollutants are not stringent enough. This type of data is the sign of a new era for the role of data science in public health, and also for the associated methodological challenges. For example, with enormous amounts of data, the threat of unmeasured confounding bias is amplified, and causality is even harder to assess with observational studies. These and other challenges will be discussed.

17/01/2018 ore 15.30

### Seminario di presentazione: principali linee di ricerca e risultati ottenuti

### Francesca Giambona (DiSIA)

Nel corso del seminario verranno presentate le principali linee di ricerca e le principali tematiche oggetto di analisi e di ricerca del percorso scientifico. In particolare verranno descritte le principali metodologie statistiche utilizzate per analizzare i fenomeni oggetto di studio e presentati i più importanti risultati empirici ottenuti. Infine, verranno brevemente introdotti i temi di ricerca attualmente oggetto di studio.

17/01/2018 ore 14.30

### Latent variable modelling for dependent observations: some developments

### M. Francesca Marino (DiSIA)

When dealing with dependent data, standard statistical methods cannot be directly used for the analysis as they may produce biased results and lead to misleading inferential conclusions. In this framework, latent variables are frequently used as a tool for capturing dependence and describing association. During the seminar, some developments in the context of latent variable modelling will be presented. Three main area of research will be covered, namely quantile regression, social network analysis, and small area estimation. The main results will be presented and some hints on current research and future developments will be given

12/01/2018 ore 12.00

### SAM Based Analysis of the Impact of VAT Rate Cut on the Economic System of China after the Tax Reform

### Ma Kewei (Shanxi University of Finance and Economics)

China completed on May 2016 the tax reform, and from a bifurcated system based on business tax (BT) and Value Added Tax (VAT), moved to a VAT entirely based system. The effects of this reform are still under analysis, regarding both the alleviation of tax burden on industry and service sectors and the economic improvements. Currently, it seems there is a common agreement that, except some minor cases worth of further deepening, in both respects these effects are positive. It is interesting to analyze the impact that in a VAT based system regime would have a reduction of the VAT rate on some key industry. To our knowledge, no analysis in this direction has been performed so far. In this paper, we try to fill this gap and analyze the effect of VAT rate cuts on selected industries on the whole economic system using an impact multiplier model based on a purposively elaborated Social Accounting Matrix (SAM) 2015 for China, which also allows to conduct a preliminary analysis of the structure of China’s economic system. (Joint work with Guido Ferrari and Zichuan Mi)

Ultimo aggiornamento 16 settembre 2019 .