Seminari - Abstract
Living Standards and Fertility in Indonesia: A Bayesian Analysis
We investigate the relationship between living standards and fertility, using a three-wave panel dataset from Indonesia to provide information on women's fertility histories and the levels of consumption expenditure in the households to which they belong. We adopt a Bayesian approach to estimation and exploit the dynamically recursive structure implied by gestation lags to identify causal eects of living standards on fertility and vice versa.
A new Data Mining approach for impact evaluation dealing with selection bias
This paper presents an algorithmic approach to deal with the selection bias problem. Selection bias may be the most important vexing problem in program evaluation or in any line of research that attempts to assert causality. The main problem of causal inference is essentially one of missing data. That is, in order to know whether some variable cause change in another variable for some unit, it is necessary to observe that unit in both its treated and untreated states. This observation is never possible. In other words, our missing data is the counterfactual outcome, defined as what would have happened in the absence of the intervention and vice versa. Researchers have taken various approaches to resolving the missing counterfactual problem: the most widely applied is the Potential Outcome Approach, pioneered principally by Rubin (1974; 1978) that attempts to address such selection bias via the propensity score (PS) method. PS is computed as a function of a covariates set potentially related to the selection process. In literature, it is very usual the operative use of the PS as a one-dimensional space. In a binning classical procedure, for example, treated and control units with similar propensity scores could be compared, and when the balancing property holds an unbiased estimation of the treatment effect can occur. But, it is not clear according to which fitting criterion choose the best model for PS. When some fitting criterion is maximized, such that a good model is found, the balancing property cannot be tested. Aiming at eliminating the PS tautology (Ho et al., 2007), our strategy explicitly excludes doing any analysis that requires access to any outcome data and any model for the selection mechanism. The fundamental belief underlying this paper is that any research influence affects the results, such that multiple solutions arrive simply by virtue of researchers choice of model. Large variation in estimates across choices of control variables, functional forms and other modeling assumptions cannot ensure objective results. In brief, the underlying paradigm is that the problem at hand should define the approach. Taking an automatic algorithmic approach and capitalizing on the known treatment-associated variance in the X matrix - no outcome in sight - we propose a data transformation that allows estimating unbiased treatment effects. The approach involves the construction of a multidimensional de-conditioned space in which the bias associated with treatment assignment has been eliminated. Then the missing counterfactual could be computed given that treated and control units are like arisen from the same population: no difference due to selection into treatment will exist anymore. The proposed approach does not call for modeling data, based on some underlying theory or assumption about the selection process, but instead it calls for using the existing variability within the data and letting the data to speak. Specifically, it is a two-stage procedure that involves the following: original pre-treatment variables are transformed using a specific eigenvalues and eigenvectors to derive a factorial de-conditioned space; and then the counterfactual is computed using as input the de-conditioned variables obtained in the previous stage.
The non-rejection rate for structural learning of gene transcriptional networks from E.coli microarray data
Structural learning of transcriptional regulatory networks in silico using microarray data is an important and challenging problem in bioinformatics. Several solutions to this problem have been proposed both in a statistical approach and in a machine learning approach. Statistical approaches typically rely on Graphical models. They assume that the available data constitute a random sample from a multivariate distribution and aim at identifying a network where missing edges are interpreted as conditional independence relationships between genes. Also machine learning approaches usually describe their results by means of a network, but in fact the primary aim of such procedures is the identification of (some) transcriptional regulatory interaction with high confidence. Empirical evidence shows that in order to reach this task it is convenient to apply procedures to a compendium of microarray experiments. In a statistical approach, Castelo and Roverato (2006) proposed a procedure to learn a Gaussian graphical model from data. Here we show how this procedure can be extended in a meta-anlysis context where the available data are obtained by a compendium of different microarray experiments. This is a work in collaboration with Robert Castelo (Pompeu Fabra University, Spain).
A new approach to the measure of concentration: ABC (Area, Barycentre and Concentration)
Gustavo De Santis
Gini's index of concentration may be viewed from a different, and simpler angle, by considering where the barycentre falls in an ordered, but not cumulated distribution of the possessed quantitative variable (on the y axis) among owners (on the x axis - the poorest on the left, the richest on the right). The abscissa of the barycentre (relative to its maximum and minimum) provides a measure of concentration that coincides with Gini's G. Several empirical applications and a few theoretical considerations show that the ABC approach performs at least as well as - and sometimes better than - the traditional version of Gini's index.
Size, Innovation and Internationalization: A Survival Analysis of Italian Firms
Margherita Velucchi (Università di Firenze)
Firms’ survival is often seen as crucial for economic growth and competitiveness. This paper focuses on business demography of Italian firms, using an original database, obtained by matching and merging to gain the intersection three firm level datasets (ICE-Reprint, Capitalia, AIDA). This database allows us to simultaneously consider the effect of size, technology, trade, foreign direct investments, and innovation on firms’ survival probability. We use a semiparametric Cox Model to show that size and technological level positively affect the likelihood of survival. Internationalized firms show higher failure risk: on average competition is stronger in international markets, forcing firms to be more efficient. However, large internationalized firms are more likely to ‘survive’. An Italian internationalized firm to be successful and to survive, should be high-tech, large and innovative. (joint work with G. Giovannetti and G. Ricchiuti)
Matching for Causal Inference Without Balance Checking (joint with G. King and G. Porro)
We address a major discrepancy in matching methods for causal inference in observational data. Since these data are typically plentiful, the goal of matching is to reduce bias and only secondarily to keep variance low. However, most matching methods seem designed for the opposite problem, guaranteeing sample size ex ante but limiting bias by controlling for covariates through reductions in the imbalance between treated and control groups only ex post and only sometimes. (The resulting practical difficulty may explain why many published applications do not check whether imbalance was reduced and so may not even be decreasing bias.) We introduce a new class of ``Monotonic Imbalance Bounding'' (MIB) matching methods that enables one to choose a fixed level of maximum imbalance, or to reduce maximum imbalance for one variable without changing it for the others. We then discuss a specific MIB method called ``Coarsened Exact Matching'' (CEM) which, unlike most existing approaches, also explicitly bounds through ex ante user choice both the degree of model dependence and the causal effect estimation error, eliminates the need for a separate procedure to restrict data to common support, meets the congruence principle, is approximately invariant to measurement error, works well with modern methods of imputation for missing data, is computationally efficient even with massive data sets, and is easy to understand and use. This method can improve causal inferences in a wide range of applications, and may be preferred for simplicity of use even when it is possible to design superior methods for particular problems. We also make available open source software for R and Stata which implements all our suggestions.
Ruolo delle donne e utilizzo della risorsa acqua nella valle del wadi Laou (Marocco): un caso di studio del progetto WADI
Lucia Fanini (Università di Firenze)
Il progetto euro-mediterraneo WADI (2006-2008) aveva come scopo la creazione di scenari per la sostenibilità dell’utilizzo umano delle risorse ecologiche. L’approccio alla varietà e complessità dell’ambiente mediterraneo è stato quello dei siti di studio, in modo da partire da una base locale per arrivare a scenari contenenti fin dal principio il contributo delle reali comunità. I conflitti relativi all’acqua sono stati il tema comune ai siti di studio selezionati, con problemi riguardanti la qualità e le scelte di gestione, piuttosto che la quantità di questa risorsa. In aree rurali il problema è particolarmente rilevante, in quanto una parte di popolazione è strettamente legata ai beni ambientali per la propria sopravvivenza, e di conseguenza risulta più esposta di altre agli effetti delle decisioni gestionali. Questo studio è stato effettuato in un’area storicamente estremamente isolata ed ora in rapida transizione, in particolare per quanto riguarda la pressione verso lo sviluppo turistico e l’aumento di infrastrutture. La valle del wadi Laou è ricca di sorgenti e ospita un sistema tradizionale (saquìa) di irrigazione, ma la gestione della risorsa acqua sta passando dalle tradizionali autorità di villaggio agli enti pubblici. In questo contesto, l’analisi di genere è necessaria per ottenere informazioni sulla parte femminile della popolazione che, pur potendo essere proprietaria di terre, non è rappresentata a nessun livello (neanche a quello di autorità tradizionale di villaggio) ma porta avanti buona parte del lavoro domestico e agricolo. Per tenere in considerazione le metodologie SEAGA (proposte dalla FAO per l’analisi di genere a livello di terreno-comunità) ma soprattutto per andare incontro alla situazione reale, la scelta del campione da analizzare e la metodologia di indagine sono state di tipo non probabilistico, coinvolgendo abitanti di zone rurali e urbane della valle. Sono state coinvolte nell’analisi 52 famiglie, in cui venivano intervistati in contemporanea l’uomo e la donna di maggiore potere. Il questionario conteneva quattro sezioni generali: dati socioeconomici della famiglia, accesso ai servizi di base, suddivisione del lavoro (domestico ed esterno), percezione dei problemi e rappresentazione a livello decisionale. I dati raccolti sono stati analizzati in maniera descrittiva mediante cluster analysis e nonmetric Multi Dimensional Scaling, considerando a livello di famiglia le condizioni socioeconomiche e di accesso ai servizi di base, e a livello di genere la percezione dei problemi e la possibilità di essere rappresentati nei gruppi di potere. I risultati hanno evidenziato la dinamicità del contesto relativa all’accesso ai servizi (aumento della scolarizzazione e dell’allacciamento alla rete idrica), così come l’emergenza di specifici problemi (la mancanza di rete fognaria anche nelle aree urbane più recenti, il pagamento richiesto per l’utilizzo del’acqua, che prima era gratuita), con una differenza principale tra l’unica area urbana “storica” e il resto della valle. Per quanto riguarda l’analisi della percezione dei problemi, sono emerse differenze legate alla zona di appartenenza esclusivamente tra la parte femminile del campione, mentre la parte maschile è risultata poco differenziata. Questo tipo di analisi applicata al caso di studio della valle del wadi Laou, oltre a integrare la formulazione di scenari futuri, come era obiettivo del progetto, ha sollevato alcune domande sull’applicazione pratica delle metodologie per l’integrazione di genere negli studi, e ha evidenziato la necessità di una notevole conoscenza pregressa del contesto, per la messa a punto di strategie di campionamento effettive.
A Bayesian Calibration Modeling for Combining Different Preprocessign Methods in Affymetrix Chips
In gene expression studies a key role is played by the so called “pre-processing”, a series of steps designed to extract the signal and account for the sources of variability due to the technology used rather than to biological differences between the RNA samples. Many studies have shown how this choice can affect the results of subsequent analysis carried out to measure the influence of biological contrasts in differential expression. At the moment there is no commonly agreed gold standard method and each researcher has the responsibility to choose one pre-processing method, incurring the risk of false positive and false negative features associated with the pre-processing method chosen. We propose a Bayesian model that combines several pre-processing methods to assess the “true” unknown differential expression between two conditions and show how to estimate the posterior distribution of the differential expression values of interest. The model is tested both on simulated data and on a spike in data set and its biological interest is demonstrated through a real example on publicly available data.
Bayesian hierarchical model for the prediction of football results
The problem of modelling football data has become increasingly popular in the last few years and many different models have been proposed with the aim of estimating the characteristics that bring a team to lose or win a game, or to predict the score of a particular match. We propose a Bayesian hierarchical model to address both these aims and test its predictive strength on data about the Italian Serie A 1991-1992 championship. To overcome the issue of overshrinkage produced by the Bayesian hierarchical model, we specify a more complex mixture model that results in better fit to the observed data. We test its performance using an example about the Italian Serie A 2007-2008 championship. (jointly with M. Blangiardo)
Forecasting a macroeconomic aggregate which includes components with common trends
Antoni Espasa (Univ. Carlos III de Madrid)
The empirical literature on forecasting an aggregate by forecasting disaggregates has usually used low disaggregation levels. The components at the highest level of the breakdown of a macroeconomic aggregate (full disaggregation) often show features which are shared by a significant proportion. In this paper the disaggregation level is treated as a statistical problem that can be approached by estimating common trends. This information on common trends enables to define a disaggregation scheme, by signalling from the full disaggregation all the components, say m, which share a common trend (these components define the set B) and grouping the remaining components in a sub-aggregate R. The m components of the set B are identified by a testing procedure carried over each of all the possible pairs of components from the full disaggregation. This approach obtains a parsimonious breakdown of the data in m+1 components. A forecasting strategy is proposed, consisting in forecasting each one of the components in B, taking into account the common trend they share, employing a cointegration mechanism for each component and the common trend, and forecasting independently the sub-aggregate R, and then aggregating all these forecasts. This strategy is applied to forecasting Euro area inflation and USA inflation, where a 37% and a 43% of CPI weight shares their common trend, respectively. It is shown that this strategy significantly improves the forecasting accuracy of the corresponding aggregate for all horizons from 1 to 12, this improvement increases with the length of the horizon in the Euro area case and it is constant across horizons for the USA inflation. Additionally, we argue that the out-of-sample accuracy gains implied by our procedure increases with the number of basic components which are full cointegrated. In this approach, it is important to work with the full disaggregation because official breakdowns consist on sub-aggregates which include components from both sets, B and R, so the cointegrated relationships between the official sub-aggregates can be unstable. Finally it is also shown that this procedure performs better than one based on dynamic factor analysis. (Joint work with Iván Mayo)
Clustering of Curves
Silvia Liverani (Univ. of Warwick)
An increasing number of microarray experiments produce time series of expression levels for many genes. Some recent clustering algorithms respect the time ordering of the data and are, importantly, extremely fast. The aim is to cluster and classify the expression profiles in order to identify genes potentially involved in, and regulated by, the circadian clock. In this presentation we report a new development associated with this methodology. The partition space is intelligently searched placing most effort in refining the partition where genes are likely to be of most scientific interest. This utility based Bayesian search algorithm can be shown both theoretically and practically to outperform the greedy search algorithm which does not use contextual information to guide the search.
Design and analysis of teaching experiments for course quality in the academic setting
The never ending debate on quality invests every institution devoted to higher education. It is undeniable that with a massive and fast globalisation causing student flows in any direction, and under the constraints of limited economic resources devoted to higher education, the performances and the quality of any academic system must be constantly monitored and improved. Thus, whoever researches and teaches the scientific foundations of quality, has at the same time the right and the duty to provide his/her opinion and to firstly assure the quality of the processes for which he/she is responsible. In the recent past the authors started focusing on Teaching, as a relevant aspect for the transmission of knowledge to an audience who constantly experiences a rapidly changing technological environment. We recently proposed a methodology named TESF: Teaching Experiments and Student Feedback. It is aimed at designing, monitoring, and continuously improving (according to the Deming’s cycle) the quality of a university course. The TESF methodology is based on the concurrent adoption of Design of Experiments and the SERVQUAL model. The experiments are essentially “teaching experiments” performed by the teacher according to a predefined plan. The teacher is therefore the designer, the experimenter and part of the experimental unit. The other part of the experimental unit is constituted by a predefined sample of students attending the course (students evaluators sample) whose feedback is carefully studied. We have shown, by a preliminary application of the TESF, that it is absolutely unnecessary to be an experienced statistician to apply this methodology and even the description of the model is kept at the most general level. In fact the methodology, initially thought for the academic environment could be easily applied to any educational context. On the other side it is evident that expert statisticians can be stimulated by this approach, so a scientific discussion can be opened and further substantial improvements can be gained. This seminar is aimed at giving an overview of the TESF, emphasising the statistical aspects therein involved and the delicate experimental and measurement issues. An interesting upgrade concerning the data (student feedback) analysis will be mentioned. Results from the application of the methodology in three consecutive editions of a Statistics course at the University of Palermo will be presented.
Conjoint analysis and response surface methodology: searching the full optimal profile by status‐quo and optimization
Rossella Berni (Università di Firenze)
Standard conjoint analysis is a multi‐attribute quantitative method useful to study the evaluation of a consumer/user about a new product/service. In this seminar, our proposal is presented for a modified conjoint analysis (CA) in order to evaluate the new or revised product/service, through a generic consumer or user. The proposal is based on the application of Response Surface Methodology (RSM) in order to set the best preference for a sample of respondents by evaluating both the quantitative judgements about the full profiles and the judgements about the current situation, or status‐quo; in addition, baseline variables of respondents are considered. Therefore, the optimal solution for the new or revised product/service is obtained by computing the optimal hypothetical solution through the experienced status quo. Note that the estimated model is subsequently optimized in order to set the best preference on the basis of the factors, involved in the experimental design, and judgments involved. Our proposal is applied to University students of the II‐nd and III‐rd year; the aim is the evaluation of an interdisciplinary degree course of the University of Florence by achieving the best degree course solution according to the considered factors and the student’s judgements.
Another look into the effect of premarital cohabitation on duration of marriage: an approach based on matching
Stefano Mazzuco (Università di Padova)
The paper proposes an alternative approach to studying the effect of premarital cohabitation on subsequent duration of marriage on the basis of a strong ignorability assumption. The approach is called propensity score matching and consists of computing survival functions conditional on a function of observed variables (the propensity score), thus eliminating any selection that is derived from these variables. In this way, it is possible to identify a time varying effect of cohabitation without making any assumption either regarding its shape or the functional form of covariate effects. The output of the matching method is the difference between the survival functions of treated and untreated individuals at each time point. Results show that the cohabitation effect on duration of marriage is indeed time varying, being close to zero for the first 2–3 years and rising considerably in the following years.
Pension Issues in Japan: How Can We Cope with the Declining Population?
After a brief sketch of the Japanese demography and its impact on financing social security, I turn to explaining the Japanese social security pension program and summarize Japan’s major pension problems. I further examine the 2004 pension reform and use the balance sheet approach to analyze its economic implications. I also discuss future policy options on pensions. Financial sustainability of social security pensions is not often attained even if its income statement enjoys a surplus. The balance sheet approach is an indispensable tool for people to understand the long run financial sustainability of social security pensions and to evaluate varying financial impacts of different reform alternatives. When it comes to social security pensions, the most important question is whether or not they are worth buying. Contributions are required to be much more directly linked with old-age pension benefits, while an element of social adequacy has to be incorporated in a separate tier of pension benefits financed by other sources than contributions. It is also shown that a shift to a consumption-based tax to finance the basic pension in Japan will induce smoother increases in pension burdens among different cohorts.
Interdependencies between fertility and women's labour supply in Europe. How a multi-process hazard model can help us in modeling this relationship?
Anna Matysiak (Warsaw School of Economics)
The paper discusses the state of the current research on the interdependencies between fertility and women’s labour supply in Europe. It outlines a theoretical model of decision-making with respect to childbearing and economic activity of women and formulates conditions that should be met for a proper assessment of the time conflict between the two activities. Against this theoretical background studies researching the association between fertility and women’s labour supply are critically evaluated. In the next step it is discussed how a multi-process hazard model can help in eliminating some of the shortcomings of the current research. The model is estimated for Poland and its findings are discussed within the Polish socio-economic context. The paper concludes with suggestions for further research in the field of the interdependencies between fertility and women’s labour supply.
A segmented regression model for event history data: an application to the fertility patterns in Italy
Massimo Attanasio (Università di Palermo)
We propose a segmented discrete-time model for the analysis of event history data in demographic research. Through a unified regression framework, the model provides estimates of the effects of explanatory variables and jointly accommodates flexibly non-proportional differences via segmented relationships. The main appeal relies on ready availability of parameters, changepoints, and slopes, which may provide meaningful and intuitive information on the topic. Furthermore, specific linear constraints on the slopes may also be set to investigate particular patterns. We investigate the intervals between cohabitation and first childbirth and from first to second childbirth using individual data for Italian women from the Second National Survey on Fertility. The model provides insights into dramatic decrease of fertility experienced in Italy, in that it detects a 'common' tendency in delaying the onset of childbearing for the more recent cohorts and a 'specific' postponement strictly depending on the educational level and age at cohabitation.
Bayesian CAR Models for Syndromic Surveillance on Multiple Data Streams: Theory and Practice
Gauri Datta (Georgia University)
Syndromic surveillance has, so far, considered only simple models for Bayesian inference. This lecture details the methodology for a serious, scalable solution to the problem of combining symptom data from a network of U.S. hospitals for early detection of disease outbreaks. The approach requires high-end Bayesian modeling and significant computation, but the strategy described here appears to be feasible and offers attractive advantages over the methods that are currently used in this area. The method is illustrated by application to ten quarters worth of data on opioid drug abuse surveillance from 636 reporting centers, and then compared to two other syndromic surveillance methods using simulation to create known signal in the drug abuse database.
Mixture Priors for Bayesian Variable Selection
Marina Vannucci (Rice University)
In this talk I will review Bayesian methods for variable selection that use spike and slab priors. Specific interest will be towards high-dimensional data. Linear and nonlinear models will be considered, with continuous, categorical and survival responses. Applications will be to genomics data from DNA microarray studies. The analysis of the high-dimensional data generated by such studies often challenges standard statistical methods. Models and algorithms are quite flexible and allow us to incorporate additional information, such as data substructure and/or knowledge on gene functions and on relationships among genes.
Direct and Indirect Effects: An Unhelpful Distinction?
Donald B. Rubin (Harvard University)
The terminology of direct and indirect causal effects is relatively common in causal conversation as well in some more formal language. In the context of real statistical problems, however, I do not think that that terminology is helpful to clear thinking, and rather leads to confused thinking. This presentation will present several real examples where this point arises, as well as one which illustrates that even Sir Ronald Fisher was vulnerable to such confusion.
Factor Scores as Proxies for Latent Variables in Structural Equation Modelling (SEM)
Anders Skrondal (Norwegian Inst. of Public Health)
Structural equation models with latent variables are sometimes estimated using an intuitive approach where factor scores are plugged in for latent variables. Ordinary regression analysis is then performed with the factor scores simply treated as observed variables. Not surprisingly, we show that this approach in general produces inconsistent estimates of the parameters of main scientific interest. Rather remarkably, consistent estimates for all parameters can however be obtained if the factor scoring methods are judiciously chosen.
Improved Regression Calibration
Anders Skrondal (Norwegian Inst. of Public Health)
The joint likelihood for generalized linear models with covariate measurement error cannot in general be expressed in closed form and maximum likelihood estimation is hence taxing. A popular alternative is regression calibration which is computationally efficient at the cost of potentially inconsistent parameter estimation. We propose an improved regression calibration approach, based on an approximate decomposed form of the joint likelihood, which is both consistent and computationally convenient. It produces point estimates and estimated standard errors which are practically identical to those obtained by maximum likelihood. Simulations suggest that improved regression calibration, which is easy to implement in standard software, works well in a range of situations.
Combining Duration and Intensity of Poverty: Proposal of a New Index of Longitudinal Poverty
Daria Mendola (Università di Palermo)
Traditional measures of poverty persistence, such as the “poverty rate” or the “persistent-risk-of-poverty rate”, do not devote sufficient attention to the sequence of poverty spells. In particular, they are insufficient in underlining the different effects associated with occasional, single spells of poverty and consecutive years of poverty. Here, we propose a new index which measures the severity of poverty, taking into account the way poverty and non-poverty spells follow one another along individual life courses. The index is normalised and increases with the number of consecutive years in poverty along the sequence, while the index decreases when the distance between two years of poverty increases. All the years spent in poverty concur with the measurement of the persistency in poverty, but with a decreasing contribution as long as the distance between two years of poverty become longer. A weighted version of the index is also proposed, explicitly taking the distance from the poverty line of poor people into account. Both the indexes are supported by a conceptual framework and characterised via properties and axioms. They are validated according to content, construct and criterion validity assessment and tested on a sample drawn from young European adults participating in the European Community Households Panel survey.
Ultimo aggiornamento 12 gennaio 2010.