Research

The faculty members of the Department of Statistics and Data Science are prominent scholars, researchers, and consultants, as well as dedicated teachers.

All are actively engaged in research which is being published in professional journals. Their research has been funded by major grants from private organizations and governmental agencies, including the Department of Energy, Advanced Research Projects Agency, National Science Foundation, Office of Naval Research, Department of Education, Air Force Office of Scientific Research, the National Institutes of Health, Department of Veteran Affairs, and the National Aeronautics and Space Administration.

Faculty members are actively working in the following areas:

Advances in statistical education

Students need to practice to learn new material, and the passive format of most lecture styles does not facilitate active learning. Just in time teaching (JiTT), a particular form of flipping the classroom, is one method of active learning. Students are asked to review class materials prior to a class, then in class they answer questions on the materials. This method gives students multiple exposures to important concepts. In addition, Dr. McGee has experimented with different grading systems, such as contract grading, specifications grading, and menu grading, to improve student learning and decrease grade anxiety. In contrast to traditional grading (two exams and a final plus 10% homework), these other methods allow students to learn from their mistakes or make up for a slow start.

Faculty: Monnie McGee
Analysis and methodology for compositional data

Compositional data consist of discrete outcomes (called components) that are proportions of a whole. For example, daily activity can be composed of sleep, work, exercise, and leisure time. In each 24-hour period, the time spent in each one of these activities should sum to 24 and can easily be converted to percentages of time spent in each activity. These data cannot be analyzed by classical statistical methods because they are forcibly dependent. Spending more time sleeping, for example, means less time for the other three activities. Monnie McGee and her students have developed methods based on the nested Dirich let distribution to test for differences in components for G independent groups, where G>2 (e.g., time spent for men vs. women, or older people versus middle-aged people versus younger people). Applications include comparison of components of cell populations in Lupus vs. Healthy patients, time spent in a water maze for normal vs. impaired mice, phylogenetic trees for populations of microbiome data, and outcomes of at-bats from young, mid-career and experienced baseball players. This is continuing work with two former students, Jacob Turner (SFA) and Bianca Luedeker (NAU).

Faculty: Monnie McGee
Data integration with probability and nonprobability samples

Declining response rates and rising costs make probability sampling increasingly difficult, while large nonprobability data sources are more accessible but often suffer from substantial selection bias. This project focuses on integrating probability and nonprobability samples when auxiliary information is available in both sources. We develop a unified semiparametric propensity score framework that relaxes ignorability and mitigates parametric misspecification risk, providing valid inference whether selection is ignorable or not. Estimation proceeds via a pseudo profile likelihood approach with profiling for nonparametric components. We establish asymptotic properties and derive variance estimators. Ongoing work extends these ideas to administrative record linkage in early childhood education evaluations, where linkage can be outcome-informative and linked outcomes may suffer from mismatch error.

Faculty: Danhyang Lee
Missing data and selective nonresponse

Real world datasets often contain missing values, and the missingness is frequently informative, meaning that response can depend on unobserved outcomes. This research develops semiparametric methods for missing data, especially under missing-not-at-random (MNAR) mechanisms, to reduce sensitivity to response-model misspecification while retaining efficiency. We study flexible nonresponse models and propose profile-based efficient estimators, including profile maximum likelihood and profile calibration, with theoretical guarantees and asymptotic variance estimation.

Faculty: Danhyang Lee
Bayesian methods and applications

Bayesian hierarchical models enjoy flexibility in model construction and accommodation of complex data structures. The research in this area spans a diversity of applications, ranging from psychological and behavioral science studies, high throughput data analysis, spatial-temporal analysis of longitudinal and survival data. Currently, we are constructing spatial-temporal predictive models to study intervention effect in economic observational studies with small sample size and heterogeneous spatial structure. Another project is to study prediction and uncertainty quantification of high-dimensional multi-type data with block-wise missing structure.

Faculty: Jing Cao
Causal inference

In clinical trials with perfect adherence to assigned treatment and no missing data, it is straightforward to make causal inferences from study data. In all other contexts, it is not possible to make causal inferences without making some assumptions about the treatment assignment mechanism — i.e., the dependence of the treatment assignment on baseline factors. Together with several recent graduates of the biostatistics and statistics PhD programs, we have developed methods to extract causal inferences in situations where the treatment assignment mechanism is complex, as is typically the case in clinical research in surgery. We have also explored sensitivity of various clinical trial estimands to correlation of individual compliance behavior and causal effects. Our papers on these topics have appeared in Biometrics, Statistics in Medicine, and International Journal of Biostatistics.

Faculty: Daniel Heitjan
Clinical trial design and analysis

Randomized trials with correlated outcomes are widely employed in medical, epidemiological, and behavioral studies. Correlated outcomes are usually categorized into two types: clustered and longitudinal. The former arises from trials where randomization is performed at the level of some aggregate (e.g., clinics) of research subjects (e.g., patients). The latter arises when the outcome is measured multiple times during follow-up from each subject. In addition, missing data is a common issue which leads to the challenge of “partial” observations. The research is to develop GEE-based sample size methods that cover various types of correlated outcomes (continuous, binary, and count) and accommodate missing data, correlation structures, and financial constraints. Trials based on such outcomes, although more complicated than the conventional randomized trials, offer greater flexibility and efficiency in practice.

Faculty: Jing Cao
DNA sequence analysis via Fourier transforms and abstraction augmented Markov models

Current metagenomic methods require alignment of reads to a database of genomes. This is not only computationally expensive, but also identification of a microbial population is tied to the presence of a sequenced genome in a database. It is possible that novel microbial populations that have yet to be sequenced exist in samples from the mouth, lungs, and other human environments. To identify them, it is necessary to employ alignment-free techniques, such as abstraction augmented Markov models (AAMM) to identify the genera and species of the organisms. Related to this, we are exploring the use of Fourier coefficients as a signature for DNA sequences. We have used FCs to classify geographic location of DNA sequences from coronavirus samples.

Faculty: Monnie McGee
Dynamic risk prediction triggered by intermediate event

The availability of extensive data from electronic health records and registry databases has sparked considerable interest in incorporating time-varying patient information to enhance risk prediction. Unlike static risk prediction methods that provide a conditional survival function based on baseline predictors, dynamic landmark prediction focuses on the survival function conditioned on the patient's predictor history up to a specific landmark time. However, incorporating the complex stochastic processes involved in the patient's history information poses challenges when applying tree-based methods. To address this, we propose a unified framework for landmark prediction using survival tree ensembles, allowing for updated predictions as new information becomes available. Additionally, we represent the patient's history information as a fixed-length predictor vector, enabling the application of recursive partitioning techniques to exploit the growing amount of predictor information over time.

Faculty: Steven Chiou
Geometric and topological data analysis

A defining characteristic of many modern data applications is their unstructured nature. The basic unit of analysis could be something other than a traditional observation, such as regular arrays with fixed numbers of rows and columns and a single observation in each cell. Such questions are not amenable to traditional statistical procedures based on simple array-structured data. Geometric and topological data analysis provides a mathematical representation of the shape of data and extracts structural information from a complex data set. We have developed statistical approaches for geometric and topological data analysis that provide a direct inference on the shape of data.

Faculty: Chul Moon
Human mobility patterns

The near-universal adoption of smartphones gives researchers unprecedented opportunities to collect data on human mobility. It has been shown that insights gained through such data can inform treatment of certain medical conditions or provide early warning about others. We research new models to represent such data and propose statistical methods to deal with some of the challenges they present such as incomplete or missing observations or measurement error.

Faculty: Marcin Jurek
Machine learning and text sentiment analysis

Machine learning (ML) has enjoyed great success in prediction and classification using big data. However, the desirable accuracy from ML algorithms often goes hand in hand with lack of interpretability due to their complex inner-workings. In practice, it can be a limitation because interpretability is crucial for understanding and acceptance of prediction or classification outcomes. To meet this challenge, we try to incorporate machine learning components, such as the attention mechanism, into a relatively simpler parametric statistical model structure. We have applied the idea in the text sentiment analysis. By combining the attention mechanism from ML (which is capable of providing meaningful word embedding vectors) with a relatively simple interpretable statistical model, we are able to get the best of both worlds: the interpretability of a statistical model and the high predictive performance of ML algorithms.

Faculty: Jing Cao
Mixed-valued time series analysis

Multivariate time series are routinely modeled and analyzed by the well-known vector autoregressive (VAR) models. The main reasons are ease in computation arising from the imposed linearity, easily understood by a wide audience, and provide predictions. Though VAR models are well understood from a theoretical and methodological point of view, and are quite useful for analysis of continuous-valued data, they are inappropriate when dealing with multivariate time series when some of its components are integer-valued such as the daily number of new patient admissions to a hospital, the number of crimes in a particular region, trading volume during a time period. The goal is to develop new statistical tools and models for analyzing multivariate mixed-valued time series data. This is significant because multivariate time series data, discrete and continuous-valued, is collected in diverse scientific areas such as demography, econometrics, sociology, public health, and neurobiology for the purpose of forecasting, planning and informing policy.

Faculty: Raanju Sundararajan
Multivariate and high-dimensional time series analysis

Time series data from various sources appear often in multivariate and high-dimensional form. Numerous important problems from application areas such as neuroscience, finance, environmental science and engineering involve analyzing time series data. As an example in neuroscience, functional MRI (fMRI) data from neuroscience experiments are recorded as a high-dimensional time series with signal from several thousand spatial locations in the brain. The interest here is in understanding time-varying interactions between different brain locations and also assist relating this with neurological disorders. As an example in engineering, power systems operations of renewable energy sources like wind depend on modeling and forecasting of multivariate time series data. Managing the renewable energy grid is critical in utilizing that energy source effectively and time series methods play a central role in the decision making process of these systems. The problems identified in the above mentioned areas require new time series methods that are computationally feasible and grounded in theory. Ongoing research focuses on developing such methods and these have theoretical and methodological importance in time series analysis.

Faculty: Raanju Sundararajan
Measuring sensitivity to nonignorability

Most data sets of any consequence have some missing observations. When the propensity to be missing is associated with the values of the observations, we say that the data are nonignorably missing. Nonignorability can lead to bias and other problems when one applies standard statistical analyses to the data. In principle, one can eliminate such problems by estimating models that account for nonignorability, but these models are notoriously non-robust and difficult to fit. An alternative approach is to measure the sensitivity to nonignorability, that is, to evaluate whether nonignorability, if it exists, is sufficient to change parameter estimates from their values under standard ignorable models. A primitive version of this idea is to tally the fraction of missing observations in a univariate data set; if the fraction is small, then presumably the potential bias arising from nonignorability is also small. We have developed methods and software to measure sensitivity for a broad range of data structures, missingness types, and statistical models.

Faculty: Daniel Heitjan
Nonparametric statistics

Nonparametric statistics aim to infer an unknown quantity while making a few underlying assumptions. Because nonparametric methods make fewer assumptions, they can be useful when existing information about the application is insufficient. Nonparametric methods could provide more robust and simpler inference than parametric methods for various cases. Empirical likelihood is an example of the nonparametric method of inference.
Faculty: Chul Moon
Order statistics

An order statistic is the realized ranked value of a random variable in a sample. The study of order statistics can be useful in a range of problems, such as evaluating the reliability of a manufacturing system that depends on performance of many similar parts or the risk to a life insurance company for its portfolio of policies. Inference from order statistics can provide robust and cost-effective testing and estimation. An example of efficient estimation using the theory of order statistics is ranked set sampling.

Faculty: Chul Moon
Ranking and selection

Decision-makers are frequently confronted with the problem of selecting from among a set of possible choices. Ranking and selection addresses the problem of how to choose the best among a group of items, where the quality of those items is measured imperfectly. Another aspect of the problem that we have studied is how to assess the quality of the measures themselves; i.e., ranking the rankers. Our approaches have included various ways of modeling the evaluation process. Applications have been wide-ranging, from wine-tasting, to proposal evaluation, to diving scores.

Faculty: Jing Cao, Monnie McGee
Real-time prediction in clinical trials

Clinical trial planning involves the specification of a projected duration of enrollment and follow-up needed to achieve the targeted study power. If pre-trial estimates of enrollment and event rates are inaccurate, projections can be faulty, leading to inadequate power or other mis-allocation of resources. We have developed an array of methods that use the accruing trial data to efficiently and correctly predict future enrollment counts, times of occurrence of landmark events, estimated final treatment effects, and ultimate significance of the trial.

Faculty: Daniel Heitjan
Recurrent event analysis

Recurrent event analyses have wide-ranging applications in biomedicine, public health, and engineering, among other fields, where subjects experience a sequence of events of interest during follow-up. However, simple survival methods that focus solely on the first event can overlook valuable information on subsequent events, leading to bias and potentially misleading results. Consequently, there has been considerable attention given to approaches that address the sequential nature of recurrent event times without information loss. Since recurrent events can be terminated by informative censoring or a terminal event, the use of frailty models to relax the assumption of conditional independent censoring and jointly model the recurrent event process and the terminal event has gained significant interest. We have developed a general scale-change joint model that encompasses the popular Cox-type model, the accelerated rate model, and the accelerated mean model as special cases, accommodating informative censoring through subject-specific frailty without any need for parametric specification.

Faculty: Steven Chiou
Scalable Gaussian processes with applications to earth sciences.

Modern remote sensing technology has been producing an incredible amount of environmental data which helps to produce new insights into the mechanisms governing the Earth's ecosystem. Many of the popular tools used to model such data are based on Gaussian processes (GPs). This versatile and analytically tractable approach provides a natural way to quantify uncertainty yet often suffers from computational problems. In this line of research we explore new ways to exploit the favorable properties of GPs in the analysis of environmental data while making sure that they can be scaled to massive data sets.

Faculty: Marcin Jurek
Sports analytics: swimming, diving, track and field

Sports analytics is a big business, particularly in football, basketball, baseball, soccer, and hockey. McGee concentrates her sports analytics work on “individual-team sports”, such as track and field and swimming and diving, which individual performance is critical to team scores. Recent work showing that girls and boys do not tend to plateau in performance in running events during high school, as was previously believed, has been published in The American Statistician. An article showing that judges’ rankings of divers in a regional diving competition are compatible with measurements taken from videos of the same dives has been published in PLoS One. Data on scores from a diving competition for educational use have been published in Journal of Statistics and Data Science Education.

Faculty: Monnie McGee