Research

The faculty members of the Department of Statistics and Data Science are prominent scholars, researchers, and consultants, as well as dedicated teachers.

All are actively engaged in research which is being published in professional journals. Their research has been funded by major grants from private organizations and governmental agencies, including the Department of Energy, Advanced Research Projects Agency, National Science Foundation, Office of Naval Research, Department of Education, Air Force Office of Scientific Research, the National Institutes of Health, Department of Veteran Affairs, and the National Aeronautics and Space Administration.

Faculty members are actively working in the following areas:

Advances in statistical education

Students need to practice to learn new material, and the passive format of most lecture styles does not facilitate active learning. Just in time teaching (JiTT), a particular form of flipping the classroom, is one method of active learning. Students are asked to review class materials prior to a class, then in class they answer questions on the materials. This method gives students multiple exposures to important concepts. In addition, Dr. McGee has experimented with different grading systems, such as contract grading, specifications grading, and menu grading, to improve student learning and decrease grade anxiety. In contrast to traditional grading (two exams and a final plus 10% homework), these other methods allow students to learn from their mistakes or make up for a slow start.

Faculty: Monnie McGee
Analysis and methodology for compositional data

Compositional data consist of discrete outcomes (called components) that are proportions of a whole. For example, daily activity can be composed of sleep, work, exercise, and leisure time. In each 24-hour period, the time spent in each one of these activities should sum to 24 and can easily be converted to percentages of time spent in each activity. These data cannot be analyzed by classical statistical methods because they are forcibly dependent. Spending more time sleeping, for example, means less time for the other three activities. Monnie McGee and her students have developed methods based on the nested Dirich let distribution to test for differences in components for G independent groups, where G>2 (e.g., time spent for men vs. women, or older people versus middle-aged people versus younger people). Applications include comparison of components of cell populations in Lupus vs. Healthy patients, time spent in a water maze for normal vs. impaired mice, phylogenetic trees for populations of microbiome data, and outcomes of at-bats from young, mid-career and experienced baseball players. This is continuing work with two former students, Jacob Turner (SFA) and Bianca Luedeker (NAU).

Faculty: Monnie McGee
Bayesian methods and applications

Bayesian hierarchical models enjoy flexibility in model construction and accommodation of complex data structures. The research in this area spans a diversity of applications, ranging from psychological and behavioral science studies, high throughput data analysis, spatial-temporal analysis of longitudinal and survival data. Currently, we are constructing spatial-temporal predictive models to study intervention effect in economic observational studies with small sample size and heterogeneous spatial structure. Another project is to study prediction and uncertainty quantification of high-dimensional multi-type data with block-wise missing structure.

Faculty: Jing Cao
Causal inference

In clinical trials with perfect adherence to assigned treatment and no missing data, it is straightforward to make causal inferences from study data. In all other contexts, it is not possible to make causal inferences without making some assumptions about the treatment assignment mechanism — i.e., the dependence of the treatment assignment on baseline factors. Together with several recent graduates of the biostatistics and statistics PhD programs, we have developed methods to extract causal inferences in situations where the treatment assignment mechanism is complex, as is typically the case in clinical research in surgery. We have also explored sensitivity of various clinical trial estimands to correlation of individual compliance behavior and causal effects. Our papers on these topics have appeared in Biometrics, Statistics in Medicine, and International Journal of Biostatistics.

Faculty: Daniel Heitjan
Clinical trial design and analysis

Randomized trials with correlated outcomes are widely employed in medical, epidemiological, and behavioral studies. Correlated outcomes are usually categorized into two types: clustered and longitudinal. The former arises from trials where randomization is performed at the level of some aggregate (e.g., clinics) of research subjects (e.g., patients). The latter arises when the outcome is measured multiple times during follow-up from each subject. In addition, missing data is a common issue which leads to the challenge of “partial” observations. The research is to develop GEE-based sample size methods that cover various types of correlated outcomes (continuous, binary, and count) and accommodate missing data, correlation structures, and financial constraints. Trials based on such outcomes, although more complicated than the conventional randomized trials, offer greater flexibility and efficiency in practice.

Faculty: Jing Cao
DNA sequence analysis via Fourier transforms and abstraction augmented Markov models

Current metagenomic methods require alignment of reads to a database of genomes. This is not only computationally expensive, but also identification of a microbial population is tied to the presence of a sequenced genome in a database. It is possible that novel microbial populations that have yet to be sequenced exist in samples from the mouth, lungs, and other human environments. To identify them, it is necessary to employ alignment-free techniques, such as abstraction augmented Markov models (AAMM) to identify the genera and species of the organisms. Related to this, we are exploring the use of Fourier coefficients as a signature for DNA sequences. We have used FCs to classify geographic location of DNA sequences from coronavirus samples.

Faculty: Monnie McGee
Dynamic risk prediction triggered by intermediate event

The availability of extensive data from electronic health records and registry databases has sparked considerable interest in incorporating time-varying patient information to enhance risk prediction. Unlike static risk prediction methods that provide a conditional survival function based on baseline predictors, dynamic landmark prediction focuses on the survival function conditioned on the patient's predictor history up to a specific landmark time. However, incorporating the complex stochastic processes involved in the patient's history information poses challenges when applying tree-based methods. To address this, we propose a unified framework for landmark prediction using survival tree ensembles, allowing for updated predictions as new information becomes available. Additionally, we represent the patient's history information as a fixed-length predictor vector, enabling the application of recursive partitioning techniques to exploit the growing amount of predictor information over time.

Faculty: Steven Chiou
Complex-valued data analysis

An emerging field of statistics driven by applications in wireless communication is the analysis of data derived from digital communication systems, radar, sonar, etc. which are not real-valued but rather complex-valued. When statistical models are used to describe the transmission of data across 5G communication networks with multiple input and multiple outputs (MIMO), the raw data are complex responses so each datum response has both amplitude and phase. All transmission systems and receiving antennas have responses that are not in phase and which need to be put into phase. Distribution theory for such data poses special challenges to statistical theory particularly since the parameters of interest in such models are themselves typically complex-valued parameters. Standard distributions such as multivariate normal, Wishart, etc. need to be generalized to create models for complex multivariate normal, complex Wishart, and many other complex-valued distributions. A common misconception in Statistics is that time series and Kalman filtering models when applied in high tech contexts are used in their real response versions as typically taught in statistics classes - they are not. Rather models with complex-valued responses using complex distribution theory are used for dealing with microwave transmission and for the most commonly used example of filtering theory: radar and sonar tracking of moving objects. The likelihood functions derived from such complex data are necessarily real-valued functions as a function of complex-valued parameters and this ensures that they are not complex-analytic functions. This complicates the development of statistical likelihood theory which requires differentiation of such likelihoods. The emergence of this subject requires a substantial rewriting of the basics in mathematical statistics to accommodate inference for complex parameters from complex data.

Faculty: Ron Butler
Geometric and topological data analysis

A defining characteristic of many modern data applications is their unstructured nature. The basic unit of analysis could be something other than a traditional observation, such as regular arrays with fixed numbers of rows and columns and a single observation in each cell. Such questions are not amenable to traditional statistical procedures based on simple array-structured data. Geometric and topological data analysis provides a mathematical representation of the shape of data and extracts structural information from a complex data set. We have developed statistical approaches for geometric and topological data analysis that provide a direct inference on the shape of data.

Faculty: Chul Moon
Human mobility patterns

The near-universal adoption of smartphones gives researchers unprecedented opportunities to collect data on human mobility. It has been shown that insights gained through such data can inform treatment of certain medical conditions or provide early warning about others. We research new models to represent such data and propose statistical methods to deal with some of the challenges they present such as incomplete or missing observations or measurement error.

Faculty: Marcin Jurek
Laplace and Fourier transform inversion

The physical and engineering sciences have traditionally used Laplace, Fourier, and z-transforms as a mean for analyzing the behavior of complex random systems. Such transforms underlie most of the study of systems theory but typically it is the inverses of such transforms as time-domain functions which are of greater interest. For example, in any stochastic network or electric circuit, the equivalent transmittance between any two nodes in the system is a Laplace or z-transform for an associated impulse response function but it is rather the impulse response function in time that is of more practical interest. Thus the inversion of such transforms becomes an important mathematical concern. Numerical inversion of such transforms draws from the theory of complex variables, numerical analysis, and the mathematics of computation as it relates to floating point arithmetic.

Faculty: Ron Butler
Machine learning and text sentiment analysis

Machine learning (ML) has enjoyed great success in prediction and classification using big data. However, the desirable accuracy from ML algorithms often goes hand in hand with lack of interpretability due to their complex inner-workings. In practice, it can be a limitation because interpretability is crucial for understanding and acceptance of prediction or classification outcomes. To meet this challenge, we try to incorporate machine learning components, such as the attention mechanism, into a relatively simpler parametric statistical model structure. We have applied the idea in the text sentiment analysis. By combining the attention mechanism from ML (which is capable of providing meaningful word embedding vectors) with a relatively simple interpretable statistical model, we are able to get the best of both worlds: the interpretability of a statistical model and the high predictive performance of ML algorithms.

Faculty: Jing Cao
Mixed-valued time series analysis

Multivariate time series are routinely modeled and analyzed by the well-known vector autoregressive (VAR) models. The main reasons are ease in computation arising from the imposed linearity, easily understood by a wide audience, and provide predictions. Though VAR models are well understood from a theoretical and methodological point of view, and are quite useful for analysis of continuous-valued data, they are inappropriate when dealing with multivariate time series when some of its components are integer-valued such as the daily number of new patient admissions to a hospital, the number of crimes in a particular region, trading volume during a time period. The goal is to develop new statistical tools and models for analyzing multivariate mixed-valued time series data. This is significant because multivariate time series data, discrete and continuous-valued, is collected in diverse scientific areas such as demography, econometrics, sociology, public health, and neurobiology for the purpose of forecasting, planning and informing policy.

Faculty: Raanju Sundararajan
Multivariate and high-dimensional time series analysis

Time series data from various sources appear often in multivariate and high-dimensional form. Numerous important problems from application areas such as neuroscience, finance, environmental science and engineering involve analyzing time series data. As an example in neuroscience, functional MRI (fMRI) data from neuroscience experiments are recorded as a high-dimensional time series with signal from several thousand spatial locations in the brain. The interest here is in understanding time-varying interactions between different brain locations and also assist relating this with neurological disorders. As an example in engineering, power systems operations of renewable energy sources like wind depend on modeling and forecasting of multivariate time series data. Managing the renewable energy grid is critical in utilizing that energy source effectively and time series methods play a central role in the decision making process of these systems. The problems identified in the above mentioned areas require new time series methods that are computationally feasible and grounded in theory. Ongoing research focuses on developing such methods and these have theoretical and methodological importance in time series analysis.

Faculty: Raanju Sundararajan
Measuring sensitivity to nonignorability

Most data sets of any consequence have some missing observations. When the propensity to be missing is associated with the values of the observations, we say that the data are nonignorably missing. Nonignorability can lead to bias and other problems when one applies standard statistical analyses to the data. In principle, one can eliminate such problems by estimating models that account for nonignorability, but these models are notoriously non-robust and difficult to fit. An alternative approach is to measure the sensitivity to nonignorability, that is, to evaluate whether nonignorability, if it exists, is sufficient to change parameter estimates from their values under standard ignorable models. A primitive version of this idea is to tally the fraction of missing observations in a univariate data set; if the fraction is small, then presumably the potential bias arising from nonignorability is also small. We have developed methods and software to measure sensitivity for a broad range of data structures, missingness types, and statistical models.

Faculty: Daniel Heitjan
Nonparametric statistics

Nonparametric statistics aim to infer an unknown quantity while making a few underlying assumptions. Because nonparametric methods make fewer assumptions, they can be useful when existing information about the application is insufficient. Nonparametric methods could provide more robust and simpler inference than parametric methods for various cases. Empirical likelihood is an example of the nonparametric method of inference.
Faculty: Chul Moon
Non-probability sampling

Lynne Stokes directs a team of Ph.D. students working on two projects related to Gulf of Mexico fisheries. The first, funded through a contract with NOAA, is developing and evaluating new methods for estimating catch of recreational anglers. These methods augment data from traditional surveys of anglers with real-time electronic self-reports. These new methods are being considered as replacements or supplements for NOAA’s current data collection methods. The second project, The Great Red Snapper Count, is a two-year $10 million effort by a multi-disciplinary team of 21 researchers who will provide a fisheries independent estimate of the abundance of red snapper in the Gulf of Mexico. The SMU team provides statistical support for the project, which will require integrating a variety of data collection and estimation strategies across the Gulf.

Faculty: Lynne Stokes
Order statistics

An order statistic is the realized ranked value of a random variable in a sample. The study of order statistics can be useful in a range of problems, such as evaluating the reliability of a manufacturing system that depends on performance of many similar parts or the risk to a life insurance company for its portfolio of policies. Inference from order statistics can provide robust and cost-effective testing and estimation. An example of efficient estimation using the theory of order statistics is ranked set sampling.

Faculty: Xinlei Wang, Lynne Stokes, Chul Moon
Ranking and selection

Decision-makers are frequently confronted with the problem of selecting from among a set of possible choices. Ranking and selection addresses the problem of how to choose the best among a group of items, where the quality of those items is measured imperfectly. Another aspect of the problem that we have studied is how to assess the quality of the measures themselves; i.e., ranking the rankers. Our approaches have included various ways of modeling the evaluation process. Applications have been wide-ranging, from wine-tasting, to proposal evaluation, to diving scores.

Faculty: Jing Cao, Lynne Stokes, Monnie McGee
Real-time prediction in clinical trials

Clinical trial planning involves the specification of a projected duration of enrollment and follow-up needed to achieve the targeted study power. If pre-trial estimates of enrollment and event rates are inaccurate, projections can be faulty, leading to inadequate power or other mis-allocation of resources. We have developed an array of methods that use the accruing trial data to efficiently and correctly predict future enrollment counts, times of occurrence of landmark events, estimated final treatment effects, and ultimate significance of the trial.

Faculty: Daniel Heitjan
Recurrent event analysis

Recurrent event analyses have wide-ranging applications in biomedicine, public health, and engineering, among other fields, where subjects experience a sequence of events of interest during follow-up. However, simple survival methods that focus solely on the first event can overlook valuable information on subsequent events, leading to bias and potentially misleading results. Consequently, there has been considerable attention given to approaches that address the sequential nature of recurrent event times without information loss. Since recurrent events can be terminated by informative censoring or a terminal event, the use of frailty models to relax the assumption of conditional independent censoring and jointly model the recurrent event process and the terminal event has gained significant interest. We have developed a general scale-change joint model that encompasses the popular Cox-type model, the accelerated rate model, and the accelerated mean model as special cases, accommodating informative censoring through subject-specific frailty without any need for parametric specification.

Faculty: Steven Chiou
Saddlepoint approximations and higher-order asymptotic theory

Modern methods used in statistics and probability often require the compu- tation of probabilities from complicated models in which what is know is the underlying transform theory for the distributions of interest rather than their explicit expressions. It is in this context that saddlepoint methods aid in the computations of such probabilities. Of particular relevance are the majority of probability computations used in stochastic modeling. The companion subject of higher-order asymptotic theory provides tools for making more precise com- putations than those normally derived from using central limit theory as based on the theory of weak convergence.

Faculty: Ron Butler
Scalable Gaussian processes with applications to earth sciences.

Modern remote sensing technology has been producing an incredible amount of environmental data which helps to produce new insights into the mechanisms governing the Earth's ecosystem. Many of the popular tools used to model such data are based on Gaussian processes (GPs). This versatile and analytically tractable approach provides a natural way to quantify uncertainty yet often suffers from computational problems. In this line of research we explore new ways to exploit the favorable properties of GPs in the analysis of environmental data while making sure that they can be scaled to massive data sets.

Faculty: Marcin Jurek
Sports analytics: swimming, diving, track and field

Sports analytics is a big business, particularly in football, basketball, baseball, soccer, and hockey. McGee concentrates her sports analytics work on “individual-team sports”, such as track and field and swimming and diving, which individual performance is critical to team scores. Recent work showing that girls and boys do not tend to plateau in performance in running events during high school, as was previously believed, has been published in The American Statistician. An article showing that judges’ rankings of divers in a regional diving competition are compatible with measurements taken from videos of the same dives has been published in PLoS One. Data on scores from a diving competition for educational use have been published in Journal of Statistics and Data Science Education.

Faculty: Monnie McGee
Stochastic processes, feedback systems and networks

This subject involves the study and modeling of random phenomena over space and time with particular emphasis on how components of the systems interact to create the dynamics of stochastic phenomenon. Feedback processes and mechanisms are an integral part of this subject. Such models include Markov chains, semi-Markov processes, diffusion processes, and their underlying renewal theory. This subject body represents the majority of mathematical models used in the physical sciences, engineering sciences, and stochastic finance.

Faculty: Ron Butler