Statistical Learning and Data Science

Group leader

Interim Dean School of Business, Social and Decision Sciences Associate Professor of Statistics

Specific themes and goals

Prof. Adalbert Wilhelm’s research group has a long history of working with and advising experts from various research fields. We focus on incorporating domain knowledge into statistical modeling by developing visual representations of both the raw data and the models.

Ice sheets: We are currently undertaking research within MarData, the Helmholtz School for Marine Data Science. In one project, we are developing a suitable algorithm for the automatic analysis and categorisation of radio-echo sound data in order to detect ice layers in Antarctica and Greenland. This has traditionally mostly been done by a flawed semi-automatic method. By using automatic image processing systems, we will significantly improve the efficiency and accuracy of layer detection. Based on a performance analysis of existing semi-manual algorithms, we will be able to understand how to develop hybrid algorithms using statistical properties and image processing. Furthermore, we will use the available manually annotated data to improve the evaluation of competing machine-learning.
Phytoplankton: In another project, we are developing a complete data processing chain for combining various Phytoplankton Functional Type (PFT) datasets and associated uncertainties at various spatial and temporal scales. Evaluating the distribution of phytoplankton in space and time is critical for assessing the impact of climate change on marine biogeochemistry and food web. Researchers can detect and quantify phytoplankton abundance and composition with optical sensors that are operated by different devices such as ship-towed undulators, shipbased inline systems or autonomous platforms such as satellites and profile floating. Combining these disparate data sources remains a major difficulty due to varying temporal and spatial resolution and insufficient definition of uncertainty.
Cocoa: In another research theme, we are merging econometric models and machine learning to investigate the price formation on the futures market and the physical market for cocoa. The project is run jointly with RWI - Leibniz Institute for Economic Research (RWI) in Essen.
Adaptive learning systems: We are also designing and implementing an adaptive learning system for statistics to reduce gaps in statistics education. With our system, we intend to offer an innovative tool to complement traditional courses in Statistics. It will offer personalized learning paths to students, particularly students from non-mathematical fields with a strong empirical focus, such as psychology, political science, humanities and social sciences.
Unsupervised learning: We are also developing basic methods in the area of unsupervised learning, in particular cluster analysis of mixed-type data. One of the few and probably the best-known approach to clustering mixed type data is the k-prototype algorithm. In this project, we aim to expand the scope of this clustering algorithm, covering the various aspects of validating the number of clusters, variable selection, imputation of incomplete data, and initialisation of the algorithm. By means of simulation studies on artificially generated data, we will gain a deeper understanding of mixed-type cluster analysis, and be able to suggest improvements to the algorithm that will make it more practical and, ultimately, to enable targeted data analyses that were previously not possible with sufficient quality.

Highlights and impact

In addition to its extensive external consulting activities, the group co-organised a number of conference events dedicated to community building in the field of data science and visualization. These include the European Conference on Data Analysis ECDA 2019 (University of Bayreuth), DAGStat 2019 (LMU Munich), ECDADSSV 2021 (Erasmus University Rotterdam), DAGStat 2022 (University of Hamburg), ECDA 2022 (University of Naples).

Group composition & projects/funding

The research group comprises five PhD students funded by different projects and sources, several masters and bachelor students, and visiting PhD students (one per year) from the University of Cagliari, Italy. The group received funding from Erasmus+ Program, the Volkswagen Foundation, and the Bundesministerium frü Ernährung und Landwirtschaft (BMEL).

Selected publications

Müller, L. Lausser, A. Wilhelm, T. Ropinski, M. Platzer, H. Neumann, H.A. Kestler. A perceptually optimized bivariate visualization scheme for high-dimensional fold-change data, Advances in Data Analysis and Classification 15(1), 463- 480.
S. Friedrich, G. Antes, S. Behr, H. Binder, W. Brannath, F. Dumpert, K. Ickstadt, H. A. Kestler, J. Lederer, H. Leitgöb, M. Pauly, A. Steland, A. Wilhelm, T. Friede. Is there a role for statistics in artificial intelligence? Advances in Data Analysis and Classification 1-24.
Aschenbruck R, Szepannek G (2020) Cluster validation for mixed-type data. Archives of Data Science, Series A 6(1):1–12, DOI 10.5445/ KSP/1000098011/02
C Koopman, A Wilhelm. The Effect of Preprocessing on Short Document Clustering, Archives of Data Science 6 (1): 1-16.
Rizkallah MR, Frickenhaus S, Trimborn S, Harms L, Moustafa A, Benes V, Gäbler Schwarz S, Beszteri S. (2020) Deciphering patterns of adaptation and acclimation in the Transcriptome of Phaeocystis antarctica to Changing Iron Conditions