Dr. David B. Blumenthal

Chair of Experimental Bioinformatics
TUM School of Life Sciences Weihenstephan (WZW)
Technical University of Munich (TUM)

ORCID iD iconorcid.org/0000-0001-8651-750X
Office OLG-L 10
Maximus-von-Imhof-Forum 3
85354 Freising-Weihenstephan

About Me

I am a postdoctoral fellow at the Chair of Experimental Bioinformatics of the Technical University of Munich, where I am developing new algorithmic techniques for network biology and systems medicine. In July 2019, I defended my Ph.D. thesis entitled “New Techniques for Graph Edit Distance Computation” at the Free University of Bozen-Bolzano. Before moving to Bolzano, I studied mathematics and philosophy at the Freie Universität Berlin and at the Technische Universität Berlin. For more information, please download my CV or visit my profile on LinkedIn.


This is a complete list of my research publications. Also see my profiles on ResearchGate, Scopus, ORCiD, dblp, and Google Scholar. If you have problems accessing a paper you are interested in, feel free to contact me.

Abstract We discuss the trend towards using quantitative metrics for evaluating research. We claim that, rather than promoting meaningful research, purely metric-based research evaluation schemes potentially lead to a dystopian academic reality, leaving no space for creativity and intellectual initiative. After sketching what the future could look like if quantitative metrics are allowed to proliferate, we provide a more detailed discussion on why research is so difficult to evaluate and outline approaches for avoiding such a situation. In particular, we characterize meaningful research as an essentially contested concept and argue that quantitative metrics should always be accompanied by operationalized instructions for their proper use and continuously evaluated via feedback loops. Additionally, we analyze a dataset containing information about computer science publications and their citation history and indicate how quantitative metrics could potentially be calibrated via alternative evaluation methods such as test of time awards. Finally, we argue that, instead of over-relying on indicators, research environments should primarily be based on trust and personal responsibility.

doi | paper | bibtex

Abstract Coronavirus Disease-2019 (COVID-19) is an infectious disease caused by the SARS-CoV-2 virus. Various studies exist about the molecular mechanisms of viral infection. However, such information is spread across many publications and it is very time-consuming to integrate, and exploit. We develop CoVex, an interactive online platform for SARS-CoV-2 host interactome exploration and drug (target) identification. CoVex integrates virus-human protein interactions, human protein-protein interactions, and drug-target interactions. It allows visual exploration of the virus-host interactome and implements systems medicine algorithms for network-based prediction of drug candidates. Thus, CoVex is a resource to understand molecular mechanisms of pathogenicity and to prioritize candidate therapeutics. We investigate recent hypotheses on a systems biology level to explore mechanistic virus life cycle drivers, and to extract drug repurposing candidates. CoVex renders COVID-19 drug research systems-medicine-ready by giving the scientific community direct access to network medicine algorithms. It is available at https://exbio.wzw.tum.de/covex/.

doi | paper | bibtex


Summary: Simulated data is crucial for evaluating epistasis detection tools in genome-wide association studies. Existing simulators are limited, as they do not account for linkage disequilibrium (LD), support limited interaction models of single nucleotide polymorphisms (SNPs) and only dichotomous phenotypes, or depend on proprietary software. In contrast, EpiGEN supports SNP interactions of arbitrary order, produces realistic LD patterns, and can generate both categorical and quantitative phenotypes.

Availability and Implementation: EpiGEN is implemented in Python 3 and is freely available at https://github.com/baumbachlab/epigen.

Supplementary Information: A detailed user guide for EpiGEN is available at Bioinformatics online.

doi | paper | bibtex

Abstract In this paper, we investigate the computation of alternative paths between two locations in a road network. More specifically, we study the k-shortest paths with limited overlap (kSPwLO) problem that aims at finding a set of k paths such that all paths are sufficiently dissimilar to each other and as short as possible. To compute kSPwLO queries, we propose two exact algorithms, termed OnePass and MultiPass, and we formally prove that MultiPass is optimal in terms of complexity. We also study two classes of heuristic algorithms: (a) performance-oriented heuristic algorithms that trade shortness for performance, i.e., they reduce query processing time, but do not guarantee that the length of each subsequent result is minimum; and (b) completeness-oriented heuristic algorithms that trade dissimilarity for completeness, i.e., they relax the similarity constraint to return a result that contains exactly k paths. An extensive experimental analysis on real road networks demonstrates the efficiency of our proposed solutions in terms of runtime and quality of the result.

doi | bibtex

Abstract Because of its flexibility, intuitiveness, and expressivity, the graph edit distance (GED) is one of the most widely used distance measures for labeled graphs. Since exactly computing GED is NP-hard, over the past years, various heuristics have been proposed. They use techniques such as transformations to the linear sum assignment problem with error-correction, local search, and linear programming to approximate GED via upper or lower bounds. In this paper, we provide a systematic overview of the most important heuristics. Moreover, we empirically evaluate all compared heuristics within an integrated implementation.

doi | poster | bibtex

Abstract The graph edit distance (GED) measures the dissimilarity between two graphs as the minimal cost of a sequence of elementary operations transforming one graph into another. This measure is fundamental in many areas such as structural pattern recognition or classification. However, exactly computing GED is NP-hard. Among different classes of heuristic algorithms that were proposed to compute approximate solutions, local search based algorithms provide the tightest upper bounds for GED. In this paper, we present K-REFINE and RANDPOST. K-REFINE generalizes and improves an existing local search algorithm and performs particularly well on small graphs. RANDPOST is a general warm start framework that stochastically generates promising initial solutions to be used by any local search based GED algorithm. It is particularly efficient on large graphs. An extensive empirical evaluation demonstrates that both K-REFINE and RANDPOST perform excellently in practice.

doi | bibtex

Abstract The graph edit distance is a widely used distance measure for labelled graph. However, A*-GED, the standard approach for its exact computation, suffers from huge runtime and memory requirements. Recently, three better performing algorithms have been proposed: The general algorithms DF-GED and BIP-GED, and the algorithm CSI-GED, which only works for uniform edit costs. All newly proposed algorithms outperform the standard approach A*-GED. However, cross-comparisons are lacking. This paper consolidates and extends these recent advances. To this purpose, we present all existing algorithms in a unified way and show that the slightly different definitions of the graph edit distance underlying A*-GED and DF-GED, on the one side, and CSI-GED, on the other side, can be harmonised. This harmonisation allows us to develop a generalisation of CSI-GED to non-uniform edit cost. Moreover, we present a speed-up of A*-GED and DF-GED for uniform edit costs, which build upon the fact that, in the uniform case, a continuously used subroutine can be implemented to run in linear rather than cubic time. We also suggest an algorithm MIP-GED which builds upon a very compact new mixed integer linear programming formulation. Finally, we carry out a thorough empirical evaluation, which, for the first time, compares all existing exact algorithms.

doi | bibtex

Abstract We propose an algorithm that efficiently solves the linear sum assignment problem with error-correction and no cost constraints. This problem is encountered for instance in the approximation of the graph edit distance. The fastest currently available solvers for the linear sum assignment problem require the pairwise costs to respect the triangle inequality. Our algorithm is as fast as these algorithms, but manages to drop the cost constraint. The main technical ingredient of our algorithm is a cost-dependent factorization of the node substitutions.

doi | bibtex

Abstract Due to their capacity to encode rich structural information, labeled graphs are often used for modeling various kinds of objects such as images, molecules, and chemical compounds. If pattern recognition problems such as clustering and classification are to be solved on these domains, a (dis-)similarity measure for labeled graphs has to be defined. A widely used measure is the graph edit distance (GED), which, intuitively, is defined as the minimum amount of distortion that has to be applied to a source graph in order to transform it into a target graph. The main advantage of GED is its flexibility and sensitivity to small differences between the input graphs. Its main drawback is that it is hard to compute.

In this thesis, new results and techniques for several aspects of computing GED are presented. Firstly, theoretical aspects are discussed: competing definitions of GED are harmonized, the problem of computing GED is characterized in terms of complexity, and several reductions from GED to the quadratic assignment problem (QAP) are presented. Secondly, solvers for the linear sum assignment problem with error-correction (LSAPE) are discussed. LSAPE is a generalization of the well-known linear sum assignment problem (LSAP), and has to be solved as a subproblem by many GED algorithms. In particular, a new solver is presented that efficiently reduces LSAPE to LSAP. Thirdly, exact algorithms for computing GED are presented in a systematic way, and improvements of existing algorithms as well as a new mixed integer programming (MIP) based approach are introduced. Fourthly, a detailed overview of heuristic algorithms that approximate GED via upper and lower bounds is provided, and eight new heuristics are described. Finally, a new easily extensible C++ library for exactly or approximately computing GED is presented.

arXiv | bibtex

Abstract The graph edit distance (GED) is a flexible graph dissimilar- ity measure widely used within the structural pattern recognition field. In this paper, we present GEDLIB, a C++ library for exactly or approx- imately computing GED. Many existing algorithms for GED are already implemented in GEDLIB. Moreover, GEDLIB is designed to be easily ex- tensible: for implementing new edit cost functions and GED algorithms, it suffices to implement abstract classes contained in the library. For implementing these extensions, the user has access to a wide range of utilities, such as deep neural networks, support vector machines, mixed integer linear programming solvers, a blackbox optimizer, and solvers for the linear sum assignment problem with and without error-correction.

doi | slides | poster | bibtex

Abstract Shortest path computation is a fundamental problem in road networks. However, in many real-world scenarios, determining solely the shortest path is not enough. In this paper, we study the problem of finding k-Dissimilar Paths with Minimum Collective Length (kDPwML), which aims at computing a set of paths from a source s to a target t such that all paths are pairwise dissimilar by at least \theta and the sum of the path lengths is minimal. We introduce an exact algorithm for the kDPwML problem, which iterates over all possible s-t paths while employing two pruning techniques to reduce the prohibitively expensive computational cost. To achieve scalability, we also define the much smaller set of the simple single-via paths, and we adapt two algorithms for kDPwML queries to iterate over this set. Our experimental analysis on real road networks shows that iterating over all paths is impractical, while iterating over the set of simple single-via paths can lead to scalable solutions with only a small trade-off in the quality of the results.

doi | bibtex | extended version on arXiv

Abstract The graph edit distance (GED) is a widely used distance measure for attributed graphs. It has recently been shown that the problem of computing GED, which is a NP-hard optimization problem, can be formulated as a quadratic assignment problem (QAP). This formulation is useful, since it allows to derive well performing approximative heuristics for GED from existing techniques for QAP. In this paper, we focus on the case where the edit costs that underlie GED are quasimetric. This is the case in many applications of GED. We show that, for quasimetric edit costs, it is possible to reduce the size of the corresponding QAP formulation. An empirical evaluation shows that this reduction significantly speeds up the QAP-based approximative heuristics for GED.

doi | poster | bibtex

Abstract The graph edit distance (GED) is a flexible graph dissimilarity measure widely used within the structural pattern recognition field. A widely used paradigm for approximating GED is to define local structures rooted at the nodes of the input graphs and use these structures to transform the problem of computing GED into a linear sum assignment problem with error correction (LSAPE). In the literature, different local structures such as incident edges, walks of fixed length, and induced subgraphs of fixed radius have been proposed. In this paper, we propose to use rings as local structure, which are defined as collections of nodes and edges at fixed distances from the root node. We empirically show that this allows us to quickly compute a tight approximation of GED.

doi | slides | bibtex

Abstract The problem of deriving lower and upper bounds for the edit distance between undirected, labelled graphs has recently received increasing attention. However, only one algorithm has been proposed that allegedly computes not only an upper but also a lower bound for non-uniform edit costs and incorporates information about both node and edge labels. In this paper, we demonstrate that this algorithm is incorrect. We present a corrected version Branch that runs in O(n2Δ3+n3) time, where Δ is the maximum of the maximum degrees of input graphs G and H. We also develop a speed-up BranchFast that runs in O(n2Δ2+n3) time and computes an only slightly less accurate lower bound. The lower bounds produced by Branch and BranchFast are shown to be pseudo-metrics on a collection of graphs. Finally, we suggest an anytime algorithm BranchTight that iteratively improves Branch's lower bound. BranchTight runs in O(n3Δ2+I(n2Δ3+n3)) time, where the number of iterations I is controlled by the user. A detailed experimental evaluation shows that all suggested algorithms are Pareto optimal, that they are very effective when used as filters for edit distance range queries, and that they perform excellently when used within classification frameworks.

doi | bibtex

Abstract The graph edit distance is a well-established and widely used distance measure for labelled, undirected graphs. However, since its exact computation is NP-hard, research has mainly focused on devising approximative heuristics and only few exact algorithms have been proposed. The standard approach A*-GED, a node-based best-first search that works for both uniform and non-uniform metric edit costs, suffers from huge runtime and memory requirements. Recently, two better performing algorithms have been proposed: DF-GED, a node-based depth-first search that works for uniform and non-uniform metric edit costs, and CSI_GED, an edge-based depth-first search that works only for uniform edit costs. Our paper contains two contributions: First, we propose a speed-up DF-GEDu of DF-GED for uniform edit costs. Second, we develop a generalisation CSI_GEDnu of CSI_GED that also covers non-uniform metric edit cost. We empirically evaluate the proposed algorithms. The experiments show, i.a., that our speed-up DF-GEDu clearly outperforms DF-GED and that our generalisation CSI_GEDnu is the most versatile algorithm.

doi | slides | bibtex

Abstract The problem of deriving lower and upper bounds for the edit distance between labelled undirected graphs has recently received increasing attention. However, only one algorithm has been proposed that allegedly computes not only an upper but also a lower bound for non-uniform metric edit costs and incorporates information about both node and edge labels. In this paper, we show that this algorithm is incorrect in the sense that, in general, it does not compute a lower bound. We present Branch, a corrected version of the algorithm that runs in O(n5) time. We also develop a speed-up BranchFast that runs in O(n4) time and computes a lower bound, which is only slightly less accurate than the one computed by Branch. An experimental evaluation shows that Branch and BranchFast yield excellent runtime/accuracy-tradeoffs, as they outperform all existing competitors in terms of runtime or in terms of accuracy.

doi | poster | bibtex

Abstract Der Aufsatz zielt darauf ab, ein angemessenes Verständnis der Semantik von Äußerungen über fiktive Kontexte (kurz: AFKs) zu entwickeln. Der systematische Ausgangspunkt der Arbeit besteht dabei in einem pragmatistischen Inferentialismus à la Robert Brandom. Für diese theoretische Weichenstellung wird nicht gesondert argumentiert, sondern vielmehr geltend gemacht, dass aus ihr zwei Forderungen an eine angemessene Theorie der Semantik von AFKs ableitbar sind. Erstens muss eine solche Theorie den inferentiellen Beziehungen Rechnung tragen, in denen von AFKs gemachte Aussagen de facto stehen. Zweitens darf sie nur auf solche Gegenstände zurückgreifen, die insofern ontologisch unschuldig sind, als sie vollständig und individuierbar sind. Aus diesen Forderungen ergibt sich, dass klassische Theorien der Semantik von AFKs unbefriedigend sind: Weder können AFKs mit Bertrand Russell als Äußerungen über das inferentielle Potenzial bestimmter Mengen fiktionaler Medien betrachtet, noch mit Searle, van Inwagen oder Parsons als Äußerungen über fiktive Gegenstände verstanden werden. Im Anschluss an diese kritische Auseinandersetzung wird ein eigener Vorschlag entwickelt, dessen Kerngedanken darin bestehen, AFKs als Äußerungen über bestimmte Werke zu betrachten, die wesentlich auf Interpretation beruhen. Dabei werden Werke als Äquivalenzklassen fiktionaler Medien verstanden, die logische Feinstruktur von AFKs gemachter Aussagen nach dem Modell von De-dicto-Zuschreibungen erläutert und deren Funktionsweise wiederum inferentialistisch gefasst.

pdf | bibtex

2020, Technical University of Munich
Einführung in die Bioinformatik II, labs, B.Sc. in Bioinformatics.
2017/2018, Free University of Bozen-Bolzano
Mathematical Methods for Experimental Science, labs, B.Sc. in Computer Science.
2016/2017, Free University of Bozen-Bolzano
Data structures and Algorithms, labs, B.Sc. in Computer Science.
2015/2016, Free University of Bozen-Bolzano
Calculus, tutorials, B.Sc. in Computer Science.

For more detailed project descriptions, please visit my profile on GitHub.

CoVeX (Coronavirus Explorer) is a unique online network and systems medicine platform for data analysis that integrates virus-human interactions for SARS-CoV-2 and SARS-CoV-1. It implements different network-based approaches for the identification of new drug targets and new repurposable drugs. For detailed information please visit our blog.

GEDLIB is an easily extensible C++ library for (suboptimally) computing the graph edit distance (GED) between two labeled graphs. GED is defined as the minimum cost of a sequence of elementary edit operations that transforms one graph into another. There are six elementary edit operations: changing the label of an existing node, inserting an isolated node, deleting an isolated node, changing the label of an existing edge, inserting an an edge between two existing nodes, and deleting an edge. GEDLIB implements several state of the art methods for computing GED and comes with predefined edit costs for some benchmark datasets. Further methods and edit costs can easily be implemented by the user.

EpiGEN is an easy-to-use epistasis simulation pipeline written in Python. It supports epistasis models of arbitrary size, which can be specified either extensionally or via parametrized risk models. Moreover, the user can specify the minor allele frequencies (MAFs) of both noise and disease SNPs, and provide a bias target distribution for the generated phenotypes to simulate observation bias.

Awards & Funding
2015 - 2019
Ph.D. grant from the Faculty of Computer Science of the Free University of Bozen-Bolzano.
Best paper award for “Exact computation of graph edit distance for uniform and non-uniform metric edit costs” at GbRPR 2017.
M.Sc. award from the Institue of Mathematics of the Technische Universität Berlin.
2007 - 2015
Scholarship from the Studienstiftung des deutschen Volkes.
PC Member
HCC 2019.
Information Sciences (2020), European Journal of Operational Research (2020), Pattern Recognition Letters (2018, 2019), Information Systems Frontiers (2019), ICPR 2018.
External Reviewer
VLDB 2017, EDBT 2017, ICDE 2016.