We define two new computational problems in the domain of perfect genome rearrangements, and propose three algorithms to solve them. The rearrangement scenarios modeled by the problems consider Reversal and Block Interchange operations, and a PQ-tree is utilized to guide the allowed operations and to compute their weights. In the first problem, [Formula: see text] ([Formula: see text]), we define the basic structure-informed rearrangement measure. Here, we assume that the gene order members of the gene cluster from which the PQ-tree is constructed are permutations. The PQ-tree representing the gene cluster is ordered such that the series of gene IDs spelled by its leaves is equivalent to that of the reference gene order. Then, a structure-informed genome rearrangement distance is computed between the ordered PQ-tree and the target gene order. The second problem, [Formula: see text] ([Formula: see text]), generalizes [Formula: see text], where the gene order members are not necessarily permutations and the structure informed rearrangement measure is extended to also consider up to [Formula: see text] and [Formula: see text] gene insertion and deletion operations, respectively, when modelling the PQ-tree informed divergence process from the reference gene order to the target gene order. The first algorithm solves [Formula: see text] in [Formula: see text] time and [Formula: see text] space, where [Formula: see text] is the maximum number of children of a node, n is the length of the string and the number of leaves in the tree, and [Formula: see text] and [Formula: see text] are the number of P-nodes and Q-nodes in the tree, respectively. If one of the penalties of [Formula: see text] is 0, then the algorithm runs in [Formula: see text] time and [Formula: see text] space. The second algorithm solves [Formula: see text] in [Formula: see text] time and [Formula: see text] space, where [Formula: see text] is the maximum number of children of a node, n is the length of the string, m is the number of leaves in the tree, [Formula: see text] and [Formula: see text] are the number of P-nodes and Q-nodes in the tree, respectively, and allowing up to [Formula: see text] deletions from the tree and up to [Formula: see text] deletions from the string. The third algorithm is intended to reduce the space complexity of the second algorithm. It solves a variant of the problem (where one of the penalties of [Formula: see text] is 0) in [Formula: see text] time and [Formula: see text] space. The algorithm is implemented as a software tool, denoted MEM-Rearrange, and applied to the comparative and evolutionary analysis of 59 chromosomal gene clusters extracted from a dataset of 1487 prokaryotic genomes.
Two genomes [Formula: see text] and [Formula: see text] over the same set of gene families form a canonical pair when each of them has exactly one gene from each family. Denote by [Formula: see text] the number of common families of [Formula: see text] and [Formula: see text]. Different distances of canonical genomes can be derived from a structure called breakpoint graph, which represents the relation between the two given genomes as a collection of cycles of even length and paths. Let [Formula: see text] and [Formula: see text] be respectively the numbers of cycles of length i and of paths of length j in the breakpoint graph of genomes [Formula: see text] and [Formula: see text]. Then, the breakpoint distance of [Formula: see text] and [Formula: see text] is equal to [Formula: see text]. Similarly, when the considered rearrangements are those modeled by the double-cut-and-join (DCJ) operation, the rearrangement distance of [Formula: see text] and [Formula: see text] is [Formula: see text], where c is the total number of cycles and [Formula: see text] is the total number of paths of even length. The distance formulation is a basic unit for several other combinatorial problems related to genome evolution and ancestral reconstruction, such as median or double distance. Interestingly, both median and double distance problems can be solved in polynomial time for the breakpoint distance, while they are NP-hard for the rearrangement distance. One way of exploring the complexity space between these two extremes is to consider a [Formula: see text] distance, defined to be [Formula: see text], and increasingly investigate the complexities of median and double distance for the [Formula: see text] distance, then the [Formula: see text] distance, and so on. While for the median much effort was done in our and in other research groups but no progress was obtained even for the [Formula: see text] distance, for solving the double distance under [Formula: see text] and [Formula: see text] distances we could devise linear time algorithms, which we present here.
Although RNA secondary structure prediction is a textbook application of dynamic programming (DP) and routine task in RNA structure analysis, it remains challenging whenever pseudoknots come into play. Since the prediction of pseudoknotted structures by minimizing (realistically modelled) energy is NP-hard, specialized algorithms have been proposed for restricted conformation classes that capture the most frequently observed configurations. To achieve good performance, these methods rely on specific and carefully hand-crafted DP schemes. In contrast, we generalize and fully automatize the design of DP pseudoknot prediction algorithms. For this purpose, we formalize the problem of designing DP algorithms for an (infinite) class of conformations, modeled by (a finite number of) fatgraphs, and automatically build DP schemes minimizing their algorithmic complexity. We propose an algorithm for the problem, based on the tree-decomposition of a well-chosen representative structure, which we simplify and reinterpret as a DP scheme. The algorithm is fixed-parameter tractable for the treewidth tw of the fatgraph, and its output represents a [Formula: see text] algorithm (and even possibly [Formula: see text] in simple energy models) for predicting the MFE folding of an RNA of length n. We demonstrate, for the most common pseudoknot classes, that our automatically generated algorithms achieve the same complexities as reported in the literature for hand-crafted schemes. Our framework supports general energy models, partition function computations, recursive substructures and partial folding, and could pave the way for algebraic dynamic programming beyond the context-free case.
RNAs composed of Triplet Repeats (TR) have recently attracted much attention in the field of synthetic biology. We study the mimimum free energy (MFE) secondary structures of such RNAs and give improved algorithms to compute the MFE and the partition function. Furthermore, we study the interaction of multiple RNAs and design a new algorithm for computing MFE and partition function for RNA-RNA interactions, improving the previously known factorial running time to exponential. In the case of TR, we show computational hardness but still obtain a parameterized algorithm. Finally, we propose a polynomial-time algorithm for computing interactions from a base set of RNA strands and conduct experiments on the interaction of TR based on this algorithm. For instance, we study the probability that a base pair is formed between two strands with the same triplet pattern, allowing an assessment of a notion of orthogonality between TR.
Atom-to-atom maps play an important role in many applications. However, they are often difficult to obtain. The KEGG reaction database does not provide atom-to-atom maps for its reactions and instead offers a description of local changes for pairs of reactant and product molecules in terms of so-called RCLASSes. Developed for classification purposes, RCLASS data are difficult to use for purposes such as the construction of atom-to-atom maps or reaction rules. DPO graph transformation rules, on the other hand, work as a convenient and efficient representation, particularly for these applications. The RCLASS data can be understood as collections of local graph patterns in the reactants and products of a reaction, together with partial correspondences of atoms. The problem of converting RCLASS data into DPO rules, therefore, is a special case of the graph reconstruction problem, which consists of inferring a graph from a collection of subgraphs. We developed laveau, a tool that computes explicit DPO rules from KEGG reactions and RCLASS data. The algorithm proceeds stepwise, starting with a translation of individual RDM codes, specifically developed by the KEGG database, into equivalent RDM pattern graphs. Multiple RDM pattern graphs for the same RCLASS are then combined based on their embeddings into the reactant and product molecules, observing certain consistency conditions. In the final step, these combined pairwise patterns are merged into a pair of subgraphs of reactants and products, respectively. If RCLASSes connecting all pairs of reactant and product molecules are available, the complete reaction center(s) is/are contained in the union of these subgraphs. The atom-to-atom map inherited from the RDM codes then defines a DPO transformation rule. Application of these rules to the reactants then yields complete atom-to-atom maps (AAMs). Starting from 3195 RCLASSes, laveau generates a total of 1232 DPO rules and 1594 AAMs. The laveau software makes it possible to extract local atom-to-atom maps from the RCLASSes of the KEGG database, covering a large set of enzyme-catalyzed reactions. The results are made available in the form of DPO rules for use in atom-level models of metabolic networks, filling a crucial gap in the available data.
Many bioinformatics problems can be approached as optimization or controlled sampling tasks, and solved exactly and efficiently using Dynamic Programming (DP). However, such exact methods are typically tailored towards specific settings, complex to develop, and hard to implement and adapt to problem variations. We introduce the Infrared framework to overcome such hindrances for a large class of problems. Its underlying paradigm is tailored toward problems that can be declaratively formalized as sparse feature networks, a generalization of constraint networks. Classic Boolean constraints specify a search space, consisting of putative solutions whose evaluation is performed through a combination of features. Problems are then solved using generic cluster tree elimination algorithms over a tree decomposition of the feature network. Their overall complexities are linear on the number of variables, and only exponential in the treewidth of the feature network. For sparse feature networks, associated with low to moderate treewidths, these algorithms allow to find optimal solutions, or generate controlled samples, with practical empirical efficiency. Implementing these methods, the Infrared software allows Python programmers to rapidly develop exact optimization and sampling applications based on a tree decomposition-based efficient processing. Instead of directly coding specialized algorithms, problems are declaratively modeled as sets of variables over finite domains, whose dependencies are captured by constraints and functions. Such models are then automatically solved by generic DP algorithms. To illustrate the applicability of Infrared in bioinformatics and guide new users, we model and discuss variants of bioinformatics applications. We provide reimplementations and extensions of methods for RNA design, RNA sequence-structure alignment, parsimony-driven inference of ancestral traits in phylogenetic trees/networks, and design of coding sequences. Moreover, we demonstrate multidimensional Boltzmann sampling. These applications of the framework-together with our novel results-underline the practical relevance of Infrared. Remarkably, the achieved complexities are typically equivalent to the ones of specialized algorithms and implementations. Infrared is available at https://amibio.gitlabpages.inria.fr/Infrared with extensive documentation, including various usage examples and API reference; it can be installed using Conda or from source.
The graph traversal edit distance (GTED), introduced by Ebrahimpour Boroojeny et al. (2018), is an elegant distance measure defined as the minimum edit distance between strings reconstructed from Eulerian trails in two edge-labeled graphs. GTED can be used to infer evolutionary relationships between species by comparing de Bruijn graphs directly without the computationally costly and error-prone process of genome assembly. Ebrahimpour Boroojeny et al. (2018) propose two ILP formulations for GTED and claim that GTED is polynomially solvable because the linear programming relaxation of one of the ILPs always yields optimal integer solutions. The claim that GTED is polynomially solvable is contradictory to the complexity results of existing string-to-graph matching problems. We resolve this conflict in complexity results by proving that GTED is NP-complete and showing that the ILPs proposed by Ebrahimpour Boroojeny et al. do not solve GTED but instead solve for a lower bound of GTED and are not solvable in polynomial time. In addition, we provide the first two, correct ILP formulations of GTED and evaluate their empirical efficiency. These results provide solid algorithmic foundations for comparing genome graphs and point to the direction of heuristics. The source code to reproduce experimental results is available at https://github.com/Kingsford-Group/gtednewilp/ .
The C -ORIENTATION problem asks whether it is possible to orient an undirected graph to a directed phylogenetic network of a desired network class C . This problem arises, for example, when visualising evolutionary data, as popular methods such as Neighbor-Net are distance-based and inevitably produce undirected graphs. The complexity of C -ORIENTATION remains open for many classes C , including binary tree-child networks, and practical methods are still lacking. In this paper, we propose (1) an exact FPT algorithm for C -ORIENTATION, applicable to any class C admitting a tractable membership test, and parameterised by the reticulation number and the maximum size of minimal basic cycles, and (2) a very fast heuristic for TREE-CHILD ORIENTATION. While the state-of-the-art for C -ORIENTATION is a simple exponential time algorithm whose computational bottleneck lies in searching for appropriate reticulation vertex placements, our methods significantly reduce this search space. Experiments show that, although our FPT algorithm is still exponential, it significantly outperforms the existing method. The heuristic runs even faster but with increasing false negatives as the reticulation number grows. Given this trade-off, we also discuss theoretical directions for improvement and biological applicability of the heuristic approach.
In the Nearest-Neighbor Thermodynamic Model, a standard approach for RNA secondary structure prediction, the energy of the multiloops is modeled using a linear entropic penalty governed by three branching parameters. Although these parameters are typically fixed, recent work has shown that reparametrizing the multiloop score and considering alternative branching conformations can lead to significantly better structure predictions. However, prior approaches for exploring the alternative branching structures were computationally inefficient for long sequences. We present a novel algorithm that partitions the parameter space, identifying all distinct branching structures (optimal under different branching parameters) for a given RNA sequence using the fewest possible minimum free energy computations. Our method efficiently computes the full parameter-space partition and the associated optimal structures, enabling a comprehensive evaluation of the structural landscape across parameter choices. We apply this algorithm to the Archive II benchmarking dataset, assessing the maximum attainable prediction accuracy for each sequence under the reparameterized multiloop model. We find that the potential for improvement over default predictions is substantial in many cases, and that the optimal prediction accuracy is highly sensitive to auxiliary modeling decisions, such as the treatment of lonely base pairs and dangling ends. Our results support the hypothesis that the conventional choice of multiloop parameters may limit prediction accuracy and that exploring alternative parameterizations is both tractable and worthwhile. The efficient partitioning algorithm we introduce makes this exploration feasible for longer sequences and larger datasets. Furthermore, we identify several open challenges in identifying the optimal structure.
The last decade of phylogenetics has seen the development of many methods that leverage constraints plus dynamic programming. The goal of this algorithmic technique is to produce a phylogeny that is optimal with respect to some objective function and that lies within a constrained version of tree space. The popular species tree estimation method ASTRAL, for example, returns a tree that (1) maximizes the quartet score computed with respect to the input gene trees and that (2) draws its branches (bipartitions) from the input constraint set. This technique has yet to be used for parsimony problems where the input are binary characters, sometimes with missing values. Here, we introduce the clade-constrained character parsimony problem and present an algorithm that solves this problem for the Dollo criterion score in [Formula: see text] time, where n is the number of leaves, k is the number of characters, and [Formula: see text] is the set of clades used as constraints. Dollo parsimony, which requires traits/mutations to be gained at most once but allows them to be lost any number of times, is widely used for tumor phylogenetics as well as species phylogenetics, for example analyses of low-homoplasy retroelement insertions across the vertebrate tree of life. This motivated us to implement our algorithm in a software package, called Dollo-CDP, and evaluate its utility for analyzing retroelement insertion presence / absence patterns for bats, birds, toothed whales as well as simulated data. Our results show that Dollo-CDP can improve upon heuristic search from a single starting tree, often recovering a better scoring tree. Moreover, Dollo-CDP scales to data sets with much larger numbers of taxa than branch-and-bound while still having an optimality guarantee, albeit a more restricted one. Lastly, we show that our algorithm for Dollo parsimony can easily be adapted to Camin-Sokal parsimony but not Fitch parsimony.
Evolutionary scenarios describing the evolution of a family of genes within a collection of species comprise the mapping of the vertices of a gene tree T to vertices and edges of a species tree S. The relative timing of the last common ancestors of two extant genes (leaves of T) and the last common ancestors of the two species (leaves of S) in which they reside is indicative of horizontal gene transfers (HGT) and ancient duplications. Orthologous gene pairs, on the other hand, require that their last common ancestors coincides with a corresponding speciation event. The relative timing information of gene and species divergences is captured by three colored graphs that have the extant genes as vertices and the species in which the genes are found as vertex colors: the equal-divergence-time (EDT) graph, the later-divergence-time (LDT) graph and the prior-divergence-time (PDT) graph, which together form an edge partition of the complete graph. Here we give a complete characterization in terms of informative and forbidden triples that can be read off the three graphs and provide a polynomial time algorithm for constructing an evolutionary scenario that explains the graphs, provided such a scenario exists. While both LDT and PDT graphs are cographs, this is not true for the EDT graph in general. We show that every EDT graph is perfect. While the information about LDT and PDT graphs is necessary to recognize EDT graphs in polynomial-time for general scenarios, this extra information can be dropped in the HGT-free case. However, recognition of EDT graphs without knowledge of putative LDT and PDT graphs is NP-complete for general scenarios. In contrast, PDT graphs can be recognized in polynomial-time. We finally connect the EDT graph to the alternative definitions of orthology that have been proposed for scenarios with horizontal gene transfer. With one exception, the corresponding graphs are shown to be colored cographs.
Inverse folding is a classic instance of negative RNA design which consists in finding a sequence that uniquely folds into a target secondary structure with respect to energy minimization. A breakthrough result of Bonnet et al. shows that, even in simple base pairs-based (BP) models, the decision version of a mildly constrained version of inverse folding is NP-hard. In this work, we show that inverse folding can be solved in linear time for a large collection of targets, including every structure that contains no isolated BP and no isolated stack (or, equivalently, when all helices consist of 3 + base pairs). For structures featuring shorter helices, our linear algorithm is no longer guaranteed to produce a solution, but still does so for a large proportion of instances. Our approach introduces a notion of modulo m-separability, generalizing a property pioneered by Hales et al. Separability is a sufficient condition for the existence of a solution to the inverse folding problem. We show that, for any input secondary structure of length n, a modulo m-separated sequence can be produced in time O ( n m 2 m ) anytime such a sequence exists. Meanwhile, we show that any structure consisting of 3 + base pairs is either trivially non-designable, or always admits a modulo-2 separated solution. Solution sequences can thus be produced in linear time, and even be uniformly generated within the set of modulo-2 separable sequences.
Scaffolding is an intermediate stage of fragment assembly. It consists in orienting and ordering the contigs obtained by the assembly of the sequencing reads. In the general case, the problem has been largely studied with the use of distances data between the contigs. Here we focus on a dedicated scaffolding for the chloroplast genomes. As these genomes are small, circular and with few specific repeats, numerous approaches have been proposed to assemble them. However, their specificities have not been sufficiently exploited. We give a new formulation for the scaffolding in the case of chloroplast genomes as a discrete optimisation problem, that we prove the decision version to be [Formula: see text]-Complete. We take advantage of the knowledge of chloroplast genomes and succeed in expressing the relationships between a few specific genomic repeats in mathematical constraints. Our approach is independent of the distances and adopts a genomic regions view, with the priority on scaffolding the repeats first. In this way, we encode the structural haplotype issue in order to retrieve several genome forms that coexist in the same chloroplast cell. To solve exactly the optimisation problem, we develop an integer linear program that we implement in Python3 package khloraascaf. We test it on synthetic data to investigate its performance behaviour and its robustness against several chosen difficulties. We succeed to model biological knowledge on genomic structures to scaffold chloroplast genomes. Our results suggest that modelling genomic regions is sufficient for scaffolding repeats and is suitable for finding several solutions corresponding to several genome forms.
Orthology inference lies at the foundation of comparative genomics research. The correct identification of loci which descended from a common ancestral sequence is not only complicated by sequence divergence but also duplication and other genome rearrangements. The conservation of gene order, i.e. synteny, is used in conjunction with sequence similarity as an additional factor for orthology determination. Current approaches, however, rely on genome annotations and are therefore limited. Here we present an annotation-free approach and compare it to synteny analysis with annotations. We find that our approach works better in closely related genomes whereas there is a better performance with annotations for more distantly related genomes. Overall, the presented algorithm offers a useful alternative to annotation-based methods and can outperform them in many cases.
The problem of sequence identification or matching-determining the subset of reference sequences from a given collection that are likely to contain a short, queried nucleotide sequence-is relevant for many important tasks in Computational Biology, such as metagenomics and pangenome analysis. Due to the complex nature of such analyses and the large scale of the reference collections a resource-efficient solution to this problem is of utmost importance. This poses the threefold challenge of representing the reference collection with a data structure that is efficient to query, has light memory usage, and scales well to large collections. To solve this problem, we describe an efficient colored de Bruijn graph index, arising as the combination of a k-mer dictionary with a compressed inverted index. The proposed index takes full advantage of the fact that unitigs in the colored compacted de Bruijn graph are monochromatic (i.e., all k-mers in a unitig have the same set of references of origin, or color). Specifically, the unitigs are kept in the dictionary in color order, thereby allowing for the encoding of the map from k-mers to their colors in as little as 1 + o(1) bits per unitig. Hence, one color per unitig is stored in the index with almost no space/time overhead. By combining this property with simple but effective compression methods for integer lists, the index achieves very small space. We implement these methods in a tool called Fulgor, and conduct an extensive experimental analysis to demonstrate the improvement of our tool over previous solutions. For example, compared to Themisto-the strongest competitor in terms of index space vs. query time trade-off-Fulgor requires significantly less space (up to 43% less space for a collection of 150,000 Salmonella enterica genomes), is at least twice as fast for color queries, and is 2-6[Formula: see text] faster to construct.
In this study, we investigate the problem of comparing gene trees reconciled with the same species tree using a novel semi-metric, called the Path-Label Reconciliation (PLR) dissimilarity measure. This approach not only quantifies differences in the topology of reconciled gene trees, but also considers discrepancies in predicted ancestral gene-species maps and speciation/duplication events, offering a refinement of existing metrics such as Robinson-Foulds (RF) and their labeled extensions LRF and ELRF. A tunable parameter α also allows users to adjust the balance between its species map and event labeling components. We show that PLR can be computed in linear time and that it is a semi-metric. We also discuss the diameters of reconciled gene tree measures, which are important in practice for normalization, and provide initial bounds on PLR, LRF, and ELRF. To validate PLR, we simulate reconciliations and perform comparisons with LRF and ELRF. The results show that PLR provides a more evenly distributed range of distances, making it less susceptible to overestimating differences in the presence of small topological changes, while at the same time being computationally efficient. We also apply our measure to evaluate the set of possible rootings of gene trees against a gold standard, and demonstrate that our measure is better at distinguishing one best gene tree among multiple candidates. Furthermore, our findings suggest that the theoretical diameter is rarely reached in practice. The PLR measure advances phylogenetic reconciliation by combining theoretical rigor with practical applicability. Future research will refine its mathematical properties, explore its performance on different types of trees, and integrate it with existing bioinformatics tools for large-scale evolutionary analyses. The implementation of the PLR distance is available in the open-source PyPI package parle: https://pypi.org/project/parle/ .
The increasing number and volume of genomic and metagenomic data necessitates scalable and robust computational models for precise analysis. Sketching techniques utilizing k -mers from a biological sample have proven to be useful for large-scale analyses. In recent years, FracMinHash has emerged as a popular sketching technique and has been used in several useful applications. Recent studies on FracMinHash proved unbiased estimators for the containment and Jaccard indices. However, theoretical investigations for other metrics are still lacking. In this paper, we present a theoretical framework for estimating similarity/distance metrics by using FracMinHash sketches, when the metric is expressible in a certain form. We establish conditions under which such an estimation is sound and recommend a minimum scale factor s for accurate results. Experimental evidence supports our theoretical findings. We also present frac-kmc, a fast and efficient FracMinHash sketch generator program. frac-kmc is the fastest known FracMinHash sketch generator, delivering accurate and precise results for cosine similarity estimation on real data. frac-kmc is also the first parallel tool for this task, allowing for speeding up sketch generation using multiple CPU cores - an option lacking in existing serialized tools. We show that by computing FracMinHash sketches using frac-kmc, we can estimate pairwise similarity speedily and accurately on real data. frac-kmc is freely available here: https://github.com/KoslickiLab/frac-kmc/.
Pangenomes are growing in number and size, thanks to the prevalence of high-quality long-read assemblies. However, current methods for studying sequence composition and conservation within pangenomes have limitations. Methods based on graph pangenomes require a computationally expensive multiple-alignment step, which can leave out some variation. Indexes based on k-mers and de Bruijn graphs are limited to answering questions at a specific substring length k. We present Maximal Exact Match Ordered (MEMO), a pangenome indexing method based on maximal exact matches (MEMs) between sequences. A single MEMO index can handle arbitrary-length queries over pangenomic windows. MEMO enables both queries that test k-mer presence/absence (membership queries) and that count the number of genomes containing k-mers in a window (conservation queries). MEMO's index for a pangenome of 89 human autosomal haplotypes fits in 2.04 GB, 8.8 × smaller than a comparable KMC3 index and 11.4 × smaller than a PanKmer index. MEMO indexes can be made smaller by sacrificing some counting resolution, with our decile-resolution HPRC index reaching 0.67 GB. MEMO can conduct a conservation query for 31-mers over the human leukocyte antigen locus in 13.89 s, 2.5 × faster than other approaches. MEMO's small index size, lack of k-mer length dependence, and efficient queries make it a flexible tool for studying and visualizing substring conservation in pangenomes.
Metabolic networks are complex systems that describe the biochemical reactions within an organism through pairwise interactions between chemical compounds. While this representation is widely used to study biological function, it fails to capture the full structure of metabolic reactions, many of which involve more than two compounds. Hypergraphs offer a more natural representation, where nodes represent metabolites and hyperedges represent reactions involving multiple participants. Clustering such metabolic hypergraphs can reveal systematic differences among evolutionarily distinct organisms, providing insight into ecological constraints and evolutionary pressures. In this study, we investigate how different graphs and hypergraphs embedding methods influence their unsupervised clustering, with the goal of capturing taxonomy-based classes. We apply 14 distinct embedding strategies to a large-scale dataset of 8467 metabolic hypergraphs. Each embedding was followed by hierarchical clustering using a fixed linkage method. To assess performance, we compared the resulting clusters against known taxonomic groupings. Our findings show that the choice of hypergraph embedding has a significant effect on clustering outcomes. Among the tested methods, Bag of Hyperedges with Jaccard distance, Histogram Cosine Kernel, and a Hypergraph Auto-Encoder consistently performed best. We also advocate that the embedding method should be chosen based on the goal of the downstream task.
Genome rearrangements are events where large blocks of DNA exchange places during evolution. The analysis of these events is a promising tool for understanding evolutionary genomics, providing data for phylogenetic reconstruction based on genome rearrangement measures. Many pairwise rearrangement distances have been proposed, based on finding the minimum number of rearrangement events to transform one genome into the other, using some predefined operation. When more than two genomes are considered, we have the more challenging problem of rearrangement-based phylogeny reconstruction. Given a set of genomes and a distance notion, there are at least two natural ways to define the "target" genome. On the one hand, finding a genome that minimizes the sum of the distances from this to any other, called the median genome. On the other hand, finding a genome that minimizes the maximum distance to any other, called the closest genome. Considering genomes as permutations of distinct integers, some distance metrics have been extensively studied. We investigate the median and closest problems on permutations over the following metrics: breakpoint distance, swap distance, block-interchange distance, short-block-move distance, and transposition distance. In biological applications some values are usually very small, such as the solution value d or the number k of input permutations. For each of these metrics and parameters d or k, we analyze the closest and the median problems from the viewpoint of parameterized complexity. We obtain the following results: NP-hardness for finding the median/closest permutation regarding some metrics of distance, even for only k = 3 permutations; Polynomial kernels for the problems of finding the median permutation of all studied metrics, considering the target distance d as parameter; NP-hardness result for finding the closest permutation by short-block-moves; FPT algorithms and infeasibility of polynomial kernels for finding the closest permutation for some metrics when parameterized by the target distance d.