The Apache Software Foundation (ASF) ecosystem underpins a vast portion of modern software infrastructure, powering widely used components such as Log4j, Tomcat, and Struts. However, the ubiquity of these libraries has made them prime targets for high-impact security vulnerabilities, as illustrated by incidents like Log4Shell. Despite their widespread adoption, Apache projects are not immune to recurring and severe security weaknesses. We conduct a historical analysis of the Apache ecosystem to follow the "breadcrumb trail of vulnerabilities" by compiling a comprehensive dataset of Common Vulnerabilities and Exposures (CVEs) and Common Weakness Enumerations (CWEs). We examine trends in exploit recurrence, disclosure timelines, and remediation practices. Our analysis is guided by four key research questions: (1) What are the most persistent and repeated CWEs in Apache libraries? (2) How long do CVEs persist before being addressed? (3) What is the delay between CVE introduction and official disclosure? and (4) How long after disclosure are CVEs remediated? We present a detailed timeline of vulnerability lifecycle stages across Apache libraries and offer insights to improve secure cod
During the study, the results of a comparative analysis of the process of handling large datasets using the Apache Spark platform in Java, Python, and Scala programming languages were obtained. Although prior works have focused on individual stages, comprehensive comparisons of full ETL workflows across programming languages using Apache Iceberg remain limited. The analysis was performed by executing several operations, including downloading data from CSV files, transforming and loading it into an Apache Iceberg analytical table. It was found that the performance of the Spark algorithm varies significantly depending on the amount of data and the programming language used. When processing a 5-megabyte CSV file, the best result was achieved in Python: 6.71 seconds, which is superior to Scala's score of 9.13 seconds and Java's time of 9.62 seconds. For processing a large CSV file of 1.6 gigabytes, all programming languages demonstrated similar results: the fastest performance was showed in Python: 46.34 seconds, while Scala and Java showed results of 47.72 and 50.56 seconds, respectively. When performing a more complex operation that involved combining two CSV files into a single data
This paper describes a distributed implementation of Apache Arrow that can leverage cluster-shared load-store addressable memory that is hardware-coherent only within each node. The implementation is built on the ThymesisFlow prototype that leverages the OpenCAPI interface to create a shared address space across a cluster. While Apache Arrow structures are immutable, simplifying their use in a cluster shared memory, this paper creates distributed Apache Arrow tables and makes them accessible in each node.
On July 1st 2025 the third interstellar object, 3I/ATLAS or C/2025 N1 (ATLAS), was discovered, with an eccentricity of $e=6.15 \pm 0.01$ and perihelion of $q=1.357\pm0.001$ au. We report our initial visible to near-infrared (420-1000 nm) spectrophotometry of 3I/ATLAS using both the Palomar 200 inch telescope and Apache Point Observatory. We measure 3I/ATLAS to have a red spectral slope of 19 %/100 nm in the 420-700 nm range, and a more neutral 6 %/100 nm slope over 700-1000 nm. We detect no notable emission features such as from C$_2$.
In issue tracking systems, each bug is assigned a priority level (e.g., Blocker, Critical, Major, Minor, or Trivial in JIRA from highest to lowest), which indicates the urgency level of the bug. In this sense, understanding bug priority changes helps to arrange the work schedule of participants reasonably, and facilitates a better analysis and resolution of bugs. According to the data extracted from JIRA deployed by Apache, a proportion of bugs in each project underwent priority changes after such bugs were reported, which brings uncertainty to the bug fixing process. However, there is a lack of indepth investigation on the phenomenon of bug priority changes, which may negatively impact the bug fixing process. Thus, we conducted a quantitative empirical study on bugs with priority changes through analyzing 32 non-trivial Apache open source software projects. The results show that: (1) 8.3% of the bugs in the selected projects underwent priority changes; (2) the median priority change time interval is merely a few days for most (28 out of 32) projects, and half (50. 7%) of bug priority changes occurred before bugs were handled; (3) for all selected projects, 87.9% of the bugs with p
With the escalating frequency of floods posing persistent threats to human life and property, satellite remote sensing has emerged as an indispensable tool for monitoring flood hazards. SpaceNet8 offers a unique opportunity to leverage cutting-edge artificial intelligence technologies to assess these hazards. A significant contribution of this research is its application of Apache Sedona, an advanced platform specifically designed for the efficient and distributed processing of large-scale geospatial data. This platform aims to enhance the efficiency of error analysis, a critical aspect of improving flood damage detection accuracy. Based on Apache Sedona, we introduce a novel approach that addresses the challenges associated with inaccuracies in flood damage detection. This approach involves the retrieval of cases from historical flood events, the adaptation of these cases to current scenarios, and the revision of the model based on clustering algorithms to refine its performance. Through the replication of both the SpaceNet8 baseline and its top-performing models, we embark on a comprehensive error analysis. This analysis reveals several main sources of inaccuracies. To address th
Complex workflows play a critical role in accelerating scientific discovery. In many scientific domains, efficient workflow management can lead to faster scientific output and broader user groups. Workflows that can leverage resources across the boundary between cloud and HPC are a strong driver for the convergence of HPC and cloud. This study investigates the transition and deployment of a GPU-accelerated molecular docking workflow that was designed for HPC systems onto a cloud-native environment with Kubernetes and Apache Airflow. The case study focuses on state-of-of-the-art molecular docking software for drug discovery. We provide a DAG-based implementation in Apache Airflow and technical details for GPU-accelerated deployment. We evaluated the workflow using the SWEETLEAD bioinformatics dataset and executed it in a Cloud environment with heterogeneous computing resources. Our workflow can effectively overlap different stages when mapped onto different computing resources.
Healthcare data is a valuable resource for research, analysis, and decision-making in the medical field. However, healthcare data is often fragmented and distributed across various sources, making it challenging to combine and analyze effectively. Record linkage, also known as data matching, is a crucial step in integrating and cleaning healthcare data to ensure data quality and accuracy. Apache Spark, a powerful open-source distributed big data processing framework, provides a robust platform for performing record linkage tasks with the aid of its machine learning library. In this study, we developed a new distributed data-matching model based on the Apache Spark Machine Learning library. To ensure the correct functioning of our model, the validation phase has been performed on the training data. The main challenge is data imbalance because a large amount of data is labeled false, and a small number of records are labeled true. By utilizing SVM and Regression algorithms, our results demonstrate that research data was neither over-fitted nor under-fitted, and this shows that our distributed model works well on the data.
Apache SAMOA (Scalable Advanced Massive Online Analysis) is an open-source platform for mining big data streams. Big data is defined as datasets whose size is beyond the ability of typical software tools to capture, store, manage, and analyze, due to the time and memory complexity. Apache SAMOA provides a collection of distributed streaming algorithms for the most common data mining and machine learning tasks such as classification, clustering, and regression, as well as programming abstractions to develop new algorithms. It features a pluggable architecture that allows it to run on several distributed stream processing engines such as Apache Flink, Apache Storm, and Apache Samza. Apache SAMOA is written in Java and is available at https://samoa.incubator.apache.org under the Apache Software License version 2.0.
We explore the utilization of the Apache TVM open source framework to automatically generate a family of algorithms that follow the approach taken by popular linear algebra libraries, such as GotoBLAS2, BLIS and OpenBLAS, in order to obtain high-performance blocked formulations of the general matrix multiplication (GEMM). % In addition, we fully automatize the generation process, by also leveraging the Apache TVM framework to derive a complete variety of the processor-specific micro-kernels for GEMM. This is in contrast with the convention in high performance libraries, which hand-encode a single micro-kernel per architecture using Assembly code. % In global, the combination of our TVM-generated blocked algorithms and micro-kernels for GEMM 1)~improves portability, maintainability and, globally, streamlines the software life cycle; 2)~provides high flexibility to easily tailor and optimize the solution to different data types, processor architectures, and matrix operand shapes, yielding performance on a par (or even superior for specific matrix shapes) with that of hand-tuned libraries; and 3)~features a small memory footprint.
Access plan recommendation is a query optimization approach that executes new queries using prior created query execution plans (QEPs). The query optimizer divides the query space into clusters in the mentioned method. However, traditional clustering algorithms take a significant amount of execution time for clustering such large datasets. The MapReduce distributed computing model provides efficient solutions for storing and processing vast quantities of data. Apache Spark and Apache Hadoop frameworks are used in the present investigation to cluster different sizes of query datasets in the MapReduce-based access plan recommendation method. The performance evaluation is performed based on execution time. The results of the experiments demonstrated the effectiveness of parallel query clustering in achieving high scalability. Furthermore, Apache Spark achieved better performance than Apache Hadoop, reaching an average speedup of 2x.
Distributed data processing platforms for cloud computing are important tools for large-scale data analytics. Apache Hadoop MapReduce has become the de facto standard in this space, though its programming interface is relatively low-level, requiring many implementation steps even for simple analysis tasks. This has led to the development of advanced dataflow oriented platforms, most prominently Apache Spark and Apache Flink. Those platforms not only aim to improve performance through improved in-memory processing, but in particular provide built-in high-level data processing functionality, such as filtering and join operators, which should make data analysis tasks easier to develop than with plain Hadoop MapReduce. But is this indeed the case? This paper compares three prominent distributed data processing platforms: Apache Hadoop MapReduce; Apache Spark; and Apache Flink, from a usability perspective. We report on the design, execution and results of a usability study with a cohort of masters students, who were learning and working with all three platforms in order to solve different use cases set in a data science context. Our findings show that Spark and Flink are preferred plat
Apache Kafka addresses the general problem of delivering extreme high volume event data to diverse consumers via a publish-subscribe messaging system. It uses partitions to scale a topic across many brokers for producers to write data in parallel, and also to facilitate parallel reading of consumers. Even though Apache Kafka provides some out of the box optimizations, it does not strictly define how each topic shall be efficiently distributed into partitions. The well-formulated fine-tuning that is needed in order to improve an Apache Kafka cluster performance is still an open research problem. In this paper, we first model the Apache Kafka topic partitioning process for a given topic. Then, given the set of brokers, constraints and application requirements on throughput, OS load, replication latency and unavailability, we formulate the optimization problem of finding how many partitions are needed and show that it is computationally intractable, being an integer program. Furthermore, we propose two simple, yet efficient heuristics to solve the problem: the first tries to minimize and the second to maximize the number of brokers used in the cluster. Finally, we evaluate its perform
The classical two-sample test of Kolmogorov-Smirnov (KS) is widely used to test whether empirical samples come from the same distribution. Even though most statistical packages provide an implementation, carrying out the test in big data settings can be challenging because it requires a full sort of the data. The popular Apache Spark system for big data processing provides a 1-sample KS test, but not the 2-sample version. Moreover, recent Spark versions provide the approxQuantile method for querying $ε$-approximate quantiles. We build on approxQuantile to propose a variation of the classical Kolmogorov-Smirnov two-sample test that constructs approximate cumulative distribution functions (CDF) from $ε$-approximate quantiles. We derive error bounds of the approximate CDF and show how to use this information to carry out KS tests. Psuedocode for the approach requires 15 executable lines. A Python implementation appears in the appendix.
We present rotation period measurements for 107 M dwarfs in the mass range $0.15-0.70 M_\odot$ observed within the context of the APACHE photometric survey. We measure rotation periods in the range 0.5-190 days, with the distribution peaking at $\sim$ 30 days. We revise the stellar masses and radii for our sample of rotators by exploiting the Gaia DR2 data. For $\sim 20\%$ of the sample, we compare the photometric rotation periods with those derived from different spectroscopic indicators, finding good correspondence in most cases. We compare our rotation periods distribution to the one obtained by the Kepler survey in the same mass range, and to that derived by the MEarth survey for stars in the mass range $0.07-0.25 M_\odot$. The APACHE and Kepler periods distributions are in good agreement, confirming the reliability of our results, while the APACHE distribution is consistent with the MEarth result only for the older/slow rotators, and in the overlapping mass range of the two surveys. Combining the APACHE/Kepler distribution with the MEarth distribution, we highlight that the rotation period increases with decreasing stellar mass, in agreement with previous work. Our findings al
The internet has caused tremendous changes since its appearance in the 1980s, and now, the Internet of Things (IoT) seems to be doing the same. The potential of IoT has made it the center of attention for many people, but, where some see an opportunity to contribute, others may see IoT networks as a target to be exploited. The high number of IoT devices makes them the perfect setup for staging denial-of-service attacks (DoS) that can have devastating consequences. This renders the need for cybersecurity measures such as intrusion detection systems (IDSs) evident. The aim of this paper is to build an IDS using the big data platform, Apache Spark. Apache Spark was used along with its ML library (MLlib) and the BoT-IoT dataset. The IDS was then tested and evaluated based on F-Measure (f1), as was the standard when evaluating imbalanced data. Two rounds of tests were performed, a partial dataset for minimizing bias, and the full BoT-IoT dataset for exploring big data and ML capabilities in a security setting. For the partial dataset, the Random Forest algorithm had the highest performance for binary classification at an average f1 measure of 99.7%, as well as 99.6% for main category cl
The wide use of XML for document management and data exchange has created the need to query large repositories of XML data. To efficiently query such large data collections and take advantage of parallelism, we have implemented Apache VXQuery, an open-source scalable XQuery processor. The system builds upon two other open-source frameworks -- Hyracks, a parallel execution engine, and Algebricks, a language agnostic compiler toolbox. Apache VXQuery extends these two frameworks and provides an implementation of the XQuery specifics (data model, data-model dependent functions and optimizations, and a parser). We describe the architecture of Apache VXQuery, its integration with Hyracks and Algebricks, and the XQuery optimization rules applied to the query plan to improve path expression efficiency and to enable query parallelism. An experimental evaluation using a real 500GB dataset with various selection, aggregation and join XML queries shows that Apache VXQuery performs well both in terms of scale-up and speed-up. Our experiments show that it is about 3x faster than Saxon (an open-source and commercial XQuery processor) on a 4-core, single node implementation, and around 2.5x faster
This paper proposes nowcasting of high-frequency financial datasets in real-time with a 5-minute interval using the streaming analytics feature of Apache Spark. The proposed 2 stage method consists of modelling chaos in the first stage and then using a sliding window approach for training with machine learning algorithms namely Lasso Regression, Ridge Regression, Generalised Linear Model, Gradient Boosting Tree and Random Forest available in the MLLib of Apache Spark in the second stage. For testing the effectiveness of the proposed methodology, 3 different datasets, of which two are stock markets namely National Stock Exchange & Bombay Stock Exchange, and finally One Bitcoin-INR conversion dataset. For evaluating the proposed methodology, we used metrics such as Symmetric Mean Absolute Percentage Error, Directional Symmetry, and Theil U Coefficient. We tested the significance of each pair of models using the Diebold Mariano (DM) test.
While cluster computing frameworks are continuously evolving to provide real-time data analysis capabilities, Apache Spark has managed to be at the forefront of big data analytics for being a unified framework for both, batch and stream data processing. There is also a renewed interest is Near Data Computing (NDC) due to technological advancement in the last decade. However, it is not known if NDC architectures can improve the performance of big data processing frameworks such as Apache Spark. In this position paper, we hypothesize in favour of NDC architecture comprising programmable logic based hybrid 2D integrated processing-in-memory and in-storage processing for Apache Spark, by extensive profiling of Apache Spark based workloads on Ivy Bridge Server.
We discuss initial results and our planned approach for incorporating Apache Mesos based resource management that will enable design and development of scheduling strategies for Apache Airavata jobs so that they can be launched on multiple clouds, wherein several VMs do not have Public IP addresses. We present initial work and next steps on the design of a meta-scheduler using Apache Mesos. Apache Mesos presents a unified view of resources available across several clouds and clusters. Our meta-scheduler can potentially examine and identify the cases where multiple small jobs have been submitted by the same scientists and then redirect job from the same community account or user to different clusters. Our approach uses a NAT firewall to make nodes/VMs, without a Public IP, visible to Mesos for the unified view.