Multimodal conversational generative AI has shown impressive capabilities in various vision and language understanding through learning massive text-image data. However, current conversational models still lack knowledge about visual insects since they are often trained on the general knowledge of vision-language data. Meanwhile, understanding insects is a fundamental problem in precision agriculture, helping to promote sustainable development in agriculture. Therefore, this paper proposes a novel multimodal conversational model, Insect-LLaVA, to promote visual understanding in insect-domain knowledge. In particular, we first introduce a new large-scale Multimodal Insect Dataset with Visual Insect Instruction Data that enables the capability of learning the multimodal foundation models. Our proposed dataset enables conversational models to comprehend the visual and semantic features of the insects. Second, we propose a new Insect-LLaVA model, a new general Large Language and Vision Assistant in Visual Insect Understanding. Then, to enhance the capability of learning insect features, we develop an Insect Foundation Model by introducing a new micro-feature self-supervised learning wi
Insect-pests significantly impact global agricultural productivity and quality. Effective management involves identifying the full insect community, including beneficial insects and harmful pests, to develop and implement integrated pest management strategies. Automated identification of insects under real-world conditions presents several challenges, including differentiating similar-looking species, intra-species dissimilarity and inter-species similarity, several life cycle stages, camouflage, diverse imaging conditions, and variability in insect orientation. A deep-learning model, InsectNet, is proposed to address these challenges. InsectNet is endowed with five key features: (a) utilization of a large dataset of insect images collected through citizen science; (b) label-free self-supervised learning for large models; (c) improving prediction accuracy for species with a small sample size; (d) enhancing model trustworthiness; and (e) democratizing access through streamlined MLOps. This approach allows accurate identification (>96% accuracy) of over 2500 insect species, including pollinator (e.g., butterflies, bees), parasitoid (e.g., some wasps and flies), predator species (e
In precision agriculture, the detection and recognition of insects play an essential role in the ability of crops to grow healthy and produce a high-quality yield. The current machine vision model requires a large volume of data to achieve high performance. However, there are approximately 5.5 million different insect species in the world. None of the existing insect datasets can cover even a fraction of them due to varying geographic locations and acquisition costs. In this paper, we introduce a novel "Insect-1M" dataset, a game-changing resource poised to revolutionize insect-related foundation model training. Covering a vast spectrum of insect species, our dataset, including 1 million images with dense identification labels of taxonomy hierarchy and insect descriptions, offers a panoramic view of entomology, enabling foundation models to comprehend visual and semantic information about insects like never before. Then, to efficiently establish an Insect Foundation Model, we develop a micro-feature self-supervised learning method with a Patch-wise Relevant Attention mechanism capable of discerning the subtle differences among insect images. In addition, we introduce Description Co
The purpose of the Insect Detection System for Crop and Plant Health is to keep an eye out for and identify insect infestations in farming areas. By utilizing cutting-edge technology like computer vision and machine learning, the system seeks to identify hazardous insects early and accurately. This would enable prompt response to save crops and maintain optimal plant health. The Method of this study includes Data Acquisition, Preprocessing, Data splitting, Model Implementation and Model evaluation. Different models like MobileNetV2, ResNet152V2, Xecption, Custom CNN was used in this study. In order to categorize insect photos, a Convolutional Neural Network (CNN) based on the ResNet152V2 architecture is constructed and evaluated in this work. Achieving 99% training accuracy and 97% testing accuracy, ResNet152V2 demonstrates superior performance among four implemented models. The results highlight its potential for real-world applications in insect classification and entomology studies, emphasizing efficiency and accuracy. To ensure food security and sustain agricultural output globally, finding insects is crucial. Cutting-edge technology, such as ResNet152V2 models, greatly influen
Bees are among the master navigators of the insect world. Despite impressive advances in robot navigation research, the performance of these insects is still unrivaled by any artificial system in terms of training efficiency and generalization capabilities, particularly considering the limited computational capacity. On the other hand, computational principles underlying these extraordinary feats are still only partially understood. The theoretical framework of reinforcement learning (RL) provides an ideal focal point to bring the two fields together for mutual benefit. In particular, we analyze and compare representations of space in robot and insect navigation models through the lens of RL, as the efficiency of insect navigation is likely rooted in an efficient and robust internal representation, linking retinotopic (egocentric) visual input with the geometry of the environment. While RL has long been at the core of robot navigation research, current computational theories of insect navigation are not commonly formulated within this framework, but largely as an associative learning process implemented in the insect brain, especially in the mushroom body (MB). Here we propose spec
Global biodiversity is declining at an unprecedented rate, yet little information is known about most species and how their populations are changing. Indeed, some 90% of Earth's species are estimated to be completely unknown. Machine learning has recently emerged as a promising tool to facilitate long-term, large-scale biodiversity monitoring, including algorithms for fine-grained classification of species from images. However, such algorithms typically are not designed to detect examples from categories unseen during training -- the problem of open-set recognition (OSR) -- limiting their applicability for highly diverse, poorly studied taxa such as insects. To address this gap, we introduce Open-Insect, a large-scale, fine-grained dataset to evaluate unknown species detection across different geographic regions with varying difficulty. We benchmark 38 OSR algorithms across three categories: post-hoc, training-time regularization, and training with auxiliary data, finding that simple post-hoc approaches remain a strong baseline. We also demonstrate how to leverage auxiliary data to improve species discovery in regions with limited data. Our results provide insights to guide the dev
Science has long been viewed as a key driver of economic growth and rising standards of living. Knowledge about how scientific advances support marketplace inventions is therefore essential for understanding the role of science in propelling real-world applications and technological progress. The increasing availability of large-scale datasets tracing scientific publications and patented inventions and the complex interactions among them offers us new opportunities to explore the evolving dual frontiers of science and technology at an unprecedented level of scale and detail. However, we lack suitable visual analytics approaches to analyze such complex interactions effectively. Here we introduce InnovationInsights, an interactive visual analysis system for researchers, research institutions, and policymakers to explore the complex linkages between science and technology, and to identify critical innovations, inventors, and potential partners. The system first identifies important associations between scientific papers and patented inventions through a set of statistical measures introduced by our experts from the field of the Science of Science. A series of visualization views are t
Insects represent half of all global biodiversity, yet many of the world's insects are disappearing, with severe implications for ecosystems and agriculture. Despite this crisis, data on insect diversity and abundance remain woefully inadequate, due to the scarcity of human experts and the lack of scalable tools for monitoring. Ecologists have started to adopt camera traps to record and study insects, and have proposed computer vision algorithms as an answer for scalable data processing. However, insect monitoring in the wild poses unique challenges that have not yet been addressed within computer vision, including the combination of long-tailed data, extremely similar classes, and significant distribution shifts. We provide the first large-scale machine learning benchmarks for fine-grained insect recognition, designed to match real-world tasks faced by ecologists. Our contributions include a curated dataset of images from citizen science platforms and museums, and an expert-annotated dataset drawn from automated camera traps across multiple continents, designed to test out-of-distribution generalization under field conditions. We train and evaluate a variety of baseline algorithms
Approximately half of the existing winged-insect species are of very small size (wing length about 0.3-4 mm); they are referred to as miniature insects. Yet until recently, much of what we know about the mechanics of insect flight was derived from studies on relatively large insects, such as hoverflies, honey bees and hawkmoths. Because of their very small size, many miniature insects fly at a Reynolds number (Re) on the order of 10 or less. At such a low Re, the viscous effect of the air is very large: A miniature insect moves through the air as would a bumble bee move through mineral oil. Miniature insects must use new flapping mode and new aerodynamic mechanisms to fly. Over the past decade, much work has been done in the study of the mechanics of flight in miniature insects: novel flapping modes have been discovered and new mechanisms of aerodynamic force generation have been revealed; progress has also been made on the fluid-mechanics related flight problems, such as flight power requirements and flight dynamic stability. This article reviews these developments and discusses potential future directions.
Ensuring fairness is essential for every education system. Machine learning is increasingly supporting the education system and educational data science (EDS) domain, from decision support to educational activities and learning analytics. However, the machine learning-based decisions can be biased because the algorithms may generate the results based on students' protected attributes such as race or gender. Clustering is an important machine learning technique to explore student data in order to support the decision-maker, as well as support educational activities, such as group assignments. Therefore, ensuring high-quality clustering models along with satisfying fairness constraints are important requirements. This chapter comprehensively surveys clustering models and their fairness in EDS. We especially focus on investigating the fair clustering models applied in educational activities. These models are believed to be practical tools for analyzing students' data and ensuring fairness in EDS.
Precision health, increasingly supported by digital technologies, is a domain of research that broadens the paradigm of precision medicine, advancing everyday healthcare. This vision goes hand in hand with the groundbreaking advent of artificial intelligence (AI), which is reshaping the way we diagnose, treat, and monitor both clinical subjects and the general population. AI tools powered by machine learning have shown considerable improvements in a variety of healthcare domains. In particular, reinforcement learning (RL) holds great promise for sequential and dynamic problems such as dynamic treatment regimes and just-in-time adaptive interventions in digital health. In this work, we discuss the opportunity offered by AI, more specifically RL, to current trends in healthcare, providing a methodological survey of RL methods in the context of precision and digital health. Focusing on the area of adaptive interventions, we expand the methodological survey with illustrative case studies that used RL in real practice. This invited article has undergone anonymous review and is intended as a book chapter for the volume "Frontiers of Statistics and Data Science" edited by Subhashis Ghosha
Throughout history, everyday people have contributed to science through a myriad of volunteer activities. This early participation required training and often involved mentorship from scientists or senior citizen scientists (or, as they were often called, gentleman scientists). During this learning process, participants learned how they and their data would be used both to advance science, and in some cases, advance the careers of professional collaborators. Modern, online citizen science, allows participation with just a few clicks, and people may participate without understanding what they are contributing to. Too often, they happily see what they are doing as the privilege of painting Tom Sawyer's fence without realizing they are actually being used as merely a means to a scientific end. This paper discusses the ethical dilemmas that plague modern citizen science, including: the issues of informed consent, such as not requiring logins; the issues of coercion inherent in mandatory classroom assignments requiring data submission; and the issues of using people merely as a means to an end that are inherent in technonationalism, and projects that do not provide utility to the users
The gradual crowding out of singleton and small team science by large team endeavors is challenging key features of research culture. It is therefore important for the future of scientific practice to reflect upon the individual scientist's ethical responsibilities within teams. To facilitate this reflection we show labor force trends in the US revealing a skewed growth in academic ranks and increased levels of competition for promotion within the system; we analyze teaming trends across disciplines and national borders demonstrating why it is becoming difficult to distribute credit and to avoid conflicts of interest; and we use more than a century of Nobel prize data to show how science is outgrowing its old institutions of singleton awards. Of particular concern within the large team environment is the weakening of the mentor-mentee relation, which undermines the cultivation of virtue ethics across scientific generations. These trends and emerging organizational complexities call for a universal set of behavioral norms that transcend team heterogeneity and hierarchy. To this end, our expository analysis provides a survey of ethical issues in team settings to inform science ethics
Mauve is a low-cost small satellite developed and operated by Blue Skies Space Ltd. The payload features a 13 cm telescope connected with a fibre that feeds into a UV-Vis spectrometer. The detector covers the 200-700 nm range in a single shot, obtaining low resolution spectra at R~20-65. Mauve has launched on 28th November 2025, reaching a 510 km Low-Earth Sun-synchronous orbit. The satellite will enable UV and visible observations of a variety of stellar objects in our Galaxy, filling the gaps in the ultraviolet space-based data. The researchers that have already joined the mission have defined the science themes, observational strategy and targets that Mauve will observe in the first year of operations. To date 10 science themes have been developed by the Mauve science collaboration for year 1, with observational strategies that include both long duration monitoring and short cadence snapshots. Here, we describe these themes and the science that Mauve will undertake in its first year of operations.
Data science and technology offer transformative tools and methods to science. This review article highlights latest development and progress in the interdisciplinary field of data-driven plasma science (DDPS). A large amount of data and machine learning algorithms go hand in hand. Most plasma data, whether experimental, observational or computational, are generated or collected by machines today. It is now becoming impractical for humans to analyze all the data manually. Therefore, it is imperative to train machines to analyze and interpret (eventually) such data as intelligently as humans but far more efficiently in quantity. Despite the recent impressive progress in applications of data science to plasma science and technology, the emerging field of DDPS is still in its infancy. Fueled by some of the most challenging problems such as fusion energy, plasma processing of materials, and fundamental understanding of the universe through observable plasma phenomena, it is expected that DDPS continues to benefit significantly from the interdisciplinary marriage between plasma science and data science into the foreseeable future.
Benchmarking the performance of complex systems such as rail networks, renewable generation assets and national economies is central to transport planning, regulation and macroeconomic analysis. Classical frontier methods, notably Data Envelopment Analysis (DEA) and Stochastic Frontier Analysis (SFA), estimate an efficient frontier in the observed input-output space and define efficiency as distance to this frontier, but rely on restrictive assumptions on the production set and only indirectly address heterogeneity and scale effects. We propose Geometric Manifold Analysis (GeMA), a latent manifold frontier framework implemented via a productivity-manifold variational autoencoder (ProMan-VAE). Instead of specifying a frontier function in the observed space, GeMA represents the production set as the boundary of a low-dimensional manifold embedded in the joint input-output space. A split-head encoder learns latent variables that capture technological structure and operational inefficiency. Efficiency is evaluated with respect to the learned manifold, endogenous peer groups arise as clusters in latent technology space, a quotient construction supports scale-invariant benchmarking, and
The number of published articles in the field of materials science is growing rapidly every year. This comparatively unstructured data source, which contains a large amount of information, has a restriction on its re-usability, as the information needed to carry out further calculations using the data in it must be extracted manually. It is very important to obtain valid and contextually correct information from the online (offline) data, as it can be useful not only to generate inputs for further calculations, but also to incorporate them into a querying framework. Retaining this context as a priority, we have developed an automated tool, MatScIE (Material Scince Information Extractor) that can extract relevant information from material science literature and make a structured database that is much easier to use for material simulations. Specifically, we extract the material details, methods, code, parameters, and structure from the various research articles. Finally, we created a web application where users can upload published articles and view/download the information obtained from this tool and can create their own databases for their personal uses.
Researchers may be tempted to attract attention through poetic titles for their publications, but would this be mistaken in some fields? Whilst poetic titles are known to be common in medicine, it is not clear whether the practice is widespread elsewhere. This article investigates the prevalence of poetic expressions in journal article titles 1996-2019 in 3.3 million articles from all 27 Scopus broad fields. Expressions were identified by manually checking all phrases with at least 5 words that occurred at least 25 times, finding 149 stock phrases, idioms, sayings, literary allusions, film names and song titles or lyrics. The expressions found are most common in the social sciences and the humanities. They are also relatively common in medicine, but almost absent from engineering and the natural and formal sciences. The differences may reflect the less hierarchical and more varied nature of the social sciences and humanities, where interesting titles may attract an audience. In engineering, natural science and formal science fields, authors should take extra care with poetic expressions, in case their choice is judged inappropriate. This includes interdisciplinary research overlapp
The rapid proliferation of online content producing and sharing technologies resulted in an explosion of user-generated content (UGC), which now extends to scientific data. Citizen science, in which ordinary people contribute information for scientific research, epitomizes UGC. Citizen science projects are typically open to everyone, engage diverse audiences, and challenge ordinary people to produce data of highest quality to be usable in science. This also makes citizen science a very exciting area to study both traditional and innovative approaches to information quality management. With this paper we position citizen science as a leading information quality research frontier. We also show how citizen science opens a unique opportunity for the information systems community to contribute to a broad range of disciplines in natural and social sciences and humanities.
We investigate the development of scientific content knowledge of volunteers participating in online citizen science projects in the Zooniverse (www.zooniverse.org), including the astronomy projects Galaxy Zoo (www.galaxyzoo.org) and Planet Hunters (www.planethunters.org). We use econometric methods to test how measures of project participation relate to success in a science quiz, controlling for factors known to correlate with scientific knowledge. Citizen scientists believe they are learning about both the content and processes of science through their participation. Won't don't directly test the latter, but we find evidence to support the former - that more actively engaged participants perform better in a project-specific science knowledge quiz, even after controlling for their general science knowledge. We interpret this as evidence of learning of science content inspired by participation in online citizen science.