共找到 20 条结果
Data hiding is the art of concealing messages with limited perceptual changes. Recently, deep learning has enriched it from various perspectives with significant progress. In this work, we conduct a brief yet comprehensive review of existing literature for deep learning based data hiding (deep hiding) by first classifying it according to three essential properties (i.e., capacity, security and robustness), and outline three commonly used architectures. Based on this, we summarize specific strategies for different applications of data hiding, including basic hiding, steganography, watermarking and light field messaging. Finally, further insight into deep hiding is provided by incorporating the perspective of adversarial attack.
While Large Language Models (LLMs) acquire vast knowledge during pre-training, they often lack domain-specific, new, or niche information. Continual pre-training (CPT) attempts to address this gap but suffers from catastrophic forgetting and inefficiencies in low-data regimes. We introduce Knowledge-Instruct, a novel approach to efficiently inject knowledge from limited corpora through pure instruction-tuning. By generating information-dense synthetic instruction data, it effectively integrates new knowledge while preserving general reasoning and instruction-following abilities. Knowledge-Instruct demonstrates superior factual memorization, minimizes catastrophic forgetting, and remains scalable by leveraging synthetic data from relatively small language models. Additionally, it enhances contextual understanding, including complex multi-hop reasoning, facilitating integration with retrieval systems. We validate its effectiveness across diverse benchmarks, including Companies, a new dataset that we release to measure knowledge injection capabilities.
The growing trends in automation, Internet of Things, big data and cloud computing technologies have led to the fourth industrial revolution (Industry 4.0), where it is possible to visualize and identify patterns and insights, which results in a better understanding of the data and can improve the manufacturing process. However, many times, the task of data exploration results difficult for manufacturing experts because they might be interested in analyzing also data that does not appear in pre-designed visualizations and therefore they must be assisted by Information Technology experts. In this paper, we present a proposal materialized in a semantic-based visual query system developed for a real Industry 4.0 scenario that allows domain experts to explore and visualize data in a friendly way. The main novelty of the system is the combined use that it makes of captured data that are semantically annotated first, and a 2D customized digital representation of a machine that is also linked with semantic descriptions. Those descriptions are expressed using terms of an ontology, where, among others, the sensors that are used to capture indicators about the performance of a machine that b
Distributed dataflow systems like Apache Flink and Apache Spark simplify processing large amounts of data on clusters in a data-parallel manner. However, choosing suitable cluster resources for distributed dataflow jobs in both type and number is difficult, especially for users who do not have access to previous performance metrics. One approach to overcoming this issue is to have users share runtime metrics to train context-aware performance models that help find a suitable configuration for the job at hand. A problem when sharing runtime data instead of trained models or model parameters is that the data size can grow substantially over time. This paper examines several clustering techniques to minimize training data size while keeping the associated performance models accurate. Our results indicate that efficiency gains in data transfer, storage, and model training can be achieved through training data reduction. In the evaluation of our solution on a dataset of runtime data from 930 unique distributed dataflow jobs, we observed that, on average, a 75% data reduction only increases prediction errors by one percentage point.
Retrieval-augmented generation (RAG) can supplement large language models (LLMs) by integrating external knowledge. However, as the number of retrieved documents increases, the input length to LLMs grows linearly, causing a dramatic increase in latency and a degradation in long-context understanding. This is particularly serious for multi-hop questions that require a chain of reasoning across documents. To accelerate inference, reduce costs, and minimize distractions, this paper presents BRIEF (Bridging Retrieval and Inference through Evidence Fusion), a lightweight approach that performs query-aware multi-hop reasoning by compressing retrieved documents into highly dense textual summaries to integrate into in-context RAG. To enable learning compression for multi-hop reasoning, we curate synthetic data by extracting atomic propositions that encapsulate distinct factoids from the source documents to compose synthetic summaries. Based on our synthetic data built entirely by open-source models, BRIEF generates more concise summaries and enables a range of LLMs to achieve exceptional open-domain question answering (QA) performance. For example, on HotpotQA, BRIEF improves the compressi
With the rapid increase of published open datasets, it is crucial to support the open data progress in smart cities while considering the open data quality. In the Czech Republic, and its National Open Data Catalogue (NODC), the open datasets are usually evaluated based on their metadata only, while leaving the content and the adherence to the recommended data structure to the sole responsibility of the data providers. The interoperability of open datasets remains unknown. This paper therefore aims to propose a novel content-aware quality evaluation framework that assesses the quality of open datasets based on five data quality dimensions. With the proposed framework, we provide a fundamental view on the interoperability-oriented data quality of Czech open datasets, which are published in NODC. Our evaluations find that domain-specific open data quality assessments are able to detect data quality issues beyond traditional heuristics used for determining Czech open data quality, increase their interoperability, and thus increase their potential to bring value for the society. The findings of this research are beneficial not only for the case of the Czech Republic, but also can be ap
Sharing diverse genomic and other biomedical datasets is critical to advance scientific discoveries and their equitable translation to improve human health. However, data sharing remains challenging in the context of legacy datasets, evolving policies, multi-institutional consortium science, and international stakeholders. The NIH-funded Polygenic Risk Methods in Diverse Populations (PRIMED) Consortium was established to improve the performance of polygenic risk estimates for a broad range of health and disease outcomes with global impacts. Improving polygenic risk score performance across genetically diverse populations requires access to large, diverse cohorts. We report on the design and implementation of data sharing policies and procedures developed in PRIMED to aggregate and analyze data from multiple, heterogeneous sources while adhering to existing data sharing policies for each integrated dataset. We describe two primary data sharing mechanisms: coordinated dbGaP applications and a Consortium Data Sharing Agreement, as well as provide alternatives when individual-level data cannot be shared within the Consortium (e.g., federated analyses). We also describe technical implem
The COVID-19 pandemic highlighted the urgent need for robust systems to enable rapid data collection, integration, and analysis for public health responses. Existing approaches often relied on disparate, non-interoperable systems, creating bottlenecks in comprehensive analyses and timely decision-making. To address these challenges, the U.S. National Institutes of Health (NIH) launched the Rapid Acceleration of Diagnostics (RADx) initiative in 2020, with the RADx Data Hub, a centralized repository for de-identified and curated COVID-19 data, as its cornerstone. The RADx Data Hub hosts diverse study data, including clinical data, testing results, smart sensor outputs, self-reported symptoms, and information on social determinants of health. Built on cloud infrastructure, the RADx Data Hub integrates metadata standards, interoperable formats, and ontology-based tools to adhere to the FAIR (Findable, Accessible, Interoperable, Reusable) principles for data sharing. Initially developed for COVID-19 research, its architecture and processes are adaptable to other scientific disciplines. This paper provides an overview of the data hosted by the RADx Data Hub and describes the platform's c
Current AI models often fail to account for local context and language, given the predominance of English and Western internet content in their training data. This hinders the global relevance, usefulness, and safety of these models as they gain more users around the globe. Amplify Initiative, a data platform and methodology, leverages expert communities to collect diverse, high-quality data to address the limitations of these models. The platform is designed to enable co-creation of datasets, provide access to high-quality multilingual datasets, and offer recognition to data authors. This paper presents the approach to co-creating datasets with domain experts (e.g., health workers, teachers) through a pilot conducted in Sub-Saharan Africa (Ghana, Kenya, Malawi, Nigeria, and Uganda). In partnership with local researchers situated in these countries, the pilot demonstrated an end-to-end approach to co-creating data with 155 experts in sensitive domains (e.g., physicians, bankers, anthropologists, human and civil rights advocates). This approach, implemented with an Android app, resulted in an annotated dataset of 8,091 adversarial queries in seven languages (e.g., Luganda, Swahili,
The emergence of breakthrough artificial intelligence (AI) techniques has led to a renewed focus on how small data settings, i.e., settings with limited information, can benefit from such developments. This includes societal issues such as how best to include under-represented groups in data-driven policy and decision making, or the health benefits of assistive technologies such as wearables. We provide a conceptual overview, in particular contrasting small data with big data, and identify common themes from exemplary case studies and application areas. Potential solutions are described in a more detailed technical overview of current data analysis and modelling techniques, highlighting contributions from different disciplines, such as knowledge-driven modelling from statistics and data-driven modelling from computer science. By linking application settings, conceptual contributions and specific techniques, we highlight what is already feasible and suggest what an agenda for fully leveraging small data might look like.
As retrieval-augmented generation (RAG) tackles complex tasks, increasingly expanded contexts offer richer information, but at the cost of higher latency and increased cognitive load on the model. To mitigate this bottleneck, especially for intricate multi-hop questions, we introduce BRIEF-Pro. It is a universal, lightweight compressor that distills relevant evidence for a given query from retrieved documents into a concise summary for seamless integration into in-context RAG. Using seed data consisting of relatively short contexts (fewer than 1k words), BRIEF-Pro is trained to perform abstractive compression of extended contexts exceeding 10k words across a wide range of scenarios. Furthermore, BRIEF-Pro offers flexible user control over summary length by allowing users to specify the desired number of sentences. Experiments on four open-domain multi-hop question-answering datasets show that BRIEF-Pro generates more concise and relevant summaries, enhancing performance across small, large, and proprietary language models. With the 70B reader model, 32x compression by BRIEF-Pro improves QA performance by 4.67% on average over LongLLMLingua's 9x, while requiring only 23% of its comp
Sharing scientific data, with the objective of making it fully discoverable, accessible, assessable, intelligible, usable, and interoperable, requires work at the disciplinary level to define in particular how the data should be formatted and described. Each discipline has its own organization and history as a starting point, and this paper explores the way a range of disciplines, namely materials science, crystallography, astronomy, earth sciences, humanities and linguistics get organized at the international level to tackle this question. In each case, the disciplinary culture with respect to data sharing, science drivers, organization and lessons learnt are briefly described, as well as the elements of the specific data infrastructure which are or could be shared with others. Commonalities and differences are assessed. Common key elements for success are identified: data sharing should be science driven; defining the disciplinary part of the interdisciplinary standards is mandatory but challenging; sharing of applications should accompany data sharing. Incentives such as journal and funding agency requirements are also similar. For all, it also appears that social aspects are mo
Data similarity assumptions have traditionally been relied upon to understand the convergence behaviors of federated learning methods. Unfortunately, this approach often demands fine-tuning step sizes based on the level of data similarity. When data similarity is low, these small step sizes result in an unacceptably slow convergence speed for federated methods. In this paper, we present a novel and unified framework for analyzing the convergence of federated learning algorithms without the need for data similarity conditions. Our analysis centers on an inequality that captures the influence of step sizes on algorithmic convergence performance. By applying our theorems to well-known federated algorithms, we derive precise expressions for three widely used step size schedules: fixed, diminishing, and step-decay step sizes, which are independent of data similarity conditions. Finally, we conduct comprehensive evaluations of the performance of these federated learning algorithms, employing the proposed step size strategies to train deep neural network models on benchmark datasets under varying data similarity conditions. Our findings demonstrate significant improvements in convergence
Large Language Models (LLMs) have demonstrated advanced capabilities in both text generation and comprehension, and their application to data archives might facilitate the privatization of sensitive information about the data subjects. In fact, the information contained in data often includes sensitive and personally identifiable details. This data, if not safeguarded, may bring privacy risks in terms of both disclosure and identification. Furthermore, the application of anonymisation techniques, such as k-anonymity, can lead to a significant reduction in the amount of data within data sources, which may reduce the efficacy of predictive processes. In our study, we investigate the capabilities offered by LLMs to enrich anonymized data sources without affecting their anonymity. To this end, we designed new ad-hoc prompt template engineering strategies to perform anonymized Data Augmentation and assess the effectiveness of LLM-based approaches in providing anonymized data. To validate the anonymization guarantees provided by LLMs, we exploited the pyCanon library, designed to assess the values of the parameters associated with the most common privacy-preserving techniques via anonymi
This is a review addressing soliton-like states in systems with nonlocal nonlinearity. The work on this topic has long history. Some findings, such as optical solitons supported by thermal nonlinearity, and by the orientational nonlinearity in liquid crystals, have been reviewed in the literature, therefore they are outlined in the present review in a brief form. Some other studies, such as those addressing models with fractional diffraction, which is represented by a linear nonlocal operator, have started recently, therefore it will be relevant to review them in detail when more results are accumulated; the present article provides a short outline of the latter topic. The main part of the article is a summary of results obtained for two-dimensional (2D) solitons in specific models originating in studies of Bose-Einstein condensates (BECs), which are sufficiently mature but have not yet been reviewed. These are, in particular, anisotropic quasi-2D solitons supported by long-range dipole-dipole interactions in a condensate of magnetic atoms, and giant vortex solitons, which are stable for high values of the winding number, as well as 2D vortex solitons of the latter type moving with
Modern large-scale astroparticle setups measure high-energy particles, gamma rays, neutrinos, radio waves, and the recently discovered gravitational waves. Ongoing and future experiments are located worldwide. The data acquired have different formats, storage concepts, and publication policies. Such differences are a crucial point in the era of Big Data and of multi-messenger analysis in astroparticle physics. We propose an open science web platform called ASTROPARTICLE.ONLINE which enables us to publish, store, search, select, and analyze astroparticle data. In the first stage of the project, the following components of a full data life cycle concept are under development: describing, storing, and reusing astroparticle data; software to perform multi-messenger analysis using deep learning; and outreach for students, post-graduate students, and others who are interested in astroparticle physics. Here we describe the concepts of the web platform and the first obtained results, including the meta data structure for astroparticle data, data analysis by using convolution neural networks, description of the binary data, and the outreach platform for those interested in astroparticle phy
Revealing hidden geometry and topology in noisy data sets is a challenging task. Elastic principal graph is a computationally efficient and flexible data approximator based on embedding a graph into the data space and minimizing the energy functional penalizing the deviation of graph nodes both from data points and from pluri-harmonic configuration (generalization of linearity). The structure of principal graph is learned from data by application of a topological grammar which in the simplest case leads to the construction of principal curves or trees. In order to more efficiently cope with noise and outliers, here we suggest using a trimmed data approximation term to increase the robustness of the method. The modification of the method that we suggest does not affect either computational efficiency or general convergence properties of the original elastic graph method. The trimmed elastic energy functional remains a Lyapunov function for the optimization algorithm. On several examples of complex data distributions we demonstrate how the robust principal graphs learn the global data structure and show the advantage of using the trimmed data approximation term for the construction o
Remote sensing vision tasks require extensive labeled data across multiple, interconnected domains. However, current generative data augmentation frameworks are task-isolated, i.e., each vision task requires training an independent generative model, and ignores the modeling of geographical information and spatial constraints. To address these issues, we propose \textbf{TerraGen}, a unified layout-to-image generation framework that enables flexible, spatially controllable synthesis of remote sensing imagery for various high-level vision tasks, e.g., detection, segmentation, and extraction. Specifically, TerraGen introduces a geographic-spatial layout encoder that unifies bounding box and segmentation mask inputs, combined with a multi-scale injection scheme and mask-weighted loss to explicitly encode spatial constraints, from global structures to fine details. Also, we construct the first large-scale multi-task remote sensing layout generation dataset containing 45k images and establish a standardized evaluation protocol for this task. Experimental results show that our TerraGen can achieve the best generation image quality across diverse tasks. Additionally, TerraGen can be used as
How do groups of individuals achieve consensus in movement decisions? Do individuals follow their friends, the one predetermined leader, or whomever just happens to be nearby? To address these questions computationally, we formalize "Coordination Strategy Inference Problem". In this setting, a group of multiple individuals moves in a coordinated manner towards a target path. Each individual uses a specific strategy to follow others (e.g. nearest neighbors, pre-defined leaders, preferred friends). Given a set of time series that includes coordinated movement and a set of candidate strategies as inputs, we provide the first methodology (to the best of our knowledge) to infer whether each individual uses local-agreement-system or dictatorship-like strategy to achieve movement coordination at the group level. We evaluate and demonstrate the performance of the proposed framework by predicting the direction of movement of an individual in a group in both simulated datasets as well as two real-world datasets: a school of fish and a troop of baboons. Moreover, since there is no prior methodology for inferring individual-level strategies, we compare our framework with the state-of-the-art a
Data is the foundation of any scientific, industrial or commercial process. Its journey typically flows from collection to transport, storage, management and processing. While best practices and regulations guide data management and protection, recent events have underscored its vulnerability. Academic research and commercial data handling have been marred by scandals, revealing the brittleness of data management. Data, despite its importance, is susceptible to undue disclosures, leaks, losses, manipulation, or fabrication. These incidents often occur without visibility or accountability, necessitating a systematic structure for safe, honest, and auditable data management. In this paper, we introduce the concept of Honest Computing as the practice and approach that emphasizes transparency, integrity, and ethical behaviour within the realm of computing and technology. It ensures that computer systems and software operate honestly and reliably without hidden agendas, biases, or unethical practices. It enables privacy and confidentiality of data and code by design and by default. We also introduce a reference framework to achieve demonstrable data lineage and provenance, contrasting i