共找到 20 条结果
The digitization of displaced archives is of great historical and cultural significance. Through the construction of digital humanistic platforms represented by MISS platform, and the comprehensive application of IIIF technology, knowledge graph technology, ontology technology, and other popular information technologies. We can find that the digital framework of displaced archives built through the MISS platform can promote the establishment of a standardized cooperation and dialogue mechanism between the archives authoritiess and other government departments. At the same time, it can embed the works o fichives ction of digital government and the economy, promote the exploration of the integration of archives management, data management, and information resource management, and ultimately promote the construction of a digital society. By fostering a new partnership between archives departments and enterprises, think tanks, research institutes, and industry associations, the role of multiple social subjects in the modernization process of the archives governance system and governance capacity will be brought into play. The National Archives Administration has launched a special oper
The creation of open archives i.e. archives where access is regulated by open licensing models (content, source, data), should be seen as part of a broader socio-economic phenomenon that finds legal expression in specific organizational and technical formats.This paper examines the origins and main characteristics of the open archives phenomenon. We investigate the extent to which different models of production of economic or social value can be expressed in different forms of licensing in the context of open archives. Through this process, we assess the extent to which the digital archive is moving towards providing access that is deeper (meaning, that offers more access rights) and wider (in the sense that most of the information given is in open content licensing) or face a gradual stratification and polarization of the content. Such stratification entails the emergence of two types of content: content to which access is extremely limited and content to which access remains completely open. This differentiation between classes of content is the result of multiple factors: from purely legislative, administrative and contractual restrictions (e.g. data protection and confidentiali
Digitization of historical records has produced a significant amount of data for analysis and interpretation. A critical challenge is the ability to relate historical information across different archives to allow for the data to be framed in the appropriate historical context. This paper presents a real-world case study on historical information integration and record matching with the goal to improve the historical value of archives containing data in the period 1800 to 1920. The archives contain unique information about Métis and Indigenous people in Canada and interactions with European settlers. The archives contain thousands of records that have increased relevance when relationships and interconnections are discovered. The contribution is a record linking approach suitable for historical archives and an evaluation of its effectiveness. Experimental results demonstrate potential for discovering historical linkage with high precision enabling new historical discoveries.
The article examines the theoretical, methodological, and technical foundations of research on audiovisual corpora within the field of digital humanities. It outlines the main transversal issues underlying the processes of constructing, exploiting, and interpreting such corpora, which are conceived as specific forms of textual data in the broad sense - that is, as sets of semiotic traces (written, visual, sound, or multimodal) that make it possible to document, analyze, and transmit domains of knowledge. The analysis is organized around five complementary themes. The first concerns the status and structure of textual data lato sensu: any data, regardless of its medium, participates in a meaningful representation of a domain and therefore requires a unified theoretical and methodological framework based on a transdisciplinary semiotic approach. The second theme addresses the documentary value of data and corpora, understood as the relevance of materials for documenting a research object in relation to the goals and perspectives of the projects in which they are used. This value depends both on provenance and reasoned selection, and on the pragmatic context of their use. The third th
The IANEC project (Investigation of Digital Archives of Contemporary Writers), led by the GREYC Research Lab and funded by the French Ministry of Culture aims to develop dedicated digital forensic investigation tools to automate the analysis of archival corpora from the Institut M{é}moires de l'{É}dition Contemporaine (IMEC). The project is based on the observation that born-digital archival materials are increasingly prevalent in contemporary archival institutions, and that digital forensics technologies have become essential for the extraction, identification, processing, and description of natively digital archival corpora.*
The digital transformation is turning archives, both old and new, into data. As a consequence, automation in the form of artificial intelligence techniques is increasingly applied both to scale traditional recordkeeping activities, and to experiment with novel ways to capture, organise and access records. We survey recent developments at the intersection of Artificial Intelligence and archival thinking and practice. Our overview of this growing body of literature is organised through the lenses of the Records Continuum model. We find four broad themes in the literature on archives and artificial intelligence: theoretical and professional considerations, the automation of recordkeeping processes, organising and accessing archives, and novel forms of digital archives. We conclude by underlining emerging trends and directions for future work, which include the application of recordkeeping principles to the very data and processes which power modern artificial intelligence, and a more structural, yet critically-aware, integration of artificial intelligence into archival systems and practice.
Although the Internet Archive's Wayback Machine is the largest and most well-known web archive, there have been a number of public web archives that have emerged in the last several years. With varying resources, audiences and collection development policies, these archives have varying levels of overlap with each other. While individual archives can be measured in terms of number of URIs, number of copies per URI, and intersection with other archives, to date there has been no answer to the question "How much of the Web is archived?" We study the question by approximating the Web using sample URIs from DMOZ, Delicious, Bitly, and search engine indexes; and, counting the number of copies of the sample URIs exist in various public web archives. Each sample set provides its own bias. The results from our sample sets indicate that range from 35%-90% of the Web has at least one archived copy, 17%-49% has between 2-5 copies, 1%-8% has 6-10 copies, and 8%-63% has more than 10 copies in public web archives. The number of URI copies varies as a function of time, but no more than 31.3% of URIs are archived more than once per month.
Colonial archives are at the center of increased interest from a variety of perspectives, as they contain traces of historically marginalized people. Unfortunately, like most archives, they remain difficult to access due to significant persisting barriers. We focus here on one of them: the biases to be found in historical findings aids, such as indexes of person names, which remain in use to this day. In colonial archives, indexes can perpetuate silences by omitting to include mentions of historically marginalized persons. In order to overcome such limitations and pluralize the scope of existing finding aids, we propose using automated entity recognition. To this end, we contribute a fit-for-purpose annotation typology and apply it on the colonial archive of the Dutch East India Company (VOC). We release a corpus of nearly 70,000 annotations as a shared task, for which we provide baselines using state-of-the-art neural network models. Our work intends to stimulate further contributions in the direction of broadening access to (colonial) archives, integrating automation as a possible means to this end.
Screenshots of social media posts are a common approach for information sharing. Unfortunately, before sharing a screenshot, users rarely verify whether the attribution of the post is fake or real. There are numerous legitimate reasons to share screenshots. However, sharing screenshots of social media posts is also a vector for mis-/disinformation spread on social media. We are exploring methods to verify the attribution of a social media post shown in a screenshot, using resources found on the live web and in web archives. We focus on the use of web archives, since the attribution of non-deleted posts can be relatively easily verified using the live web. We show how information from a Twitter screenshot (Twitter handle, timestamp, and tweet text) can be extracted and used for locating potential archived tweets in the Internet Archive's Wayback Machine. We evaluate our method on a dataset of 1,571 single tweet screenshots.
As the volume and complexity of nonclinical toxicology studies continue to increase, toxicologic pathology reporting faces persistent challenges, including fragmented sources of data (e.g., histopathology images, clinical pathology and other study data, adverse effects database, mechanistic literature), variable reporting timelines and heightened regulatory expectations. This white paper examines the emerging role of agentic artificial intelligence (AI) in addressing these issues through coordinated workflow orchestration, data integration, and pathologist-in-the-loop report generation. Based on a closed-door roundtable held during the 2025 Society of Toxicologic Pathology (STP) Annual Meeting and follow-on discussions, this paper synthesizes the perspectives of leading toxicologic pathologists, toxicologists, and AI developers. It outlines the key pain points in current reporting workflows, identifies realistic near-term use cases for agentic AI, and describes major adoption barriers including requirements for transparency, validation, and organizational readiness. A phased adoption roadmap and pilot design considerations are proposed to help support responsible evaluation and dep
This paper presents a quasi-sequential optimal design framework for toxicology experiments, specifically applied to sea urchin embryos. The authors propose a novel approach combining robust optimal design with adaptive, stage-based testing to improve efficiency in toxicological studies, particularly where traditional uniform designs fall short. The methodology uses statistical models to refine dose levels across experimental phases, aiming for increased precision while reducing costs and complexity. Key components include selecting an initial design, iterative dose optimization based on preliminary results, and assessing various model fits to ensure robust, data-driven adjustments. Through case studies, we demonstrate improved statistical efficiency and adaptability in toxicology, with potential applications in other experimental domains.
A number of serious reasons will convince an increasing amount of researchers to store their relevant material in centers which we will call "language resource archives". They combine the duty of taking care of long-term preservation as well as the task to give access to their material to different user groups. Access here is meant in the sense that an active interaction with the data will be made possible to support the integration of new data, new versions or commentaries of all sort. Modern Language Resource Archives will have to adhere to a number of basic principles to fulfill all requirements and they will have to be involved in federations to create joint language resource domains making it even more simple for the researchers to access the data. This paper makes an attempt to formulate the essential pillars language resource archives have to adhere to.
In recent years, journalists and other researchers have used web archives as an important resource for their study of disinformation. This paper provides several examples of this use and also brings together some of the work that the Old Dominion University Web Science and Digital Libraries (WS-DL) research group has done in this area. We will show how web archives have been used to investigate changes to webpages, study archived social media including deleted content, and study known disinformation that has been archived.
We document the creation of a data set of 16,627 archived web pages, or mementos, of 3,698 unique live web URIs (Uniform Resource Identifiers) from 17 public web archives. We used four different methods to collect the dataset. First, we used the Los Alamos National Laboratory (LANL) Memento Aggregator to collect mementos of an initial set of URIs obtained from four sources: (a) the Moz Top 500, (b) the dataset used in our previous study, (c) the HTTP Archive, and (d) the Web Archives for Historical Research group. Second, we extracted URIs from the HTML of already collected mementos. These URIs were then used to look up mementos in LANL's aggregator. Third, we downloaded web archives' published lists of URIs of both original pages and their associated mementos. Fourth, we collected more mementos from archives that support the Memento protocol by requesting TimeMaps directly from archives, not through the Memento aggregator. Finally, we downsampled the collected mementos to 16,627 due to our constraints of a maximum of 1,600 mementos per archive and being able to download all mementos from each archive in less than 40 hours.
Geometry of the metabolic trajectories is characteristic of the biological response (Keun, Ebbels et al. 2004). Yet, due to unavoidable inter-individual variations, the exact trajectories characterising the biological responses differ. We examined whether the differences seen between metabolic trajectories of a specific treatment, correspond to the variations seen in the other biological manifestations of the same treatment. Differences in trajectories were measured via alignment procedures which introduced and implemented in this study. Our study revealed strong correlation between the scales of the aligned trajectories of metabolic responses and the severity of the hepatocelluar lesions induced after administration of hydrazine. Thus the results confirm that aligned trajectories are characteristic of a specific treatment. They then can be used for comparison with other treatment specific or unknown metabolic trajectories and can have many metabonomic applications such as preclinical toxicological screening
Phosphorus (P) is considered to be one of the key elements for life, making it an important element to look for in the abundance analysis of spectra of stellar systems. Yet, there exists only a handful of spectroscopic studies to estimate the P abundances and investigate its trend across a range of metallicities. We have observed full HK band spectra at a spectral resolving power of R=45,000 with IGRINS instrument. Abundances are determined using SME in combination with 1D MARCS stellar atmosphere models. The investigated sample of stars have reliable stellar parameters estimated using optical FIES spectra (GILD; Jönsson et al. in prep.). In order to determine the P abundances from the 16482.92 Angstrom P line, we take special care of the CO($ν=7-4$) blend. We determine the C, N, O abundances from atomic carbon and a range of non-blended molecular lines (CO, CN, OH) which are aplenty in the H band region of K giant stars, assuring an appropriate modelling of the blending CO($ν=7-4$) line. We present [P/Fe] vs [Fe/H] trend for 38 K giant stars in the metallicity range of -1.2 dex $<$ [Fe/H] $<$ 0.4 dex. We find that our trend matches well with the compiled literature sample of
Acute poly-substance intoxication requires rapid, life-saving decisions under substantial uncertainty, as clinicians must rely on incomplete ingestion details and nonspecific symptoms. Effective diagnostic reasoning in this chaotic environment requires fusing unstructured, non-medical narratives (e.g. paramedic scene descriptions and unreliable patient self-reports or known histories), with structured medical data like vital signs. While Large Language Models (LLMs) show potential for processing such heterogeneous inputs, they struggle in this setting, often underperforming simple baselines that rely solely on patient histories. To address this, we present DeToxR (Decision-support for Toxicology with Reasoning), the first adaptation of Reinforcement Learning (RL) to emergency toxicology. We design a robust data-fusion engine for multi-label prediction across 14 substance classes based on an LLM finetuned with Group Relative Policy Optimization (GRPO). We optimize the model's reasoning directly using a clinical performance reward. By formulating a multi-label agreement metric as the reward signal, the model is explicitly penalized for missing co-ingested substances and hallucinating
We present ARCHANGEL; a de-centralised platform for ensuring the long-term integrity of digital documents stored within public archives. Document integrity is fundamental to public trust in archives. Yet currently that trust is built upon institutional reputation --- trust at face value in a centralised authority, like a national government archive or University. ARCHANGEL proposes a shift to a technological underscoring of that trust, using distributed ledger technology (DLT) to cryptographically guarantee the provenance, immutability and so the integrity of archived documents. We describe the ARCHANGEL architecture, and report on a prototype of that architecture build over the Ethereum infrastructure. We report early evaluation and feedback of ARCHANGEL from stakeholders in the research data archives space.
Web archives are a historically valuable source of information. In some respects, web archives are the only record of the evolution of human society in the last two decades. They preserve a mix of personal and collective memories, the importance of which tends to grow as they age. However, the value of web archives depends on their users being able to search and access the information they require in efficient and effective ways. Without the possibility of exploring and exploiting the archived contents, web archives are useless. Web archive access functionalities range from basic browsing to advanced search and analytical services, accessed through user-friendly interfaces. Full-text and URL search have become the predominant and preferred forms of information discovery in web archives, fulfilling user needs and supporting search APIs that feed complex applications. Both full-text and URL search are based on the technology developed for modern web search engines, since the Web is the main resource targeted by both systems. However, while web search engines enable searching over the most recent web snapshot, web archives enable searching over multiple snapshots from the past. This m
The field of web archiving provides a unique mix of human and automated agents collaborating to achieve the preservation of the web. Centuries old theories of archival appraisal are being transplanted into the sociotechnical environment of the World Wide Web with varying degrees of success. The work of the archivist and bots in contact with the material of the web present a distinctive and understudied CSCW shaped problem. To investigate this space we conducted semi-structured interviews with archivists and technologists who were directly involved in the selection of content from the web for archives. These semi-structured interviews identified thematic areas that inform the appraisal process in web archives, some of which are encoded in heuristics and algorithms. Making the infrastructure of web archives legible to the archivist, the automated agents and the future researcher is presented as a challenge to the CSCW and archival community.