搜索 — ResearchTracker

Within the millions of digitized historic American newspapers in the Chronicling America initiative are tens of millions of photographs, illustrations, cartoons, and advertisements. Much of this visual culture is shared across newspaper titles and issues. Just as reprinted texts within these newspapers speak to the virality of textual content, so too does this reprinted visual culture speak to newspapers as sites of constant information circulation and exchange. In this paper, we introduce Viral Images, a project to identify reprintings within 1.5 million photographs in Chronicling America. For our analysis, we adopt the Newspaper Navigator dataset of extracted photographs from over 16 million pages in Chronicling America. We introduce an unsupervised method of identifying reprintings by leveraging contrastive language-image pretraining (CLIP) to embed these 1.5 million photographs and applying clustering to identify re-printed content. We detail our public interface, https://viral-images.org, which we designed in order to enable humanists to interactively browse and study these identified clusters. In addition, we analyze the identified clusters, uncovering a diversity of photogra

ks-lit-3m: A 3.1 million word kashmiri text dataset for large language model pretraining

arXiv2026-01-03作者：Haq Nawaz Malik

Large Language Models (LLMs) demonstrate remarkable fluency across high-resource languages yet consistently fail to generate coherent text in Kashmiri, a language spoken by approximately seven million people. This performance disparity stems not from inherent model limitations but from a critical scarcity of high-quality training data. Decades of Kashmiri literature remain inaccessible to modern NLP pipelines due to their encoding in the proprietary InPage desktop publishing format. This paper introduces KS-LIT-3M, a curated corpus of 3.1 million words (16.4 million characters) specifically designed for pretraining language models on Kashmiri. The dataset is structured as a single continuous linear text stream, optimized for causal language model training where models learn to predict subsequent tokens from preceding context. The corpus was constructed through the development of a specialized InPage-to-Unicode converter, followed by rigorous preprocessing including English contamination removal, character normalization, and quality validation. Encompassing 131,607 unique words drawn from diverse genres including literary works, journalistic writing, academic texts, and religious sc

搜索结果：million

Viral Images: Identifying Reprintings within 1.5 Million Photographs in Chronicling America

ks-lit-3m: A 3.1 million word kashmiri text dataset for large language model pretraining

Wonderboom -- Efficient, and Censorship-Resilient Signature Aggregation for Million Scale Consensus

Large-scale artificial intelligence with 41 million nanophotonic neurons on a metasurface

Stellar Parameters for over Fifty Million stars from SMSS DR4 and Gaia DR3

Non-Termination Proving: 100 Million LoC and Beyond

Millions of Main-Sequence Binary Stars from Gaia BP/RP Spectra

ks-pret-5m: a 5 million word, 12 million token kashmiri pretraining dataset

MILLION: Mastering Long-Context LLM Inference Via Outlier-Immunized KV Product Quantization

Identifying galaxies, quasars, and stars with machine learning: A new catalogue of classifications for 111 million SDSS sources without spectra

Efficient parallel algorithms for free-energy calculation of millions of water molecules in the fluid phases

A robot-assisted pipeline to rapidly scan 1.7 million historical aerial photographs

GovScape: A Public Multimodal Search System for 70 Million Pages of Government PDFs

Solving FDR-Controlled Sparse Regression Problems with Five Million Variables on a Laptop

MolTextNet: A Two-Million Molecule-Text Dataset for Multimodal Molecular Learning

Mixture of A Million Experts

Magnetized compressible turbulence with a fluctuation dynamo and Reynolds numbers over a million

MegaWika: Millions of reports and their sources across 50 diverse languages

PENTACET data -- 23 Million Contextual Code Comments and 250,000 SATD comments

The Million Quasars (Milliquas) Catalogue, v8