搜索 — ResearchTracker

Web archives preserve portions of the web, but quantifying their completeness remains challenging. Prior approaches have estimated the coverage of a crawl by either comparing the outcomes of multiple crawlers, or by comparing the results of a single crawl to external ground truth datasets. We propose a method to estimate the absolute coverage of a crawl using only the archive's own longitudinal data, i.e., the data collected by multiple subsequent crawls. Our key insight is that coverage can be estimated from the empirical URL overlaps between subsequent crawls, which are in turn well described by a simple urn process. The parameters of the urn model can then be inferred from longitudinal crawl data using linear regression. Applied to our focused crawl configuration of the German Academic Web, with 15 semi-annual crawls between 2013-2021, we find a coverage of approximately 46 percent of the crawlable URL space for the stable crawl configuration regime. Our method is extremely simple, requires no external ground truth, and generalizes to any longitudinal focused crawl.

Craw4LLM: Efficient Web Crawling for LLM Pretraining

arXiv2025-02-19作者：Shi Yu, Zhiyuan Liu, Chenyan Xiong

Web crawl is a main source of large language models' (LLMs) pretraining data, but the majority of crawled web pages are discarded in pretraining due to low data quality. This paper presents Craw4LLM, an efficient web crawling method that explores the web graph based on the preference of LLM pretraining. Specifically, it leverages the influence of a webpage in LLM pretraining as the priority score of the web crawler's scheduler, replacing the standard graph connectivity based priority. Our experiments on a web graph containing 900 million webpages from a commercial search engine's index demonstrate the efficiency of Craw4LLM in obtaining high-quality pretraining data. With just 21% URLs crawled, LLMs pretrained on Craw4LLM data reach the same downstream performances of previous crawls, significantly reducing the crawling waste and alleviating the burdens on websites. Our code is publicly available at https://github.com/cxcscmu/Craw4LLM.

搜索结果：Crawl

Estimating Absolute Web Crawl Coverage From Longitudinal Set Intersections

Craw4LLM: Efficient Web Crawling for LLM Pretraining

Colour Contrast on the Web: A WCAG 2.1 Level AA Compliance Audit of Common Crawl's Top 500 Domains

A Scalable Crawling Algorithm Utilizing Noisy Change-Indicating Signals

Learning to crawl: Benefits and limits of centralized vs distributed control

TerraSkipper: A Centimeter-Scale Robot for Multi-Terrain Skipping and Crawling

Neural Prioritisation for Web Crawling

Excitable crawling

Beyond the Crawl: Unmasking Browser Fingerprinting in Real User Interactions

Document Quality Scoring for Web Crawling

PPL: Point Cloud Supervised Proprioceptive Locomotion Reinforcement Learning for Legged Robots in Crawl Spaces

Smart Bilingual Focused Crawling of Parallel Documents

Graph Neural Network for Crawling Target Nodes in Social Networks

A novel multi-threaded web crawling model

UnifiedCrawl: Aggregated Common Crawl for Affordable Adaptation of LLMs on Low-Resource Languages

Cybersecurity Data Extraction from Common Crawl

Quantifying Geospatial in the Common Crawl Corpus

Hopping and crawling DNA-coated colloids

The iCrawl Wizard -- Supporting Interactive Focused Crawl Specification

CC-GPX: Extracting High-Quality Annotated Geospatial Data from Common Crawl