A crowd density forecasting task aims to predict how the crowd density map will change in the future from observed past crowd density maps. However, the past crowd density maps are often incomplete due to the miss-detection of pedestrians, and it is crucial to develop a robust crowd density forecasting model against the miss-detection. This paper presents a MAsked crowd density Completion framework for crowd density forecasting (CrowdMAC), which is simultaneously trained to forecast future crowd density maps from partially masked past crowd density maps (i.e., forecasting maps from past maps with miss-detection) while reconstructing the masked observation maps (i.e., imputing past maps with miss-detection). Additionally, we propose Temporal-Density-aware Masking (TDM), which non-uniformly masks tokens in the observed crowd density map, considering the sparsity of the crowd density maps and the informativeness of the subsequent frames for the forecasting task. Moreover, we introduce multi-task masking to enhance training efficiency. In the experiments, CrowdMAC achieves state-of-the-art performance on seven large-scale datasets, including SDD, ETH-UCY, inD, JRDB, VSCrowd, FDST, and
Modeling and reproducing crowd behaviors are important in various domains including psychology, robotics, transport engineering and virtual environments. Conventional methods have focused on synthesizing momentary scenes, which have difficulty in replicating the continuous nature of real-world crowds. In this paper, we introduce a novel method for automatically generating continuous, realistic crowd trajectories with heterogeneous behaviors and interactions among individuals. We first design a crowd emitter model. To do this, we obtain spatial layouts from single input images, including a segmentation map, appearance map, population density map and population probability, prior to crowd generation. The emitter then continually places individuals on the timeline by assigning independent behavior characteristics such as agents' type, pace, and start/end positions using diffusion models. Next, our crowd simulator produces their long-term locomotions. To simulate diverse actions, it can augment their behaviors based on a Markov chain. As a result, our overall framework populates the scenes with heterogeneous crowd behaviors by alternating between the proposed emitter and simulator. Not
We show how the quality of decisions based on the aggregated opinions of the crowd can be conveniently studied using a sample of individual responses to a standard IQ questionnaire. We aggregated the responses to the IQ questionnaire using simple majority voting and a machine learning approach based on a probabilistic graphical model. The score for the aggregated questionnaire, Crowd IQ, serves as a quality measure of decisions based on aggregating opinions, which also allows quantifying individual and crowd performance on the same scale. We show that Crowd IQ grows quickly with the size of the crowd but saturates, and that for small homogeneous crowds the Crowd IQ significantly exceeds the IQ of even their most intelligent member. We investigate alternative ways of aggregating the responses and the impact of the aggregation method on the resulting Crowd IQ. We also discuss Contextual IQ, a method of quantifying the individual participant's contribution to the Crowd IQ based on the Shapley value from cooperative game theory.
Occlusion is one of the fundamental challenges in crowd counting. In the community, various data-driven approaches have been developed to address this issue, yet their effectiveness is limited. This is mainly because most existing crowd counting datasets on which the methods are trained are based on passive cameras, restricting their ability to fully sense the environment. Recently, embodied navigation methods have shown significant potential in precise object detection in interactive scenes. These methods incorporate active camera settings, holding promise in addressing the fundamental issues in crowd counting. However, most existing methods are designed for indoor navigation, showing unknown performance in analyzing complex object distribution in large scale scenes, such as crowds. Besides, most existing embodied navigation datasets are indoor scenes with limited scale and object quantity, preventing them from being introduced into dense crowd analysis. Based on this, a novel task, Embodied Crowd Counting (ECC), is proposed. We first build up an interactive simulator, Embodied Crowd Counting Dataset (ECCD), which enables large scale scenes and large object quantity. A prior proba
Traditionally, the term crowd was used almost exclusively in the context of people who self-organized around a common purpose, emotion or experience. Today, however, firms often refer to crowds in discussions of how collections of individuals can be engaged for organizational purposes. Crowdsourcing, the use of information technologies to outsource business responsibilities to crowds, can now significantly influence a firms ability to leverage previously unattainable resources to build competitive advantage. Nonetheless, many managers are hesitant to consider crowdsourcing because they do not understand how its various types can add value to the firm. In response, we explain what crowdsourcing is, the advantages it offers and how firms can pursue crowdsourcing. We begin by formulating a crowdsourcing typology and show how its four categories (crowd-voting, micro-task, idea and solution crowdsourcing) can help firms develop crowd capital, an organizational-level resource harnessed from the crowd. We then present a three-step process model for generating crowd capital. Step one includes important considerations that shape how a crowd is to be constructed. Step two outlines the capabi
This study enhances a crowd density estimation algorithm originally designed for image-based analysis by adapting it for video-based scenarios. The proposed method integrates a denoising probabilistic model that utilizes diffusion processes to generate high-quality crowd density maps. To improve accuracy, narrow Gaussian kernels are employed, and multiple density map outputs are generated. A regression branch is incorporated into the model for precise feature extraction, while a consolidation mechanism combines these maps based on similarity scores to produce a robust final result. An event-driven sampling technique, utilizing the Farneback optical flow algorithm, is introduced to selectively capture frames showing significant crowd movements, reducing computational load and storage by focusing on critical crowd dynamics. Through qualitative and quantitative evaluations, including overlay plots and Mean Absolute Error (MAE), the model demonstrates its ability to effectively capture crowd dynamics in both dense and sparse settings. The efficiency of the sampling method is further assessed, showcasing its capability to decrease frame counts while maintaining essential crowd events. B
Crowd simulation is a research area widely used in diverse fields, including gaming and security, assessing virtual agent movements through metrics like time to reach their goals, speed, trajectories, and densities. This is relevant for security applications, for instance, as different crowd configurations can determine the time people spend in environments trying to evacuate them. In this work, we extend WebCrowds, an authoring tool for crowd simulation, to allow users to build scenarios and evaluate them through a set of metrics. The aim is to provide a quantitative metric that can, based on simulation data, select the best crowd configuration in a certain environment. We conduct experiments to validate our proposed metric in multiple crowd simulation scenarios and perform a comparison with another metric found in the literature. The results show that experts in the domain of crowd scenarios agree with our proposed quantitative metric.
Detection-based methods have been viewed unfavorably in crowd analysis due to their poor performance in dense crowds. However, we argue that the potential of these methods has been underestimated, as they offer crucial information for crowd analysis that is often ignored. Specifically, the area size and confidence score of output proposals and bounding boxes provide insight into the scale and density of the crowd. To leverage these underutilized features, we propose Crowd Hat, a plug-and-play module that can be easily integrated with existing detection models. This module uses a mixed 2D-1D compression technique to refine the output features and obtain the spatial and numerical distribution of crowd-specific information. Based on these features, we further propose region-adaptive NMS thresholds and a decouple-then-align paradigm that address the major limitations of detection-based methods. Our extensive evaluations on various crowd analysis tasks, including crowd counting, localization, and detection, demonstrate the effectiveness of utilizing output features and the potential of detection-based methods in crowd analysis.
In crowd behavior understanding, a model of crowd behavior need to be trained using the information extracted from video sequences. Since there is no ground-truth available in crowd datasets except the crowd behavior labels, most of the methods proposed so far are just based on low-level visual features. However, there is a huge semantic gap between low-level motion/appearance features and high-level concept of crowd behaviors. In this paper we propose an attribute-based strategy to alleviate this problem. While similar strategies have been recently adopted for object and action recognition, as far as we know, we are the first showing that the crowd emotions can be used as attributes for crowd behavior understanding. The main idea is to train a set of emotion-based classifiers, which can subsequently be used to represent the crowd motion. For this purpose, we collect a big dataset of video clips and provide them with both annotations of "crowd behaviors" and "crowd emotions". We show the results of the proposed method on our dataset, which demonstrate that the crowd emotions enable the construction of more descriptive models for crowd behaviors. We aim at publishing the dataset wit
Most state-of-the-art crowd counting methods use color (RGB) images to learn the density map of the crowd. However, these methods often struggle to achieve higher accuracy in densely crowded scenes with poor illumination. Recently, some studies have reported improvement in the accuracy of crowd counting models using a combination of RGB and thermal images. Although multimodal data can lead to better predictions, multimodal data might not be always available beforehand. In this paper, we propose the use of generative adversarial networks (GANs) to automatically generate thermal infrared (TIR) images from color (RGB) images and use both to train crowd counting models to achieve higher accuracy. We use a Pix2Pix GAN network first to translate RGB images to TIR images. Our experiments on several state-of-the-art crowd counting models and benchmark crowd datasets report significant improvement in accuracy.
Numerous studies and anecdotes demonstrate the "wisdom of the crowd," the surprising accuracy of a group's aggregated judgments. Less is known, however, about the generality of crowd wisdom. For example, are crowds wise even if their members have systematic judgmental biases, or can influence each other before members render their judgments? If so, are there situations in which we can expect a crowd to be less accurate than skilled individuals? We provide a precise but general definition of crowd wisdom: A crowd is wise if a linear aggregate, for example a mean, of its members' judgments is closer to the target value than a randomly, but not necessarily uniformly, sampled member of the crowd. Building on this definition, we develop a theoretical framework for examining, a priori, when and to what degree a crowd will be wise. We systematically investigate the boundary conditions for crowd wisdom within this framework and determine conditions under which the accuracy advantage for crowds is maximized. Our results demonstrate that crowd wisdom is highly robust: Even if judgments are biased and correlated, one would need to nearly deterministically select only a highly skilled judge be
Crowd movement simulation is crucial for pedestrian safety management and facility design. Data-driven models offer the potential to improve realism and predictive accuracy, but most are developed for a single scenario, limiting their flexibility. We propose a data-driven crowd simulation model that incorporates refined visual-information extraction and explicit exit cues, aiming to improve flexibility across multiple scenarios by more effectively capturing core navigational features. The model is tested on four fundamental modules (bottleneck, corridor, corner, and T-junction) and further evaluated in a composite scenario using a modular approach. Results show that our model performs well across these scenarios, aligning with pedestrian movement in real-world experiments, and outperforms the classical knowledge-driven model in these scenarios. The research outcomes can provide inspiration for the development of data-driven crowd simulation models and advance the application of data-driven approaches.
Dense pedestrian crowds may pose significant safety risks, yet their underlying dynamics remain insufficiently understood to reliably prevent accidents. In these environments, physical interactions and contact forces fundamentally shape the dynamics of the crowd. However, accurately describing these interindividual interactions requires specific modeling and analytical approaches. This chapter reviews paradigms and models used to represent pedestrian dynamics in various contexts, highlighting the transition from classical approaches to models tailored for dense crowd conditions. We argue that further investigation is needed, featuring new experimental studies and new modeling paradigms, to better capture the complex dynamics that emerge in high-density situations.
Multi-modal crowd counting involves estimating crowd density from both visual and thermal/depth images. This task is challenging due to the significant gap between these distinct modalities. In this paper, we propose a novel approach by introducing an auxiliary broker modality and on this basis frame the task as a triple-modal learning problem. We devise a fusion-based method to generate this broker modality, leveraging a non-diffusion, lightweight counterpart of modern denoising diffusion-based fusion models. Additionally, we identify and address the ghosting effect caused by direct cross-modal image fusion in multi-modal crowd counting. Through extensive experimental evaluations on popular multi-modal crowd-counting datasets, we demonstrate the effectiveness of our method, which introduces only 4 million additional parameters, yet achieves promising results. The code is available at https://github.com/HenryCilence/Broker-Modality-Crowd-Counting.
In recent years, vision-based crowd analysis has been studied extensively due to its practical applications in real world. In this paper, we formulate a novel crowd analysis problem, in which we aim to predict the crowd distribution in the near future given sequential frames of a crowd video without any identity annotations. Studying this research problem will benefit applications concerned with forecasting crowd dynamics. To solve this problem, we propose a global-residual two-stream recurrent network, which leverages the consecutive crowd video frames as inputs and their corresponding density maps as auxiliary information to predict the future crowd distribution. Moreover, to strengthen the capability of our network, we synthesize scene-specific crowd density maps using simulated data for pretraining. Finally, we demonstrate that our framework is able to predict the crowd distribution for different crowd scenarios and we delve into applications including predicting future crowd count, forecasting high-density region, etc.
Understanding crowd behavior in video is challenging for computer vision. There have been increasing attempts on modeling crowded scenes by introducing ever larger property ontologies (attributes) and annotating ever larger training datasets. However, in contrast to still images, manually annotating video attributes needs to consider spatiotemporal evolution which is inherently much harder and more costly. Critically, the most interesting crowd behaviors captured in surveillance videos (e.g., street fighting, flash mobs) are either rare, thus have few examples for model training, or unseen previously. Existing crowd analysis techniques are not readily scalable to recognize novel (unseen) crowd behaviors. To address this problem, we investigate and develop methods for recognizing visual crowd behavioral attributes without any training samples, i.e., zero-shot learning crowd behavior recognition. To that end, we relax the common assumption that each individual crowd video instance is only associated with a single crowd attribute. Instead, our model learns to jointly recognize multiple crowd behavioral attributes in each video instance by exploring multiattribute cooccurrence as conte
Multi-modal crowd counting is a crucial task that uses multi-modal cues to estimate the number of people in crowded scenes. To overcome the gap between different modalities, we propose a modal emulation-based two-pass multi-modal crowd-counting framework that enables efficient modal emulation, alignment, and fusion. The framework consists of two key components: a \emph{multi-modal inference} pass and a \emph{cross-modal emulation} pass. The former utilizes a hybrid cross-modal attention module to extract global and local information and achieve efficient multi-modal fusion. The latter uses attention prompting to coordinate different modalities and enhance multi-modal alignment. We also introduce a modality alignment module that uses an efficient modal consistency loss to align the outputs of the two passes and bridge the semantic gap between modalities. Extensive experiments on both RGB-Thermal and RGB-Depth counting datasets demonstrate its superior performance compared to previous methods. Code available at https://github.com/Mr-Monday/Multi-modal-Crowd-Counting-via-Modal-Emulation.
We propose a novel crowd counting model that maps a given crowd scene to its density. Crowd analysis is compounded by myriad of factors like inter-occlusion between people due to extreme crowding, high similarity of appearance between people and background elements, and large variability of camera view-points. Current state-of-the art approaches tackle these factors by using multi-scale CNN architectures, recurrent networks and late fusion of features from multi-column CNN with different receptive fields. We propose switching convolutional neural network that leverages variation of crowd density within an image to improve the accuracy and localization of the predicted crowd count. Patches from a grid within a crowd scene are relayed to independent CNN regressors based on crowd count prediction quality of the CNN established during training. The independent CNN regressors are designed to have different receptive fields and a switch classifier is trained to relay the crowd scene patch to the best CNN regressor. We perform extensive experiments on all major crowd counting datasets and evidence better performance compared to current state-of-the-art methods. We provide interpretable re
It is important to monitor and analyze crowd events for the sake of city safety. In an EDOF (extended depth of field) image with a crowded scene, the distribution of people is highly imbalanced. People far away from the camera look much smaller and often occlude each other heavily, while people close to the camera look larger. In such a case, it is difficult to accurately estimate the number of people by using one technique. In this paper, we propose a Depth Information Guided Crowd Counting (DigCrowd) method to deal with crowded EDOF scenes. DigCrowd first uses the depth information of an image to segment the scene into a far-view region and a near-view region. Then Digcrowd maps the far-view region to its crowd density map and uses a detection method to count the people in the near-view region. In addition, we introduce a new crowd dataset that contains 1000 images. Experimental results demonstrate the effectiveness of our DigCrowd method
Indoor venues accommodate many people who collectively form crowds. Such crowds in turn influence people's routing choices, e.g., people may prefer to avoid crowded rooms when walking from A to B. This paper studies two types of crowd-aware indoor path planning queries. The Indoor Crowd-Aware Fastest Path Query (FPQ) finds a path with the shortest travel time in the presence of crowds, whereas the Indoor Least Crowded Path Query (LCPQ) finds a path encountering the least objects en route. To process the queries, we design a unified framework with three major components. First, an indoor crowd model organizes indoor topology and captures object flows between rooms. Second, a time-evolving population estimator derives room populations for a future timestamp to support crowd-aware routing cost computations in query processing. Third, two exact and two approximate query processing algorithms process each type of query. All algorithms are based on graph traversal over the indoor crowd model and use the same search framework with different strategies of updating the populations during the search process. All proposals are evaluated experimentally on synthetic and real data. The experimen