搜索 — ResearchTracker

The increasing prevalence of video clips has sparked growing interest in text-video retrieval. Recent advances focus on establishing a joint embedding space for text and video, relying on consistent embedding representations to compute similarity. However, the text content in existing datasets is generally short and concise, making it hard to fully describe the redundant semantics of a video. Correspondingly, a single text embedding may be less expressive to capture the video embedding and empower the retrieval. In this study, we propose a new stochastic text modeling method T-MASS, i.e., text is modeled as a stochastic embedding, to enrich text embedding with a flexible and resilient semantic range, yielding a text mass. To be specific, we introduce a similarity-aware radius module to adapt the scale of the text mass upon the given text-video pairs. Plus, we design and develop a support text regularization to further control the text mass during the training. The inference pipeline is also tailored to fully exploit the text mass for accurate retrieval. Empirical evidence suggests that T-MASS not only effectively attracts relevant text-video pairs while distancing irrelevant ones,

Text-Animator: Controllable Visual Text Video Generation

arXiv2024-06-25作者：Lin Liu, Quande Liu, Shengju Qian

Video generation is a challenging yet pivotal task in various industries, such as gaming, e-commerce, and advertising. One significant unresolved aspect within T2V is the effective visualization of text within generated videos. Despite the progress achieved in Text-to-Video~(T2V) generation, current methods still cannot effectively visualize texts in videos directly, as they mainly focus on summarizing semantic scene information, understanding, and depicting actions. While recent advances in image-level visual text generation show promise, transitioning these techniques into the video domain faces problems, notably in preserving textual fidelity and motion coherence. In this paper, we propose an innovative approach termed Text-Animator for visual text video generation. Text-Animator contains a text embedding injection module to precisely depict the structures of visual text in generated videos. Besides, we develop a camera control module and a text refinement module to improve the stability of generated visual text by controlling the camera movement as well as the motion of visualized text. Quantitative and qualitative experimental results demonstrate the superiority of our approac

搜索结果：Text

Text Is MASS: Modeling as Stochastic Embedding for Text-Video Retrieval

Text-Animator: Controllable Visual Text Video Generation

$\text{C}^2\text{P}$: Featuring Large Language Models with Causal Reasoning

CORE-Text: Improving Scene Text Detection with Contrastive Relational Reasoning

DSText V2: A Comprehensive Video Text Spotting Dataset for Dense and Small Text

Scale-free Characteristics of Multilingual Legal Texts and the Limitations of LLMs

Total-Text: A Comprehensive Dataset for Scene Text Detection and Recognition

Brush Your Text: Synthesize Any Scene Text on Images via Diffusion Model

Text or Pixels? It Takes Half: On the Token Efficiency of Visual Text Inputs in Multimodal LLMs

Expressive Text-to-Image Generation with Rich Text

SwinTextSpotter v2: Towards Better Synergy for Scene Text Spotting

TextSleuth: Towards Explainable Tampered Text Detection

Leveraging machine learning for less developed languages: Progress on Urdu text detection

Text-to-Audio Generation Synchronized with Videos

A framework of text-dependent speaker verification for chinese numerical string corpus

Text Guide: Improving the quality of long text classification by a text selection method based on feature importance

Turning a CLIP Model into a Scene Text Spotter

Handwritten and Printed Text Segmentation: A Signature Case Study

Unsupervised deep learning for text line segmentation

TextCohesion: Detecting Text for Arbitrary Shapes