搜索 — ResearchTracker

We study the Logistic Contextual Slate Bandit problem, where, at each round, an agent selects a slate of $N$ items from an exponentially large set (of size $2^{Ω(N)}$) of candidate slates provided by the environment. A single binary reward, determined by a logistic model, is observed for the chosen slate. Our objective is to develop algorithms that maximize cumulative reward over $T$ rounds while maintaining low per-round computational costs. We propose two algorithms, Slate-GLM-OFU and Slate-GLM-TS, that accomplish this goal. These algorithms achieve $N^{O(1)}$ per-round time complexity via local planning (independent slot selections), and low regret through global learning (joint parameter estimation). We provide theoretical and empirical evidence supporting these claims. Under a well-studied diversity assumption, we prove that Slate-GLM-OFU incurs only $\tilde{O}(\sqrt{T})$ regret. Extensive experiments across a wide range of synthetic settings demonstrate that our algorithms consistently outperform state-of-the-art baselines, achieving both the lowest regret and the fastest runtime. Furthermore, we apply our algorithm to select in-context examples in prompts of Language Models

Quotient DAGs for Off-Policy Evaluation:Forward-Flow Importance Sampling and Exact Slate Propensities

arXiv2026-05-28作者：Ziwen Xie, Shaowen Xiang, Hongyu He

Off-policy evaluation estimates how a target policy would perform using data collected by a different behavior policy, which is crucial when online testing is costly or risky, such as in recommendation or healthcare. Standard importance sampling reweights each logged trajectory, but it can treat details of the generation process as meaningful even when the evaluation target ignores them: for example, an autoregressive slate recommender may generate an ordered sequence of items while the reward and downstream estimator depend only on the unordered slate. This creates nuisance variance and a computational gap, since exact unordered slate propensities require summing over all generation orders. We introduce a quotient-DAG view that merges histories equivalent for evaluation and assigns weights using target-to-behavior forward-flow ratios on the merged graph. For slate recommendation under a set-sufficient next-item interface, this yields Forward-DP, a subset-DAG dynamic program that computes exact unordered propensities without factorial enumeration. The resulting propensity primitive enables practical propensity-based evaluation and model selection for context-dependent autoregressiv

搜索结果：Slate

Efficient Algorithms for Logistic Contextual Slate Bandits with Bandit Feedback

Quotient DAGs for Off-Policy Evaluation:Forward-Flow Importance Sampling and Exact Slate Propensities

Prompt-to-Slate: Diffusion Models for Prompt-Conditioned Slate Generation

HiGR: Industrial-Scale Hierarchical Generative Slate Recommendation Framework in Tencent

Off-Policy Evaluation of Slate Bandit Policies via Optimizing Abstraction

LLM-as-a-Judge: Toward World Models for Slate Recommendation Systems

Distributional Off-Policy Evaluation for Slate Recommendations

Conditional Sequential Slate Optimization

Generative Slate Recommendation with Reinforcement Learning

A Clean Slate for Offline Reinforcement Learning

Slate-Aware Ranking for Recommendation

Probabilistic Rank and Reward: A Scalable Model for Slate Recommendation

Generator and Critic: A Deep Reinforcement Learning Approach for Slate Re-ranking in E-commerce

Variation Control and Evaluation for Generative SlateRecommendations

Deep Reinforcement Learning with Attention for Slate Markov Decision Processes with High-Dimensional States and Actions

Combining Reward and Rank Signals for Slate Recommendation

Learning Multinomial Logits in $O(n \log n)$ time

PCN-Rec: Agentic Proof-Carrying Negotiation for Reliable Governance-Constrained Recommendation

Long-Horizon Plan Execution in Large Tool Spaces through Entropy-Guided Branching

UniRank: Unified List-wise Reranking via Confidence-Ordered Denoising