We enhance auto-regressive language models by conditioning on document chunks retrieved from a large corpus, based on local similarity with preceding tokens. With a $2$ trillion token database, our Retrieval-Enhanced Transformer (RETRO) obtains comparable performance to GPT-3 and Jurassic-1 on the Pile, despite using 25$\times$ fewer parameters. After fine-tuning, RETRO performance translates to downstream knowledge-intensive tasks such as question answering. RETRO combines a frozen Bert retriever, a differentiable encoder and a chunked cross-attention mechanism to predict tokens based on an order of magnitude more data than what is typically consumed during training. We typically train RETRO from scratch, yet can also rapidly RETROfit pre-trained transformers with retrieval and still achieve good performance. Our work opens up new avenues for improving language models through explicit memory at unprecedented scale.
In this work, we propose "global style tokens" (GSTs), a bank of embeddings that are jointly trained within Tacotron, a state-of-the-art end-to-end speech synthesis system. The embeddings are trained with no explicit labels, yet learn to model a large range of acoustic expressiveness. GSTs lead to a rich set of significant results. The soft interpretable "labels" they generate can be used to control synthesis in novel ways, such as varying speed and speaking style - independently of the text content. They can also be used for style transfer, replicating the speaking style of a single audio clip across an entire long-form text corpus. When trained on noisy, unlabeled found data, GSTs learn to factorize noise and speaker identity, providing a path towards highly scalable but robust speech synthesis.
Transformers, which are popular for language modeling, have been explored for solving vision tasks recently, e.g., the Vision Transformer (ViT) for image classification. The ViT model splits each image into a sequence of tokens with fixed length and then applies multiple Transformer layers to model their global relation for classification. However, ViT achieves inferior performance to CNNs when trained from scratch on a midsize dataset like ImageNet. We find it is because: 1) the simple tokenization of input images fails to model the important local structure such as edges and lines among neighboring pixels, leading to low training sample efficiency; 2) the redundant attention backbone design of ViT leads to limited feature richness for fixed computation budgets and limited training samples. To overcome such limitations, we propose a new Tokens-To-Token Vision Transformer (T2T-VTT), which incorporates 1) a layer-wise Tokens-to-Token (T2T) transformation to progressively structurize the image to tokens by recursively aggregating neighboring Tokens into one Token (Tokens-to-Token), such that local structure represented by surrounding tokens can be modeled and tokens length can be reduced; 2) an efficient backbone with a deep-narrow structure for vision transformer motivated by CNN architecture design after empirical study. Notably, T2T-ViT reduces the parameter count and MACs of vanilla ViT by half, while achieving more than 3.0% improvement when trained from scratch on ImageNet. It also outperforms ResNets and achieves comparable performance with MobileNets by directly training on ImageNet. For example, T2T-ViT with comparable size to ResNet50 (21.5M parameters) can achieve 83.3% top1 accuracy in image resolution 384x384 on ImageNet. <sup xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">1</sup>
This paper explores how entrepreneurs can use initial coin offerings -whereby they issue crypto tokens and commit to only accept those tokens as payment for their products -to fund venture start-up costs. We show that the ICO mechanism allows entrepreneurs to generate buyer competition for the token, giving it value. We also find that venture returns are independent of any committed growth in the supply of tokens over time, but that initial funds raised are maximized by setting that growth to zero to encourage saving by early participants. Nonetheless, since the value of the tokens depends on a single period of demand, the ability to raise funds is more limited than in traditional equity finance. Furthermore, a lack of commitment in monetary policy undermines saving behavior, hence the cost of using tokens to fund start-up costs is inflexibility in future capital raises. Crypto tokens can also facilitate coordination among stakeholders within digital ecosystems when network effects are present.
The growing usage of tokens in real-world blockchain projects – mostly visible in ICOs – has unveiled the need to understand what blockchain tokens in fact represent and how they relate to their underlying business model. Previous research has contributed to this gap but often lacks a comprehensive understanding of tokens and their design as well as of the growing and rapidly-changing complexity in token landscape. This has crucial implications for assessing tokens' value and utility. Applying a structured, scientific approach towards blockchain tokens, we provide a comprehensive token classification and a decision-aid on token design. This is based on a literature review and an empirical study to cover this research gap. Our work offers a novel contribution in an emerging field within the Blockchain research domain and proposes structured analytical tools which can be used by both practitioners and researchers.
We introduce A - ViT, a method that adaptively adjusts the inference cost of vision transformer (ViT) for images of different complexity. A - ViT achieves this by automatically reducing the number of tokens in vision transformers that are processed in the network as inference proceeds. We refor-mulate Adaptive Computation Time (ACT [17]) for this task, extending halting to discard redundant spatial tokens. The appealing architectural properties of vision transformers enables our adaptive token reduction mechanism to speed up inference without modifying the network architecture or inference hardware. We demonstrate that A - ViT requires no extra parameters or sub-network for halting, as we base the learning of adaptive halting on the original network parameters. We further introduce distributional prior regularization that stabilizes training compared to prior ACT approaches. On the image classification task (ImageNet1K), we show that our proposed A - ViT yields high efficacy in filtering informative spatial features and cutting down on the overall compute. The proposed method improves the throughput of DeiT-Tiny by 62% and DeiT-Small by 38% with only 0.3% accuracy drop, outperforming prior art by a large margin.
James Lee-Thorp, Joshua Ainslie, Ilya Eckstein, Santiago Ontanon. Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. 2022.
We propose a novel approach to both learning and detecting local contour-based representations for mid-level features. Our features, called sketch tokens, are learned using supervised mid-level information in the form of hand drawn contours in images. Patches of human generated contours are clustered to form sketch token classes and a random forest classifier is used for efficient detection in novel images. We demonstrate our approach on both top-down and bottom-up tasks. We show state-of-the-art results on the top-down task of contour detection while being over 200x faster than competing methods. We also achieve large improvements in detection accuracy for the bottom-up tasks of pedestrian and object detection as measured on INRIA and PASCAL, respectively. These gains are due to the complementary information provided by sketch tokens to low-level features such as gradient histograms.
For decades, the password has been the standard means for user authentication on computers. However, as users are required to remember more, longer, and changing passwords, it is evident that a more convenient and secure solution to user authentication is necessary. This paper examines passwords, security tokens, and biometrics-which we collectively call authenticators-and compares these authenticators and their combinations. We examine their effectiveness against several attacks and suitability for particular security specifications such as compromise detection and nonrepudiation. Examples of authenticator combinations and protocols are described to show tradeoffs and solutions that meet chosen, practical requirements. The paper endeavors to offer a comprehensive picture of user authentication solutions for the purposes of evaluating options for use and identifying deficiencies requiring further research.
The expression of surprise—at something unexpected—is a key form of emotional display. Focusing on displays of surprise performed by means of reaction tokens (akin to Goffman's “response cries”), such as wow, gosh, oh my god, ooh!, phew, we use an ethnomethodological, conversation-analytic approach to analyze surprise in talk-in-interaction. Our key contribution is to detach the psychology of surprise from its social expression by showing how co-conversationalists collaborate to bring off an interactionally achieved performance of surprise. Far from being a visceral eruption of emotion, the production of a surprise token is often prepared for several turns in advance. We also show how surprise can be recycled on an occasion subsequent to its initial production, and how surprise displays may be delayed by silence, ritualized disbelief, and other repair initiations. Finally, we consider some of the uses of surprise as an interactional resource, including its role in the reflection and reproduction of culture.
Because many studies of small talk (and talk in general) focus on the input of main speakers, the verbal behavior of listeners is often underrepresented in descriptions of interaction. The notion of small talk as talk superfluous to transactional exigencies enables us to encompass a variety of phenomena, including phatic exchanges, relational language, and various types of insertion sequence. This article adds to this range of phenomena by examining a set of high-frequency short listener response tokens that fulfill the criteria of being superfluous to transactional needs, of being focused on the interpersonal plane of discourse, and of having social functions that seem to overlap with those of phatic and relational episodes in different types of talk. Probably because the items involved are themselves "small" (in that their position is often difficult to locate on the cline from back-channels to full turns), their relational importance is easily overlooked.
The problem of translation has become increasingly central to critical reflections on modernity and its universalizing processes. Approaching translation as a symbolic and material exchange among peoples and civilizations—and not as a purely linguistic or literary matter, the essays in Tokens of Exchange focus on China and its interactions with the West to historicize an economy of translation. Rejecting the familiar regional approach to non-Western societies, contributors contend that “national histories” and “world history” must be read with absolute attention to the types of epistemological translatability that have been constructed among the various languages and cultures in modern times. By studying the production and circulation of meaning as value in areas including history, religion, language, law, visual art, music, and pedagogy, essays consider exchanges between Jesuit and Protestant missionaries and the Chinese between the seventeenth and nineteenth centuries and focus on the interchanges occasioned by the spread of capitalism and imperialism. Concentrating on ideological reciprocity and nonreciprocity in science, medicine, and cultural pathologies, contributors also posit that such exchanges often lead to racialized and essentialized ideas about culture, sexuality, and nation. The collection turns to the role of language itself as a site of the universalization of knowledge in its contemplation of such processes as the invention of Basic English and the global teaching of the English language. By focusing on the moments wherein meaning-value is exchanged in the translation from one language to another, the essays highlight the circulation of the global in the local as they address the role played by historical translation in the universalizing processes of modernity and globalization. The collection will engage students and scholars of global cultural processes, Chinese studies, world history, literary studies, history of science, and anthropology, as well as cultural and postcolonial studies. Contributors . Jianhua Chen, Nancy Chen, Alexis Dudden Eastwood, Roger Hart, Larissa Heinrich, James Hevia, Andrew F. Jones, Wan Shun Eva Lam, Lydia H. Liu, Deborah T. L. Sang, Haun Saussy, Q. S. Tong, Qiong Zhang
A male fruit fly influences the behavior and physiology of his mate via molecules that he transmits to her in his semen. The mated female fly has an elevated rate of egg laying, a decreased receptivity to mating and a shorter life span; she also stores sperm from the mating. Molecular genetic analyses possible in this insect model system permit the dissection of seminal fluid components that cause these mating responses in the female. Studies with transgenic and mutant flies have shown that products of the male's accessory gland cause short-term changes in the female's behavior and physiology; persistence of these changes requires the storage of sperm. Further dissection of accessory gland function has defined several molecules that cause these effects. A "sex peptide" and a prohormone-like molecule (Accessory gland protein 26Aa) stimulate the female's egg-laying rate; the sex peptide also depresses her receptivity to mating. A large glycoprotein (Acp36DE) appears to function in "corralling" sperm for storage. Studies of accessory gland products and the regulation of the genes that encode them will be important in understanding insect reproduction, behavior, and speciation and ultimately in designing ways to control the impressive fertility of unwanted insects. These studies also provide excellent models to address basic questions in cell biology such as the control of genes in response to sex-specific, mating-regulated and cell type-specific cues and the function and targeting of peptide hormones.
Abstract In one of his lectures, Harvey Sacks proposes that the social sciences have tended to view a society as having "relatively few orderly products, where then much of what else takes place is more or less random."; He offers "an image of a machine with a couple of holes in the front. It spews out some nice stuff from those holes, and out of the back it spews out garbage."; Where, then, "the concern to find that data generated by the machine which is orderly"; tends to focus on "what are in the first instance known to be 'big issues', and not that which is terribly mundane, occasional, local, and the like."; Sacks offers as an alternative approach, that "it is perfectly possible...to suppose...that wherever one happens to attack the phenomenon one is going to find detailed order. That is, one may alternatively take it that there is order at all points."; As a student of Sacks', I use 'order at all points' as a research presupposition, a working hypothesis, a base. But every now and then it appears that I do not fully accept it. There will be some occurrence at which I will balk: But surely not here. This cannot be orderly. This has got to be "garbage";. The phenomenon I will be reporting on is one of those. As it began to emerge, I kept thinking: No. Not here. This, surely, is garbage. And it is not that the phenomenon is too small. I have worked with much finer‐grained materials. But the fine‐grained phenomena have a certain elegance. This thing is not elegant. It seems just too "terribly mundane";, too trivial to be one of society's "orderly products";. And yet, on examination, it seems to be capable of orderliness.
暂无摘要(点击查看原文获取完整内容)
Attention is sparse in vision transformers. We observe the final prediction in vision transformers is only based on a subset of most informative tokens, which is sufficient for accurate image recognition. Based on this observation, we propose a dynamic token sparsification framework to prune redundant tokens progressively and dynamically based on the input. Specifically, we devise a lightweight prediction module to estimate the importance score of each token given the current features. The module is added to different layers to prune redundant tokens hierarchically. To optimize the prediction module in an end-to-end manner, we propose an attention masking strategy to differentiably prune a token by blocking its interactions with other tokens. Benefiting from the nature of self-attention, the unstructured sparse tokens are still hardware friendly, which makes our framework easy to achieve actual speed-up. By hierarchically pruning 66% of the input tokens, our method greatly reduces 31%~37% FLOPs and improves the throughput by over 40% while the drop of accuracy is within 0.5% for various vision transformers. Equipped with the dynamic token sparsification framework, DynamicViT models can achieve very competitive complexity/accuracy trade-offs compared to state-of-the-art CNNs and vision transformers on ImageNet. Code is available at https://github.com/raoyongming/DynamicViT
AbstractThere is a failure mode in large language models that we do not have a good name for, and thatwe therefore tend not to treat seriously enough. It is not hallucination — the model is not assertingsomething false. It is not refusal — the model answers at length. It is the production of responses thatcarry the complete outward form of careful reasoning while the cognitive work that reasoning issupposed to represent has not, in any meaningful sense, occurred. We call this theatrical compliance,and we argue that it is, in practical terms, more dangerous than either of the failure modes thatcurrently dominate alignment research. This paper identifies the phenomenon, characterizes its fiveprincipal forms, explains the asymmetry that makes it particularly costly in high-stakes settings, andoutlines the design requirements for systems intended to resist it. We do not describe such a systemin detail here. Our goal is to establish theatrical compliance as a research problem in its own rightand to argue that addressing it requires instruments operating at a fundamentally different level ofabstraction than task-level prompting frameworks.Keywords: theatrical compliance, large language models, AI reasoning quality, cognitiveprocess evaluation, prompt engineering, metacognitive systems.
Computer vision has achieved remarkable success by (a) representing images as uniformly-arranged pixel arrays and (b) convolving highly-localized features. However, convolutions treat all image pixels equally regardless of importance; explicitly model all concepts across all images, regardless of content; and struggle to relate spatially-distant concepts. In this work, we challenge this paradigm by (a) representing images as semantic visual tokens and (b) running transformers to densely model token relationships. Critically, our Visual Transformer operates in a semantic token space, judiciously attending to different image parts based on context. This is in sharp contrast to pixel-space transformers that require orders-of-magnitude more compute. Using an advanced training recipe, our VTs significantly outperform their convolutional counterparts, raising ResNet accuracy on ImageNet top-1 by 4.6 to 7 points while using fewer FLOPs and parameters. For semantic segmentation on LIP and COCO-stuff, VT-based feature pyramid networks (FPN) achieve 0.35 points higher mIoU while reducing the FPN module's FLOPs by 6.5x.
The Non-Fungible Token (NFT) market is mushrooming in recent years. The concept of NFT originally comes from a token standard of Ethereum, aiming to distinguish each token with distinguishable signs. This type of token can be bound with virtual/digital properties as their unique identifications. With NFTs, all marked properties can be freely traded with customized values according to their ages, rarity, liquidity, etc. It has greatly stimulated the prosperity of the decentralized application (DApp) market. At the time of writing (May 2021), the total money used on completed NFT sales has reached $34,530,649.86$ USD. The thousandfold return on its increasing market draws huge attention worldwide. However, the development of the NFT ecosystem is still in its early stage, and the technologies of NFTs are pre-mature. Newcomers may get lost in their frenetic evolution due to the lack of systematic summaries. In this technical report, we explore the NFT ecosystems in several aspects. We start with an overview of state-of-the-art NFT solutions, then provide their technical components, protocols, standards, and desired proprieties. Afterward, we give a security evolution, with discussions on the perspectives of their design models, opportunities, and challenges. To the best of our knowledge, this is the first systematic study on the current NFT ecosystems.
Proportions, that is, relative numbers of socially and culturally different people in a group, are seen as critical in shaping interaction dinamics, and four group types are identified in the basis of varying proportional compositions. "Skewed" groups contain a large preponderance of one type (the numerical "dominants") over another (the rare "tokens"). A framework is developed for conceptualizing the processes that occur between dominants and tokens. Three perceptual phenomena are associated with tokens: visibility (tokens capture a disproportionate awareness share), polarization (differences between tokens and dominants are exaggerated), and assimilation (tokens' attributes are distorted to fit preexisting generalizations about their social type). Visibility generates performance pressures; polarization leads dominants to heighten their group boundaries; and assimilation leads to the tokens' role entrapment. Illustrations are drawn from a field study in a large industrial corporation. Concepts are extended to tokens of all kinds, and research issues are identified.