共找到 20 条结果
NLTK, the Natural Language Toolkit, is a suite of open source program modules, tutorials and problem sets, providing ready-to-use computational linguistics courseware. NLTK covers symbolic and statistical natural language processing, and is interfaced to annotated corpora. Students augment and replace existing components, learn structured programming by example, and manipulate sophisticated models from the outset.
Abstract—We describe the design of Kaldi, a free, open-source toolkit for speech recognition research. Kaldi provides a speech recognition system based on finite-state transducers (using the freely available OpenFst), together with detailed documentation and scripts for building complete recognition systems. Kaldi is written is C++, and the core library supports modeling of arbitrary phonetic-context sizes, acoustic modeling with subspace Gaussian mixture models (SGMM) as well as standard Gaussian mixture models, together with all commonly used linear and affine transforms. Kaldi is released under the Apache License v2.0, which is highly nonrestrictive, making it suitable for a wide community of users. I.
BACKGROUND: To accelerate the application of the CRISPR/Cas9 (clustered regularly interspaced short palindromic repeats/ CRISPR-associated protein 9) system to a variety of plant species, a toolkit with additional plant selectable markers, more gRNA modules, and easier methods for the assembly of one or more gRNA expression cassettes is required. RESULTS: We developed a CRISPR/Cas9 binary vector set based on the pGreen or pCAMBIA backbone, as well as a gRNA (guide RNA) module vector set, as a toolkit for multiplex genome editing in plants. This toolkit requires no restriction enzymes besides BsaI to generate final constructs harboring maize-codon optimized Cas9 and one or more gRNAs with high efficiency in as little as one cloning step. The toolkit was validated using maize protoplasts, transgenic maize lines, and transgenic Arabidopsis lines and was shown to exhibit high efficiency and specificity. More importantly, using this toolkit, targeted mutations of three Arabidopsis genes were detected in transgenic seedlings of the T1 generation. Moreover, the multiple-gene mutations could be inherited by the next generation. CONCLUSIONS: We developed a toolkit that facilitates transient or stable expression of the CRISPR/Cas9 system in a variety of plant species, which will facilitate plant research, as it enables high efficiency generation of mutants bearing multiple gene mutations.
Next generation sequencing (NGS) technologies provide a high-throughput means to generate large amount of sequence data. However, quality control (QC) of sequence data generated from these technologies is extremely important for meaningful downstream analysis. Further, highly efficient and fast processing tools are required to handle the large volume of datasets. Here, we have developed an application, NGS QC Toolkit, for quality check and filtering of high-quality data. This toolkit is a standalone and open source application freely available at http://www.nipgr.res.in/ngsqctoolkit.html. All the tools in the application have been implemented in Perl programming language. The toolkit is comprised of user-friendly tools for QC of sequencing data generated using Roche 454 and Illumina platforms, and additional tools to aid QC (sequence format converter and trimming tools) and analysis (statistics tools). A variety of options have been provided to facilitate the QC at user-defined parameters. The toolkit is expected to be very useful for the QC of NGS data to facilitate better downstream analysis.
Abstract Clusters, Grids, and peer‐to‐peer (P2P) networks have emerged as popular paradigms for next generation parallel and distributed computing. They enable aggregation of distributed resources for solving large‐scale problems in science, engineering, and commerce. In Grid and P2P computing environments, the resources are usually geographically distributed in multiple administrative domains , managed and owned by different organizations with different policies, and interconnected by wide‐area networks or the Internet. This introduces a number of resource management and application scheduling challenges in the domain of security, resource and policy heterogeneity, fault tolerance, continuously changing resource conditions, and politics. The resource management and scheduling systems for Grid computing need to manage resources and application execution depending on either resource consumers' or owners' requirements, and continuously adapt to changes in resource availability. The management of resources and scheduling of applications in such large‐scale distributed systems is a complex undertaking. In order to prove the effectiveness of resource brokers and associated scheduling algorithms, their performance needs to be evaluated under different scenarios such as varying number of resources and users with different requirements. In a Grid environment, it is hard and even impossible to perform scheduler performance evaluation in a repeatable and controllable manner as resources and users are distributed across multiple organizations with their own policies. To overcome this limitation, we have developed a Java‐based discrete‐event Grid simulation toolkit called GridSim . The toolkit supports modeling and simulation of heterogeneous Grid resources (both time‐ and space‐shared), users and application models. It provides primitives for creation of application tasks, mapping of tasks to resources, and their management. To demonstrate suitability of the GridSim toolkit, we have simulated a Nimrod‐G like Grid resource broker and evaluated the performance of deadline and budget constrained cost‐ and time‐minimization scheduling algorithms. Copyright © 2002 John Wiley & Sons, Ltd.
This paper introduces a new open source platform for end-to-end speech processing named ESPnet. ESPnet mainly focuses on end-to-end automatic speech recognition (ASR), and adopts widely-used dynamic neural network toolkits, Chainer and PyTorch, as a main deep learning engine. ESPnet also follows the Kaldi ASR toolkit style for data processing, feature extraction/format, and recipes to provide a complete setup for speech recognition and other speech processing experiments. This paper explains a major architecture of this software platform, several important functionalities, which differentiate ESPnet from other open source ASR toolkits, and experimental results with major ASR benchmarks.
From the Publisher: The latest edition of the single most authoritative guide on dimensional modeling for data warehousing! Dimensional modeling has become the most widely accepted approach for data warehouse design. Here is a complete library of dimensional modeling techniques the most comprehensive collection ever written. Greatly expanded to cover both basic and advanced techniques for optimizing data warehouse design, this second edition to Ralph Kimballs classic guide is more than sixty percent updated. The authors begin with fundamental design recommendations and gradually progress step-by-step through increasingly complex scenarios. Clear-cut guidelines for designing dimensional models are illustrated using real-world data warehouse case studies drawn from a variety of business application areas and industries, including: Retail sales and e-commerce Inventory management Procurement Order management Customer relationship management (CRM) Human resources management Accounting Financial services Telecommunications and utilities Education Transportation Health care and insurance By the end of the book, you will have mastered the full range of powerful techniques for designing dimensional databases that are easy to understand and provide fast query response. You will also learn how to create an architected framework that integrates the distributed data warehouse using standardized dimensions and facts. Author Biography: Ralph KimbalL, PhD, has been a leading visionary in the data warehouse industry since 1982 and is one of todays most well- known speakers, consultants, and teachers. He writes the Warehouse Designer column for Intelligent Enterprise magazine and is also the author of the bestselling books The Data Warehouse Lifecycle Toolkit and The Data Webhouse Toolkit (both from Wiley). Margy Ross is President of DecisionWorks Consulting and a Ralph Kimball Associate. She has focused exclusively on decision support and data warehousing since 1982. Margy coauthored the highly acclaimed The Data Warehouse Lifecycle Toolkit, consults on data warehouse projects worldwide, and teaches data warehouse design for Kimball University.
In the traditional new product development process, manufacturers first explore user needs and then develop responsive products. Developing an accurate understanding of user needs is not simple or fast or cheap however, and the traditional approach is coming under increasing strain as user needs change more rapidly, and as firms increasingly seek to serve “markets of one.” Toolkits for user innovation is an emerging alternative approach in which manufacturers actually abandon the attempt to understand user needs in detail in favor of transferring need-related aspects of product and service development to users. Experience in fields where the toolkit approach has been pioneered show custom products being developed much more quickly and at a lower cost. In this paper we explore toolkits for user innovation and explain why and how they work.
Next-generation DNA sequencing (NGS) projects, such as the 1000 Genomes Project, are already revolutionizing our understanding of genetic variation among individuals. However, the massive data sets generated by NGS--the 1000 Genome pilot alone includes nearly five terabases--make writing feature-rich, efficient, and robust analysis tools difficult for even computationally sophisticated individuals. Indeed, many professionals are limited in the scope and the ease with which they can answer scientific questions by the complexity of accessing and manipulating the data produced by these machines. Here, we discuss our Genome Analysis Toolkit (GATK), a structured programming framework designed to ease the development of efficient and robust analysis tools for next-generation DNA sequencers using the functional programming philosophy of MapReduce. The GATK provides a small but rich set of data access patterns that encompass the majority of analysis tool needs. Separating specific analysis calculations from common data management infrastructure enables us to optimize the GATK framework for correctness, stability, and CPU and memory efficiency and to enable distributed and shared memory parallelization. We highlight the capabilities of the GATK by describing the implementation and application of robust, scale-tolerant tools like coverage calculators and single nucleotide polymorphism (SNP) calling. We conclude that the GATK programming framework enables developers and analysts to quickly and easily write efficient and robust NGS tools, many of which have already been incorporated into large-scale sequencing projects like the 1000 Genomes Project and The Cancer Genome Atlas.
Abstract Cloud computing is a recent advancement wherein IT infrastructure and applications are provided as ‘services’ to end‐users under a usage‐based payment model. It can leverage virtualized services even on the fly based on requirements (workload patterns and QoS) varying with time. The application services hosted under Cloud computing model have complex provisioning, composition, configuration, and deployment requirements. Evaluating the performance of Cloud provisioning policies, application workload models, and resources performance models in a repeatable manner under varying system and user configurations and requirements is difficult to achieve. To overcome this challenge, we propose CloudSim: an extensible simulation toolkit that enables modeling and simulation of Cloud computing systems and application provisioning environments. The CloudSim toolkit supports both system and behavior modeling of Cloud system components such as data centers, virtual machines (VMs) and resource provisioning policies. It implements generic application provisioning techniques that can be extended with ease and limited effort. Currently, it supports modeling and simulation of Cloud computing environments consisting of both single and inter‐networked clouds (federation of clouds). Moreover, it exposes custom interfaces for implementing policies and provisioning techniques for allocation of VMs under inter‐networked Cloud computing scenarios. Several researchers from organizations, such as HP Labs in U.S.A., are using CloudSim in their investigation on Cloud resource provisioning and energy‐efficient management of data center resources. The usefulness of CloudSim is demonstrated by a case study involving dynamic provisioning of application services in the hybrid federated clouds environment. The result of this case study proves that the federated Cloud computing model significantly improves the application QoS requirements under fluctuating resource and service demand patterns. Copyright © 2010 John Wiley & Sons, Ltd.
SRILM is a collection of C++ libraries, executable programs, and helper scripts designed to allow both production of and experimen-tation with statistical language models for speech recognition and other applications. SRILM is freely available for noncommercial purposes. The toolkit supports creation and evaluation of a vari-ety of language model types based on N-gram statistics, as well as several related tasks, such as statistical tagging and manipu-lation of N-best lists and word lattices. This paper summarizes the functionality of the toolkit and discusses its design and imple-mentation, highlighting ease of rapid prototyping, reusability, and combinability of tools. 1.
UNLABELLED: Unipro UGENE is a multiplatform open-source software with the main goal of assisting molecular biologists without much expertise in bioinformatics to manage, analyze and visualize their data. UGENE integrates widely used bioinformatics tools within a common user interface. The toolkit supports multiple biological data formats and allows the retrieval of data from remote data sources. It provides visualization modules for biological objects such as annotated genome sequences, Next Generation Sequencing (NGS) assembly data, multiple sequence alignments, phylogenetic trees and 3D structures. Most of the integrated algorithms are tuned for maximum performance by the usage of multithreading and special processor instructions. UGENE includes a visual environment for creating reusable workflows that can be launched on local resources or in a High Performance Computing (HPC) environment. UGENE is written in C++ using the Qt framework. The built-in plugin system and structured UGENE API make it possible to extend the toolkit with new functionality. AVAILABILITY AND IMPLEMENTATION: UGENE binaries are freely available for MS Windows, Linux and Mac OS X at http://ugene.unipro.ru/download.html. UGENE code is licensed under the GPLv2; the information about the code licensing and copyright of integrated tools can be found in the LICENSE.3rd_party file provided with the source bundle.
Computing devices and applications are now used beyond the desktop, in diverse environments, and this trend toward ubiquitous computing is accelerating. One challenge that remains in this emerging research field is the ability to enhance the behavior of any application by informing it of the context of its use. By context, we refer to any information that characterizes a situation related to the interaction between humans, applications and the surrounding environment. Context-aware applications promise richer and easier interaction, but the current state of research in this field is still far removed from that vision. This is due to three main problems: (1) the notion of context is still ill defined; (2) there is a lack of conceptual models and methods to help drive the design of context-aware applications; and (3) no tools are available to jump-start the development of context-aware applications. In this paper, we address these three problems in turn. We first define context, identify categories of contextual information, and characterize context-aware application behavior. Though the full impact of context-aware computing requires understanding very subtle and high-level notions of context, we are focusing our efforts on the pieces of context that can be inferred automatically from sensors in a physical environment. We then present a conceptual framework that separates the acquisition and representation of context from the delivery and reaction to context by a contextaware application. We have built a toolkit, the Context Toolkit, that instantiates this conceptual framework and supports the rapid development of a rich space of context-aware applications. We illustrate the usefulness of the conceptual framework by describing a number of contextaware applications that h...
There are many excellent toolkits which provide support for developing machine learning software in Python, R, Matlab, and similar environments. Dlib-ml is an open source library, targeted at both engineers and research scientists, which aims to provide a similarly rich environment for developing machine learning software in the C++ language. Towards this end, dlib-ml contains an extensible linear algebra toolkit with built in BLAS support. It also houses implementations of algorithms for performing inference in Bayesian networks and kernel-based methods for classification, regression, clustering, anomaly detection, and feature ranking. To enable easy use of these tools, the entire library has been developed with contract programming, which provides complete and precise documentation as well as powerful debugging tools. Keywords: kernel-methods, svm, rvm, kernel clustering, C++, Bayesian networks 1.
Gene Ontology (GO), the de facto standard in gene functionality description, is used widely in functional annotation and enrichment analysis. Here, we introduce agriGO, an integrated web-based GO analysis toolkit for the agricultural community, using the advantages of our previous GO enrichment tool (EasyGO), to meet analysis demands from new technologies and research objectives. EasyGO is valuable for its proficiency, and has proved useful in uncovering biological knowledge in massive data sets from high-throughput experiments. For agriGO, the system architecture and website interface were redesigned to improve performance and accessibility. The supported organisms and gene identifiers were substantially expanded (including 38 agricultural species composed of 274 data types). The requirement on user input is more flexible, in that user-defined reference and annotation are accepted. Moreover, a new analysis approach using Gene Set Enrichment Analysis strategy and customizable features is provided. Four tools, SEA (Singular enrichment analysis), PAGE (Parametric Analysis of Gene set Enrichment), BLAST4ID (Transfer IDs by BLAST) and SEACOMPARE (Cross comparison of SEA), are integrated as a toolkit to meet different demands. We also provide a cross-comparison service so that different data sets can be compared and explored in a visualized way. Lastly, agriGO functions as a GO data repository with search and download functions; agriGO is publicly accessible at http://bioinfo.cau.edu.cn/agriGO/.
When attempting to better understand the strengths and weaknesses of an algorithm, it is important to have a strong understanding of the problem at hand. This is true for the field of multiobjective evolutionary algorithms (EAs) as it is for any other field. Many of the multiobjective test problems employed in the EA literature have not been rigorously analyzed, which makes it difficult to draw accurate conclusions about the strengths and weaknesses of the algorithms tested on them. In this paper, we systematically review and analyze many problems from the EA literature, each belonging to the important class of real-valued, unconstrained, multiobjective test problems. To support this, we first introduce a set of test problem criteria, which are in turn supported by a set of definitions. Our analysis of test problems highlights a number of areas requiring attention. Not only are many test problems poorly constructed but also the important class of nonseparable problems, particularly nonseparable multimodal problems, is poorly represented. Motivated by these findings, we present a flexible toolkit for constructing well-designed test problems. We also present empirical results demonstrating how the toolkit can be used to test an optimizer in ways that existing test suites do not
Resting-state fMRI (RS-fMRI) has been drawing more and more attention in recent years. However, a publicly available, systematically integrated and easy-to-use tool for RS-fMRI data processing is still lacking. We developed a toolkit for the analysis of RS-fMRI data, namely the RESting-state fMRI data analysis Toolkit (REST). REST was developed in MATLAB with graphical user interface (GUI). After data preprocessing with SPM or AFNI, a few analytic methods can be performed in REST, including functional connectivity analysis based on linear correlation, regional homogeneity, amplitude of low frequency fluctuation (ALFF), and fractional ALFF. A few additional functions were implemented in REST, including a DICOM sorter, linear trend removal, bandpass filtering, time course extraction, regression of covariates, image calculator, statistical analysis, and slice viewer (for result visualization, multiple comparison correction, etc.). REST is an open-source package and is freely available at http://www.restfmri.net.
Over the past few years, there has been an increased interest in automatic facial behavior analysis and understanding. We present OpenFace 2.0 - a tool intended for computer vision and machine learning researchers, affective computing community and people interested in building interactive applications based on facial behavior analysis. OpenFace 2.0 is an extension of OpenFace toolkit and is capable of more accurate facial landmark detection, head pose estimation, facial action unit recognition, and eye-gaze estimation. The computer vision algorithms which represent the core of OpenFace 2.0 demonstrate state-of-the-art results in all of the above mentioned tasks. Furthermore, our tool is capable of real-time performance and is able to run from a simple webcam without any specialist hardware. Finally, unlike a lot of modern approaches or toolkits, OpenFace 2.0 source code for training models and running them is freely available for research purposes.
The Bioperl project is an international open-source collaboration of biologists, bioinformaticians, and computer scientists that has evolved over the past 7 yr into the most comprehensive library of Perl modules available for managing and manipulating life-science information. Bioperl provides an easy-to-use, stable, and consistent programming interface for bioinformatics application programmers. The Bioperl modules have been successfully and repeatedly used to reduce otherwise complex tasks to only a few lines of code. The Bioperl object model has been proven to be flexible enough to support enterprise-level applications such as EnsEMBL, while maintaining an easy learning curve for novice Perl programmers. Bioperl is capable of executing analyses and processing results from programs such as BLAST, ClustalW, or the EMBOSS suite. Interoperation with modules written in Python and Java is supported through the evolving BioCORBA bridge. Bioperl provides access to data stores such as GenBank and SwissProt via a flexible series of sequence input/output modules, and to the emerging common sequence data storage format of the Open Bioinformatics Database Access project. This study describes the overall architecture of the toolkit, the problem domains that it addresses, and gives specific examples of how the toolkit can be used to solve common life-sciences problems. We conclude with a discussion of how the open-source nature of the project has contributed to the development effort. [Supplemental material is available online at www.genome.org . Bioperl is available as open-source software free of charge and is licensed under the Perl Artistic License ( http://www.perl.com/pub/a/language/misc/Artistic.html ). It is available for download at http://www.bioperl.org . Support inquiries should be addressed to bioperl-l@bioperl.org .]
MCScan is an algorithm able to scan multiple genomes or subgenomes in order to identify putative homologous chromosomal regions, and align these regions using genes as anchors. The MCScanX toolkit implements an adjusted MCScan algorithm for detection of synteny and collinearity that extends the original software by incorporating 14 utility programs for visualization of results and additional downstream analyses. Applications of MCScanX to several sequenced plant genomes and gene families are shown as examples. MCScanX can be used to effectively analyze chromosome structural changes, and reveal the history of gene family expansions that might contribute to the adaptation of lineages and taxa. An integrated view of various modes of gene duplication can supplement the traditional gene tree analysis in specific families. The source code and documentation of MCScanX are freely available at http://chibba.pgml.uga.edu/mcscan2/.