968 resultados para Audio Data set
Resumo:
Molecular and morphological data have important roles in illuminating evolutionary history. DNA data often yield well resolved phylogenies for living taxa, but are generally unattainable for fossils. A distinct advantage of morphology is that some types of morphological data may be collected for extinct and extant taxa. Fossils provide a unique window on evolutionary history and may preserve combinations of primitive and derived characters that are not found in extant taxa. Given their unique character complexes, fossils are critical in documenting sequences of character transformation over geologic time and may elucidate otherwise ambiguous patterns of evolution that are not revealed by molecular data alone. Here, we employ a methodological approach that allows for the integration of molecular and paleontological data in deciphering one of the most innovative features in the evolutionary history of mammals—laryngeal echolocation in bats. Molecular data alone, including an expanded data set that includes new sequences for the A2AB gene, suggest that microbats are paraphyletic but do not resolve whether laryngeal echolocation evolved independently in different microbat lineages or evolved in the common ancestor of bats and was subsequently lost in megabats. When scaffolds from molecular phylogenies are incorporated into parsimony analyses of morphological characters, including morphological characters for the Eocene taxa Icaronycteris, Archaeonycteris, Hassianycteris, and Palaeochiropteryx, the resulting trees suggest that laryngeal echolocation evolved in the common ancestor of fossil and extant bats and was subsequently lost in megabats. Molecular dating suggests that crown-group bats last shared a common ancestor 52 to 54 million years ago.
Resumo:
Precise classification of tumors is critically important for cancer diagnosis and treatment. It is also a scientifically challenging task. Recently, efforts have been made to use gene expression profiles to improve the precision of classification, with limited success. Using a published data set for purposes of comparison, we introduce a methodology based on classification trees and demonstrate that it is significantly more accurate for discriminating among distinct colon cancer tissues than other statistical approaches used heretofore. In addition, competing classification trees are displayed, which suggest that different genes may coregulate colon cancers.
Resumo:
Thermodynamics Conference 2013 (Statistical Mechanics and Thermodynamics Group of the Royal Society of Chemistry), The University of Manchester, 3-6 September 2013.
Resumo:
LIDAR (LIght Detection And Ranging) first return elevation data of the Boston, Massachusetts region from MassGIS at 1-meter resolution. This LIDAR data was captured in Spring 2002. LIDAR first return data (which shows the highest ground features, e.g. tree canopy, buildings etc.) can be used to produce a digital terrain model of the Earth's surface. This dataset consists of 74 First Return DEM tiles. The tiles are 4km by 4km areas corresponding with the MassGIS orthoimage index. This data set was collected using 3Di's Digital Airborne Topographic Imaging System II (DATIS II). The area of coverage corresponds to the following MassGIS orthophoto quads covering the Boston region (MassGIS orthophoto quad ID: 229890, 229894, 229898, 229902, 233886, 233890, 233894, 233898, 233902, 233906, 233910, 237890, 237894, 237898, 237902, 237906, 237910, 241890, 241894, 241898, 241902, 245898, 245902). The geographic extent of this dataset is the same as that of the MassGIS dataset: Boston, Massachusetts Region 1:5,000 Color Ortho Imagery (1/2-meter Resolution), 2001 and was used to produce the MassGIS dataset: Boston, Massachusetts, 2-Dimensional Building Footprints with Roof Height Data (from LIDAR data), 2002 [see cross references].
Resumo:
This dataset consists of 2D footprints of the buildings in the metropolitan Boston area, based on tiles in the orthoimage index (orthophoto quad ID: 229890, 229894, 229898, 229902, 233886, 233890, 233894, 233898, 233902, 237890, 237894, 237898, 237902, 241890, 241894, 241898, 241902, 245898, 245902). This data set was collected using 3Di's Digital Airborne Topographic Imaging System II (DATIS II). Roof height and footprint elevation attributes (derived from 1-meter resolution LIDAR (LIght Detection And Ranging) data) are included as part of each building feature. This data can be combined with other datasets to create 3D representations of buildings and the surrounding environment.
Resumo:
The present data set includes 268,127 vertical in situ fluorescence profiles obtained from several available online databases and from published and unpublished individual sources. Metadata about each profiles are given in the file provided here in further details. The majority of profiles comes from the National Oceanographic Data Center (NODC) and the fluorescence profiles acquired by Bio-Argo floats available on the Oceanographic Autonomous Observations (OAO) platform (63.7% and 12.5% respectively).
Different modes of acquisition were used to collect the data presented in this study: (1) CTD profiles are acquired using a fluorometer mounted on a CTD-rosette; (2) OSD (Ocean Station Data) profiles are derived from water samples and are defined as low resolution profiles; (3) the UOR (Undulating Oceanographic Recorder) profiles are acquired by a
Resumo:
Acoustic and pelagic trawl data were collected during various pelagic surveys carried out by IFREMER in May between 2000 and 2012 (except 2001), on the eastern continental shelf of the Bay of Biscay (Pelgas series). The acoustic data were collected with a Simrad EK60 echosounder operating at 38 kHz (beam angle at -3 dB: 7°, pulse length set to 1.024 ms). The echosounder transducer was mounted on the vessel keel, at 6 m below the sea surface. The sampling design were parallel transects spaced 12 nm apart which were orientated perpendicular to the coast line from 20 m to about 200 m bottom depth. The nominal sailing speed was 10 knots and 3 knots on average during fishing operations. The scrutinising (species identification) of acoustic data was done by first characterising acoustic schools by type and then linking these types with the species composition of specific trawl hauls. The data set contains nautical area backscattering values, biomass and abundance estimates for blue whiting for one nautical mile long transect lines. Further information on the survey design, scrutinising and biomass estimation can be found in Doray et al. 2012.
Resumo:
Normal mixture models are often used to cluster continuous data. However, conventional approaches for fitting these models will have problems in producing nonsingular estimates of the component-covariance matrices when the dimension of the observations is large relative to the number of observations. In this case, methods such as principal components analysis (PCA) and the mixture of factor analyzers model can be adopted to avoid these estimation problems. We examine these approaches applied to the Cabernet wine data set of Ashenfelter (1999), considering the clustering of both the wines and the judges, and comparing our results with another analysis. The mixture of factor analyzers model proves particularly effective in clustering the wines, accurately classifying many of the wines by location.
Resumo:
To account for the preponderance of zero counts and simultaneous correlation of observations, a class of zero-inflated Poisson mixed regression models is applicable for accommodating the within-cluster dependence. In this paper, a score test for zero-inflation is developed for assessing correlated count data with excess zeros. The sampling distribution and the power of the test statistic are evaluated by simulation studies. The results show that the test statistic performs satisfactorily under a wide range of conditions. The test procedure is further illustrated using a data set on recurrent urinary tract infections. Copyright (c) 2005 John Wiley & Sons, Ltd.
Resumo:
The paper investigates a Bayesian hierarchical model for the analysis of categorical longitudinal data from a large social survey of immigrants to Australia. Data for each subject are observed on three separate occasions, or waves, of the survey. One of the features of the data set is that observations for some variables are missing for at least one wave. A model for the employment status of immigrants is developed by introducing, at the first stage of a hierarchical model, a multinomial model for the response and then subsequent terms are introduced to explain wave and subject effects. To estimate the model, we use the Gibbs sampler, which allows missing data for both the response and the explanatory variables to be imputed at each iteration of the algorithm, given some appropriate prior distributions. After accounting for significant covariate effects in the model, results show that the relative probability of remaining unemployed diminished with time following arrival in Australia.
Resumo:
Traditional vegetation mapping methods use high cost, labour-intensive aerial photography interpretation. This approach can be subjective and is limited by factors such as the extent of remnant vegetation, and the differing scale and quality of aerial photography over time. An alternative approach is proposed which integrates a data model, a statistical model and an ecological model using sophisticated Geographic Information Systems (GIS) techniques and rule-based systems to support fine-scale vegetation community modelling. This approach is based on a more realistic representation of vegetation patterns with transitional gradients from one vegetation community to another. Arbitrary, though often unrealistic, sharp boundaries can be imposed on the model by the application of statistical methods. This GIS-integrated multivariate approach is applied to the problem of vegetation mapping in the complex vegetation communities of the Innisfail Lowlands in the Wet Tropics bioregion of Northeastern Australia. The paper presents the full cycle of this vegetation modelling approach including sampling sites, variable selection, model selection, model implementation, internal model assessment, model prediction assessments, models integration of discrete vegetation community models to generate a composite pre-clearing vegetation map, independent data set model validation and model prediction's scale assessments. An accurate pre-clearing vegetation map of the Innisfail Lowlands was generated (0.83r(2)) through GIS integration of 28 separate statistical models. This modelling approach has good potential for wider application, including provision of. vital information for conservation planning and management; a scientific basis for rehabilitation of disturbed and cleared areas; a viable method for the production of adequate vegetation maps for conservation and forestry planning of poorly-studied areas. (c) 2006 Elsevier B.V. All rights reserved.
Resumo:
Objective: An estimation of cut-off points for the diagnosis of diabetes mellitus (DM) based on individual risk factors. Methods: A subset of the 1991 Oman National Diabetes Survey is used, including all patients with a 2h post glucose load >= 200 mg/dl (278 subjects) and a control group of 286 subjects. All subjects previously diagnosed as diabetic and all subjects with missing data values were excluded. The data set was analyzed by use of the SPSS Clementine data mining system. Decision Tree Learners (C5 and CART) and a method for mining association rules (the GRI algorithm) are used. The fasting plasma glucose (FPG), age, sex, family history of diabetes and body mass index (BMI) are input risk factors (independent variables), while diabetes onset (the 2h post glucose load >= 200 mg/dl) is the output (dependent variable). All three techniques used were tested by use of crossvalidation (89.8%). Results: Rules produced for diabetes diagnosis are: A- GRI algorithm (1) FPG>=108.9 mg/dl, (2) FPG>=107.1 and age>39.5 years. B- CART decision trees: FPG >=110.7 mg/dl. C- The C5 decision tree learner: (1) FPG>=95.5 and 54, (2) FPG>=106 and 25.2 kg/m2. (3) FPG>=106 and =133 mg/dl. The three techniques produced rules which cover a significant number of cases (82%), with confidence between 74 and 100%. Conclusion: Our approach supports the suggestion that the present cut-off value of fasting plasma glucose (126 mg/dl) for the diagnosis of diabetes mellitus needs revision, and the individual risk factors such as age and BMI should be considered in defining the new cut-off value.
Resumo:
Visualization has proven to be a powerful and widely-applicable tool the analysis and interpretation of data. Most visualization algorithms aim to find a projection from the data space down to a two-dimensional visualization space. However, for complex data sets living in a high-dimensional space it is unlikely that a single two-dimensional projection can reveal all of the interesting structure. We therefore introduce a hierarchical visualization algorithm which allows the complete data set to be visualized at the top level, with clusters and sub-clusters of data points visualized at deeper levels. The algorithm is based on a hierarchical mixture of latent variable models, whose parameters are estimated using the expectation-maximization algorithm. We demonstrate the principle of the approach first on a toy data set, and then apply the algorithm to the visualization of a synthetic data set in 12 dimensions obtained from a simulation of multi-phase flows in oil pipelines and to data in 36 dimensions derived from satellite images.
Resumo:
Multidimensional compound optimization is a new paradigm in the drug discovery process, yielding efficiencies during early stages and reducing attrition in the later stages of drug development. The success of this strategy relies heavily on understanding this multidimensional data and extracting useful information from it. This paper demonstrates how principled visualization algorithms can be used to understand and explore a large data set created in the early stages of drug discovery. The experiments presented are performed on a real-world data set comprising biological activity data and some whole-molecular physicochemical properties. Data visualization is a popular way of presenting complex data in a simpler form. We have applied powerful principled visualization methods, such as generative topographic mapping (GTM) and hierarchical GTM (HGTM), to help the domain experts (screening scientists, chemists, biologists, etc.) understand and draw meaningful decisions. We also benchmark these principled methods against relatively better known visualization approaches, principal component analysis (PCA), Sammon's mapping, and self-organizing maps (SOMs), to demonstrate their enhanced power to help the user visualize the large multidimensional data sets one has to deal with during the early stages of the drug discovery process. The results reported clearly show that the GTM and HGTM algorithms allow the user to cluster active compounds for different targets and understand them better than the benchmarks. An interactive software tool supporting these visualization algorithms was provided to the domain experts. The tool facilitates the domain experts by exploration of the projection obtained from the visualization algorithms providing facilities such as parallel coordinate plots, magnification factors, directional curvatures, and integration with industry standard software. © 2006 American Chemical Society.
Resumo:
We analyse how the Generative Topographic Mapping (GTM) can be modified to cope with missing values in the training data. Our approach is based on an Expectation -Maximisation (EM) method which estimates the parameters of the mixture components and at the same time deals with the missing values. We incorporate this algorithm into a hierarchical GTM. We verify the method on a toy data set (using a single GTM) and a realistic data set (using a hierarchical GTM). The results show our algorithm can help to construct informative visualisation plots, even when some of the training points are corrupted with missing values.