970 resultados para Datasets
Resumo:
There is strong evidence from twin and family studies indicating that a substantial proportion of the heritability of susceptibility to ankylosing spondylitis (AS) and its clinical manifestations is encoded by non-major-histocompatibility-complex genes. Efforts to identify these genes have included genomewide linkage studies and candidate gene association studies. One region, the interleukin (IL)-1 gene complex on chromosome 2, has been repeatedly associated with AS in both Caucasians and Asians. It is likely that more than one gene in this complex is involved in AS, with the strongest evidence to date implicating IL-1A. Identifying the genes underlying other linkage regions has been difficult due to the lack of obvious candidates and the low power of most studies to date to identify genes of the small to moderate magnitude that are likely to be involved. The field is moving towards genomewide association analysis, involving much larger datasets of unrelated cases and controls. Early successes using this approach in other diseases indicates that it is likely to identify genes in common diseases like AS, but there remains the risk that the common-variant, common-disease hypothesis will not hold true in AS. Nonetheless, it is appropriate for the field to be cautiously optimistic that the next few years will bring great advances in our understanding of the genetics of this condition.
Resumo:
Grass pollen is a major trigger for allergic rhinitis and asthma, yet little is known about the timing and levels of human exposure to airborne grass pollen across Australasian urban environments. The relationships between environmental aeroallergen exposure and allergic respiratory disease bridge the fields of ecology, aerobiology, geospatial science and public health. The Australian Aerobiology Working Group comprised of experts in botany, palynology, biogeography, climate change science, plant genetics, biostatistics, ecology, pollen allergy, public and environmental health, and medicine, was established to systematically source, collate and analyse atmospheric pollen concentration data from 11 Australian and six New Zealand sites. Following two week-long workshops, post-workshop evaluations were conducted to reflect upon the utility of this analysis and synthesis approach to address complex multidisciplinary questions. This Working Group described i) a biogeographically dependent variation in airborne pollen diversity, ii) a latitudinal gradient in the timing, duration and number of peaks of the grass pollen season, and iii) the emergence of new methodologies based on trans-disciplinary synthesis of aerobiology and remote sensing data. Challenges included resolving methodological variations between pollen monitoring sites and temporal variations in pollen datasets. Other challenges included “marrying” ecosystem and health sciences and reconciling divergent expert opinion. The Australian Aerobiology Working Group facilitated knowledge transfer between diverse scientific disciplines, mentored students and early career scientists, and provided an uninterrupted collaborative opportunity to focus on a unifying problem globally. The Working Group provided a platform to optimise the value of large existing ecological datasets that have importance for human respiratory health and ecosystems research. Compilation of current knowledge of Australasian pollen aerobiology is a critical first step towards the management of exposure to pollen in patients with allergic disease and provides a basis from which the future impacts of climate change on pollen distribution can be assessed and monitored.
Resumo:
Big Datasets are endemic, but they are often notoriously difficult to analyse because of their size, heterogeneity, history and quality. The purpose of this paper is to open a discourse on the use of modern experimental design methods to analyse Big Data in order to answer particular questions of interest. By appealing to a range of examples, it is suggested that this perspective on Big Data modelling and analysis has wide generality and advantageous inferential and computational properties. In particular, the principled experimental design approach is shown to provide a flexible framework for analysis that, for certain classes of objectives and utility functions, delivers near equivalent answers compared with analyses of the full dataset under a controlled error rate. It can also provide a formalised method for iterative parameter estimation, model checking, identification of data gaps and evaluation of data quality. Finally, it has the potential to add value to other Big Data sampling algorithms, in particular divide-and-conquer strategies, by determining efficient sub-samples.
Resumo:
Membrane proteins play important roles in many biochemical processes and are also attractive targets of drug discovery for various diseases. The elucidation of membrane protein types provides clues for understanding the structure and function of proteins. Recently we developed a novel system for predicting protein subnuclear localizations. In this paper, we propose a simplified version of our system for predicting membrane protein types directly from primary protein structures, which incorporates amino acid classifications and physicochemical properties into a general form of pseudo-amino acid composition. In this simplified system, we will design a two-stage multi-class support vector machine combined with a two-step optimal feature selection process, which proves very effective in our experiments. The performance of the present method is evaluated on two benchmark datasets consisting of five types of membrane proteins. The overall accuracies of prediction for five types are 93.25% and 96.61% via the jackknife test and independent dataset test, respectively. These results indicate that our method is effective and valuable for predicting membrane protein types. A web server for the proposed method is available at http://www.juemengt.com/jcc/memty_page.php
Resumo:
We propose a novel multiview fusion scheme for recognizing human identity based on gait biometric data. The gait biometric data is acquired from video surveillance datasets from multiple cameras. Experiments on publicly available CASIA dataset show the potential of proposed scheme based on fusion towards development and implementation of automatic identity recognition systems.
Resumo:
Whole genome sequences are generally accepted as excellent tools for studying evolutionary relationships. Due to the problems caused by the uncertainty in alignment, existing tools for phylogenetic analysis based on multiple alignments could not be directly applied to the whole-genome comparison and phylogenomic studies. There has been a growing interest in alignment-free methods for phylogenetic analysis using complete genome data. The “distances” used in these alignment-free methods are not proper distance metrics in the strict mathematical sense. In this study, we first review them in a more general frame — dissimilarity. Then we propose some new dissimilarities for phylogenetic analysis. Last three genome datasets are employed to evaluate these dissimilarities from a biological point of view.
Resumo:
In the field of face recognition, sparse representation (SR) has received considerable attention during the past few years, with a focus on holistic descriptors in closed-set identification applications. The underlying assumption in such SR-based methods is that each class in the gallery has sufficient samples and the query lies on the subspace spanned by the gallery of the same class. Unfortunately, such an assumption is easily violated in the face verification scenario, where the task is to determine if two faces (where one or both have not been seen before) belong to the same person. In this study, the authors propose an alternative approach to SR-based face verification, where SR encoding is performed on local image patches rather than the entire face. The obtained sparse signals are pooled via averaging to form multiple region descriptors, which then form an overall face descriptor. Owing to the deliberate loss of spatial relations within each region (caused by averaging), the resulting descriptor is robust to misalignment and various image deformations. Within the proposed framework, they evaluate several SR encoding techniques: l1-minimisation, Sparse Autoencoder Neural Network (SANN) and an implicit probabilistic technique based on Gaussian mixture models. Thorough experiments on AR, FERET, exYaleB, BANCA and ChokePoint datasets show that the local SR approach obtains considerably better and more robust performance than several previous state-of-the-art holistic SR methods, on both the traditional closed-set identification task and the more applicable face verification task. The experiments also show that l1-minimisation-based encoding has a considerably higher computational cost when compared with SANN-based and probabilistic encoding, but leads to higher recognition rates.
Resumo:
In this paper, we aim at predicting protein structural classes for low-homology data sets based on predicted secondary structures. We propose a new and simple kernel method, named as SSEAKSVM, to predict protein structural classes. The secondary structures of all protein sequences are obtained by using the tool PSIPRED and then a linear kernel on the basis of secondary structure element alignment scores is constructed for training a support vector machine classifier without parameter adjusting. Our method SSEAKSVM was evaluated on two low-homology datasets 25PDB and 1189 with sequence homology being 25% and 40%, respectively. The jackknife test is used to test and compare our method with other existing methods. The overall accuracies on these two data sets are 86.3% and 84.5%, respectively, which are higher than those obtained by other existing methods. Especially, our method achieves higher accuracies (88.1% and 88.5%) for differentiating the α + β class and the α/β class compared to other methods. This suggests that our method is valuable to predict protein structural classes particularly for low-homology protein sequences. The source code of the method in this paper can be downloaded at http://math.xtu.edu.cn/myphp/math/research/source/SSEAK_source_code.rar.
Resumo:
Background This paper examines changing patterns in the utilisation and geographic access to health services in Great Britain using National Travel Survey data (1985-2012). The National Travel Survey (NTS) is a series of household surveys designed to provide data on personal travel and monitor changes in travel behaviour over time. The utilisation rate was derived using the proportion of journeys made to access health services. Geographic access was analysed by separating the concept into its accessibility and mobility dimensions. Methods Variables from the PSU, households, and individuals datasets were used as explanatory variables. Whereas, variables extracted from the journeys dataset were used as dependent variables to identify patterns of utilisation i.e. the proportion of journeys made by different groups to access health facilities in a particular journey distance or time band or by mode of transport; and geographic access to health services. A binary logistic regression analysis was conducted to identify the utilisation rate over the different time periods between different groups. This analysis shows the Odds Ratios (ORs) for different groups making a trip to utilise health services compared to their respective counterparts. Linear multiple regression analyses were conducted to then identify patterns of change in the accessibility and mobility level. Results Analysis of the data has shown that that journey distances to health facilities were signi fi cantly shorter and also gradually reduced over the period in question for Londoners, females, those without a car or on low incomes, and older people. Although rates of utilisation of health services we re Oral Abstracts / Journal of Transport & Health 2 (2015) S5 – S63 S43 signi fi cantly lower because of longer journey times. These fi ndings indicate that the rate of utilisation of health services largely depends on mobility level although previous research studies have traditionally overlooked the mobility dimension. Conclusions This fi nding, therefore, suggests the need to improve geographic access to services together with an enhanced mobility option for disadvantaged groups in order for them to have improved levels of access to health facilities. This research has also found that the volume of car trips to health services also increased steadily over the period 1985-2012 while all other modes accounted for a smaller number of trips. However, it is dif fi cult to conclude from this research whether this increase in the volume of car trips was due to a lack of alternative transport or due to an increase in the level of car-ownership.
Resumo:
Ankylosing spondylitis is a common form of inflammatory arthritis predominantly affecting the spine and pelvis that occurs in approximately 5 out of 1,000 adults of European descent. Here we report the identification of three variants in the RUNX3, LTBR-TNFRSF1A and IL12B regions convincingly associated with ankylosing spondylitis (P < 5 × 10-8 in the combined discovery and replication datasets) and a further four loci at PTGER4, TBKBP1, ANTXR2 and CARD9 that show strong association across all our datasets (P < 5 × 10-6 overall, with support in each of the three datasets studied). We also show that polymorphisms of ERAP1, which encodes an endoplasmic reticulum aminopeptidase involved in peptide trimming before HLA class I presentation, only affect ankylosing spondylitis risk in HLA-B27-positive individuals. These findings provide strong evidence that HLA-B27 operates in ankylosing spondylitis through a mechanism involving aberrant processing of antigenic peptides.
Resumo:
Gene expression is arguably the most important indicator of biological function. Thus identifying differentially expressed genes is one of the main aims of high throughout studies that use microarray and RNAseq platforms to study deregulated cellular pathways. There are many tools for analysing differentia gene expression from transciptomic datasets. The major challenge of this topic is to estimate gene expression variance due to the high amount of ‘background noise’ that is generated from biological equipment and the lack of biological replicates. Bayesian inference has been widely used in the bioinformatics field. In this work, we reveal that the prior knowledge employed in the Bayesian framework also helps to improve the accuracy of differential gene expression analysis when using a small number of replicates. We have developed a differential analysis tool that uses Bayesian estimation of the variance of gene expression for use with small numbers of biological replicates. Our method is more consistent when compared to the widely used cyber-t tool that successfully introduced the Bayesian framework to differential analysis. We also provide a user-friendly web based Graphic User Interface for biologists to use with microarray and RNAseq data. Bayesian inference can compensate for the instability of variance caused when using a small number of biological replicates by using pseudo replicates as prior knowledge. We also show that our new strategy to select pseudo replicates will improve the performance of the analysis. - See more at: http://www.eurekaselect.com/node/138761/article#sthash.VeK9xl5k.dpuf
Resumo:
Document clustering is one of the prominent methods for mining important information from the vast amount of data available on the web. However, document clustering generally suffers from the curse of dimensionality. Providentially in high dimensional space, data points tend to be more concentrated in some areas of clusters. We take advantage of this phenomenon by introducing a novel concept of dynamic cluster representation named as loci. Clusters’ loci are efficiently calculated using documents’ ranking scores generated from a search engine. We propose a fast loci-based semi-supervised document clustering algorithm that uses clusters’ loci instead of conventional centroids for assigning documents to clusters. Empirical analysis on real-world datasets shows that the proposed method produces cluster solutions with promising quality and is substantially faster than several benchmarked centroid-based semi-supervised document clustering methods.
Resumo:
Neural data are inevitably contaminated by noise. When such noisy data are subjected to statistical analysis, misleading conclusions can be reached. Here we attempt to address this problem by applying a state-space smoothing method, based on the combined use of the Kalman filter theory and the Expectation–Maximization algorithm, to denoise two datasets of local field potentials recorded from monkeys performing a visuomotor task. For the first dataset, it was found that the analysis of the high gamma band (60–90 Hz) neural activity in the prefrontal cortex is highly susceptible to the effect of noise, and denoising leads to markedly improved results that were physiologically interpretable. For the second dataset, Granger causality between primary motor and primary somatosensory cortices was not consistent across two monkeys and the effect of noise was suspected. After denoising, the discrepancy between the two subjects was significantly reduced.
Resumo:
The development of techniques for scaling up classifiers so that they can be applied to problems with large datasets of training examples is one of the objectives of data mining. Recently, AdaBoost has become popular among machine learning community thanks to its promising results across a variety of applications. However, training AdaBoost on large datasets is a major problem, especially when the dimensionality of the data is very high. This paper discusses the effect of high dimensionality on the training process of AdaBoost. Two preprocessing options to reduce dimensionality, namely the principal component analysis and random projection are briefly examined. Random projection subject to a probabilistic length preserving transformation is explored further as a computationally light preprocessing step. The experimental results obtained demonstrate the effectiveness of the proposed training process for handling high dimensional large datasets.
Resumo:
- Objective To explore the potential for using a basic text search of routine emergency department data to identify product-related injury in infants and to compare the patterns from routine ED data and specialised injury surveillance data. - Methods Data was sourced from the Emergency Department Information System (EDIS) and the Queensland Injury Surveillance Unit (QISU) for all injured infants between 2009 and 2011. A basic text search was developed to identify the top five infant products in QISU. Sensitivity, specificity, and positive predictive value were calculated and a refined search was used with EDIS. Results were manually reviewed to assess validity. Descriptive analysis was conducted to examine patterns between datasets. - Results The basic text search for all products showed high sensitivity and specificity, and most searches showed high positive predictive value. EDIS patterns were similar to QISU patterns with strikingly similar month-of-age injury peaks, admission proportions and types of injuries. - Conclusions This study demonstrated a capacity to identify a sample of valid cases of product-related injuries for specified products using simple text searching of routine ED data. - Implications As the capacity for large datasets grows and the capability to reliably mine text improves, opportunities for expanded sources of injury surveillance data increase. This will ultimately assist stakeholders such as consumer product safety regulators and child safety advocates to appropriately target prevention initiatives.