970 resultados para Datasets
Resumo:
Recommender systems assist users in finding what they want. The challenging issue is how to efficiently acquire user preferences or user information needs for building personalized recommender systems. This research explores the acquisition of user preferences using data taxonomy information to enhance personalized recommendations for alleviating cold-start problem. A concept hierarchy model is proposed, which provides a two-dimensional hierarchy for acquiring user preferences. The language model is also extended for the proposed hierarchy in order to generate an effective recommender algorithm. Both Amazon.com book and music datasets are used to evaluate the proposed approach, and the experimental results show that the proposed approach is promising.
Resumo:
In this thesis, the genetic variation of human populations from the Baltic Sea region was studied in order to elucidate population history as well as evolutionary adaptation in this region. The study provided novel understanding of how the complex population level processes of migration, genetic drift, and natural selection have shaped genetic variation in North European populations. Results from genome-wide, mitochondrial DNA and Y-chromosomal analyses suggested that the genetic background of the populations of the Baltic Sea region lies predominantly in Continental Europe, which is consistent with earlier studies and archaeological evidence. The late settlement of Fennoscandia after the Ice Age and the subsequent small population size have led to pronounced genetic drift, especially in Finland and Karelia but also in Sweden, evident especially in genome-wide and Y-chromosomal analyses. Consequently, these populations show striking genetic differentiation, as opposed to much more homogeneous pattern of variation in Central European populations. Additionally, the eastern side of the Baltic Sea was observed to have experienced eastern influence in the genome-wide data as well as in mitochondrial DNA and Y-chromosomal variation – consistent with linguistic connections. However, Slavic influence in the Baltic Sea populations appears minor on genetic level. While the genetic diversity of the Finnish population overall was low, genome-wide and Y-chromosomal results showed pronounced regional differences. The genetic distance between Western and Eastern Finland was larger than for many geographically distant population pairs, and provinces also showed genetic differences. This is probably mainly due to the late settlement of Eastern Finland and local isolation, although differences in ancestral migration waves may contribute to this, too. In contrast, mitochondrial DNA and Y-chromosomal analyses of the contemporary Swedish population revealed a much less pronounced population structure and a fusion of the traces of ancient admixture, genetic drift, and recent immigration. Genome-wide datasets also provide a resource for studying the adaptive evolution of human populations. This study revealed tens of loci with strong signs of recent positive selection in Northern Europe. These results provide interesting targets for future research on evolutionary adaptation, and may be important for understanding the background of disease-causing variants in human populations.
Resumo:
Quantifying nitrous oxide (N(2)O) fluxes, a potent greenhouse gas, from soils is necessary to improve our knowledge of terrestrial N(2)O losses. Developing universal sampling frequencies for calculating annual N(2)O fluxes is difficult, as fluxes are renowned for their high temporal variability. We demonstrate daily sampling was largely required to achieve annual N(2)O fluxes within 10% of the best estimate for 28 annual datasets collected from three continents, Australia, Europe and Asia. Decreasing the regularity of measurements either under- or overestimated annual N(2)O fluxes, with a maximum overestimation of 935%. Measurement frequency was lowered using a sampling strategy based on environmental factors known to affect temporal variability, but still required sampling more than once a week. Consequently, uncertainty in current global terrestrial N(2)O budgets associated with the upscaling of field-based datasets can be decreased significantly using adequate sampling frequencies.
Resumo:
Lead contamination in the environment is of particular concern, as it is a known toxin. Until recently, however, much less attention has been given to the local contamination caused by activities at shooting ranges compared to large-scale industrial contamination. In Finland, more than 500 tons of Pb is produced each year for shotgun ammunition. The contaminant threatens various organisms, ground water and the health of human populations. However, the forest at shooting ranges usually shows no visible sign of stress compared to nearby clean environments. The aboveground biota normally reflects the belowground ecosystem. Thus, the soil microbial communities appear to bear strong resistance to contamination, despite the influence of lead. The studies forming this thesis investigated a shooting range site at Hälvälä in Southern Finland, which is heavily contaminated by lead pellets. Previously it was experimentally shown that the growth of grasses and degradation of litter are retarded. Measurements of acute toxicity of the contaminated soil or soil extracts gave conflicting results, as enchytraeid worms used as toxicity reporters were strongly affected, while reporter bacteria showed no or very minor decreases in viability. Measurements using sensitive inducible luminescent reporter bacteria suggested that the bioavailability of lead in the soil is indeed low, and this notion was supported by the very low water extractability of the lead. Nevertheless, the frequency of lead-resistant cultivable bacteria was elevated based on the isolation of cultivable strains. The bacterial and fungal diversity in heavily lead contaminated shooting sectors were compared with those of pristine sections of the shooting range area. The bacterial 16S rRNA gene and fungal ITS rRNA gene were amplified, cloned and sequenced using total DNA extracted from the soil humus layer as the template. Altogether, 917 sequenced bacterial clones and 649 sequenced fungal clones revealed a high soil microbial diversity. No effect of lead contamination was found on bacterial richness or diversity, while fungal richness and diversity significantly differed between lead contaminated and clean control areas. However, even in the case of fungi, genera that were deemed sensitive were not totally absent from the contaminated area: only their relative frequency was significantly reduced. Some operational taxonomic units (OTUs) assigned to Basidiomycota were clearly affected, and were much rarer in the lead contaminated areas. The studies of this thesis surveyed EcM sporocarps, analyzed morphotyped EcM root tips by direct sequencing, and 454-pyrosequenced fungal communities in in-growth bags. A total of 32 EcM fungi that formed conspicuous sporocarps, 27 EcM fungal OTUs from 294 root tips, and 116 EcM fungal OTUs from a total of 8 194 ITS2 454 sequences were recorded. The ordination analyses by non-parametric multidimensional scaling (NMS) indicated that Pb enrichment induced a shift in the EcM community composition. This was visible as indicative trends in the sporocarp and root tip datasets, but explicitly clear in the communities observed in the in-growth bags. The compositional shift in the EcM community was mainly attributable to an increase in the frequencies of OTUs assigned to the genus Thelephora, and to a decrease in the OTUs assigned to Pseudotomentella, Suillus and Tylospora in Pb-contaminated areas when compared to the control. The enrichment of Thelephora in contaminated areas was also observed when examining the total fungal communities in soil using DNA cloning and sequencing technology. While the compositional shifts are clear, their functional consequences for the dominant trees or soil ecosystem remain undetermined. The results indicate that at the Hälvälä shooting range, lead influences the fungal communities but not the bacterial communities. The forest ecosystem shows apparent functional redundancy, since no significant effects were seen on forest trees. Recently, by means of 454 pyrosequencing , the amount of sequences in a single analysis run can be up to one million. It has been applied in microbial ecology studies to characterize microbial communities. The handling of sequence data with traditional programs is becoming difficult and exceedingly time consuming, and novel tools are needed to handle the vast amounts of data being generated. The field of microbial ecology has recently benefited from the availability of a number of tools for describing and comparing microbial communities using robust statistical methods. However, although these programs provide methods for rapid calculation, it has become necessary to make them more amenable to larger datasets and numbers of samples from pyrosequencing. As part of this thesis, a new program was developed, MuSSA (Multi-Sample Sequence Analyser), to handle sequence data from novel high-throughput sequencing approaches in microbial community analyses. The greatest advantage of the program is that large volumes of sequence data can be manipulated, and general OTU series with a frequency value can be calculated among a large number of samples.
Resumo:
One major reason for the global decline of biodiversity is habitat loss and fragmentation. Conservation areas can be designed to reduce biodiversity loss, but as resources are limited, conservation efforts need to be prioritized in order to achieve best possible outcomes. The field of systematic conservation planning developed as a response to opportunistic approaches to conservation that often resulted in biased representation of biological diversity. The last two decades have seen the development of increasingly sophisticated methods that account for information about biodiversity conservation goals (benefits), economical considerations (costs) and socio-political constraints. In this thesis I focus on two general topics related to systematic conservation planning. First, I address two aspects of the question about how biodiversity features should be valued. (i) I investigate the extremely important but often neglected issue of differential prioritization of species for conservation. Species prioritization can be based on various criteria, and is always goal-dependent, but can also be implemented in a scientifically more rigorous way than what is the usual practice. (ii) I introduce a novel framework for conservation prioritization, which is based on continuous benefit functions that convert increasing levels of biodiversity feature representation to increasing conservation value using the principle that more is better. Traditional target-based systematic conservation planning is a special case of this approach, in which a step function is used for the benefit function. We have further expanded the benefit function framework for area prioritization to address issues such as protected area size and habitat vulnerability. In the second part of the thesis I address the application of community level modelling strategies to conservation prioritization. One of the most serious issues in systematic conservation planning currently is not the deficiency of methodology for selection and design, but simply the lack of data. Community level modelling offers a surrogate strategy that makes conservation planning more feasible in data poor regions. We have reviewed the available community-level approaches to conservation planning. These range from simplistic classification techniques to sophisticated modelling and selection strategies. We have also developed a general and novel community level approach to conservation prioritization that significantly improves on methods that were available before. This thesis introduces further degrees of realism into conservation planning methodology. The benefit function -based conservation prioritization framework largely circumvents the problematic phase of target setting, and allowing for trade-offs between species representation provides a more flexible and hopefully more attractive approach to conservation practitioners. The community-level approach seems highly promising and should prove valuable for conservation planning especially in data poor regions. Future work should focus on integrating prioritization methods to deal with multiple aspects in combination influencing the prioritization process, and further testing and refining the community level strategies using real, large datasets.
Resumo:
The problem of unsupervised anomaly detection arises in a wide variety of practical applications. While one-class support vector machines have demonstrated their effectiveness as an anomaly detection technique, their ability to model large datasets is limited due to their memory and time complexity for training. To address this issue for supervised learning of kernel machines, there has been growing interest in random projection methods as an alternative to the computationally expensive problems of kernel matrix construction and sup-port vector optimisation. In this paper we leverage the theory of nonlinear random projections and propose the Randomised One-class SVM (R1SVM), which is an efficient and scalable anomaly detection technique that can be trained on large-scale datasets. Our empirical analysis on several real-life and synthetic datasets shows that our randomised 1SVM algorithm achieves comparable or better accuracy to deep auto encoder and traditional kernelised approaches for anomaly detection, while being approximately 100 times faster in training and testing.
Resumo:
Identifying unusual or anomalous patterns in an underlying dataset is an important but challenging task in many applications. The focus of the unsupervised anomaly detection literature has mostly been on vectorised data. However, many applications are more naturally described using higher-order tensor representations. Approaches that vectorise tensorial data can destroy the structural information encoded in the high-dimensional space, and lead to the problem of the curse of dimensionality. In this paper we present the first unsupervised tensorial anomaly detection method, along with a randomised version of our method. Our anomaly detection method, the One-class Support Tensor Machine (1STM), is a generalisation of conventional one-class Support Vector Machines to higher-order spaces. 1STM preserves the multiway structure of tensor data, while achieving significant improvement in accuracy and efficiency over conventional vectorised methods. We then leverage the theory of nonlinear random projections to propose the Randomised 1STM (R1STM). Our empirical analysis on several real and synthetic datasets shows that our R1STM algorithm delivers comparable or better accuracy to a state-of-the-art deep learning method and traditional kernelised approaches for anomaly detection, while being approximately 100 times faster in training and testing.
Resumo:
Many conventional statistical machine learning al- gorithms generalise poorly if distribution bias ex- ists in the datasets. For example, distribution bias arises in the context of domain generalisation, where knowledge acquired from multiple source domains need to be used in a previously unseen target domains. We propose Elliptical Summary Randomisation (ESRand), an efficient domain generalisation approach that comprises of a randomised kernel and elliptical data summarisation. ESRand learns a domain interdependent projection to a la- tent subspace that minimises the existing biases to the data while maintaining the functional relationship between domains. In the latent subspace, ellipsoidal summaries replace the samples to enhance the generalisation by further removing bias and noise in the data. Moreover, the summarisation enables large-scale data processing by significantly reducing the size of the data. Through comprehensive analysis, we show that our subspace-based approach outperforms state-of-the-art results on several activity recognition benchmark datasets, while keeping the computational complexity significantly low.
Resumo:
Generating discriminative input features is a key requirement for achieving highly accurate classifiers. The process of generating features from raw data is known as feature engineering and it can take significant manual effort. In this paper we propose automated feature engineering to derive a suite of additional features from a given set of basic features with the aim of both improving classifier accuracy through discriminative features, and to assist data scientists through automation. Our implementation is specific to HTTP computer network traffic. To measure the effectiveness of our proposal, we compare the performance of a supervised machine learning classifier built with automated feature engineering versus one using human-guided features. The classifier addresses a problem in computer network security, namely the detection of HTTP tunnels. We use Bro to process network traffic into base features and then apply automated feature engineering to calculate a larger set of derived features. The derived features are calculated without favour to any base feature and include entropy, length and N-grams for all string features, and counts and averages over time for all numeric features. Feature selection is then used to find the most relevant subset of these features. Testing showed that both classifiers achieved a detection rate above 99.93% at a false positive rate below 0.01%. For our datasets, we conclude that automated feature engineering can provide the advantages of increasing classifier development speed and reducing development technical difficulties through the removal of manual feature engineering. These are achieved while also maintaining classification accuracy.
Resumo:
The application of computer-aided inspection integrated with the coordinate measuring machine and laser scanners to inspect manufactured aircraft parts using robust registration of two-point datasets is a subject of active research in computational metrology. This paper presents a novel approach to automated inspection by matching shapes based on the modified iterative closest point (ICP) method to define a criterion for the acceptance or rejection of a part. This procedure improves upon existing methods by doing away with the following, viz., the need for constructing either a tessellated or smooth representation of the inspected part and requirements for an a priori knowledge of approximate registration and correspondence between the points representing the computer-aided design datasets and the part to be inspected. In addition, this procedure establishes a better measure for error between the two matched datasets. The use of localized region-based triangulation is proposed for tracking the error. The approach described improves the convergence of the ICP technique with a dramatic decrease in computational effort. Experimental results obtained by implementing this proposed approach using both synthetic and practical data show that the present method is efficient and robust. This method thereby validates the algorithm, and the examples demonstrate its potential to be used in engineering applications.
Resumo:
Support Vector Machines(SVMs) are hyperplane classifiers defined in a kernel induced feature space. The data size dependent training time complexity of SVMs usually prohibits its use in applications involving more than a few thousands of data points. In this paper we propose a novel kernel based incremental data clustering approach and its use for scaling Non-linear Support Vector Machines to handle large data sets. The clustering method introduced can find cluster abstractions of the training data in a kernel induced feature space. These cluster abstractions are then used for selective sampling based training of Support Vector Machines to reduce the training time without compromising the generalization performance. Experiments done with real world datasets show that this approach gives good generalization performance at reasonable computational expense.
Resumo:
Twitter’s hashtag functionality is now used for a very wide variety of purposes, from covering crises and other breaking news events through gathering an instant community around shared media texts (such as sporting events and TV broadcasts) to signalling emotive states from amusement to despair. These divergent uses of the hashtag are increasingly recognised in the literature, with attention paid especially to the ability for hashtags to facilitate the creation of ad hoc or hashtag publics. A more comprehensive understanding of these different uses of hashtags has yet to be developed, however. Previous research has explored the potential for a systematic analysis of the quantitative metrics that could be generated from processing a series of hashtag datasets. Such research found, for example, that crisis-related hashtags exhibited a significantly larger incidence of retweets and tweets containing URLs than hashtags relating to televised events, and on this basis hypothesised that the information-seeking and -sharing behaviours of Twitter users in such different contexts were substantially divergent. This article updates such study and their methodology by examining the communicative metrics of a considerably larger and more diverse number of hashtag datasets, compiled over the past five years. This provides an opportunity both to confirm earlier findings, as well as to explore whether hashtag use practices may have shifted subsequently as Twitter’s userbase has developed further; it also enables the identification of further hashtag types beyond the “crisis” and “mainstream media event” types outlined to date. The article also explores the presence of such patterns beyond recognised hashtags, by incorporating an analysis of a number of keyword-based datasets. This large-scale, comparative approach contributes towards the establishment of a more comprehensive typology of hashtags and their publics, and the metrics it describes will also be able to be used to classify new hashtags emerging in the future. In turn, this may enable researchers to develop systems for automatically distinguishing newly trending topics into a number of event types, which may be useful for example for the automatic detection of acute crises and other breaking news events.
Resumo:
The first quarter of the 20th century witnessed a rebirth of cosmology, study of our Universe, as a field of scientific research with testable theoretical predictions. The amount of available cosmological data grew slowly from a few galaxy redshift measurements, rotation curves and local light element abundances into the first detection of the cos- mic microwave background (CMB) in 1965. By the turn of the century the amount of data exploded incorporating fields of new, exciting cosmological observables such as lensing, Lyman alpha forests, type Ia supernovae, baryon acoustic oscillations and Sunyaev-Zeldovich regions to name a few. -- CMB, the ubiquitous afterglow of the Big Bang, carries with it a wealth of cosmological information. Unfortunately, that information, delicate intensity variations, turned out hard to extract from the overall temperature. Since the first detection, it took nearly 30 years before first evidence of fluctuations on the microwave background were presented. At present, high precision cosmology is solidly based on precise measurements of the CMB anisotropy making it possible to pinpoint cosmological parameters to one-in-a-hundred level precision. The progress has made it possible to build and test models of the Universe that differ in the way the cosmos evolved some fraction of the first second since the Big Bang. -- This thesis is concerned with the high precision CMB observations. It presents three selected topics along a CMB experiment analysis pipeline. Map-making and residual noise estimation are studied using an approach called destriping. The studied approximate methods are invaluable for the large datasets of any modern CMB experiment and will undoubtedly become even more so when the next generation of experiments reach the operational stage. -- We begin with a brief overview of cosmological observations and describe the general relativistic perturbation theory. Next we discuss the map-making problem of a CMB experiment and the characterization of residual noise present in the maps. In the end, the use of modern cosmological data is presented in the study of an extended cosmological model, the correlated isocurvature fluctuations. Current available data is shown to indicate that future experiments are certainly needed to provide more information on these extra degrees of freedom. Any solid evidence of the isocurvature modes would have a considerable impact due to their power in model selection.
Resumo:
The increased availability of image capturing devices has enabled collections of digital images to rapidly expand in both size and diversity. This has created a constantly growing need for efficient and effective image browsing, searching, and retrieval tools. Pseudo-relevance feedback (PRF) has proven to be an effective mechanism for improving retrieval accuracy. An original, simple yet effective rank-based PRF mechanism (RB-PRF) that takes into account the initial rank order of each image to improve retrieval accuracy is proposed. This RB-PRF mechanism innovates by making use of binary image signatures to improve retrieval precision by promoting images similar to highly ranked images and demoting images similar to lower ranked images. Empirical evaluations based on standard benchmarks, namely Wang, Oliva & Torralba, and Corel datasets demonstrate the effectiveness of the proposed RB-PRF mechanism in image retrieval.
Resumo:
Muscoidea is a significant dipteran clade that includes house flies (Family Muscidae), latrine flies (F. Fannidae), dung flies (F. Scathophagidae) and root maggot flies (F. Anthomyiidae). It is comprised of approximately 7000 described species. The monophyly of the Muscoidea and the precise relationships of muscoids to the closest superfamily the Oestroidea (blow flies, flesh flies etc) are both unresolved. Until now mitochondrial (mt) genomes were available for only two of the four muscoid families precluding a thorough test of phylogenetic relationships using this data source. Here we present the first two mt genomes for the families Fanniidae (Euryomma sp.) (family Fanniidae) and Anthomyiidae (Delia platura (Meigen, 1826)). We also conducted phylogenetic analyses containing of these newly sequenced mt genomes plus 15 other species representative of dipteran diversity to address the internal relationship of Muscoidea and its systematic position. Both maximum-likelihood and Bayesian analyses suggested that Muscoidea was not a monophyletic group with the relationship: (Fanniidae + Muscidae) + ((Anthomyiidae + Scathophagidae) + (Calliphoridae + Sarcophagidae)), supported by the majority of analysed datasets. This also infers that Oestroidea was paraphyletic in the majority of analyses. Divergence time estimation suggested that the earliest split within the Calyptratae, separating (Tachinidae + Oestridae) from the remaining families, occurred in the Early Eocene. The main divergence within the paraphyletic muscoidea grade was between Fanniidae + Muscidae and the lineage ((Anthomyiidae + Scathophagidae) + (Calliphoridae + Sarcophagidae)) which occurred in the Late Eocene