46 resultados para data-mining application
Resumo:
OBJECTIVES: The prediction of protein structure and the precise understanding of protein folding and unfolding processes remains one of the greatest challenges in structural biology and bioinformatics. Computer simulations based on molecular dynamics (MD) are at the forefront of the effort to gain a deeper understanding of these complex processes. Currently, these MD simulations are usually on the order of tens of nanoseconds, generate a large amount of conformational data and are computationally expensive. More and more groups run such simulations and generate a myriad of data, which raises new challenges in managing and analyzing these data. Because the vast range of proteins researchers want to study and simulate, the computational effort needed to generate data, the large data volumes involved, and the different types of analyses scientists need to perform, it is desirable to provide a public repository allowing researchers to pool and share protein unfolding data. METHODS: To adequately organize, manage, and analyze the data generated by unfolding simulation studies, we designed a data warehouse system that is embedded in a grid environment to facilitate the seamless sharing of available computer resources and thus enable many groups to share complex molecular dynamics simulations on a more regular basis. RESULTS: To gain insight into the conformational fluctuations and stability of the monomeric forms of the amyloidogenic protein transthyretin (TTR), molecular dynamics unfolding simulations of the monomer of human TTR have been conducted. Trajectory data and meta-data of the wild-type (WT) protein and the highly amyloidogenic variant L55P-TTR represent the test case for the data warehouse. CONCLUSIONS: Web and grid services, especially pre-defined data mining services that can run on or 'near' the data repository of the data warehouse, are likely to play a pivotal role in the analysis of molecular dynamics unfolding data.
Resumo:
In this article, we review the state-of-the-art techniques in mining data streams for mobile and ubiquitous environments. We start the review with a concise background of data stream processing, presenting the building blocks for mining data streams. In a wide range of applications, data streams are required to be processed on small ubiquitous devices like smartphones and sensor devices. Mobile and ubiquitous data mining target these applications with tailored techniques and approaches addressing scarcity of resources and mobility issues. Two categories can be identified for mobile and ubiquitous mining of streaming data: single-node and distributed. This survey will cover both categories. Mining mobile and ubiquitous data require algorithms with the ability to monitor and adapt the working conditions to the available computational resources. We identify the key characteristics of these algorithms and present illustrative applications. Distributed data stream mining in the mobile environment is then discussed, presenting the Pocket Data Mining framework. Mobility of users stimulates the adoption of context-awareness in this area of research. Context-awareness and collaboration are discussed in the Collaborative Data Stream Mining, where agents share knowledge to learn adaptive accurate models.
Resumo:
An important application of Big Data Analytics is the real-time analysis of streaming data. Streaming data imposes unique challenges to data mining algorithms, such as concept drifts, the need to analyse the data on the fly due to unbounded data streams and scalable algorithms due to potentially high throughput of data. Real-time classification algorithms that are adaptive to concept drifts and fast exist, however, most approaches are not naturally parallel and are thus limited in their scalability. This paper presents work on the Micro-Cluster Nearest Neighbour (MC-NN) classifier. MC-NN is based on an adaptive statistical data summary based on Micro-Clusters. MC-NN is very fast and adaptive to concept drift whilst maintaining the parallel properties of the base KNN classifier. Also MC-NN is competitive compared with existing data stream classifiers in terms of accuracy and speed.
Resumo:
In molecular biology, it is often desirable to find common properties in large numbers of drug candidates. One family of methods stems from the data mining community, where algorithms to find frequent graphs have received increasing attention over the past years. However, the computational complexity of the underlying problem and the large amount of data to be explored essentially render sequential algorithms useless. In this paper, we present a distributed approach to the frequent subgraph mining problem to discover interesting patterns in molecular compounds. This problem is characterized by a highly irregular search tree, whereby no reliable workload prediction is available. We describe the three main aspects of the proposed distributed algorithm, namely, a dynamic partitioning of the search space, a distribution process based on a peer-to-peer communication framework, and a novel receiverinitiated load balancing algorithm. The effectiveness of the distributed method has been evaluated on the well-known National Cancer Institute’s HIV-screening data set, where we were able to show close-to linear speedup in a network of workstations. The proposed approach also allows for dynamic resource aggregation in a non dedicated computational environment. These features make it suitable for large-scale, multi-domain, heterogeneous environments, such as computational grids.
Resumo:
Facilitating the visual exploration of scientific data has received increasing attention in the past decade or so. Especially in life science related application areas the amount of available data has grown at a breath taking pace. In this paper we describe an approach that allows for visual inspection of large collections of molecular compounds. In contrast to classical visualizations of such spaces we incorporate a specific focus of analysis, for example the outcome of a biological experiment such as high throughout screening results. The presented method uses this experimental data to select molecular fragments of the underlying molecules that have interesting properties and uses the resulting space to generate a two dimensional map based on a singular value decomposition algorithm and a self organizing map. Experiments on real datasets show that the resulting visual landscape groups molecules of similar chemical properties in densely connected regions.
Resumo:
Recently, two approaches have been introduced that distribute the molecular fragment mining problem. The first approach applies a master/worker topology, the second approach, a completely distributed peer-to-peer system, solves the scalability problem due to the bottleneck at the master node. However, in many real world scenarios the participating computing nodes cannot communicate directly due to administrative policies such as security restrictions. Thus, potential computing power is not accessible to accelerate the mining run. To solve this shortcoming, this work introduces a hierarchical topology of computing resources, which distributes the management over several levels and adapts to the natural structure of those multi-domain architectures. The most important aspect is the load balancing scheme, which has been designed and optimized for the hierarchical structure. The approach allows dynamic aggregation of heterogenous computing resources and is applied to wide area network scenarios.
Resumo:
In real world applications sequential algorithms of data mining and data exploration are often unsuitable for datasets with enormous size, high-dimensionality and complex data structure. Grid computing promises unprecedented opportunities for unlimited computing and storage resources. In this context there is the necessity to develop high performance distributed data mining algorithms. However, the computational complexity of the problem and the large amount of data to be explored often make the design of large scale applications particularly challenging. In this paper we present the first distributed formulation of a frequent subgraph mining algorithm for discriminative fragments of molecular compounds. Two distributed approaches have been developed and compared on the well known National Cancer Institute’s HIV-screening dataset. We present experimental results on a small-scale computing environment.
Resumo:
In a world of almost permanent and rapidly increasing electronic data availability, techniques of filtering, compressing, and interpreting this data to transform it into valuable and easily comprehensible information is of utmost importance. One key topic in this area is the capability to deduce future system behavior from a given data input. This book brings together for the first time the complete theory of data-based neurofuzzy modelling and the linguistic attributes of fuzzy logic in a single cohesive mathematical framework. After introducing the basic theory of data-based modelling, new concepts including extended additive and multiplicative submodels are developed and their extensions to state estimation and data fusion are derived. All these algorithms are illustrated with benchmark and real-life examples to demonstrate their efficiency. Chris Harris and his group have carried out pioneering work which has tied together the fields of neural networks and linguistic rule-based algortihms. This book is aimed at researchers and scientists in time series modeling, empirical data modeling, knowledge discovery, data mining, and data fusion.
Resumo:
Aircraft Maintenance, Repair and Overhaul (MRO) feedback commonly includes an engineer’s complex text-based inspection report. Capturing and normalizing the content of these textual descriptions is vital to cost and quality benchmarking, and provides information to facilitate continuous improvement of MRO process and analytics. As data analysis and mining tools requires highly normalized data, raw textual data is inadequate. This paper offers a textual-mining solution to efficiently analyse bulk textual feedback data. Despite replacement of the same parts and/or sub-parts, the actual service cost for the same repair is often distinctly different from similar previously jobs. Regular expression algorithms were incorporated with an aircraft MRO glossary dictionary in order to help provide additional information concerning the reason for cost variation. Professional terms and conventions were included within the dictionary to avoid ambiguity and improve the outcome of the result. Testing results show that most descriptive inspection reports can be appropriately interpreted, allowing extraction of highly normalized data. This additional normalized data strongly supports data analysis and data mining, whilst also increasing the accuracy of future quotation costing. This solution has been effectively used by a large aircraft MRO agency with positive results.
Resumo:
Advances in hardware and software in the past decade allow to capture, record and process fast data streams at a large scale. The research area of data stream mining has emerged as a consequence from these advances in order to cope with the real time analysis of potentially large and changing data streams. Examples of data streams include Google searches, credit card transactions, telemetric data and data of continuous chemical production processes. In some cases the data can be processed in batches by traditional data mining approaches. However, in some applications it is required to analyse the data in real time as soon as it is being captured. Such cases are for example if the data stream is infinite, fast changing, or simply too large in size to be stored. One of the most important data mining techniques on data streams is classification. This involves training the classifier on the data stream in real time and adapting it to concept drifts. Most data stream classifiers are based on decision trees. However, it is well known in the data mining community that there is no single optimal algorithm. An algorithm may work well on one or several datasets but badly on others. This paper introduces eRules, a new rule based adaptive classifier for data streams, based on an evolving set of Rules. eRules induces a set of rules that is constantly evaluated and adapted to changes in the data stream by adding new and removing old rules. It is different from the more popular decision tree based classifiers as it tends to leave data instances rather unclassified than forcing a classification that could be wrong. The ongoing development of eRules aims to improve its accuracy further through dynamic parameter setting which will also address the problem of changing feature domain values.
Resumo:
Purpose: This paper aims to design an evaluation method that enables an organization to assess its current IT landscape and provide readiness assessment prior to Software as a Service (SaaS) adoption. Design/methodology/approach: The research employs a mixed of quantitative and qualitative approaches for conducting an IT application assessment. Quantitative data such as end user’s feedback on the IT applications contribute to the technical impact on efficiency and productivity. Qualitative data such as business domain, business services and IT application cost drivers are used to determine the business value of the IT applications in an organization. Findings: The assessment of IT applications leads to decisions on suitability of each IT application that can be migrated to cloud environment. Research limitations/implications: The evaluation of how a particular IT application impacts on a business service is done based on the logical interpretation. Data mining method is suggested in order to derive the patterns of the IT application capabilities. Practical implications: This method has been applied in a local council in UK. This helps the council to decide the future status of the IT applications for cost saving purpose.
Resumo:
Global communicationrequirements andloadimbalanceof someparalleldataminingalgorithms arethe major obstacles to exploitthe computational power of large-scale systems. This work investigates how non-uniform data distributions can be exploited to remove the global communication requirement and to reduce the communication costin parallel data mining algorithms and, in particular, in the k-means algorithm for cluster analysis. In the straightforward parallel formulation of the k-means algorithm, data and computation loads are uniformly distributed over the processing nodes. This approach has excellent load balancing characteristics that may suggest it could scale up to large and extreme-scale parallel computing systems. However, at each iteration step the algorithm requires a global reduction operationwhichhinders thescalabilityoftheapproach.Thisworkstudiesadifferentparallelformulation of the algorithm where the requirement of global communication is removed, while maintaining the same deterministic nature ofthe centralised algorithm. The proposed approach exploits a non-uniform data distribution which can be either found in real-world distributed applications or can be induced by means ofmulti-dimensional binary searchtrees. The approachcanalso be extended to accommodate an approximation error which allows a further reduction ofthe communication costs. The effectiveness of the exact and approximate methods has been tested in a parallel computing system with 64 processors and in simulations with 1024 processing element
Resumo:
Advances in hardware technologies allow to capture and process data in real-time and the resulting high throughput data streams require novel data mining approaches. The research area of Data Stream Mining (DSM) is developing data mining algorithms that allow us to analyse these continuous streams of data in real-time. The creation and real-time adaption of classification models from data streams is one of the most challenging DSM tasks. Current classifiers for streaming data address this problem by using incremental learning algorithms. However, even so these algorithms are fast, they are challenged by high velocity data streams, where data instances are incoming at a fast rate. This is problematic if the applications desire that there is no or only a very little delay between changes in the patterns of the stream and absorption of these patterns by the classifier. Problems of scalability to Big Data of traditional data mining algorithms for static (non streaming) datasets have been addressed through the development of parallel classifiers. However, there is very little work on the parallelisation of data stream classification techniques. In this paper we investigate K-Nearest Neighbours (KNN) as the basis for a real-time adaptive and parallel methodology for scalable data stream classification tasks.