44 resultados para Data sets storage


Relevância:

90.00% 90.00%

Publicador:

Resumo:

In this paper, a hybrid model consisting of the fuzzy ARTMAP (FAM) neural network and the classification and regression tree (CART) is formulated. FAM is useful for tackling the stability–plasticity dilemma pertaining to data-based learning systems, while CART is useful for depicting its learned knowledge explicitly in a tree structure. By combining the benefits of both models, FAM–CART is capable of learning data samples stably and, at the same time, explaining its predictions with a set of decision rules. In other words, FAM–CART possesses two important properties of an intelligent system, i.e., learning in a stable manner (by overcoming the stability–plasticity dilemma) and extracting useful explanatory rules (by overcoming the opaqueness issue). To evaluate the usefulness of FAM–CART, six benchmark medical data sets from the UCI repository of machine learning and a real-world medical data classification problem are used for evaluation. For performance comparison, a number of performance metrics which include accuracy, specificity, sensitivity, and the area under the receiver operation characteristic curve are computed. The results are quantified with statistical indicators and compared with those reported in the literature. The outcomes positively indicate that FAM–CART is effective for undertaking data classification tasks. In addition to producing good results, it provides justifications of the predictions in the form of a decision tree so that domain users can easily understand the predictions, therefore making it a useful decision support tool.

Relevância:

90.00% 90.00%

Publicador:

Resumo:

When no prior knowledge is available, clustering is a useful technique for categorizing data into meaningful groups or clusters. In this paper, a modified fuzzy min-max (MFMM) clustering neural network is proposed. Its efficacy for tackling power quality monitoring tasks is demonstrated. A literature review on various clustering techniques is first presented. To evaluate the proposed MFMM model, a performance comparison study using benchmark data sets pertaining to clustering problems is conducted. The results obtained are comparable with those reported in the literature. Then, a real-world case study on power quality monitoring tasks is performed. The results are compared with those from the fuzzy c-means and k-means clustering methods. The experimental outcome positively indicates the potential of MFMM in undertaking data clustering tasks and its applicability to the power systems domain.

Relevância:

90.00% 90.00%

Publicador:

Resumo:

Recommendations based on offline data processing has attracted increasing attention from both research communities and IT industries. The recommendation techniques could be used to explore huge volumes of data, identify the items that users probably like, translate the research results into real-world applications and so on. This paper surveys the recent progress in the research of recommendations based on offline data processing, with emphasis on new techniques (such as temporal recommendation, graph-based recommendation and trust-based recommendation), new features (such as serendipitous recommendation) and new research issues (such as tag recommendation and group recommendation). We also provide an extensive review of evaluation measurements, benchmark data sets and available open source tools. Finally, we outline some existing challenges for future research.

Relevância:

90.00% 90.00%

Publicador:

Resumo:

The need to estimate a particular quantile of a distribution is an important problem which frequently arises in many computer vision and signal processing applications. For example, our work was motivated by the requirements of many semi-automatic surveillance analytics systems which detect abnormalities in close-circuit television (CCTV) footage using statistical models of low-level motion features. In this paper we specifically address the problem of estimating the running quantile of a data stream with non-stationary stochasticity when the memory for storing observations is limited. We make several major contributions: (i) we derive an important theoretical result which shows that the change in the quantile of a stream is constrained regardless of the stochastic properties of data, (ii) we describe a set of high-level design goals for an effective estimation algorithm that emerge as a consequence of our theoretical findings, (iii) we introduce a novel algorithm which implements the aforementioned design goals by retaining a sample of data values in a manner adaptive to changes in the distribution of data and progressively narrowing down its focus in the periods of quasi-stationary stochasticity, and (iv) we present a comprehensive evaluation of the proposed algorithm and compare it with the existing methods in the literature on both synthetic data sets and three large 'real-world' streams acquired in the course of operation of an existing commercial surveillance system. Our findings convincingly demonstrate that the proposed method is highly successful and vastly outperforms the existing alternatives, especially when the target quantile is high valued and the available buffer capacity severely limited.

Relevância:

90.00% 90.00%

Publicador:

Resumo:

BACKGROUND: Atheoretical large-scale data mining techniques using machine learning algorithms have promise in the analysis of large epidemiological datasets. This study illustrates the use of a hybrid methodology for variable selection that took account of missing data and complex survey design to identify key biomarkers associated with depression from a large epidemiological study.

METHODS: The study used a three-step methodology amalgamating multiple imputation, a machine learning boosted regression algorithm and logistic regression, to identify key biomarkers associated with depression in the National Health and Nutrition Examination Study (2009-2010). Depression was measured using the Patient Health Questionnaire-9 and 67 biomarkers were analysed. Covariates in this study included gender, age, race, smoking, food security, Poverty Income Ratio, Body Mass Index, physical activity, alcohol use, medical conditions and medications. The final imputed weighted multiple logistic regression model included possible confounders and moderators.

RESULTS: After the creation of 20 imputation data sets from multiple chained regression sequences, machine learning boosted regression initially identified 21 biomarkers associated with depression. Using traditional logistic regression methods, including controlling for possible confounders and moderators, a final set of three biomarkers were selected. The final three biomarkers from the novel hybrid variable selection methodology were red cell distribution width (OR 1.15; 95% CI 1.01, 1.30), serum glucose (OR 1.01; 95% CI 1.00, 1.01) and total bilirubin (OR 0.12; 95% CI 0.05, 0.28). Significant interactions were found between total bilirubin with Mexican American/Hispanic group (p = 0.016), and current smokers (p<0.001).

CONCLUSION: The systematic use of a hybrid methodology for variable selection, fusing data mining techniques using a machine learning algorithm with traditional statistical modelling, accounted for missing data and complex survey sampling methodology and was demonstrated to be a useful tool for detecting three biomarkers associated with depression for future hypothesis generation: red cell distribution width, serum glucose and total bilirubin.

Relevância:

90.00% 90.00%

Publicador:

Resumo:

With the advance of computing and electronic technology, quantitative data, for example, continuous data (i.e., sequences of floating point numbers), become vital and have wide applications, such as for analysis of sensor data streams and financial data streams. However, existing association rule mining generally discover association rules from discrete variables, such as boolean data (`O' and `l') and categorical data (`sunny', `cloudy', `rainy', etc.) but very few deal with quantitative data. In this paper, a novel optimized fuzzy association rule mining (OFARM) method is proposed to mine association rules from quantitative data. The advantages of the proposed algorithm are in three folds: 1) propose a novel method to add the smoothness and flexibility of membership function for fuzzy sets; 2) optimize the fuzzy sets and their partition points with multiple objective functions after categorizing the quantitative data; and 3) design a two-level iteration to filter frequent-item-sets and fuzzy association-rules. The new method is verified by three different data sets, and the results have demonstrated the effectiveness and potentials of the developed scheme.

Relevância:

90.00% 90.00%

Publicador:

Resumo:

Cyber-physical-social system (CPSS) allows individuals to share personal information collected from not only cyberspace but also physical space. This has resulted in generating numerous data at a user's local storage. However, it is very expensive for users to store large data sets, and it also causes problems in data management. Therefore, it is of critical importance to outsource the data to cloud servers, which provides users an easy, cost-effective, and flexible way to manage data, whereas users lose control on their data once outsourcing their data to cloud servers, which poses challenges on integrity of outsourced data. Many schemes have been proposed to allow a third-party auditor to verify data integrity using the public keys of users. Most of these schemes bear a strong assumption: the auditors are honest and reliable, and thereby are vulnerability in the case that auditors are malicious. Moreover, in most of these schemes, an auditor needs to manage users certificates to choose the correct public keys for verification. In this paper, we propose a secure certificateless public integrity verification scheme (SCLPV). The SCLPV is the first work that simultaneously supports certificateless public verification and resistance against malicious auditors to verify the integrity of outsourced data in CPSS. A formal security proof proves the correctness and security of our scheme. In addition, an elaborate performance analysis demonstrates that the SCLPV is efficient and practical. Compared with the only existing certificateless public verification scheme (CLPV), the SCLPV provides stronger security guarantees in terms of remedying the security vulnerability of the CLPV and resistance against malicious auditors. In comparison with the best of integrity verification scheme achieving resistance against malicious auditors, the communication cost between the auditor and the cloud server of the SCLPV is independent of the size of the processed data, meanwhile, the auditor in the SCLPV does not need to manage certificates.

Relevância:

90.00% 90.00%

Publicador:

Resumo:

Extracellular data analysis has become a quintessential method for understanding the neurophysiological responses to stimuli. This demands stringent techniques owing to the complicated nature of the recording environment. In this paper, we highlight the challenges in extracellular multi-electrode recording and data analysis as well as the limitations pertaining to some of the currently employed methodologies. To address some of the challenges, we present a unified algorithm in the form of selective sorting. Selective sorting is modelled around hypothesized generative model, which addresses the natural phenomena of spikes triggered by an intricate neuronal population. The algorithm incorporates Cepstrum of Bispectrum, ad hoc clustering algorithms, wavelet transforms, least square and correlation concepts which strategically tailors a sequence to characterize and form distinctive clusters. Additionally, we demonstrate the influence of noise modelled wavelets to sort overlapping spikes. The algorithm is evaluated using both raw and synthesized data sets with different levels of complexity and the performances are tabulated for comparison using widely accepted qualitative and quantitative indicators.

Relevância:

90.00% 90.00%

Publicador:

Resumo:

Ecological data sets rarely extend back more than a few decades, limiting our understanding of environmental change and its drivers. Marine historical ecology has played a critical role in filling these data gaps by illuminating the magnitude and rate of ongoing changes in marine ecosystems. Yet despite a growing body of knowledge, historical insights are rarely explicitly incorporated in mainstream conservation and management efforts. Failing to consider historical change can have major implications for conservation, such as the ratcheting down of expectations of ecosystem quality over time, leading to less ambitious targets for recovery or restoration. We discuss several unconventional sources used by historical ecologists to fill data gaps - including menus, newspaper articles, cookbooks, museum collections, artwork, benthic sediment cores - and novel techniques for their analysis. We specify opportunities for the integration of historical data into conservation and management, and highlight the important role that these data can play in filling conservation data gaps and motivating conservation actions. As historical marine ecology research continues to grow as a multidisciplinary enterprise, great opportunities remain to foster direct linkages to conservation and improve the outlook for marine ecosystems.

Relevância:

90.00% 90.00%

Publicador:

Resumo:

Recommendation systems adopt various techniques to recommend ranked lists of items to help users in identifying items that fit their personal tastes best. Among various recommendation algorithms, user and item-based collaborative filtering methods have been very successful in both industry and academia. More recently, the rapid growth of the Internet and E-commerce applications results in great challenges for recommendation systems as the number of users and the amount of available online information have been growing too fast. These challenges include performing high quality recommendations per second for millions of users and items, achieving high coverage under the circumstance of data sparsity and increasing the scalability of recommendation systems. To obtain higher quality recommendations under the circumstance of data sparsity, in this paper, we propose a novel method to compute the similarity of different users based on the side information which is beyond user-item rating information from various online recommendation and review sites. Furthermore, we take the special interests of users into consideration and combine three types of information (users, items, user-items) to predict the ratings of items. Then FUIR, a novel recommendation algorithm which fuses user and item information, is proposed to generate recommendation results for target users. We evaluate our proposed FUIR algorithm on three data sets and the experimental results demonstrate that our FUIR algorithm is effective against sparse rating data and can produce higher quality recommendations.

Relevância:

90.00% 90.00%

Publicador:

Resumo:

Big data is one of the hottest research topics in science and technology communities, and it possesses a great potential in every sector for our society, such as climate, economy, health, social science, and so on. Big data is currently treated as data sets with sizes beyond the ability of commonly used software tools to capture, curate, and manage. We have tasted the power of big data in various applications, such as finance, business, health, and so on. However, big data is still in her infancy stage, which is evidenced by its vague definition, limited application, unsolved security and privacy barriers for pervasive implementation, and so forth. It is certain that we will face many unprecedented problems and challenges along the way of this unfolding revolutionary chapter of human history.

Relevância:

90.00% 90.00%

Publicador:

Resumo:

The development of novel therapies is essential to lower the burden of complex diseases. The purpose of this study is to identify novel therapeutics for complex diseases using bioinformatic methods. Bioinformatic tools such as candidate gene prediction tools allow identification of disease genes by identifying the potential candidate genes linked to genetic markers of the disease. Candidate gene prediction tools can only identify candidates for further research, and do not identify disease genes directly. Integration of drug-target datasets with candidate gene data-sets can identify novel potential therapeutics suitable for repositioning in clinical trials. Drug repositioning can save valuable time and money spent in therapeutic development of complex diseases.

Relevância:

90.00% 90.00%

Publicador:

Resumo:

To generate realistic predictions, species distribution models require the accurate coregistration of occurrence data with environmental variables. There is a common assumption that species occurrence data are accurately georeferenced; however, this is often not the case. This study investigates whether locational uncertainty and sample size affect the performance and interpretation of fine-scale species distribution models. This study evaluated the effects of locational uncertainty across multiple sample sizes by subsampling and spatially degrading occurrence data. Distribution models were constructed for kelp (Ecklonia radiata), across a large study site (680 km2) off the coast of southeastern Australia. Generalized additive models were used to predict distributions based on fine-resolution (2·5 m cell size) seafloor variables, generated from multibeam echosounder data sets, and occurrence data from underwater towed video. The effects of different levels of locational uncertainty in combination with sample size were evaluated by comparing model performance and predicted distributions. While locational uncertainty was observed to influence some measures of model performance, in general this was small and varied based on the accuracy metric used. However, simulated locational uncertainty caused changes in variable importance and predicted distributions at fine scales, potentially influencing model interpretation. This was most evident with small sample sizes. Results suggested that seemingly high-performing, fine-scale models can be generated from data containing locational uncertainty, although interpreting their predictions can be misleading if the predictions are interpreted at scales similar to the spatial errors. This study demonstrated the need to consider predictions across geographic space rather than performance alone. The findings are important for conservation managers as they highlight the inherent variation in predictions between equally performing distribution models, and the subsequent restrictions on ecological interpretations.

Relevância:

90.00% 90.00%

Publicador:

Resumo:

Big data is one of the hottest research topics in science and technology communities, and it possesses a great application potential in every sector for our society, such as climate, economy, health, social science, and so on. Big data usually includes data sets with sizes beyond the ability of commonly used software tools to capture, curate, and manage. We can conclude that big data is still in its infancy stage, and we will face many unprecedented problems and challenges along the way of this unfolding chapter of human history.