989 resultados para Structure mining


Relevância:

30.00% 30.00%

Publicador:

Resumo:

Visual data mining (VDM) tools employ information visualization techniques in order to represent large amounts of high-dimensional data graphically and to involve the user in exploring data at different levels of detail. The users are looking for outliers, patterns and models – in the form of clusters, classes, trends, and relationships – in different categories of data, i.e., financial, business information, etc. The focus of this thesis is the evaluation of multidimensional visualization techniques, especially from the business user’s perspective. We address three research problems. The first problem is the evaluation of projection-based visualizations with respect to their effectiveness in preserving the original distances between data points and the clustering structure of the data. In this respect, we propose the use of existing clustering validity measures. We illustrate their usefulness in evaluating five visualization techniques: Principal Components Analysis (PCA), Sammon’s Mapping, Self-Organizing Map (SOM), Radial Coordinate Visualization and Star Coordinates. The second problem is concerned with evaluating different visualization techniques as to their effectiveness in visual data mining of business data. For this purpose, we propose an inquiry evaluation technique and conduct the evaluation of nine visualization techniques. The visualizations under evaluation are Multiple Line Graphs, Permutation Matrix, Survey Plot, Scatter Plot Matrix, Parallel Coordinates, Treemap, PCA, Sammon’s Mapping and the SOM. The third problem is the evaluation of quality of use of VDM tools. We provide a conceptual framework for evaluating the quality of use of VDM tools and apply it to the evaluation of the SOM. In the evaluation, we use an inquiry technique for which we developed a questionnaire based on the proposed framework. The contributions of the thesis consist of three new evaluation techniques and the results obtained by applying these evaluation techniques. The thesis provides a systematic approach to evaluation of various visualization techniques. In this respect, first, we performed and described the evaluations in a systematic way, highlighting the evaluation activities, and their inputs and outputs. Secondly, we integrated the evaluation studies in the broad framework of usability evaluation. The results of the evaluations are intended to help developers and researchers of visualization systems to select appropriate visualization techniques in specific situations. The results of the evaluations also contribute to the understanding of the strengths and limitations of the visualization techniques evaluated and further to the improvement of these techniques.

Relevância:

30.00% 30.00%

Publicador:

Resumo:

The curse of dimensionality is a major problem in the fields of machine learning, data mining and knowledge discovery. Exhaustive search for the most optimal subset of relevant features from a high dimensional dataset is NP hard. Sub–optimal population based stochastic algorithms such as GP and GA are good choices for searching through large search spaces, and are usually more feasible than exhaustive and deterministic search algorithms. On the other hand, population based stochastic algorithms often suffer from premature convergence on mediocre sub–optimal solutions. The Age Layered Population Structure (ALPS) is a novel metaheuristic for overcoming the problem of premature convergence in evolutionary algorithms, and for improving search in the fitness landscape. The ALPS paradigm uses an age–measure to control breeding and competition between individuals in the population. This thesis uses a modification of the ALPS GP strategy called Feature Selection ALPS (FSALPS) for feature subset selection and classification of varied supervised learning tasks. FSALPS uses a novel frequency count system to rank features in the GP population based on evolved feature frequencies. The ranked features are translated into probabilities, which are used to control evolutionary processes such as terminal–symbol selection for the construction of GP trees/sub-trees. The FSALPS metaheuristic continuously refines the feature subset selection process whiles simultaneously evolving efficient classifiers through a non–converging evolutionary process that favors selection of features with high discrimination of class labels. We investigated and compared the performance of canonical GP, ALPS and FSALPS on high–dimensional benchmark classification datasets, including a hyperspectral image. Using Tukey’s HSD ANOVA test at a 95% confidence interval, ALPS and FSALPS dominated canonical GP in evolving smaller but efficient trees with less bloat expressions. FSALPS significantly outperformed canonical GP and ALPS and some reported feature selection strategies in related literature on dimensionality reduction.

Relevância:

30.00% 30.00%

Publicador:

Resumo:

The curse of dimensionality is a major problem in the fields of machine learning, data mining and knowledge discovery. Exhaustive search for the most optimal subset of relevant features from a high dimensional dataset is NP hard. Sub–optimal population based stochastic algorithms such as GP and GA are good choices for searching through large search spaces, and are usually more feasible than exhaustive and determinis- tic search algorithms. On the other hand, population based stochastic algorithms often suffer from premature convergence on mediocre sub–optimal solutions. The Age Layered Population Structure (ALPS) is a novel meta–heuristic for overcoming the problem of premature convergence in evolutionary algorithms, and for improving search in the fitness landscape. The ALPS paradigm uses an age–measure to control breeding and competition between individuals in the population. This thesis uses a modification of the ALPS GP strategy called Feature Selection ALPS (FSALPS) for feature subset selection and classification of varied supervised learning tasks. FSALPS uses a novel frequency count system to rank features in the GP population based on evolved feature frequencies. The ranked features are translated into probabilities, which are used to control evolutionary processes such as terminal–symbol selection for the construction of GP trees/sub-trees. The FSALPS meta–heuristic continuously refines the feature subset selection process whiles simultaneously evolving efficient classifiers through a non–converging evolutionary process that favors selection of features with high discrimination of class labels. We investigated and compared the performance of canonical GP, ALPS and FSALPS on high–dimensional benchmark classification datasets, including a hyperspectral image. Using Tukey’s HSD ANOVA test at a 95% confidence interval, ALPS and FSALPS dominated canonical GP in evolving smaller but efficient trees with less bloat expressions. FSALPS significantly outperformed canonical GP and ALPS and some reported feature selection strategies in related literature on dimensionality reduction.

Relevância:

30.00% 30.00%

Publicador:

Resumo:

In the current study, epidemiology study is done by means of literature survey in groups identified to be at higher potential for DDIs as well as in other cases to explore patterns of DDIs and the factors affecting them. The structure of the FDA Adverse Event Reporting System (FAERS) database is studied and analyzed in detail to identify issues and challenges in data mining the drug-drug interactions. The necessary pre-processing algorithms are developed based on the analysis and the Apriori algorithm is modified to suit the process. Finally, the modules are integrated into a tool to identify DDIs. The results are compared using standard drug interaction database for validation. 31% of the associations obtained were identified to be new and the match with existing interactions was 69%. This match clearly indicates the validity of the methodology and its applicability to similar databases. Formulation of the results using the generic names expanded the relevance of the results to a global scale. The global applicability helps the health care professionals worldwide to observe caution during various stages of drug administration thus considerably enhancing pharmacovigilance

Relevância:

30.00% 30.00%

Publicador:

Resumo:

Social bookmark tools are rapidly emerging on the Web. In such systems users are setting up lightweight conceptual structures called folksonomies. These systems provide currently relatively few structure. We discuss in this paper, how association rule mining can be adopted to analyze and structure folksonomies, and how the results can be used for ontology learning and supporting emergent semantics. We demonstrate our approach on a large scale dataset stemming from an online system.

Relevância:

30.00% 30.00%

Publicador:

Resumo:

Recently, two approaches have been introduced that distribute the molecular fragment mining problem. The first approach applies a master/worker topology, the second approach, a completely distributed peer-to-peer system, solves the scalability problem due to the bottleneck at the master node. However, in many real world scenarios the participating computing nodes cannot communicate directly due to administrative policies such as security restrictions. Thus, potential computing power is not accessible to accelerate the mining run. To solve this shortcoming, this work introduces a hierarchical topology of computing resources, which distributes the management over several levels and adapts to the natural structure of those multi-domain architectures. The most important aspect is the load balancing scheme, which has been designed and optimized for the hierarchical structure. The approach allows dynamic aggregation of heterogenous computing resources and is applied to wide area network scenarios.

Relevância:

30.00% 30.00%

Publicador:

Resumo:

In real world applications sequential algorithms of data mining and data exploration are often unsuitable for datasets with enormous size, high-dimensionality and complex data structure. Grid computing promises unprecedented opportunities for unlimited computing and storage resources. In this context there is the necessity to develop high performance distributed data mining algorithms. However, the computational complexity of the problem and the large amount of data to be explored often make the design of large scale applications particularly challenging. In this paper we present the first distributed formulation of a frequent subgraph mining algorithm for discriminative fragments of molecular compounds. Two distributed approaches have been developed and compared on the well known National Cancer Institute’s HIV-screening dataset. We present experimental results on a small-scale computing environment.

Relevância:

30.00% 30.00%

Publicador:

Resumo:

OBJECTIVES: The prediction of protein structure and the precise understanding of protein folding and unfolding processes remains one of the greatest challenges in structural biology and bioinformatics. Computer simulations based on molecular dynamics (MD) are at the forefront of the effort to gain a deeper understanding of these complex processes. Currently, these MD simulations are usually on the order of tens of nanoseconds, generate a large amount of conformational data and are computationally expensive. More and more groups run such simulations and generate a myriad of data, which raises new challenges in managing and analyzing these data. Because the vast range of proteins researchers want to study and simulate, the computational effort needed to generate data, the large data volumes involved, and the different types of analyses scientists need to perform, it is desirable to provide a public repository allowing researchers to pool and share protein unfolding data. METHODS: To adequately organize, manage, and analyze the data generated by unfolding simulation studies, we designed a data warehouse system that is embedded in a grid environment to facilitate the seamless sharing of available computer resources and thus enable many groups to share complex molecular dynamics simulations on a more regular basis. RESULTS: To gain insight into the conformational fluctuations and stability of the monomeric forms of the amyloidogenic protein transthyretin (TTR), molecular dynamics unfolding simulations of the monomer of human TTR have been conducted. Trajectory data and meta-data of the wild-type (WT) protein and the highly amyloidogenic variant L55P-TTR represent the test case for the data warehouse. CONCLUSIONS: Web and grid services, especially pre-defined data mining services that can run on or 'near' the data repository of the data warehouse, are likely to play a pivotal role in the analysis of molecular dynamics unfolding data.

Relevância:

30.00% 30.00%

Publicador:

Resumo:

Twitter is both a micro-blogging service and a platform for public conversation. Direct conversation is facilitated in Twitter through the use of @’s (mentions) and replies. While the conversational element of Twitter is of particular interest to the marketing sector, relatively few data-mining studies have focused on this area. We analyse conversations associated with reciprocated mentions that take place in a data-set consisting of approximately 4 million tweets collected over a period of 28 days that contain at least one mention. We ignore tweet content and instead use the mention network structure and its dynamical properties to identify and characterise Twitter conversations between pairs of users and within larger groups. We consider conversational balance, meaning the fraction of content contributed by each party. The goal of this work is to draw out some of the mechanisms driving conversation in Twitter, with the potential aim of developing conversational models.

Relevância:

30.00% 30.00%

Publicador:

Resumo:

Menezesite, ideally Ba2MgZr4(BaNb12O42)center dot 12H(2)O, occurs as a vug mineral in the contact zone between dolomite carbonatite and ""jacupirangite"" (=a pyroxenite) at the Jacupiranga mine, in Cajati county, Sao Paulo state, Brazil, associated with dolomite, calcite, magnetite, clinohumite, phlogopite, ancylite-(Ce), strontianite, pyrite, and tochilinite. This is also the type locality for quintinite-2H. The mineral forms rhombododecahedra up to I mm, isolated or in aggregates. Menezesite is transparent and displays a vitreous luster; it is reddish brown with a white streak. It is non-fluorescent. Mohs hardness is about 4. Calculated density derived from the empirical formula is 4.181 g/cm(3). It is isotropic, 1.93(1) (white light); n(calc) = 2.034. Menezesite exhibits weak anomalous birefringence. The empirical formula is (Ba1.47K0.53Ca0.3,Ce0.17Nd0.10Na0.06La0.02)(Sigma 2.66)(Mg0.94Mn0.23Fe0.23Al0.03)(Sigma 1.43)(Zr2.75Ti0.96Th0.29)(Sigma 4.00)[(Ba0.72Th0.26U0.02)(Sigma 1.00)(Nb9.23Ti2.29Ta0.36Si0.12)Sigma O-12.00(42)]center dot 12H(2)O. The mineral is cubic, space group 10 (204), a = 13.017(1) angstrom, V = 2206(1) angstrom(3), Z = 2. Menezesite is isostructural with the synthetic compound Mg-7[MgW12O42](OH)(4)center dot 8H(2)O. The mineral was named in honor of Luiz Alberto Dias Menezes Filho (born 1950), mining engineer, mineral collector and merchant. Both the description and the name were approved by the CNMMN-IMA (Nomenclature Proposal 2005-023). Menezesite is the first natural heteropolyniobate. Heteropolyanions have been employed in a range of applications that include virus-binding inorganic drugs (including the AIDs virus), homogeneous and heterogeneous catalysts, electro-optic and electrochromic materials, metal and protein binding, and as building blocks for nanostructuring of materials.

Relevância:

30.00% 30.00%

Publicador:

Resumo:

Includes bibliography

Relevância:

30.00% 30.00%

Publicador:

Resumo:

Identification and classification of overlapping nodes in networks are important topics in data mining. In this paper, a network-based (graph-based) semi-supervised learning method is proposed. It is based on competition and cooperation among walking particles in a network to uncover overlapping nodes by generating continuous-valued outputs (soft labels), corresponding to the levels of membership from the nodes to each of the communities. Moreover, the proposed method can be applied to detect overlapping data items in a data set of general form, such as a vector-based data set, once it is transformed to a network. Usually, label propagation involves risks of error amplification. In order to avoid this problem, the proposed method offers a mechanism to identify outliers among the labeled data items, and consequently prevents error propagation from such outliers. Computer simulations carried out for synthetic and real-world data sets provide a numeric quantification of the performance of the method. © 2012 Springer-Verlag.

Relevância:

30.00% 30.00%

Publicador:

Resumo:

The utilization of borate mineral wastes with glass-ceramic technology was first time studied and primarily not investigated combinations of wastes were incorporated into the research. These wastes consist of; soda lime silica glass, meat bone and meal ash and fly ash. In order to investigate possible and relevant application areas in ceramics, kaolin clay, an essential raw material for ceramic industry was also employed in some studied compositions. As a result, three different glass-ceramic articles obtained by using powder sintering method via individual sintering processes. Light weight micro porous glass-ceramic from borate mining waste, meat bone and meal ash and kaolin clay was developed. In some compositions in related study, soda lime silica glass waste was used as an additive providing lightweight structure with a density below 0.45 g/cm3 and a crushing strength of 1.8±0.1 MPa. In another study within the research, compositions respecting the B2O3–P2O5–SiO2 glass-ceramic ternary system were prepared from; borate wastes, meat bone and meal ash and soda lime silica glass waste and sintered up to 950ºC. Low porous, highly crystallized glass-ceramic structures with density ranging between 1.8 ± 0,7 to 2.0 ± 0,3 g/cm3 and tensile strength ranging between 8,0 ± 2 to 15,0 ± 0,5 MPa were achieved. Lastly, diopside - wollastonite (SiO2-Al2O3-CaO )glass-ceramics from borate wastes, fly ash and soda lime silica glass waste were successfully obtained with controlled rapid sintering between 950 and 1050ºC. The wollastonite and diopside crystal sizes were improved by adopting varied combinations of formulations and heating rates. The properties of the obtained materials show; the articles with a uniform pore structure could be useful for thermal and acoustic insulations and can be embedded in lightweight concrete where low porous glass-ceramics can be employed as building blocks or additive in cement and ceramic industries.

Relevância:

30.00% 30.00%

Publicador:

Resumo:

The effects of abandoned mine drainage (AMD) on streams and responses to remediation efforts were studied using three streams (AMD-impacted, remediated, reference) in both the anthracite and the bituminous coal mining regions of Pennsylvania (USA). Response variables included ecosystem function as well as water chemistry and macroinvertebrate community composition. The bituminous AMD stream was extremely acidic with high dissolved metals concentrations, a prolific mid-summer growth of the filamentous alga, Mougeotia, and .10-fold more chlorophyll than the reference stream. The anthracite AMD stream had a higher pH, substrata coated with iron hydroxide(s), and negligible chlorophyll. Macroinvertebrate communities in the AMD streams were different from the reference streams, the remediated streams, and each other. Relative to the reference stream, the AMD stream(s) had (1) greater gross primary productivity (GPP) in the bituminous region and undetectable GPP in the anthracite region, (2) greater ecosystem respiration in both regions, (3) greatly reduced ammonium uptake and nitrification in both regions, (4) lower nitrate uptake in the bituminous (but not the anthracite) region, (5) more rapid phosphorus removal from the water column in both regions, (6) activities of phosphorus-acquiring, nitrogenacquiring, and hydrolytic-carbon-acquiring enzymes that indicated extreme phosphorus limitation in both regions, and (7) slower oak and maple leaf decomposition in the bituminous region and slower oak decomposition in the anthracite region. Remediation brought chlorophyll concentrations and GPP nearer to values for respective reference streams, depressed ecosystem respiration, restored ammonium uptake, and partially restored nitrification in the bituminous (but not the anthracite) region, reduced nitrate uptake to an undetectable level, restored phosphorus uptake to near normal rates, and brought enzyme activities more in line with the reference stream in the bituminous (but not the anthracite) region. Denitrification was not detected in any stream. Water chemistry and macroinvertebrate community structure analyses capture the impact of AMD at the local reach scale, but functional measures revealed that AMD has ramifications that can cascade to downstream reaches and perhaps to receiving estuaries.