832 resultados para databases and data mining


Relevância:

100.00% 100.00%

Publicador:

Resumo:

In a world of almost permanent and rapidly increasing electronic data availability, techniques of filtering, compressing, and interpreting this data to transform it into valuable and easily comprehensible information is of utmost importance. One key topic in this area is the capability to deduce future system behavior from a given data input. This book brings together for the first time the complete theory of data-based neurofuzzy modelling and the linguistic attributes of fuzzy logic in a single cohesive mathematical framework. After introducing the basic theory of data-based modelling, new concepts including extended additive and multiplicative submodels are developed and their extensions to state estimation and data fusion are derived. All these algorithms are illustrated with benchmark and real-life examples to demonstrate their efficiency. Chris Harris and his group have carried out pioneering work which has tied together the fields of neural networks and linguistic rule-based algortihms. This book is aimed at researchers and scientists in time series modeling, empirical data modeling, knowledge discovery, data mining, and data fusion.

Relevância:

100.00% 100.00%

Publicador:

Resumo:

This article explores the contribution that artisanal and small-scale mining (ASM) makes to poverty reduction in Tanzania, based on data on gold and diamond mining in Mwanza Region. The evidence suggests that people working in mining or related services are less likely to be in poverty than those with other occupations. However, the picture is complex; while mining income can help reduce poverty and provide a buffer from livelihood shocks, peoples inability to obtain a formal mineral claim, or to effectively exploit their claims, contributes to insecurity. This is reinforced by a context in which ASM is peripheral to large-scale mining interests, is only gradually being addressed within national poverty reduction policies, and is segregated from district-level planning.

Relevância:

100.00% 100.00%

Publicador:

Resumo:

This paper reviews the literature concerning the practice of using Online Analytical Processing (OLAP) systems to recall information stored by Online Transactional Processing (OLTP) systems. Such a review provides a basis for discussion on the need for the information that are recalled through OLAP systems to maintain the contexts of transactions with the data captured by the respective OLTP system. The paper observes an industry trend involving the use of OLTP systems to process information into data, which are then stored in databases without the business rules that were used to process information and data stored in OLTP databases without associated business rules. This includes the necessitation of a practice, whereby, sets of business rules are used to extract, cleanse, transform and load data from disparate OLTP systems into OLAP databases to support the requirements for complex reporting and analytics. These sets of business rules are usually not the same as business rules used to capture data in particular OLTP systems. The paper argues that, differences between the business rules used to interpret these same data sets, risk gaps in semantics between information captured by OLTP systems and information recalled through OLAP systems. Literature concerning the modeling of business transaction information as facts with context as part of the modelling of information systems were reviewed to identify design trends that are contributing to the design quality of OLTP and OLAP systems. The paper then argues that; the quality of OLTP and OLAP systems design has a critical dependency on the capture of facts with associated context, encoding facts with contexts into data with business rules, storage and sourcing of data with business rules, decoding data with business rules into the facts with the context and recall of facts with associated contexts. The paper proposes UBIRQ, a design model to aid the co-design of data with business rules storage for OLTP and OLAP purposes. The proposed design model provides the opportunity for the implementation and use of multi-purpose databases, and business rules stores for OLTP and OLAP systems. Such implementations would enable the use of OLTP systems to record and store data with executions of business rules, which will allow for the use of OLTP and OLAP systems to query data with business rules used to capture the data. Thereby ensuring information recalled via OLAP systems preserves the contexts of transactions as per the data captured by the respective OLTP system.

Relevância:

100.00% 100.00%

Publicador:

Resumo:

Spectroscopic catalogues, such as GEISA and HITRAN, do not yet include information on the water vapour continuum that pervades visible, infrared and microwave spectral regions. This is partly because, in some spectral regions, there are rather few laboratory measurements in conditions close to those in the Earth’s atmosphere; hence understanding of the characteristics of the continuum absorption is still emerging. This is particularly so in the near-infrared and visible, where there has been renewed interest and activity in recent years. In this paper we present a critical review focusing on recent laboratory measurements in two near-infrared window regions (centred on 4700 and 6300 cm−1) and include reference to the window centred on 2600 cm−1 where more measurements have been reported. The rather few available measurements, have used Fourier transform spectroscopy (FTS), cavity ring down spectroscopy, optical-feedback – cavity enhanced laser spectroscopy and, in very narrow regions, calorimetric interferometry. These systems have different advantages and disadvantages. Fourier Transform Spectroscopy can measure the continuum across both these and neighbouring windows; by contrast, the cavity laser techniques are limited to fewer wavenumbers, but have a much higher inherent sensitivity. The available results present a diverse view of the characteristics of continuum absorption, with differences in continuum strength exceeding a factor of 10 in the cores of these windows. In individual windows, the temperature dependence of the water vapour self-continuum differs significantly in the few sets of measurements that allow an analysis. The available data also indicate that the temperature dependence differs significantly between different near-infrared windows. These pioneering measurements provide an impetus for further measurements. Improvements and/or extensions in existing techniques would aid progress to a full characterisation of the continuum – as an example, we report pilot measurements of the water vapour self-continuum using a supercontinuum laser source coupled to an FTS. Such improvements, as well as additional measurements and analyses in other laboratories, would enable the inclusion of the water vapour continuum in future spectroscopic databases, and therefore allow for a more reliable forward modelling of the radiative properties of the atmosphere. It would also allow a more confident assessment of different theoretical descriptions of the underlying cause or causes of continuum absorption.

Relevância:

100.00% 100.00%

Publicador:

Resumo:

The main purpose of this thesis project is to prediction of symptom severity and cause in data from test battery of the Parkinson’s disease patient, which is based on data mining. The collection of the data is from test battery on a hand in computer. We use the Chi-Square method and check which variables are important and which are not important. Then we apply different data mining techniques on our normalize data and check which technique or method gives good results.The implementation of this thesis is in WEKA. We normalize our data and then apply different methods on this data. The methods which we used are Naïve Bayes, CART and KNN. We draw the Bland Altman and Spearman’s Correlation for checking the final results and prediction of data. The Bland Altman tells how the percentage of our confident level in this data is correct and Spearman’s Correlation tells us our relationship is strong. On the basis of results and analysis we see all three methods give nearly same results. But if we see our CART (J48 Decision Tree) it gives good result of under predicted and over predicted values that’s lies between -2 to +2. The correlation between the Actual and Predicted values is 0,794in CART. Cause gives the better percentage classification result then disability because it can use two classes.

Relevância:

100.00% 100.00%

Publicador:

Resumo:

The data mining of Eucalyptus ESTs genome finds four clusters (EGCEST2257E11.g, EGBGRT3213F11.g, and EGCCFB1223H11.g) from highly conservative 14-3-3 protein family which modulates a wide variety of cellular processes. Multiple alignments were built from twenty four sequences of 14-3-3 proteins searched into the GenBank databases and into the four pools of Eucalyptus genome programs. The alignment has shown two regions highly conservative on the sequences corresponding to the motifs of protein phosphorylation and nine highly conservative regions on the sequence corresponding to the linkage regions of alpha helices structure based on three dimensional of dimer functional structure. The differences of amino acid into the structural and functional domains of 14-3-3 plant protein were identified and can explain the functional diversity of different isoforms. The phylogenic protein trees were built by the maximum parsimony and neighborjoining procedures of Clustal X alignments and PAUP software for phylogenic analysis.

Relevância:

100.00% 100.00%

Publicador:

Resumo:

In a peer-to-peer network, the nodes interact with each other by sharing resources, services and information. Many applications have been developed using such networks, being a class of such applications are peer-to-peer databases. The peer-to-peer databases systems allow the sharing of unstructured data, being able to integrate data from several sources, without the need of large investments, because they are used existing repositories. However, the high flexibility and dynamicity of networks the network, as well as the absence of a centralized management of information, becomes complex the process of locating information among various participants in the network. In this context, this paper presents original contributions by a proposed architecture for a routing system that uses the Ant Colony algorithm to optimize the search for desired information supported by ontologies to add semantics to shared data, enabling integration among heterogeneous databases and the while seeking to reduce the message traffic on the network without causing losses in the amount of responses, confirmed by the improve of 22.5% in this amount. © 2011 IEEE.

Relevância:

100.00% 100.00%

Publicador:

Resumo:

The multi-relational Data Mining approach has emerged as alternative to the analysis of structured data, such as relational databases. Unlike traditional algorithms, the multi-relational proposals allow mining directly multiple tables, avoiding the costly join operations. In this paper, is presented a comparative study involving the traditional Patricia Mine algorithm and its corresponding multi-relational proposed, MR-Radix in order to evaluate the performance of two approaches for mining association rules are used for relational databases. This study presents two original contributions: the proposition of an algorithm multi-relational MR-Radix, which is efficient for use in relational databases, both in terms of execution time and in relation to memory usage and the presentation of the empirical approach multirelational advantage in performance over several tables, which avoids the costly join operations from multiple tables. © 2011 IEEE.

Relevância:

100.00% 100.00%

Publicador:

Resumo:

In soil surveys, several sampling systems can be used to define the most representative sites for sample collection and description of soil profiles. In recent years, the conditioned Latin hypercube sampling system has gained prominence for soil surveys. In Brazil, most of the soil maps are at small scales and in paper format, which hinders their refinement. The objectives of this work include: (i) to compare two sampling systems by conditioned Latin hypercube to map soil classes and soil properties; (II) to retrieve information from a detailed scale soil map of a pilot watershed for its refinement, comparing two data mining tools, and validation of the new soil map; and (III) to create and validate a soil map of a much larger and similar area from the extrapolation of information extracted from the existing soil map. Two sampling systems were created by conditioned Latin hypercube and by the cost-constrained conditioned Latin hypercube. At each prospection place, soil classification and measurement of the A horizon thickness were performed. Maps were generated and validated for each sampling system, comparing the efficiency of these methods. The conditioned Latin hypercube captured greater variability of soils and properties than the cost-constrained conditioned Latin hypercube, despite the former provided greater difficulty in field work. The conditioned Latin hypercube can capture greater soil variability and the cost-constrained conditioned Latin hypercube presents great potential for use in soil surveys, especially in areas of difficult access. From an existing detailed scale soil map of a pilot watershed, topographical information for each soil class was extracted from a Digital Elevation Model and its derivatives, by two data mining tools. Maps were generated using each tool. The more accurate of these tools was used for extrapolation of soil information for a much larger and similar area and the generated map was validated. It was possible to retrieve the existing soil map information and apply it on a larger area containing similar soil forming factors, at much low financial cost. The KnowledgeMiner tool for data mining, and ArcSIE, used to create the soil map, presented better results and enabled the use of existing soil map to extract soil information and its application in similar larger areas at reduced costs, which is especially important in development countries with limited financial resources for such activities, such as Brazil.

Relevância:

100.00% 100.00%

Publicador:

Resumo:

The unavailability of data to inform policy planning and formulation has been repeatedly cited as the main challenge to economic and social progress in the Caribbean. Furthermore, even in instances when data is produced, broader gaps exist between its production and eventual use for evidence-based policy formulation. Owing to those challenges, this report explores the use of databases of social and gender statistics in the development of policies and programmes in the Caribbean subregion. The report offers a general appraisal of databases against two main considerations: (i) maximizing the use of existing databases in relevant policies and programmes; and (ii) bridging the gaps in data availability of relevant statistical databases and their analyses. The assessment entailed an inventory of social and gender databases maintained by data producers in the region and analysis of the extent to which the databases are used for policy formulation. To that end, a literature search as well as consultations with a number of knowledgeable persons active in the field of statistics and data provision was conducted. Based on the review, a set of recommendations were produced to improve current practices within the region with respect evidence based policy formulation.

Relevância:

100.00% 100.00%

Publicador:

Resumo:

Presentations sponsored by the Patent and Trademark Depository Library Association (PTDLA) at the American Library Association Annual Conference, New Orleans, June 25, 2006 Speaker #1: Nan Myers Associate Professor; Government Documents, Patents and Trademarks Librarian Wichita State University, Wichita, KS Title: Intellectual Property Roundup: Copyright, Trademarks, Trade Secrets, and Patents Abstract: This presentation provides a capsule overview of the distinctive coverage of the four types of intellectual property – What they are, why they are important, how to get them, what they cost, how long they last. Emphasis will be on what questions patrons ask most, along with the answers! Includes coverage of the mission of Patent & Trademark Depository Libraries (PTDLs) and other sources of business information outside of libraries, such as Small Business Development Centers. Speaker #2: Jan Comfort Government Information Reference Librarian Clemson University, Clemson, SC Title: Patents as a Source of Competitive Intelligence Information Abstract: Large corporations often have R&D departments, or large numbers of staff whose jobs are to monitor the activities of their competitors. This presentation will review strategies that small business owners can employ to do their own competitive intelligence analysis. The focus will be on features of the patent database that is available free of charge on the USPTO website, as well as commercial databases available at many public and academic libraries across the country. Speaker #3: Virginia Baldwin Professor; Engineering Librarian University of Nebraska-Lincoln, Lincoln, NE Title: Mining Online Patent Data for Business Information Abstract: The United States Patent and Trademark Office (USPTO) website and websites of international databases contains information about granted patents and patent applications and the technologies they represent. Statistical information about patents, their technologies, geographical information, and patenting entities are compiled and available as reports on the USPTO website. Other valuable information from these websites can be obtained using data mining techniques. This presentation will provide the keys to opening these resources and obtaining valuable data. Speaker #4: Donna Hopkins Engineering Librarian Renssalaer Polytechnic Institute, Troy, NY Title: Searching the USPTO Trademark Database for Wordmarks and Logos Abstract: This presentation provides an overview of wordmark searching in www.uspto.gov, followed by a review of the techniques of searching for non-word US trademarks using codes from the Design Search Code Manual. These codes are used in an electronic search, either on the uspto website or on CASSIS DVDs. The search is sometimes supplemented by consulting the Official Gazette. A specific example of using a section of the codes for searching is included. Similar searches on the Madrid Express database of WIPO, using the Vienna Classification, will also be briefly described.

Relevância:

100.00% 100.00%

Publicador:

Resumo:

Background The use of the knowledge produced by sciences to promote human health is the main goal of translational medicine. To make it feasible we need computational methods to handle the large amount of information that arises from bench to bedside and to deal with its heterogeneity. A computational challenge that must be faced is to promote the integration of clinical, socio-demographic and biological data. In this effort, ontologies play an essential role as a powerful artifact for knowledge representation. Chado is a modular ontology-oriented database model that gained popularity due to its robustness and flexibility as a generic platform to store biological data; however it lacks supporting representation of clinical and socio-demographic information. Results We have implemented an extension of Chado – the Clinical Module - to allow the representation of this kind of information. Our approach consists of a framework for data integration through the use of a common reference ontology. The design of this framework has four levels: data level, to store the data; semantic level, to integrate and standardize the data by the use of ontologies; application level, to manage clinical databases, ontologies and data integration process; and web interface level, to allow interaction between the user and the system. The clinical module was built based on the Entity-Attribute-Value (EAV) model. We also proposed a methodology to migrate data from legacy clinical databases to the integrative framework. A Chado instance was initialized using a relational database management system. The Clinical Module was implemented and the framework was loaded using data from a factual clinical research database. Clinical and demographic data as well as biomaterial data were obtained from patients with tumors of head and neck. We implemented the IPTrans tool that is a complete environment for data migration, which comprises: the construction of a model to describe the legacy clinical data, based on an ontology; the Extraction, Transformation and Load (ETL) process to extract the data from the source clinical database and load it in the Clinical Module of Chado; the development of a web tool and a Bridge Layer to adapt the web tool to Chado, as well as other applications. Conclusions Open-source computational solutions currently available for translational science does not have a model to represent biomolecular information and also are not integrated with the existing bioinformatics tools. On the other hand, existing genomic data models do not represent clinical patient data. A framework was developed to support translational research by integrating biomolecular information coming from different “omics” technologies with patient’s clinical and socio-demographic data. This framework should present some features: flexibility, compression and robustness. The experiments accomplished from a use case demonstrated that the proposed system meets requirements of flexibility and robustness, leading to the desired integration. The Clinical Module can be accessed in http://dcm.ffclrp.usp.br/caib/pg=iptrans webcite.

Relevância:

100.00% 100.00%

Publicador:

Resumo:

OBJECTIVE: To describe the electronic medical databases used in antiretroviral therapy (ART) programmes in lower-income countries and assess the measures such programmes employ to maintain and improve data quality and reduce the loss of patients to follow-up. METHODS: In 15 countries of Africa, South America and Asia, a survey was conducted from December 2006 to February 2007 on the use of electronic medical record systems in ART programmes. Patients enrolled in the sites at the time of the survey but not seen during the previous 12 months were considered lost to follow-up. The quality of the data was assessed by computing the percentage of missing key variables (age, sex, clinical stage of HIV infection, CD4+ lymphocyte count and year of ART initiation). Associations between site characteristics (such as number of staff members dedicated to data management), measures to reduce loss to follow-up (such as the presence of staff dedicated to tracing patients) and data quality and loss to follow-up were analysed using multivariate logit models. FINDINGS: Twenty-one sites that together provided ART to 50 060 patients were included (median number of patients per site: 1000; interquartile range, IQR: 72-19 320). Eighteen sites (86%) used an electronic database for medical record-keeping; 15 (83%) such sites relied on software intended for personal or small business use. The median percentage of missing data for key variables per site was 10.9% (IQR: 2.0-18.9%) and declined with training in data management (odds ratio, OR: 0.58; 95% confidence interval, CI: 0.37-0.90) and weekly hours spent by a clerk on the database per 100 patients on ART (OR: 0.95; 95% CI: 0.90-0.99). About 10 weekly hours per 100 patients on ART were required to reduce missing data for key variables to below 10%. The median percentage of patients lost to follow-up 1 year after starting ART was 8.5% (IQR: 4.2-19.7%). Strategies to reduce loss to follow-up included outreach teams, community-based organizations and checking death registry data. Implementation of all three strategies substantially reduced losses to follow-up (OR: 0.17; 95% CI: 0.15-0.20). CONCLUSION: The quality of the data collected and the retention of patients in ART treatment programmes are unsatisfactory for many sites involved in the scale-up of ART in resource-limited settings, mainly because of insufficient staff trained to manage data and trace patients lost to follow-up.

Relevância:

100.00% 100.00%

Publicador:

Resumo:

Pteropods are a group of holoplanktonic gastropods for which global biomass distribution patterns remain poorly resolved. The aim of this study was to collect and synthesize existing pteropod (Gymnosomata, Thecosomata and Pseudothecosomata) abundance and biomass data, in order to evaluate the global distribution of pteropod carbon biomass, with a particular emphasis on its seasonal, temporal and vertical patterns. We collected 25 902 data points from several online databases and a number of scientific articles. The biomass data has been gridded onto a 360 x 180° grid, with a vertical resolution of 33 WOA depth levels. Data has been converted to NetCDF format. Data were collected between 1951-2010, with sampling depths ranging from 0-1000 m. Pteropod biomass data was either extracted directly or derived through converting abundance to biomass with pteropod specific length to weight conversions. In the Northern Hemisphere (NH) the data were distributed evenly throughout the year, whereas sampling in the Southern Hemisphere was biased towards the austral summer months. 86% of all biomass values were located in the NH, most (42%) within the latitudinal band of 30-50° N. The range of global biomass values spanned over three orders of magnitude, with a mean and median biomass concentration of 8.2 mg C l-1 (SD = 61.4) and 0.25 mg C l-1, respectively for all data points, and with a mean of 9.1 mg C l-1 (SD = 64.8) and a median of 0.25 mg C l-1 for non-zero biomass values. The highest mean and median biomass concentrations were located in the NH between 40-50° S (mean biomass: 68.8 mg C l-1 (SD = 213.4) median biomass: 2.5 mg C l-1) while, in the SH, they were within the 70-80° S latitudinal band (mean: 10.5 mg C l-1 (SD = 38.8) and median: 0.2 mg C l-1). Biomass values were lowest in the equatorial regions. A broad range of biomass concentrations was observed at all depths, with the biomass peak located in the surface layer (0-25 m) and values generally decreasing with depth. However, biomass peaks were located at different depths in different ocean basins: 0-25 m depth in the N Atlantic, 50-100 m in the Pacific, 100-200 m in the Arctic, 200-500 m in the Brazilian region and >500 m in the Indo-Pacific region. Biomass in the NH was relatively invariant over the seasonal cycle, but more seasonally variable in the SH. The collected database provides a valuable tool for modellers for the study of ecosystem processes and global biogeochemical cycles.

Relevância:

100.00% 100.00%

Publicador:

Resumo:

This paper presents the results of a Secchi depth data mining study for the North Sea - Baltic Sea region. 40,829 measurements of Secchi depth were compiled from the area as a result of this study. 4.3% of the observations were found in the international data centers [ICES Oceanographic Data Center in Denmark and the World Ocean Data Center A (WDC-A) in the USA], while 95.7% of the data was provided by individuals and ocean research institutions from the surrounding North Sea and Baltic Sea countries. Inquiries made at the World Ocean Data Center B (WDC-B) in Russia suggested that there could be significant additional holdings in that archive but, unfortunately, no data could be made available. The earliest Secchi depth measurement retrieved in this study dates back to 1902 for the Baltic Sea, while the bulk of the measurements were gathered after 1970. The spatial distribution of Secchi depth measurements in the North Sea is very uneven with surprisingly large sampling gaps in the Western North Sea. Quarterly and annual Secchi depth maps with a 0.5° x 0.5° spatial resolution are provided for the transition area between the North Sea and the Baltic Sea (4°E-16°E, 53°N-60°N).