838 resultados para text and data mining
Resumo:
In a world of almost permanent and rapidly increasing electronic data availability, techniques of filtering, compressing, and interpreting this data to transform it into valuable and easily comprehensible information is of utmost importance. One key topic in this area is the capability to deduce future system behavior from a given data input. This book brings together for the first time the complete theory of data-based neurofuzzy modelling and the linguistic attributes of fuzzy logic in a single cohesive mathematical framework. After introducing the basic theory of data-based modelling, new concepts including extended additive and multiplicative submodels are developed and their extensions to state estimation and data fusion are derived. All these algorithms are illustrated with benchmark and real-life examples to demonstrate their efficiency. Chris Harris and his group have carried out pioneering work which has tied together the fields of neural networks and linguistic rule-based algortihms. This book is aimed at researchers and scientists in time series modeling, empirical data modeling, knowledge discovery, data mining, and data fusion.
Resumo:
In order to gain knowledge from large databases, scalable data mining technologies are needed. Data are captured on a large scale and thus databases are increasing at a fast pace. This leads to the utilisation of parallel computing technologies in order to cope with large amounts of data. In the area of classification rule induction, parallelisation of classification rules has focused on the divide and conquer approach, also known as the Top Down Induction of Decision Trees (TDIDT). An alternative approach to classification rule induction is separate and conquer which has only recently been in the focus of parallelisation. This work introduces and evaluates empirically a framework for the parallel induction of classification rules, generated by members of the Prism family of algorithms. All members of the Prism family of algorithms follow the separate and conquer approach.
Resumo:
This article explores the contribution that artisanal and small-scale mining (ASM) makes to poverty reduction in Tanzania, based on data on gold and diamond mining in Mwanza Region. The evidence suggests that people working in mining or related services are less likely to be in poverty than those with other occupations. However, the picture is complex; while mining income can help reduce poverty and provide a buffer from livelihood shocks, peoples inability to obtain a formal mineral claim, or to effectively exploit their claims, contributes to insecurity. This is reinforced by a context in which ASM is peripheral to large-scale mining interests, is only gradually being addressed within national poverty reduction policies, and is segregated from district-level planning.
Resumo:
This article examines the marginal position of artisanal miners in sub-Saharan Africa, and considers how they are incorporated into mineral sector change in the context of institutional and legal integration. Taking the case of diamond and gold mining in Tanzania, the concept of social exclusion is used to explore the consequences of marginalization on people's access to mineral resources and ability to make a living from artisanal mining. Because existing inequalities and forms of discrimination are ignored by the Tanzanian state, the institutionalization of mineral titles conceals social and power relations that perpetuate highly unequal access to resources. The article highlights the complexity of these processes, and shows that while legal integration can benefit certain wealthier categories of people, who fit into the model of an 'entrepreneurial small-scale miner', for others adverse incorporation contributes to socio-economic dependence, exploitation and insecurity. For the issue of marginality to be addressed within integration processes, the existence of local forms of organization, institutions and relationships, which underpin inequalities and discrimination, need to be recognized.
Resumo:
Advances in hardware technologies allow to capture and process data in real-time and the resulting high throughput data streams require novel data mining approaches. The research area of Data Stream Mining (DSM) is developing data mining algorithms that allow us to analyse these continuous streams of data in real-time. The creation and real-time adaption of classification models from data streams is one of the most challenging DSM tasks. Current classifiers for streaming data address this problem by using incremental learning algorithms. However, even so these algorithms are fast, they are challenged by high velocity data streams, where data instances are incoming at a fast rate. This is problematic if the applications desire that there is no or only a very little delay between changes in the patterns of the stream and absorption of these patterns by the classifier. Problems of scalability to Big Data of traditional data mining algorithms for static (non streaming) datasets have been addressed through the development of parallel classifiers. However, there is very little work on the parallelisation of data stream classification techniques. In this paper we investigate K-Nearest Neighbours (KNN) as the basis for a real-time adaptive and parallel methodology for scalable data stream classification tasks.
Resumo:
This paper reviews the literature concerning the practice of using Online Analytical Processing (OLAP) systems to recall information stored by Online Transactional Processing (OLTP) systems. Such a review provides a basis for discussion on the need for the information that are recalled through OLAP systems to maintain the contexts of transactions with the data captured by the respective OLTP system. The paper observes an industry trend involving the use of OLTP systems to process information into data, which are then stored in databases without the business rules that were used to process information and data stored in OLTP databases without associated business rules. This includes the necessitation of a practice, whereby, sets of business rules are used to extract, cleanse, transform and load data from disparate OLTP systems into OLAP databases to support the requirements for complex reporting and analytics. These sets of business rules are usually not the same as business rules used to capture data in particular OLTP systems. The paper argues that, differences between the business rules used to interpret these same data sets, risk gaps in semantics between information captured by OLTP systems and information recalled through OLAP systems. Literature concerning the modeling of business transaction information as facts with context as part of the modelling of information systems were reviewed to identify design trends that are contributing to the design quality of OLTP and OLAP systems. The paper then argues that; the quality of OLTP and OLAP systems design has a critical dependency on the capture of facts with associated context, encoding facts with contexts into data with business rules, storage and sourcing of data with business rules, decoding data with business rules into the facts with the context and recall of facts with associated contexts. The paper proposes UBIRQ, a design model to aid the co-design of data with business rules storage for OLTP and OLAP purposes. The proposed design model provides the opportunity for the implementation and use of multi-purpose databases, and business rules stores for OLTP and OLAP systems. Such implementations would enable the use of OLTP systems to record and store data with executions of business rules, which will allow for the use of OLTP and OLAP systems to query data with business rules used to capture the data. Thereby ensuring information recalled via OLAP systems preserves the contexts of transactions as per the data captured by the respective OLTP system.
Resumo:
Purpose: To investigate the relationship between research data management (RDM) and data sharing in the formulation of RDM policies and development of practices in higher education institutions (HEIs). Design/methodology/approach: Two strands of work were undertaken sequentially: firstly, content analysis of 37 RDM policies from UK HEIs; secondly, two detailed case studies of institutions with different approaches to RDM based on semi-structured interviews with staff involved in the development of RDM policy and services. The data are interpreted using insights from Actor Network Theory. Findings: RDM policy formation and service development has created a complex set of networks within and beyond institutions involving different professional groups with widely varying priorities shaping activities. Data sharing is considered an important activity in the policies and services of HEIs studied, but its prominence can in most cases be attributed to the positions adopted by large research funders. Research limitations/implications: The case studies, as research based on qualitative data, cannot be assumed to be universally applicable but do illustrate a variety of issues and challenges experienced more generally, particularly in the UK. Practical implications: The research may help to inform development of policy and practice in RDM in HEIs and funder organisations. Originality/value: This paper makes an early contribution to the RDM literature on the specific topic of the relationship between RDM policy and services, and openness – a topic which to date has received limited attention.
Resumo:
This paper provides an interdisciplinary perspective on mine reclamation in forested areas of Ghana, a country characterised by conflicts between mining and forest conservation. A comparison was made between above ground biomass (AGB) and soil organic carbon (SOC) content from two reclaimed mine sites and adjacent undisturbed forest. Findings suggest that on decadal timescales, reclaimed mine sites contain approximately 40% of the total carbon and 10% the AGB carbon of undisturbed forest. This raises questions regarding the potential for decommissioning mine sites to provide forestry-based legacies. Such a move could deliver a host of benefits, including improving the longevity and success of reclamation, mitigating climate change and delivering corollary enumeration for local communities under carbon trading schemes. A discussion of the antecedents and challenges associated with establishing forest-legacies highlights the risk of neglecting the participation and heterogeneity of legitimate local representatives, which threatens the equity of potential benefits and sustainability of projects. Despite these risks, implementing pilot projects could help to address the lack of transparency and data which currently characterises mine reclamation.
Resumo:
An important application of Big Data Analytics is the real-time analysis of streaming data. Streaming data imposes unique challenges to data mining algorithms, such as concept drifts, the need to analyse the data on the fly due to unbounded data streams and scalable algorithms due to potentially high throughput of data. Real-time classification algorithms that are adaptive to concept drifts and fast exist, however, most approaches are not naturally parallel and are thus limited in their scalability. This paper presents work on the Micro-Cluster Nearest Neighbour (MC-NN) classifier. MC-NN is based on an adaptive statistical data summary based on Micro-Clusters. MC-NN is very fast and adaptive to concept drift whilst maintaining the parallel properties of the base KNN classifier. Also MC-NN is competitive compared with existing data stream classifiers in terms of accuracy and speed.
Resumo:
Data mining is a relatively new field of research that its objective is to acquire knowledge from large amounts of data. In medical and health care areas, due to regulations and due to the availability of computers, a large amount of data is becoming available [27]. On the one hand, practitioners are expected to use all this data in their work but, at the same time, such a large amount of data cannot be processed by humans in a short time to make diagnosis, prognosis and treatment schedules. A major objective of this thesis is to evaluate data mining tools in medical and health care applications to develop a tool that can help make rather accurate decisions. In this thesis, the goal is finding a pattern among patients who got pneumonia by clustering of lab data values which have been recorded every day. By this pattern we can generalize it to the patients who did not have been diagnosed by this disease whose lab values shows the same trend as pneumonia patients does. There are 10 tables which have been extracted from a big data base of a hospital in Jena for my work .In ICU (intensive care unit), COPRA system which is a patient management system has been used. All the tables and data stored in German Language database.
Predictive models for chronic renal disease using decision trees, naïve bayes and case-based methods
Resumo:
Data mining can be used in healthcare industry to “mine” clinical data to discover hidden information for intelligent and affective decision making. Discovery of hidden patterns and relationships often goes intact, yet advanced data mining techniques can be helpful as remedy to this scenario. This thesis mainly deals with Intelligent Prediction of Chronic Renal Disease (IPCRD). Data covers blood, urine test, and external symptoms applied to predict chronic renal disease. Data from the database is initially transformed to Weka (3.6) and Chi-Square method is used for features section. After normalizing data, three classifiers were applied and efficiency of output is evaluated. Mainly, three classifiers are analyzed: Decision Tree, Naïve Bayes, K-Nearest Neighbour algorithm. Results show that each technique has its unique strength in realizing the objectives of the defined mining goals. Efficiency of Decision Tree and KNN was almost same but Naïve Bayes proved a comparative edge over others. Further sensitivity and specificity tests are used as statistical measures to examine the performance of a binary classification. Sensitivity (also called recall rate in some fields) measures the proportion of actual positives which are correctly identified while Specificity measures the proportion of negatives which are correctly identified. CRISP-DM methodology is applied to build the mining models. It consists of six major phases: business understanding, data understanding, data preparation, modeling, evaluation, and deployment.
Resumo:
In soil surveys, several sampling systems can be used to define the most representative sites for sample collection and description of soil profiles. In recent years, the conditioned Latin hypercube sampling system has gained prominence for soil surveys. In Brazil, most of the soil maps are at small scales and in paper format, which hinders their refinement. The objectives of this work include: (i) to compare two sampling systems by conditioned Latin hypercube to map soil classes and soil properties; (II) to retrieve information from a detailed scale soil map of a pilot watershed for its refinement, comparing two data mining tools, and validation of the new soil map; and (III) to create and validate a soil map of a much larger and similar area from the extrapolation of information extracted from the existing soil map. Two sampling systems were created by conditioned Latin hypercube and by the cost-constrained conditioned Latin hypercube. At each prospection place, soil classification and measurement of the A horizon thickness were performed. Maps were generated and validated for each sampling system, comparing the efficiency of these methods. The conditioned Latin hypercube captured greater variability of soils and properties than the cost-constrained conditioned Latin hypercube, despite the former provided greater difficulty in field work. The conditioned Latin hypercube can capture greater soil variability and the cost-constrained conditioned Latin hypercube presents great potential for use in soil surveys, especially in areas of difficult access. From an existing detailed scale soil map of a pilot watershed, topographical information for each soil class was extracted from a Digital Elevation Model and its derivatives, by two data mining tools. Maps were generated using each tool. The more accurate of these tools was used for extrapolation of soil information for a much larger and similar area and the generated map was validated. It was possible to retrieve the existing soil map information and apply it on a larger area containing similar soil forming factors, at much low financial cost. The KnowledgeMiner tool for data mining, and ArcSIE, used to create the soil map, presented better results and enabled the use of existing soil map to extract soil information and its application in similar larger areas at reduced costs, which is especially important in development countries with limited financial resources for such activities, such as Brazil.
Resumo:
Presentations sponsored by the Patent and Trademark Depository Library Association (PTDLA) at the American Library Association Annual Conference, New Orleans, June 25, 2006 Speaker #1: Nan Myers Associate Professor; Government Documents, Patents and Trademarks Librarian Wichita State University, Wichita, KS Title: Intellectual Property Roundup: Copyright, Trademarks, Trade Secrets, and Patents Abstract: This presentation provides a capsule overview of the distinctive coverage of the four types of intellectual property – What they are, why they are important, how to get them, what they cost, how long they last. Emphasis will be on what questions patrons ask most, along with the answers! Includes coverage of the mission of Patent & Trademark Depository Libraries (PTDLs) and other sources of business information outside of libraries, such as Small Business Development Centers. Speaker #2: Jan Comfort Government Information Reference Librarian Clemson University, Clemson, SC Title: Patents as a Source of Competitive Intelligence Information Abstract: Large corporations often have R&D departments, or large numbers of staff whose jobs are to monitor the activities of their competitors. This presentation will review strategies that small business owners can employ to do their own competitive intelligence analysis. The focus will be on features of the patent database that is available free of charge on the USPTO website, as well as commercial databases available at many public and academic libraries across the country. Speaker #3: Virginia Baldwin Professor; Engineering Librarian University of Nebraska-Lincoln, Lincoln, NE Title: Mining Online Patent Data for Business Information Abstract: The United States Patent and Trademark Office (USPTO) website and websites of international databases contains information about granted patents and patent applications and the technologies they represent. Statistical information about patents, their technologies, geographical information, and patenting entities are compiled and available as reports on the USPTO website. Other valuable information from these websites can be obtained using data mining techniques. This presentation will provide the keys to opening these resources and obtaining valuable data. Speaker #4: Donna Hopkins Engineering Librarian Renssalaer Polytechnic Institute, Troy, NY Title: Searching the USPTO Trademark Database for Wordmarks and Logos Abstract: This presentation provides an overview of wordmark searching in www.uspto.gov, followed by a review of the techniques of searching for non-word US trademarks using codes from the Design Search Code Manual. These codes are used in an electronic search, either on the uspto website or on CASSIS DVDs. The search is sometimes supplemented by consulting the Official Gazette. A specific example of using a section of the codes for searching is included. Similar searches on the Madrid Express database of WIPO, using the Vienna Classification, will also be briefly described.
Resumo:
In [1], the authors proposed a framework for automated clustering and visualization of biological data sets named AUTO-HDS. This letter is intended to complement that framework by showing that it is possible to get rid of a user-defined parameter in a way that the clustering stage can be implemented more accurately while having reduced computational complexity
Resumo:
We review recent visualization techniques aimed at supporting tasks that require the analysis of text documents, from approaches targeted at visually summarizing the relevant content of a single document to those aimed at assisting exploratory investigation of whole collections of documents.Techniques are organized considering their target input materialeither single texts or collections of textsand their focus, which may be at displaying content, emphasizing relevant relationships, highlighting the temporal evolution of a document or collection, or helping users to handle results from a query posed to a search engine.We describe the approaches adopted by distinct techniques and briefly review the strategies they employ to obtain meaningful text models, discuss how they extract the information required to produce representative visualizations, the tasks they intend to support and the interaction issues involved, and strengths and limitations. Finally, we show a summary of techniques, highlighting their goals and distinguishing characteristics. We also briefly discuss some open problems and research directions in the fields of visual text mining and text analytics.