49 resultados para Cluster Analysis. Information Theory. Entropy. Cross Information Potential. Complex Data
Resumo:
One among the most influential and popular data mining methods is the k-Means algorithm for cluster analysis. Techniques for improving the efficiency of k-Means have been largely explored in two main directions. The amount of computation can be significantly reduced by adopting geometrical constraints and an efficient data structure, notably a multidimensional binary search tree (KD-Tree). These techniques allow to reduce the number of distance computations the algorithm performs at each iteration. A second direction is parallel processing, where data and computation loads are distributed over many processing nodes. However, little work has been done to provide a parallel formulation of the efficient sequential techniques based on KD-Trees. Such approaches are expected to have an irregular distribution of computation load and can suffer from load imbalance. This issue has so far limited the adoption of these efficient k-Means variants in parallel computing environments. In this work, we provide a parallel formulation of the KD-Tree based k-Means algorithm for distributed memory systems and address its load balancing issue. Three solutions have been developed and tested. Two approaches are based on a static partitioning of the data set and a third solution incorporates a dynamic load balancing policy.
Resumo:
Measurements of the ionospheric E-region during total solar eclipses have been used to provide information about the evolution of the solar magnetic field and EUV and X-ray emissions from the solar corona and chromosphere. By measuring levels of ionisation during an eclipse and comparing these measurements with an estimate of the unperturbed ionisation levels (such as those made during a control day, where available) it is possible to estimate the percentage of ionising radiation being emitted by the solar corona and chromosphere. Previously unpublished data from the two eclipses presented here are particularly valuable as they provide information that supplements the data published to date. The eclipse of 23 October 1976 over Australia provides information in a data gap that would otherwise have spanned the years 1966 to 1991. The eclipse of 4 December 2002 over Southern Africa is important as it extends the published sequence of measurements. Comparing measurements from eclipses between 1932 and 2002 with the solar magnetic source flux reveals that changes in the solar EUV and X-ray flux lag the open source flux measurements by approximately 1.5 years. We suggest that this unexpected result comes about from changes to the relative size of the limb corona between eclipses, with the lag representing the time taken to populate the coronal field with plasma hot enough to emit the EUV and X-rays ionising our atmosphere.
Resumo:
Context: During development managers, analysts and designers often need to know whether enough requirements analysis work has been done and whether or not it is safe to proceed to the design stage. Objective: This paper describes a new, simple and practical method for assessing our confidence in a set of requirements. Method: We identified 4 confidence factors and used a goal oriented framework with a simple ordinal scale to develop a method for assessing confidence. We illustrate the method and show how it has been applied to a real systems development project. Results: We show how assessing confidence in the requirements could have revealed problems in this project earlier and so saved both time and money. Conclusion: Our meta-level assessment of requirements provides a practical and pragmatic method that can prove useful to managers, analysts and designers who need to know when sufficient requirements analysis has been performed.
Resumo:
Investments in direct real estate are inherently difficult to segment compared to other asset classes due to the complex and heterogeneous nature of the asset. The most common segmentation in real estate investment analysis relies on property sector and geographical region. In this paper, we compare the predictive power of existing industry classifications with a new type of segmentation using cluster analysis on a number of relevant property attributes including the equivalent yield and size of the property as well as information on lease terms, number of tenants and tenant concentration. The new segments are shown to be distinct and relatively stable over time. In a second stage of the analysis, we test whether the newly generated segments are able to better predict the resulting financial performance of the assets than the old dichotomous segments. Applying both discriminant and neural network analysis we find mixed evidence for this hypothesis. Overall, we conclude from our analysis that each of the two approaches to segmenting the market has its strengths and weaknesses so that both might be applied gainfully in real estate investment analysis and fund management.
Resumo:
Tourism is the worlds largest employer, accounting for 10% of jobs worldwide (WTO, 1999). There are over 30,000 protected areas around the world, covering about 10% of the land surface(IUCN, 2002). Protected area management is moving towards a more integrated form of management, which recognises the social and economic needs of the worlds finest areas and seeks to provide long term income streams and support social cohesion through active but sustainable use of resources. Ecotourism - 'responsible travel to natural areas that conserves the environment and improves the well- being of local people' (The Ecotourism Society, 1991) - is often cited as a panacea for incorporating the principles of sustainable development in protected area management. However, few examples exist worldwide to substantiate this claim. In reality, ecotourism struggles to provide social and economic empowerment locally and fails to secure proper protection of the local and global environment. Current analysis of ecotourism provides a useful checklist of interconnected principles for more successful initiatives, but no overall framework of analysis or theory. This paper argues that applying common property theory to the application of ecotourism can help to establish more rigorous, multi-layered analysis that identifies the institutional demands of community based ecotourism (CBE). The paper draws on existing literature on ecotourism and several new case studies from developed and developing countries around the world. It focuses on the governance of CBE initiatives, particularly the interaction between local stakeholders and government and the role that third party non-governmental organisations can play in brokering appropriate institutional arrangements. The paper concludes by offering future research directions."
Resumo:
In order to identify the factors influencing adoption of technologies promoted by government to small-scale dairy farmers in the highlands of central Mexico, a field survey was conducted. A total of 115 farmers were grouped through cluster analysis (CA) and divided into three wealth status categories (high, medium and low) using wealth ranking. Chi-square analysis was used to examine the association of wealth status with technology adoption. Four groups of farms were differentiated in terms of farms’ dimensions, farmers’ education, sources of incomes, wealth status, management of herd, monetary support by government and technological availability. Statistical differences (p < 0.05) were observed in the milk yield per herd per year among groups. Government organizations (GO) participated little in the promotion of the 17 technologies identified, six of which focused on crop or forage production and 11 of which were related to animal husbandry. Relatives and other farmers played an important role in knowledge diffusion and technology adoption. Although wealth status had a significant association (p < 0.05) with adoption, other factors including importance of the technology to farmers, usefulness and productive benefits of innovations together with farmers’ knowledge of them, were important. It is concluded that the analysis of the information per group and wealth status was useful to identify suitable crop or forage related and animal husbandry technologies per group and wealth status of farmers. Therefore the characterizations of farmers could provide a useful starting point for the design and delivery of more appropriate and effective extension.
Resumo:
The past decade has witnessed a sharp increase in published research on energy and buildings. This paper takes stock of work in this area, with a particular focus on construction research and the analysis of non-technical dimensions. While there is widespread recognition as to the importance of non-technical dimensions, research tends to be limited to individualistic studies of occupants and occupant behavior. In contrast, publications in the mainstream social science literature display a broader range of interests, including policy developments, structural constraints on the diffusion and use of new technologies and the construction process itself. The growing interest of more generalist scholars in energy and buildings provides an opportunity for construction research to engage a wider audience. This would enrich the current research agenda, helping to address unanswered problems concerning the relatively weak impact of policy mechanisms and new technologies and the seeming recalcitrance of occupants. It would also help to promote the academic status of construction research as a field. This, in turn, depends on greater engagement with interpretivist types of analysis and theory building, thereby challenging deeply ingrained views on the nature and role of academic research in construction.
Resumo:
In the recent years, the area of data mining has been experiencing considerable demand for technologies that extract knowledge from large and complex data sources. There has been substantial commercial interest as well as active research in the area that aim to develop new and improved approaches for extracting information, relationships, and patterns from large datasets. Artificial neural networks (NNs) are popular biologically-inspired intelligent methodologies, whose classification, prediction, and pattern recognition capabilities have been utilized successfully in many areas, including science, engineering, medicine, business, banking, telecommunication, and many other fields. This paper highlights from a data mining perspective the implementation of NN, using supervised and unsupervised learning, for pattern recognition, classification, prediction, and cluster analysis, and focuses the discussion on their usage in bioinformatics and financial data analysis tasks. © 2012 Wiley Periodicals, Inc.
Resumo:
The aim of this study was to determine whether geographical differences impact the composition of bacterial communities present in the airways of cystic fibrosis (CF) patients attending CF centers in the United States or United Kingdom. Thirty-eight patients were matched on the basis of clinical parameters into 19 pairs comprised of one U.S. and one United Kingdom patient. Analysis was performed to determine what, if any, bacterial correlates could be identified. Two culture-independent strategies were used: terminal restriction fragment length polymorphism (T-RFLP) profiling and 16S rRNA clone sequencing. Overall, 73 different terminal restriction fragment lengths were detected, ranging from 2 to 10 for U.S. and 2 to 15 for United Kingdom patients. The statistical analysis of T-RFLP data indicated that patient pairing was successful and revealed substantial transatlantic similarities in the bacterial communities. A small number of bands was present in the vast majority of patients in both locations, indicating that these are species common to the CF lung. Clone sequence analysis also revealed that a number of species not traditionally associated with the CF lung were present in both sample groups. The species number per sample was similar, but differences in species presence were observed between sample groups. Cluster analysis revealed geographical differences in bacterial presence and relative species abundance. Overall, the U.S. samples showed tighter clustering with each other compared to that of United Kingdom samples, which may reflect the lower diversity detected in the U.S. sample group. The impact of cross-infection and biogeography is considered, and the implications for treating CF lung infections also are discussed.
Resumo:
Human brain imaging techniques, such as Magnetic Resonance Imaging (MRI) or Diffusion Tensor Imaging (DTI), have been established as scientific and diagnostic tools and their adoption is growing in popularity. Statistical methods, machine learning and data mining algorithms have successfully been adopted to extract predictive and descriptive models from neuroimage data. However, the knowledge discovery process typically requires also the adoption of pre-processing, post-processing and visualisation techniques in complex data workflows. Currently, a main problem for the integrated preprocessing and mining of MRI data is the lack of comprehensive platforms able to avoid the manual invocation of preprocessing and mining tools, that yields to an error-prone and inefficient process. In this work we present K-Surfer, a novel plug-in of the Konstanz Information Miner (KNIME) workbench, that automatizes the preprocessing of brain images and leverages the mining capabilities of KNIME in an integrated way. K-Surfer supports the importing, filtering, merging and pre-processing of neuroimage data from FreeSurfer, a tool for human brain MRI feature extraction and interpretation. K-Surfer automatizes the steps for importing FreeSurfer data, reducing time costs, eliminating human errors and enabling the design of complex analytics workflow for neuroimage data by leveraging the rich functionalities available in the KNIME workbench.
Resumo:
Studies on learning management systems have largely been technical in nature with an emphasis on the evaluation of the human computer interaction (HCI) processes in using the LMS. This paper reports a study that evaluates the information interaction processes on an eLearning course used in teaching an applied Statistics course. The eLearning course is used as a synonym for information systems. The study explores issues of missing context in stored information in information systems. Using the semiotic framework as a guide, the researchers evaluated an existing eLearning course with the view to proposing a model for designing improved eLearning courses for future eLearning programmes. In this exploratory study, a survey questionnaire is used to collect data from 160 participants on an eLearning course in Statistics in Applied Climatology. The views of the participants are analysed with a focus on only the human information interaction issues. Using the semiotic framework as a guide, syntactic, semantic, pragmatic and social context gaps or problems were identified. The information interactions problems identified include ambiguous instructions, inadequate information, lack of sound, interface design problems among others. These problems affected the quality of new knowledge created by the participants. The researchers thus highlighted the challenges of missing information context when data is stored in an information system. The study concludes by proposing a human information interaction model for improving the information interaction quality issues in the design of eLearning course on learning management platforms and those other information systems.
Resumo:
Accurate monitoring of degradation levels in soils is essential in order to understand and achieve complete degradation of petroleum hydrocarbons in contaminated soils. We aimed to develop the use of multivariate methods for the monitoring of biodegradation of diesel in soils and to determine if diesel contaminated soils could be remediated to a chemical composition similar to that of an uncontaminated soil. An incubation experiment was set up with three contrasting soil types. Each soil was exposed to diesel at varying stages of degradation and then analysed for key hydrocarbons throughout 161 days of incubation. Hydrocarbon distributions were analysed by Principal Coordinate Analysis and similar samples grouped by cluster analysis. Variation and differences between samples were determined using permutational multivariate analysis of variance. It was found that all soils followed trajectories approaching the chemical composition of the unpolluted soil. Some contaminated soils were no longer significantly different to that of uncontaminated soil after 161 days of incubation. The use of cluster analysis allows the assignment of a percentage chemical similarity of a diesel contaminated soil to an uncontaminated soil sample. This will aid in the monitoring of hydrocarbon contaminated sites and the establishment of potential endpoints for successful remediation.
Resumo:
We present an account of semantic representation that focuses on distinct types of information from which word meanings can be learned. In particular, we argue that there are at least two major types of information from which we learn word meanings. The first is what we call experiential information. This is data derived both from our sensory-motor interactions with the outside world, as well as from our experience of own inner states, particularly our emotions. The second type of information is language-based. In particular, it is derived from the general linguistic context in which words appear. The paper spells out this proposal, summarizes research supporting this view and presents new predictions emerging from this framework.
Resumo:
Tensor clustering is an important tool that exploits intrinsically rich structures in real-world multiarray or Tensor datasets. Often in dealing with those datasets, standard practice is to use subspace clustering that is based on vectorizing multiarray data. However, vectorization of tensorial data does not exploit complete structure information. In this paper, we propose a subspace clustering algorithm without adopting any vectorization process. Our approach is based on a novel heterogeneous Tucker decomposition model taking into account cluster membership information. We propose a new clustering algorithm that alternates between different modes of the proposed heterogeneous tensor model. All but the last mode have closed-form updates. Updating the last mode reduces to optimizing over the multinomial manifold for which we investigate second order Riemannian geometry and propose a trust-region algorithm. Numerical experiments show that our proposed algorithm compete effectively with state-of-the-art clustering algorithms that are based on tensor factorization.
Resumo:
Satellite-based rainfall monitoring is widely used for climatological studies because of its full global coverage but it is also of great importance for operational purposes especially in areas such as Africa where there is a lack of ground-based rainfall data. Satellite rainfall estimates have enormous potential benefits as input to hydrological and agricultural models because of their real time availability, low cost and full spatial coverage. One issue that needs to be addressed is the uncertainty on these estimates. This is particularly important in assessing the likely errors on the output from non-linear models (rainfall-runoff or crop yield) which make use of the rainfall estimates, aggregated over an area, as input. Correct assessment of the uncertainty on the rainfall is non-trivial as it must take account of • the difference in spatial support of the satellite information and independent data used for calibration • uncertainties on the independent calibration data • the non-Gaussian distribution of rainfall amount • the spatial intermittency of rainfall • the spatial correlation of the rainfall field This paper describes a method for estimating the uncertainty on satellite-based rainfall values taking account of these factors. The method involves firstly a stochastic calibration which completely describes the probability of rainfall occurrence and the pdf of rainfall amount for a given satellite value, and secondly the generation of ensemble of rainfall fields based on the stochastic calibration but with the correct spatial correlation structure within each ensemble member. This is achieved by the use of geostatistical sequential simulation. The ensemble generated in this way may be used to estimate uncertainty at larger spatial scales. A case study of daily rainfall monitoring in the Gambia, west Africa for the purpose of crop yield forecasting is presented to illustrate the method.