66 resultados para applicazione, business analysis, data mining, Facebook, PRIN, relazioni sociali, social network
Resumo:
One among the most influential and popular data mining methods is the k-Means algorithm for cluster analysis. Techniques for improving the efficiency of k-Means have been largely explored in two main directions. The amount of computation can be significantly reduced by adopting geometrical constraints and an efficient data structure, notably a multidimensional binary search tree (KD-Tree). These techniques allow to reduce the number of distance computations the algorithm performs at each iteration. A second direction is parallel processing, where data and computation loads are distributed over many processing nodes. However, little work has been done to provide a parallel formulation of the efficient sequential techniques based on KD-Trees. Such approaches are expected to have an irregular distribution of computation load and can suffer from load imbalance. This issue has so far limited the adoption of these efficient k-Means variants in parallel computing environments. In this work, we provide a parallel formulation of the KD-Tree based k-Means algorithm for distributed memory systems and address its load balancing issue. Three solutions have been developed and tested. Two approaches are based on a static partitioning of the data set and a third solution incorporates a dynamic load balancing policy.
Resumo:
Recently, two approaches have been introduced that distribute the molecular fragment mining problem. The first approach applies a master/worker topology, the second approach, a completely distributed peer-to-peer system, solves the scalability problem due to the bottleneck at the master node. However, in many real world scenarios the participating computing nodes cannot communicate directly due to administrative policies such as security restrictions. Thus, potential computing power is not accessible to accelerate the mining run. To solve this shortcoming, this work introduces a hierarchical topology of computing resources, which distributes the management over several levels and adapts to the natural structure of those multi-domain architectures. The most important aspect is the load balancing scheme, which has been designed and optimized for the hierarchical structure. The approach allows dynamic aggregation of heterogenous computing resources and is applied to wide area network scenarios.
Resumo:
In real world applications sequential algorithms of data mining and data exploration are often unsuitable for datasets with enormous size, high-dimensionality and complex data structure. Grid computing promises unprecedented opportunities for unlimited computing and storage resources. In this context there is the necessity to develop high performance distributed data mining algorithms. However, the computational complexity of the problem and the large amount of data to be explored often make the design of large scale applications particularly challenging. In this paper we present the first distributed formulation of a frequent subgraph mining algorithm for discriminative fragments of molecular compounds. Two distributed approaches have been developed and compared on the well known National Cancer Institute’s HIV-screening dataset. We present experimental results on a small-scale computing environment.
Resumo:
Accurately and reliably identifying the actual number of clusters present with a dataset of gene expression profiles, when no additional information on cluster structure is available, is a problem addressed by few algorithms. GeneMCL transforms microarray analysis data into a graph consisting of nodes connected by edges, where the nodes represent genes, and the edges represent the similarity in expression of those genes, as given by a proximity measurement. This measurement is taken to be the Pearson correlation coefficient combined with a local non-linear rescaling step. The resulting graph is input to the Markov Cluster (MCL) algorithm, which is an elegant, deterministic, non-specific and scalable method, which models stochastic flow through the graph. The algorithm is inherently affected by any cluster structure present, and rapidly decomposes a graph into cohesive clusters. The potential of the GeneMCL algorithm is demonstrated with a 5730 gene subset (IGS) of the Van't Veer breast cancer database, for which the clusterings are shown to reflect underlying biological mechanisms. (c) 2005 Elsevier Ltd. All rights reserved.
Resumo:
Peak picking is an early key step in MS data analysis. We compare three commonly used approaches to peak picking and discuss their merits by means of statistical analysis. Methods investigated encompass signal-to-noise ratio, continuous wavelet transform, and a correlation-based approach using a Gaussian template. Functionality of the three methods is illustrated and discussed in a practical context using a mass spectral data set created with MALDI-TOF technology. Sensitivity and specificity are investigated using a manually defined reference set of peaks. As an additional criterion, the robustness of the three methods is assessed by a perturbation analysis and illustrated using ROC curves.
Resumo:
In a world of almost permanent and rapidly increasing electronic data availability, techniques of filtering, compressing, and interpreting this data to transform it into valuable and easily comprehensible information is of utmost importance. One key topic in this area is the capability to deduce future system behavior from a given data input. This book brings together for the first time the complete theory of data-based neurofuzzy modelling and the linguistic attributes of fuzzy logic in a single cohesive mathematical framework. After introducing the basic theory of data-based modelling, new concepts including extended additive and multiplicative submodels are developed and their extensions to state estimation and data fusion are derived. All these algorithms are illustrated with benchmark and real-life examples to demonstrate their efficiency. Chris Harris and his group have carried out pioneering work which has tied together the fields of neural networks and linguistic rule-based algortihms. This book is aimed at researchers and scientists in time series modeling, empirical data modeling, knowledge discovery, data mining, and data fusion.
Resumo:
Social Networking Sites have recently become a mainstream communications technology for many people around the world. Major IT vendors are releasing social software designed for use in a business/commercial context. These Enterprise 2.0 technologies have impressive collaboration and information sharing functionality, but so far they do not have any organizational network analysis (ONA) features that reveal any patterns of connectivity within business units. This paper shows the impact of organizational network analysis techniques and social networks on organizational performance, we also give an overview on current enterprise social software, and most importantly, we highlight how Enterprise 2.0 can help automate an organizational network analysis.
Resumo:
The K-Means algorithm for cluster analysis is one of the most influential and popular data mining methods. Its straightforward parallel formulation is well suited for distributed memory systems with reliable interconnection networks. However, in large-scale geographically distributed systems the straightforward parallel algorithm can be rendered useless by a single communication failure or high latency in communication paths. This work proposes a fully decentralised algorithm (Epidemic K-Means) which does not require global communication and is intrinsically fault tolerant. The proposed distributed K-Means algorithm provides a clustering solution which can approximate the solution of an ideal centralised algorithm over the aggregated data as closely as desired. A comparative performance analysis is carried out against the state of the art distributed K-Means algorithms based on sampling methods. The experimental analysis confirms that the proposed algorithm is a practical and accurate distributed K-Means implementation for networked systems of very large and extreme scale.
Resumo:
The K-Means algorithm for cluster analysis is one of the most influential and popular data mining methods. Its straightforward parallel formulation is well suited for distributed memory systems with reliable interconnection networks, such as massively parallel processors and clusters of workstations. However, in large-scale geographically distributed systems the straightforward parallel algorithm can be rendered useless by a single communication failure or high latency in communication paths. The lack of scalable and fault tolerant global communication and synchronisation methods in large-scale systems has hindered the adoption of the K-Means algorithm for applications in large networked systems such as wireless sensor networks, peer-to-peer systems and mobile ad hoc networks. This work proposes a fully distributed K-Means algorithm (EpidemicK-Means) which does not require global communication and is intrinsically fault tolerant. The proposed distributed K-Means algorithm provides a clustering solution which can approximate the solution of an ideal centralised algorithm over the aggregated data as closely as desired. A comparative performance analysis is carried out against the state of the art sampling methods and shows that the proposed method overcomes the limitations of the sampling-based approaches for skewed clusters distributions. The experimental analysis confirms that the proposed algorithm is very accurate and fault tolerant under unreliable network conditions (message loss and node failures) and is suitable for asynchronous networks of very large and extreme scale.
Resumo:
Platelets in the circulation are triggered by vascular damage to activate, aggregate and form a thrombus that prevents excessive blood loss. Platelet activation is stringently regulated by intracellular signalling cascades, which when activated inappropriately lead to myocardial infarction and stroke. Strategies to address platelet dysfunction have included proteomics approaches which have lead to the discovery of a number of novel regulatory proteins of potential therapeutic value. Global analysis of platelet proteomes may enhance the outcome of these studies by arranging this information in a contextual manner that recapitulates established signalling complexes and predicts novel regulatory processes. Platelet signalling networks have already begun to be exploited with interrogation of protein datasets using in silico methodologies that locate functionally feasible protein clusters for subsequent biochemical validation. Characterization of these biological systems through analysis of spatial and temporal organization of component proteins is developing alongside advances in the proteomics field. This focused review highlights advances in platelet proteomics data mining approaches that complement the emerging systems biology field. We have also highlighted nucleated cell types as key examples that can inform platelet research. Therapeutic translation of these modern approaches to understanding platelet regulatory mechanisms will enable the development of novel anti-thrombotic strategies.
Resumo:
Purpose: This paper aims to design an evaluation method that enables an organization to assess its current IT landscape and provide readiness assessment prior to Software as a Service (SaaS) adoption. Design/methodology/approach: The research employs a mixed of quantitative and qualitative approaches for conducting an IT application assessment. Quantitative data such as end user’s feedback on the IT applications contribute to the technical impact on efficiency and productivity. Qualitative data such as business domain, business services and IT application cost drivers are used to determine the business value of the IT applications in an organization. Findings: The assessment of IT applications leads to decisions on suitability of each IT application that can be migrated to cloud environment. Research limitations/implications: The evaluation of how a particular IT application impacts on a business service is done based on the logical interpretation. Data mining method is suggested in order to derive the patterns of the IT application capabilities. Practical implications: This method has been applied in a local council in UK. This helps the council to decide the future status of the IT applications for cost saving purpose.
Resumo:
Stakeholder analysis plays a critical role in business analysis. However, the majority of the stakeholder identification and analysis methods focus on the activities and processes and ignore the artefacts being processed by human beings. By focusing on the outputs of the organisation, an artefact-centric view helps create a network of artefacts, and a component-based structure of the organisation and its supply chain participants. Since the relationship is based on the components, i.e. after the stakeholders are identified, the interdependency between stakeholders and the focal organisation can be measured. Each stakeholder is associated with two types of dependency, namely the stakeholder’s dependency on the focal organisation and the focal organisation’s dependency on the stakeholder. We identify three factors for each type of dependency and propose the equations that calculate the dependency indexes. Once both types of the dependency indexes are calculated, each stakeholder can be placed and categorised into one of the four groups, namely critical stakeholder, mutual benefits stakeholder, replaceable stakeholder, and easy care stakeholder. The mutual dependency grid and the dependency gap analysis, which further investigates the priority of each stakeholder by calculating the weighted dependency gap between the focal organisation and the stakeholder, subsequently help the focal organisation to better understand its stakeholders and manage its stakeholder relationships.
Resumo:
Background: Since their inception, Twitter and related microblogging systems have provided a rich source of information for researchers and have attracted interest in their affordances and use. Since 2009 PubMed has included 123 journal articles on medicine and Twitter, but no overview exists as to how the field uses Twitter in research. // Objective: This paper aims to identify published work relating to Twitter indexed by PubMed, and then to classify it. This classification will provide a framework in which future researchers will be able to position their work, and to provide an understanding of the current reach of research using Twitter in medical disciplines. Limiting the study to papers indexed by PubMed ensures the work provides a reproducible benchmark. // Methods: Papers, indexed by PubMed, on Twitter and related topics were identified and reviewed. The papers were then qualitatively classified based on the paper’s title and abstract to determine their focus. The work that was Twitter focused was studied in detail to determine what data, if any, it was based on, and from this a categorization of the data set size used in the studies was developed. Using open coded content analysis additional important categories were also identified, relating to the primary methodology, domain and aspect. // Results: As of 2012, PubMed comprises more than 21 million citations from biomedical literature, and from these a corpus of 134 potentially Twitter related papers were identified, eleven of which were subsequently found not to be relevant. There were no papers prior to 2009 relating to microblogging, a term first used in 2006. Of the remaining 123 papers which mentioned Twitter, thirty were focussed on Twitter (the others referring to it tangentially). The early Twitter focussed papers introduced the topic and highlighted the potential, not carrying out any form of data analysis. The majority of published papers used analytic techniques to sort through thousands, if not millions, of individual tweets, often depending on automated tools to do so. Our analysis demonstrates that researchers are starting to use knowledge discovery methods and data mining techniques to understand vast quantities of tweets: the study of Twitter is becoming quantitative research. // Conclusions: This work is to the best of our knowledge the first overview study of medical related research based on Twitter and related microblogging. We have used five dimensions to categorise published medical related research on Twitter. This classification provides a framework within which researchers studying development and use of Twitter within medical related research, and those undertaking comparative studies of research relating to Twitter in the area of medicine and beyond, can position and ground their work.
Resumo:
The two-way relationship between Rossby Wave-Breaking (RWB) and intensification of extra tropical cyclones is analysed over the Euro-Atlantic sector. In particular, the timing, intensity and location of cyclone development are related to RWB occurrences. For this purpose, two potential-temperature based indices are used to detect and classify anticyclonic and cyclonic RWB episodes from ERA-40 Re-Analysis data. Results show that explosive cyclogenesis over the North Atlantic (NA) is fostered by enhanced occurrence of RWB on days prior to the cyclone’s maximum intensification. Under such conditions, the eddy-driven jet stream is accelerated over the NA, thus enhancing conditions for cyclogenesis. For explosive cyclogenesis over the eastern NA, enhanced cyclonic RWB over eastern Greenland and anticyclonic RWB over the sub-tropical NA are observed. Typically only one of these is present in any given case, with the RWB over eastern Greenland being more frequent than its southern counterpart. This leads to an intensification of the jet over the eastern NA and enhanced probability of windstorms reaching Western Europe. Explosive cyclones evolving under simultaneous RWB on both sides of the jet feature a higher mean intensity and deepening rates than cyclones preceded by a single RWB event. Explosive developments over the western NA are typically linked to a single area of enhanced cyclonic RWB over western Greenland. Here, the eddy-driven jet is accelerated over the western NA. Enhanced occurrence of cyclonic RWB over southern Greenland and anticyclonic RWB over Europe is also observed after explosive cyclogenesis, potentially leading to the onset of Scandinavian Blocking. However, only very intense developments have a considerable influence on the large-scale atmospheric flow. Non-explosive cyclones depict no sign of enhanced RWB over the whole NA area. We conclude that the links between RWB and cyclogenesis over the Euro-Atlantic sector are sensitive to the cyclone’s maximum intensity, deepening rate and location.