899 resultados para Information Filtering, Pattern Mining, Relevance Feature Discovery, Text Mining
                                
                                
                                
Resumo:
Il presente lavoro ha lo scopo di comprendere i processi sottesi ai pattern di coesistenza tra le specie di invertebrati sorgentizi, distinguendo tra dinamiche stocastiche e deterministiche. Le sorgenti sono ecosistemi complessi e alcune loro caratteristiche (ad esempio l’insularità, la stabilità termica, la struttura ecotonale “a mosaico”, la frequente presenza di specie rare ed endemiche, o l’elevata diversità in taxa) le rendono laboratori naturali utili allo studio dei processi ecologici, tra cui i processi di assembly. Al fine di studiare queste dinamiche è necessario un approccio multi-scala, per questo motivi sono state prese in considerazione tre scale spaziali. A scala locale è stato compiuto un campionamento stagionale su sette sorgenti (quattro temporanee e tre permanenti) del Monte Prinzera, un affioramento ofiolitico vicino alla città di Parma. In questa area sono stati valutati l’efficacia e l’impatto ambientale di diversi metodi di campionamento e sono stati analizzati i drivers ecologici che influenzano le comunità. A scala più ampia sono state campionate per due volte 15 sorgenti della regione Emilia Romagna, al fine di identificare il ruolo della dispersione e la possibile presenza di un effetto di niche-filtering. A scala continentale sono state raccolte informazioni di letteratura riguardanti sorgenti dell’area Paleartica occidentale, e sono stati studiati i pattern biogeografici e l’influenza dei fattori climatici sulle comunità. Sono stati presi in considerazione differenti taxa di invertebrati (macroinvertebrati, ostracodi, acari acquatici e copepodi), scegliendo tra quelli che si prestavano meglio allo studio dei diversi processi in base alle loro caratteristiche biologiche e all’approfondimento tassonomico raggiungibile. I campionamenti biologici in sorgente sono caratterizzati da diversi problemi metodologici e possono causare impatti sugli ambienti. In questo lavoro sono stati paragonati due diversi metodi: l’utilizzo del retino con un approccio multi-habitat proporzionale e l’uso combinato di trappole e lavaggio di campioni di vegetazione. Il retino fornisce dati più accurati e completi, ma anche significativi disturbi sulle componenti biotiche e abiotiche delle sorgenti. Questo metodo è quindi raccomandato solo se il campionamento ha come scopo un’approfondita analisi della biodiversità. D’altra parte l’uso delle trappole e il lavaggio della vegetazione sono metodi affidabili che presentano minori impatti sull’ecosistema, quindi sono adatti a studi ecologici finalizzati all’analisi della struttura delle comunità. Questo lavoro ha confermato che i processi niche-based sono determinanti nello strutturare le comunità di ambienti sorgentizi, e che i driver ambientali spiegano una rilevante percentuale della variabilità delle comunità. Infatti le comunità di invertebrati del Monte Prinzera sono influenzate da fattori legati al chimismo delle acque, alla composizione e all’eterogeneità dell’habitat, all’idroperiodo e alle fluttuazioni della portata. Le sorgenti permanenti mostrano variazioni stagionali per quanto riguarda le concentrazioni dei principali ioni, mentre la conduttività, il pH e la temperatura dell’acqua sono più stabili. È probabile che sia la stabilità termica di questi ambienti a spiegare l’assenza di variazioni stagionali nella struttura delle comunità di macroinvertebrati. L’azione di niche-filtering delle sorgenti è stata analizzata tramite lo studio della diversità funzionale delle comunità di ostracodi dell’Emilia-Romagna. Le sorgenti ospitano più del 50% del pool di specie regionale, e numerose specie sono state rinvenute esclusivamente in questi habitat. Questo è il primo studio che analizza la diversità funzionale degli ostracodi, è stato quindi necessario stilare una lista di tratti funzionali. Analizzando il pool di specie regionale, la diversità funzionale nelle sorgenti non è significativamente diversa da quella misurata in comunità assemblate in maniera casuale. Le sorgenti non limitano quindi la diversità funzionale tra specie coesistenti, ma si può concludere che, data la soddisfazione delle esigenze ecologiche delle diverse specie, i processi di assembly in sorgente potrebbero essere influenzati da fattori stocastici come la dispersione, la speciazione e le estinzioni locali. In aggiunta, tutte le comunità studiate presentano pattern spaziali riconoscibili, rivelando una limitazione della dispersione tra le sorgenti, almeno per alcuni taxa. Il caratteristico isolamento delle sorgenti potrebbe essere la causa di questa limitazione, influenzando maggiormente i taxa a dispersione passiva rispetto a quelli a dispersione attiva. In ogni caso nelle comunità emiliano-romagnole i fattori spaziali spiegano solo una ridotta percentuale della variabilità biologica totale, mentre tutte le comunità risultano influenzate maggiormente dalle variabili ambientali. Il controllo ambientale è quindi prevalente rispetto a quello attuato dai fattori spaziali. Questo risultato dimostra che, nonostante le dinamiche stocastiche siano importanti in tutte le comunità studiate, a questa scala spaziale i fattori deterministici ricoprono un ruolo prevalente. I processi stocastici diventano più influenti invece nei climi aridi, dove il disturbo collegato ai frequenti eventi di disseccamento delle sorgenti provoca una dinamica source-sink tra le diverse comunità. Si è infatti notato che la variabilità spiegata dai fattori ambientali diminuisce all’aumentare dell’aridità del clima. Disturbi frequenti potrebbero provocare estinzioni locali seguite da ricolonizzazioni di specie provenienti dai siti vicini, riducendo la corrispondenza tra gli organismi e le loro richieste ambientali e quindi diminuendo la quantità di variabilità spiegata dai fattori ambientali. Si può quindi concludere che processi deterministici e stocastici non si escludono mutualmente, ma contribuiscono contemporaneamente a strutturare le comunità di invertebrati sorgentizi. Infine, a scala continentale, le comunità di ostracodi sorgentizi mostrano chiari pattern biogeografici e sono organizzate lungo gradienti ambientali principalmente collegati altitudine, latitudine, temperatura dell’acqua e conducibilità. Anche la tipologia di sorgente (elocrena, reocrena o limnocrena) è influente sulla composizione delle comunità. La presenza di specie rare ed endemiche inoltre caratterizza specifiche regioni geografiche.
                                
Resumo:
In this thesis work we develop a new generative model of social networks belonging to the family of Time Varying Networks. The importance of correctly modelling the mechanisms shaping the growth of a network and the dynamics of the edges activation and inactivation are of central importance in network science. Indeed, by means of generative models that mimic the real-world dynamics of contacts in social networks it is possible to forecast the outcome of an epidemic process, optimize the immunization campaign or optimally spread an information among individuals. This task can now be tackled taking advantage of the recent availability of large-scale, high-quality and time-resolved datasets. This wealth of digital data has allowed to deepen our understanding of the structure and properties of many real-world networks. Moreover, the empirical evidence of a temporal dimension in networks prompted the switch of paradigm from a static representation of graphs to a time varying one. In this work we exploit the Activity-Driven paradigm (a modeling tool belonging to the family of Time-Varying-Networks) to develop a general dynamical model that encodes fundamental mechanism shaping the social networks' topology and its temporal structure: social capital allocation and burstiness. The former accounts for the fact that individuals does not randomly invest their time and social interactions but they rather allocate it toward already known nodes of the network. The latter accounts for the heavy-tailed distributions of the inter-event time in social networks. We then empirically measure the properties of these two mechanisms from seven real-world datasets and develop a data-driven model, analytically solving it. We then check the results against numerical simulations and test our predictions with real-world datasets, finding a good agreement between the two. Moreover, we find and characterize a non-trivial interplay between burstiness and social capital allocation in the parameters phase space. Finally, we present a novel approach to the development of a complete generative model of Time-Varying-Networks. This model is inspired by the Kaufman's adjacent possible theory and is based on a generalized version of the Polya's urn. Remarkably, most of the complex and heterogeneous feature of real-world social networks are naturally reproduced by this dynamical model, together with many high-order topological properties (clustering coefficient, community structure etc.).
                                
Resumo:
A practical Bayesian approach for inference in neural network models has been available for ten years, and yet it is not used frequently in medical applications. In this chapter we show how both regularisation and feature selection can bring significant benefits in diagnostic tasks through two case studies: heart arrhythmia classification based on ECG data and the prognosis of lupus. In the first of these, the number of variables was reduced by two thirds without significantly affecting performance, while in the second, only the Bayesian models had an acceptable accuracy. In both tasks, neural networks outperformed other pattern recognition approaches.
                                
Resumo:
Today, the data available to tackle many scientific challenges is vast in quantity and diverse in nature. The exploration of heterogeneous information spaces requires suitable mining algorithms as well as effective visual interfaces. Most existing systems concentrate either on mining algorithms or on visualization techniques. Though visual methods developed in information visualization have been helpful, for improved understanding of a complex large high-dimensional dataset, there is a need for an effective projection of such a dataset onto a lower-dimension (2D or 3D) manifold. This paper introduces a flexible visual data mining framework which combines advanced projection algorithms developed in the machine learning domain and visual techniques developed in the information visualization domain. The framework follows Shneiderman’s mantra to provide an effective user interface. The advantage of such an interface is that the user is directly involved in the data mining process. We integrate principled projection methods, such as Generative Topographic Mapping (GTM) and Hierarchical GTM (HGTM), with powerful visual techniques, such as magnification factors, directional curvatures, parallel coordinates, billboarding, and user interaction facilities, to provide an integrated visual data mining framework. Results on a real life high-dimensional dataset from the chemoinformatics domain are also reported and discussed. Projection results of GTM are analytically compared with the projection results from other traditional projection methods, and it is also shown that the HGTM algorithm provides additional value for large datasets. The computational complexity of these algorithms is discussed to demonstrate their suitability for the visual data mining framework.
                                
Resumo:
A study of information available on the settlement characteristics of backfill in restored opencast coal mining sites and other similar earthworks projects has been undertaken. In addition, the methods of opencast mining, compaction controls, monitoring and test methods have been reviewed. To consider and develop the methods of predicting the settlement of fill, three sites in the West Midlands have been examined; at each, the backfill had been placed in a controlled manner. In addition, use has been made of a finite element computer program to compare a simple two-dimensional linear elastic analysis with field observations of surface settlements in the vicinity of buried highwalls. On controlled backfill sites, settlement predictions have been accurately made, based on a linear relationship between settlement (expressed as a percentage of fill height) against logarithm of time. This `creep' settlement was found to be effectively complete within 18 months of restoration. A decrease of this percentage settlement was observed with increasing fill thickness; this is believed to be related to the speed with which the backfill is placed. A rising water table within the backfill is indicated to cause additional gradual settlement. A prediction method, based on settlement monitoring, has been developed and used to determine the pattern of settlement across highwalls and buried highwalls. The zone of appreciable differential settlement was found to be mainly limited to the highwall area, the magnitude was dictated by the highwall inclination. With a backfill cover of about 15 metres over a buried highwall the magnitude of differential settlement was negligible. Use has been made of the proposed settlement prediction method and monitoring to control the re-development of restored opencase sites. The specifications, tests and monitoring techniques developed in recent years have been used to aid this. Such techniques have been valuable in restoring land previously derelict due to past underground mining.
                                
Resumo:
Sentiment analysis or opinion mining aims to use automated tools to detect subjective information such as opinions, attitudes, and feelings expressed in text. This paper proposes a novel probabilistic modeling framework called joint sentiment-topic (JST) model based on latent Dirichlet allocation (LDA), which detects sentiment and topic simultaneously from text. A reparameterized version of the JST model called Reverse-JST, obtained by reversing the sequence of sentiment and topic generation in the modeling process, is also studied. Although JST is equivalent to Reverse-JST without a hierarchical prior, extensive experiments show that when sentiment priors are added, JST performs consistently better than Reverse-JST. Besides, unlike supervised approaches to sentiment classification which often fail to produce satisfactory performance when shifting to other domains, the weakly supervised nature of JST makes it highly portable to other domains. This is verified by the experimental results on data sets from five different domains where the JST model even outperforms existing semi-supervised approaches in some of the data sets despite using no labeled documents. Moreover, the topics and topic sentiment detected by JST are indeed coherent and informative. We hypothesize that the JST model can readily meet the demand of large-scale sentiment analysis from the web in an open-ended fashion.
                                
Resumo:
During the last decade, biomedicine has witnessed a tremendous development. Large amounts of experimental and computational biomedical data have been generated along with new discoveries, which are accompanied by an exponential increase in the number of biomedical publications describing these discoveries. In the meantime, there has been a great interest with scientific communities in text mining tools to find knowledge such as protein-protein interactions, which is most relevant and useful for specific analysis tasks. This paper provides a outline of the various information extraction methods in biomedical domain, especially for discovery of protein-protein interactions. It surveys methodologies involved in plain texts analyzing and processing, categorizes current work in biomedical information extraction, and provides examples of these methods. Challenges in the field are also presented and possible solutions are discussed.
                                
Resumo:
Web APIs have gained increasing popularity in recent Web service technology development owing to its simplicity of technology stack and the proliferation of mashups. However, efficiently discovering Web APIs and the relevant documentations on the Web is still a challenging task even with the best resources available on the Web. In this paper we cast the problem of detecting the Web API documentations as a text classification problem of classifying a given Web page as Web API associated or not. We propose a supervised generative topic model called feature latent Dirichlet allocation (feaLDA) which offers a generic probabilistic framework for automatic detection of Web APIs. feaLDA not only captures the correspondence between data and the associated class labels, but also provides a mechanism for incorporating side information such as labelled features automatically learned from data that can effectively help improving classification performance. Extensive experiments on our Web APIs documentation dataset shows that the feaLDA model outperforms three strong supervised baselines including naive Bayes, support vector machines, and the maximum entropy model, by over 3% in classification accuracy. In addition, feaLDA also gives superior performance when compared against other existing supervised topic models.
                                
Resumo:
In this poster we presented our preliminary work on the study of spammer detection and analysis with 50 active honeypot profiles implemented on Weibo.com and QQ.com microblogging networks. We picked out spammers from legitimate users by manually checking every captured user's microblogs content. We built a spammer dataset for each social network community using these spammer accounts and a legitimate user dataset as well. We analyzed several features of the two user classes and made a comparison on these features, which were found to be useful to distinguish spammers from legitimate users. The followings are several initial observations from our analysis on the features of spammers captured on Weibo.com and QQ.com. ¦The following/follower ratio of spammers is usually higher than legitimate users. They tend to follow a large amount of users in order to gain popularity but always have relatively few followers. ¦There exists a big gap between the average numbers of microblogs posted per day from these two classes. On Weibo.com, spammers post quite a lot microblogs every day, which is much more than legitimate users do; while on QQ.com spammers post far less microblogs than legitimate users. This is mainly due to the different strategies taken by spammers on these two platforms. ¦More spammers choose a cautious spam posting pattern. They mix spam microblogs with ordinary ones so that they can avoid the anti-spam mechanisms taken by the service providers. ¦Aggressive spammers are more likely to be detected so they tend to have a shorter life while cautious spammers can live much longer and have a deeper influence on the network. The latter kind of spammers may become the trend of social network spammer. © 2012 IEEE.
                                
Resumo:
In current organizations, valuable enterprise knowledge is often buried under rapidly expanding huge amount of unstructured information in the form of web pages, blogs, and other forms of human text communications. We present a novel unsupervised machine learning method called CORDER (COmmunity Relation Discovery by named Entity Recognition) to turn these unstructured data into structured information for knowledge management in these organizations. CORDER exploits named entity recognition and co-occurrence data to associate individuals in an organization with their expertise and associates. We discuss the problems associated with evaluating unsupervised learners and report our initial evaluation experiments in an expert evaluation, a quantitative benchmarking, and an application of CORDER in a social networking tool called BuddyFinder.
                                
Resumo:
Discovering who works with whom, on which projects and with which customers is a key task in knowledge management. Although most organizations keep models of organizational structures, these models do not necessarily accurately reflect the reality on the ground. In this paper we present a text mining method called CORDER which first recognizes named entities (NEs) of various types from Web pages, and then discovers relations from a target NE to other NEs which co-occur with it. We evaluated the method on our departmental Website. We used the CORDER method to first find related NEs of four types (organizations, people, projects, and research areas) from Web pages on the Website and then rank them according to their co-occurrence with each of the people in our department. 20 representative people were selected and each of them was presented with ranked lists of each type of NE. Each person specified whether these NEs were related to him/her and changed or confirmed their rankings. Our results indicate that the method can find the NEs with which these people are closely related and provide accurate rankings.
                                
Resumo:
The management and sharing of complex data, information and knowledge is a fundamental and growing concern in the Water and other Industries for a variety of reasons. For example, risks and uncertainties associated with climate, and other changes require knowledge to prepare for a range of future scenarios and potential extreme events. Formal ways in which knowledge can be established and managed can help deliver efficiencies on acquisition, structuring and filtering to provide only the essential aspects of the knowledge really needed. Ontologies are a key technology for this knowledge management. The construction of ontologies is a considerable overhead on any knowledge management programme. Hence current computer science research is investigating generating ontologies automatically from documents using text mining and natural language techniques. As an example of this, results from application of the Text2Onto tool to stakeholder documents for a project on sustainable water cycle management in new developments are presented. It is concluded that by adopting ontological representations sooner, rather than later in an analytical process, decision makers will be able to make better use of highly knowledgeable systems containing automated services to ensure that sustainability considerations are included.
                                
Resumo:
We present information-theory analysis of the tradeoff between bit-error rate improvement and the data-rate loss using skewed channel coding to suppress pattern-dependent errors in digital communications. Without loss of generality, we apply developed general theory to the particular example of a high-speed fiber communication system with a strong patterning effect. © 2007 IEEE.
 
                    