899 resultados para Information Filtering, Pattern Mining, Relevance Feature Discovery, Text Mining


Relevância:

40.00% 40.00%

Publicador:

Resumo:

This thesis analyses problems related to the applicability, in business environments, of Process Mining tools and techniques. The first contribution is a presentation of the state of the art of Process Mining and a characterization of companies, in terms of their "process awareness". The work continues identifying circumstance where problems can emerge: data preparation; actual mining; and results interpretation. Other problems are the configuration of parameters by not-expert users and computational complexity. We concentrate on two possible scenarios: "batch" and "on-line" Process Mining. Concerning the batch Process Mining, we first investigated the data preparation problem and we proposed a solution for the identification of the "case-ids" whenever this field is not explicitly indicated. After that, we concentrated on problems at mining time and we propose the generalization of a well-known control-flow discovery algorithm in order to exploit non instantaneous events. The usage of interval-based recording leads to an important improvement of performance. Later on, we report our work on the parameters configuration for not-expert users. We present two approaches to select the "best" parameters configuration: one is completely autonomous; the other requires human interaction to navigate a hierarchy of candidate models. Concerning the data interpretation and results evaluation, we propose two metrics: a model-to-model and a model-to-log. Finally, we present an automatic approach for the extension of a control-flow model with social information, in order to simplify the analysis of these perspectives. The second part of this thesis deals with control-flow discovery algorithms in on-line settings. We propose a formal definition of the problem, and two baseline approaches. The actual mining algorithms proposed are two: the first is the adaptation, to the control-flow discovery problem, of a frequency counting algorithm; the second constitutes a framework of models which can be used for different kinds of streams (stationary versus evolving).

Relevância:

40.00% 40.00%

Publicador:

Resumo:

L'elaborato ha come scopo l'analisi delle tecniche di Text Mining e la loro applicazione all'interno di processi per l'auto-organizzazione della conoscenza. La prima parte della tesi si concentra sul concetto del Text Mining. Viene fornita la sua definizione, i possibili campi di utilizzo, il processo di sviluppo che lo riguarda e vengono esposte le diverse tecniche di Text Mining. Si analizzano poi alcuni tools per il Text Mining e infine vengono presentati alcuni esempi pratici di utilizzo. Il macro-argomento che viene esposto successivamente riguarda TuCSoN, una infrastruttura per la coordinazione di processi: autonomi, distribuiti e intelligenti, come ad esempio gli agenti. Si descrivono innanzi tutto le entità sulle quali il modello si basa, vengono introdotte le metodologie di interazione fra di essi e successivamente, gli strumenti di programmazione che l'infrastruttura mette a disposizione. La tesi, in un secondo momento, presenta MoK, un modello di coordinazione basato sulla biochimica studiato per l'auto-organizzazione della conoscenza. Anche per MoK, come per TuCSoN, vengono introdotte le entità alla base del modello. Avvalendosi MoK dell'infrastruttura TuCSoN, viene mostrato come le entità del primo vengano mappate su quelle del secondo. A conclusione dell'argomento viene mostrata un'applicazione per l'auto-organizzazione di news che si avvale del modello. Il capitolo successivo si occupa di analizzare i possibili utilizzi delle tecniche di Text Mining all'interno di infrastrutture per l'auto-organizzazione, come MoK. Nell'elaborato vengono poi presentati gli esperimenti effettuati sfruttando tecniche di Text Mining. Tutti gli esperimenti svolti hanno come scopo la clusterizzazione di articoli scientifici in base al loro contenuto, vengono quindi analizzati i risultati ottenuti. L'elaborato di tesi si conclude mettendo in evidenza alcune considerazioni finali su quanto svolto.

Relevância:

40.00% 40.00%

Publicador:

Resumo:

Information is nowadays a key resource: machine learning and data mining techniques have been developed to extract high-level information from great amounts of data. As most data comes in form of unstructured text in natural languages, research on text mining is currently very active and dealing with practical problems. Among these, text categorization deals with the automatic organization of large quantities of documents in priorly defined taxonomies of topic categories, possibly arranged in large hierarchies. In commonly proposed machine learning approaches, classifiers are automatically trained from pre-labeled documents: they can perform very accurate classification, but often require a consistent training set and notable computational effort. Methods for cross-domain text categorization have been proposed, allowing to leverage a set of labeled documents of one domain to classify those of another one. Most methods use advanced statistical techniques, usually involving tuning of parameters. A first contribution presented here is a method based on nearest centroid classification, where profiles of categories are generated from the known domain and then iteratively adapted to the unknown one. Despite being conceptually simple and having easily tuned parameters, this method achieves state-of-the-art accuracy in most benchmark datasets with fast running times. A second, deeper contribution involves the design of a domain-independent model to distinguish the degree and type of relatedness between arbitrary documents and topics, inferred from the different types of semantic relationships between respective representative words, identified by specific search algorithms. The application of this model is tested on both flat and hierarchical text categorization, where it potentially allows the efficient addition of new categories during classification. Results show that classification accuracy still requires improvements, but models generated from one domain are shown to be effectively able to be reused in a different one.

Relevância:

40.00% 40.00%

Publicador:

Resumo:

Nowadays communication is switching from a centralized scenario, where communication media like newspapers, radio, TV programs produce information and people are just consumers, to a completely different decentralized scenario, where everyone is potentially an information producer through the use of social networks, blogs, forums that allow a real-time worldwide information exchange. These new instruments, as a result of their widespread diffusion, have started playing an important socio-economic role. They are the most used communication media and, as a consequence, they constitute the main source of information enterprises, political parties and other organizations can rely on. Analyzing data stored in servers all over the world is feasible by means of Text Mining techniques like Sentiment Analysis, which aims to extract opinions from huge amount of unstructured texts. This could lead to determine, for instance, the user satisfaction degree about products, services, politicians and so on. In this context, this dissertation presents new Document Sentiment Classification methods based on the mathematical theory of Markov Chains. All these approaches bank on a Markov Chain based model, which is language independent and whose killing features are simplicity and generality, which make it interesting with respect to previous sophisticated techniques. Every discussed technique has been tested in both Single-Domain and Cross-Domain Sentiment Classification areas, comparing performance with those of other two previous works. The performed analysis shows that some of the examined algorithms produce results comparable with the best methods in literature, with reference to both single-domain and cross-domain tasks, in $2$-classes (i.e. positive and negative) Document Sentiment Classification. However, there is still room for improvement, because this work also shows the way to walk in order to enhance performance, that is, a good novel feature selection process would be enough to outperform the state of the art. Furthermore, since some of the proposed approaches show promising results in $2$-classes Single-Domain Sentiment Classification, another future work will regard validating these results also in tasks with more than $2$ classes.

Relevância:

40.00% 40.00%

Publicador:

Resumo:

La tesi riguarda lo sviluppo di recommender system che hanno lo scopo di supportare chi è alla ricerca di un lavoro e le aziende che devono selezionare la giusta figura. A partire da un insieme di skill il sistema suggerisce alla persona la posizione lavorativa più affine al suo profilo, oppure a partire da una specifica posizione lavorativa suggerisce all'azienda la persona che più si avvicina alle sue esigenze.

Relevância:

40.00% 40.00%

Publicador:

Resumo:

This paper describes informatics for cross-sample analysis with comprehensive two-dimensional gas chromatography (GCxGC) and high-resolution mass spectrometry (HRMS). GCxGC-HRMS analysis produces large data sets that are rich with information, but highly complex. The size of the data and volume of information requires automated processing for comprehensive cross-sample analysis, but the complexity poses a challenge for developing robust methods. The approach developed here analyzes GCxGC-HRMS data from multiple samples to extract a feature template that comprehensively captures the pattern of peaks detected in the retention-times plane. Then, for each sample chromatogram, the template is geometrically transformed to align with the detected peak pattern and generate a set of feature measurements for cross-sample analyses such as sample classification and biomarker discovery. The approach avoids the intractable problem of comprehensive peak matching by using a few reliable peaks for alignment and peak-based retention-plane windows to define comprehensive features that can be reliably matched for cross-sample analysis. The informatics are demonstrated with a set of 18 samples from breast-cancer tumors, each from different individuals, six each for Grades 1-3. The features allow classification that matches grading by a cancer pathologist with 78% success in leave-one-out cross-validation experiments. The HRMS signatures of the features of interest can be examined for determining elemental compositions and identifying compounds.

Relevância:

40.00% 40.00%

Publicador:

Resumo:

In 2011, researchers at Bucknell University and Illinois Wesleyan University compared the search efficacy of Serial Solutions Summon, EBSCO Discovery Service, Google Scholar and conventional library databases. Using a mixed-methods approach, qualitative and quantitative data was gathered on students’ usage of these tools. Regardless of the search system, students exhibited a marked inability to effectively evaluate sources and a heavy reliance on default search settings. On the quantitative benchmarks measured by this study, the EBSCO Discovery Service tool outperformed the other search systems in almost every category. This article describes these results and makes recommendations for libraries considering these tools.

Relevância:

40.00% 40.00%

Publicador:

Resumo:

The curriculum of the Bucknell University Chemical Engineering Department includes a required senior year capstone course titled Process Engineering, with an emphasis on process design. For the past ten years library research has been a significant component of the coursework, and students working in teams meet with the librarian throughout the semester to explore a wide variety of information resources required for their project. The assignment has been the same from 1989 to 1999. Teams of students are responsible for designing a safe, efficient, and profitable process for the dehydrogenation of ethylbenzene to styrene monomer. A series of written reports on their chosen process design is a significant course outcome. While the assignment and the specific chemical technology have not changed radically in the past decade, the process of research and discovery has evolved considerably. This paper describes the solutions offered in 1989 to meet the information needs of the chemical engineering students at Bucknell University, and the evolution in research brought about by online databases, electronic journals, and the Internet, making the process of discovery a completely different experience in 1999.

Relevância:

40.00% 40.00%

Publicador:

Resumo:

Despite the fact that photographic stimuli are used across experimental contexts with both human and nonhuman subjects, the nature of individuals' perceptions of these stimuli is still not well understood. In the present experiments, we tested whether three orangutans and 36 human children could use photographic information presented on a computer screen to solve a perceptually corresponding problem in the physical domain. Furthermore, we tested the cues that aided in this process by pitting featural information against spatial position in a series of probe trials. We found that many of the children and one orangutan were successfully able to use the information cross-dimensionally; however, the other two orangutans and almost a quarter of the children failed to acquire the task. Species differences emerged with respect to ease of task acquisition. More striking, however, were the differences in cues that participants used to solve the task: Whereas the orangutan used a spatial strategy, the majority of children used a feature one. Possible reasons for these differences are discussed from both evolutionary and developmental perspectives. The novel results found here underscore the need for further testing in this area to design appropriate experimental paradigms in future comparative research settings.

Relevância:

40.00% 40.00%

Publicador:

Resumo:

11beta-Hydroxysteroid dehydrogenase (11beta-HSD) enzymes catalyze the conversion of biologically inactive 11-ketosteroids into their active 11beta-hydroxy derivatives and vice versa. Inhibition of 11beta-HSD1 has considerable therapeutic potential for glucocorticoid-associated diseases including obesity, diabetes, wound healing, and muscle atrophy. Because inhibition of related enzymes such as 11beta-HSD2 and 17beta-HSDs causes sodium retention and hypertension or interferes with sex steroid hormone metabolism, respectively, highly selective 11beta-HSD1 inhibitors are required for successful therapy. Here, we employed the software package Catalyst to develop ligand-based multifeature pharmacophore models for 11beta-HSD1 inhibitors. Virtual screening experiments and subsequent in vitro evaluation of promising hits revealed several selective inhibitors. Efficient inhibition of recombinant human 11beta-HSD1 in intact transfected cells as well as endogenous enzyme in mouse 3T3-L1 adipocytes and C2C12 myotubes was demonstrated for compound 27, which was able to block subsequent cortisol-dependent activation of glucocorticoid receptors with only minor direct effects on the receptor itself. Our results suggest that inhibitor-based pharmacophore models for 11beta-HSD1 in combination with suitable cell-based activity assays, including such for related enzymes, can be used for the identification of selective and potent inhibitors.

Relevância:

40.00% 40.00%

Publicador:

Resumo:

Mobile Mesh Network based In-Transit Visibility (MMN-ITV) system facilitates global real-time tracking capability for the logistics system. In-transit containers form a multi-hop mesh network to forward the tracking information to the nearby sinks, which further deliver the information to the remote control center via satellite. The fundamental challenge to the MMN-ITV system is the energy constraint of the battery-operated containers. Coupled with the unique mobility pattern, cross-MMN behavior, and the large-spanned area, it is necessary to investigate the energy-efficient communication of the MMN-ITV system thoroughly. First of all, this dissertation models the energy-efficient routing under the unique pattern of the cross-MMN behavior. A new modeling approach, pseudo-dynamic modeling approach, is proposed to measure the energy-efficiency of the routing methods in the presence of the cross-MMN behavior. With this approach, it could be identified that the shortest-path routing and the load-balanced routing is energy-efficient in mobile networks and static networks respectively. For the MMN-ITV system with both mobile and static MMNs, an energy-efficient routing method, energy-threshold routing, is proposed to achieve the best tradeoff between them. Secondly, due to the cross-MMN behavior, neighbor discovery is executed frequently to help the new containers join the MMN, hence, consumes similar amount of energy as that of the data communication. By exploiting the unique pattern of the cross-MMN behavior, this dissertation proposes energy-efficient neighbor discovery wakeup schedules to save up to 60% of the energy for neighbor discovery. Vehicular Ad Hoc Networks (VANETs)-based inter-vehicle communications is by now growingly believed to enhance traffic safety and transportation management with low cost. The end-to-end delay is critical for the time-sensitive safety applications in VANETs, and can be a decisive performance metric for VANETs. This dissertation presents a complete analytical model to evaluate the end-to-end delay against the transmission range and the packet arrival rate. This model illustrates a significant end-to-end delay increase from non-saturated networks to saturated networks. It hence suggests that the distributed power control and admission control protocols for VANETs should aim at improving the real-time capacity (the maximum packet generation rate without causing saturation), instead of the delay itself. Based on the above model, it could be determined that adopting uniform transmission range for every vehicle may hinder the delay performance improvement, since it does not allow the coexistence of the short path length and the low interference. Clusters are proposed to configure non-uniform transmission range for the vehicles. Analysis and simulation confirm that such configuration can enhance the real-time capacity. In addition, it provides an improved trade off between the end-to-end delay and the network capacity. A distributed clustering protocol with minimum message overhead is proposed, which achieves low convergence time.

Relevância:

40.00% 40.00%

Publicador:

Resumo:

The Cardwell Mining District is part of the greater Whitehall Mining District. The district is situated about four miles to the east and northeast of Whitehall in the southern end of the Bull Mountains which are near the Continental Divide. The first reported production was in 1896 after the dis­covery of the Mayflower Mine. Mining has been carried on in­termittently and on a small scale since that time.