899 resultados para Information Filtering, Pattern Mining, Relevance Feature Discovery, Text Mining
Resumo:
Conselho Nacional de Desenvolvimento Científico e Tecnológico (CNPq)
Resumo:
Discovery Driven Analysis (DDA) is a common feature of OLAP technology to analyze structured data. In essence, DDA helps analysts to discover anomalous data by highlighting 'unexpected' values in the OLAP cube. By giving indications to the analyst on what dimensions to explore, DDA speeds up the process of discovering anomalies and their causes. However, Discovery Driven Analysis (and OLAP in general) is only applicable on structured data, such as records in databases. We propose a system to extend DDA technology to semi-structured text documents, that is, text documents with a few structured data. Our system pipeline consists of two stages: first, the text part of each document is structured around user specified dimensions, using semi-PLSA algorithm; then, we adapt DDA to these fully structured documents, thus enabling DDA on text documents. We present some applications of this system in OLAP analysis and show how scalability issues are solved. Results show that our system can handle reasonable datasets of documents, in real time, without any need for pre-computation.
Resumo:
El treball desenvolupat en aquesta tesi presenta un profund estudi i proveïx solucions innovadores en el camp dels sistemes recomanadors. Els mètodes que usen aquests sistemes per a realitzar les recomanacions, mètodes com el Filtrat Basat en Continguts (FBC), el Filtrat Col·laboratiu (FC) i el Filtrat Basat en Coneixement (FBC), requereixen informació dels usuaris per a predir les preferències per certs productes. Aquesta informació pot ser demogràfica (Gènere, edat, adreça, etc), o avaluacions donades sobre algun producte que van comprar en el passat o informació sobre els seus interessos. Existeixen dues formes d'obtenir aquesta informació: els usuaris ofereixen explícitament aquesta informació o el sistema pot adquirir la informació implícita disponible en les transaccions o historial de recerca dels usuaris. Per exemple, el sistema recomanador de pel·lícules MovieLens (http://movielens.umn.edu/login) demana als usuaris que avaluïn almenys 15 pel·lícules dintre d'una escala de * a * * * * * (horrible, ...., ha de ser vista). El sistema genera recomanacions sobre la base d'aquestes avaluacions. Quan els usuaris no estan registrat en el sistema i aquest no té informació d'ells, alguns sistemes realitzen les recomanacions tenint en compte l'historial de navegació. Amazon.com (http://www.amazon.com) realitza les recomanacions tenint en compte les recerques que un usuari a fet o recomana el producte més venut. No obstant això, aquests sistemes pateixen de certa falta d'informació. Aquest problema és generalment resolt amb l'adquisició d'informació addicional, se li pregunta als usuaris sobre els seus interessos o es cerca aquesta informació en fonts addicionals. La solució proposada en aquesta tesi és buscar aquesta informació en diverses fonts, específicament aquelles que contenen informació implícita sobre les preferències dels usuaris. Aquestes fonts poden ser estructurades com les bases de dades amb informació de compres o poden ser no estructurades com les pàgines web on els usuaris deixen la seva opinió sobre algun producte que van comprar o posseïxen. Nosaltres trobem tres problemes fonamentals per a aconseguir aquest objectiu: 1 . La identificació de fonts amb informació idònia per als sistemes recomanadors. 2 . La definició de criteris que permetin la comparança i selecció de les fonts més idònies. 3 . La recuperació d'informació de fonts no estructurades. En aquest sentit, en la tesi proposada s'ha desenvolupat: 1 . Una metodologia que permet la identificació i selecció de les fonts més idònies. Criteris basats en les característiques de les fonts i una mesura de confiança han estat utilitzats per a resoldre el problema de la identificació i selecció de les fonts. 2 . Un mecanisme per a recuperar la informació no estructurada dels usuaris disponible en la web. Tècniques de Text Mining i ontologies s'han utilitzat per a extreure informació i estructurar-la apropiadament perquè la utilitzin els recomanadors. Les contribucions del treball desenvolupat en aquesta tesi doctoral són: 1. Definició d'un conjunt de característiques per a classificar fonts rellevants per als sistemes recomanadors 2. Desenvolupament d'una mesura de rellevància de les fonts calculada sobre la base de les característiques definides 3. Aplicació d'una mesura de confiança per a obtenir les fonts més fiables. La confiança es definida des de la perspectiva de millora de la recomanació, una font fiable és aquella que permet millorar les recomanacions. 4. Desenvolupament d'un algorisme per a seleccionar, des d'un conjunt de fonts possibles, les més rellevants i fiable utilitzant les mitjanes esmentades en els punts previs. 5. Definició d'una ontologia per a estructurar la informació sobre les preferències dels usuaris que estan disponibles en Internet. 6. Creació d'un procés de mapatge que extreu automàticament informació de les preferències dels usuaris disponibles en la web i posa aquesta informació dintre de l'ontologia. Aquestes contribucions permeten aconseguir dos objectius importants: 1 . Millorament de les recomanacions usant fonts d'informació alternatives que sigui rellevants i fiables. 2 . Obtenir informació implícita dels usuaris disponible en Internet.
Resumo:
Dissertation submitted in partial fulfilment of the requirements for the Degree of Master of Science in Geospatial Technologies
Resumo:
Dissertação apresentada como requisito parcial para obtenção do grau de Mestre em Estatística e Gestão de Informação
Resumo:
This paper describes a data mining environment for knowledge discovery in bioinformatics applications. The system has a generic kernel that implements the mining functions to be applied to input primary databases, with a warehouse architecture, of biomedical information. Both supervised and unsupervised classification can be implemented within the kernel and applied to data extracted from the primary database, with the results being suitably stored in a complex object database for knowledge discovery. The kernel also includes a specific high-performance library that allows designing and applying the mining functions in parallel machines. The experimental results obtained by the application of the kernel functions are reported. © 2003 Elsevier Ltd. All rights reserved.
Resumo:
Mode of access: Internet.
Resumo:
Mode of access: Internet.
Resumo:
In the recent past, hardly anyone could predict this course of GIS development. GIS is moving from desktop to cloud. Web 2.0 enabled people to input data into web. These data are becoming increasingly geolocated. Big amounts of data formed something that is called "Big Data". Scientists still don't know how to deal with it completely. Different Data Mining tools are used for trying to extract some useful information from this Big Data. In our study, we also deal with one part of these data - User Generated Geographic Content (UGGC). The Panoramio initiative allows people to upload photos and describe them with tags. These photos are geolocated, which means that they have exact location on the Earth's surface according to a certain spatial reference system. By using Data Mining tools, we are trying to answer if it is possible to extract land use information from Panoramio photo tags. Also, we tried to answer to what extent this information could be accurate. At the end, we compared different Data Mining methods in order to distinguish which one has the most suited performances for this kind of data, which is text. Our answers are quite encouraging. With more than 70% of accuracy, we proved that extracting land use information is possible to some extent. Also, we found Memory Based Reasoning (MBR) method the most suitable method for this kind of data in all cases.
Resumo:
ABSTRACT Geographic Information System (GIS) is an indispensable software tool in forest planning. In forestry transportation, GIS can manage the data on the road network and solve some problems in transportation, such as route planning. Therefore, the aim of this study was to determine the pattern of the road network and define transport routes using GIS technology. The present research was conducted in a forestry company in the state of Minas Gerais, Brazil. The criteria used to classify the pattern of forest roads were horizontal and vertical geometry, and pavement type. In order to determine transport routes, a data Analysis Model Network was created in ArcGIS using an Extension Network Analyst, allowing finding a route shorter in distance and faster. The results showed a predominance of horizontal geometry classes average (3) and bad (4), indicating presence of winding roads. In the case of vertical geometry criterion, the class of highly mountainous relief (4) possessed the greatest extent of roads. Regarding the type of pavement, the occurrence of secondary coating was higher (75%), followed by primary coating (20%) and asphalt pavement (5%). The best route was the one that allowed the transport vehicle travel in a higher specific speed as a function of road pattern found in the study.
Resumo:
Information systems for business are frequently heavily reliant on software. Two important feedback-related effects of embedding software in a business process are identified. First, the system dynamics of the software maintenance process can become complex, particularly in the number and scope of the feedback loops. Secondly, responsiveness to feedback can have a big effect on the evolvability of the information system. Ways have been explored to provide an effective mechanism for improving the quality of feedback between stakeholders during software maintenance. Understanding can be improved by using representations of information systems that are both service-based and architectural in scope. The conflicting forces that encourage change or stability can be resolved using patterns and pattern languages. A morphology of information systems pattern languages has been described to facilitate the identification and reuse of patterns and pattern languages. The kind of planning process needed to achieve consensus on a system's evolution is also considered.
Resumo:
We show how multivariate GARCH models can be used to generate a time-varying “information share” (Hasbrouck, 1995) to represent the changing patterns of price discovery in closely related securities. We find that time-varying information shares can improve credit spread predictions.
Resumo:
Various factors are believed to govern the selection of references in citation networks, but a precise, quantitative determination of their importance has remained elusive. In this paper, we show that three factors can account for the referencing pattern of citation networks for two topics, namely "graphenes" and "complex networks", thus allowing one to reproduce the topological features of the networks built with papers being the nodes and the edges established by citations. The most relevant factor was content similarity, while the other two - in-degree (i.e. citation counts) and age of publication - had varying importance depending on the topic studied. This dependence indicates that additional factors could play a role. Indeed, by intuition one should expect the reputation (or visibility) of authors and/or institutions to affect the referencing pattern, and this is only indirectly considered via the in-degree that should correlate with such reputation. Because information on reputation is not readily available, we simulated its effect on artificial citation networks considering two communities with distinct fitness (visibility) parameters. One community was assumed to have twice the fitness value of the other, which amounts to a double probability for a paper being cited. While the h-index for authors in the community with larger fitness evolved with time with slightly higher values than for the control network (no fitness considered), a drastic effect was noted for the community with smaller fitness. (C) 2012 Elsevier Ltd. All rights reserved.
Resumo:
Development of homology modeling methods will remain an area of active research. These methods aim to develop and model increasingly accurate three-dimensional structures of yet uncrystallized therapeutically relevant proteins e.g. Class A G-Protein Coupled Receptors. Incorporating protein flexibility is one way to achieve this goal. Here, I will discuss the enhancement and validation of the ligand-steered modeling, originally developed by Dr. Claudio Cavasotto, via cross modeling of the newly crystallized GPCR structures. This method uses known ligands and known experimental information to optimize relevant protein binding sites by incorporating protein flexibility. The ligand-steered models were able to model, reasonably reproduce binding sites and the co-crystallized native ligand poses of the β2 adrenergic and Adenosine 2A receptors using a single template structure. They also performed better than the choice of template, and crude models in a small scale high-throughput docking experiments and compound selectivity studies. Next, the application of this method to develop high-quality homology models of Cannabinoid Receptor 2, an emerging non-psychotic pain management target, is discussed. These models were validated by their ability to rationalize structure activity relationship data of two, inverse agonist and agonist, series of compounds. The method was also applied to improve the virtual screening performance of the β2 adrenergic crystal structure by optimizing the binding site using β2 specific compounds. These results show the feasibility of optimizing only the pharmacologically relevant protein binding sites and applicability to structure-based drug design projects.
Resumo:
As the number of protein folds is quite limited, a mode of analysis that will be increasingly common in the future, especially with the advent of structural genomics, is to survey and re-survey the finite parts list of folds from an expanding number of perspectives. We have developed a new resource, called PartsList, that lets one dynamically perform these comparative fold surveys. It is available on the web at http://bioinfo.mbb.yale.edu/partslist and http://www.partslist.org. The system is based on the existing fold classifications and functions as a form of companion annotation for them, providing ‘global views’ of many already completed fold surveys. The central idea in the system is that of comparison through ranking; PartsList will rank the approximately 420 folds based on more than 180 attributes. These include: (i) occurrence in a number of completely sequenced genomes (e.g. it will show the most common folds in the worm versus yeast); (ii) occurrence in the structure databank (e.g. most common folds in the PDB); (iii) both absolute and relative gene expression information (e.g. most changing folds in expression over the cell cycle); (iv) protein–protein interactions, based on experimental data in yeast and comprehensive PDB surveys (e.g. most interacting fold); (v) sensitivity to inserted transposons; (vi) the number of functions associated with the fold (e.g. most multi-functional folds); (vii) amino acid composition (e.g. most Cys-rich folds); (viii) protein motions (e.g. most mobile folds); and (ix) the level of similarity based on a comprehensive set of structural alignments (e.g. most structurally variable folds). The integration of whole-genome expression and protein–protein interaction data with structural information is a particularly novel feature of our system. We provide three ways of visualizing the rankings: a profiler emphasizing the progression of high and low ranks across many pre-selected attributes, a dynamic comparer for custom comparisons and a numerical rankings correlator. These allow one to directly compare very different attributes of a fold (e.g. expression level, genome occurrence and maximum motion) in the uniform numerical format of ranks. This uniform framework, in turn, highlights the way that the frequency of many of the attributes falls off with approximate power-law behavior (i.e. according to V–b, for attribute value V and constant exponent b), with a few folds having large values and most having small values.