7 resultados para Knowledge Discovery Tools
em Biblioteca Digital da Produção Intelectual da Universidade de São Paulo
Resumo:
There are some variants of the widely used Fuzzy C-Means (FCM) algorithm that support clustering data distributed across different sites. Those methods have been studied under different names, like collaborative and parallel fuzzy clustering. In this study, we offer some augmentation of the two FCM-based clustering algorithms used to cluster distributed data by arriving at some constructive ways of determining essential parameters of the algorithms (including the number of clusters) and forming a set of systematically structured guidelines such as a selection of the specific algorithm depending on the nature of the data environment and the assumptions being made about the number of clusters. A thorough complexity analysis, including space, time, and communication aspects, is reported. A series of detailed numeric experiments is used to illustrate the main ideas discussed in the study.
Resumo:
We review recent visualization techniques aimed at supporting tasks that require the analysis of text documents, from approaches targeted at visually summarizing the relevant content of a single document to those aimed at assisting exploratory investigation of whole collections of documents.Techniques are organized considering their target input materialeither single texts or collections of textsand their focus, which may be at displaying content, emphasizing relevant relationships, highlighting the temporal evolution of a document or collection, or helping users to handle results from a query posed to a search engine.We describe the approaches adopted by distinct techniques and briefly review the strategies they employ to obtain meaningful text models, discuss how they extract the information required to produce representative visualizations, the tasks they intend to support and the interaction issues involved, and strengths and limitations. Finally, we show a summary of techniques, highlighting their goals and distinguishing characteristics. We also briefly discuss some open problems and research directions in the fields of visual text mining and text analytics.
Resumo:
The ubiquity of time series data across almost all human endeavors has produced a great interest in time series data mining in the last decade. While dozens of classification algorithms have been applied to time series, recent empirical evidence strongly suggests that simple nearest neighbor classification is exceptionally difficult to beat. The choice of distance measure used by the nearest neighbor algorithm is important, and depends on the invariances required by the domain. For example, motion capture data typically requires invariance to warping, and cardiology data requires invariance to the baseline (the mean value). Similarly, recent work suggests that for time series clustering, the choice of clustering algorithm is much less important than the choice of distance measure used.In this work we make a somewhat surprising claim. There is an invariance that the community seems to have missed, complexity invariance. Intuitively, the problem is that in many domains the different classes may have different complexities, and pairs of complex objects, even those which subjectively may seem very similar to the human eye, tend to be further apart under current distance measures than pairs of simple objects. This fact introduces errors in nearest neighbor classification, where some complex objects may be incorrectly assigned to a simpler class. Similarly, for clustering this effect can introduce errors by “suggesting” to the clustering algorithm that subjectively similar, but complex objects belong in a sparser and larger diameter cluster than is truly warranted.We introduce the first complexity-invariant distance measure for time series, and show that it generally produces significant improvements in classification and clustering accuracy. We further show that this improvement does not compromise efficiency, since we can lower bound the measure and use a modification of triangular inequality, thus making use of most existing indexing and data mining algorithms. We evaluate our ideas with the largest and most comprehensive set of time series mining experiments ever attempted in a single work, and show that complexity-invariant distance measures can produce improvements in classification and clustering in the vast majority of cases.
Resumo:
This descriptive, cross-sectional and quantitative study presents an analysis of knowledge acquired by mastectomized women concerning breast cancer after reading an educational handbook. The sample was composed of 125 women. Data were collected in a specialized cancer facility in three phases: preparatory, operational I and operational II. As to the knowledge acquired, the posttest showed an 11% increase in the number of correct answers compared to the pretest. The most frequent correct answer regarded a question asking the name of the surgery (97.60%) while the question concerning breast reconstruction obtained the lowest number of correct answers (58.40%). Answers to all the questions significantly improved in the posttest, with the exception of a question addressing breast reconstruction (p=0.754). The assessment of knowledge showed positive results after reading, suggesting that cognition is essential to understanding and adhering to guidance, thus the handbook is a favorable resource to be used in the rehabilitation of mastectomized women.
Resumo:
Abstract Background The study and analysis of gene expression measurements is the primary focus of functional genomics. Once expression data is available, biologists are faced with the task of extracting (new) knowledge associated to the underlying biological phenomenon. Most often, in order to perform this task, biologists execute a number of analysis activities on the available gene expression dataset rather than a single analysis activity. The integration of heteregeneous tools and data sources to create an integrated analysis environment represents a challenging and error-prone task. Semantic integration enables the assignment of unambiguous meanings to data shared among different applications in an integrated environment, allowing the exchange of data in a semantically consistent and meaningful way. This work aims at developing an ontology-based methodology for the semantic integration of gene expression analysis tools and data sources. The proposed methodology relies on software connectors to support not only the access to heterogeneous data sources but also the definition of transformation rules on exchanged data. Results We have studied the different challenges involved in the integration of computer systems and the role software connectors play in this task. We have also studied a number of gene expression technologies, analysis tools and related ontologies in order to devise basic integration scenarios and propose a reference ontology for the gene expression domain. Then, we have defined a number of activities and associated guidelines to prescribe how the development of connectors should be carried out. Finally, we have applied the proposed methodology in the construction of three different integration scenarios involving the use of different tools for the analysis of different types of gene expression data. Conclusions The proposed methodology facilitates the development of connectors capable of semantically integrating different gene expression analysis tools and data sources. The methodology can be used in the development of connectors supporting both simple and nontrivial processing requirements, thus assuring accurate data exchange and information interpretation from exchanged data.
Resumo:
The University of São Paulo celebrates its Integrated Library System 30th anniversary with an exhibition, discussing the problems of retrieval, preservation and access to knowledge resulting from the exceptional changes ICTs produce in contemporary society. It opens up discussions on the main function of the ancient library institution, reinforces its relevance and reflects on technical tools and social practices that make information and basic raw material accessible, generating new forms of knowledge. About the future library, it´s a call for reflection on how the brilliant minds of the past projected into the future, which for us are the achievements of the present. The future has already started and expects each one to exercise inventiveness and determination to build it in a human and collaborative sense.
Resumo:
The discovery and development of a new drug are time-consuming, difficult and expensive. This complex process has evolved from classical methods into an integration of modern technologies and innovative strategies addressed to the design of new chemical entities to treat a variety of diseases. The development of new drug candidates is often limited by initial compounds lacking reasonable chemical and biological properties for further lead optimization. Huge libraries of compounds are frequently selected for biological screening using a variety of techniques and standard models to assess potency, affinity and selectivity. In this context, it is very important to study the pharmacokinetic profile of the compounds under investigation. Recent advances have been made in the collection of data and the development of models to assess and predict pharmacokinetic properties (ADME - absorption, distribution, metabolism and excretion) of bioactive compounds in the early stages of drug discovery projects. This paper provides a brief perspective on the evolution of in silico ADME tools, addressing challenges, limitations, and opportunities in medicinal chemistry.