838 resultados para text and data mining
Resumo:
Abstract Background Mycelium-to-yeast transition in the human host is essential for pathogenicity by the fungus Paracoccidioides brasiliensis and both cell types are therefore critical to the establishment of paracoccidioidomycosis (PCM), a systemic mycosis endemic to Latin America. The infected population is of about 10 million individuals, 2% of whom will eventually develop the disease. Previously, transcriptome analysis of mycelium and yeast cells resulted in the assembly of 6,022 sequence groups. Gene expression analysis, using both in silico EST subtraction and cDNA microarray, revealed genes that were differential to yeast or mycelium, and we discussed those involved in sugar metabolism. To advance our understanding of molecular mechanisms of dimorphic transition, we performed an extended analysis of gene expression profiles using the methods mentioned above. Results In this work, continuous data mining revealed 66 new differentially expressed sequences that were MIPS(Munich Information Center for Protein Sequences)-categorised according to the cellular process in which they are presumably involved. Two well represented classes were chosen for further analysis: (i) control of cell organisation – cell wall, membrane and cytoskeleton, whose representatives were hex (encoding for a hexagonal peroxisome protein), bgl (encoding for a 1,3-β-glucosidase) in mycelium cells; and ags (an α-1,3-glucan synthase), cda (a chitin deacetylase) and vrp (a verprolin) in yeast cells; (ii) ion metabolism and transport – two genes putatively implicated in ion transport were confirmed to be highly expressed in mycelium cells – isc and ktp, respectively an iron-sulphur cluster-like protein and a cation transporter; and a putative P-type cation pump (pct) in yeast. Also, several enzymes from the cysteine de novo biosynthesis pathway were shown to be up regulated in the yeast form, including ATP sulphurylase, APS kinase and also PAPS reductase. Conclusion Taken together, these data show that several genes involved in cell organisation and ion metabolism/transport are expressed differentially along dimorphic transition. Hyper expression in yeast of the enzymes of sulphur metabolism reinforced that this metabolic pathway could be important for this process. Understanding these changes by functional analysis of such genes may lead to a better understanding of the infective process, thus providing new targets and strategies to control PCM.
Resumo:
Given a large image set, in which very few images have labels, how to guess labels for the remaining majority? How to spot images that need brand new labels different from the predefined ones? How to summarize these data to route the user’s attention to what really matters? Here we answer all these questions. Specifically, we propose QuMinS, a fast, scalable solution to two problems: (i) Low-labor labeling (LLL) – given an image set, very few images have labels, find the most appropriate labels for the rest; and (ii) Mining and attention routing – in the same setting, find clusters, the top-'N IND.O' outlier images, and the 'N IND.R' images that best represent the data. Experiments on satellite images spanning up to 2.25 GB show that, contrasting to the state-of-the-art labeling techniques, QuMinS scales linearly on the data size, being up to 40 times faster than top competitors (GCap), still achieving better or equal accuracy, it spots images that potentially require unpredicted labels, and it works even with tiny initial label sets, i.e., nearly five examples. We also report a case study of our method’s practical usage to show that QuMinS is a viable tool for automatic coffee crop detection from remote sensing images.
Resumo:
Il problema relativo alla predizione, la ricerca di pattern predittivi all‘interno dei dati, è stato studiato ampiamente. Molte metodologie robuste ed efficienti sono state sviluppate, procedimenti che si basano sull‘analisi di informazioni numeriche strutturate. Quella testuale, d‘altro canto, è una tipologia di informazione fortemente destrutturata. Quindi, una immediata conclusione, porterebbe a pensare che per l‘analisi predittiva su dati testuali sia necessario sviluppare metodi completamente diversi da quelli ben noti dalle tecniche di data mining. Un problema di predizione può essere risolto utilizzando invece gli stessi metodi : dati testuali e documenti possono essere trasformati in valori numerici, considerando per esempio l‘assenza o la presenza di termini, rendendo di fatto possibile una utilizzazione efficiente delle tecniche già sviluppate. Il text mining abilita la congiunzione di concetti da campi di applicazione estremamente eterogenei. Con l‘immensa quantità di dati testuali presenti, basti pensare, sul World Wide Web, ed in continua crescita a causa dell‘utilizzo pervasivo di smartphones e computers, i campi di applicazione delle analisi di tipo testuale divengono innumerevoli. L‘avvento e la diffusione dei social networks e della pratica di micro blogging abilita le persone alla condivisione di opinioni e stati d‘animo, creando un corpus testuale di dimensioni incalcolabili aggiornato giornalmente. Le nuove tecniche di Sentiment Analysis, o Opinion Mining, si occupano di analizzare lo stato emotivo o la tipologia di opinione espressa all‘interno di un documento testuale. Esse sono discipline attraverso le quali, per esempio, estrarre indicatori dello stato d‘animo di un individuo, oppure di un insieme di individui, creando una rappresentazione dello stato emotivo sociale. L‘andamento dello stato emotivo sociale può condizionare macroscopicamente l‘evolvere di eventi globali? Studi in campo di Economia e Finanza Comportamentale assicurano un legame fra stato emotivo, capacità nel prendere decisioni ed indicatori economici. Grazie alle tecniche disponibili ed alla mole di dati testuali continuamente aggiornati riguardanti lo stato d‘animo di milioni di individui diviene possibile analizzare tali correlazioni. In questo studio viene costruito un sistema per la previsione delle variazioni di indici di borsa, basandosi su dati testuali estratti dalla piattaforma di microblogging Twitter, sotto forma di tweets pubblici; tale sistema include tecniche di miglioramento della previsione basate sullo studio di similarità dei testi, categorizzandone il contributo effettivo alla previsione.
Resumo:
People often use tools to search for information. In order to improve the quality of an information search, it is important to understand how internal information, which is stored in user’s mind, and external information, represented by the interface of tools interact with each other. How information is distributed between internal and external representations significantly affects information search performance. However, few studies have examined the relationship between types of interface and types of search task in the context of information search. For a distributed information search task, how data are distributed, represented, and formatted significantly affects the user search performance in terms of response time and accuracy. Guided by UFuRT (User, Function, Representation, Task), a human-centered process, I propose a search model, task taxonomy. The model defines its relationship with other existing information models. The taxonomy clarifies the legitimate operations for each type of search task of relation data. Based on the model and taxonomy, I have also developed prototypes of interface for the search tasks of relational data. These prototypes were used for experiments. The experiments described in this study are of a within-subject design with a sample of 24 participants recruited from the graduate schools located in the Texas Medical Center. Participants performed one-dimensional nominal search tasks over nominal, ordinal, and ratio displays, and searched one-dimensional nominal, ordinal, interval, and ratio tasks over table and graph displays. Participants also performed the same task and display combination for twodimensional searches. Distributed cognition theory has been adopted as a theoretical framework for analyzing and predicting the search performance of relational data. It has been shown that the representation dimensions and data scales, as well as the search task types, are main factors in determining search efficiency and effectiveness. In particular, the more external representations used, the better search task performance, and the results suggest the ideal search performance occurs when the question type and corresponding data scale representation match. The implications of the study lie in contributing to the effective design of search interface for relational data, especially laboratory results, which are often used in healthcare activities.
Resumo:
Background: The US has higher rates of teen births and sexually transmitted infections (STI) than other developed countries. Texas youth are disproportionately impacted. Purpose: To review local, state, and national data on teens’ engagement in sexual risk behaviors to inform policy and practice related to teen sexual health. Methods: 2009 middle school and high school Youth Risk Behavior Survey (YRBS) data, and data from All About Youth, a middle school study conducted in a large urban school district in Texas, were analyzed to assess the prevalence of sexual initiation, including the initiation of non-coital sex, and the prevalence of sexual risk behaviors among Texas and US youth. Results: A substantial proportion of middle and high school students are having sex. Sexual initiation begins as early as 6th grade and increases steadily through 12th grade with almost two-thirds of high school seniors being sexually experienced. Many teens are not protecting themselves from unintended pregnancy or STIs – nationally, 80% and 39% of high school students did not use birth control pills or a condom respectively the last time they had sex. Many middle and high school students are engaging in oral and anal sex, two behaviors which increase the risk of contracting an STI and HIV. In Texas, an estimated 689,512 out of 1,327,815 public high school students are sexually experienced – over half (52%) of the total high school population. Texas students surpass their US peers in several sexual risk behaviors including number of lifetime sexual partners, being currently sexually active, and not using effective methods of birth control or dual protection when having sex. They are also less likely to receive HIV/AIDS education in school. Conclusion: Changes in policy and practice, including implementation of evidence-based sex education programs in middle and high schools and increased access to integrated, teen-friendly sexual and reproductive health services, are urgently needed at the state and national levels to address these issues effectively.
Resumo:
Intensive family preservation services (IFPS), designed to stabilize at-risk families and avert out-of-home care, have been the focus of many randomized, experimental studies. Employing a retrospective “clinical data-mining” (CDM) methodology (Epstein, 2001), this study makes use of available information extracted from client records in one IFPS agency over the course of two years. The primary goal of this descriptive and associational study was to gain a clearer understanding of IFPS service delivery and effectiveness. Interventions provided to families are delineated and assessed for their impact on improved family functioning, their impact on the reduction of family violence, as well as placement prevention. Findings confirm the use of a wide range of services consistent with IFPS program theory. Because the study employs a quasi-experimental, retrospective use of available information, clinical outcomes described cannot be causally attributed to interventions employed as with randomized controlled trials. With regard to service outcomes, findings suggest that family education, empowerment services and advocacy are most influential in placement prevention and in ameliorating unmanageable behaviors in children as well as the incidence of family violence.
Resumo:
This paper presents the results of a Secchi depth data mining study for the North Sea - Baltic Sea region. 40,829 measurements of Secchi depth were compiled from the area as a result of this study. 4.3% of the observations were found in the international data centers [ICES Oceanographic Data Center in Denmark and the World Ocean Data Center A (WDC-A) in the USA], while 95.7% of the data was provided by individuals and ocean research institutions from the surrounding North Sea and Baltic Sea countries. Inquiries made at the World Ocean Data Center B (WDC-B) in Russia suggested that there could be significant additional holdings in that archive but, unfortunately, no data could be made available. The earliest Secchi depth measurement retrieved in this study dates back to 1902 for the Baltic Sea, while the bulk of the measurements were gathered after 1970. The spatial distribution of Secchi depth measurements in the North Sea is very uneven with surprisingly large sampling gaps in the Western North Sea. Quarterly and annual Secchi depth maps with a 0.5° x 0.5° spatial resolution are provided for the transition area between the North Sea and the Baltic Sea (4°E-16°E, 53°N-60°N).
Resumo:
The environmental, cultural and socio-economic causes and consequences of farmland abandonment are issues of increasing concern for researchers and policy makers. In previous studies, we proposed a new methodology for selecting the driving factors in farmland abandonment processes. Using Data Mining and GIS, it is possible to select those variables which are more significantly related to abandonment. The aim of this study is to investigate the application of the above mentioned methodology for finding relationships between relief and farmland abandonment in a Mediterranean region (SE Spain).We have taken into account up to 28 different variables in a single analysis, some of them commonly considered in land use change studies (slope, altitude, TWI, etc), but also other novel variables have been evaluated (sky view factor, terrain view factor, etc). The variable selection process provides results in line with the previous knowledge of the study area, describing some processes that are region specific (e.g. abandonment versus intensification of the agricultural activities). The European INSPIRE Directive (2007/2/EC) establishes that the digital elevation models for land surfaces should be available in all member countries, this means that the research described in this work can be extrapolated to any European country to determine whether these variables (slope, altitude, etc) are important in the process of abandonment.
Resumo:
Thesis (Ph.D.)--University of Washington, 2016-06
Resumo:
Today, the data available to tackle many scientific challenges is vast in quantity and diverse in nature. The exploration of heterogeneous information spaces requires suitable mining algorithms as well as effective visual interfaces. miniDVMS v1.8 provides a flexible visual data mining framework which combines advanced projection algorithms developed in the machine learning domain and visual techniques developed in the information visualisation domain. The advantage of this interface is that the user is directly involved in the data mining process. Principled projection methods, such as generative topographic mapping (GTM) and hierarchical GTM (HGTM), are integrated with powerful visual techniques, such as magnification factors, directional curvatures, parallel coordinates, and user interaction facilities, to provide this integrated visual data mining framework. The software also supports conventional visualisation techniques such as principal component analysis (PCA), Neuroscale, and PhiVis. This user manual gives an overview of the purpose of the software tool, highlights some of the issues to be taken care while creating a new model, and provides information about how to install and use the tool. The user manual does not require the readers to have familiarity with the algorithms it implements. Basic computing skills are enough to operate the software.
Resumo:
When applying multivariate analysis techniques in information systems and social science disciplines, such as management information systems (MIS) and marketing, the assumption that the empirical data originate from a single homogeneous population is often unrealistic. When applying a causal modeling approach, such as partial least squares (PLS) path modeling, segmentation is a key issue in coping with the problem of heterogeneity in estimated cause-and-effect relationships. This chapter presents a new PLS path modeling approach which classifies units on the basis of the heterogeneity of the estimates in the inner model. If unobserved heterogeneity significantly affects the estimated path model relationships on the aggregate data level, the methodology will allow homogenous groups of observations to be created that exhibit distinctive path model estimates. The approach will, thus, provide differentiated analytical outcomes that permit more precise interpretations of each segment formed. An application on a large data set in an example of the American customer satisfaction index (ACSI) substantiates the methodology’s effectiveness in evaluating PLS path modeling results.
Resumo:
We address the important bioinformatics problem of predicting protein function from a protein's primary sequence. We consider the functional classification of G-Protein-Coupled Receptors (GPCRs), whose functions are specified in a class hierarchy. We tackle this task using a novel top-down hierarchical classification system where, for each node in the class hierarchy, the predictor attributes to be used in that node and the classifier to be applied to the selected attributes are chosen in a data-driven manner. Compared with a previous hierarchical classification system selecting classifiers only, our new system significantly reduced processing time without significantly sacrificing predictive accuracy.