18 resultados para Data Mining, Yield Improvement, Self Organising Map, Clustering Quality
em Biblioteca Digital da Produção Intelectual da Universidade de São Paulo (BDPI/USP)
Resumo:
Melanoma is a highly aggressive and therapy resistant tumor for which the identification of specific markers and therapeutic targets is highly desirable. We describe here the development and use of a bioinformatic pipeline tool, made publicly available under the name of EST2TSE, for the in silico detection of candidate genes with tissue-specific expression. Using this tool we mined the human EST (Expressed Sequence Tag) database for sequences derived exclusively from melanoma. We found 29 UniGene clusters of multiple ESTs with the potential to predict novel genes with melanoma-specific expression. Using a diverse panel of human tissues and cell lines, we validated the expression of a subset of three previously uncharacterized genes (clusters Hs.295012, Hs.518391, and Hs.559350) to be highly restricted to melanoma/melanocytes and named them RMEL1, 2 and 3, respectively. Expression analysis in nevi, primary melanomas, and metastatic melanomas revealed RMEL1 as a novel melanocytic lineage-specific gene up-regulated during melanoma development. RMEL2 expression was restricted to melanoma tissues and glioblastoma. RMEL3 showed strong up-regulation in nevi and was lost in metastatic tumors. Interestingly, we found correlations of RMEL2 and RMEL3 expression with improved patient outcome, suggesting tumor and/or metastasis suppressor functions for these genes. The three genes are composed of multiple exons and map to 2q12.2, 1q25.3, and 5q11.2, respectively. They are well conserved throughout primates, but not other genomes, and were predicted as having no coding potential, although primate-conserved and human-specific short ORFs could be found. Hairpin RNA secondary structures were also predicted. Concluding, this work offers new melanoma-specific genes for future validation as prognostic markers or as targets for the development of therapeutic strategies to treat melanoma.
Resumo:
Objetivou-se com este trabalho utilizar regras de associação para identificar forças de mercado que regem a comercialização de touros com avaliação genética pelo programa Nelore Brasil. Essas regras permitem evidenciar padrões implícitos nas transações de grandes bases de dados, indicando causas e efeitos determinantes da oferta e comercialização de touros. Na análise foram considerados 19.736 registros de touros comercializados, 17 fazendas e 15 atributos referentes às diferenças esperadas nas progênies dos reprodutores, local e época da venda. Utilizou-se um sistema com interface gráfica usuário-dirigido que permite geração e seleção interativa de regras de associação. Análise de Pareto foi aplicada para as três medidas objetivas (suporte, confiança e lift) que acompanham cada uma das regras de associação, para validação das mesmas. Foram geradas 2.667 regras de associação, 164 consideradas úteis pelo usuário e 107 válidas para lift ≥ 1,0505. As fazendas participantes do programa Nelore Brasil apresentam especializações na oferta de touros, segundo características para habilidade materna, ganho de peso, fertilidade, precocidade sexual, longevidade, rendimento e terminação de carcaça. Os perfis genéticos dos touros são diferentes para as variedades padrão e mocho. Algumas regiões brasileiras são nichos de mercado para touros sem registro genealógico. A análise de evolução de mercado sugere que o mérito genético total, índice oficial do programa Nelore Brasil, tornou-se um importante índice para comercialização dos touros. Com o uso das regras de associação, foi possível descobrir forças do mercado e identificar combinações de atributos genéticos, geográficos e temporais que determinam a comercialização de touros no programa Nelore Brasil.
Resumo:
This work proposes a method based on both preprocessing and data mining with the objective of identify harmonic current sources in residential consumers. In addition, this methodology can also be applied to identify linear and nonlinear loads. It should be emphasized that the entire database was obtained through laboratory essays, i.e., real data were acquired from residential loads. Thus, the residential system created in laboratory was fed by a configurable power source and in its output were placed the loads and the power quality analyzers (all measurements were stored in a microcomputer). So, the data were submitted to pre-processing, which was based on attribute selection techniques in order to minimize the complexity in identifying the loads. A newer database was generated maintaining only the attributes selected, thus, Artificial Neural Networks were trained to realized the identification of loads. In order to validate the methodology proposed, the loads were fed both under ideal conditions (without harmonics), but also by harmonic voltages within limits pre-established. These limits are in accordance with IEEE Std. 519-1992 and PRODIST (procedures to delivery energy employed by Brazilian`s utilities). The results obtained seek to validate the methodology proposed and furnish a method that can serve as alternative to conventional methods.
Resumo:
Background: High-throughput molecular approaches for gene expression profiling, such as Serial Analysis of Gene Expression (SAGE), Massively Parallel Signature Sequencing (MPSS) or Sequencing-by-Synthesis (SBS) represent powerful techniques that provide global transcription profiles of different cell types through sequencing of short fragments of transcripts, denominated sequence tags. These techniques have improved our understanding about the relationships between these expression profiles and cellular phenotypes. Despite this, more reliable datasets are still necessary. In this work, we present a web-based tool named S3T: Score System for Sequence Tags, to index sequenced tags in accordance with their reliability. This is made through a series of evaluations based on a defined rule set. S3T allows the identification/selection of tags, considered more reliable for further gene expression analysis. Results: This methodology was applied to a public SAGE dataset. In order to compare data before and after filtering, a hierarchical clustering analysis was performed in samples from the same type of tissue, in distinct biological conditions, using these two datasets. Our results provide evidences suggesting that it is possible to find more congruous clusters after using S3T scoring system. Conclusion: These results substantiate the proposed application to generate more reliable data. This is a significant contribution for determination of global gene expression profiles. The library analysis with S3T is freely available at http://gdm.fmrp.usp.br/s3t/.S3T source code and datasets can also be downloaded from the aforementioned website.
Resumo:
Background: The inherent complexity of statistical methods and clinical phenomena compel researchers with diverse domains of expertise to work in interdisciplinary teams, where none of them have a complete knowledge in their counterpart's field. As a result, knowledge exchange may often be characterized by miscommunication leading to misinterpretation, ultimately resulting in errors in research and even clinical practice. Though communication has a central role in interdisciplinary collaboration and since miscommunication can have a negative impact on research processes, to the best of our knowledge, no study has yet explored how data analysis specialists and clinical researchers communicate over time. Methods/Principal Findings: We conducted qualitative analysis of encounters between clinical researchers and data analysis specialists (epidemiologist, clinical epidemiologist, and data mining specialist). These encounters were recorded and systematically analyzed using a grounded theory methodology for extraction of emerging themes, followed by data triangulation and analysis of negative cases for validation. A policy analysis was then performed using a system dynamics methodology looking for potential interventions to improve this process. Four major emerging themes were found. Definitions using lay language were frequently employed as a way to bridge the language gap between the specialties. Thought experiments presented a series of ""what if'' situations that helped clarify how the method or information from the other field would behave, if exposed to alternative situations, ultimately aiding in explaining their main objective. Metaphors and analogies were used to translate concepts across fields, from the unfamiliar to the familiar. Prolepsis was used to anticipate study outcomes, thus helping specialists understand the current context based on an understanding of their final goal. Conclusion/Significance: The communication between clinical researchers and data analysis specialists presents multiple challenges that can lead to errors.
Resumo:
The productivity associated with commonly available disassembly methods today seldomly makes disassembly the preferred end-of-life solution for massive take back product streams. Systematic reuse of parts or components, or recycling of pure material fractions are often not achievable in an economically sustainable way. In this paper a case-based review of current disassembly practices is used to analyse the factors influencing disassembly feasibility. Data mining techniques were used to identify major factors influencing the profitability of disassembly operations. Case characteristics such as involvement of the product manufacturer in the end-of-life treatment and continuous ownership are some of the important dimensions. Economic models demonstrate that the efficiency of disassembly operations should be increased an order of magnitude to assure the competitiveness of ecologically preferred, disassembly oriented end-of-life scenarios for large waste of electric and electronic equipment (WEEE) streams. Technological means available to increase the productivity of the disassembly operations are summarized. Automated disassembly techniques can contribute to the robustness of the process, but do not allow to overcome the efficiency gap if not combined with appropriate product design measures. Innovative, reversible joints, collectively activated by external trigger signals, form a promising approach to low cost, mass disassembly in this context. A short overview of the state-of-the-art in the development of such self-disassembling joints is included. (c) 2008 CIRP.
Resumo:
One of the top ten most influential data mining algorithms, k-means, is known for being simple and scalable. However, it is sensitive to initialization of prototypes and requires that the number of clusters be specified in advance. This paper shows that evolutionary techniques conceived to guide the application of k-means can be more computationally efficient than systematic (i.e., repetitive) approaches that try to get around the above-mentioned drawbacks by repeatedly running the algorithm from different configurations for the number of clusters and initial positions of prototypes. To do so, a modified version of a (k-means based) fast evolutionary algorithm for clustering is employed. Theoretical complexity analyses for the systematic and evolutionary algorithms under interest are provided. Computational experiments and statistical analyses of the results are presented for artificial and text mining data sets. (C) 2010 Elsevier B.V. All rights reserved.
Resumo:
Clustering quality or validation indices allow the evaluation of the quality of clustering in order to support the selection of a specific partition or clustering structure in its natural unsupervised environment, where the real solution is unknown or not available. In this paper, we investigate the use of quality indices mostly based on the concepts of clusters` compactness and separation, for the evaluation of clustering results (partitions in particular). This work intends to offer a general perspective regarding the appropriate use of quality indices for the purpose of clustering evaluation. After presenting some commonly used indices, as well as indices recently proposed in the literature, key issues regarding the practical use of quality indices are addressed. A general methodological approach is presented which considers the identification of appropriate indices thresholds. This general approach is compared with the simple use of quality indices for evaluating a clustering solution.
Resumo:
Most multidimensional projection techniques rely on distance (dissimilarity) information between data instances to embed high-dimensional data into a visual space. When data are endowed with Cartesian coordinates, an extra computational effort is necessary to compute the needed distances, making multidimensional projection prohibitive in applications dealing with interactivity and massive data. The novel multidimensional projection technique proposed in this work, called Part-Linear Multidimensional Projection (PLMP), has been tailored to handle multivariate data represented in Cartesian high-dimensional spaces, requiring only distance information between pairs of representative samples. This characteristic renders PLMP faster than previous methods when processing large data sets while still being competitive in terms of precision. Moreover, knowing the range of variation for data instances in the high-dimensional space, we can make PLMP a truly streaming data projection technique, a trait absent in previous methods.
Resumo:
Some sesquiterpene lactones (SLs) are the active compounds of a great number of traditionally medicinal plants from the Asteraceae family and possess considerable cytotoxic activity. Several studies in vitro have shown the inhibitory activity against cells derived from human carcinoma of the nasopharynx (KB). Chemical studies showed that the cytotoxic activity is due to the reaction of alpha,beta-unsaturated carbonyl structures of the SLs with thiols, such as cysteine. These studies support the view that SLs inhibit tumour growth by selective alkylation of growth-regulatory biological macromolecules, such as key enzymes, which control cell division, thereby inhibiting a variety of cellular functions, which directs the cells into apoptosis. In this study we investigated a set of 55 different sesquiterpene lactones, represented by 5 skeletons (22 germacranolides, 6 elemanolides, 2 eudesmanolides, 16 guaianolides and nor-derivatives and 9 pseudoguaianolides), in respect to their cytotoxic properties. The experimental results and 3D molecular descriptors were submitted to Kohonen self-organizing map (SOM) to classify (training set) and predict (test set) the cytotoxic activity. From the obtained results, it was concluded that only the geometrical descriptors showed satisfactory values. The Kohonen map obtained after training set using 25 geometrical descriptors shows a very significant match, mainly among the inactive compounds (similar to 84%). Analyzing both groups, the percentage seen is high (83%). The test set shows the highest match, where 89% of the substances had their cytotoxic activity correctly predicted. From these results, important properties for the inhibition potency are discussed for the whole dataset and for subsets of the different structural skeletons. (C) 2008 Elsevier Masson SAS. All rights reserved.
Resumo:
The synthesis and self-assembly of tetragonal phase-containing L1(0)-Fe(55)Pt(45) nanorods with high coercive field is described. The experimental procedure resulted in a tetragonal/cubic phase ratio close to 1:1 for the as-synthesized nanoparticles. Using different surfactant/solvent proportions in the process allowed control of particle morphology from nanospheres to nanowires. Monodisperse nanorods with lengths of 60 +/- 5 nm and diameters of 2-3 nm were self-assembled in a perpendicular oriented array onto a substrate surface using hexadecylamine as organic spacer. Magnetic alignment and properties assigned, respectively, to the shape anisotropy and the tetragonal phase suggest that the self-assembled materials are a strong candidate to solve the problem of random magnetic alignment observed in FePt nanospheres leading to applications in ultrahigh magnetic recording (UHMR) systems capable of achieving a performance of the order of terabits/in(2).
Resumo:
In the last decades there was an increase in stress at work and its effects on workers' health. These issues are still little studied in the electric utility sector. This study aims to evaluate factors associated with stress at work and to verify its associations with health status among workers of an electric company in São Paulo State, Brazil. A cross-sectional study was conducted with 474 subjects (87.5% of the eligible workers). Data were collected using self-reported questionnaires. A descriptive analysis, a multiple linear hierarchical regression analysis and a correlation analysis were performed. The majority of participants were males (91.1%) and the mean age was 37.5 yr. The mean score of stress level was 2.3 points (scale ranging from 1.0 to 5.0). Hierarchical multiple analyses showed that: regular practice of physical activities (p=0.025) and individual monthly income (p=0.002) were inversely associated with stress level; BMI was marginally associated with the stress level (p=0.074). The demographic characteristics were not associated with stress. Stress at work was significantly associated with physical and mental health status (p<0.001). To improve health of electric utility workers, actions are suggested to decrease stress by remuneration and an appropriate practice of physical activity aiming reduction of BMI
Resumo:
In the last decades there was an increase in stress at work and its effects on workers' health. These issues are still little studied in the electric utility sector. This study aims to evaluate factors associated with stress at work and to verify its associations with health status among workers of an electric company in Sao Paulo State, Brazil. A cross-sectional study was conducted with 474 subjects (87.5% of the eligible workers). Data were collected using self-reported questionnaires. A descriptive analysis, a multiple linear hierarchical regression analysis and a correlation analysis were performed. The majority of participants were males (91.1%) and the mean age was 37.5 yr. The mean score of stress level was 2.3 points (scale ranging from 1.0 to 5.0). Hierarchical multiple analyses showed that: regular practice of physical activities (p=0.025) and individual monthly income (p=0.002) were inversely associated with stress level; BMI was marginally associated with the stress level (p=0.074). The demographic characteristics were not associated with stress. Stress at work was significantly associated with physical and mental health status (p<0.001). To improve health of electric utility workers, actions are suggested to decrease stress by remuneration and an appropriate practice of physical activity aiming reduction of BMI.
Resumo:
Age-related changes in running kinematics have been reported in the literature using classical inferential statistics. However, this approach has been hampered by the increased number of biomechanical gait variables reported and subsequently the lack of differences presented in these studies. Data mining techniques have been applied in recent biomedical studies to solve this problem using a more general approach. In the present work, we re-analyzed lower extremity running kinematic data of 17 young and 17 elderly male runners using the Support Vector Machine (SVM) classification approach. In total, 31 kinematic variables were extracted to train the classification algorithm and test the generalized performance. The results revealed different accuracy rates across three different kernel methods adopted in the classifier, with the linear kernel performing the best. A subsequent forward feature selection algorithm demonstrated that with only six features, the linear kernel SVM achieved 100% classification performance rate, showing that these features provided powerful combined information to distinguish age groups. The results of the present work demonstrate potential in applying this approach to improve knowledge about the age-related differences in running gait biomechanics and encourages the use of the SVM in other clinical contexts. (C) 2010 Elsevier Ltd. All rights reserved.
Resumo:
Data related to medication order and the use of medications was collected from 94 elderly medical records of two nursing homes of Aracaju (SE), The mean age was 83.2 (SD = 10.1), with most belonging to the females (63.8%). The prevalence of the use of drugs was 87.2% and the average of medicines consumed was equal to 2.7 (SD = 1.8), mainly with action in the cardiovascular and nervous systems. In this Study, the elderly population studied presented polypharmacy (18.1%), inappropriate use of drugs (28.7%) and double therapy (11.7%). Data showed the need for improvement and evaluation of the quality of pharmacotherapy to promote rational drug use in the elderly population.