101 resultados para mining data streams
em Biblioteca Digital da Produção Intelectual da Universidade de São Paulo (BDPI/USP)
Resumo:
One of the top ten most influential data mining algorithms, k-means, is known for being simple and scalable. However, it is sensitive to initialization of prototypes and requires that the number of clusters be specified in advance. This paper shows that evolutionary techniques conceived to guide the application of k-means can be more computationally efficient than systematic (i.e., repetitive) approaches that try to get around the above-mentioned drawbacks by repeatedly running the algorithm from different configurations for the number of clusters and initial positions of prototypes. To do so, a modified version of a (k-means based) fast evolutionary algorithm for clustering is employed. Theoretical complexity analyses for the systematic and evolutionary algorithms under interest are provided. Computational experiments and statistical analyses of the results are presented for artificial and text mining data sets. (C) 2010 Elsevier B.V. All rights reserved.
Resumo:
Melanoma is a highly aggressive and therapy resistant tumor for which the identification of specific markers and therapeutic targets is highly desirable. We describe here the development and use of a bioinformatic pipeline tool, made publicly available under the name of EST2TSE, for the in silico detection of candidate genes with tissue-specific expression. Using this tool we mined the human EST (Expressed Sequence Tag) database for sequences derived exclusively from melanoma. We found 29 UniGene clusters of multiple ESTs with the potential to predict novel genes with melanoma-specific expression. Using a diverse panel of human tissues and cell lines, we validated the expression of a subset of three previously uncharacterized genes (clusters Hs.295012, Hs.518391, and Hs.559350) to be highly restricted to melanoma/melanocytes and named them RMEL1, 2 and 3, respectively. Expression analysis in nevi, primary melanomas, and metastatic melanomas revealed RMEL1 as a novel melanocytic lineage-specific gene up-regulated during melanoma development. RMEL2 expression was restricted to melanoma tissues and glioblastoma. RMEL3 showed strong up-regulation in nevi and was lost in metastatic tumors. Interestingly, we found correlations of RMEL2 and RMEL3 expression with improved patient outcome, suggesting tumor and/or metastasis suppressor functions for these genes. The three genes are composed of multiple exons and map to 2q12.2, 1q25.3, and 5q11.2, respectively. They are well conserved throughout primates, but not other genomes, and were predicted as having no coding potential, although primate-conserved and human-specific short ORFs could be found. Hairpin RNA secondary structures were also predicted. Concluding, this work offers new melanoma-specific genes for future validation as prognostic markers or as targets for the development of therapeutic strategies to treat melanoma.
Resumo:
This work proposes a method based on both preprocessing and data mining with the objective of identify harmonic current sources in residential consumers. In addition, this methodology can also be applied to identify linear and nonlinear loads. It should be emphasized that the entire database was obtained through laboratory essays, i.e., real data were acquired from residential loads. Thus, the residential system created in laboratory was fed by a configurable power source and in its output were placed the loads and the power quality analyzers (all measurements were stored in a microcomputer). So, the data were submitted to pre-processing, which was based on attribute selection techniques in order to minimize the complexity in identifying the loads. A newer database was generated maintaining only the attributes selected, thus, Artificial Neural Networks were trained to realized the identification of loads. In order to validate the methodology proposed, the loads were fed both under ideal conditions (without harmonics), but also by harmonic voltages within limits pre-established. These limits are in accordance with IEEE Std. 519-1992 and PRODIST (procedures to delivery energy employed by Brazilian`s utilities). The results obtained seek to validate the methodology proposed and furnish a method that can serve as alternative to conventional methods.
Resumo:
Background: The inherent complexity of statistical methods and clinical phenomena compel researchers with diverse domains of expertise to work in interdisciplinary teams, where none of them have a complete knowledge in their counterpart's field. As a result, knowledge exchange may often be characterized by miscommunication leading to misinterpretation, ultimately resulting in errors in research and even clinical practice. Though communication has a central role in interdisciplinary collaboration and since miscommunication can have a negative impact on research processes, to the best of our knowledge, no study has yet explored how data analysis specialists and clinical researchers communicate over time. Methods/Principal Findings: We conducted qualitative analysis of encounters between clinical researchers and data analysis specialists (epidemiologist, clinical epidemiologist, and data mining specialist). These encounters were recorded and systematically analyzed using a grounded theory methodology for extraction of emerging themes, followed by data triangulation and analysis of negative cases for validation. A policy analysis was then performed using a system dynamics methodology looking for potential interventions to improve this process. Four major emerging themes were found. Definitions using lay language were frequently employed as a way to bridge the language gap between the specialties. Thought experiments presented a series of ""what if'' situations that helped clarify how the method or information from the other field would behave, if exposed to alternative situations, ultimately aiding in explaining their main objective. Metaphors and analogies were used to translate concepts across fields, from the unfamiliar to the familiar. Prolepsis was used to anticipate study outcomes, thus helping specialists understand the current context based on an understanding of their final goal. Conclusion/Significance: The communication between clinical researchers and data analysis specialists presents multiple challenges that can lead to errors.
Resumo:
Stream discharge-concentration relationships are indicators of terrestrial ecosystem function. Throughout the Amazon and Cerrado regions of Brazil rapid changes in land use and land cover may be altering these hydrochemical relationships. The current analysis focuses on factors controlling the discharge-calcium (Ca) concentration relationship since previous research in these regions has demonstrated both positive and negative slopes in linear log(10)discharge-log(10)Ca concentration regressions. The objective of the current study was to evaluate factors controlling stream discharge-Ca concentration relationships including year, season, stream order, vegetation cover, land use, and soil classification. It was hypothesized that land use and soil class are the most critical attributes controlling discharge-Ca concentration relationships. A multilevel, linear regression approach was utilized with data from 28 streams throughout Brazil. These streams come from three distinct regions and varied broadly in watershed size (< 1 to > 10(6) ha) and discharge (10(-5.7)-10(3.2) m(3) s(-1)). Linear regressions of log(10)Ca versus log(10)discharge in 13 streams have a preponderance of negative slopes with only two streams having significant positive slopes. An ANOVA decomposition suggests the effect of discharge on Ca concentration is large but variable. Vegetation cover, which incorporates aspects of land use, explains the largest proportion of the variance in the effect of discharge on Ca followed by season and year. In contrast, stream order, land use, and soil class explain most of the variation in stream Ca concentration. In the current data set, soil class, which is related to lithology, has an important effect on Ca concentration but land use, likely through its effect on runoff concentration and hydrology, has a greater effect on discharge-concentration relationships.
Resumo:
Hot tensile and creep tests were carried out on Kanthal A1 alloy in the temperature range from 600 to 800 degrees C. Each of these sets of data were analyzed separately according to their own methodologies, but an attempt was made to find a correlation between them. A new criterion proposed for converting hot tensile data to creep data, makes possible the analysis of the two kinds of results according to usual creep relations like: Norton, Monkman-Grant, Larson-Miller and others. The remarkable compatibility verified between both sets of data by this procedure strongly suggests that hot tensile data can be converted to creep data and vice-versa for Kanthal A1 alloy, as verified previously for other metallic materials.
Resumo:
The productivity associated with commonly available disassembly methods today seldomly makes disassembly the preferred end-of-life solution for massive take back product streams. Systematic reuse of parts or components, or recycling of pure material fractions are often not achievable in an economically sustainable way. In this paper a case-based review of current disassembly practices is used to analyse the factors influencing disassembly feasibility. Data mining techniques were used to identify major factors influencing the profitability of disassembly operations. Case characteristics such as involvement of the product manufacturer in the end-of-life treatment and continuous ownership are some of the important dimensions. Economic models demonstrate that the efficiency of disassembly operations should be increased an order of magnitude to assure the competitiveness of ecologically preferred, disassembly oriented end-of-life scenarios for large waste of electric and electronic equipment (WEEE) streams. Technological means available to increase the productivity of the disassembly operations are summarized. Automated disassembly techniques can contribute to the robustness of the process, but do not allow to overcome the efficiency gap if not combined with appropriate product design measures. Innovative, reversible joints, collectively activated by external trigger signals, form a promising approach to low cost, mass disassembly in this context. A short overview of the state-of-the-art in the development of such self-disassembling joints is included. (c) 2008 CIRP.
Resumo:
The central issue for pillar design in underground coal mining is the in situ uniaxial compressive strength (sigma (cm)). The paper proposes a new method for estimating in situ uniaxial compressive strength in coal seams based on laboratory strength and P wave propagation velocity. It describes the collection of samples in the Bonito coal seam, Fontanella Mine, southern Brazil, the techniques used for the structural mapping of the coal seam and determination of seismic wave propagation velocity as well as the laboratory procedures used to determine the strength and ultrasonic wave velocity. The results obtained using the new methodology are compared with those from seven other techniques for estimating in situ rock mass uniaxial compressive strength.
Resumo:
Determining reference concentrations in rivers and streams is an important tool for environmental management. Reference conditions for eutrophication-related water variables are unavailable for Brazilian freshwaters. We aimed to establish reference baselines for So Paulo State tropical rivers and streams for total phosphorus (TP) and nitrogen (TN), nitrogen-ammonia (NH(4) (+)) and Biochemical Oxygen Demand (BOD) through the best professional judgment and the trisection methods. Data from 319 sites monitored by the So Paulo State Environmental Company (2005 to 2009) and from the 22 Water Resources Management Units in So Paulo State were assessed (N = 27,131). We verified that data from different management units dominated by similar land cover could be analyzed together (Analysis of Variance, P = 0.504). Cumulative frequency diagrams showed that industrialized management units were characterized by the worst water quality (e.g. average TP of 0.51 mg/L), followed by agricultural watersheds. TN and NH(4) (+) were associated with urban percentages and population density (Spearman Rank Correlation Test, P < 0.05). Best professional judgment and trisection (median of lower third of all sites) methods for determining reference concentrations showed agreement: 0.03 & 0.04 mg/L (TP), 0.31 & 0.34 mg/L (TN), 0.06 & 0.10 mg-N/L (NH(4) (+)) and 2 & 2 mg/L (BOD), respectively. Our reference concentrations were similar to TP and TN reference values proposed for temperate water bodies. These baselines can help with water management in So Paulo State, as well as providing some of the first such information for tropical ecosystems.
Resumo:
Since the 1990s several large companies have been publishing nonfinancial performance reports. Focusing initially on the physical environment, these reports evolved to consider social relations, as well as data on the firm`s economic performance. A few mining companies pioneered this trend, and in the last years some of them incorporated the three dimensions of sustainable development, publishing so-called sustainability reports. This article reviews 31 reports published between 2001 and 2006 by four major mining companies. A set of 62 assessment items organized in six categories (namely context and commitment, management, environmental, social and economic performance, and accessibility and assurance) were selected to guide the review. The items were derived from international literature and recommended best practices, including the Global Reporting Initiative G3 framework. A content analysis was performed using the report as a sampling unit, and using phrases, graphics, or tables containing certain information as data collection units. A basic rating scale (0 or 1) was used for noting the presence or absence of information and a final percentage score was obtained for each report. Results show that there is a clear evolution in report`s comprehensiveness and depth. Categories ""accessibility and assurance"" and ""economic performance"" featured the lowest scores and do not present a clear evolution trend in the period, whereas categories ""context and commitment"" and ""social performance"" presented the best results and regular improvement; the category ""environmental performance,"" despite it not reaching the biggest scores, also featured constant evolution. Description of data measurement techniques, besides more comprehensive third-party verification are the items most in need of improvement.
Resumo:
The purpose of this study was to describe the reproductive profile and frequency of genital infections among women living in the Serra Pelada, a former mining village in the Para state, Brazil. A descriptive study of women living in the mining area of Serra Pelada was performed in 2004 through interviews that gathered demographics and clinical data, and assessed risk behaviors of 209 randomly-selected women. Blood samples were collected for rapid assay for HIV; specimens were taken for Pap smears and Gram stains. Standard descriptive statistical analyses were performed and prevalence was calculated to reflect the relative frequency of each disease. Of the 209 participants, the median age was 38 years, with almost 70% having less than four years of education and 77% having no income or under 1.9 times the minimum wage of Brazil. About 30% did not have access to health care services during the preceding year. Risk behaviors included: alcohol abuse, 24.4%; illicit drug abuse, 4.3%; being a sex worker, 15.8%; and domestic violence, 17.7%. Abnormal Pap smear was found in 8.6%. Prevalence rates of infection were: HIV, 1.9%; trichomoniasis, 2.9%; bacterial vaginosis, 18.7%; candidiasis, 5.7%; Chlamydial-related cytological changes, 3.3%; and HPV-related cytological changes, 3.8%. Women living in this mining area in Brazil are economically and socially vulnerable to health problems. It is important to point out the importance of concomitant broader strategies that include reducing poverty and empowering women to make improvements regarding their health.
Resumo:
Electromagnetic induction (EMI) method results are shown for vertical magnetic dipole (VMD) configuration by using the EM38 equipment. Performance in the location of metallic pipes and electrical cables is compared as a function of instrumental drift correction by linear and quadratic adjusting under controlled conditions. Metallic pipes and electrical cables are buried at the IAG/USP shallow geophysical test site in Sao Paulo City. Brazil. Results show that apparent electrical conductivity and magnetic susceptibility data were affected by ambient temperature variation. In order to obtain better contrast between background and metallic targets it was necessary to correct the drift. This correction was accomplished by using linear and quadratic relation between conductivity/susceptibility and temperature intending comparative studies. The correction of temperature drift by using a quadratic relation was effective, showing that all metallic targets were located as well deeper targets were also improved. (C) 2010 Elsevier B.V. All rights reserved.
Resumo:
1. Analyses of species association have major implications for selecting indicators for freshwater biomonitoring and conservation, because they allow for the elimination of redundant information and focus on taxa that can be easily handled and identified. These analyses are particularly relevant in the debate about using speciose groups (such as the Chironomidae) as indicators in the tropics, because they require difficult and time-consuming analysis, and their responses to environmental gradients, including anthropogenic stressors, are poorly known. 2. Our objective was to show whether chironomid assemblages in Neotropical streams include clear associations of taxa and, if so, how well these associations could be explained by a set of models containing information from different spatial scales. For this, we formulated a priori models that allowed for the influence of local, landscape and spatial factors on chironomid taxon associations (CTA). These models represented biological hypotheses capable of explaining associations between chironomid taxa. For instance, CTA could be best explained by local variables (e.g. pH, conductivity and water temperature) or by processes acting at wider landscape scales (e.g. percentage of forest cover). 3. Biological data were taken from 61 streams in Southeastern Brazil, 47 of which were in well-preserved regions, and 14 of which drained areas severely affected by anthropogenic activities. We adopted a model selection procedure using Akaike`s information criterion to determine the most parsimonious models for explaining CTA. 4. Applying Kendall`s coefficient of concordance, seven genera (Tanytarsus/Caladomyia, Ablabesmyia, Parametriocnemus, Pentaneura, Nanocladius, Polypedilum and Rheotanytarsus) were identified as associated taxa. The best-supported model explained 42.6% of the total variance in the abundance of associated taxa. This model combined local and landscape environmental filters and spatial variables (which were derived from eigenfunction analysis). However, the model with local filters and spatial variables also had a good chance of being selected as the best model. 5. Standardised partial regression coefficients of local and landscape filters, including spatial variables, derived from model averaging allowed an estimation of which variables were best correlated with the abundance of associated taxa. In general, the abundance of the associated genera tended to be lower in streams characterised by a high percentage of forest cover (landscape scale), lower proportion of muddy substrata and high values of pH and conductivity (local scale). 6. Overall, our main result adds to the increasing number of studies that have indicated the importance of local and landscape variables, as well as the spatial relationships among sampling sites, for explaining aquatic insect community patterns in streams. Furthermore, our findings open new possibilities for the elimination of redundant data in the assessment of anthropogenic impacts on tropical streams.
Resumo:
Most multidimensional projection techniques rely on distance (dissimilarity) information between data instances to embed high-dimensional data into a visual space. When data are endowed with Cartesian coordinates, an extra computational effort is necessary to compute the needed distances, making multidimensional projection prohibitive in applications dealing with interactivity and massive data. The novel multidimensional projection technique proposed in this work, called Part-Linear Multidimensional Projection (PLMP), has been tailored to handle multivariate data represented in Cartesian high-dimensional spaces, requiring only distance information between pairs of representative samples. This characteristic renders PLMP faster than previous methods when processing large data sets while still being competitive in terms of precision. Moreover, knowing the range of variation for data instances in the high-dimensional space, we can make PLMP a truly streaming data projection technique, a trait absent in previous methods.
Resumo:
Advances in diagnostic research are moving towards methods whereby the periodontal risk can be identified and quantified by objective measures using biomarkers. Patients with periodontitis may have elevated circulating levels of specific inflammatory markers that can be correlated to the severity of the disease. The purpose of this study was to evaluate whether differences in the serum levels of inflammatory biomarkers are differentially expressed in healthy and periodontitis patients. Twenty-five patients (8 healthy patients and 17 chronic periodontitis patients) were enrolled in the study. A 15 mL blood sample was used for identification of the inflammatory markers, with a human inflammatory flow cytometry multiplex assay. Among 24 assessed cytokines, only 3 (RANTES, MIG and Eotaxin) were statistically different between groups (p<0.05). In conclusion, some of the selected markers of inflammation are differentially expressed in healthy and periodontitis patients. Cytokine profile analysis may be further explored to distinguish the periodontitis patients from the ones free of disease and also to be used as a measure of risk. The present data, however, are limited and larger sample size studies are required to validate the findings of the specific biomarkers.