13 resultados para Data mining and knowledge discovery
em Biblioteca Digital da Produção Intelectual da Universidade de São Paulo
Resumo:
We review recent visualization techniques aimed at supporting tasks that require the analysis of text documents, from approaches targeted at visually summarizing the relevant content of a single document to those aimed at assisting exploratory investigation of whole collections of documents.Techniques are organized considering their target input materialeither single texts or collections of textsand their focus, which may be at displaying content, emphasizing relevant relationships, highlighting the temporal evolution of a document or collection, or helping users to handle results from a query posed to a search engine.We describe the approaches adopted by distinct techniques and briefly review the strategies they employ to obtain meaningful text models, discuss how they extract the information required to produce representative visualizations, the tasks they intend to support and the interaction issues involved, and strengths and limitations. Finally, we show a summary of techniques, highlighting their goals and distinguishing characteristics. We also briefly discuss some open problems and research directions in the fields of visual text mining and text analytics.
Resumo:
The ubiquity of time series data across almost all human endeavors has produced a great interest in time series data mining in the last decade. While dozens of classification algorithms have been applied to time series, recent empirical evidence strongly suggests that simple nearest neighbor classification is exceptionally difficult to beat. The choice of distance measure used by the nearest neighbor algorithm is important, and depends on the invariances required by the domain. For example, motion capture data typically requires invariance to warping, and cardiology data requires invariance to the baseline (the mean value). Similarly, recent work suggests that for time series clustering, the choice of clustering algorithm is much less important than the choice of distance measure used.In this work we make a somewhat surprising claim. There is an invariance that the community seems to have missed, complexity invariance. Intuitively, the problem is that in many domains the different classes may have different complexities, and pairs of complex objects, even those which subjectively may seem very similar to the human eye, tend to be further apart under current distance measures than pairs of simple objects. This fact introduces errors in nearest neighbor classification, where some complex objects may be incorrectly assigned to a simpler class. Similarly, for clustering this effect can introduce errors by “suggesting” to the clustering algorithm that subjectively similar, but complex objects belong in a sparser and larger diameter cluster than is truly warranted.We introduce the first complexity-invariant distance measure for time series, and show that it generally produces significant improvements in classification and clustering accuracy. We further show that this improvement does not compromise efficiency, since we can lower bound the measure and use a modification of triangular inequality, thus making use of most existing indexing and data mining algorithms. We evaluate our ideas with the largest and most comprehensive set of time series mining experiments ever attempted in a single work, and show that complexity-invariant distance measures can produce improvements in classification and clustering in the vast majority of cases.
Resumo:
The reproductive performance of cattle may be influenced by several factors, but mineral imbalances are crucial in terms of direct effects on reproduction. Several studies have shown that elements such as calcium, copper, iron, magnesium, selenium, and zinc are essential for reproduction and can prevent oxidative stress. However, toxic elements such as lead, nickel, and arsenic can have adverse effects on reproduction. In this paper, we applied a simple and fast method of multi-element analysis to bovine semen samples from Zebu and European classes used in reproduction programs and artificial insemination. Samples were analyzed by inductively coupled plasma spectrometry (ICP-MS) using aqueous medium calibration and the samples were diluted in a proportion of 1:50 in a solution containing 0.01% (vol/vol) Triton X-100 and 0.5% (vol/vol) nitric acid. Rhodium, iridium, and yttrium were used as the internal standards for ICP-MS analysis. To develop a reliable method of tracing the class of bovine semen, we used data mining techniques that make it possible to classify unknown samples after checking the differentiation of known-class samples. Based on the determination of 15 elements in 41 samples of bovine semen, 3 machine-learning tools for classification were applied to determine cattle class. Our results demonstrate the potential of support vector machine (SVM), multilayer perceptron (MLP), and random forest (RF) chemometric tools to identify cattle class. Moreover, the selection tools made it possible to reduce the number of chemical elements needed from 15 to just 8.
Resumo:
Multi-element analysis of honey samples was carried out with the aim of developing a reliable method of tracing the origin of honey. Forty-two chemical elements were determined (Al, Cu, Pb, Zn, Mn, Cd, Tl, Co, Ni, Rb, Ba, Be, Bi, U, V, Fe, Pt, Pd, Te, Hf, Mo, Sn, Sb, P, La, Mg, I, Sm, Tb, Dy, Sd, Th, Pr, Nd, Tm, Yb, Lu, Gd, Ho, Er, Ce, Cr) by inductively coupled plasma mass spectrometry (ICP-MS). Then, three machine learning tools for classification and two for attribute selection were applied in order to prove that it is possible to use data mining tools to find the region where honey originated. Our results clearly demonstrate the potential of Support Vector Machine (SVM), Multilayer Perceptron (MLP) and Random Forest (RF) chemometric tools for honey origin identification. Moreover, the selection tools allowed a reduction from 42 trace element concentrations to only 5. (C) 2012 Elsevier Ltd. All rights reserved.
Resumo:
Background: The CUPID (Cultural and Psychosocial Influences on Disability) study was established to explore the hypothesis that common musculoskeletal disorders (MSDs) and associated disability are importantly influenced by culturally determined health beliefs and expectations. This paper describes the methods of data collection and various characteristics of the study sample. Methods/Principal Findings: A standardised questionnaire covering musculoskeletal symptoms, disability and potential risk factors, was used to collect information from 47 samples of nurses, office workers, and other (mostly manual) workers in 18 countries from six continents. In addition, local investigators provided data on economic aspects of employment for each occupational group. Participation exceeded 80% in 33 of the 47 occupational groups, and after pre-specified exclusions, analysis was based on 12,426 subjects (92 to 1018 per occupational group). As expected, there was high usage of computer keyboards by office workers, while nurses had the highest prevalence of heavy manual lifting in all but one country. There was substantial heterogeneity between occupational groups in economic and psychosocial aspects of work; three-to fivefold variation in awareness of someone outside work with musculoskeletal pain; and more than ten-fold variation in the prevalence of adverse health beliefs about back and arm pain, and in awareness of terms such as "repetitive strain injury" (RSI). Conclusions/Significance: The large differences in psychosocial risk factors (including knowledge and beliefs about MSDs) between occupational groups should allow the study hypothesis to be addressed effectively.
Resumo:
Oxygen abundances of 67 dwarf stars in the metallicity range -1.6 < [Fe/H] < -0.4 are derived from a non-LTE analysis of the 777 nm O I triplet lines. These stars have precise atmospheric parameters measured by Nissen and Schuster, who find that they separate into three groups based on their kinematics and alpha-element (Mg, Si, Ca, Ti) abundances: thick disk, high-alpha halo, and low-alpha halo. We find the oxygen abundance trends of thick-disk and high-alpha halo stars very similar. The low-alpha stars show a larger star-to-star scatter in [O/Fe] at a given [Fe/H] and have systematically lower oxygen abundances compared to the other two groups. Thus, we find the behavior of oxygen abundances in these groups of stars similar to that of the a elements. We use previously published oxygen abundance data of disk and very metal-poor halo stars to present an overall view (-2.3 < [Fe/H] < +0.3) of oxygen abundance trends of stars in the solar neighborhood. Two field halo dwarf stars stand out in their O and Na abundances. Both G53-41 and G150-40 have very low oxygen and very high sodium abundances, which are key signatures of the abundance anomalies observed in globular cluster (GC) stars. Therefore, they are likely field halo stars born in GCs. If true, we estimate that at least 3% +/- 2% of the local field metal-poor star population was born in GCs.
Resumo:
Abstract Background Once multi-relational approach has emerged as an alternative for analyzing structured data such as relational databases, since they allow applying data mining in multiple tables directly, thus avoiding expensive joining operations and semantic losses, this work proposes an algorithm with multi-relational approach. Methods Aiming to compare traditional approach performance and multi-relational for mining association rules, this paper discusses an empirical study between PatriciaMine - an traditional algorithm - and its corresponding multi-relational proposed, MR-Radix. Results This work showed advantages of the multi-relational approach in performance over several tables, which avoids the high cost for joining operations from multiple tables and semantic losses. The performance provided by the algorithm MR-Radix shows faster than PatriciaMine, despite handling complex multi-relational patterns. The utilized memory indicates a more conservative growth curve for MR-Radix than PatriciaMine, which shows the increase in demand of frequent items in MR-Radix does not result in a significant growth of utilized memory like in PatriciaMine. Conclusion The comparative study between PatriciaMine and MR-Radix confirmed efficacy of the multi-relational approach in data mining process both in terms of execution time and in relation to memory usage. Besides that, the multi-relational proposed algorithm, unlike other algorithms of this approach, is efficient for use in large relational databases.
Resumo:
Given a large image set, in which very few images have labels, how to guess labels for the remaining majority? How to spot images that need brand new labels different from the predefined ones? How to summarize these data to route the user’s attention to what really matters? Here we answer all these questions. Specifically, we propose QuMinS, a fast, scalable solution to two problems: (i) Low-labor labeling (LLL) – given an image set, very few images have labels, find the most appropriate labels for the rest; and (ii) Mining and attention routing – in the same setting, find clusters, the top-'N IND.O' outlier images, and the 'N IND.R' images that best represent the data. Experiments on satellite images spanning up to 2.25 GB show that, contrasting to the state-of-the-art labeling techniques, QuMinS scales linearly on the data size, being up to 40 times faster than top competitors (GCap), still achieving better or equal accuracy, it spots images that potentially require unpredicted labels, and it works even with tiny initial label sets, i.e., nearly five examples. We also report a case study of our method’s practical usage to show that QuMinS is a viable tool for automatic coffee crop detection from remote sensing images.
Resumo:
What is knowledge construction for? Mesopotamian rituals were practiced in order to grasp the future and guide war strategies. Nowadays, scientific rules are developed to avoid mysticism-constructing more accurate laws to explain the reality. Both rituals and science were, and usually are, grounded in a conception that to know is to decipher the correct meaning behind the expressive relief of the world. Contemporary studies on anthropology have shown that the opposition between nature and culture is the basis of a number of problems in human sciences aiming to comprehend the intricate relation between body and violence and overcome ethical dilemmas.
Resumo:
Serra da Canastra National Park (SCNP) is one of the most important protected areas in the Cerrado biome. Despite its importance to the conservation of rare and endangered species like Brazilian Merganser, two bills were approved in 2010 by Brazil's Chamber of Deputies aiming to reduce SCNP's official boundaries and to transform some of its parts into an Environmental Protection Area (EPA). We evaluated whether such changes would facilitate mining areas to be legally exploited within the park's area, and if those mining areas would represent a threat to Brazilian Merganser populations at SCNP. Results showed that 55% of the mining areas currently within the National Park will be located within the new EPA, and six hydrographic micro-basins inhabited by Brazilian Merganser could be affected by environmental impacts caused by mineral exploitation in those areas. For these reasons, we recommend the two bills be refused at the Federal Senate.
Resumo:
Each plasma physics laboratory has a proprietary scheme to control and data acquisition system. Usually, it is different from one laboratory to another. It means that each laboratory has its own way to control the experiment and retrieving data from the database. Fusion research relies to a great extent on international collaboration and this private system makes it difficult to follow the work remotely. The TCABR data analysis and acquisition system has been upgraded to support a joint research programme using remote participation technologies. The choice of MDSplus (Model Driven System plus) is proved by the fact that it is widely utilized, and the scientists from different institutions may use the same system in different experiments in different tokamaks without the need to know how each system treats its acquisition system and data analysis. Another important point is the fact that the MDSplus has a library system that allows communication between different types of language (JAVA, Fortran, C, C++, Python) and programs such as MATLAB, IDL, OCTAVE. In the case of tokamak TCABR interfaces (object of this paper) between the system already in use and MDSplus were developed, instead of using the MDSplus at all stages, from the control, and data acquisition to the data analysis. This was done in the way to preserve a complex system already in operation and otherwise it would take a long time to migrate. This implementation also allows add new components using the MDSplus fully at all stages. (c) 2012 Elsevier B.V. All rights reserved.
Resumo:
The adoption of principles of equality and universality stipulated in legislation for the sanitation sector requires discussions on innovation. The existing model was able to meet sanitary demands, but was unable to attend all areas causing disparities in vulnerable areas. The universal implementation of sanitation requires identification of the know-how that promotes it and analysis of the model adopted today to establish a new method. Analysis of how different viewpoints on the restructuring process is necessary for the definition of public policy, especially in health, and understanding its complexities and importance in confirming social practices and organizational designs. These are discussed to contribute to universal implementation of sanitation in urban areas by means of a review of the literature and practices in the industry. By way of conclusion, it is considered that accepting a particular concept or idea in sanitation means choosing some effective interventions in the network and on the lives of individual users, and implies a redefinition of the space in which it exercises control and management of sewerage networks, such that connected users are perceived as groups with different interests.
Resumo:
Abstract Background The success of HPV vaccination programs will require awareness regarding HPV associated diseases and the benefits of HPV vaccination for the general population. The aim of this study was to assess the level of awareness and knowledge of human papillomavirus (HPV) infection, cervical cancer prevention, vaccines, and factors associated with HPV awareness among young women after birth of the first child. Methods This analysis is part of a cross-sectional study carried out at Hospital Maternidade Leonor Mendes de Barros, a large public maternity hospital in Sao Paulo. Primiparous women (15-24 years) who gave birth in that maternity hospital were included. A questionnaire that included questions concerning knowledge of HPV, cervical cancer, and vaccines was applied. To estimate the association of HPV awareness with selected factors, prevalence ratios (PR) were estimated using a generalized linear model (GLM). Results Three hundred and one primiparous women were included; 37% of them reported that they "had ever heard about HPV", but only 19% and 7%, respectively, knew that HPV is a sexually transmitted infection (STI) and that it can cause cervical cancer. Seventy-four percent of interviewees mentioned the preventive character of vaccines and all participants affirmed that they would accept HPV vaccination after delivery. In the multivariate analysis, only increasing age (P for trend = 0.021) and previous STI (P < 0.001) were factors independently associated with HPV awareness ("had ever heard about HPV"). Conclusions This survey indicated that knowledge about the association between HPV and cervical cancer among primiparous young women is low. Therefore, these young low-income primiparous women could benefit greatly from educational interventions to encourage primary and secondary cervical cancer prevention programs.