982 resultados para Data profiling
Resumo:
Networked information and communication technologies are rapidly advancing the capacities of governments to target and separately manage specific sub-populations, groups and individuals. Targeting uses data profiling to calculate the differential probabilities of outcomes associated with various personal characteristics. This knowledge is used to classify and sort people for differentiated levels of treatment. Targeting is often used to efficiently and effectively target government resources to the most disadvantaged. Although having many benefits, targeting raises several policy and ethical issues. This paper discusses these issues and the policy responses governments may take to maximise the benefits of targeting while ameliorating the negative aspects.
Resumo:
Companies require information in order to gain an improved understanding of their customers. Data concerning customers, their interests and behavior are collected through different loyalty programs. The amount of data stored in company data bases has increased exponentially over the years and become difficult to handle. This research area is the subject of much current interest, not only in academia but also in practice, as is shown by several magazines and blogs that are covering topics on how to get to know your customers, Big Data, information visualization, and data warehousing. In this Ph.D. thesis, the Self-Organizing Map and two extensions of it – the Weighted Self-Organizing Map (WSOM) and the Self-Organizing Time Map (SOTM) – are used as data mining methods for extracting information from large amounts of customer data. The thesis focuses on how data mining methods can be used to model and analyze customer data in order to gain an overview of the customer base, as well as, for analyzing niche-markets. The thesis uses real world customer data to create models for customer profiling. Evaluation of the built models is performed by CRM experts from the retailing industry. The experts considered the information gained with help of the models to be valuable and useful for decision making and for making strategic planning for the future.
Resumo:
This paper consists in the characterization of medium voltage (MV) electric power consumers based on a data clustering approach. It is intended to identify typical load profiles by selecting the best partition of a power consumption database among a pool of data partitions produced by several clustering algorithms. The best partition is selected using several cluster validity indices. These methods are intended to be used in a smart grid environment to extract useful knowledge about customers’ behavior. The data-mining-based methodology presented throughout the paper consists in several steps, namely the pre-processing data phase, clustering algorithms application and the evaluation of the quality of the partitions. To validate our approach, a case study with a real database of 1.022 MV consumers was used.
Resumo:
With recent advances in mass spectrometry techniques, it is now possible to investigate proteins over a wide range of molecular weights in small biological specimens. This advance has generated data-analytic challenges in proteomics, similar to those created by microarray technologies in genetics, namely, discovery of "signature" protein profiles specific to each pathologic state (e.g., normal vs. cancer) or differential profiles between experimental conditions (e.g., treated by a drug of interest vs. untreated) from high-dimensional data. We propose a data analytic strategy for discovering protein biomarkers based on such high-dimensional mass-spectrometry data. A real biomarker-discovery project on prostate cancer is taken as a concrete example throughout the paper: the project aims to identify proteins in serum that distinguish cancer, benign hyperplasia, and normal states of prostate using the Surface Enhanced Laser Desorption/Ionization (SELDI) technology, a recently developed mass spectrometry technique. Our data analytic strategy takes properties of the SELDI mass-spectrometer into account: the SELDI output of a specimen contains about 48,000 (x, y) points where x is the protein mass divided by the number of charges introduced by ionization and y is the protein intensity of the corresponding mass per charge value, x, in that specimen. Given high coefficients of variation and other characteristics of protein intensity measures (y values), we reduce the measures of protein intensities to a set of binary variables that indicate peaks in the y-axis direction in the nearest neighborhoods of each mass per charge point in the x-axis direction. We then account for a shifting (measurement error) problem of the x-axis in SELDI output. After these pre-analysis processing of data, we combine the binary predictors to generate classification rules for cancer, benign hyperplasia, and normal states of prostate. Our approach is to apply the boosting algorithm to select binary predictors and construct a summary classifier. We empirically evaluate sensitivity and specificity of the resulting summary classifiers with a test dataset that is independent from the training dataset used to construct the summary classifiers. The proposed method performed nearly perfectly in distinguishing cancer and benign hyperplasia from normal. In the classification of cancer vs. benign hyperplasia, however, an appreciable proportion of the benign specimens were classified incorrectly as cancer. We discuss practical issues associated with our proposed approach to the analysis of SELDI output and its application in cancer biomarker discovery.
Resumo:
Hypercapnia and elevated temperatures resulting from climate change may have adverse consequences for many marine organisms. While diverse physiological and ecological effects have been identified, changes in those molecular mechanisms, which shape the physiological phenotype of a species and limit its capacity to compensate, remain poorly understood. Here, we use global gene expression profiling through RNA-Sequencing to study the transcriptional responses to ocean acidification and warming in gills of the boreal spider crab Hyas araneus exposed medium-term (10 weeks) to intermediate (1,120 µatm) and high (1,960 µatm) PCO2 at different temperatures (5°C and 10°C). The analyses reveal shifts in steady state gene expression from control to intermediate and from intermediate to high CO2 exposures. At 5°C acid-base, energy metabolism and stress response related genes were upregulated at intermediate PCO2, whereas high PCO2 induced a relative reduction in expression to levels closer to controls. A similar pattern was found at elevated temperature (10°C). There was a strong coordination between acid-base, metabolic and stress-related processes. Hemolymph parameters at intermediate PCO2 indicate enhanced capacity in acid-base compensation potentially supported by upregulation of a V-ATPase. The likely enhanced energy demand might be met by the upregulation of the electron transport system (ETS), but may lead to increased oxidative stress reflected in upregulated antioxidant defense transcripts. These mechanisms were attenuated by high PCO2, possibly as a result of limited acid-base compensation and metabolic down-regulation. Our findings indicate a PCO2 dependent threshold beyond which compensation by acclimation fails progressively. They also indicate a limited ability of this stenoecious crustacean to compensate for the effects of ocean acidification with and without concomitant warming.