877 resultados para data-mining application


Relevância:

90.00% 90.00%

Publicador:

Resumo:

Principal Component Analysis (PCA) is a popular method for dimension reduction that can be used in many fields including data compression, image processing, exploratory data analysis, etc. However, traditional PCA method has several drawbacks, since the traditional PCA method is not efficient for dealing with high dimensional data and cannot be effectively applied to compute accurate enough principal components when handling relatively large portion of missing data. In this report, we propose to use EM-PCA method for dimension reduction of power system measurement with missing data, and provide a comparative study of traditional PCA and EM-PCA methods. Our extensive experimental results show that EM-PCA method is more effective and more accurate for dimension reduction of power system measurement data than traditional PCA method when dealing with large portion of missing data set.

Relevância:

90.00% 90.00%

Publicador:

Resumo:

Brain tumor is one of the most aggressive types of cancer in humans, with an estimated median survival time of 12 months and only 4% of the patients surviving more than 5 years after disease diagnosis. Until recently, brain tumor prognosis has been based only on clinical information such as tumor grade and patient age, but there are reports indicating that molecular profiling of gliomas can reveal subgroups of patients with distinct survival rates. We hypothesize that coupling molecular profiling of brain tumors with clinical information might improve predictions of patient survival time and, consequently, better guide future treatment decisions. In order to evaluate this hypothesis, the general goal of this research is to build models for survival prediction of glioma patients using DNA molecular profiles (U133 Affymetrix gene expression microarrays) along with clinical information. First, a predictive Random Forest model is built for binary outcomes (i.e. short vs. long-term survival) and a small subset of genes whose expression values can be used to predict survival time is selected. Following, a new statistical methodology is developed for predicting time-to-death outcomes using Bayesian ensemble trees. Due to a large heterogeneity observed within prognostic classes obtained by the Random Forest model, prediction can be improved by relating time-to-death with gene expression profile directly. We propose a Bayesian ensemble model for survival prediction which is appropriate for high-dimensional data such as gene expression data. Our approach is based on the ensemble "sum-of-trees" model which is flexible to incorporate additive and interaction effects between genes. We specify a fully Bayesian hierarchical approach and illustrate our methodology for the CPH, Weibull, and AFT survival models. We overcome the lack of conjugacy using a latent variable formulation to model the covariate effects which decreases computation time for model fitting. Also, our proposed models provides a model-free way to select important predictive prognostic markers based on controlling false discovery rates. We compare the performance of our methods with baseline reference survival methods and apply our methodology to an unpublished data set of brain tumor survival times and gene expression data, selecting genes potentially related to the development of the disease under study. A closing discussion compares results obtained by Random Forest and Bayesian ensemble methods under the biological/clinical perspectives and highlights the statistical advantages and disadvantages of the new methodology in the context of DNA microarray data analysis.

Relevância:

90.00% 90.00%

Publicador:

Resumo:

Biodiversity, a multidimensional property of natural systems, is difficult to quantify partly because of the multitude of indices proposed for this purpose. Indices aim to describe general properties of communities that allow us to compare different regions, taxa, and trophic levels. Therefore, they are of fundamental importance for environmental monitoring and conservation, although there is no consensus about which indices are more appropriate and informative. We tested several common diversity indices in a range of simple to complex statistical analyses in order to determine whether some were better suited for certain analyses than others. We used data collected around the focal plant Plantago lanceolata on 60 temperate grassland plots embedded in an agricultural landscape to explore relationships between the common diversity indices of species richness (S), Shannon's diversity (H'), Simpson's diversity (D-1), Simpson's dominance (D-2), Simpson's evenness (E), and Berger-Parker dominance (BP). We calculated each of these indices for herbaceous plants, arbuscular mycorrhizal fungi, aboveground arthropods, belowground insect larvae, and P.lanceolata molecular and chemical diversity. Including these trait-based measures of diversity allowed us to test whether or not they behaved similarly to the better studied species diversity. We used path analysis to determine whether compound indices detected more relationships between diversities of different organisms and traits than more basic indices. In the path models, more paths were significant when using H', even though all models except that with E were equally reliable. This demonstrates that while common diversity indices may appear interchangeable in simple analyses, when considering complex interactions, the choice of index can profoundly alter the interpretation of results. Data mining in order to identify the index producing the most significant results should be avoided, but simultaneously considering analyses using multiple indices can provide greater insight into the interactions in a system.

Relevância:

90.00% 90.00%

Publicador:

Resumo:

NH···π hydrogen bonds occur frequently between the amino acid side groups in proteins and peptides. Data-mining studies of protein crystals find that ~80% of the T-shaped histidine···aromatic contacts are CH···π, and only ~20% are NH···π interactions. We investigated the infrared (IR) and ultraviolet (UV) spectra of the supersonic-jet-cooled imidazole·benzene (Im·Bz) complex as a model for the NH···π interaction between histidine and phenylalanine. Ground- and excited-state dispersion-corrected density functional calculations and correlated methods (SCS-MP2 and SCS-CC2) predict that Im·Bz has a Cs-symmetric T-shaped minimum-energy structure with an NH···π hydrogen bond to the Bz ring; the NH bond is tilted 12° away from the Bz C₆ axis. IR depletion spectra support the T-shaped geometry: The NH stretch vibrational fundamental is red shifted by −73 cm⁻¹ relative to that of bare imidazole at 3518 cm⁻¹, indicating a moderately strong NH···π interaction. While the Sₒ(A1g) → S₁(B₂u) origin of benzene at 38 086 cm⁻¹ is forbidden in the gas phase, Im·Bz exhibits a moderately intense Sₒ → S₁ origin, which appears via the D₆h → Cs symmetry lowering of Bz by its interaction with imidazole. The NH···π ground-state hydrogen bond is strong, De=22.7 kJ/mol (1899 cm⁻¹). The combination of gas-phase UV and IR spectra confirms the theoretical predictions that the optimum Im·Bz geometry is T shaped and NH···π hydrogen bonded. We find no experimental evidence for a CH···π hydrogen-bonded ground-state isomer of Im·Bz. The optimum NH···π geometry of the Im·Bz complex is very different from the majority of the histidine·aromatic contact geometries found in protein database analyses, implying that the CH···π contacts observed in these searches do not arise from favorable binding interactions but merely from protein side-chain folding and crystal-packing constraints. The UV and IR spectra of the imidazole·(benzene)₂ cluster are observed via fragmentation into the Im·Bz+ mass channel. The spectra of Im·Bz and Im·Bz₂ are cleanly separable by IR hole burning. The UV spectrum of Im·Bz₂ exhibits two 000 bands corresponding to the Sₒ → S₁ excitations of the two inequivalent benzenes, which are symmetrically shifted by −86/+88 cm⁻¹ relative to the 000 band of benzene.

Relevância:

90.00% 90.00%

Publicador:

Resumo:

A wide variety of spatial data collection efforts are ongoing throughout local, state and federal agencies, private firms and non-profit organizations. Each effort is established for a different purpose but organizations and individuals often collect and maintain the same or similar information. The United States federal government has undertaken many initiatives such as the National Spatial Data Infrastructure, the National Map and Geospatial One-Stop to reduce duplicative spatial data collection and promote the coordinated use, sharing, and dissemination of spatial data nationwide. A key premise in most of these initiatives is that no national government will be able to gather and maintain more than a small percentage of the geographic data that users want and desire. Thus, national initiatives depend typically on the cooperation of those already gathering spatial data and those using GIs to meet specific needs to help construct and maintain these spatial data infrastructures and geo-libraries for their nations (Onsrud 2001). Some of the impediments to widespread spatial data sharing are well known from directly asking GIs data producers why they are not currently involved in creating datasets that are of common or compatible formats, documenting their datasets in a standardized metadata format or making their datasets more readily available to others through Data Clearinghouses or geo-libraries. The research described in this thesis addresses the impediments to wide-scale spatial data sharing faced by GIs data producers and explores a new conceptual data-sharing approach, the Public Commons for Geospatial Data, that supports user-friendly metadata creation, open access licenses, archival services and documentation of parent lineage of the contributors and value- adders of digital spatial data sets.

Relevância:

90.00% 90.00%

Publicador:

Resumo:

Intensive family preservation services (IFPS), designed to stabilize at-risk families and avert out-of-home care, have been the focus of many randomized, experimental studies. Employing a retrospective “clinical data-mining” (CDM) methodology (Epstein, 2001), this study makes use of available information extracted from client records in one IFPS agency over the course of two years. The primary goal of this descriptive and associational study was to gain a clearer understanding of IFPS service delivery and effectiveness. Interventions provided to families are delineated and assessed for their impact on improved family functioning, their impact on the reduction of family violence, as well as placement prevention. Findings confirm the use of a wide range of services consistent with IFPS program theory. Because the study employs a quasi-experimental, retrospective use of available information, clinical outcomes described cannot be causally attributed to interventions employed as with randomized controlled trials. With regard to service outcomes, findings suggest that family education, empowerment services and advocacy are most influential in placement prevention and in ameliorating unmanageable behaviors in children as well as the incidence of family violence.

Relevância:

90.00% 90.00%

Publicador:

Resumo:

Intensive family preservation services (IFPS), designed to stabilize at-risk families and avert out-of-home care, have been the focus of many randomized, experimental studies. The emphasis on "gold-standard" evaluation of IFPS has resulted in fewer "black box" studies that describe actual IFPS service patterns and the fidelity with which they adhere to IFPS program theory. Intervention research is important to the advancement of programs designed to protect the safety of children, improve family functioning, as well as prevent out-of-home placement. Employing a retrospective “clinical data-mining” (CDM) methodology, this exploratory study of Families First, an IFPS program, makes use of available information extracted from client records to describe interventions and service patterns provided over a two year period. This study uncovers actual IFPS service patterns, demonstrates IFPS program fidelity, as well as reveals the usefulness of CDM as a social work research methodology. These findings are particularly valuable for program planning and treatment, policy development and evidence-based practice research.

Relevância:

90.00% 90.00%

Publicador:

Resumo:

Sediments of Lake Donggi Cona on the northeastern Tibetan Plateau were studied to infer changes in the lacustrine depositional environment, related to climatic and non-climatic changes during the last 19 kyr. The lake today fills a 30 X 8 km big and 95 m deep tectonic basin, associated with the Kunlun Fault. The study was conducted on a sediment-core transect through the lake basin, in order to gain a complete picture of spatiotemporal environmental change. The recovered sediments are partly finely laminated and are composed of calcareous muds with variable amounts of carbonate micrite, organic matter, detrital silt and clay. On the basis of sedimentological, geochemical, and mineralogical data up to five lithological units (LU) can be distinguished that document distinct stages in the development of the lake system. The onset of the lowermost LU with lacustrine muds above basal sands indicates that lake level was at least 39 m below the present level and started to rise after 19 ka, possibly in response to regional deglaciation. At this time, the lacustrine environment was characterized by detrital sediment influx and the deposition of siliciclastic sediment. In two sediment cores, upward grain-size coarsening documents a lake-level fall after 13 cal ka BP, possibly associated with the late-glacial Younger Dryas stadial. From 11.5 to 4.3 cal ka BP, grainsize fining in sediment cores from the profundal coring sites and the onset of lacustrine deposition at a litoral core site (2m water depth) in a recent marginal bay of Donggi Cona document lake-level rise during the early tomid-Holocene to at least modern level. In addition, high biological productivity and pronounced precipitation of carbonate micrites are consistent with warm and moist climate conditions related to an enhanced influence of summer monsoon. At 4.3 cal ka BP the lake system shifted from an aragonite- to a calcite-dominated system, indicating a change towards a fully open hydrological lake system. The younger clay-rich sediments are moreover non-laminated and lack any diagenetic sulphides, pointing to fully ventilated conditions, and the prevailing absence of lake stratification. This turning point in lake history could imply either a threshold response to insolation-forced climate cooling or a response to a non-climatic trigger, such as an erosional event or a tectonic pulse that induced a strong earthquake, which is difficult to decide from our data base.

Relevância:

90.00% 90.00%

Publicador:

Resumo:

The development of the ecosystem approach and models for the management of ocean marine resources requires easy access to standard validated datasets of historical catch data for the main exploited species, together with the model estimates achieved from these data, allowing models inter-comparison and evaluation of model skills. North Atlantic albacore tuna is exploited all year round by longline and in summer and autumn by surface fisheries and fishery statistics compiled by the International Commission for the Conservation of Atlantic Tunas (ICCAT). Catch and effort with geographical coordinates at monthly spatial resolution of 1° or 5° squares were extracted for this species with a careful definition of fisheries and data screening. Length frequencies of catch were also extracted according to the definition of fisheries for the period 1956-2010. Using these data, an application of the spatial ecosystem and population dynamics model (SEAPODYM) was developed for the North Atlantic albacore population and fisheries and provided the first spatially explicit estimate of albacore density in the North Atlantic by life stage. These densities by life stage (larval recruits, young immature fish adult mature fish and total biomass) are provided in gridded file (Netcdf) at resolution of 2° x 2° x month.

Relevância:

90.00% 90.00%

Publicador:

Resumo:

The recent development of in-situ monitoring devices, such as UV-spectrometers, makes the study of short-term stream chemistry variation relevant, especially the study of diurnal cycles, which are not yet fully understood. Our study is based on high-frequency data from an agricultural catchment (Studienlandschaft Schwingbachtal, Germany). We propose a novel approach, i.e. the combination of cluster analysis and Linear Discriminant Analysis, to mine from these data nitrate behavior patterns. As a result, we observe a seasonality of nitrate diurnal cycles, that differs from the most common cycle seasonality described in the literature, i.e. pre-dawn peaks in spring. Our cycles appear in summer and the maximum and minimum shift to a later time in late summer/autumn. This is observed both for water- and energy-limited years, thus potentially stressing the role of evapotranspiration. This concluding hypothesis on the role of evapotranspiration on nitrate stream concentration, which was obtained through data mining, broadens the perspective on the diurnal cycling of stream nitrate concentrations.

Relevância:

90.00% 90.00%

Publicador:

Resumo:

This poster raises the issue of a research work oriented to the storage, retrieval, representation and analysis of dynamic GI, taking into account The ultimate objective is the modelling and representation of the dynamic nature of geographic features, establishing mechanisms to store geometries enriched with a temporal structure (regardless of space) and a set of semantic descriptors detailing and clarifying the nature of the represented features and their temporality. the semantic, the temporal and the spatiotemporal components. We intend to define a set of methods, rules and restrictions for the adequate integration of these components into the primary elements of the GI: theme, location, time [1]. We intend to establish and incorporate three new structures (layers) into the core of data storage by using mark-up languages: a semantictemporal structure, a geosemantic structure, and an incremental spatiotemporal structure. Thus, data would be provided with the capability of pinpointing and expressing their own basic and temporal characteristics, enabling them to interact each other according to their context, and their time and meaning relationships that could be eventually established

Relevância:

90.00% 90.00%

Publicador:

Resumo:

Presenting relevant information via web-based user friendly interfac- es makes the information more accessible to the general public. This is especial- ly useful for sensor networks that monitor natural environments. Adequately communicating this type of information helps increase awareness about the limited availability of natural resources and promotes their better use with sus- tainable practices. In this paper, I suggest an approach to communicating this information to wide audiences based on simulating data journalism using artifi- cial intelligence techniques. I analyze this approach by describing a pioneer knowledge-based system called VSAIH, which looks for news in hydrological data from a national sensor network in Spain and creates news stories that gen- eral users can understand. VSAIH integrates artificial intelligence techniques, including a model-based data analyzer and a presentation planner. In the paper, I also describe characteristics of the hydrological national sensor network and the technical solutions applied by VSAIH to simulate data journalism.