9 resultados para random forest data analysis
em Digital Commons - Michigan Tech
Resumo:
Analyzing large-scale gene expression data is a labor-intensive and time-consuming process. To make data analysis easier, we developed a set of pipelines for rapid processing and analysis poplar gene expression data for knowledge discovery. Of all pipelines developed, differentially expressed genes (DEGs) pipeline is the one designed to identify biologically important genes that are differentially expressed in one of multiple time points for conditions. Pathway analysis pipeline was designed to identify the differentially expression metabolic pathways. Protein domain enrichment pipeline can identify the enriched protein domains present in the DEGs. Finally, Gene Ontology (GO) enrichment analysis pipeline was developed to identify the enriched GO terms in the DEGs. Our pipeline tools can analyze both microarray gene data and high-throughput gene data. These two types of data are obtained by two different technologies. A microarray technology is to measure gene expression levels via microarray chips, a collection of microscopic DNA spots attached to a solid (glass) surface, whereas high throughput sequencing, also called as the next-generation sequencing, is a new technology to measure gene expression levels by directly sequencing mRNAs, and obtaining each mRNA’s copy numbers in cells or tissues. We also developed a web portal (http://sys.bio.mtu.edu/) to make all pipelines available to public to facilitate users to analyze their gene expression data. In addition to the analyses mentioned above, it can also perform GO hierarchy analysis, i.e. construct GO trees using a list of GO terms as an input.
Resumo:
The Zagros oak forests in Western Iran are critically important to the sustainability of the region. These forests have undergone dramatic declines in recent decades. We evaluated the utility of the non-parametric Random Forest classification algorithm for land cover classification of Zagros landscapes, and selected the best spatial and spectral predictive variables. The algorithm resulted in high overall classification accuracies (>85%) and also equivalent classification accuracies for the datasets from the three different sensors. We evaluated the associations between trends in forest area and structure with trends in socioeconomic and climatic conditions, to identify the most likely driving forces creating deforestation and landscape structure change. We used available socioeconomic (urban and rural population, and rural income), and climatic (mean annual rainfall and mean annual temperature) data for two provinces in northern Zagros. The most correlated driving force of forest area loss was urban population, and climatic variables to a lesser extent. Landscape structure changes were more closely associated with rural population. We examined the effects of scale changes on the results from spatial pattern analysis. We assessed the impacts of eight years of protection in a protected area in northern Zagros at two different scales (both grain and extent). The effects of protection on the amount and structure of forests was scale dependent. We evaluated the nature and magnitude of changes in forest area and structure over the entire Zagros region from 1972 to 2009. We divided the Zagros region in 167 Landscape Units and developed two measures— Deforestation Sensitivity (DS) and Connectivity Sensitivity (CS) — for each landscape unit as the percent of the time steps that forest area and ECA experienced a decrease of greater than 10% in either measure. A considerable loss in forest area and connectivity was detected, but no sudden (nonlinear) changes were detected at the spatial and temporal scale of the study. Connectivity loss occurred more rapidly than forest loss due to the loss of connecting patches. More connectivity was lost in southern Zagros due to climatic differences and different forms of traditional land use.
Resumo:
Nitrogen and water are essential for plant growth and development. In this study, we designed experiments to produce gene expression data of poplar roots under nitrogen starvation and water deprivation conditions. We found low concentration of nitrogen led first to increased root elongation followed by lateral root proliferation and eventually increased root biomass. To identify genes regulating root growth and development under nitrogen starvation and water deprivation, we designed a series of data analysis procedures, through which, we have successfully identified biologically important genes. Differentially Expressed Genes (DEGs) analysis identified the genes that are differentially expressed under nitrogen starvation or drought. Protein domain enrichment analysis identified enriched themes (in same domains) that are highly interactive during the treatment. Gene Ontology (GO) enrichment analysis allowed us to identify biological process changed during nitrogen starvation. Based on the above analyses, we examined the local Gene Regulatory Network (GRN) and identified a number of transcription factors. After testing, one of them is a high hierarchically ranked transcription factor that affects root growth under nitrogen starvation. It is very tedious and time-consuming to analyze gene expression data. To avoid doing analysis manually, we attempt to automate a computational pipeline that now can be used for identification of DEGs and protein domain analysis in a single run. It is implemented in scripts of Perl and R.
Resumo:
Tropospheric ozone (O3) and carbon monoxide (CO) pollution in the Northern Hemisphere is commonly thought to be of anthropogenic origin. While this is true in most cases, copious quantities of pollutants are emitted by fires in boreal regions, and the impact of these fires on CO has been shown to significantly exceed the impact of urban and industrial sources during large fire years. The impact of boreal fires on ozone is still poorly quantified, and large uncertainties exist in the estimates of the fire-released nitrogen oxides (NO x ), a critical factor in ozone production. As boreal fire activity is predicted to increase in the future due to its strong dependence on weather conditions, it is necessary to understand how these fires affect atmospheric composition. To determine the scale of boreal fire impacts on ozone and its precursors, this work combined statistical analysis of ground-based measurements downwind of fires, satellite data analysis, transport modeling and the results of chemical model simulations. The first part of this work focused on determining boreal fire impact on ozone levels downwind of fires, using analysis of observations in several-days-old fire plumes intercepted at the Pico Mountain station (Azores). The results of this study revealed that fires significantly increase midlatitude summertime ozone background during high fire years, implying that predicted future increases in boreal wildfires may affect ozone levels over large regions in the Northern Hemisphere. To improve current estimates of NOx emissions from boreal fires, we further analyzed ΔNOy /ΔCO enhancement ratios in the observed fire plumes together with transport modeling of fire emission estimates. The results of this analysis revealed the presence of a considerable seasonal trend in the fire NOx /CO emission ratio due to the late-summer changes in burning properties. This finding implies that the constant NOx /CO emission ratio currently used in atmospheric modeling is unrealistic, and is likely to introduce a significant bias in the estimated ozone production. Finally, satellite observations were used to determine the impact of fires on atmospheric burdens of nitrogen dioxide (NO2 ) and formaldehyde (HCHO) in the North American boreal region. This analysis demonstrated that fires dominated the HCHO burden over the fires and in plumes up to two days old. This finding provides insights into the magnitude of secondary HCHO production and further enhances scientific understanding of the atmospheric impacts of boreal fires.
Resumo:
Credible spatial information characterizing the structure and site quality of forests is critical to sustainable forest management and planning, especially given the increasing demands and threats to forest products and services. Forest managers and planners are required to evaluate forest conditions over a broad range of scales, contingent on operational or reporting requirements. Traditionally, forest inventory estimates are generated via a design-based approach that involves generalizing sample plot measurements to characterize an unknown population across a larger area of interest. However, field plot measurements are costly and as a consequence spatial coverage is limited. Remote sensing technologies have shown remarkable success in augmenting limited sample plot data to generate stand- and landscape-level spatial predictions of forest inventory attributes. Further enhancement of forest inventory approaches that couple field measurements with cutting edge remotely sensed and geospatial datasets are essential to sustainable forest management. We evaluated a novel Random Forest based k Nearest Neighbors (RF-kNN) imputation approach to couple remote sensing and geospatial data with field inventory collected by different sampling methods to generate forest inventory information across large spatial extents. The forest inventory data collected by the FIA program of US Forest Service was integrated with optical remote sensing and other geospatial datasets to produce biomass distribution maps for a part of the Lake States and species-specific site index maps for the entire Lake State. Targeting small-area application of the state-of-art remote sensing, LiDAR (light detection and ranging) data was integrated with the field data collected by an inexpensive method, called variable plot sampling, in the Ford Forest of Michigan Tech to derive standing volume map in a cost-effective way. The outputs of the RF-kNN imputation were compared with independent validation datasets and extant map products based on different sampling and modeling strategies. The RF-kNN modeling approach was found to be very effective, especially for large-area estimation, and produced results statistically equivalent to the field observations or the estimates derived from secondary data sources. The models are useful to resource managers for operational and strategic purposes.
Resumo:
Turrialba is one of the largest and most active stratovolcanoes in the Central Cordillera of Costa Rica and an excellent target for validation of satellite data using ground based measurements due to its high elevation, relative ease of access, and persistent elevated SO2 degassing. The Ozone Monitoring Instrument (OMI) aboard the Aura satellite makes daily global observations of atmospheric trace gases and it is used in this investigation to obtain volcanic SO2 retrievals in the Turrialba volcanic plume. We present and evaluate the relative accuracy of two OMI SO2 data analysis procedures, the automatic Band Residual Index (BRI) technique and the manual Normalized Cloud-mass (NCM) method. We find a linear correlation and good quantitative agreement between SO2 burdens derived from the BRI and NCM techniques, with an improved correlation when wet season data are excluded. We also present the first comparisons between volcanic SO2 emission rates obtained from ground-based mini-DOAS measurements at Turrialba and three new OMI SO2 data analysis techniques: the MODIS smoke estimation, OMI SO2 lifetime, and OMI SO2 transect techniques. A robust validation of OMI SO2 retrievals was made, with both qualitative and quantitative agreements under specific atmospheric conditions, proving the utility of satellite measurements for estimating accurate SO2 emission rates and monitoring passively degassing volcanoes.
Resumo:
Landscape structure and heterogeneity play a potentially important, but little understood role in predator-prey interactions and behaviourally-mediated habitat selection. For example, habitat complexity may either reduce or enhance the efficiency of a predator's efforts to search, track, capture, kill and consume prey. For prey, structural heterogeneity may affect predator detection, avoidance and defense, escape tactics, and the ability to exploit refuges. This study, investigates whether and how vegetation and topographic structure influence the spatial patterns and distribution of moose (Alces alces) mortality due to predation and malnutrition at the local and landscape levels on Isle Royale National Park. 230 locations where wolves (Canis lupus) killed moose during the winters between 2002 and 2010, and 182 moose starvation death sites for the period 1996-2010, were selected from the extensive Isle Royale Wolf-Moose Project carcass database. A variety of LiDAR-derived metrics were generated and used in an algorithm model (Random Forest) to identify, characterize, and classify three-dimensional variables significant to each of the mortality classes. Furthermore, spatial models to predict and assess the likelihood at the landscape scale of moose mortality were developed. This research found that the patterns of moose mortality by predation and malnutrition across the landscape are non-random, have a high degree of spatial variability, and that both mechanisms operate in contexts of comparable physiographic and vegetation structure. Wolf winter hunting locations on Isle Royale are more likely to be a result of its prey habitat selection, although they seem to prioritize the overall areas with higher moose density in the winter. Furthermore, the findings suggest that the distribution of moose mortality by predation is habitat-specific to moose, and not to wolves. In addition, moose sex, age, and health condition also affect mortality site selection, as revealed by subtle differences between sites in vegetation heights, vegetation density, and topography. Vegetation density in particular appears to differentiate mortality locations for distinct classes of moose. The results also emphasize the significance of fine-scale landscape and habitat features when addressing predator-prey interactions. These finer scale findings would be easily missed if analyses were limited to the broader landscape scale alone.
Resumo:
This thesis attempts to understand why people adopt or reject individual-use renewable energy technologies (IURET). I used factors from Everett Rogers' Diffusion of Innovation Theory to understand how people's perceptions towards the characteristics of a given IURET (such as price, compatibility, complexity, etc.), the characteristics of the individual adopter (such as innovativeness and environmental awareness), and the communication network (inter-personal communications and mass media) can influence adoption. An online questionnaire was sent to 101randomly selected Michigan households (using random digit dialing) to ask people whether or not they had adopted at least one IURET and to assess the above-mentioned factors from Rogers' theory. Data analysis was then conducted in SPSS using Chi-squared and binary logistic regression to determine the relationship between adoption behaviors (the dependent variable) and the factors from Rogers' theory (the independent variables) while controlling for education. The results show that Rogers' factors of price and observability and the control variable of education were all significant in explaining adoption but the other factors of Rogers' theory were not. For example, if individuals perceive the price of IURET to be reasonable or if they observe their neighbors using these technologies, then they are more likely to adopt. These results indicate that, if we want to promote greater adoption of IURET, we should focus our efforts on making the price of IURET more affordable through incentives and other mechanisms. Adopters should also be given some form of reward if they provide free demonstrations of their IURET in use to their neighbors to take advantage of the observability effects.
DIMENSION REDUCTION FOR POWER SYSTEM MODELING USING PCA METHODS CONSIDERING INCOMPLETE DATA READINGS
Resumo:
Principal Component Analysis (PCA) is a popular method for dimension reduction that can be used in many fields including data compression, image processing, exploratory data analysis, etc. However, traditional PCA method has several drawbacks, since the traditional PCA method is not efficient for dealing with high dimensional data and cannot be effectively applied to compute accurate enough principal components when handling relatively large portion of missing data. In this report, we propose to use EM-PCA method for dimension reduction of power system measurement with missing data, and provide a comparative study of traditional PCA and EM-PCA methods. Our extensive experimental results show that EM-PCA method is more effective and more accurate for dimension reduction of power system measurement data than traditional PCA method when dealing with large portion of missing data set.