970 resultados para Datasets
Resumo:
Selection criteria and misspecification tests for the intra-cluster correlation structure (ICS) in longitudinal data analysis are considered. In particular, the asymptotical distribution of the correlation information criterion (CIC) is derived and a new method for selecting a working ICS is proposed by standardizing the selection criterion as the p-value. The CIC test is found to be powerful in detecting misspecification of the working ICS structures, while with respect to the working ICS selection, the standardized CIC test is also shown to have satisfactory performance. Some simulation studies and applications to two real longitudinal datasets are made to illustrate how these criteria and tests might be useful.
Resumo:
We consider the development of statistical models for prediction of constituent concentration of riverine pollutants, which is a key step in load estimation from frequent flow rate data and less frequently collected concentration data. We consider how to capture the impacts of past flow patterns via the average discounted flow (ADF) which discounts the past flux based on the time lapsed - more recent fluxes are given more weight. However, the effectiveness of ADF depends critically on the choice of the discount factor which reflects the unknown environmental cumulating process of the concentration compounds. We propose to choose the discount factor by maximizing the adjusted R-2 values or the Nash-Sutcliffe model efficiency coefficient. The R2 values are also adjusted to take account of the number of parameters in the model fit. The resulting optimal discount factor can be interpreted as a measure of constituent exhaustion rate during flood events. To evaluate the performance of the proposed regression estimators, we examine two different sampling scenarios by resampling fortnightly and opportunistically from two real daily datasets, which come from two United States Geological Survey (USGS) gaging stations located in Des Plaines River and Illinois River basin. The generalized rating-curve approach produces biased estimates of the total sediment loads by -30% to 83%, whereas the new approaches produce relatively much lower biases, ranging from -24% to 35%. This substantial improvement in the estimates of the total load is due to the fact that predictability of concentration is greatly improved by the additional predictors.
Resumo:
The 2008 US election has been heralded as the first presidential election of the social media era, but took place at a time when social media were still in a state of comparative infancy; so much so that the most important platform was not Facebook or Twitter, but the purpose-built campaign site my.barackobama.com, which became the central vehicle for the most successful electoral fundraising campaign in American history. By 2012, the social media landscape had changed: Facebook and, to a somewhat lesser extent, Twitter are now well-established as the leading social media platforms in the United States, and were used extensively by the campaign organisations of both candidates. As third-party spaces controlled by independent commercial entities, however, their use necessarily differs from that of home-grown, party-controlled sites: from the point of view of the platform itself, a @BarackObama or @MittRomney is technically no different from any other account, except for the very high follower count and an exceptional volume of @mentions. In spite of the significant social media experience which Democrat and Republican campaign strategists had already accumulated during the 2008 campaign, therefore, the translation of such experience to the use of Facebook and Twitter in their 2012 incarnations still required a substantial amount of new work, experimentation, and evaluation. This chapter examines the Twitter strategies of the leading accounts operated by both campaign headquarters: the ‘personal’ candidate accounts @BarackObama and @MittRomney as well as @JoeBiden and @PaulRyanVP, and the campaign accounts @Obama2012 and @TeamRomney. Drawing on datasets which capture all tweets from and at these accounts during the final months of the campaign (from early September 2012 to the immediate aftermath of the election night), we reconstruct the campaigns’ approaches to using Twitter for electioneering from the quantitative and qualitative patterns of their activities, and explore the resonance which these accounts have found with the wider Twitter userbase. A particular focus of our investigation in this context will be on the tweeting styles of these accounts: the mixture of original messages, @replies, and retweets, and the level and nature of engagement with everyday Twitter followers. We will examine whether the accounts chose to respond (by @replying) to the messages of support or criticism which were directed at them, whether they retweeted any such messages (and whether there was any preferential retweeting of influential or – alternatively – demonstratively ordinary users), and/or whether they were used mainly to broadcast and disseminate prepared campaign messages. Our analysis will highlight any significant differences between the accounts we examine, trace changes in style over the course of the final campaign months, and correlate such stylistic differences with the respective electoral positioning of the candidates. Further, we examine the use of these accounts during moments of heightened attention (such as the presidential and vice-presidential debates, or in the context of controversies such as that caused by the publication of the Romney “47%” video; additional case studies may emerge over the remainder of the campaign) to explore how they were used to present or defend key talking points, and exploit or avert damage from campaign gaffes. A complementary analysis of the messages directed at the campaign accounts (in the form of @replies or retweets) will also provide further evidence for the extent to which these talking points were picked up and disseminated by the wider Twitter population. Finally, we also explore the use of external materials (links to articles, images, videos, and other content on the campaign sites themselves, in the mainstream media, or on other platforms) by the campaign accounts, and the resonance which these materials had with the wider follower base of these accounts. This provides an indication of the integration of Twitter into the overall campaigning process, by highlighting how the platform was used as a means of encouraging the viral spread of campaign propaganda (such as advertising materials) or of directing user attention towards favourable media coverage. By building on comprehensive, large datasets of Twitter activity (as of early October, our combined datasets comprise some 3.8 million tweets) which we process and analyse using custom-designed social media analytics tools, and by using our initial quantitative analysis to guide further qualitative evaluation of Twitter activity around these campaign accounts, we are able to provide an in-depth picture of the use of Twitter in political campaigning during the 2012 US election which will provide detailed new insights social media use in contemporary elections. This analysis will then also be able to serve as a touchstone for the analysis of social media use in subsequent elections, in the USA as well as in other developed nations where Twitter and other social media platforms are utilised in electioneering.
Resumo:
Robust estimation often relies on a dispersion function that is more slowly varying at large values than the square function. However, the choice of tuning constant in dispersion functions may impact the estimation efficiency to a great extent. For a given family of dispersion functions such as the Huber family, we suggest obtaining the "best" tuning constant from the data so that the asymptotic efficiency is maximized. This data-driven approach can automatically adjust the value of the tuning constant to provide the necessary resistance against outliers. Simulation studies show that substantial efficiency can be gained by this data-dependent approach compared with the traditional approach in which the tuning constant is fixed. We briefly illustrate the proposed method using two datasets.
Resumo:
The Macroscopic Fundamental Diagram (MFD) relates space-mean density and flow. Since the MFD represents the area-wide network traffic performance, studies on perimeter control strategies and network-wide traffic state estimation utilising the MFD concept have been reported. Most previous works have utilised data from fixed sensors, such as inductive loops, to estimate the MFD, which can cause biased estimation in urban networks due to queue spillovers at intersections. To overcome the limitation, recent literature reports the use of trajectory data obtained from probe vehicles. However, these studies have been conducted using simulated datasets; limited works have discussed the limitations of real datasets and their impact on the variable estimation. This study compares two methods for estimating traffic state variables of signalised arterial sections: a method based on cumulative vehicle counts (CUPRITE), and one based on vehicles’ trajectory from taxi Global Positioning System (GPS) log. The comparisons reveal some characteristics of taxi trajectory data available in Brisbane, Australia. The current trajectory data have limitations in quantity (i.e., the penetration rate), due to which the traffic state variables tend to be underestimated. Nevertheless, the trajectory-based method successfully captures the features of traffic states, which suggests that the trajectories from taxis can be a good estimator for the network-wide traffic states.
Resumo:
The oncogene MDM4, also known as MDMX or HDMX, contributes to cancer susceptibility and progression through its capacity to negatively regulate a range of genes with tumour-suppressive functions. As part of a recent genome-wide association study it was determined that the A-allele of the rs4245739 SNP (A>C), located in the 3'-UTR of MDM4, is associated with an increased risk of prostate cancer. Computational predictions revealed that the rs4245739 SNP is located within a predicted binding site for three microRNAs (miRNAs): miR-191-5p, miR-887 and miR-3669. Herein, we show using reporter gene assays and endogenous MDM4 expression analyses that miR-191-5p and miR-887 have a specific affinity for the rs4245739 SNP C-allele in prostate cancer. These miRNAs do not affect MDM4 mRNA levels, rather they inhibit its translation in C-allele-containing PC3 cells but not in LNCaP cells homozygous for the A-allele. By analysing gene expression datasets from patient cohorts, we found that MDM4 is associated with metastasis and prostate cancer progression and that targeting this gene with miR-191-5p or miR-887 decreases in PC3 cell viability. This study is the first, to our knowledge, to demonstrate regulation of the MDM4 rs4245739 SNP C-allele by two miRNAs in prostate cancer, and thereby to identify a mechanism by which the MDM4 rs4245739 SNP A-allele may be associated with an increased risk for prostate cancer.
Resumo:
A global climate model experiment is performed to evaluate the effect of irrigation on temperatures in several major irrigated regions of the world. The Community Atmosphere Model, version 3.3, was modified to represent irrigation for the fraction of each grid cell equipped for irrigation according to datasets from the Food and Agriculture Organization. Results indicate substantial regional differences in the magnitude of irrigation-induced cooling, which are attributed to three primary factors: differences in extent of the irrigated area, differences in the simulated soil moisture for the control simulation (without irrigation), and the nature of cloud response to irrigation. The last factor appeared especially important for the dry season in India, although further analysis with other models and observations are needed to verify this feedback. Comparison with observed temperatures revealed substantially lower biases in several regions for the simulation with irrigation than for the control, suggesting that the lack of irrigation may be an important component of temperature bias in this model or that irrigation compensates for other biases. The results of this study should help to translate the results from past regional efforts, which have largely focused on the United States, to regions in the developing world that in many cases continue to experience significant expansion of irrigated land.
Resumo:
Many statistical forecast systems are available to interested users. In order to be useful for decision-making, these systems must be based on evidence of underlying mechanisms. Once causal connections between the mechanism and their statistical manifestation have been firmly established, the forecasts must also provide some quantitative evidence of `quality’. However, the quality of statistical climate forecast systems (forecast quality) is an ill-defined and frequently misunderstood property. Often, providers and users of such forecast systems are unclear about what ‘quality’ entails and how to measure it, leading to confusion and misinformation. Here we present a generic framework to quantify aspects of forecast quality using an inferential approach to calculate nominal significance levels (p-values) that can be obtained either by directly applying non-parametric statistical tests such as Kruskal-Wallis (KW) or Kolmogorov-Smirnov (KS) or by using Monte-Carlo methods (in the case of forecast skill scores). Once converted to p-values, these forecast quality measures provide a means to objectively evaluate and compare temporal and spatial patterns of forecast quality across datasets and forecast systems. Our analysis demonstrates the importance of providing p-values rather than adopting some arbitrarily chosen significance levels such as p < 0.05 or p < 0.01, which is still common practice. This is illustrated by applying non-parametric tests (such as KW and KS) and skill scoring methods (LEPS and RPSS) to the 5-phase Southern Oscillation Index classification system using historical rainfall data from Australia, The Republic of South Africa and India. The selection of quality measures is solely based on their common use and does not constitute endorsement. We found that non-parametric statistical tests can be adequate proxies for skill measures such as LEPS or RPSS. The framework can be implemented anywhere, regardless of dataset, forecast system or quality measure. Eventually such inferential evidence should be complimented by descriptive statistical methods in order to fully assist in operational risk management.
Resumo:
Background Fusion transcripts are found in many tissues and have the potential to create novel functional products. Here, we investigate the genomic sequences around fusion junctions to better understand the transcriptional mechanisms mediating fusion transcription/splicing. We analyzed data from prostate (cancer) cells as previous studies have shown extensively that these cells readily undergo fusion transcription. Results We used the FusionMap program to identify high-confidence fusion transcripts from RNAseq data. The RNAseq datasets were from our (N = 8) and other (N = 14) clinical prostate tumors with adjacent non-cancer cells, and from the LNCaP prostate cancer cell line that were mock-, androgen- (DHT), and anti-androgen- (bicalutamide, enzalutamide) treated. In total, 185 fusion transcripts were identified from all RNAseq datasets. The majority (76 %) of these fusion transcripts were ‘read-through chimeras’ derived from adjacent genes in the genome. Characterization of sequences at fusion loci were carried out using a combination of the FusionMap program, custom Perl scripts, and the RNAfold program. Our computational analysis indicated that most fusion junctions (76 %) use the consensus GT-AG intron donor-acceptor splice site, and most fusion transcripts (85 %) maintained the open reading frame. We assessed whether parental genes of fusion transcripts have the potential to form complementary base pairing between parental genes which might bring them into physical proximity. Our computational analysis of sequences flanking fusion junctions at parental loci indicate that these loci have a similar propensity as non-fusion loci to hybridize. The abundance of repetitive sequences at fusion and non-fusion loci was also investigated given that SINE repeats are involved in aberrant gene transcription. We found few instances of repetitive sequences at both fusion and non-fusion junctions. Finally, RT-qPCR was performed on RNA from both clinical prostate tumors and adjacent non-cancer cells (N = 7), and LNCaP cells treated as above to validate the expression of seven fusion transcripts and their respective parental genes. We reveal that fusion transcript expression is similar to the expression of parental genes. Conclusions Fusion transcripts maintain the open reading frame, and likely use the same transcriptional machinery as non-fusion transcripts as they share many genomic features at splice/fusion junctions.
Resumo:
A nucleosome forms a basic unit of the chromosome structure. A biologically relevant question is how much of the nucleosomal conformational space is accessible to protein-free DNA, and what proportion of the nucleosomal conformations are induced by bound histones. To investigate this, we have analysed high resolution xray crystal structure datasets of DNA in protein-free as well as protein-bound forms, and compared the dinucleotide step parameters for the two datasets with those for high resolution nucleosome structures. Our analysis shows that most of the dinucleotide step parameter values for the nucleosome structures lie within the range accessible to protein-free DNA, indirectly indicating that the histone core plays more of a stabilizing role. The nucleosome structures are observed to assume smooth and nearly planar curvature, implying that ‘normal’ B-DNA like parameters can give rise to a curved geometry at the gross structural level. Different nucleosome
Resumo:
This paper describes a vision-only system for place recognition in environments that are tra- versed at different times of day, when chang- ing conditions drastically affect visual appear- ance, and at different speeds, where places aren’t visited at a consistent linear rate. The ma- jor contribution is the removal of wheel-based odometry from the previously presented algo- rithm (SMART), allowing the technique to op- erate on any camera-based device; in our case a mobile phone. While we show that the di- rect application of visual odometry to our night- time datasets does not achieve a level of perfor- mance typically needed, the VO requirements of SMART are orthogonal to typical usage: firstly only the magnitude of the velocity is required, and secondly the calculated velocity signal only needs to be repeatable in any one part of the environment over day and night cycles, but not necessarily globally consistent. Our results show that the smoothing effect of motion constraints is highly beneficial for achieving a locally consis- tent, lighting-independent velocity estimate. We also show that the advantage of our patch-based technique used previously for frame recogni- tion, surprisingly, does not transfer to VO, where SIFT demonstrates equally good performance. Nevertheless, we present the SMART system us- ing only vision, which performs sequence-base place recognition in extreme low-light condi- tions where standard 6-DOF VO fails and that improves place recognition performance over odometry-less benchmarks, approaching that of wheel odometry.
Resumo:
Background: Molecular marker technologies are undergoing a transition from largely serial assays measuring DNA fragment sizes to hybridization-based technologies with high multiplexing levels. Diversity Arrays Technology (DArT) is a hybridization-based technology that is increasingly being adopted by barley researchers. There is a need to integrate the information generated by DArT with previous data produced with gel-based marker technologies. The goal of this study was to build a high-density consensus linkage map from the combined datasets of ten populations, most of which were simultaneously typed with DArT and Simple Sequence Repeat (SSR), Restriction Enzyme Fragment Polymorphism (RFLP) and/or Sequence Tagged Site (STS) markers. Results: The consensus map, built using a combination of JoinMap 3.0 software and several purpose-built perl scripts, comprised 2,935 loci (2,085 DArT, 850 other loci) and spanned 1,161 cM. It contained a total of 1,629 'bins' (unique loci), with an average inter-bin distance of 0.7 ± 1.0 cM (median = 0.3 cM). More than 98% of the map could be covered with a single DArT assay. The arrangement of loci was very similar to, and almost as optimal as, the arrangement of loci in component maps built for individual populations. The locus order of a synthetic map derived from merging the component maps without considering the segregation data was only slightly inferior. The distribution of loci along chromosomes indicated centromeric suppression of recombination in all chromosomes except 5H. DArT markers appeared to have a moderate tendency toward hypomethylated, gene-rich regions in distal chromosome areas. On the average, 14 ± 9 DArT loci were identified within 5 cM on either side of SSR, RFLP or STS loci previously identified as linked to agricultural traits. Conclusion: Our barley consensus map provides a framework for transferring genetic information between different marker systems and for deploying DArT markers in molecular breeding schemes. The study also highlights the need for improved software for building consensus maps from high-density segregation data of multiple populations.
Resumo:
This study examined the nature and lifetime prevalence of two types of victimization among Finnish university students: stalking and violence victimization (i.e. general violence). This study was a cross-sectional study using two different datasets of Finnish university students. The stalking data was collected via an electronic questionnaire and the violence victimization data was collected via a postal questionnaire. There were 615 participants in the stalking study (I-III) and 905 participants in the violence victimization study. The thesis consists of four studies. The aims regarding the stalking substudies (Studies I-III) were to examine the lifetime prevalence of stalking among university students and to analyze how stalking is related to victim and stalker characteristics and certain central variables of stalking (victim-stalker relationship, stalking episodes, stalking duration). Specifically, the aim was to identify factors that are associated with stalking violence and to factors contributing to the stalking duration. Furthermore, the aim was also to investigate how university students cope with stalking and whether coping is related to victim and stalker background characteristics and to certain other core variables (victim-stalker relationship, stalking episodes, stalking duration, prior victimization, and stalking violence). The aims for the violence victimization substudy (Study IV) were to examine the prevalence of violence victimization, i.e. general violence (minor and serious physical violence and threats) and how violence victimization is associated with victim/abuser characteristics, symptomology, and the use of student health care services. The present study shows that both stalking and violence victimization (i.e. general violence) are markedly prevalent among Finnish university students. The lifetime prevalence rate for stalking was 48.5% and 46.5% for violence victimization. When the lifetime prevalence rate was restricted to violent stalking and physical violence only, the prevalence decreased to 22% and 42% respectively. The students reported exposure to multiple forms of stalking and violence victimization, demonstrating the diversity of victimization among university students. Stalking victimization was found to be more prevalent among female students, while violence victimization was found to be more prevalent among male students. Most of the victims of stalking knew their stalkers, while the offender in general violence was typically a stranger. Stalking victimization often included violence and continued for a lengthy period. The victim-stalking relationship and stalking behaviors were found to be associated with stalking violence and stalking duration. Based on three identified stalking dimensions (violence, surveillance, contact seeking), the present study found five distinct victim subgroups (classes). Along with the victim-stalker relationship, the victim subgroups emerged as important factors contributing to the stalking duration. Victims of violent stalking did not differ greatly from victims of non-violent stalking in their use of behavioral coping tactics, while exposure to violent stalking had an effect on the use of coping strategies. The victim-offender relationship was also associated to a set of symptoms regarding violence victimization. Furthermore, violence victimization had a significant main effect on specific symptoms (mental health symptoms, alcohol consumption, symptom index), while gender had a significant main effect on most symptoms, yet no interaction effect was found. The present results also show that victims of violence are overrepresented among frequent health care users. The present findings add to the literature on the prevalence and nature of stalking and violence victimization among Finnish university students. Moreover, the present findings stress the importance of violence prevention and intervention in student health care, and may be used as a guideline for policy makers, as well as health care and law enforcement professionals dealing with youth violence prevention.
Resumo:
This paper addresses the following predictive business process monitoring problem: Given the execution trace of an ongoing case,and given a set of traces of historical (completed) cases, predict the most likely outcome of the ongoing case. In this context, a trace refers to a sequence of events with corresponding payloads, where a payload consists of a set of attribute-value pairs. Meanwhile, an outcome refers to a label associated to completed cases, like, for example, a label indicating that a given case completed “on time” (with respect to a given desired duration) or “late”, or a label indicating that a given case led to a customer complaint or not. The paper tackles this problem via a two-phased approach. In the first phase, prefixes of historical cases are encoded using complex symbolic sequences and clustered. In the second phase, a classifier is built for each of the clusters. To predict the outcome of an ongoing case at runtime given its (uncompleted) trace, we select the closest cluster(s) to the trace in question and apply the respective classifier(s), taking into account the Euclidean distance of the trace from the center of the clusters. We consider two families of clustering algorithms – hierarchical clustering and k-medoids – and use random forests for classification. The approach was evaluated on four real-life datasets.
Resumo:
Whilst the topic of soil salinity has received a substantive research effort over the years, the accurate measurement and interpretation of salinity tolerance data remain problematic. The tolerance of four perennial grass species (non-halophytes) to sodium chloride (NaCl) dominated salinity was determined in a free-flowing sand culture system. Although the salinity tolerance of non-halophytes is often represented by the threshold salinity model (bent-stick model), none of the species in the current study displayed any observable salinity threshold. Further, the observed yield decrease was not linear as suggested by the model. On re-examination of earlier datasets, we conclude that the threshold salinity model does not adequately describe the physiological processes limiting growth of non-halophytes in saline soils. Therefore, the use of the threshold salinity model is not recommended for non-halophytes, but rather, a model which more accurately reflects the physiological response observed in these saline soils, such as an exponential regression curve.