340 resultados para datasets


Relevância:

10.00% 10.00%

Publicador:

Resumo:

Document clustering is one of the prominent methods for mining important information from the vast amount of data available on the web. However, document clustering generally suffers from the curse of dimensionality. Providentially in high dimensional space, data points tend to be more concentrated in some areas of clusters. We take advantage of this phenomenon by introducing a novel concept of dynamic cluster representation named as loci. Clusters’ loci are efficiently calculated using documents’ ranking scores generated from a search engine. We propose a fast loci-based semi-supervised document clustering algorithm that uses clusters’ loci instead of conventional centroids for assigning documents to clusters. Empirical analysis on real-world datasets shows that the proposed method produces cluster solutions with promising quality and is substantially faster than several benchmarked centroid-based semi-supervised document clustering methods.

Relevância:

10.00% 10.00%

Publicador:

Resumo:

- Objective To explore the potential for using a basic text search of routine emergency department data to identify product-related injury in infants and to compare the patterns from routine ED data and specialised injury surveillance data. - Methods Data was sourced from the Emergency Department Information System (EDIS) and the Queensland Injury Surveillance Unit (QISU) for all injured infants between 2009 and 2011. A basic text search was developed to identify the top five infant products in QISU. Sensitivity, specificity, and positive predictive value were calculated and a refined search was used with EDIS. Results were manually reviewed to assess validity. Descriptive analysis was conducted to examine patterns between datasets. - Results The basic text search for all products showed high sensitivity and specificity, and most searches showed high positive predictive value. EDIS patterns were similar to QISU patterns with strikingly similar month-of-age injury peaks, admission proportions and types of injuries. - Conclusions This study demonstrated a capacity to identify a sample of valid cases of product-related injuries for specified products using simple text searching of routine ED data. - Implications As the capacity for large datasets grows and the capability to reliably mine text improves, opportunities for expanded sources of injury surveillance data increase. This will ultimately assist stakeholders such as consumer product safety regulators and child safety advocates to appropriately target prevention initiatives.

Relevância:

10.00% 10.00%

Publicador:

Resumo:

Selection criteria and misspecification tests for the intra-cluster correlation structure (ICS) in longitudinal data analysis are considered. In particular, the asymptotical distribution of the correlation information criterion (CIC) is derived and a new method for selecting a working ICS is proposed by standardizing the selection criterion as the p-value. The CIC test is found to be powerful in detecting misspecification of the working ICS structures, while with respect to the working ICS selection, the standardized CIC test is also shown to have satisfactory performance. Some simulation studies and applications to two real longitudinal datasets are made to illustrate how these criteria and tests might be useful.

Relevância:

10.00% 10.00%

Publicador:

Resumo:

We consider the development of statistical models for prediction of constituent concentration of riverine pollutants, which is a key step in load estimation from frequent flow rate data and less frequently collected concentration data. We consider how to capture the impacts of past flow patterns via the average discounted flow (ADF) which discounts the past flux based on the time lapsed - more recent fluxes are given more weight. However, the effectiveness of ADF depends critically on the choice of the discount factor which reflects the unknown environmental cumulating process of the concentration compounds. We propose to choose the discount factor by maximizing the adjusted R-2 values or the Nash-Sutcliffe model efficiency coefficient. The R2 values are also adjusted to take account of the number of parameters in the model fit. The resulting optimal discount factor can be interpreted as a measure of constituent exhaustion rate during flood events. To evaluate the performance of the proposed regression estimators, we examine two different sampling scenarios by resampling fortnightly and opportunistically from two real daily datasets, which come from two United States Geological Survey (USGS) gaging stations located in Des Plaines River and Illinois River basin. The generalized rating-curve approach produces biased estimates of the total sediment loads by -30% to 83%, whereas the new approaches produce relatively much lower biases, ranging from -24% to 35%. This substantial improvement in the estimates of the total load is due to the fact that predictability of concentration is greatly improved by the additional predictors.

Relevância:

10.00% 10.00%

Publicador:

Resumo:

The 2008 US election has been heralded as the first presidential election of the social media era, but took place at a time when social media were still in a state of comparative infancy; so much so that the most important platform was not Facebook or Twitter, but the purpose-built campaign site my.barackobama.com, which became the central vehicle for the most successful electoral fundraising campaign in American history. By 2012, the social media landscape had changed: Facebook and, to a somewhat lesser extent, Twitter are now well-established as the leading social media platforms in the United States, and were used extensively by the campaign organisations of both candidates. As third-party spaces controlled by independent commercial entities, however, their use necessarily differs from that of home-grown, party-controlled sites: from the point of view of the platform itself, a @BarackObama or @MittRomney is technically no different from any other account, except for the very high follower count and an exceptional volume of @mentions. In spite of the significant social media experience which Democrat and Republican campaign strategists had already accumulated during the 2008 campaign, therefore, the translation of such experience to the use of Facebook and Twitter in their 2012 incarnations still required a substantial amount of new work, experimentation, and evaluation. This chapter examines the Twitter strategies of the leading accounts operated by both campaign headquarters: the ‘personal’ candidate accounts @BarackObama and @MittRomney as well as @JoeBiden and @PaulRyanVP, and the campaign accounts @Obama2012 and @TeamRomney. Drawing on datasets which capture all tweets from and at these accounts during the final months of the campaign (from early September 2012 to the immediate aftermath of the election night), we reconstruct the campaigns’ approaches to using Twitter for electioneering from the quantitative and qualitative patterns of their activities, and explore the resonance which these accounts have found with the wider Twitter userbase. A particular focus of our investigation in this context will be on the tweeting styles of these accounts: the mixture of original messages, @replies, and retweets, and the level and nature of engagement with everyday Twitter followers. We will examine whether the accounts chose to respond (by @replying) to the messages of support or criticism which were directed at them, whether they retweeted any such messages (and whether there was any preferential retweeting of influential or – alternatively – demonstratively ordinary users), and/or whether they were used mainly to broadcast and disseminate prepared campaign messages. Our analysis will highlight any significant differences between the accounts we examine, trace changes in style over the course of the final campaign months, and correlate such stylistic differences with the respective electoral positioning of the candidates. Further, we examine the use of these accounts during moments of heightened attention (such as the presidential and vice-presidential debates, or in the context of controversies such as that caused by the publication of the Romney “47%” video; additional case studies may emerge over the remainder of the campaign) to explore how they were used to present or defend key talking points, and exploit or avert damage from campaign gaffes. A complementary analysis of the messages directed at the campaign accounts (in the form of @replies or retweets) will also provide further evidence for the extent to which these talking points were picked up and disseminated by the wider Twitter population. Finally, we also explore the use of external materials (links to articles, images, videos, and other content on the campaign sites themselves, in the mainstream media, or on other platforms) by the campaign accounts, and the resonance which these materials had with the wider follower base of these accounts. This provides an indication of the integration of Twitter into the overall campaigning process, by highlighting how the platform was used as a means of encouraging the viral spread of campaign propaganda (such as advertising materials) or of directing user attention towards favourable media coverage. By building on comprehensive, large datasets of Twitter activity (as of early October, our combined datasets comprise some 3.8 million tweets) which we process and analyse using custom-designed social media analytics tools, and by using our initial quantitative analysis to guide further qualitative evaluation of Twitter activity around these campaign accounts, we are able to provide an in-depth picture of the use of Twitter in political campaigning during the 2012 US election which will provide detailed new insights social media use in contemporary elections. This analysis will then also be able to serve as a touchstone for the analysis of social media use in subsequent elections, in the USA as well as in other developed nations where Twitter and other social media platforms are utilised in electioneering.

Relevância:

10.00% 10.00%

Publicador:

Resumo:

Robust estimation often relies on a dispersion function that is more slowly varying at large values than the square function. However, the choice of tuning constant in dispersion functions may impact the estimation efficiency to a great extent. For a given family of dispersion functions such as the Huber family, we suggest obtaining the "best" tuning constant from the data so that the asymptotic efficiency is maximized. This data-driven approach can automatically adjust the value of the tuning constant to provide the necessary resistance against outliers. Simulation studies show that substantial efficiency can be gained by this data-dependent approach compared with the traditional approach in which the tuning constant is fixed. We briefly illustrate the proposed method using two datasets.

Relevância:

10.00% 10.00%

Publicador:

Resumo:

The Macroscopic Fundamental Diagram (MFD) relates space-mean density and flow. Since the MFD represents the area-wide network traffic performance, studies on perimeter control strategies and network-wide traffic state estimation utilising the MFD concept have been reported. Most previous works have utilised data from fixed sensors, such as inductive loops, to estimate the MFD, which can cause biased estimation in urban networks due to queue spillovers at intersections. To overcome the limitation, recent literature reports the use of trajectory data obtained from probe vehicles. However, these studies have been conducted using simulated datasets; limited works have discussed the limitations of real datasets and their impact on the variable estimation. This study compares two methods for estimating traffic state variables of signalised arterial sections: a method based on cumulative vehicle counts (CUPRITE), and one based on vehicles’ trajectory from taxi Global Positioning System (GPS) log. The comparisons reveal some characteristics of taxi trajectory data available in Brisbane, Australia. The current trajectory data have limitations in quantity (i.e., the penetration rate), due to which the traffic state variables tend to be underestimated. Nevertheless, the trajectory-based method successfully captures the features of traffic states, which suggests that the trajectories from taxis can be a good estimator for the network-wide traffic states.

Relevância:

10.00% 10.00%

Publicador:

Resumo:

The oncogene MDM4, also known as MDMX or HDMX, contributes to cancer susceptibility and progression through its capacity to negatively regulate a range of genes with tumour-suppressive functions. As part of a recent genome-wide association study it was determined that the A-allele of the rs4245739 SNP (A>C), located in the 3'-UTR of MDM4, is associated with an increased risk of prostate cancer. Computational predictions revealed that the rs4245739 SNP is located within a predicted binding site for three microRNAs (miRNAs): miR-191-5p, miR-887 and miR-3669. Herein, we show using reporter gene assays and endogenous MDM4 expression analyses that miR-191-5p and miR-887 have a specific affinity for the rs4245739 SNP C-allele in prostate cancer. These miRNAs do not affect MDM4 mRNA levels, rather they inhibit its translation in C-allele-containing PC3 cells but not in LNCaP cells homozygous for the A-allele. By analysing gene expression datasets from patient cohorts, we found that MDM4 is associated with metastasis and prostate cancer progression and that targeting this gene with miR-191-5p or miR-887 decreases in PC3 cell viability. This study is the first, to our knowledge, to demonstrate regulation of the MDM4 rs4245739 SNP C-allele by two miRNAs in prostate cancer, and thereby to identify a mechanism by which the MDM4 rs4245739 SNP A-allele may be associated with an increased risk for prostate cancer.

Relevância:

10.00% 10.00%

Publicador:

Resumo:

Background Fusion transcripts are found in many tissues and have the potential to create novel functional products. Here, we investigate the genomic sequences around fusion junctions to better understand the transcriptional mechanisms mediating fusion transcription/splicing. We analyzed data from prostate (cancer) cells as previous studies have shown extensively that these cells readily undergo fusion transcription. Results We used the FusionMap program to identify high-confidence fusion transcripts from RNAseq data. The RNAseq datasets were from our (N = 8) and other (N = 14) clinical prostate tumors with adjacent non-cancer cells, and from the LNCaP prostate cancer cell line that were mock-, androgen- (DHT), and anti-androgen- (bicalutamide, enzalutamide) treated. In total, 185 fusion transcripts were identified from all RNAseq datasets. The majority (76 %) of these fusion transcripts were ‘read-through chimeras’ derived from adjacent genes in the genome. Characterization of sequences at fusion loci were carried out using a combination of the FusionMap program, custom Perl scripts, and the RNAfold program. Our computational analysis indicated that most fusion junctions (76 %) use the consensus GT-AG intron donor-acceptor splice site, and most fusion transcripts (85 %) maintained the open reading frame. We assessed whether parental genes of fusion transcripts have the potential to form complementary base pairing between parental genes which might bring them into physical proximity. Our computational analysis of sequences flanking fusion junctions at parental loci indicate that these loci have a similar propensity as non-fusion loci to hybridize. The abundance of repetitive sequences at fusion and non-fusion loci was also investigated given that SINE repeats are involved in aberrant gene transcription. We found few instances of repetitive sequences at both fusion and non-fusion junctions. Finally, RT-qPCR was performed on RNA from both clinical prostate tumors and adjacent non-cancer cells (N = 7), and LNCaP cells treated as above to validate the expression of seven fusion transcripts and their respective parental genes. We reveal that fusion transcript expression is similar to the expression of parental genes. Conclusions Fusion transcripts maintain the open reading frame, and likely use the same transcriptional machinery as non-fusion transcripts as they share many genomic features at splice/fusion junctions.

Relevância:

10.00% 10.00%

Publicador:

Resumo:

This paper describes a vision-only system for place recognition in environments that are tra- versed at different times of day, when chang- ing conditions drastically affect visual appear- ance, and at different speeds, where places aren’t visited at a consistent linear rate. The ma- jor contribution is the removal of wheel-based odometry from the previously presented algo- rithm (SMART), allowing the technique to op- erate on any camera-based device; in our case a mobile phone. While we show that the di- rect application of visual odometry to our night- time datasets does not achieve a level of perfor- mance typically needed, the VO requirements of SMART are orthogonal to typical usage: firstly only the magnitude of the velocity is required, and secondly the calculated velocity signal only needs to be repeatable in any one part of the environment over day and night cycles, but not necessarily globally consistent. Our results show that the smoothing effect of motion constraints is highly beneficial for achieving a locally consis- tent, lighting-independent velocity estimate. We also show that the advantage of our patch-based technique used previously for frame recogni- tion, surprisingly, does not transfer to VO, where SIFT demonstrates equally good performance. Nevertheless, we present the SMART system us- ing only vision, which performs sequence-base place recognition in extreme low-light condi- tions where standard 6-DOF VO fails and that improves place recognition performance over odometry-less benchmarks, approaching that of wheel odometry.

Relevância:

10.00% 10.00%

Publicador:

Resumo:

This paper addresses the following predictive business process monitoring problem: Given the execution trace of an ongoing case,and given a set of traces of historical (completed) cases, predict the most likely outcome of the ongoing case. In this context, a trace refers to a sequence of events with corresponding payloads, where a payload consists of a set of attribute-value pairs. Meanwhile, an outcome refers to a label associated to completed cases, like, for example, a label indicating that a given case completed “on time” (with respect to a given desired duration) or “late”, or a label indicating that a given case led to a customer complaint or not. The paper tackles this problem via a two-phased approach. In the first phase, prefixes of historical cases are encoded using complex symbolic sequences and clustered. In the second phase, a classifier is built for each of the clusters. To predict the outcome of an ongoing case at runtime given its (uncompleted) trace, we select the closest cluster(s) to the trace in question and apply the respective classifier(s), taking into account the Euclidean distance of the trace from the center of the clusters. We consider two families of clustering algorithms – hierarchical clustering and k-medoids – and use random forests for classification. The approach was evaluated on four real-life datasets.

Relevância:

10.00% 10.00%

Publicador:

Resumo:

Deep convolutional network models have dominated recent work in human action recognition as well as image classification. However, these methods are often unduly influenced by the image background, learning and exploiting the presence of cues in typical computer vision datasets. For unbiased robotics applications, the degree of variation and novelty in action backgrounds is far greater than in computer vision datasets. To address this challenge, we propose an “action region proposal” method that, informed by optical flow, extracts image regions likely to contain actions for input into the network both during training and testing. In a range of experiments, we demonstrate that manually segmenting the background is not enough; but through active action region proposals during training and testing, state-of-the-art or better performance can be achieved on individual spatial and temporal video components. Finally, we show by focusing attention through action region proposals, we can further improve upon the existing state-of-the-art in spatio-temporally fused action recognition performance.

Relevância:

10.00% 10.00%

Publicador:

Resumo:

Road traffic emissions are often considered the main source of ultrafine particles (UFP, diameter smaller than 100 nm) in urban environments. However, recent studies worldwide have shown that - in high-insolation urban regions at least - new particle formation events can also contribute to UFP. In order to quantify such events we systematically studied three cities located in predominantly sunny environments: Barcelona (Spain), Madrid (Spain) and Brisbane (Australia). Three long term datasets (1-2 years) of fine and ultrafine particle number size distributions (measured by SMPS, Scanning Mobility Particle Sizer) were analysed. Compared to total particle number concentrations, aerosol size distributions offer far more information on the type, origin and atmospheric evolution of the particles. By applying k-Means clustering analysis, we categorized the collected aerosol size distributions in three main categories: “Traffic” (prevailing 44-63% of the time), “Nucleation” (14-19%) and “Background pollution and Specific cases” (7-22%). Measurements from Rome (Italy) and Los Angeles (California) were also included to complement the study. The daily variation of the average UFP concentrations for a typical nucleation day at each site revealed a similar pattern for all cities, with three distinct particle bursts. A morning and an evening spike reflected traffic rush hours, whereas a third one at midday showed nucleation events. The photochemically nucleated particles burst lasted 1-4 hours, reaching sizes of 30-40 nm. On average, the occurrence of particle size spectra dominated by nucleation events was 16% of the time, showing the importance of this process as a source of UFP in urban environments exposed to high solar radiation. On average, nucleation events lasting for 2 hours or more occurred on 55% of the days, this extending to >4hrs in 28% of the days, demonstrating that atmospheric conditions in urban environments are not favourable to the growth of photochemically nucleated particles. In summary, although traffic remains the main source of UFP in urban areas, in developed countries with high insolation urban nucleation events are also a main source of UFP. If traffic-related particle concentrations are reduced in the future, nucleation events will likely increase in urban areas, due to the reduced urban condensation sinks.

Relevância:

10.00% 10.00%

Publicador:

Resumo:

Purpose: Emotional intelligence (EI) is an increasingly important aspect of a health professional’s skill set. It is strongly associated with empathy, reflection and resilience; all key aspects of radiotherapy practice. Previous work in other disciplines has formed contradictory conclusions concerning development of EI over time. This study aimed to determine the extent to which EI can develop during a radiotherapy undergraduate course and identify factors affecting this. Methods and materials: This study used anonymous coded Likert-style surveys to gather longitudinal data from radiotherapy students relating to a range of self-perceived EI traits during their 3-year degree. Data were gathered at various points throughout the course from the whole cohort. Results: A total of 26 students provided data with 14 completing the full series of datasets. There was a 17·2% increase in self-reported EI score with a p-value<0·0001. Social awareness and relationship skills exhibited the greatest increase in scores compared with self-awareness. Variance of scores decreased over time; there was a reduced change in EI for mature students who tended to have higher initial scores. EI increase was most evident immediately after clinical placements. Conclusions: Radiotherapy students increase their EI scores during a 3-year course. Students reported higher levels of EI immediately after their clinical placement; radiotherapy curricula should seek to maximise on these learning opportunities.

Relevância:

10.00% 10.00%

Publicador:

Resumo:

This article presents a method for checking the conformance between an event log capturing the actual execution of a business process, and a model capturing its expected or normative execution. Given a business process model and an event log, the method returns a set of statements in natural language describing the behavior allowed by the process model but not observed in the log and vice versa. The method relies on a unified representation of process models and event logs based on a well-known model of concurrency, namely event structures. Specifically, the problem of conformance checking is approached by folding the input event log into an event structure, unfolding the process model into another event structure, and comparing the two event structures via an error-correcting synchronized product. Each behavioral difference detected in the synchronized product is then verbalized as a natural language statement. An empirical evaluation shows that the proposed method scales up to real-life datasets while producing more concise and higher-level difference descriptions than state-of-the-art conformance checking methods.