678 resultados para Bootstrap (Estatistica)
Resumo:
Despite many incidents about fake online consumer reviews have been reported, very few studies have been conducted to date to examine the trustworthiness of online consumer reviews. One of the reasons is the lack of an effective computational method to separate the untruthful reviews (i.e., spam) from the legitimate ones (i.e., ham) given the fact that prominent spam features are often missing in online reviews. The main contribution of our research work is the development of a novel review spam detection method which is underpinned by an unsupervised inferential language modeling framework. Another contribution of this work is the development of a high-order concept association mining method which provides the essential term association knowledge to bootstrap the performance for untruthful review detection. Our experimental results confirm that the proposed inferential language model equipped with high-order concept association knowledge is effective in untruthful review detection when compared with other baseline methods.
Resumo:
Lower fruit and vegetable intake among socioeconomically disadvantaged groups has been well documented, and may be a consequence of a higher consumption of take-out foods. This study examined whether, and to what extent, take-out food consumption mediated (explained) the association between socioeconomic position and fruit and vegetable intake. A cross-sectional postal survey was conducted among 1500 randomly selected adults aged 25–64 years in Brisbane, Australia in 2009 (response rate = 63.7%, N = 903). A food frequency questionnaire assessed usual daily servings of fruits and vegetables (0 to 6), overall take-out consumption (times/week) and the consumption of 22 specific take-out items (never to ≥once/day). These specific take-out items were grouped into “less healthy” and “healthy” choices and indices were created for each type of choice (0 to 100). Socioeconomic position was ascertained by education. The analyses were performed using linear regression, and a bootstrap re-sampling approach estimated the statistical significance of the mediated effects. Mean daily serves of fruits and vegetables was 1.89 (SD 1.05) and 2.47 (SD 1.12) respectively. The least educated group were more likely to consume fewer serves of fruit (B= –0.39, p<0.001) and vegetables (B= –0.43, p<0.001) compared with the highest educated. The consumption of “less healthy” take-out food partly explained (mediated) education differences in fruit and vegetable intake; however, no mediating effects were observed for overall and “healthy” take-out consumption. Regular consumption of “less healthy” take-out items may contribute to socioeconomic differences in fruit and vegetable intake, possibly by displacing these foods.
Resumo:
Since manually constructing domain-specific sentiment lexicons is extremely time consuming and it may not even be feasible for domains where linguistic expertise is not available. Research on the automatic construction of domain-specific sentiment lexicons has become a hot topic in recent years. The main contribution of this paper is the illustration of a novel semi-supervised learning method which exploits both term-to-term and document-to-term relations hidden in a corpus for the construction of domain specific sentiment lexicons. More specifically, the proposed two-pass pseudo labeling method combines shallow linguistic parsing and corpusbase statistical learning to make domain-specific sentiment extraction scalable with respect to the sheer volume of opinionated documents archived on the Internet these days. Another novelty of the proposed method is that it can utilize the readily available user-contributed labels of opinionated documents (e.g., the user ratings of product reviews) to bootstrap the performance of sentiment lexicon construction. Our experiments show that the proposed method can generate high quality domain-specific sentiment lexicons as directly assessed by human experts. Moreover, the system generated domain-specific sentiment lexicons can improve polarity prediction tasks at the document level by 2:18% when compared to other well-known baseline methods. Our research opens the door to the development of practical and scalable methods for domain-specific sentiment analysis.
Resumo:
This paper seeks to explain the lagging productivity in Singapore’s manufacturing noted in the statements of the Economic Strategies Committee Report 2010. Two methods are employed: the Malmquist productivity to measure total factor productivity change and Simar and Wilson’s (J Econ, 136:31–64, 2007) bootstrapped truncated regression approach. In the first stage, the nonparametric data envelopment analysis is used to measure technical efficiency. To quantify the economic drivers underlying inefficiencies, the second stage employs a bootstrapped truncated regression whereby bias-corrected efficiency estimates are regressed against explanatory variables. The findings reveal that growth in total factor productivity was attributed to efficiency change with no technical progress. Most industries were technically inefficient throughout the period except for ‘Pharmaceutical Products’. Sources of efficiency were attributed to quality of worker and flexible work arrangements while incessant use of foreign workers lowered efficiency.
Resumo:
Monitoring the natural environment is increasingly important as habit degradation and climate change reduce theworld’s biodiversity.We have developed software tools and applications to assist ecologists with the collection and analysis of acoustic data at large spatial and temporal scales.One of our key objectives is automated animal call recognition, and our approach has three novel attributes. First, we work with raw environmental audio, contaminated by noise and artefacts and containing calls that vary greatly in volume depending on the animal’s proximity to the microphone. Second, initial experimentation suggested that no single recognizer could dealwith the enormous variety of calls. Therefore, we developed a toolbox of generic recognizers to extract invariant features for each call type. Third, many species are cryptic and offer little data with which to train a recognizer. Many popular machine learning methods require large volumes of training and validation data and considerable time and expertise to prepare. Consequently we adopt bootstrap techniques that can be initiated with little data and refined subsequently. In this paper, we describe our recognition tools and present results for real ecological problems.
Resumo:
Aims: To identify risk factors for major Adverse Events (AEs) and to develop a nomogram to predict the probability of such AEs in individual patients who have surgery for apparent early stage endometrial cancer. Methods: We used data from 753 patients who were randomized to either total laparoscopic hysterectomy or total abdominal hysterectomy in the LACE trial. Serious adverse events that prolonged hospital stay or postoperative adverse events (using common terminology criteria 3+, CTCAE V3) were considered major AEs. We analyzed pre-surgical characteristics that were associated with the risk of developing major AEs by multivariate logistic regression. We identified a parsimonious model by backward stepwise logistic regression. The six most significant or clinically important variables were included in the nomogram to predict the risk of major AEs within 6 weeks of surgery and the nomogram was internally validated. Results: Overall, 132 (17.5%) patients had at least one major AE. An open surgical approach (laparotomy), higher Charlson’s medical co-morbidities score, moderately differentiated tumours on curettings, higher baseline ECOG score, higher body mass index and low haemoglobin levels were associated with AE and were used in the nomogram. The bootstrap corrected concordance index of the nomogram was 0.63 and it showed good calibration. Conclusions: Six pre-surgical factors independently predicted the risk of major AEs. This research might form the basis to develop risk reduction strategies to minimize the risk of AEs among patients undergoing surgery for apparent early stage endometrial cancer.
Resumo:
Sequencing of mba gene fragments of reference strains of Ureaplasma urealyticum serovars 1, 3, 6, 14, in addition to 33 clinical U. urealyticum isolates is reported. A phylogenetic tree deduced from an alignment of these sequences clearly demonstrates two major clusters (confidence limit 100%), which equate to the parvo and T960 biovars, and five types which we have designated mba 1, 3, 6, 8 and X. These relationships are supported by bootstrap analysis. Polymorphisms within the mba fragment of types mba 1, 3, and 6 were used to define nine subtypes (mba 1a, 1b, 3a, 3b, 3c, 3d, 3e, 6a, and 6b) thus facilitating high resolution typing of U. urealyticum. Inclusion of the reference strains for serovars 1, 3, 6, and 8 in the mba typing scheme showed that the results of this analysis are broadly consistent with currently accepted serotyping. In addition a ure gene fragment from nine of the clinical isolates was amplified and sequenced. Comparisons of the sequences clearly distinguished the two biovars of U. urealyticum; however this fragment was invariant within the parvo biovar. This study has shown that the sequence of the mba can reveal the fine details of the relationships between U. urealyticum isolates and also supports the significant evolutionary gap between the two biovars.
Resumo:
Carrion-breeding Sarcophagidae (Diptera) can be used to estimate the post-mortem interval (PMI) in forensic cases. Difficulties with accurate morphological identifications at any life stage and a lack of documented thermobiological profiles have limited their current usefulness of these flies. The molecular-based approach of DNA barcoding, which utilises a 648-bp fragment of the mitochondrial cytochrome oxidase subunit I gene, was previously evaluated in a pilot study for the discrimination between 16 Australian sarcophagids. The current study comprehensively evaluated DNA barcoding on a larger taxon set of 588 adult Australian sarcophagids. A total of 39 of the 84 known Australian species were represented by 580 specimens, which includes 92% of potentially forensically important species. A further eight specimens could not be reliably identified, but included as six unidentifable taxa. A neighbour-joining phylogenetic tree was generated and nucleotide sequence divergences were calculated using the Kimura-two-parameter distance model. All species except Sarcophaga (Fergusonimyia) bancroftorum, known for high morphological variability, were resolved as reciprocally monophyletic (99.2% of cases), with most having bootstrap support of 100. Excluding S. bancroftorum, the mean intraspecific and interspecific variation ranged from 0.00-1.12% and 2.81-11.23%, respectively, allowing for species discrimination. DNA barcoding was therefore validated as a suitable method for the molecular identification of the Australian Sarcophagidae, which will aid in the implementation of this fauna in forensic entomology.
Resumo:
Recent literature has argued that environmental efficiency (EE), which is built on the materials balance (MB) principle, is more suitable than other EE measures in situations where the law of mass conversation regulates production processes. In addition, the MB-based EE method is particularly useful in analysing possible trade-offs between cost and environmental performance. Identifying determinants of MB-based EE can provide useful information to decision makers but there are very few empirical investigations into this issue. This article proposes the use of data envelopment analysis and stochastic frontier analysis techniques to analyse variation in MB-based EE. Specifically, the article develops a stochastic nutrient frontier and nutrient inefficiency model to analyse determinants of MB-based EE. The empirical study applies both techniques to investigate MB-based EE of 96 rice farms in South Korea. The size of land, fertiliser consumption intensity, cost allocative efficiency, and the share of owned land out of total land are found to be correlated with MB-based EE. The results confirm the presence of a trade-off between MB-based EE and cost allocative efficiency and this finding, favouring policy interventions to help farms simultaneously achieve cost efficiency and MP-based EE.
Resumo:
Phylogenetic inference from sequences can be misled by both sampling (stochastic) error and systematic error (nonhistorical signals where reality differs from our simplified models). A recent study of eight yeast species using 106 concatenated genes from complete genomes showed that even small internal edges of a tree received 100% bootstrap support. This effective negation of stochastic error from large data sets is important, but longer sequences exacerbate the potential for biases (systematic error) to be positively misleading. Indeed, when we analyzed the same data set using minimum evolution optimality criteria, an alternative tree received 100% bootstrap support. We identified a compositional bias as responsible for this inconsistency and showed that it is reduced effectively by coding the nucleotides as purines and pyrimidines (RY-coding), reinforcing the original tree. Thus, a comprehensive exploration of potential systematic biases is still required, even though genome-scale data sets greatly reduce sampling error.
Resumo:
Bactrocera dorsalis sensu stricto, B. papayae, B. philippinensis and B. carambolae are serious pest fruit fly species of the B. dorsalis complex that predominantly occur in south-east Asia and the Pacific. Identifying molecular diagnostics has proven problematic for these four taxa, a situation that cofounds biosecurity and quarantine efforts and which may be the result of at least some of these taxa representing the same biological species. We therefore conducted a phylogenetic study of these four species (and closely related outgroup taxa) based on the individuals collected from a wide geographic range; sequencing six loci (cox1, nad4-3′, CAD, period, ITS1, ITS2) for approximately 20 individuals from each of 16 sample sites. Data were analysed within maximum likelihood and Bayesian phylogenetic frameworks for individual loci and concatenated data sets for which we applied multiple monophyly and species delimitation tests. Species monophyly was measured by clade support, posterior probability or bootstrap resampling for Bayesian and likelihood analyses respectively, Rosenberg's reciprocal monophyly measure, P(AB), Rodrigo's (P(RD)) and the genealogical sorting index, gsi. We specifically tested whether there was phylogenetic support for the four 'ingroup' pest species using a data set of multiple individuals sampled from a number of populations. Based on our combined data set, Bactrocera carambolae emerges as a distinct monophyletic clade, whereas B. dorsalis s.s., B. papayae and B. philippinensis are unresolved. These data add to the growing body of evidence that B. dorsalis s.s., B. papayae and B. philippinensis are the same biological species, which poses consequences for quarantine, trade and pest management.
Resumo:
This paper seeks to explain the lagging productivity in Singapore’s manufacturing noted in the statements of the Economic Strategies Committee Report 2010. Two methods are employed: the Malmquist productivity to measure total factor productivity (TFP) change and Simar and Wilson’s (2007) bootstrapped truncated regression approach which first derives bias-corrected efficiency estimates before being regressed against explanatory variables to help quantify sources of inefficiencies. The findings reveal that growth in total factor productivity was attributed to efficiency change with no technical progress. Sources of efficiency were attributed to quality of worker and flexible work arrangements while the use of foreign workers lowered efficiency.
Resumo:
This paper evaluates the performances of prediction intervals generated from alternative time series models, in the context of tourism forecasting. The forecasting methods considered include the autoregressive (AR) model, the AR model using the bias-corrected bootstrap, seasonal ARIMA models, innovations state space models for exponential smoothing, and Harvey’s structural time series models. We use thirteen monthly time series for the number of tourist arrivals to Hong Kong and Australia. The mean coverage rates and widths of the alternative prediction intervals are evaluated in an empirical setting. It is found that all models produce satisfactory prediction intervals, except for the autoregressive model. In particular, those based on the biascorrected bootstrap perform best in general, providing tight intervals with accurate coverage rates, especially when the forecast horizon is long.
Resumo:
Time series classification has been extensively explored in many fields of study. Most methods are based on the historical or current information extracted from data. However, if interest is in a specific future time period, methods that directly relate to forecasts of time series are much more appropriate. An approach to time series classification is proposed based on a polarization measure of forecast densities of time series. By fitting autoregressive models, forecast replicates of each time series are obtained via the bias-corrected bootstrap, and a stationarity correction is considered when necessary. Kernel estimators are then employed to approximate forecast densities, and discrepancies of forecast densities of pairs of time series are estimated by a polarization measure, which evaluates the extent to which two densities overlap. Following the distributional properties of the polarization measure, a discriminant rule and a clustering method are proposed to conduct the supervised and unsupervised classification, respectively. The proposed methodology is applied to both simulated and real data sets, and the results show desirable properties.