26 resultados para outliers

em Deakin Research Online - Australia


Relevância:

20.00% 20.00%

Publicador:

Resumo:

We consider the use of Ordered Weighted Averaging (OWA) in linear regression. Our goal is to replace the traditional least squares, least absolute deviation, and maximum likelihood criteria with an OWA function of the residuals. We obtain several high breakdown robust regression methods as special cases (least median, least trimmed squares, trimmed likelihood methods). We also present new formulations of regression problem. OWA-based regression is particularly useful in the presence of outliers.

Relevância:

20.00% 20.00%

Publicador:

Relevância:

10.00% 10.00%

Publicador:

Resumo:

The aim of this study was to determine whether the behavioral characteristics demonstrated by rapists clustered together into groups that were similar to the common rapist typology in the literature: anger, power exploitative, power reassurance, and sadistic. Two studies were conducted to examine the evidence for this typology. Study 1 involved the analysis of data from 130 men charged with sexual assault and Study 2 involved the analysis of court transcripts from 50 rape cases tried through the court system. The results of Study 1 revealed that there was some validity to the characteristics usually associated with each of the four types of rape, especially for the power reassurance and sadistic rapists. However, there were some unexpected outliers within both the anger and power exploitative types of rapists, which may suggest that there is more than one type of anger rapist and more than two types of power rapists. The results of Study 2 very closely replicated the results of Study 1. Future research needs to focus on the behavioral, motivational, and cognitive characteristics associated with each of the types of rapists and research them separately, so that it is possible to further evaluate the evidence for the typology identified in this study.

Relevância:

10.00% 10.00%

Publicador:

Resumo:

A retrospective assessment of exposure to benzene was carried out for a nested case control study of lympho-haematopoietic cancers, including leukaemia, in the Australian petroleum industry. Each job or task in the industry was assigned a Base Estimate (BE) of exposure derived from task-based personal exposure assessments carried out by the company occupational hygienists. The BEs corresponded to the estimated arithmetic mean exposure to benzene for each job or task and were used in a deterministic algorithm to estimate the exposure of subjects in the study. Nearly all of the data sets underlying the BEs were found to contain some values below the limit of detection (LOD) of the sampling and analytical methods and some were very heavily censored; up to 95% of the data were below the LOD in some data sets. It was necessary, therefore, to use a method of calculating the arithmetic mean exposures that took into account the censored data. Three different methods were employed in an attempt to select the most appropriate method for the particular data in the study. A common method is to replace the missing (censored) values with half the detection limit. This method has been recommended for data sets where much of the data are below the limit of detection or where the data are highly skewed; with a geometric standard deviation of 3 or more. Another method, involving replacing the censored data with the limit of detection divided by the square root of 2, has been recommended when relatively few data are below the detection limit or where data are not highly skewed. A third method that was examined is Cohen's method. This involves mathematical extrapolation of the left-hand tail of the distribution, based on the distribution of the uncensored data, and calculation of the maximum likelihood estimate of the arithmetic mean. When these three methods were applied to the data in this study it was found that the first two simple methods give similar results in most cases. Cohen's method on the other hand, gave results that were generally, but not always, higher than simpler methods and in some cases gave extremely high and even implausible estimates of the mean. It appears that if the data deviate substantially from a simple log-normal distribution, particularly if high outliers are present, then Cohen's method produces erratic and unreliable estimates. After examining these results, and both the distributions and proportions of censored data, it was decided that the half limit of detection method was most suitable in this particular study.

Relevância:

10.00% 10.00%

Publicador:

Resumo:

Overview and Aim
1. This report concerns an analysis of the cumulative data from 15 surveys using the Personal Wellbeing Index to measure Subjective Wellbeing. The total number of respondents is about 30,000 but not all data were available for all analyses due to changing variables between
surveys.
2. The aim of this analysis is to determine those sub-groups with the highest and the lowest wellbeing.
Method
3. The definition of sub-groups is through the demographic variables of Income, Gender, Age, Household Composition, Relationship Status and Employment Status. Index domains are also included. While not every combination of demographic variables has been tested, the total number of combinations analysed was 3,277.
4. Extreme group mean scores are defined as lying above 79 points and below 70 points. These values are at least five standard deviations beyond the total sample mean score and are, therefore, extreme outliers. The minimum number of responses that could form such a group is
one. Data are accumulated across surveys for corresponding groups.
Results
5. The initial search for the most extreme groups identified the 20 highest and the 20 lowest groups with a minimum N=10. These are termed the ‘Exclusive’ groups since they were based only on the previously identified extreme scores. In order to determine the true mean of each of these groups, a further analysis incorporated all respondents who met the definition of group membership. For example, an Exclusive group defined as [male, 76+ years] would contain only the accumulation of scores from individual surveys that met the extreme score criterion (<70 or >79). The Inclusive group included the scores from all survey respondents who matched the group definition of male, 76+ years.
6. The results revealed a dominance by the domains of the Personal Wellbeing Index. The extreme high groups were predicted by high scores on all domains except safety and relationships. The low groups were defined by low scores on all seven domains.
7. A further search for extreme groups was undertaken that was restricted to the demographic descriptors. The 20 highest and 20 lowest groups were identified based on a minimum cell content of N=10. The corresponding Inclusive group means were then calculated as before.
8. In order to increase the reliability of the final groups, a minimum cell content of N=20 cases was imposed.
9. Six extreme high groups were identified. These are dominated by high income and the presence of a partner. Five extreme low groups were identified. These are dominated by very low income, the absence of a partner, and unemployment.
Conclusions
10. The conclusions drawn from these analyses are as follows:
10.1 The central defining characteristics of people forming the extreme high wellbeing groups is high household income and living with a partner.
10.2 The central defining risk factors for people forming the extreme low wellbeing groups are very low household income, not living with a partner, and unemployment.
10.3 None of these five demographic characteristics are sufficient to define extreme wellbeing groups on their own. They all act in combinations of at least two risk factors together.

Relevância:

10.00% 10.00%

Publicador:

Resumo:

Recently, much attention has been given to the mass spectrometry (MS) technology based disease classification, diagnosis, and protein-based biomarker identification. Similar to microarray based investigation, proteomic data generated by such kind of high-throughput experiments are often with high feature-to-sample ratio. Moreover, biological information and pattern are compounded with data noise, redundancy and outliers. Thus, the development of algorithms and procedures for the analysis and interpretation of such kind of data is of paramount importance. In this paper, we propose a hybrid system for analyzing such high dimensional data. The proposed method uses the k-mean clustering algorithm based feature extraction and selection procedure to bridge the filter selection and wrapper selection methods. The potential informative mass/charge (m/z) markers selected by filters are subject to the k-mean clustering algorithm for correlation and redundancy reduction, and a multi-objective Genetic Algorithm selector is then employed to identify discriminative m/z markers generated by k-mean clustering algorithm. Experimental results obtained by using the proposed method indicate that it is suitable for m/z biomarker selection and MS based sample classification.

Relevância:

10.00% 10.00%

Publicador:

Resumo:

Prior research by Bouman and Jacobsen (2002) document unusually high monthly returns over the period November-April for both United States (U.S.) and foreign stock markets and label this phenomenon the Halloween effect. The implication is that the Halloween effect represents an exploitable anomaly, which has negative implications for stock market efficiency. We extend this research to the S&P 500 futures contract and find no evidence of an exploitable Halloween effect over the period April 1982 through April 2003. Re-examining Bouman and Jacobsen’s empirical results for the U.S., we find that two outliers drive their results. These two outliers are associated with the “Crash” in October 1987 and collapse of the hedge fund Long-Term Capital Management in August 1998. After inserting a dummy variable to account for the impact of the two identified outliers, the Halloween effect disappears.

Relevância:

10.00% 10.00%

Publicador:

Resumo:

Examining the years 1970 to 1998, Bouman and Jacobsen (2002) document unusually high monthly returns during the November-April periods for both United States (U.S.) and foreign stock markets and label this phenomenon the Halloween effect. Their research suggests that the Halloween effect represents an exploitable anomaly and has negative implications for claims of stock market efficiency.

Re-examining Bouman and Jacobsen’s empirical results for the U.S. reveals that their results are driven by two outliers, the “Crash” of October 1987 and the collapse of the hedge fund Long-Term Capital Management in August 1998. After inserting a dummy variable to account for the impact of the two identified outliers, the Halloween effect becomes statistically insignificant. This anomaly is not economically exploitable for U.S. equity markets. We extend the research to the S&P 500 futures contract and find no evidence of an exploitable Halloween effect over the period April 1982-April 2003.

Relevância:

10.00% 10.00%

Publicador:

Resumo:

We consider an application of fuzzy logic connectives to statistical regression. We replace the standard least squares, least absolute deviation, and maximum likelihood criteria with an ordered weighted averaging (OWA) function of the residuals. Depending on the choice of the weights, we obtain the standard regression problems, high-breakdown robust methods (least median, least trimmed squares, and trimmed likelihood methods), as well as new formulations. We present various approaches to numerical solution of such regression problems. OWA-based regression is particularly useful in the presence of outliers, and we illustrate the performance of the new methods on several instances of linear regression problems with multiple outliers.

Relevância:

10.00% 10.00%

Publicador:

Resumo:

The exact distribution of the maximum and minimum frequencies of Multinomial/Dirichlet and Multivariate Hypergeometric distributions of n balls in m urns is compactly represented as a product of stochastic matrices. This representation does not require equal urn probabilities, is invariant to urn order, and permits rapid calculation of exact probabilities. The exact distribution of the range is also obtained. These algorithms satisfy a long-standing need for routines to compute exact Multinomial/Dirichlet and Multivariate Hypergeometric maximum, minimum, and range probabilities in statistical computation libraries and software packages.

Relevância:

10.00% 10.00%

Publicador:

Resumo:

In this paper, we propose a new augmented Dickey–Fuller-type test for unit roots which accounts for
two structural breaks. We consider two different specifications: (a) two breaks in the level of a trending data series and (b) two breaks in the level and slope of a trending data series. The breaks whose time of occurrence is assumed to be unknown are modeled as innovational outliers and thus take effect gradually. Using Monte Carlo simulations, we showthat our proposed test has correct size, stable power, and identifies the structural breaks accurately.

Relevância:

10.00% 10.00%

Publicador:

Resumo:

Determination of the optimal operating condition for moulding process has been of special interest for many researchers. To determine the optimal setting, one has to derive the model of injection moulding process first which is able to map the relationship between the input process control factors and output responses. One of most popular modeling techniques is the linear least square regression due to its effectiveness and completeness. However, the least square regression was found to be very sensitive to the outliers and failed to provide a reliable model if the control variables are highly related with each other. To address this problem, a new modeling method based on principal component regression was proposed in this paper. The distinguished feature of our proposed method is it does not only consider the variance of covariance matrix of control variables but also consider the correlation coefficient between control variables and target variables to be optimised. Such a modelling method has been implemented into a commercial optimisation software and field test results demonstrated the performance of the proposed modelling method.

Relevância:

10.00% 10.00%

Publicador:

Resumo:

We advance the theory of aggregation operators and introduce non-monotone aggregation methods based on minimization of a penalty for inputs disagreements. The application in mind is processing data sets which may contain noisy values. Our aim is to filter out noise while at the same time preserve signs of unusual values. We review various methods of robust estimators of location, and then introduce a new estimator based on penalty minimisation.

Relevância:

10.00% 10.00%

Publicador:

Resumo:

Regression lies heart in statistics, it is the one of the most important branch of multivariate techniques available for extracting knowledge in almost every field of study and research. Nowadays, it has drawn a huge interest to perform the tasks with different fields like machine learning, pattern recognition and data mining. Investigating outlier (exceptional) is a century long problem to the data analyst and researchers. Blind application of data could have dangerous consequences and leading to discovery of meaningless patterns and carrying to the imperfect knowledge. As a result of digital revolution and the growth of the Internet and Intranet data continues to be accumulated at an exponential rate and thereby importance of detecting outliers and study their costs and benefits as a tool for reliable knowledge discovery claims perfect attention. Investigating outliers in regression has been paid great value for the last few decades within two frames of thoughts in the name of robust regression and regression diagnostics. Robust regression first wants to fit a regression to the majority of the data and then to discover outliers as those points that possess large residuals from the robust output whereas in regression diagnostics one first finds the outliers, delete/correct them and then fit the regular data by classical (usual) methods. At the beginning there seems to be much confusion but now the researchers reach to the consensus, robustness and diagnostics are two complementary approaches to the analysis of data and any one is not good enough. In this chapter, we discuss both of them under the unique spectrum of regression diagnostics. Chapter expresses the necessity and views of regression diagnostics as well as presents several contemporary methods through numerical examples in linear regression within each aforesaid category together with current challenges and possible future research directions. Our aim is to make the chapter self-explained maintaining its general accessibility.

Relevância:

10.00% 10.00%

Publicador:

Resumo:

Compressed sensing (CS) is a new information sampling theory for acquiring sparse or compressible data with much fewer measurements than those otherwise required by the Nyquist/Shannon counterpart. This is particularly important for some imaging applications such as magnetic resonance imaging or in astronomy. However, in the existing CS formulation, the use of the â„“ 2 norm on the residuals is not particularly efficient when the noise is impulsive. This could lead to an increase in the upper bound of the recovery error. To address this problem, we consider a robust formulation for CS to suppress outliers in the residuals. We propose an iterative algorithm for solving the robust CS problem that exploits the power of existing CS solvers. We also show that the upper bound on the recovery error in the case of non-Gaussian noise is reduced and then demonstrate the efficacy of the method through numerical studies.