89 resultados para Query clustering


Relevância:

20.00% 20.00%

Publicador:

Resumo:

As end-user computing becomes more pervasive, an organization's success increasingly depends on the ability of end-users, usually in managerial positions, to extract appropriate data from both internal and external sources. Many of these data sources include or are derived from the organization's accounting information systems. Managerial end-users with different personal characteristics and approaches are likely to compose queries of differing levels of accuracy when searching the data contained within these accounting information systems. This research investigates how cognitive style elements of personality influence managerial end-user performance in database querying tasks. A laboratory experiment was conducted in which participants generated queries to retrieve information from an accounting information system to satisfy typical information requirements. The experiment investigated the influence of personality on the accuracy of queries of varying degrees of complexity. Relying on the Myers–Briggs personality instrument, results show that perceiving individuals (as opposed to judging individuals) who rely on intuition (as opposed to sensing) composed queries more accurately. As expected, query complexity and academic performance also explain the success of data extraction tasks.

Relevância:

20.00% 20.00%

Publicador:

Resumo:

In microarray studies, the application of clustering techniques is often used to derive meaningful insights into the data. In the past, hierarchical methods have been the primary clustering tool employed to perform this task. The hierarchical algorithms have been mainly applied heuristically to these cluster analysis problems. Further, a major limitation of these methods is their inability to determine the number of clusters. Thus there is a need for a model-based approach to these. clustering problems. To this end, McLachlan et al. [7] developed a mixture model-based algorithm (EMMIX-GENE) for the clustering of tissue samples. To further investigate the EMMIX-GENE procedure as a model-based -approach, we present a case study involving the application of EMMIX-GENE to the breast cancer data as studied recently in van 't Veer et al. [10]. Our analysis considers the problem of clustering the tissue samples on the basis of the genes which is a non-standard problem because the number of genes greatly exceed the number of tissue samples. We demonstrate how EMMIX-GENE can be useful in reducing the initial set of genes down to a more computationally manageable size. The results from this analysis also emphasise the difficulty associated with the task of separating two tissue groups on the basis of a particular subset of genes. These results also shed light on why supervised methods have such a high misallocation error rate for the breast cancer data.

Relevância:

20.00% 20.00%

Publicador:

Resumo:

We consider the problem of assessing the number of clusters in a limited number of tissue samples containing gene expressions for possibly several thousands of genes. It is proposed to use a normal mixture model-based approach to the clustering of the tissue samples. One advantage of this approach is that the question on the number of clusters in the data can be formulated in terms of a test on the smallest number of components in the mixture model compatible with the data. This test can be carried out on the basis of the likelihood ratio test statistic, using resampling to assess its null distribution. The effectiveness of this approach is demonstrated on simulated data and on some microarray datasets, as considered previously in the bioinformatics literature. (C) 2004 Elsevier Inc. All rights reserved.

Relevância:

20.00% 20.00%

Publicador:

Resumo:

Stable social aggregations are rarely recorded in lizards, but have now been reported from several species in the Australian scincid genus Egernia. Most of those examples come from species using rock crevice refuges that are relatively easy to observe. But for many other Egernia species that occupy different habitats and are more secretive, it is hard to gather the observational data needed to deduce their social structure. Therefore, we used genotypes at six polymorphic microsatellite DNA loci of 229 individuals of Egernia frerei, trapped in 22 sampling sites over 3500 ha of eucalypt forest on Fraser Island, Australia. Each sampling site contained 15 trap locations in a 100 x 50 m grid. We estimated relatedness among pairs of individuals and found that relatedness was higher within than between sites. Relatedness of females within sites was higher than relatedness of males, and was higher than relatedness between males and females. Within sites we found that juvenile lizards were highly related to other juveniles and to adults trapped at the same location, or at adjacent locations, but relatedness decreased with increasing trap separation. We interpreted the results as suggesting high natal philopatry among juvenile lizards and adult females. This result is consistent with stable family group structure previously reported in rock dwelling Egernia species, and suggests that social behaviour in this genus is not habitat driven.

Relevância:

20.00% 20.00%

Publicador:

Resumo:

Objectives: The objectives of this study were to examine the extent of clustering of smoking, high levels of television watching, overweight, and high blood pressure among adolescents and whether this clustering varies by socioeconomic position and Cognitive function. Methods: This study was a cross-sectional analysis of 3613 (1742 females) participants of an Australian birth cohort who were examined at age 14. Results: Three hundred fifty-three (9.8%) of the participants had co-occurrence of three or four risk factors. Risk factors clustered in these adolescents with a greater number of participants than would be predicted by assumptions of independence having no risk factors and three or four risk factors. The extent of clustering tended to be greater in those from lower-income families and among those with lower cognitive function. The age-adjusted ratio of observed to expected cooccurrence of three or four risk factors was 2.70 (95% confidence interval [Cl], 1.80-4.06) among those from low-income families and 1.70 (95% Cl, 1.34-2.16) among those from more affluent families. The ratio among those with low Raven's scores (nonverbal reasoning) was 2.36 (95% Cl, 1.69-3.30) and among those with higher scores was 1.51 (95% Cl, 1.19-1.92); similar results for the WRAT 3 score (reading ability) were 2.69 (95% Cl, 1.85-3.94) and 1.68 (95% Cl, 1.34-2.11). Clustering did not differ by sex. Conclusion: Among adolescents, coronary heart disease risk factors cluster, and there is some evidence that this clustering is greater among those from families with low income and those who have lower cognitive function.

Relevância:

20.00% 20.00%

Publicador:

Resumo:

Motivation: The clustering of gene profiles across some experimental conditions of interest contributes significantly to the elucidation of unknown gene function, the validation of gene discoveries and the interpretation of biological processes. However, this clustering problem is not straightforward as the profiles of the genes are not all independently distributed and the expression levels may have been obtained from an experimental design involving replicated arrays. Ignoring the dependence between the gene profiles and the structure of the replicated data can result in important sources of variability in the experiments being overlooked in the analysis, with the consequent possibility of misleading inferences being made. We propose a random-effects model that provides a unified approach to the clustering of genes with correlated expression levels measured in a wide variety of experimental situations. Our model is an extension of the normal mixture model to account for the correlations between the gene profiles and to enable covariate information to be incorporated into the clustering process. Hence the model is applicable to longitudinal studies with or without replication, for example, time-course experiments by using time as a covariate, and to cross-sectional experiments by using categorical covariates to represent the different experimental classes. Results: We show that our random-effects model can be fitted by maximum likelihood via the EM algorithm for which the E(expectation) and M(maximization) steps can be implemented in closed form. Hence our model can be fitted deterministically without the need for time-consuming Monte Carlo approximations. The effectiveness of our model-based procedure for the clustering of correlated gene profiles is demonstrated on three real datasets, representing typical microarray experimental designs, covering time-course, repeated-measurement and cross-sectional data. In these examples, relevant clusters of the genes are obtained, which are supported by existing gene-function annotation. A synthetic dataset is considered too.

Relevância:

20.00% 20.00%

Publicador:

Resumo:

Multiresolution Triangular Mesh (MTM) models are widely used to improve the performance of large terrain visualization by replacing the original model with a simplified one. MTM models, which consist of both original and simplified data, are commonly stored in spatial database systems due to their size. The relatively slow access speed of disks makes data retrieval the bottleneck of such terrain visualization systems. Existing spatial access methods proposed to address this problem rely on main-memory MTM models, which leads to significant overhead during query processing. In this paper, we approach the problem from a new perspective and propose a novel MTM called direct mesh that is designed specifically for secondary storage. It supports available indexing methods natively and requires no modification to MTM structure. Experiment results, which are based on two real-world data sets, show an average performance improvement of 5-10 times over the existing methods.

Relevância:

20.00% 20.00%

Publicador:

Resumo:

Quantile computation has many applications including data mining and financial data analysis. It has been shown that an is an element of-approximate summary can be maintained so that, given a quantile query d (phi, is an element of), the data item at rank [phi N] may be approximately obtained within the rank error precision is an element of N over all N data items in a data stream or in a sliding window. However, scalable online processing of massive continuous quantile queries with different phi and is an element of poses a new challenge because the summary is continuously updated with new arrivals of data items. In this paper, first we aim to dramatically reduce the number of distinct query results by grouping a set of different queries into a cluster so that they can be processed virtually as a single query while the precision requirements from users can be retained. Second, we aim to minimize the total query processing costs. Efficient algorithms are developed to minimize the total number of times for reprocessing clusters and to produce the minimum number of clusters, respectively. The techniques are extended to maintain near-optimal clustering when queries are registered and removed in an arbitrary fashion against whole data streams or sliding windows. In addition to theoretical analysis, our performance study indicates that the proposed techniques are indeed scalable with respect to the number of input queries as well as the number of items and the item arrival rate in a data stream.

Relevância:

20.00% 20.00%

Publicador:

Resumo:

We have undertaken two-dimensional gel electrophoresis proteomic profiling on a series of cell lines with different recombinant antibody production rates. Due to the nature of gel-based experiments not all protein spots are detected across all samples in an experiment, and hence datasets are invariably incomplete. New approaches are therefore required for the analysis of such graduated datasets. We approached this problem in two ways. Firstly, we applied a missing value imputation technique to calculate missing data points. Secondly, we combined a singular value decomposition based hierarchical clustering with the expression variability test to identify protein spots whose expression correlates with increased antibody production. The results have shown that while imputation of missing data was a useful method to improve the statistical analysis of such data sets, this was of limited use in differentiating between the samples investigated, and highlighted a small number of candidate proteins for further investigation. (c) 2006 Elsevier B.V. All rights reserved.

Relevância:

20.00% 20.00%

Publicador:

Resumo:

In many advanced applications, data are described by multiple high-dimensional features. Moreover, different queries may weight these features differently; some may not even specify all the features. In this paper, we propose our solution to support efficient query processing in these applications. We devise a novel representation that compactly captures f features into two components: The first component is a 2D vector that reflects a distance range ( minimum and maximum values) of the f features with respect to a reference point ( the center of the space) in a metric space and the second component is a bit signature, with two bits per dimension, obtained by analyzing each feature's descending energy histogram. This representation enables two levels of filtering: The first component prunes away points that do not share similar distance ranges, while the bit signature filters away points based on the dimensions of the relevant features. Moreover, the representation facilitates the use of a single index structure to further speed up processing. We employ the classical B+-tree for this purpose. We also propose a KNN search algorithm that exploits the access orders of critical dimensions of highly selective features and partial distances to prune the search space more effectively. Our extensive experiments on both real-life and synthetic data sets show that the proposed solution offers significant performance advantages over sequential scan and retrieval methods using single and multiple VA-files.

Relevância:

20.00% 20.00%

Publicador:

Resumo:

Quality of life has been shown to be poor among people living with chronic hepatitis C However, it is not clear how this relates to the presence of symptoms and their severity. The aim of this study was to describe the typology of a broad array of symptoms that were attributed to hepatitis C virus (HCV) infection. Phase I used qualitative methods to identify symptoms. In Phase 2, 188 treatment-naive people living with HCV participated in a quantitative survey. The most prevalent symptom was physical tiredness (86%) followed by irritability (75%), depression (70%), mental tiredness (70%), and abdominal pain (68%). Temporal clustering of symptoms was reported in 62% of participants. Principal components analysis identified four symptom clusters: neuropsychiatric (mental tiredness, poor concentration, forgetfulness, depression, irritability, physical tiredness, and sleep problems); gastrointestinal (day sweats, nausea, food intolerance, night sweats, abdominal pain, poor appetite, and diarrhea); algesic (joint pain, muscle pain, and general body pain); and dysesthetic (noise sensitivity, light sensitivity, skin. problems, and headaches). These data demonstrate that symptoms are prevalent in treatment-naive people with HCV and support the hypothesis that symptom clustering occurs.

Relevância:

20.00% 20.00%

Publicador:

Resumo:

Semantic data models provide a map of the components of an information system. The characteristics of these models affect their usefulness for various tasks (e.g., information retrieval). The quality of information retrieval has obvious important consequences, both economic and otherwise. Traditionally, data base designers have produced parsimonious logical data models. In spite of their increased size, ontologically clearer conceptual models have been shown to facilitate better performance for both problem solving and information retrieval tasks in experimental settings. The experiments producing evidence of enhanced performance for ontologically clearer models have, however, used application domains of modest size. Data models in organizational settings are likely to be substantially larger than those used in these experiments. This research used an experiment to investigate whether the benefits of improved information retrieval performance associated with ontologically clearer models are robust as the size of the application domains increase. The experiment used an application domain of approximately twice the size as tested in prior experiments. The results indicate that, relative to the users of the parsimonious implementation, end users of the ontologically clearer implementation made significantly more semantic errors, took significantly more time to compose their queries, and were significantly less confident in the accuracy of their queries.