137 resultados para document clustering


Relevância:

20.00% 20.00%

Publicador:

Resumo:

In this thesis we investigate the use of quantum probability theory for ranking documents. Quantum probability theory is used to estimate the probability of relevance of a document given a user's query. We posit that quantum probability theory can lead to a better estimation of the probability of a document being relevant to a user's query than the common approach, i. e. the Probability Ranking Principle (PRP), which is based upon Kolmogorovian probability theory. Following our hypothesis, we formulate an analogy between the document retrieval scenario and a physical scenario, that of the double slit experiment. Through the analogy, we propose a novel ranking approach, the quantum probability ranking principle (qPRP). Key to our proposal is the presence of quantum interference. Mathematically, this is the statistical deviation between empirical observations and expected values predicted by the Kolmogorovian rule of additivity of probabilities of disjoint events in configurations such that of the double slit experiment. We propose an interpretation of quantum interference in the document ranking scenario, and examine how quantum interference can be effectively estimated for document retrieval. To validate our proposal and to gain more insights about approaches for document ranking, we (1) analyse PRP, qPRP and other ranking approaches, exposing the assumptions underlying their ranking criteria and formulating the conditions for the optimality of the two ranking principles, (2) empirically compare three ranking principles (i. e. PRP, interactive PRP, and qPRP) and two state-of-the-art ranking strategies in two retrieval scenarios, those of ad-hoc retrieval and diversity retrieval, (3) analytically contrast the ranking criteria of the examined approaches, exposing similarities and differences, (4) study the ranking behaviours of approaches alternative to PRP in terms of the kinematics they impose on relevant documents, i. e. by considering the extent and direction of the movements of relevant documents across the ranking recorded when comparing PRP against its alternatives. Our findings show that the effectiveness of the examined ranking approaches strongly depends upon the evaluation context. In the traditional evaluation context of ad-hoc retrieval, PRP is empirically shown to be better or comparable to alternative ranking approaches. However, when we turn to examine evaluation contexts that account for interdependent document relevance (i. e. when the relevance of a document is assessed also with respect to other retrieved documents, as it is the case in the diversity retrieval scenario) then the use of quantum probability theory and thus of qPRP is shown to improve retrieval and ranking effectiveness over the traditional PRP and alternative ranking strategies, such as Maximal Marginal Relevance, Portfolio theory, and Interactive PRP. This work represents a significant step forward regarding the use of quantum theory in information retrieval. It demonstrates in fact that the application of quantum theory to problems within information retrieval can lead to improvements both in modelling power and retrieval effectiveness, allowing the constructions of models that capture the complexity of information retrieval situations. Furthermore, the thesis opens up a number of lines for future research. These include: (1) investigating estimations and approximations of quantum interference in qPRP; (2) exploiting complex numbers for the representation of documents and queries, and; (3) applying the concepts underlying qPRP to tasks other than document ranking.

Relevância:

20.00% 20.00%

Publicador:

Resumo:

Upon overexpression of integrin αvβ3 and its engagement by vitronectin, we previously showed enhanced adhesion, proliferation, and motility of human ovarian cancer cells. By studying differential expression of genes possibly related to these tumor biological events, we identified the epidermal growth-factor receptor (EGF-R) to be under control of αvβ3 expression levels. Thus in the present study we characterized αvβ3-dependent changes of EGF-R and found significant upregulation of its expression and activity which was reflected by prominent changes of EGF-R promoter activity. Upon disruption of DNA-binding motifs for the transcription factors p53, ETF, the repressor ETR, p50, and c-rel, respectively, we sought to identify DNA elements contributing to αvβ3-mediated EGF-R promoter induction. Both, the p53- and ETF-mutant, while exhibiting considerably lower EGF-R promoter activity than the wild type promoter, retained inducibility by αvβ3. Mutation of the repressor motif ETR, as expected, enhanced EGF-R promoter activity with a further moderate increase upon αvβ3 elevation. The p50-mutant displayed EGF-R promoter activity almost comparable to that of the wild type promoter with no impairment of induction by αvβ3. However, the activity of an EGF-R promoter mutant displaying a disrupted c-rel-binding motif did not only prominently decline, but, moreover, was not longer responsive to enhanced αvβ3, involving this DNA element in αvβ3-dependent EGF-R upregulation. Moreover, αvβ3 did not only increase the EGF-R but, moreover, also led to obvious co-clustering on the cancer cell surface. By studying αvβ3/EGF-R-effects on the focal adhesion kinase (FAK) and the mitogen activated protein kinases (MAPK) p44/42 (erk−1/erk−2), having important functions in synergistic crosstalk between integrins and growth-factor receptors, we found for both significant enhancement of expression and activity upon αvβ3/VN interaction and cell stimulation by EGF. Upregulation of the EGF-R by integrin αvβ3, both receptor molecules with a well-defined role as targets for cancer treatment, might represent an additional mechanism to adapt synergistic receptor signaling and crosstalk in response to an altered tumor cell microenvironment during ovarian cancer progression.

Relevância:

20.00% 20.00%

Publicador:

Resumo:

Automated process discovery techniques aim at extracting process models from information system logs. Existing techniques in this space are effective when applied to relatively small or regular logs, but generate spaghetti-like and sometimes inaccurate models when confronted to logs with high variability. In previous work, trace clustering has been applied in an attempt to reduce the size and complexity of automatically discovered process models. The idea is to split the log into clusters and to discover one model per cluster. This leads to a collection of process models – each one representing a variant of the business process – as opposed to an all-encompassing model. Still, models produced in this way may exhibit unacceptably high complexity and low fitness. In this setting, this paper presents a two-way divide-and-conquer process discovery technique, wherein the discovered process models are split on the one hand by variants and on the other hand hierarchically using subprocess extraction. Splitting is performed in a controlled manner in order to achieve user-defined complexity or fitness thresholds. Experiments on real-life logs show that the technique produces collections of models substantially smaller than those extracted by applying existing trace clustering techniques, while allowing the user to control the fitness of the resulting models.

Relevância:

20.00% 20.00%

Publicador:

Resumo:

Newsletter ACM SIGIR Forum: The Seventeenth Australian Document Computing Symposium was held in Dunedin, New Zealand on the 5th and 6th of December 2012. In total twenty four papers were submitted. From those eleven were accepted for full presentation and 8 for short presentation. A poster session was held jointly with the Australasian Language Technology Workshop.

Relevância:

20.00% 20.00%

Publicador:

Resumo:

In this paper we describe the approaches adopted to generate the runs submitted to ImageCLEFPhoto 2009 with an aim to promote document diversity in the rankings. Four of our runs are text based approaches that employ textual statistics extracted from the captions of images, i.e. MMR [1] as a state of the art method for result diversification, two approaches that combine relevance information and clustering techniques, and an instantiation of Quantum Probability Ranking Principle. The fifth run exploits visual features of the provided images to re-rank the initial results by means of Factor Analysis. The results reveal that our methods based on only text captions consistently improve the performance of the respective baselines, while the approach that combines visual features with textual statistics shows lower levels of improvements.

Relevância:

20.00% 20.00%

Publicador:

Resumo:

Time series classification has been extensively explored in many fields of study. Most methods are based on the historical or current information extracted from data. However, if interest is in a specific future time period, methods that directly relate to forecasts of time series are much more appropriate. An approach to time series classification is proposed based on a polarization measure of forecast densities of time series. By fitting autoregressive models, forecast replicates of each time series are obtained via the bias-corrected bootstrap, and a stationarity correction is considered when necessary. Kernel estimators are then employed to approximate forecast densities, and discrepancies of forecast densities of pairs of time series are estimated by a polarization measure, which evaluates the extent to which two densities overlap. Following the distributional properties of the polarization measure, a discriminant rule and a clustering method are proposed to conduct the supervised and unsupervised classification, respectively. The proposed methodology is applied to both simulated and real data sets, and the results show desirable properties.

Relevância:

20.00% 20.00%

Publicador:

Resumo:

The primary aim of this paper was to investigate heterogeneity in language abilities of children with a confirmed diagnosis of an ASD (N = 20) and children with typical development (TD; N = 15). Group comparisons revealed no differences between ASD and TD participants on standard clinical assessments of language ability, reading ability or nonverbal intelligence. However, a hierarchical cluster analysis based on spoken nonword repetition and sentence repetition identified two clusters within the combined group of ASD and TD participants. The first cluster (N = 6) presented with significantly poorer performances than the second cluster (N = 29) on both of the clustering variables in addition to single word and nonword reading. The significant differences between the two clusters occur within a context of Cluster 1 having language impairment and a tendency towards more severe autistic symptomatology. Differences between the oral language abilities of the first and second clusters are considered in light of diagnosis, attention and verbal short term memory skills and reading impairment.

Relevância:

20.00% 20.00%

Publicador:

Resumo:

This article presents a study of how humans perceive and judge the relevance of documents. Humans are adept at making reasonably robust and quick decisions about what information is relevant to them, despite the ever increasing complexity and volume of their surrounding information environment. The literature on document relevance has identified various dimensions of relevance (e.g., topicality, novelty, etc.), however little is understood about how these dimensions may interact. We performed a crowdsourced study of how human subjects judge two relevance dimensions in relation to document snippets retrieved from an internet search engine. The order of the judgment was controlled. For those judgments exhibiting an order effect, a q–test was performed to determine whether the order effects can be explained by a quantum decision model based on incompatible decision perspectives. Some evidence of incompatibility was found which suggests incompatible decision perspectives is appropriate for explaining interacting dimensions of relevance in such instances.

Relevância:

20.00% 20.00%

Publicador:

Resumo:

Given the drawbacks for using geo-political areas in mapping outcomes unrelated to geo-politics, a compromise is to aggregate and analyse data at the grid level. This has the advantage of allowing spatial smoothing and modelling at a biologically or physically relevant scale. This article addresses two consequent issues: the choice of the spatial smoothness prior and the scale of the grid. Firstly, we describe several spatial smoothness priors applicable for grid data and discuss the contexts in which these priors can be employed based on different aims. Two such aims are considered, i.e., to identify regions with clustering and to model spatial dependence in the data. Secondly, the choice of the grid size is shown to depend largely on the spatial patterns. We present a guide on the selection of spatial scales and smoothness priors for various point patterns based on the two aims for spatial smoothing.

Relevância:

20.00% 20.00%

Publicador:

Resumo:

We present a novel method for improving hierarchical speaker clustering in the tasks of speaker diarization and speaker linking. In hierarchical clustering, a tree can be formed that demonstrates various levels of clustering. We propose a ratio that expresses the impact of each cluster on the formation of this tree and use this to rescale cluster scores. This provides score normalisation based on the impact of each cluster. We use a state-of-the-art speaker diarization and linking system across the SAIVT-BNEWS corpus to show that our proposed impact ratio can provide a relative improvement of 16% in diarization error rate (DER).

Relevância:

20.00% 20.00%

Publicador:

Resumo:

Empirical evidence shows that repositories of business process models used in industrial practice contain significant amounts of duplication. This duplication arises for example when the repository covers multiple variants of the same processes or due to copy-pasting. Previous work has addressed the problem of efficiently retrieving exact clones that can be refactored into shared subprocess models. This article studies the broader problem of approximate clone detection in process models. The article proposes techniques for detecting clusters of approximate clones based on two well-known clustering algorithms: DBSCAN and Hi- erarchical Agglomerative Clustering (HAC). The article also defines a measure of standardizability of an approximate clone cluster, meaning the potential benefit of replacing the approximate clones with a single standardized subprocess. Experiments show that both techniques, in conjunction with the proposed standardizability measure, accurately retrieve clusters of approximate clones that originate from copy-pasting followed by independent modifications to the copied fragments. Additional experiments show that both techniques produce clusters that match those produced by human subjects and that are perceived to be standardizable.

Relevância:

20.00% 20.00%

Publicador:

Resumo:

PURPOSE The purpose of this study was to demonstrate the potential of near infrared (NIR) spectroscopy for characterizing the health and degenerative state of articular cartilage based on the components of the Mankin score. METHODS Three models of osteoarthritic degeneration induced in laboratory rats by anterior cruciate ligament (ACL) transection, meniscectomy (MSX), and intra-articular injection of monoiodoacetate (1 mg) (MIA) were used in this study. Degeneration was induced in the right knee joint; each model group consisted of 12 rats (N = 36). After 8 weeks, the animals were euthanized and knee joints were collected. A custom-made diffuse reflectance NIR probe of 5-mm diameter was placed on the tibial and femoral surfaces, and spectral data were acquired from each specimen in the wave number range of 4,000 to 12,500 cm(-1). After spectral data acquisition, the specimens were fixed and safranin O staining (SOS) was performed to assess disease severity based on the Mankin scoring system. Using multivariate statistical analysis, with spectral preprocessing and wavelength selection technique, the spectral data were then correlated to the structural integrity (SI), cellularity (CEL), and matrix staining (SOS) components of the Mankin score for all the samples tested. RESULTS ACL models showed mild cartilage degeneration, MSX models had moderate degeneration, and MIA models showed severe cartilage degenerative changes both morphologically and histologically. Our results reveal significant linear correlations between the NIR absorption spectra and SI (R(2) = 94.78%), CEL (R(2) = 88.03%), and SOS (R(2) = 96.39%) parameters of all samples in the models. In addition, clustering of the samples according to their level of degeneration, with respect to the Mankin components, was also observed. CONCLUSIONS NIR spectroscopic probing of articular cartilage can potentially provide critical information about the health of articular cartilage matrix in early and advanced stages of osteoarthritis (OA). CLINICAL RELEVANCE This rapid nondestructive method can facilitate clinical appraisal of articular cartilage integrity during arthroscopic surgery.

Relevância:

20.00% 20.00%

Publicador:

Resumo:

Description of a patient's injuries is recorded in narrative text form by hospital emergency departments. For statistical reporting, this text data needs to be mapped to pre-defined codes. Existing research in this field uses the Naïve Bayes probabilistic method to build classifiers for mapping. In this paper, we focus on providing guidance on the selection of a classification method. We build a number of classifiers belonging to different classification families such as decision tree, probabilistic, neural networks, and instance-based, ensemble-based and kernel-based linear classifiers. An extensive pre-processing is carried out to ensure the quality of data and, in hence, the quality classification outcome. The records with a null entry in injury description are removed. The misspelling correction process is carried out by finding and replacing the misspelt word with a soundlike word. Meaningful phrases have been identified and kept, instead of removing the part of phrase as a stop word. The abbreviations appearing in many forms of entry are manually identified and only one form of abbreviations is used. Clustering is utilised to discriminate between non-frequent and frequent terms. This process reduced the number of text features dramatically from about 28,000 to 5000. The medical narrative text injury dataset, under consideration, is composed of many short documents. The data can be characterized as high-dimensional and sparse, i.e., few features are irrelevant but features are correlated with one another. Therefore, Matrix factorization techniques such as Singular Value Decomposition (SVD) and Non Negative Matrix Factorization (NNMF) have been used to map the processed feature space to a lower-dimensional feature space. Classifiers with these reduced feature space have been built. In experiments, a set of tests are conducted to reflect which classification method is best for the medical text classification. The Non Negative Matrix Factorization with Support Vector Machine method can achieve 93% precision which is higher than all the tested traditional classifiers. We also found that TF/IDF weighting which works well for long text classification is inferior to binary weighting in short document classification. Another finding is that the Top-n terms should be removed in consultation with medical experts, as it affects the classification performance.