14 resultados para Vector Space IR, Search Engines, Document Clustering, Document

em Aston University Research Archive


Relevância:

100.00% 100.00%

Publicador:

Resumo:

In this paper, we propose a text mining method called LRD (latent relation discovery), which extends the traditional vector space model of document representation in order to improve information retrieval (IR) on documents and document clustering. Our LRD method extracts terms and entities, such as person, organization, or project names, and discovers relationships between them by taking into account their co-occurrence in textual corpora. Given a target entity, LRD discovers other entities closely related to the target effectively and efficiently. With respect to such relatedness, a measure of relation strength between entities is defined. LRD uses relation strength to enhance the vector space model, and uses the enhanced vector space model for query based IR on documents and clustering documents in order to discover complex relationships among terms and entities. Our experiments on a standard dataset for query based IR shows that our LRD method performed significantly better than traditional vector space model and other five standard statistical methods for vector expansion.

Relevância:

100.00% 100.00%

Publicador:

Resumo:

When a query is passed to multiple search engines, each search engine returns a ranked list of documents. Researchers have demonstrated that combining results, in the form of a "metasearch engine", produces a significant improvement in coverage and search effectiveness. This paper proposes a linear programming mathematical model for optimizing the ranked list result of a given group of Web search engines for an issued query. An application with a numerical illustration shows the advantages of the proposed method. © 2011 Elsevier Ltd. All rights reserved.

Relevância:

100.00% 100.00%

Publicador:

Resumo:

Web document cluster analysis plays an important role in information retrieval by organizing large amounts of documents into a small number of meaningful clusters. Traditional web document clustering is based on the Vector Space Model (VSM), which takes into account only two-level (document and term) knowledge granularity but ignores the bridging paragraph granularity. However, this two-level granularity may lead to unsatisfactory clustering results with “false correlation”. In order to deal with the problem, a Hierarchical Representation Model with Multi-granularity (HRMM), which consists of five-layer representation of data and a twophase clustering process is proposed based on granular computing and article structure theory. To deal with the zero-valued similarity problemresulted from the sparse term-paragraphmatrix, an ontology based strategy and a tolerance-rough-set based strategy are introduced into HRMM. By using granular computing, structural knowledge hidden in documents can be more efficiently and effectively captured in HRMM and thus web document clusters with higher quality can be generated. Extensive experiments show that HRMM, HRMM with tolerancerough-set strategy, and HRMM with ontology all outperform VSM and a representative non VSM-based algorithm, WFP, significantly in terms of the F-Score.

Relevância:

100.00% 100.00%

Publicador:

Resumo:

This work contributes to the development of search engines that self-adapt their size in response to fluctuations in workload. Deploying a search engine in an Infrastructure as a Service (IaaS) cloud facilitates allocating or deallocating computational resources to or from the engine. In this paper, we focus on the problem of regrouping the metric-space search index when the number of virtual machines used to run the search engine is modified to reflect changes in workload. We propose an algorithm for incrementally adjusting the index to fit the varying number of virtual machines. We tested its performance using a custom-build prototype search engine deployed in the Amazon EC2 cloud, while calibrating the results to compensate for the performance fluctuations of the platform. Our experiments show that, when compared with computing the index from scratch, the incremental algorithm speeds up the index computation 2–10 times while maintaining a similar search performance.

Relevância:

100.00% 100.00%

Publicador:

Resumo:

For a submitted query to multiple search engines finding relevant results is an important task. This paper formulates the problem of aggregation and ranking of multiple search engines results in the form of a minimax linear programming model. Besides the novel application, this study detects the most relevant information among a return set of ranked lists of documents retrieved by distinct search engines. Furthermore, two numerical examples aree used to illustrate the usefulness of the proposed approach.

Relevância:

100.00% 100.00%

Publicador:

Resumo:

This paper summarizes the scientific work presented at the 32nd European Conference on Information Retrieval. It demonstrates that information retrieval (IR) as a research area continues to thrive with progress being made in three complementary sub-fields, namely IR theory and formal methods together with indexing and query representation issues, furthermore Web IR as a primary application area and finally research into evaluation methods and metrics. It is the combination of these areas that gives IR its solid scientific foundations. The paper also illustrates that significant progress has been made in other areas of IR. The keynote speakers addressed three such subject fields, social search engines using personalization and recommendation technologies, the renewed interest in applying natural language processing to IR, and multimedia IR as another fast-growing area.

Relevância:

100.00% 100.00%

Publicador:

Resumo:

Ontology search and reuse is becoming increasingly important as the quest for methods to reduce the cost of constructing such knowledge structures continues. A number of ontology libraries and search engines are coming to existence to facilitate locating and retrieving potentially relevant ontologies. The number of ontologies available for reuse is steadily growing, and so is the need for methods to evaluate and rank existing ontologies in terms of their relevance to the needs of the knowledge engineer. This paper presents AKTiveRank, a prototype system for ranking ontologies based on a number of structural metrics.

Relevância:

100.00% 100.00%

Publicador:

Resumo:

Representing knowledge using domain ontologies has shown to be a useful mechanism and format for managing and exchanging information. Due to the difficulty and cost of building ontologies, a number of ontology libraries and search engines are coming to existence to facilitate reusing such knowledge structures. The need for ontology ranking techniques is becoming crucial as the number of ontologies available for reuse is continuing to grow. In this paper we present AKTiveRank, a prototype system for ranking ontologies based on the analysis of their structures. We describe the metrics used in the ranking system and present an experiment on ranking ontologies returned by a popular search engine for an example query.

Relevância:

100.00% 100.00%

Publicador:

Resumo:

Text classification is essential for narrowing down the number of documents relevant to a particular topic for further pursual, especially when searching through large biomedical databases. Protein-protein interactions are an example of such a topic with databases being devoted specifically to them. This paper proposed a semi-supervised learning algorithm via local learning with class priors (LL-CP) for biomedical text classification where unlabeled data points are classified in a vector space based on their proximity to labeled nodes. The algorithm has been evaluated on a corpus of biomedical documents to identify abstracts containing information about protein-protein interactions with promising results. Experimental results show that LL-CP outperforms the traditional semisupervised learning algorithms such as SVMand it also performs better than local learning without incorporating class priors.

Relevância:

100.00% 100.00%

Publicador:

Resumo:

Development of mass spectrometry techniques to detect protein oxidation, which contributes to signalling and inflammation, is important. Label-free approaches have the advantage of reduced sample manipulation, but are challenging in complex samples owing to undirected analysis of large data sets using statistical search engines. To identify oxidised proteins in biological samples, we previously developed a targeted approach involving precursor ion scanning for diagnostic MS3 ions from oxidised residues. Here, we tested this approach for other oxidations, and compared it with an alternative approach involving the use of extracted ion chromatograms (XICs) generated from high-resolution MSMS data using very narrow mass windows. This accurate mass XIC data methodology was effective at identifying nitrotyrosine, chlorotyrosine, and oxidative deamination of lysine, and for tyrosine oxidations highlighted more modified peptide species than precursor ion scanning or statistical database searches. Although some false positive peaks still occurred in the XICs, these could be identified by comparative assessment of the peak intensities. The method has the advantage that a number of different modifications can be analysed simultaneously in a single LC-MSMS run. This article is part of a Special Issue entitled: Posttranslational Protein modifications in biology and Medicine. Biological significance: The use of accurate mass extracted product ion chromatograms to detect oxidised peptides could improve the identification of oxidatively damaged proteins in inflammatory conditions. © 2013 Elsevier B.V.

Relevância:

100.00% 100.00%

Publicador:

Resumo:

Determining the Ordered Weighted Averaging (OWA) operator weights is important in decision making applications. Several approaches have been proposed in the literature to obtain the associated weights. This paper provides an alternative disparity model to identify the OWA operator weights. The proposed mathematical model extends the existing disparity approaches by minimizing the sum of the deviation between two distinct OWA weights. The proposed disparity model can be used for a preference ranking aggregation. A numerical example in preference ranking and an application in search engines prove the usefulness of the generated OWA weights.

Relevância:

100.00% 100.00%

Publicador:

Resumo:

This article presents a new method for data collection in regional dialectology based on site-restricted web searches. The method measures the usage and determines the distribution of lexical variants across a region of interest using common web search engines, such as Google or Bing. The method involves estimating the proportions of the variants of a lexical alternation variable over a series of cities by counting the number of webpages that contain the variants on newspaper websites originating from these cities through site-restricted web searches. The method is evaluated by mapping the 26 variants of 10 lexical variables with known distributions in American English. In almost all cases, the maps based on site-restricted web searches align closely with traditional dialect maps based on data gathered through questionnaires, demonstrating the accuracy of this method for the observation of regional linguistic variation. However, unlike collecting dialect data using traditional methods, which is a relatively slow process, the use of site-restricted web searches allows for dialect data to be collected from across a region as large as the United States in a matter of days.

Relevância:

100.00% 100.00%

Publicador:

Resumo:

Phospholipid oxidation can generate reactive and electrophilic products that are capable of modifying proteins, especially at cysteine, lysine and histidine residues. Such lipoxidation reactions are known to alter protein structure and function, both with gain of function and loss of activity effects. As well as potential importance in the redox regulation of cell behaviour, lipoxidation products in plasma could also be useful biomarkers for stress conditions. Although studies with antibodies suggested the occurrence of lipoxidation adducts on ApoB-100, these products had not previously been characterized at a molecular level. We have developed new mass spectrometry-based approaches to detect and locate adducts of oxidized phospholipids in plasma proteins, as well as direct oxidation modifications of proteins, which avoid some of the problems typically encountered with database search engines leading to erroneous identifications of oxidative PTMs. This approach uses accurate mass extracted ion chromatograms (XICs) of fragment ions from peptides containing oxPTMs, and allows multiple modifications to be examined regardless of the protein that contains them. For example, a reporter ion at 184.074 Da/e corresponding to phosphocholine indicated the presence of oxidized phosphatidylcholine adducts, while 2 reporter ions at 100.078 and 82.025 Da/e were selective for allysine. ApoB-100-oxidized phospholipid adducts were detected even in healthy human samples, as well as LDL from patients with inflammatory disease. Lipidomic studies showed that more than 350 different species of lipid were present in LDL, and were altered in disease conditions. LDL clearly represents a very complex carrier system and one that offers a rich source of information about systemic conditions, with potential as indicators of oxidative damage in ageing or inflammatory diseases.

Relevância:

100.00% 100.00%

Publicador:

Resumo:

Oxidative post-translational modifications (oxPTMs) can alter the function of proteins, and are important in the redox regulation of cell behaviour. The most informative technique to detect and locate oxPTMs within proteins is mass spectrometry (MS). However, proteomic MS data are usually searched against theoretical databases using statistical search engines, and the occurrence of unspecified or multiple modifications, or other unexpected features, can lead to failure to detect the modifications and erroneous identifications of oxPTMs. We have developed a new approach for mining data from accurate mass instruments that allows multiple modifications to be examined. Accurate mass extracted ion chromatograms (XIC) for specific reporter ions from peptides containing oxPTMs were generated from standard LC-MSMS data acquired on a rapid-scanning high-resolution mass spectrometer (ABSciex 5600 Triple TOF). The method was tested using proteins from human plasma or isolated LDL. A variety of modifications including chlorotyrosine, nitrotyrosine, kynurenine, oxidation of lysine, and oxidized phospholipid adducts were detected. For example, the use of a reporter ion at 184.074 Da/e, corresponding to phosphocholine, was used to identify for the first time intact oxidized phosphatidylcholine adducts on LDL. In all cases the modifications were confirmed by manual sequencing. ApoB-100 containing oxidized lipid adducts was detected even in healthy human samples, as well as LDL from patients with chronic kidney disease. The accurate mass XIC method gave a lower false positive rate than normal database searching using statistical search engines, and identified more oxidatively modified peptides. A major advantage was that additional modifications could be searched after data collection, and multiple modifications on a single peptide identified. The oxPTMs present on albumin and ApoB-100 have potential as indicators of oxidative damage in ageing or inflammatory diseases.