944 resultados para Document signature
Resumo:
Clustering is an important technique in organising and categorising web scale documents. The main challenges faced in clustering the billions of documents available on the web are the processing power required and the sheer size of the datasets available. More importantly, it is nigh impossible to generate the labels for a general web document collection containing billions of documents and a vast taxonomy of topics. However, document clusters are most commonly evaluated by comparison to a ground truth set of labels for documents. This paper presents a clustering and labeling solution where the Wikipedia is clustered and hundreds of millions of web documents in ClueWeb12 are mapped on to those clusters. This solution is based on the assumption that the Wikipedia contains such a wide range of diverse topics that it represents a small scale web. We found that it was possible to perform the web scale document clustering and labeling process on one desktop computer under a couple of days for the Wikipedia clustering solution containing about 1000 clusters. It takes longer to execute a solution with finer granularity clusters such as 10,000 or 50,000. These results were evaluated using a set of external data.
Resumo:
Performance comparisons between File Signatures and Inverted Files for text retrieval have previously shown several significant shortcomings of file signatures relative to inverted files. The inverted file approach underpins most state-of-the-art search engine algorithms, such as Language and Probabilistic models. It has been widely accepted that traditional file signatures are inferior alternatives to inverted files. This paper describes TopSig, a new approach to the construction of file signatures. Many advances in semantic hashing and dimensionality reduction have been made in recent times, but these were not so far linked to general purpose, signature file based, search engines. This paper introduces a different signature file approach that builds upon and extends these recent advances. We are able to demonstrate significant improvements in the performance of signature file based indexing and retrieval, performance that is comparable to that of state of the art inverted file based systems, including Language models and BM25. These findings suggest that file signatures offer a viable alternative to inverted files in suitable settings and positions the file signatures model in the class of Vector Space retrieval models.
Resumo:
Topographically and chemically modified titanium implants are recognized to have improved osteogenic properties; however, the molecular regulation of this process remains unknown. This study aimed to determine the microRNA profile and the potential regulation of osteogenic differentiation following early exposure of osteoprogenitor cells to sand-blasted, large-grit acid-etched (SLA) and hydrophilic SLA (modSLA) surfaces. Firstly, the osteogenic characteristics of the primary osteoprogenitor cells were confirmed using ALP activity and Alizarin Red S staining. The effect of smooth (SMO), SLA and modSLA surfaces on the TGF-β/BMP (BMP2, BMP6, ACVR1) and non-canonical WNT/Ca2+ (WNT5A, FZD6) pathways, as well as the integrins ITGB1 and ITGA2, was determined. It was revealed that the modified titanium surfaces could induce the activation of TGF-β/BMP and non-canonical WNT/Ca2+ signaling genes. The expression pattern of microRNAs (miRNAs) related to cell differentiation was evaluated. Statistical analysis of the differentially regulated miRNAs indicated that 35 and 32 miRNAs were down-regulated on the modSLA and SLA surfaces respectively, when compared with the smooth surface (SMO). Thirty-one miRNAs that were down-regulated were common to both modSLA and SLA. There were 10 miRNAs up-regulated on modSLA and nine on SLA surfaces, amongst which eight were the same as observed on modSLA. TargetScan predictions for the down-regulated miRNAs revealed genes of the TGF-β/BMP and non-canonical Ca2+ pathways as targets. This study demonstrated that modified titanium implant surfaces induce differential regulation of miRNAs, which potentially regulate the TGF-β/BMP and WNT/Ca2+ pathways during osteogenic differentiation on modified titanium implant surfaces.
Resumo:
This paper analyses the pairwise distances of signatures produced by the TopSig retrieval model on two document collections. The distribution of the distances are compared to purely random signatures. It explains why TopSig is only competitive with state of the art retrieval models at early precision. Only the local neighbourhood of the signatures is interpretable. We suggest this is a common property of vector space models.
Resumo:
The proliferation of the web presents an unsolved problem of automatically analyzing billions of pages of natural language. We introduce a scalable algorithm that clusters hundreds of millions of web pages into hundreds of thousands of clusters. It does this on a single mid-range machine using efficient algorithms and compressed document representations. It is applied to two web-scale crawls covering tens of terabytes. ClueWeb09 and ClueWeb12 contain 500 and 733 million web pages and were clustered into 500,000 to 700,000 clusters. To the best of our knowledge, such fine grained clustering has not been previously demonstrated. Previous approaches clustered a sample that limits the maximum number of discoverable clusters. The proposed EM-tree algorithm uses the entire collection in clustering and produces several orders of magnitude more clusters than the existing algorithms. Fine grained clustering is necessary for meaningful clustering in massive collections where the number of distinct topics grows linearly with collection size. These fine-grained clusters show an improved cluster quality when assessed with two novel evaluations using ad hoc search relevance judgments and spam classifications for external validation. These evaluations solve the problem of assessing the quality of clusters where categorical labeling is unavailable and unfeasible.
Resumo:
This thesis studies document signatures, which are small representations of documents and other objects that can be stored compactly and compared for similarity. This research finds that document signatures can be effectively and efficiently used to both search and understand relationships between documents in large collections, scalable enough to search a billion documents in a fraction of a second. Deliverables arising from the research include an investigation of the representational capacity of document signatures, the publication of an open-source signature search platform and an approach for scaling signature retrieval to operate efficiently on collections containing hundreds of millions of documents.
Resumo:
Extraction of text areas from the document images with complex content and layout is one of the challenging tasks. Few texture based techniques have already been proposed for extraction of such text blocks. Most of such techniques are greedy for computation time and hence are far from being realizable for real time implementation. In this work, we propose a modification to two of the existing texture based techniques to reduce the computation. This is accomplished with Harris corner detectors. The efficiency of these two textures based algorithms, one based on Gabor filters and other on log-polar wavelet signature, are compared. A combination of Gabor feature based texture classification performed on a smaller set of Harris corner detected points is observed to deliver the accuracy and efficiency.
Resumo:
In many applications, when communicating with a host, we may or may not be concerned about the privacy of the data but are mainly concerned about the integrity of data being transmitted. This paper presents a simple algorithm based on zero knowledge proof by which the receiver can confirm the integrity of data without the sender having to send the digital signature of the message directly. Also, if the same document is sent across by the same user multiple times, this scheme results in different digital signature each time thus making it a practical one-time signature scheme.
Resumo:
Un résumé en anglais est également disponible. Le présent document a été présenté pour l'obtention du diplôme de Maîtrise en droit. Le ménoire a été accepté et classé parmi les 5% de la discipline.
Resumo:
Internal and external computer network attacks or security threats occur according to standards and follow a set of subsequent steps, allowing to establish profiles or patterns. This well-known behavior is the basis of signature analysis intrusion detection systems. This work presents a new attack signature model to be applied on network-based intrusion detection systems engines. The AISF (ACME! Intrusion Signature Format) model is built upon XML technology and works on intrusion signatures handling and analysis, from storage to manipulation. Using this new model, the process of storing and analyzing information about intrusion signatures for further use by an IDS become a less difficult and standardized process.