590 resultados para Automatized Indexing
Resumo:
For the first time in human history, large volumes of spoken audio are being broadcast, made available on the internet, archived, and monitored for surveillance every day. New technologies are urgently required to unlock these vast and powerful stores of information. Spoken Term Detection (STD) systems provide access to speech collections by detecting individual occurrences of specified search terms. The aim of this work is to develop improved STD solutions based on phonetic indexing. In particular, this work aims to develop phonetic STD systems for applications that require open-vocabulary search, fast indexing and search speeds, and accurate term detection. Within this scope, novel contributions are made within two research themes, that is, accommodating phone recognition errors and, secondly, modelling uncertainty with probabilistic scores. A state-of-the-art Dynamic Match Lattice Spotting (DMLS) system is used to address the problem of accommodating phone recognition errors with approximate phone sequence matching. Extensive experimentation on the use of DMLS is carried out and a number of novel enhancements are developed that provide for faster indexing, faster search, and improved accuracy. Firstly, a novel comparison of methods for deriving a phone error cost model is presented to improve STD accuracy, resulting in up to a 33% improvement in the Figure of Merit. A method is also presented for drastically increasing the speed of DMLS search by at least an order of magnitude with no loss in search accuracy. An investigation is then presented of the effects of increasing indexing speed for DMLS, by using simpler modelling during phone decoding, with results highlighting the trade-off between indexing speed, search speed and search accuracy. The Figure of Merit is further improved by up to 25% using a novel proposal to utilise word-level language modelling during DMLS indexing. Analysis shows that this use of language modelling can, however, be unhelpful or even disadvantageous for terms with a very low language model probability. The DMLS approach to STD involves generating an index of phone sequences using phone recognition. An alternative approach to phonetic STD is also investigated that instead indexes probabilistic acoustic scores in the form of a posterior-feature matrix. A state-of-the-art system is described and its use for STD is explored through several experiments on spontaneous conversational telephone speech. A novel technique and framework is proposed for discriminatively training such a system to directly maximise the Figure of Merit. This results in a 13% improvement in the Figure of Merit on held-out data. The framework is also found to be particularly useful for index compression in conjunction with the proposed optimisation technique, providing for a substantial index compression factor in addition to an overall gain in the Figure of Merit. These contributions significantly advance the state-of-the-art in phonetic STD, by improving the utility of such systems in a wide range of applications.
Resumo:
As organizations reach to higher levels of business process management maturity, they often find themselves maintaining repositories of hundreds or even thousands of process models, representing valuable knowledge about their operations. Over time, process model repositories tend to accumulate duplicate fragments (also called clones) as new process models are created or extended by copying and merging fragments from other models. This calls for methods to detect clones in process models, so that these clones can be refactored as separate subprocesses in order to improve maintainability. This paper presents an indexing structure to support the fast detection of clones in large process model repositories. The proposed index is based on a novel combination of a method for process model decomposition (specifically the Refined Process Structure Tree), with established graph canonization and string matching techniques. Experiments show that the algorithm scales to repositories with hundreds of models. The experimental results also show that a significant number of non-trivial clones can be found in process model repositories taken from industrial practice.
Resumo:
This work proposes to improve spoken term detection (STD) accuracy by optimising the Figure of Merit (FOM). In this article, the index takes the form of phonetic posterior-feature matrix. Accuracy is improved by formulating STD as a discriminative training problem and directly optimising the FOM, through its use as an objective function to train a transformation of the index. The outcome of indexing is then a matrix of enhanced posterior-features that are directly tailored for the STD task. The technique is shown to improve the FOM by up to 13% on held-out data. Additional analysis explores the effect of the technique on phone recognition accuracy, examines the actual values of the learned transform, and demonstrates that using an extended training data set results in further improvement in the FOM.
Resumo:
Performance comparisons between File Signatures and Inverted Files for text retrieval have previously shown several significant shortcomings of file signatures relative to inverted files. The inverted file approach underpins most state-of-the-art search engine algorithms, such as Language and Probabilistic models. It has been widely accepted that traditional file signatures are inferior alternatives to inverted files. This paper describes TopSig, a new approach to the construction of file signatures. Many advances in semantic hashing and dimensionality reduction have been made in recent times, but these were not so far linked to general purpose, signature file based, search engines. This paper introduces a different signature file approach that builds upon and extends these recent advances. We are able to demonstrate significant improvements in the performance of signature file based indexing and retrieval, performance that is comparable to that of state of the art inverted file based systems, including Language models and BM25. These findings suggest that file signatures offer a viable alternative to inverted files in suitable settings and positions the file signatures model in the class of Vector Space retrieval models.
Resumo:
With the growing number of XML documents on theWeb it becomes essential to effectively organise these XML documents in order to retrieve useful information from them. A possible solution is to apply clustering on the XML documents to discover knowledge that promotes effective data management, information retrieval and query processing. However, many issues arise in discovering knowledge from these types of semi-structured documents due to their heterogeneity and structural irregularity. Most of the existing research on clustering techniques focuses only on one feature of the XML documents, this being either their structure or their content due to scalability and complexity problems. The knowledge gained in the form of clusters based on the structure or the content is not suitable for reallife datasets. It therefore becomes essential to include both the structure and content of XML documents in order to improve the accuracy and meaning of the clustering solution. However, the inclusion of both these kinds of information in the clustering process results in a huge overhead for the underlying clustering algorithm because of the high dimensionality of the data. The overall objective of this thesis is to address these issues by: (1) proposing methods to utilise frequent pattern mining techniques to reduce the dimension; (2) developing models to effectively combine the structure and content of XML documents; and (3) utilising the proposed models in clustering. This research first determines the structural similarity in the form of frequent subtrees and then uses these frequent subtrees to represent the constrained content of the XML documents in order to determine the content similarity. A clustering framework with two types of models, implicit and explicit, is developed. The implicit model uses a Vector Space Model (VSM) to combine the structure and the content information. The explicit model uses a higher order model, namely a 3- order Tensor Space Model (TSM), to explicitly combine the structure and the content information. This thesis also proposes a novel incremental technique to decompose largesized tensor models to utilise the decomposed solution for clustering the XML documents. The proposed framework and its components were extensively evaluated on several real-life datasets exhibiting extreme characteristics to understand the usefulness of the proposed framework in real-life situations. Additionally, this research evaluates the outcome of the clustering process on the collection selection problem in the information retrieval on the Wikipedia dataset. The experimental results demonstrate that the proposed frequent pattern mining and clustering methods outperform the related state-of-the-art approaches. In particular, the proposed framework of utilising frequent structures for constraining the content shows an improvement in accuracy over content-only and structure-only clustering results. The scalability evaluation experiments conducted on large scaled datasets clearly show the strengths of the proposed methods over state-of-the-art methods. In particular, this thesis work contributes to effectively combining the structure and the content of XML documents for clustering, in order to improve the accuracy of the clustering solution. In addition, it also contributes by addressing the research gaps in frequent pattern mining to generate efficient and concise frequent subtrees with various node relationships that could be used in clustering.
Resumo:
As organizations reach higher levels of business process management maturity, they often find themselves maintaining very large process model repositories, representing valuable knowledge about their operations. A common practice within these repositories is to create new process models, or extend existing ones, by copying and merging fragments from other models. We contend that if these duplicate fragments, a.k.a. ex- act clones, can be identified and factored out as shared subprocesses, the repository’s maintainability can be greatly improved. With this purpose in mind, we propose an indexing structure to support fast detection of clones in process model repositories. Moreover, we show how this index can be used to efficiently query a process model repository for fragments. This index, called RPSDAG, is based on a novel combination of a method for process model decomposition (namely the Refined Process Structure Tree), with established graph canonization and string matching techniques. We evaluated the RPSDAG with large process model repositories from industrial practice. The experiments show that a significant number of non-trivial clones can be efficiently found in such repositories, and that fragment queries can be handled efficiently.
Resumo:
Background When observers are asked to identify two targets in rapid sequence, they often suffer profound performance deficits for the second target, even when the spatial location of the targets is known. This attentional blink (AB) is usually attributed to the time required to process a previous target, implying that a link should exist between individual differences in information processing speed and the AB. Methodology/Principal Findings The present work investigated this question by examining the relationship between a rapid automatized naming task typically used to assess information-processing speed and the magnitude of the AB. The results indicated that faster processing actually resulted in a greater AB, but only when targets were presented amongst high similarity distractors. When target-distractor similarity was minimal, processing speed was unrelated to the AB. Conclusions/Significance Our findings indicate that information-processing speed is unrelated to target processing efficiency per se, but rather to individual differences in observers' ability to suppress distractors. This is consistent with evidence that individuals who are able to avoid distraction are more efficient at deploying temporal attention, but argues against a direct link between general processing speed and efficient information selection.
Resumo:
Electronic Health Record (EHR) retrieval processes are complex demanding Information Technology (IT) resources exponentially in particular memory usage. Database-as-a-service (DAS) model approach is proposed to meet the scalability factor of EHR retrieval processes. A simulation study using ranged of EHR records with DAS model was presented. The bucket-indexing model incorporated partitioning fields and bloom filters in a Singleton design pattern were used to implement custom database encryption system. It effectively provided faster responses in the range query compared to different types of queries used such as aggregation queries among the DAS, built-in encryption and the plain-text DBMS. The study also presented with constraints around the approach should consider for other practical applications.
Resumo:
In the medical and healthcare arena, patients‟ data is not just their own personal history but also a valuable large dataset for finding solutions for diseases. While electronic medical records are becoming popular and are used in healthcare work places like hospitals, as well as insurance companies, and by major stakeholders such as physicians and their patients, the accessibility of such information should be dealt with in a way that preserves privacy and security. Thus, finding the best way to keep the data secure has become an important issue in the area of database security. Sensitive medical data should be encrypted in databases. There are many encryption/ decryption techniques and algorithms with regard to preserving privacy and security. Currently their performance is an important factor while the medical data is being managed in databases. Another important factor is that the stakeholders should decide more cost-effective ways to reduce the total cost of ownership. As an alternative, DAS (Data as Service) is a popular outsourcing model to satisfy the cost-effectiveness but it takes a consideration that the encryption/ decryption modules needs to be handled by trustworthy stakeholders. This research project is focusing on the query response times in a DAS model (AES-DAS) and analyses the comparison between the outsourcing model and the in-house model which incorporates Microsoft built-in encryption scheme in a SQL Server. This research project includes building a prototype of medical database schemas. There are 2 types of simulations to carry out the project. The first stage includes 6 databases in order to carry out simulations to measure the performance between plain-text, Microsoft built-in encryption and AES-DAS (Data as Service). Particularly, the AES-DAS incorporates implementations of symmetric key encryption such as AES (Advanced Encryption Standard) and a Bucket indexing processor using Bloom filter. The results are categorised such as character type, numeric type, range queries, range queries using Bucket Index and aggregate queries. The second stage takes the scalability test from 5K to 2560K records. The main result of these simulations is that particularly as an outsourcing model, AES-DAS using the Bucket index shows around 3.32 times faster than a normal AES-DAS under the 70 partitions and 10K record-sized databases. Retrieving Numeric typed data takes shorter time than Character typed data in AES-DAS. The aggregation query response time in AES-DAS is not as consistent as that in MS built-in encryption scheme. The scalability test shows that the DBMS reaches in a certain threshold; the query response time becomes rapidly slower. However, there is more to investigate in order to bring about other outcomes and to construct a secured EMR (Electronic Medical Record) more efficiently from these simulations.
Resumo:
Bananas are one of the world's most important food crops, providing sustenance and income for millions of people in developing countries and supporting large export industries. Viruses are considered major constraints to banana production, germplasm multiplication and exchange, and to genetic improvement of banana through traditional breeding. In Africa, the two most important virus diseases are bunchy top, caused by Banana bunchy top virus (BBTV), and banana streak disease, caused by Banana streak virus (BSV). BBTV is a serious production constraint in a number of countries within/bordering East Africa, such as Burundi, Democratic Republic of Congo, Malawi, Mozambique, Rwanda and Zambia, but is not present in Kenya, Tanzania and Uganda. Additionally, epidemics of banana streak disease are occurring in Kenya and Uganda. The rapidly growing tissue culture (TC) industry within East Africa, aiming to provide planting material to banana farmers, has stimulated discussion about the need for virus indexing to certify planting material as virus-free. Diagnostic methods for BBTV and BSV have been reported and, for BBTV, PCR-based assays are reliable and relatively straightforward. However for BSV, high levels of serological and genetic variability and the presence of endogenous virus sequences within the banana genome complicate diagnosis. Uganda has been shown to contain the greatest diversity in BSV isolates found anywhere in the world. A broad-spectrum diagnostic test for BSV detection, which can discriminate between endogenous and episomal BSV sequences, is a priority. This PhD project aimed to establish diagnostic methods for banana viruses, with a particular focus on the development of novel methods for BSV detection, and to use these diagnostic methods for the detection and characterisation of banana viruses in East Africa. A novel rolling-circle amplification (RCA) method was developed for the detection of BSV. Using samples of Banana streak MY virus (BSMYV) and Banana streak OL virus (BSOLV) from Australia, this method was shown to distinguish between endogenous and episomal BSV sequences in banana plants. The RCA assay was used to screen a collection of 56 banana samples from south-west Uganda for BSV. RCA detected at least five distinct BSV isolates in these samples, including BSOLV and Banana streak GF virus (BSGFV) as well as three BSV isolates (Banana streak Uganda-I, -L and -M virus) for which only partial sequences had been previously reported. These latter three BSV had only been detected using immuno-capture (IC)-PCR and thus were possible endogenous sequences. In addition to its ability to detect BSV, the RCA protocol was also demonstrated to detect other viruses within the family Caulimoviridae, including Sugar cane bacilliform virus, and Cauliflower mosaic virus. Using the novel RCA method, three distinct BSV isolates from both Kenya and Uganda were identified and characterised. The complete genome of these isolates was sequenced and annotated. All six isolates were shown to have a characteristic badnavirus genome organisation with three open reading frames (ORFs) and the large polyprotein encoded by ORF 3 was shown to contain conserved amino acid motifs for movement, aspartic protease, reverse transcriptase and ribonuclease H activities. As well, several sequences important for expression and replication of the virus genome were identified including the conserved tRNAmet primer binding site present in the intergenic region of all badnaviruses. Based on the International Committee on Taxonomy of Viruses (ICTV) guidelines for species demarcation in the genus Badnavirus, these six isolates were proposed as distinct species, and named Banana streak UA virus (BSUAV), Banana streak UI virus (BSUIV), Banana streak UL virus (BSULV), Banana streak UM virus (BSUMV), Banana streak CA virus (BSCAV) and Banana streak IM virus (BSIMV). Using PCR with species-specific primers designed to each isolate, a genotypically diverse collection of 12 virus-free banana cultivars were tested for the presence of endogenous sequences. For five of the BSV no amplification was observed in any cultivar tested, while for BSIMV, four positive samples were identified in cultivars with a B-genome component. During field visits to Kenya, Tanzania and Uganda, 143 samples were collected and assayed for BSV. PCR using nine sets of species-specific primers, and RCA, were compared for BSV detection. For five BSV species with no known endogenous counterpart (namely BSCAV, BSUAV, BSUIV, BSULV and BSUMV), PCR was used to detect 30 infections from the 143 samples. Using RCA, 96.4% of these samples were considered positive, with one additional sample detected using RCA which was not positive using PCR. For these five BSV, PCR and RCA were both useful for identifying infected samples, irrespective of the host cultivar genotype (Musa A- or B-genome components). For four additional BSV with known endogenous counterparts in the M. balbisiana genome (BSOLV, BSGFV, BSMYV and BSIMV), PCR was shown to detect 75 infections from the 143 samples. In 30 samples from cultivars with an A-only genome component there was 96.3% agreement between PCR positive samples and detection using RCA, again demonstrating either PCR or RCA are suitable methods for detection. However, in 45 samples from cultivars with some B-genome component, the level of agreement between PCR positive samples and RCA positive samples was 70.5%. This suggests that, in cultivars with some B-genome component, many infections were detected using PCR which were the result of amplification of endogenous sequences. In these latter cases, RCA or another method which discriminates between endogenous and episomal sequences, such as immuno-capture PCR, is needed to diagnose episomal BSV infection. Field visits were made to Malawi and Rwanda to collect local isolates of BBTV for validation of a PCR-based diagnostic assay. The presence of BBTV in samples of bananas with bunchy top disease was confirmed in 28 out of 39 samples from Malawi and all nine samples collected in Rwanda, using PCR and RCA. For three isolates, one from Malawi and two from Rwanda, the complete nucleotide sequences were determined and shown to have a similar genome organisation to previously published BBTV isolates. The two isolates from Rwanda had at least 98.1% nucleotide sequence identity between each of the six DNA components, while the similarity between isolates from Rwanda and Malawi was between 96.2% and 99.4% depending on the DNA component. At the amino acid level, similarities in the putative proteins encoded by DNA-R, -S, -M, - C and -N were found to range between 98.8% to 100%. In a phylogenetic analysis, the three East African isolates clustered together within the South Pacific subgroup of BBTV isolates. Nucleotide sequence comparison to isolates of BBTV from outside Africa identified India as the possible origin of East African isolates of BBTV.
Resumo:
There is an increased interested in Uninhabited Aerial Vehicle (UAV) operations and research into advanced methods for commanding and controlling multiple heterogeneous UAVs. Research into areas of supervisory control has rapidly increased. Past research has investigated various approaches of autonomous control and operator limitation to improve mission commanders' Situation Awareness (SA) and cognitive workload. The aim of this paper is to address this challenge through a visualisation framework of UAV information constructed from Information Abstraction (IA). This paper presents the concept and process of IA, and the visualisation framework (constructed using IA), the concept associated with the Level Of Detail (LOD) indexing method, the visualisation of an example of the framework. Experiments will test the hypothesis that, the operator will be able to achieve increased SA and reduced cognitive load with the proposed framework.
Resumo:
This paper addresses the issue of analogical inference, and its potential role as the mediator of new therapeutic discoveries, by using disjunction operators based on quantum connectives to combine many potential reasoning pathways into a single search expression. In it, we extend our previous work in which we developed an approach to analogical retrieval using the Predication-based Semantic Indexing (PSI) model, which encodes both concepts and the relationships between them in high-dimensional vector space. As in our previous work, we leverage the ability of PSI to infer predicate pathways connecting two example concepts, in this case comprising of known therapeutic relationships. For example, given that drug x TREATS disease z, we might infer the predicate pathway drug x INTERACTS WITH gene y ASSOCIATED WITH disease z, and use this pathway to search for drugs related to another disease in similar ways. As biological systems tend to be characterized by networks of relationships, we evaluate the ability of quantum-inspired operators to mediate inference and retrieval across multiple relations, by testing the ability of different approaches to recover known therapeutic relationships. In addition, we introduce a novel complex vector based implementation of PSI, based on Plate’s Circular Holographic Reduced Representations, which we utilize for all experiments in addition to the binary vector based approach we have applied in our previous research.
Resumo:
Bananas are one of the world�fs most important crops, serving as a staple food and an important source of income for millions of people in the subtropics. Pests and diseases are a major constraint to banana production. To prevent the spread of pests and disease, farmers are encouraged to use disease�] and insect�]free planting material obtained by micropropagation. This option, however, does not always exclude viruses and concern remains on the quality of planting material. Therefore, there is a demand for effective and reliable virus indexing procedures for tissue culture (TC) material. Reliable diagnostic tests are currently available for all of the economically important viruses of bananas with the exception of Banana streak viruses (BSV, Caulimoviridae, Badnavirus). Development of a reliable diagnostic test for BSV is complicated by the significant serological and genetic variation reported for BSV isolates, and the presence of endogenous BSV (eBSV). Current PCR�] and serological�]based diagnostic methods for BSV may not detect all species of BSV, and PCR�]based methods may give false positives because of the presence of eBSV. Rolling circle amplification (RCA) has been reported as a technique to detect BSV which can also discriminate between episomal and endogenous BSV sequences. However, the method is too expensive for large scale screening of samples in developing countries, and little information is available regarding its sensitivity. Therefore the development of reliable PCR�]based assays is still considered the most appropriate option for large scale screening of banana plants for BSV. This MSc project aimed to refine and optimise the protocols for BSV detection, with a particular focus on developing reliable PCR�]based diagnostics Initially, the appropriateness and reliability of PCR and RCA as diagnostic tests for BSV detection were assessed by testing 45 field samples of banana collected from nine districts in the Eastern region of Uganda in February 2010. This research was also aimed at investigating the diversity of BSV in eastern Uganda, identifying the BSV species present and characterising any new BSV species. Out of the 45 samples tested, 38 and 40 samples were considered positive by PCR and RCA, respectively. Six different species of BSV, namely Banana streak IM virus (BSIMV), Banana streak MY virus (BSMYV), Banana streak OL virus (BSOLV), Banana streak UA virus (BSUAV), Banana streak UL virus (BSULV), Banana streak UM virus (BSUMV), were detected by PCR and confirmed by RCA and sequencing. No new species were detected, but this was the first report of BSMYV in Uganda. Although RCA was demonstrated to be suitable for broad�]range detection of BSV, it proved time�]consuming and laborious for identification in field samples. Due to the disadvantages associated with RCA, attempts were made to develop a reliable PCR�]based assay for the specific detection of episomal BSOLV, Banana streak GF virus (BSGFV), BSMYV and BSIMV. For BSOLV and BSGFV, the integrated sequences exist in rearranged, repeated and partially inverted portions at their site of integration. Therefore, for these two viruses, primers sets were designed by mapping previously published sequences of their endogenous counterparts onto published sequences of the episomal genomes. For BSOLV, two primer sets were designed while, for BSGFV, a single primer set was designed. The episomalspecificity of these primer sets was assessed by testing 106 plant samples collected during surveys in Kenya and Uganda, and 33 leaf samples from a wide range of banana cultivars maintained in TC at the Maroochy Research Station of the Department of Employment, Economic Development and Innovation (DEEDI), Queensland. All of these samples had previously been tested for episomal BSV by RCA and for both BSOLV and BSGFV by PCR using published primer sets. The outcome from these analyses was that the newly designed primer sets for BSOLV and BSGFV were able to distinguish between episomal BSV and eBSV in most cultivars with some B�]genome component. In some samples, however, amplification was observed using the putative episomal�]specific primer sets where episomal BSV was not identified using RCA. This may reflect a difference in the sensitivity of PCR compared to RCA, or possibly the presence of an eBSV sequence of different conformation. Since the sequences of the respective eBSV for BSMYV and BSIMV in the M. balbisiana genome are not available, a series of random primer combinations were tested in an attempt to find potential episomal�]specific primer sets for BSMYV and BSIMV. Of an initial 20 primer combinations screened for BSMYV detection on a small number of control samples, 11 primers sets appeared to be episomal�]specific. However, subsequent testing of two of these primer combinations on a larger number of control samples resulted in some inconsistent results which will require further investigation. Testing of the 25 primer combinations for episomal�]specific detection of BSIMV on a number of control samples showed that none were able to discriminate between episomal and endogenous BSIMV. The final component of this research project was the development of an infectious clone of a BSV endemic in Australia, namely BSMYV. This was considered important to enable the generation of large amounts of diseased plant material needed for further research. A terminally redundant fragment (.1.3 �~ BSMYV genome) was cloned and transformed into Agrobacterium tumefaciens strain AGL1, and used to inoculate 12 healthy banana plants of the cultivars Cavendish (Williams) by three different methods. At 12 weeks post�]inoculation, (i) four of the five banana plants inoculated by corm injection showed characteristic BSV symptoms while the remaining plant was wilting/dying, (ii) three of the five banana plants inoculated by needle�]pricking of the stem showed BSV symptoms, one plant was symptomless while the remaining had died and (iii) both banana plants inoculated by leaf infiltration were symptomless. When banana leaf samples were tested for BSMYV by PCR and RCA, BSMYV was confirmed in all banana plants showing symptoms including those were wilting and/or dying. The results from this research have provided several avenues for further research. By completely sequencing all variants of eBSOLV and eBSGFV and fully sequencing the eBSIMV and eBSMYV regions, episomal BSV�]specific primer sets for all eBSVs could potentially be designed that could avoid all integrants of that particular BSV species. Furthermore, the development of an infectious BSV clone will enable large numbers of BSVinfected plants to be generated for the further testing of the sensitivity of RCA compared to other more established assays such as PCR. The development of infectious clones also opens the possibility for virus induced gene silencing studies in banana.
Resumo:
This paper analyses the pairwise distances of signatures produced by the TopSig retrieval model on two document collections. The distribution of the distances are compared to purely random signatures. It explains why TopSig is only competitive with state of the art retrieval models at early precision. Only the local neighbourhood of the signatures is interpretable. We suggest this is a common property of vector space models.
Resumo:
The aim of this paper is to provide a comparison of various algorithms and parameters to build reduced semantic spaces. The effect of dimension reduction, the stability of the representation and the effect of word order are examined in the context of the five algorithms bearing on semantic vectors: Random projection (RP), singular value decom- position (SVD), non-negative matrix factorization (NMF), permutations and holographic reduced representations (HRR). The quality of semantic representation was tested by means of synonym finding task using the TOEFL test on the TASA corpus. Dimension reduction was found to improve the quality of semantic representation but it is hard to find the optimal parameter settings. Even though dimension reduction by RP was found to be more generally applicable than SVD, the semantic vectors produced by RP are somewhat unstable. The effect of encoding word order into the semantic vector representation via HRR did not lead to any increase in scores over vectors constructed from word co-occurrence in context information. In this regard, very small context windows resulted in better semantic vectors for the TOEFL test.