842 resultados para Graph DBMS, BenchMarking, OLAP, NoSQL
Resumo:
Bioinformatics involves analyses of biological data such as DNA sequences, microarrays and protein-protein interaction (PPI) networks. Its two main objectives are the identification of genes or proteins and the prediction of their functions. Biological data often contain uncertain and imprecise information. Fuzzy theory provides useful tools to deal with this type of information, hence has played an important role in analyses of biological data. In this thesis, we aim to develop some new fuzzy techniques and apply them on DNA microarrays and PPI networks. We will focus on three problems: (1) clustering of microarrays; (2) identification of disease-associated genes in microarrays; and (3) identification of protein complexes in PPI networks. The first part of the thesis aims to detect, by the fuzzy C-means (FCM) method, clustering structures in DNA microarrays corrupted by noise. Because of the presence of noise, some clustering structures found in random data may not have any biological significance. In this part, we propose to combine the FCM with the empirical mode decomposition (EMD) for clustering microarray data. The purpose of EMD is to reduce, preferably to remove, the effect of noise, resulting in what is known as denoised data. We call this method the fuzzy C-means method with empirical mode decomposition (FCM-EMD). We applied this method on yeast and serum microarrays, and the silhouette values are used for assessment of the quality of clustering. The results indicate that the clustering structures of denoised data are more reasonable, implying that genes have tighter association with their clusters. Furthermore we found that the estimation of the fuzzy parameter m, which is a difficult step, can be avoided to some extent by analysing denoised microarray data. The second part aims to identify disease-associated genes from DNA microarray data which are generated under different conditions, e.g., patients and normal people. We developed a type-2 fuzzy membership (FM) function for identification of diseaseassociated genes. This approach is applied to diabetes and lung cancer data, and a comparison with the original FM test was carried out. Among the ten best-ranked genes of diabetes identified by the type-2 FM test, seven genes have been confirmed as diabetes-associated genes according to gene description information in Gene Bank and the published literature. An additional gene is further identified. Among the ten best-ranked genes identified in lung cancer data, seven are confirmed that they are associated with lung cancer or its treatment. The type-2 FM-d values are significantly different, which makes the identifications more convincing than the original FM test. The third part of the thesis aims to identify protein complexes in large interaction networks. Identification of protein complexes is crucial to understand the principles of cellular organisation and to predict protein functions. In this part, we proposed a novel method which combines the fuzzy clustering method and interaction probability to identify the overlapping and non-overlapping community structures in PPI networks, then to detect protein complexes in these sub-networks. Our method is based on both the fuzzy relation model and the graph model. We applied the method on several PPI networks and compared with a popular protein complex identification method, the clique percolation method. For the same data, we detected more protein complexes. We also applied our method on two social networks. The results showed our method works well for detecting sub-networks and give a reasonable understanding of these communities.
Resumo:
Complex networks have been studied extensively due to their relevance to many real-world systems such as the world-wide web, the internet, biological and social systems. During the past two decades, studies of such networks in different fields have produced many significant results concerning their structures, topological properties, and dynamics. Three well-known properties of complex networks are scale-free degree distribution, small-world effect and self-similarity. The search for additional meaningful properties and the relationships among these properties is an active area of current research. This thesis investigates a newer aspect of complex networks, namely their multifractality, which is an extension of the concept of selfsimilarity. The first part of the thesis aims to confirm that the study of properties of complex networks can be expanded to a wider field including more complex weighted networks. Those real networks that have been shown to possess the self-similarity property in the existing literature are all unweighted networks. We use the proteinprotein interaction (PPI) networks as a key example to show that their weighted networks inherit the self-similarity from the original unweighted networks. Firstly, we confirm that the random sequential box-covering algorithm is an effective tool to compute the fractal dimension of complex networks. This is demonstrated on the Homo sapiens and E. coli PPI networks as well as their skeletons. Our results verify that the fractal dimension of the skeleton is smaller than that of the original network due to the shortest distance between nodes is larger in the skeleton, hence for a fixed box-size more boxes will be needed to cover the skeleton. Then we adopt the iterative scoring method to generate weighted PPI networks of five species, namely Homo sapiens, E. coli, yeast, C. elegans and Arabidopsis Thaliana. By using the random sequential box-covering algorithm, we calculate the fractal dimensions for both the original unweighted PPI networks and the generated weighted networks. The results show that self-similarity is still present in generated weighted PPI networks. This implication will be useful for our treatment of the networks in the third part of the thesis. The second part of the thesis aims to explore the multifractal behavior of different complex networks. Fractals such as the Cantor set, the Koch curve and the Sierspinski gasket are homogeneous since these fractals consist of a geometrical figure which repeats on an ever-reduced scale. Fractal analysis is a useful method for their study. However, real-world fractals are not homogeneous; there is rarely an identical motif repeated on all scales. Their singularity may vary on different subsets; implying that these objects are multifractal. Multifractal analysis is a useful way to systematically characterize the spatial heterogeneity of both theoretical and experimental fractal patterns. However, the tools for multifractal analysis of objects in Euclidean space are not suitable for complex networks. In this thesis, we propose a new box covering algorithm for multifractal analysis of complex networks. This algorithm is demonstrated in the computation of the generalized fractal dimensions of some theoretical networks, namely scale-free networks, small-world networks, random networks, and a kind of real networks, namely PPI networks of different species. Our main finding is the existence of multifractality in scale-free networks and PPI networks, while the multifractal behaviour is not confirmed for small-world networks and random networks. As another application, we generate gene interactions networks for patients and healthy people using the correlation coefficients between microarrays of different genes. Our results confirm the existence of multifractality in gene interactions networks. This multifractal analysis then provides a potentially useful tool for gene clustering and identification. The third part of the thesis aims to investigate the topological properties of networks constructed from time series. Characterizing complicated dynamics from time series is a fundamental problem of continuing interest in a wide variety of fields. Recent works indicate that complex network theory can be a powerful tool to analyse time series. Many existing methods for transforming time series into complex networks share a common feature: they define the connectivity of a complex network by the mutual proximity of different parts (e.g., individual states, state vectors, or cycles) of a single trajectory. In this thesis, we propose a new method to construct networks of time series: we define nodes by vectors of a certain length in the time series, and weight of edges between any two nodes by the Euclidean distance between the corresponding two vectors. We apply this method to build networks for fractional Brownian motions, whose long-range dependence is characterised by their Hurst exponent. We verify the validity of this method by showing that time series with stronger correlation, hence larger Hurst exponent, tend to have smaller fractal dimension, hence smoother sample paths. We then construct networks via the technique of horizontal visibility graph (HVG), which has been widely used recently. We confirm a known linear relationship between the Hurst exponent of fractional Brownian motion and the fractal dimension of the corresponding HVG network. In the first application, we apply our newly developed box-covering algorithm to calculate the generalized fractal dimensions of the HVG networks of fractional Brownian motions as well as those for binomial cascades and five bacterial genomes. The results confirm the monoscaling of fractional Brownian motion and the multifractality of the rest. As an additional application, we discuss the resilience of networks constructed from time series via two different approaches: visibility graph and horizontal visibility graph. Our finding is that the degree distribution of VG networks of fractional Brownian motions is scale-free (i.e., having a power law) meaning that one needs to destroy a large percentage of nodes before the network collapses into isolated parts; while for HVG networks of fractional Brownian motions, the degree distribution has exponential tails, implying that HVG networks would not survive the same kind of attack.
Resumo:
While the phrase “six degrees of separation” is widely used to characterize a variety of humanderived networks, in this study we show that in patent citation network, related patents are connected with an average distance of 6, whereas an average distance for a random pair of nodes in the graph is approximately 15. We use this information to improve the recall level in prior-art retrieval in the setting of blind relevance feedback without any textual knowledge.
Resumo:
In practice, parallel-machine job-shop scheduling (PMJSS) is very useful in the development of standard modelling approaches and generic solution techniques for many real-world scheduling problems. In this paper, based on the analysis of structural properties in an extended disjunctive graph model, a hybrid shifting bottleneck procedure (HSBP) algorithm combined with Tabu Search metaheuristic algorithm is developed to deal with the PMJSS problem. The original-version SBP algorithm for the job-shop scheduling (JSS) has been significantly improved to solve the PMJSS problem with four novelties: i) a topological-sequence algorithm is proposed to decompose the PMJSS problem into a set of single-machine scheduling (SMS) and/or parallel-machine scheduling (PMS) subproblems; ii) a modified Carlier algorithm based on the proposed lemmas and the proofs is developed to solve the SMS subproblem; iii) the Jackson rule is extended to solve the PMS subproblem; iv) a Tabu Search metaheuristic algorithm is embedded under the framework of SBP to optimise the JSS and PMJSS cases. The computational experiments show that the proposed HSBP is very efficient in solving the JSS and PMJSS problems.
Resumo:
In the structure of the title compound C16H26N+ Cl-, the salt of a precursor in the synthesis of an isoindolin-2-yloxyl free-radical trapping agent, the cations and anions form discrete centrosymetric cyclic dimers through N---H...Cl hydrogen-bonding associations [graph set R2/4(8)].
Resumo:
In the title salt, racemic C6H12N2O+ C8H11O4- from the reaction of cis-cyclohexane-1,2-dicarboxylic anhydride with isonipecotamide, the cations are linked into duplex chain substructures through both centrosymmetric cyclic head-to-head 'amide motif' hydrogen-bonding associations [graph set R2/2(8)] and 'side-by-side' R2/2(14) associations. The anions are incorporated into the chains through cyclic R3/4(10) interactions involving amide and piperidinium N-H...O(carboxyl) hydrogen bonds which, together with inter-anion carboxylic acid O-H...O(carboxyl) hydrogen bonds, give a two-dimensional layered structure extending along (011).
Resumo:
Nowadays, Opinion Mining is getting more important than before especially in doing analysis and forecasting about customers’ behavior for businesses purpose. The right decision in producing new products or services based on data about customers’ characteristics means profit for organization/company. This paper proposes a new architecture for Opinion Mining, which uses a multidimensional model to integrate customers’ characteristics and their comments about products (or services). The key step to achieve this objective is to transfer comments (opinions) to a fact table that includes several dimensions, such as, customers, products, time and locations. This research presents a comprehensive way to calculate customers’ orientation for all possible products’ attributes. A use case study is also presented in this paper to show the advantages of using OLAP and data cubes to analyze costumers’ opinions.
Resumo:
This paper describes the evaluation in benchmarking the effectiveness of cross-lingual link discovery (CLLD). Cross lingual link discovery is a way of automatically finding prospective links between documents in different languages, which is particularly helpful for knowledge discovery of different language domains. A CLLD evaluation framework is proposed for system performance benchmarking. The framework includes standard document collections, evaluation metrics, and link assessment and evaluation tools. The evaluation methods described in this paper have been utilised to quantify the system performance at NTCIR-9 Crosslink task. It is shown that using the manual assessment for generating gold standard can deliver a more reliable evaluation result.
Resumo:
Recommender systems are one of the recent inventions to deal with ever growing information overload in relation to the selection of goods and services in a global economy. Collaborative Filtering (CF) is one of the most popular techniques in recommender systems. The CF recommends items to a target user based on the preferences of a set of similar users known as the neighbours, generated from a database made up of the preferences of past users. With sufficient background information of item ratings, its performance is promising enough but research shows that it performs very poorly in a cold start situation where there is not enough previous rating data. As an alternative to ratings, trust between the users could be used to choose the neighbour for recommendation making. Better recommendations can be achieved using an inferred trust network which mimics the real world "friend of a friend" recommendations. To extend the boundaries of the neighbour, an effective trust inference technique is required. This thesis proposes a trust interference technique called Directed Series Parallel Graph (DSPG) which performs better than other popular trust inference algorithms such as TidalTrust and MoleTrust. Another problem is that reliable explicit trust data is not always available. In real life, people trust "word of mouth" recommendations made by people with similar interests. This is often assumed in the recommender system. By conducting a survey, we can confirm that interest similarity has a positive relationship with trust and this can be used to generate a trust network for recommendation. In this research, we also propose a new method called SimTrust for developing trust networks based on user's interest similarity in the absence of explicit trust data. To identify the interest similarity, we use user's personalised tagging information. However, we are interested in what resources the user chooses to tag, rather than the text of the tag applied. The commonalities of the resources being tagged by the users can be used to form the neighbours used in the automated recommender system. Our experimental results show that our proposed tag-similarity based method outperforms the traditional collaborative filtering approach which usually uses rating data.
Resumo:
As organizations reach higher levels of business process management maturity, they often find themselves maintaining very large process model repositories, representing valuable knowledge about their operations. A common practice within these repositories is to create new process models, or extend existing ones, by copying and merging fragments from other models. We contend that if these duplicate fragments, a.k.a. ex- act clones, can be identified and factored out as shared subprocesses, the repository’s maintainability can be greatly improved. With this purpose in mind, we propose an indexing structure to support fast detection of clones in process model repositories. Moreover, we show how this index can be used to efficiently query a process model repository for fragments. This index, called RPSDAG, is based on a novel combination of a method for process model decomposition (namely the Refined Process Structure Tree), with established graph canonization and string matching techniques. We evaluated the RPSDAG with large process model repositories from industrial practice. The experiments show that a significant number of non-trivial clones can be efficiently found in such repositories, and that fragment queries can be handled efficiently.
Resumo:
In the asymmetric unit of the title co-crystal, C12H14N4O2S . C7H5NO4 there are two independent but conformationally similar heterodimers, which are formed through intermolecular N-H...O(carboxy) and carboxyl O-H...N hydrogen-bond pairs, giving a cyclic motif [graph set R2/2(8)]. The dihedral angles between the rings in the sulfonamide molecules are 78.77(8) and 82.33(9)deg. while the dihedral angles between the ring and the CO2H group in the acids are 2.19(9) and 7.02(10)deg. A two-dimensional structure parallel to the ab plane is generated from the heterodimer units through hydrogen-bonding associations between NH2 and sulfone groups. Between neighbouring two-dimensional arrays there are two types of aromatic pi-pi stacking interactions involving either one of the pyrimidine rings and a 4-nitrobenzoic acid molecule [minimum ring centroid separation = 3.5886(9)A] or two acid molecules [minimum ring centroid separation = 3.7236(10)A].
Resumo:
Electronic Health Record (EHR) retrieval processes are complex demanding Information Technology (IT) resources exponentially in particular memory usage. Database-as-a-service (DAS) model approach is proposed to meet the scalability factor of EHR retrieval processes. A simulation study using ranged of EHR records with DAS model was presented. The bucket-indexing model incorporated partitioning fields and bloom filters in a Singleton design pattern were used to implement custom database encryption system. It effectively provided faster responses in the range query compared to different types of queries used such as aggregation queries among the DAS, built-in encryption and the plain-text DBMS. The study also presented with constraints around the approach should consider for other practical applications.
Resumo:
While researchers strive to improve automatic face recognition performance, the relationship between image resolution and face recognition performance has not received much attention. This relationship is examined systematically and a framework is developed such that results from super-resolution techniques can be compared. Three super-resolution techniques are compared with the Eigenface and Elastic Bunch Graph Matching face recognition engines. Parameter ranges over which these techniques provide better recognition performance than interpolated images is determined.
Resumo:
In the medical and healthcare arena, patients‟ data is not just their own personal history but also a valuable large dataset for finding solutions for diseases. While electronic medical records are becoming popular and are used in healthcare work places like hospitals, as well as insurance companies, and by major stakeholders such as physicians and their patients, the accessibility of such information should be dealt with in a way that preserves privacy and security. Thus, finding the best way to keep the data secure has become an important issue in the area of database security. Sensitive medical data should be encrypted in databases. There are many encryption/ decryption techniques and algorithms with regard to preserving privacy and security. Currently their performance is an important factor while the medical data is being managed in databases. Another important factor is that the stakeholders should decide more cost-effective ways to reduce the total cost of ownership. As an alternative, DAS (Data as Service) is a popular outsourcing model to satisfy the cost-effectiveness but it takes a consideration that the encryption/ decryption modules needs to be handled by trustworthy stakeholders. This research project is focusing on the query response times in a DAS model (AES-DAS) and analyses the comparison between the outsourcing model and the in-house model which incorporates Microsoft built-in encryption scheme in a SQL Server. This research project includes building a prototype of medical database schemas. There are 2 types of simulations to carry out the project. The first stage includes 6 databases in order to carry out simulations to measure the performance between plain-text, Microsoft built-in encryption and AES-DAS (Data as Service). Particularly, the AES-DAS incorporates implementations of symmetric key encryption such as AES (Advanced Encryption Standard) and a Bucket indexing processor using Bloom filter. The results are categorised such as character type, numeric type, range queries, range queries using Bucket Index and aggregate queries. The second stage takes the scalability test from 5K to 2560K records. The main result of these simulations is that particularly as an outsourcing model, AES-DAS using the Bucket index shows around 3.32 times faster than a normal AES-DAS under the 70 partitions and 10K record-sized databases. Retrieving Numeric typed data takes shorter time than Character typed data in AES-DAS. The aggregation query response time in AES-DAS is not as consistent as that in MS built-in encryption scheme. The scalability test shows that the DBMS reaches in a certain threshold; the query response time becomes rapidly slower. However, there is more to investigate in order to bring about other outcomes and to construct a secured EMR (Electronic Medical Record) more efficiently from these simulations.
Resumo:
The time consuming and labour intensive task of identifying individuals in surveillance video is often challenged by poor resolution and the sheer volume of stored video. Faces or identifying marks such as tattoos are often too coarse for direct matching by machine or human vision. Object tracking and super-resolution can then be combined to facilitate the automated detection and enhancement of areas of interest. The object tracking process enables the automatic detection of people of interest, greatly reducing the amount of data for super-resolution. Smaller regions such as faces can also be tracked. A number of instances of such regions can then be utilized to obtain a super-resolved version for matching. Performance improvement from super-resolution is demonstrated using a face verification task. It is shown that there is a consistent improvement of approximately 7% in verification accuracy, using both Eigenface and Elastic Bunch Graph Matching approaches for automatic face verification, starting from faces with an eye to eye distance of 14 pixels. Visual improvement in image fidelity from super-resolved images over low-resolution and interpolated images is demonstrated on a small database. Current research and future directions in this area are also summarized.