812 resultados para Hier-archical clustering
Resumo:
Performance comparisons between File Signatures and Inverted Files for text retrieval have previously shown several significant shortcomings of file signatures relative to inverted files. The inverted file approach underpins most state-of-the-art search engine algorithms, such as Language and Probabilistic models. It has been widely accepted that traditional file signatures are inferior alternatives to inverted files. This paper describes TopSig, a new approach to the construction of file signatures. Many advances in semantic hashing and dimensionality reduction have been made in recent times, but these were not so far linked to general purpose, signature file based, search engines. This paper introduces a different signature file approach that builds upon and extends these recent advances. We are able to demonstrate significant improvements in the performance of signature file based indexing and retrieval, performance that is comparable to that of state of the art inverted file based systems, including Language models and BM25. These findings suggest that file signatures offer a viable alternative to inverted files in suitable settings and positions the file signatures model in the class of Vector Space retrieval models.
Resumo:
We consider the problem of choosing, sequentially, a map which assigns elements of a set A to a few elements of a set B. On each round, the algorithm suffers some cost associated with the chosen assignment, and the goal is to minimize the cumulative loss of these choices relative to the best map on the entire sequence. Even though the offline problem of finding the best map is provably hard, we show that there is an equivalent online approximation algorithm, Randomized Map Prediction (RMP), that is efficient and performs nearly as well. While drawing upon results from the "Online Prediction with Expert Advice" setting, we show how RMP can be utilized as an online approach to several standard batch problems. We apply RMP to online clustering as well as online feature selection and, surprisingly, RMP often outperforms the standard batch algorithms on these problems.
Resumo:
We have used microarray gene expression profiling and machine learning to predict the presence of BRAF mutations in a panel of 61 melanoma cell lines. The BRAF gene was found to be mutated in 42 samples (69%) and intragenic mutations of the NRAS gene were detected in seven samples (11%). No cell line carried mutations of both genes. Using support vector machines, we have built a classifier that differentiates between melanoma cell lines based on BRAF mutation status. As few as 83 genes are able to discriminate between BRAF mutant and BRAF wild-type samples with clear separation observed using hierarchical clustering. Multidimensional scaling was used to visualize the relationship between a BRAF mutation signature and that of a generalized mitogen-activated protein kinase (MAPK) activation (either BRAF or NRAS mutation) in the context of the discriminating gene list. We observed that samples carrying NRAS mutations lie somewhere between those with or without BRAF mutations. These observations suggest that there are gene-specific mutation signals in addition to a common MAPK activation that result from the pleiotropic effects of either BRAF or NRAS on other signaling pathways, leading to measurably different transcriptional changes.
Resumo:
This thesis investigates profiling and differentiating customers through the use of statistical data mining techniques. The business application of our work centres on examining individuals’ seldomly studied yet critical consumption behaviour over an extensive time period within the context of the wireless telecommunication industry; consumption behaviour (as oppose to purchasing behaviour) is behaviour that has been performed so frequently that it become habitual and involves minimal intentions or decision making. Key variables investigated are the activity initialised timestamp and cell tower location as well as the activity type and usage quantity (e.g., voice call with duration in seconds); and the research focuses are on customers’ spatial and temporal usage behaviour. The main methodological emphasis is on the development of clustering models based on Gaussian mixture models (GMMs) which are fitted with the use of the recently developed variational Bayesian (VB) method. VB is an efficient deterministic alternative to the popular but computationally demandingMarkov chainMonte Carlo (MCMC) methods. The standard VBGMMalgorithm is extended by allowing component splitting such that it is robust to initial parameter choices and can automatically and efficiently determine the number of components. The new algorithm we propose allows more effective modelling of individuals’ highly heterogeneous and spiky spatial usage behaviour, or more generally human mobility patterns; the term spiky describes data patterns with large areas of low probability mixed with small areas of high probability. Customers are then characterised and segmented based on the fitted GMM which corresponds to how each of them uses the products/services spatially in their daily lives; this is essentially their likely lifestyle and occupational traits. Other significant research contributions include fitting GMMs using VB to circular data i.e., the temporal usage behaviour, and developing clustering algorithms suitable for high dimensional data based on the use of VB-GMM.
Resumo:
Continuous user authentication with keystroke dynamics uses characters sequences as features. Since users can type characters in any order, it is imperative to find character sequences (n-graphs) that are representative of user typing behavior. The contemporary feature selection approaches do not guarantee selecting frequently-typed features which may cause less accurate statistical user-representation. Furthermore, the selected features do not inherently reflect user typing behavior. We propose four statistical based feature selection techniques that mitigate limitations of existing approaches. The first technique selects the most frequently occurring features. The other three consider different user typing behaviors by selecting: n-graphs that are typed quickly; n-graphs that are typed with consistent time; and n-graphs that have large time variance among users. We use Gunetti’s keystroke dataset and k-means clustering algorithm for our experiments. The results show that among the proposed techniques, the most-frequent feature selection technique can effectively find user representative features. We further substantiate our results by comparing the most-frequent feature selection technique with three existing approaches (popular Italian words, common n-graphs, and least frequent ngraphs). We find that it performs better than the existing approaches after selecting a certain number of most-frequent n-graphs.
Resumo:
The strain-induced self-assembly of suitable semiconductor pairs is an attractive natural route to nanofabrication. To bring to fruition their full potential for actual applications, individual nanostructures need to be combined into ordered patterns in which the location of each single unit is coupled with others and the surrounding environment. Within the Ge/Si model system, we analyze a number of examples of bottom-up strategies in which the shape, positioning, and actual growth mode of epitaxial nanostructures are tailored by manipulating the intrinsic physical processes of heteroepitaxy. The possibility of controlling elastic interactions and, hence, the configuration of self-assembled quantum dots by modulating surface orientation with the miscut angle is discussed. We focus on the use of atomic steps and step bunching as natural templates for nanodot clustering. Then, we consider several different patterning techniques which allow one to harness the natural self-organization dynamics of the system, such as: scanning tunneling nanolithography, focused ion beam and nanoindentation patterning. By analyzing the evolution of the dot assembly by scanning probe microscopy, we follow the pathway which leads to lateral ordering, discussing the thermodynamic and kinetic effects involved in selective nucleation on patterned substrates.
Resumo:
Photochemistry has made significant contributions to our understanding of many important natural processes as well as the scientific discoveries of the man-made world. The measurements from such studies are often complex and may require advanced data interpretation with the use of multivariate or chemometrics methods. In general, such methods have been applied successfully for data display, classification, multivariate curve resolution and prediction in analytical chemistry, environmental chemistry, engineering, medical research and industry. However, in photochemistry, by comparison, applications of such multivariate approaches were found to be less frequent although a variety of methods have been used, especially with spectroscopic photochemical applications. The methods include Principal Component Analysis (PCA; data display), Partial Least Squares (PLS; prediction), Artificial Neural Networks (ANN; prediction) and several models for multivariate curve resolution related to Parallel Factor Analysis (PARAFAC; decomposition of complex responses). Applications of such methods are discussed in this overview and typical examples include photodegradation of herbicides, prediction of antibiotics in human fluids (fluorescence spectroscopy), non-destructive in- and on-line monitoring (near infrared spectroscopy) and fast-time resolution of spectroscopic signals from photochemical reactions. It is also quite clear from the literature that the scope of spectroscopic photochemistry was enhanced by the application of chemometrics. To highlight and encourage further applications of chemometrics in photochemistry, several additional chemometrics approaches are discussed using data collected by the authors. The use of a PCA biplot is illustrated with an analysis of a matrix containing data on the performance of photocatalysts developed for water splitting and hydrogen production. In addition, the applications of the Multi-Criteria Decision Making (MCDM) ranking methods and Fuzzy Clustering are demonstrated with an analysis of water quality data matrix. Other examples of topics include the application of simultaneous kinetic spectroscopic methods for prediction of pesticides, and the use of response fingerprinting approach for classification of medicinal preparations. In general, the overview endeavours to emphasise the advantages of chemometrics' interpretation of multivariate photochemical data, and an Appendix of references and summaries of common and less usual chemometrics methods noted in this work, is provided. Crown Copyright © 2010.
Resumo:
Obesity is a major public health problem in both developed and developing countries. The body mass index (BMI) is the most common index used to define obesity. The universal application of the same BMI classification across different ethnic groups is being challenged due to the inability of the index to differentiate fat mass (FM) and fat�]free mass (FFM) and the recognized ethnic differences in body composition. A better understanding of the body composition of Asian children from different backgrounds would help to better understand the obesity�]related health risks of people in this region. Moreover, the limitations of the BMI underscore the necessity to use where possible, more accurate measures of body fat assessment in research and clinical settings in addition to BMI, particularly in relation to the monitoring of prevention and treatment efforts. The aim of the first study was to determine the ethnic difference in the relationship between BMI and percent body fat (%BF) in pre�]pubertal Asian children from China, Lebanon, Malaysia, the Philippines, and Thailand. A total of 1039 children aged 8�]10 y were recruited using a non�]random purposive sampling approach aiming to encompass a wide BMI range from the five countries. Percent body fat (%BF) was determined using the deuterium dilution technique to quantify total body water (TBW) and subsequently derive proportions of FM and FFM. The study highlighted the sex and ethnic differences between BMI and %BF in Asian children from different countries. Girls had approximately 4.0% higher %BF compared with boys at a given BMI. Filipino boys tended to have a lower %BF than their Chinese, Lebanese, Malay and Thai counterparts at the same age and BMI level (corrected mean %BF was 25.7�}0.8%, 27.4�}0.4%, 27.1�}0.6%, 27.7�}0.5%, 28.1�}0.5% for Filipino, Chinese, Lebanese, Malay and Thai boys, respectively), although they differed significantly from Thai and Malay boys. Thai girls had approximately 2.0% higher %BF values than Chinese, Lebanese, Filipino and Malay counterparts (however no significant difference was seen among the four ethnic groups) at a given BMI (corrected mean %BF was 31.1�}0.5%, 28.6�}0.4%, 29.2�}0.6%, 29.5�}0.6%, 29.5�}0.5% for Thai, Chinese, Lebanese, Malay and Filipino girls, respectively). However, the ethnic difference in BMI�]%BF relationship varied by BMI. Compared with Caucasians, Asian children had a BMI 3�]6 units lower for a given %BF. More than one third of obese Asian children in the study were not identified using the WHO classification and more than half were not identified using the International Obesity Task Force (IOTF) classification. However, use of the Chinese classification increased the sensitivity by 19.7%, 18.1%, 2.3%, 2.3%, and 11.3% for Chinese, Lebanese, Malay, Filipino and Thai girls, respectively. A further aim of the first study was to determine the ethnic difference in body fat distribution in pre�]pubertal Asian children from China, Lebanon, Malaysia, and Thailand. The skin fold thicknesses, height, weight, waist circumference (WC) and total adiposity (as determined by deuterium dilution technique) of 922 children from the four countries was assessed. Chinese boys and girls had a similar trunk�]to�]extremity skin fold thickness ratio to Thai counterparts and both groups had higher ratios than the Malays and Lebanese at a given total FM. At a given BMI, both Chinese and Thai boys and girls had a higher WC than Malays and Lebanese (corrected mean WC was 68.1�}0.2 cm, 67.8�}0.3 cm, 65.8�}0.4 cm, 64.1�}0.3 cm for Chinese, Thai, Lebanese and Malay boys, respectively; 64.2�}0.2 cm, 65.0�}0.3 cm, 62.9�}0.4 cm, 60.6�}0.3 cm for Chinese, Thai, Lebanese and Malay girls, respectively). Chinese boys and girls had lower trunk fat adjusted subscapular/suprailiac skinfold ratio compared with Lebanese and Malay counterparts. The second study aimed to develop and cross�]validate bioelectrical impedance analysis (BIA) prediction equations of TBW and FFM for Asian pre�]pubertal children from China, Lebanon, Malaysia, the Philippines, and Thailand. Data on height, weight, age, gender, resistance and reactance measured by BIA were collected from 948 Asian children (492 boys and 456 girls) aged 8�]10 y from the five countries. The deuterium dilution technique was used as the criterion method for the estimation of TBW and FFM. The BIA equations were developed from the validation group (630 children randomly selected from the total sample) using stepwise multiple regression analysis and cross�]validated in a separate group (318 children) using the Bland�]Altman approach. Age, gender and ethnicity influenced the relationship between the resistance index (RI = height2/resistance), TBW and FFM. The BIA prediction equation for the estimation of TBW was: TBW (kg) = 0.231�~Height2 (cm)/resistance (ƒ¶) + 0.066�~Height (cm) + 0.188�~Weight (kg) + 0.128�~Age (yr) + 0.500�~Sex (male=1, female=0) . 0.316�~Ethnicity (Thai ethnicity=1, others=0) �] 4.574, and for the estimation of FFM: FFM (kg) = 0.299�~Height2 (cm)/resistance (ƒ¶) + 0.086�~Height (cm) + 0.245�~Weight (kg) + 0.260�~Age (yr) + 0.901�~Sex (male=1, female=0) �] 0.415�~Ethnicity (Thai ethnicity=1, others=0) �] 6.952. The R2 was 88.0% (root mean square error, RSME = 1.3 kg), 88.3% (RSME = 1.7 kg) for TBW and FFM equation, respectively. No significant difference between measured and predicted TBW and between measured and predicted FFM for the whole cross�]validation sample was found (bias = �]0.1�}1.4 kg, pure error = 1.4�}2.0 kg for TBW and bias = �]0.2�}1.9 kg, pure error = 1.8�}2.6 kg for FFM). However, the prediction equation for estimation of TBW/FFM tended to overestimate TBW/FFM at lower levels while underestimate at higher levels of TBW/FFM. Accuracy of the general equation for TBW and FFM compared favorably with both BMI�]specific and ethnic�]specific equations. There were significant differences between predicted TBW and FFM from external BIA equations derived from Caucasian populations and measured values in Asian children. There were three specific aims of the third study. The first was to explore the relationship between obesity and metabolic syndrome and abnormalities in Chinese children. A total of 608 boys and 800 girls aged 6�]12 y were recruited from four cities in China. Three definitions of pediatric metabolic syndrome and abnormalities were used, including the International Diabetes Federation (IDF) and National Cholesterol Education Program (NCEP) definition for adults modified by Cook et al. and de Ferranti et al. The prevalence of metabolic syndrome varied with different definitions, was highest using the de Ferranti definition (5.4%, 24.6% and 42.0%, respectively for normal�]weight, overweight and obese children), followed by the Cook definition (1.5%, 8.1%, and 25.1%, respectively), and the IDF definition (0.5%, 1.8% and 8.3%, respectively). Overweight and obese children had a higher risk of developing the metabolic syndrome compared to normal�]weight children (odds ratio varied with different definitions from 3.958 to 6.866 for overweight children, and 12.640�]26.007 for obese children). Overweight and obesity also increased the risk of developing metabolic abnormalities. Central obesity and high triglycerides (TG) were the most common while hyperglycemia was the least frequent in Chinese children regardless of different definitions. The second purpose was to determine the best obesity index for the prediction of cardiovascular (CV) risk factor clustering across a 2�]y follow�]up among BMI, %BF, WC and waist�]to�]height ratio (WHtR) in Chinese children. Height, weight, WC, %BF as determined by BIA, blood pressure, TG, high�]density lipoprotein cholesterol (HDL�]C), and fasting glucose were collected at baseline and 2 years later in 292 boys and 277 girls aged 8�]10 y. The results showed the percentage of children who remained overweight/obese defined on the basis of BMI, WC, WHtR and %BF was 89.7%, 93.5%, 84.5%, and 80.4%, respectively after 2 years. Obesity indices at baseline significantly correlated with TG, HDL�]C, and blood pressure at both baseline and 2 years later with a similar strength of correlations. BMI at baseline explained the greatest variance of later blood pressure. WC at baseline explained the greatest variance of later HDL�]C and glucose, while WHtR at baseline was the main predictor of later TG. Receiver�]operating characteristic (ROC) analysis explored the ability of the four indices to identify the later presence of CV risk. The overweight/obese children defined on the basis of BMI, WC, WHtR or %BF were more likely to develop CV risk 2 years later with relative risk (RR) scores of 3.670, 3.762, 2.767, and 2.804, respectively. The final purpose of the third study was to develop age�] and gender�]specific percentiles of WC and WHtR and cut�]off points of WC and WHtR for the prediction of CV risk in Chinese children. Smoothed percentile curves of WC and WHtR were produced in 2830 boys and 2699 girls aged 6�]12 y randomly selected from southern and northern China using the LMS method. The optimal age�] and gender�]specific thresholds of WC and WHtR for the prediction of cardiovascular risk factors clustering were derived in a sub�]sample (n=1845) by ROC analysis. Age�] and gender�]specific WC and WHtR percentiles were constructed. The WC thresholds were at the 90th and 84th percentiles for Chinese boys and girls, respectively, with sensitivity and specificity ranging from 67.2% to 83.3%. The WHtR thresholds were at the 91st and 94th percentiles for Chinese boys and girls, respectively, with sensitivity and specificity ranging from 78.6% to 88.9%. The cut�]offs of both WC and WHtR were age�] and gender�]dependent. In conclusion, the current thesis quantifies the ethnic differences in the BMI�]%BF relationship and body fat distribution between Asian children from different origins and confirms the necessity to consider ethnic differences in body composition when developing BMI and other obesity index criteria for obesity in Asian children. Moreover, ethnicity is also important in BIA prediction equations. In addition, WC and WHtR percentiles and thresholds for the prediction of CV risk in Chinese children differ from other populations. Although there was no advantage of WC or WHtR over BMI or %BF in the prediction of CV risk, obese children had a higher risk of developing the metabolic syndrome and abnormalities than normal�]weight children regardless of the obesity index used.
Resumo:
Most recommendation methods employ item-item similarity measures or use ratings data to generate recommendations. These methods use traditional two dimensional models to find inter relationships between alike users and products. This paper proposes a novel recommendation method using the multi-dimensional model, tensor, to group similar users based on common search behaviour, and then finding associations within such groups for making effective inter group recommendations. Web log data is multi-dimensional data. Unlike vector based methods, tensors have the ability to highly correlate and find latent relationships between such similar instances, consisting of users and searches. Non redundant rules from such associations of user-searches are then used for making recommendations to the users.
Resumo:
Video surveillance systems using Closed Circuit Television (CCTV) cameras, is one of the fastest growing areas in the field of security technologies. However, the existing video surveillance systems are still not at a stage where they can be used for crime prevention. The systems rely heavily on human observers and are therefore limited by factors such as fatigue and monitoring capabilities over long periods of time. This work attempts to address these problems by proposing an automatic suspicious behaviour detection which utilises contextual information. The utilisation of contextual information is done via three main components: a context space model, a data stream clustering algorithm, and an inference algorithm. The utilisation of contextual information is still limited in the domain of suspicious behaviour detection. Furthermore, it is nearly impossible to correctly understand human behaviour without considering the context where it is observed. This work presents experiments using video feeds taken from CAVIAR dataset and a camera mounted on one of the buildings Z-Block) at the Queensland University of Technology, Australia. From these experiments, it is shown that by exploiting contextual information, the proposed system is able to make more accurate detections, especially of those behaviours which are only suspicious in some contexts while being normal in the others. Moreover, this information gives critical feedback to the system designers to refine the system.
Resumo:
The multifractal properties of two indices of geomagnetic activity, D st (representative of low latitudes) and a p (representative of the global geomagnetic activity), with the solar X-ray brightness, X l , during the period from 1 March 1995 to 17 June 2003 are examined using multifractal detrended fluctuation analysis (MF-DFA). The h(q) curves of D st and a p in the MF-DFA are similar to each other, but they are different from that of X l , indicating that the scaling properties of X l are different from those of D st and a p . Hence, one should not predict the magnitude of magnetic storms directly from solar X-ray observations. However, a strong relationship exists between the classes of the solar X-ray irradiance (the classes being chosen to separate solar flares of class X-M, class C, and class B or less, including no flares) in hourly measurements and the geomagnetic disturbances (large to moderate, small, or quiet) seen in D st and a p during the active period. Each time series was converted into a symbolic sequence using three classes. The frequency, yielding the measure representations, of the substrings in the symbolic sequences then characterizes the pattern of space weather events. Using the MF-DFA method and traditional multifractal analysis, we calculate the h(q), D(q), and τ (q) curves of the measure representations. The τ (q) curves indicate that the measure representations of these three indices are multifractal. On the basis of this three-class clustering, we find that the h(q), D(q), and τ (q) curves of the measure representations of these three indices are similar to each other for positive values of q. Hence, a positive flare storm class dependence is reflected in the scaling exponents h(q) in the MF-DFA and the multifractal exponents D(q) and τ (q). This finding indicates that the use of the solar flare classes could improve the prediction of the D st classes.
Resumo:
Automatic species recognition plays an important role in assisting ecologists to monitor the environment. One critical issue in this research area is that software developers need prior knowledge of specific targets people are interested in to build templates for these targets. This paper proposes a novel approach for automatic species recognition based on generic knowledge about acoustic events to detect species. Acoustic component detection is the most critical and fundamental part of this proposed approach. This paper gives clear definitions of acoustic components and presents three clustering algorithms for detecting four acoustic components in sound recordings; whistles, clicks, slurs, and blocks. The experiment result demonstrates that these acoustic component recognisers have achieved high precision and recall rate.
Resumo:
Diversity techniques have long been used to combat the channel fading in wireless communications systems. Recently cooperative communications has attracted lot of attention due to many benefits it offers. Thus cooperative routing protocols with diversity transmission can be developed to exploit the random nature of the wireless channels to improve the network efficiency by selecting multiple cooperative nodes to forward data. In this paper we analyze and evaluate the performance of a novel routing protocol with multiple cooperative nodes which share multiple channels. Multiple shared channels cooperative (MSCC) routing protocol achieves diversity advantage by using cooperative transmission. It unites clustering hierarchy with a bandwidth reuse scheme to mitigate the co-channel interference. Theoretical analysis of average packet reception rate and network throughput of the MSCC protocol are presented and compared with simulated results.
Resumo:
A new relationship type of social networks - online dating - are gaining popularity. With a large member base, users of a dating network are overloaded with choices about their ideal partners. Recommendation methods can be utilized to overcome this problem. However, traditional recommendation methods do not work effectively for online dating networks where the dataset is sparse and large, and a two-way matching is required. This paper applies social networking concepts to solve the problem of developing a recommendation method for online dating networks. We propose a method by using clustering, SimRank and adapted SimRank algorithms to recommend matching candidates. Empirical results show that the proposed method can achieve nearly double the performance of the traditional collaborative filtering and common neighbor methods of recommendation.
Resumo:
Handling information overload online, from the user's point of view is a big challenge, especially when the number of websites is growing rapidly due to growth in e-commerce and other related activities. Personalization based on user needs is the key to solving the problem of information overload. Personalization methods help in identifying relevant information, which may be liked by a user. User profile and object profile are the important elements of a personalization system. When creating user and object profiles, most of the existing methods adopt two-dimensional similarity methods based on vector or matrix models in order to find inter-user and inter-object similarity. Moreover, for recommending similar objects to users, personalization systems use the users-users, items-items and users-items similarity measures. In most cases similarity measures such as Euclidian, Manhattan, cosine and many others based on vector or matrix methods are used to find the similarities. Web logs are high-dimensional datasets, consisting of multiple users, multiple searches with many attributes to each. Two-dimensional data analysis methods may often overlook latent relationships that may exist between users and items. In contrast to other studies, this thesis utilises tensors, the high-dimensional data models, to build user and object profiles and to find the inter-relationships between users-users and users-items. To create an improved personalized Web system, this thesis proposes to build three types of profiles: individual user, group users and object profiles utilising decomposition factors of tensor data models. A hybrid recommendation approach utilising group profiles (forming the basis of a collaborative filtering method) and object profiles (forming the basis of a content-based method) in conjunction with individual user profiles (forming the basis of a model based approach) is proposed for making effective recommendations. A tensor-based clustering method is proposed that utilises the outcomes of popular tensor decomposition techniques such as PARAFAC, Tucker and HOSVD to group similar instances. An individual user profile, showing the user's highest interest, is represented by the top dimension values, extracted from the component matrix obtained after tensor decomposition. A group profile, showing similar users and their highest interest, is built by clustering similar users based on tensor decomposed values. A group profile is represented by the top association rules (containing various unique object combinations) that are derived from the searches made by the users of the cluster. An object profile is created to represent similar objects clustered on the basis of their similarity of features. Depending on the category of a user (known, anonymous or frequent visitor to the website), any of the profiles or their combinations is used for making personalized recommendations. A ranking algorithm is also proposed that utilizes the personalized information to order and rank the recommendations. The proposed methodology is evaluated on data collected from a real life car website. Empirical analysis confirms the effectiveness of recommendations made by the proposed approach over other collaborative filtering and content-based recommendation approaches based on two-dimensional data analysis methods.