116 resultados para height partition clustering


Relevância:

20.00% 20.00%

Publicador:

Resumo:

Dealing with product yield and quality in manufacturing industries is getting more difficult due to the increasing volume and complexity of data and quicker time to market expectations. Data mining offers tools for quick discovery of relationships, patterns and knowledge in large databases. Growing self-organizing map (GSOM) is established as an efficient unsupervised datamining algorithm. In this study some modifications to the original GSOM are proposed for manufacturing yield improvement by clustering. These modifications include introduction of a clustering quality measure to evaluate the performance of the programme in separating good and faulty products and a filtering index to reduce noise from the dataset. Results show that the proposed method is able to effectively differentiate good and faulty products. It will help engineers construct the knowledge base to predict product quality automatically from collected data and provide insights for yield improvement.

Relevância:

20.00% 20.00%

Publicador:

Relevância:

20.00% 20.00%

Publicador:

Resumo:

Obesity is a major public health problem in both developed and developing countries. The body mass index (BMI) is the most common index used to define obesity. The universal application of the same BMI classification across different ethnic groups is being challenged due to the inability of the index to differentiate fat mass (FM) and fat�]free mass (FFM) and the recognized ethnic differences in body composition. A better understanding of the body composition of Asian children from different backgrounds would help to better understand the obesity�]related health risks of people in this region. Moreover, the limitations of the BMI underscore the necessity to use where possible, more accurate measures of body fat assessment in research and clinical settings in addition to BMI, particularly in relation to the monitoring of prevention and treatment efforts. The aim of the first study was to determine the ethnic difference in the relationship between BMI and percent body fat (%BF) in pre�]pubertal Asian children from China, Lebanon, Malaysia, the Philippines, and Thailand. A total of 1039 children aged 8�]10 y were recruited using a non�]random purposive sampling approach aiming to encompass a wide BMI range from the five countries. Percent body fat (%BF) was determined using the deuterium dilution technique to quantify total body water (TBW) and subsequently derive proportions of FM and FFM. The study highlighted the sex and ethnic differences between BMI and %BF in Asian children from different countries. Girls had approximately 4.0% higher %BF compared with boys at a given BMI. Filipino boys tended to have a lower %BF than their Chinese, Lebanese, Malay and Thai counterparts at the same age and BMI level (corrected mean %BF was 25.7�}0.8%, 27.4�}0.4%, 27.1�}0.6%, 27.7�}0.5%, 28.1�}0.5% for Filipino, Chinese, Lebanese, Malay and Thai boys, respectively), although they differed significantly from Thai and Malay boys. Thai girls had approximately 2.0% higher %BF values than Chinese, Lebanese, Filipino and Malay counterparts (however no significant difference was seen among the four ethnic groups) at a given BMI (corrected mean %BF was 31.1�}0.5%, 28.6�}0.4%, 29.2�}0.6%, 29.5�}0.6%, 29.5�}0.5% for Thai, Chinese, Lebanese, Malay and Filipino girls, respectively). However, the ethnic difference in BMI�]%BF relationship varied by BMI. Compared with Caucasians, Asian children had a BMI 3�]6 units lower for a given %BF. More than one third of obese Asian children in the study were not identified using the WHO classification and more than half were not identified using the International Obesity Task Force (IOTF) classification. However, use of the Chinese classification increased the sensitivity by 19.7%, 18.1%, 2.3%, 2.3%, and 11.3% for Chinese, Lebanese, Malay, Filipino and Thai girls, respectively. A further aim of the first study was to determine the ethnic difference in body fat distribution in pre�]pubertal Asian children from China, Lebanon, Malaysia, and Thailand. The skin fold thicknesses, height, weight, waist circumference (WC) and total adiposity (as determined by deuterium dilution technique) of 922 children from the four countries was assessed. Chinese boys and girls had a similar trunk�]to�]extremity skin fold thickness ratio to Thai counterparts and both groups had higher ratios than the Malays and Lebanese at a given total FM. At a given BMI, both Chinese and Thai boys and girls had a higher WC than Malays and Lebanese (corrected mean WC was 68.1�}0.2 cm, 67.8�}0.3 cm, 65.8�}0.4 cm, 64.1�}0.3 cm for Chinese, Thai, Lebanese and Malay boys, respectively; 64.2�}0.2 cm, 65.0�}0.3 cm, 62.9�}0.4 cm, 60.6�}0.3 cm for Chinese, Thai, Lebanese and Malay girls, respectively). Chinese boys and girls had lower trunk fat adjusted subscapular/suprailiac skinfold ratio compared with Lebanese and Malay counterparts. The second study aimed to develop and cross�]validate bioelectrical impedance analysis (BIA) prediction equations of TBW and FFM for Asian pre�]pubertal children from China, Lebanon, Malaysia, the Philippines, and Thailand. Data on height, weight, age, gender, resistance and reactance measured by BIA were collected from 948 Asian children (492 boys and 456 girls) aged 8�]10 y from the five countries. The deuterium dilution technique was used as the criterion method for the estimation of TBW and FFM. The BIA equations were developed from the validation group (630 children randomly selected from the total sample) using stepwise multiple regression analysis and cross�]validated in a separate group (318 children) using the Bland�]Altman approach. Age, gender and ethnicity influenced the relationship between the resistance index (RI = height2/resistance), TBW and FFM. The BIA prediction equation for the estimation of TBW was: TBW (kg) = 0.231�~Height2 (cm)/resistance (ƒ¶) + 0.066�~Height (cm) + 0.188�~Weight (kg) + 0.128�~Age (yr) + 0.500�~Sex (male=1, female=0) . 0.316�~Ethnicity (Thai ethnicity=1, others=0) �] 4.574, and for the estimation of FFM: FFM (kg) = 0.299�~Height2 (cm)/resistance (ƒ¶) + 0.086�~Height (cm) + 0.245�~Weight (kg) + 0.260�~Age (yr) + 0.901�~Sex (male=1, female=0) �] 0.415�~Ethnicity (Thai ethnicity=1, others=0) �] 6.952. The R2 was 88.0% (root mean square error, RSME = 1.3 kg), 88.3% (RSME = 1.7 kg) for TBW and FFM equation, respectively. No significant difference between measured and predicted TBW and between measured and predicted FFM for the whole cross�]validation sample was found (bias = �]0.1�}1.4 kg, pure error = 1.4�}2.0 kg for TBW and bias = �]0.2�}1.9 kg, pure error = 1.8�}2.6 kg for FFM). However, the prediction equation for estimation of TBW/FFM tended to overestimate TBW/FFM at lower levels while underestimate at higher levels of TBW/FFM. Accuracy of the general equation for TBW and FFM compared favorably with both BMI�]specific and ethnic�]specific equations. There were significant differences between predicted TBW and FFM from external BIA equations derived from Caucasian populations and measured values in Asian children. There were three specific aims of the third study. The first was to explore the relationship between obesity and metabolic syndrome and abnormalities in Chinese children. A total of 608 boys and 800 girls aged 6�]12 y were recruited from four cities in China. Three definitions of pediatric metabolic syndrome and abnormalities were used, including the International Diabetes Federation (IDF) and National Cholesterol Education Program (NCEP) definition for adults modified by Cook et al. and de Ferranti et al. The prevalence of metabolic syndrome varied with different definitions, was highest using the de Ferranti definition (5.4%, 24.6% and 42.0%, respectively for normal�]weight, overweight and obese children), followed by the Cook definition (1.5%, 8.1%, and 25.1%, respectively), and the IDF definition (0.5%, 1.8% and 8.3%, respectively). Overweight and obese children had a higher risk of developing the metabolic syndrome compared to normal�]weight children (odds ratio varied with different definitions from 3.958 to 6.866 for overweight children, and 12.640�]26.007 for obese children). Overweight and obesity also increased the risk of developing metabolic abnormalities. Central obesity and high triglycerides (TG) were the most common while hyperglycemia was the least frequent in Chinese children regardless of different definitions. The second purpose was to determine the best obesity index for the prediction of cardiovascular (CV) risk factor clustering across a 2�]y follow�]up among BMI, %BF, WC and waist�]to�]height ratio (WHtR) in Chinese children. Height, weight, WC, %BF as determined by BIA, blood pressure, TG, high�]density lipoprotein cholesterol (HDL�]C), and fasting glucose were collected at baseline and 2 years later in 292 boys and 277 girls aged 8�]10 y. The results showed the percentage of children who remained overweight/obese defined on the basis of BMI, WC, WHtR and %BF was 89.7%, 93.5%, 84.5%, and 80.4%, respectively after 2 years. Obesity indices at baseline significantly correlated with TG, HDL�]C, and blood pressure at both baseline and 2 years later with a similar strength of correlations. BMI at baseline explained the greatest variance of later blood pressure. WC at baseline explained the greatest variance of later HDL�]C and glucose, while WHtR at baseline was the main predictor of later TG. Receiver�]operating characteristic (ROC) analysis explored the ability of the four indices to identify the later presence of CV risk. The overweight/obese children defined on the basis of BMI, WC, WHtR or %BF were more likely to develop CV risk 2 years later with relative risk (RR) scores of 3.670, 3.762, 2.767, and 2.804, respectively. The final purpose of the third study was to develop age�] and gender�]specific percentiles of WC and WHtR and cut�]off points of WC and WHtR for the prediction of CV risk in Chinese children. Smoothed percentile curves of WC and WHtR were produced in 2830 boys and 2699 girls aged 6�]12 y randomly selected from southern and northern China using the LMS method. The optimal age�] and gender�]specific thresholds of WC and WHtR for the prediction of cardiovascular risk factors clustering were derived in a sub�]sample (n=1845) by ROC analysis. Age�] and gender�]specific WC and WHtR percentiles were constructed. The WC thresholds were at the 90th and 84th percentiles for Chinese boys and girls, respectively, with sensitivity and specificity ranging from 67.2% to 83.3%. The WHtR thresholds were at the 91st and 94th percentiles for Chinese boys and girls, respectively, with sensitivity and specificity ranging from 78.6% to 88.9%. The cut�]offs of both WC and WHtR were age�] and gender�]dependent. In conclusion, the current thesis quantifies the ethnic differences in the BMI�]%BF relationship and body fat distribution between Asian children from different origins and confirms the necessity to consider ethnic differences in body composition when developing BMI and other obesity index criteria for obesity in Asian children. Moreover, ethnicity is also important in BIA prediction equations. In addition, WC and WHtR percentiles and thresholds for the prediction of CV risk in Chinese children differ from other populations. Although there was no advantage of WC or WHtR over BMI or %BF in the prediction of CV risk, obese children had a higher risk of developing the metabolic syndrome and abnormalities than normal�]weight children regardless of the obesity index used.

Relevância:

20.00% 20.00%

Publicador:

Resumo:

This paper proposes an innovative instance similarity based evaluation metric that reduces the search map for clustering to be performed. An aggregate global score is calculated for each instance using the novel idea of Fibonacci series. The use of Fibonacci numbers is able to separate the instances effectively and, in hence, the intra-cluster similarity is increased and the inter-cluster similarity is decreased during clustering. The proposed FIBCLUS algorithm is able to handle datasets with numerical, categorical and a mix of both types of attributes. Results obtained with FIBCLUS are compared with the results of existing algorithms such as k-means, x-means expected maximization and hierarchical algorithms that are widely used to cluster numeric, categorical and mix data types. Empirical analysis shows that FIBCLUS is able to produce better clustering solutions in terms of entropy, purity and F-score in comparison to the above described existing algorithms.

Relevância:

20.00% 20.00%

Publicador:

Resumo:

We propose to use the Tensor Space Modeling (TSM) to represent and analyze the user’s web log data that consists of multiple interests and spans across multiple dimensions. Further we propose to use the decomposition factors of the Tensors for clustering the users based on similarity of search behaviour. Preliminary results show that the proposed method outperforms the traditional Vector Space Model (VSM) based clustering.

Relevância:

20.00% 20.00%

Publicador:

Resumo:

Purpose: Web search engines are frequently used by people to locate information on the Internet. However, not all queries have an informational goal. Instead of information, some people may be looking for specific web sites or may wish to conduct transactions with web services. This paper aims to focus on automatically classifying the different user intents behind web queries. Design/methodology/approach: For the research reported in this paper, 130,000 web search engine queries are categorized as informational, navigational, or transactional using a k-means clustering approach based on a variety of query traits. Findings: The research findings show that more than 75 percent of web queries (clustered into eight classifications) are informational in nature, with about 12 percent each for navigational and transactional. Results also show that web queries fall into eight clusters, six primarily informational, and one each of primarily transactional and navigational. Research limitations/implications: This study provides an important contribution to web search literature because it provides information about the goals of searchers and a method for automatically classifying the intents of the user queries. Automatic classification of user intent can lead to improved web search engines by tailoring results to specific user needs. Practical implications: The paper discusses how web search engines can use automatically classified user queries to provide more targeted and relevant results in web searching by implementing a real time classification method as presented in this research. Originality/value: This research investigates a new application of a method for automatically classifying the intent of user queries. There has been limited research to date on automatically classifying the user intent of web queries, even though the pay-off for web search engines can be quite beneficial. © Emerald Group Publishing Limited.

Relevância:

20.00% 20.00%

Publicador:

Resumo:

In the last few years we have observed a proliferation of approaches for clustering XML docu- ments and schemas based on their structure and content. The presence of such a huge amount of approaches is due to the different applications requiring the XML data to be clustered. These applications need data in the form of similar contents, tags, paths, structures and semantics. In this paper, we first outline the application contexts in which clustering is useful, then we survey approaches so far proposed relying on the abstract representation of data (instances or schema), on the identified similarity measure, and on the clustering algorithm. This presentation leads to draw a taxonomy in which the current approaches can be classified and compared. We aim at introducing an integrated view that is useful when comparing XML data clustering approaches, when developing a new clustering algorithm, and when implementing an XML clustering compo- nent. Finally, the paper moves into the description of future trends and research issues that still need to be faced.

Relevância:

20.00% 20.00%

Publicador:

Resumo:

With the growing number of XML documents on theWeb it becomes essential to effectively organise these XML documents in order to retrieve useful information from them. A possible solution is to apply clustering on the XML documents to discover knowledge that promotes effective data management, information retrieval and query processing. However, many issues arise in discovering knowledge from these types of semi-structured documents due to their heterogeneity and structural irregularity. Most of the existing research on clustering techniques focuses only on one feature of the XML documents, this being either their structure or their content due to scalability and complexity problems. The knowledge gained in the form of clusters based on the structure or the content is not suitable for reallife datasets. It therefore becomes essential to include both the structure and content of XML documents in order to improve the accuracy and meaning of the clustering solution. However, the inclusion of both these kinds of information in the clustering process results in a huge overhead for the underlying clustering algorithm because of the high dimensionality of the data. The overall objective of this thesis is to address these issues by: (1) proposing methods to utilise frequent pattern mining techniques to reduce the dimension; (2) developing models to effectively combine the structure and content of XML documents; and (3) utilising the proposed models in clustering. This research first determines the structural similarity in the form of frequent subtrees and then uses these frequent subtrees to represent the constrained content of the XML documents in order to determine the content similarity. A clustering framework with two types of models, implicit and explicit, is developed. The implicit model uses a Vector Space Model (VSM) to combine the structure and the content information. The explicit model uses a higher order model, namely a 3- order Tensor Space Model (TSM), to explicitly combine the structure and the content information. This thesis also proposes a novel incremental technique to decompose largesized tensor models to utilise the decomposed solution for clustering the XML documents. The proposed framework and its components were extensively evaluated on several real-life datasets exhibiting extreme characteristics to understand the usefulness of the proposed framework in real-life situations. Additionally, this research evaluates the outcome of the clustering process on the collection selection problem in the information retrieval on the Wikipedia dataset. The experimental results demonstrate that the proposed frequent pattern mining and clustering methods outperform the related state-of-the-art approaches. In particular, the proposed framework of utilising frequent structures for constraining the content shows an improvement in accuracy over content-only and structure-only clustering results. The scalability evaluation experiments conducted on large scaled datasets clearly show the strengths of the proposed methods over state-of-the-art methods. In particular, this thesis work contributes to effectively combining the structure and the content of XML documents for clustering, in order to improve the accuracy of the clustering solution. In addition, it also contributes by addressing the research gaps in frequent pattern mining to generate efficient and concise frequent subtrees with various node relationships that could be used in clustering.

Relevância:

20.00% 20.00%

Publicador:

Resumo:

Recently, Software as a Service (SaaS) in Cloud computing, has become more and more significant among software users and providers. To offer a SaaS with flexible functions at a low cost, SaaS providers have focused on the decomposition of the SaaS functionalities, or known as composite SaaS. This approach has introduced new challenges in SaaS resource management in data centres. One of the challenges is managing the resources allocated to the composite SaaS. Due to the dynamic environment of a Cloud data centre, resources that have been initially allocated to SaaS components may be overloaded or wasted. As such, reconfiguration for the components’ placement is triggered to maintain the performance of the composite SaaS. However, existing approaches often ignore the communication or dependencies between SaaS components in their implementation. In a composite SaaS, it is important to include these elements, as they will directly affect the performance of the SaaS. This paper will propose a Grouping Genetic Algorithm (GGA) for multiple composite SaaS application component clustering in Cloud computing that will address this gap. To the best of our knowledge, this is the first attempt to handle multiple composite SaaS reconfiguration placement in a dynamic Cloud environment. The experimental results demonstrate the feasibility and the scalability of the GGA.

Relevância:

20.00% 20.00%

Publicador:

Resumo:

In this paper, the goal of identifying disease subgroups based on differences in observed symptom profile is considered. Commonly referred to as phenotype identification, solutions to this task often involve the application of unsupervised clustering techniques. In this paper, we investigate the application of a Dirichlet Process mixture (DPM) model for this task. This model is defined by the placement of the Dirichlet Process (DP) on the unknown components of a mixture model, allowing for the expression of uncertainty about the partitioning of observed data into homogeneous subgroups. To exemplify this approach, an application to phenotype identification in Parkinson’s disease (PD) is considered, with symptom profiles collected using the Unified Parkinson’s Disease Rating Scale (UPDRS). Clustering, Dirichlet Process mixture, Parkinson’s disease, UPDRS.

Relevância:

20.00% 20.00%

Publicador:

Resumo:

Divergence from a random baseline is a technique for the evaluation of document clustering. It ensures cluster quality measures are performing work that prevents ineffective clusterings from giving high scores to clusterings that provide no useful result. These concepts are defined and analysed using intrinsic and extrinsic approaches to the evaluation of document cluster quality. This includes the classical clusters to categories approach and a novel approach that uses ad hoc information retrieval. The divergence from a random baseline approach is able to differentiate ineffective clusterings encountered in the INEX XML Mining track. It also appears to perform a normalisation similar to the Normalised Mutual Information (NMI) measure but it can be applied to any measure of cluster quality. When it is applied to the intrinsic measure of distortion as measured by RMSE, subtraction from a random baseline provides a clear optimum that is not apparent otherwise. This approach can be applied to any clustering evaluation. This paper describes its use in the context of document clustering evaluation.

Relevância:

20.00% 20.00%

Publicador:

Resumo:

Software as a Service (SaaS) in Cloud is getting more and more significant among software users and providers recently. A SaaS that is delivered as composite application has many benefits including reduced delivery costs, flexible offers of the SaaS functions and decreased subscription cost for users. However, this approach has introduced a new problem in managing the resources allocated to the composite SaaS. The resource allocation that has been done at the initial stage may be overloaded or wasted due to the dynamic environment of a Cloud. A typical data center resource management usually triggers a placement reconfiguration for the SaaS in order to maintain its performance as well as to minimize the resource used. Existing approaches for this problem often ignore the underlying dependencies between SaaS components. In addition, the reconfiguration also has to comply with SaaS constraints in terms of its resource requirements, placement requirement as well as its SLA. To tackle the problem, this paper proposes a penalty-based Grouping Genetic Algorithm for multiple composite SaaS components clustering in Cloud. The main objective is to minimize the resource used by the SaaS by clustering its component without violating any constraint. Experimental results demonstrate the feasibility and the scalability of the proposed algorithm.

Relevância:

20.00% 20.00%

Publicador:

Resumo:

This paper introduces PartSS, a new partition-based fil- tering for tasks performing string comparisons under edit distance constraints. PartSS offers improvements over the state-of-the-art method NGPP with the implementation of a new partitioning scheme and also improves filtering abil- ities by exploiting theoretical results on shifting and scaling ranges, thus accelerating the rate of calculating edit distance between strings. PartSS filtering has been implemented within two major tasks of data integration: similarity join and approximate membership extraction under edit distance constraints. The evaluation on an extensive range of real-world datasets demonstrates major gain in efficiency over NGPP and QGrams approaches.

Relevância:

20.00% 20.00%

Publicador:

Resumo:

This paper proposes the use of Bayesian approaches with the cross likelihood ratio (CLR) as a criterion for speaker clustering within a speaker diarization system, using eigenvoice modeling techniques. The CLR has previously been shown to be an effective decision criterion for speaker clustering using Gaussian mixture models. Recently, eigenvoice modeling has become an increasingly popular technique, due to its ability to adequately represent a speaker based on sparse training data, as well as to provide an improved capture of differences in speaker characteristics. The integration of eigenvoice modeling into the CLR framework to capitalize on the advantage of both techniques has also been shown to be beneficial for the speaker clustering task. Building on that success, this paper proposes the use of Bayesian methods to compute the conditional probabilities in computing the CLR, thus effectively combining the eigenvoice-CLR framework with the advantages of a Bayesian approach to the diarization problem. Results obtained on the 2002 Rich Transcription (RT-02) Evaluation dataset show an improved clustering performance, resulting in a 33.5% relative improvement in the overall Diarization Error Rate (DER) compared to the baseline system.

Relevância:

20.00% 20.00%

Publicador:

Resumo:

Capacity probability models of generating units are commonly used in many power system reliability studies, at hierarchical level one (HLI). Analytical modelling of a generating system with many units or generating units with many derated states in a system, can result in an extensive number of states in the capacity model. Limitations on available memory and computational time of present computer facilities can pose difficulties for assessment of such systems in many studies. A cluster procedure using the nearest centroid sorting method was used for IEEE-RTS load model. The application proved to be very effective in producing a highly similar model with substantially fewer states. This paper presents an extended application of the clustering method to include capacity probability representation. A series of sensitivity studies are illustrated using IEEE-RTS generating system and load models. The loss of load and energy expectations (LOLE, LOEE), are used as indicators to evaluate the application