869 resultados para height partition clustering


Relevância:

20.00% 20.00%

Publicador:

Resumo:

The XML Document Mining track was launched for exploring two main ideas: (1) identifying key problems and new challenges of the emerging field of mining semi-structured documents, and (2) studying and assessing the potential of Machine Learning (ML) techniques for dealing with generic ML tasks in the structured domain, i.e., classification and clustering of semi-structured documents. This track has run for six editions during INEX 2005, 2006, 2007, 2008, 2009 and 2010. The first five editions have been summarized in previous editions and we focus here on the 2010 edition. INEX 2010 included two tasks in the XML Mining track: (1) unsupervised clustering task and (2) semi-supervised classification task where documents are organized in a graph. The clustering task requires the participants to group the documents into clusters without any knowledge of category labels using an unsupervised learning algorithm. On the other hand, the classification task requires the participants to label the documents in the dataset into known categories using a supervised learning algorithm and a training set. This report gives the details of clustering and classification tasks.

Relevância:

20.00% 20.00%

Publicador:

Resumo:

The traditional Vector Space Model (VSM) is not able to represent both the structure and the content of XML documents. This paper introduces a novel method of representing XML documents in a Tensor Space Model (TSM) and then utilizing it for clustering. Empirical analysis shows that the proposed method is scalable for large-sized datasets; as well, the factorized matrices produced from the proposed method help to improve the quality of clusters through the enriched document representation of both structure and content information.

Relevância:

20.00% 20.00%

Publicador:

Resumo:

This paper proposes the use of eigenvoice modeling techniques with the Cross Likelihood Ratio (CLR) as a criterion for speaker clustering within a speaker diarization system. The CLR has previously been shown to be a robust decision criterion for speaker clustering using Gaussian Mixture Models. Recently, eigenvoice modeling techniques have become increasingly popular, due to its ability to adequately represent a speaker based on sparse training data, as well as an improved capture of differences in speaker characteristics. This paper hence proposes that it would be beneficial to capitalize on the advantages of eigenvoice modeling in a CLR framework. Results obtained on the 2002 Rich Transcription (RT-02) Evaluation dataset show an improved clustering performance, resulting in a 35.1% relative improvement in the overall Diarization Error Rate (DER) compared to the baseline system.

Relevância:

20.00% 20.00%

Publicador:

Resumo:

Dealing with product yield and quality in manufacturing industries is getting more difficult due to the increasing volume and complexity of data and quicker time to market expectations. Data mining offers tools for quick discovery of relationships, patterns and knowledge in large databases. Growing self-organizing map (GSOM) is established as an efficient unsupervised datamining algorithm. In this study some modifications to the original GSOM are proposed for manufacturing yield improvement by clustering. These modifications include introduction of a clustering quality measure to evaluate the performance of the programme in separating good and faulty products and a filtering index to reduce noise from the dataset. Results show that the proposed method is able to effectively differentiate good and faulty products. It will help engineers construct the knowledge base to predict product quality automatically from collected data and provide insights for yield improvement.

Relevância:

20.00% 20.00%

Publicador:

Relevância:

20.00% 20.00%

Publicador:

Resumo:

Obesity is a major public health problem in both developed and developing countries. The body mass index (BMI) is the most common index used to define obesity. The universal application of the same BMI classification across different ethnic groups is being challenged due to the inability of the index to differentiate fat mass (FM) and fat�]free mass (FFM) and the recognized ethnic differences in body composition. A better understanding of the body composition of Asian children from different backgrounds would help to better understand the obesity�]related health risks of people in this region. Moreover, the limitations of the BMI underscore the necessity to use where possible, more accurate measures of body fat assessment in research and clinical settings in addition to BMI, particularly in relation to the monitoring of prevention and treatment efforts. The aim of the first study was to determine the ethnic difference in the relationship between BMI and percent body fat (%BF) in pre�]pubertal Asian children from China, Lebanon, Malaysia, the Philippines, and Thailand. A total of 1039 children aged 8�]10 y were recruited using a non�]random purposive sampling approach aiming to encompass a wide BMI range from the five countries. Percent body fat (%BF) was determined using the deuterium dilution technique to quantify total body water (TBW) and subsequently derive proportions of FM and FFM. The study highlighted the sex and ethnic differences between BMI and %BF in Asian children from different countries. Girls had approximately 4.0% higher %BF compared with boys at a given BMI. Filipino boys tended to have a lower %BF than their Chinese, Lebanese, Malay and Thai counterparts at the same age and BMI level (corrected mean %BF was 25.7�}0.8%, 27.4�}0.4%, 27.1�}0.6%, 27.7�}0.5%, 28.1�}0.5% for Filipino, Chinese, Lebanese, Malay and Thai boys, respectively), although they differed significantly from Thai and Malay boys. Thai girls had approximately 2.0% higher %BF values than Chinese, Lebanese, Filipino and Malay counterparts (however no significant difference was seen among the four ethnic groups) at a given BMI (corrected mean %BF was 31.1�}0.5%, 28.6�}0.4%, 29.2�}0.6%, 29.5�}0.6%, 29.5�}0.5% for Thai, Chinese, Lebanese, Malay and Filipino girls, respectively). However, the ethnic difference in BMI�]%BF relationship varied by BMI. Compared with Caucasians, Asian children had a BMI 3�]6 units lower for a given %BF. More than one third of obese Asian children in the study were not identified using the WHO classification and more than half were not identified using the International Obesity Task Force (IOTF) classification. However, use of the Chinese classification increased the sensitivity by 19.7%, 18.1%, 2.3%, 2.3%, and 11.3% for Chinese, Lebanese, Malay, Filipino and Thai girls, respectively. A further aim of the first study was to determine the ethnic difference in body fat distribution in pre�]pubertal Asian children from China, Lebanon, Malaysia, and Thailand. The skin fold thicknesses, height, weight, waist circumference (WC) and total adiposity (as determined by deuterium dilution technique) of 922 children from the four countries was assessed. Chinese boys and girls had a similar trunk�]to�]extremity skin fold thickness ratio to Thai counterparts and both groups had higher ratios than the Malays and Lebanese at a given total FM. At a given BMI, both Chinese and Thai boys and girls had a higher WC than Malays and Lebanese (corrected mean WC was 68.1�}0.2 cm, 67.8�}0.3 cm, 65.8�}0.4 cm, 64.1�}0.3 cm for Chinese, Thai, Lebanese and Malay boys, respectively; 64.2�}0.2 cm, 65.0�}0.3 cm, 62.9�}0.4 cm, 60.6�}0.3 cm for Chinese, Thai, Lebanese and Malay girls, respectively). Chinese boys and girls had lower trunk fat adjusted subscapular/suprailiac skinfold ratio compared with Lebanese and Malay counterparts. The second study aimed to develop and cross�]validate bioelectrical impedance analysis (BIA) prediction equations of TBW and FFM for Asian pre�]pubertal children from China, Lebanon, Malaysia, the Philippines, and Thailand. Data on height, weight, age, gender, resistance and reactance measured by BIA were collected from 948 Asian children (492 boys and 456 girls) aged 8�]10 y from the five countries. The deuterium dilution technique was used as the criterion method for the estimation of TBW and FFM. The BIA equations were developed from the validation group (630 children randomly selected from the total sample) using stepwise multiple regression analysis and cross�]validated in a separate group (318 children) using the Bland�]Altman approach. Age, gender and ethnicity influenced the relationship between the resistance index (RI = height2/resistance), TBW and FFM. The BIA prediction equation for the estimation of TBW was: TBW (kg) = 0.231�~Height2 (cm)/resistance (ƒ¶) + 0.066�~Height (cm) + 0.188�~Weight (kg) + 0.128�~Age (yr) + 0.500�~Sex (male=1, female=0) . 0.316�~Ethnicity (Thai ethnicity=1, others=0) �] 4.574, and for the estimation of FFM: FFM (kg) = 0.299�~Height2 (cm)/resistance (ƒ¶) + 0.086�~Height (cm) + 0.245�~Weight (kg) + 0.260�~Age (yr) + 0.901�~Sex (male=1, female=0) �] 0.415�~Ethnicity (Thai ethnicity=1, others=0) �] 6.952. The R2 was 88.0% (root mean square error, RSME = 1.3 kg), 88.3% (RSME = 1.7 kg) for TBW and FFM equation, respectively. No significant difference between measured and predicted TBW and between measured and predicted FFM for the whole cross�]validation sample was found (bias = �]0.1�}1.4 kg, pure error = 1.4�}2.0 kg for TBW and bias = �]0.2�}1.9 kg, pure error = 1.8�}2.6 kg for FFM). However, the prediction equation for estimation of TBW/FFM tended to overestimate TBW/FFM at lower levels while underestimate at higher levels of TBW/FFM. Accuracy of the general equation for TBW and FFM compared favorably with both BMI�]specific and ethnic�]specific equations. There were significant differences between predicted TBW and FFM from external BIA equations derived from Caucasian populations and measured values in Asian children. There were three specific aims of the third study. The first was to explore the relationship between obesity and metabolic syndrome and abnormalities in Chinese children. A total of 608 boys and 800 girls aged 6�]12 y were recruited from four cities in China. Three definitions of pediatric metabolic syndrome and abnormalities were used, including the International Diabetes Federation (IDF) and National Cholesterol Education Program (NCEP) definition for adults modified by Cook et al. and de Ferranti et al. The prevalence of metabolic syndrome varied with different definitions, was highest using the de Ferranti definition (5.4%, 24.6% and 42.0%, respectively for normal�]weight, overweight and obese children), followed by the Cook definition (1.5%, 8.1%, and 25.1%, respectively), and the IDF definition (0.5%, 1.8% and 8.3%, respectively). Overweight and obese children had a higher risk of developing the metabolic syndrome compared to normal�]weight children (odds ratio varied with different definitions from 3.958 to 6.866 for overweight children, and 12.640�]26.007 for obese children). Overweight and obesity also increased the risk of developing metabolic abnormalities. Central obesity and high triglycerides (TG) were the most common while hyperglycemia was the least frequent in Chinese children regardless of different definitions. The second purpose was to determine the best obesity index for the prediction of cardiovascular (CV) risk factor clustering across a 2�]y follow�]up among BMI, %BF, WC and waist�]to�]height ratio (WHtR) in Chinese children. Height, weight, WC, %BF as determined by BIA, blood pressure, TG, high�]density lipoprotein cholesterol (HDL�]C), and fasting glucose were collected at baseline and 2 years later in 292 boys and 277 girls aged 8�]10 y. The results showed the percentage of children who remained overweight/obese defined on the basis of BMI, WC, WHtR and %BF was 89.7%, 93.5%, 84.5%, and 80.4%, respectively after 2 years. Obesity indices at baseline significantly correlated with TG, HDL�]C, and blood pressure at both baseline and 2 years later with a similar strength of correlations. BMI at baseline explained the greatest variance of later blood pressure. WC at baseline explained the greatest variance of later HDL�]C and glucose, while WHtR at baseline was the main predictor of later TG. Receiver�]operating characteristic (ROC) analysis explored the ability of the four indices to identify the later presence of CV risk. The overweight/obese children defined on the basis of BMI, WC, WHtR or %BF were more likely to develop CV risk 2 years later with relative risk (RR) scores of 3.670, 3.762, 2.767, and 2.804, respectively. The final purpose of the third study was to develop age�] and gender�]specific percentiles of WC and WHtR and cut�]off points of WC and WHtR for the prediction of CV risk in Chinese children. Smoothed percentile curves of WC and WHtR were produced in 2830 boys and 2699 girls aged 6�]12 y randomly selected from southern and northern China using the LMS method. The optimal age�] and gender�]specific thresholds of WC and WHtR for the prediction of cardiovascular risk factors clustering were derived in a sub�]sample (n=1845) by ROC analysis. Age�] and gender�]specific WC and WHtR percentiles were constructed. The WC thresholds were at the 90th and 84th percentiles for Chinese boys and girls, respectively, with sensitivity and specificity ranging from 67.2% to 83.3%. The WHtR thresholds were at the 91st and 94th percentiles for Chinese boys and girls, respectively, with sensitivity and specificity ranging from 78.6% to 88.9%. The cut�]offs of both WC and WHtR were age�] and gender�]dependent. In conclusion, the current thesis quantifies the ethnic differences in the BMI�]%BF relationship and body fat distribution between Asian children from different origins and confirms the necessity to consider ethnic differences in body composition when developing BMI and other obesity index criteria for obesity in Asian children. Moreover, ethnicity is also important in BIA prediction equations. In addition, WC and WHtR percentiles and thresholds for the prediction of CV risk in Chinese children differ from other populations. Although there was no advantage of WC or WHtR over BMI or %BF in the prediction of CV risk, obese children had a higher risk of developing the metabolic syndrome and abnormalities than normal�]weight children regardless of the obesity index used.

Relevância:

20.00% 20.00%

Publicador:

Resumo:

This paper proposes an innovative instance similarity based evaluation metric that reduces the search map for clustering to be performed. An aggregate global score is calculated for each instance using the novel idea of Fibonacci series. The use of Fibonacci numbers is able to separate the instances effectively and, in hence, the intra-cluster similarity is increased and the inter-cluster similarity is decreased during clustering. The proposed FIBCLUS algorithm is able to handle datasets with numerical, categorical and a mix of both types of attributes. Results obtained with FIBCLUS are compared with the results of existing algorithms such as k-means, x-means expected maximization and hierarchical algorithms that are widely used to cluster numeric, categorical and mix data types. Empirical analysis shows that FIBCLUS is able to produce better clustering solutions in terms of entropy, purity and F-score in comparison to the above described existing algorithms.

Relevância:

20.00% 20.00%

Publicador:

Resumo:

We propose to use the Tensor Space Modeling (TSM) to represent and analyze the user’s web log data that consists of multiple interests and spans across multiple dimensions. Further we propose to use the decomposition factors of the Tensors for clustering the users based on similarity of search behaviour. Preliminary results show that the proposed method outperforms the traditional Vector Space Model (VSM) based clustering.

Relevância:

20.00% 20.00%

Publicador:

Resumo:

Purpose: Web search engines are frequently used by people to locate information on the Internet. However, not all queries have an informational goal. Instead of information, some people may be looking for specific web sites or may wish to conduct transactions with web services. This paper aims to focus on automatically classifying the different user intents behind web queries. Design/methodology/approach: For the research reported in this paper, 130,000 web search engine queries are categorized as informational, navigational, or transactional using a k-means clustering approach based on a variety of query traits. Findings: The research findings show that more than 75 percent of web queries (clustered into eight classifications) are informational in nature, with about 12 percent each for navigational and transactional. Results also show that web queries fall into eight clusters, six primarily informational, and one each of primarily transactional and navigational. Research limitations/implications: This study provides an important contribution to web search literature because it provides information about the goals of searchers and a method for automatically classifying the intents of the user queries. Automatic classification of user intent can lead to improved web search engines by tailoring results to specific user needs. Practical implications: The paper discusses how web search engines can use automatically classified user queries to provide more targeted and relevant results in web searching by implementing a real time classification method as presented in this research. Originality/value: This research investigates a new application of a method for automatically classifying the intent of user queries. There has been limited research to date on automatically classifying the user intent of web queries, even though the pay-off for web search engines can be quite beneficial. © Emerald Group Publishing Limited.