962 resultados para Clustering a large document collection


Relevância:

40.00% 40.00%

Publicador:

Resumo:

Web document cluster analysis plays an important role in information retrieval by organizing large amounts of documents into a small number of meaningful clusters. Traditional web document clustering is based on the Vector Space Model (VSM), which takes into account only two-level (document and term) knowledge granularity but ignores the bridging paragraph granularity. However, this two-level granularity may lead to unsatisfactory clustering results with “false correlation”. In order to deal with the problem, a Hierarchical Representation Model with Multi-granularity (HRMM), which consists of five-layer representation of data and a twophase clustering process is proposed based on granular computing and article structure theory. To deal with the zero-valued similarity problemresulted from the sparse term-paragraphmatrix, an ontology based strategy and a tolerance-rough-set based strategy are introduced into HRMM. By using granular computing, structural knowledge hidden in documents can be more efficiently and effectively captured in HRMM and thus web document clusters with higher quality can be generated. Extensive experiments show that HRMM, HRMM with tolerancerough-set strategy, and HRMM with ontology all outperform VSM and a representative non VSM-based algorithm, WFP, significantly in terms of the F-Score.

Relevância:

40.00% 40.00%

Publicador:

Resumo:

2000 Mathematics Subject Classification: 62H30

Relevância:

40.00% 40.00%

Publicador:

Resumo:

With the explosive growth of the volume and complexity of document data (e.g., news, blogs, web pages), it has become a necessity to semantically understand documents and deliver meaningful information to users. Areas dealing with these problems are crossing data mining, information retrieval, and machine learning. For example, document clustering and summarization are two fundamental techniques for understanding document data and have attracted much attention in recent years. Given a collection of documents, document clustering aims to partition them into different groups to provide efficient document browsing and navigation mechanisms. One unrevealed area in document clustering is that how to generate meaningful interpretation for the each document cluster resulted from the clustering process. Document summarization is another effective technique for document understanding, which generates a summary by selecting sentences that deliver the major or topic-relevant information in the original documents. How to improve the automatic summarization performance and apply it to newly emerging problems are two valuable research directions. To assist people to capture the semantics of documents effectively and efficiently, the dissertation focuses on developing effective data mining and machine learning algorithms and systems for (1) integrating document clustering and summarization to obtain meaningful document clusters with summarized interpretation, (2) improving document summarization performance and building document understanding systems to solve real-world applications, and (3) summarizing the differences and evolution of multiple document sources.

Relevância:

40.00% 40.00%

Publicador:

Resumo:

Charles Edward Perry (Chuck), 1937-1999, was the founding president of Florida International University in Miami, Florida. He grew up in Logan County, West Virginia and received his bachelor's and masters's degrees from Bowling Green State University. He married Betty Laird in 1960. In 1969, at the age of 32, Perry was the youngest president of any university in the nation. The name of the university reflects Perry’s desire for a title that would not limit the scope of the institution and would support his vision of having close ties to Latin America. Perry and a founding corps opened FIU to 5,667 students in 1972 with only one large building housing six different schools. Perry left the office of President of FIU in 1976 when the student body had grown to 10,000 students and the university had six buildings, offered 134 different degrees and was fully accredited. Charles Perry died on August 30, 1999 at his home in Rockwall, Texas. He is buried on the FIU campus in front of the Graham Center entrance.

Relevância:

40.00% 40.00%

Publicador:

Resumo:

Charles Edward Perry (Chuck), 1937-1999, was the founding president of Florida International University in Miami, Florida. He grew up in Logan County, West Virginia and graduated from Bowling Green State University. He married Betty Laird in 1960. In 1969, at the age of 32, Perry was the youngest president of any university in the nation. The name of the university reflects Perry’s desire for a title that would not limit the scope of the institution and would support his vision of having close ties to Latin America. Perry and a founding corps opened FIU to 5,667 students in 1972 with only one large building housing six different schools. Perry left the office of President of FIU in 1976 when the student body had grown to 10,000 students and the university had six buildings, offered 134 different degrees and was fully accredited. Charles Perry died on August 30, 1999 at his home in Rockwall, Texas. He is buried on the FIU campus in front of the Graham Center entrance.

Relevância:

40.00% 40.00%

Publicador:

Resumo:

Online Social Network (OSN) services provided by Internet companies bring people together to chat, share the information, and enjoy the information. Meanwhile, huge amounts of data are generated by those services (they can be regarded as the social media ) every day, every hour, even every minute, and every second. Currently, researchers are interested in analyzing the OSN data, extracting interesting patterns from it, and applying those patterns to real-world applications. However, due to the large-scale property of the OSN data, it is difficult to effectively analyze it. This dissertation focuses on applying data mining and information retrieval techniques to mine two key components in the social media data — users and user-generated contents. Specifically, it aims at addressing three problems related to the social media users and contents: (1) how does one organize the users and the contents? (2) how does one summarize the textual contents so that users do not have to go over every post to capture the general idea? (3) how does one identify the influential users in the social media to benefit other applications, e.g., Marketing Campaign? The contribution of this dissertation is briefly summarized as follows. (1) It provides a comprehensive and versatile data mining framework to analyze the users and user-generated contents from the social media. (2) It designs a hierarchical co-clustering algorithm to organize the users and contents. (3) It proposes multi-document summarization methods to extract core information from the social network contents. (4) It introduces three important dimensions of social influence, and a dynamic influence model for identifying influential users.

Relevância:

40.00% 40.00%

Publicador:

Resumo:

Nigerian scam, also known as advance fee fraud or 419 scam, is a prevalent form of online fraudulent activity that causes financial loss to individuals and businesses. Nigerian scam has evolved from simple non-targeted email messages to more sophisticated scams targeted at users of classifieds, dating and other websites. Even though such scams are observed and reported by users frequently, the community’s understanding of Nigerian scams is limited since the scammers operate “underground”. To better understand the underground Nigerian scam ecosystem and seek effective methods to deter Nigerian scam and cybercrime in general, we conduct a series of active and passive measurement studies. Relying upon the analysis and insight gained from the measurement studies, we make four contributions: (1) we analyze the taxonomy of Nigerian scam and derive long-term trends in scams; (2) we provide an insight on Nigerian scam and cybercrime ecosystems and their underground operation; (3) we propose a payment intervention as a potential deterrent to cybercrime operation in general and evaluate its effectiveness; and (4) we offer active and passive measurement tools and techniques that enable in-depth analysis of cybercrime ecosystems and deterrence on them. We first created and analyze a repository of more than two hundred thousand user-reported scam emails, stretching from 2006 to 2014, from four major scam reporting websites. We select ten most commonly observed scam categories and tag 2,000 scam emails randomly selected from our repository. Based upon the manually tagged dataset, we train a machine learning classifier and cluster all scam emails in the repository. From the clustering result, we find a strong and sustained upward trend for targeted scams and downward trend for non-targeted scams. We then focus on two types of targeted scams: sales scams and rental scams targeted users on Craigslist. We built an automated scam data collection system and gathered large-scale sales scam emails. Using the system we posted honeypot ads on Craigslist and conversed automatically with the scammers. Through the email conversation, the system obtained additional confirmation of likely scam activities and collected additional information such as IP addresses and shipping addresses. Our analysis revealed that around 10 groups were responsible for nearly half of the over 13,000 total scam attempts we received. These groups used IP addresses and shipping addresses in both Nigeria and the U.S. We also crawled rental ads on Craigslist, identified rental scam ads amongst the large number of benign ads and conversed with the potential scammers. Through in-depth analysis of the rental scams, we found seven major scam campaigns employing various operations and monetization methods. We also found that unlike sales scammers, most rental scammers were in the U.S. The large-scale scam data and in-depth analysis provide useful insights on how to design effective deterrence techniques against cybercrime in general. We study underground DDoS-for-hire services, also known as booters, and measure the effectiveness of undermining a payment system of DDoS Services. Our analysis shows that the payment intervention can have the desired effect of limiting cybercriminals’ ability and increasing the risk of accepting payments.

Relevância:

30.00% 30.00%

Publicador:

Resumo:

Garlic is a spice and a medicinal plant; hence, there is an increasing interest in 'developing' new varieties with different culinary properties or with high content of nutraceutical compounds. Phenotypic traits and dominant molecular markers are predominantly used to evaluate the genetic diversity of garlic clones. However, 24 SSR markers (codominant) specific for garlic are available in the literature, fostering germplasm researches. In this study, we genotyped 130 garlic accessions from Brazil and abroad using 17 polymorphic SSR markers to assess the genetic diversity and structure. This is the first attempt to evaluate a large set of accessions maintained by Brazilian institutions. A high level of redundancy was detected in the collection (50 % of the accessions represented eight haplotypes). However, non-redundant accessions presented high genetic diversity. We detected on average five alleles per locus, Shannon index of 1.2, HO of 0.5, and HE of 0.6. A core collection was set with 17 accessions, covering 100 % of the alleles with minimum redundancy. Overall FST and D values indicate a strong genetic structure within accessions. Two major groups identified by both model-based (Bayesian approach) and hierarchical clustering (UPGMA dendrogram) techniques were coherent with the classification of accessions according to maturity time (growth cycle): early-late and midseason accessions. Assessing genetic diversity and structure of garlic collections is the first step towards an efficient management and conservation of accessions in genebanks, as well as to advance future genetic studies and improvement of garlic worldwide.

Relevância:

30.00% 30.00%

Publicador:

Resumo:

Universidade Estadual de Campinas . Faculdade de Educação Física

Relevância:

30.00% 30.00%

Publicador:

Resumo:

Universidade Estadual de Campinas . Faculdade de Educação Física

Relevância:

30.00% 30.00%

Publicador:

Resumo:

Rectangular dropshafts, commonly used in sewers and storm water systems, are characterised by significant flow aeration. New detailed air-water flow measurements were conducted in a near-full-scale dropshaft at large discharges. In the shaft pool and outflow channel, the results demonstrated the complexity of different competitive air entrainment mechanisms. Bubble size measurements showed a broad range of entrained bubble sizes. Analysis of streamwise distributions of bubbles suggested further some clustering process in the bubbly flow although, in the outflow channel, bubble chords were in average smaller than in the shaft pool. A robust hydrophone was tested to measure bubble acoustic spectra and to assess its field application potential. The acoustic results characterised accurately the order of magnitude of entrained bubble sizes, but the transformation from acoustic frequencies to bubble radii did not predict correctly the probability distribution functions of bubble sizes.