8 resultados para document clustering
em Helda - Digital Repository of University of Helsinki
Resumo:
The Minimum Description Length (MDL) principle is a general, well-founded theoretical formalization of statistical modeling. The most important notion of MDL is the stochastic complexity, which can be interpreted as the shortest description length of a given sample of data relative to a model class. The exact definition of the stochastic complexity has gone through several evolutionary steps. The latest instantation is based on the so-called Normalized Maximum Likelihood (NML) distribution which has been shown to possess several important theoretical properties. However, the applications of this modern version of the MDL have been quite rare because of computational complexity problems, i.e., for discrete data, the definition of NML involves an exponential sum, and in the case of continuous data, a multi-dimensional integral usually infeasible to evaluate or even approximate accurately. In this doctoral dissertation, we present mathematical techniques for computing NML efficiently for some model families involving discrete data. We also show how these techniques can be used to apply MDL in two practical applications: histogram density estimation and clustering of multi-dimensional data.
Resumo:
Online content services can greatly benefit from personalisation features that enable delivery of content that is suited to each user's specific interests. This thesis presents a system that applies text analysis and user modeling techniques in an online news service for the purpose of personalisation and user interest analysis. The system creates a detailed thematic profile for each content item and observes user's actions towards content items to learn user's preferences. A handcrafted taxonomy of concepts, or ontology, is used in profile formation to extract relevant concepts from the text. User preference learning is automatic and there is no need for explicit preference settings or ratings from the user. Learned user profiles are segmented into interest groups using clustering techniques with the objective of providing a source of information for the service provider. Some theoretical background for chosen techniques is presented while the main focus is in finding practical solutions to some of the current information needs, which are not optimally served with traditional techniques.
Resumo:
Electronic document management (EDM) technology has the potential to enhance the information management in construction projects considerably, without radical changes to current practice. Over the past fifteen years this topic has been overshadowed by building product modelling in the construction IT research world, but at present EDM is quickly being introduced in practice, in particular in bigger projects. Often this is done in the form of third party services available over the World Wide Web. In the paper, a typology of research questions and methods is presented, which can be used to position the individual research efforts which are surveyed in the paper. Questions dealt with include: What features should EMD systems have? How much are they used? Are there benefits from use and how should these be measured? What are the barriers to wide-spread adoption? Which technical questions need to be solved? Is there scope for standardisation? How will the market for such systems evolve?
Resumo:
Triggered by the very quick proliferation of Internet connectivity, electronic document management (EDM) systems are now rapidly being adopted for managing the documentation that is produced and exchanged in construction projects. Nevertheless there are still substantial barriers to the efficient use of such systems, mainly of a psychological nature and related to insufficient training. This paper presents the results of empirical studies carried out during 2002 concerning the current usage of EDM systems in the Finnish construction industry. The studies employed three different methods in order to provide a multifaceted view of the problem area, both on the industry and individual project level. In order to provide an accurate measurement of overall usage volume in the industry as a whole telephone interviews with key personnel from 100 randomly chosen construction projects were conducted. The interviews showed that while around 1/3 of big projects already have adopted the use of EDM, very few small projects have adopted this technology. The barriers to introduction were investigated through interviews with representatives for half a dozen of providers of systems and ASP-services. These interviews shed a lot of light on the dynamics of the market for this type of services and illustrated the diversity of business strategies adopted by vendors. In the final study log files from a project which had used an EDM system were analysed in order to determine usage patterns. The results illustrated that use is yet incomplete in coverage and that only a part of the individuals involved in the project used the system efficiently, either as information producers or consumers. The study also provided feedback on the usefulness of the log files.
Resumo:
This paper investigates the clustering pattern in the Finnish stock market. Using trading volume and time as factors capturing the clustering pattern in the market, the Keim and Madhavan (1996) and the Engle and Russell (1998) model provide the framework for the analysis. The descriptive and the parametric analysis provide evidences that an important determinant of the famous U-shape pattern in the market is the rate of information arrivals as measured by large trading volumes and durations at the market open and close. Precisely, 1) the larger the trading volume, the greater the impact on prices both in the short and the long run, thus prices will differ across quantities. 2) Large trading volume is a non-linear function of price changes in the long run. 3) Arrival times are positively autocorrelated, indicating a clustering pattern and 4) Information arrivals as approximated by durations are negatively related to trading flow.