838 resultados para text and data mining


Relevância:

100.00% 100.00%

Publicador:

Resumo:

For the last decade, high-resolution (HR)-MS has been associated with qualitative analyses while triple quadrupole MS has been associated with routine quantitative analyses. However, a shift of this paradigm is taking place: quantitative and qualitative analyses will be increasingly performed by HR-MS, and it will become the common 'language' for most mass spectrometrists. Most analyses will be performed by full-scan acquisitions recording 'all' ions entering the HR-MS with subsequent construction of narrow-width extracted-ion chromatograms. Ions will be available for absolute quantification, profiling and data mining. In parallel to quantification, metabotyping will be the next step in clinical LC-MS analyses because it should help in personalized medicine. This article is aimed to help analytical chemists who perform targeted quantitative acquisitions with triple quadrupole MS make the transition to quantitative and qualitative analyses using HR-MS. Guidelines for the acceptance criteria of mass accuracy and for the determination of mass extraction windows in quantitative analyses are proposed.

Relevância:

100.00% 100.00%

Publicador:

Resumo:

This master's thesis coversthe concepts of knowledge discovery, data mining and technology forecasting methods in telecommunications. It covers the various aspects of knowledge discoveryin data bases and discusses in detail the methods of data mining and technologyforecasting methods that are used in telecommunications. Main concern in the overall process of this thesis is to emphasize the methods that are being used in technology forecasting for telecommunications and data mining. It tries to answer to some extent to the question of do forecasts create a future? It also describes few difficulties that arise in technology forecasting. This thesis was done as part of my master's studies in Lappeenranta University of Technology.

Relevância:

100.00% 100.00%

Publicador:

Resumo:

Yritysten syvällinen ymmärrys työntekijöistä vaatii yrityksiltä monipuolista panostusta tiedonhallintaan. Tämän yhdistäminen ennakoivaan analytiikkaan ja tiedonlouhintaan mahdollistaa yrityksille uudenlaisen ulottuvuuden kehittää henkilöstöhallinnon toimintoja niin työntekijöiden kuin yrityksen etujen mukaisesti. Tutkielman tavoitteena oli selvittää tiedonlouhinnan hyödyntämistä henkilöstöhallinnossa. Tutkielma toteutettiin konstruktiivistä menetelmää hyödyntäen. Teoreettinen viitekehys keskittyi ennakoivan analytiikan ja tiedonlouhinnan konseptin ymmärtämiseen. Tutkielman empiriaosuus rakentui kvalitatiiviseen ja kvantitatiiviseen osiin. Kvalitatiivinen osa koostui tutkielman esitutkimuksesta, jossa käsiteltiin ennakoivan analytiikan ja tiedonlouhinnan hyödyntämistä. Kvantitatiivinen osa rakentui tiedonlouhintaprojektiin, joka toteutettiin henkilöstöhallintoon tutkien henkilöstövaihtuvuutta. Esitutkimuksen tuloksena tiedonlouhinnan hyödyntämisen haasteiksi ilmeni muun muassa tiedon omistajuus, osaaminen ja ymmärrys mahdollisuuksista. Tiedonlouhintaprojektin tuloksena voidaan todeta, että tutkimuksessa sovelletuista korrelaatioiden tutkimisista ja logistisesta regressioanalyysistä oli havaittavissa tilastollisia riippuvuuksia vapaaehtoisesti poistuvien työntekijöiden osalta.

Relevância:

100.00% 100.00%

Publicador:

Resumo:

Leveraging cloud services, companies and organizations can significantly improve their efficiency, as well as building novel business opportunities. Cloud computing offers various advantages to companies while having some risks for them too. Advantages offered by service providers are mostly about efficiency and reliability while risks of cloud computing are mostly about security problems. Problems with security of the cloud still demand significant attention in order to tackle the potential problems. Security problems in the cloud as security problems in any area of computing, can not be fully tackled. However creating novel and new solutions can be used by service providers to mitigate the potential threats to a large extent. Looking at the security problem from a very high perspective, there are two focus directions. Security problems that threaten service user’s security and privacy are at one side. On the other hand, security problems that threaten service provider’s security and privacy are on the other side. Both kinds of threats should mostly be detected and mitigated by service providers. Looking a bit closer to the problem, mitigating security problems that target providers can protect both service provider and the user. However, the focus of research community mostly is to provide solutions to protect cloud users. A significant research effort has been put in protecting cloud tenants against external attacks. However, attacks that are originated from elastic, on-demand and legitimate cloud resources should still be considered seriously. The cloud-based botnet or botcloud is one of the prevalent cases of cloud resource misuses. Unfortunately, some of the cloud’s essential characteristics enable criminals to form reliable and low cost botclouds in a short time. In this paper, we present a system that helps to detect distributed infected Virtual Machines (VMs) acting as elements of botclouds. Based on a set of botnet related system level symptoms, our system groups VMs. Grouping VMs helps to separate infected VMs from others and narrows down the target group under inspection. Our system takes advantages of Virtual Machine Introspection (VMI) and data mining techniques.

Relevância:

100.00% 100.00%

Publicador:

Resumo:

Feature selection plays an important role in knowledge discovery and data mining nowadays. In traditional rough set theory, feature selection using reduct - the minimal discerning set of attributes - is an important area. Nevertheless, the original definition of a reduct is restrictive, so in one of the previous research it was proposed to take into account not only the horizontal reduction of information by feature selection, but also a vertical reduction considering suitable subsets of the original set of objects. Following the work mentioned above, a new approach to generate bireducts using a multi--objective genetic algorithm was proposed. Although the genetic algorithms were used to calculate reduct in some previous works, we did not find any work where genetic algorithms were adopted to calculate bireducts. Compared to the works done before in this area, the proposed method has less randomness in generating bireducts. The genetic algorithm system estimated a quality of each bireduct by values of two objective functions as evolution progresses, so consequently a set of bireducts with optimized values of these objectives was obtained. Different fitness evaluation methods and genetic operators, such as crossover and mutation, were applied and the prediction accuracies were compared. Five datasets were used to test the proposed method and two datasets were used to perform a comparison study. Statistical analysis using the one-way ANOVA test was performed to determine the significant difference between the results. The experiment showed that the proposed method was able to reduce the number of bireducts necessary in order to receive a good prediction accuracy. Also, the influence of different genetic operators and fitness evaluation strategies on the prediction accuracy was analyzed. It was shown that the prediction accuracies of the proposed method are comparable with the best results in machine learning literature, and some of them outperformed it.

Relevância:

100.00% 100.00%

Publicador:

Resumo:

Decision trees are very powerful tools for classification in data mining tasks that involves different types of attributes. When coming to handling numeric data sets, usually they are converted first to categorical types and then classified using information gain concepts. Information gain is a very popular and useful concept which tells you, whether any benefit occurs after splitting with a given attribute as far as information content is concerned. But this process is computationally intensive for large data sets. Also popular decision tree algorithms like ID3 cannot handle numeric data sets. This paper proposes statistical variance as an alternative to information gain as well as statistical mean to split attributes in completely numerical data sets. The new algorithm has been proved to be competent with respect to its information gain counterpart C4.5 and competent with many existing decision tree algorithms against the standard UCI benchmarking datasets using the ANOVA test in statistics. The specific advantages of this proposed new algorithm are that it avoids the computational overhead of information gain computation for large data sets with many attributes, as well as it avoids the conversion to categorical data from huge numeric data sets which also is a time consuming task. So as a summary, huge numeric datasets can be directly submitted to this algorithm without any attribute mappings or information gain computations. It also blends the two closely related fields statistics and data mining

Relevância:

100.00% 100.00%

Publicador:

Resumo:

Este texto contribuirá a que la institución de salud se organice y prepare la información necesaria para emprender el largo y tortuoso camino de la determinación de la razón costo/beneficio y de la acreditación. Además, podrá ser muy útil para los estudiantes de los programas de pregrado y posgrado de ingeniería biomédica que se quieran especializar en la gestión de tecnologías del equipamiento biomédico y la ingeniería clínica. También podrá ser usado como guía de referencia por personas que estén directamente vinculadas al sector de la salud en departamentos de mantenimiento, ingeniería clínica o de servicios hospitalarios.

Relevância:

100.00% 100.00%

Publicador:

Resumo:

In a world where massive amounts of data are recorded on a large scale we need data mining technologies to gain knowledge from the data in a reasonable time. The Top Down Induction of Decision Trees (TDIDT) algorithm is a very widely used technology to predict the classification of newly recorded data. However alternative technologies have been derived that often produce better rules but do not scale well on large datasets. Such an alternative to TDIDT is the PrismTCS algorithm. PrismTCS performs particularly well on noisy data but does not scale well on large datasets. In this paper we introduce Prism and investigate its scaling behaviour. We describe how we improved the scalability of the serial version of Prism and investigate its limitations. We then describe our work to overcome these limitations by developing a framework to parallelise algorithms of the Prism family and similar algorithms. We also present the scale up results of a first prototype implementation.

Relevância:

100.00% 100.00%

Publicador:

Resumo:

The domain of Knowledge Discovery (KD) and Data Mining (DM) is of growing importance in a time where more and more data is produced and knowledge is one of the most precious assets. Having explored both the existing underlying theory, the results of the ongoing research in academia and the industry practices in the domain of KD and DM, we have found that this is a domain that still lacks some systematization. We also found that this systematization exists to a greater degree in the Software Engineering and Requirements Engineering domains, probably due to being more mature areas. We believe that it is possible to improve and facilitate the participation of enterprise stakeholders in the requirements engineering for KD projects by systematizing requirements engineering process for such projects. This will, in turn, result in more projects that end successfully, that is, with satisfied stakeholders, including in terms of time and budget constraints. With this in mind and based on all information found in the state-of-the art, we propose SysPRE - Systematized Process for Requirements Engineering in KD projects. We begin by proposing an encompassing generic description of the KD process, where the main focus is on the Requirements Engineering activities. This description is then used as a base for the application of the Design and Engineering Methodology for Organizations (DEMO) so that we can specify a formal ontology for this process. The resulting SysPRE ontology can serve as a base that can be used not only to make enterprises become aware of their own KD process and requirements engineering process in the KD projects, but also to improve such processes in reality, namely in terms of success rate.

Relevância:

100.00% 100.00%

Publicador:

Resumo:

Pós-graduação em Desenvolvimento Humano e Tecnologias - IBRC

Relevância:

100.00% 100.00%

Publicador:

Resumo:

Coordenação de Aperfeiçoamento de Pessoal de Nível Superior (CAPES)

Relevância:

100.00% 100.00%

Publicador:

Resumo:

The ubiquity of time series data across almost all human endeavors has produced a great interest in time series data mining in the last decade. While dozens of classification algorithms have been applied to time series, recent empirical evidence strongly suggests that simple nearest neighbor classification is exceptionally difficult to beat. The choice of distance measure used by the nearest neighbor algorithm is important, and depends on the invariances required by the domain. For example, motion capture data typically requires invariance to warping, and cardiology data requires invariance to the baseline (the mean value). Similarly, recent work suggests that for time series clustering, the choice of clustering algorithm is much less important than the choice of distance measure used.In this work we make a somewhat surprising claim. There is an invariance that the community seems to have missed, complexity invariance. Intuitively, the problem is that in many domains the different classes may have different complexities, and pairs of complex objects, even those which subjectively may seem very similar to the human eye, tend to be further apart under current distance measures than pairs of simple objects. This fact introduces errors in nearest neighbor classification, where some complex objects may be incorrectly assigned to a simpler class. Similarly, for clustering this effect can introduce errors by “suggesting” to the clustering algorithm that subjectively similar, but complex objects belong in a sparser and larger diameter cluster than is truly warranted.We introduce the first complexity-invariant distance measure for time series, and show that it generally produces significant improvements in classification and clustering accuracy. We further show that this improvement does not compromise efficiency, since we can lower bound the measure and use a modification of triangular inequality, thus making use of most existing indexing and data mining algorithms. We evaluate our ideas with the largest and most comprehensive set of time series mining experiments ever attempted in a single work, and show that complexity-invariant distance measures can produce improvements in classification and clustering in the vast majority of cases.

Relevância:

100.00% 100.00%

Publicador:

Resumo:

Máster Universitario en Sistemas Inteligentes y Aplicaciones Numéricas en Ingeniería (SIANI)