775 resultados para mining data streams
Resumo:
Existing business process drift detection methods do not work with event streams. As such, they are designed to detect inter-trace drifts only, i.e. drifts that occur between complete process executions (traces), as recorded in event logs. However, process drift may also occur during the execution of a process, and may impact ongoing executions. Existing methods either do not detect such intra-trace drifts, or detect them with a long delay. Moreover, they do not perform well with unpredictable processes, i.e. processes whose logs exhibit a high number of distinct executions to the total number of executions. We address these two issues by proposing a fully automated and scalable method for online detection of process drift from event streams. We perform statistical tests over distributions of behavioral relations between events, as observed in two adjacent windows of adaptive size, sliding along with the stream. An extensive evaluation on synthetic and real-life logs shows that our method is fast and accurate in the detection of typical change patterns, and performs significantly better than the state of the art.
Resumo:
Gene mapping is a systematic search for genes that affect observable characteristics of an organism. In this thesis we offer computational tools to improve the efficiency of (disease) gene-mapping efforts. In the first part of the thesis we propose an efficient simulation procedure for generating realistic genetical data from isolated populations. Simulated data is useful for evaluating hypothesised gene-mapping study designs and computational analysis tools. As an example of such evaluation, we demonstrate how a population-based study design can be a powerful alternative to traditional family-based designs in association-based gene-mapping projects. In the second part of the thesis we consider a prioritisation of a (typically large) set of putative disease-associated genes acquired from an initial gene-mapping analysis. Prioritisation is necessary to be able to focus on the most promising candidates. We show how to harness the current biomedical knowledge for the prioritisation task by integrating various publicly available biological databases into a weighted biological graph. We then demonstrate how to find and evaluate connections between entities, such as genes and diseases, from this unified schema by graph mining techniques. Finally, in the last part of the thesis, we define the concept of reliable subgraph and the corresponding subgraph extraction problem. Reliable subgraphs concisely describe strong and independent connections between two given vertices in a random graph, and hence they are especially useful for visualising such connections. We propose novel algorithms for extracting reliable subgraphs from large random graphs. The efficiency and scalability of the proposed graph mining methods are backed by extensive experiments on real data. While our application focus is in genetics, the concepts and algorithms can be applied to other domains as well. We demonstrate this generality by considering coauthor graphs in addition to biological graphs in the experiments.
Resumo:
The high temperature region of the MnO-A1203 phase diagram has been redetermined to resolve some discrepancies reported in the literature regarding the melting behaviour of MnA1,04. This spinel was found to melt congruently at 2108 (+ 15) K. Theactivity of MnOin MnO-Al,03 meltsand in the two phase regions, melt + MnAI,04 and MnAI2O4 + A1203, has been determined by measuring the manganese concentration in platinum foils in equilibrium under controlled oxygen potentials. The activity of MnO obtained in this study for M ~ O - A I ,m~el~ts is in fair agreement with the results of Sharma and Richardson.However. the alumina-rich melt is found to be in equilibrium with MnAl,04 rather than AI2O3. as suggested by ~ha rmaan d Richardson. The value for the acthity of MnO in the M~AI ,O,+ A1,03 two phaseregion permits a rigorous application of the Gibbs-Duhem equation for calculating the activity of A1203 and the integral Gibbs' energy of mixing of MnO-A1203 melts, which are significantly different from those reported in the literature.
Resumo:
Frequent episode discovery is a popular framework for temporal pattern discovery in event streams. An episode is a partially ordered set of nodes with each node associated with an event type. Currently algorithms exist for episode discovery only when the associated partial order is total order (serial episode) or trivial (parallel episode). In this paper, we propose efficient algorithms for discovering frequent episodes with unrestricted partial orders when the associated event-types are unique. These algorithms can be easily specialized to discover only serial or parallel episodes. Also, the algorithms are flexible enough to be specialized for mining in the space of certain interesting subclasses of partial orders. We point out that frequency alone is not a sufficient measure of interestingness in the context of partial order mining. We propose a new interestingness measure for episodes with unrestricted partial orders which, when used along with frequency, results in an efficient scheme of data mining. Simulations are presented to demonstrate the effectiveness of our algorithms.
Resumo:
Over the past decade, many powerful data mining techniques have been developed to analyze temporal and sequential data. The time is now fertile for addressing problems of larger scope under the purview of temporal data mining. The fourth SIGKDD workshop on temporal data mining focused on the question: What can we infer about the structure of a complex dynamical system from observed temporal data? The goals of the workshop were to critically evaluate the need in this area by bringing together leading researchers from industry and academia, and to identify promising technologies and methodologies for doing the same. We provide a brief summary of the workshop proceedings and ideas arising out of the discussions.
Resumo:
Song-selection and mood are interdependent. If we capture a song’s sentiment, we can determine the mood of the listener, which can serve as a basis for recommendation systems. Songs are generally classified according to genres, which don’t entirely reflect sentiments. Thus, we require an unsupervised scheme to mine them. Sentiments are classified into either two (positive/negative) or multiple (happy/angry/sad/...) classes, depending on the application. We are interested in analyzing the feelings invoked by a song, involving multi-class sentiments. To mine the hidden sentimental structure behind a song, in terms of “topics”, we consider its lyrics and use Latent Dirichlet Allocation (LDA). Each song is a mixture of moods. Topics mined by LDA can represent moods. Thus we get a scheme of collecting similar-mood songs. For validation, we use a dataset of songs containing 6 moods annotated by users of a particular website.
Resumo:
The problem of classification of time series data is an interesting problem in the field of data mining. Even though several algorithms have been proposed for the problem of time series classification we have developed an innovative algorithm which is computationally fast and accurate in several cases when compared with 1NN classifier. In our method we are calculating the fuzzy membership of each test pattern to be classified to each class. We have experimented with 6 benchmark datasets and compared our method with 1NN classifier.
Resumo:
Today's programming languages are supported by powerful third-party APIs. For a given application domain, it is common to have many competing APIs that provide similar functionality. Programmer productivity therefore depends heavily on the programmer's ability to discover suitable APIs both during an initial coding phase, as well as during software maintenance. The aim of this work is to support the discovery and migration of math APIs. Math APIs are at the heart of many application domains ranging from machine learning to scientific computations. Our approach, called MATHFINDER, combines executable specifications of mathematical computations with unit tests (operational specifications) of API methods. Given a math expression, MATHFINDER synthesizes pseudo-code comprised of API methods to compute the expression by mining unit tests of the API methods. We present a sequential version of our unit test mining algorithm and also design a more scalable data-parallel version. We perform extensive evaluation of MATHFINDER (1) for API discovery, where math algorithms are to be implemented from scratch and (2) for API migration, where client programs utilizing a math API are to be migrated to another API. We evaluated the precision and recall of MATHFINDER on a diverse collection of math expressions, culled from algorithms used in a wide range of application areas such as control systems and structural dynamics. In a user study to evaluate the productivity gains obtained by using MATHFINDER for API discovery, the programmers who used MATHFINDER finished their programming tasks twice as fast as their counterparts who used the usual techniques like web and code search, IDE code completion, and manual inspection of library documentation. For the problem of API migration, as a case study, we used MATHFINDER to migrate Weka, a popular machine learning library. Overall, our evaluation shows that MATHFINDER is easy to use, provides highly precise results across several math APIs and application domains even with a small number of unit tests per method, and scales to large collections of unit tests.
Resumo:
Progress report from the Mining Biodiversity, Digging into Data Challenge round 3, project.
Resumo:
In contrast to cost modeling activities, the pricing of services must be simple and transparent. Calculating and thus knowing price structures, would not only help identify the level of detail required for cost modeling of individual instititutions, but also help develop a ”public” market for services as well as clarify the division of task and the modeling of funding and revenue streams for data preservation of public institutions. This workshop has built on the results from the workshop ”The Costs and Benefits of Keeping Knowledge” which took place 11 June 2012 in Copenhagen. This expert workshop aimed at: •Identifying ways for data repositories to abstract from their complicated cost structures and arrive at one transparent pricing structure which can be aligned with available and plausible funding schemes. Those repositories will probably need a stable institutional funding stream for data management and preservation. Are there any estimates for this, absolute or as percentage of overall cost? Part of the revenue will probably have to come through data management fees upon ingest. How could that be priced? Per dataset, per GB or as a percentage of research cost? Will it be necessary to charge access prices, as they contradict the open science paradigm? •What are the price components for pricing individual services, which prices are currently being paid e.g. to commercial providers? What are the description and conditions of the service(s) delivered and guaranteed? •What types of risks are inherent in these pricing schemes? •How can services and prices be defined in an all-inclusive and simple manner, so as to enable researchers to apply for specific amount when asking for funding of data-intensive projects?Please
Resumo:
Few detailed studies have been made on the ecology of the chalk streams. A complex community of plants and animals is present and much more information is required to achieve an understanding of the requirements and interactions of all the species. It is important that the rivers affected by this scheme should be studied and kept under continued observation so that any effects produced by the scheme can be detected. The report gives a synopsis of work carried out between 1971 and 1979 focusing on the present phase 1978-1979. It assumes some familiarity with the investigations carried out on the River Lambourn during the preceding years. The aims of the present phase of the project may be divided into two broad aspects. The first involves collecting further information in the field and includes three objectives: a continuation of studies on the Lambourn sites at Bagnor; comparative studies on other chalk streams; and a comparative study on a limestone stream. The second involves detailed analyses of data previously collected to document the recovery of the Lambourn from operational pumping and to attempt to develop simple conceptual and predictive models applicable over a wide range of physical and geographical variables. (PDF contains 43 pages)
Resumo:
188 p.
Resumo:
Eight streams from the North West of England were stocked with Atlantic salmon (Salmo salar L.) fed fry at densities ranging from 1 to 4/m2 over a period of up to three years to evaluate survival to the end of the first an d second growing periods and hence assess the value of stocking as a management practice. Survival to the end of the first growin g period (mean duration of 108 days) was found to vary between 7.8 and 41.3% with a mean of 22% and CV of 0.44. Survival from the end of the first growing period to the end of the second growing period (mean duration of 384 days) ranged from 19.9 to 34.1% with a mean of 26.3% and CV of 0.21. Survival was found to be positively related to 0+ trout density (P < 0.05) and negatively related to altitude (P < 0.05). A comparison of the raw survival data (non standardised with respect to duration of experiments) with that from other studies in relation to stocking densities revealed a negative relationship between fry survival and stocking density (P < 0.05). Densities in excess of 5/m2 tended to result in lower levels of survival. Post stocking fry dispersal patterns were examined for the 1991 data. On average 86.7% of the number of fry surviving remained within the stocked zone by the end of the first growing period. With the exception of one stream there was little in the way of dispersal beyond the stocked zone. The dispersal pattern approximated to the normal distribution (P < 0.05). It was estimated that stocking can result in a net gain of fish to a river system compared with natural productivity, however the numerical significance of this gain and its cost effectiveness need to be determined on a river specific basis.
Resumo:
Depth data from archival tags on northern rock sole (Lepidopsetta polyxystra) were examined to assess whether fish used tidal currents to aid horizontal migration. Two northern rock sole, out of 115 released with archival tags in the eastern Bering Sea, were recovered 314 and 667 days after release. Both fish made periodic excursions away from the bottom during mostly night-time hours, but also during particular phases of the tide cycle. One fish that was captured and released in an area of rotary currents made vertical excursions that were correlated with tidal current direction. To test the hypothesis that the fish made vertical excursions to use tidal currents to aid migration, a hypothetical migratory path was calculated using a tide model to predict the current direction and speed during periods when the fish was off the bottom. This migration included limited movements from July through December, followed by a 200-km southern migration from January through February, then a return northward in March and April. The successful application of tidal current information to predict a horizontal migratory path not only provides evidence of selective tidal stream transport but indicates that vertical excursions were conducted primarily to assist horizontal migration.
Resumo:
This is the Evaluation of the impact of cypermethrin use in forestry on Welsh streams from the University of Plymouth, published on September 2010 by the Environment Agency South West. The report focuses attention on Cypermethrin, a highly active synthetic pyrethroid insecticide effective against a wide range of pests in agriculture, public health, and animal husbandry. It is also used in forestry to control the pine weevil, Hylobius abietis. Cypermethrin is very toxic to aquatic invertebrates and fish at nanogram per litre concentrations. This project checks the effectiveness of current best practice measures in minimising the risk of pollution associated with the use of cypermethrin in forestry in Wales. Chemical results from the intensive studies show that cypermethrin entered minor watercourses draining treated areas at two of the eight sites. In one of these cases the level was well in excess of the short-term Predicted No Effect Concentration. The absence of a buffer area at the other site resulted in the cypermethrin reaching a main drain. However dilution appeared to be sufficient to prevent any impact on water quality or on the invertebrate community in the main stream. Invertebrate and chemical data from the extensive survey showed little evidence of pollution due to wider use of cypermethrin in Welsh forestry. Finally, a number of recommendations are made for further tightening controls on forestry practice to minimise the risk of cypermethrin entering the aquatic environment.