34 resultados para Data stream mining
Resumo:
A system for temporal data mining includes a computer readable medium having an application configured to receive at an input module a temporal data series having events with start times and end times, a set of allowed dwelling times and a threshold frequency. The system is further configured to identify, using a candidate identification and tracking module, one or more occurrences in the temporal data series of a candidate episode and increment a count for each identified occurrence. The system is also configured to produce at an output module an output for those episodes whose count of occurrences results in a frequency exceeding the threshold frequency.
Resumo:
We address the problem of mining targeted association rules over multidimensional market-basket data. Here, each transaction has, in addition to the set of purchased items, ancillary dimension attributes associated with it. Based on these dimensions, transactions can be visualized as distributed over cells of an n-dimensional cube. In this framework, a targeted association rule is of the form {X -> Y} R, where R is a convex region in the cube and X. Y is a traditional association rule within region R. We first describe the TOARM algorithm, based on classical techniques, for identifying targeted association rules. Then, we discuss the concepts of bottom-up aggregation and cubing, leading to the CellUnion technique. This approach is further extended, using notions of cube-count interleaving and credit-based pruning, to derive the IceCube algorithm. Our experiments demonstrate that IceCube consistently provides the best execution time performance, especially for large and complex data cubes.
Resumo:
The rapid growth in the field of data mining has lead to the development of various methods for outlier detection. Though detection of outliers has been well explored in the context of numerical data, dealing with categorical data is still evolving. In this paper, we propose a two-phase algorithm for detecting outliers in categorical data based on a novel definition of outliers. In the first phase, this algorithm explores a clustering of the given data, followed by the ranking phase for determining the set of most likely outliers. The proposed algorithm is expected to perform better as it can identify different types of outliers, employing two independent ranking schemes based on the attribute value frequencies and the inherent clustering structure in the given data. Unlike some existing methods, the computational complexity of this algorithm is not affected by the number of outliers to be detected. The efficacy of this algorithm is demonstrated through experiments on various public domain categorical data sets.
Resumo:
This paper primarily intends to develop a GIS (geographical information system)-based data mining approach for optimally selecting the locations and determining installed capacities for setting up distributed biomass power generation systems in the context of decentralized energy planning for rural regions. The optimal locations within a cluster of villages are obtained by matching the installed capacity needed with the demand for power, minimizing the cost of transportation of biomass from dispersed sources to power generation system, and cost of distribution of electricity from the power generation system to demand centers or villages. The methodology was validated by using it for developing an optimal plan for implementing distributed biomass-based power systems for meeting the rural electricity needs of Tumkur district in India consisting of 2700 villages. The approach uses a k-medoid clustering algorithm to divide the total region into clusters of villages and locate biomass power generation systems at the medoids. The optimal value of k is determined iteratively by running the algorithm for the entire search space for different values of k along with demand-supply matching constraints. The optimal value of the k is chosen such that it minimizes the total cost of system installation, costs of transportation of biomass, and transmission and distribution. A smaller region, consisting of 293 villages was selected to study the sensitivity of the results to varying demand and supply parameters. The results of clustering are represented on a GIS map for the region.
Resumo:
The disclosure of information and its misuse in Privacy Preserving Data Mining (PPDM) systems is a concern to the parties involved. In PPDM systems data is available amongst multiple parties collaborating to achieve cumulative mining accuracy. The vertically partitioned data available with the parties involved cannot provide accurate mining results when compared to the collaborative mining results. To overcome the privacy issue in data disclosure this paper describes a Key Distribution-Less Privacy Preserving Data Mining (KDLPPDM) system in which the publication of local association rules generated by the parties is published. The association rules are securely combined to form the combined rule set using the Commutative RSA algorithm. The combined rule sets established are used to classify or mine the data. The results discussed in this paper compare the accuracy of the rules generated using the C4. 5 based KDLPPDM system and the CS. 0 based KDLPPDM system using receiver operating characteristics curves (ROC).
Resumo:
Online Social Networks (OSNs) facilitate to create and spread information easily and rapidly, influencing others to participate and propagandize. This work proposes a novel method of profiling Influential Blogger (IB) based on the activities performed on one's blog documents who influences various other bloggers in Social Blog Network (SBN). After constructing a social blogging site, a SBN is analyzed with appropriate parameters to get the Influential Blog Power (IBP) of each blogger in the network and demonstrate that profiling IB is adequate and accurate. The proposed Profiling Influential Blogger (PIB) Algorithm survival rate of IB is high and stable. (C) 2015 The Authors. Published by Elsevier B.V. This is an open access article under the CC BY-NC-ND license (http://creativecommons.org/licenses/by-nc-nd/4.0/).
Resumo:
The critical stream power criterion may be used to describe the incipient motion of cohesionless particles of plane sediment beds. The governing equation relating ``critical stream power'' to ``shear Reynolds number'' is developed by using the present experimental data as well as the data from several other sources. Simultaneously, a resistance equation, relating the ``particle Reynolds number'' to the``shear Reynolds number'' is developed for plane sediment beds in wide channels with little or no transport. By making use of these relations, a procedure is developed to design plane sediment beds such that any two of the four design variables, including particle size, energy/friction slope, flow depth, and discharge per unit width in the channel should be known to predict the remaining two variables. Finally, a straightforward design procedure using design tables/design curves and analytical methods is presented to solve six possible design problems.
Resumo:
The StreamIt programming model has been proposed to exploit parallelism in streaming applications oil general purpose multicore architectures. The StreamIt graphs describe task, data and pipeline parallelism which can be exploited on accelerators such as Graphics Processing Units (GPUs) or CellBE which support abundant parallelism in hardware. In this paper, we describe a novel method to orchestrate the execution of if StreamIt program oil a multicore platform equipped with an accelerator. The proposed approach identifies, using profiling, the relative benefits of executing a task oil the superscalar CPU cores and the accelerator. We formulate the problem of partitioning the work between the CPU cores and the GPU, taking into account the latencies for data transfers and the required buffer layout transformations associated with the partitioning, as all integrated Integer Linear Program (ILP) which can then be solved by an ILP solver. We also propose an efficient heuristic algorithm for the work-partitioning between the CPU and the GPU, which provides solutions which are within 9.05% of the optimal solution on an average across the benchmark Suite. The partitioned tasks are then software pipelined to execute oil the multiple CPU cores and the Streaming Multiprocessors (SMs) of the GPU. The software pipelining algorithm orchestrates the execution between CPU cores and the GPU by emitting the code for the CPU and the GPU, and the code for the required data transfers. Our experiments on a platform with 8 CPU cores and a GeForce 8800 GTS 512 GPU show a geometric mean speedup of 6.94X with it maximum of 51.96X over it single threaded CPU execution across the StreamIt benchmarks. This is a 18.9% improvement over it partitioning strategy that maps only the filters that cannot be executed oil the GPU - the filters with state that is persistent across firings - onto the CPU.
Resumo:
The StreamIt programming model has been proposed to exploit parallelism in streaming applications on general purpose multi-core architectures. This model allows programmers to specify the structure of a program as a set of filters that act upon data, and a set of communication channels between them. The StreamIt graphs describe task, data and pipeline parallelism which can be exploited on modern Graphics Processing Units (GPUs), as they support abundant parallelism in hardware. In this paper, we describe the challenges in mapping StreamIt to GPUs and propose an efficient technique to software pipeline the execution of stream programs on GPUs. We formulate this problem - both scheduling and assignment of filters to processors - as an efficient Integer Linear Program (ILP), which is then solved using ILP solvers. We also describe a novel buffer layout technique for GPUs which facilitates exploiting the high memory bandwidth available in GPUs. The proposed scheduling utilizes both the scalar units in GPU, to exploit data parallelism, and multiprocessors, to exploit task and pipelin parallelism. Further it takes into consideration the synchronization and bandwidth limitations of GPUs, and yields speedups between 1.87X and 36.83X over a single threaded CPU.
Resumo:
The impact of riparian land use on the stream insect communities was studied at Kudremukh National Park located within Western Ghats, a tropical biodiversity hotspot in India. The diversity and community composition of stream insects varied across streams with different riparian land use types. The rarefied family and generic richness was highest in streams with natural semi evergreen forests as riparian vegetation. However, when the streams had human habitations and areca nut plantations as riparian land use type, the rarefied richness was higher than that of streams with natural evergreen forests and grasslands. The streams with scrub lands and iron ore mining as the riparian land use had the lowest rarefied richness. Within a landscape, the streams with the natural riparian vegetation had similar community composition. However, streams with natural grasslands as the riparian vegetation, had low diversity and the community composition was similar to those of paddy fields. We discuss how stream insect assemblages differ due to varied riparian land use patterns, reflecting fundamental alterations in the functioning of stream ecosystems. This understanding is vital to conserve, manage and restore tropical riverine ecosystems.
Resumo:
The problem of two-stream instability in plasma is studied by specifying the importance of initial magnetic field associated with the motion of the charged particles and the boundary effects. In Part I the accurate initial steady state is studied when the streams of electrons and ions move with different uniform speeds in plasmas with plane and cylindrical geometry. In Part II, in order to show the effects of finiteness and inhomogeneity of the system, small transverse plasma oscillations are studied in the case of plane plasmas. The role of plasma-sheath oscillations at the boundaries is found to be very important in driving the instabilities associated with the electromagnetic modes. The numerical estimates of the growth rates of the instability are given for the specific case of the physical data in discharge tubes.
Resumo:
The effect of vibration on heat transfer from a horizontal copper cylinder, 0.344 in. in diameter and 6 in. long, was investigated. The cylinder was placed normal to an air stream and was sinusoidally vibrated in a direction perpendicular to the direction of the air stream. The flow velocity varied from 19 ft/s to 92 ft/s; the double amplitude of vibration from 0.75 cm to 3.2 cm, and the frequency of vibration from 200 to 2800 cycles/min. A transient technique was used to determine the heat transfer coefficients. The experimental data in the absence of vibration is expressed by NNu = 0.226 NRe0.6 in the range 2500 < NRe < 15 000. By imposing vibrational velocities as high as 20 per cent of the flow velocity, no appreciable change in the heat transfer coefficient was observed. An analysis using the resultant of the vibration and the flow velocity explains the observed phenomenon.
Resumo:
Abstract—A new breed of processors like the Cell Broadband Engine, the Imagine stream processor and the various GPU processors emphasize data-level parallelism (DLP) and threadlevel parallelism (TLP) as opposed to traditional instructionlevel parallelism (ILP). This allows them to achieve order-ofmagnitude improvements over conventional superscalar processors for many workloads. However, it is unclear as to how much parallelism of these types exists in current programs. Most earlier studies have largely concentrated on the amount of ILP in a program, without differentiating DLP or TLP. In this study, we investigate the extent of data-level parallelism available in programs in the MediaBench suite. By packing instructions in a SIMD fashion, we observe reductions of up to 91 % (84 % on average) in the number of dynamic instructions, indicating a very high degree of DLP in several applications. I.
Resumo:
The high temperature region of the MnO-A1203 phase diagram has been redetermined to resolve some discrepancies reported in the literature regarding the melting behaviour of MnA1,04. This spinel was found to melt congruently at 2108 (+ 15) K. Theactivity of MnOin MnO-Al,03 meltsand in the two phase regions, melt + MnAI,04 and MnAI2O4 + A1203, has been determined by measuring the manganese concentration in platinum foils in equilibrium under controlled oxygen potentials. The activity of MnO obtained in this study for M ~ O - A I ,m~el~ts is in fair agreement with the results of Sharma and Richardson.However. the alumina-rich melt is found to be in equilibrium with MnAl,04 rather than AI2O3. as suggested by ~ha rmaan d Richardson. The value for the acthity of MnO in the M~AI ,O,+ A1,03 two phaseregion permits a rigorous application of the Gibbs-Duhem equation for calculating the activity of A1203 and the integral Gibbs' energy of mixing of MnO-A1203 melts, which are significantly different from those reported in the literature.
Resumo:
Over the past decade, many powerful data mining techniques have been developed to analyze temporal and sequential data. The time is now fertile for addressing problems of larger scope under the purview of temporal data mining. The fourth SIGKDD workshop on temporal data mining focused on the question: What can we infer about the structure of a complex dynamical system from observed temporal data? The goals of the workshop were to critically evaluate the need in this area by bringing together leading researchers from industry and academia, and to identify promising technologies and methodologies for doing the same. We provide a brief summary of the workshop proceedings and ideas arising out of the discussions.