775 resultados para mining data streams
Resumo:
Due to the rapid advances in computing and sensing technologies, enormous amounts of data are being generated everyday in various applications. The integration of data mining and data visualization has been widely used to analyze these massive and complex data sets to discover hidden patterns. For both data mining and visualization to be effective, it is important to include the visualization techniques in the mining process and to generate the discovered patterns for a more comprehensive visual view. In this dissertation, four related problems: dimensionality reduction for visualizing high dimensional datasets, visualization-based clustering evaluation, interactive document mining, and multiple clusterings exploration are studied to explore the integration of data mining and data visualization. In particular, we 1) propose an efficient feature selection method (reliefF + mRMR) for preprocessing high dimensional datasets; 2) present DClusterE to integrate cluster validation with user interaction and provide rich visualization tools for users to examine document clustering results from multiple perspectives; 3) design two interactive document summarization systems to involve users efforts and generate customized summaries from 2D sentence layouts; and 4) propose a new framework which organizes the different input clusterings into a hierarchical tree structure and allows for interactive exploration of multiple clustering solutions.
Resumo:
Electronic database handling of buisness information has gradually gained its popularity in the hospitality industry. This article provides an overview on the fundamental concepts of a hotel database and investigates the feasibility of incorporating computer-assisted data mining techniques into hospitality database applications. The author also exposes some potential myths associated with data mining in hospitaltiy database applications.
Resumo:
Ensemble Stream Modeling and Data-cleaning are sensor information processing systems have different training and testing methods by which their goals are cross-validated. This research examines a mechanism, which seeks to extract novel patterns by generating ensembles from data. The main goal of label-less stream processing is to process the sensed events to eliminate the noises that are uncorrelated, and choose the most likely model without over fitting thus obtaining higher model confidence. Higher quality streams can be realized by combining many short streams into an ensemble which has the desired quality. The framework for the investigation is an existing data mining tool. First, to accommodate feature extraction such as a bush or natural forest-fire event we make an assumption of the burnt area (BA*), sensed ground truth as our target variable obtained from logs. Even though this is an obvious model choice the results are disappointing. The reasons for this are two: One, the histogram of fire activity is highly skewed. Two, the measured sensor parameters are highly correlated. Since using non descriptive features does not yield good results, we resort to temporal features. By doing so we carefully eliminate the averaging effects; the resulting histogram is more satisfactory and conceptual knowledge is learned from sensor streams. Second is the process of feature induction by cross-validating attributes with single or multi-target variables to minimize training error. We use F-measure score, which combines precision and accuracy to determine the false alarm rate of fire events. The multi-target data-cleaning trees use information purity of the target leaf-nodes to learn higher order features. A sensitive variance measure such as ƒ-test is performed during each node's split to select the best attribute. Ensemble stream model approach proved to improve when using complicated features with a simpler tree classifier. The ensemble framework for data-cleaning and the enhancements to quantify quality of fitness (30% spatial, 10% temporal, and 90% mobility reduction) of sensor led to the formation of streams for sensor-enabled applications. Which further motivates the novelty of stream quality labeling and its importance in solving vast amounts of real-time mobile streams generated today.
Resumo:
Online Social Network (OSN) services provided by Internet companies bring people together to chat, share the information, and enjoy the information. Meanwhile, huge amounts of data are generated by those services (they can be regarded as the social media ) every day, every hour, even every minute, and every second. Currently, researchers are interested in analyzing the OSN data, extracting interesting patterns from it, and applying those patterns to real-world applications. However, due to the large-scale property of the OSN data, it is difficult to effectively analyze it. This dissertation focuses on applying data mining and information retrieval techniques to mine two key components in the social media data — users and user-generated contents. Specifically, it aims at addressing three problems related to the social media users and contents: (1) how does one organize the users and the contents? (2) how does one summarize the textual contents so that users do not have to go over every post to capture the general idea? (3) how does one identify the influential users in the social media to benefit other applications, e.g., Marketing Campaign? The contribution of this dissertation is briefly summarized as follows. (1) It provides a comprehensive and versatile data mining framework to analyze the users and user-generated contents from the social media. (2) It designs a hierarchical co-clustering algorithm to organize the users and contents. (3) It proposes multi-document summarization methods to extract core information from the social network contents. (4) It introduces three important dimensions of social influence, and a dynamic influence model for identifying influential users.
Resumo:
The Buchans ore bodies of central Newfoundland represent some of the highest grade VMS deposits ever mined. These Kuroko-type deposits are also known for the well developed and preserved nature of the mechanically transported deposits. The deposits are hosted in Cambro-Ordovician, dominantly calc-alkaline, bimodal volcanic and epiclastic sequences of the Notre Dame Subzone, Newfoundland Appalachians. Stratigraphic relationships in this zone are complicated by extensively developed, brittledominated Silurian thrust faulting. Hydrothermal alteration of host rocks is a common feature of nearly all VMS deposits, and the recognition of these zones has been a key exploration tool. Alteration of host rocks has long been described to be spatially associated with the Buchans ore bodies, most notably with the larger in-situ deposits. This report represents a base-line study in which a complete documentation of the geochemical variance, in terms of both primary (igneous) and alteration effects, is presented from altered volcanic rocks in the vicinity of the Lucky Strike deposit (LSZ), the largest in-situ deposit in the Buchans camp. Packages of altered rocks also occur away from the immediate mining areas and constitute new targets for exploration. These zones, identified mostly by recent and previous drilling, represent untested targets and include the Powerhouse (PHZ), Woodmans Brook (WBZ) and Airport (APZ) alteration zones, as well as the Middle Branch alteration zone (MBZ), which represents a more distal alteration facies related to Buchans ore-formation. Data from each of these zones were compared to those from the LSZ in order to evaluate their relative propectivity. Derived litho geochemical data served two functions: (i) to define primary (igneous) trends and (ii) secondary alteration trends. Primary trends were established using immobile, or conservative, elements (i. e., HFSE, REE, Th, Ti0₂, Al₂0₃, P₂0₅). From these, altered volcanic rocks were interpreted in terms of composition (e.g., basalt - rhyodacite) and magmatic affinity (e.g., calc-alkaline vs. tholeiitic). The information suggests that bimodality is a common feature of all zones, with most rocks plotting as either basalt/andesite or dacite (or rhyodacite); andesitic senso stricto compositions are rare. Magmatic affinities are more varied and complex, but indicate that all units are arc volcanic sequences. Rocks from the LSZ/MBZ represent a transitional to calc-alkalic sequence, however, a slight shift in key geochemical discriminants occurs between the foot-wall to the hanging-wall. Specifically, mafic and felsic lavas of the foot-wall are of transitional (or mildly calc-alkaline) affinity whereas the hanging-wall rocks are relatively more strongly calc-alkaline as indicated by enriched LREE/HREE and higher ZrN, NbN and other ratios in the latter. The geochemical variations also serve as a means to separate the units (at least the felsic rocks) into hanging-wall and foot-wall sequences, therefore providing a valuable exploration tool. Volcanic rocks from the WBZ/PHZ (and probably the APZ) are more typical of tholeiitic to transitional suites, yielding flatter mantlenormalized REE patterns and lower ZrN ratios. Thus, the relationships between the immediate mining area (represented by LSZ/MBZ) and the Buchans East (PHZ/WBZ) and the APZ are uncertain. Host rocks for all zones consist of mafic to felsic volcanic rocks, though the proportion of pyroclastic and epiclastic rocks, is greatest at the LSZ. Phenocryst assemblages and textures are common in all zones, with minor exceptions, and are not useful for discrimination purposes. Felsic rocks from all zones are dominated by sericiteclay+/- silica alteration, whereas mafic rocks are dominated by chlorite- quartz- sericite alteration. Pyrite is ubiquitous in all moderately altered rocks and minor associated base metal sulphides occur locally. The exception is at Lucky Strike, where stockwork quartzveining contains abundant base-metal mineralization and barite. Rocks completely comprised of chlorite (chloritite) also occur in the LSZ foot-wall. In addition, K-feldspar alteration occurs in felsic volcanic rocks at the MBZ associated with Zn-Pb-Ba and, notably, without chlorite. This zone represents a peripheral, but proximal, zone of alteration induced by lower temperature hydrothermal fluids, presumably with little influence from seawater. Alteration geochemistry was interpreted from raw data as well as from mass balanced (recalculated) data derived from immobile element pairs. The data from the LSZ/MBZ indicate a range in the degree of alteration from only minor to severe modification of precursor compositions. Ba tends to show a strong positive correlation with K₂0, although most Ba occurs as barite. With respect to mass changes, Al₂0₃, Ti0₂ and P₂0₅ were shown to be immobile. Nearly all rocks display mass loss of Na₂O, CaO, and Sr reflecting feldspar destruction. These trends are usually mirrored by K₂0-Rb and MgO addition, indicating sericitic and chloritic alteration, respectively. More substantial gains ofK₂0 often occur in rocks with K-feldspar alteration, whereas a few samples also displayed excessive MgO enrichment and represent chloritites. Fe₂0₃ indicates both chlorite and sulphide formation. Si0₂ addition is almost always the case for the altered mafic rocks as silica often infills amygdules and replaces the finer tuffaceous material. The felsic rocks display more variability in Si0₂. Silicic, sericitic and chloritic alteration trends were observed from the other zones, but not K-feldspar, chloritite, or barite. Microprobe analysis of chlorites, sericites and carbonates indicate: (i) sericites from all zones are defined as muscovite and are not phengitic; (ii) at the LSZ, chlorites ranged from Fe-Mg chlorites (pycnochlorite) to Mg-rich chlorite (penninite), with the latter occurring in the stockwork zone and more proximal alteration facies; (iii) chlorites from the WBZ were typical of those from the more distal alteration facies of the LSZ, plotting as ripidolite to pycnochlorite; (iv) conversely, chlorite from the PHZ plot with Mg-Al-rich compositions (chlinochlore to penninite); and (v) carbonate species from each zone are also varied, with calcite occurring in each zone, in addition to dolomite and ankerite in the PHZ and WBZ, respectively. Lead isotope ratios for galena separates from the different various zones, when combined with data from older studies, tend to cluster into four distinctive fields. Overall, the data plot on a broad mixing line and indicate evolution in a relatively low-μ environment. Data from sulphide stringers in altered MBZ rocks, as well as from clastic sulphides (Sandfill prospect), plot in the Buchans ore field, as do the data for galena from altered rocks in the APZ. Samples from the Buchans East area are even more primitive than the Buchans ores, with lead from the PHZ plotting with the Connel Option prospect and data from the WBZ matching that of the Skidder prospect. A sample from a newly discovered debris flow-type sulphide occurrence (Middle Branch East) yields lead isotope ratios that are slightly more radiogenic than Buchans and plot with the Mary March alteration zone. Data within each cluster are interpreted to represent derivation from individual hydrothermal systems in which metals were derived from a common source.
Resumo:
Peer reviewed
Resumo:
Peer reviewed
Resumo:
Online Social Network (OSN) services provided by Internet companies bring people together to chat, share the information, and enjoy the information. Meanwhile, huge amounts of data are generated by those services (they can be regarded as the social media ) every day, every hour, even every minute, and every second. Currently, researchers are interested in analyzing the OSN data, extracting interesting patterns from it, and applying those patterns to real-world applications. However, due to the large-scale property of the OSN data, it is difficult to effectively analyze it. This dissertation focuses on applying data mining and information retrieval techniques to mine two key components in the social media data — users and user-generated contents. Specifically, it aims at addressing three problems related to the social media users and contents: (1) how does one organize the users and the contents? (2) how does one summarize the textual contents so that users do not have to go over every post to capture the general idea? (3) how does one identify the influential users in the social media to benefit other applications, e.g., Marketing Campaign? The contribution of this dissertation is briefly summarized as follows. (1) It provides a comprehensive and versatile data mining framework to analyze the users and user-generated contents from the social media. (2) It designs a hierarchical co-clustering algorithm to organize the users and contents. (3) It proposes multi-document summarization methods to extract core information from the social network contents. (4) It introduces three important dimensions of social influence, and a dynamic influence model for identifying influential users.
Resumo:
Data mining can be defined as the extraction of implicit, previously un-known, and potentially useful information from data. Numerous re-searchers have been developing security technology and exploring new methods to detect cyber-attacks with the DARPA 1998 dataset for Intrusion Detection and the modified versions of this dataset KDDCup99 and NSL-KDD, but until now no one have examined the performance of the Top 10 data mining algorithms selected by experts in data mining. The compared classification learning algorithms in this thesis are: C4.5, CART, k-NN and Naïve Bayes. The performance of these algorithms are compared with accuracy, error rate and average cost on modified versions of NSL-KDD train and test dataset where the instances are classified into normal and four cyber-attack categories: DoS, Probing, R2L and U2R. Additionally the most important features to detect cyber-attacks in all categories and in each category are evaluated with Weka’s Attribute Evaluator and ranked according to Information Gain. The results show that the classification algorithm with best performance on the dataset is the k-NN algorithm. The most important features to detect cyber-attacks are basic features such as the number of seconds of a network connection, the protocol used for the connection, the network service used, normal or error status of the connection and the number of data bytes sent. The most important features to detect DoS, Probing and R2L attacks are basic features and the least important features are content features. Unlike U2R attacks, where the content features are the most important features to detect attacks.
Resumo:
Following inspections in 2013 of all police forces, Her Majesty’s Inspectorate of Constabulary found that one-third of forces could not provide data on repeat victims of domestic abuse (DA) and concluded that in general there were ambiguities around the term ‘repeat victim’ and that there was a need for consistent and comparable statistics on DA. Using an analysis of police-recorded DA data from two forces, an argument is made for including both offences and non-crime incidents when identifying repeat victims of DA. Furthermore, for statistical purposes the counting period for repeat victimizations should be taken as a rolling 12 months from first recorded victimization. Examples are given of summary statistics that can be derived from these data down to Community Safety Partnership level. To reinforce the need to include both offences and incidents in analyses, repeat victim chronologies from policerecorded data are also used to briefly examine cases of escalation to homicide as an example of how they can offer new insights and greater scope for evaluating risk and effectiveness of interventions.
Resumo:
Data mining, as a heatedly discussed term, has been studied in various fields. Its possibilities in refining the decision-making process, realizing potential patterns and creating valuable knowledge have won attention of scholars and practitioners. However, there are less studies intending to combine data mining and libraries where data generation occurs all the time. Therefore, this thesis plans to fill such a gap. Meanwhile, potential opportunities created by data mining are explored to enhance one of the most important elements of libraries: reference service. In order to thoroughly demonstrate the feasibility and applicability of data mining, literature is reviewed to establish a critical understanding of data mining in libraries and attain the current status of library reference service. The result of the literature review indicates that free online data resources other than data generated on social media are rarely considered to be applied in current library data mining mandates. Therefore, the result of the literature review motivates the presented study to utilize online free resources. Furthermore, the natural match between data mining and libraries is established. The natural match is explained by emphasizing the data richness reality and considering data mining as one kind of knowledge, an easy choice for libraries, and a wise method to overcome reference service challenges. The natural match, especially the aspect that data mining could be helpful for library reference service, lays the main theoretical foundation for the empirical work in this study. Turku Main Library was selected as the case to answer the research question: whether data mining is feasible and applicable for reference service improvement. In this case, the daily visit from 2009 to 2015 in Turku Main Library is considered as the resource for data mining. In addition, corresponding weather conditions are collected from Weather Underground, which is totally free online. Before officially being analyzed, the collected dataset is cleansed and preprocessed in order to ensure the quality of data mining. Multiple regression analysis is employed to mine the final dataset. Hourly visits are the independent variable and weather conditions, Discomfort Index and seven days in a week are dependent variables. In the end, four models in different seasons are established to predict visiting situations in each season. Patterns are realized in different seasons and implications are created based on the discovered patterns. In addition, library-climate points are generated by a clustering method, which simplifies the process for librarians using weather data to forecast library visiting situation. Then the data mining result is interpreted from the perspective of improving reference service. After this data mining work, the result of the case study is presented to librarians so as to collect professional opinions regarding the possibility of employing data mining to improve reference services. In the end, positive opinions are collected, which implies that it is feasible to utilizing data mining as a tool to enhance library reference service.
Resumo:
Following the workshop on new developments in daily licensing practice in November 2011, we brought together fourteen representatives from national consortia (from Denmark, Germany, Netherlands and the UK) and publishers (Elsevier, SAGE and Springer) met in Copenhagen on 9 March 2012 to discuss provisions in licences to accommodate new developments. The one day workshop aimed to: present background and ideas regarding the provisions KE Licensing Expert Group developed; introduce and explain the provisions the invited publishers currently use;ascertain agreement on the wording for long term preservation, continuous access and course packs; give insight and more clarity about the use of open access provisions in licences; discuss a roadmap for inclusion of the provisions in the publishers’ licences; result in report to disseminate the outcome of the meeting. Participants of the workshop were: United Kingdom: Lorraine Estelle (Jisc Collections) Denmark: Lotte Eivor Jørgensen (DEFF), Lone Madsen (Southern University of Denmark), Anne Sandfær (DEFF/Knowledge Exchange) Germany: Hildegard Schaeffler (Bavarian State Library), Markus Brammer (TIB) The Netherlands: Wilma Mossink (SURF), Nol Verhagen (University of Amsterdam), Marc Dupuis (SURF/Knowledge Exchange) Publishers: Alicia Wise (Elsevier), Yvonne Campfens (Springer), Bettina Goerner (Springer), Leo Walford (Sage) Knowledge Exchange: Keith Russell The main outcome of the workshop was that it would be valuable to have a standard set of clauses which could used in negotiations, this would make concluding licences a lot easier and more efficient. The comments on the model provisions the Licensing Expert group had drafted will be taken into account and the provisions will be reformulated. Data and text mining is a new development and demand for access to allow for this is growing. It would be easier if there was a simpler way to access materials so they could be more easily mined. However there are still outstanding questions on how authors of articles that have been mined can be properly attributed.
Resumo:
The incredible rapid development to huge volumes of air travel, mainly because of jet airliners that appeared to the sky in the 1950s, created the need for systematic research for aviation safety and collecting data about air traffic. The structured data can be analysed easily using queries from databases and running theseresults through graphic tools. However, in analysing narratives that often give more accurate information about the case, mining tools are needed. The analysis of textual data with computers has not been possible until data mining tools have been developed. Their use, at least among aviation, is still at a moderate level. The research aims at discovering lethal trends in the flight safety reports. The narratives of 1,200 flight safety reports from years 1994 – 1996 in Finnish were processed with three text mining tools. One of them was totally language independent, the other had a specific configuration for Finnish and the third originally created for English, but encouraging results had been achieved with Spanish and that is why a Finnish test was undertaken, too. The global rate of accidents is stabilising and the situation can now be regarded as satisfactory, but because of the growth in air traffic, the absolute number of fatal accidents per year might increase, if the flight safety will not be improved. The collection of data and reporting systems have reached their top level. The focal point in increasing the flight safety is analysis. The air traffic has generally been forecasted to grow 5 – 6 per cent annually over the next two decades. During this period, the global air travel will probably double also with relatively conservative expectations of economic growth. This development makes the airline management confront growing pressure due to increasing competition, signify cant rise in fuel prices and the need to reduce the incident rate due to expected growth in air traffic volumes. All this emphasises the urgent need for new tools and methods. All systems provided encouraging results, as well as proved challenges still to be won. Flight safety can be improved through the development and utilisation of sophisticated analysis tools and methods, like data mining, using its results supporting the decision process of the executives.
Resumo:
Double Degree