752 resultados para Data streams
Resumo:
This paper presents a single pass algorithm for mining discriminative Itemsets in data streams using a novel data structure and the tilted-time window model. Discriminative Itemsets are defined as Itemsets that are frequent in one data stream and their frequency in that stream is much higher than the rest of the streams in the dataset. In order to deal with the data structure size, we propose a pruning process that results in the compact tree structure containing discriminative Itemsets. Empirical analysis shows the sound time and space complexity of the proposed method.
Resumo:
Technological advances have led to an influx of affordable hardware that supports sensing, computation and communication. This hardware is increasingly deployed in public and private spaces, tracking and aggregating a wealth of real-time environmental data. Although these technologies are the focus of several research areas, there is a lack of research dealing with the problem of making these capabilities accessible to everyday users. This thesis represents a first step towards developing systems that will allow users to leverage the available infrastructure and create custom tailored solutions. It explores how this notion can be utilized in the context of energy monitoring to improve conventional approaches. The project adopted a user-centered design process to inform the development of a flexible system for real-time data stream composition and visualization. This system features an extensible architecture and defines a unified API for heterogeneous data streams. Rather than displaying the data in a predetermined fashion, it makes this information available as building blocks that can be combined and shared. It is based on the insight that individual users have diverse information needs and presentation preferences. Therefore, it allows users to compose rich information displays, incorporating personally relevant data from an extensive information ecosystem. The prototype was evaluated in an exploratory study to observe its natural use in a real-world setting, gathering empirical usage statistics and conducting semi-structured interviews. The results show that a high degree of customization does not warrant sustained usage. Other factors were identified, yielding recommendations for increasing the impact on energy consumption.
Resumo:
Amphibian is an 10’00’’ musical work which explores new musical interfaces and approaches to hybridising performance practices from the popular music, electronic dance music and computer music traditions. The work is designed to be presented in a range of contexts associated with the electro-acoustic, popular and classical music traditions. The work is for two performers using two synchronised laptops, an electric guitar and a custom designed gestural interface for vocal performers - the e-Mic (Extended Mic-stand Interface Controller). This interface was developed by one of the co-authors, Donna Hewitt. The e-Mic allows a vocal performer to manipulate the voice in real time through the capture of physical gestures via an array of sensors - pressure, distance, tilt - along with ribbon controllers and an X-Y joystick microphone mount. Performance data are then sent to a computer, running audio-processing software, which is used to transform the audio signal from the microphone. In this work, data is also exchanged between performers via a local wireless network, allowing performers to work with shared data streams. The duo employs the gestural conventions of guitarist and singer (i.e. 'a band' in a popular music context), but transform these sounds and gestures into new digital music. The gestural language of popular music is deliberately subverted and taken into a new context. The piece thus explores the nexus between the sonic and performative practices of electro acoustic music and intelligent electronic dance music (‘idm’). This work was situated in the research fields of new musical interfacing, interaction design, experimental music composition and performance. The contexts in which the research was conducted were live musical performance and studio music production. The work investigated new methods for musical interfacing, performance data mapping, hybrid performance and compositional practices in electronic music. The research methodology was practice-led. New insights were gained from the iterative experimental workshopping of gestural inputs, musical data mapping, inter-performer data exchange, software patch design, data and audio processing chains. In respect of interfacing, there were innovations in the design and implementation of a novel sensor-based gestural interface for singers, the e-Mic, one of the only existing gestural controllers for singers. This work explored the compositional potential of sharing real time performance data between performers and deployed novel methods for inter-performer data exchange and mapping. As regards stylistic and performance innovation, the work explored and demonstrated an approach to the hybridisation of the gestural and sonic language of popular music with recent ‘post-digital’ approaches to laptop based experimental music The development of the work was supported by an Australia Council Grant. Research findings have been disseminated via a range of international conference publications, recordings, radio interviews (ABC Classic FM), broadcasts, and performances at international events and festivals. The work was curated into the major Australian international festival, Liquid Architecture, and was selected by an international music jury (through blind peer review) for presentation at the International Computer Music Conference in Belfast, N. Ireland.
Resumo:
An information filtering (IF) system monitors an incoming document stream to find the documents that match the information needs specified by the user profiles. To learn to use the user profiles effectively is one of the most challenging tasks when developing an IF system. With the document selection criteria better defined based on the users’ needs, filtering large streams of information can be more efficient and effective. To learn the user profiles, term-based approaches have been widely used in the IF community because of their simplicity and directness. Term-based approaches are relatively well established. However, these approaches have problems when dealing with polysemy and synonymy, which often lead to an information overload problem. Recently, pattern-based approaches (or Pattern Taxonomy Models (PTM) [160]) have been proposed for IF by the data mining community. These approaches are better at capturing sematic information and have shown encouraging results for improving the effectiveness of the IF system. On the other hand, pattern discovery from large data streams is not computationally efficient. Also, these approaches had to deal with low frequency pattern issues. The measures used by the data mining technique (for example, “support” and “confidences”) to learn the profile have turned out to be not suitable for filtering. They can lead to a mismatch problem. This thesis uses the rough set-based reasoning (term-based) and pattern mining approach as a unified framework for information filtering to overcome the aforementioned problems. This system consists of two stages - topic filtering and pattern mining stages. The topic filtering stage is intended to minimize information overloading by filtering out the most likely irrelevant information based on the user profiles. A novel user-profiles learning method and a theoretical model of the threshold setting have been developed by using rough set decision theory. The second stage (pattern mining) aims at solving the problem of the information mismatch. This stage is precision-oriented. A new document-ranking function has been derived by exploiting the patterns in the pattern taxonomy. The most likely relevant documents were assigned higher scores by the ranking function. Because there is a relatively small amount of documents left after the first stage, the computational cost is markedly reduced; at the same time, pattern discoveries yield more accurate results. The overall performance of the system was improved significantly. The new two-stage information filtering model has been evaluated by extensive experiments. Tests were based on the well-known IR bench-marking processes, using the latest version of the Reuters dataset, namely, the Reuters Corpus Volume 1 (RCV1). The performance of the new two-stage model was compared with both the term-based and data mining-based IF models. The results demonstrate that the proposed information filtering system outperforms significantly the other IF systems, such as the traditional Rocchio IF model, the state-of-the-art term-based models, including the BM25, Support Vector Machines (SVM), and Pattern Taxonomy Model (PTM).
Resumo:
Investigates the use of temporal lip information, in conjunction with speech information, for robust, text-dependent speaker identification. We propose that significant speaker-dependent information can be obtained from moving lips, enabling speaker recognition systems to be highly robust in the presence of noise. The fusion structure for the audio and visual information is based around the use of multi-stream hidden Markov models (MSHMM), with audio and visual features forming two independent data streams. Recent work with multi-modal MSHMMs has been performed successfully for the task of speech recognition. The use of temporal lip information for speaker identification has been performed previously (T.J. Wark et al., 1998), however this has been restricted to output fusion via single-stream HMMs. We present an extension to this previous work, and show that a MSHMM is a valid structure for multi-modal speaker identification
Resumo:
Currently, the GNSS computing modes are of two classes: network-based data processing and user receiver-based processing. A GNSS reference receiver station essentially contributes raw measurement data in either the RINEX file format or as real-time data streams in the RTCM format. Very little computation is carried out by the reference station. The existing network-based processing modes, regardless of whether they are executed in real-time or post-processed modes, are centralised or sequential. This paper describes a distributed GNSS computing framework that incorporates three GNSS modes: reference station-based, user receiver-based and network-based data processing. Raw data streams from each GNSS reference receiver station are processed in a distributed manner, i.e., either at the station itself or at a hosting data server/processor, to generate station-based solutions, or reference receiver-specific parameters. These may include precise receiver clock, zenith tropospheric delay, differential code biases, ambiguity parameters, ionospheric delays, as well as line-of-sight information such as azimuth and elevation angles. Covariance information for estimated parameters may also be optionally provided. In such a mode the nearby precise point positioning (PPP) or real-time kinematic (RTK) users can directly use the corrections from all or some of the stations for real-time precise positioning via a data server. At the user receiver, PPP and RTK techniques are unified under the same observation models, and the distinction is how the user receiver software deals with corrections from the reference station solutions and the ambiguity estimation in the observation equations. Numerical tests demonstrate good convergence behaviour for differential code bias and ambiguity estimates derived individually with single reference stations. With station-based solutions from three reference stations within distances of 22–103 km the user receiver positioning results, with various schemes, show an accuracy improvement of the proposed station-augmented PPP and ambiguity-fixed PPP solutions with respect to the standard float PPP solutions without station augmentation and ambiguity resolutions. Overall, the proposed reference station-based GNSS computing mode can support PPP and RTK positioning services as a simpler alternative to the existing network-based RTK or regionally augmented PPP systems.
Resumo:
Autonomous underwater vehicles (AUVs) are becoming commonplace in the study of inshore coastal marine habitats. Combined with shipboard systems, scientists are able to make in-situ measurements of water column and benthic properties. In CSIRO, autonomous gliders are used to collect water column data, while surface vessels are used to collect bathymetry information through the use of swath mapping, bottom grabs, and towed video systems. Although these methods have provided good data coverage for coastal and deep waters beyond 50m, there has been an increasing need for autonomous in-situ sampling in waters less than 50m deep. In addition, the collection of benthic and water column data has been conducted separately, requiring extensive post-processing to combine data streams. As such, a new AUV was developed for in-situ observations of both benthic habitat and water column properties in shallow waters. This paper provides an overview of the Starbug X AUV system, its operational characteristics including vision-based navigation and oceanographic sensor integration.
Resumo:
The eastern Australian rainforests have experienced several cycles of range contraction and expansion since the late Miocene that are closely correlated with global glaciation events. Together with ongoing aridification of the continent, this has resulted in current distributions of native closed forest that are highly fragmented along the east coast. Several closed forest endemic taxa exhibit patterns of population genetic structure that are congruent with historical isolation of populations in discrete refugia and reflect evolutionary histories dramatically affected by vicariance. Currently, limited data are available regarding the impact of these past climatic fluctuations on freshwater invertebrate taxa. The non-biting midge species Echinocladius martini Cranston is distributed along the east coast and inhabits predominantly montane streams in closed forest habitat. Phylogeographic structure in E. martini was resolved here at a continental scale by incorporating data from a previous pilot study and expanding the sampling design to encompass populations in the Wet Tropics of north-eastern Queensland, south-east Queensland, New South Wales and Victoria. Patterns of phylogeographic structure revealed several deeply divergent mitochondrial lineages from central and south-eastern Australia that were previously unrecognised and were geographically endemic to closed forest refugia. Estimated divergence times were congruent with late Miocene onset of rainforest contractions across the east coast of Australia. This suggested that dispersal and gene flow among E. martini populations isolated in refugia has been highly restricted historically. Moreover, these data imply, in contrast to existing preconceptions about freshwater invertebrates, that this taxon may be acutely susceptible to habitat fragmentation.
Resumo:
Big Data presents many challenges related to volume, whether one is interested in studying past datasets or, even more problematically, attempting to work with live streams of data. The most obvious challenge, in a ‘noisy’ environment such as contemporary social media, is to collect the pertinent information; be that information for a specific study, tweets which can inform emergency services or other responders to an ongoing crisis, or give an advantage to those involved in prediction markets. Often, such a process is iterative, with keywords and hashtags changing with the passage of time, and both collection and analytic methodologies need to be continually adapted to respond to this changing information. While many of the data sets collected and analyzed are preformed, that is they are built around a particular keyword, hashtag, or set of authors, they still contain a large volume of information, much of which is unnecessary for the current purpose and/or potentially useful for future projects. Accordingly, this panel considers methods for separating and combining data to optimize big data research and report findings to stakeholders. The first paper considers possible coding mechanisms for incoming tweets during a crisis, taking a large stream of incoming tweets and selecting which of those need to be immediately placed in front of responders, for manual filtering and possible action. The paper suggests two solutions for this, content analysis and user profiling. In the former case, aspects of the tweet are assigned a score to assess its likely relationship to the topic at hand, and the urgency of the information, whilst the latter attempts to identify those users who are either serving as amplifiers of information or are known as an authoritative source. Through these techniques, the information contained in a large dataset could be filtered down to match the expected capacity of emergency responders, and knowledge as to the core keywords or hashtags relating to the current event is constantly refined for future data collection. The second paper is also concerned with identifying significant tweets, but in this case tweets relevant to particular prediction market; tennis betting. As increasing numbers of professional sports men and women create Twitter accounts to communicate with their fans, information is being shared regarding injuries, form and emotions which have the potential to impact on future results. As has already been demonstrated with leading US sports, such information is extremely valuable. Tennis, as with American Football (NFL) and Baseball (MLB) has paid subscription services which manually filter incoming news sources, including tweets, for information valuable to gamblers, gambling operators, and fantasy sports players. However, whilst such services are still niche operations, much of the value of information is lost by the time it reaches one of these services. The paper thus considers how information could be filtered from twitter user lists and hash tag or keyword monitoring, assessing the value of the source, information, and the prediction markets to which it may relate. The third paper examines methods for collecting Twitter data and following changes in an ongoing, dynamic social movement, such as the Occupy Wall Street movement. It involves the development of technical infrastructure to collect and make the tweets available for exploration and analysis. A strategy to respond to changes in the social movement is also required or the resulting tweets will only reflect the discussions and strategies the movement used at the time the keyword list is created — in a way, keyword creation is part strategy and part art. In this paper we describe strategies for the creation of a social media archive, specifically tweets related to the Occupy Wall Street movement, and methods for continuing to adapt data collection strategies as the movement’s presence in Twitter changes over time. We also discuss the opportunities and methods to extract data smaller slices of data from an archive of social media data to support a multitude of research projects in multiple fields of study. The common theme amongst these papers is that of constructing a data set, filtering it for a specific purpose, and then using the resulting information to aid in future data collection. The intention is that through the papers presented, and subsequent discussion, the panel will inform the wider research community not only on the objectives and limitations of data collection, live analytics, and filtering, but also on current and in-development methodologies that could be adopted by those working with such datasets, and how such approaches could be customized depending on the project stakeholders.
Resumo:
In studies using macroinvertebrates as indicators for monitoring rivers and streams, species level identifications in comparison with lower resolution identifications can have greater information content and result in more reliable site classifications and better capacity to discriminate between sites, yet many such programmes identify specimens to the resolution of family rather than species. This is often because it is cheaper to obtain family level data than species level data. Choice of appropriate taxonomic resolution is a compromise between the cost of obtaining data at high taxonomic resolutions and the loss of information at lower resolutions. Optimum taxonomic resolution should be determined by the information required to address programme objectives. Costs saved in identifying macroinvertebrates to family level may not be justified if family level data can not give the answers required and expending the extra cost to obtain species level data may not be warranted if cheaper family level data retains sufficient information to meet objectives. We investigated the influence of taxonomic resolution and sample quantification (abundance vs. presence/absence) on the representation of aquatic macroinvertebrate species assemblage patterns and species richness estimates. The study was conducted in a physically harsh dryland river system (Condamine-Balonne River system, located in south-western Queensland, Australia), characterised by low macroinvertebrate diversity. Our 29 study sites covered a wide geographic range and a diversity of lotic conditions and this was reflected by differences between sites in macroinvertebrate assemblage composition and richness. The usefulness of expending the extra cost necessary to identify macroinvertebrates to species was quantified via the benefits this higher resolution data offered in its capacity to discriminate between sites and give accurate estimates of site species richness. We found that very little information (<6%) was lost by identifying taxa to family (or genus), as opposed to species, and that quantifying the abundance of taxa provided greater resolution for pattern interpretation than simply noting their presence/absence. Species richness was very well represented by genus, family and order richness, so that each of these could be used as surrogates of species richness if, for example, surveying to identify diversity hot-spots. It is suggested that sharing of common ecological responses among species within higher taxonomic units is the most plausible mechanism for the results. Based on a cost/benefit analysis, family level abundance data is recommended as the best resolution for resolving patterns in macroinvertebrate assemblages in this system. The relevance of these findings are discussed in the context of other low diversity, harsh, dryland river systems.