2 resultados para Content analysis (Communication) -- Data processing
em CORA - Cork Open Research Archive - University College Cork - Ireland
Resumo:
A substantial amount of information on the Internet is present in the form of text. The value of this semi-structured and unstructured data has been widely acknowledged, with consequent scientific and commercial exploitation. The ever-increasing data production, however, pushes data analytic platforms to their limit. This thesis proposes techniques for more efficient textual big data analysis suitable for the Hadoop analytic platform. This research explores the direct processing of compressed textual data. The focus is on developing novel compression methods with a number of desirable properties to support text-based big data analysis in distributed environments. The novel contributions of this work include the following. Firstly, a Content-aware Partial Compression (CaPC) scheme is developed. CaPC makes a distinction between informational and functional content in which only the informational content is compressed. Thus, the compressed data is made transparent to existing software libraries which often rely on functional content to work. Secondly, a context-free bit-oriented compression scheme (Approximated Huffman Compression) based on the Huffman algorithm is developed. This uses a hybrid data structure that allows pattern searching in compressed data in linear time. Thirdly, several modern compression schemes have been extended so that the compressed data can be safely split with respect to logical data records in distributed file systems. Furthermore, an innovative two layer compression architecture is used, in which each compression layer is appropriate for the corresponding stage of data processing. Peripheral libraries are developed that seamlessly link the proposed compression schemes to existing analytic platforms and computational frameworks, and also make the use of the compressed data transparent to developers. The compression schemes have been evaluated for a number of standard MapReduce analysis tasks using a collection of real-world datasets. In comparison with existing solutions, they have shown substantial improvement in performance and significant reduction in system resource requirements.
Resumo:
Two concepts in rural economic development policy have been the focus of much research and policy action: the identification and support of clusters or networks of firms and the availability and adoption by rural businesses of Information and Communication Technologies (ICT). From a theoretical viewpoint these policies are based on two contrasting models, with clustering seen as a process of economic agglomeration, and ICT-mediated communication as a means of facilitating economic dispersion. The study’s conceptual framework is based on four interrelated elements: location, interaction, knowledge, and advantage, together with the concept of networks which is employed as an operationally and theoretically unifying concept. The research questions are developed in four successive categories: Policy, Theory, Networks, and Method. The questions are approached using a study of two contrasting groups of rural small businesses in West Cork, Ireland: (a) Speciality Foods, and (b) firms in Digital Products and Services. The study combines Social Network Analysis (SNA) with Qualitative Thematic Analysis, using data collected from semi-structured interviews with 58 owners or managers of these businesses. Data comprise relational network data on the firms’ connections to suppliers, customers, allies and competitors, together with linked qualitative data on how the firms established connections, and how tacit and codified knowledge was sourced and utilised. The research finds that the key characteristics identified in the cluster literature are evident in the sample of Speciality Food businesses, in relation to flows of tacit knowledge, social embedding, and the development of forms of social capital. In particular the research identified the presence of two distinct forms of collective social capital in this network, termed “community” and “reputation”. By contrast the sample of Digital Products and Services businesses does not have the form of a cluster, but matches more closely to dispersive models, or “chain” structures. Much of the economic and social structure of this set of firms is best explained in terms of “project organisation”, and by the operation of an individual rather than collective form of “reputation”. The rural setting in which these firms are located has resulted in their being service-centric, and consequently they rely on ICT-mediated communication in order to exchange tacit knowledge “at a distance”. It is this factor, rather than inputs of codified knowledge, that most strongly influences their operation and their need for availability and adoption of high quality communication technologies. Thus the findings have applicability in relation to theory in Economic Geography and to policy and practice in Rural Development. In addition the research contributes to methodological questions in SNA, and to methodological questions about the combination or mixing of quantitative and qualitative methods.