791 resultados para Web Data Mining


Relevância:

80.00% 80.00%

Publicador:

Resumo:

Durante los últimos años ha aumentado la presencia de personas pertenecientes al mundo de la política en la red debido a la proliferación de las redes sociales, siendo Twitter la que mayor repercusión mediática tiene en este ámbito. El estudio del comportamiento de los políticos en Twitter y de la acogida que tienen entre los ciudadanos proporciona información muy valiosa a la hora de analizar las campañas electorales. De esta forma, se puede estudiar la repercusión real que tienen sus mensajes en los resultados electorales, así como distinguir aquellos comportamientos que tienen una mayor aceptación por parte de la la ciudadaná. Gracias a los avances desarrollados en el campo de la minería de textos, se poseen las herramientas necesarias para analizar un gran volumen de textos y extraer de ellos información de utilidad. Este proyecto tiene como finalidad recopilar una muestra significativa de mensajes de Twitter pertenecientes a los candidatos de los principales partidos políticos que se presentan a las elecciones autonómicas de Madrid en 2015. Estos mensajes, junto con las respuestas de otros usuarios, se han analizado usando algoritmos de aprendizaje automático y aplicando las técnicas de minería de textos más oportunas. Los resultados obtenidos para cada político se han examinado en profundidad y se han presentado mediante tablas y gráficas para facilitar su comprensión.---ABSTRACT---During the past few years the presence on the Internet of people related with politics has increased, due to the proliferation of social networks. Among all existing social networks, Twitter is the one which has the greatest media impact in this field. Therefore, an analysis of the behaviour of politicians in this social network, along with the response from the citizens, gives us very valuable information when analysing electoral campaigns. This way it is possible to know their messages impact in the election results. Moreover, it can be inferred which behaviours have better acceptance among the citizenship. Thanks to the advances achieved in the text mining field, its tools can be used to analyse a great amount of texts and extract from them useful information. The present project aims to collect a significant sample of Twitter messages from the candidates of the principal political parties for the 2015 autonomic elections in Madrid. These messages, as well as the answers received by the other users, have been analysed using machine learning algorithms and applying the most suitable data mining techniques. The results obtained for each politician have been examined in depth and have been presented using tables and graphs to make its understanding easier.

Relevância:

80.00% 80.00%

Publicador:

Resumo:

Poder clasificar de manera precisa la aplicación o programa del que provienen los flujos que conforman el tráfico de uso de Internet dentro de una red permite tanto a empresas como a organismos una útil herramienta de gestión de los recursos de sus redes, así como la posibilidad de establecer políticas de prohibición o priorización de tráfico específico. La proliferación de nuevas aplicaciones y de nuevas técnicas han dificultado el uso de valores conocidos (well-known) en puertos de aplicaciones proporcionados por la IANA (Internet Assigned Numbers Authority) para la detección de dichas aplicaciones. Las redes P2P (Peer to Peer), el uso de puertos no conocidos o aleatorios, y el enmascaramiento de tráfico de muchas aplicaciones en tráfico HTTP y HTTPS con el fin de atravesar firewalls y NATs (Network Address Translation), entre otros, crea la necesidad de nuevos métodos de detección de tráfico. El objetivo de este estudio es desarrollar una serie de prácticas que permitan realizar dicha tarea a través de técnicas que están más allá de la observación de puertos y otros valores conocidos. Existen una serie de metodologías como Deep Packet Inspection (DPI) que se basa en la búsqueda de firmas, signatures, en base a patrones creados por el contenido de los paquetes, incluido el payload, que caracterizan cada aplicación. Otras basadas en el aprendizaje automático de parámetros de los flujos, Machine Learning, que permite determinar mediante análisis estadísticos a qué aplicación pueden pertenecer dichos flujos y, por último, técnicas de carácter más heurístico basadas en la intuición o el conocimiento propio sobre tráfico de red. En concreto, se propone el uso de alguna de las técnicas anteriormente comentadas en conjunto con técnicas de minería de datos como son el Análisis de Componentes Principales (PCA por sus siglas en inglés) y Clustering de estadísticos extraídos de los flujos procedentes de ficheros de tráfico de red. Esto implicará la configuración de diversos parámetros que precisarán de un proceso iterativo de prueba y error que permita dar con una clasificación del tráfico fiable. El resultado ideal sería aquel en el que se pudiera identificar cada aplicación presente en el tráfico en un clúster distinto, o en clusters que agrupen grupos de aplicaciones de similar naturaleza. Para ello, se crearán capturas de tráfico dentro de un entorno controlado e identificando cada tráfico con su aplicación correspondiente, a continuación se extraerán los flujos de dichas capturas. Tras esto, parámetros determinados de los paquetes pertenecientes a dichos flujos serán obtenidos, como por ejemplo la fecha y hora de llagada o la longitud en octetos del paquete IP. Estos parámetros serán cargados en una base de datos MySQL y serán usados para obtener estadísticos que ayuden, en un siguiente paso, a realizar una clasificación de los flujos mediante minería de datos. Concretamente, se usarán las técnicas de PCA y clustering haciendo uso del software RapidMiner. Por último, los resultados obtenidos serán plasmados en una matriz de confusión que nos permitirá que sean valorados correctamente. ABSTRACT. Being able to classify the applications that generate the traffic flows in an Internet network allows companies and organisms to implement efficient resource management policies such as prohibition of specific applications or prioritization of certain application traffic, looking for an optimization of the available bandwidth. The proliferation of new applications and new technics in the last years has made it more difficult to use well-known values assigned by the IANA (Internet Assigned Numbers Authority), like UDP and TCP ports, to identify the traffic. Also, P2P networks and data encapsulation over HTTP and HTTPS traffic has increased the necessity to improve these traffic analysis technics. The aim of this project is to develop a number of techniques that make us able to classify the traffic with more than the simple observation of the well-known ports. There are some proposals that have been created to cover this necessity; Deep Packet Inspection (DPI) tries to find signatures in the packets reading the information contained in them, the payload, looking for patterns that can be used to characterize the applications to which that traffic belongs; Machine Learning procedures work with statistical analysis of the flows, trying to generate an automatic process that learns from those statistical parameters and calculate the likelihood of a flow pertaining to a certain application; Heuristic Techniques, finally, are based in the intuition or the knowledge of the researcher himself about the traffic being analyzed that can help him to characterize the traffic. Specifically, the use of some of the techniques previously mentioned in combination with data mining technics such as Principal Component Analysis (PCA) and Clustering (grouping) of the flows extracted from network traffic captures are proposed. An iterative process based in success and failure will be needed to configure these data mining techniques looking for a reliable traffic classification. The perfect result would be the one in which the traffic flows of each application is grouped correctly in each cluster or in clusters that contain group of applications of similar nature. To do this, network traffic captures will be created in a controlled environment in which every capture is classified and known to pertain to a specific application. Then, for each capture, all the flows will be extracted. These flows will be used to extract from them information such as date and arrival time or the IP length of the packets inside them. This information will be then loaded to a MySQL database where all the packets defining a flow will be classified and also, each flow will be assigned to its specific application. All the information obtained from the packets will be used to generate statistical parameters in order to describe each flow in the best possible way. After that, data mining techniques previously mentioned (PCA and Clustering) will be used on these parameters making use of the software RapidMiner. Finally, the results obtained from the data mining will be compared with the real classification of the flows that can be obtained from the database. A Confusion Matrix will be used for the comparison, letting us measure the veracity of the developed classification process.

Relevância:

80.00% 80.00%

Publicador:

Resumo:

El avance tecnológico de los últimos años ha aumentado la necesidad de guardar enormes cantidades de datos de forma masiva, llegando a una situación de desorden en el proceso de almacenamiento de datos, a su desactualización y a complicar su análisis. Esta situación causó un gran interés para las organizaciones en la búsqueda de un enfoque para obtener información relevante de estos grandes almacenes de datos. Surge así lo que se define como inteligencia de negocio, un conjunto de herramientas, procedimientos y estrategias para llevar a cabo la “extracción de conocimiento”, término con el que se refiere comúnmente a la extracción de información útil para la propia organización. Concretamente en este proyecto, se ha utilizado el enfoque Knowledge Discovery in Databases (KDD), que permite lograr la identificación de patrones y un manejo eficiente de las anomalías que puedan aparecer en una red de comunicaciones. Este enfoque comprende desde la selección de los datos primarios hasta su análisis final para la determinación de patrones. El núcleo de todo el enfoque KDD es la minería de datos, que contiene la tecnología necesaria para la identificación de los patrones mencionados y la extracción de conocimiento. Para ello, se utilizará la herramienta RapidMiner en su versión libre y gratuita, debido a que es más completa y de manejo más sencillo que otras herramientas como KNIME o WEKA. La gestión de una red engloba todo el proceso de despliegue y mantenimiento. Es en este procedimiento donde se recogen y monitorizan todas las anomalías ocasionadas en la red, las cuales pueden almacenarse en un repositorio. El objetivo de este proyecto es realizar un planteamiento teórico y varios experimentos que permitan identificar patrones en registros de anomalías de red. Se ha estudiado el repositorio de MAWI Lab, en el que se han almacenado anomalías diarias. Se trata de buscar indicios característicos anuales detectando patrones. Los diferentes experimentos y procedimientos de este estudio pretenden demostrar la utilidad de la inteligencia de negocio a la hora de extraer información a partir de un almacén de datos masivo, para su posterior análisis o futuros estudios. ABSTRACT. The technological progresses in the recent years required to store a big amount of information in repositories. This information is often in disorder, outdated and needs a complex analysis. This situation has caused a relevant interest in investigating methodologies to obtain important information from these huge data stores. Business intelligence was born as a set of tools, procedures and strategies to implement the "knowledge extraction". Specifically in this project, Knowledge Discovery in Databases (KDD) approach has been used. KDD is one of the most important processes of business intelligence to achieve the identification of patterns and the efficient management of the anomalies in a communications network. This approach includes all necessary stages from the selection of the raw data until the analysis to determine the patterns. The core process of the whole KDD approach is the Data Mining process, which analyzes the information needed to identify the patterns and to extract the knowledge. In this project we use the RapidMiner tool to carry out the Data Mining process, because this tool has more features and is easier to use than other tools like WEKA or KNIME. Network management includes the deployment, supervision and maintenance tasks. Network management process is where all anomalies are collected, monitored, and can be stored in a repository. The goal of this project is to construct a theoretical approach, to implement a prototype and to carry out several experiments that allow identifying patterns in some anomalies records. MAWI Lab repository has been selected to be studied, which contains daily anomalies. The different experiments show the utility of the business intelligence to extract information from big data warehouse.

Relevância:

80.00% 80.00%

Publicador:

Resumo:

Vivimos en una sociedad en la que la información ha adquirido una vital importancia. El uso de Internet y el desarrollo de nuevos sistemas de la información han generado un ferviente interés tanto de empresas como de instituciones en la búsqueda de nuevos patrones que les proporcione la clave del éxito. La Analítica de Negocio reúne un conjunto de herramientas, estrategias y técnicas orientadas a la explotación de la información con el objetivo de crear conocimiento útil dentro de un marco de trabajo y facilitar la optimización de los recursos tanto de empresas como de instituciones. El presente proyecto se enmarca en lo que se conoce como Gestión Educativa. Se aplicará una arquitectura y modelo de trabajo similar a lo que se ha venido haciendo en los últimos años en el entorno empresarial con la Inteligencia de Negocio. Con esta variante, se pretende mejorar la calidad de la enseñanza, agilizar las decisiones dentro de la institución académica, fortalecer las capacidades del cuerpo docente y en definitiva favorecer el aprendizaje del alumnado. Para lograr el objetivo se ha decidido seguir las etapas del Knowledge Discovery in Databases (KDD), una de las metodologías más conocidas dentro de la Inteligencia de Negocio, que describe el procedimiento que va desde la selección de la información y su carga en sistemas de almacenamiento, hasta la aplicación de técnicas de minería de datos para la obtención nuevo conocimiento. Los estudios se realizan a partir de la información de la activad de los usuarios dentro la plataforma de Tele-Enseñanza de la Universidad Politécnica de Madrid (Moodle). Se desarrollan trabajos de extracción y preprocesado de la base de datos en crudo y se aplican técnicas de minería de datos. En la aplicación de técnicas de minería de datos, uno de los factores más importantes a tener en cuenta es el tipo de información que se va a tratar. Por este motivo, se trabaja con la Minería de Datos Educativa, en inglés, Educational Data Mining (EDM) que consiste en la aplicación de técnicas de minería optimizadas para la información que se genera en entornos educativos. Dentro de las posibilidades que ofrece el EDM, se ha decidido centrar los estudios en lo que se conoce como analítica predictiva. El objetivo fundamental es conocer la influencia que tienen las interacciones alumno-plataforma en las calificaciones finales y descubrir nuevas reglas que describan comportamientos que faciliten al profesorado discriminar si un estudiante va a aprobar o suspender la asignatura, de tal forma que se puedan tomar medidas que mejoren su rendimiento. Toda la información tratada en el presente proyecto ha sido previamente anonimizada para evitar cualquier tipo de intromisión que atente contra la privacidad de los elementos participantes en el estudio. ABSTRACT. We live in a society dominated by data. The use of the Internet accompanied by developments in information systems has generated a sustained interest among companies and institutions to discover new patterns to succeed in their business ventures. Business Analytics (BA) combines tools, strategies and techniques focused on exploiting the available information, to optimize resources and create useful insight. The current project is framed under Educational Management. A Business Intelligence (BI) architecture and business models taught up to date will be applied with the aim to accelerate the decision-making in academic institutions, strengthen teacher´s skills and ultimately improve the quality of teaching and learning. The best way to achieve this is to follow the Knowledge Discovery in Databases (KDD), one of the best-known methodologies in B.I. This process describes data preparation, selection, and cleansing through to the application of purely Data Mining Techniques in order to incorporate prior knowledge on data sets and interpret accurate solutions from the observed results. The studies will be performed using the information extracted from the Universidad Politécnica de Madrid Learning Management System (LMS), Moodle. The stored data is based on the user-platform interaction. The raw data will be extracted and pre-processed and afterwards, Data Mining Techniques will be applied. One of the crucial factors in the application of Data Mining Techniques is the kind of information that will be processed. For this reason, a new Data Mining perspective will be taken, called Educational Data Mining (EDM). EDM consists of the application of Data Mining Techniques but optimized for the raw data generated by the educational environment. Within EDM, we have decided to drive our research on what is called Predictive Analysis. The main purpose is to understand the influence of the user-platform interactions in the final grades of students and discover new patterns that explain their behaviours. This could allow teachers to intervene ahead of a student passing or failing, in such a way an action could be taken to improve the student performance. All the information processed has been previously anonymized to avoid the invasion of privacy.

Relevância:

80.00% 80.00%

Publicador:

Resumo:

Stream-mining approach is defined as a set of cutting-edge techniques designed to process streams of data in real time, in order to extract knowledge. In the particular case of classification, stream-mining has to adapt its behaviour to the volatile underlying data distributions, what has been called concept drift. Moreover, it is important to note that concept drift may lead to situations where predictive models become invalid and have therefore to be updated to represent the actual concepts that data poses. In this context, there is a specific type of concept drift, known as recurrent concept drift, where the concepts represented by data have already appeared in the past. In those cases the learning process could be saved or at least minimized by applying a previously trained model. This could be extremely useful in ubiquitous environments that are characterized by the existence of resource constrained devices. To deal with the aforementioned scenario, meta-models can be used in the process of enhancing the drift detection mechanisms used by data stream algorithms, by representing and predicting when the change will occur. There are some real-world situations where a concept reappears, as in the case of intrusion detection systems (IDS), where the same incidents or an adaptation of them usually reappear over time. In these environments the early prediction of drift by means of a better knowledge of past models can help to anticipate to the change, thus improving efficiency of the model regarding the training instances needed. By means of using meta-models as a recurrent drift detection mechanism, the ability to share concepts representations among different data mining processes is open. That kind of exchanges could improve the accuracy of the resultant local model as such model may benefit from patterns similar to the local concept that were observed in other scenarios, but not yet locally. This would also improve the efficiency of training instances used during the classification process, as long as the exchange of models would aid in the application of already trained recurrent models, that have been previously seen by any of the collaborative devices. Which it is to say that the scope of recurrence detection and representation is broaden. In fact the detection, representation and exchange of concept drift patterns would be extremely useful for the law enforcement activities fighting against cyber crime. Being the information exchange one of the main pillars of cooperation, national units would benefit from the experience and knowledge gained by third parties. Moreover, in the specific scope of critical infrastructures protection it is crucial to count with information exchange mechanisms, both from a strategical and technical scope. The exchange of concept drift detection schemes in cyber security environments would aid in the process of preventing, detecting and effectively responding to threads in cyber space. Furthermore, as a complement of meta-models, a mechanism to assess the similarity between classification models is also needed when dealing with recurrent concepts. In this context, when reusing a previously trained model a rough comparison between concepts is usually made, applying boolean logic. The introduction of fuzzy logic comparisons between models could lead to a better efficient reuse of previously seen concepts, by applying not just equal models, but also similar ones. This work faces the aforementioned open issues by means of: the MMPRec system, that integrates a meta-model mechanism and a fuzzy similarity function; a collaborative environment to share meta-models between different devices; a recurrent drift generator that allows to test the usefulness of recurrent drift systems, as it is the case of MMPRec. Moreover, this thesis presents an experimental validation of the proposed contributions using synthetic and real datasets.

Relevância:

80.00% 80.00%

Publicador:

Resumo:

Esta tesis presenta el diseño y la aplicación de una metodología que permite la determinación de los parámetros para la planificación de nodos e infraestructuras logísticas en un territorio, considerando además el impacto de estas en los diferentes componentes territoriales, así como en el desarrollo poblacional, el desarrollo económico y el medio ambiente, presentando así un avance en la planificación integral del territorio. La Metodología propuesta está basada en Minería de Datos, que permite el descubrimiento de patrones detrás de grandes volúmenes de datos previamente procesados. Las características propias de los datos sobre el territorio y los componentes que lo conforman hacen de los estudios territoriales un campo ideal para la aplicación de algunas de las técnicas de Minería de Datos, tales como los ´arboles decisión y las redes bayesianas. Los árboles de decisión permiten representar y categorizar de forma esquemática una serie de variables de predicción que ayudan al análisis de una variable objetivo. Las redes bayesianas representan en un grafo acíclico dirigido, un modelo probabilístico de variables distribuidas en padres e hijos, y la inferencia estadística que permite determinar la probabilidad de certeza de una hipótesis planteada, es decir, permiten construir modelos de probabilidad conjunta que presentan de manera gráfica las dependencias relevantes en un conjunto de datos. Al igual que con los árboles de decisión, la división del territorio en diferentes unidades administrativas hace de las redes bayesianas una herramienta potencial para definir las características físicas de alguna tipología especifica de infraestructura logística tomando en consideración las características territoriales, poblacionales y económicas del área donde se plantea su desarrollo y las posibles sinergias que se puedan presentar sobre otros nodos e infraestructuras logísticas. El caso de estudio seleccionado para la aplicación de la metodología ha sido la República de Panamá, considerando que este país presenta algunas características singulares, entra las que destacan su alta concentración de población en la Ciudad de Panamá; que a su vez a concentrado la actividad económica del país; su alto porcentaje de zonas protegidas, lo que ha limitado la vertebración del territorio; y el Canal de Panamá y los puertos de contenedores adyacentes al mismo. La metodología se divide en tres fases principales: Fase 1: Determinación del escenario de trabajo 1. Revisión del estado del arte. 2. Determinación y obtención de las variables de estudio. Fase 2: Desarrollo del modelo de inteligencia artificial 3. Construcción de los ´arboles de decisión. 4. Construcción de las redes bayesianas. Fase 3: Conclusiones 5. Determinación de las conclusiones. Con relación al modelo de planificación aplicado al caso de estudio, una vez aplicada la metodología, se estableció un modelo compuesto por 47 variables que definen la planificación logística de Panamá, el resto de variables se definen a partir de estas, es decir, conocidas estas, el resto se definen a través de ellas. Este modelo de planificación establecido a través de la red bayesiana considera los aspectos de una planificación sostenible: económica, social y ambiental; que crean sinergia con la planificación de nodos e infraestructuras logísticas. The thesis presents the design and application of a methodology that allows the determination of parameters for the planning of nodes and logistics infrastructure in a territory, besides considering the impact of these different territorial components, as well as the population growth, economic and environmental development. The proposed methodology is based on Data Mining, which allows the discovery of patterns behind large volumes of previously processed data. The own characteristics of the territorial data makes of territorial studies an ideal field of knowledge for the implementation of some of the Data Mining techniques, such as Decision Trees and Bayesian Networks. Decision trees categorize schematically a series of predictor variables of an analyzed objective variable. Bayesian Networks represent a directed acyclic graph, a probabilistic model of variables divided in fathers and sons, and statistical inference that allow determine the probability of certainty in a hypothesis. The case of study for the application of the methodology is the Republic of Panama. This country has some unique features: a high population density in the Panama City, a concentration of economic activity, a high percentage of protected areas, and the Panama Canal. The methodology is divided into three main phases: Phase 1: definition of the work stage. 1. Review of the State of the art. 2. Determination of the variables. Phase 2: Development of artificial intelligence model 3. Construction of decision trees. 4. Construction of Bayesian Networks. Phase 3: conclusions 5. Determination of the conclusions. The application of the methodology to the case study established a model composed of 47 variables that define the logistics planning for Panama. This model of planning established through the Bayesian network considers aspects of sustainable planning and simulates the synergies between the nodes and logistical infrastructure planning.

Relevância:

80.00% 80.00%

Publicador:

Resumo:

Esta dissertação visa apresentar o mapeamento do uso das teorias de sistemas de informações, usando técnicas de recuperação de informação e metodologias de mineração de dados e textos. As teorias abordadas foram Economia de Custos de Transações (Transactions Costs Economics TCE), Visão Baseada em Recursos da Firma (Resource-Based View-RBV) e Teoria Institucional (Institutional Theory-IT), sendo escolhidas por serem teorias de grande relevância para estudos de alocação de investimentos e implementação em sistemas de informação, tendo como base de dados o conteúdo textual (em inglês) do resumo e da revisão teórica dos artigos dos periódicos Information System Research (ISR), Management Information Systems Quarterly (MISQ) e Journal of Management Information Systems (JMIS) no período de 2000 a 2008. Os resultados advindos da técnica de mineração textual aliada à mineração de dados foram comparadas com a ferramenta de busca avançada EBSCO e demonstraram uma eficiência maior na identificação de conteúdo. Os artigos fundamentados nas três teorias representaram 10% do total de artigos dos três períodicos e o período mais profícuo de publicação foi o de 2001 e 2007.(AU)

Relevância:

80.00% 80.00%

Publicador:

Resumo:

Esta dissertação visa apresentar o mapeamento do uso das teorias de sistemas de informações, usando técnicas de recuperação de informação e metodologias de mineração de dados e textos. As teorias abordadas foram Economia de Custos de Transações (Transactions Costs Economics TCE), Visão Baseada em Recursos da Firma (Resource-Based View-RBV) e Teoria Institucional (Institutional Theory-IT), sendo escolhidas por serem teorias de grande relevância para estudos de alocação de investimentos e implementação em sistemas de informação, tendo como base de dados o conteúdo textual (em inglês) do resumo e da revisão teórica dos artigos dos periódicos Information System Research (ISR), Management Information Systems Quarterly (MISQ) e Journal of Management Information Systems (JMIS) no período de 2000 a 2008. Os resultados advindos da técnica de mineração textual aliada à mineração de dados foram comparadas com a ferramenta de busca avançada EBSCO e demonstraram uma eficiência maior na identificação de conteúdo. Os artigos fundamentados nas três teorias representaram 10% do total de artigos dos três períodicos e o período mais profícuo de publicação foi o de 2001 e 2007.(AU)

Relevância:

80.00% 80.00%

Publicador:

Resumo:

The Biomolecular Interaction Network Database (BIND; http://binddb.org) is a database designed to store full descriptions of interactions, molecular complexes and pathways. Development of the BIND 2.0 data model has led to the incorporation of virtually all components of molecular mechanisms including interactions between any two molecules composed of proteins, nucleic acids and small molecules. Chemical reactions, photochemical activation and conformational changes can also be described. Everything from small molecule biochemistry to signal transduction is abstracted in such a way that graph theory methods may be applied for data mining. The database can be used to study networks of interactions, to map pathways across taxonomic branches and to generate information for kinetic simulations. BIND anticipates the coming large influx of interaction information from high-throughput proteomics efforts including detailed information about post-translational modifications from mass spectrometry. Version 2.0 of the BIND data model is discussed as well as implementation, content and the open nature of the BIND project. The BIND data specification is available as ASN.1 and XML DTD.

Relevância:

80.00% 80.00%

Publicador:

Resumo:

We set out to define patterns of gene expression during kidney organogenesis by using high-density DNA array technology. Expression analysis of 8,740 rat genes revealed five discrete patterns or groups of gene expression during nephrogenesis. Group 1 consisted of genes with very high expression in the early embryonic kidney, many with roles in protein translation and DNA replication. Group 2 consisted of genes that peaked in midembryogenesis and contained many transcripts specifying proteins of the extracellular matrix. Many additional transcripts allied with groups 1 and 2 had known or proposed roles in kidney development and included LIM1, POD1, GFRA1, WT1, BCL2, Homeobox protein A11, timeless, pleiotrophin, HGF, HNF3, BMP4, TGF-α, TGF-β2, IGF-II, met, FGF7, BMP4, and ganglioside-GD3. Group 3 consisted of transcripts that peaked in the neonatal period and contained a number of retrotransposon RNAs. Group 4 contained genes that steadily increased in relative expression levels throughout development, including many genes involved in energy metabolism and transport. Group 5 consisted of genes with relatively low levels of expression throughout embryogenesis but with markedly higher levels in the adult kidney; this group included a heterogeneous mix of transporters, detoxification enzymes, and oxidative stress genes. The data suggest that the embryonic kidney is committed to cellular proliferation and morphogenesis early on, followed sequentially by extracellular matrix deposition and acquisition of markers of terminal differentiation. The neonatal burst of retrotransposon mRNA was unexpected and may play a role in a stress response associated with birth. Custom analytical tools were developed including “The Equalizer” and “eBlot,” which contain improved methods for data normalization, significance testing, and data mining.