Case Study overview
We will outline and explain in detail the implementation of a framework built on top of Spark to enable agile and iterative data discovery between legacy systems and new data sources generated by IoT devices.
The internet of things (IoT) is certainly bringing new challenges for data practitioners. It’s basically a network of connected objects (the “things”) that can communicate and exchange information. The concept is not new, however there is a significant shift in the list of objects that could be part of this network. Very soon, it will be nearly every object on the planet. There are already few applications emerging around the internet of things, such as smart cities, smart agriculture, demotic & home automation, smart industrial control…etc.
The data* for our case study contain various information such as:
• Call Detail Records (CDRs) generated by the Telecom Italia cellular network over the Province of Trento.
• Measurements about temperature, precipitation and wind speed/direction taken in 36 Weather Stations.
• Amount of current flowing through the electrical grid of the Trentino province at specific instants.
Framework overview
The primary focus of the framework is not to collect in real-‐time the data from the devices although it can be extended to support such capability thanks to Spark.
The ultimate goal of the framework approach here is to speed up development effort to ingest new data sources available and reduce time to market for data consumers. It aims to avoid rewriting new scripts for every new data sources available and enables a team of data engineer to easily collaborate on a project using the same core engine.
Framework architecture & technologies
The framework is composed by 2 independents modules and one administration module that contains common functions. The choice has been made to create independents modules, as they achieve very different logical functions:
- Data collection: get the various data sources as they are available and stored the data in hdfs.
- Data integration: apply transformation to the data to extract meaningful information and store the results in hive.
Framework Modules
Administration module
The administration module contains common functions that are not directly related to the data collection and ingestion such as installation scripts, generic scripts to handle log message, global properties files…etc.
Data collection module
This module is designed to collect data from various data sources as they are available and stored them in Hadoop file system.
The data sources to be ingested are described in data catalog which contains each data source properties such as the type (local, aws, api..etc), connection properties..etc.
Framework architecture & technologies
The framework is composed by 2 independents modules and one administration module that contains common functions. The choice has been made to create independents modules, as they achieve very different logical functions:
- Data collection: get the various data sources as they are available and stored the data in hdfs.
- Data integration: apply transformation to the data to extract meaningful information and store the results in hive.
Framework Modules
Administration module
The administration module contains common functions that are not directly related to the data collection and ingestion such as installation scripts, generic scripts to handle log message, global properties files…etc.
Data collection module
This module is designed to collect data from various data sources as they are available and stored them in Hadoop file system.
The data sources to be ingested are described in data catalog which contains each data source properties such as the type (local, aws, api..etc), connection properties..etc.
Data integration module
The data ingestion is realized in two steps:
Step 1: Collect all data available in the hdfs hub, convert them in parquet and save them in a staging folder. The goal of this step is to build a source agnostic representation of the data. Spark does not really need to know if the data was originally in csv, txt, json…etc. Parquet format has been chosen to enable fast execution of Spark queries.
This concept of a framework aims to avoid rewriting new scripts for every new data sources available. Spark has been identified as a good candidate, as it offers the ability to execute various workloads and support multiple type of computations.
The framework has been designed with flexibility in mind. The current two core modules are independent and can be replaced or customized without impacting the whole solution.
In addition, it enables a team of data engineer to easily collaborate on a project using the same core engine. Each member can focus on writing their templates and updating configuration files. Everyone can benefit from evolution to the core modules.
Many companies are looking to enrich their conventional data warehouse or data platform with new sources of information that usually come in a variety of format. With the rise of the internet of the things, this concern will certainly become more and more important, as the companies would like to ingest and derive value from these new data-sets. They need to do it in an agile way:
A framework is definitely the way to go. The key benefits are: