An Aberdeen survey (Oct 2017) reported that organisations that implemented a Data Lake outperformed similar companies by 9% in organic revenue growth.
A Data Lake is a storage repository that contains a large amount of structured, semi-structured, and unstructured data. It is a place to store every type of data in its native format with no fixed limits on account or file size. Data Lakes offer high quantities of data in order to increase analytic performance of an enterprise application.
With a Data Lake, users can gain insight and make better business decisions, using big data processing, real-time analytics, and machine learning.
Traditionally organisations build data warehouses where a considerable investment is made analysing data sources, understanding business processes and profiling data. The result is a highly structured data model designed for reporting. A large part of this process includes making decisions about what data to include and to not include in the warehouse. Generally, if data isn’t used to answer specific questions or in a defined report, it may be excluded from the warehouse.
With cloud technology, servers have become commodity and storage very cheap. It is no longer prohibitive to store terabytes of data per day and build petabyte scale repositories of data. It has become affordable to build repositories that store data from various sources in a given domain where the purpose or the value of the data has not yet been clearly defined. It is with this landscape of changing IT infrastructure economics that the Data Lake approach has emerged.
Here are a few characteristics of the Data Lake approach contrasted against a data warehouse approach, in order to further clarify the concept. In a Data Lake, all business data is stored regardless of source or structure. For example, in a data warehouse, data is extracted from transactional systems and consists of quantitative metrics and the attributes that describe them. In a Data Lake, additional information can be stored such as: web server logs, sensor data, social network activity, and text and images from a website. The data is stored in raw form, ready to be transformed. The value of the data may not yet be determined at the time the data is stored.
A Data Lake adapts to change easily. For example: in a data warehouse, a considerable investment is required to develop a new source of data to fulfil a business need. This is due to the complexity of the data loading process and the work required to make analysis and reporting easy. In a Data Lake, since all data is stored in its raw form and is always accessible to someone who needs it, people are empowered to go beyond the structure of the warehouse to explore data in novel ways. Analysts can rapidly explore new applications of data and run experiments before deploying them for use by broader audiences in the business.
A Data Lake supports all users equally. For example: In a data warehouse:
In a Data Lake all users access the data lake at the same level, where different views are used to support different applications of the data.
By default the resources provisioned in the cloud are not accessible by an external entity. The system is configured such that permissions are assigned only to authorised computing resources to access data stored in the Data Lake.
For example: in the AWS cloud, the Data Lake is a S3 bucket that is accessed by a database service called AWS Athena. The Athena Service is accessed by the PowerBI application via a Data Gateway. We apply the principle of least privilege by allowing only enough access to each and every resource to perform the required job. In the AWS cloud, IAM user policies are used to manage access permissions for the resources such as AWS Lambda, AWS Glue, AWS Athena and Sagemaker.
Although there are controls in place to manage who can see and access data in the Data Lake it is also important to ensure that users who might inadvertently or maliciously gain access to those data assets can’t see and use them. The Data Lake is configured such that data assets are encrypted at rest and encrypted connections (HTTPS, SSL, TLS, FTPS) are used for data in transit.
Audit trails are enabled in the Data Lake to provide information on who has accessed the Data Lake, what data was accessed and when the data access events took place. This allows the organization to monitor the usage of the Data Lake and detect unusual usage patterns. An alarm will be raised if a security breach has occured.
The data gateway is the server used to access the Data Lake, and it is protected in the network layer by placing it in a private subnet. Access to the data gateway to the outside world is restricted to only a single jump box (bastion host) via a secure connection (SSH or RDP). A NAT gateway is provisioned to allow the data gateway to send outgoing traffic to the internet while at the same time blocking inbound traffic from the internet. This is to allow the data gateway to access the internet for operating system updates
While the Data Lake provides several advantages to the Data Warehouse, it should be viewed as a complement instead of a replacement. Organisations use data warehouses to effectively support processing of structured information such as financial transactions, CRM and ERP data. The Data Lake outperforms the data warehouse when it comes to handling high volumes of unstructured data that are difficult to model such as social media, web server logs, and sensor data. An organisation should deploy a suitable combination of both technologies to extract maximum value from their data assets.
The security of data is a critical risk factor in today’s organisations that are heavily data-centric. With the use of cloud technology, protecting data through isolation is no longer an option, and simply adding additional security tools is not sufficient to contain the risks associated with today’s evolving threat environment. It is critical that information systems for the organisation are architected with security taken into consideration at every level of its design.