07 Sep, 2020Alan Chu4 min read

Data Lake and Security Considerations

An Aberdeen survey (Oct 2017) reported that organisations that implemented a Data Lake outperformed similar companies by 9% in organic revenue growth.


What is a Data Lake?

A Data Lake is a storage repository that contains a large amount of structured, semi-structured, and unstructured data. It is a place to store every type of data in its native format with no fixed limits on account or file size. Data Lakes offer high quantities of data in order to increase analytic performance of an enterprise application.

With a Data Lake, users can gain insight and make better business decisions, using big data processing, real-time analytics, and machine learning.

Data Lake

Traditionally organisations build data warehouses where a considerable investment is made analysing data sources, understanding business processes and profiling data. The result is a highly structured data model designed for reporting. A large part of this process includes making decisions about what data to include and to not include in the warehouse. Generally, if data isn’t used to answer specific questions or in a defined report, it may be excluded from the warehouse.

With cloud technology, servers have become commodity and storage very cheap. It is no longer prohibitive to store terabytes of data per day and build petabyte scale repositories of data. It has become affordable to build repositories that store data from various sources in a given domain where the purpose or the value of the data has not yet been clearly defined. It is with this landscape of changing IT infrastructure economics that the Data Lake approach has emerged.

Here are a few characteristics of the Data Lake approach contrasted against a data warehouse approach, in order to further clarify the concept. In a Data Lake, all business data is stored regardless of source or structure. For example, in a data warehouse, data is extracted from transactional systems and consists of quantitative metrics and the attributes that describe them. In a Data Lake, additional information can be stored such as: web server logs, sensor data, social network activity, and text and images from a website. The data is stored in raw form, ready to be transformed. The value of the data may not yet be determined at the time the data is stored.

A Data Lake adapts to change easily. For example: in a data warehouse, a considerable investment is required to develop a new source of data to fulfil a business need. This is due to the complexity of the data loading process and the work required to make analysis and reporting easy. In a Data Lake, since all data is stored in its raw form and is always accessible to someone who needs it, people are empowered to go beyond the structure of the warehouse to explore data in novel ways. Analysts can rapidly explore new applications of data and run experiments before deploying them for use by broader audiences in the business.

A Data Lake supports all users equally. For example: In a data warehouse:

  1. the operations staff uses a reporting application built on top of the data warehouse
  2. the data analyst builds reports based on the data warehouse
  3. the data scientist builds new data sources using statistical and analytic tools based on available data sources.

In a Data Lake all users access the data lake at the same level, where different views are used to support different applications of the data.

The challenges of building and deploying a Data Lake

  1. With an added availability of data to many people within your business comes a new need to learn the skills required to analyse and make sense of that data.
  2. With traditional database management systems, the information security team might handle all the network security and access control protections but do little with the data once it enters the database management system. Data Lake structures do not come with all of the governance capabilities and policies associated with a traditional database management system. New security measures must be undertaken to address internal and external threats.

How we addressed the challenge of securing a Data Lake

AWS Data Lake

By default the resources provisioned in the cloud are not accessible by an external entity. The system is configured such that permissions are assigned only to authorised computing resources to access data stored in the Data Lake.

For example: in the AWS cloud, the Data Lake is a S3 bucket that is accessed by a database service called AWS Athena. The Athena Service is accessed by the PowerBI application via a Data Gateway. We apply the principle of least privilege by allowing only enough access to each and every resource to perform the required job. In the AWS cloud, IAM user policies are used to manage access permissions for the resources such as AWS Lambda, AWS Glue, AWS Athena and Sagemaker.

Although there are controls in place to manage who can see and access data in the Data Lake it is also important to ensure that users who might inadvertently or maliciously gain access to those data assets can’t see and use them. The Data Lake is configured such that data assets are encrypted at rest and encrypted connections (HTTPS, SSL, TLS, FTPS) are used for data in transit.

Audit trails are enabled in the Data Lake to provide information on who has accessed the Data Lake, what data was accessed and when the data access events took place. This allows the organization to monitor the usage of the Data Lake and detect unusual usage patterns. An alarm will be raised if a security breach has occured.

The data gateway is the server used to access the Data Lake, and it is protected in the network layer by placing it in a private subnet. Access to the data gateway to the outside world is restricted to only a single jump box (bastion host) via a secure connection (SSH or RDP). A NAT gateway is provisioned to allow the data gateway to send outgoing traffic to the internet while at the same time blocking inbound traffic from the internet. This is to allow the data gateway to access the internet for operating system updates

Conclusion

While the Data Lake provides several advantages to the Data Warehouse, it should be viewed as a complement instead of a replacement. Organisations use data warehouses to effectively support processing of structured information such as financial transactions, CRM and ERP data. The Data Lake outperforms the data warehouse when it comes to handling high volumes of unstructured data that are difficult to model such as social media, web server logs, and sensor data. An organisation should deploy a suitable combination of both technologies to extract maximum value from their data assets.

The security of data is a critical risk factor in today’s organisations that are heavily data-centric. With the use of cloud technology, protecting data through isolation is no longer an option, and simply adding additional security tools is not sufficient to contain the risks associated with today’s evolving threat environment. It is critical that information systems for the organisation are architected with security taken into consideration at every level of its design.

Share this article