By Leonid Shpaner
To the outside world, the lines are often blurred between the terms data science and data engineering. These two disciplines are often misconstrued as being synonymous, a scenario in common with the medical sciences – where opticians are often confused for being one and the same as optometrists and/or ophthalmologists. While it is true that all three deal with eye care, opticians fit the frames for lenses. Optometrists and ophthalmologists set the stage for patient care – examining, diagnosing, and prescribing corrective actions. Ophthalmologists are medically licensed to perform surgical treatments.
So why the immediate segue into two separate yet distinct careers? After all, data is data, and medicine is medicine. Effectively, data engineers are to data scientists what opticians are to ophthalmologists – they create the infrastructure used by data scientists, who in turn use the data to draw inferences, make meaningful insights, predictions, and prescribe solutions to a target audience (stakeholders). Data engineers, however, work with what is given to them by data scientists, so a robust two-way stream of communication and dialogue between the two is essential. In 2022, a competitive data engineer will not only speak the language of a data scientist but will be able to understand how data gets stored, its structure and function (purpose), and how to ingest it into a global environment.
Amazon Web Services (AWS)
An ability to become familiarized with the framework of a cloud computing environment is an essential skill. Data engineers and data scientists with access to Amazon Web Services (AWS) can load data (i.e., flat .CSV files) into an S3 bucket to store the data, Amazon Redshift and/or Athena to create databases, preparing them for transformations, and querying them using Structured Query Language (SQL). A data scientist well-versed in the principles and practices of data engineering will be adept at taking the data a step further by ingesting it into a SageMaker (a cloud-based Jupyter Notebook environment), where the creation of databases and SQL transformations can commence. This is made possible by importing the requisite packages using Python code into the environment – SageMaker, Pandas for data analysis and ingestion (i.e., reading .CSV files, SQL queries, storing data frames, etc.) and Boto3 Software Development Kit for S3 buckets. For context and clarity, a Jupyter Notebook is part and parcel of a larger live interactive development environment (JupyterLab) that can easily accommodate data science projects in various programming languages (most notably Python).
Similarly, Microsoft Azure is a cloud-based solution that allows users to load data into blob storage almost like that of S3 storage on AWS, with which they can subsequently call from Synapse Studio – SageMaker’s notebook-like rough equivalent. From there, data scientists can run their EDA, pre-processing, and predictive models commensurate with the scale, complexity, and need of the project. However, it is most important to understand that data science and data engineering are not mutually exclusive, regardless of the framework or architecture being used. One directly depends on the other. The data engineer ingests the data (i.e., multiple .CSV files) into a storage medium, defines the table schema in an organized and logical manner, and then combines this data into one uniform readable file format. This is the basic idea. Now, we can pack it in, call it a day, and go to the beach, right? Data is not always static, and if it were that easy, then there would be a long line of newly minted data engineers filling the likes of any vast enterprise for annual salaries of anywhere in the hundreds of thousands of dollars. When the data is constantly being updated on the fly, a pipeline needs to be crafted such that these regular updates can be accommodated. Data engineers can use tools like dbt in conjunction with warehouses (e.g., Google BigQuery) to architect automated solutions for extracting, transforming, and loading the data.
Successful data scientists, though not fully or regularly immersed in the daily practices of data engineers, should know (or become well versed in) the language of data engineering – pipelines, ETL, and of course, SQL. Extracting (i.e., moving from a SQL server to a staging area), Transforming (cleaning the data), and Loading the data into a data warehouse. The global COVID-19 pandemic has led to a shift from on-site work to an increase in distributed teams dedicated to handling on-premise and cloud-based infrastructures alike. Whether the end-user logs in to AWS, Microsoft Azure, or uses a VPN to remote access a server, data scientists must be able to effectively communicate their organization’s project requirements, scope, and breadth to their data engineers. How can they build a pipeline, let alone understand a schema otherwise? Conversely, how can data scientists understand the limitations of data engineers? Is the schema set up correctly? Can the right kinds of joins be performed on the data? The data scientist should be able to examine the data’s viability.
A Simple Case Study
For example, a company that deals with weekly movie recommendations may decide to adopt a batch prediction pipeline, and a data engineer goes into AWS to select an ml.m2.medium instance type. However, the data scientist then explains to the data engineer that for the same money, the company can use a larger instance type of ml.m5.large. While this is what the vendor recommends, understanding cost and its relationship with computational complexity traverses beyond the immediate needs of the project. On the contrary, it is the kind of frugal, efficient, and let it be proclaimed, necessary thinking that scales and transcends organizations into twenty-first century idea-centric powerhouses.
So now that the symbiotic relationship between data scientists and data engineers has been introduced, it is necessary to make one final attempt in expounding and expanding on these comparisons. Data engineers prepare the framework for data scientists to conduct meaningful analyses. They are builders of analytical systems, and data scientists are the adopters, analysts, and statisticians that scale these services to predict outcomes that, once deployed, can be used to provide scalable business solutions across a wide array of business enterprises and use-cases pertaining thereto.
The University of San Diego’s Master of Science in Applied Data Science program provides practical courses on data engineering and cloud computing that allow students to hone their skills in these areas.