Before a model is built, before the data is cleaned and made ready for exploration, even before the role of a data scientist begins – this is where data engineers come into the picture. Every data-driven business needs to have a framework in place for the data science pipeline, otherwise it’s a setup for failure.
Most people enter the data science world with the aim of becoming a data scientist, without ever realizing what a data engineer is, or what that role entails. These data engineers are vital parts of any data science project and their demand in the industry is growing exponentially in the current data-rich environment.
There is currently no coherent or formal path available for data engineers. Most folks in this role got there by learning on the job, rather than following a detailed route. My aim for writing this article was to help anyone who wants to become a data engineer but doesn’t know where to start and where to find study resources.
In this article, I have put together a list of things every aspiring data engineer needs to know. Initially we’ll see what a data engineer is and how the role differs from a data scientist. Then, we’ll move on to the core skills you should have in your skillset before being considered a good fit for the role. I have also mentioned some industry recognized certifications you should consider.
Right, let’s dive right into it.
Table of Contents
- So, what is a Data Engineer?
- The Difference between a Data Scientist and a Data Engineer
- The Different Roles in Data Engineering
- Core Data Engineering Skills and Resources to Learn Them
- Basic Language Requirements
- In-Depth Database Knowledge
- Data Warehousing/Big Data Skills
- Hadoop and MapReduce
- Hive and PIG
- Apache Spark
- Courses with a mixture of the above frameworks
- Data Engineering Certifications
So, what is a Data Engineer?
A data engineer is responsible for building and maintaining the data architecture of a data science project. These engineers have to ensure that there is uninterrupted flow of data between servers and applications. Some of the responsibilities of a data engineer include improving data foundational procedures, integrating new data management technologies and softwares into the existing system, building data collection pipelines, among various other things.
One of the most sought-after skills in data engineering is the ability to design and build data warehouses. This is where all the raw data is collected, stored and retrieved from. Without data warehouses, all the tasks that a data scientist does will become either too expensive or too large to scale.
ETL (Extract, Transform, and Load) are the steps which a data engineer follows to build the data pipelines. ETL is essentially a blueprint for how the collected raw data is processed and transformed into data ready for analysis.
Data engineers usually come from engineering backgrounds. Unlike data scientists, there is not much academic or scientific understanding required for this role. Developers or engineers who are interested in building large scale structures and architectures are ideally suited to thrive in this role.
The Difference between a Data Scientist and a Data Engineer
It is important to know the distinction between these 2 roles. Broadly speaking, a data scientist builds models using a combination of statistics, mathematics, machine learning and domain based knowledge. He/she has to code and build these models using the same tools/languages and framework that the organization supports.
A data engineer on the other hand has to build and maintain data structures and architectures for data ingestion, processing, and deployment for large-scale data-intensive applications. To build a pipeline for data collection and storage, to funnel the data to the data scientists, to put the model into production – these are just some of the tasks a data engineer has to perform.
For any large scale data science project to succeed, data scientists and data engineers need to work hand-in-hand. Otherwise things can go wrong very quickly!
To learn more about the difference between these 2 roles, head over to our detailed infographic here.
The Different Roles in Data Engineering
- Data Architect: A data architect lays down the foundation for data management systems to ingest, integrate and maintain all the data sources. This role requires knowledge of tools like SQL, XML, Hive, Pig, Spark, etc.
- Database Administrator: As the name suggests, a person working in this role requires extensive knowledge of databases. Responsibilities entail ensuring the databases are available to all the required users, is maintained properly and functions without any hiccups when new features are added.
- Data Engineer: The master of the lot. A data engineer, as we’ve already seen, needs to have knowledge of database tools, languages like Python and Java, distributed systems like Hadoop, among other things. It’s a combination of tasks into one single role.
Core Data Engineering Skills and Resources to Learn Them
- Introduction to Data Engineering
- Basic Language Requirement: Python
- Solid Knowledge of Operating Systems
- Heavy, In-Depth Database Knowledge – SQL and NoSQL
- Data Warehousing – Hadoop, MapReduce, HIVE, PIG, Apache Spark, Kafka
- Basic Machine Learning Familiarity
Introduction to Data Engineering
It’s essential to first understand what data engineering actually is, before diving into the different facets of the role. What are the different functions a data engineer performs day-to-day? What do the top technology companies look for in a data engineer? Are you expected to know just about everything under the sun or just enough to be a good fit for a specific role? My aim is to provide you an answer to these questions (and more) in the resources below.
A Beginner’s Guide to Data Engineering (Part 1): A very popular post on data engineering from a data scientist at Airbnb. The author first explains why data engineering is such a critical aspect of any machine learning project, and then deep dives into the various component of this subject. I consider this a compulsory read for all aspiring data engineers AND data scientists.
A Beginner’s Guide to Data Engineering (Part 2): Continuing on from the above post, part 2 looks at data modeling, data partitioning, Airflow, and best practices for ETL.
A Beginner’s Guide to Data Engineering (Part 3): The final part of this amazing series looks at the concept of a data engineering framework. Throughout the series, the author keeps relating the theory to practical concepts at Airbnb, and that trend continues here. A truly exquisitely written series of articles.
O’Reilly’s Suite of Free Data Engineering E-Books: O’Reilly is known for their excellent books, and this collection is no exception to that. Except, these books are free! Scroll down to the ‘Big Data Architecture’ section and check out the books there. Some of these require a bit of knowledge regarding Big Data infrastructure, but these books will help you get acquainted with the intricacies of data engineering tasks.
Basic Language Requirement: Python
While there are other data engineering-specific programming languages out there (like Java and Scala), we’ll be focusing on Python in this article. We have seen a clear shift in the industry towards Python and is seeing a rapid adoption rate. It’s become an essential part of a data engineer’s (and a data scientist’s) skillset.
There are tons of resources online to learn Python. I have mentioned a few of them below.
A complete tutorial to learn Data Science with Python from Scratch: This article by Kunal Jain covers a list of resources you can use to begin and advance your Python journey. A must-read resource.
Introduction to Data Science using Python: This is Analytics Vidhya’s most popular course that covers the basics of Python. We additionally cover core statistics concepts and predictive modeling methods to solidify your grasp on Python and basic data science.
Codeacademy’s Learn Python course: This course assumes no prior knowledge of programming. It starts from the absolute basics of Python and is a good starting point.
If you prefer learning through books, below are a couple of free ebooks to get you started:
Think Python by Allen Downey: A comprehensive go-through of the Python language. Perfect for newcomers and even non-programmers.
Non-Programmer’s Tutorial for Python 3: As the name suggests, it’s a perfect starting point for folks coming from a non-IT background or a non-technical background. There are plenty of examples in each chapter to test your knowledge.
A key cog in the entire data science machine, operating systems are what make the pipelines tick. A data engineer is expected to know the ins and outs of infrastructure components, such as virtual machines, networks, applications services, etc. How well versed are you with server management? Do you know Linux well enough to navigate around different configurations? How familiar are you with access control methods? These are just some of the questions you’ll face as a data engineer.
Linux Server Management and Security: This Coursera offering is designed for folks looking to understand how Linux works in the enterprise. The course is divided into 4 weeks (and a project at the end) and covers the basics well enough.
CS401: Operating Systems: As comprehensive a course as any around operating systems. This contains nine sections dedicated to different aspects of an operating system. The primary focus is on UNIX-based systems, though Windows is covered as well.
Raspberry Pi Platform and Python Programming for the Raspberry Pi: A niche topic, for sure, but the demand for this one is off the charts these days. This course aims to make you familiar with the Raspberry Pi environment and get you started with basic Python code on the Raspberry Pi.
In-Depth Database Knowledge
In order to become a data engineer, you need to have a very strong grasp on database languages and tools. This is another very basic requirement. You need to be able to collect, store and query information from these databases in real-time. There are tons of databases available today but I have listed down resources for the ones that are currently widely used in the industry today. These are divided into SQL and NoSQL databases.
Learn SQL for Free: Another codeacademy entry, you can learn the absolute basics of SQL here. Topics like manipulation, queries, aggregate functions and multiple tables are covered from the ground up. If you’re completely new to this field, not many places better than this to kick things off.
Quick SQL Cheatsheet: An ultra helpful GitHub repository with regularly updated SQL queries and examples. Ensure you star/bookmark this repository as a reference point anytime you quickly need to check a command.
MySQL Tutorial: MySQL was created over two decades ago, and still remains a popular choice in the industry. This resource is a text-based tutorial, presented in an easy-to-follow manner. The cool thing about this site is that practical examples with SQL scripts (and screenshots) accompany each topic.
Learn Microsoft SQL Server: This text tutorial explores SQL Server concepts starting from the basics to more advanced topics. Concepts have been explained using codes and detailed screenshots.
PostgreSQL Tutorial: An incredible detailed guide to get you started and well acquainted with PostgreSQL. The tutorial has been divided into 16 sections so you can imagine how well this subject has been covered.
Oracle Live SQL: Who better to learn Oracle’s SQL database than the creators themselves? The platform is really well designed and makes for a great end user experience. You can view scripts and tutorials to get your feet wet, and then start coding on the same platform. Sounds awesome!
MongoDB from MongoDB: This is currently the most popular NoSQL Database out there. And as with the Oracle training mentioned above, MongoDB is best learned from the masters themselves. I have linked their entire course catalogue here, so you can pick and choose which trainings you want to take.
Introduction to MongoDB: This course will get you up and running with MongoDB quickly, and teach you how to leverage its power for data analytics. It’s a short three weeks course but has plenty of exercises to make you feel like an expert by the time you’re finished!
Learn Cassandra: If you’re looking for an excellent text-based and beginner-friendly introduction to Cassandra, this is the perfect resource. Topics like Cassandra’s architecture, installation, key operations, etc. are covered here. The tutorial also has dedicated chapters to explain the data types and collections available in CQL and how to make use of user-defined data types.
Redis Enterprise: There are not many resources out there to learn about Redis Databases, but this one site is enough. There are multiple courses and beautifully designed videos to make the learning experience engaging and interactive. And it’s free!
Google Bigtable: Being Google’s offering, there are surprisingly sparse resources available to learn how Bigtable works. I have linked a Coursera course that includes plenty of Google Cloud topics but you can scroll down and select Bigtable (or BigQuery). I would, however, recommend going through the full course as it provides valuable insights into how Google’s entire Cloud offerings work.
Couchbase: Multiple trainings are available here (scroll down to see the free trainings), and they range from beginner to advanced. If Couchbase is your organization’s database of choice, this is where you’ll learn everything about it.
Data Warehousing/Big Data Tools
Distributed file systems like Hadoop (HDFS) can be found in any data engineer job description these days. It’s a common role requirement and one you should be familiar with intimately. Apart from that, you need to gain an understanding of platforms and frameworks like Apache Spark, Hive, PIG, Kafka, etc. I have listed the resources for all these topics in this section.
Hadoop and MapReduce
Hadoop Fundamentals: This is essentially a learning path for Hadoop. It includes 5 courses that will give you a solid understanding of what Hadoop is, the architecture and components that define it, how to use it, it’s applications and a whole lot more.
Hadoop Starter Kit: This is a really good and comprehensive free course for anyone looking to get started with Hadoop. It includes topics like HDFS, MapReduce, Pig and HIVE with free access to clusters for practising what you’ve learned.
Hortonworks Tutorials: As the creators of Hadoop, Hortonworks have a well respected set of courses for learning various things related to Hadoop. From beginners to advanced, this page has a very comprehensive list of tutorials. Ensure you check this out.
Introduction to MapReduce: Before reading this article, you need to have some basic knowledge of how Hadoop works. Once done, come back and take a deep dive into the world of MapReduce.
Hadoop Beyond Traditional MapReduce – Simplified: This article covers an overview of the Hadoop ecosystem that goes beyond simply MapReduce.
Prefer books? No worries, I have you covered! Below are a few free ebooks that cover Hadoop and it’s components.
Hadoop Explained: A basic introduction to the complicated world of Hadoop. It gives a high-level overview of how Hadoop works, it’s advantages, applications in real-life scenarios, among other things.
Hadoop: What you Need to Know: This one is on similar lines to the above book. As the description says, the books covers just about enough to ensure you can make informed and intelligent decisions about Hadoop.
Data-Intensive Text Processing with MapReduce: This free ebook covers the basics of MapReduce, its algorithm design, and then deep dives into examples and applications you should know about. It’s recommended that you take the above courses first before reading this book.
You should also join the Hadoop LinkedIn group to keep yourself up-to-date and to ask any queries you might have.
Comprehensive Guide to Apache Spark, RDDs and Dataframes (using PySpark): This is the ultimate article to get you stared with Apache Spark. It covers the history of Apache Spark, how to install it using Python, RDD/Dataframes/Datasets and then rounds-up by solving a machine learning problem. A must-read guide.
Step by Step Guide for Beginners to Learn SparkR: In case you are a R user, this one is for you! You can of course use Spark with R and this article will be your guide.
Spark Fundamentals: This course covers the basics of Spark, it’s components, how to work with them, interactive examples of using Spark, introduction to various Spark libraries and finally understanding the Spark cluster. What more could you ask for from one course?
Introduction to Apache Spark and AWS: This is a practical and practice focused course. You will work with the Gutenberg Project data, the world’s largest open collection of ebooks. You will need knowledge of Python and the Unix command line to extract the most out of this course.
Courses covering Hadoop, Spark, HIVE and Spark SQL
Big Data Essentials: HDFS, MapReduce and Spark RDD: This course takes real-life datasets to teach you basic Big Data technologies – HDFS, MapReduce and Spark. It’s a typical Coursera course – detailed, filled with examples and useful datasets, and taught by excellent instructors.
Big Data Analysis: Hive, Spark SQL, DataFrames and GraphFrames: MapReduce and Spark tackle the issue of working with Big Data partially. Learn high-level tools with this intuitive course where you’ll master your knowledge of Hive and Spark SQL, among other things.
Big Data Applications: Real-Time Streaming: One of the challenges of working with enourmous amounts of data is not just the computational power to process it, but to do so as quickly as possible. Applications like recommendation engines require real-time data processing and to store and query this amount of data requires knowledge of systems like Kafka, Cassandra and Redis, which this course provides. But to take this course, you need a working knowledge of Hadoop, Hive, Python, Spark and Spark SQL.
Simplifying Data Pipelines with Apache Kafka: Get the low down on what Apache Kafka is, its architecture and how to use it. You need a basic understanding of Hadoop, Spark and Python to truly gain the most from this course.
Kafka’s Official Documentation: This is an excellent intuitive introduction to how Kafka works and the various components that go toward making it work. This page also includes a nice explanation of what a distributed streaming platform is.
Putting the Power of Kafka into the Hands of Data Scientists: Not quite a learning resource per se, but a very interesting and detailed article on how data engineers at Stitch Fix built a platform tailored to the requirements of their data scientists.
Basic Machine Learning Familiarity
While machine learning is primarily considered the domain of a data scientist, a data engineer needs to be well versed with certain techniques as well. Why, you ask? Getting models into production and making pipelines for data collection or generation need to be streamlined, and these require at least a basic understanding of machine learning algorithms.
Machine Learning Basics for a Newbie: A superb introduction to the world of machine learning by Kunal Jain. The aim of the article is to do away with all the jargon you’ve heard or read about. The guide cuts straight to heart of the matter, and you end up appreciating that style of writing.
Essentials of Machine Learning Algorithms: This is an excellent article that provides a high-level understanding of various machine learning algorithms. It includes an implementation of these techniques in R and Python as well – a perfect place to start your journey.
Must-Read Books for Beginners on Machine Learning and Artificial Intelligence: If books are more to your taste, then check out this article! This is a collection of the best of the best, so even if you read only a few of these books, you’ll have gone a long way towards your dream career.
24 Ultimate Data Science Projects to Boost your Knowledge and Skills: Once you’ve acquired a certain amount of knowledge and skill, it’s always highly recommended to put your theoretical knowledge into practice. Check out these datasets, ranked in order of their difficulty, and get your hands dirty.
Data Engineering Certifications
This is one of the premier data engineering certifications available today. To earn this certification, you need to successfully clear a challenging 2 hour multiple choice exam. You can find the general outline of what to expect on this link. Also available are links to get hands-on practice with Google Cloud technologies. Ensure you check this out!
To attain this certification, you need to pass one exam – this one. The exam contains 54 questions out of which you have to answer 44 correctly. I recommend going through what IBM expects you to know before you sit for the exam. The exam link also contains further links to study materials you can refer to for preparing yourself.
This is another globally recognized certification, and a pretty challenging one for a newcomer. Your concepts need to be up-to-date and in-depth, you should have some hands-on experience with data engineering tools like Hadoop, Oozie, AWS Sandbox, etc. But if you clear this exam, you are looking at a very promising start to this field of work!
Cloudera has mentioned that it would help if you took their training for Apache Spark and Hadoop since the exam is heavily based on these two tools.