Posted in Information Technology

Popular Programming Languages

 

With over 300 standardized coding languages, where do you start? The programming languages you learn will depend on your career goals. Here are a handful of the most popular programming languages and how they’re used most often.

C

C is a general-purpose coding language, originally created for Unix systems. It’s commonly used in cross-platform systems, Unix coding and game coding. It’s often selected because it’s more compact than C++ and runs faster.  It’s the second most common programming language following Java. C is the grandfather of many other coding languages including C#, Java, JavaScript, Perl, PHP and Python.

C++

C++ is an intermediate-level coding language that is object-oriented. It derives from C, however, it has add-ons and enhancements to make it a more multifaceted coding language. It’s well suited to large projects, as it can be broken up into parts enabling easy collaboration. It’s used by some of the world’s best-known tech companies including Adobe, Google, Mozilla, and Microsoft.

Objective-C

Like most of the coding languages on this list, it derives from C. It’s a general purpose, high-level code that has an added message-passing function. It’s known for being the coding language of choice for Apple’s OS X and iOS apps until it was replaced by Swift.

Java

Java is currently the most popular and widely used language in the world. Though it was originally created for interactive TV, it’s become known as the language of choice for Android devices. It’s also the coding language of choice enterprise-level software. It’s a good multi-purpose coding language because it can be used cross-platform (meaning it’s just as easily used on smartphone apps as on desktop apps). It resembles C++ in syntax and structure making it easy to pick up if you know C languages already.

JavaScript

JavaScript was created as an add-on code to extend the functionality of web pages. It adds dynamic features such as submission forms, interactivity, animations, user-tracking, etc. It’s mostly used for front-end development, or for coding solutions that customers and clients interact with. It’s compatible with all browsers, making it a good general-purpose web development code, though it’s also known to be difficult to debug.

Posted in Information Technology

Data Engineering

 

From helping cars drive themselves to helping Facebook tag you in photos, data science has attracted a lot of buzz recently. Data scientists have become extremely sought after, and for good reason — a skilled data scientist can add incredible value to a business. But what about data engineers? Who are they, and what do they do?

A data scientist is only as good as the data they have access to. Most companies store their data in variety of formats across databases and text files. This is where data engineers come in — they build pipelines that transform that data into formats that data scientists can use. Data engineers are just as important as data scientists, but tend to be less visible because they tend to be further from the end product of the analysis.

A good analogy is a race car builder vs a race car driver. The driver gets the excitement of speeding along a track, and thrill of victory in front of a crowd. But the builder gets the joy of tuning engines, experimenting with different exhaust setups, and creating a powerful, robust, machine. If you’re the type of person that likes building and tweaking systems, data engineering might be right for you. In this post, we’ll explore the day to day of a data engineer, and discuss the skills required for the role.

The Data Engineer Role

The data science field is incredibly broad, encompassing everything from cleaning data to deploying predictive models. However, it’s rare for any single data scientist to be working across the spectrum day to day. Data scientists usually focus on a few areas, and are complemented by a team of other scientists and analysts.

Data engineering is also a broad field, but any individual data engineer doesn’t need to know the whole spectrum of skills. In this section, we’ll sketch the broad outlines of data engineering, then walk through more specific descriptions that illustrate specific data engineering roles.

A data engineer transforms data into a useful format for analysis. Imagine that you’re a data engineer working on a simple competitor to Uber called Rebu. Your users have an app on their device through which they access your service. They request a ride to a destination through your app, which gets routed to a driver, who then picks them up and drops them off. After the ride, they’re charged, and have the option to rate their driver.

In order to maintain a service like this, you need:

  • A mobile app for users
  • A mobile app for drivers
  • A server that can pass requests from users to drivers, and handle other details like updating payment information

Here’s a diagram showing the communication:

ditaa_diagram_1-3

As you may expect, this kind of system will generate huge amounts of data. You’ll have a few different data stores:

  • The database that backs your main app. This contains user and driver information.
  • Server analytics logs
    • Server access logs. These contain one line per request made to the server from the app.
    • Server error logs. These contain all the server-side errors generated by your app.
  • App analytics logs
    • App event logs. These contain information about what actions users and drivers took in the app. For example, you’d log when they clicked a button or updated their payment information.
    • App error logs. These contain information about errors in the app.
  • Ride database. This contains information about a single ride for user/driver pair, and contains status information on the ride.
  • Customer service database. This contains information about customer interactions by customer service agents. It can include voice transcripts and email logs.

Here’s an updated diagram showing the data sources:

ditaa_diagram_2-2

Let’s say a data scientist wants to analyze a user’s action history with your service, and see what actions correlate with users who spend more. In order to enable them to create this, you’ll need to combine information from the server access logs and the app event logs. You’ll need to:

  • Gather app analytics logs from user devices regularly
  • Combine the app analytics logs with any server log entries that reference the user
  • Create an API endpoint that returns the event history of any user

In order to solve this, you’ll need to create a pipeline that can ingest mobile app logs and server logs in real-time, parse them, and attach them to a specific user. You’ll then need to store the parsed logs in a database, so they can easily be queried by the API. You’ll need to spin up several servers behind a load balancer to process the incoming logs.

Most of the issues that you’ll run into will be around reliability and distributed systems. For example, if you have millions of devices to gather logs from, and variable demand (in the morning, you get a ton of logs, but not as many at midnight), you’ll need a system that can automatically scale your server count up and down.

Running servers behind a load balancer. Servers are registered with the load balancer, and the load balancer sends traffic to them based on how busy they are. This means servers can be added or removed as needed.

Roughly, the operations in a data pipeline consist of the following phases:

  • Ingestion — this involves gathering in the needed data.
  • Processing — this involves processing the data to get the end results you want.
  • Storage — this involves storing the end results for fast retrieval.
  • Access — you’ll need to enable a tool or user to access the end results of the pipeline.

A data pipeline — input data is transformed in a series of phases into output data.

Finding bad quality rides

For a more complex example, imagine that a data scientist wants to build a system that finds all rides that ended prematurely due to app or driver issues. One way to do this is to look at the customer service database to see which rides ended with issues, and analyze their language logn with some data about the ride.

Before the data scientist can do this, they need a way to match up the logs in the customer service database with specific rides. As a data engineer, you’ll want to create an API endpoint that allows the data scientist to query for all customer service messages related to a particular ride. In order to do this, you’ll need to:

  • Create a system that pulls data from the ride database, and figures out information about the ride, such as how long it was, and whether the destination matched the user’s initial request.
  • Combine the computed statistics on each ride with user information, such as name and user id.
  • Extract error information from the app and server analytics logs pertaining to the user during the time period of the ride.
  • Find all customer service queries by a user.
  • Create some heuristic to match rides with customer service queries (a simple example is that a customer service query is always about the previous ride)
  • Store values as needed to ensure that the API performs quickly, even for future rides.
  • Create an API that returns all customer service messages related to a particular ride.

A skilled data engineer will be able to build a pipeline that performs each of the above steps every time a new ride is added. This will ensure that the data served by the API is always up to date, and that whatever analysis the data scientist does is valid.

Data engineering skills

A data engineer needs to be good at:

  • Architecting distributed systems
  • Creating reliable pipelines
  • Combining data sources
  • Architecting data stores
  • Collaborating with data science teams and building the right solutions for them

Note that we didn’t mention any tools above. Although tools like Hadoop and Spark and languages like Scala and Python are important to data engineering, it’s more important to understand the concepts well and know how to build real-world systems. We’ll continue this focus on concepts over tools throughout this series on data engineering.

Data Engineering Roles

Although data engineers need to have the skills listed above, the day to day of a data engineer will vary depending on the type of company they work for. Broadly, you can classify data engineers into a few categories:

  • Generalist
  • Pipeline-centric
  • Database-centric

Let’s go through each one of these categories.

Generalist

A generalist data engineer typically works on a small team. Without a data engineer, data analysts and scientsts don’t have anything to analyze, making a data engineer a critical first member of a data science team.

When a data engineer is the only data-focused person at a company, they usually end up having to do more end-to-end work. For example, a generalist data engineer may have to do everything from ingesting the data to processing it to doing the final analysis. This requires more data science skill than most data engineers have. However, it also requires less systems architecture knowledge — small teams and companies don’t have a ton of users, so engineering for scale isn’t as important. This is a good role for a data scientist who wants to transition into data engineering.

When our hypothetical Uber competitor, Rebu, is small, a data engineer might be asked to create a dashboard that shows the number of rides taken for each day in the past month, along with a forecast for the next month.

Pipeline-centric

Pipeline-centric data engineers tend to be necessary in mid-sized companies that have complex data science needs. A pipeline-centric data engineer will work with teams of data scientists to transform data into a useful format for analysis. This entails in-depth knowledge of distributed systems and computer science.

As Rebu grows, a pipeline-centric data engineer might be asked to create a tool that enables data scientists to query metadata about rides to use in a predictive algorithm.

Database-centric

A database-centric data engineer is focused on setting up and populating analytics databases. This involves some work with pipelines, but more work with tuning databases for fast analysis and creating table schemas. This involves ETL work to get data into warehouses. This type of data engineer is usually found at larger companies with many data analysts that have their data distributed across databases.

After Rebu takes over the world, a database centric data engineer might design an analytics database, then create scripts to pull information from the main app database into the analytics database.

A data warehouse takes in data, then makes it easy for others to query it.

Data Engineering Skills

In this post, we covered data engineering and the skills needed to practice it at a high level. If you’re interested in architecting large-scale systems, or working with huge amounts of data, then data engineering is a good field for you.

It can be very exciting to see your autoscaling data pipeline suddently handle a traffic spike, or get to work with machines that have terabytes of RAM. There’s satisfaction in building a robust system that can work for months or years with minimal tweaking.

Because data engineering is about learning to deal with scale and efficiency, it can be hard to find good practice material on your own. But don’t give up hope — it’s very possible to learn data engineering on your own and get a job in the field.

We’ve recently launched our new interactive Data Engineering Path at Dataquest, designed to teach you the skills you need to become a data engineer. If you’re interested, you can signup and do the first mission of every course for free.

Posted in Information Technology

Want to Become a Data Engineer? Here’s a Comprehensive List of Resources to get Started

https://www.analyticsvidhya.com/blog/2018/11/data-engineer-comprehensive-list-resources-get-started/
Introduction

Before a model is built, before the data is cleaned and made ready for exploration, even before the role of a data scientist begins – this is where data engineers come into the picture. Every data-driven business needs to have a framework in place for the data science pipeline, otherwise it’s a setup for failure.

Most people enter the data science world with the aim of becoming a data scientist, without ever realizing what a data engineer is, or what that role entails. These data engineers are vital parts of any data science project and their demand in the industry is growing exponentially in the current data-rich environment.

There is currently no coherent or formal path available for data engineers. Most folks in this role got there by learning on the job, rather than following a detailed route. My aim for writing this article was to help anyone who wants to become a data engineer but doesn’t know where to start and where to find study resources.

In this article, I have put together a list of things every aspiring data engineer needs to know. Initially we’ll see what a data engineer is and how the role differs from a data scientist. Then, we’ll move on to the core skills you should have in your skillset before being considered a good fit for the role. I have also mentioned some industry recognized certifications you should consider.

Right, let’s dive right into it.

 

Table of Contents

  1. So, what is a Data Engineer?
  2. The Difference between a Data Scientist and a Data Engineer
  3. The Different Roles in Data Engineering
  4. Core Data Engineering Skills and Resources to Learn Them
    1. Basic Language Requirements
    2. In-Depth Database Knowledge
    3. Data Warehousing/Big Data Skills
      1. Hadoop and MapReduce
      2. Hive and PIG
      3. Apache Spark
      4. Courses with a mixture of the above frameworks
      5. Kafka
  5. Data Engineering Certifications

 

So, what is a Data Engineer?

A data engineer is responsible for building and maintaining the data architecture of a data science project. These engineers have to ensure that there is uninterrupted flow of data between servers and applications. Some of the responsibilities of a data engineer include improving data foundational procedures, integrating new data management technologies and softwares into the existing system, building data collection pipelines, among various other things.

One of the most sought-after skills in data engineering is the ability to design and build data warehouses. This is where all the raw data is collected, stored and retrieved from. Without data warehouses, all the tasks that a data scientist does will become either too expensive or too large to scale.

ETL (Extract, Transform, and Load) are the steps which a data engineer follows to build the data pipelines. ETL is essentially a blueprint for how the collected raw data is processed and transformed into data ready for analysis.

Data engineers usually come from engineering backgrounds. Unlike data scientists, there is not much academic or scientific understanding required for this role. Developers or engineers who are interested in building large scale structures and architectures are ideally suited to thrive in this role.

 

The Difference between a Data Scientist and a Data Engineer

It is important to know the distinction between these 2 roles. Broadly speaking, a data scientist builds models using a combination of statistics, mathematics, machine learning and domain based knowledge. He/she has to code and build these models using the same tools/languages and framework that the organization supports.

A data engineer on the other hand has to build and maintain data structures and architectures for data ingestion, processing, and deployment for large-scale data-intensive applications. To build a pipeline for data collection and storage, to funnel the data to the data scientists, to put the model into production – these are just some of the tasks a data engineer has to perform.

For any large scale data science project to succeed, data scientists and data engineers need to work hand-in-hand. Otherwise things can go wrong very quickly!

To learn more about the difference between these 2 roles, head over to our detailed infographic here.

 

The Different Roles in Data Engineering

  • Data Architect: A data architect lays down the foundation for data management systems to ingest, integrate and maintain all the data sources. This role requires knowledge of tools like SQL, XML, Hive, Pig, Spark, etc.
  • Database Administrator: As the name suggests, a person working in this role requires extensive knowledge of databases. Responsibilities entail ensuring the databases are available to all the required users, is maintained properly and functions without any hiccups when new features are added.
  • Data Engineer: The master of the lot. A data engineer, as we’ve already seen, needs to have knowledge of database tools, languages like Python and Java, distributed systems like Hadoop, among other things. It’s a combination of tasks into one single role.

 

Core Data Engineering Skills and Resources to Learn Them

  • Introduction to Data Engineering
  • Basic Language Requirement: Python
  • Solid Knowledge of Operating Systems
  • Heavy, In-Depth Database Knowledge – SQL and NoSQL
  • Data Warehousing – Hadoop, MapReduce, HIVE, PIG, Apache Spark, Kafka
  • Basic Machine Learning Familiarity

 

Introduction to Data Engineering

It’s essential to first understand what data engineering actually is, before diving into the different facets of the role. What are the different functions a data engineer performs day-to-day? What do the top technology companies look for in a data engineer? Are you expected to know just about everything under the sun or just enough to be a good fit for a specific role? My aim is to provide you an answer to these questions (and more) in the resources below.

 

A Beginner’s Guide to Data Engineering (Part 1): A very popular post on data engineering from a data scientist at Airbnb. The author first explains why data engineering is such a critical aspect of any machine learning project, and then deep dives into the various component of this subject. I consider this a compulsory read for all aspiring data engineers AND data scientists.

A Beginner’s Guide to Data Engineering (Part 2): Continuing on from the above post, part 2 looks at data modeling, data partitioning, Airflow, and best practices for ETL.

A Beginner’s Guide to Data Engineering (Part 3): The final part of this amazing series looks at the concept of a data engineering framework. Throughout the series, the author keeps relating the theory to practical concepts at Airbnb, and that trend continues here. A truly exquisitely written series of articles.

O’Reilly’s Suite of Free Data Engineering E-Books: O’Reilly is known for their excellent books, and this collection is no exception to that. Except, these books are free! Scroll down to the ‘Big Data Architecture’ section and check out the books there. Some of these require a bit of knowledge regarding Big Data infrastructure, but these books will help you get acquainted with the intricacies of data engineering tasks.

 

Basic Language Requirement: Python

While there are other data engineering-specific programming languages out there (like Java and Scala), we’ll be focusing on Python in this article. We have seen a clear shift in the industry towards Python and is seeing a rapid adoption rate. It’s become an essential part of a data engineer’s (and a data scientist’s) skillset.

There are tons of resources online to learn Python. I have mentioned a few of them below.

A complete tutorial to learn Data Science with Python from Scratch: This article by Kunal Jain covers a list of resources you can use to begin and advance your Python journey. A must-read resource.

Introduction to Data Science using Python: This is Analytics Vidhya’s most popular course that covers the basics of Python. We additionally cover core statistics concepts and predictive modeling methods to solidify your grasp on Python and basic data science.

Codeacademy’s Learn Python course: This course assumes no prior knowledge of programming. It starts from the absolute basics of Python and is a good starting point.

If you prefer learning through books, below are a couple of free ebooks to get you started:

Think Python by Allen Downey: A comprehensive go-through of the Python language. Perfect for newcomers and even non-programmers.

Non-Programmer’s Tutorial for Python 3: As the name suggests, it’s a perfect starting point for folks coming from a non-IT background or a non-technical background. There are plenty of examples in each chapter to test your knowledge.

 

Operating Systems

A key cog in the entire data science machine, operating systems are what make the pipelines tick. A data engineer is expected to know the ins and outs of infrastructure components, such as virtual machines, networks, applications services, etc. How well versed are you with server management? Do you know Linux well enough to navigate around different configurations? How familiar are you with access control methods? These are just some of the questions you’ll face as a data engineer.

 

Linux Server Management and Security: This Coursera offering is designed for folks looking to understand how Linux works in the enterprise. The course is divided into 4 weeks (and a project at the end) and covers the basics well enough.

CS401: Operating Systems: As comprehensive a course as any around operating systems. This contains nine sections dedicated to different aspects of an operating system. The primary focus is on UNIX-based systems, though Windows is covered as well.

Raspberry Pi Platform and Python Programming for the Raspberry Pi: A niche topic, for sure, but the demand for this one is off the charts these days. This course aims to make you familiar with the Raspberry Pi environment and get you started with basic Python code on the Raspberry Pi.

 

In-Depth Database Knowledge

In order to become a data engineer, you need to have a very strong grasp on database languages and tools. This is another very basic requirement. You need to be able to collect, store and query information from these databases in real-time. There are tons of databases available today but I have listed down resources for the ones that are currently widely used in the industry today. These are divided into SQL and NoSQL databases.

 

SQL Databases

Source: MacWorld UK

Learn SQL for Free: Another codeacademy entry, you can learn the absolute basics of SQL here. Topics like manipulation, queries, aggregate functions and multiple tables are covered from the ground up. If you’re completely new to this field, not many places better than this to kick things off.

Quick SQL Cheatsheet: An ultra helpful GitHub repository with regularly updated SQL queries and examples. Ensure you star/bookmark this repository as a reference point anytime you quickly need to check a command.

MySQL Tutorial: MySQL was created over two decades ago, and still remains a popular choice in the industry. This resource is a text-based tutorial, presented in an easy-to-follow manner. The cool thing about this site is that practical examples with SQL scripts (and screenshots) accompany each topic.

Learn Microsoft SQL Server: This text tutorial explores SQL Server concepts starting from the basics to more advanced topics. Concepts have been explained using codes and detailed screenshots.

PostgreSQL Tutorial: An incredible detailed guide to get you started and well acquainted with PostgreSQL. The tutorial has been divided into 16 sections so you can imagine how well this subject has been covered.

Oracle Live SQL: Who better to learn Oracle’s SQL database than the creators themselves? The platform is really well designed and makes for a great end user experience. You can view scripts and tutorials to get your feet wet, and then start coding on the same platform. Sounds awesome!

 

NoSQL Databases

Image result for nosql

Source: Eventil

MongoDB from MongoDB: This is currently the most popular NoSQL Database out there. And as with the Oracle training mentioned above, MongoDB is best learned from the masters themselves. I have linked their entire course catalogue here, so you can pick and choose which trainings you want to take.

Introduction to MongoDB: This course will get you up and running with MongoDB quickly, and teach you how to leverage its power for data analytics. It’s a short three weeks course but has plenty of exercises to make you feel like an expert by the time you’re finished!

Learn Cassandra: If you’re looking for an excellent text-based and beginner-friendly introduction to Cassandra, this is the perfect resource. Topics like Cassandra’s architecture, installation, key operations, etc. are covered here. The tutorial also has dedicated chapters to explain the data types and collections available in CQL and how to make use of user-defined data types.

Redis Enterprise: There are not many resources out there to learn about Redis Databases, but this one site is enough. There are multiple courses and beautifully designed videos to make the learning experience engaging and interactive. And it’s free!

Google Bigtable: Being Google’s offering, there are surprisingly sparse resources available to learn how Bigtable works. I have linked a Coursera course that includes plenty of Google Cloud topics but you can scroll down and select Bigtable (or BigQuery). I would, however, recommend going through the full course as it provides valuable insights into how Google’s entire Cloud offerings work.

Couchbase: Multiple trainings are available here (scroll down to see the free trainings), and they range from beginner to advanced. If Couchbase is your organization’s database of choice, this is where you’ll learn everything about it.

 

Data Warehousing/Big Data Tools

Distributed file systems like Hadoop (HDFS) can be found in any data engineer job description these days. It’s a common role requirement and one you should be familiar with intimately. Apart from that, you need to gain an understanding of platforms and frameworks like Apache Spark, Hive, PIG, Kafka, etc. I have listed the resources for all these topics in this section.

 

Hadoop and MapReduce

Hadoop Fundamentals: This is essentially a learning path for Hadoop. It includes 5 courses that will give you a solid understanding of what Hadoop is, the architecture and components that define it, how to use it, it’s applications and a whole lot more.

Hadoop Starter Kit: This is a really good and comprehensive free course for anyone looking to get started with Hadoop. It includes topics like HDFS, MapReduce, Pig and HIVE with free access to clusters for practising what you’ve learned.

Hortonworks Tutorials: As the creators of Hadoop, Hortonworks have a well respected set of courses for learning various things related to Hadoop. From beginners to advanced, this page has a very comprehensive list of tutorials. Ensure you check this out.

Introduction to MapReduce: Before reading this article, you need to have some basic knowledge of how Hadoop works. Once done, come back and take a deep dive into the world of MapReduce.

Hadoop Beyond Traditional MapReduce – Simplified: This article covers an overview of the Hadoop ecosystem that goes beyond simply MapReduce.

 

Prefer books? No worries, I have you covered! Below are a few free ebooks that cover Hadoop and it’s components.

Hadoop Explained: A basic introduction to the complicated world of Hadoop. It gives a high-level overview of how Hadoop works, it’s advantages, applications in real-life scenarios, among other things.

Hadoop: What you Need to Know: This one is on similar lines to the above book. As the description says, the books covers just about enough to ensure you can make informed and intelligent decisions about Hadoop.

Data-Intensive Text Processing with MapReduce: This free ebook covers the basics of MapReduce, its algorithm design, and then deep dives into examples and applications you should know about. It’s recommended that you take the above courses first before reading this book.

You should also join the Hadoop LinkedIn group to keep yourself up-to-date and to ask any queries you might have.

 

Apache Spark

Image result for apache spark

Comprehensive Guide to Apache Spark, RDDs and Dataframes (using PySpark): This is the ultimate article to get you stared with Apache Spark. It covers the history of Apache Spark, how to install it using Python, RDD/Dataframes/Datasets and then rounds-up by solving a machine learning problem. A must-read guide.

Step by Step Guide for Beginners to Learn SparkR: In case you are a R user, this one is for you! You can of course use Spark with R and this article will be your guide.

Spark Fundamentals: This course covers the basics of Spark, it’s components, how to work with them, interactive examples of using Spark, introduction to various Spark libraries and finally understanding the Spark cluster. What more could you ask for from one course?

Introduction to Apache Spark and AWS: This is a practical and practice focused course. You will work with the Gutenberg Project data, the world’s largest open collection of ebooks. You will need knowledge of Python and the Unix command line to extract the most out of this course.

 

Courses covering Hadoop, Spark, HIVE and Spark SQL

Big Data Essentials: HDFS, MapReduce and Spark RDD: This course takes real-life datasets to teach you basic Big Data technologies – HDFS, MapReduce and Spark. It’s a typical Coursera course – detailed, filled with examples and useful datasets, and taught by excellent instructors.

Big Data Analysis: Hive, Spark SQL, DataFrames and GraphFrames: MapReduce and Spark tackle the issue of working with Big Data partially. Learn high-level tools with this intuitive course where you’ll master your knowledge of Hive and Spark SQL, among other things.

Big Data Applications: Real-Time Streaming: One of the challenges of working with enourmous amounts of data is not just the computational power to process it, but to do so as quickly as possible. Applications like recommendation engines require real-time data processing and to store and query this amount of data requires knowledge of systems like Kafka, Cassandra and Redis, which this course provides. But to take this course, you need a working knowledge of Hadoop, Hive, Python, Spark and Spark SQL.

 

Kafka

Simplifying Data Pipelines with Apache Kafka: Get the low down on what Apache Kafka is, its architecture and how to use it. You need a basic understanding of Hadoop, Spark and Python to truly gain the most from this course.

Kafka’s Official Documentation: This is an excellent intuitive introduction to how Kafka works and the various components that go toward making it work. This page also includes a nice explanation of what a distributed streaming platform is.

Putting the Power of Kafka into the Hands of Data Scientists: Not quite a learning resource per se, but a very interesting and detailed article on how data engineers at Stitch Fix built a platform tailored to the requirements of their data scientists.

 

Basic Machine Learning Familiarity

While machine learning is primarily considered the domain of a data scientist, a data engineer needs to be well versed with certain techniques as well. Why, you ask? Getting models into production and making pipelines for data collection or generation need to be streamlined, and these require at least a basic understanding of machine learning algorithms.

 

Machine Learning Basics for a Newbie: A superb introduction to the world of machine learning by Kunal Jain. The aim of the article is to do away with all the jargon you’ve heard or read about. The guide cuts straight to heart of the matter, and you end up appreciating that style of writing.

Essentials of Machine Learning Algorithms: This is an excellent article that provides a high-level understanding of various machine learning algorithms. It includes an implementation of these techniques in R and Python as well – a perfect place to start your journey.

Must-Read Books for Beginners on Machine Learning and Artificial Intelligence: If books are more to your taste, then check out this article! This is a collection of the best of the best, so even if you read only a few of these books, you’ll have gone a long way towards your dream career.

24 Ultimate Data Science Projects to Boost your Knowledge and Skills: Once you’ve acquired a certain amount of knowledge and skill, it’s always highly recommended to put your theoretical knowledge into practice. Check out these datasets, ranked in order of their difficulty, and get your hands dirty.

 

Data Engineering Certifications

Google’s Certified Professional

Source: Fourcast.io

This is one of the premier data engineering certifications available today. To earn this certification, you need to successfully clear a challenging 2 hour multiple choice exam. You can find the general outline of what to expect on this link. Also available are links to get hands-on practice with Google Cloud technologies. Ensure you check this out!

 

IBM Certified Data Engineer

Image result for IBM certified data engineer

To attain this certification, you need to pass one exam – this one. The exam contains 54 questions out of which you have to answer 44 correctly. I recommend going through what IBM expects you to know before you sit for the exam. The exam link also contains further links to study materials you can refer to for preparing yourself.

 

Cloudera’s CCP Data Engineer

Image result for cloudera png

This is another globally recognized certification, and a pretty challenging one for a newcomer. Your concepts need to be up-to-date and in-depth, you should have some hands-on experience with data engineering tools like Hadoop, Oozie, AWS Sandbox, etc. But if you clear this exam, you are looking at a very promising start to this field of work!

Cloudera has mentioned that it would help if you took their training for Apache Spark and Hadoop since the exam is heavily based on these two tools.

Posted in Information Technology

14 Best Sites for Taking Online Classes That’ll Boost Your Skills and Get You Ahead

Looking to pick up a new skill, but don’t have the time to do so? Do you want to go back to school but need to take some classes beforehand? Or, do you not want to go to school at all, but are looking to change careers? We’ve got the answer for all those problems: online classes.

They’re shorter than a college semester, they’re typically self-regulated, and they cover just about every skill, topic, or hobby you can possibly imagine.

But with this luxury comes great responsibility—mainly, the task of finding a site that works best for you. Have no fear, we’ve done all the hard work for you and compiled the ultimate list of resources that offer free, cheap, and quality classes right here on the internet.

Now all you have to do is sign up for one!

1. ALISON

ALISON has a large range of free, comprehensive classes on financial literacy, personal and soft skills, digital skills, entrepreneurship and then some. It targets all kinds of learners, from professionals and managers to teachers and freelancers.

2. Udemy

Udemy has plenty to offer for the learner on a budget, from completely free courses taught by experts, professors, entrepreneurs, and professionals, to frequent discounts and class specials. In addition to classes in tech, business, and marketing, you can also explore options in productivity, health, hobbies, and lifestyle.

3. Coursera

If you want to receive a college education without the high cost of tuition, Coursera is the best stop. This website offers amazing courses in all kinds of fields, from professional development to psychology, history, and literature—all created and taught by professors at top institutions nationally and across the globe. Their universities include Princeton, Johns Hopkins, Stanford, and plenty more.

4. edX

Just like Coursera, edX offers anyone, anywhere the chance to take university classes in various departments—and get certified. Some of their big partners include Harvard, Berkeley, Dartmouth, Georgetown, and the University of Chicago (and that’s not all!).

5. Udacity

Udacity focuses on software development, offering free courses in programming, data science, and web development. The website also offers a nanodegree program for individuals who want to master a skillset or pursue a full-time career in tech.

6. Lynda

By subscribing to Lynda, you’ll have access to thousands of courses in business, design, art, education, and tech. And it offers a free 10-day trial so you can test the waters!

7. General Assembly

General Assembly offers both online and in-person classes, as well as full-time and part-time options. It focuses mainly on digital skills, covering subjects such as digital marketing, iOS and Android development, data analytics, and JavaScript.

8. Skillshare

Skillshare provides “bite-sized” classes to learners who only have 15 minutes a day. It has over 500 free classes and several thousand premium classes to choose from in topics such as film, writing, tech, lifestyle, and more.

9. LearnSmart

LearnSmart’s orientated toward career development, which is why it’s a great place to learn about IT and security, project management, Office, HR, and business.

10. Codecademy

Codecademy wants to teach you how to, well, code—and for free. It covers all kinds of programming, including JavaScript, Ruby, HTML, CSS, and Python.

11. Pluralsight

After subscribing to Pluralsight (or using its free trial!), you’ll be able to explore classes in software, 3D development, VFX, design, game design, web design, and CAD software.

12. Adobe TV

Not sure how to use Photoshop or InDesign? Don’t worry, Adobe TV will walk you through all its programs with tutorials, manuals, and more.

13. FutureLearn

FutureLearn’s completely free, with classes taught by universities and special organizations. Its big topics are business and management, creative arts, law, health, politics, science, digital skills, sports and leisure, and teaching.

14. Academic Earth

And if you’re looking solely for academic classes, this website is perfect for you. It has courses in the arts, science, humanities, economics, computer science, and more, all for free.

Still don’t know where to start? Try Class Central—it personalizes your class search by asking you from the get-go what you’re interested in learning and from whom. Then, it pairs you with options from Coursera, edX, and other forums to find what best suits your needs, making the process even easier!

Posted in Information Technology

Linux Distributions Optimized for Hosting Docker

You can run Docker containers on any modern Linux distribution. But some specialized Linux-based operating systems are designed specifically for running Docker. If you want to host containers, these Linux platforms may be a better fit than an all-purpose Linux distribution.

Choosing a Linux Distribution for Docker

When you’re choosing a Linux distribution to host Docker, you should keep these considerations in mind:

  • Easy Docker support. You want your distribution to be able to support Docker, of course. You also want that support to come easily. You don’t want to have to install Docker manually—You want to be able to use a package manager that will keep it up-to-date for you.
  • Extensibility and customizability. Part of the appeal of containers is that they provide a framework for building agile environments. Being agile means being able to use whichever tools or processes you like, rather than being constrained by hardware or software limitations. For this reason, look for a distribution that is easy to extend or customize according to your needs. Distributions with few core dependencies and extensive package repositories fit this bill.
  • Efficiency. Another part of the appeal of containers is that they consume resources more efficiently than virtual machines. To make sure you’re getting the most out of your investment in efficient infrastructure, you should look for a Linux distribution that is efficient from the perspective of resource consumption. It should be able to do what you need while consuming as few resources as possible.
  • Security. You want to be sure your Linux distribution is secure by design. This is true no matter what type of workload you’re hosting, but it’s so important that it’s worth emphasizing on this list.
  • Startup time. Depending on your goals, fast startup time may be a priority for you. This is less important if you plan to boot your Docker host system only once and let it run. Conversely, if you will be spinning up virtual machines constantly to host Docker, you’ll want them to be able to boot quickly.

Linux Distributions Designed for Docker

If you’re looking for a Linux distribution that meets the criteria outlined above, you’re in luck. Several distributions have appeared over the past several years that are designed with these priorities in mind. Some were created specifically for hosting Docker, while others just happen to align well with Docker hosting needs.

Docker-friendly Linux distributions include:

  • Alpine Linux. This is the distribution that Docker uses by default to build its packages. Alpine wasn’t designed specifically for hosting containers, but its small footprint and focus on security make it a good fit.
  • Container Linux. Container Linux is the new name for the operating system formerly known as CoreOS (which is still a project). As the name implies, Container Linux is designed very much with containers in mind.
  • RancherOS. A Linux distribution from Rancher that is tailored to host containers. One of RancherOS’s main features is that it runs as a container itself to make deployment simple and fast.
  • Atomic Host. One of the earliest container-centric Linux distributions. Okay—Technically, Atomic Host is not itself a Linux distribution; it’s a project that builds the foundation for Linux distributions like Fedora Atomic Host and RHEL Atomic Host.
  • Boot2Docker. This lightweight Linux distribution does what its name promises: It boots your computer into a Docker environment. It does this in about five seconds, while consuming minimal resources and running only in memory. The big catch is that Boot2Docker is designed to run from a Windows or macOS machine. It’s a way to test Docker from a developer’s workstation, rather than host Docker workloads in the data center.
  • Ubuntu Core. A lightweight variant of Ubuntu Linux, Ubuntu Core can be a good solution for hosting container workloads when you want to leave a small footprint.

Don’t like any of these Linux distributions for running Docker? You can always make your own. Freedom is what Linux is all about, after all.

Posted in Information Technology

Best Smart Hub (updated 2019)

https://www.lifewire.com/best-smart-hubs-4140443

If you own several Samsung devices or own some of the 200+ SmartThings-compatible smart devices on the market, the router you should invest in for your home is the Samsung SmartThings Wi-Fi + Hub. This router lets you connect all your smart devices to one place, allowing you to control all the devices through the SmartThings app for iOS or Android. A single Samsung SmartThings Wi-Fi + Hub will cover up to 1,500 square feet in your apartment or home, but if you want 4,500 square feet of coverage, you can buy three of these and connect them together. With positive reception so far, the SmartThings Wi-Fi + Hub has been praised for covering a large home with wireless signal and for delivering fast speeds to many devices.

With a compact seven-inch display, the Google Home Hub is a perfect fit for smaller spaces like a bedroom or office, or in an apartment where there isn’t much surface space to spare. The display is well lit, easy to read, and offers crisp resolution to show off your pictures. The Hub is also compatible with products from over 400 different smart product brands, allowing you to sync all your smart devices to one central unit including internet-connected thermostats, light bulbs, security cameras, and more. Plus, the setup is hassle-free.

In many ways, the Google Home Hub is a cheaper version of the Amazon Echo. The voice-activated Google Assistant provides the same hands-free convenience offered by Amazon Alexa. It’s easy to hook up to your PC for more granular control. This device makes a few concessions to keep the price low, like settling for sub-par audio quality and skipping the front-facing camera (so you can’t make video calls). For most people, that’s a small price to pay considering the actual money you’ll be saving.

Watch out, Alexa — Lenovo has introduced a new contender to the digital assistant hub market. The 10” Smart Display and its built-in Google Assistant are there to meet all your needs, whether you want to check your meetings for the day, watch a YouTube video, or turn up the air conditioning. Lenovo’s Smart Display is fully voice-controlled. Treat it as a hub for your home, commanding it to turn smart lights on or off, adjust thermostats, or display feeds from smart home cameras. While it doesn’t have the strongest speakers, the 10-watt driver delivers a crisp, clean sound that pairs well with its 1920×1200 screen resolution for when your needs are just pure entertainment: Google Cast allows it to act like a TV, or you can run compatible apps like YouTube, Spotify, or Netflix. With a white body and a bamboo back, the Lenovo Smart Display easily blends into any décor for a sophisticated and smart addition to any room in the house. The 1.8GHz Qualcomm Snapdragon 624 processor is sufficient for daily use, but it should be treated as a device to make your life easier, rather than store apps — if you’re expecting the latter, you won’t get much out of its 2GB of RAM.

Need some more help finding what you’re looking for? Read through our best smart home products article

The Google Home Max combines big smarts with big volume. Packed inside its streamlined profile are a pair of 4.5-inch woofers and dual 0.7-inch custom tweeters. And when you crank the volume, it really shines, delivering high, undistorted volume with a booming bass. And it’s easy to set up. Just plug in the speaker to a power supply, open up the Google Home app, and it will walk you through the steps to get your speaker up and running.

Fortunately, heart-pumping sound isn’t the only claim to fame for the Home Max. The built-in Google Assistant makes it a two-in-one product. Just say “Hey, Google” to play music, control a variety of smart home devices or answer any search inquiry. Six far-field microphones will pick up your voice from nearly anywhere in a room even if you’re playing music. The Home Max is also capable of recognizing up to six different voices, synchronizing music between rooms, and playing music from a multitude of streaming services.

The latest Amazon Echo Plus comes standard with a seamless protocol that lets it function as a true smart hub: Zigbee. The Echo Plus’s Zigbee support allows it to more directly and seamlessly control devices like smart thermostats and smart light bulbs, making it a widely-compatible and user-friendly choice for a smart hub.

The Echo Plus is technically a follow-up to the original Amazon Echo, but it does bring with it a few new design features. First of all, you now have your choice of three different neutral colors: charcoal, heather gray, and a very light gray called “sandstone” on Amazon. It looks a bit more premium, blends in with all kinds of decor, and with that signature gradient light ring on the top, it fits in with the rest of the Echo line.

The Alexa functionality is also still there, meaning you’ll be able to ask questions, check the weather and a whole lot more (Amazon has built an Alexa library that is more than 50,000 skills strong). But the real standout feature here, in our eyes, is the speaker setup. The smaller Amazon Echos sometimes struggle with quiet, thin sound quality. The Echo Plus here offers a three-inch neodymium woofer alongside a tweeter that’s nearly an inch big. For reference, the smaller Dot line sports a single woofer that’s only about one and a half inches. What this bigger speaker will most likely amount to is a louder, fuller sound that will more adequately fill a room.

Want to take a look at other options? Read through our list of the best Amazon devices available now.

If you can’t quite shell out the cash for Amazon’s new flagship Echo Plus, you can still get the latest tech in a smaller, less expensive package. The newest generation of Echo Dot gives you a pretty solid amount of features without breaking the bank. You do end up sacrificing the Zigbee smart home connectivity and the massive speaker set of the new Echo Plus (the speaker size here is 1.6 inches, which is admittedly small). But it does seem like Amazon has exerted some effort to update the speaker to give you more oomph and volume. The Alexa functionality is present here, too, allowing you can stream music, run searches, or teach it up to 50,000 different skills from Amazon’s ever-expanding library. There’s a line out or Bluetooth connectivity, and that new premium mesh-grill look is present here. The three new colors carry through, as well — charcoal, heather gray, or sandstone. And because it’s a flat, smaller device, it’s perfect for smaller rooms or anywhere that you want Alexa capabilities without a big screen or big speakers.

The second generation of the Echo Show demonstrates a renewed commitment from Amazon to create a solid smart display hub. This version looks quite a bit more premium than the prior model, with a 10.1-inch screen with 1280×800 resolution (that amounts to 720p HD sharpness). Add that to the built-in 5MP camera, and you basically have a tablet’s worth of visual features sitting on your counter.

But this gets our pick for best display-based hub because of just how much more it can do. The wedged back housing contains a serious set of speakers — dual two-inch neodymium drivers, to be precise. Those 10W speakers are Dolby-tuned which means that you’ll get full, carrying sound. Naturally, you also get all the Alexa features — voice commands, day-to-day weather checking, and 50,000 other teachable, third-party skills — that you’d expect from an Echo device.

With a sleek Scandinavian design, the Libratone Zipp 2 Smart Wireless speaker is an eye-catching smart hub with plenty to offer.

The cylindrical speaker comes in four different colors, including black, gray, red, and green. But if you want to switch it up, you can unzip the main sleeve and replace it with a new one in a different color. On top of the speaker, you’ll find a few simple light-up buttons that control it. On the side, you’ll find a handy leather strap for extra portability.

The Zipp 2 includes support for Alexa, but not Google Assistant or Siri. Libratone has said it is considering adding support for Google Assistant at a later date. But for now, this isn’t the smart hub for you if you need Google Assistant. It is, however, compatible with Airplay 2, meaning that it can be paired with other Airplay-compatible speakers, such as a HomePod or Sonos speaker, in the same house. In fact, you can connect up to 10 different speakers using Airplay 2.

In short, the Zipp 2 will be particularly useful for people already using Airplay 2 and for those who want Alexa.

The LG ThinQ WK9 may not be as sleek as some of its competitors, but it’s a powerful and efficient addition to the smart hub ranks. The eight-inch touch screen is bright and colorful, making it ideal for video calls or cooking demonstrations, or just using it to play your favorite podcast. It also features a front-facing camera and audio tuned by Meridian Audio. The two built-in woofers deliver a total of 20W of power. The smart hub can connect with both Bluetooth and Wi-Fi, and you can stream audio or visual to the WK9 from any Bluetooth-equipped device

The WK9 is far from a one-trick pony: with Google Assistant, it lets you check the weather or traffic, browse the web, monitor your schedule, or sync it with other Assistant devices. With built-in Chromecast, you’ll also be able to pair it with other Chromecast speakers in your house.

Logitech’s Harmony Hub isn’t your typical smart hub, but it’s compatible with more than 270,000 devices. With a simple setup that can have you online and connected to up to eight devices within minutes, the Harmony Hub works great with your TV, satellite, cable box, Blu-ray player, Apple TV, Roku, game consoles and more.

Creating customized activities is a breeze through the downloadable Harmony App for both Android and iOS. Tapping a pre-programmed button on the app can immediately turn down your Philips Hue smart lights, turn on your connected speaker and TV, launch Netflix and let date night begin instantaneously with one click. And the Harmony Hub adds Amazon Alexa support as well for voice control, so you can do this just by talking. Beyond voice control, the Logitech truly stands out with closed cabinet control, which allows it to send commands to connected devices through infrared commands that don’t require direct line-of-sight to function.

The Wink 2 is a second-generation smart hub that connects to an impressive number of gadgets, including Amazon’s Alexa, Google Home, Z-Wave, Zigbee, Lutron Clear Connect and Kidde devices. Inside the Wink and its sleekly designed 7.25 x 7.25 x 1.75-inch frame are powerful Wi-Fi radio and Ethernet ports for rock-solid Internet connection. Fortunately, the ease of setup matches the quality of its design, thanks to a straightforward smartphone app for both Android and iOS. In less than five minutes you’ll be connected to smart devices such as Philips Hue lighting, Ecobee thermostats or the Nest camera.

The four main features (control, automate, monitor and schedule) round out the Wink 2’s full spread of capabilities. All totaled, between those four functions, the Wink 2 can support up to 530 devices paired at once with seamless integration. Additionally, the separately purchased Wink Relay plugs directly into your wall allowing control of all Wink-ready devices without a smartphone.

With 6,000 square feet of combined coverage, the Tenda Nova MW6 3-pack ensures powerful, high-speed internet in every inch of your house. Don’t quite need 6,000 feet of coverage? The Tenda Nova also sells two-packs and single modems to fit your home’s specific needs. Doubling as a Wi-Fi router and a home automation system, it is compatible with all major internet providers (AT&T, Comcast, Verizon, Spectrum, etc.), and supports a stable connection for up to 90 devices simultaneously. The Tenda Nova also links to your favorite smart devices, like the Amazon Echo and Alexa, as well as your smart TV, security system, and other advanced appliances.

The set-up process is as simple as plugging in the modem and following the app’s instructions. From there, the Tenda’s system can be controlled from anywhere, and on any device: adjust temperature settings, listen to music from your home entertainment center, or even set timed-restrictions on Wi-Fi use for your children and teens.

Tested by

How We Tested

We bought and tested four top-rated smart hubs. Our reviewers spent more than 500 hours setting them up and experimenting with their various capabilities. We asked our testers to consider the most important features when using these smart hubs, from their automation software to their coverage areas. We’ve outlined the key points here so that you, too, know what to look for when shopping.

What to Look for in a Smart Hub

Compatibility – There are different standards for communicating with smart home devices including ZigBee and Z-Wave. Before purchasing a smart home hub, ensure that it supports the standards used by your existing smart home devices.

Automation – Some hubs will include automation software for your smartphone or computer. If you’re looking to have the lights in your home automatically turn on at a specific time or have the thermostat adjust itself depending on the weather, you’ll want to make sure your hub includes the needed software.

Coverage – Depending on the size of your home, you may need to check that the hub you’re purchasing will provide sufficient coverage. If it isn’t powerful enough to transmit a signal throughout your entire space, you might find that some of your smart devices won’t respond.

What We Like

  • Easy to use
  • Seamless
  • Great-quality Wi-Fi

What We Don’t Like

  • Time-consuming setup
Samsung SmartThings Wi-Fi + Hub
Samsung SmartThings

 

Samsung SmartThings Hub

 

Samsung SmartThings Wi-Fi + Hub Box

One of our testers, who used this device with an Amazon Alexa and a collection of Phillips Hue lightbulbs, reported that it changed the way he interacted with his home: “The highlight of this product is that, once it was set up, we didn’t think about it,” he said. “We slowly but seamlessly integrated it — and its associated ‘smart’ products — into our lives.” One reviewer also felt that its Wi-Fi signal was excellent: “I able to completely eliminate dead spots in my apartment,” he said. “I didn’t realize that my home Wi-Fi could be so fast.” On the other hand, our testers noted that even though the setup process was fairly straightforward, it was still time-consuming.

What We Like

  • Easy setup
  • Great display quality
  • Versatile

What We Don’t Like

  • Sound quality could be better
  • Can’t power via USB
Google home hub
Google hub

 

Google home hub smart hub

 

Google home hub google assistant

This device won over one of our testers with its simple setup process and great-quality screen display: “I love how the screen brightness adjusts to light,” she explained. “You don’t have to fiddle with the brightness settings.” One reviewer — who set up this device in her kitchen and used it primarily to look up recipes, watch instructional cooking videos, and control her other devices — was also impressed with its versatility. “There’s so much you can do with it,” she said. “I’m discovering new features every day.” The downsides? Our testers wished its sound quality were better and also that it had the capability to be powered via USB: “Pretty much all devices nowadays do that,” one reviewer explained, “and it’s useful in cases where you may want to run it without a full plug (like in an RV).”

What We Like

  • Easy setup
  • Sharp display
  • Amazing sound quality

What We Don’t Like

  • Can’t customize home screen
Lenovo 10
Lenovo 10

 

Lenovo Smart Display

 

Lenovo 10

 

Lenovo Display

“This display really has made spending time in the kitchen more exciting,” declared one of our reviewers. The highlights? Its easy setup process, its “sharp and vivid” display, and its sound quality: “It was loud enough to fill the kitchen with music,” one tester explained. “The higher ranges provide deep, robust sound without sounding tinny.” In terms of negatives, one of our reviewers wished that the device’s home screen had more customization options: “I’d rather tap into different apps to find my personal information instead of having my calendar and recent Google Assistant queries readily listed for everyone to see.”

What We Like

  • Simple setup
  • Excellent sound quality
  • Easy to interact with Google Assistant

What We Don’t Like

  • Slight delay when executing voice commands
  • Heavy
Google Home Max
Google Home

 

Google Home Max smart hub

 

Google Home Max hub

 

Google smart hub

The Google Home Max stood out to one of our testers because of its excellent sound quality: “It filled my living room and it was easy to adjust the bass and treble on your mobile app as you please,” he reported. “I like this product so much that I may consider purchasing another one for pairing usage — just imagine the extra sound I’d get!” Our reviewers also liked that it was easy to set up. Although one tester enjoyed interacting with Google Assistant, he noted that it was slightly slow to execute voice commands. One reviewer also thought that the device’s weight made it less portable.