Posted in Information Technology

5 Keys to Becoming a Successful Consultant as a Software Developer

https://simpleprogrammer.com/effective-software-consultant/

Why I Became a Consultant

When I left my chief information officer (CIO) position in 2007, I wanted to return to the purity of software development. The work of software development—programming—was my first love by a very wide margin, and after a few years as a CIO, I came to hate the day-to-day tediousness of budgeting and more budgeting, and defending that budget; performance evaluations; personnel management; endless meetings to plan and organize, and then scheduling the next round of meetings to ultimately arrange yet another series of meetings. Rinse and repeat.

In my earlier senior management days, I was ecstatic to participate in guiding the steps of the company and our department. I loved being involved in decision-making processes for software and hardware acquisition, high-level business planning, and the subsequent staffing and project development processes to ultimately meet the goals of the business. All of that was fun and exciting. For a while.

After a period of time, though, I realized that I am not the guy who can evolve completely or comfortably out of software development. I know lots of people who have happily hung up their programming duties in favor of a management position. That’s great for them, but I am not that guy.

I found that I could no longer participate in the things that advanced my career in the first place. The things that got me promoted through the ranks and to the roles that I aspired to were the very things I had to relinquish. Being a good software developer got me promoted right out of being a software developer. After a while, it just didn’t make sense to me.

I missed software development more than I liked the status, money, and responsibility that came with my nice big title. I started asking myself what I really wanted to do. I had more or less achieved what I thought was the endgame for software professionals. Heck, I was a CIO. What was left?

That was a hollow thought for quite a while. I was miserable.

By that time, I had been hiring consultants, contract programmers, freelancers—call it what you will—for years. I, too, had been in and out of contract and consulting work at times over the years.

Then it hit me. Why not consult? Why not start my own consulting and contracting business? Wasn’t that the dream in the back of my mind all along?

I got excited! I mused, I plotted, and I daydreamed: I would write code for various customers. I would help companies make good systems decisions and then I would help them create their software. I would bring my experience as a software developer and a senior manager together and help companies achieve their software and technology goals. I would have fun again! I would be great at it! They would love me!

It all made perfect sense. It seemed so simple.

Just one problem: How? And where do I start?

How and Where Do I Start?

As I thought about it, there were endless “hows” to deal with:

  • How do I stay busy enough to survive, to thrive?
  • How do I market and sell?
  • How do I close deals when I find customers?
  • How do I price my offerings?
  • How do I find additional programming resources?
  • How do I manage finance, taxation?
  • How do I handle legal issues?
  • How do I grow my company?
  • How do I manage contracts?
  • How do I stay relevant technologically?

And that is only 10 questions! I could increase this list ad infinitum.

Within each of these “how” questions are a conceivably endless list of more questions.

Gulp …

Back in 2007, when we first started our company, Pinch Hitter Solutions, there was not a ton of information on the web about how to start a software development consultancy.

I had my career as a software developer, and my experience as a senior manager and CIO. I knew what buying assets and services looked like. I was skilled as a Java and JavaScript developer and I had a ton of AS/400 and RPG experience; I knew how to write code, manage teams, and run projects.

Even with all my experience up to that point, considering the list of “how” questions above, the skills I started with were not nearly enough. Not even close.

Nevertheless, I began my journey with what I had to work with, which was really just myself. My wife and family believed in me, and my experience in corporate America was certainly helpful, but ultimately I had a lot to learn. A whole lot. And that is a vast understatement as I look back.

By the end of 2017, I began to reflect on all of the lessons I had learned in my journey as a consultant. And there had been many!

I asked myself how I could be helpful to others. What could I do for those that might be considering a walk down the same path? That would be the path of leaving a relatively stable life in corporate America to live in the wild, Wild West of consulting, contract programming, self-employment, and business ownership. And, oh yeah, marketing and sales.

Earlier this year, after a lot of thought, consideration, and research as to how I might do this, I started a blog and a YouTube channel of the same name (Motivated Code Pro—links below). Both of which were inspired and encouraged in part by the work of John Sonmez—his SimpleProgrammer.com blog and YouTube channel were extremely helpful to me.

In the interest of space and time, I have distilled a lot of experience down to just five keys that will help you become a successful software development consultant, and I’ve included links to their corresponding videos below.

1. Your Skill and Experience

Obvious, right? We present ourselves to potential clients based on the merits of our skill and experience. I want you to have at least five years of work history in your technologies of choice before trying to sell those skills to potential customers.

Experience is the one thing that can’t be taught or rushed. Time takes time and it is the day-to-day, mile-after-mile, step-by-step work that earns you that experience. There is no substitute for putting in the time.

Companies that engage your services are counting on you to complete time-sensitive projects with efficiency. I know from years on the buying side of many contracts that people with at least five years of solid experience in their disciplines tend to come with robust knowledge of their chosen languages, frameworks, and technologies. They are able to resolve complex problems, work independently or on a team, and can learn new disciplines quickly.

Companies will hire you based on your experience (and your ability to sell and represent that experience), but they will also expect you to consume and apply new information very quickly.

Maybe they work in a business that is different from where you came from. Trucking instead of health care, for example, so you will need to apply your technology skills in a business setting that is new to you. You need to be mentally limber, eager to learn what is new, and quick to apply what you know.

Not only that, with all of the Java, PHP, Python, and JavaScript frameworks—not to mention the many other languages, design patterns, and the sheer number of people using them—you will definitely encounter techniques and solutions to common problems that are different than what you are used to. Fast assimilation of new things is part of the fun and also part of the challenge.

That said, there is always the temptation to wander way off course and learn something completely new and sexy. You may want to learn it because it’s getting a lot of press, because it’s the hot new thing, the technology du jour.

Pause! Stop!

Don’t do that. Wait until buying customers are engaging with that technology before you burn countless hours of your life to learn something that may not take off and therefore will not get you paid.

Remember, you are in business now. You need to make smart business choices with your talent and time. Be sensible with where you spend it; make choices based on what is going to get you paid. Period.

As for further sharpening your existing tools, it’s wise to spend at least 30 minutes every single workday pushing your skills ahead—learning a new framework, for example (maybe new to you, but tried-and-true in the marketplace), within one of your main languages. This is important and is a good investment.

I am a fan of going deeper with the skill set that is already getting you paid. For example, if you’re working in web development today, learning another popular JavaScript framework is a good idea.

Choose wisely, and if you decide to dig into something totally new, and I know some of you will, just be sure it’s a worthy investment of your time and talent. Remember: You are a business.

Skill and experience will get you paid, but time is your greatest asset. Treat it like money.

Key 1 on YouTube

2. Writing for Blogs and Posting on Social Media

Getting your name out there is important. Writing for blogs and posting on social media are great ways to let prospects and even existing customers know who you are, what you do, and that you really know your stuff.

Taking the initiative to write also gives them a look at your ability to communicate using the written word. Despite living in a world of highly abridged communication styles via text and chat tools, being able to write effectively is incredibly valuable if you’re going to be a consultant—in my humble opinion. IMHO.

This is good old-fashioned marketing for the modern era.

When I first started as a consultant, I wrote an article for a popular AS/400 magazine—still in paper form in those days—encouraging their audience to try Java and to try the NetBeans IDE. It was a guest post, about 500 words.

After the article was published, a company in Tennessee, not too far from where I live, contacted me with a barrage of questions related to Java. After several phone calls and a site visit, we began a nearly two-year consulting engagement that would not and could not have happened without that article.

Sometimes, writing for a blog, yours or someone else’s, feels like pouring cups of water into the ocean, but I know from experience that those tiny contributions to the sea of digital media can make a big difference to you. You never know who might be watching, who might be reading.

All of the social platforms are used for business these days. As for me, I use LinkedIn, and I have a YouTube channel, a personal blog, and a Facebook business page.

I encourage you to start a blog and write for other blogs; write short articles on LinkedIn or other sites in your space; consider creating a YouTube channel where you can be helpful and informative to people who might use your services or work in your space.

Do some homework. What are successful people in your space doing to attract attention and help others?

It’s never too soon or too late to start. This is something that you should give attention to, even if you don’t consider yourself savvy on the various social platforms.

I am guilty of a decade of indifference where social media is concerned and have only recently really, really tried to get traction. And that was way after our business had been established. Don’t wait.

Think of social media and blogging as a step in trusting the process. Your hard work is never wasted. Have faith.

Social media and blogging are an investment in your future, but they are something you have to do today without assurance of a payoff—again, this is an exercise in faith. Immediate return is not the goal; creating an online presence takes time and effort without the immediacy of a paycheck. This is an investment in yourself and your business, and will ultimately help you find work and stay busy.

Key 2 on YouTube

3. Being Able to Find Work

The biggest way people fail as consultants or contract programmers is in not being able to find work. Specifically, the second and subsequent contracts are harder to obtain. When someone leaves their full-time job to consult, they usually have an arrangement waiting with another company. The difficulty comes when that first contract ends.

One interesting thing about contracts that I want to share is this: I have started multiple contracts over the years that had a stated duration. Ninety days or six months, let’s say. That is not a lot of time and not something I would recommend leaving a full-time position for.

However, I have never been on a contract that lasted only the stated duration. (If you listen carefully, you can hear me knocking on wood.) My point is that contract durations are often contract starting points. There is a never a guarantee, but that is my experience and the experience of many other consultants.

Remember that contract work is designed to end. That is perhaps the most important statement in this article. In a sense, you are there to work yourself out of a job. With that in mind, you should have your eye on where work will come from after your current engagement ends.

Let me say it a different way: You need to look for future work and future relationships, and tend to your current relationships, your social media platforms, and yours and other blogs, or you will be out of business. Finished.

Attending user groups is a great way to meet people who work in your space and are doing more or less the same things that you are. This is where you can make friends, potentially help others, and meet people with influence and hiring ability.

Connecting with consultancies around the country that offer the same types of services you do is a great way to build valuable relationships. As you know, it’s not just finding work, it’s being able to staff that work as well. Other consultancies have this challenge, too, and will look for people like you to meet their customer needs. This is a big win-win in that you can conceivably find work and potentially have access to their resources to staff projects you might sell into.

Something to consider when working through other consultancies is that you will not get the same rate you will get from a direct client—this could be seen as a negative. The positive is that you will typically not have to wait an eternity to get paid. You are also benefiting from their sales efforts.

Working with direct customers is more profitable but harder to sell into. Also, billing with direct customers can be painful and adds another task to an already busy work life—Welcome to Accounts Receivable.

Our company has never had a problem getting paid through our consultancy relationships, but I have frequently had to make phone calls, send emails, etc. to remind our direct customers that we have not received payment.

Stay in touch with people you have worked with over the years. People move around, take new jobs, learn new skills, and may need someone like you to help with their projects, their company, and their software products. Keep your contacts warm. Phone calls, emails, lunches, and other friendly gestures are what keep these relationships alive.

You will feel the gravitational pull of your billable work trying to distract you from the road ahead. This is a dangerous trap. Yes, you need to work and bill to get paid today, but time will pass and you need to be ready for that too.

Strive for balance in tending to your present work while tending to your future possibilities as well. Contracts are built to end. Get comfortable with change. Stay connected!

Key 3 on YouTube

4. A Good CPA and Business Attorney

Think of your certified public accountant (CPA) as an integral part of your team, even if you’re a team of one. CPAs are licensed by the state or states they do business in and can help you with your local, state, and federal tax issues.

There are a number of things your CPA does or can do for you.

Your CPA:

  • Will help you select the right corporate designation: limited liability companyLLC, S corporation, C corporation, sole proprietorship—this can change for various reasons over time and your CPA will help if the time comes to change your designation.
  • Is a great source of advice when you get started; when you shrink or grow; if you struggle financially; and when you want to borrow money, purchase equipment, or make other financial decisions. Here are some questions I have asked countless times:
    • Is “x thing” deductible?
    • Should I buy this equipment now or wait until next year?
    • Based on my gross income, how much money should I set aside for taxes?
    • Based on my income, am I in a good position to hire someone at x rate of pay?
    • Should I hire this person as a W2 employee or a 1099 contractor?
  • CPAs are generally connected to people in your community who can help you with bookkeeping, banking, credit, payroll and direct deposit, 401(k), and even health care. Your CPA may not do these things themselves, but they will generally know who the right people are for a given discipline.

The time to hire a CPA is when you first start your business. Interview a few local CPAs as you’re getting started. Tell them you’re a new business owner and that they should assume you know nothing about accounting for a small business.

Unless you have an accounting degree or an actual background in accounting, do not assume that your exposure to big-company corporate accounting is going to get you through. Remember, you are a software professional, not an accountant. Don’t devote valuable time to someone else’s expertise.

Your CPA can guide you or direct you to helpful resources that will start you on the right footing from a tax and accounting standpoint and can guide you as you grow. Make sure your potential CPA is easy to access; returns phone calls, texts, and emails promptly; and is someone you like. I cannot overemphasize this point. Over time, your CPA is someone you will have a lot of contact with. If you groan every time they show up on your caller ID, you have the wrong CPA.

A business attorney is a bit more remote but still integral to your team—once more, even if the team is just you to begin with.

As a software development consultant, contractor, or freelancer,—call it what you like— you are self-employed, and as a matter of course you will submit proposals and contracts to clients, potential clients, and vendors, as well as to contract or consultative workers.

Sometimes, your clients will insist that they author the contract. The same thing might be true for your 1099 professionals and vendors. All of that is well and good, but you are not a lawyer and therefore you are not qualified to review the legalese of a contract. Your business lawyer can and should do this for you.

A few common agreements consultants need to produce or agree to:

  • NDA – nNondisclosure agreement
  • Contractor agreement
  • Subcontractor agreement
  • Intellectual property rights

In most cases, these agreements will be boilerplate or one-offs from a boilerplate so you don’t have to have these written and rewritten again and again. Having an attorney to quickly review agreements and changes asked for by your clients, vendors, and 1099 professionals is essential to your peace of mind.

To close this section, I want to remind you that you are a software professional. You are not a CPA or an attorney. Spend your time with your customers, your prospects, your software product, your blog, your social media, etc. You have plenty to do already. Leave the accounting and lawyering to the experts while you tend to your own expertise.

Key 4 on YouTube

5. Welcome to Sales and Marketing

This is where the reality of being a self-employed entrepreneur really sinks in. Of the five keys, this will require the most from you and take you the furthest from your comfort zone.

This is the one thing that keeps otherwise capable and talented software professionals from taking the step to becoming an independent consultant. I have spoken to many people over the years who have simply said, “I can’t sell.”

Let’s face it, if you have to risk your home and family life as you know it and everything else you’ve worked hard for up to this point, the very real possibility of being without work for periods of time can be a nonstarter.

Staying with a company you like, doing a job you love, and maintaining a lower risk in your life is not a bad thing. There is nothing wrong with that.

The word entrepreneur carries with it the connotation of glamour and success. While this is true for some, the brutal truth for all is that it is extremely hard work. Late nights, early mornings, financial risks, and abandoned personal life are a few of the better-known, though infrequently discussed, hazards of entrepreneurship.

Let me say it this way: Sometimes it sucks, and for every ounce of success, there is a pound of stress.

Being an entrepreneur is not for everyone. Taking a risk like this with your livelihood takes a close examination of who you really are at this point in your life, your tolerance for risk in general, and a super long look at the reality of your skill set in the marketplace.

Does the “suggestion” that you have at least five years of experience make perfect sense now?

Some software people choose to work for an established consultancy. This is often seen as the perfect middle ground between a traditional 9-to-5 job and being self-employed but with less risk and no requirement to “ask for the sale,” because a sales pro is likely handling that. This is a decent choice, maybe a great way to put your toe in the water. Maybe a great place to spend your career.

If you want to be an entrepreneur and self-employed as an independent software professional, then you will need to sell yourself, your services, and your company.

Like a lot of things, selling is a skill that takes time and patience, and for deeply technical people does not tend to come naturally. It has to be developed. Just like learning to code, it takes practice, practice, practice.

I think the best piece of advice I’ve ever gotten is to just be myself and to talk to prospects the way I would talk to a new friend. Ultimately, they are buying you, not some salesy version of you.

You can’t control the way prospects respond to you; some will be fun and engaging, some will be difficult, but you have total control over yourself and your response. So be yourself. It’s what you know best. It’s what you already have.

Even if you’re a bit klutzy, your sense of authenticity will come through and be easier to warm up to than a super-polished, over-rehearsed Zig Ziglar version of yourself.

With all that said, there is nothing wrong with learning as much about marketing and sales as you can—in the end, though, just be yourself as you apply the techniques and lessons you’ve learned.

Finally, when it comes to selling, at some point in the process you are going to have to ask for the sale, to say it costs x amount of money, it is going to take x long to complete, the payment terms are thus, and that you are the right person to help them. Sign here, please.

Your preparation for this moment could take weeks or months, but you need to be ready.

The most important thing to know at this point in the process is that everything is a negotiation. I mean everything. It might look something like this:

You throw out a fair number, but they say it’s high (and your throat is in your stomach).

They counter with a low number (now you have un-swallowed your throat).

You negotiate to the middle.

Sometimes, you get it just right, but other times you will hear things like, “It’s outside of the financial allocation I have for this project,” or “It’s more than I’m comfortable selling to my boss down the hall.” It could be anything, but it’s always a negotiation and you need to prepare yourself mentally for that process. You need to know what your bottom line is.

This gets to the heart of closing deals and it is something you will have to work on over time. You don’t get to close deals every day, and it is these moments that coin phrases like the closer.

These conversations need to happen in person if possible or over the phone at least. Email is not the medium to handle a serious negotiation. I say this based on real experience and real money. The value of a voice-to-voice, or preferably a face-to-face, conversation cannot be overstated.

To wrap up this section, I want to remind you that selling is a strange bedfellow for software development professionals and it takes practice, practice, and still more practice. You have to really want it.

The well is deep when it comes to marketing and selling. There are infinite resources available on the web and in books. My goal here was to scratch the surface and present the reality of sales and marketing from the perspective of a software professional who has been there, who in fact lives it every day.

Key 5 on YouTube

Take the Leap Into Successful Consulting

Stepping into your own consultancy is a big step indeed—one to be taken with care and caution. Make sure you are ready and that the timing is right with the rest of your personal priorities.

After years of software work, it’s your skill and experience that got you here. Stay sharp and go even deeper with the software development choices that paved the way for you.

Writing for blogs and posting on social media will help keep you here. Carve out time for this activity so it becomes part of your daily business process.

Staying in touch with your software development friends and keeping your business contacts warm will keep your prospects hot for finding new work.

Good businesses function as a team. Your CPA and business attorney will keep the business end of your venture healthy and accountable.

In the end, your willingness to market and sell will not only keep you in business, but will also allow you to grow your consulting practice and thrive. You have to get paid to succeed, so your ability to ask for the sale will make you a true entrepreneur and a successful consultant as a software developer.

As I inferred at the beginning of this article, each of these five keys is a deep topic, worthy of a lot more ink than I have spilled here.

In closing, I hope you are inspired to dig deep and ask yourself the important questions about becoming a successful consultant as a software developer.

 

Posted in Information Technology

Consultant vs Software Engineer

A consultant works more with people and less with software, though you need strong technical skills to be any good at it. A developer (engineer) spends most of their time doing problem-solving while a consultant spends most of their time communicating. Both roles involve analysis.

Following the engineer path probably offers slightly better job security (it’s very easy to hire and fire consultants since they tend to work on projects for relatively short periods). However, the consultant path probably offers more flexibility, as you get experience of a lot of different workplaces. It would also usually involve far more travel and time away from home.

Right  out of college, there will be far fewer options for consultancy as most  consulting companies will want more proof that you’re good enough  technically. Improving your skills as a software developer in no way reduces your options to become a consultant at a later stage when you’re on a higher level technically. It’s easier to go from being a software engineer to being a consultant than the other way round – first get the knowledge, then share it. If you’ve been a consultant for a few years and ended up not doing so much programming, that might make it harder to get a programming job (on the other hand, you might be lucky and find an employer who would value your wider experience).

Senior consultants are paid WAY more than senior software engineers in most of the industry – the exception being some of the huge software companies that really value having some top class senior engineers. However, at the junior levels and certainly straight out of college, there’s no real difference. Either way, you’re there to learn and add a small part to a team effort.

 

Posted in Information Technology

Data Science vs. Big Data vs. Data Analytics

Data Science vs. Big Data vs. Data Analytics

Data is everywhere. In fact, the amount of digital data that exists is growing at a rapid rate, doubling every two years, and changing the way we live. According to IBM, 2.5 billion gigabytes (GB) of data was generated every day in 2012.

An article by Forbes states that Data is growing faster than ever before and by the year 2020, about 1.7 megabytes of new information will be created every second for every human being on the planet.

Which makes it extremely important to at least know the basics of the field. After all, here is where our future lies.

In this article, we will differentiate between the Data Science, Big Data, and Data Analytics, based on what it is, where it is used, the skills you need to become a professional in the field, and the salary prospects in each field.

Let’s first start off with understanding what these concepts are.

What They Are

Data Science: Dealing with unstructured and structured data, Data Science is a field that comprises of everything that related to data cleansing, preparation, and analysis.

Data Science is the combination of statistics, mathematics, programming, problem-solving, capturing data in ingenious ways, the ability to look at things differently, and the activity of cleansing, preparing and aligning the data.

In simple terms, it is the umbrella of techniques used when trying to extract insights and information from data.

Big Data: Big Data refers to humongous volumes of data that cannot be processed effectively with the traditional applications that exist. The processing of Big Data begins with the raw data that isn’t aggregated and is most often impossible to store in the memory of a single computer.

A buzzword that is used to describe immense volumes of data, both unstructured and structured, Big Data inundates a business on a day-to-day basis. Big Data is something that can be used to analyze insights which can lead to better decisions and strategic business moves.

The definition of Big Data, given by Gartner is, “Big data is high-volume, and high-velocity and/or high-variety information assets that demand cost-effective, innovative forms of information processing that enable enhanced insight, decision making, and process automation”.

You too can join the high-earners’ club. Enroll in our Data Science Masters program and earn more today.

Data Analytics: Data Analytics the science of examining raw data with the purpose of drawing conclusions about that information.

Data Analytics involves applying an algorithmic or mechanical process to derive insights. For example, running through a number of data sets to look for meaningful correlations between each other.

It is used in a number of industries to allow the organizations and companies to make better decisions as well as verify and disprove existing theories or models.

The focus of Data Analytics lies in inference, which is the process of deriving conclusions that are solely based on what the researcher already knows.

You can check the Course Preview of our Data Science Training with R here.

The Applications of Each Field

Applications of Data Science:

  • Internet search: Search engines make use of data science algorithms to deliver best results for search queries in a fraction of seconds.

  • Digital Advertisements: The entire digital marketing spectrum uses the data science algorithms – from display banners to digital billboards. This is the mean reason for digital ads getting higher CTR than traditional advertisements.

  • Recommender systems: The recommender systems not only make it easy to find relevant products from billions of products available but also adds a lot to user-experience. A lot of companies use this system to promote their products and suggestions in accordance with the user’s demands and relevance of information. The recommendations are based on the user’s previous search results.

Applications of Big Data:

  • Big Data for financial services: Credit card companies, retail banks, private wealth management advisories, insurance firms, venture funds, and institutional investment banks use big data for their financial services. The common problem among them all is the massive amounts of multi-structured data living in multiple disparate systems which can be solved by big data. Thus big data is used in a number of ways like:

    • Customer analytics
    • Compliance analytics
    • Fraud analytics
    • Operational analytics
  • Big Data in communications: Gaining new subscribers, retaining customers, and expanding within current subscriber bases are top priorities for telecommunication service providers. The solutions to these challenges lie in the ability to combine and analyze the masses of customer-generated data and machine-generated data that is being created every day.

  • Big Data for Retail: Brick and Mortar or an online e-tailer, the answer to staying the game and being competitive is understanding the customer better to serve them. This requires the ability to analyze all the disparate data sources that companies deal with every day, including the weblogs, customer transaction data, social media, store-branded credit card data, and loyalty program data.

Applications of Data Analysis:

  • Healthcare: The main challenge for hospitals with cost pressures tightens is to treat as many patients as they can efficiently, keeping in mind the improvement of the quality of care. Instrument and machine data is being used increasingly to track as well as optimize patient flow, treatment, and equipment used in the hospitals. It is estimated that there will be a 1% efficiency gain that could yield more than $63 billion in the global healthcare savings.

  • Travel: Data analytics is able to optimize the buying experience through the mobile/ weblog and the social media data analysis. Travel sights can gain insights into the customer’s desires and preferences. Products can be up-sold by correlating the current sales to the subsequent browsing increase browse-to-buy conversions via customized packages and offers. Personalized travel recommendations can also be delivered by data analytics based on social media data.

  • Gaming: Data Analytics helps in collecting data to optimize and spend within as well as across games. Game companies gain insight into the dislikes, the relationships, and the likes of the users.

  • Energy Management: Most firms are using data analytics for energy management, including smart-grid management, energy optimization, energy distribution, and building automation in utility companies. The application here is centered on the controlling and monitoring of network devices, dispatch crews, and manage service outages. Utilities are given the ability to integrate millions of data points in the network performance and lets the engineers use the analytics to monitor the network.

The Skills you Require

To become a Data Scientist:

  • Education: 88% have a Master’s Degree and 46% have PhDs

  • In-depth knowledge of SAS and/or R: For Data Science, R is generally preferred.

  • Python coding: Python is the most common coding language that is used in data science along with Java, Perl, C/C++.

  • Hadoop platform: Although not always a requirement, knowing the Hadoop platform is still preferred for the field. Having a bit of experience in Hive or Pig is also a huge selling point.

  • SQL database/coding: Though NoSQL and Hadoop have become a major part of the Data Science background, it is still preferred if you can write and execute complex queries in SQL.

  • Working with unstructured data: It is most important that a Data Scientist is able to work with unstructured data be it on social media, video feeds, or audio.

To become a Big Data professional:

  • Analytical skills: The ability to be able to make sense of the piles of data that you get. With analytical abilities, you will be able to determine which data is relevant to your solution, more like problem-solving.

  • Creativity: You need to have the ability to create new methods to gather, interpret, and analyze a data strategy. This is an extremely suitable skill to possess.

  • Mathematics and statistical skills: Good, old-fashioned “number crunching”. This is extremely necessary, be it in data science, data analytics, or big data.

  • Computer science: Computers are the workhorses behind every data strategy. Programmers will have a constant need to come up with algorithms to process data into insights.

  • Business skills: Big Data professionals will need to have an understanding of the business objectives that are in place, as well as the underlying processes that drive the growth of the business as well as its profit.

To become a Data Analyst:

  • Programming skills: Knowing programming languages are R and Python are extremely important for any data analyst.

  • Statistical skills and mathematics: Descriptive and inferential statistics and experimental designs are a must for data scientists.

  • Machine learning skills

  • Data wrangling skills: The ability to map raw data and convert it into another format that allows for a more convenient consumption of the data.

  • Communication and Data Visualization skills

  • Data Intuition: it is extremely important for professional to be able to think like a data analyst.

Posted in Information Technology

ITIL vs PMP

https://www.educba.com/itil-vs-pmp/

ITIL VS PMP – In the modern era of professional expertise, it has become increasingly important for those involved in management of critical projects and processes to gain an edge over the competition in terms of their knowledge and skill sets along with the ability to inspire confidence in the stakeholders.

If observed closely, a project manager is someone involved in defining and implementing effective work strategies in order to achieve quantifiable success in terms of pre-defined goals at every stage of a project.

In a modern setting, where cross-industry global projects are more of a norm than exception, large scale corporations are wary of hiring someone they cannot rely on completely with such a role, especially with the kind of technological, financial and human resources involved.

This is one of the primary reasons why along with professional experience, requisite certifications have nearly become the order of the day for project managers and those in equivalent positions in order to make an impression with their employers.

However, there are a number of professional certifications that have aggressively acquired this space and it is no mean task to choose the most suitable credential for a professional.

Two of the leading certifications which attract the attention of aspiring and existing project managers include ITILCertification and PMP Certification, each with its own distinct set of advantages.

ITIL vs PMP Infographics

ITIL vs PMP infographics

Why ITIL & PMP?

The answer is that since both of these certifications speak of adopting a structured approach to managing specific projects and related tasks, they naturally gain relevance for professionals in the field of project management.

 Popular Course in this category
Primavera Course10 Online Courses | 27+ Hours | Verifiable Certificate of Completion | Lifetime Access
4.5 (487 ratings)
Course Price
$99 $299
View Course

Related Courses

Quality Management CourseQuality Management Course

Today, project managers have a difficult time choosing which certification would meet their professional needs adequately.

This is why a comparative approach is being adopted to study the inherent advantages and disadvantages underlying these certifications and their specific methodologies.

This should help any professional better understand what exactly do each of these credentials offer and choose accordingly. However, first we need to discuss each of them individually and then only try to compare them.

ITIL Certification:

ITIL stands for Information Technology Infrastructure Library and the certification is designed to help professionals become acquainted with the principles of enterprise IT management.

In order to achieve this objective, candidates are introduced to a standard IT service management framework as a part of comprehensive ITIL training programs.

The purpose of ITIL is to help professionals become acquainted with the best codes and practices in the field of IT service management.

ITIL’s Philosophy:

This leads to enhanced efficiency in execution of IT projects and assists organizations in delivering greater value to their customers. The whole concept of ITIL revolves around IT service management and is primarily based on a life cycle approach.

Needless to say, this methodology has been especially useful for organizations engaged in providing IT services. However, this unique framework can be adapted to specific enterprise management needs of non-IT organizations as well.

PMP Certification:

PMP is a highly valued credential offered by Project Management Institute (PMI), designed to help project managers imbibe the fundamental principles and practices of project management as embodied in PMBOK (Project Management Body of Knowledge).

Painstakingly developed by PMI over the years, PMBOK serves as a kind of universal guide for project managers across the world, detailing a highly effective framework for project management that sets it apart from most other project management certifications.

PMP’s Philosophy:

PMP is focused on the processes, tools and methodologies to be adopted for successful completion of a project and is wider in scope as compared to ITIL in the sense that its principles and methodologies are applicable to projects of any scope and size in almost any industry.

Instead of focusing on a life cycle approach and the services aspect of an organization, it focuses on techniques for efficiently carrying out specific projects with a finite scope, time and budget.

Different Schools of Thought:

Some experts consider ITIL and PMP to be so different methodologies that their individual advantages and disadvantages must be weighed to choose the right certification based on its relevance for any professional.

However, there is a different school of thought, which suggests that although both these credentials advocate different methodologies but they can be combined to create greater value and to complete highly complex projects and undertakings with a high level of precision.

Next we will analyze in what ways ITIL and PMP are similar and at the same time what makes them unique, before going ahead to study whether the methodologies they represent can be brought together successfully.

What is Common in ITIL & PMP?

There are a number of similarities between ITIL and PMP framework despite their uniqueness, including but not limited to their heavy reliance on a complex set of tools and processes for accomplishing a complex set of tasks resulting in enhanced efficiency within the specific context of a project or an organization.

Where ITIL addresses the needs of an entire organization in terms of streamlining its service management and improving processes through a lifecycle approach, PMP is more focused on managing individual project or set of projects within an organization instead of dealing with organizational operations on the whole.

What makes them Different?

So far it sounds like the only key difference between these methodologies is that one is dealing with the service aspect of an organization instead of each individual project, whereas another is about managing individual projects efficiently.

However, another major difference lies in the fact that where ITIL is concerned with managing the service and other aspects of only IT enterprises, PMP speaks of managing projects related to almost any industry.

Another important difference is in the way their methodologies proceed to define the core set of processes involved. This is where their methodologies

Breakdown of Processes in PMP & ITIL Methodologies:

ITIL Methodology:

As already discussed, ITIL is primarily concerned with defining a comprehensive and coherent set of best practices for IT service management.

Although primarily meant for IT organizations, its unique framework can be adapted and implemented by organizations of almost any kind and help deliver value and improve organizational strategy in terms of service management.

ITIL breaks down IT service management into primarily two areas, including IT service support and service delivery, each with sub-processes of their own.

IT Service Support:

This area of service management is concerned with application of principles which hold the key to providing IT services in an effective manner without compromising on quality or other aspects.

IT Service Support includes 6 processes:

  1. Configuration Management
  2. Incident Management
  3. Problem Management
  4. Change Management
  5. Service/Help Desk
  6. Release Management

IT Service Delivery is concerned with the application of principles for ensuring quality of deliverable’s in terms of IT services to the customer.

IT Service Delivery includes no less than 5 processes:

  1. Service Level Management
  2. Capacity Management
  3. Continuity Management
  4. Availability Management
  5. IT Financial Management

PMP Methodology:

PMP considers a project as a complete and closed entity with specific needs of its own, which are decidedly different from the organization-wide needs in any case. Each project is bound and defined by its finite nature in terms of its pre-allocated budget, time and scope.

The goal of a project manager implementing this methodology is to complete any given project within these predefined parameters while ensuring that any unforeseen changes effected in the duration of a project are accommodated without any adverse impact on the outcome of the project.

In order to achieve its project management objectives, breakdown of processes is carried out in a completely different manner. PMBOK defines all these processes and the principles behind them for guiding project management professionals around the world.

PMBOK defines 5 core process groups including:

  1. Initiating
  2. Planning
  3. Executing
  4. Monitoring & Controlling
  5. Closing

With a view to study and comprehend processes better, they are divided into 10 Knowledge Areas (KAs) in PMBOK Guide:

  1. Project Integration Management
  2. Project Scope Management
  3. Project Time Management
  4. Project Cost Management
  5. Project Quality Management
  6. Project Human Resource Management
  7. Project Communications Management
  8. Project Risk Management
  9. Project Procurement Management
  10. Project Stakeholders Management

It might be relevant to point it out at this point that there are a total of 47 processes as outlined in the 5th Edition of PMBOK Guide. Both of these classifications, including process groups and KAs only serve to classify them in different ways for purposes of understanding their relevance in varying contexts.

Where process groups classify these processes in terms of their relevance in different project phases, KAs classify them in terms of functional areas when studying these processes.

Certification Levels:

PMP Certification Path:

This is one of the most important aspects to be considered when comparing these two certifications. As far as PMP is concerned, it’s just a single, stand-alone certification, but highly valued, no doubt.

There are no basic or advanced levels of PMP, although beginners in the field of project management can opt for CAPM (Certified Associate in Project Management) which requires relatively lesser experience as compared to what is required for undertaking PMP exam.

ITIL Certification Path:

On the other hand, ITIL offers a certification path to accommodate the learning needs of IT professionals of different levels. This includes:

  • ITIL Foundation
  • ITIL Intermediate Level
  • ITIL Managing Across the Lifecycle
  • ITIL Expert Level

Here, we are primarily discussing the merits of ITIL Foundation Level as compared with PMP for any project management professional. To begin with, the foundation level imparts the knowledge of fundamental principles and concepts related to ITIL Service Life cycle.

It helps them become acquainted with the basic ideas of ITIL methodology and get started in the right direction.

Prerequisites:

What it takes for a PMP?

Professionals require at least 7,500 hours or 4,500 hours of documented project management experience depending on if they have a secondary-level diploma or a 4 year bachelor’s degree or its global equivalent respectively. In addition to that, 35 PDUs or Professional Development Units have to be earned by a professional through a contact learning program.

What it takes for ITIL Foundation Level?

For ITIL Foundation Level, there is no such eligibility criterion in terms of qualification or work experience for aspiring candidates. It can be a good choice of certification for IT professionals or business managers.

However, advanced levels of ITIL certification do need one to have earned preceding certification including that of ITIL Foundation Level certification.

What all of this Means?

With almost no prerequisites, it is evident that it is difficult to consider ITIL Foundation Level certification on par with PMP and in fact, it may not be fair to compare them as competing certifications, speaking in terms of professional experience.

However, it is the certification path which ITIL Foundation Level opens up for professionals which holds greater significance.

Going a step further, it may be a good idea for someone holding a PMP to opt for ITIL Foundation Level certification which would not only help gain a basic understanding of IT service management, but also help earn a good number of PDUs required to retain the PMP certification.

Any PMP is required to earn 60 PDUs within 3 years to be able to retain the certification and going in for ITIL Foundation Level would help get 17-25 PDUs depending on the education provider.

Making the Hard Choice:

Keeping in mind all the factors we have discussed so far, we can now attempt to compare and find out if and whether ITIL and PMP hold relevance to what kind of professionals.

It may be made clear at the outset that any project management professional who has nothing to do with IT industry can safely stay away from ITIL.

However, as discussed above, ITIL Foundation would be a good choice for professionals holding a PMP and interested in learning more about IT service management.

Having said that, it may also be useful to keep in mind which certification is more valued and favored by leading employers in your geographical location.

Combining ITIL & PMP Methodologies:

Even otherwise, professionals with a PMP can learn a great deal with ITIL and combine the two approaches for better managing IT-related projects and understanding the intricacies of organization-level service management.

For someone in IT industry, ITIL is the natural choice but it is recommended to not stop at Foundation Level and go ahead to earn advanced ITIL certifications which would serve them much better in terms of professional recognition and expertise.

Additionally, someone in IT industry aspiring to be a project manager either within IT industry or elsewhere would definitely benefit from earning a PMP.

Conclusion:

To sum it up, although ITIL and PMP represent two completely different methodologies which find application in different industry-based contexts, they do have their areas of overlapping functionalities.

If they are brought together, as some experts aggressively advocate, it can serve as a fitting example of the whole being greater than the sum of its parts.

This is because a synergistic combination of these approaches would not only help manage individual projects better and improve service management but would ultimately lead to creation of value-driven services both within the finite scope of a project and in the organizational service management as a whole.

 

Posted in Information Technology

Internet of Things (IoT) with Redis, NodeMCU and Spring boot

https://medium.com/@dinethchalitha/internet-of-things-iot-with-redis-nodemcu-and-spring-boot-3d3291484b11

If things connected with network it doesn’t mean it’s “smart”.Truly smart devices provide valuable services, are trusted, and are easy to use. This Make your life easier by implementing your own IoT solution as smart way.

Redis is an open source in memory database,cache and message broker. Redis supports various types of data structures such as strings, hashes, lists, sets, sorted sets with range queries, bitmaps, hyperloglogs and geospatial indexes with radius queries.

Most Off other database require considerable amount of resources for handle millions of transactions.Also it may be difficult to handle real-time analytics.
Redis use minimum amount of resources and it has in built data structures, modules are advantage in delivering reliable IoT solutions.

NodeMCU is an open source IoT platform.The NodeMCU development board is a powerful solution to program micro-controllers. This is a cute module with a micro-controller, integrated Wi-Fi receiver, and transmitter.

Let’s start implement a scenario using a smart database and a cute module. 🙂

This is a very simple scenario. If you are not familiar with nodemcu/esp8266 you can refer this (https://www.instructables.com/id/Quick-Start-to-Nodemcu-ESP8266-on-Arduino-IDE/).

Scenario —

Here NodeMCU board connected with DHT11 sensor, redis db and home wifi network. Basically sensor send the data to NodeMCU and it directly update the redis cache via TCP call.Here redis db is available in another host (PC).After Spring boot app read the data from redis cache and those data will expose as a rest service.

High level architecture.

Hardware implementation

required resources

  • NodeMCU development board.
  • DHT11 sensor module.
  • breadboard and jumping wires(optional).
Hardware deployment diagram
  • Pin 1 goes into +3v.
  • Pin 2 of the DHT11 goes into Digital Pin D4.
  • Pin 3 of the DHT11 goes into Ground Pin (GND).
Hardware implementation

Arduino Code has implemented following functions.

  • Connect with WiFi connection.
  • Connect with redis Connection.
  • Reading data from DHT11 Sensor.
  • Get current time using NTPClient.
  • Send the data to redis cache.

Arduino full source code.

//network SSID (name)
#define WIFI_SSID "DCB"
#define WIFI_PASS "dcbdcb123bcd"
//redis config
#define REDISHOST "192.168.8.101"
#define REDISPORT 6379
#include <ESP8266WiFi.h>
#include <DHT.h>
#include <NTPClient.h>
#include <WiFiUdp.h>
DHT dht;
#define NTP_OFFSET   60*60*5// In seconds
#define NTP_INTERVAL 60 * 1000    // In miliseconds
#define NTP_ADDRESS  "www.sltime.org"
WiFiClient redis;
WiFiUDP ntpUDP;
NTPClient timeClient(ntpUDP, NTP_ADDRESS, NTP_OFFSET,  NTP_INTERVAL);
String dayStamp;
String timeStamp;
String formattedTime;
String formattedDate;
float humidity;
float temperature;
String key;
String sensorData;
void setup() {
  Serial.begin(115200);
  
  Serial.println("Serial initialized.");
  Serial.print("Connecting to ");
  Serial.print(WIFI_SSID);
  WiFi.begin(WIFI_SSID, WIFI_PASS);
  
  while (WiFi.status() != WL_CONNECTED) {  //Wait for the WiFI connection completion 
    delay(500);
    Serial.println("Waiting for connection"); 
  }
  Serial.println("");
  Serial.print("WiFi (");
  Serial.print(WiFi.macAddress());
  Serial.print(") connected with IP ");
  Serial.println(WiFi.localIP());
  Serial.print("DNS0 : ");
  Serial.println(WiFi.dnsIP(0));
  Serial.print("DNS1 : ");
  Serial.println(WiFi.dnsIP(1));
//set the DHT11 output datapin 
  dht.setup(D4);
  timeClient.begin();
  timeClient.setTimeOffset(3600); 
}
void loop() {
  timeClient.update();
  formattedTime = timeClient.getFormattedTime();
  formattedDate = timeClient.getFormattedDate();
//Extract Time and Date
  int splitT = formattedDate.indexOf("T");
  dayStamp = formattedDate.substring(0, splitT);
  timeStamp = formattedDate.substring(splitT+1, formattedDate.length()-1);
if (!redis.connected()) {
      Serial.print("Redis not connected, connecting...");
      if (!redis.connect(REDISHOST,REDISPORT)) {
        Serial.print  ("Redis connection failed...");
        Serial.println("Waiting for next read");
        return; 
      } else
        Serial.println("OK");
    }
humidity = dht.getHumidity();/* Get humidity value */
    temperature = dht.getTemperature();
    key = "DHT11:AreaX1";
    sensorData= "DATE: "+dayStamp+" TIME:"+timeStamp+" humidity:"+String(humidity)+" temperature:"+String(temperature);
    
    Serial.print("Time Formatted : ");
    Serial.println(formattedTime);
//
redis.print(
      String("*3\r\n")
      +"$5\r\n"+"LPUSH\r\n"
      +"$"+key.length()+"\r\n"+key+"\r\n"
      +"$"+sensorData.length()+"\r\n"+sensorData+"\r\n"
    );
    
  while (redis.available() != 0)
    Serial.print((char)redis.read());
    delay(10000);/* Delay of amount equal to sampling period */
}

Redis Database setup

Download redis — https://redis.io/download

wget http://download.redis.io/redis-stable.tar.gz
tar xvzf redis-stable.tar.gz
cd redis-stable
make

Now, copy redis-server,redis-client from the redis-stable/src and paste it into two different directories.

Start the redis-server

./redis-server

Next, navigate to redis-client directory and connect to redis server with running the following command.

./redis-client

After entering the monitor command, you can monitor the redis operation.

Spring App

This service will handle post requests for /getSensorData , optionally with a reqData parameter will use to query the latest number of data from redis cache . The post request should return the required number of data as a json response to client.

Here, I used the Jedis client to connect with redis database.

@Controller
@RequestMapping("/")
public class WebController {
@Autowired
private RedisUtil util;
private JedisPool jedisPool = null;
@RequestMapping(value = "/getSensorData", method = RequestMethod.POST)
    public @ResponseBody List<String> findValue(@RequestBody SensorData sensorData,@RequestParam("reqData") int reqData) {
        List<String>  retrieveMap=null;
        jedisPool = util.getJedisPool();
        try (Jedis jedis = jedisPool.getResource()) {
           String key=getListKey(sensorData.getSensorType(),sensorData.getSensorLocation());
           retrieveMap = jedis.lrange(key, 0, reqData);
        }
        return retrieveMap;
    }
private String getListKey(String sensorType,String location){
        return sensorType+":"+location;
    }
}

Sample request .

POST /getSensorData?reqData=3 HTTP/1.1
Host: localhost:8080
Content-Type: application/json
{
 "sensorType":"DHT11",
 "sensorLocation":"AreaX1"
}

Response related to above request.

[
    "DATE: 2018-08-18 TIME:17:19:46 humidity:93.00 temperature:28.00",
    "DATE: 2018-08-18 TIME:17:19:36 humidity:92.00 temperature:28.00",
    "DATE: 2018-08-18 TIME:17:19:26 humidity:92.00 temperature:28.00",
    "DATE: 2018-08-18 TIME:17:19:16 humidity:93.00 temperature:28.00"
]

 

Posted in Information Technology

Intro to Apache Kafka with Spring

1. Overview

Apache Kafka is a distributed and fault-tolerant stream processing system.

In this article, we’ll cover Spring support for Kafka and the level of abstractions it provides over native Kafka Java client APIs.

Spring Kafka brings the simple and typical Spring template programming model with a KafkaTemplate and Message-driven POJOs via @KafkaListener annotation.

2. Installation and Setup

To download and install Kafka, please refer to the official guide here.

We also need to add the spring-kafka dependency to our pom.xml:

1
2
3
4
5
<dependency>
    <groupId>org.springframework.kafka</groupId>
    <artifactId>spring-kafka</artifactId>
    <version>2.2.2.RELEASE</version>
</dependency>

The latest version of this artifact can be found here.

Our example application will be a Spring Boot application.

This article assumes that the server is started using the default configuration and no server ports are changed.

3. Configuring Topics

Previously we used to run command line tools to create topics in Kafka such as:

1
2
3
4
$ bin/kafka-topics.sh --create \
  --zookeeper localhost:2181 \
  --replication-factor 1 --partitions 1 \
  --topic mytopic

But with the introduction of AdminClient in Kafka, we can now create topics programmatically.

We need to add the KafkaAdmin Spring bean, which will automatically add topics for all beans of type NewTopic:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
@Configuration
public class KafkaTopicConfig {
    
    @Value(value = "${kafka.bootstrapAddress}")
    private String bootstrapAddress;
    @Bean
    public KafkaAdmin kafkaAdmin() {
        Map<String, Object> configs = new HashMap<>();
        configs.put(AdminClientConfig.BOOTSTRAP_SERVERS_CONFIG, bootstrapAddress);
        return new KafkaAdmin(configs);
    }
    
    @Bean
    public NewTopic topic1() {
         return new NewTopic("baeldung", 1, (short) 1);
    }
}

4. Producing Messages

To create messages, first, we need to configure a ProducerFactory which sets the strategy for creating Kafka Producer instances.

Then we need a KafkaTemplate which wraps a Producer instance and provides convenience methods for sending messages to Kafka topics.

Producer instances are thread-safe and hence using a single instance throughout an application context will give higher performance. Consequently, KakfaTemplate instances are also thread-safe and use of one instance is recommended.

4.1. Producer Configuration

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
@Configuration
public class KafkaProducerConfig {
    @Bean
    public ProducerFactory<String, String> producerFactory() {
        Map<String, Object> configProps = new HashMap<>();
        configProps.put(
          ProducerConfig.BOOTSTRAP_SERVERS_CONFIG,
          bootstrapAddress);
        configProps.put(
          ProducerConfig.KEY_SERIALIZER_CLASS_CONFIG,
          StringSerializer.class);
        configProps.put(
          ProducerConfig.VALUE_SERIALIZER_CLASS_CONFIG,
          StringSerializer.class);
        return new DefaultKafkaProducerFactory<>(configProps);
    }
    @Bean
    public KafkaTemplate<String, String> kafkaTemplate() {
        return new KafkaTemplate<>(producerFactory());
    }
}

4.2. Publishing Messages

We can send messages using the KafkaTemplate class:

1
2
3
4
5
6
@Autowired
private KafkaTemplate<String, String> kafkaTemplate;
public void sendMessage(String msg) {
    kafkaTemplate.send(topicName, msg);
}

The send API returns a ListenableFuture object. If we want to block the sending thread and get the result about the sent message, we can call the get API of the ListenableFuture object. The thread will wait for the result, but it will slow down the producer.

Kafka is a fast stream processing platform. So it’s a better idea to handle the results asynchronously so that the subsequent messages do not wait for the result of the previous message. We can do this through a callback:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
public void sendMessage(String message) {
            
    ListenableFuture<SendResult<String, String>> future = kafkaTemplate.send(topicName, message);
    
    future.addCallback(new ListenableFutureCallback<SendResult<String, String>>() {
        @Override
        public void onSuccess(SendResult<String, String> result) {
            System.out.println("Sent message=[" + message +
              "] with offset=[" + result.getRecordMetadata().offset() + "]");
        }
        @Override
        public void onFailure(Throwable ex) {
            System.out.println("Unable to send message=["
              + message + "] due to : " + ex.getMessage());
        }
    });
}

5. Consuming Messages

5.1. Consumer Configuration

For consuming messages, we need to configure a ConsumerFactory and a KafkaListenerContainerFactory. Once these beans are available in the Spring bean factory, POJO based consumers can be configured using @KafkaListener annotation.

@EnableKafka annotation is required on the configuration class to enable detection of @KafkaListenerannotation on spring managed beans:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
@EnableKafka
@Configuration
public class KafkaConsumerConfig {
    @Bean
    public ConsumerFactory<String, String> consumerFactory() {
        Map<String, Object> props = new HashMap<>();
        props.put(
          ConsumerConfig.BOOTSTRAP_SERVERS_CONFIG,
          bootstrapAddress);
        props.put(
          ConsumerConfig.GROUP_ID_CONFIG,
          groupId);
        props.put(
          ConsumerConfig.KEY_DESERIALIZER_CLASS_CONFIG,
          StringDeserializer.class);
        props.put(
          ConsumerConfig.VALUE_DESERIALIZER_CLASS_CONFIG,
          StringDeserializer.class);
        return new DefaultKafkaConsumerFactory<>(props);
    }
    @Bean
    public ConcurrentKafkaListenerContainerFactory<String, String>
      kafkaListenerContainerFactory() {
   
        ConcurrentKafkaListenerContainerFactory<String, String> factory
          = new ConcurrentKafkaListenerContainerFactory<>();
        factory.setConsumerFactory(consumerFactory());
        return factory;
    }
}

5.2. Consuming Messages

1
2
3
4
@KafkaListener(topics = "topicName", groupId = "foo")
public void listen(String message) {
    System.out.println("Received Messasge in group foo: " + message);
}

Multiple listeners can be implemented for a topic, each with a different group Id. Furthermore, one consumer can listen for messages from various topics:

1
@KafkaListener(topics = "topic1, topic2", groupId = "foo")

Spring also supports retrieval of one or more message headers using the @Header annotation in the listener:

1
2
3
4
5
6
7
8
@KafkaListener(topics = "topicName")
public void listenWithHeaders(
  @Payload String message,
  @Header(KafkaHeaders.RECEIVED_PARTITION_ID) int partition) {
      System.out.println(
        "Received Message: " + message"
        + "from partition: " + partition);
}

5.3. Consuming Messages from a Specific Partition

As you may have noticed, we had created the topic baeldung with only one partition. However, for a topic with multiple partitions, a @KafkaListener can explicitly subscribe to a particular partition of a topic with an initial offset:

1
2
3
4
5
6
7
8
9
10
11
12
13
@KafkaListener(
  topicPartitions = @TopicPartition(topic = "topicName",
  partitionOffsets = {
    @PartitionOffset(partition = "0", initialOffset = "0"),
    @PartitionOffset(partition = "3", initialOffset = "0")
}))
public void listenToParition(
  @Payload String message,
  @Header(KafkaHeaders.RECEIVED_PARTITION_ID) int partition) {
      System.out.println(
        "Received Messasge: " + message"
        + "from partition: " + partition);
}

Since the initialOffset has been sent to 0 in this listener, all the previously consumed messages from partitions 0 and three will be re-consumed every time this listener is initialized. If setting the offset is not required, we can use the partitions property of @TopicPartition annotation to set only the partitions without the offset:

1
2
@KafkaListener(topicPartitions
  = @TopicPartition(topic = "topicName", partitions = { "0", "1" }))

5.4. Adding Message Filter for Listeners

Listeners can be configured to consume specific types of messages by adding a custom filter. This can be done by setting a RecordFilterStrategy to the KafkaListenerContainerFactory:

1
2
3
4
5
6
7
8
9
10
11
@Bean
public ConcurrentKafkaListenerContainerFactory<String, String>
  filterKafkaListenerContainerFactory() {
    ConcurrentKafkaListenerContainerFactory<String, String> factory
      = new ConcurrentKafkaListenerContainerFactory<>();
    factory.setConsumerFactory(consumerFactory());
    factory.setRecordFilterStrategy(
      record -> record.value().contains("World"));
    return factory;
}

A listener can then be configured to use this container factory:

1
2
3
4
5
6
@KafkaListener(
  topics = "topicName",
  containerFactory = "filterKafkaListenerContainerFactory")
public void listen(String message) {
    // handle message
}

In this listener, all the messages matching the filter will be discarded.

6. Custom Message Converters

So far we have only covered sending and receiving Strings as messages. However, we can also send and receive custom Java objects. This requires configuring appropriate serializer in ProducerFactory and deserializer in ConsumerFactory.

Let’s look at a simple bean class, which we will send as messages:

1
2
3
4
5
6
7
public class Greeting {
    private String msg;
    private String name;
    // standard getters, setters and constructor
}

6.1. Producing Custom Messages

In this example, we will use JsonSerializer. Let’s look at the code for ProducerFactory and KafkaTemplate:

1
2
3
4
5
6
7
8
9
10
11
12
13
@Bean
public ProducerFactory<String, Greeting> greetingProducerFactory() {
    // ...
    configProps.put(
      ProducerConfig.VALUE_SERIALIZER_CLASS_CONFIG,
      JsonSerializer.class);
    return new DefaultKafkaProducerFactory<>(configProps);
}
@Bean
public KafkaTemplate<String, Greeting> greetingKafkaTemplate() {
    return new KafkaTemplate<>(greetingProducerFactory());
}

This new KafkaTemplate can be used to send the Greeting message:

1
kafkaTemplate.send(topicName, new Greeting("Hello", "World"));

6.2. Consuming Custom Messages

Similarly, let’s modify the ConsumerFactory and KafkaListenerContainerFactory to deserialize the Greeting message correctly:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
@Bean
public ConsumerFactory<String, Greeting> greetingConsumerFactory() {
    // ...
    return new DefaultKafkaConsumerFactory<>(
      props,
      new StringDeserializer(),
      new JsonDeserializer<>(Greeting.class));
}
@Bean
public ConcurrentKafkaListenerContainerFactory<String, Greeting>
  greetingKafkaListenerContainerFactory() {
    ConcurrentKafkaListenerContainerFactory<String, Greeting> factory
      = new ConcurrentKafkaListenerContainerFactory<>();
    factory.setConsumerFactory(greetingConsumerFactory());
    return factory;
}

The spring-kafka JSON serializer and deserializer uses the Jackson library which is also an optional maven dependency for the spring-kafka project. So let’s add it to our pom.xml:

1
2
3
4
5
<dependency>
    <groupId>com.fasterxml.jackson.core</groupId>
    <artifactId>jackson-databind</artifactId>
    <version>2.9.7</version>
</dependency>

Instead of using the latest version of Jackson, it’s recommended to use the version which is added to the pom.xml of spring-kafka.

Finally, we need to write a listener to consume Greeting messages:

1
2
3
4
5
6
@KafkaListener(
  topics = "topicName",
  containerFactory = "greetingKafkaListenerContainerFactory")
public void greetingListener(Greeting greeting) {
    // process greeting message
}

7. Conclusion

In this article, we covered the basics of Spring support for Apache Kafka. We had a brief look at the classes which are used for sending and receiving messages.

Complete source code for this article can be found over on GitHub. Before executing the code, please make sure that Kafka server is running and the topics are created manually.

Posted in Information Technology

15 data science certifications that will pay off

Data scientist is one of the hottest jobs in IT. What’s more, it’s the best job you can get, period, according to data from Glassdoor. If you’re looking to get into this field, or you want to stand out against the competition, look no further than data science certifications.

Data science is important to nearly every company and industry, but the skills that recruiters are looking for will vary across businesses and industries. Certifications are a great way to gain an edge because they allow you to develop skills that are hard to find in your desired industry. They’re also a way to back up your skills, so recruiters and hiring managers know what they’re getting if they hire you.

Whether you’re looking to earn a certification from an accredited university, gain some experience as a new grad, hone vendor-specific skills or demonstrate your broad knowledge of data analytics, at least one of these certifications (presented here in alphabetical order) will work for you.

Top 15 data science certifications

  • Applied AI with DeepLearning, IBM Watson IoT Data Science Certificate
  • Certified Analytics Professional (CAP)
  • Cloudera Certified Associate: Data Analyst
  • Cloudera Certified Professional: CCP Data Engineer
  • Data Science Council of America (DASCA)
  • Dell Technologies Data Scientist Associate (DCA-DS)
  • Dell Technologies Data Scientist Advance Analytics Specialist (DCS-DS)
  • HDP Data Science
  • IBM Certified Data Architect
  • Microsoft MCSE: Data Management and Analytics
  • Microsoft Certified Azure Data Scientist Associate
  • Microsoft Professional Program in Data Science
  • SAS Certified Advanced Analytics Professional
  • SAS Certified Big Data Professional
  • SAS Certified Data Scientist

Applied AI with DeepLearning, IBM Watson IoT Data Science Certificate

To earn IBM’s Watson IoT Data Science Certification, you’ll need some experience coding, preferably in Python, but they will consider any programming language as a place to start. Math skills, especially with linear algebra, are recommended but the course promises to cover the topics within the first week. It’s aimed at those with more advanced data science skills and classes are offered through Coursera.

Cost: $49 per month for a subscription to Coursera
Location: 
Online
Duration
: Self-paced
Expiration
: Does not expire

Certified Analytics Professional (CAP)

CAP offers a vender-neutral certification and promises to help you “transform complex data into valuable insights and actions,” which is exactly what businesses are looking for in a data scientist: someone who not only understands the data but can draw logical conclusions and then express to key stakeholders why those data points are significant. If you’re new to data analytics, you can start with the entry-level Associate Certified Analytics Professional (aCAP) exam and then move on to your CAP certification.

Cost: $495 for INFORMS members, $695 for non-members; team pricing for organizations is available on request
Location: 
In person at designated test centers
Duration
: Self-paced
Expiration
: Valid for three years

Cloudera Certified Associate: Data Analyst

The CCA exam demonstrates your foundational knowledge as a developer, data analyst and administrator of Cloudera’s enterprise software. Passing a CCA exam and earning your certification will show employers that you have a handle on the basic skills required to be a data scientist. It’s also a great way to prove your skills if you’re just starting out and lack a strong portfolio or past work experience.

Cost: $295 per exam specialty and per attempt
Location: 
Online
Duration
: Self-paced
Expiration
: Valid for two years

Cloudera Certified Professional: CCP Data Engineer

Once you earn your CCA, you can move on to the CCP exam, which Cloudera touts as one of the most rigorous and “demanding performance-based certifications.” According to the website, those looking to earn their CCP need to bring “in-depth experience developing data engineering solutions” to the table, as well as a “high-level of mastery” of common data science skills. The exam consists of eight to 12 customer problems that you will have to solve hands-on using a Cloudera Enterprise cluster. The exam lasts 120 minutes and you’ll need to earn a 70 percent or higher to pass.

Cost: $600 per attempt — each attempt includes three exams
Location: 
Online
Duration
: Self-paced
Expiration
: Valid for three years

Data Science Council of America (DASCA)

The Data Science Council of America offers a data scientist certification that was designed to address “credentialing requirements of senior, accomplished professionals who specialize in managing and leading Big Data strategies and programs for organizations,” according to DASCA. The certification track includes paths for earning your Senior Data Scientist (SDS) and the more advanced Principal Data Scientist (PDS) credentials. Both exams last 100 minutes and consist of 85 and 100 multiple-choice questions for the SDS and PDS exams, respectively. You’ll need at least six or more years of big data analytics or engineering experience to start on the SDS track and 10 or more years of experience to qualify for the PDS exam.

Cost: $520 per exam
Location
: Online
Duration
: Self-paced
Expiration
: 5 years

Dell Technologies Data Scientist Associate (DCA-DS)

The DCA-DS certification is an entry-level data science designation that is designed for those new to the industry or who want to make a career switch to work as a data scientist. While the exam is designed for those without a strong background in machine learning, statistics, math or analytics, it’s still a requirement for the more advanced certification. So even if you’re already an experienced data scientist, you’ll still need to pass this exam before you can move on to the Advanced Analytics Specialist designation.

Cost: $230 per Proven Professional certification exam; you’ll also need to purchase any books or other course material
Location: 
Online via Pearson VUE
Duration
: Self-paced
Expiration
: Does not expire

Dell Technologies Data Scientist Advanced Analytics Specialist (DCS-DS)

The DCS-DS certification builds on the entry-level associate certification and covers general knowledge of big data analytics across different industries and technologies. It doesn’t specifically focus on one product or industry, so it’s a good option if you aren’t sure where you want to go with your data career or if you just want a more generalized certification for your resume. The exam covers advanced analytical methods, social network analysis, natural language processing, data visualization methods and popular data tools like Hadoop, Pig, Hive and HBase.

Cost: $230 per Proven Professional certification exam; you’ll also need to purchase any books or other course material
Location: 
Online via Pearson VUE
Duration
: Self-paced
Expiration
: Does not expire

HDP Data Science

The HDP Data Science certification course from Hortonworks covers data science topics like machine learning and natural language processing. It also covers popular concepts and algorithms used in classification, regression, clustering, dimensionality reduction and neural networks. The course will also get you up to speed on the latest tools and frameworks, including Python, NumPy, pandas, SciPy, Sckikit-learn, NLTK, TensorFlow, Jupyter, Spark MLlib, Stanford CoreNLP, TensorFlowOnSpark/Horovod/MLeap and Apache Zeppelin. The course includes a combination of lecture and discussion and the other half consists of hands-on labs, which you’ll complete before taking the exam.

Cost: $250 per attempt
Location
: Online
Duration
: 4 days
Expiration
: 2 years

IBM Certified Data Architect

IBM’s Certified Data Architect certification isn’t for everyone — it’s geared toward seasoned professionals and experts in the field. IBM recommends that you have knowledge of the data layer and associated risk and challenges, cluster management, network requirement, important interfaces, data modeling, latency, scalability, high availability, data replication and synchronization, disaster recovery, data lineage and governance, LDAP security and general big data best practices. You will also need prior experience with software such as BigInsights, BigSQL, Hadoop and Cloudant (NoSQL), among others. You can see the long list of prerequisites on IBM’s website, but it’s safe to say you’ll need a solid background in data science to qualify for this exam.

The certification exam consists of 55 questions and five sections focusing on requirements (16%), use cases (46%), applying technologies (16%), recoverability (11%) — you will have 90 minutes to complete the exam. IBM offers web-based and in-classroom training courses on InfoSphere BigInsights, BigInsights Analytics for Programmers and Big SQL for developers.

Cost: $200 
Location
: Online
Duration
: 90 minutes
Expiration
: N/A

Microsoft MCSE: Data Management and Analytics

MCSE certifications cover a wide variety of IT specialties and skills, including data science. For data science certifications, Microsoft offers two courses, one that focuses on business applications, and another that focuses on data management and analytics. However, each course requires prior certification under the MCSE Certification program, so you’ll want to make sure you check the requirements first.

Cost: $165 per exam, per attempt
Location: 
Online
Duration
: Self-paced
Expiration
: Valid for three years

Microsoft Certified Azure Data Scientist Associate

The Azure Data Scientist Associate certification from Microsoft focuses your ability to utilize machine learning to “train, evaluate and deploy models that solve business problems,” according to Microsoft. Candidates for the exam are tested on machine learning, AI solutions, natural language processing, computer vision and predictive analytics. The exam focuses on defining and preparing the development environment, data modeling, feature engineering and developing models.

Cost: $165 
Location: 
Online
Duration
: Self-paced
Expiration
: Credentials do not expire

Microsoft Professional Program in Data Science

The Microsoft Professional Program in Data Science focuses on eight specific data science skills, including T-SQL, Microsoft Excel, PowerBI, Python, R, Azure Machine Learning, HDInsight and Spark. Microsoft claims there are over 1.5 million open jobs looking for these skills. Courses run for three months every quarter and you don’t have to take them in order; it’s self-paced with a recommended commitment of two to four hours per week.

Cost: Must purchase credits through EdX, some materials are free
Location
: Online
Duration
: 6 weeks
Expiration
: Does not expire

SAS Certified Advanced Analytics Professional

This program covers machine learning, predictive modeling techniques, working with big data sets, finding patterns, optimizing data techniques and time series forecasting. The certification program consists of nine courses and three exams that you’ll have to pass to earn the designation. You’ll need at least six months of programming experience in SAS or another language and it’s also recommended that you have at least six months of experience using mathematics or statistics in a business setting.

Cost: $299 per month subscription 
Location: 
Online
Duration
: Self-paced
Expiration
: Credentials do not expire

SAS Certified Big Data Professional

The SAS Big Data certification includes two modules with a total of nine courses. You’ll need to pass two exams to earn the designation. The course covers SAS programming skills, working with data, improving data quality, communication skills, fundamentals of statistics and analytics, data visualization and popular data tools such as Hadoop, Hive, Pig and SAS. To qualify for the exam, you’ll need at least six months of programming experience in SAS or another language.

Cost: $299 per month subscription 
Location: 
Online
Duration
: Self-paced
Expiration
: Credentials do not expire

SAS Certified Data Scientist

The SAS Certified Data Scientist certification is a combination of the other two data certifications offered through SAS. It covers programming skills, managing and improving data, transforming, accessing and manipulating data and how to work with popular data visualization tools. Once you earn both the Big Data Professional and Advance Analytics Professional certifications, you can qualify to earn your SAS Certified Data Scientist designation. You’ll need to complete all 18 courses and pass the five exams between the two separate certifications.

Cost: $299 per month subscription 
Location: 
Online
Duration
: Self-paced
Expiration
: Credentials do not expire

Posted in Information Technology

An Implementation and Explanation of the Random Forest in Python

https://towardsdatascience.com/an-implementation-and-explanation-of-the-random-forest-in-python-77bf308a9b76

Fortunately, with libraries such as Scikit-Learn, it’s now easy to implement hundreds of machine learning algorithms in Python. It’s so easy that we often don’t need any underlying knowledge of how the model works in order to use it. While knowing all the details is not necessary, it’s still helpful to have an idea of how a machine learning model works under the hood. This lets us diagnose the model when it’s underperforming or explain how it makes decisions, which is crucial if we want to convince others to trust our models.

In this article, we’ll look at how to build and use the Random Forest in Python. In addition to seeing the code, we’ll try to get an understanding of how this model works. Because a random forest in made of many decision trees, we’ll start by understanding how a single decision tree makes classifications on a simple problem. Then, we’ll work our way to using a random forest on a real-world data science problem. The complete code for this article is available as a Jupyter Notebook on GitHub.

Note: this article originally appeared on enlight, a community-driven, open-source platform with tutorials for those looking to study machine learning.


Understanding a Decision Tree

decision tree is the building block of a random forest and is an intuitive model. We can think of a decision tree as a series of yes/no questions asked about our data eventually leading to a predicted class (or continuous value in the case of regression). This is an interpretable model because it makes classifications much like we do: we ask a sequence of queries about the available data we have until we arrive at a decision (in an ideal world).

The technical details of a decision tree are in how the questions about the data are formed. In the CART algorithm, a decision tree is built by determining the questions (called splits of nodes) that, when answered, lead to the greatest reduction in Gini Impurity. What this means is the decision tree tries to form nodes containing a high proportion of samples (data points) from a single class by finding values in the features that cleanly divide the data into classes.

We’ll talk in low-level detail about Gini Impurity later, but first, let’s build a Decision Tree so we can understand it on a high level.

Decision Tree on Simple Problem

We’ll start with a very simple binary classification problem as shown below:

The goal is to divide the data points into their respective classes.

Our data only has two features (predictor variables), x1 and x2 with 6 data points — samples — divided into 2 different labels. Although this problem is simple, it’s not linearly separable, which means that we can’t draw a single straight line through the data to classify the points.

We can however draw a series of straight lines that divide the data points into boxes, which we’ll call nodes. In fact, this is what a decision tree does during training. Effectively, a decision tree is a non-linear model built by constructing many linear boundaries.

To create a decision tree and train (fit) it on the data, we use Scikit-Learn.

During training we give the model both the features and the labels so it can learn to classify points based on the features. (We don’t have a testing set for this simple problem, but when testing, we only give the model the features and have it make predictions about the labels.)

We can test the accuracy of our model on the training data:

We see that it gets 100% accuracy, which is what we expect because we gave it the answers (y) for training and did not limit the depth of the tree. It turns out this ability to completely learn the training data can be a downside of a decision tree because it may lead to overfitting as we’ll discuss later.


Visualizing a Decision Tree

So what’s actually going on when we train a decision tree? I find a helpful way to understand the decision tree is by visualizing it, which we can do using a Scikit-Learn function (for details check out the notebook or this article).

Simple decision tree

All the nodes, except the leaf nodes (colored terminal nodes), have 5 parts:

  1. Question asked about the data based on a value of a feature. Each question has either a True or False answer that splits the node. Based on the answer to the question, a data point moves down the tree.
  2. gini: The Gini Impurity of the node. The average weighted Gini Impurity decreases as we move down the tree.
  3. samples: The number of observations in the node.
  4. value: The number of samples in each class. For example, the top node has 2 samples in class 0 and 4 samples in class 1.
  5. class: The majority classification for points in the node. In the case of leaf nodes, this is the prediction for all samples in the node.

The leaf nodes do not have a question because these are where the final predictions are made. To classify a new point, simply move down the tree, using the features of the point to answer the questions until you arrive at a leaf node where the class is the prediction.

To make see the tree in a different way, we can draw the splits built by the decision tree on the original data.

Splits made by the decision tree.

Each split is a single line that divides data points into nodes based on feature values. For this simple problem and with no limit on the maximum depth, the divisions place each point in a node with only points of the same class. (Again, later we’ll see that this perfect division of the training data might not be what we want because it can lead to overfitting.)


Gini Impurity

At this point it’ll be helpful to dive into the concept of Gini Impurity (the math is not intimidating!) The Gini Impurity of a node is the probability that a randomly chosen sample in a node would be incorrectly labeled if it was labeled by the distribution of samples in the node. For example, in the top (root) node, there is a 44.4% chance of incorrectly classifying a data point chosen at random based on the sample labels in the node. We arrive at this value using the following equation:

Gini impurity of a node n.

The Gini Impurity of a node n is 1 minus the sum over all the classes J (for a binary classification task this is 2) of the fraction of examples in each class p_i squared. That might be a little confusing in words, so let’s work out the Gini impurity of the root node.

Gini Impurity of the root node

At each node, the decision tree searches through the features for the value to split on that results in the greatest reduction in Gini Impurity. (An alternative for splitting nodes is using the information gain, a related concept).

It then repeats this splitting process in a greedy, recursive procedure until it reaches a maximum depth, or each node contains only samples from one class. The weighted total Gini Impurity at each level of tree must decrease. At the second level of the tree, the total weighted Gini Impurity is 0.333:

(The Gini Impurity of each node is weighted by the fraction of points from the parent node in that node.) You can continue to work out the Gini Impurity for each node (check the visual for the answers). Out of some basic math, a powerful model emerges!

Eventually, the weighted total Gini Impurity of the last layer goes to 0 meaning each node is completely pure and there is no chance that a point randomly selected from that node would be misclassified. While this may seem like a positive, it means that the model may potentially be overfitting because the nodes are constructed only using training data.

Overfitting: Or Why a Forest is better than One Tree

You might be tempted to ask why not just use one decision tree? It seems like the perfect classifier since it did not make any mistakes! A critical point to remember is that the tree made no mistakes on the training data. We expect this to be the case since we gave the tree the answers and didn’t limit the max depth (number of levels). The objective of a machine learning model is to generalize well to new data it has never seen before.

Overfitting occurs when we have a very flexible model (the model has a high capacity) which essentially memorizes the training data by fitting it closely. The problem is that the model learns not only the actual relationships in the training data, but also any noise that is present. A flexible model is said to have high variance because the learned parameters (such as the structure of the decision tree) will vary considerably with the training data.

On the other hand, an inflexible model is said to have high bias because it makes assumptions about the training data (it’s biased towards pre-conceived ideas of the data.) For example, a linear classifier makes the assumption that the data is linear and does not have the flexibility to fit non-linear relationships. An inflexible model may not have the capacity to fit even the training data and in both cases — high variance and high bias — the model is not able to generalize well to new data.

The balance between creating a model that is so flexible it memorizes the training data versus an inflexible model that can’t learn the training data is known as the bias-variance tradeoff and is a foundational concept in machine learning.


The reason the decision tree is prone to overfitting when we don’t limit the maximum depth is because it has unlimited flexibility, meaning that it can keep growing until it has exactly one leaf node for every single observation, perfectly classifying all of them. If you go back to the image of the decision tree and limit the maximum depth to 2 (making only a single split), the classifications are no longer 100% correct. We have reduced the variance of the decision tree but at the cost of increasing the bias.

As an alternative to limiting the depth of the tree, which reduces variance (good) and increases bias (bad), we can combine many decision trees into a single ensemble model known as the random forest.

Random Forest

The random forest is a model made up of many decision trees. Rather than just simply averaging the prediction of trees (which we could call a “forest”), this model uses two key concepts that gives it the name random:

  1. Random sampling of training data points when building trees
  2. Random subsets of features considered when splitting nodes

Random sampling of training observations

When training, each tree in a random forest learns from a random sample of the data points. The samples are drawn with replacement, known as bootstrapping, which means that some samples will be used multiple times in a single tree. The idea is that by training each tree on different samples, although each tree might have high variance with respect to a particular set of the training data, overall, the entire forest will have lower variance but not at the cost of increasing the bias.

At test time, predictions are made by averaging the predictions of each decision tree. This procedure of training each individual learner on different bootstrapped subsets of the data and then averaging the predictions is known as bagging, short for bootstrap aggregating.

Random Subsets of features for splitting nodes

The other main concept in the random forest is that only a subset of all the features are considered for splitting each node in each decision tree. Generally this is set to sqrt(n_features) for classification meaning that if there are 16 features, at each node in each tree, only 4 random features will be considered for splitting the node. (The random forest can also be trained considering all the features at every node as is common in regression. These options can be controlled in the Scikit-Learn Random Forest implementation).


If you can comprehend a single decision tree, the idea of bagging, and random subsets of features, then you have a pretty good understanding of how a random forest works:

The random forest combines hundreds or thousands of decision trees, trains each one on a slightly different set of the observations, splitting nodes in each tree considering a limited number of the features. The final predictions of the random forest are made by averaging the predictions of each individual tree.

To understand why a random forest is better than a single decision tree imagine the following scenario: you have to decide whether Tesla stock will go up and you have access to a dozen analysts who have no prior knowledge about the company. Each analyst has low bias because they don’t come in with any assumptions, and is allowed to learn from a dataset of news reports.

This might seem like an ideal situation, but the problem is that the reports are likely to contain noise in addition to real signals. Because the analysts are basing their predictions entirely on the data — they have high flexibility — they can be swayed by irrelevant information. The analysts might come up with differing predictions from the same dataset. Moreover, each individual analyst has high variance and would come up with drastically different predictions if given a different training set of reports.

The solution is to not rely on any one individual, but pool the votes of each analyst. Furthermore, like in a random forest, allow each analyst access to only a section of the reports and hope the effects of the noisy information will be cancelled out by the sampling. In real life, we rely on multiple sources (never trust a solitary Amazon review), and therefore, not only is a decision tree intuitive, but so is the idea of combining them in a random forest.


Random Forest in Practice

Next, we’ll build a random forest in Python using Scikit-Learn. Instead of learning a simple problem, we’ll use a real-world dataset split into a training and testing set. We use a test set as an estimate of how the model will perform on new data which also lets us determine how much the model is overfitting.

Dataset

The problem we’ll solve is a binary classification task with the goal of predicting an individual’s health. The features are socioeconomic and lifestyle characteristics of individuals and the label is 0 for poor health and 1 for good health. This dataset was collected by the Centers for Disease Control and Prevention and is available here.

Sample of Data

Generally, 80% of a data science project is spent cleaning, exploring, and making features out of the data. However, for this article, we’ll stick to the modeling. (For details of the other steps, look at this article).

This is an imbalanced classification problem, so accuracy is not an appropriate metric. Instead we’ll measure the Receiver Operating Characteristic Area Under the Curve (ROC AUC), a measure from 0 (worst) to 1 (best) with a random guess scoring 0.5. We can also plot the ROC curve to assess a model.


The notebook contains the implementation for both the decision tree and the random forest, but here we’ll just focus on the random forest. After reading in the data, we can instantiate and train a random forest as follows:

After a few minutes to train, the model is ready to make predictions on the testing data as follows:

We make class predictions (predict) as well as predicted probabilities (predict_proba) to calculate the ROC AUC. Once we have the testing predictions, we can calculate the ROC AUC.

Results

The final testing ROC AUC for the random forest was 0.87 compared to 0.67for the single decision tree with an unlimited max depth. If we look at the training scores, both models achieved 1.0 ROC AUC, which again is as expected because we gave these models the training answers and did not limit the maximum depth of each tree.

Although the random forest overfits (doing better on the training data than on the testing data), it is able to generalize much better to the testing data than the single decision tree. The random forest has lower variance (good) while maintaining the same low bias (also good) of a decision tree.

We can also plot the ROC curve for the single decision tree (top) and the random forest (bottom). A curve to the top and left is a better model:

Decision Tree ROC Curve
Random Forest ROC Curve

The random forest significantly outperforms the single decision tree.

Another diagnostic measure of the model we can take is to plot the confusion matrix for the testing predictions (see the notebook for details):

This shows the predictions the model got correct in the top left and bottom right corners and the predictions missed by the model in the lower left and upper right. We can use plots such as these to diagnose our model and decide whether it’s doing well enough to put into production.


Feature Importances

The feature importances in a random forest indicate the sum of the reduction in Gini Impurity over all the nodes that are split on that feature. We can use these to try and figure out what predictor variables the random forest considers most important. The feature importances can be extracted from a trained random forest and put into a Pandas dataframe as follows:

Feature importances can give us insight into a problem by telling us what variables are the most discerning between classes. For example, here DIFFWALK, indicating whether the patient has difficulty walking, is the most important feature which makes sense in the problem context.

Feature importances can be used for feature engineering by building additional features from the most important. We can also use feature importances for feature selection by removing low importance features.

Visualize Tree in Forest

Finally, we can visualize a single decision tree in the forest. This time, we have to limit the depth of the tree otherwise it will be too large to be converted into an image. To make the figure below, I limited the maximum depth to 6. This still results in a large tree that we can’t completely parse! However, given our deep dive into the decision tree, we grasp how our model is working.

Single decision tree in random forest.

Next Steps

A further step is to optimize the random forest which we can do through random search using the RandomizedSearchCV in Scikit-Learn. Optimization refers to finding the best hyperparameters for a model on a given dataset. The best hyperparameters will vary between datasets, so we have to perform optimization (also called model tuning) separately on each datasets.

I like to think of model tuning as finding the best settings for a machine learning algorithm. Examples of what we might optimize in a random forest are the number of decision trees, the maximum depth of each decision tree, the maximum number of features considered for splitting each node, and the maximum number of data points required in a leaf node.

For an implementation of random search for model optimization of the random forest, refer to the Jupyter Notebook.

Complete Running Example

The below code is created with repl.it and presents a complete interactive running example of the random forest in Python. Feel free to run and change the code (loading the packages might take a few moments).

Complete Python example of random forest.

Conclusions

While we can build powerful machine learning models in Python without understanding anything about them, I find it’s more effective to have knowledge about what is occurring behind the scenes. In this article, we not only built and used a random forest in Python, but we also developed an understanding of the model by starting with the basics.

We first looked at an individual decision tree, the building block of a random forest, and then saw how we can overcome the high variance of a single decision tree by combining hundreds of them in an ensemble model known as a random forest. The random forest uses the concepts of random sampling of observations, random sampling of features, and averaging predictions.

The key concepts to understand from this article are:

  1. Decision tree: an intuitive model that makes decisions based on a sequence of questions asked about feature values. Has low bias and high variance leading to overfitting the training data.
  2. Gini Impurity: a measure that the decision tree tries to minimize when splitting each node. Represents the probability that a randomly selected sample from a node will be incorrectly classified according to the distribution of samples in the node.
  3. Bootstrapping: sampling random sets of observations with replacement.
  4. Random subsets of features: selecting a random set of the features when considering splits for each node in a decision tree.
  5. Random Forest: ensemble model made of many decision trees using bootstrapping, random subsets of features, and average voting to make predictions. This is an example of a bagging ensemble.
  6. Bias-variance tradeoff: a core issue in machine learning describing the balance between a model with high flexibility (high variance) that learns the training data very well at the cost of not being able to generalize to new data , and an inflexible model (high bias) that cannot learn the training data. A random forest reduces the variance of a single decision tree leading to better predictions on new data.

Hopefully this article has given you the confidence and understanding needed to start using the random forest on your projects. The random forest is a powerful machine learning model, but that should not prevent us from knowing how it works. The more we know about a model, the better equipped we will be to use it effectively and explain how it makes predictions.

Posted in Information Technology

How to initialize a Neural Network

https://towardsdatascience.com/how-to-initialize-a-neural-network-27564cfb5ffc

Training a neural net is far from being a straightforward task, as the slightest mistake leads to non-optimal results without any warning. Training depends on many factors and parameters and thus require a thoughtful approach.

It is known that the beginning of training (i.e., the first few iterations) is very important. When done improperly, you get bad results — sometimes, the network won’t even learn anything at all! For this reason, the way you initialize the weights of the neural network is one of the key factors to good training.

The goal of this article is to explain why initialization is impacting and present a different number of ways to implement it efficiently. We will test our approaches against practical examples.
The code uses the fastai library (based on pytorch) and lessons from the last fastai MOOC (which, by the way, is really great!). All experiment notebooks are available in this github repository.

Why is initialization important?

Neural-net training essentially consists in repeating the two following steps:

  • A forward step that consists in a huge amount of matrix multiplication between weights and input / activations (we call activations the output of a layer that will become the input of the next layer, i.e., the hidden activations)
  • A backward step that consists in updating the weights of the network in order to minimize the loss function (using gradients of the parameters)

During the forward step, the activations (and then the gradients) can quickly get really big or really small — this is due to the fact that we repeat a lot of matrix multiplications. More specifically, we might get either:

  • very big activations and hence large gradients that shoot towards infinity
  • very small activations and hence infinitesimal gradients, which may be canceled to zero due to numerical precision

Either of these effects is fatal for training. Below is an example of explosion with randomly initialized weights, on the first forward pass.

In this particular example, the mean and standard deviation is already huge at the 10th layer!

What makes things even trickier is that, in practice, you can still get non-optimal results after long periods of training even while avoiding explosion or vanishing effects. This is illustrated below on a simple convnet (experiments will be detailed in the second part of the article):

Notice that the default pytorch approach is not the best one, and that random init does not learn a lot (also: this is only a 5-layers network, meaning that a deeper network would not learn anything).

How to initialize your network

Recall that the goal of a good initialization is to:

  • get random weights
  • keep the activations in a good range during the first forward passes (and so for the gradients in the backward passes)

What is a good range in practice? Quantitatively speaking, it implies having the output of the Matrix multiplications with the input vector produce an output vector (i.e. activations) with mean near 0 and standard deviation near 1. Then each layer will propagate these statistics across all the layers.
And even on a deep network, you will have stable statistics on the first iterations.

We now discuss two approaches to do so.

The math approach: Kaiming init

So let’s picture the issue. If the initialized weights are too big at the beginning of training, then each matrix multiplication will exponentially increase the activations, leading to what we call gradient explosion.
Conversely, if the weights are too small, then each matrix multiplication will decrease the activations until they vanish completely.

So the key here is to scale the weights matrix to get outputs of matrix multiplication with a mean around 0 and a standard deviation of 1.

But then how to define the scale of the weights? Well, since each weight (as well as the input) is independent and distributed according to a normal distribution, we can get help by working out some math.

Two famous papers present a good initialization scheme based on this idea:

In practice, the two schemes are quite similar: the “main” difference is that Kaiming initialization takes into account the ReLU activation function following each matrix multiplication.

Nowadays, most neural nets use ReLU (or a similar function like leaky ReLU). Here, we only focus on the Kaiming initialization.

The simplified formula (for standard ReLU) is to scale the random weights (drawn from a standard distribution) by:

For instance, if we have an input of size 512:

In addition, all bias parameters should be initialized to zeros.

Note that for Leaky ReLU the formula has an additional component, which we do not consider here (we refer the reader to the original paper).

Let’s check how this approach works on our previous example:

Notice that now we get an activation with mean 0.64 and standard deviation 0.87 after initialization. Obviously, this is not perfect (how could it be with random numbers?), but much better than normally-distributed random weights.

After 50 layers, we get a mean of 0.27 and a standard deviation of 0.464, so no more explosion or vanishing effects.

Optional: Quick explanation of Kaiming formula

The math derivations that lead to the magic scaling number of math.sqrt(2 / size of input vector) are provided in the Kaiming paper. In addition, we provide below some useful code, which the reader can skip entirely to proceed to the next section. Note that the code requires an understanding of how to do matrix multiplications and what variance / standard deviation is.

To understand the formula, we can think about what is the variance of the result of a matrix multiplication. In this example, we have a 512 vector multiplied by a 512×512 matrix, with an output of a 512 vector.

So in our case, the variance of the output of a matrix multiplication is around the size of the input vector. And, by definition, the standard deviation is the square root of that.

This is why dividing the weight matrix by the square root of the input vector size (512 in this example) gives us results with a standard deviation of 1.

But where does the numerator of “2” come from? This is only to take into account the ReLU layer.

As you know, ReLU sets the negative numbers to 0 (it’s only max(0, input)). So, because we have numbers centered around a mean of 0, it basically removes half the variance. This is why we add a numerator of 2.

The downside of the Kaiming init

The Kaiming init works great in practice, so why consider another approach? It turns out that there are some downsides of Kaming init:

  • The mean after a layer is not of 0 but around 0.5. This is because of the ReLU activation function, which removes all the negative numbers, effectively shifting its mean
  • Kaiming init only works with ReLU activation functions. Hence, if you have a more complex architecture (not only matmult → ReLU layers), then this won’t be able to keep a standard deviation around 1 on all the layers
  • The standard deviation after a layer is not of 1 but close to 1. In a deep network, this could not be enough to keep a standard deviation close to one all the way.

The algorithmic approach: LSUV

So what can we do to get a good initialization scheme, without manually customizing the Kaiming init for more complex architectures?

The paper All you need is a good init, from 2015, shows an interesting approach. It is called LSUV (Layer-sequential unit-variance).

The solution consists in using a simple algorithm: first, initialize all the layers with orthogonal initialization. Then, take a mini batch input and, for each layer, compute the standard deviation of its output. Dividing each layer by the resulting deviation then resets it to 1. Below is the algorithm as explained in the paper:

After some testing, I have found that orthogonal initialization gives similar (and sometimes worse) results than doing a Kaiming init before ReLU.

Jeremy Howard, in the fastai MOOC, shows another implementation, which adds an update to the weights to keep a mean around 0. In my experiments, I also find that keeping the mean around 0 gives better results.

Now let’s compare the results of these two approaches.

Performance of initialization schemes

We will check the performance of the different initialization schemes on two architectures: a “simple” convnet with 5 layers, and a more complex resnet-like architecture.
The task is to do image classification on the imagenette dataset (a subset of 10 classes from the Imagenet dataset).

Simple architecture

This experiment can be found in this notebook. Note that because of randomness, the results could be slightly different each time (but it does not change the order and the big picture).

It uses a simple model, defined as:

#ConvLayer is a Conv2D layer followed by a ReLU
nn.Sequential(ConvLayer(3, 32, ks=5), ConvLayer(32, 64), ConvLayer(64, 128), ConvLayer(128, 128), nn.AdaptiveAvgPool2d(1), Flatten(), nn.Linear(128, data.c))

Below is a comparison of 3 initialization schemes: Pytorch default’s init (it’s a kaiming init but with some specific parameters), Kaiming init and LSUV init.

Note that the random init performance is so bad we removed it from results that follow.

Activations stats after init
The first question is what are the activations stats after a forward pass for the first iteration? The closer we are to a mean of 0 and a standard deviation of 1, the better it will be.

This figure shows the stats of the activations at each layer, after initialization (before training).

For the standard deviation (right figure), both the LSUV and Kaiming init are close to one (and LSUV is closer). But for the pytorch default, the standard deviation is way lower.

For the mean value though, the Kaiming init has worse results. It is understandable because Kaiming init doesn’t take into account the ReLU effect on the mean. So the mean is around 0.5 and not 0.

Complex architecture (resnet50)

Now let’s check if we get similar results on a more complex architecture.

The architecture is xresnet-50, as implemented in the fastai library. It has 10x more layers than our previous simple model.

We will check it in 2 steps:

  • without normalization layer: batchnorm will be disabled. Because this layer will modify the stats minibatch-wise, it should decrease the impact of the initialization
  • with normalization layer: batchnorm will be enabled

Step 1: Without batchnorm
This experiment can be found in this notebook.

Without batchnorm, the results for 10 epochs are:

The plot shows that the accuracy (y-axis) is of 67% for LSUV, 57% for Kaiming init and 48% for the pytorch default. The difference is huge!

Let’s check the activations stats before training:

Let’s zoom to get a better scale:

We see that some layers have stats of 0: it is by design of the xresnet50, and independent of the init scheme. It is a trick from the paper Bag of Tricks for Image Classification with Convolutional Neural Networks (implemented in the fastai library).

We see that for:

  • Pytorch default init: the standard deviation and mean are close to 0. This is not good and shows a vanishing issue
  • Kaiming init: We get a big mean and standard deviation
  • LSUV init: We get good stats, not perfect but better than other schemes

We see that the best init scheme for this example gives much better results for the full training, even after 10 full epochs. This shows the importance of keeping good stats across the layers during the first iteration.

Step 2: with batchnorm layers

This experiment can be found in this notebook.

Because batchnorm is normalizing the output of a layer, we should expect the init schemes to have less impact.

The results show close accuracy for all init schemes, near 88%. Note that at each run the best init scheme may change depending on the random generator.

It shows that batchnom layers make the network less sensitive to the initialization scheme.

The activations stats before training are the following:

Like before, the best seems to be the LSUV init (only one to keep a mean around 0 as well as a standard deviation close to 1).

But the results show this has no impact on the accuracy, at least for this architecture and this dataset. It confirms one thing though: batchnorm makes the network much less sensitive to the quality of the initialization.

Conclusion

What to remember from this article?

  • The first iterations are very important and can have a lasting impact on the full training.
  • A good initialization scheme should keep the input stats (mean of 0 and standard deviation of 1) on the activations across all the layers of the network (for the first iteration).
  • Batchnorm layers reduce the neural net sensitivity to the initialization scheme.
  • Using Kaiming init + LSUV seems to be a good approach, especially when the network lacks a normalization layer.
  • Other kinds of architecture could have different behaviors regarding initialization.

 

Posted in Information Technology

Data Science: The Power of Visualization

https://towardsdatascience.com/the-power-of-visualization-in-data-science-1995d56e4208

This article focuses on the importance of visualization with data. The amount and complexity of information produced in science, engineering, business, and everyday human activity is increasing at staggering rates. Good visualizations not only present a visual interpretation of data, but do so by improving comprehension, communication, and decision making.

The importance of visualization is a topic taught to almost every data scientist in an entry-level course at university but is mastered by very few individuals. It is often regarded as obvious or unimportant due to its inherently subjective nature. In this article, I hope to dispel some of those thoughts and show you that visualization is incredibly important, not just in the field of data science, but for communicating any form of information.

I will aim to show the reader, through multiple examples, the impact a well-designed visualization can have at communicating an idea or piece of information. In addition, I will discuss the best practices for making effective visualizations, and how one can go about developing their own visualizations and the resources that are available to go about doing this.

I hope you enjoy this visual journey and learn something in the process.


What is Visualization?

The American Heritage Dictionary defines visualization as:

(1) The formation of mental visual images.

In the context of visualization with data, it is necessary to add something to this definition, so that it becomes:

The formation of mental visual images to convey information through graphical representations of data.

If you are pursuing a career in data science, this is one of the most crucial skills that you can master, and it is transferable to virtually any discipline. Let us imagine that you are trying to convince your manager to invest in a company and you present them a spreadsheet full of numbers to explain to them why this is such a good investment opportunity. How would you respond if you were the manager?

If presented in visual form, information is often much easier to digest, especially if it makes use of patterns and structures that humans can interpret intuitively. If you want a quick and easy visualization that requires little to no effort, you can go with something like a pie chart or a bar chart. In terms of developing visualizations, this is often as far as most people go, and often, it is as far as they made need to go for there field of expertise.

Another factor that inhibits our use of visualizations is the amount of data we have available. How do I know if visualization is an appropriate method to communicate a message?

This is a difficult question to answer. One design study recommends that we assess the viability of using visualizations based on the clarity of our task and the location of the information.

Design Study Methodology: Reflections from the Trenches and the Stacks, Michael Sedlmair, Miriah Meyer, and Tamara Munzner. IEEE Trans. Visualization and Computer Graphics, 2012.

If we are in the top right corner of this diagram, it becomes feasible to develop and program interactive visualizations, which is the realm in which data scientists are now entering due to the persistently increasing scale of data resulting from the information explosion.

Information Explosion.

We are now living in a data-driven world, and it is only likely to become more data-driven. This is clear from multiple areas, such as important advances in developing large-scale sensor networks as well as artificial intelligence agents that interact with the world (such as self-driving cars).

In a world where data is sovereign, having the power to develop clear and impactful visualizations is becoming an increasingly necessary skill.


Good and Bad Visualizations

Humans have been creating visualizations for thousands of years, and whilst the drawings of cavemen are slightly less spectacular than what we have nowadays, it is still good to appreciate just how powerful some of the early visualizations were, as well as how impactful they have been on the modern world.

Take Leonardo da Vinci, for example, an Italian polymath who was not only the first one to come up with incredible inventions such as the airplane, helicopter, and tank, but was also incredibly skilled at drawing. His engineering and anatomical drawings, like the ones below, are incredibly realistic and yet simple to understand.

Being skilled at drawing was very necessary for the purpose of visualizations back hundreds of years ago when we did not have computers to draw things for us. Take a moment to admire Galileo’s sketches of the moon during different phases of the lunar calendar.

It is not often that we really stare at ancient drawings of the Moon, so is there really still a need for these types of visualizations in the modern world? And if there is, can we not just leave it to artists, graphic designers, and the like?

The answer is obviously yes. Even ten or fifteen years ago, learning something like chemistry was incredibly difficult, despite being able to picture molecules in your head, it is still tough to translate between complex scientific words and your mental picture of what is occurring. Nowadays, one can go on Youtube and type in a few words and watch a visualization or a visual walkthrough of essentially any aspect of chemistry. This same idea applies for essentially any abstract idea in science.

So now we have convinced ourselves that visualizations are pretty useful for conveying information, and can also be used to explain complex ideas in a more interpretable manner.

What are some examples of good visualizations?

I currently live in Boston, so several of the following visualizations are related to the city of Boston. These are just some visualizations that I consider good, and, due to their subjective nature, you may disagree with me.

In Boston, we have an underground subway system called the T. As with any city subway system, there are a bunch of different lines and they go in various directions, and some of the lines take longer than others due to longer distances.

The below visualization captures not only the time taken to each stop from the city center in the form of concentric spheres but also follows the correct direction for each line. Looking at this diagram, it is pretty quick to work out which line to take, which direction it goes, and how long it will take to get there.

This second visualization shows the movement of individuals that are born in Massachusetts over the last century. We see that in 1940, 82% of the people born in Massachusetts were expected to live in Massachusetts. Now fast forward to the modern day, we see that this number has decreased to 64%, and we can get a reasonable idea of where these individuals have migrated.

One of the most famous visualizations ever made was by Joseph Minard, and it depicts the journey of Napoleon as he marched towards Russia for his Russian campaign of 1812.

The illustration depicts Napoleon’s army departing the Polish-Russian border. A thick band illustrates the size of his army at specific geographic points during their advance and retreat. It displays six types of data in two dimensions: the number of Napoleon’s troops; the distance traveled; temperature; latitude and longitude; the direction of travel; and location relative to specific dates without making mention of Napoleon. Minard’s interest lay with the travails and sacrifices of the soldiers. This type of band graph for illustration of flows was later called a Sankey diagram.

A trend that you might have noticed with all of the above visualizations is that they convey multiple types of data in a relatively simple fashion. Doing this is an intelligible way is not an easy task.

Now let us consider the transformation of a poor visualization into a visualization better suited for its purpose. This is easiest to do with subway maps so I will consider the subway map of London, and we will see why it was changed and how their new design has improved upon the original design.

This was the original map of the London Underground that dates back to 1927. As you may have already noticed, the major problem with this diagram is that there is a large cluster of closely spaced underground stations due to their close proximity. This stems from the fact that the map is plotted based on the geographic locations of the stations. However, when going far out of the city, there are vast amounts of space left unused on the map.

In 1933, Harry Beck came up with a new design for the map of the London Underground. Beck decided that passengers on the underground were not concerned with geographical accuracy and were most interested in how to get from one station to another and where to change trains. He took inspiration from electrical diagrams and decided to show each of the lines in an individual color, and to show their connections to other lines as one would on an electrical diagram. The plot maintains the directional information for where each line goes, but the distance information is lost as it was deemed unnecessary by Beck.

As someone who used to live in London and used a modern version of this map to get around with relative ease, I can vouch for its brilliance.

We can see a very similar debate that occurred with the New York subway map. Which of these do you think is better?

Despite the fact we have looked at several subway maps, there is obviously not a clear-cut solution that can be applied to all situations. After all, it depends on what data is most relevant to the audience. Harry Beck decided that passengers did not care about distance or geographical information, only that they knew how to go from station A to station B and what connections to make. Maybe this idea would not work for New Yorkers because they are more concerned with knowing the distance and geographic location than Londoners.


Anscombe’s Quartet

Numbers can be incredibly misleading, as was demonstrated by Gertrude Anscombe in the form of the now famously known Anscombe’s Quartet. The quartet is a set of four data samples that have exactly the same mean, variance, correlation, and linear regression line.

Anscombe’s Quartet in tabular form (Anscombe, 1973).

However, it is clear from a visual representation of the results that the distributions of the four sets of data are completely different.

Anscombe’s Quartet in visual form (Anscombe, 1973).

To reinforce this point I have developed six of my own data plots, all of which have the same mean, variance, correlation, and linear regression line. As you can see below, these are, again, completely different.

This idea that visualizations can be used in place of quantitative metrics to make the structure of the data clearer and more meaningful leads us to naturally into the realm of misleading visualizations. Just as numbers can be used to mislead us about the structure of our data, as we have seen from Anscombe’s Quartet, it also works the other way around, visualizations can be cleverly made to distort the underlying structure in the data. As we shall see, this is a very common occurrence, especially in areas that are prone to discourse such as politics and scientific debates.


Misleading Visualizations

There are numerous examples of people using statistics to mislead individuals. Indeed, this is an extremely common tactic used in politics. Possibly one of the most famous examples of this was related to tax cuts proposed by President George Bush, in which a 5% increase was made to look much larger by distorting the axis of a bar chart.

(Left) The visualization shown to viewers, and (right) a less deceptive visualization.

These deceptive tactics often involve distortion of the measurement axes, as in the above example. Here is another example of axis distortion related to job losses (ironically, also related to U.S. politics).

In reality, the plot should look like this.

Another way that people are deceived through the use of visualizations is through the omission of data.

I, myself, am an environmental scientist, so I am only too aware of the deception people can espouse with carefully crafted visualizations. Here is a prime example of one showing that global warming is a myth (which, just to clarify, it is not, and this idea has not been contested by environmental scientists since the 1990s).

A less deceptive diagram is shown below.

As of now we just discussed visualizations that are actively designed to deceive us. How about when it is done accidentally?

Beware of Rainbows

Rainbow color maps are probably the most annoying visualization I come across on a daily basis. As an environmental scientist, I see these pretty much everywhere. Not only are these bad to use because colorblind people (such as myself) can have issues differentiating many of the colors, but assigning a color to a quantitative value is nonsensical.

The rainbow colormap is perceptually non-linear. Who decided that blue represents a quantitatively lower value than yellow or red? When do the transitions occur and how sudden are they?

Rainbow color maps.

The best way to tackle this is to stick to just two colors and use a linear change in color to represent quantitative values. In this sense, the plot can be colored but the quantitative nature of the plot is described by the brightness of the color, with darker regions typically indicating higher values. This idea is illustrated below.

This essentially makes them the same as a heat plot or a choropleth map like the one below.

Smoky Mountains

A nice example of this is shown for U.S. votes following the 2016 presidential election.

Color Blindness

It is always good to keep in mind the fact that a reasonable amount of people are colorblind and to avoid using color combinations that can be problematic for these individuals. Take the following visualization as an example.

The colors used in this diagram are a terrible combination for someone who is suffering from red-green color blindness (can be deuteranopia or protanopia). It is best to be mindful of at least using combinations of red and green since this type of color blindness is the most prevalent.

If you are interested in learning more about color blindness, there is a nifty website for simulating color blindness.

Edge Bundling

The idea behind edge bundling is essentially wrapping an elastic band around all of the paths that follow the same route from one node to another node. This is used on network diagrams and has the advantage of making the visualization look less a cluttered hairball, and is more inherently pleasing as we can see below.

The downsides to this method are that you cannot follow the exact links or paths after bundling, which mean that our visualization does not reflect the underlying data (we forfeit some graphical integrity).


Overview of the Visualization Process

Now that we have looked at a bunch of visualizations and understand the difference between a good and bad visualization, it is a good time to discuss what actually makes a good visualization.

Visualization Goals

Essentially, there are three goals to visualization:

  • Data Exploration — find the unknown
  • Data Analysis — check hypotheses
  • Presentation — communicate and disseminate

That is essentially it. However, these terms are pretty vague, and it is thus quite easy to understand why it is so difficult for individuals to master the art of communicating through visualizations. It is, therefore, useful to have a model to follow to help us meet these goals.

The Five-Step Model

Visualization is often described as the following five-step model, a process which follows a fairly logical progression.

Firstly, one is required to isolate a specific target or question that is to be the subject of evaluation.

This is followed by data wrangling, which is 90% of what data scientists do when they are working with data. This procedure involves getting the data into a workable format, performing exploratory data analysis to understand their data set, which may involve various ways of summarizing or plotting the data.

The third stage is the design stage, which involves the development of a story that you want to tell with the data. This is closely linked back to the target we defined. What is the message we are trying to communicate? This will also likely depend on who your audience is, as well as the level of objectivity of the analysis — for example, a political opponent is likely to want to send an exaggerated message of data in order to make their opponent look bad.

The fourth step involves the implementation of the visualization, such as via programming of interactive web-based visualizations using D3. This is the part of the process that involves some coding, whereas the design stage involves thinking, drawing, ideation, and so on.

The fifth stage is essentially a review stage, you look at your implementation and decide whether it sends the message that you want to communicate, or answers the question you set out to answer.

In reality, this is a non-linear process, although it is often presented as one. Here is a somewhat more realistic form of this model.

It seems simple right? Well, there are actually a lot of ways that you can screw this up, and often without realizing. Here are the three most common issues:

Domain situation — Did you correctly understand the users’ needs? Perhaps the wrong problem is being addressed. This is a problem associated with the target phase.

Data/task abstraction — Are you showing them the right thing? Perhaps the wrong abstraction is being used. This is also a problem associated with the target phase.

Visual encoding/interaction — Does the way you are showing the data work? Perhaps the wrong idiom or encoding is being used. This is a problem associated with the design phase.

Algorithm — Does your code break? Is your code too slow? Is it scalable? This is a problem with the implementation phase. Perhaps the wrong algorithm is being used.

It might be obvious to address the fact that your code is breaking, but how do you assess the more subjective problems we just addressed, such as the domain situation or the visual encoding used? We can lean towards evaluation metrics.

We can rely on qualitative and quantitative metrics. Qualitative metrics are often the most useful for visualizations since visualizations are developed for communicating information to people, some examples of metrics to use are:

  • Observational Studies (“Think Aloud”)
  • Expert Interviews (aka Design Critiques)
  • Focus Groups

The idea of these qualitative procedures is that individuals should be able to see the visualization and understand the message you are trying to convey without any additional information. These types of studies and metrics are commonly used in areas such as marketing and web design because they provide insight into how individuals will interpret and respond to their ideas or designs.


Rules of Thumb

Edward Tufte is a pioneer in the field of developing effective visualizations and has written multiple books on the topic (I will reference these at the end of the article).

Here are three of his rules for effective visualization:

  • Graphical integrity
  • Maximize data-ink ratio
  • Avoid chart junk

Graphical Integrity

We have already discussed this to somewhat extend when discussing misleading visualizations. In general, it is bad practice, and somewhat harmful to society, to try to mislead individuals with statistics.

Maximize Data-Ink Ratio

This rule of thumb is about clarity and minimalism. In general, 3D plots tend to be less clear and can be misleading in some cases. Examine the differences between the two charts below and decide which you think is better.

Avoid Chart Junk

Extraneous visual elements distract people from the message being conveyed.


Interactive Visualizations

This topic can get incredibly complicated so I will leave the discussion for developing visualizations for future articles. Unfortunately, D3 visualizations cannot be run through Medium so you will have the visit the links to see the visualizations in action.

Here are a few of my favorite visualizations to whet the reader’s appetite.

Places in the Game of Thrones

Location names discussed in the Game of Thrones saga. Source

Gun Deaths in the U.S.

Gun deaths in the United States. Source

Road Safety in the U.K.

This visualization is built on deck.gl and incredibly fun to play around with — there are multiple other interactive visualizations on this website that I recommend checking out.

Road Safety in the U.K. Source

Roads to State Capitals

This visualization is an interactive and color-coded map of the United States and all of the roads that lead to the capitals of each state.

A similar visualization is also available from the same website showing the roads to Rome.

United States Trade Deficit

This is a beautiful visualization that visualizes the trade deficit of the United States from 2001 to 2013.

Linked Jazz Network Graph

This interactive graph shows some of the famous names in jazz and how they influenced other artists.

For more visualizations, check out the d3 gallery on GitHub.