Posted in Information Technology

SaaS Growth Trends in 2019

Gartner predicts a strong growth of SaaS technologies reaching revenue figures of $85 billion by the end of 2019. The growth comes at 17.8 percent increase from previous years and accounts for a majority proportion in the public cloud revenues forecasted to reach $278 billion by the year 2021. The overall growth of the SaaS industry will remain consistent through these years as more companies adopt SaaS solutions for a variety of business functions, not limited to core engineering and sales applications as seen during the early years of Salesforce popularity. The SaaS cloud application services market segment will reach $113.1 billion in 2021, up by almost twice from the 2017 revenues of $58.8 billion.

Source: Gartner via ZDNet

The SaaS market is reaching a higher revenue size than both Infrastructure as a Service (IaaS) and Platform as a Service (PaaS) combined. Gartner predicts that the IaaS market will grow from $23.6 billion in 2017 to $63 billion in 2021 and the PaaS market will grow from $11.9 billion in 2017 to $27.7 billion in the year 2021.

The competition in the SaaS industry has since increased rapidly. Research suggests that the number of competitors for SaaS firms starting around 2012 were less than three on average. By the end of 2017, every SaaS startup faced competition from nine other firms competing in the same SaaS market segment. Considering the example of SaaS marketing solutions, the number of products increased from 500 to 8500 between 2007 and 2017.

Source: PriceIntelligently by ProfitWell

Stability of the economy and investor interest in scalable cloud solutions have encouraged entrepreneurs, innovators and enterprises to develop new SaaS solutions. In 2018 alone, 169 SaaS acquisitions took place, averaging at $1.3 billion. The number is closer to the mean acquisition price, considering that two data points at the highest price points in 2018 were $8 billion and $7.5 billion for Qualtrics International Inc. and GitHub respectively. These stats position SaaS companies as some of the most expensive business entities in the industry. IPO valuation has also increased by 3.5 times since 2017 as SaaS IPOs collectively reached $38.2 billion in 2018. These numbers are expected to increase at similar growth figures in the coming years as organizations increase public cloud spending as per Gartner predictions for the near future.

Source: OpenView Venture Partners

Customers are increasingly adopting the subscription based pricing model to satisfy growing IT needs within limited IT budget. Established enterprises are also responding by embracing the ‘as-a-Service’ business model to satisfy these needs. The result is a suitable business environment that facilitates healthy competition among SaaS vendors while the market demand continues to increase exponentially.

The number and type of users of SaaS products has increased rapidly in recent years. Its not just the small firms but also large enterprises leveraging SaaS technologies to power their business. Another researchfinds that organizations with over 250 employees use over 100 SaaS apps, whereas small firms of up to 50 employees use up between 25-50 SaaS solutions on average. However, the rate of SaaS usage growth is consistent across organizations of all sizes. The growth is driven both by internal requirements attributed to tooling-focused product development methodologies such as Agile and DevOps, as well as the growing availability of useful SaaS products in the enterprise IT market segment.

Source: Blissfully Tech Inc via

The proportionality of SaaS adoption to workforce size is attributed to several factors. Small organizations tend to work on a limited set of projects and naturally require a limited set of products to address the necessary IT functionality. As the organization grows and the number of teams increase, users working on different projects may have their own requirement for the SaaS tools. In order to avoid the issues resulting from Shadow IT – adoption of unapproved software at the workplace – including those associated with cost and security, large organizations make it easier to provision as many SaaS resources as necessary. Additionally, the complexity of large-scale projects at the enterprise level means that no single SaaS solution delivers all necessary functionality. Users tend to rely on multiple SaaS solutions to address their technology requirements may therefore adopt several products designed for the same target audience and application use cases.

Source: Blissfully Tech Inc via

The SaaS market is dominated by Microsoft, followed by Salesforce, Adobe and Oracle respectively, according to a recent report by the Synergy Research Group. Microsoft also leads the annual growth rate with 45 percent, closely followed by Oracle at 43 percent and SAP at 36 percent. Although these enterprises are driving the SaaS market growth, a significant proportion of their revenue comes from selling on-premise software. With the prevalence of the subscription-based pricing model, companies such as Microsoft are rapidly migrating their customers toward the SaaS consumption model. While the on-premise software deployment model reached maturity several years ago, less than 15 percent of the software budget is spent on SaaS products. This means strong potential for growth of SaaS products in the coming years as their Total Cost of Ownership matches that of the on-premise software deployment models. Organizations dominating the enterprise software space, including IBM, Oracle, Microsoft and SAP will likely maintain their market share for enterprise software products, as a growing number of customers can take advantage the same product capabilities with the feasible subscription-based pricing model.

Source: Synergy Research Group

From a customer perspective, SaaS products deliver higher strategic value as compared to on-premise software deployments. The software deployment time has reduced from several weeks and days to a few minutes with the SaaS model. The wealth of SaaS solution available in the enterprise software market means that users have a diverse set of resources readily available to address varied demands of different users. As a result, organizations are experiencing higher levels of employee engagement with feature-rich SaaS solutions designed for improved customer experience. SaaS vendors are also able to push feature improvements, bug fixes and security updates on the fly. These capabilities with on-premise deployments were previously required to pass through several layers of organizational protocols and governance before eventually reaching end-users. SaaS technologies have made it easier for enterprises and software vendors to effectively deliver the necessary features and functionality to end-users, ultimately contributing to the popularity of SaaS solutions over on-premise software products.


Posted in Information Technology

Jupyter Notebook for Beginners: A Tutorial

The Jupyter Notebook is an incredibly powerful tool for interactively developing and presenting data science projects. A notebook integrates code and its output into a single document that combines visualisations, narrative text, mathematical equations, and other rich media. The intuitive workflow promotes iterative and rapid development, making notebooks an increasingly popular choice at the heart of contemporary data science, analysis, and increasingly science at large. Best of all, as part of the open source Project Jupyter, they are completely free.

The Jupyter project is the successor to the earlier IPython Notebook, which was first published as a prototype in 2010. Although it is possible to use many different programming languages within Jupyter Notebooks, this article will focus on Python as it is the most common use case.

To get the most out of this tutorial you should be familiar with programming, specifically Python and pandas specifically. That said, if you have experience with another language, the Python in this article shouldn’t be too cryptic and pandas should be interpretable. Jupyter Notebooks can also act as a flexible platform for getting to grips with pandas and even Python, as it will become apparent in this article.

We will:

  • Cover the basics of installing Jupyter and creating your first notebook
  • Delve deeper and learn all the important terminology
  • Explore how easily notebooks can be shared and published online. Indeed, this article is a Jupyter Notebook! Everything here was written in the Jupyter Notebook environment and you are viewing it in a read-only form.

Example data analysis in a Jupyter Notebook

We will walk through a sample analysis, to answer a real-life question, so you can see how the flow of a notebook makes the task intuitive to work through ourselves, as well as for others to understand when we share it with them.

So, let’s say you’re a data analyst and you’ve been tasked with finding out how the profits of the largest companies in the US changed historically. You find a data set of Fortune 500 companies spanning over 50 years since the list’s first publication in 1955, put together from Fortune’s public archive. We’ve gone ahead and created a CSV of the data you can use here.

As we shall demonstrate, Jupyter Notebooks are perfectly suited for this investigation. First, let’s go ahead and install Jupyter.


The easiest way for a beginner to get started with Jupyter Notebooks is by installing Anaconda. Anaconda is the most widely used Python distribution for data science and comes pre-loaded with all the most popular libraries and tools. As well as Jupyter, some of the biggest Python libraries wrapped up in Anaconda include NumPypandas and Matplotlib, though the full 1000+ list is exhaustive. This lets you hit the ground running in your own fully stocked data science workshop without the hassle of managing countless installations or worrying about dependencies and OS-specific (read: Windows-specific) installation issues.

To get Anaconda, simply:

  1. Download the latest version of Anaconda for Python 3 (ignore Python 2.7).
  2. Install Anaconda by following the instructions on the download page and/or in the executable.

If you are a more advanced user with Python already installed and prefer to manage your packages manually, you can just use pip:

pip3 install jupyter

Creating Your First Notebook

In this section, we’re going to see how to run and save notebooks, familiarise ourselves with their structure, and understand the interface. We’ll become intimate with some core terminology that will steer you towards a practical understanding of how to use Jupyter Notebooks by yourself and set us up for the next section, which steps through an example data analysis and brings everything we learn here to life.

Running Jupyter

On Windows, you can run Jupyter via the shortcut Anaconda adds to your start menu, which will open a new tab in your default web browser that should look something like the following screenshot.

Jupyter control panel

This isn’t a notebook just yet, but don’t panic! There’s not much to it. This is the Notebook Dashboard, specifically designed for managing your Jupyter Notebooks. Think of it as the launchpad for exploring, editing and creating your notebooks.

Be aware that the dashboard will give you access only to the files and sub-folders contained within Jupyter’s start-up directory; however, the start-up directory can be changed. It is also possible to start the dashboard on any system via the command prompt (or terminal on Unix systems) by entering the command jupyter notebook; in this case, the current working directory will be the start-up directory.

The astute reader may have noticed that the URL for the dashboard is something like http://localhost:8888/tree. Localhost is not a website, but indicates that the content is being served from your local machine: your own computer. Jupyter’s Notebooks and dashboard are web apps, and Jupyter starts up a local Python server to serve these apps to your web browser, making it essentially platform independent and opening the door to easier sharing on the web.

The dashboard’s interface is mostly self-explanatory — though we will come back to it briefly later. So what are we waiting for? Browse to the folder in which you would like to create your first notebook, click the “New” drop-down button in the top-right and select “Python 3” (or the version of your choice).

New notebook menu

Hey presto, here we are! Your first Jupyter Notebook will open in new tab — each notebook uses its own tab because you can open multiple notebooks simultaneously. If you switch back to the dashboard, you will see the new file Untitled.ipynb and you should see some green text that tells you your notebook is running.

What is an ipynb File?

It will be useful to understand what this file really is. Each .ipynb file is a text file that describes the contents of your notebook in a format called JSON. Each cell and its contents, including image attachments that have been converted into strings of text, is listed therein along with some metadata. You can edit this yourself — if you know what you are doing! — by selecting “Edit > Edit Notebook Metadata” from the menu bar in the notebook.

You can also view the contents of your notebook files by selecting “Edit” from the controls on the dashboard, but the keyword here is “can“; there’s no reason other than curiosity to do so unless you really know what you are doing.

The notebook interface

Now that you have an open notebook in front of you, its interface will hopefully not look entirely alien; after all, Jupyter is essentially just an advanced word processor. Why not take a look around? Check out the menus to get a feel for it, especially take a few moments to scroll down the list of commands in the command palette, which is the small button with the keyboard icon (or Ctrl + Shift + P).

New Jupyter Notebook

There are two fairly prominent terms that you should notice, which are probably new to you: cells and kernels are key both to understanding Jupyter and to what makes it more than just a word processor. Fortunately, these concepts are not difficult to understand.

  • A kernel is a “computational engine” that executes the code contained in a notebook document.
  • A cell is a container for text to be displayed in the notebook or code to be executed by the notebook’s kernel.


We’ll return to kernels a little later, but first let’s come to grips with cells. Cells form the body of a notebook. In the screenshot of a new notebook in the section above, that box with the green outline is an empty cell. There are two main cell types that we will cover:

  • code cell contains code to be executed in the kernel and displays its output below.
  • Markdown cell contains text formatted using Markdown and displays its output in-place when it is run.

The first cell in a new notebook is always a code cell. Let’s test it out with a classic hello world example. Type print('Hello World!') into the cell and click the run button Notebook Run Buttonin the toolbar above or press Ctrl + Enter. The result should look like this:

print('Hello World!')
Hello World!

When you ran the cell, its output will have been displayed below and the label to its left will have changed from In [ ] to In [1]. The output of a code cell also forms part of the document, which is why you can see it in this article. You can always tell the difference between code and Markdown cells because code cells have that label on the left and Markdown cells do not. The “In” part of the label is simply short for “Input,” while the label number indicates when the cell was executed on the kernel — in this case the cell was executed first. Run the cell again and the label will change to In [2] because now the cell was the second to be run on the kernel. It will become clearer why this is so useful later on when we take a closer look at kernels.

From the menu bar, click Insert and select Insert Cell Below to create a new code cell underneath your first and try out the following code to see what happens. Do you notice anything different?

import time

This cell doesn’t produce any output, but it does take three seconds to execute. Notice how Jupyter signifies that the cell is currently running by changing its label to In [*].

In general, the output of a cell comes from any text data specifically printed during the cells execution, as well as the value of the last line in the cell, be it a lone variable, a function call, or something else. For example:

def say_hello(recipient):
return 'Hello, {}!'.format(recipient)
'Hello, Tim!'

You’ll find yourself using this almost constantly in your own projects, and we’ll see more of it later on.

Keyboard shortcuts

One final thing you may have observed when running your cells is that their border turned blue, whereas it was green while you were editing. There is always one “active” cell highlighted with a border whose colour denotes its current mode, where green means “edit mode” and blue is “command mode.”

So far we have seen how to run a cell with Ctrl + Enter, but there are plenty more. Keyboard shortcuts are a very popular aspect of the Jupyter environment because they facilitate a speedy cell-based workflow. Many of these are actions you can carry out on the active cell when it’s in command mode.

Below, you’ll find a list of some of Jupyter’s keyboard shortcuts. You’re not expected to pick them up immediately, but the list should give you a good idea of what’s possible.

  • Toggle between edit and command mode with Esc and Enter, respectively.
  • Once in command mode:
    • Scroll up and down your cells with your Up and Down keys.
    • Press A or B to insert a new cell above or below the active cell.
    • M will transform the active cell to a Markdown cell.
    • Y will set the active cell to a code cell.
    • D + D (D twice) will delete the active cell.
    • Z will undo cell deletion.
    • Hold Shift and press Up or Down to select multiple cells at once.
      • With multple cells selected, Shift + M will merge your selection.
  • Ctrl + Shift + -, in edit mode, will split the active cell at the cursor.
  • You can also click and Shift + Click in the margin to the left of your cells to select them.

Go ahead and try these out in your own notebook. Once you’ve had a play, create a new Markdown cell and we’ll learn how to format the text in our notebooks.


Markdown is a lightweight, easy to learn markup language for formatting plain text. Its syntax has a one-to-one correspondance with HTML tags, so some prior knowledge here would be helpful but is definitely not a prerequisite. Remember that this article was written in a Jupyter notebook, so all of the narrative text and images you have seen so far was achieved in Markdown. Let’s cover the basics with a quick example.

# This is a level 1 heading
## This is a level 2 heading
This is some plain text that forms a paragraph.
Add emphasis via **bold** and __bold__, or *italic* and _italic_.
Paragraphs must be separated by an empty line.
* Sometimes we want to include lists.
* Which can be indented.
1. Lists can also be numbered.
2. For ordered lists.
[It is possible to include hyperlinks](
Inline code uses single backticks: `foo()`, and code blocks use triple backticks:
Or can be indented by 4 spaces:
And finally, adding images is easy: ![Alt text](

When attaching images, you have three options:

  • Use a URL to an image on the web.
  • Use a local URL to an image that you will be keeping alongside your notebook, such as in the same git repo.
  • Add an attachment via “Edit > Insert Image”; this will convert the image into a string and store it inside your notebook .ipynb file.
  • Note that this will make your .ipynb file much larger!

There is plenty more detail to Markdown, especially around hyperlinking, and it’s also possible to simply include plain HTML. Once you find yourself pushing the limits of the basics above, you can refer to the official guide from the creator, John Gruber, on his website.


Behind every notebook runs a kernel. When you run a code cell, that code is executed within the kernel and any output is returned back to the cell to be displayed. The kernel’s state persists over time and between cells — it pertains to the document as a whole and not individual cells.

For example, if you import libraries or declare variables in one cell, they will be available in another. In this way, you can think of a notebook document as being somewhat comparable to a script file, except that it is multimedia. Let’s try this out to get a feel for it. First, we’ll import a Python package and define a function.

import numpy as np
def square(x):
return x * x

Once we’ve executed the cell above, we can reference np and square in any other cell.

x = np.random.randint(1, 10)
y = square(x)
print('%d squared is %d' % (x, y))
1 squared is 1

This will work regardless of the order of the cells in your notebook. You can try it yourself, let’s print out our variables again.

print('Is %d squared is %d?' % (x, y))
Is 1 squared is 1?

No surprises here! But now let’s change y.

y = 10

What do you think will happen if we run the cell containing our print statement again? We will get the output Is 4 squared is 10?!

Most of the time, the flow in your notebook will be top-to-bottom, but it’s common to go back to make changes. In this case, the order of execution stated to the left of each cell, such as In [6], will let you know whether any of your cells have stale output. And if you ever wish to reset things, there are several incredibly useful options from the Kernel menu:

  • Restart: restarts the kernel, thus clearing all the variables etc that were defined.
  • Restart & Clear Output: same as above but will also wipe the output displayed below your code cells.
  • Restart & Run All: same as above but will also run all your cells in order from first to last.

If your kernel is ever stuck on a computation and you wish to stop it, you can choose the Interupt option.

Choosing a kernal

You may have noticed that Jupyter gives you the option to change kernel, and in fact there are many different options to choose from. Back when you created a new notebook from the dashboard by selecting a Python version, you were actually choosing which kernel to use.

Not only are there kernels for different versions of Python, but also for over 100 languagesincluding Java, C, and even Fortran. Data scientists may be particularly interested in the kernels for R and Julia, as well as both imatlab and the Calysto MATLAB Kernel for Matlab. The SoS kernel provides multi-language support within a single notebook. Each kernel has its own installation instructions, but will likely require you to run some commands on your computer.

Example analysis

Now we’ve looked at what a Jupyter Notebook is, it’s time to look at how they’re used in practice, which should give you a clearer understanding of why they are so popular. It’s finally time to get started with that Fortune 500 data set mentioned earlier. Remember, our goal is to find out how the profits of the largest companies in the US changed historically.

It’s worth noting that everyone will develop their own preferences and style, but the general principles still apply, and you can follow along with this section in your own notebook if you wish, which gives you the scope to play around.

Naming your notebooks

Before you start writing your project, you’ll probably want to give it a meaningful name. Perhaps somewhat confusingly, you cannot name or rename your notebooks from the notebook app itself, but must use either the dashboard or your file browser to rename the .ipynb file. We’ll head back to the dashboard to rename the file you created earlier, which will have the default notebook file name Untitled.ipynb.

You cannot rename a notebook while it is running, so you’ve first got to shut it down. The easiest way to do this is to select “File > Close and Halt” from the notebook menu. However, you can also shutdown the kernel either by going to “Kernel > Shutdown” from within the notebook app or by selecting the notebook in the dashboard and clicking “Shutdown” (see image below).

A running notebook

You can then select your notebook and and click “Rename” in the dashboard controls.

A running notebook

Note that closing the notebook tab in your browser will not “close” your notebook in the way closing a document in a traditional application will. The notebook’s kernel will continue to run in the background and needs to be shut down before it is truly “closed” — though this is pretty handy if you accidentally close your tab or browser! If the kernel is shut down, you can close the tab without worrying about whether it is still running or not.

Once you’ve named your notebook, open it back up and we’ll get going.


It’s common to start off with a code cell specifically for imports and setup, so that if you choose to add or change anything, you can simply edit and re-run the cell without causing any side-effects.

import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

We import pandas to work with our data, Matplotlib to plot charts, and Seaborn to make our charts prettier. It’s also common to import NumPy but in this case, although we use it via pandas, we don’t need to explicitly. And that first line isn’t a Python command, but uses something called a line magic to instruct Jupyter to capture Matplotlib plots and render them in the cell output; this is one of a range of advanced features that are out of the scope of this article.

Let’s go ahead and load our data.

df = pd.read_csv('fortune500.csv')

It’s sensible to also do this in a single cell in case we need to reload it at any point.

Save and Checkpoint

Now we’ve got started, it’s best practice to save regularly. Pressing Ctrl + S will save your notebook by calling the “Save and Checkpoint” command, but what this checkpoint thing?

Every time you create a new notebook, a checkpoint file is created as well as your notebook file; it will be located within a hidden subdirectory of your save location called .ipynb_checkpoints and is also a .ipynb file. By default, Jupyter will autosave your notebook every 120 seconds to this checkpoint file without altering your primary notebook file. When you “Save and Checkpoint,” both the notebook and checkpoint files are updated. Hence, the checkpoint enables you to recover your unsaved work in the event of an unexpected issue. You can revert to the checkpoint from the menu via “File > Revert to Checkpoint.”

Investigating our data set

Now we’re really rolling! Our notebook is safely saved and we’ve loaded our data set dfinto the most-used pandas data structure, which is called a DataFrame and basically looks like a table. What does ours look like?

year rank company revenue (in millions) profit (in millions)
0 1955 1 General Motors 9823.5 806
1 1955 2 Exxon Mobil 5661.4 584.8
2 1955 3 U.S. Steel 3250.4 195.4
3 1955 4 General Electric 2959.1 212.6
4 1955 5 Esmark 2510.8 19.1
year rank company revenue (in millions) profit (in millions)
25495 2005 496 Wm. Wrigley Jr. 3648.6 493
25496 2005 497 Peabody Energy 3631.6 175.4
25497 2005 498 Wendy’s International 3630.4 57.8
25498 2005 499 Kindred Healthcare 3616.6 70.6
25499 2005 500 Cincinnati Financial 3614.0 584

Looking good. We have the columns we need, and each row corresponds to a single company in a single year.

Let’s just rename those columns so we can refer to them later.

df.columns = ['year', 'rank', 'company', 'revenue', 'profit']

Next, we need to explore our data set. Is it complete? Did pandas read it as expected? Are any values missing?


Okay, that looks good — that’s 500 rows for every year from 1955 to 2005, inclusive.

Let’s check whether our data set has been imported as we would expect. A simple check is to see if the data types (or dtypes) have been correctly interpreted.

year int64
rank int64
company object
revenue float64
profit object
dtype: object

Uh oh. It looks like there’s something wrong with the profits column — we would expect it to be a float64 like the revenue column. This indicates that it probably contains some non-integer values, so let’s take a look.

non_numberic_profits = df.profit.str.contains('[^0-9.-]')
year rank company revenue profit
228 1955 229 Norton 135.0 N.A.
290 1955 291 Schlitz Brewing 100.0 N.A.
294 1955 295 Pacific Vegetable Oil 97.9 N.A.
296 1955 297 Liebmann Breweries 96.0 N.A.
352 1955 353 Minneapolis-Moline 77.4 N.A.

Just as we suspected! Some of the values are strings, which have been used to indicate missing data. Are there any other values that have crept in?


That makes it easy to interpret, but what should we do? Well, that depends how many values are missing.


It’s a small fraction of our data set, though not completely inconsequential as it is still around 1.5%. If rows containing N.A. are, roughly, uniformly distributed over the years, the easiest solution would just be to remove them. So let’s have a quick look at the distribution.

bin_sizes, _, _ = plt.hist(df.year[non_numberic_profits], bins=range(1955, 2006))

Missing value distribution

At a glance, we can see that the most invalid values in a single year is fewer than 25, and as there are 500 data points per year, removing these values would account for less than 4% of the data for the worst years. Indeed, other than a surge around the 90s, most years have fewer than half the missing values of the peak. For our purposes, let’s say this is acceptable and go ahead and remove these rows.

df = df.loc[~non_numberic_profits]df.profit = df.profit.apply(pd.to_numeric)

We should check that worked.

year int64
rank int64
company object
revenue float64
profit float64
dtype: object

Great! We have finished our data set setup.

If you were going to present your notebook as a report, you could get rid of the investigatory cells we created, which are included here as a demonstration of the flow of working with notebooks, and merge relevant cells (see the Advanced Functionality section below for more on this) to create a single data set setup cell. This would mean that if we ever mess up our data set elsewhere, we can just rerun the setup cell to restore it.

Plotting with matplotlib

Next, we can get to addressing the question at hand by plotting the average profit by year. We might as well plot the revenue as well, so first we can define some variables and a method to reduce our code.

group_by_year = df.loc[:, ['year', 'revenue', 'profit']].groupby('year')
avgs = group_by_year.mean()
x = avgs.index
y = avgs.profit
def plot(x, y, ax, title, y_label):
ax.plot(x, y)
ax.margins(x=0, y=0)

Now let’s plot!

fig, ax = plt.subplots()
plot(x, y1, ax, 'Increase in mean Fortune 500 company profits from 1955 to 2005', 'Profit (millions)')

Increase in mean Fortune 500 company profits from 1955 to 2005

Wow, that looks like an exponential, but it’s got some huge dips. They must correspond to the early 1990s recession and the dot-com bubble. It’s pretty interesting to see that in the data. But how come profits recovered to even higher levels post each recession?

Maybe the revenues can tell us more.

y2 = avgs.revenue
fig, ax = plt.subplots()
plot(x, y2, ax, 'Increase in mean Fortune 500 company revenues from 1955 to 2005', 'Revenue (millions)')

Increase in mean Fortune 500 company revenues from 1955 to 2005

That adds another side to the story. Revenues were no way nearly as badly hit, that’s some great accounting work from the finance departments.

With a little help from Stack Overflow, we can superimpose these plots with +/- their standard deviations.

def plot_with_std(x, y, stds, ax, title, y_label):
ax.fill_between(x, y - stds, y + stds, alpha=0.2)
plot(x, y, ax, title, y_label)
fig, (ax1, ax2) = plt.subplots(ncols=2)
title = 'Increase in mean and std Fortune 500 company %s from 1955 to 2005'
stds1 = group_by_year.std().profit.as_matrix()
stds2 = group_by_year.std().revenue.as_matrix()
plot_with_std(x, y1.as_matrix(), stds1, ax1, title % 'profits', 'Profit (millions)')
plot_with_std(x, y2.as_matrix(), stds2, ax2, title % 'revenues', 'Revenue (millions)')
fig.set_size_inches(14, 4)


That’s staggering, the standard deviations are huge. Some Fortune 500 companies make billions while others lose billions, and the risk has increased along with rising profits over the years. Perhaps some companies perform better than others; are the profits of the top 10% more or less volatile than the bottom 10%?

There are plenty of questions that we could look into next, and it’s easy to see how the flow of working in a notebook matches one’s own thought process, so now it’s time to draw this example to a close. This flow helped us to easily investigate our data set in one place without context switching between applications, and our work is immediately sharable and reproducible. If we wished to create a more concise report for a particular audience, we could quickly refactor our work by merging cells and removing intermediary code.

Sharing your notebooks

When people talk of sharing their notebooks, there are generally two paradigms they may be considering. Most often, individuals share the end-result of their work, much like this article itself, which means sharing non-interactive, pre-rendered versions of their notebooks; however, it is also possible to collaborate on notebooks with the aid version control systems such as Git.

That said, there are some nascent companies popping up on the web offering the ability to run interactive Jupyter Notebooks in the cloud.

Before you share

A shared notebook will appear exactly in the state it was in when you export or save it, including the output of any code cells. Therefore, to ensure that your notebook is share-ready, so to speak, there are a few steps you should take before sharing:

  1. Click “Cell > All Output > Clear”
  2. Click “Kernel > Restart & Run All”
  3. Wait for your code cells to finish executing and check they did so as expected

This will ensure your notebooks don’t contain intermediary output, have a stale state, and executed in order at the time of sharing.

Exporting your notebooks

Jupyter has built-in support for exporting to HTML and PDF as well as several other formats, which you can find from the menu under “File > Download As.” If you wish to share your notebooks with a small private group, this functionality may well be all you need. Indeed, as many researchers in academic institutions are given some public or internal webspace, and because you can export a notebook to an HTML file, Jupyter Notebooks can be an especially convenient way for them to share their results with their peers.

But if sharing exported files doesn’t cut it for you, there are also some immensely popular methods of sharing .ipynb files more directly on the web.


With the number of public notebooks on GitHub exceeding 1.8 million by early 2018, it is surely the most popular independent platform for sharing Jupyter projects with the world. GitHub has integrated support for rendering .ipynb files directly both in repositories and gists on its website. If you aren’t already aware, GitHub is a code hosting platform for version control and collaboration for repositories created with Git. You’ll need an account to use their services, but standard accounts are free.

Once you have a GitHub account, the easiest way to share a notebook on GitHub doesn’t actually require Git at all. Since 2008, GitHub has provided its Gist service for hosting and sharing code snippets, which each get their own repository. To share a notebook using Gists:

  1. Sign in and browse to
  2. Open your .ipynb file in a text editor, select all and copy the JSON inside.
  3. Paste the notebook JSON into the gist.
  4. Give your Gist a filename, remembering to add .iypnb or this will not work.
  5. Click either “Create secret gist” or “Create public gist.”

This should look something like the following:

Creating a Gist

If you created a public Gist, you will now be able to share its URL with anyone, and others will be able to fork and clone your work.

Creating your own Git repository and sharing this on GitHub is beyond the scope of this tutorial, but GitHub provides plenty of guides for you to get started on your own.

An extra tip for those using git is to add an exception to your .gitignore for those hidden .ipynb_checkpoints directories Jupyter creates, so as not to commit checkpoint files unnecessarily to your repo.


Having grown to render hundreds of thousands of notebooks every week by 2015, NBViewer is the most popular notebook renderer on the web. If you already have somewhere to host your Jupyter Notebooks online, be it GitHub or elsewhere, NBViewer will render your notebook and provide a shareable URL along with it. Provided as a free service as part of Project Jupyter, it is available at

Initially developed before GitHub’s Jupyter Notebook integration, NBViewer allows anyone to enter a URL, Gist ID, or GitHub username/repo/file and it will render the notebook as a webpage. A Gist’s ID is the unique number at the end of its URL; for example, the string of characters after the last backslash in If you enter a GitHub username or username/repo, you will see a minimal file browser that lets you explore a user’s repos and their contents.

The URL NBViewer displays when displaying a notebook is a constant based on the URL of the notebook it is rendering, so you can share this with anyone and it will work as long as the original files remain online — NBViewer doesn’t cache files for very long.

Final Thoughts

Starting with the basics, we have come to grips with the natural workflow of Jupyter Notebooks, delved into IPython’s more advanced features, and finally learned how to share our work with friends, colleagues, and the world. And we accomplished all this from a notebook itself!

It should be clear how notebooks promote a productive working experience by reducing context switching and emulating a natural development of thoughts during a project. The power of Jupyter Notebooks should also be evident, and we covered plenty of leads to get you started exploring more advanced features in your own projects.

If you’d like further inspiration for your own Notebooks, Jupyter has put together a gallery of interesting Jupyter Notebooks that you may find helpful and the Nbviewer homepagelinks to some really fancy examples of quality notebooks. Also check out our list of Jupyter Notebook tips.

Want to learn more about Jupyter Notebooks? We have a guided project you may be interested in.


Posted in Information Technology

What is NFC & how does it work?

NFC is becoming pretty commonplace thanks to the growth of online payment systems like Samsung Pay and Android Pay. Especially when it comes to high-end devices and even many mid-rangers. You’ve likely heard the term before, but what is NFC exactly? In this piece we rundown what it is, how it works, and what it can be used for.

What is NFC

NFC stands for “Near Field Communication” and, as the name implies, it enables short range communication between compatible devices. This requires at least one transmitting device, and another to receive the signal. A range of devices can use the NFC standard and will be considered either passive or active.

Passive NFC devices include tags, and other small transmitters, that can send information to other NFC devices without the need for a power source of their own. However, they don’t really process any information sent from other sources, and can’t connect to other passive components.  These often take the form of interactive signs on walls or advertisements.

Active devices are able to both send and receive data, and can communicate with each other as well as with passive devices. Smartphones are by far the most common form of active NFC device. Public transport card readers and touch payment terminals are also good examples of the technology.

How does NFC work?

Now that we know what NFC is, how does it work? Just like Bluetooth and WiFi, and all manner of other wireless signals, NFC works on the principle of sending information over radio waves. Near Field Communication is another standard for wireless data transitions. This means that devices must adhere to certain specifications in order to communicate with each other properly. The technology used in NFC is based on older RFID (Radio-frequency identification) ideas, which used electromagnetic induction in order to transmit information.

This marks the one major difference between NFC and Bluetooth/WiFi. The former can be used to induce electric currents within passive components as well as just send data. This means that passive devices don’t require their own power supply. They can instead be powered by the electromagnetic field produced by an active NFC component when it comes into range. Unfortunately, NFC technology does not command enough inductance to charge our smartphones, but QI charging is based on the same principle.

magnetic fields

The transmission frequency for data across NFC is 13.56 megahertz. You can send data at either 106, 212, or 424 kilobits per second. That’s is quick enough for a range of data transfers — from contact details to swapping pictures and music.

To determine what sort of information will be exchanged between devices, the NFC standard currently has three distinct modes of operation. Perhaps the most common use in smartphones is the peer-to-peer mode. This allows two NFC-enabled devices to exchange various pieces of information between each other. In this mode both devices switch between active when sending data and passive when receiving.

Read/write mode, on the other hand, is a one-way data transmission. The active device, possibly your smartphone, links up with another device in order to read information from it. NFC advert tags use this mode.

The final mode of operation is card emulation. The NFC device can function as a smart or contactless credit card and make payments or tap into public transport systems.

Comparisons with Bluetooth

While we have answered the question “What is NFC?”, how does it compare with other wireless technologies? You might think that NFC is bit unnecessary, considering that Bluetooth has been more widely available for many years. However, there are several important technological differences between the two that gives NFC some significant benefits in certain circumstances. The major argument in favor of NFC is that it requires much less power consumption than Bluetooth. This makes NFC perfect for passive devices, such as the advertising tags mentioned earlier, as they can operate without a major power source.

However, this power saving does have some major drawbacks. Most notably, the range of transmission is much shorter than Bluetooth. While NFC has a range of around 10 cm, just a few inches, Bluetooth connections can transmit data up to 10 meters or more from the source. Another drawback is that NFC is quite a bit slower than Bluetooth. It transmits data at a maximum speed of just 424 kbit/s, compared to 2.1 Mbit/s with Bluetooth 2.1 or around 1 Mbit/s with Bluetooth Low Energy.

But NFC does have one major advantage: faster connectivity. Due to the use of inductive coupling, and the absence of manual pairing, it takes less than one tenth of a second to establish a connection between two devices. While modern Bluetooth connects pretty fast, NFC is still super handy for certain scenarios. Namely mobile payments.

Samsung Pay, Android Pay, and even Apple Pay use NFC technology — though Samsung Pay works a bit differently than the others. While Bluetooth works better for connecting devices together for file transfers, sharing connections to speakers, and more, we anticipate that NFC will always have a place in this world thanks to mobile payments — a quickly expanding technology.

Posted in Information Technology

Disaster Recovery for the Cloud

In recent decades, cloud computing has gained popularity due to its range of benefits to business organizations ranging from cost optimization and access to high performance IT infrastructure, to security, compliance and ease of doing business. However, these advantages are only realized as long as the service is available and functioning as per the expected reliability standards. In order to maximize the reliability of IT services delivered from off-site cloud datacenters, vendors and customers of cloud computing follow Disaster Recovery strategies. These practices are designed to mitigate the risks associated with operating mission-critical apps and data from cloud datacenters, that are not immune to natural disasters, cyber-attacks, power outages, networking issues and other technical or business challenges affecting service availability to end-users.

Unplanned downtime cost businesses over $80,000 per hour in datacenter downtime according to a recent research. While large enterprises may be able to contain the financial damages associated with downtime incidents, small and midsize businesses experience the most damaging consequences. Research suggests that organizations without an adequate disaster recovery plan go into liquidation within 18 months of suffering a major downtime incident. This makes disaster recovery planning critical to business success amid growing dependence on cloud-enabled IT services, cybersecurity issues and power outage concerns.

Disaster Recovery (DR) is a component of security planning that constitutes the technologies, practices and policies to recover from a disaster that impacts the availability, functionality and performance of an IT service. The disaster may result from a human, technology or natural incident. Disaster Recovery is a subset of business continuity that deals with the large picture of avoiding disasters in the first place. While business continuity involves the processes and strategy to ensure a functioning IT service during and after a disaster, the component of disaster recovery involves the measures and mechanism that help regain application functionality and access to data following a disaster incident. The following is a brief guide to get you started with your disaster recovery planning initiatives:

Planning and Preparation

Disaster recovery planning is unique to every organization and depends on the metrics that are best considered to evaluate the recovery of an IT service following a disaster. Organizations need to identify the resilience level for their development, testing and production environments, and implement disaster recovery plans accordingly. The metrics in consideration could include Recovery Point Objective (RPO), the age limit of business information that must be recovered since the disaster, and Recovery Time Objective (RTO), the acceptable time for recovery during which the IT service remains unavailable. These metrics should be aligned with the organizational goals of business continuity and must evolve over time as the organization scales and faces different set of challenges in achieving these goals.

For customers of cloud infrastructure services, the requirements on these metrics should be defined in the SLA agreement. High availability architecture such as hybrid and multi-cloud environments offer improved operational performance in terms of service availability. However, the tradeoff between cost, availability, performance and other associated parameters should be considered for each investment option.

The following best practices should be employed in developing a disaster recovery program for your organization:

  • Understand how your organization defines a disaster.
  • Define your requirements. Understand your RPO and RTO requirements for different workloads and applications.
  • How do you re-evaluate disaster recovery on an ongoing basis to account for changing technical and business requirements?
  • Is the organization capable of realizing a disaster recovery plan in real scenarios? Consider employee awareness and training, disaster recovery exercises and drills.

Know Your Options

Disaster recovery solutions may involve a diverse range of options for different DR goals. A well-designed strategy focuses on an optimal tradeoff of cost investments, practicality, and IT burden, with the disaster recovery performance. For instance, if a car risks a puncture during driving, would you rather run expensive run-flat tires; run a regular tire and keep a spare wheel with a replacement kit in the car; or run a regular tire, have no spare wheel and rely on roadside assistance to replace a flat tire? Each option have their own set of implications and require a strategic assessment of the disaster recovery goals. It may be possible for organizations to follow a holistic disaster recovery plan that incorporates different disaster recovery patterns for different use cases as appropriate. For instance, a mission-critical app may require short RTO/RPO objectives while an external marketing database may not impact business operations for long duration following a disaster.

Testing Your Disaster Recovery Capability

Organizations can develop the most applicable and appropriate disaster recovery program and yet fail to implement the measures in practical, real-world environments. These limitations are often caused by limited employee training and failure to account for real-world situations that may have been ignored during the disaster recovery planning stages. The proof is therefore in the testing of your disaster recovery program at frequent and regular intervals. These intervals may range from one to four times per year, although some fast-growing organization may even resort to monthly testing exercises depending upon their technical requirements or regulatory concerns.

The testing procedures should extend beyond the technology capabilities and encompass the people and processes. Disaster recovery simulations can help organizations understand how the technology will behave in transferring workloads across geographic locations if the primary datacenter is hit with a power outage. But what about the workforce responsible for executing the policies and procedures designed to streamline the disaster recovery process? This means that the disaster recovery program should also consider the education and training of employees responsible for executing key protocols to recover from a disaster situation.

Finally, it is important to keep up to date documentation on the disaster recovery performance during exercises as well as real-world disaster incidents. Use this information as a feedback loop to tune your disaster recovery capabilities based on your organizational requirements. Disaster recovery for different cloud architecture models may be treated according to the impact on business and the technical requirements. For instance, multi-cloud environments may be less prone to disaster situations as per the appropriate SLA agreements associated with multiple datacenter locations, RPO/RTOs and other metrics. Organizations must therefore evaluate which cloud service model optimally fulfils their disaster requirements on different apps and data sets used to perform daily business operations.

Posted in Information Technology

Build Secure Microservices in Your Spring REST API

Users ask us for new features, bug fixes, and changes in domain logic for our applications every day. As any project (especially a monolith) grows, it often becomes difficult to maintain, and the barrier to entry for a new developer joining the project gets higher and higher.

In this tutorial, I’ll walk you through building a secure Spring REST API that tries to solve for some of these pain points using a microservices architecture.

In a typical microservices architecture, you divide your application into several apps that can be more easily maintained and scaled, use different stacks, and support more teams working in parallel. But microservices are the simple solution to every scaling and maintenance problem.

Microservices also present a number of architectural challenges that must be addressed:

  • How those services communicate?
  • How should communication failures and availability be handled?
  • How can a user’s requests be traced between services?
  • And, how should you handle user authorization to access a single service?

Let’s dig in and find out how to address these challenges when building a Spring REST API.

Secure Your Spring REST API With OAuth 2.0

In OAuth 2.0, a resource server is a service designed to handle domain-logic requests and does not have any kind of login workflow or complex authentication mechanism: It receives a pre-obtained access token that guarantees a user have grant permission to access the server and delivers the expected response.

In this post, you are going to build a simple Resource Server with Spring Boot and Okta to demonstrate how easy it is. You will implement a simple Resource Server that will receive and validate a JWT Token.

Add a Resource Server Your Spring REST API

This example uses Okta to handle all authentication process. You can register for a free-forever developer account that will enable you to create as many users and applications you need.

I have set up some things so we can get started easily. Please clone the following resource repository and go to startup tag, as follows:

git clone -b startup secure-spring-rest-api
cd secure-spring-rest-api

This project has the following structure:

$ tree .
├── mvnw
├── mvnw.cmd
├── pom.xml
└── src
    ├── main
    │   ├── java
    │   │   └── net
    │   │       └── dovale
    │   │           └── okta
    │   │               └── secure_rest_api
    │   │                   ├──
    │   │                   ├──
    │   │                   └──
    │   └── resources
    │       └──
    └── test
        └── java
            └── net
                └── dovale
                    └── okta
                        └── secure_rest_api

14 directories, 9 files

I created it using the excellent Spring Initializr and adding Web and Security dependencies. Spring Initializr provides an easy way to create a new Spring Boot service with some common auto-discovered dependencies. It also adds the Maven Wrapper: So, you use the command mvnw instead of mvn, the tool will detect if you have the designated Maven version and, if not, it will download and run the specified command.

The file HelloWorldController is a simple @RestController that outputs “Hello World.”

In a terminal, you can run the following command and see Spring Boot start:

mvnw spring-boot:run

TIP: If this command doesn’t work for you, try ./mvnw spring-boot:run instead.

Once it finishes loading, you’ll have a REST API ready and set to deliver to you a glorious Hello World message!

> curl http://localhost:8080/
Hello World

TIP: The curl command is not available by default for Windows users. You can download it from here.

Now, you need to properly create a protected Resource Server.

Set Up an OAuth 2.0 Resource Server

In the Okta dashboard, create an application of type Service it indicates a resource server that does not have a login page or any way to obtain new tokens.

Create new Service

Click Next, type the name of your service, and then click Done. You will be presented with a screen similar to the one below. Copy and paste your Client ID and Client Secret for later. They will be useful when you are configuring your application.

Service Created

Now, let’s code something!

Edit the pom.xml file and add dependencies for Spring Security and Okta. They will enable all the Spring AND Okta OAuth 2.0 goodness you need:

<!-- security - begin -->
<!-- security - end -->

By simply adding this dependency, your code is going to be like a locked house without a key. No one can access your API until you provide a key to your users. Run the command below again.

mvnw spring-boot:run

Now, try to access the Hello World resource:

> curl http://localhost:8080/

Add Spring Security to Your REST API

Spring Boot has a lot of classpath magic and is able to discover and automatically configure dependencies. Since you have added Spring Security, it automatically secured your resources. Now, you need to configure Spring Security so you can properly authenticate the requests.

NOTE: If you are struggling, you can check the modifications in Git branch step-1-security-dependencies.

For that, you need to modify as follows (use client_id and client_secret provided by Okta dashboard to your application):


Spring Boot uses annotations and code for configuring your application so you do not need to edit super boring XML files. This means you can use the Java compiler to validate your configuration!

I usually create a configuration in different classes, each one has its own purpose. Create the class net.dovale.okta.secure_rest_api.SecurityConfig as follows:

package net.dovale.okta.secure_rest_api;


public class SecurityConfig  {}

Allow me to explain what the annotations here do:

  • @EnableWebSecurity tells Spring we are going to use Spring Security to provide web security mechanisms
  • @EnableResourceServer is a convenient annotation that enables request authentication through OAuth 2.0 tokens. Normally, you would provide a ResourceServerConfigurer bean, but Okta’s Spring Boot starter conveniently provides one for you.

That’s it! Now, you have a completely configured and secured Spring REST API without any boilerplate!

Run Spring Boot again and check it with cURL.

mvnw spring-boot:run
# in another shell
curl http://localhost:8080/
{"error":"unauthorized","error_description":"Full authentication is required to access this resource"}

The message changed, but you still without access… why? Because now the server is waiting for an authorization header with a valid token. In the next step, you’ll create an access token and use it to access your API.

NOTE: Check the Git branchstep-2-security-configuration if you have any doubt.

Generate Tokens in Your Spring REST API

So… how do you obtain a token? A resource server has no responsibility to obtain valid credentials: It will only check if the token is valid and proceed with the method execution.

An easy way to achieve a token to generate one using OpenID Connect <debugger/>.

First, you’ll need to create a new Web application in Okta:

New web application

Set the Login redirect URIs field to and Grant Type Allowed to Hybrid. Click Done and copy the client ID for the next step.

Now, on the OpenID Connect website, fill the form in like the picture below (do not forget to fill in the client ID for your recently created Okta web application):

OpenID connect

Submit the form to start the authentication process. You’ll receive an Okta login form if you are not logged in or you’ll see the screen below with your custom token.

OpenID connect - getting token

The token will be valid for one hour so you can do a lot of testing with your API. It’s simple to use the token — just copy it and modify the curl command to use it as follows:

> export TOKEN=${YOUR_TOKEN}
> curl http://localhost:8080 -H "Authorization: Bearer $TOKEN"
Hello World

Add OAuth 2.0 Scopes

OAuth 2.0 scopes is a feature that lets users decide if the application will be authorized to make something restricted. For example, you could have “read” and “write” scopes. If an application needs the write scope, it should ask the user this specific scope. These can be automatically handled by Okta’s authorization server.

As a resource server, it can have different endpoints with different scope for each one. Next, you are going to learn how to set different scopes and how to test them.

Add a new annotation to your SecurityConfig class:

@EnableGlobalMethodSecurity(prePostEnabled = true)
public class SecurityConfig {}

The new @EnableGlobalMethodSecurity(prePostEnabled = true) annotation tells Spring to use AOP-like method security, and prePostEnabled = true will enable pre and post annotations. Those annotations will enable us to define security programmatically for each endpoint.

Now, make changes to to create a scope-protected endpoint:

public String helloWorldProtected(Principal principal) {
    return "Hello VIP " + principal.getName();

Pay attention to @PreAuthorize("#oauth2.hasScope('profile')"). It says: Before running this method, verify the request has authorization for the specified Scope. The #oauth2 bit is added by OAuth2SecurityExpressionMethods (check the other methods available) Spring class and is added to your classpath through the spring-cloud-starter-oauth2dependency.

OK! After a restart, your server will be ready! Make a new request to the endpoint using your current token:

> curl http://localhost:8080/protected/ -H "Authorization: Bearer $TOKEN"
{"error":"access_denied","error_description":"Access is denied"}

Since your token does not have the desired scope, you’ll receive an access is denied message. To fix this, head back over to OIDC Debugger and add the new scope.

Profile scope

Try again using the newly obtained token:

> curl http://localhost:8080/protected/ -H "Authorization: Bearer $TOKEN"
Hello VIP

That’s it! If you are in doubt of anything, check the latest repository branch finished_sample.

TIP: Since profile is a common OAuth 2.0 scope, you don’t need to change anything in your authorization server. Need to create a custom scope? See this Simple Token Authentication for Java Apps.

Learn More About Spring and REST APIs

In this tutorial, you learned how to use Spring (Boot) to create a resource server and seamlessly integrate it with OAuth 2.0. Both Spring and REST API’s are huge topics, with lots to discuss and learn.

The source code for this tutorial is available on GitHub.

Here are some other posts that will help you further your understanding of both Spring and REST API security:

Like what you learned today? Follow us on Twitter, like us on Facebook, check us out on LinkedIn, and subscribe to our YouTube channel.

Create a Secure Spring REST API was originally published to the Okta developer blog on December 18, 2018 by Raphael do Vale.

Posted in Information Technology, Software Engineering

COBOL on AWS Lambda

AWS has joined with Blu Age, a vendor that offers tools to modernize applications, to support COBOL on Lambda, the cloud provider’s serverless computing platform. Blu Age’s software provides a COBOL runtime and takes advantage of AWS’ new Lambda Layers feature.

Developers can run COBOL-based functions on AWS’ native Java 8 runtime and use Blu Age’s compiler to build Lambda deployment packages from COBOL source code, Blu Age said. Support for COBOL applications is part of a wave of improvements to AWS Lambda revealed last week at its re:Invent conference.

There are massive amounts of COBOL code still operational today — some 220 billion lines of it, as Reuters reported last year. The venerable language is a linchpin of the financial industry’s transaction processing systems. For example, about 95% of ATM card swipes rely on COBOL code, Reuters said.

Blu Age’s COBOL runtime for Lambda means that users don’t have to manage servers or containers, and AWS handles scaling. Costs incur for every 100 ms of code execution time, with no fees when code is idle.

The other advantage of Lambda is that it helps developers decompose COBOL applications into microservices to provide more agility and flexibility, Blu Age said.

Many current COBOL applications are already well-suited for serverless, said Ryan Marsh, a DevOps and serverless trainer and consultant based in Houston.

“Your typical COBOL application that you most run into in the wild is an application that runs from time to time,” Marsh said. “It’s batch-oriented. It takes data from place to place, does things with it and puts it somewhere else or calls other COBOL applications.”

Moving [COBOL apps] to Lambda, versus a wholesale rewrite to technology and patterns that were cutting-edge 10 years ago, is much more advisable.

Ryan MarshDevops/serverless trainer and consultant

Marsh likened COBOL applications to Rube Goldberg machines; both are composed of a series of things deliberately chained together to perform complex tasks. Serverless applications follow this same model.

Few COBOL apps are monolithic, where all the functionality is in one executable or invocation that continuously runs, he said. That’s why a move to a serverless model makes sense for COBOL apps.

Some IT shops might be hesitant to move COBOL applications from on premises to the cloud, because those apps often run on hardware paid for long ago.

“Moving into something where I’m renting [compute resources] and paying for it again just doesn’t make sense,” Marsh said. However, there are bigger considerations for COBOL shops to mull.

“When I completely remove the ops headaches and I no longer have to think about virtual machines and instances and worry about disk space and things like that and I’m just thinking about my business logic and data, that makes perfect sense,” he said.

Meanwhile, some companies spend vast sums of money to rewrite their COBOL applications in languages such as Java, but that idea is wrongheaded, Marsh said. For one, documentation on these old applications is frequently poor, making a rewrite project much more fraught with pitfalls.

“What if you could skip all that and lift and shift it into the cloud?” Marsh said. “Moving to Lambda, versus a wholesale rewrite to technology and patterns that were cutting-edge 10 years ago, is much more advisable. You can skip two-plus generations of application development patterns. How often do you have that kind of opportunity in enterprise application development?”



Posted in Information Technology

From cloud-first to cloud-smart: How enterprises are getting smarter about their AWS cloud adoption

Today’s enterprises are getting smarter about the cloud. In the early days of digital transformation, the hype around public cloud made it seem almost magical—a way for IT to increase agility, lower cost, and enable instant scalability with the wave of a wand. Over time, though, a more nuanced reality became apparent. On-demand resources can be a boon for developers, business units, and the organization as a whole, but their radical ease of provisioning can also lead to out-of-control costs, security gaps, management headaches, and the rueful discovery that not every workload is truly cloud-friendly. Having learned a few painful but important lessons, IT leaders are now making more thoughtful decisions about when and how to use the cloud—and discovering that the resulting value can still seem close to supernatural.

At the recent AWS re:Invent 2018 event in Las Vegas, Nevada, David Cramer, President of Digital Service Operations at BMC Software, sat down with John Walls and Justin Warren of the Cube to discuss the state of cloud thinking today. The following Q&A is adapted from their conversation.

What path are people taking to the cloud these days, and what are they doing there?
We see companies at every point on their cloud journey. While some are still just getting started, most have already begun moving apps to the cloud as well as building new cloud-native apps. But there’s an important difference from the early days. A few years ago, the word of the day was cloud-first, but you can spend a lot of money that way if you’re not careful. Now people are stepping back and thinking: What can cloud services and infrastructure really give us, and how can we optimize what we put in the cloud?

There’s a balance to strike here. DevOps culture is all about releasing more software into the market, faster—new features, products, and innovations. To do that at a competitive speed, you can’t have top-down command-and-control; you have to let teams be autonomous. At the same time, that creates opportunity for waste, mistakes, misconfigurations, and security issues. You have to keep your eyes open and have a real strategy for keeping things from getting out of control.

How is BMC helping customers decide what to move to cloud and what to keep on-premises?
During the heyday of lift-and-shift, a lot of people wanted to replatform and rearchitect everything in sight, but some of the apps they were looking at weren’t actually going to benefit from the cloud’s elasticity. You can deliver a virtual machine more cheaply in-house anyway, so what are you basing that decision on?

BMC has a rich history in the data center, and we’ve developed ways to assess and identify which high-value assets can really save you money in the cloud, and which are more suited to on-premises. You have to start with a foundation of app discovery, capacity and performance data, and insight into your configurations. To move something to the cloud, you first need to know what it looks like and how it runs in the data center.

Security is fundamental in the cloud. What’s BMC’s cloud security story?
Our cloud security strategy is based in large part on our own experiences. A few years ago, as our customers started wanting to consume services in different ways, we began building new apps in Amazon using Lambda. That meant we also had to learn how to secure them. We focused in particular on optimizing the security and configuration of the cloud infrastructure or platform layer. A lot of companies don’t understand Amazon’s shared responsibility model and expect someone else to take care of these problems. Amazon does provide a lot of security, but not for the stuff you put in your environment—your middleware, software, the services you connect, and so on. So we’re helping customers secure the configuration of those services.

There’s also an organizational factor. DevOps teams are often developer-heavy, and they’re not all that interested in compliance, code-scanning, or testing—they want to build cool stuff. We need to build protection around them and their processes so they can focus on what they’re best at.

What about the cost factor?
Cost is definitely a key area to focus on. A developer with access to the Amazon suite of services is like a kid in a candy store. They want gold-plated everything—the best environment and database, the most advanced services. It takes a mature, fully empowered cloud cost control practice to make sure you’re spending the right money in the right ways, and we’re helping a lot of customers develop that capability.

The scale of modern data volumes is truly mind-boggling. Has it grown beyond our ability to keep up?
Most people would agree that we’re well past human scale at this point. There are millions of containers spun up every day for Google Maps alone. You can’t take old approaches to something like that. We used to talk about pushbutton automation, but it was smart people figuring out what to automate, and only then pushing the button. Now we’re getting to a new level of trust, allowing the automation routines to take over and run on their own, whether driven by machine learning and AI algorithms or just by simple policies. That’s still a hard shift to make, though—it runs against instinct. Just as in DevOps, CIOs are concerned about guardrails and governance. They want to allow autonomy but also need to restrict the potential for mistakes and cost overruns.

Taking the broad view, this is an exciting and dynamic time to be in cloud, and in IT ops in general. Ops teams are moving into DevOps and cloud ops, and more and more you have everyone working together as a team—ops everywhere, dev everywhere. IT ops people have always been good at running things, and now IT ops teams are getting the dev and coding skills they need to both run and reinvent. The cloud is central to that vision, from providing the resources to enable DevOps, to transforming the way organizations provision, optimize, and manage their infrastructure as a whole. We’re thrilled to be at the heart of it all, and we’re inspired every day by what we see our customers accomplishing.

Posted in Information Technology

Streaming MySQL tables in real-time to Kafka

Original writer: Prem Santosh Udaya Shankar, Software Engineer

As our engineering team grew, we realized we had to move away from a single monolithic application towards services that were easier to use and could scale with us. While there are many upsides to switching to this architecture, there are downsides as well, one of the bigger ones being: how do we provide services with the data they require as soon as possible?

In an ecosystem with hundreds of services, each managing millions and millions of rows in MySQL, communicating these data changes across services becomes a hard technical challenge. As mentioned in our previous blog post, you will quickly hit the N+1 query problem. For us, this meant building a system, called MySQLStreamer, to monitor against data changes and alert all subscribing services about them.

MySQLStreamer is a database change data capture and publish system. It’s responsible for capturing each individual database change, enveloping them into messages and publishing to Kafka. The notion of replaying every changeset from a table to reconstruct the entire table at a particular snapshot in time and reproducing a stream by iterating over every change in a table is referred to as a stream-table duality.

In order to understand how we capture and publish database changes, it’s essential to know a bit about our database infrastructure. At Yelp, we store most of our data in MySQL clusters. To handle a high volume of visits, Yelp must operate a hierarchy of dozens of geographically distributed read replicas to manage the read load.

How does replication work?

In order for replication to work, events on the master database cluster are written to a special log called the binary log. When a replica connects to its master, this binary log is read by the replica to complete or continue the process of replication, depending on the replication hierarchy. This process is bolstered by two threads running on the replica, an IO thread and a SQL thread. This is visualized in the figure below. The IO thread is primarily responsible for reading the binary log events from master, as they arrive, and copying them over to a local relay log in the replica. The SQL thread then reads these events and replays them in the same order that they arrived in.

Replication Overview

Replication in MySQL

It should be noted here that an event played by the replica’s SQL thread need not necessarily be the latest event logged in the master’s binary log. This is shown in the figure above and is called replication lag. Yelp’s MySQLStreamer acts as a read replica, persisting updates into Apache Kafka, instead of materializing them into a table.

Type of MySQL replication

There are two ways of replicating a MySQL database:

  • Statement-based replication (SBR)
  • Row-based replication (RBR)

In statement-based replication, SQL statements are written to the binary log by master and the slave’s SQL thread then replays these statements on the slave. There are a few disadvantages of using statement-based replication. One important disadvantage is the possibility of data inconsistency between master and the slave. This is because the SQL thread in the slave is simply responsible for replaying log statements copied over from the master, but there could be instances where the statements generate non-deterministic outputs. Consider the following query:

INSERT INTO places (name, location)
(SELECT name, location FROM business)

This is a scenario where you want to SELECT certain rows and INSERT them into another table. In this case, selecting multiple rows without an ORDER BY clause would result in rows returned–but the order of those rows may not be the same if the statement was to be replayed multiple times. Also, if a column had an AUTO_INCREMENT associated with it, the rows might end up with different ranks each time the statement is executed. Another example would be using the RAND()or the NOW() methods which would end up generating different results when played on different hosts in the replication topology. Due to these limitations, we use row-based replication in the database required by the MySQLStreamer. In row-based replication, each event shows how the individual rows of a table have been modified. UPDATE and DELETE statements contain the original state of the row before it was modified. Hence, replaying these row changes will keep the data consistent.

Now we know what replication is, but why do we need the MySQLStreamer?

One of the main uses of Yelp’s real time streaming platform is to stream data changes and process them, in order to keep downstream systems up to date. There are two kinds of SQL change events that we have to be aware of:

  • DDL (Data Definition Language) statements, which define or alter a database structure or schema.
  • DML (Data Manipulation Language) statements, which manipulate data within schema objects.

The MySQLStreamer is responsible for:

  • Tailing the MySQL binary log consuming both these types of events
  • Handling the events depending on their type, and publishing the DML events to Kafka topics.

The MySQLStreamer publishes four distinct event types: InsertUpdateDelete and Refresh. The former three events correspond to DML statements of the same type. Refresh events are generated by our bootstrap process, described in detail later. For each event type, we include the complete row contents. Update events include the full row, both before and after the update. This is particularly important for dealing with cycles. Imagine implementing a geocoding service that consumes Business updates, and triggers a latitude and longitude update on that same Business row if the address changes. Without the before row content, the service would have to store a significant amount of state to determine if a row’s address had actually changed, and to ignore latitude and longitude updates. With both before and after content, it’s trivial to generate a diff and break these cycles without keeping any state.

Event Type Message Contents
Insert Full Row
Update Full Row Before Update and Full Row After Update
Delete Full Row Before Delete
Refresh Full Row
MySQLStreamer Event Types

Database Topology

The MySQLStreamer is powered by three databases as shown below:

Database Topology

Database Topology

The Source Database

This database stores change events in the upstream data. It is tracked by the MySQLStreamer in order to stream those events to downstream consumers. A binary log stream reader in the MySQLStreamer is responsible for parsing the binary log for new events. Our stream reader is an abstraction over the BinLogStreamReader from the python-mysql-replication package. This api provides three main functionalities: to peek at the next event, to pop the next event, and to resume reading from the stream at a specific position.

The Schema Tracker Database

The schema tracker database is analogous to a schema-only slave database. It is initially populated with schemas from the source database, and kept up to date by replaying DDL statements against it. This means it skips the data and stores the skeleton of all the tables. We lazily retrieve CREATE TABLE statements from this database to generate Avro schemas using the Schematizer service. Schema information is also necessary to map column name information to rows in the binary log. Because of replication lag, the database schema for the current replication position of the MySQLStreamer doesn’t necessarily match the current schema on the master. Hence, the schema used by the MySQLStreamer cannot be retrieved from the master. We chose to use a database for this to avoid re-implementing MySQL’s DDL engine.

Should the system encounter a failure during the execution of the SQL statement, the database might end up in a corrupted state, as DDL statements are not transactional. To circumvent this issue, we treat the entire database transactionally. Before applying any DDL event, we checkpoint, take a schema dump of the entire schema tracker database, and store it in the state database. The DDL event is then played. If it succeeds, the stored schema dump is deleted and another checkpoint is taken. A checkpoint basically consists of saving the binary log file name and the position along with the Kafka offset information. In case of failure, after the MySQLStreamer restarts it checks to see if a schema dump exists. If it does, it replays the schema dump before handling the DDL event it failed on. Once the schema dump is replayed, the MySQLStreamer restarts the tailing of events from the checkpointed position, eventually catching up to real time.

The State Database

The state database stores the MySQLStreamer’s internal state. It consists of three tables that store various pieces of state information:


Stores information about each topic and the corresponding last known published offset.


The most important piece of information stored in this table is the position. The position looks like this:

heartbeat_signal heartbeat_timestamp log_file log_position offset
Position Information

One of the prerequisites for fail-safe replication is to have an unique identifier associated with every transaction. It not only provides benefit in the process of recovery but also in hierarchical replication. The global transaction identifier (GTID) in one such identifier. It’s identical across all the servers in a given replication setup. Though our code supports GTID, the version of MySQL we use does not. Hence, we needed an alternative approach to store the state that could be easily translated across the entire replication setup, which motivated us to piggyback on Yelp’s heartbeat daemon. This python daemon is responsible for sending periodic heartbeat database updates that consists of a serial number and a timestamp. This is then replicated to all other replicas. The MySQLStreamer takes the heartbeat serial number and timestamp and attaches the log file name and log position it is currently working with and stores that in the global_event_statetable. If the current master fails for some reason, a batch finds the log file and the log position from the new master using the heartbeat serial number and heartbeat timestamp.


Stores the schema dump of the schema tracker database for restoring the database to a stable state after a failure.

How does the MySQLStreamer work?

MySQLStreamer Working

Working of MySQLStreamer

As the MySQLStreamer starts, it acquires a Zookeeper lock before it initiates processing on any of the incoming events. This step is necessary to prevent multiple instances of the MySQLStreamer from running on the same cluster. The problem with multiple instances running on the same cluster is that replication is inherently serial. In some applications we want to maintain order within and between tables, so we prevent multiple instances from running, preserving order and preventing message duplication.

As we have previously discussed, the MySQLStreamer receives events from the source database (Yelp database as seen in the figure above) via the binlog parser. If the event is a data event then the table schema for that event is extracted and sent to the Schematizer service. This service then returns the corresponding Avro schema and a Kafka topic. The Schematizer service is idempotent. It will return the exact same Avro schema and topic if it’s called with the same create table statement multiple times.

Data events are encoded with the received Avro schema and published to Kafka topics. The data pipeline’s Kafka Producer maintains an internal queue of events to be published to Kafka. If a schema event is received from the binlog parser, the MySQLStreamer first flushes all the events already present in the internal queue and then takes a checkpoint for the purposes of recovery in case of a failure. It then applies the schema event on the schema tracker database.

Failure handling of data events is slightly different compared to schema events. We checkpoint before processing any data event, and continue checkpointing as we publish batches of messages successfully. We trust successes, but never failures. If we encounter a failure we recover from it by inspecting the last checkpoint and Kafka high watermarks, and publishing only those messages that were not successfully published previously. On the Kafka side, we require acks from all in-sync replicas, and run with a high min.isr setting, trading availability for consistency. By methodically verifying and recovering from failures, we’re able to ensure that messages are published exactly once.

Bootstrapping a Topic

Yelp was founded in 2004. Many of the most interesting tables have existed for nearly as long as Yelp. We needed to find a way to bootstrap a Kafka topic with existing table contents. We engineered a procedure that can perform a consistent topic bootstrap while still processing live replication events.

Replication is recursive

Before we talk about bootstrapping, let us see what the actual database replication topology looks like. Referencing the figure above, we notice there exists a master and a replica of master called the intermediate master. There are other replicas of intermediate master called local masters. The MySQLStreamer is connected to a replica called refresh primary which in turn is a replica of one of the local master. The refresh primary is setup with row-based replication whereas all other replicas run statement-based replication.

Bootstrapping is initiated by creating a table like the original table we want to bootstrap on the MySQLStreamer’s refresh primary, using MySQL’s blackhole engine.

blackhole_table = create_blackhole_table(original_table)
while remaining_data(original_table):
    copy_batch(original_table, blackhole_table, batch_size)

Pseudo-code Bootstrap Process

The blackhole engine is like the /dev/null of database engines. The primary reason we chose to use the blackhole engine is because writes to blackhole tables aren’t persisted, but are logged and replicated. This way we’re recreating the binary logs of the original table and not worry about storing the duplicate table.

Once we’ve created the blackhole table, we lock the original table to prevent any data changes during the bootstrap process. We then copy rows from the original table to the blackhole table in batches. As shown in the figure, the MySQLStreamer is connected to one of the leaf clusters. This is because, we do not want any change triggered by the bootstrap logic to trickle down to every child cluster. But we do want the original table to be updated with the latest changes during bootstrapping hence, between batches, we unlock the original table and wait for replication to catch up. Locking the original table can cause replication from a local master to the replica (refresh primary) to stall , but it guarantees that the data we’re copying into the blackhole table is consistent at that point in replication. By unlocking and allowing replication to catch up, the bootstrap process naturally throttles itself. The process is very fast, since the data never leaves the MySQL server. The replication delay we’re causing is measured in milliseconds per batch.

All of the complexity of this process happens in the database. Inside the MySQLStreamer, our code simply treats inserts into specially named blackhole tables as Refresh events on the original table. Refresh events are interleaved in topics with normal replication events, as regular InsertUpdate, and Delete events are published during the bootstrap. Semantically, many consumers treat Refresh events like upserts.

The Takeaway

Engineering time is valuable. A good principle for engineers to follow is “Any repetitive task should try to be automated”. For Yelp to scale we had to engineer a single, flexible piece of infrastructure that would help with a multitude of applications. With the data pipeline we can index data for search, warehouse data and share transformed data with other internal services. The data pipeline proved to be immensely valuable and was a positive step towards achieving the required automation. The MySQLStreamer is a cardinal part of the data pipeline that scrapes MySQL binary logs for all the change events and publishes those changes to Kafka. Once the changes are in the Kafka topics, it’s easy for the downstream consumers to utilize them based on their individual use case.


Posted in Information Technology

The Role of Predictive Analytics in Cloud Operations

Infrastructure operations in the era of complex multi-cloud technologies have emerged as challenging and resource-intensive tasks for an average customer of cloud computing seeking to optimize cloud investments and resource performance. Organizations upgrade their infrastructure to operate at scale while reducing expenses, performance and security issues. However, the diagnostics, resolution and optimization of the infrastructure has emerged as a challenge considering the vast, dynamic and interconnected nature of the underlying hardware resources. Predictive analytics aims to modernize and simplify the complexity of infrastructure operations by leveraging the vast deluge of data associated with IT operations.

Predictive Analytics refers to the practice of using data to determine future patterns and behavior of a system. Predictive analytics tools may use advanced machine learning algorithms and statistical analysis techniques to identify a system model with high accuracy. The prediction models applied to historical and present data can unveil insights into future trends of the system behavior. Using this information, decisions can be made regarding the system proactively. Predictive analytics can help identify correlations between behaviors that otherwise may be overlooked or perceived as isolated. Algorithmic filtering also reduces the noise data and false alarms that may keep IT Ops teams rifling through vast data to identify the most useful insights. This capability offers immense opportunities in the IT infrastructure operations segment, where an isolated anomalous network traffic behavior can translate into a large-scale data leak and remain hidden from sight until it’s too late to react. Predictive analytics constitute the building blocks of a modern AIOps architecture, which includes data ingestion, auto-discovery, correlation, visualization and automated actions.

For cloud operations in particular, predictive analytics has a key role to play in the following domain applications:

Optimizing the Cloud Infrastructure

Many organizations use multiple cloud environments and possibly, a range of siloed infrastructure monitoring, management and troubleshooting solutions. In order to gain visibility into the hybrid multi-cloud environment, Cloud Ops teams using traditional analytics practices may rely on manual capabilities and overlook the correlating information pieces across the infrastructure environment. Data applications and IT workloads are increasingly dynamic in nature and the unpredictable changes in network traffic, infrastructure performance and scalability requirements impact IT operations decisions in real-time. Making the right decisions proactively requires Cloud Ops to collect the necessary information from various sources and being able to correlate the information across the siloed IT environments. Predictive analytics allows users to focus on the knowledge gleaned from data instead of collecting, processing and analyzing information from multiple cloud environments independently. Regardless of the complexity of the cloud network, advanced machine learning algorithms that power predictive analytics capabilities provide the necessary abstraction between the complex underlying infrastructure and data analysis. Cloud Ops are ultimately able to use the collective insights to proactively make the right decisions regarding resource provisioning, storage capacity, server instance selection and load balancing, among other key cloud operations decisions.

Application Assurance and Uptime

Software applications are increasingly an integral component of business processes. When the apps and IT services go down, business processes risk interruption. For this reason, IT shops continuously monitor an array of application and IT network performance metrics that correlate with business process performance. Any anomaly identified in the IT performance impacts business operations. With predictive analytics solutions in place, IT can proactively prepare for possible downtime or infrastructure performance issues. The organization can establish pre-defined policies and measures to apply corrective actions and policies automatically well before application assurance and uptime is potentially compromised. As a result, the organization reduces its dependence upon IT to troubleshoot issues or to improve their Mean-Time-To-Detect (MTTD) and Mean-Time-To-Repair (MTTR) capabilities. Predictive analytics algorithms further cut through the noise to ensure that only the impactful metrics threshold cause the organization to adapt its business processes when needed.

Application Discovery and Insights

Enterprise networks are typically distributed across regions and contain diverse infrastructure components, often operating in disparate silos. A holistic knowledge of the infrastructure and application discovery requires organizations to understand how those components interact and relate with each other, especially since network performance issues can spread across dependencies that are otherwise hidden from view. With the predictive analytics solutions in place, organizations can collect data from across the network, analyze multiple data sources and understand how one infrastructure system can impact the other. In hybrid IT environments, application and infrastructure discovery is a greater challenge considering the limited visibility and control available to customers of cloud-based services. Any lack of automated correlation between network incidents can limit the ability of organizations to steer cloud operations in real-time while responding to potential application and infrastructure performance issues.

Audit, Compliance and Security

Strictly regulated industries are often required to comply with regulations associated with application uptime, assurance, MTTR, end-user experience and satisfaction, among other parameters. Compliance becomes increasingly complex when these organizations have limited visibility and control over their IT network. Performing the audit activities at scale may require organizations to invest greater resources on IT. The regular business may not justify the increased operational overhead and organizations may be forced to cut the corners in audit, compliance and security of sensitive data, apps and IT network. Organizations using advanced artificial intelligence technologies can automate these functions and glean insightful knowledge that can translate into regulatory compliance of hybrid cloud IT environments without breaking the bank.

Security is another key enabler of regulatory compliance and requires more than automation solutions to accurately identify the root-cause of network traffic anomalies. Security infringements in the form of data leak tend to remain under the radar until unauthorized data transfers or network behavior is identified. In that case, it may be too late for organizations to respond without incurring data loss, non-compliance and potentially, the ability to operate in security sensitive industry segments such as healthcare, defense and finance. In complex cloud infrastructure environments, the role of predictive analytics is to unify the knowledge from the diverse, disparate and distributed networks and empower organizations to make better, faster and well-informed decisions.


Posted in Information Technology, News

List of Websites Where You Can Download Free Things

Download Free Software,Movies, Music, Templates, WordPress themes, Blogger Themes and Icon pack with these websites. Theses are totally free websites for any internet user. You can download your choice material and you don’t pay anymore.

List of Websites Where You Can Download Free Things

1. – Download All type of  Windows PC Software for all Versions

2. – Download any Youtube Video in HD

3. – Download Pdf Book Collection

4. – Download Trending Photos from Internet

5. – Download Android apk directly to your pc

6. – Download Free Powerpoint templates and themes

7. – Download Free Blogger templates for your Blogger

8. – Download Free WordPress themes for WordPress Blog

9. – Download Latest music mix tracks.

10. – Free Vector Icons in SVG, PNG, EPS and Icon Font for photoshop and illustrator.

11. – Download all type of drivers for your PC

12. – Download Movies, Music, Pictures and Software for free.

13. Google Takeout – Download your All Google Profile ,Search and Email Data.

14. Google Fonts – Download Free Fonts Database from Google.

15. Download Rainmeter Skins – Download Rainmter Skins for free for Rainmeter Software.

16. Wikimedia – Download Wikipedia Data in Your Computer.

17. Github – Download Source of Open Source Software.

18. Hdwallpapers – Download HD Wallpapers for your PC and Mobile.

19. Web Capture – Download Screenshot of any Website for free.

20. Download Copy of Facebook Data –  Download your whole facebook data including your facebook messages.