Posted in Information Technology

Managed File Transfer (MFT) vs Secure File Transfer (SFT) vs Standard File Transfer Protocol (FTP): What’s the Difference?

For enterprise businesses, file transfer protocols get things done. Literally! They help businesses get information from one party to another, both internally and externally. File transfer protocols dictate how we process and send files of all sizes and types.

There are three commonly used protocols: standard file transfer protocol (FTP), Managed File Transfer (MFT) and Secure File Transfer (SFT or SFTP). In the age of big data and data compliance, businesses are faced with the challenge of moving files from point A to point B faster, more efficiently, and safer than ever.

Alas, selecting the right protocols for your business means understanding the nuanced differences between the three options. That’s what we intend to cover in the following article.

What’s the difference between MFT versus SFT versus FTP? Stay tuned to find out!

Managed File Transfer

MFTs are the platforms of the transfer protocols. They offer a foundational, automated, and secure way to approach the movement of data. Here’s what you need to know:

What Is It?

As mentioned above, MFT is a platform. This may make it seem more advanced than other protocols, and arguably it is. It offers administration capabilities coupled with automation and popular security protocols like HTTPS, SFTP, and FTPS. Often, the interface of MFT is designed for transparency and visibility. Generally, it’s a more secure transfer protocol than most others.

MFT offers options for accomplishing company data objectives while remaining compliant. With MFT your company can manage strategies around:

  • Data compliance
  • Operational efficiency
  • Secure management of data

Deploying MFT is considered a proactive data strategy, and as such, it has a number of benefits that will be flushed out below.

MFT: Benefits and Drawbacks

MFT has far fewer drawbacks than standard FTP. With heightened security measures, oversight capabilities, and a focus on compliance, the benefits of MFT line up with many business goals regarding the transfer of information. With MFT businesses get:

  • Visibility and increased control
  • Industry standardization of protocols
  • Replacement for file transfer processes that are less secure and effective

All in all, MFT is a top pick for businesses who want their big data to be secure and compliant.

How Does it Compare?

MFT is top of the line. It beats secure file transfers in complexity and nuance and crushes the competition when it comes to security. If we had to find some drawbacks to implementing a MFT strategy, it’s complexity may mean a learning curve is required for some users. Also, managed file transfer implies management is required. The introduction of visibility and transparency of the process offers no benefit if the processes aren’t being monitored.

Secure File Transfer

In this section, we’ll examine the network protocol that uses encryption to offer a more secure approach to “put” and “get” functions than standard FTP. Keep reading to learn more about SFT, also sometimes called SFTP:

What Is It?

FTP and SFT are both network protocols for “put” and “get” functions. With regards to billing data, data recovery files and other sensitive information that enterprise businesses need to hold and share, SFT offers encryption, whereas FTP does not. SFT was designed for the purposes of securely transmitting data.

To that point, SFT uses Secure Shell or SSH network protocol to transfer data across a channel. Data is protected as long as it’s in moving across the channel. Once it hits a secured server, it’s no longer protected. For additional encryption, senders would need to ensure encryption occurs in advance of sending.

SFT: Benefits and Drawbacks

The main benefit of Secure File Transfer is it is enhanced with encryption during the sending process, whereas regular FTP does not have such protection. It’s still second to an MFT platform, but SFT could be a less expensive alternative, depending on how much impact data transfer has on your business.

How Does it Compare?

It’s hard to describe how SFT and FTP compare to one another. Think of SFT as an evolution of FTP. FTP is generally unsecured communication between two parties, and therefore, it’s easily hackable. While transferring files over the internet was a novel concept for its time, what it lacked was security. As FTP evolved, FTP became secondary to FTP/S and the most recent iteration SFT, which differs from FTP/S in that it’s SSH, meaning that the client must be authenticated by the server.

So, how does it all compare? At a very basic level, you can think of SFT as a more secure, advanced version of FTP. Any business moving secure files should at least use SFT; however, MFT offers a high visibility platform with more data options, so, overall, it’s the best choice.

File Transfer Protocol

If you’ve made it this far, you probably have a general understanding of FTP. FTP was developed in the 1970s as an efficient way to transfer data between parties across a server channel. Ultimately, it became a process in which the internet is used to move data from point A to point B.

An average internet user may never have a need for FTP, but it’s essential in the world of software and web development. Development is at the core of many agile business models, and therefore FTP has become an important tool for enterprise businesses.

What Is It?

In short, FTP is a nuts and bolts, basic network protocol that offers “put” and “get” functions across an FTP server. Because it’s easy to set up and use, it has been an important tool for the enterprise business. However, when it comes to securing data, businesses must tread lightly as FTP is not known as a secure process, although some exceptions exist.

FTP: Benefits and Drawbacks

The benefit of FTP is that it’s been a significant protocol for enterprise businesses. And on a very basic level, it’s a fast and efficient way to move data. But in a business community driven by big data, which often includes customer information, passwords, and other sensitive items, FTP doesn’t cut it from a security perspective when there’s so much at stake and much better options for security and encryption.

Drawbacks of using FTP include:

  • Less secure than other types network protocols: vulnerable to bounce attack, brute force attack, and packet capture
  • No easy way to ensure compliance: unlike other protocols discussed, the lack of security could make it more difficult for FTP users with big data to be compliant.
  • Lacks visibility: FTP and SFT both lack the visibility of MFT.

At the end of the day, FTP is an important tool but not exceptional as it lacks the security of other options.

How Does it Compare?

FTP doesn’t have as much to offer in terms of security and compliance like SFT or MFT. However, it’s still used frequently in the enterprise world. It’s a common and inexpensive option for transferring non-secure information. For instance, marketing agencies will frequently use it to exchange proofs with clients. There will likely always be a reason to use FTP, but for businesses storing and exchanging big data, a more secure option should be necessary.



Posted in Information Technology

Blockchain Battle: Ethereum vs Cosmos vs Cardano vs EOS vs Hyperledger

So, why have we decided to focus on these 5? We feel that this group gives a healthy mixture of usability and functionality. Yes, we know that some of these projects are not exactly live, but we still feel that the potential of the projects is enough to warrant a place on our list. We are going to go through each and every platform and then compare them at the end.


Different Blockchains: Ethereum vs Cosmos vs Cardano vs EOS vs Hyperledger

Token: ETH

Ethereum is, without a doubt, the big daddy of smart contract platforms. The main man behind Ethereum is Vitalik Buterin. Buterin was fascinated with Bitcoin, but he realized that the blockchain technology had far more use than being a mere facilitator of a payment protocol. He realized that one can use the blockchain technology to create decentralized applications. That was when he was inspired to create Ethereum.

Ethereum, like Bitcoin, was a cryptocurrency, however, that’s where the similarity ends. Because while Bitcoin is a “first-generation” blockchain, Ethereum broke the mold by becoming the first ever second-generation blockchain. Ethereum revolutionized the crypto-space by bringing in smart contracts on the blockchain.

Smart contracts were first conceptualized by Nick Szabo. The idea is simple, have a set of self-executing instructions between two parties which don’t need to be supervised or enforced by a third-party. The idea seems pretty straightforward, right? However, smart contracts enabled Ethereum to create an environment wherein developers from around the world could create their own decentralized application aka Dapps.

Dapps and Smart Contracts

Dapp creation is one of the most important features of Ethereum.  Along with being decentralized, there are certain other features that a Dapp must have:

  • Hotr trial
  • The source code of the Dapp should be open to all
  • The application must have some sort of tokens to fuel itself
  • The App must be able to generate its own tokens and have an inbuilt consensus mechanism

Sounds pretty awesome right? So, how exactly can you build them? You need to code smart contracts using solidity.

Developers use a programming language called Solidity which is a purposefully slimmed down, loosely-typed language with a syntax very similar to ECMAScript (Javascript).

Along with creating the smart contract, you must have an environment where you can execute it. However, there are some properties that this execution environment must have. These properties are:

  • Deterministic.
  • Terminable.
  • Isolated.

Property #1: Deterministic

A program is deterministic if it gives the same output to a given input every single time. Eg. If 3+1 = 4 then 3+1 will ALWAYS be 4 (assuming the same base). So when a program gives the same output to the same set of inputs in different computers, the program is called deterministic. The environment must make sure that execution of the smart contract is always deterministic.

Property #2: Terminable

In mathematical logic, we have an error called “halting problem”. Basically, it states that there is an inability to know whether or not a given program can execute its function within a time limit. In 1936, Alan Turing deduced, using Cantor’s Diagonal Problem, that there is no way to know whether a given program can finish in a time limit or not.

This is obviously a problem with smart contracts because, contracts by definition, must be capable of termination in a given time limit. So the environment must be able to halt the operation of the smart contract.

Property #3: Isolated

In a blockchain, anyone and everyone can upload a smart contract. However, because of this the contracts may, knowingly and unknowingly contain virus and bugs.

  • Buy hor

If the contract is not isolated, this may hamper the whole system. Hence, it is critical for a contract to be kept isolated in a sandbox to save the entire environment from any negative effects.

Ethereum executes its smart contracts using a virtual machine called Ethereum Virtual Machine (EVM).

The next core Ethereum concept that one must understand is gas.

What is Ethereum Gas?

Remember the “Terminable” property of smart contract environments? Well, Ethereum smart contract achieves this property by utilizing gas. Each and every line that is coded in the smart contract requires a certain amount of gas to execute. So, when a developer submits a smart contract for execution, they also specify the maximum gas limit.

Think of the gas limit as the fuel you fill up in your car before going for a drive, the moment the fuel runs out, the car stops working. Each and every line in the smart contract requires a certain amount of gas to execute. Once the gas runs out, the smart contract stops executing.

Ethereum and ICOs

We have covered this topic at length before so we will just go over this very briefly. One of the most alluring features of Ethereum is initial coin offering or ICOs. Developers around the world can use Ethereum’s virtual machine to power their smart contracts and use the platform to raise lots of money in a crowded sale with relative ease. Because of this very feature, Ethereum’s adoption has gone through the roof.

Ethereum Mining

Ethereum as of right now is using the Proof-of-Work mining, i.e. the same mining process used by Bitcoin. Basically, miners compete to find the next block in the chain by using their processing power to solve complex cryptographic puzzles.

Ethereum is eventually going to move on to Proof-of-Stake by utilizing the Casper protocol. POS is far more environmentally friendly than POW and is a lot more scalable.

Main Problems

There is no doubt of the impact that Ethereum has had on the crypto-space, however, there are some major problems surrounding its performance. As of right now, Ethereum fails when it comes to scalability. They can only manage 25 transactions per second, which is not ideal for Dapps who want mainstream adoption. On top of that, Ethereum can be expensive for developers. The gas prices for the execution of Dapps can go through the roof.

Along with these, there is one more problem that affects Ethereum and other cryptocurrencies. This problem is interoperability. As of right now, if Alice owns Bitcoin and Bob owns Ethereum, then there is no easy and direct way for the two to interact with each other. This is a really big issue because in the future, there may be thousands of blockchains running in parallel and there should be a way for them to interact seamlessly with each other.

  • Buy hor

One project that is aiming to solve this interoperability problem is Cosmos.


Blockchain Battle: Ethereum vs Cosmos vs Cardano vs EOS vs Hyperledger

Token: ATOM

Cosmos aims to become an “internet of blockchains” which is going to solve these problems once and for all. Cosmos’s architecture consists of several independent blockchains called “Zones” attached to a central blockchain called “Hub”.

Blockchain Battle: Ethereum vs Cosmos vs Cardano vs EOS vs Hyperledger

Image Credit: Cosmos Video

According to the Cosmos whitepaper, “The zones are powered by Tendermint Core, which provides a high-performance, consistent, secure PBFT-like consensus engine, where strict fork-accountability guarantees hold over the behavior of malicious actors. Tendermint Core’s BFT consensus algorithm is well suited for scaling public proof-of-stake blockchains.”

The brains behind this project are CEO Jae Kwon and CTO Ethan Buchman and the Interchain Foundation team.

What is Tendermint?

Tendermint is a variant of PBFT i.e. Practical Byzantine Fault Tolerance. A Byzantine Fault Tolerance, or BFT, the system is a system which has successfully answered the Byzantine Generals Problem. We have covered the Byzantine Generals Problem in detail here. To keep things short, for a decentralized peer-to-peer system to function in a trustless manner, it is imperative for them to find the solution to the Byzantine’s Generals Problem.

As the cosmos whitepaper states:

Tendermint provides exceptional performance. In benchmarks of 64 nodes distributed across 7 data centers on 5 continents, on commodity cloud instances, Tendermint consensus can process thousands of transactions per second, with commit latencies on the order of one to two seconds. Notably, the performance of well over a thousand transactions per second is maintained even in harsh adversarial conditions, with validators crashing or broadcasting maliciously crafted votes.”

The graph below support the claim made above:

Different Blockchains: Ethereum vs Cosmos vs Cardano vs EOS vs Hyperledger

Image Credit: Cosmos Whitepaper

Benefits of Tendermint

  • Tendermint can handle transaction volume at the rate of 10,000 transactions per second for 250byte transactions.


  • Better and simple light client security which makes it ideal for mobile and IoT use cases. In contrast, Bitcoin light clients require a lot more work and have lots of demands which makes it impractical for certain use cases.


  • Tendermint has fork-accountability which stops attacks such as long-range-nothing-at-stake double spends and censorship.


  • Tendermint is implemented via Tendermint core which is an “application-agnostic consensus engine.” It can basically turn any deterministic blackbox application into a distributedly replicated blockchain.


  • Tendermint Core connects to blockchain applications via the Application Blockchain Interface (ABCI).


Inter-Blockchain Communication

As we have mentioned before, Cosmos’s architecture will follow the Hub and Zones method. There will be multiple parallel blockchains connected to one central Hub blockchain. Think of the Sun and the solar system.

The Cosmos hub is a distributed ledger where individual users or the Zones themselves can hold their tokens. The zones can interact with each other through the Hub using IBC or Inter Blockchain Communication.

Blockchain Battle: Ethereum vs Cosmos vs Cardano vs EOS vs Hyperledger

See the diagram above?

This is a very simplified version of how two Zones communicate with each other via IBC.

Cosmos Use Cases

The interoperability achieved by Cosmos has some extremely interesting use-cases:

  • DEX: Since Cosmos is linking so many blockchains with each other, it goes without saying that it can easily enable different ecosystems to interact with one another. This a perfect setting for a decentralized exchange.


  • Cross chain transactions: Similarly, one zone can avail the services of another zone through the Cosmos hub.


  • Ethereum Scaling: This is one of the more use cases. Any EVM based zone which is connected to the Cosmos hub will be, as per the architecture, powered by the Tendermint consensus system as well. This will enable these zones to scale up faster.


Blockchain Battle: Ethereum vs Cosmos vs Cardano vs EOS vs Hyperledger

Token: ADA

The brainchild of Ethereum co-founder Charles Hoskinson, Cardano is a smart contract platform however, Cardano offers scalability and security through layered architecture. Cardano’s approach is unique in the space itself since it is built on scientific philosophy and peer-reviewed academic research.

Cardano is a third-generation blockchain which is focussed on bringing scalability and interoperability to the blockchain space. There are three organizations which work full time to develop and take care of Cardano:

  • The Cardano Foundation.
  • IOHK.
  • Emurgo.

These three organizations work in synergy to make sure that Cardano development is going on at a good pace.

Functional Programming

There is one really interesting quality that makes Cardano unique as compared to the other smart contract platforms. Majority of the other smart contract platforms are coded via imperial programming language. Cardano uses Haskell for its source code, which is a functional programming language. For its smart contracts, Cardano uses Plutus, which is also a functional language.

Let us explain the difference between the two types of languages in a straightforward way.

In imperative languages, addition works like this:

int a = 5;

int b = 3;

int c;

c= a + b;

As you can see, it takes a lot of steps. Now, how will that work in a functional language?

Suppose there is a function f(x) that we want to use to calculate a function g(x) and then we want to use that to work with a function h(x). Instead of solving all of those in a sequence, we can simply club all of them together in a single function like this:


This makes the functional approach easier to reason mathematically.

Functional languages helps with scalability and it also helps in making the program far more precise.


Cardano uses a new proof of stake algorithm called Ouroboros, which determines how individual nodes reach consensus about the network. The protocol has been designed by a team led by OHK Chief Scientist, Professor Aggelos Kiayias.

Ouroboros is the first proof of stake protocol that has mathematically been shown to be provably secure, and the first to have gone through peer review through its acceptance to Crypto 2017, the leading cryptography conference.


The way Cardano plans to execute interoperability is by implementing sidechains.

Sidechain as a concept has been in the crypto circles for quite some time now. The idea is very straightforward; you have a parallel chain which runs along with the main chain. The side chain will be attached to the main chain via a two-way peg.

Cardano will support sidechains based on the research by Kiayias, Miller, and Zindros (KMZ) involving “non-interactive proofs of proofs of work”.

According to Hoskinson, the idea of sidechains comes from two things:

  • Getting a compressed version of a blockchain.
  • Creating interoperability between chains.


Blockchain Battle: Ethereum vs Cosmos vs Cardano vs EOS vs Hyperledger

Token: EOS

EOS are aiming to become a decentralized operating system which can support industrial-scale decentralized applications. The driving force behind EOS is Dan Larimer (the creator of BitShares and Steemit) and Block.One. EOS recently came into the spotlight for their year-long ICO which raised a record-breaking $4 billion.

That sounds pretty amazing but what has really captured the public’s imagination is the following two claims:

  • They are claiming to have the ability to conduct millions of transactions per second.
  • They are planning to completely remove transaction fees.

Scalability Through DPOS

EOS achieves its scalability via the utilization of the delegated proof-of-stake (DPOS) consensus mechanism, which is a variation of the traditional proof-of-stake. It can theoretically do millions of transactions per second.

So, how is DPOS different from traditional POS? While in POS the entire network will have to take care of the consensus, in DPOS all the EOS holders will elect 21 block producers who will be in charge of taking care of the consensus and general network health. Anyone can participate in the block producer election and they will be given an opportunity to produce blocks proportional to the total votes they receive relative to all other producers.

The DPOS system doesn’t experience a fork because instead of competing to find blocks, the producers will have to co-operate instead. In the event of a fork, the consensus switches automatically to the longest chain.

As you can imagine, the importance of these block producers definitely can’t be underestimated. Not only do they take care of consensus, but they take care of overall network health as well. This is why it is extremely important that each and every single vote that has been cast has proper weightage.

This is why, Larimer introduced the idea of Voter Decay, which will reduce the weightage of old votes over time. The only way that one can maintain the strength of votes is by regular voting.

The Voter Decay mechanism leads to two great advantages:

  • Firstly, as we have seen time and again, elected officials may become corrupt and change their tune after getting elected. The vote decay system gives the voters a chance to reconsider their vote every week. This keeps the block producers accountable and on their toes.


  • Secondly, people simply change over time. Maybe the political beliefs and ideologies that someone has today is completely different than what they had a year ago. The vote decay system will allow people to vote for someone who is more congruent with their newly evolved ideologies.

This has the potential to be a truly revolutionary concept and can change decentralized voting (maybe even voting) forever.

Removal of Transaction Fees

EOS works on an ownership model where users own and are entitled to use resources proportional to their stake, rather than having to pay for every transaction. So, in essence, if you hold N tokens of EOS then you are entitled to N*k transactions. This, in essence, eliminates transaction fees.

On staking EOS tokens you get certain computational resources in exchange. You will get:

  • RAM
  • Network Bandwidth
  • Computational Bandwidth.

EOS tokens, along with payment coins, can also be used as a toll to get all these resources.



Different Blockchains: Ethereum vs Cosmos vs Cardano vs EOS vs Hyperledger

Finally, we have Hyperledger.

Hyperledger, to be very frank, is extremely different from all the platforms that we have talked about so far. While Ethereum, Cardano, and EOS are proper cryptocurrencies and have their own blockchains, Hyperledger is not a cryptocurrency, and nor does it have its own blockchain. Hyperledger is an open-sourced project by the Linux Foundation. On their website, Hyperledger describes itself as

“an open source collaborative effort created to advance cross-industry blockchain technologies. It is a global collaboration, hosted by The Linux Foundation, including leaders in finance, banking, Internet of Things, supply chains, manufacturing, and Technology.”


The Need For Permissioned Blockchain

Platforms like Ethereum, EOS etc. are all public blockchains, meaning, anyone can choose to join the network. However, for big enterprises who need their own blockchain infrastructure, this is highly undesirable.

Think of a blockchain conglomerate of banks.

Banks need to deal with sensitive data every single day. From their internal transactional records to KYC data, there are lots of items which they simply can’t reveal to the public. Plus, only banks that have been vetted by the other banks present in the network should be allowed inside the network.

Also, as we have already covered before, public blockchains are slow and have performance issues, which is again a big no-no for large-scale companies.

Hyperledger allows these companies to create their own high-performance permissioned blockchain (aka blockchains where each and every node must be vetted properly before entering).

Interesting Projects Under Hyperledger

Maybe the most interesting project in the Hyperledger family is IBM’s Fabric. Rather than a single blockchain Fabric is a base for the development of blockchain based solutions with a modular architecture.

With Fabric different components of Blockchains, like consensus and membership services can become plug-and-play. Fabric is designed to provide a framework with which enterprises can put together their own, individual blockchain network that can quickly scale to more than 1,000 transactions per second.

Along with Fabric you also have:

  • Sawtooth: Developed by Intel and uses Proof-of-Elapsed time consensus mechanism
  • Iroha: Asn easy-to-use blockchain framework developed by a couple of Japanese companies.
  • Burrow: Creates a permissible smart contract machine along the specification of Ethereum.

Different Blockchains: Comparing all the Platforms

Alright, so now that we have somewhat familiarized ourselves with these platforms, let’s compare all of them.

Different Blockchains: Ethereum vs Cosmos vs Cardano vs EOS vs Hyperledger

Posted in Information Technology

Using Machine Learning to Predict Car Accident Risk

This article reflects some of the work we are doing at Esri to define spatially focused artificial intelligence and machine learning, however: The opinions in this article are my own, and not necessarily the opinions of my employer. This article is intended to be a simple technical introduction to one application of geo-spatial machine learning, not a fully mature solution. There’s a lot of excitement and some great, truly innovative work going on at Esri. I’m very excited to be a part of it!

Car accidents are a huge problem in our world. Nearly 1.3 million people die internationally every year from car accidents and in addition up to 50 million people are injured (ASIRT). Can machine learning help save lives? I believe the answer is yes, and this article details one possible approach.

End result: an accident risk heat map

Many governments collect accident records and make these data publicly available. In addition, numerous road infrastructure data sets. We will make use of publicly available road infrastructure data and weather feeds to attempt to predict accident risk per road segment, per hour within the state of Utah using supervised machine learning.

(Disclaimer: I’m using a slightly different accident data set that goes back to 2010, but isn’t available online)

Our Approach

We pose the car accident risk prediction as a classification problem with two labels (accident and no accident). It could equally be posed as a regression problem (number of accidents), but on our timescale (one hour) we don’t expect to see more than one accident per road segment so this simplifies the problem a bit. There are of course other approaches, but this is the one we take here. Commonly traffic is modeled by a Poisson or Negative binomial model. Choosing a small road segment and time interval allows us to treat each observation as a Bernoulli random variable (and thus use a cross-entropy loss function as the objective)

We can use the seven years and roughly half a million car accident records as our positive examples. You might then ask, you have positive labels, where are your negative labels? Great question! Every single road segment/hour combination is a possible negative example. Over 7 years and 400,000 discrete road segments, this amounts to roughly 24.5 BILLION potential negative examples.

Machine learning practitioners will notice an issue here, namely, class imbalance. SEVERE class imbalance. Essentially, if we were to use all of this data to train a model, our model would be heavily biased towards no accidents. This is a problem if we want to estimate accident risk.

To counter this problem, we instead don’t use all 24.5B negative examples, instead we take a sampling approach that will be detailed later in this article.

Data Exploration

What do nearly half a million accidents look like?

That’s not a picture of the roads: this is just a heatmap of car accidents

Anyone who commutes definitely understands the impact of time on car accidents. We can visualize what this looks like for 7 years of accidents on the following chart:

This follows our intuition: accidents are occurring mostly on weekday afternoons during peak rush hour. Another observation by looking at the vertical cross section is that accidents tend to peak in the December/January time-frame. Utah has frequent heavy snow and ice during this time so this is certainly not unexpected. This highlights the importance of good weather data as an input to this model. Utah sees about 15 accidents on an average day at the peak of rush hour.

Exploration of some of the data using ArcGIS Pro

The inputs?

Now that we know what we want to predict, what are the inputs? What could cause a car accident. The answer, of course, is numerous factors, some of which we include in this analysis.

Left: Accidents typically cluster around intersections, particularly signalized ones. Right: Accidents happening more frequently in curvy road segments
  • Weather (temperature, wind speed, visibility, rainy/snowy/icy, snow depth, rain depth, etc)
  • Time features: hour of day, day of the week, month of the year, solar azimuth/elevation
  • Static features such as speed limit, road curvature, average traffic volumes, proximity to intersections, road north/south/east/west alignment, road width, road surface type, etc
  • Human factors such as population density and distractions such as billboards
  • Graph derived features of the road network such as centrality and flow
  • Countless others

This is where the geospatial portion of this analysis becomes important. This is inherently a spatial problem and our machine learning model needs to take into account many different geospatial data sources and their relationship with each other. This will include many geoprocessing operations, which can be computationally expensive. For this purpose, I’m using the ArcGIS platform.

(Disclaimer: I’m an Esri employee, but I came from an open source geo background. You can definitely perform most of this analysis without using ArcGIS, but it will be more challenging. If you aren’t an ArcGIS user, I still highly recommend taking a look at the ArcGIS API for Python, if only for the data wrangling capabilities as much of this data is available from various ArcGIS based services. The developers of this API have taken care to fallback on many open source utilities, such as shapely when ArcGIS isn’t available. A spatial database such as PostgreSQL with PostGIS will go a long way.)

There are really two distinct parts of the inputs: the static features and the dynamic features.

The static features are the parts of the input data that, for the most part, do not change with time. This includes features derived from the road geometry, such as curvature or other properties, such as speed limit or population density. Of course, these aren’t static per say, but they are slowly changing so we can treat them as constant for all intents and purposes.

The dynamic features change depending on when we are making the prediction. These are the weather feeds, solar geometry, and time variables (hour, month, day, etc).

We need to compute all of these features for each road segment, of which we have around 400,000. We scripted this process using the Arcpy Python library included with ArcGIS Pro. Let’s take a look at an example of that really quick:

billboards_url = ''
# Calc proximity to billboard
_ = arcpy.analysis.Near('centerlines_merged',
_ ='centerlines_merged','proximity_to_billboard','!NEAR_DIST!')
_ ='centerlines_merged',['NEAR_DIST','NEAR_FID'])

The above code snippet used the “Near” tool to find the proximity to the nearest billboard for every road in our centerlines data. We computed numerous proximity based features to form our static feature dataset.

Proximity to billboards. This feature captures some possible driver distractions, but also represents heavily traveled areas because advertisements are intended to be seen by a large audience.

We also had to derive features from the road geometries themselves. For example to estimate road curvature we used “sinuosity” as the metric. Sinuosity is the ratio between the path length and the shortest distance between the endpoints. Again, using Arcpy, we calculated this:

# Calc Sinuosity
code_block = \
import math
def getSinuosity(shp):
    x0 = shp.firstPoint.x
    y0 = shp.firstPoint.y
    x1 = shp.lastPoint.x
    y1 = shp.lastPoint.y
euclid = math.sqrt((x0-x1)**2 + (y0-y1)**2)
    length = shp.length
    if euclid > 0:
        return length/euclid
    return 1.0
_ ='centerlines_merged','sinuosity','getSinuosity(!Shape!)',code_block=code_block)

Let’s talk about weather now. There are many different weather feeds but we chose to use a reliable hourly weather source from NOAA. We have a handful of weather stations, but we need to know the weather at each road segment. One approach is to interpolate the weather at the surface stations to the individual road segments. To do this, we can use a technique known as “kriging” in the geostatistics community or Gaussian process regression in the machine learning community. ArcGIS has a built in “empirical Bayesian kriging” tool in the geostatistics toolbox that contains a robust implementation of this technique that uses an empirical prior distribution based on the data and removes a lot of the parameter twiddling. If this isn’t an option for you, there are other techniques such as inverse distance weighting or simple spatial joins (I did this initially for simplicity’s sake). If you have other data to estimate more precisely how geography affects weather features (such as elevation or more sophisticated climate models), you could load those into a geographically weighted regression model to gain even more accuracy.

Temperature interpolated between weather stations

In summary, there are numerous spatial operations performed to construct a useful feature set for accident prediction. These features will then be used to create a training set for the supervised machine learning model.

Building the training set

With the geospatial processing completed, we can turn our attention to actually building the training set. For this we use the ArcGIS Python API, pandas, and a few other miscellaneous Python libraries. Most of this is standard data wrangling, but a critical piece of any of this working is the creation of negative samples; that is, what are counter examples to when accidents occurred.

One approach is to randomly sample some number of roads/times when accidents didn’t occur, but this has some shortfalls. There are a lot of times and roads when accidents simply do not occur often, but the much more important problem to solve is differentiating accident vs not on roads where accidents happen frequently. What is it that causes accidents?

We chose to use a sampling approach that builds up a set of negative examples that are very similar to our positive examples so that the machine learning model can learn to find the fine differences between when there is and isn’t an accident. Of course, there’s an element of randomness to it as well so we also sample situations that are very different. The approach is as follows:

  1. Randomly select an accident record from positive examples
  2. Randomly alter: the road segment, the hour of the day, or the day of the year.
  3. If the new sample isn’t with in the accident records, add to the list of negative samples
  4. Repeat until we have a large number of negative samples (a few times the number of positive samples)

This gives us a training set that is challenging to work with because it’s very hard to tell the positive and negative examples apart. This isn’t an issue — this is a difficult problem and we’re not concerned with making our numbers look good, we care about practical results. This is a commonly used approach for situations such as this.

As shown in the above gist, categorical variables such as hour, weekday, and month are one-hot encoded. All continuous variables are transformed to z-scores using the scikit-learn StandardScaler. We also transform the sinuosity with a logarithmic transform since most values are near 1 and we want to pick up more on the small difference than the large ones (long windy roads).

The Model

My go to machine learning approach is gradient boosting, particularly with the XGBoost library. The approach builds upon a very intuitive machine learning concept called a decision tree. Decision trees work by finding splits on different features that separate the labels. Unfortunately, decision trees have a tendency to over-fit to a training set, meaning they don’t generalize to new data, which is a crucial part of a predictive model. Gradient boosting works by combining the results from many different decision trees together and is extremely fast and powerful. Gradient boosting often times outperforms other approaches for many different problems and should be in any data scientist’s toolbox. XGBoost is a particularly good implementation of gradient boosting. We also trained other models, including a deep neural network, but found that not only did gradient boosting give the best overall performance (ROC AUC), it gave us more insight into why decisions where made. It should be noted that the deep neural network we built achieved higher recall at equal precision, but the curve was steep and the ROC AUC was a little smaller.

I won’t go into detail on the hyperparameter optimization and training, but here are the final model choices:

params = {
    'min_child_weight': 5.0,
    'reg_lambda': 1.0,

We trained until convergence with 10 rounds of early stopping, achieving a final ROC AUC of around 0.828 on a holdout set. The final model had 80 trees.

ROC Curve and Precision-Recall Curve

With a threshold of 0.19, we get the following performance characteristics:

Test Accuracy: 0.685907494583
Test F1: 0.461524665975
Test Precision: 0.311445366528
Test Recall: 0.890767937497
Test AUC: 0.828257459986
Test AP: 0.388845428164
Train Accuracy: 0.68895115694
Train F1: 0.466528546103
Train Precision: 0.314947399182
Train Recall: 0.899402111551
Train AUC: 0.836489144112
Test AP: 0.410456610829

Like I said earlier, we intentionally made the training set difficult to separate because we wanted the results to reflect the true performance of the model. The recall value of 0.89 means we are able to predict nearly 90% of car accidents, and the precision value of 0.31 means we are correct about those predictions about 30% of the time. It’s not perfect, but it’s a great start and definitely tells us something about our ability to predict accidents. There are numerous things we can do to improve the performance of this model, and perhaps I’ll revisit this in a future post.

Some Results

Great, now we have a model that achieves decent performance, what can we gather about how this model makes predictions? What features are used? What values are important to consider?

XGBoost Feature Importance (How often do they appear in a tree to make decisions?)

You can see from the above plot that solar position and temperature are most important. There are several reasons why these are important features. Temperature acts as a proxy variable to things that are difficult to measure. For instance temperature has a relationship to the season and time of day during the season. It also tells us how likely the roads are to be icy.

Notice a large peak around 0 Celsius

Obviously solar position tells us time of day and season, but it also allows us to model an interesting factor: is the sun in driver’s eyes? Let’s take a look at the split histogram for both solar elevation and road orientation.

Left: Notice the peak around 0 degrees elevation? This is near sunrise. Right: There are peaks at both 90 degrees and 0/180 degrees, but the peak at 0/180 degrees is higher. This indicates that east/west roads are more often used by the model to make a determination.

Going down the list we notice population density, accident_counts, proximity_to_billboard, proximity_to_signal are important features. These certainly make sense from our analysis. You may notice hour, weekday, and month are broken into multiple features through one-hot encoding. If we aggregate the importance from each hour, they actually make up a large contribution to the model. Less common weather features, such as icy (not often reported in our weather feed) and hailing appear in the model, but not as useful for making predictions.

What about the temporal performance of our model? Can it make predictions over time?

It’s not perfect, but for the most part we’re doing a good job. We were fairly far off around Feb 25th, but that was a particularly icy/snowy day and there wasn’t enough weather information to make that determination.

Keep in mind that this is not a time series model, we are simply estimating the expected counts for all road segments and aggregating them. We perform a mean/variance adjustment to the resulting time series since we had a training set biased towards accidents.

Finally, let’s take a look at the resulting model spatially.

Generated Risk Map vs Real Accidents: A particularly icy hour/day in December. Only one of these accidents occurred on a low risk segment.
Same hour, different area. This time all accidents occurred on high risk segments.

You’ll notice from the above graphics that there are plenty of places we’re predicting as high risk, but there are no accidents there. We are attempting to quantify risk using this model, but accidents are still a random occurrence. We believe with more features, especially real time traffic information, construction, important events, higher resolution weather, etc. that we can improve this model significantly, but it seems good for a first start.

Why do we care?

We built a decent model to predict accident risk, but what can we do with it? There are numerous possible applications, including the following that we have considered for applications:

  • Safe route planning
  • Emergency vehicle allocation
  • Roadway design
  • Where to place additional signage (e.g. to warn for curves)

Some engineers at Esri took this example and built a prototype safe routing app! It’s a pretty cool concept, take a look at it here.

That’s all I have today, thank you for reading! Have a better approach? Are we doing something poorly? Post in the comments below!