17 analytics to transform your organisation, Part 3

12 December 2019
analytics-to-transform-your-organisation
In the first and second parts of this article, we saw how analytics have succeeded in encompassing all aspects of the data generated worldwide. Using quantitative data, analytics have managed to penetrate the various forms of media available: text, speech and voice, sound, image and video. Using these different forms in combination has even led to the development of sophisticated systems capable of interpreting human emotions. But this didn't go far enough, and for certain types of problem, there was a further need to develop highly specific analytics.

(Feel free to have a look at the previous part of our series just here).

 

9 - Business Experiments

On 27 February 2000, in one of its many labs, Google performed an experiment that went on to revolutionise the Internet. This was in fact a very basic experiment, one which might well have been thought totally insignificant. The test looked at how search results are displayed, using two different user groups: a display listing the usual ten links for the control group, and a modified display listing over twenty links for the test group.

The conclusions and lessons learned from this test proved to be a real turning point. First and foremost, it was observed that making this minor modification caused significant differences in behaviour between the two user groups. What also caught Google’s attention was that it was possible to adjust the number of links displayed to obtain the highest number of positive reactions, and that this level of adjustment could not be determined theoretically, only by repeated trial and error comparing the results each time.

This paved the way for the development of the optimisation method, better known as A/B Testing, which involves performing experiments using randomly selected groups of users.

From the very outset, this method was widely embraced across the Web. It has greatly intensified since then, perhaps to the point of excess. In 2003, one Google employee publicly threw in the towel after performing 40 A/B tests to distinguish between barely perceptible shades of blue. In 2011, Google – again – performed over 7000 A/B tests. In 2018, Facebook performed over 1000 A/B tests per day.   

The biggest landmark case is probably that of Barack Obama’s 2012 election campaign. A/B testing was methodically used to optimise the performance of the campaign website. Modifications and options were used for photos, banners, slogans, etc. In the end, the campaign team did actually manage to maximise the numbers of party members, volunteers and donations... and the rest, as we know, is history (see illustration).

 

barack-obama-campaign

Two tested versions (among others) of the homepage of Barak Obama’s campaign website. The version on the left attracted more party memberships. Shown to the right is the menu for managing options and ratings.

 

More discreetly, but in a way that is perhaps better established and more deeply rooted, most online newspapers and dailies regularly test alternative headlines for published articles, obviously with the aim of prompting more readers to click the link.     

Nowadays, A/B Testing is also used increasingly widely in the gaming industry, particularly in online games and network play. In World of Warcraft, for example, tests are systematically implemented for different versions of the game. One of the main methods consists of assigning different missions to several groups of players, the aim being that each player stays online for the longest time possible without signing out. This is how Blizzard Entertainment managed to ascertain that the game time per session is 30% longer when the mission involves rescuing a character.    

In the connected digital world, random experiments are relatively quick and cheap to implement. There’s no need to recruit volunteers when all that’s needed is to modify a line of code to divide the users of a service or an application into several groups, unbeknown to them.      

And minor changes really can be seen to have major impacts. In 2012, Google (again) added right arrows to its adverts. The arrows didn't actually point to anything except the edge of the screen, which seems totally counter-intuitive – but there was no disputing the results: the conversion rate increased significantly.

 

10 - Cohort Analysis

Behind this strange name, straight out of Ancient Rome, hides an extremely effective analytics technique for optimising website traffic.  

A cohort is a group of users that share a common characteristic and, in particular, have shown the same behaviour over a fixed period of time. The term was initially used in medical studies comparing groups of smokers and non-smokers. It has since been adopted almost exclusively in the context of digitisation and web platform development.

Cohort Analysis is a tool used for in-depth monitoring and to provide insights into the behaviour of users of online applications: emergency services or public services, ecommerce platforms, business-specific web applications, online games, etc.

It is obviously a key asset in digital marketing, used to ascertain what is the actual usage of a service and whether people carry on using it. 

Time slicing is used to form cohorts, grouping together users in order of arrival (like cohorts of a legion). For example, all customers who placed their first order on a given day, which in this case would create one cohort per day.

In practice, cohorts are based either on the moment when the purchase of a product or service is completed for the first time, or on what the users are doing in an application during a given period (browsing, transaction, downloading, contact form, etc.).  

Time slicing is relevant for identifying recurring temporal patterns in the customer life cycle. For example, for an ecommerce site, we start off by measuring the time between the first visit and the first purchase for a given cohort. Following this same logic, we can find out how much time elapses between two purchases, how long it is before the customer stops making purchases, what percentage of customers are still visiting after two weeks, after one month, two months, etc.  

The significance of using cohorts, as we know, is the ability to track micro-trends by comparing changes in measurements between one cohort and another. This allows us not only to be more proactive when we see that results are not going in the right direction, but also to rapidly evaluate the impact of initiatives implemented to keep targets on track. 

Globally speaking, an analysis across the entire set of cohorts provides finely detailed insights about the quality of usage. It can be used to ascertain the causes of retention or disengagement, to see how quickly customers are lost and at what pace we need to acquire new customers in order to maintain site traffic.

Unsurprisingly, the leading companies in this field are Google and Adobe, which both provide off-the-shelf and customisable cohort analysis frameworks.

These features are seamlessly integrated within the new version of Adobe Analytics (January 2019). As standard, there are three cohort charts used to analyse customer retention, customer attrition (or churn) and latency time.

By customising these charts, it is possible to fine-tune data in order to improve the analyses carried out.

For example, let’s try to determine the best time to re-engage customers who are visiting the website or using an online service less frequently. 

In Google Analytics, there is also a cohort analysis option in View / Audience / Cohort Analysis. By default, Google presents evaluations using five values, represented by colour gradients (the higher the value, the darker the colour).

 

cohort-analysis-google-analytics

Google page for recovery and analysis of website visits by cohorts  

 

To sum up, the toolkit is both easy to access and easy to use, with the potential for huge benefits. So, if you're tempted to try cohort analysis for yourself, then go for it...

 

11 - Forecasting / Time Series Analysis

Forecasting is relevant to one specific type of data: time-series data.

A time series is made up of measurements made at regular intervals, also known as chronological series. It is a sequence of numerical values representing changes to a specific quantity over time. Typically, this is the data structure obtained when we connect a sensor to a system or machine in order to record its operation (e.g. a jet engine or a machine tool on a production line).

Time-series analysis is particularly relevant for cyclical or periodic trends, either for monitoring or for making predictions. The principle is to understand previously observed changes in order to predict future behaviours.  

The initial aim is to determine trends or to assess the stability of values and their variation over time. Predictions, on the other hand, are based on statistical calculations.

Time-series analytics is widely used for predictive maintenance. Firstly, it provides a means of rapidly detecting malfunctions and identifying the cause. It is also used to model the nominal operation of equipment or hardware, to simulate it over time, and to predict wear and tear or breakdowns. 

Its application is relevant to all areas of activity where automatons and machines are used. In addition, the current rise in sensors and connected objects will undoubtedly lead to a sharp increase in the number of time series throughout the world. We should also note that the aeronautics sector is a special case in its own right, with huge investments and a worldwide predictive maintenance service that undergoes developments and improvements from one day to the next.

In technical terms, forecasting gave rise to a new innovation with the emergence of Time Series Databases a few years ago.

These are data management software applications with a specific design and implementation. They are characterised by fast, high-availability, specialist functions for the storage and retrieval of time-series data.

The most well-known and widely used of these solutions is InfluxDB, developed by MIT (written in Go). Since 2013, it has been available under open-source licence. 

 

influxdb-time-series-databases-developed-by-MIT

InfluxDB – Main Menu 

 

Each InfluxDB data point is made up of a collection of fields associated with a time stamp, which form key-value pairs. Time intervals are stated in nanoseconds, so measurements can be managed quasi-continuously (real time). 

Its functions are natively oriented in time (temporally) in order to query data structures made up of measurements, series and data points:

  • storage and compression of timestamped data (data with similar timestamps is stored in the same physical space),
  • data lifecycle management,
  • aggregation (temporal),
  • large-scale data scanning,
  • time queries, range queries,
  • high write performance,
  • huge scalability.

And it’s all about performance: by way of comparison, a Time Series Database such as InfluxDB can be up to one hundred times faster than Elasticsearch in querying time series data.

It has many applications and is currently booming. At the present time, monitoring is more prevalent (Infrastructure and application monitoring, IoT monitoring and analytics, Network monitoring). In the near future though, we can simply look to the aeronautics sector to see that Time Series Analytics systems are set to be used increasingly as part of complex architectures to produce modelling, simulations and predictions on subsystems and full systems, such as aircraft, railways or roads, energy distribution networks, etc. The Time Series era has only just begun. 

 

12 - Horizon Analysis

Horizon Analysis is well known in the finance sector. A horizon is quite simply a deadline or term. And it therefore relates to anticipating all scenarios that are likely to happen before the deadline is reached; the deadline is, by very definition, fixed and pre-determined. Put simply, a horizon technically means a date.

If you talk about horizons in banking terms, it immediately calls to mind the expected income from an investment within a given timeframe. If we then need to reason about a set of shares of securities – a portfolio – everything gets rather more complicated, and it becomes difficult to take a linear or deterministic view of the return on investment. And that’s where horizon analysis comes in.  

The first stage involves breaking down the expected results into scenarios which are projected over several time periods. The simulations follow the What If principle. We then make comparisons between various different likely scenarios and a worst-case scenario. In our financial portfolio example, the outcome is a more realistic assessment of overall performance which in turn generates the total return analysis.

Transposed into other domains, the main aim and advantage of horizon analysis is the ability to make quantitative comparisons between different scenarios in order to choose or negotiate one of them. When the decision-making process involves a number of stakeholders, it also proves to be a very effective instrument for assessing the opportunities of each stakeholder and their corresponding scope of responsibility.

The most obvious example, in my opinion, is the issue of how to control greenhouse gas emissions – a real hot topic. The report issued by the IPCC every five to six years is a prime example of horizon analysis. 

The IPCC is the Intergovernmental Panel on Climate Change, created in 1988. The report sets out climate change mitigation scenarios.

The fifth report issued in 2014, called AR5 (for IPPC Fifth Assessment Report), sets out four scenarios with horizons ranging between 2050 and 2100. It describes a temperature increase of between 1°C and 4°C in the best case and of 4°C to 11°C in the worst case. It attempts to answer the question: how can we minimise increasing temperatures by the end of the 21st century? 

The scenarios are based on CEGES models (Agent-based Computational Economics of the Global Energy System). These were developed using around fifty climatic models: general circulation, simulation of displacement and temperature of air mass and ocean mass, carbon cycle, water cycle (vapour, clouds).

The starting point is to assess the level to which human activity impacts increasing temperatures, called anthropogenic forcing. This is now considered to be extremely likely, having shifted from a possibility in 1979, to very likely in the fourth report in 2007. 

In addition, the scenarios explore the options for controlling the different elements that lead to accelerated warming. Among the various options, there is a focus on increasing concentrations of CO2, and also on more harmful and equally incidental emissions such as methane and nitrous oxide (examples include ruminants, rice fields and thawing permafrost).    

 

scenarios-greenhouse-gas-emission-by-2100-ipcc

The four scenarios for greenhouse gas emissions by 2100, modelled by the IPCC in its fifth report in 2014 (AR5).  

 

Emissions are broken down by major types of economic activity in order to determine the means of action for minimising accelerated warming (power, industry, forestry (tropical deforestation), agriculture, transport, housing, waste and waste water).

This type of analysis can prove to be extremely complex, both in terms of formulating hypotheses and in conducting studies. These require huge amounts of data and immense computational power. To this end, the availability of super-calculators is a precondition for producing robust and credible results.  

One final point of interest is that results are broken down into scenarios, each containing targeted hypotheses, in order to distribute the task of analysis between several research teams and to run studies in parallel. In this respect, we can say that horizon analysis, conducted at a certain scale, is an immensely collaborative process. It is surely the most accomplished method to date of working towards a common good.  

 

13 - Monte Carlo Simulation

Monte Carlo is a method of probability analysis that involves running simulations of a certain number of variables within a model in order to determine the different possible outcomes.

This method was invented and popularised, so to speak, in the field of nuclear physics, at the Los Alamos Laboratory shortly after the Second World War. This is indeed where the MCNP (Monte Carlo N-Particle Transport Code) calculation code was developed, a specialist software used to numerically simulate interactions between particles, photons, electrons or neutrons. The tracking of each particle, the trajectory and type of nucleus on which the interactions take place, are drawn at random, exactly like in slot machines or casino roulette. And the Los Alamos scientists did indeed have the Monte Carlo Casino in mind when they named the method.

Today, MCNP has become the most widely-used simulation tool in the world, in applications such as radiation protection, nuclear criticality, instrumentation, dosimetry, medical imaging, and calculations for engines and other installations.

As for the Monte Carlo method itself, its application has been extended to all types of decision where there is a significant degree of uncertainty. Using this method, a decision maker is able to ascertain the entire range of possible outcomes and the probability of occurrence for each choice of action.

The first stage consists of building a mathematical model of the decision in question. Next, the simulation is run in order to cover all of the model’s aspects of uncertainty. Different random variables are introduced for these uncertainties until there are enough results to be able to plot a probability distribution curve.

 

monte-carlo-simulation

Re-running the simulation thousands of times will plot a graph representing all possible outcomes, in this example changes to the price of an asset.  

 

Based on the plotted curve, using the visual representation of the probability distribution, it is clear to see whereabouts on the graph a decision falls among all the different possibilities represented.

It is then possible to make choices based on the level of risk the decision maker is willing to take in order to obtain the desired or required result.   

Because the calculation is based on probability, a high number of random draws is required to lower the statistical uncertainty. Sometimes the simulation needs to be run tens of thousands of times at least in order to cover all possible outcomes. This is why a powerful computer is a necessity in all cases. Certain calculations can take several months to complete and require significant resources, hence the increasingly popular use of super-calculators.

 

Conclusion

This overview of analytics has now covered a lot of ground. And although we have talked about all types of data and media, and have touched on some highly specialised areas of analytics, we have not yet reached the end of our study. It still remains for us to look at certain methods that almost qualify as pure mathematics. Some are older methods that are still in popular use today, but those that have undergone more recent developments are certain to represent the future of analytics...     

(Feel free to have a look at the final part of our series just here). 

And you may be interested in our offer Big Data.

Let’s have a chat about your projects.

bouton-contact-en