Accounting for bias is vital in business-related data reporting because it’s used in decision-making and therefore has a significant impact. It is not often, however, that bias analysis becomes the central aspect of the dataset and can be used to extract meaningful data.

We realised this when planning a survey of web services that would reflect our opinions about the state of online systems. Our individual biases came into play and made it harder to settle on a standard scoring system that everyone would agree with. A proposal to gather qualitative data about these opinions eventually snowballed into a statistical model that reflects our standing as a company.

Here we present a fully-fledged model that gathers scoring data about online services and obtains the distribution function of these scores in various categories. This data serves two purposes:

  1. It informs us, as well as potential partners, about our feelings regarding multiple aspects of web development, such as the impact on society and the environment.
  2. It enables us to obtain similar datasets from our clients through a simple form application, and objectively compare their values with our own.

The process, once the model is defined, is quite simple. First, we devise a standard scoring system for internal use, and we use this scoring system to rank a list of web services for each category. Then, we can give a similar list to a potential partner and ask them to score the services. Once all the data is gathered, we can apply the same mathematical transformations to both datasets in order to obtain each histogram, and we compare the results with the clients’.

Let’s go deeper into the actual definition of the model.

Model definition: the scoring system

Five categories were defined to reflect the different aspects of a web service:

  1. Main service: main functionality of the web service
  2. Secondary aspects: other services offered and added functionality
  3. Client-facing: user experience
  4. Society: social impact
  5. Environment: environmental implications

The scoring system could be defined as a qualitative assessment or a scoring assignment, depending on whether the allowed scores are strictly non-negative integers or real numbers. For the sake of simplicity, and because for large enough sample sizes even integer scores become very accurate, we decided to define a scoring system with the following scale:

Rank Very Bad Bad Average Good Very Good
Score 0 1 2 3 4

In scoring forms, the user would then assign a rank to each category of the service.

Qualitative data: the internal struggle

Naturally, the scoring system doesn’t account for individual bias. This is partly the use of such rankings, as they reflect the opinions of the users. It is also our intended use, as we want to use this data to compare ourselves to potential clients and to obtain the goodness of fit with them.

However, this is also an issue to overcome if we wish to obtain a dataset that reflects our opinions and beliefs as a company. We have followed a crude normalisation method to find the consensus definition of “average” in each category, thus defining the standard for internal scoring. The technique essentially consisted of a qualitative questionnaire that was anonymously filled by all our staff, by which they were questioned about their views and opinions regarding the five categories described above. For example, when posed this question about the non-essential features of a service, two respondents gave the following answers:

Secondary aspects

Secondary aspects include the functionalities that add to the service provided, regardless of how non-essential they are. For example, Netflix’s algorithmic recommendations are non-essential but provide additional functionality.

How important are the non-essential functionalities added to a service?

“I outright don’t care for recommendation engines and would prefer that they didn’t exist. Likewise with most secondary features in a web service, I prefer a wider breadth of tools with less scope as most of my web usage is done with a specific goal in mind, I rarely mosey about and try things for fun as it seems like a waste of time.”

“Secondary aspects are important if they add to the ease with which a user can access the services, or positively grow the uses they have of the service. They should add services only if they benefit the user.”

The relevance of these questionnaires is relative: being descriptive opinions as answers to broad questions, they cannot be completely parsed and incorporated into the model; nevertheless, they are essential to detect internal struggles and conflicts or disagreements regarding particular aspects. For instance, conflicts were more prominent in issues regarding the environment:

Environment

As with social impact, environment impact is very relevant nowadays. Many web services claim to be carbon-neutral or environment friendly, but it is an often-ignored topic.

What is the baseline effort that any online service should make?

To be carbon neutral and reinvest into the environment by either being powered entirely by renewable energy sources or putting money back into sustainability.

I’m unsure if a service should make any effort. I would prefer services strive to use data centres in countries where the majority of the power is generated via renewable energy sources, but this can cause problems with regards to response times.

With the aid of these questionnaires, the baseline definitions for the scoring system are defined, and we can start scoring services.

Quantitative data: surveying the web

At this point, the next steps in the process would involve surveying a list of web services according to our ranking system. In this step, each member of staff would be given a list of online systems to score. The data would be collated and the average score in each category across all services would be calculated, as well as the distribution around the average.

On the other hand, similar lists of services would be given to potential partners in a form, so that they would be able to score them according to their own opinions. Because they wouldn’t have gone through the process of normalization, their own bias would likely be different from ours, which will reflect in the average scores and the distributions. To minimize the number of samples required, the survey would enable more granularity in the score (i.e. one decimal position rather than strict integers).

For example, a client might produce the following data:

Main Secondary Client-facing Society Environment
Facebook 3.1 2.4 1.8 3.0 1.4
Netflix 3.6 3.0 2.9 2.8 1.8

Data analysis: statistical magic

From this data, it is necessary to obtain the average scores and distributions for each category, which can then be compared with our own. This is equivalent to receiving the empirical distribution functions for each dataset and analysing them using a bit of statistical magic.

The first part is allegedly the simplest: define a set of equal-width bins along the range of possible values, and count how many instances there are in each bin. In other words, count how many services were given a score between 3 and 3.5, how many were given a score between 2.5 and 3, etc., and finally, normalize the dataset by dividing each count by the total number of instances. That produces a histogram (also called an empirical mass distribution function).

From the histogram, the cumulative distribution function can be calculated easily as well: it’s a similar concept, but it instead counts how many instances were given a score less than, and it normalises the results so that it adds up to 1.

The last bit is the most challenging but also the most rewarding. Enter: the Kolmogorov-Smirnov test and the null hypothesis. This test is a statistical tool used to check whether two distributions of data are similar enough that they could be produced by the same method or from the same source. This assertion is called the null hypothesis, and it is based on the assumption that, if two data sets were obtained equally, they would produce the same distribution. The way we use this is mainly to compare the cumulative distribution function derived from our data, and from the client’s data, and check against the null hypothesis by calculating the Kolmogorov-Smirnov statistic and its p-value. The p-value in itself is an interesting concept, as it reflects how likely a given event is. Thus, smaller p-values indicate that the event of interest is increasingly unlikely. By obtaining a p-value that checks the null hypothesis of the Kolmogorov-Smirnov test, we are essentially asking ‘How likely is it that these two datasets have the same source?’, so smaller p-values indicate more significant differences between the potential partner and us.

At this point, it’s possibly easiest to visualise the process as a histogram:

The chart above represents the distribution of scores from a hypothetical client. Clicking on the bars of a category brings up the comparison of the cumulative distribution function of the client with our own, and it also calculates a two-sample Kolmogorov-Smirnov statistic to test for similarity, with a p-value threshold of 0.01 (or 1%). If the p-value is smaller than the threshold, the null hypothesis is rejected which signifies to us that we should probably address this difference of opinions with the client. However, if it is accepted, we can rest assured that we are a good match for them!

Some remarks and disclaimers should be made. First and foremost, it is important to note that our crude normalisation method is by no means an efficient or valid approach to data normalisation. Most generally, this process would, in fact, involve a large number of samples from all our members of staff, and this comparative analysis that we are using with the clients would be applied to the internal comparisons as well. However, this process is time-consuming and, as we are a small team, it would require each of us to go through vast numbers of web services before the sample size was large enough. Instead, by finding common ground and defining a consensus scoring system, we eliminate the most significant conflicts and minimise the effect of the bias.

Another aspect of the data which is being ignored in this analysis is the actual ranking of particular web services. For instance, we are not considering which services have higher scores, but instead, we are just using their scores to obtain a data fingerprint of our own bias. This data would serve as guidance for us to resolve what we aspire to, but it’s a fickle concept because it assumes that the internal bias is entirely resolved by the normalisation (which, as we mentioned above, is not true).

Conclusions

After a lengthy process of data gathering and a struggle with data reporting libraries and methods, we have effectively produced a technique that enables us to compare ourselves to potential clients. The end use of this would be to know more about our partners in advance, by working out which aspects of web service development we would need to address more directly in order to meet the client’s expectations. Furthermore, this shifts the balance of this assessment towards a data-gathering issue, which can be approached with automation tools such as form apps.