Salary benchmarks aren't as useful as you might think

September 6, 2022. 2,435 words and a 13 min read. Post version 1.1

If you've hired people, you've likely seen or used some form of salary benchmark. They're a set of data, ostensibly curated by an objective third party, that lists compensation for normalised roles in your industry and location, they say things like "Senior Foo Technicians make $50,000 per year in New York". If you're hiring Foo Technicians in New York, you can look at those numbers and make a judgement about your pay versus what the benchmark tells you.

It's easy to become myopic with compensation, so it can be valuable to lift your head up and look at the wider market. This data can be very hard to get and normalise at scale, for example, your competitors aren't likely to answer the email where you ask them how much they pay people. These complexities are often why you need to buy this data. It can be expensive.

The trouble arises when people use this data without understanding its limitations. If salary benchmarks are the only data you use to decide how much to pay people, you're unlikely to get the outcomes you intended.

How is salary benchmark data collected?

Before we jump into those limitations, it's useful to understand how this data is collected, as those methodologies are the main driver of the normalised data's flaws. Broadly, there are three types of input into salary benchmarking:

  1. Company reported data. This is where a company will contribute their internal compensation data to a third-party benchmark, usually to get some level of access to the subsequent aggregated data. This is the bread and butter of commercial benchmark data, and is seen as the most valuable.
  2. Individual reported data. This is where an individual will share their salary, typically also getting access to aggregate data as a result. This data category is relatively new in the benchmarking space, but is increasingly impactful. For example, tools like levels.fyi are having a big impact on benchmarking for technology roles.
  3. Public data. This can be anything from scraping job listings, to using government data, to looking at public company accounts. This data is often used as a secondary source for correlation and validation rather than the "primary" data of (1) and (2). I won't touch on this, in this article.

The best benchmarks use all three of these sources, correlating and normalising them to come up with more accurate answers.

The benchmarking industry is based on relationships, particularly for those organisations who rely heavily on (1). How many organisations get data from? How frequently can they get updates to that data? Do they have good coverage in certain industries and roles? Getting to scale with this data is based on how well you leverage those relationships, which is why large consulting organisations like McKinsey & Company can do an impressive job at collecting this data, they're built to optimise for trusting relationships.

Normalisation and validation is hard

Validation and normalisation is a really hard problem to solve with salary benchmarking data, for example, in you were provided the following data, how would you normalise it?

CompanyRoleLocationSalarySource
Foo LtdSenior Software EngineerLondon, UK£80,000Company reported
Bar PLCSoftware Engineer 3South East, UK£100,000Individual reported
Baz ABLead React EngineerSoho, London£60,000Company Reported

There are a large number of questions that arise from the data:

  1. Are those roles the same seniority? Should the data be normalised across them?
  2. Is that the same location from a hiring POV?
  3. Should you normalise across private and public companies?
  4. How do you normalise across companies that pay a lot in non-salary compensation?
  5. The government reports that "Software Engineers" make on average £xx,xxx, so should you weight more towards that figure?

There are lots of complex, context-specific questions to answer.

These questions are usually answered manually, as judgement calls, by humans who know a lot about a type of role or market. For example, an organisation looking at the above data may have an expert analyst who would answer: (1) yes, they can normalised as the same seniority; (2) no, the locations are different; (3) yes, it should be normalised across public and private; etc. These human judgements permeate benchmarking data.

The net result of this problems is that different organisations have their own proprietary weights and formulas that they believe deliver the best results, and this forms part of the IP these organisations used to differentiate themselves.

Problem: benchmark data is usually 6-12 months out of date

One of the biggest mistakes people make when using benchmark data is to assume it's a snapshot of the today's market, when it's actually a snapshot of the market some time in the past, usually 6-12 months ago, in my experience. This might not be a concern, most job markets don't move significantly in those periods, but some really do.

There are two reasons why this data problem occurs:

  1. Data collection, normalisation and distribution is complex. As we mentioned above, complex questions need to be answered about this data often at the row level, and doing that is expensive, so most organisations do it infrequently.
  2. Companies and humans report historic data. Data that's reported is usually shared for people who have been in the job for >12 months, so the salary they agreed was from the market >12 months ago.

Number (2) is a real problem with salary benchmark data. For example, consider the following three companies are the source of benchmark data for a specific job:

CompanyNo. RolesTenure (years)
Foo Ltd122.5
Bar PLC1233
Baz AB60.7

In this case, the data you built your naive benchmark on would be approximately 2 years old. It's of course more complicated than this, people receive pay increases during their tenure for example, but it highlights one of the significant causes of inaccuracy in benchmarking data.

Benchmarking organisations have all kinds of tricks to reduce this issue, like weighting benchmarks towards more recent hires, but that just reduces the challenges with the data, it doesn't remove them.

Individual reported data also has this difficulty, but for more human reasons. People who report their salaries have a strong bias towards being dissatisfied with their compensation (e.g. you think you're not paid enough, so you want to find out how much other people make). There's a strong correlation between people thinking they're badly paid and people actually being badly paid.

However, salary dissatisfaction doesn't usually occur right after you take a job, as (in theory) you've been in the market recently and took a competitive offer, but rather it happens after a period of time (e.g. "I've been here two years and haven't been promoted/been given a raise"). This means reported data for individuals is also over representative of compensation agreed historically.

This can be a real problem in practice, for example, salary benchmarks for hot job markets may lead organisations to believe they're more competitive with their compensation than they really are.

Problem: benchmarks under-represent top-end salaries

Organisations that collect benchmark data have a challenge collecting compensation data at the top end. The reason for this is pretty intuitive: if you don't see an issue with how much you pay people or are paid, you're less likely to contribute your data to see how you compare to your peers.

This happens because there's a strong correlation between organisations who pay at the top-end of a hiring market and those organisations already having significant knowledge of the competitive dynamics of that hiring market. They don't need third parties to tell them they're doing a good job, they already know. This is true for individual reporting too, those who earn at the top end are less likely to need to know what market rates are, as they already know, so they're less likely to contribute their salary to a benchmark.

This has the effect of significantly under-representing compensation at the top-end of salary benchmarks. In the industry, this is known as the "fat middle" problem, where the majority of benchmark data comes from the low-to-mid segments of a job market.

This can have significant implications for organisations who use benchmark data, for example, if you want to pay in the top quartile it's unlikely that any benchmark data you can get your hands is an accurate representation of competitive pay in the top quartile. If you were basing your benchmark on the following, very naive, data, for example:

CompanyRoleSalaryTracked?
Foo LtdSenior Engineer£60,000Yes
Foo LtdSenior Engineer£62,000Yes
Foo LtdSenior Engineer£58,000Yes
Bar PLCSenior Engineer£100,000No
Bar PLCSenior Engineer£110,000No
Bar PLCSenior Engineer£90,000No
Baz ABSenior Engineer£70,000Yes
Baz ABSenior Engineer£73,000Yes
Baz ABSenior Engineer£69,000Yes

If you took the average salary from those who did report to the benchmark, it would be £65k but if you took the unreported salaries into account it would be £76k, a ~16% difference. This usually results in someone saying "well duh, that's how averages work?" but it illustrates a real difficulty in salary benchmarking: companies who pay the most report their data much less often, and those are paid the most also contribute their data much less often.

Without understanding this, using benchmarking data naively can mean grossly underestimating the compensation for a particular role in a particular market. As you'd imagine, this is particularly acute where there's a significant distribution of salaries.

Problem: self-reporting leads to bad data

There are numerous scholarly articles on why self-reporting leads to bad data, this problem exists in salary benchmarking. Companies only contribute selective salary data or individuals exaggerate salaries, there are plenty of issues, but you broadly end up with data that (on a row level) isn't terribly trustworthy.

One of the ways to solve that problem is looking at aggregates across large data sets, to smooth out the individual untrustworthy contributions, but this leads to the next issue: most benchmarks are based on tiny amounts of data.

Problem: no statistical significance

If you enjoy making people uncomfortable, ask the sales person from the salary benchmarking organisation you're working with about the p-value of a particular role. Don't be surprised if they skilfully try to avoid the question. Furthermore, don't be surprised when they eventually tell you how small the sample size is for their benchmarks.

Frankly, these tiny sample sizes are one of the industries dirty secrets, alongside how much salary benchmarking can reinforce structural inequality, but that's for a different blog post.

The tiny sample sizes of most salary benchmarks mean they have a huge general bias problem (not least for the discrete reasons mentioned above). This small sample size issue is particularly acute at the top end (as discussed above). I've seen some of the most trusted benchmarking organisations talking confidently about roles and locations based on single digit sample sizes with a variance of >50%, it's wild.

This is one of the areas where individual contribution has real advantages over company contributions, as you can (in theory) get much larger sample sizes to smooth out any noise.

Problem: benchmarks haven't caught up to global remote working

In salary benchmarks, there are usually three filters you can apply: normalised job title; normalised job seniority; and normalised location. In a world of everyone working in an office in a single location this made sense, salaries were very location specific (e.g. naively you'd expect to earn a lot more being a Foo Technician in New York than in Nebraska) and were broadly dependent on people being within a commuting distance of that office.

As with much else in the world, global remote working is having a massive impact on salary benchmarking. No longer can Foo Ltd just hire Software Engineers who live within 90 minutes of central London, they can hire someone who lives anywhere on Earth.

Salary benchmarks, and organisations that collect them, tend to be very location specific (e.g. we have great data on the US, or Argentina, or France) this is usually because that's where they have the best relationships with companies who report their data to them (we know plenty of people in Paris, so we have great data in Paris). This means it's difficult to get salary benchmarks that are helpful in hiring remote workers globally, you might have great data in Paris, but what is a competitive salary in Perth?

This is once again a collection and normalisation problem, normalising in a city like New York is hard enough, how do you do it across every city on Earth? How do you build enough relationships in all those cities to get a statistically significant sample size?

Some larger benchmarking organisations already have this data at scale, McKinsey does for example (after spending decades collecting it) but even those with a lead in this area have a further bias problem: most of this global data comes from global companies. You can find out how much Accenture pays Software Engineers in 20 countries, but is that indicative of the actual local market or more a factor of some outlier policy of levelling people globally?

So is salary benchmarking useless?

No, when understood and when applied appropriately it can be a useful tool for understanding your relative position in a hiring market, but without this understanding and nuance salary benchmarking can lead to Very Bad™ outcomes.

Benchmarking data should be used as a single data source, a tool that contributes to understanding, not the gospel to be read from.