Get the latest transfer pricing insights, AI benchmarking tips, and industry updates delivered straight to your inbox.
When you read transfer pricing guidance or court decisions, benchmarking often looks neat and deterministic: pick a database, run a search, calculate IQR, job done.
But if you talk to practitioners who live in Excel and database interfaces every day, the reality is much more nuanced. There are rules, yes - but there is also judgment about sample size, data quality, loss-makers, capital structure, and how far a “range” can stretch before it stops being meaningful.
In late 2025 we ran the TP Benchmarking Practices Survey to capture that reality. A group of experienced practitioners from around the world - mostly mid‑senior and senior - shared how they actually approach ranges, multi‑year data, loss‑makers, and capital adjustments in day‑to‑day work. This article walks through the most interesting patterns we saw and what they mean for policy, documentation, and tooling.
Before we talk about the results, it’s worth understanding the lens.
Most respondents were advisors or consultants (around four out of five), with in‑house practitioners and a small number of tax authority officials rounding out the sample. Roughly 60% reported at least eight years of TP experience, so the responses skew towards people who have run dozens, if not hundreds, of benchmarks.
Geographically, the survey is anchored in Europe but globally distributed. Europe accounts for roughly two‑thirds of responses, with strong representation from Spain, Poland, Ukraine, the Nordics, and smaller markets like Cyprus and Latvia. Asia‑Pacific (including India and Southeast Asia), the Middle East (heavily UAE), North America, and Latin America make up the remainder.
This is not a random sample of the profession. It’s a set of experienced practitioners who chose to respond - people who are often opinionated about methodology and deeply involved in the “how” of benchmarking, not just the final charts.
Survey reach by region
Visualising where benchmarking practices are coming from.
Europe
63.6%
Asia-Pacific
18.2%
Middle East
9.1%
North America
7.3%
Latin America
3.6%
Europe accounts for roughly two-thirds of responses, with strong secondary clusters in Asia-Pacific and the Middle East, and smaller but meaningful participation from North and Latin America.
These results reflect how experienced practitioners work when they can design their own approach - not a universal picture of every TP study.
2. A quiet consensus: the core benchmarking toolkit
Despite regional differences, there is a remarkably stable “default toolkit” that most respondents converge on.
2.1 IQR, three years, and weighted averages
Across the sample, three practices show up again and again:
Interquartile range as the main range
Three years of financial data in the set
Weighted average PLIs across years
In other words, if you shadow a typical practitioner on a typical benchmark, the most likely pattern is:
Build a three‑year set of comparables.
Require at least two of those three years of data to keep a company in.
Compute weighted PLIs across the available years.
Use the IQR (25th–75th percentile) as the primary range for narrative and testing.
This is broadly consistent with how many firms teach benchmarking internally and with the expectations embedded in a lot of audit practice - and it matches the technical recommendations in our benchmarking methodology content.
Multi-year benchmarking practices
Financial years used
Dataset timeframes
3 years
92.7%
1 year
5.5%
5 years
1.8%
PLI computation method
Profit level indicator calculation
Weighted average
81.8%
Simple average
18.2%
The vast majority of practitioners use 3-year datasets with weighted average PLIs, forming a de facto standard for multi-year benchmarking.
2.2 The "2 out of 3 years" middle ground
We also see convergence around minimum years of data. The majority of respondents require 2 out of 3 years to keep a company. A smaller group insists on 3 out of 3, and a minority are comfortable with 1 out of 3.
That middle position is interesting. It recognises that:
Requiring 3/3 years is often too strict in markets where financial reporting is patchy or histories are short.
Accepting 1/3 years as a rule risks building ranges on extremely thin evidence.
Most practitioners seem to settle on: “I’ll tolerate some missing years, but not at the expense of plausibility.” That’s a useful reference point if you’re drafting internal policy or guidance.
3. Where practitioners diverge: the interesting edges
The more interesting story is not what everyone agrees on, but where opinions (and practices) split. Three areas stood out:
Definition of the range itself.
How percentiles are computed.
How loss‑makers and incomplete data sets are handled in practice.
3.1 Range definitions: IQR plus a “grey zone”
If you look at the headline numbers, the picture is clean: roughly three quarters of respondents use IQR as their primary range. But when you dig deeper, a meaningful minority report using:
Range 35-65, and
Full ranges (min–max) in at least some contexts.
In our dataset, Range 35-65 shows up almost exclusively in responses from India, with a single “other” jurisdiction also reporting their use.
In practice, many teams build internal policies around IQR for global consistency, use Range 35-65 in India (where this convention is particularly common) or other markets where narrower ranges feel more intuitive or conservative, and sometimes present full ranges alongside IQR when they want to tell a story about outliers or dispersion.
Range types used in practice
Share of respondents using each range definition.
Loading...
The underlying message: IQR is the shared language, but it’s not the only dialect. If you run global TP policy, it’s worth acknowledging the “grey zone” of ranges that experienced practitioners use when local practice or the fact pattern demands it.
3.2 Percentiles: INC, EXC, and jurisdiction‑specific rules
Percentile computation is where consensus really breaks down.
At a global level, Excel’s PERCENTILE.INC leads, but not by enough to be called a true standard. Respondents report using:
PERCENTILE.INC
PERCENTILE.EXC
US / jurisdiction‑specific methods, and
An “other” bucket that includes custom formulas or combinations.
From the comments, two patterns appear:
Practitioners are very aware that INC vs EXC can move quartiles by real amounts, especially in small samples.
Many explicitly choose percentile methods based on sample size and regulatory expectations, not just personal preference.
If you’re building benchmarking tools or templates, you probably want configurable percentile methods, with clear explanations baked into the output. The survey results suggest that “hard‑coding INC” will be misaligned with how a large chunk of the market actually works.
Percentile methods in use
Overall share of INC, EXC, US / jurisdiction rules, and other approaches.
Loading...
3.3 Loss‑makers: far more nuanced than “exclude by default”
Loss‑making comparables are a classic flashpoint between taxpayers and authorities. What we see in the data is a profession that has moved well beyond blunt rules.
At a high level, around 60% of respondents describe their policy as “case‑by‑case”, with smaller groups in the “exclude” and “include” camps. But when you look at the actual rules people use, a clearer structure appears:
Many respondents exclude comparables that are loss‑making in “most years”,
A smaller group exclude only if “all years” are loss‑making, and
Several mention looking at multi‑year average PLI rather than individual years.
The comments add more colour:
Some teams are stricter when the study is used for price setting (planning), and more open to including loss‑makers when testing actual outcomes.
Others keep loss‑makers that are qualitatively strong comparables in terms of functions, assets, and risks, even if their numbers are messy.
How practitioners treat loss-making comparables
Overall stance and the rules used when exclusion is on the table.
A. Overall stance on loss-makers
Case-by-case
60.0%
Exclude
27.3%
Include
7.3%
Other
5.5%
B. Exclusion rules when loss-makers are screened out
Most years loss-making
72.9%
All years loss-making
27.1%
Panel B is based on respondents who reported using a specific rule for excluding loss-makers (35 with "most years" vs 13 with "all years").
The picture that emerges is not “loss‑makers are fine” or “loss‑makers are banned”, but “we use guardrails, then exercise judgment inside those guardrails.” That is very different from how loss‑makers are sometimes described in audits or litigation, where both sides often simplify their own position for rhetorical reasons.
4. Capital adjustments: important, but not default
If you read technical literature, capital adjustments can look like a standard part of benchmarking. In actual practice, they appear to be a targeted exception.
Across the sample:
Roughly four in ten respondents say they rarely apply capital adjustments,
A similar share say they sometimes apply them, and
Only a very small group say they regularly do so.
This pattern holds across different types of respondents.
From the comments, you can see why:
Practitioners are wary of over‑engineering adjustments that make the numbers look tidy but break economic intuition.
There is a strong emphasis on “math that makes sense”: if an adjustment takes you from a messy but realistic picture to something neat but implausible, most respondents would rather keep the mess and explain it.
If you are designing playbooks or internal templates, the practical implication is simple: treat capital adjustments as a specialised tool, not a checkbox that must be ticked in every study to look “sophisticated.”
5. How this fits into the broader benchmarking story
One way to read these results is: “nothing surprising - everyone uses IQR and three‑year sets.” But that misses the more interesting point.
What the survey really shows is a profession that:
Has coalesced around a shared set of tools (IQR, three‑year data, weighted PLIs, minimum years), and
Uses those tools with context‑sensitive judgment around the edges (ranges, percentiles, loss‑makers, capital structure).
If you compare this to “clean” benchmarking examples in guidance, you see a gap. Textbooks often focus on the core toolkit, but stay vague on the judgment calls practitioners must make every day: “Is this sample big enough for a range?”, “Do we drop this loss‑maker?”, “Does this adjustment actually improve comparability?”
Those are exactly the decisions we wanted to surface.
These results align with what we've seen building ArmsLength AI: any serious benchmarking product has to support not just the clean core workflow, but also the messy decision points around percentiles, ranges, data completeness, and loss‑makers.
How a benchmark becomes a narrative
From messy raw data to defendable ranges and explanations.
1
Start with the data you actually have
Every benchmark begins with a messy reality: how many comparables you found, how clean their financials are, and which jurisdictions and years you need to satisfy. Those constraints quietly shape what is even possible.
Once you understand the data, you choose how to turn it into a range: which percentile method, how many years, and what to do with loss-makers or extreme outliers. This is where policy and judgment come in.
Range & percentile methodMulti-year & weighted averagesLoss-makers & adjustments
3
Turn the numbers into a story you can defend
Finally, you decide what to show: the range, any points you rely on, and the explanation that ties it back to the facts. Good documentation makes these choices explicit and repeatable across years.
Final ranges & pointsNarrative & rationaleDocumentation & consistency
In practice, the "methodology" is the combination of these three layers: the data you have, the policy choices you make, and the story you're prepared to stand behind in front of an auditor.
If you’re rethinking your own benchmarking standards or tool stack, this is where we’d start:
Document your core defaults explicitly.
For example: “We use 3‑year weighted PLIs with IQR and 2/3‑year completeness.”
Decide where you allow flexibility - and why.
When do you accept 1/3 years?
When does full range make more sense than 25–75?
How do you treat loss‑makers in planning vs testing?
Make sure your tools and templates can express those choices clearly.
Both internally (for consistency) and in documentation (for audit defence).
Stay informed on Transfer Pricing
Get expert insights on the arm's length principle, OECD developments, and practical application tips delivered to your inbox.
6. Where we’re taking this next
We’re already using the survey internally to pressure‑test how ArmsLength AI handles:
Range configuration (IQR vs alternative ranges, including Range 35-65 and full ranges).
Percentile methods (INC, EXC, and jurisdiction‑specific rules).
Multi‑year completeness (2/3 vs 3/3 and beyond).
Loss‑maker policies, including persistent‑loss rules and planning vs testing contexts.
Capital adjustments, with clear decision support rather than one‑click “fixes.”
Most of these choices are already supported in ArmsLength AI today - for example, you can choose between Excel-style percentile functions like PERCENTILE.INC and PERCENTILE.EXC, toggle between weighted and simple average PLIs, and configure multi-year completeness thresholds and loss-maker rules per study.
ArmsLength AI benchmarking configuration panel showing range type, percentile method, multi-year, and loss-maker options
Our goal is to make it easy for teams to:
Encode their policy choices in the platform, and
Generate outputs that are transparent enough to show to authorities, not just technically correct.
7. Frequently Asked Questions
How representative is this survey of the broader TP profession?
The sample is relatively small but skewed towards experienced practitioners: around 60% have at least eight years of TP experience, and most are working in advisory roles. Europe is over‑represented, but there is meaningful participation from Asia‑Pacific, the Middle East, North America, and Latin America.
That means the results are best read as “how a globally active, senior‑leaning slice of the market works”, rather than a statistically perfect cross‑section of every practitioner in every jurisdiction.
Does this survey suggest that IQR is “mandatory” in benchmarking?
No - but it does show that IQR is the de facto standard for many practitioners. It remains the most common way to define an arm’s length range, especially when combined with three‑year, weighted‑average PLIs.
At the same time, the non‑trivial use of Range 35-65 and full ranges suggests that practitioners adapt to local expectations and fact patterns, and that there is still room for judgment around how “tight” or “wide” a defensible range should be.
How should firms think about INC vs EXC vs jurisdiction‑specific percentile methods?
The survey reinforces something most practitioners already feel: there is no universal standard for percentile functions. Respondents juggle INC, EXC, US‑specific quartiles, and other formulas depending on the jurisdiction and sample.
Practically, that means:
Your internal policies should be explicit about when each method is acceptable, and
Your tools should make percentile methods configurable and transparent, not hard‑coded.
What do these findings imply for handling loss‑making comparables?
The clearest message is that purely mechanical rules are rare. Practitioners gravitate towards:
Case‑by‑case decisions, bounded by rules like “most years loss‑making” or “average PLI negative,” and
Different thresholds for planning vs testing contexts.
If your current policy is simply “exclude any loss‑maker,” this survey suggests you may be too strict compared to peer practice, especially in markets or industries where loss‑making is cyclical rather than structural.
How can smaller teams apply these insights without over‑complicating their process?
You don’t need a complex framework to benefit from the survey.
For most teams, three practical steps go a long way:
Write down your default choices (IQR, years, PLIs, loss‑maker rules) in a short internal note.
Pick one or two areas where you allow exceptions - for example, loss‑makers or sample completeness - and describe the conditions clearly.
Align your templates and tools (including any use of ArmsLength AI) with those choices, so the methodology is visible in every deliverable.
That gets you most of the benefit of what the survey respondents are doing, without adding unnecessary bureaucracy.
How does this tie into ArmsLength AI’s product roadmap?
Directly. We’re using the survey as one of the inputs to:
Prioritise which configuration levers we expose in the benchmarking workflow.
Design explanations and audit‑ready narratives around ranges, percentiles, and loss‑makers.
Benchmark our defaults against real‑world practices, rather than purely theoretical guidance.
The aim is that, when you run a benchmark through ArmsLength AI, the methodology feels familiar - and the platform supports the same kinds of judgment calls that experienced practitioners are already making manually, with most of these levers already available in the product today.