Loading...
Loading...
Borys Ulanenko
CEO of ArmsLength AI

Get the latest transfer pricing insights, AI benchmarking tips, and industry updates delivered straight to your inbox.
When you read transfer pricing guidance or court decisions, benchmarking often looks neat and deterministic: pick a database, run a search, calculate IQR, job done.
But if you talk to practitioners who live in Excel and database interfaces every day, the reality is much more nuanced. There are rules, yes - but there is also judgment about sample size, data quality, loss-makers, capital structure, and how far a “range” can stretch before it stops being meaningful.
In late 2025 we ran the TP Benchmarking Practices Survey to capture that reality. A group of experienced practitioners from around the world - mostly mid‑senior and senior - shared how they actually approach ranges, multi‑year data, loss‑makers, and capital adjustments in day‑to‑day work. This article walks through the most interesting patterns we saw and what they mean for policy, documentation, and tooling.
If you want the underlying methodology for benchmarking itself, you may find it helpful to read this alongside the more formal Benchmarking Study Guide and our piece on IQR calculation in transfer pricing.
Before we talk about the results, it’s worth understanding the lens.
Most respondents were advisors or consultants (around four out of five), with in‑house practitioners and a small number of tax authority officials rounding out the sample. Roughly 60% reported at least eight years of TP experience, so the responses skew towards people who have run dozens, if not hundreds, of benchmarks.
Geographically, the survey is anchored in Europe but globally distributed. Europe accounts for roughly two‑thirds of responses, with strong representation from Spain, Poland, Ukraine, the Nordics, and smaller markets like Cyprus and Latvia. Asia‑Pacific (including India and Southeast Asia), the Middle East (heavily UAE), North America, and Latin America make up the remainder.
This is not a random sample of the profession. It’s a set of experienced practitioners who chose to respond - people who are often opinionated about methodology and deeply involved in the “how” of benchmarking, not just the final charts.
Visualising where benchmarking practices are coming from.
These results reflect how experienced practitioners work when they can design their own approach - not a universal picture of every TP study.
Despite regional differences, there is a remarkably stable “default toolkit” that most respondents converge on.
Across the sample, three practices show up again and again:
In other words, if you shadow a typical practitioner on a typical benchmark, the most likely pattern is:
This is broadly consistent with how many firms teach benchmarking internally and with the expectations embedded in a lot of audit practice - and it matches the technical recommendations in our benchmarking methodology content.
Dataset timeframes
Profit level indicator calculation
We also see convergence around minimum years of data. The majority of respondents require 2 out of 3 years to keep a company. A smaller group insists on 3 out of 3, and a minority are comfortable with 1 out of 3.
That middle position is interesting. It recognises that:
Most practitioners seem to settle on: “I’ll tolerate some missing years, but not at the expense of plausibility.” That’s a useful reference point if you’re drafting internal policy or guidance.
The more interesting story is not what everyone agrees on, but where opinions (and practices) split. Three areas stood out:
If you look at the headline numbers, the picture is clean: roughly three quarters of respondents use IQR as their primary range. But when you dig deeper, a meaningful minority report using:
In our dataset, Range 35-65 shows up almost exclusively in responses from India, with a single “other” jurisdiction also reporting their use.
In practice, many teams build internal policies around IQR for global consistency, use Range 35-65 in India (where this convention is particularly common) or other markets where narrower ranges feel more intuitive or conservative, and sometimes present full ranges alongside IQR when they want to tell a story about outliers or dispersion.
Share of respondents using each range definition.
The underlying message: IQR is the shared language, but it’s not the only dialect. If you run global TP policy, it’s worth acknowledging the “grey zone” of ranges that experienced practitioners use when local practice or the fact pattern demands it.
Percentile computation is where consensus really breaks down.
At a global level, Excel’s PERCENTILE.INC leads, but not by enough to be called a true standard. Respondents report using:
PERCENTILE.INCPERCENTILE.EXCFrom the comments, two patterns appear:
If you’re building benchmarking tools or templates, you probably want configurable percentile methods, with clear explanations baked into the output. The survey results suggest that “hard‑coding INC” will be misaligned with how a large chunk of the market actually works.
Overall share of INC, EXC, US / jurisdiction rules, and other approaches.
Loss‑making comparables are a classic flashpoint between taxpayers and authorities. What we see in the data is a profession that has moved well beyond blunt rules.
At a high level, around 60% of respondents describe their policy as “case‑by‑case”, with smaller groups in the “exclude” and “include” camps. But when you look at the actual rules people use, a clearer structure appears:
The comments add more colour:
Overall stance and the rules used when exclusion is on the table.
Panel B is based on respondents who reported using a specific rule for excluding loss-makers (35 with "most years" vs 13 with "all years").
The picture that emerges is not “loss‑makers are fine” or “loss‑makers are banned”, but “we use guardrails, then exercise judgment inside those guardrails.” That is very different from how loss‑makers are sometimes described in audits or litigation, where both sides often simplify their own position for rhetorical reasons.
If you read technical literature, capital adjustments can look like a standard part of benchmarking. In actual practice, they appear to be a targeted exception.
Across the sample:
This pattern holds across different types of respondents.
From the comments, you can see why:
If you are designing playbooks or internal templates, the practical implication is simple: treat capital adjustments as a specialised tool, not a checkbox that must be ticked in every study to look “sophisticated.”
One way to read these results is: “nothing surprising - everyone uses IQR and three‑year sets.” But that misses the more interesting point.
What the survey really shows is a profession that:
If you compare this to “clean” benchmarking examples in guidance, you see a gap. Textbooks often focus on the core toolkit, but stay vague on the judgment calls practitioners must make every day: “Is this sample big enough for a range?”, “Do we drop this loss‑maker?”, “Does this adjustment actually improve comparability?”
Those are exactly the decisions we wanted to surface.
These results align with what we've seen building ArmsLength AI: any serious benchmarking product has to support not just the clean core workflow, but also the messy decision points around percentiles, ranges, data completeness, and loss‑makers.
From messy raw data to defendable ranges and explanations.
Every benchmark begins with a messy reality: how many comparables you found, how clean their financials are, and which jurisdictions and years you need to satisfy. Those constraints quietly shape what is even possible.
Once you understand the data, you choose how to turn it into a range: which percentile method, how many years, and what to do with loss-makers or extreme outliers. This is where policy and judgment come in.
Finally, you decide what to show: the range, any points you rely on, and the explanation that ties it back to the facts. Good documentation makes these choices explicit and repeatable across years.
If you’re rethinking your own benchmarking standards or tool stack, this is where we’d start:
Get expert insights on the arm's length principle, OECD developments, and practical application tips delivered to your inbox.
We’re already using the survey internally to pressure‑test how ArmsLength AI handles:
Most of these choices are already supported in ArmsLength AI today - for example, you can choose between Excel-style percentile functions like PERCENTILE.INC and PERCENTILE.EXC, toggle between weighted and simple average PLIs, and configure multi-year completeness thresholds and loss-maker rules per study.
ArmsLength AI benchmarking configuration panel showing range type, percentile method, multi-year, and loss-maker options
Our goal is to make it easy for teams to:
The sample is relatively small but skewed towards experienced practitioners: around 60% have at least eight years of TP experience, and most are working in advisory roles. Europe is over‑represented, but there is meaningful participation from Asia‑Pacific, the Middle East, North America, and Latin America.
That means the results are best read as “how a globally active, senior‑leaning slice of the market works”, rather than a statistically perfect cross‑section of every practitioner in every jurisdiction.
No - but it does show that IQR is the de facto standard for many practitioners. It remains the most common way to define an arm’s length range, especially when combined with three‑year, weighted‑average PLIs.
At the same time, the non‑trivial use of Range 35-65 and full ranges suggests that practitioners adapt to local expectations and fact patterns, and that there is still room for judgment around how “tight” or “wide” a defensible range should be.
The survey reinforces something most practitioners already feel: there is no universal standard for percentile functions. Respondents juggle INC, EXC, US‑specific quartiles, and other formulas depending on the jurisdiction and sample.
Practically, that means:
If you want a deeper dive into the mechanics, our article on IQR calculation in transfer pricing walks through the technical options.
The clearest message is that purely mechanical rules are rare. Practitioners gravitate towards:
If your current policy is simply “exclude any loss‑maker,” this survey suggests you may be too strict compared to peer practice, especially in markets or industries where loss‑making is cyclical rather than structural.
You don’t need a complex framework to benefit from the survey.
For most teams, three practical steps go a long way:
That gets you most of the benefit of what the survey respondents are doing, without adding unnecessary bureaucracy.
Directly. We’re using the survey as one of the inputs to:
The aim is that, when you run a benchmark through ArmsLength AI, the methodology feels familiar - and the platform supports the same kinds of judgment calls that experienced practitioners are already making manually, with most of these levers already available in the product today.