Loading...
Loading...
Borys Ulanenko
CEO of ArmsLength AI

Get the latest transfer pricing insights, AI benchmarking tips, and industry updates delivered straight to your inbox.
Transfer pricing documentation requires specific citations to the OECD Guidelines. Tax authorities expect paragraph numbers. Audit defense often hinges on exact quotes.
When you ask ChatGPT about DEMPE functions or safe harbours, it gives you a reasonable answer. When you ask for the exact paragraph reference and text, it gives you something that looks like a citation:
"According to OECD Transfer Pricing Guidelines ¶1.6, the arm's length principle requires that conditions of transactions between associated enterprises be consistent with those between independent enterprises..."
The paragraph number exists. The quote sounds right. But when you check the actual ¶1.6, it says something completely different.
We built an OECD Guidelines API because we suspected this was happening. This benchmark confirms it and measures how severe the problem is across different AI configurations.
Before we present the findings, it's worth acknowledging progress. The default ChatGPT experience today (GPT-5.2 with web search enabled) is considerably more reliable than it was a year ago. When it has access to web search, it can often find and cite OECD text correctly.
Our benchmark shows ChatGPT with web search achieving 83% overall accuracy. That's not perfect, but it's a real improvement over earlier models.
The problem is that many professional applications don't use ChatGPT with web search. Internal tools, agents, and automations typically run on GPT-4.1 or similar models without internet access. These models rely entirely on training data, and that's where citation accuracy breaks down.
We ran 10 transfer pricing questions through three configurations:
| Configuration | Model | Tools | Use Case |
|---|---|---|---|
| Baseline | GPT-4.1 | None | Internal tools, agents, automations |
| Web Search | GPT-5.2 | Web Search | Default ChatGPT experience |
| OECD API | GPT-5.2 | OECD Guidelines API | RAG-grounded citation |
The questions covered DEMPE functions, safe harbours, method selection, intangibles, and specific paragraph lookups. We verified every citation against the actual OECD Transfer Pricing Guidelines 2022 text.
The OECD API configuration achieved 99% accuracy. ChatGPT with web search scored 83%. The baseline model without web access scored 28%.
| Configuration | Citation | Quote | Gap |
|---|---|---|---|
| OECD API | 100% | 99% | 1pp |
| Web Search | 97% | 66% | 31pp |
| Baseline | 58% | 9% | 49pp |
ChatGPT with web search performed well on most questions. It found indexed OECD text on third-party sites and provided verbatim quotes with correct paragraph references.
However, results varied by question. On some queries, it refused to provide full quotes, citing copyright concerns. On others, it gave only short fragments that weren't useful for documentation.
A tool that works well most of the time but refuses unpredictably isn't reliable enough when you need to cite a specific paragraph in a tax authority filing.
Models without web access (GPT-4.1 and similar) performed poorly. We expected some paraphrasing. What we found was outright fabrication.
| Failure Type | Description | Count |
|---|---|---|
| Fabricated quotes | Quote doesn't exist in cited paragraph | 4 |
| Wrong paragraph | Correct concept, wrong paragraph number | 2 |
| Invented references | Paragraph doesn't mention the claimed content | 2 |
| Total failures | 8/10 |
Example: ALP Definition (Q03)
We asked for the exact text of ¶1.6.
The model claimed ¶1.6 says:
"Under the arm's length principle, the conditions of transactions between associated enterprises should not differ from those which would be made between independent enterprises..."
Actual ¶1.6 text:
"The authoritative statement of the arm's length principle is found in paragraph 1 of Article 9 of the OECD Model Tax Convention..."
The model didn't paraphrase. It fabricated a quote that sounds plausible but doesn't exist in that paragraph. A professional citing this in documentation would be providing incorrect information.
Example: Interquartile Range (Q08)
The model claimed ¶3.55 discusses "statistical tools, such as the interquartile range" and ¶3.56 mentions "a statistical range (e.g. the interquartile range)."
In reality, only ¶3.57 mentions interquartile range in the entire Guidelines. The model invented IQR references in adjacent paragraphs.
Models without web access understand transfer pricing concepts well. They can explain DEMPE, discuss method selection, and describe why R&D performers might deserve more than cost-plus compensation. But they don't have the actual OECD Guidelines text, so they reconstruct what they think paragraphs say. These reconstructions are plausible but often wrong.
The API configuration achieved 99% accuracy by retrieving actual paragraph text before responding. It also found paragraphs that other configurations missed.
For the R&D/IP question, the API cited ¶6.79:
"Compensation based on a reimbursement of costs plus a modest mark-up will not reflect the anticipated value of, or the arm's length price for, the contributions of the research team in all cases."
Neither the baseline model nor web search cited this paragraph. They discussed the concepts but missed the specific authoritative text.
The OECD API provides three tools to the model:
oecd_search: Semantic and keyword search across all paragraphsoecd_get_paragraphs: Retrieve specific paragraphs by referenceoecd_context_pack: Get curated bundles for a topicThe model searches for concepts, retrieves actual paragraph text, and responds with verifiable citations. Every quote comes from the authoritative source.
| Configuration | Avg Time | Input Tokens | Notes |
|---|---|---|---|
| Baseline | 15s | ~40 | Fast but unreliable |
| OECD API | 65s | ~15,000 | Best accuracy per token |
| Web Search | 122s | ~45,000 | Slowest, high token usage |
The API approach uses 67% fewer tokens than web search while achieving higher accuracy.
If you use ChatGPT for TP research: Web search mode (the default) is reasonably reliable for finding OECD guidance. But verify quotes before using them in documentation. The model sometimes refuses to quote or provides only fragments.
If you build TP tools: Models without web access fabricate citations. If your application needs to cite OECD paragraphs, you need retrieval augmentation that preserves paragraph structure. Standard RAG that chunks documents arbitrarily won't let you cite "¶1.6" because chunks don't align with official paragraph boundaries.
If you audit TP documentation: Be aware that AI-generated OECD citations may not match the source. Consider requesting page numbers from the official PDF or evidence that citations have been verified.
We tested 10 questions across four categories:
| Category | What We Tested |
|---|---|
| Direct Citation | Can the model cite specific paragraphs accurately? |
| Trap Questions | Does the model invent guidance on topics not covered? |
| Technical Interpretation | Can the model synthesize method selection guidance? |
| Multi-Step Reasoning | Can the model cite multiple relevant sections? |
For each response, we extracted paragraph references, retrieved actual text via the OECD API, and compared claimed quotes against the source. Responses were scored as accurate (exact match), partial (paraphrase), or fabricated (no match).
The OECD Guidelines API used in this benchmark is publicly available. You can request an API key and integrate verified OECD citations into your own tools and workflows.
Get expert insights on the arm's length principle, OECD developments, and practical application tips delivered to your inbox.