XXE: Why a 2003 Bug Class Still Sits in Modern XML Parsers
In the autumn of 2013, a Brazilian researcher named Reginaldo Silva was poking at Facebook's password-reset flow when he discovered an OpenID endpoint that accepted XML. He fed it a payload containing an external entity reference pointing at a local file, and the response came back containing the contents of /etc/passwd. From there he escalated to a full out-of-band entity that exfiltrated arbitrary files over HTTP, then to an SSRF primitive that let him reach Facebook's internal services. He stopped short of pursuing remote code execution and reported the finding through the bug bounty program. Facebook paid him 33,500 USD, the largest single bounty the program had awarded at the time, and patched the parser within hours of triage. The 2014 disclosure became the textbook example of XML External Entity injection because every link in the chain was a default behavior of a respected XML library, applied to user-controlled input, in a feature nobody had marked as risky.
The vulnerability class is older than most production engineers. It was formally described in 2003, made the OWASP Top 10 in 2017 as A4, and quietly persists today across SOAP web services, SAML SSO endpoints, document upload pipelines, and the underbelly of every office-document parser on the planet. This article walks through how XXE actually works at the parser level, what vulnerable and fixed code looks like in Python and Java, why the bug class refuses to die, and what to do about it before the next office-document upload turns into an exfiltration channel.
What XXE Actually Is
XML defines a feature called entities: short reusable references that the parser expands into longer values during parsing. Most are harmless — & for an ampersand, < for a less-than. The XML 1.0 specification also permits external entities, declared inside a <!DOCTYPE> block, that point at an arbitrary URI. When the parser encounters a reference to such an entity, it fetches the URI and substitutes the response into the document. The supported schemes vary by parser but commonly include file://, http://, https://, ftp://, and on some Java distributions jar://, netdoc://, and gopher://. The attacker writes a DOCTYPE that declares an entity pointing at a file on disk or a URL on the internal network, then references that entity somewhere in the document body where the parsed value flows back into the response.
That single feature gives the attacker arbitrary file read on the server, SSRF against internal endpoints unreachable from the public internet, and — through parameter entities and out-of-band exfiltration — a way to lift files even when the parsed value never appears in the HTTP response. The CVE-2021-26277 advisory against libexpat is one recent reminder that even hardened C parsers continue to ship XXE-adjacent regressions; many SOAP stacks built on top of older Xerces or .NET XML libraries inherited unsafe defaults that linger in maintenance-mode applications years after the original developers moved on. The SAML ecosystem deserves a special mention: a long sequence of XXE issues in SAML SSO providers and consumers over the past decade traces back to the same root cause — XML parsing with default settings on an attacker-controlled assertion.
The Vulnerable Python Parser You Have Probably Shipped
Python's lxml library is fast, widely deployed, and has changed its defaults more than once over the years. Older versions resolved external entities by default; newer versions disable network fetches but still permit local-file resolution unless you pass the right flags. The pattern below appears in countless internal tools that import XML documents from users:
# VULNERABLE
from lxml import etree
def parse_user_document(file_path: str):
parser = etree.XMLParser() # default flags vary across versions
tree = etree.parse(file_path, parser=parser)
return tree.getroot() On a vulnerable build, feeding this function a document that declares <!ENTITY xxe SYSTEM "file:///etc/passwd"> and references &xxe; in any text node returns the file contents through the parsed tree. The fix is not to upgrade and hope; it is to set the parser flags explicitly and never trust defaults to remain safe across library upgrades, language packagers, or the next CVE that re-enables a feature for compatibility:
# FIXED
from lxml import etree
def parse_user_document(file_path: str):
parser = etree.XMLParser(
resolve_entities=False,
no_network=True,
load_dtd=False,
huge_tree=False,
)
tree = etree.parse(file_path, parser=parser)
return tree.getroot()resolve_entities=False stops the parser from expanding entity references at all, which closes the file-read and SSRF primitives directly. no_network=True blocks any URL scheme that would issue a network call, even for legitimate-looking schema fetches. load_dtd=False refuses to load external DTDs, which is the path most out-of-band exfiltration techniques use. huge_tree=False caps the in-memory tree size, which is the cheap defense against the entity-expansion bombs we discuss below. For applications that genuinely need to parse XML where none of these features are required, prefer defusedxml, which wraps the standard parsers with safe defaults and refuses to silently re-enable risky behavior on upgrade.
The Java DocumentBuilder That Reads Your Filesystem
Java's DocumentBuilderFactory is the most-deployed XML parser in enterprise software, sitting underneath SOAP stacks, SAML libraries, configuration loaders, XSLT transformers, and a thousand internal report generators. Its defaults across most JDK and runtime distributions still permit DOCTYPE declarations and external entity resolution, which means the textbook vulnerable shape continues to ship in fresh code:
// VULNERABLE
import javax.xml.parsers.DocumentBuilder;
import javax.xml.parsers.DocumentBuilderFactory;
import org.w3c.dom.Document;
public Document parseRequest(InputStream xmlInput) throws Exception {
DocumentBuilderFactory dbf = DocumentBuilderFactory.newInstance();
DocumentBuilder db = dbf.newDocumentBuilder();
return db.parse(xmlInput);
} This four-line handler will resolve external entities, follow file:// URIs into the local disk, and reach http:// URLs into the internal network. The hardening is mechanical but verbose, and the verbosity is the reason it is so often skipped:
// FIXED
import javax.xml.XMLConstants;
import javax.xml.parsers.DocumentBuilder;
import javax.xml.parsers.DocumentBuilderFactory;
import org.w3c.dom.Document;
public Document parseRequest(InputStream xmlInput) throws Exception {
DocumentBuilderFactory dbf = DocumentBuilderFactory.newInstance();
dbf.setFeature("http://apache.org/xml/features/disallow-doctype-decl", true);
dbf.setFeature("http://xml.org/sax/features/external-general-entities", false);
dbf.setFeature("http://xml.org/sax/features/external-parameter-entities", false);
dbf.setFeature("http://apache.org/xml/features/nonvalidating/load-external-dtd", false);
dbf.setXIncludeAware(false);
dbf.setExpandEntityReferences(false);
dbf.setAttribute(XMLConstants.ACCESS_EXTERNAL_DTD, "");
dbf.setAttribute(XMLConstants.ACCESS_EXTERNAL_SCHEMA, "");
DocumentBuilder db = dbf.newDocumentBuilder();
return db.parse(xmlInput);
} The single most important line is disallow-doctype-decl, which throws a parser exception the moment the document declares a DOCTYPE. If your application has no legitimate reason to accept DTDs — and most do not — this one feature flag closes the entire bug class. The remaining flags are belt-and-suspenders for parsers that interpret disallow-doctype-decl inconsistently or for code paths that switch implementations behind your back. Wrap the configuration in a factory method, share it across the codebase, and write a unit test that asserts the protections still hold after the next dependency upgrade.
The Payload That Walks Out With /etc/passwd
The mechanic is easier to reason about with the actual XML in front of you. The classic file-read payload looks like this:
<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE foo [
<!ENTITY xxe SYSTEM "file:///etc/passwd">
]>
<invoice>
<customer>&xxe;</customer>
</invoice> A vulnerable parser sees the DOCTYPE, registers an external entity named xxe bound to the URI file:///etc/passwd, then expands every &xxe; reference by reading that file. If the application reflects the parsed customer field back to the user — in an error message, a confirmation page, or an API response — the file contents arrive on the wire. When the parsed value never reflects, the attacker pivots to out-of-band exfiltration: a parameter entity loads an attacker-hosted DTD, the DTD constructs a second entity that concatenates the file contents into a URL, and the parser fetches that URL, leaking the file in the request path of an HTTP log on the attacker's server. The same primitive substitutes http://internal.metadata.service/credentials for the file:// URI and turns into SSRF without any further work.
Billion Laughs and the Quadratic Cousin
The same XML feature that enables XXE also enables a denial-of-service variant called the billion laughs attack. The payload defines an entity that references another entity, which references another, with each layer multiplying the expansion count. Ten layers of ten-fold expansion turn a one-kilobyte document into ten gigabytes of in-memory string that the parser dutifully tries to materialize, exhausting heap and pinning CPU. A subtler quadratic-blowup variant repeats a single very long entity reference thousands of times, defeating the depth limits some parsers added after the original disclosure while still consuming gigabytes. Modern parsers ship caps for entity expansion count and total expanded size, but the caps are off by default in some configurations and the limits are sometimes set absurdly high. The fix is the same as for XXE: disable DOCTYPE entirely if the application does not need it, and if it does, set explicit and conservative limits.
Why XXE Refuses to Die
Three structural reasons keep XXE alive in 2026 despite a decade of OWASP coverage and library hardening. First, legacy SOAP web services and SAML SSO endpoints continue to live behind enterprise authentication portals that nobody has the budget to retire. The XML parsing in those stacks was configured before the safe defaults conversation began, the configuration is buried in framework code several layers below the application, and a developer adding a new endpoint inherits whatever the framework chose in 2011. Second, file upload features that accept Office documents are XML parsers in disguise: a .docx or .xlsx file is a ZIP archive containing XML parts, and any library that opens those parts with an unconfigured parser inherits the same XXE primitive at the moment a user uploads a malicious spreadsheet to a perfectly modern web application.
Third, the safe configuration is verbose and library-specific. Compare the four lines of vulnerable Java to the ten lines of fixed Java; the fixed version requires the developer to know which combination of feature flags actually closes the class on their specific parser, JDK version, and downstream Xerces shim. The path of least resistance is to copy the four-line pattern from a tutorial. Cross-cutting library wrappers like defusedxml for Python and the OWASP XML External Entity Prevention Cheat Sheet help, but they only help in projects where someone has already noticed the problem; the next greenfield service started by an engineer who has never read those documents starts with the same defaults the 2014 Facebook bug ran on.
Detection: Where Each Layer Earns Its Keep
XXE has a fingerprint that static analysis catches cleanly because the dangerous configuration is a finite, enumerable set of API calls: DocumentBuilderFactory.newInstance() without subsequent feature flags, etree.XMLParser() without resolve_entities=False, SAXParserFactory without the disallow-doctype feature, XmlReaderSettings with DtdProcessing.Parse enabled. SAST traces XML parser construction across method boundaries and flags the cases where no hardening flags are set between the factory call and the parse sink — see why data flow analysis matters for the inter-procedural mechanics that catch a parser configured in one method and used in another.
DAST attacks the running application by submitting XML payloads with out-of-band entity references — pointing at a Burp Collaborator, interactsh, or a self-hosted DNS canary — and watching for callbacks. Fuzzing XML inputs with mutation-aware grammars catches deeply buried parsers in document upload pipelines that the crawler would never reach. SCA flags vulnerable XML libraries in the dependency tree, which catches the cases where the application code is fine but a transitive parser ships a known regression. The cheapest of the four layers is SAST because it runs on the diff that introduced the unsafe parser configuration, before the build is reachable to a tester at all.
Prevention Checklist
Six rules close the overwhelming majority of real-world XXE. Apply them in order; each later rule assumes the earlier ones are in place.
- Disable DOCTYPE entirely if the application does not need it. The single feature flag
http://apache.org/xml/features/disallow-doctype-declon Java parsers, orload_dtd=Falseon lxml, closes the whole class for the common case. - Disable external entity resolution at parser construction. Set
resolve_entities=False,external-general-entitiesfalse,external-parameter-entitiesfalse, andACCESS_EXTERNAL_DTDempty on every code path that constructs a parser. - Use libraries with safe defaults instead of fighting unsafe ones.
defusedxmlin Python and modernSystem.XmlAPIs in .NET ship hardened defaults; prefer them for new code. - Prefer JSON over XML for new APIs. The XXE bug class does not exist in JSON parsing. If you have a choice for a new endpoint, take the format that does not have a thirty-year history of parser CVEs.
- Audit file upload features that parse XML, DOCX, or XLSX. Office documents are ZIP archives of XML; any document parser is a transitive XML parser. Configure those parsers with the same hardening flags as your application's direct XML code.
- Gate unsafe parser usage in CI with SAST. A pull request that introduces
DocumentBuilderFactory.newInstance()without the hardening features should fail the build, not generate a Jira ticket six months later.
Where GraphNode SAST Fits
GraphNode SAST traces XML parser construction and configuration across 13+ languages, flagging the specific combinations of factory calls and missing feature flags that leave the entity-expansion and external-resolution primitives reachable. Findings surface on the diff that introduced the unsafe parser, with the engineer who wrote it. XXE is most often classified under A05 Security Misconfiguration because the bug is a default the developer did not override; SAST catches the missed override at the only point in the lifecycle where the fix is still a one-line change.
Closing
XXE has a documented fix that fits in a single feature flag. Despite that, it shipped in a Facebook OpenID endpoint in 2013, in libexpat in 2021, in countless SAML SSO integrations across the past decade, and in the next document-upload feature near you. The pattern persists because the fix is library-specific, the defaults are still unsafe in too many parsers, and the seam where XML enters the application is rarely owned by anyone in particular. The teams that stop shipping XXE are the ones that move detection upstream into the diff, gate the parser configuration at code review, and prefer JSON wherever the choice exists. Twenty-three years after the bug class was named, that is still the only place the economics work in the defender's favor.
GraphNode SAST flags unsafe XML parser configurations across 13+ languages — request a demo.
Request Demo