Recently, our analyst team shared their research into a zero-day attack involving the use of corrupted malicious files to bypass static detection systems. Now, we present a technical analysis of this method and its mechanics.
In this article, we will:
- Demonstrate how attackers corrupt archives, office documents, and other files
- Explain how this method successfully evades detection by security systems
- Show how corrupted files get recovered by their native applications
Let’s get started.
Sandbox Analysis of a Corrupted File Attack
To first see how such attacks unfold, we can upload one of the corrupted filles used by attackers to ANY.RUN’s sandbox.
View analysis session.
Thanks to its interactivity, the sandbox lets us simulate a real scenario of user opening the broken malicious file inside the file’s corresponding application.
In our case, it’s a docx file. When we open it with Word, the program immediately offers us the option to recover the content of the file and successfully does it.
Inside, we find a QR code with a phishing link. The sandbox also automatically detects malicious activity and notifies us about this.
How Corrupted Files Bypass Antivirus Software and Other Automated Solutions
Analysis inside the ANY.RUN sandbox showed how a corrupted file gets restored thanks to Word’s built-in recovery mechanisms, which allows us to identify its malicious nature.
Yet, if we submit the same corrupted file to VirusTotal, which provides verdicts from numerous security solutions, we will see zero threat detections. The question is why?
The answer is simple: most antivirus software and automated tools are not equipped with the recovery functionality that is found in applications, such as Word. This prevents them from accurately identifying the type of the corrupted file, resulting in a failure to detect and mitigate the threat.
Docx is not the only file format used by attackers. There are also corrupted archives with malicious files inside, which easily bypass spam filters because security systems cannot view their contents due to corruption.
Once downloaded onto a system, tools like WinRAR easily restore the damaged archive, making its contents available to the victim.
Now, let’s see how exactly it works on a technical level.
Technical Analysis of a Corrupted Word Document
The Structure of a Word Document
Since the mid-2000s, office documents (OpenOffice.org 2.0 — released in 2005) have been structured as archives containing the document’s content.
In the image below, you can see the structure of a Word document.
As we can see, all structures within this archive are interconnected, and this relationship begins from the end.
At the end of the archive, there is a structure called the End of Central Directory Record (EOCD). This structure contains information about the size of the Central Directory File Header (CDFH), its offset, and the total number of entries in the archive. This structure helps locate the CDFH.
The CDFH duplicates the data stored in the Local File Header (LFH) and the offsets to it. Yet, this structure does not contain the compressed data itself but rather represents a hierarchy of files within the archive. This part of the structure allows you to find the LFH of each file in the archive.
The LFH is considered the header for each file in the archive. It contains important data such as the file name, compressed and uncompressed sizes, CRC32 checksum, and other parameters.
The compressed data is located after the header.
How the File Structure Can Be Manipulated by Attackers
As shown in the image above (Figure 1), the archive is structured backward, starting with the end, while all parts are linked together.
This has led us to test three different hypotheses (Figure 2):
1. Can Word or an archiving program recover and successfully open a file if additional data is added to the beginning of the archive?
2. Can Word or an archiving program recover and successfully open a file if we corrupt the linking between the parts and delete the CDFH, which does not contain the file data itself?
3. Can Word or an archiving program recover and successfully open a file if we corrupt the linking between the parts and erase the EOCD, which is a crucial part of the recovery process?
You can see the results of our hypothesis testing in the table below.
Word | ZIP | |
---|---|---|
Hypothesis 1 | Success | Fail (the file is no longer an archive) |
Hypothesis 2 | Success | Success |
Hypothesis 3 | Success (thanks to undamaged Local File Headers) | Success (thanks to undamaged Local File Headers) |
During our hypothesis testing, we’ve made several noteworthy observations:
1. For minimal recovery of a Word document, the following files are essential:
[Content_Types].xml,
Word/document.xml,
word/_rels/document.xml.rels,
_rels/.rels;
These contain crucial information regarding the relationships between elements and form the standard file hierarchy required for Word to interpret the document.
2. A ZIP archive with corrupted Local File Headers will only show the file structure. The actual file content will be empty.
3. If the end part of the ZIP file is damaged, the archiving software and Word will attempt to use an alternative recovery method: by leveraging intact Local File Headers.
Our findings demonstrate that Word is more resilient to file corruption than ZIP. While Word successfully recovered files with corrupted CDFH, EOCD, and even when random bytes were added to create a non-existent LFH structure, ZIP failed in the first hypothesis, where random bytes were added to the beginning of the file.
Why Security Systems Fail to Read Corrupted Files
Security systems attempt to identify file types, including by using Magic Bytes in File Headers. In the case of office documents and ZIP archives, because the file effectively starts from the end, we can corrupt the archive structure and magic bytes, making it difficult for detection systems to identify the file type.
This leads to the inability to unpack and inspect the contents.
Consider this email with a corrupted Word document.
The sandbox once again has no problem detecting the threat, returning a “malicious activity” verdict.
But, when run in VirusTotal, almost zero threat detections come back for this file.
Learn to analyze cyber threats
See a detailed guide to using ANY.RUN’s Interactive Sandbox for malware and phishing analysis
Conclusion
Our study revealed a vulnerability in document and archive structures. By manipulating specific components like the CDFH and EOCD, attackers can create corrupted files that are successfully repaired by applications but remain undetected by security software. As a result, we face a situation when security systems have not yet developed a clear logic for detecting such attacks, exposing the security of their users.
About ANY.RUN
ANY.RUN helps more than 500,000 cybersecurity professionals worldwide. Our interactive sandbox simplifies malware analysis of threats that target both Windows and Linux systems. Our threat intelligence products, TI Lookup, YARA Search and Feeds, help you find IOCs or files to learn more about the threats and respond to incidents faster.
With ANY.RUN you can:
- Detect malware in seconds
- Interact with samples in real time
- Save time and money on sandbox setup and maintenance
- Record and study all aspects of malware behavior
- Collaborate with your team
- Scale as you need