document sanitization
What is document sanitization?
Document sanitization is the process of cleaning a document to ensure that only the intended information can be accessed from it. In addition to making sure the document text doesn't openly divulge anything it shouldn't, sanitization includes removing metadata that could pose a privacy or security risk.
Document sanitization, sometimes called file sanitization, refers to the cleansing of documents by removing hidden content from them. In addition to metadata, the content that is removed may include document properties, hazardous code such as malicious scripts or backdoors, or malware that hasn't previously been detected.
 
  Sanitization is not the same as redaction. Where sanitization is about removing hidden data and metadata, redaction is about removing private or sensitive information that should only be available to a specific group of people. With sanitization, all hidden information is permanently removed so that the file can be safely passed on. With redaction, certain text gets permanently removed so it is no longer viewable or editable.
Despite these differences, the two activities work well together. Combining sanitization with redaction helps to better protect documents and their information from leaks and breaches.
The importance of document sanitization
Documents often include hidden content that may not have been detected previously. If hackers or cybercriminals are able to access this information, they may be able to use it to steal other types of sensitive data, such as passwords, personally identifiable information or financial information. They might also use the information to embarrass a firm or damage its reputation.
To minimize the risk of such incidents, it's important to find and remove all hidden content in documents, which is any information that's not intended to be distributed. Proper and thorough sanitization helps ensure that potentially sensitive information is not inadvertently or maliciously seen when the document is published or shared. It thus protects the organization from data breaches.
Document sanitization and metadata removal
Metadata is often described as "data about data." Different types of metadata provide additional information about a document, such as its author, history or version. Metadata can also contain the names of a document's modifiers, the dates of creation and changes, file size, digital signatures, revision histories, tracked changes, watermarks, headers, footers and comment exchanges among various authors and editors.
Metadata, which is usually not obviously visible to the document's authors, viewers and editors, is important for document tracking, classification, analysis and management. It also improves collaboration among users sharing a file or collection of files. But it could also contain sensitive information that might embarrass or damage an organization, so it's important to safeguard metadata from unauthorized access. This requires removing it from the document before it is published, shared or circulated.
 
  How to sanitize a document
A common way to remove metadata and other hidden information from a document is to convert it to PDF format and then follow the sanitization procedure provided by the specific PDF application used.
For example, the procedure for sanitizing PDF documents in Adobe Acrobat is the following:
- Access the Redact top menu and select Sanitize Document.
- To remove all hidden information, select OK.
- To remove specific pieces of hidden information, select Click Here.
Non-PDF documents can also be sanitized. Software such as Microsoft Excel and Word both have built-in features to discover and remove hidden content like metadata.
The National Security Agency provides recommendations for sanitizing Word documents. In 2005, the Agency published Redacting with Confidence: How to Safely Publish Sanitized Reports Converted from Word to PDF, highlighting a seven-step process to safely sanitize Word documents:
- Create a copy of the original document. All edits should be made to the copied version only. The original should be retained as-is as a backup.
- Turn off track changes, comments and other visible markups on the copy. Review and remove all sensitive content.
- Rename the document.
- Review the document to ensure that all material to be redacted has been deleted and, wherever necessary, replaced with innocuous filler (e.g., empty shapes to replace sensitive images).
- Open a new blank document and copy data from the document copy into the new document.
- Convert the new Word document to PDF format.
- Review the PDF for any missed redactions.
Automated document sanitization
As noted, manual document sanitization can be a somewhat complex process, with the possibility of misses and errors. Automation can prevent errors as well as ensure more thorough sanitization and better protection for the document and organization.
Automated sanitization software use algorithms that detect terms and term combinations in a document that might potentially disclose sensitive or confidential information. Users of the sanitization applications define which topics are deemed sensitive. The terms are then redacted from the document. If the user pre-defines privacy requirements, the software can also generalize the risky terms in line with those requirements.
Effective sanitization applications can sanitize documents of different formats, including Word, PDF, Excel and PowerPoint. By removing hidden data, these products help to safeguard information from leaking outside the organization. In this sense, they can be considered important for data loss prevention.
An effective data sanitization process lessens the chance that your organization's valuable data could be stolen or compromised and enhances compliance. Explore these data sanitization techniques, including standards, practices and legislation.
