Browse Definitions :
Definition

document sanitization

What is document sanitization?

Document sanitization is the process of cleaning a document to ensure that only the intended information can be accessed from it. In addition to making sure the document text doesn't openly divulge anything it shouldn't, sanitization includes removing metadata that could pose a privacy or security risk.

Document sanitization, sometimes called file sanitization, refers to the cleansing of documents by removing hidden content from them. In addition to metadata, the content that is removed may include document properties, hazardous code such as malicious scripts or backdoors, or malware that hasn't previously been detected.

The properties of metadata.
Document sanitization removes all hidden and sensitive content, such as metadata, and code from documents.

Sanitization is not the same as redaction. Where sanitization is about removing hidden data and metadata, redaction is about removing private or sensitive information that should only be available to a specific group of people. With sanitization, all hidden information is permanently removed so that the file can be safely passed on. With redaction, certain text gets permanently removed so it is no longer viewable or editable.

Despite these differences, the two activities work well together. Combining sanitization with redaction helps to better protect documents and their information from leaks and breaches.

The importance of document sanitization

Documents often include hidden content that may not have been detected previously. If hackers or cybercriminals are able to access this information, they may be able to use it to steal other types of sensitive data, such as passwords, personally identifiable information or financial information. They might also use the information to embarrass a firm or damage its reputation.

To minimize the risk of such incidents, it's important to find and remove all hidden content in documents, which is any information that's not intended to be distributed. Proper and thorough sanitization helps ensure that potentially sensitive information is not inadvertently or maliciously seen when the document is published or shared. It thus protects the organization from data breaches.

Document sanitization and metadata removal

Metadata is often described as "data about data." Different types of metadata provide additional information about a document, such as its author, history or version. Metadata can also contain the names of a document's modifiers, the dates of creation and changes, file size, digital signatures, revision histories, tracked changes, watermarks, headers, footers and comment exchanges among various authors and editors.

Metadata, which is usually not obviously visible to the document's authors, viewers and editors, is important for document tracking, classification, analysis and management. It also improves collaboration among users sharing a file or collection of files. But it could also contain sensitive information that might embarrass or damage an organization, so it's important to safeguard metadata from unauthorized access. This requires removing it from the document before it is published, shared or circulated.

Common metadata standards.
Metadata standards guarantee uniformity on the shared language, format, spelling and other aspects used to describe data. Each standard is based on specific schema, providing an overarching structure for all its metadata.

How to sanitize a document

A common way to remove metadata and other hidden information from a document is to convert it to PDF format and then follow the sanitization procedure provided by the specific PDF application used.

For example, the procedure for sanitizing PDF documents in Adobe Acrobat is the following:

  • Access the Redact top menu and select Sanitize Document.
  • To remove all hidden information, select OK.
  • To remove specific pieces of hidden information, select Click Here.

Non-PDF documents can also be sanitized. Software such as Microsoft Excel and Word both have built-in features to discover and remove hidden content like metadata.

The National Security Agency provides recommendations for sanitizing Word documents. In 2005, the Agency published Redacting with Confidence: How to Safely Publish Sanitized Reports Converted from Word to PDF, highlighting a seven-step process to safely sanitize Word documents:

  1. Create a copy of the original document. All edits should be made to the copied version only. The original should be retained as-is as a backup.
  2. Turn off track changes, comments and other visible markups on the copy. Review and remove all sensitive content.
  3. Rename the document.
  4. Review the document to ensure that all material to be redacted has been deleted and, wherever necessary, replaced with innocuous filler (e.g., empty shapes to replace sensitive images).
  5. Open a new blank document and copy data from the document copy into the new document.
  6. Convert the new Word document to PDF format.
  7. Review the PDF for any missed redactions.

Automated document sanitization

As noted, manual document sanitization can be a somewhat complex process, with the possibility of misses and errors. Automation can prevent errors as well as ensure more thorough sanitization and better protection for the document and organization.

Automated sanitization software use algorithms that detect terms and term combinations in a document that might potentially disclose sensitive or confidential information. Users of the sanitization applications define which topics are deemed sensitive. The terms are then redacted from the document. If the user pre-defines privacy requirements, the software can also generalize the risky terms in line with those requirements.

Effective sanitization applications can sanitize documents of different formats, including Word, PDF, Excel and PowerPoint. By removing hidden data, these products help to safeguard information from leaking outside the organization. In this sense, they can be considered important for data loss prevention.

An effective data sanitization process lessens the chance that your organization's valuable data could be stolen or compromised and enhances compliance. Explore these data sanitization techniques, including standards, practices and legislation.

This was last updated in May 2024

Continue Reading About document sanitization

Networking
  • subnet (subnetwork)

    A subnet, or subnetwork, is a segmented piece of a larger network. More specifically, subnets are a logical partition of an IP ...

  • secure access service edge (SASE)

    Secure access service edge (SASE), pronounced sassy, is a cloud architecture model that bundles together network and cloud-native...

  • Transmission Control Protocol (TCP)

    Transmission Control Protocol (TCP) is a standard protocol on the internet that ensures the reliable transmission of data between...

Security
  • cyber attack

    A cyber attack is any malicious attempt to gain unauthorized access to a computer, computing system or computer network with the ...

  • digital signature

    A digital signature is a mathematical technique used to validate the authenticity and integrity of a digital document, message or...

  • What is security information and event management (SIEM)?

    Security information and event management (SIEM) is an approach to security management that combines security information ...

CIO
  • product development (new product development)

    Product development -- also called new product management -- is a series of steps that includes the conceptualization, design, ...

  • innovation culture

    Innovation culture is the work environment that leaders cultivate to nurture unorthodox thinking and its application.

  • technology addiction

    Technology addiction is an impulse control disorder that involves the obsessive use of mobile devices, the internet or video ...

HRSoftware
  • organizational network analysis (ONA)

    Organizational network analysis (ONA) is a quantitative method for modeling and analyzing how communications, information, ...

  • HireVue

    HireVue is an enterprise video interviewing technology provider of a platform that lets recruiters and hiring managers screen ...

  • Human Resource Certification Institute (HRCI)

    Human Resource Certification Institute (HRCI) is a U.S.-based credentialing organization offering certifications to HR ...

Customer Experience
  • contact center agent (call center agent)

    A contact center agent is a person who handles incoming or outgoing customer communications for an organization.

  • contact center management

    Contact center management is the process of overseeing contact center operations with the goal of providing an outstanding ...

  • digital marketing

    Digital marketing is the promotion and marketing of goods and services to consumers through digital channels and electronic ...

Close