You’ve probably seen one or more news stories like this in the last year: a government or corporation releases a document with sensitive text covered with black rectangles, but the text is still technically present in the document and private information is accidentally released.
Timothy Lee of Princeton’s Center for Information Technology Policy has published a fascinating article titled Studying the Frequency of Redaction Failures in PACER in which he describes an online database of 1.8 million PDF court documents. He used my CAM::PDF Perl library to automatically search all of the documents attempting to detect solid rectangles drawn over text. Then he manually examined a subset of those, discovering that many of them are cases of accidentally leaked personal information. He proposes that his technique should be used as a pre-processing filter to detect problematic documents before they are added to the database.
Technologically, his technique takes advantage of some interesting details of Adobe’s Portable Document Format (PDF). In a PDF, the visual elements are all layered on top of each other using document coordinates. My library can make a list of all of the elements and what position they are on the page in the document. His code tries to identify black rectangles and looks for text elements which occur at the same overlapping coordinates as that rectangle. His code is not perfect because it seems to detect text that is drawn intentionally on top of background rectangles, but better safe than sorry with sensitive documents.