The Problem
Even when you invest in the best document comparison tools available, there can be times when you need to contact technical support for assistance with a problem comparison. When this happens, it's vital that you can provide the problem documents to technical support to allow the issue to be reproduced and investigated. The problem comes when those documents are confidential...
Of course, highly respected software companies who produce to quality document comparison tools (who, us?!) will guard the confidentiality of your documents and enter into NDA arrangements as appropriate. Just sometimes, however, the document is so confidential that this isn't enough. What to do then?
The Tool
Well, if you've already got a top class document comparison tool installed (Workshare Compare 5.21 or later), all you need is this little utility :
The Workshare Document Content Hider
The utility works on a copy of your document saved in either DOC or RTF format, so your first step should be to use Word to save a copy in one of these formats somewhere convenient. Then run the tool (it shouldn't need any installation). Use the top 'Browse' button to select your problem document, check that you're happy with the output filename chosen (note that the output is *always* RTF format) and press the 'Hide Content' button. Rinse and repeat for the other document.
What does it do?
The tool opens the document, then mangles all the text and numbers in the document before saving the new version. The text mangling procedure involves replacing all consonants with the letter 'c' and all vowels with the letter 'v'. Even and odd digits are replaced with '1' and '0' respectively. Other characters (punctuation, etc.) remain unchanged. It's obviously going to be almost impossible to recover the original content of the document once this scrambling has taken place - for example the words 'art', 'and', 'ant' and many others are all transformed into 'vcc'.
What Limitations does it have?
Under some circumstances, the comparison problem being investigated may be related to a problem reading the content of the original source files. Since the same code path is used by the tool to read the document as in the comparison, in these cases the tool would not be able to process the document either. Such problems can arise from corrupted documents or pre- Word 97 .doc files. If the tool fails to work, try loading the document into Word and saving it as a different format, then trying again.
Graphics and other non-text items within the document are unchanged - if these contain sensitive information then they would need to be removed manually.
Japanese, Chinese and other Ideographic language documents may not be scrambled properly - it's always worth checking the scrambled version of the document before sending it to us.
You should be aware that it may be possible to deduce the identity of some longer dictionary words from the scrambled document - for instance if there is only one dictionary word that is converted to vccvccvcvvc, then that pattern in the scrambled document could be identified as the matching dictionary word ('information' in this example). However, numbers and proper nouns (company names, individual names, place names, etc) are almost impossible to recover. The level at which the content is hidden is a compromise, since it's important to ensure that a particular word is always converted in the same way regardless of its location in a document. Traditional strong encryption wouldn't obey this rule, which would mean that the resulting documents would not be at all useful in reproducing the comparison problem.
One last piece of advice - once you've processed your documents, try comparing the processed version to make sure that the bug you're about to report can still be reproduced!