Top 5 ways to use textual analytics in electronic document review


For many legal counsel, this is a familiar scenario: a production deadline’s looming and you still have thousands, possibly hundreds of thousands, of documents to review.

But many don’t realize—you can use textual analytics to review documents and save time and money. More specifically, analytics tools can help you:
• reduce document review time
• find documents you wouldn’t have discovered using the traditional keyword approach
• meet what may otherwise be unrealistic document production deadlines

Here we outline what we’ve found are the top five ways to use textual analytics in electronic document review.

1. Email threading

Email threading should be the first stop on your analytics road trip. If used correctly, it will likely pay for the analytics cost in and of itself.

When you’re dealing with a large email collection, it’s almost always necessary to thread it. What does this mean? First you eliminate duplicative emails (ones with minor differences that aren’t necessarily caught as duplicates during traditional eDiscovery processing) and next you identify what is referred to as “non-inclusive content.”

Non-inclusive content refers to emails in a conversation thread that have their content repeated in other replies, forwards, etc. so the actual body is duplicative. If you hive off this content, you could reduce your email content for review by at least 10% and even upwards to 40%.

Email threading also allows you to organize your threads in chronological order, making the review process that much more efficient for the reviewers.

2. Categorization

Using the technology that supports predictive coding or technology assisted review (TAR), categorization allows you to identify a small number of example documents and ingest them into the analytics engine. The engine will return conceptually similar documents, based on the specified coherence level (or percentage similarity) to the examples provided.

This technology is ideal to identify languages or a group of documents to be used as a seed set. This approach can dramatically narrow the review population if looking for documents with a particular set of content. This can also be done across several different concepts (or issues) at the same time and categorize the results using the issues used as examples.

3. Concept searching/ Find similar

Concept searching and “find similar” are much like categorization but on a smaller scale. If a highly relevant example document is identified, the content can be used to find conceptually similar documents.

A sentence or paragraph within a document can be copied and pasted into the concept search box to quickly find documents that deal with the same concepts. This can help find additional sample documents that can then be ingested through other analytics tools such as categorization.

These tools can also be highly effective methods of finding additional conceptually similar documents to those that may be key to the case. Just think—the documents that are conceptually similar to your hot or key documents can be found with a few mouse clicks.

4. Clustering/ Cluster visualization

But what do you do if you’re really not sure what content your data holds? You could try clustering.

Clustering allows you to place your documents (all of them or a subset determined by other means such as keywords, categorization, etc.) into buckets based on concept. These conceptual buckets can then be batched for review or visualized to see how they relate to one another. Cluster visualization can be used to see which clusters have had coding applied to them by overlaying a heat map on the clusters.

5. Near duplicate detection

Probably one of the most misunderstood features of textual analytics is near duplicate detection. Clients often ask if the collection can be processed through near duplicate detection and remove the duplicates, similar to how hash (or exact) duplicates are removed during ediscovery processing. Unfortunately, it’s not that simple.

Two documents that are “near duplicates” are just that. Subtle differences may, in some cases, be acceptable, such as with scanned paper documents on which optical character recognition (OCR) has been processed. These documents may be the same due to inaccuracies during the scanning/OCR process. But with electronic documents, subtle differences might be extremely important. And that’s why this tool should be used with caution if near duplicates are being considered for removal from review.

Based upon experience, the best use case scenarios for near duplicate detection is  identifying alternate versions of documents or for quality control on specifically sensitive sets of documents (such as those marked as privileged). Bringing in near duplicates of these documents may allow you to find documents that might otherwise have been produced.

Find out more about our Forensics Technology Services by clicking here.

Contact us

William Platt

William Platt

Partner, Forensic Services, PwC Canada

Tel: (416) 814-5710, (403) 509-7400

Follow PwC Canada