No Match Found
As the use of smartphones and digital cameras becomes more widespread, the number of photographic images used in daily business is also on the rise. We estimate that images represent about 5–20% of the total data processed during an eDiscovery investigation. For example, many individuals will take a picture or screenshot of a specific section of text within a document, such as a contract, because they believe the information is important. These images may contain important, and even critical, information relevant to the investigation.
Unfortunately, there has not traditionally been any practical way to identify text from among large volumes of image data. Manually reviewing large volumes is neither realistic nor cost efficient, and is prone to human error. Optical character recognition (OCR) can help extract text from an image file. However, running OCR for large volumes of data requires expensive resources and days of processing time.
Due to the lack of an appropriate way to process images during eDiscovery projects, the value of image files is often overlooked. This is one challenge the eDiscovery industry is currently facing, and that PwC is working to solve.
PwC’s forensic team has recently made breakthroughs in the tackling of this challenge. With the development of AI technology and an image classification algorithm, we were able to develop a solution to automatically classify large amounts of image files.
This solution is designed to identify those image files which might contain text information, either printed or handwritten. The identified image files are then processed in various ways to extract text information, depending on the type of image.
Our solution can automatically highlight three types of relevant image files: screenshots of chat messages, images of printed documents and images of handwritten documents.
We have incorporated this newly developed solution into our eDiscovery methodology. As a result, our current methodology is now able to cover all digital image files by default, enabling a wider scope of eDiscovery analysis and providing new value through the investigation process.
A combination of AI algorithms and the proper hardware means a high processing speed. Currently, this solution can process 20,000 images per hour, and our target is to be able to process 100,000 image files per hour in the future.
We have achieved an accuracy of over 95% for the classification results in projects conducted from June to November 2020. We are continuously working to improve the solution, and expect to achieve an accuracy of over 99% in the future.
This solution is currently incorporated into the standard eDiscovery workflow. We are also working on fully automating the solution in the future.
This solution classifies image files which might contain text information, either printed or handwritten. This enables forensic professionals to discover more relevant or critical information from the image files, while also avoiding the huge cost of large-scale OCR processing or manual reviews.
We have been testing this solution and using it in engagements since its development, and we have received positive feedback from our clients and the law firms leading the investigations. This solution has made it possible to identify highly relevant documents from image files which would have not been identified otherwise.
We believe this solution will bring about a revolution for the eDiscovery industry, and that analysing photographic images during the eDiscovery process will become the new standard in the near future.