Extracting complex tables with machine learning

Using Machine Learning (lite) to help users extract data from complex tables, allowing our customers to process this data downstream.

To quote one of our customers, "It's like magic". 

Team: Product Designer (me) • Tech Lead • Product Manager • Developers (6) • QA (2)

The problem 😬

Our customers often had tables in the documents they had to extract data from. While we had a machine learning (ML) model to identify 'griddable' tables, customers often had tables that were 'non-griddable', which made data extraction difficult for them. 

Screenshot 2023-07-10 at 6.47.51 PM
Screenshot 2023-07-10 at 6.50.15 PM

Griddable table: A table where the information on the table is always in the format of column : row.

Non-griddable table: A table where the information is not laid out as neatly and can have rows within rows or overlapping columns. 

As you can see, griddable tables were much easier for our users because our models were able to identify the different rows and columns because of the more simplistic structure. Our users, the keyers, would then just have to check if everything was in the right spot. Here is an example of what our experience looked like with griddable tables:

Screenshot 2023-07-10 at 6.56.48 PM

Griddable tables This is an example of what the machine would identify for griddable tables.

The clear row:column relationship which made it very easy for the machine to identify and extract the data the user needed.

We needed to find a way to help our customers extract data from their non-griddable tables in a less manual way.

Limitations 🚧

The biggest limitation was that machine learning models take time to develop and the fact that our machine learning team had a ton of other projects in development. And even after developing a model, it was always an uphill battle to get our customers to install those models (our product was on-prem; aka not on a cloud) and then continuously train them. 

We wanted to make this a v1- something that we could get out the door and then iterate on later. 

Solution ⚡️

Our solution was to use 'machine learning lite'. We wouldn't have an entire model meaning the customer wouldn't be able to improve on the model, but they would be able to extract things less manually. I realized that while the non-griddable tables didn't follow formats of other tables, they did follow a pattern within itself. So if you identified on part of the pattern, we could help apply that pattern to the rest of the document. 

Here was our final solution:

Nongriddable table

More case studies

Footer_ Layout management
Footer_ Personas
Footer_ Shopping bag
Footer_ Complex tables
Footer_ A year of growth experiments