Document miners close the gap between unstructured and structured content

Share this article

In one of a series of essays on the impact of AI on the insights profession, Martin Rückert, our Chief AI Officer, identifies a key technology to efficiently transform unstructured information into content a machine can understand: document miners.document miners.

In theory, a lot of data is a good thing in modern data science. It helps give a very detailed description of a situation so high fidelity models can be constructed that serve a predictive process. A typical example would be sensors tracking the health of a machine, like a vehicle motor. By analyzing many of these sensor logs, you can derive patterns that predict a machine failure before it actually happens. The beauty is that all the data in the sensor logs is highly structured, and therefore easy to understand, format and analyze.

Nonetheless, a machine’s behavior is a highly noisy process (in data science terms), but we know what to measure and the features of the output. All that’s needed is a lot of computational power to sift through the data and derive the pattern for the predictor.

The ‘noisy’ world of marketing

Marketing processes, where the behavior of people is measured, are even noisier because behavior is influenced by so many factors. (Sentiment extraction alone, for example, can’t give you a complete picture.) To make matters worse, there’s simply no sensor data available that can produce ready-to-use, structured information that’s easy to understand, like the vehicle motor I described a moment ago. That’s why it’s often more precise to write a text in common English, (an encoding format matured over thousands of years) to express what needs to be transmitted between an analyst, “the sensor” and “the recipient” (the market researcher).

The major problem is that to further process natural language data, by summarizing it in graphs or counting aspects to react at a defined threshold, the data needs to be harmonized in a format that’s easy to work with. For example, marketers might like to review a table, where consumers’ attitudes to benefits are presented as columns and rows.

Cutting through the noise to create structured output

It’s clear that there needs to be a conversion step between unstructured, noisy human language (which is hard for a computer to “understand”) and highly structured formats that can be used for analytics, such as trend predictions or frequency distributions over a dimension, like consumer problems in a product category.

At Market Logic, we’ve encapsulated that normalization process from unstructured to easy-to-compute structured form in a module type called “document miners”. Document miners process human language in text reports and visual information in videos to detect the important dimensions we can bridge into structured form. (These are the same golden nuggets our customers spend tons of money to find in primary and syndicated research reports.)

Document miners are able to decode human language and any other noisy encoding formats, using machine-trained models to extract the dimensions that are important for market researchers and marketers. The output extracted information into a highly structured form, which we call the market logic knowledge graph.

Ultimately, this is where insights are married with data from already structured sources, such as survey respondent data over the same dimensions.