Making Sense: Extracting Meaning from Text

August 30, 2015 AJMaren

Making Sense: Extracting Meaning from Text by Matching Terms and Entities to Ontologies and Concepts

Text analytics is the means by which computer algorithms can extract meaning and useful insights from raw text sources. This can have enormous impact in realms such as marketing, business intelligence, and political campaigns.

However, text analytics is one of the toughest challenges in predictive analytics.

The reason why this is so hard?

It’s because – when done right – text analytics must effectively connect two very different methodologies.

Not many people know how to do this.

In fact, within the text analytics arena, there are many who know one methodology or another, but very, VERY few who know both – and fewer yet who can effectively connect these two approaches.

Why It Is Essential to Connect Statistics with Ontologies

Imagine that you’re building a house. On the one hand, you have a blueprint. On the other hand, you have a pile of lumber, with assorted tools, nails and screws, and other things that you need to get started.

Connecting document statistics to ontologies is like building your house according to your blueprint.

Ontologies Are Like Blueprints

Ontologies represent your world view.

Suppose that you are building a house. Your ontology is like the blueprint.

The main exterior walls of the house are like the main divisions of your ontology. So, just as you’d have the primary north, south, east and west exterior walls, you’d have (within your ontology), the highest-level types such as people, people-groups, places, things, and events.

Just as you’d have essential supporting walls inside a house, you’d have the next tier of crucial ontological subgroups. For example, in our Case 1 Study focusing on the presidential election campaign of 2016, we have major people-groups such as Democrats and Republicans, Political Action Committees (PACs), government organizations (ranging from the Executive Branch to intelligence organizations), and others.

Within the high-level ontology subgroups, we have more specific subgroups, such as Democratic PACs, etc.

Within those, we have further specifics, leading often to specific instances. An example might be the “Ready for Hillary PAC.”

Specific instances are like walls or other features that can be moved, without too much impact to the overall blueprint. For example, one candidate can concede the campaign to another, and even become a running mate with a former rival. This is like moving two interior walls, or connecting two rooms with an arch instead of keeping them separate. If you have a well-designed ontology, then instances can change their attributes, and their relationships with other instances, without too much impact.

Good ontology design, like any kind of sound architecture, gives you a robust structure with some flexible options.

Statistics Are Like Organizing Raw Materials

Statistics are a way of organizing data. Image courtesy http://www.enonhall.com/html/journal0805.html.

Think of every document that is collected as having a delivery truck pull up and drop off a pile of building supplies; your raw materials for house-building.

Statistics are a means of organizing and comparing these raw materials.

So for example, you could have a list of possible supplies; everything from lumber (all different types), roofing materials (from sheeting to shingles), insulation, plumbing, electrical … the list goes on. And of course there are nails, screws, and all sorts of tools needed.

Every time you get a delivery, you fill in a column on your master list of what was delivered. This is like filling in a column on a Term Feature Vector for each delivery (resulting in an overall Term Feature Matrix).

Statistics help keep things organized (how much of what), and they help you track similarity of deliveries.

Just as you’d expect lots of lumber deliveries, with various nails, screws, and other supplies thrown in, you would expect to see similar terms in your documents if you are collecting those with keywords on a specific subject. (Documents collected using the keyphrase “Hillary Clinton” or “Republican candidate” will naturally have lots of references to Ms. Clinton, or to one of the Republican candidates.)

Statistics let you know how much of what.

For example, statistics on lumber deliveries let you know how many 2×4’s, how many sheets of plywood, how many sheets of drywall.

Similarly, statistics applied to a document’s content will tell you how often a given term appears. It can tell you if a given term is relatively new or unusual, or if it is appearing with the same frequency that it has in previous documents.

None of this helps you really build a house.

All that statistics will help you do is to organize your piles of stuff. Statistical measures, by themselves, will not give you guidance.

So if you start building a house using just statistics (piles of lumber), without a blueprint, you’ll get a mess, not a house.

It would be like saying “Oh, these boards are the same size and length, let’s start putting them together.”

A total mess.

Concepts Are Like the Roof of a House

It helps us now to think of the document’s concepts as the roof of a house.

Concepts should map very closely to the ontology; they are like carrying the blueprint into the real world.

To get what we want out of text analytics, we need connections between raw materials (the 2×4’s, etc.) and the concepts (the roof). We have to actually build these connections.

We need a framework connecting the raw material (extracted terms) to our roof (concepts) according to our blueprint (ontology). — We need a framework connecting the raw material (extracted terms) to our roof (concepts) according to our blueprint (ontology) Image courtesy http://www.safewise.com/blog/home-building-timeline-keep-sane/.

To make these connections – raw materials to roof, or terms to concepts – we need a framework.

Building the set of connections is like framing a house. We are using the raw materials that we have to build a structure; a framework for getting the roof (concepts) over our heads, according to our well-considered design (ontology).

The way in which we do this is to build a set of Concept Feature Vectors. These are actually term-to-concept feature vectors; their goal is to map raw document content (terms) into the concepts. Once we have the concepts in a document, we have the equivalent of a roof over our heads. We have a workable structure.

Our Job is to Be the General Contractors

Our job is to be general managers. Image courtesy http://www.news-researchers.com/2013/08/people-expect-constructing-house/.

Taking on a big text analytics project is like becoming a general manager for building a house. Before you get a livable house (useful results for your boss or clients), you’ll need to build a lot of structure. That means, connecting the terms to the concepts.

And before you can build concepts, you need a general plan (an ontology).

And before you can build the ontology, even if you have a good sense of what needs to be in it, you need to look at the raw material available – the kinds of terms in the documents that you’ll be using, and the overall nature of these documents. (Modeling a set of Tweets is different from modeling news where the core pieces are in-depth profiles and analyses.)

Doing this is as much an art as it is a science.

There are lots of text analytics software tools available; this is much like saying that you can bring in a team of skilled workers and provide tools on the job.

There are lots of ontologies available; this is like saying that there are lots of off-the-shelf blueprints.

However, if you’re designing a customized text analytics system, then much like designing a custom home, it’s important to analyze the working materials (kinds of data) and the client’s particular desires and intentions.

For example, if your client wants to know about breaking news regarding a political candidate, then you’ll need to figure out a way to define “breaking news” so that the system will recognize it – and note that it is different from just another opinion piece.

This is like customizing a blueprint so that the client can watch the sun come up while they sip their morning coffee. You need to think through the design in terms of the client’s specific world view.

Resources for Doing the Job

Bluntly put, there are not many good resources that will tell you how to connect terms to concepts (and thus to a specific ontology).

The problem with most text analytics books (and courses, programming suites, and other offerings) is that they come at the problem from only one of two important ways:

Statistical (bottom-up) processing, or
Ontology (top-down) design.

And in the world of most text analytics methods, never the twain shall meet.

However, because connecting statistics to ontologies is the most important task in text analytics, this is the focus of my Text Analytics course with Northwestern University.

It is also the focus of my book-in-progress, Making Sense.

I also am building a set of web pages on this site to support text analytics students and professionals; these will contain resource links. (Links to these pages coming soon.)

You can also download the crucial Chapter 6 from Making Sense: “Connecting Terms to Concepts.”