Readings – Text Analytics

Readings – Text Analytics

Text Analytics Books, Articles, Blogs, and Websites

 

I started teaching text analytics at Northwestern University in Summer, 2015, and continue teaching it every quarter.

The backstory: In my former company – dealing with the DoD and Intel Community’s need to “find the needle in the haystack” after 9/11 – I had developed one of my most crucial and innovative patents; a knowledge discovery architecture.

Cue the 2007 / 2008 financial meltdown, and the dissolution of almost every text-analytics venture at that time.

Ten years later, and text analytics is now clearly a sub-discipline of artificial intelligence (AI), with massive results in speech-to-text, text translation, automated email response systems, and numerous other commercial applications.

The following readings are primarily for my Northwestern University students, who are always eager to get a jump on the next quarter – and so this will let these students read up before I get the course opened. (These people are super-ambitious, super-determined, super-driven; all in the best way possible. Just gotta love ’em!)

Up through Summer, 2017, I’ve been taking a bottoms-up approach; we started with various texts that the students would bring in (all on a single topic theme; Case Study 1 was from Summer 2015 – Fall, 2016, it was the U.S. Presidential Election of 2016; Case Study 2 started in 2017, it’s been the Trump Administration). We did basic entity extraction, using a couple of easy-to-access online tools, and then moved towards creating Equivalence Classes (ECs), and from there towards creating clusters of related texts together with some Latent Direchlet Allocation (LDA) work.

My intention was that our formation of both clusters and LDA topics be guided by ontologies, but the role of ontologies never really seemed to kick in – partly because we could only achieve so much in a 10-week quarter.

Ontologies are of increasing importance, though. Just witness Google’s Knowledge Graph.

So, I’m shifting my approach. I’m going to start top-down; begin with ontologies, and work towards the actual data. Let’s see what happens.

 

Good Starting Places

 

Let’s just start here, ok? These are a couple of light and easy ways in which to get rolling.

 

Recent and Relevant

 

In July, 2017, there was a brief news splash when Facebook pulled back an experiment involving two AIs (artificial intelligences) that had evolved their own language to negotiate with each other. While their conversations sounded like gibberish to the human reader, they made sense to the two AIs.

Why is this so important to us?

The science fiction movie Ex Machina pits a robot against her human developers.
The science fiction movie Ex Machina pits a robot against her human developers.

It’s not just that we’ve got this little Frankenstein fear going on, particularly a fear about AIs that can operate independent of both human understanding and human control. Witness the evolution from the early SF movie Colossus: The Forbin Project (in which two mega-computers learn to talk with each other, and then take over the world), to the more recent move Ex Machina, and of course, Stanley Kubrick’s 2001.

(Historical digression: Check out the infamous conversation with HAL in the movie 2001: HAL 9000: “I’m sorry Dave, I’m afraid I can’t do that”.)

It’s that now, the idea of having independent bots conducting business on our behalf is not just SF-fantasy-someday, but pretty real. Check out some of the follow-ups to Facebook’s Chatbot scenario:

Why did their own, invented-language sound so odd to our ears?

The answer is: it was not designed to pass the Turing test – meaning that it wasn’t designed so that a human couldn’t tell if the “speaker” was a human or a computer.

This is becoming more and more important in practical applications. Companies are interested in, for example, creating email response systems that emulate how a human might act in a given conversation.

 

Deep Learning Architectures for Text Analytics, Speech Translation, and Other Linguistic Tasks

 

One way in which these technologies is unfolding is through use of Generative Adversarial Networks (GANs), a deep learning architecture in which two different AIs “compete” to top each other’s models. Here’s an example of how these can show up in computer-based natural language systems:

  • Jiwei Li, Will Monroe, Tianlin Shi, Sébastien Jean, Alan Ritter, Dan Jurafsky (2017), Adversarial Learning for Neural Dialogue Generation, arXiv:1701.06547v4 [cs.CL] pdf.

This following article also uses the GAN neural network, this time to identify language in e-commerce sites that is predictive of sales:

  • Reid Pryzant, Young-joo Chung, Dan Jurafsky (2017), Predicting Sales from the Language of Product Descriptions. SIGIR eCom pdf.

Posted on KDNuggets, a tutorial on the Google Translate Algorithm:

 

Where We Are with Linguistic-based Artificial Intelligence

 

Important take-aways so far:

  1. Text analytics, and AI-generated text / speech, is now strongly an AI sub-discipline: The field uses basic deep learning architectures, such as Convolutional Neural Networks (CNNs) along with Long Short-Term Memory (LSTM) networks for predicting / interpreting what the texts that they’re reading will mean, along with Bayesian logic and other methods to generate responses. Most recently, the AI-text-analytics/text-generation community has been using Generative Adversarial Networks (GANs), as the preceding two articles indicate.
  2. Text-and-speech analytics / text-and-speech generation tasks are now much more pervasive: Everyone knows that personal-assistant AIs, such as Siri and Alexa, can understand spoken requests. Among AI aficionados, Google Neural Machine Translation system (GNMT) is well-known. (See a YouTube on the GNMT.) Microsoft’s Rick Rashid demonstrates Microsoft’s deep learning-based English-to-Chinese speech translation system. Beyond the more well-known applications just identified, there are more subtle applications, such as using text analytics to craft more persuasive messages for online marketing, as identified in the GAN example above.
  3. Text, speech, and search analytics are now much smarter than they’ve ever been: Google’s response to search queries shows a much greater understanding of the world than it did even a few years ago. It knows what sorts of topics might be relevant to you, based on its Knowledge Graph and its collected search histories. Other AI systems also connect what you speak and write to extensive ontologies, or world models, so they can interpret what you are saying more effectively than ever before.

 

A Good Text

 

Daniel Jurafsky is one of the “grand old men” in natural language; this text does look good, and I found the online Chapter 1 to be a very good read:

  • Daniel Jurafsky and James H. Martin, 2009. Speech and Language Processing: An Introduction to Natural Language Processing, Computational Linguistics, and Speech Recognition (Second Edition). Chapter 1: Introduction – pdf.
    • Abstract: We introduce a vibrant interdisciplinary field with many names corresponding to its many facets, names like speech and language processing, human language technology, natural language processing, computational linguistics, and speech recognition and synthesis.
    • AJM’s Notes: This is a very good introduction, tracing the course of natural language processing over its history of several decades, and bringing in important lines of thinking that have contributed. Well worth reading as a kick-off to the natural language / text analytics field!
  •  

    Knowledge Representation in Text Analytics and Natural Language Processing

     

    One of the most important things in linguistic AI (I’m using that in place of text analytics; it’s much more appropriate now) is that there are multiple representation levels in the AI systems.

    The Notion of Multiple Representation Levels

    Here’s one good read. For others, please check on my Readings – Artificial Intelligence page. It focuses right now on multiple levels of knowledge representation.

    • J. Rasmussen’s Skills, rules and knowledge: signals, signs, and symbols, and other distinctions in human performance models (IEEE Trans SMC SMC-13(3) (May, 1983)) is a classic and highly-regarded paper. Think about signals and signs as extracted words or word-phrases in a document, as well as their various statistical combinations and RDF triplets, etc. They have no intrinsic meaning. Meaning comes when we associate interpretations with the signals and signs. In the realm of text analytics, we create meaning at Level 3, and associate it with various signs/signals from Levels 1 – 2.

    For more, I would go to readings on (Multiple levels of) knowledge representation for AI.

    Knowledge Graphs, Ontologies, and Taxonomies

    The previous reading was not explicitly about text analytics or linguistic AI; it was to generally understand different knowledge representation levels. To be more specific to linguistic AI, I would:

    Good reads once you’ve finished the above include:

     

    Going from Statistical to Symbolic Knowledge

     

    This is really the crux of the matter. It is what Google is doing with their “strings to things.”

     

    Sentiment Analysis: Still Challenging

     

    Sentiment analysis continues to be one of the most challenging aspects of text analytics. One reason is that words can be ambiguous, or even used sarcastically – there is a great deal of context-dependence, and that’s one of the trickiest things to figure out. Another reason is that even basic and straightforward sentiment analysis usually requires at least some syntactic (grammar) processing, as well as a full dictionary of how certain sentiment terms show up in a given domain. For these reasons, experiments that my students have done have shown limited success. I believe that sentiment analysis will become more possible. However, it will require more sophisticated systems than those which are in use today.

    Here are a couple of industry reports on sentiment. Medallia expresses it as the VoC, or “Voice of the Customer.”

    • Discovering Business Insights from the Voice of the Customer: Medallia’s Approach to Text Analytics, by Ji Fang, Ph.D., from Medallia, is a good 9-page, easy-to-read overview of text analytics as applied to CEM (Customer Experience Management). Dr. Fang touches on three levels of text mining: concept/entity extraction, entity-nearest-neighbors, and syntactic relations (leading to sentiment analysis). She shows how machine learning supports text analysis, yet makes it clear that user-customized taxonomies are needed to get full value from sentiment analysis. With a few quick examples, she demonstrates both the subtlety and complexity of interpreting natural language responses.
    • Text Analytics 2015, by Seth Grimes, a sentiment analysis pioneer, is a very useful next-read; he focuses here on the industry perspective – it’s always nice to “follow the money.”

     

    Extra and unsorted little bits