Ontologies, Knowledge Graphs, and AI: Getting from “Here” to “There” (Part 2)

Ontologies, Knowledge Graphs, and AI: Getting from “Here” to “There” (Part 2)

A Principled Approach to AI: Representations and Transitions:

 

In the last post, on “Moving Between Representation Levels: The Key to an AI System (Part 1),” I re-introduced one of the most important and fundamental AI topics: how we can effectively use multiple representation levels. If we’re going to build (or gauge the properties of) an AI system, we need a framework. The notion of representations, and of moving between representation levels, is as fundamental as we can get.

In this post, we’re going to focus down on a single recent paper, by X.L. Chen and colleagues (including Fei-Fei Li), on improving a computer vision system by incorporating a high-level, symbolic reasoning component that acts via feedback and cooperation (as they say, iteratively) to improve scene understanding. This is an important problem, and this is a very useful paper. (See their Abstract at the end of this post, right after the citation of their work.)

The authors frame the issue as follows (my bold equates to their emphasis in the original text):

A key recipe to reasoning with relationships is to iteratively build up estimates. … In the case of top-down modules, high-level features which have class-based information can be used in conjunction with low-level features to improve recognition performance… Another alternative is to use explicit memory…
However, there are two problems with these approaches: a) both approaches use stack of convolutions to [perform local pixel-level reasoning [see CITE], which can lack a global reasoning power that also allows regions farther away to directly communicate information; b) more importantly, both approaches assume enough examples of relationships in the training data – so that the model can learn them from scratch, but as the relationships grow exponentially with increasing number of classes, there is not always enough data.

Chen, X.L., Li, L.-J., Li, F.-F., and Gupta, A. (2018, Mar. 29). Iterative visual reasoning beyond convolutions, arXiv:1803.11189 [cs.CV]. online access, accessed April 1, 2018 by AJM.

They identify their core contribution as follows:

Our work takes an important next step beyond these [previous] approaches in that it also incorporates learning from structured visual knowledge bases directly to reason with spatial and semantic relationships.

Ibid.

Rather than continue with further summary, let me invite you to read their paper in full at Iterative visual reasoning beyond convolutions, and to dig into their references as well.

The essence of what Fei-Fei Li and colleagues are saying is: we’ve about reached the limit that can be achieved just by stacking more and more ConvNet layers. Now, we need to approach image understanding from the other end; we need to start thinking top-down as well as bottom-up.

 
Italian-renaissance-border-2-thin
 

Everything Old is New Again

The idea of using a knowledge level to guide image understanding is not at all new. At the dawn of the computer vision era, Rodney Brooks, R. Greiner, and Thomas Binford – all then at the Stanford AI Lab, precursor to the Stanford Computer Vision Lab – introduced a knowledge-based system called ACRONYM. (Perhaps not coincidently, the current work by Fei-Fei Li and colleagues is also being conducted at the Stanford Computer Vision Lab, along with, of course, at Google’s labs.)

ACRONYM used First Order Logic, and was written using LISP. It took approximately two to three hours to process a single image.

At the same time, a team at SRI International used a region-finding approach, and another team at Honeywell SRC, headquartered in Minneapolis, MN, used an edge-finding approach. These teams were able to do the low-level region and edge detection very rapidly. (That is, rapidly for that time era.) However, their processes were not able to link up the extracted edges and regions to an actual image interpretation.

Thus, at the dawn of the neural network era, the whole realm of computer vision was stalled.

There was still important progress going on. One of the most important contributions was from David Marr, at MIT, who invented the notion of a “2 1/2-D representation.” This was important, because the previous work was either on modeling the 3-D objects, and then generating how a 2-D image of that object would appear from a given perspective, or on working with just the edges and regions extracted from the 2-D image. The two representations were simply not connecting.

Another important contribution came from David Lowe, who introduced the notion of perceptual grouping. This was a means of organizing regions so that they connected with each other in a way similar to how we would perceive them as being related. This was based on the much earlier (1940’s) work of a group of researchers called the Gestalt psychologists. (Gestalt is a German word meaning “shape,” but more broadly, the Gestalt psychologists identified five principles governing edge / region organization in vision, collectively referred to as “perceptual organization principles.”)

David Marr was at MIT, as was Rodney Brooks. Rodney Brooks and David Lowe both got their Ph.D.’s at Stanford, under Tom Binford. (Brooks was also one of Lowe’s advisors as well.) Thus, we’re seeing some people-connections, leading to some fomenting of ideas. And in this brief summary, I’m leaving out a number of very important people who were key players in our developing a joint sense of how human (perceptual) vision worked, and what we needed to do to create workable computer vision. (For some suggestions on other really important researchers in this timeframe, take a look at this blogpost, and at this Book Review.)

The ideas developed in this 1980’s – 1990’s timeframe were all well and good. The only problem was really the same problem that plagued all the other AI applications: not enough computing power/processing speed at the lower end, not enough training data, not enough layers in the neural networks (the latter compounded by the difficulties in training complex networks). All of this set the stage for the well-known breakthroughs in computer vision systems, starting in the mid-2000’s.

For a nice three-part summary of the ImageNet breakthroughs, see Adit Deshpande’s posts: Post 1, Post 2, and Post 3.

Where We Are Now

Dagwood Bumstead holding a Dagwood sandwich in a handscrew clamp, illustration by Chip Young. Reproduced under <a href="https://en.wikipedia.org/wiki/File:Dagwood_Comics.jpg">Fair Use</a> agreement, for educational purposes only.
Dagwood Bumstead holding a Dagwood sandwich in a handscrew clamp, illustration by Chip Young. Reproduced under Fair Use agreement, for educational purposes only.

As recognized by Fei-Fei Li and colleagues (previous winners of the ImageNet Large Scale Visual Recognition Challenge (ILSVRC)), we now need to go beyond slabbing together ConvNet layers. Building a bigger-and-better Dagwood baloney sandwich will not get us to solve the next round of really tough problems.

Practically speaking: we need to connect the knowledge level (oooh – shades of Allen Newell, circa 1980) with the low-level “statistical” or data level. Fei-Fei Li’s work, with colleagues, is an excellent indication of how we should proceed.

More to come. (This is not a trivial topic.)

In particular, we want to consider the notion of iterativity … that is, feedback from the higher to lower levels, and doing that more than once.

 
Italian-renaissance-border-2-thin
 

Live free or die, my friend –

AJ Maren

Live free or die: Death is not the worst of evils.
Attr. to Gen. John Stark, American Revolutionary War

 
Italian-renaissance-border-2-thin
 

Most Crucial To-Reads (Journal and arXiv)

 

  • Chen, X.L., Li, L.-J., Li, F.-F., and Gupta, A. (2018, Mar. 29). Iterative visual reasoning beyond convolutions, arXiv:1803.11189 [cs.CV]. online access, accessed April 1, 2018 by AJM.

 

Abstract: We present a novel framework for iterative visual reasoning. Our framework goes beyond current recognition systems that lack the capability to reason beyond stack of convolutions. The framework consists of two core modules: a local module that uses spatial memory to store previous beliefs with parallel updates; and a global graph-reasoning module. Our graph module has three components: a) a knowledge graph where we represent classes as nodes and build edges to encode different types of semantic relationships between them; b) a region graph of the current image where regions in the image are nodes and spatial relationships between these regions are edges; c) an assignment graph that assigns regions to classes. Both the local module and the global module roll-out iteratively and cross-feed predictions to each other to refine estimates. The final predictions are made by combining the best of both modules with an attention mechanism. We show strong performance over plain ConvNets, e.g. achieving an 8.4% absolute improvement on ADE measured by per-class average precision. Analysis also shows that the framework is resilient to missing regions for reasoning.

 

Academic Articles and Books on Representations for AI

  • Brooks, R.A., Creiner, R., and Binford, T.O. (1979). The ACRONYM model-based vision system. IJCAI’79 Proc. of the 6th Int’l Joint Conference on Artificial intelligence – Volume 1 (Tokyo, Japan: August 20 – 23, 1979), 105-113. Publ: Morgan Kaufmann Publishers Inc. San Francisco, CA, USA.
  • Lowe, D. (1984). Perceptual Organization and Visual Recognition. Doctoral dissertation, Stanford University. pdf, accessed April 18, 2018 by AJM.
  • Marr D. (1982). Vision (San Francisco: W.H. Freeman).
  • McCarthy, J., et al. (1980, May). Final Report: Basic Research in Artificial Intelligence and Foundations of Programming. Stanford Artificial Intelligence Laboratory, Memo AIM 337; Computer Science Dept., Report No. STAN-CS-80-808. pdf, accessed April 18, 2018 by AJM.
  • McClamrock, R. (1991, May). Marr’s Three Levels: A Re-evaluation. Minds and Machines, 1 (2), 185–196. online access, accessed April 1, 2018 by AJM.
  • Newell, A. (1980, Aug 19). The knowledge level. Presidential Address, American Association for Artificial Intelligence. AAAI80, Stanford University. Later published in Artificial Intelligence and AI Magazine (1981, July). online access, accessed April 1, 2018 by AJM.
  • Warren, W.H. (2012). Does this computational theory solve the right problem? Marr, Gibson, and the goal of vision. Perception, 41(9): 1053–1060. doi: 10.1068/p7327. online access, accessed April 1, 2018 by AJM.

 

Useful Blogs

  • AI Business (2016, November 14). Dichotomy of Intelligence – a Thorny Journey Towards Human-Level Intelligence. blogpost, accessed April 18, 2018 by AJM.
  • Bergman, M. (2018, Feb. 21) Desiderata for Knowledge Graphs. online access, accessed April 18, 2018 by AJM.
  • Deshpande, A. (2016, August 24). The 9 Deep Learning Papers You Need To Know About (Understanding CNNs Part 3). Post 3, accessed April 18, 2018 by AJM.
  • Deshpande, A. (2016, July 29). A Beginner’s Guide To Understanding Convolutional Neural Networks Part 2. Post 1, accessed April 18, 2018 by AJM.
  • Matthen, M. (2014, July 28). Thomas Natsoulas: Consciousness and Perceptual Experience: An Ecological and Phenomenological Approach – A Review of the Book. Book Review Online, accessed April 18, 2018 by AJM. AJM’s Note: Look at this review for suggestions of researchers who are NOT Natsoulas, who have made very valuable contributions to computer vision and particularly to the perceptual underpinnings of human vision that carry over to good computer vision systems.
  • Singhal, A. (2012, May 16). Introducing the Knowledge Graph: things, not strings. Google. online access, accessed April 1, 2018 by AJM.

 
Italian-renaissance-border-2-thin
 

Previous Related Posts

 
Italian-renaissance-border-2-thin
 

Leave a Reply

Your email address will not be published. Required fields are marked *