Below is another set of IJCAI session notes. This was the first invited talk in a day-long workshop called Architectures and Evaluation for Generality, Autonomy & Progress in AI (AEGAP). The speaker, Oren Etzioni, talked about some of the work the Allen Institute is doing to drive the creation of common sense in AI. He focused especially on a need for a concrete benchmark to measure results when it comes to implementing common sense. As before, these are mostly for myself, to sort through the notes in my head and revisit something a little more organized in the future. In addition to the talk itself these will cover the Q&A which followed.
Learning common sense: a grand challenge for academic AI research
By Oren Etzioni, CEO of the Allen Institute of AI.
We live in a world with so much compute power and data (especially user data); how can an academic make a contribution to the field?
“The best minds of my generation are thinking about how to make people click on ads. That sucks.” - Jeff Hammerbacher
Where is AI today?
- Gather lots of training data!
- Apply deep learning!
- Observe Impressive Gains!!
This is a real thing, it has taken the field by storm. But is that all there is to research? Sometimes it seems that way, it is all graduate students want to do - execute this recipe and publish a paper.
Getting outside of the above recipe. Why bother?
Performance of AI systems in their specialized domains is often very impressive, but they are brittle if you try to use them outside of the scope intended by the designers. We have systems that do a great job at one thing, like object detection, but if you show them something which looks like noise they will not be able to cope. They will mislabel that thing. Deep learning models for object detection are easily fooled, and they make high confidence incorrect predictions. At this point examples were provided of systems classifying white noise as objects, or finding a “car” in a picture of a duck because a collection of pixels in the bottom left part of the water looked like it might resemble a car-like silhouette. These kinds of issues aren’t actually limited to vision.
In 1997 when we had Deep Blue, we said it would make a superhuman/brilliant chess move in the game while the room is burning down around it. Now we have AlphaGo: Go is much more complex, and it too will make a brilliant Go move while the room is burning down around it. When we ignore context as many current systems do, we have a problem.
What is common sense?
The speaker defines common sense as “Knowledge about the world that most people have, but most AI systems do not."
“Common sense facts and methods are only very partially understood today, and extending this understanding is the key problem facing artificial intelligence.” - John McCarthy, 1983
Why do AI systems need common sense?
- robustness: adversarial examples; zero shot learning. What if you see a completely novel situation? Can you come up with the right judgements?
- data efficiency: learn with fewer training examples
- generality: transfer learning, etc.
- performance: NLP, robotics, medical diagnosis, etc.
- Safety: how can an AI system avoid harm if it doesn’t know what is harmful?
Often in “safety” conversations we go to reinforcement learning or alignment of optimization and loss functions; this is an interesting set of mathematical issues. However, what if we build an AI system that has no common sense and cannot understand what is harmful/good? How far should it go in pursuit of objectives? One example is AI which becomes obsessed with producing paper-clips: it takes over all resources and destroys the world because it is focusing on doing its job very well. This is an example of AI with no common sense.
Lessons from Cyc
Douglas Lenat launched a project to build a common sense system, called it Cyc. Some lessons learned:
- Implicit knowledge is critical. Initially he started writing down knowledge that would be in an encyclopedia, but it realized that it is implicit knowledge and not encyclopedia knowledge which is important.
- Size matters - scalability
- But you need to know how to use all that size (reasoning!)
- Consistency is not realistic (introduced notion of micro-theories).
Crowd sourcing, machine vision, and modern NLP are an opportunity to revisit this grand challenge! This is what the Allen Institute’s project, Project Alexandria, is trying to use to tackle the same challenge of common sense in AI.
Key learning/point: Having a benchmark/performance metric is essential. Compare the evolution of Cyc to what happened with games. In games you can tell who won and who lost. Not just with games, throughout machine learning (especially natural language processing and vision), there is a big focus on benchmarks. We are addicted to making progress by identifying a data set and improving performance on that set. This is very productive. So a key idea for Project Alexandria is to move the field of common sense forward by first defining a benchmark data set for common sense.
Creating a benchmark for common sense.
Questions to be answered:
- breadth: what topics are covered?
- depth: what is the sophistication of knowledge?
- language: should this benchmark factor out certain linguistic challenges?
- vision: is visual/robotic common sense included? For example, how do you say when you open a door you need to take a few steps back to give door room to open? Should this be in the benchmark?
Project Alexandria is working on a common sense leaderboard and a question set called Common Sense QA; existing systems will struggle with this.
How do you acquire this kind of common sense knowledge?
Etzioni brings up machine reading specifically - auto-text to knowledge. Extracting info from text, especially text you find on the internet. Previous systems relied on trained human experts inputting knowledge by hand, and some systems then allowed less specifically-trained humans to do the same. However, we have an entire mass of human input out there for the taking already on the web. We should use this.
Can we leverage regularities in language to extract information in a relation-independent way? Relations are often anchored in verbs and exhibit simple syntactic form.
openie.allenai.org - “Open information extraction (open IE) refers to the extraction of relation tuples, typically binary relations, from plain text. The central difference is that the schema for these relations does not need to be specified in advance; typically the relation name is just the text linking two arguments. For example, Barack Obama was born in Hawaii would create a triple (Barack Obama; was born in; Hawaii), corresponding to the open domain relation was-born-in(Barack-Obama, Hawaii). This software is a Java implementation of an open IE system.” (https://nlp.stanford.edu/software/openie.html)
TextRunner (2007 system of this) suffered from attention deficit disorder. if you go to the demo and type “apple” or “tree”, it’ll know a few things on that topic but won’t really have the comprehensive organized knowledge you would want if you wanted to use it in machine translation or common sense question answering. The tech is limited in its ability to obtain high quality bodies of common sense knowledge.
Often the things you want to know about certain predicates are not stated in text. we do not say “I am larger than a chair”. That is obvious. Some work currently being done uses inference to overcome reporting bias.
For example: “x threw y”, therefore “x is bigger than y”, “x weighs more than y”, “y will be moving faster than x” can be inferred.
Work is also being done to recover information from images. For example, images display relative size. We can see that a dog is bigger than the cat next to it. We can see that a window is bigger than a cat. We can then use this to make estimates and calculations about size relation between objects/entities.
Common sense from the crowd
This is probably the most important source of common sense knowledge. People are the ones who have common sense and our systems do not. In Cyc the idea was that people would type their knowledge in, but they had to be highly trained knowledge engineers to do a good job. We are continuing to look a different ways that common sense can be crowdsourced. key questions:
- scalability: is it economically viable? Cyc was relatively expensive
- class distribution: The truth may be a fraction of interesting facts. Can we make sure we get true facts?
- difficulty: are the collected facts easily derived from simple data driven methods?
- coverage: can we avoid TextRunner-like ADD?
What are the right problems for evaluating common sense? what are the right representations for common sense knowledge that also support reasoning?
“AI requires the investigation of ill-structured problems.” - Herb Simon, Artificial Intelligence
What is an ill structure problem? Well structured problems are ones where an objective function can be tractably computed. A well structured problem is basically deep learning. If we can take a problem and define it as a deep learning problem, we have a well structured problem. Ill structured problems are the ones that cannot be structured as a deep learning problem.
Figuring out how to optimize something is far easier than figuring out what to optimize.
Ill structured problem: “People breathe air.”
How do we represent this formally? What is the formal vocabulary for the above?
- common sense is critical for AI
- progress has been limited despite the successes of deep learning
- we need a benchmark and crisp metrics
- we have some new ideas on acquisition
- we’re launching Project Alexandria.
Q & A
(Both questions and answers are paraphrased)
Comment: Foundations of common sense lie deeper than natural language processing; we need to interact with the physical world on a more fundamental level: understanding object persistence, movement, containment. slight difference of perspective.
Response: We mostly agree. Grounding and more embodied/physical common sense is definitely necessary. Then there come two potential points of disagreement:
- Does NLP come before this grounding? Some would say we start with grounding and language emerges in a variety of ways; Etzioni isn’t sure that he believes this.
- The key point is to have a benchmark. Suppose we go down the grounding road, so the first question is: tell me what your benchmark is. Some work in this grounding area is already being done, but we need to be able compare that to other approaches.
Question: will this common sense graph or knowledge extraction ever get to expressing common sense in first order logic?
Answer: Some in the field say “clearly we need first order logic”, or “clearly we need Markov logic networks”. We can spend decades investing in this. First we need a benchmark. The questions Allen Institute is posing are not simple lookups; if there is a set of common sense questions which require understanding of naive physics which requires logic the only systems that will do well are ones that have that. Etzioni does not want to assume anything, but start with building a benchmark and then seeing.
“Let’s not philosophise; let’s measure.”
Question: we are all born without common sense and develop different common sense based on how we grow up. Is it important to study what gives humans common sense and the environment in which it develops?
Answer: This kind of question goes to a core fork in road in AI; do we build an AI system based on how people do it or in some different way? There is no concrete answer, different approaches should be investigated and measured.
Comment: If we don’t know what the formal vocabulary is for human intellectual commerce; a formal, systematic account that says you need first order logic for this or some other kind of logic for that. If we don’t have that specified, that formal foundation, my prediction is that despite how spiritually correct this whole thing is we’ll go through the same thing Lenat went through, where everything took 10 years because he didn’t formally know what he was doing.
Response Etzioni doesn’t feel like we’ll recapitulate Lenat. One huge difference methodologically is that we are going from the outside in as opposed to inside out. We are starting by defining a set of benchmarks for systems to solve. We’ll be much more empirical about whether something works or not. And yes, it’s possible that we can get the wrong benchmarks, be overly fixated on them, etc. But in general we are in a different space - we are not naive on the underpinnings, we are just not blindly committed to them.
The second point is that when we do build new systems we are more likely to rely on existing systems. If you think DL is misguided and naive and will fail, we will also suffer, but Etzioni thinks it will be different. Cyc chose to isolate themselves from academic community very early. By contrast, Allen Institute is committed to maintaining engagement with the peer review process and the wider community. This will force them to listen to other people as they progress, more strongly than the Cyc project did.
Question: [This one seemed similar to the question above about how humans learn common sense] Do you take into account the developmental aspect of common sense? Eg, a 10 year old child has its own common sense, it develops it by interacting with the world. Do you take into account development of common sense during the life of the “child”/agent?
Answer: Allen Institute has not done this in their work, but this does not mean that they have it right. The thing to emphasise is that different approaches which do or do not take this into account should be able to be compared and measured. Compared not based on hand-waving but actual performance. Using benchmarks. Let’s make this less of a hand-wavy question and more of an empirical computer science question.
Question: Can we design common sense (and benchmarks for common sense) which goes beyond what humans are able to do?
Answer: While a wonderful idea, we are so woefully inadequate with just basic common sense that we aren’t anywhere near that. Because of these issues, like interactions with the world, we are lacking very basic information. It’s not just physical information, and it’s not just factual information. Those are easy to pick up. It is a wide variety of things humans have access to almost instantaneously and the machine doesn’t. So while it would be great to get beyond what humans are able to do, it’s not even really on the map yet.
Follow-up comment: Maybe it’s worth thinking about these things now (or in a few years' time) in order to design these systems better and (potentially) put them on the path to superhuman common sense from the beginning.
Question: I like the go big approach, but why not go small instead? A simulated world where you want to learn common sense, where you’re deliberately constraining the factors involved.
Answer: We have thought about this in the context of a system called Thor. Etzioni isn’t sure what is the right approach, he is just sure that he wants it to be highly measurable.
Question: It sounds like these benchmarks are focused on measuring knowledge which is already acquired by the system. But thinking about central importance of creativity, what about having a benchmark involving an agent which can demonstrate artificial creativity?
Answer: The benchmarks we’re talking about are snapshots of performance. We’re not sure how to benchmark creativity.
Question: We say humans that are good at learning things are very smart, but forget to measure how quickly they forget those things. Would it be interesting to create a benchmark for knowledge acquisition?
Answer: Due to the flavour of the benchmarks being developed, they are kind of a benchmark for knowledge acquisition. This goes back to the point of breadth: what is the extent of questions a benchmark should ask you to answer? The breadth is so vast that you do need some acquisition and reasoning procedure to pass the benchmark. It may look like human learning or not, they’re just trying to get the stake in the ground to make this measurable. Today we have very few projects on common sense. It is difficult to answer which system has how much common sense knowledge; there’s no way to scientifically study this, which is where we need to start.