VCR, or Visual Commonsense Reasoning, is a dataset for image understanding. Not only can it point out what objects or people are present in the image, it can infer people's actions, mental states, and goals.
A model is given an image, objects, a question, and four answer choices. The model has to decide which answer choice is correct. Then, it's given four rationale choices, and it has to decide which of those is the best rationale that explains why its answer is right.
We'll use BERT and detection regions to ground the words in the query, then contextualize the query with the response. We'll perform several steps of reasoning on top of a representation consisting of the response choice in question, the attended query, and the attended detection regions.