자료유형 | 학위논문 |
---|---|
서명/저자사항 | Embodied Visual Perception Models for Human Behavior Understanding. |
개인저자 | Bertasius, Gediminas. |
단체저자명 | University of Pennsylvania. Computer and Information Science. |
발행사항 | [S.l.]: University of Pennsylvania., 2019. |
발행사항 | Ann Arbor: ProQuest Dissertations & Theses, 2019. |
형태사항 | 258 p. |
기본자료 저록 | Dissertations Abstracts International 81-02B. Dissertation Abstract International |
ISBN | 9781085560221 |
학위논문주기 | Thesis (Ph.D.)--University of Pennsylvania, 2019. |
일반주기 |
Source: Dissertations Abstracts International, Volume: 81-02, Section: B.
Advisor: Shi, Jianbo. |
이용제한사항 | This item must not be sold to any third party vendors. |
요약 | Many modern applications require extracting the core attributes of human behavior such as a person's attention, intent, or skill level from the visual data. There are two main challenges related to this problem. First, we need models that can represent visual data in terms of object-level cues. Second, we need models that can infer the core behavioral attributes from the visual data. We refer to these two challenges as ``learning to see'', and ``seeing to learn'' respectively. In this PhD thesis, we have made progress towards addressing both challenges.We tackle the problem of ``learning to see'' by developing methods that extract object-level information directly from raw visual data. This includes, two top-down contour detectors, DeepEdge and HfL, which can be used to aid high-level vision tasks such as object detection. Furthermore, we also present two semantic object segmentation methods, Boundary Neural Fields (BNFs), and Convolutional Random Walk Networks (RWNs), which integrate low-level affinity cues into an object segmentation process. We then shift our focus to video-level understanding, and present a Spatiotemporal Sampling Network (STSN), which can be used for video object detection, and discriminative motion feature learning.Afterwards, we transition into the second subproblem of ``seeing to learn'', for which we leverage first-person GoPro cameras that record what people see during a particular activity. We aim to infer the core behavior attributes such as a person's attention, intention, and his skill level from such first-person data. To do so, we first propose a concept of action-objects--the objects that capture person's conscious visual (watching a TV) or tactile (taking a cup) interactions. We then introduce two models, EgoNet and Visual-Spatial Network (VSN), which detect action-objects in supervised and unsupervised settings respectively. Afterwards, we focus on a behavior understanding task in a complex basketball activity. We present a method for evaluating players' skill level from their first-person basketball videos, and also a model that predicts a player's future motion trajectory from a single first-person image. |
일반주제명 | Computer science. Artificial intelligence. |
언어 | 영어 |
바로가기 | ![]() |