From where and how to what we see
Eye movement studies have confirmed that overt atten- tion is highly biased towards faces and text regions in im- ages. In this paper we explore a novel problem of predicting face and text regions in images using eye tracking data from multiple subjects. The problem is challenging as we aim to predict the semantics (face/text/background) only from eye tracking data without utilizing any image information. The proposed algorithm spatially clusters eye tracking data ob- tained in an image into different coherent groups and sub- sequently models the likelihood of the clusters containing faces and text using a fully connected Markov Random Field (MRF). Given the eye tracking data from a test image, it pre- dicts potential face/head (humans, dogs and cats) and text locations reliably. Furthermore, the approach can be used to select regions of interest for further analysis by object de- tectors for faces and text. The hybrid eye position/object de- tector approach achieves better detection performance and reduced computation time compared to using only the ob- ject detection algorithm. We also present a new eye tracking dataset on 300 images selected from ICDAR, Street-view, Flickr and Oxford-IIIT Pet Dataset from 15 subjects.