RGC Newsletter - Research Frontiers

Professor Yang Qiang

In this information age, data are at the center of most of our daily activities. Imagine you start your day and head off to work. Your car’s movement data are captured via GPS devices and smart phones. When you visit the web and search for a specific product, your query and clicking actions are recorded in a data store known as query log data. When you take a picture using your camera, one or more image data are created. We are truly in the age of the so-called Big Data, where the data flows in even larger volume, wider variety and faster speed. When we accumulate so much data, it is important for us to search for knowledge and useful patterns in the data via specialized software techniques. These data analytic software technology is collectively known as data mining, where the key issue is to uncover and make sense of hidden knowledge in the data.

This project aims to develop a new technology that can find interesting patterns and knowledge by considering two or more sources of different, but related, data together to discover something new. For example, suppose one has read many books on butterfly. Does this help the person in recognizing a new species of butterfly by looking at their images? Humans seem to be able to do this smoothly, by transferring knowledge learned from one domain to another. For example, having knowledge in chess makes it easier for us to learn checker games, and know how to program in C++ giving us an advantage in learning Java. These are the examples of heterogeneous transfer learning, a subject of this project, where the transfer of knowledge happens from one application to another, in our daily lives. For example, when reading a book, we form images of faces and scenes in our minds, which could benefit our learning to recognize scenes and faces in photos. It could save us a lot of efforts and expense in building new models for recognizing new scenes and faces.

In data mining, a feature space is needed to represent data and different areas of application have different feature spaces. For example, in images, we use pixels, lines and areas to represent an image. In text documents, we use words and phrases to represent features. Heterogeneous transfer learning refers to learning in one domain (text) and then transferring the learned knowledge to another newer or more difficult domain (images).

A major limitation in heterogeneous transfer learning is the gap between two feature spaces and the differences between two distributions. These are two major assumptions in data mining that are often broken in real world practice. To overcome this limitation, we have developed a novel heterogeneous transfer learning method to build up a mapping between two related feature spaces which enable the knowledge transfer. To do this, we search for a rich online resource such as Wikipedia or an image repository such as Flickr, and use the related images and tags to build a ‘bridge’ between two feature spaces. To resolve the distribution differences, we add an information filter to ensure that the distance between two learning domains is minimized.