Avatar

Research Fellow

Adelaide, Australia
Email

Frederic Zhang | 张真

I'm currently a research fellow at the Centre for Augmented Reasoning, Australian Institute for Machine Learning (AIML), working with Dr. Ehsan Abbasnejad.

I did my PhD at the Australian National University, under the supervision of Prof. Stephen Gould and Dr. Dylan Campbell. My research focus was on the visual and spatial understanding of human–object interactions. This includes visual recognition and localisation.

Prior to my PhD, as part of an international partnership program, I received a bachelor degree of science in automation from Beijing Institute of Technology and a bachelor degree of engineering in mechatronics (research and development) with first-class honours from the Australian National University, where I had the pleasure to work with Prof. Yuchao Dai and Prof. Richard Hartley.

I'm greatly passionate about programming, so much so that I wrote a deep learning library called Pocket. It is a lightweight library built on top of PyTorch, featuring different boilerplate learning engines and other utilities purposed for visualisation and evaluation. I'm also a photographer. As an enthusiast of the great outdoors, my subjects are mostly nature-oriented. Find out more in my gallery!


Connect with me

Research

Exploring Predicate Visual Context in Detecting Human–Object Interactions Frederic Z. Zhang, Yuhui Yuan, Dylan Campbell, Zhuoyao Zhong and Stephen Gould In Proceedings of the IEEE International Conference on Computer Vision (ICCV), 2023. [abstract] [paper] [preprint] [code] [video] [bibtex]
Recently, the DETR framework has emerged as the dominant approach for human–object interaction (HOI) research. In particular, two-stage transformer-based HOI detectors are amongst the most performant and training-efficient approaches. However, these often condition HOI classification on object features that lack fine-grained contextual information, eschewing pose and orientation information in favour of visual cues about object identity and box extremities. This naturally hinders the recognition of complex or ambiguous interactions. In this work, we study these issues through visualisations and carefully designed experiments. Accordingly, we investigate how best to re-introduce image features via cross-attention. With an improved query design, extensive exploration of keys and values, and box pair positional embeddings as spatial guidance, our model with enhanced predicate visual context (PViC) outperforms state-of-the-art methods on the HICO-DET and V-COCO benchmarks, while maintaining low training cost.
@inproceedings{zhang2023pvic,
  author = {Zhang, Frederic Z. and Yuan, Yuhui and Campbell, Dylan and Zhong, Zhuoyao and Gould, Stephen},
  title = {Exploring Predicate Visual Context in Detecting Human–Object Interactions},
  booktitle = {Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV)},
  month = {October},
  year = {2023},
  pages = {10411-10421},
}
Efficient Two-Stage Detection of Human–Object Interactions with a Novel Unary–Pairwise Transformer Frederic Z. Zhang, Dylan Campbell and Stephen Gould In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2022. [abstract] [paper] [preprint] [code] [video] [bibtex]
Recent developments in transformer models for visual data have led to significant improvements in recognition and detection tasks. In particular, using learnable queries in place of region proposals has given rise to a new class of one-stage detection models, spearheaded by the Detection Transformer (DETR). Variations on this one-stage approach have since dominated human–object interaction (HOI) detection. However, the success of such one-stage HOI detectors can largely be attributed to the representation power of transformers. We discovered that when equipped with the same transformer, their two-stage counterparts can be more performant and memory-efficient, while taking a fraction of the time to train. In this work, we propose the Unary–Pairwise Transformer, a two-stage detector that exploits unary and pairwise representations for HOIs. We observe that the unary and pairwise parts of our transformer network specialize, with the former preferentially increasing the scores of positive examples and the latter decreasing the scores of negative examples. We evaluate our method on the HICO-DET and V-COCO datasets, and significantly outperform state-of-the-art approaches. At inference time, our model with ResNet50 approaches real-time performance on a single GPU.
@inproceedings{zhang2022upt,
  author = {Zhang, Frederic Z. and Campbell, Dylan and Gould, Stephen},
  title = {Efficient Two-Stage Detection of Human–Object Interactions with a Novel Unary–Pairwise Transformer},
  booktitle = {Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2022},
  pages = {20104-20112}
}
Spatially Conditioned Graphs for Detecting Human–Object Interactions Frederic Z. Zhang, Dylan Campbell and Stephen Gould In Proceedings of the IEEE International Conference on Computer Vision (ICCV), 2021. [abstract] [paper] [preprint] [code] [video] [bibtex]
We address the problem of detecting human–object interactions in images using graphical neural networks. Unlike conventional methods, where nodes send scaled but otherwise identical messages to each of their neighbours, we propose to condition messages between pairs of nodes on their spatial relationships, resulting in different messages going to neighbours of the same node. To this end, we explore various ways of applying spatial conditioning under a multi-branch structure. Through extensive experimentation we demonstrate the advantages of spatial conditioning for the computation of the adjacency structure, messages and the refined graph features. In particular, we empirically show that as the quality of the bounding boxes increases, their coarse appearance features contribute relatively less to the disambiguation of interactions compared to the spatial information. Our method achieves an mAP of 31.33% on HICO-DET and 54.2% on V-COCO, significantly outperforming state of the art on fine-tuned detections.
@inproceedings{zhang2021scg,
  author = {Zhang, Frederic Z. and Campbell, Dylan and Gould, Stephen},
  title = {Spatially Conditioned Graphs for Detecting Human–Object Interactions},
  booktitle = {Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV)},
  month = {October},
  year = {2021},
  pages = {13319-13327}
}