Phuc Nguyen Duc Anh

Ph.D. Student @ University of Maryland, College Park Working on |

I’m a PhD student at University of Maryland, College Park advised by Prof. Ming C. Lin!

I work at StackAV (previously ArgoAI), in close collaboration with Dr. Sudipta N. Sinha. Previously, I worked as a Research Engineer at SpreeAI, where my work focuses on multimodal representations in collaboration with at Dr. Aayush Bansal and Dr. Minh Vo. I was an AI Research Resident at VinAI Research (acquired by Qualcomm), where I worked on 3D scene understanding under the supervision of Dr. Anh Tran and Prof. Cuong Pham.

My research focuses on 3D/4D reconstruction and understanding, particularly on developing scalable methods that jointly recover geometry, motion, and semantics from multi-view images and videos.

News

Jun 2026 🏆 OpenVO received the Compute Champion Award in CVPR 2026!
May 2026 🏠 We release Scale3D, a scalable approach to 3D reconstruction and scene understanding.
Feb 2026 🚗 OpenVO has been accepted to CVPR 2026. We have also released the code.
Nov 2025 🚀 We release OpenVO, an open-world visual odometry works on any video captured by any camera!
Aug 2025 🎓 I started my PhD in Computer Science at University of Maryland (UMD)!
Jul 2025 📢 HA-RDet and OE-3DIS have been accepted to ICCV 2025.
Feb 2025 📌 Any3DIS has been accepted to CVPR 2025. Cheerz!
Nov 2024 💼 I joined SpreeAI as a Research Engineer.
Jul 2024 🚀 VinMap has been accepted to MAPR 2024. We have also released the dataset.
Jun 2024 🏆 VinAI-3DIS team has ranked top 1 in OpenSUN3D challenge at CVPR 2024.
Feb 2024 💡 Open3DIS has been accepted to CVPR 2024. We have also released the code.
Jun 2024 🏆 VinAI-3DIS team ranked top 2 in OpenSUN3D challenge at ICCV 2023. Technical report.
Nov 2023 🎓 I earned my B.S. in Computer Science at VNUHCM-UIT.
Feb 2023 🔬 I joined VinAI Research as a Research Resident.

Experience

Research Intern  •  Stack AV
Working on 4D Reconstruction and Motion-controllable 4D novel view synthesis for autonomous driving.
May 2026 – Present
Research Engineer  •  SpreeAI
Worked on learning representations for physically-grounded multimodal-LLM.
Nov 2024 – Sep 2025
AI Research Resident  •  VinAI Research (Acquired by Qualcomm)
Focused on 3D scene reconstruction-understanding and vision-language model.
Feb 2023 – Nov 2024

Highlighted Research · See Full List At Scholar

Phuc Nguyen, Xiyi Chen, Dongki Jung, Anshul Rai, Guan-Ming Su, Dinesh Manocha, Ming C. Lin
Under Review

We introduce Scale3D, a novel framework for Scalable 3D reconstruction and understanding.

We present Scale3D, a unified framework for scalable 3D reconstruction and scene understanding from a complex and long image sequences. Existing methods typically emphasize either geometric reconstruction or object-level understanding, but struggle to maintain both global geometric consistency and coherent instance identities over hundreds to thousands of views. Our key insight is to exploit their mutual synergy: geometry provides a robust basis for cross-view object association, while perception regularizes and refines geometry. Scale3D decomposes long video into overlapping clusters, reconstructs cluster-wise geometry and 2D segmentation masks, and introduces a 3D-Aware Alignment module to align local predictions into a global proxy geometry while recovering temporally coherent, globally ID-consistent video object segmentation. We further propose Instance-Aware Bundle Adjustment, leveraging dense instance-consistent correspondences to refine the camera poses and geometry. We evaluate Scale3D on ScanNet200 and ScanNet++v2 across three different benchmarking tasks: 3D reconstruction, class-agnostic 3D instance segmentation, and panoptic lifting for novel-view rendering and it achieves the state-of-the-art results with the improvement of 5% on AUC@30, 11% on AP and 10% on Panoptic Quality. Overall, our results highlight the importance of jointly modeling geometry and perception for scalable scene reconstruction and understanding over long image sequences with hundreds to thousands of views.
OpenVO teaser
Phuc Nguyen*, Anh. N. Nhu*, Ming C. Lin
IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR 2026) Compute Champion Award

We introduce OpenVO, a novel framework for Open-world Visual Odometry (VO) with temporal awareness under limited input conditions.

OpenVO effectively estimates real-world–scale ego-motion from monocular dashcam footage with varying observation rates and uncalibrated cameras, enabling robust trajectory dataset construction from rare driving events recorded in dashcam. Existing VO methods are trained on fixed observation frequency (e.g., 10Hz or 12Hz), completely overlooking temporal dynamics information. Many prior methods also require calibrated cameras with known intrinsic parameters. Consequently, their performance degrades when (1) deployed under unseen observation frequencies or (2) applied to uncalibrated cameras. These significantly limit their generalizability to many downstream tasks, such as extracting trajectories from dashcam footage. To address these challenges, OpenVO (1) explicitly encodes temporal dynamics information within a two-frame pose regression framework and (2) leverages 3D geometric priors derived from foundation models. We validate our method on three major autonomous-driving benchmarks -- KITTI, nuScenes, and Argoverse 2 -- achieving more than 20% performance improvement over state-of-the-art approaches. Under varying observation rate settings, our method is significantly more robust, achieving 46%–92% lower errors across all metrics. These results demonstrate the versatility of OpenVO for real-world 3D reconstruction and diverse downstream applications.
Any3DIS teaser
Phuc Nguyen, Minh Luu, Anh Tran, Cuong Pham, Khoi Nguyen
IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR 2025)

A novel class-agnostic approach for 3D instance segmentation that leverages 2D mask tracking to segment 3D objects in point cloud scenes.

Existing 3D instance segmentation methods frequently encounter issues with over-segmentation, leading to redundant and inaccurate 3D proposals that complicate downstream tasks. This challenge arises from their unsupervised merging approach, where dense 2D instance masks are lifted across frames into point clouds to form 3D candidate proposals without direct supervision. These candidates are then hierarchically merged based on heuristic criteria, often resulting in numerous redundant segments that fail to combine into precise 3D proposals. To overcome these limitations, we propose a 3D-Aware 2D Mask Tracking module that uses robust 3D priors from a 2D mask segmentation and tracking foundation model (SAM-2) to ensure consistent object masks across video frames. Rather than merging all visible superpoints across views to create a 3D mask, our 3D Mask Optimization module leverages a dynamic programming algorithm to select an optimal set of views, refining the superpoints to produce a final 3D proposal for each object. Our approach achieves comprehensive object coverage within the scene while reducing unnecessary proposals, which could otherwise impair downstream applications. Evaluations on ScanNet200 and ScanNet++ confirm the effectiveness of our method, with improvements across Class-Agnostic, Open-Vocabulary, and Open-Ended 3D Instance Segmentation tasks.
Open-Ended teaser
Phuc Nguyen*, Minh Luu*, Anh Tran, Cuong Pham, Khoi Nguyen
IEEE/CVF International Conference on Computer Vision (ICCV 2025)

Introducing the Vocablulary-Free 3D point cloud instance segmentation with different solid baselines and a novel pointwise method using multimodal LLM.

Open-vocabulary 3D Instance Segmentation methods (OV-3DIS) have recently demonstrated their generalization ability to unseen objects. However, these methods still depend on predefined class names during inference, restricting agents' autonomy. To mitigate this constraint, we propose a novel problem termed Open-Ended 3D Instance Segmentation (OE-3DIS), which eliminates the necessity for predefined class names during testing. We present a comprehensive set of strong baselines inspired by OV-3DIS methodologies, utilizing 2D Multimodal Large Language Models. In addition, we introduce a novel token aggregation strategy that effectively fuses information from multiview images. To evaluate the performance of our OE-3DIS system, we benchmark both the proposed baselines and our method on two widely used indoor datasets: ScanNet200 and ScanNet++. Our approach achieves substantial performance gains over the baselines on both datasets. Notably, even without access to ground-truth object class names during inference, our method outperforms Open3DIS, the current state-of-the-art in OV-3DIS.
HA-RDet teaser
Phuc Nguyen
IEEE/CVF International Conference on Computer Vision (ICCV 2025)
Bachelor's Thesis

Hybrid-Anchor Rotation Detector (HA-RDet), which combines the advantages of both anchor-based and anchor-free schemes for oriented object detection.

Oriented object detection in aerial images poses a significant challenge due to their varying sizes and orientations. Current state-of-the-art detectors typically rely on either two-stage or one-stage approaches, often employing Anchor-based strategies, which can result in computationally expensive operations due to the redundant number of generated anchors during training. In contrast, Anchor-free mechanisms offer faster processing but suffer from a reduction in the number of training samples, potentially impacting detection accuracy. To address these limitations, we propose the Hybrid-Anchor Rotation Detector (HA-RDet), which combines the advantages of both anchor-based and anchor-free schemes for oriented object detection. By utilizing only one preset anchor for each location on the feature maps and refining these anchors with our Orientation-Aware Convolution technique, HA-RDet achieves competitive accuracies, including 75.41 mAP on DOTA-v1, 65.3 mAP on DIOR-R, and 90.2 mAP on HRSC2016, against current anchor-based state-of-the-art methods, while significantly reducing computational resources.
Open3DIS teaser
Phuc Nguyen*, Tuan Ngo*, Chuang Gan, Evangelos Kalogeraki, Anh Tran, Cuong Pham, Khoi Nguyen
IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR 2024)

Tackling the open-vocabulary 3D point cloud instance segmentation by using 2D prior.

We introduce Open3DIS a novel solution designed to tackle the problem of Open-Vocabulary Instance Segmentation within 3D scenes. Objects within 3D environments exhibit diverse shapes scales and colors making precise instance-level identification a challenging task. Recent advancements in Open-Vocabulary scene understanding have made significant strides in this area by employing class-agnostic 3D instance proposal networks for object localization and learning queryable features for each 3D mask. While these methods produce high-quality instance proposals they struggle with identifying small-scale and geometrically ambiguous objects. The key idea of our method is a new module that aggregates 2D instance masks across frames and maps them to geometrically coherent point cloud regions as high-quality object proposals addressing the above limitations. These are then combined with 3D class-agnostic instance proposals to include a wide range of objects in the real world. To validate our approach we conducted experiments on three prominent datasets including ScanNet200 S3DIS and Replica demonstrating significant performance gains in segmenting objects with diverse categories over the state-of-the-art approaches.

Awards and Achievements

Service

Reviewer:
  • IEEE/CVF Computer Vision and Pattern Recognition (CVPR’24,26)
  • IEEE/CVF International Conference on Computer Vision (ICCV’25)
  • European Conference on Computer Vision (ECCV’24)
  • International Conference on Learning Representations (ICLR’25)
  • Neural Information Processing Systems (NeurIPS’24,26)
  • British Machine Vision Conference (BMVC’25)
  • Transactions on Machine Learning Research (TMLR’25,26)
Teaching Assistant UMD:
  • CMSC420 (Fall’25): Advanced Data Structures
  • CMSC421 (Spring’26): Introduction to AI
Light mode service window Dark mode service window Frog resting on window Cow resting on window