TY - GEN
T1 - Pix4Point
T2 - 11th International Conference on 3D Vision, 3DV 2024
AU - Qian, Guocheng
AU - Hamdi, Abdullah
AU - Zhang, Xingdi
AU - Ghanem, Bernard
N1 - Publisher Copyright:
© 2024 IEEE.
PY - 2024
Y1 - 2024
N2 - While Transformers have achieved impressive success in natural language processing and computer vision, their performance on 3D point clouds is relatively poor. This is mainly due to the limitation of Transformers: a demanding need for extensive training data. Unfortunately, in the realm of 3D point clouds, the availability of large datasets is a challenge, exacerbating the issue of training Transformers for 3D tasks. In this work, we solve the data issue of point cloud Transformers from two perspectives: (i) introducing more inductive bias to reduce the dependency of Transformers on data, and (ii) relying on cross-modality pretraining. More specifically, we first present Progressive Point Patch Embedding and present a new point cloud Transformer model namely PViT. PViT shares the same backbone as Transformer but is shown to be less hungry for data, enabling Transformer to achieve performance comparable to the state-of-the-art. Second, we formulate a simple yet effective pipeline dubbed Pix4Point that allows harnessing Transformers pretrained in the image domain to enhance downstream point cloud understanding. This is achieved through a modality-agnostic Transformer backbone with the help of a tokenizer and decoder specialized in the different domains. Pretrained on a large number of widely available images, significant gains of PViT are observed in the tasks of 3D point cloud classification, part segmentation, and semantic segmentation on ScanObjectNN, ShapeNetPart, and S3DIS, respectively. Our code and models are available at https: //github.com/guochengqian/Pix4Point.
AB - While Transformers have achieved impressive success in natural language processing and computer vision, their performance on 3D point clouds is relatively poor. This is mainly due to the limitation of Transformers: a demanding need for extensive training data. Unfortunately, in the realm of 3D point clouds, the availability of large datasets is a challenge, exacerbating the issue of training Transformers for 3D tasks. In this work, we solve the data issue of point cloud Transformers from two perspectives: (i) introducing more inductive bias to reduce the dependency of Transformers on data, and (ii) relying on cross-modality pretraining. More specifically, we first present Progressive Point Patch Embedding and present a new point cloud Transformer model namely PViT. PViT shares the same backbone as Transformer but is shown to be less hungry for data, enabling Transformer to achieve performance comparable to the state-of-the-art. Second, we formulate a simple yet effective pipeline dubbed Pix4Point that allows harnessing Transformers pretrained in the image domain to enhance downstream point cloud understanding. This is achieved through a modality-agnostic Transformer backbone with the help of a tokenizer and decoder specialized in the different domains. Pretrained on a large number of widely available images, significant gains of PViT are observed in the tasks of 3D point cloud classification, part segmentation, and semantic segmentation on ScanObjectNN, ShapeNetPart, and S3DIS, respectively. Our code and models are available at https: //github.com/guochengqian/Pix4Point.
KW - 3D Point Cloud Understanding
KW - Point Cloud Classification
KW - Point Cloud Segmentation
KW - Transfer Learning
KW - Transformers
UR - http://www.scopus.com/inward/record.url?scp=85194009796&partnerID=8YFLogxK
U2 - 10.1109/3DV62453.2024.00113
DO - 10.1109/3DV62453.2024.00113
M3 - Conference contribution
AN - SCOPUS:85194009796
T3 - Proceedings - 2024 International Conference on 3D Vision, 3DV 2024
SP - 1280
EP - 1290
BT - Proceedings - 2024 International Conference on 3D Vision, 3DV 2024
PB - Institute of Electrical and Electronics Engineers Inc.
Y2 - 18 March 2024 through 21 March 2024
ER -