0
Views
0
CrossRef citations to date
0
Altmetric
Full Paper

CLIP feature-based randomized control using images and text for multiple tasks and robots

, & ORCID Icon
Received 19 Jan 2024, Accepted 14 Jun 2024, Published online: 01 Aug 2024

References

  • Majumdar A, Aggarwal G, Devnani B, et al. Zson: zero-shot object-goal navigation using multimodal goal embeddings. Adv Neural Inf Process Syst. 2022;35:32340–32352.
  • Dorbala VS, Sigurdsson G, Piramuthu R, et al. Clip-nav: using clip for zero-shot vision-and-language navigation. arXiv preprint arXiv:221116649; 2022.
  • Shah D, Osinski B, Ichter B, et al. LM-Nav: robotic navigation with large pre-trained models of language, vision, and action; 2022.
  • Gadre SY, Wortsman M, Ilharco G, et al. CoWs on pasture: baselines and benchmarks for language-driven zero-shot object navigation; 2022.
  • Zhou K, Zheng K, Pryor C, et al. Esc: exploration with soft commonsense constraints for zero-shot object navigation. arXiv preprint arXiv:230113166; 2023.
  • Huang C, Mees O, Zeng A, et al. Visual language maps for robot navigation. In: 2023 IEEE International Conference on Robotics and Automation (ICRA). London, UK: IEEE; 2023. p. 10608–10615.
  • Li J, Tan H, Bansal M. Envedit: environment editing for vision-and-language navigation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA. 2022. p. 15407–15417.
  • Huo J, Sun Q, Jiang B, et al. GeoVLN: learning geometry-enhanced visual representation with slot attention for vision-and-language navigation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Vancouver, BC, Canada. 2023. p. 23212–23221.
  • Li J, Bansal M. Panogen: text-conditioned panoramic environment generation for vision-and-language navigation. arXiv preprint arXiv:230519195; 2023.
  • Mees O, Hermann L, Rosete-Beas E, et al. Calvin: a benchmark for language-conditioned policy learning for long-horizon robot manipulation tasks. IEEE Robot Autom Lett. 2022;7(3):7327–7334. doi: 10.1109/LRA.2022.3180108
  • Xiao T, Chan H, Sermanet P, et al. Robotic skill acquisition via instruction augmentation with vision-language models; 2023.
  • Chen B, Xia F, Ichter B, et al. Open-vocabulary queryable scene representations for real world planning. In: 2023 IEEE International Conference on Robotics and Automation (ICRA), London, UK. 2023. p. 11509–11522.
  • Khandelwal A, Weihs L, Mottaghi R, et al. Simple but effective: clip embeddings for embodied AI. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA. 2022. p. 14829–14838.
  • Goodwin W, Vaze S, Havoutis I, et al. Semantically grounded object matching for robust robotic scene rearrangement. In: 2022 International Conference on Robotics and Automation (ICRA). Philadelphia, PA, USA: IEEE; 2022. p. 11138–11144.
  • Kapelyukh I, Vosylius V, Johns E. Dall-e-bot: introducing web-scale diffusion models to robotics. IEEE Robot Autom Lett. 2023;8(7):3956–3963.
  • Brohan A, Brown N, Carbajal J, et al. RT-1: robotics transformer for real-world control at scale; 2023.
  • Brohan A, Brown N, Carbajal J, et al. RT-2: vision-language-action models transfer web knowledge to robotic control; 2023.
  • Vuong Q, Levine S, Walke HR, et al. Open x-embodiment: robotic learning datasets and RT-x models; 2023. Available from: https://openreview.net/forum?id=zraBtFgxT0.
  • Li X, Liu M, Zhang H, et al. Vision-language foundation models as effective robot imitators; 2023.
  • Radford A, Kim JW, Hallacy C, et al. Learning transferable visual models from natural language supervision. In: Meila M, Zhang T, editors. Proceedings of the 38th International Conference on Machine Learning; (Proceedings of Machine Learning Research; Vol. 139). Jul 18–24; PMLR; 2021. p. 8748–8763. Available from: https://proceedings.mlr.press/v139/radford21a.html.
  • Azuma S, Yoshimura R, Sugie T. Broadcast control of multi-agent systems. Automatica. 2013;49(8):2307–2316. doi: 10.1016/j.automatica.2013.04.022
  • Tan M, Le Q. Efficientnet: rethinking model scaling for convolutional neural networks. In: International Conference on Machine Learning. Long Beach, CA, USA: PMLR; 2019. p. 6105–6114.
  • Ryoo M, Piergiovanni A, Arnab A, et al. Tokenlearner: adaptive space-time tokenization for videos. Adv Neural Inf Process Syst. 2021;34:12786–12797.
  • Chen X, Djolonga J, Padlewski P, et al. PaLI-X: on scaling up a multilingual vision and language model. arXiv preprint arXiv:230518565; 2023.
  • Driess D, Xia F, Sajjadi MSM, et al. PaLM-E: an embodied multimodal language model; 2023.
  • Shridhar M, Manuelli L, Fox D. Cliport: what and where pathways for robotic manipulation. In: Conference on Robot Learning. London, UK: PMLR; 2022. p. 894–906.
  • Zeng A, Florence P, Tompson J, et al. Transporter networks: rearranging the visual world for robotic manipulation. In: Conference on Robot Learning. Cambridge, MA, USA: PMLR; 2021. p. 726–747.
  • Shridhar M, Manuelli L, Fox D. Perceiver-actor: a multi-task transformer for robotic manipulation. In: Conference on Robot Learning. Auckland, New Zealand: PMLR; 2023. p. 785–799.
  • Jaegle A, Borgeaud S, Alayrac JB, et al. Perceiver io: a general architecture for structured inputs & outputs. arXiv preprint arXiv:210714795; 2021.
  • Mees O, Borja-Diaz J, Burgard W. Grounding language with visual affordances over unstructured data. In: 2023 IEEE International Conference on Robotics and Automation (ICRA). London, UK: IEEE; 2023. p. 11576–11582.
  • Huang W, Wang C, Zhang R, et al. Voxposer: composable 3D value maps for robotic manipulation with language models. arXiv preprint arXiv:230705973; 2023.
  • Hinton G, Srivastava N, Swersky K. Neural networks for machine learning lecture 6a overview of mini-batch gradient descent. Cited On. 2012;14(8):2.
  • Kingma DP, Ba JA. Adam: a method for stochastic optimization. arXiv preprint arXiv:14126980; 2014.
  • Yu T, Quillen D, He Z, et al. Meta-world: a benchmark and evaluation for multi-task and meta reinforcement learning. In: Kaelbling LP, Kragic D, Sugiura K, editors. Proceedings of the Conference on Robot Learning; (Proceedings of Machine Learning Research; Vol. 100); 30 Oct–1 Nov. Osaka, Japan: PMLR; 2020. p. 1094–1100. Available from: https://proceedings.mlr.press/v100/yu20a.html.
  • Schulman J, Wolski F, Dhariwal P, et al. Proximal policy optimization algorithms; 2017.
  • Barhate N. Minimal pytorch implementation of proximal policy optimization [accessed 2024 Jan 4]; 2021. Available from: https://github.com/nikhilbarhate99/PPO-PyTorch.
  • Dosovitskiy A, Beyer L, Kolesnikov A, et al. An image is worth 16×16 words: Transformers for image recognition at scale; 2021.
  • He K, Zhang X, Ren S, et al. Deep residual learning for image recognition; 2015.
  • Niwa T, Taguchi S, Hirose N. Spatio-temporal graph localization networks for image-based navigation. In: 2022 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), Kyoto, Japan. 2022. p. 3279–3286.
  • Reis D, Kupec J, Hong J, et al. Real-time flying object detection with yolov8; 2023.
  • Zhou X, Girdhar R, Joulin A, et al. Detecting twenty-thousand classes using image-level supervision. In: European Conference on Computer Vision. Tel Aviv, Israel: Springer; 2022. p. 350–368.
  • Deng S, Xu X, Wu C, et al. 3D affordancenet: A benchmark for visual object affordance understanding. In: proceedings of the IEEE/CVF conference on computer vision and pattern recognition, Nashville, TN, USA. 2021. p. 1778–1787.