0
Views
0
CrossRef citations to date
0
Altmetric
Full Paper

CLIP feature-based randomized control using images and text for multiple tasks and robots

, & ORCID Icon
Received 19 Jan 2024, Accepted 14 Jun 2024, Published online: 01 Aug 2024

Figures & data

Figure 1. Chair rearrangement task using CLIP feature-based randomized control. The text instruction is ‘place a green chair under the table.’

Figure 1. Chair rearrangement task using CLIP feature-based randomized control. The text instruction is ‘place a green chair under the table.’

Figure 2. Overview of our control framework.

Figure 2. Overview of our control framework.

Figure 3. Simulation environment (the green dot indicates the target position of the handle). (a) drawer-close. (b) drawer-open. (c) door-close. (d) door-open. (e) window-close and (f) window-open.

Figure 3. Simulation environment (the green dot indicates the target position of the handle). (a) drawer-close. (b) drawer-open. (c) door-close. (d) door-open. (e) window-close and (f) window-open.

Table 1. Hyperparameters used in the PPO algorithm.

Table 2. Texts used in a multitask simulation.

Table 3. Comparison of success rates for the multitask simulation (the target position of the handle is unknown except for †).

Table 4. Comparison of text accuracy rates for the simulator images.

Table 5. Texts used in more complex simulation tasks.

Figure 4. Control results for more complex tasks when applying our method. (a) button-press-wall. (b) assembly. (c) box-close and (d) shelf-place.

Figure 4. Control results for more complex tasks when applying our method. (a) button-press-wall. (b) assembly. (c) box-close and (d) shelf-place.

Table 6. Success rate for more complex tasks when applying our ViT-L/14.

Figure 5. Experimental configuration.

Figure 5. Experimental configuration.

Table 7. Texts used in the experiment.

Table 8. Comparisons of success rate for the real robot experiment (The target position of the object is unknown except for †).

Table 9. Comparison of text accuracy rates for the real images.

Figure 6. An example of control results when applying each method. (a) ViT-B/32. (b) ViT-L/14 and (c) ViT-B/32 (finetune).

Figure 6. An example of control results when applying each method. (a) ViT-B/32. (b) ViT-L/14 and (c) ViT-B/32 (finetune).
Supplemental material

Supplemental Material

Download MP4 Video (9 MB)