GazeSwitch: Automatic Eye-Head Mode Switching for Optimised Hands-Free Pointing

This paper contributes GazeSwitch, an ML-based technique that optimises the real-time switching between eye and head modes for fast and precise hands-free pointing. GazeSwitch reduces false positives from natural head movements and efficiently detects head gestures for input, resulting in an effective hands-free and adaptive technique for interaction. We conducted two user studies to evaluate its performance and user experience. Comparative analyses with baseline switching techniques, Eye+Head Pinpointing (manual) and BimodalGaze (threshold-based) revealed several trade-offs. We found that GazeSwitch provides a natural and effortless experience but trades off control and stability compared to manual mode switching, and requires less head movement compared to BimodalGaze. This work demonstrates the effectiveness of machine learning approach to learn and adapt to patterns in head movement, allowing us to better leverage the synergistic relation between eye and head input modalities for interaction in mixed and extended reality.


INTRODUCTION
The synergistic relationship between eye and head input modalities offers a promising approach for achieving hands-free pointing [22,35,37,48].The proposed BimodalGaze technique [37], for instance, allows for greater pointer control by automatically switching between 'Gaze Mode' for coarse positioning and 'Head Mode' for refinement.This seamless switch leverages eye-head coordination insights that allow the separation of natural from gestural head movement [34].Natural head movement occurs when the head moves to support our visual system during a gaze shift so that we can see objects that are not right in front of us while keeping the eyes within a

RELATED WORK
Gaze has been widely explored as a hands-free alternative to manual input, as it functions as a fast and natural pointer for selection-people naturally look at objects before selecting them.However, using the eyes for input has limitations [28].First, even during fixation, the eye is never completely still, which makes precise eye-based pointing challenging, especially for selecting small targets [52].Second, although eye tracking has come a long way, its accuracy and precision are influenced by various factors, including calibration, lighting conditions, and the potential for drift over time.
To address these inherent limitations, researchers have proposed a multitude of techniques, such as algorithms to smoothen eye tracking data [e.g.12,47], zooming techniques for accurate target selection [e.g.1,13,42], and selection and disambiguation techniques that do not rely on calibration [e.g.26,32,33,45].
A promising approach involves harnessing the rapid pointing and hands-free capabilities of gaze for initial coarse positioning, and employing a complementary modality that affords more precise control for further positioning.A fundamental work that demonstrates this combination is MAGIC pointing [11,51], where the cursor is "warped" to the gaze location and adjusted with manual mouse input, resulting in a substantial enhancement in pointing speed.In AR and VR, this principle has also been applied to controller movements [19].Gaze-Shifting by Pfeuffer et al. [30] demonstrates the same principle with direct touch and pen input, where either input can be directly or indirectly mapped to the gaze area.The integration of gaze input with other modalities not only reduces physical movement and user fatigue but also enhances efficiency, fine control, and precision while capitalising on the natural speed and convenience of gaze pointing [2].
Besides hand-based input, head input has shown to be a promising input for disambiguation and refinement pointing for target selection, as the head affords hands-free fine control.In our previous work, we found that users have fine-grained over their head movement (∼0.3 degrees) [17].Eye-head combination capitalises on the strengths of both modalities, with the eyes providing fast and precise input while head movements enable finer adjustments.Moreover, eye-head techniques for pointing have been found to achieve faster speeds than head-only techniques [19,21,40].
Early works on desktop-based interaction combined head movement with gaze to refine gaze movements with leaning [48] or rotating head movements [27].However, a key assumption for these works is that head movement is only used for interaction, not for controlling the viewport, as in VR.As the head position can easily be tracked in 3D interfaces, several techniques have been proposed that leverage head input-with many leveraging eye-head coordination insights for selection and manipulation.For example, using the head with estimation of gaze depth for target disambiguation [24], or for menu control [39].
In a study that compared variations of eyes for selection and other inputs for refinement, head correction of gaze is preferable even if manual input is available, as it requires less physical effort [22].This eye+head variation, 'Eye+Head Pinpointing', is where the cursor is initially controlled with gaze and switches over to refinement mode when the user holds down a controller button to invoke head input.Head movements are then used to make precise adjustments to the cursor position, effectively "pinpointing" the target.When the user releases the button, the target returns to gaze pointing mode.In head-refinement mode, the CD-gain is adjusted to 0.5, allowing the technique to select small targets, as small as 0.5 degrees.While a manual switching technique affords users control over when to enter refinement mode, this switching process can be seamless, as shown with BimodalGaze [37].
The BimodalGaze technique seamlessly integrates eye and head movements, enabling automatic mode switching based on a threshold-based algorithm.The algorithm classifies and seamlessly transitions between gaze mode (gaze-driven head movements) and head mode (gestural head movements) using a set of thresholds.BimodalGaze enters 'head mode' when a head movement is detected (head velocity >15 • /) that started at least 150 ms after the previous gaze shift, and the angular difference between the trajectory of the eyes and head at least 20 degrees while 'gaze mode' when either a gaze shift is detected (gaze velocity >160 • /) or when the distance between gaze and cursor is more than 10 degrees.Hence, by classifying when the head supports gaze (natural) and when the head is used for interaction (gestural), the technique allows the seamless transition where the eyes are used for fast coarse pointing and head movements for refinement.
However, despite participants in their user study describing BimodalGaze's ability to automatic mode switch as smooth and effortless compared to Eye+Head Pinpointing, it displayed a greater frequency of initial selection errors.This impacted both the total selection time and the overall performance despite its shorter refinement time.BimodalGaze employed a high head velocity threshold as a heuristic to minimise consistent mode switching.This threshold, however, introduced challenges when only small movements were required, often leading to overshooting as users resorted to exaggerated head motions to trigger the algorithm to enter head mode.These issues primarily stemmed from the limitations inherent in a threshold-based approach, impacting mode switching performance.
In our work, we build on the insights from BimodalGaze for automatic eye-head mode switching, and the potential of head movement classification from our previous work, HeadBoost [17].The HeadBoost classifier addresses the challenge of correctly classifying between two fundamental types of head movements: gaze-driven head movement (Head-Gaze) and gestural head movement (Head Gesture).The classifier, built using XGBoost [3], takes as input position and direction 3D vectors of both eye and head movements.It incorporates a comprehensive set of over 600 eye and head-related features sourced from eye and head movement classification literature, along with feature vectors from prior timestamps to capture and analyse user behaviour.These features encompass shape, noise, spectral, temporal, and correlation characteristics of the eye and head vectors, and in combination, facilitate the classification of head movements.This novel approach yielded exceptional results, boasting an offline classification accuracy with an  1 -Score of 0.89 for effectively discriminating between the two types of head movement.
In comparison with BimodalGaze, the HeadBoost classifier demonstrated better classification performance ( 1 -Score: 0.89 vs 0.62), indicating a substantial improvement in overcoming the limitations of a threshold-based approach.Moreover, HeadBoost results showed that it predicted the onset of Head Gesture much earlier than BimodalGaze (119 ms earlier on average for all trials), an area that required improvement.In further analysis, the Headboost classifier accurately classified small head movements (<15 • /), compared to BimodalGaze.This performance enhancement can be attributed to the capacity of using a machine learning approach to learn and adapt to patterns in head movement, effectively overcoming classification challenges-a viable approach in light of the natural eye-head coordination behavioural complexities discussed in Introduction.

GAZESWITCH
To develop GazeSwitch, we first obtained a labelled dataset of eye and head movement data of participants as they performed cursor refinement tasks in a controlled study (detailed in Section 3.1).We closely followed the pipeline steps used to develop HeadBoost [17], including preprocessing and the initial steps for feature engineering (Section 3.2).With recursive feature selection, we obtained a classification rate above 120 Hz with a high classification performance of 0.91  1 -Score (Section 3.3).We then apply the ability to classify head movement types with a simple logic to robustly define the mode switch between gaze pointing and head refinement (Section 3.4).(c-d) 50% of the time, the target centre turns red, and a black dot appears in the centre to prompt a refinement to place the cursor as close to the target as possible using head mode (thumbpad press) before selecting the target (thumbpad release).A new target appears, and the sequence repeats over.Right: Data collection setup.

Data Collection
We designed a target acquisition task and corresponding study procedure to collect eye and head movement data typical of gaze pointing and head refinement.We developed the apparatus using Unity 2020.3.32f1.Figure 1 illustrates the task sequence, uniquely designed to collect large variances of labelled eye and head movements for training.To collect gaze shifts of various directions and amplitudes, targets appeared at randomised positions of diverse patterns, some requiring only a gaze shift towards them and others demanding cursor refinement.Trials involving head mode selection occurred with a 50% probability, and the sequence of pointing modes was randomised.
In cases requiring refinement, participants employed a technique akin to Eye+Head Pinpointing [22], toggling mode switching by pressing and releasing the thumbpad of a controller to place the cursor as close as possible to the target centre.The period with the controller button held down is labelled as 'Head Gesture', while the remaining samples are labelled as 'Head-Gaze'.
We followed the target design used in HeadBoost [17], featuring a diameter of 5.72 degrees, a transparent centre of 2.56 degrees (Figure 1a), and a black dot of 0.8 degrees in diameter (Figure 1c).The small size of the black dot was chosen to challenge gaze pointing, thereby encouraging participants to invoke head mode for refinement.The transparent centre provided feedback, transitioning from green when the user fixated on the target to red to indicate the need for closer placement to the centre.The target size was to ensure that the target was visible in the VR scene, facilitating participants in easily locating the subsequent target.For all trials, we collected the eye-in-world directional 3D vector, eye-in-head directional 3D vector, head position 3D vector, and head directional 3D vector.
We recruited 5 participants from our local university, aged 22-30 (M=26.8,SD=3.54, 1 female, 4 male).No prior VR or eye tracking experience was required, but participants needed to have normal or corrected-to-good vision.Upon arrival, participants were comfortably seated, briefed on the study procedure, and asked to sign a consent form before completing a demographic survey.They were then instructed to wear the HTC Vive Pro Eye VR HMD with integrated 120 Hz Tobii eye tracker, with assistance provided if needed, and underwent a five-point eye-tracking calibration.Following this, participants were asked to complete one sequence (30 trials) to familiarise themselves with the task before the data collection phase.Each participant completed 300 trials (10 sequences × 30 trials).Breaks were permitted between the sequences, and participants recalibrated each time they removed the HMD.Each session took approximately 40 minutes.The study procedure was approved by Lancaster University's research ethics committee.

Dataset Preprocessing and Feature Engineering
The data collection session resulted in 246898 timestamps from 1500 trials (300 trials per participant), with 65.3% of samples labelled as Head-Gaze, and 34.7% as Head Gesture.We preprocessed the raw data following best practices [4,6,8], filtering out samples with an inter-sample velocity exceeding 800 • /.Following this, we applied cubic spline interpolation to standardise the sampling rate to 120 Hz (sampling frequency of the eye tracker).Lastly, we converted the 3D directional gaze and head vectors into 2D Fick angles using the Fick-gimbal method [15] 1 , mirroring the approach in our HeadBoost paper for consistency in feature generation.Furthermore, we adopted the hyperparameter choices used in HeadBoost, for both feature calculation and classifier training, determined through cross-validation.
We extracted shape-, noise-, spectral-, correlation-, and timing-based features (see Appendix B), computed over a window length of 512 ms.To address issues related to multicollinearity during classification, we refined the features using correlation distance and hierarchical clustering [25], resulting in a streamlined set of 80 representative features.We then included the features from the last 1024 ms for each labelled timestamp at every 6.25 Hz to capture the temporal context of users' behaviour.This resulted in a set of 600 features for each labelled time stamp.To overcome computational costs and the risk of overfitting [9], we applied Recursive Feature Addition (RFA).RFA involved incrementally adding features and assessing model performance on testing folds, retaining only features that improved performance.This process yielded a final set of 28 features for each labelled timestamp for classification.Fifteen of the final features are based on eye movement, 13 are based on head movement (Appendix B.2).

Model Classification and Evaluation
We use the final set of 28-dimensional features to train an XGBoost model (20 trees, max. depth 6) to classify between Head-Gaze and Head Gesture.As demonstrated in HeadBoost [17], XGBoost was superior in performance across the testing folds compared to other models.We then evaluated the classifier using leave-one-participant-out cross-validation, training the classifier five times, each time training the classifier on the data of four participants and evaluating it on the trials of the last participant.Model performance was evaluated using two metrics: the  1 -Score and the Area Under the Receiver Operating Characteristics Curve (AUC). 1 -Score combines precision and recall in a single metric, while AUC measures the classifier's ability to differentiate between classes.Both metrics range from 0 to 1, with 1 indicating perfect performance.The performance results of the user-independent model indicate that the built classifier can optimally classify head movements, achieving a high average  1 -Score of 0.91 (SD=0.01)and a high AUC score of 0.93 (SD=0.01),as well as high Precision and Recall scores, 0.92 (SD=0.01)and 0.90 (SD=0.02),respectively.

Mode Switching Logic
To switch into gaze mode, two conditions must be satisfied: (1) the trained ML classifier predicts gaze mode, and (2) the dispersion of the eye-in-head angles from the last 50 ms is greater than 3.6 • .The second condition overwrites the ML prediction if the user is still fixating to maintain a steady head mode period.The 3.6 • threshold chosen is twice the eye tracking precision of the HTC Vive Pro Eye during static head phases (2 × 1.8 • mean intersample RMS) [41], and the 50 ms duration is a trade-off between window size and real-time classification responsiveness.The dispersion threshold serves to counter eye tracking imprecision and prevent unintended gaze mode activation due to minor jitters.Furthermore, requiring confirmation from both the ML model and the dispersion threshold mitigates accidental mode-switching resulting from single-frame false predictions of the ML model.This approach differs from BimodalGaze, which activates gaze mode by detecting larger gaze shifts with a velocity threshold.Through a pilot study, we found that thresholds worked well for the participants, giving confidence in our chosen parameters.The algorithm for GazeSwitch mode switching logic can be found in Appendix A.

STUDY 1: PERFORMANCE EVALUATION
We evaluated the performance of GazeSwitch against two existing eye-head mode switching techniques, Eye+Head Pinpointing (manual) and BimodalGaze (threshold-based).We used a 3 × 2 × 3 within-subject design with the three techniques, two target widths (0.8 • , 1.5 • ) and three amplitudes (10 • , 25 • , 40 • ).We recruited 12 participants, aged 21 to 50, (M=29, SD=7.07, 6 female) through the university's mailing lists for this study.No prior VR or eye tracking experience was required, but participants needed to have normal or corrected-to-good vision.Eleven participants had either occasional or no VR experience, while one reported daily VR headset use.Six participants had no prior experience with eye tracking, whereas six reported occasional use.The study environment and tasks were developed in Unity version 2020.3.32f1.We collected the eye-in-world directional 3D vector, eye-in-head directional 3D vector, head position 3D vector, and head directional 3D vector using a HTC Vive Pro Eye VR HMD (90 Hz).The HMD has a field of view (FOV) of 100 • in the horizontal plane, 110 • in the vertical plane and a built-in eye tracker (120 Hz).

Task
We adopted a pinpointing task for this study, similar to the task used in BimodalGaze [37], which required participants to perform precise pointing for target selection using eye-head mode switching.Hence, targets can be selected in either eye or head mode, while confirmation is triggered using the controller.Given that the 0.8 • target might be challenging to discern at larger amplitudes, we enhanced its visibility by introducing a white crosshair with a 3 • transparent space at its centre and a thickness of 1 • surrounding the target.Figure 2 illustrates the trial sequence.
At the onset of each trial, participants are guided visually to align their eyes and head, in which we enforce that the eyes and head position are within 5 and 2 degrees, respectively, from a centred neutral position, with the head velocity limited to less than 2 • /.Once the alignment is completed, a black circular target appears, signalling the participant to look towards it.The cursor is visible throughout the trial and is initially attached to the filtered gaze point.We applied a 1€ filter with a minimum cutoff frequency of 1 Hz, slope beta value of 10, with the default 1 Hz cutoff frequency to smooth the cursor for visualisation, but the raw data streams were used as input to GazeSwitch.
Participants were required to place the cursor as close as possible to the target centre, with the option to switch to head mode to fine-tune the cursor position.In gaze mode, the cursor appears as a white ring (Figure 2b), while in head mode, it changes to a white cross (Figure 2d).Upon entering the target area, it turns green as hover feedback.The participant then completes the selection with a button-up event of the thumbpad of the controller.If the cursor is off-target at selection, or if no selection is made within 5 seconds, an error audio cue is played, and the trial is marked as failure.The target position will then be re-queued at the end of the block for a maximum of two additional attempts.If the target is selected within the 5-second window with the cursor inside the target area, the trial is marked as a success and will not be re-queued.The next trial begins after realigning the eyes and head back to the centre.The block concluded either upon selection of all targets or when the maximum attempts were exceeded (3 per target position).

Procedure
Upon arrival, participants were seated comfortably and provided with a briefing on the study.They were then given a consent form and a demographic questionnaire to be signed and filled out, respectively.They were then instructed to put on the HMD, with assistance provided if required, and to undergo the five-point eye tracking calibration.For each technique block, participants completed six sequences (2 Target Sizes × 3 Repetitions) of 24 trials (8 Directions × 3 Amplitudes) each, with the two target size levels randomly and evenly ordered.The techniques are counterbalanced with a Latin Square.Participants were then offered the opportunity to practice the current technique at the beginning of each block, involving one sequence of 24 trials with 0.8-degree targets.
At the end of each technique block, participants were asked to remove HMD, fill out a NASA TLX questionnaire [14] and provide verbal feedback about the technique they just used.Participants continued to the next block when ready.
In total, each participant performed 144 trials (3 Techniques × 2 Target sizes × 24 Trials).The study took 60 minutes to complete, after which we progressed to a second subsequent study that took a maximum of 30 minutes, which we report in Section 5. Participants were compensated with a £10 Amazon gift card for their time.The study procedures were approved by Lancaster University's research ethics committee.

Results
We performed a three-way repeated-measures ANOVA with interaction technique, target size, and target amplitude as independent variables, using a significance level of  = 0.05.In cases where the data was ordinal or conventional transformations did not address normality, we applied the Aligned Rank Transform (ART) technique [49] and confirmed that the aligned responses approximately summed to zero.When the assumption of sphericity was violated, as indicated by Mauchly's test, we employed Greenhouse-Geiser correction.Post hoc tests were carried out using pairwise t-tests with Bonferroni corrections or the ART procedure for multifactor contrast tests [10].We analysed usability Likert-scale data using Friedman tests with Bonferroni-corrected Wilcoxon tests for post hoc analysis.Table 1 shows the mean and standard deviation for each performance evaluation metric.

Selection time.
Selection time, measured from the onset of a trial to a successful selection, serves as an indicator of the overall technique speed.We found a significant main effect for Technique ( 1.98,21.74= 11.27, < 0.001), Target Size ( 1,11 = 27.2, < 0.001), and Target Amplitude ( 1.37,15.12= 56.1, < 0.001).Post hoc examination demonstrated that Eye+Head Pinpointing  exhibited a significantly shorter selection time compared to both BimodalGaze and GazeSwitch ( < 0.001) (see Figure 3a).Selection times were significantly longer at the 40 • amplitude compared to all other amplitudes ( < 0.001) and at the smaller (0.8 • ) target size ( < 0.001).No significant interactions were observed.

Error rate.
We define an error as missing the target due to trial timing out or having an inaccurate cursor position at the time of selection, measured as error rate (percentage of unsuccessful initial attempts).We found significant main effects for Technique ( 2,22 = 12.88,  < 0.001) and Target Amplitude ( 2,22 = 6.88,  < 0.01).Post hoc analysis showed that the error rate of Eye+Head Pinpointing is significantly lower than BimodalGaze ( < 0.001) and GazeSwitch ( < 0.05) (see Figure 3b).Further, the error rate of all techniques was significantly lower at 10 • amplitude than at 40 • amplitude ( < 0.05).No significant interactions were observed.

Time to first Head
Mode.This metric measures the time from the onset of the trial to when the participant first entered the head mode, reflecting how quickly each technique facilitates an intended switch to head mode.We observed that some participants never entered head mode under certain conditions, causing the data to deviate from normal distribution even after standard transformations were applied.Thus, for this metric, we only considered Technique and Target Amplitude as factors for the ANOVA analysis.We found significant main effects for Technique ( 2,22 = 32.03, < 0.001) and Target Amplitude ( 2,23 = 28.14,  < 0.001).Moreover, we observed a significant interaction effect ( 4,44 = 2.80,  < 0.05) for time to first head mode.Post hoc analysis showed that BimodalGaze has a significantly shorter time to enter the first head mode than both Eye+Head Pinpointing ( < 0.05) and GazeSwitch ( < 0.001).Further, BimodalGaze showed a significantly earlier transition to head mode compared to GazeSwitch at 25 • and 40 • amplitudes ( < 0.001) (see Figure 3c).Lastly, we found that GazeSwitch transitioned to head mode significantly earlier at 10 • compared to 25 ).Furthermore, we observed significant two-way interaction effects ( < 0.05).Post hoc examination revealed that for every identical amplitude and target size, BimodalGaze exhibited a significantly lower number of mode switches compared to GazeSwitch ( < 0.01) (see Figure 3d).

Total head movement.
This metric is derived from the sum of the inter-sample Euclidean distance of head movement throughout the entire trial.It provides an assessment of the overall head movement performed by the participant and is useful for determining if the differences in head movement during head mode are meaningful when considering the demands of the entire task.Our analysis revealed significant main effects for Technique ( 2,22 = 4.91,  < 0.05), Target Size ( 1,11 = 9.13,  < 0.05), and Target Amplitude ( 2,22 = 183.03, < 0.001), along with interaction effects between Technique × Target Size ( 2,22 = 3.58,  < 0.05) and Target Size × Target Amplitude ( 2,22 = 0.027,  < 0.05).For the larger target (1.5 • ), BimodalGaze required significantly greater overall head movement compared to only Eye+Head Pinpointing ( < 0.05) (refer to Figure 3e).However, for smaller targets (0.8 • ), we observed that BimodalGaze exhibited significantly more head movement than both Eye+Head Pinpointing and GazeSwitch ( < 0.05).We further calculated total head movement in head mode only, which measures the overall effort of the selection technique, as more head movement during refinement may suggest more action from the users.We found a significant main effect for Technique ( 2,22 = 12.93 < 0.001), Target Amplitude ( 2,22 = 26.62, < 0.001), and the interaction between Target Size and Target Amplitude ( 2,22 = 26.62, < 0.05).Subsequent post hoc examination revealed that BimodalGaze necessitated significantly more head movement during refinement in comparison to the other techniques ( < 0.001) (see Figure 3f).However, we observed no significant difference in head movement between Eye+Head Pinpointing and GazeSwitch.As expected, the analysis indicated that targets with a 40 • amplitude required significantly more head movement during refinement compared to all the smaller amplitudes ( < 0.05).

Subjective feedback.
We observed a statistically significant difference in the NASA-TLX workload for performance, with participants rating BimodalGaze significantly lower than Eye+Head Pinpointing ( < 0.05).No other significant differences were found.We further analysed participants' verbal feedback and found that all three techniques were generally well-received.However, it was evident that each technique had its own set of limitations.
For Eye+Head Pinpointing, participants reported feeling "more in control" (P2) due to the ability to manually mode switch, resulting in the selection task being perceived as "convenient" (P4) and "fast" (P1, P3, P4).However, some participants (e.g., P3, P12) found it challenging when timing button presses and remembering to release the button, leading to more head movement required if the button was pressed too early.
For BimodalGaze, several participants found target selection to require "effort" (P2, P5, P9, P10), mainly because they perceived it as "less like an automatic switch" (P9), requiring more head movement and being "inconsistent" (P12, P10, P11) for mode switching.P7 noted, "It is more inconsistent, sometimes the cross appears when I didn't need it, other times it didn't appear when I wanted.I have to learn the head movement to turn on the cross.".However, some participants acknowledged that when mode switching was accurate, BimodalGaze could make the task feel "easier" (P4, P11, P12) and "smooth" (P3, P5).

STUDY 2: USER EXPERIENCE EVALUATION
We developed two applications in Unity 2020.3.32f1 to compare the user experience of automatic and manual mode switching: (1) tracing the outline of an object with precise marker placement and (2) colouring objects in the scene.In the tracing task, participants have the flexibility to switch between gaze and head mode, affording them to utilise both long sweeping lines and short successive selections.The colouring task was designed to highlight the affordance of gaze mode to quickly move across the sides of the screen, while using head mode to select small targets precisely.
In this second study, we exclusively compared GazeSwitch and Eye+Head Pinpointing techniques, as BimodalGaze operates on the same automatic mode switching principle but received lower perceived performance in our prior evaluation (see Section 4.3.6).This study followed immediately after the performance evaluation study (Section 4); hence, the same participants and procedures when taking breaks were used.At the start of this study, we briefed participants on both tasks and the operation of both techniques.We then asked the participants to wear the HMD and perform a five-point eye-tracking calibration.Participants first performed the tracing application using both techniques, but the order of the techniques was counterbalanced.After each task, participants were invited to comment on their overall experience of using each technique.Participants were also free to report their preferred technique for performing the task.

5.1.1
Tracing.This task is inspired by Gaze-Shifting [30] and demonstrates that users can precisely control the cursor to follow the contours of a car using mode switching.Hence, participants are tasked to trace the outline of the car by placing markers on it.Figure 4-Left shows the application scene, where a car is positioned at the centre, spanning 80 • horizontally and 51 • vertically.This is achieved by utilising the gaze mode to cover longer distances and the head mode to trace around The user can leverage the eyes' saccadic movement to cover a large distance before switching to head mode to place a marker (b).The user can activate and stay in head mode to place multiple markers in close proximity to trace out details, e.g., the curvature of the wheel (c).Right: The user selects the desired colour, typically in head mode (b).The user can leverage gaze mode to saccade quickly to a target (c).If it is a big target, e.g., a wide leaf, they can select without refinement.If the target is small, for e.g., the thin tree trunk, the user can activate head mode to refine the cursor position (d).
smaller features.The tracing line is produced by extending it from a previous marker to the current cursor position.An outline marker is placed by releasing the thumbpad on the controller.

5.1.2
Colouring.This task evaluates participants' experience of using the techniques, where gaze mode can be used to select and interact with larger targets and only use head refinement when needed to interact with small targets.As shown in Figure 4-Right, the application scene displays three palm trees positioned at increasing distances from the user.Participants are assigned the task of applying colour to various parts of the palm trees, with different sections available for colouring.The process involves selecting a colour from a palette situated 25 • to the left of the beach scene and then choosing the specific part of the palm tree to be coloured.Hover feedback is provided as the cursor lands on colourable parts of the tree.The tree closest to the user will appear larger in visual angle compared to the farthest tree.

Results
Following the general inductive approach [44], two researchers independently coded interview transcripts focusing on participant experiences with the techniques.Initially, the first coder proposed six themes, which the second coder refined by removing two.A final consistency check, where both coders independently re-applied the themes, yielded 90% agreement and disagreements resolved through discussion.Both GazeSwitch and Eye+Head Pinpointing were generally well-received, with four out of the twelve participants preferring GazeSwitch for tracing and eight for colouring.Thematic analysis revealed key findings centred around the effectiveness and consistency of mode switching: 5.2.1 Effort.Seven participants stated that GazeSwitch requires less effort than Eye+Head Pinpointing mainly because it offers automatic mode switching.Participants further commented that GazeSwitch makes the experience more "fluent" (P5) than Eye+Head Pinpointing as it allows them to focus on the task without the need to press a button to mode switch: "GazeSwitch saves clicking...I can zone out on how to do it and just focus on what you are doing."(P1).5.2.2 Speed.Six participants mentioned that Eye+Head Pinpointing was faster than GazeSwitch, primarily because they found pressing a button to switch into head mode an easy action: "Eye+Head Pinpointing is quicker as it is straightforward and intuitive."(P3).Participants further noted that GazeSwitch could sometimes be time-consuming, particularly when the cursor got stuck in head mode due to their unfamiliarity with the technique.In contrast, five participants reported that GazeSwitch enabled them to complete tasks quicker than Eye+Head Pinpointing.This was attributed to the "accurate" (P1, P4) automatic mode switching offered by GazeSwitch: "Automatic is quicker, more efficient, especially when you get the hang of it."(P6).

Control over Mode
Switching.Nine participants noted that Eye+Head Pinpointing offers greater control over mode switching compared to GazeSwitch.This enhanced control makes the mode switching more "stable" (P1, P3) and allows the participants to explore the visual scene with their eyes and head freely: "Eye+Head Pinpointing allows more manual control as I can move the head around without thinking about switching to head mode.I like the extra power...I can manually and precisely enter head mode when I want."(P8).

Stability of Mode
Switching.Eight participants agreed that GazeSwitch is less stable than Eye+Head Pinpointing, resulting in a cursor that is "shaky" (P4, P10) and "jittery" (P3, P11).Participants noticed the instability is worse at the edges of the field of view (FOV), possibly due to eye tracking loss or when the "eyes move away at the last minute before selection, presumably already moving on to the next outline point, causing the cursor to jump, which led to mistakes."(P2).

DISCUSSION
In this paper, we extended the insights from prior research to overcome limitations in existing eye-head mode switching techniques.Our contribution, GazeSwitch, leverages machine learning to optimise real-time switching between eye and head modes, enabling fast and precise hands-free pointing.Our findings demonstrate that adopting an ML-based classification approach reduces the occurrence of false positives resulting from natural head movements while efficiently detecting head gestures for input.The results from our two user studies not only validate the effectiveness of GazeSwitch in discrete target selection but also highlight its capability for continuous interaction, as demonstrated in our tracing task.This capability is significant for hands-free gaze and head interaction as it is traditionally only available for manual clutch-based techniques (e.g.Eye+Head Pinpointing) or other gaze-combined manual techniques (e.g.Gaze-Shifting).
GazeSwitch facilitates a smooth transition between pointing and refinement modes without requiring manual actions like Eye+Head Pinpointing or exaggerated head movements due to threshold limitations, as in BimodalGaze.The fast and adaptive mode switching facilitated by our classifier does not impose specific behaviours on users but instead allows them to act more freely.This has an impact on other parts of GazeSwitch.In both Eye+Head Pinpointing and BimodalGaze, feedback is of utmost importance in showing the current mode.As in the original implementations, Eye+Head Pinpointing forced users to go into head mode for selection, as gaze mode does not display any feedback.In BimodalGaze the cursor switches colour to signify a mode switch, which is necessary to ensure that users perform an exaggerated enough movement.In our work, we also implemented mode switch feedback by changing the circle into a crosshair.However, as our findings show that users could easily and seamlessly switch between modes, it minimises the need for explicit broadcasting of modes, potentially making the technique feel more fluid and synergistic.
However, the results of our studies also revealed trade-offs between GazeSwitch and the baseline switching techniques.Compared to Eye+Head Pinpointing, we found that GazeSwitch exhibited a higher error rate and longer selection time, but no significant differences were found in terms of overall head movement or the head movement required to enter head mode.There was also no significant difference in the onset of head mode or other performance metrics.These findings suggest that GazeSwitch allowed users to naturally utilise their heads, as participants commented on its effortless operation compared to manually activating head mode.However, the manual mode switch in Eye+Head Pinpointing offered greater control and stability, resulting in quicker and more accurate selections.
In comparison to BimodalGaze, GazeSwitch was perceived as less stable, possibly due to switching to gaze mode right before selection.Further analysis showed that participants attempted head refinement in 82.22% (SD=23.77) of failed selections were eventually made in gaze mode, and 97% (SD=5.49) of these could have succeeded if participants had selected in head mode.In these failed trials, participants maintained a final stable head mode for 0.81 seconds (SD=0.28).However, gaze velocity rises around 0.14 seconds before selection, unlike successful trials, where gaze velocity only increases after selection.This distinct pattern (shown in Figure 5 in Appendix C) suggests participants might have looked away before selection, triggering gaze mode an unnoticeable 0.14 seconds (SD=0.12)before selection, thus undoing head-mode refinement.This aligns with research showing fixation probability peaks before interaction [18,36], potentially leading to "Late-Trigger errors" [20].
Moreover, participants entered head mode later with GazeSwitch compared to BimodalGaze, but also required less head movement.We also found no differences in the selection time or error rate between the two techniques, highlighting the difference between threshold-based and ML-based techniques.When using BimodalGaze, participants commented that they needed exaggerated head movements to activate head mode, which resulted in increased effort, and the early activation of head mode did not translate into shorter selection times.While the threshold-based approach demonstrated stability, it also contributed to a decrease in perceived performance, as participants may require time to familiarise themselves with the necessary head movement for activating head mode in BimodalGaze.
Our work and study findings highlight the effectiveness of the machine learning classification approach for classifying head movements into head-gaze and head gestures for hands-free and adaptive interaction, which we initially proposed as part of our HeadBoost paper [17].In contrast with this prior work, where we evaluated the HeadBoost classifier in an offline context, this paper demonstrates its feasibility for real-time classification and eye-head pointing.This breakthrough opens up exciting opportunities for enabling various expressive and robust head movements for interaction, including head-based gestures, inferring user intentions based on head movements, and further exploration of other application areas.

Limitations and Future Work
When GazeSwitch performed smoothly, participants enjoyed its efficiency and seamless interaction.However, when it failed to perform optimally, participants noticed unexpected switched modes that interrupted task completion.Participants' feedback indicated mode switching instability as the main limitation of GazeSwitch, particularly noticeable around the edges of the field of view (FOV).Some found it helpful to adjust their head positioning slightly to centralise the target before attempting refinement.Further, participants favoured GazeSwitch over Eye+Head Pinpointing in the colouring application, which had a narrower scene compared to the tracing application.These observations suggest that GazeSwitch's performance may be affected by large visual angles.Given that GazeSwitch heavily relies on eye tracking for mode prediction, a decrease in eye tracking precision at extreme visual angles could contribute to instability and premature gaze shifts ("Late-Trigger errors").In contrast, Eye+Head Pinpointing allows users to switch to head mode even if eye tracking fails, providing a fallback option to continue the task.Exploring alternatives such as defaulting to head mode when eye tracking fails, as proposed in error-aware gaze-based interaction techniques [38], or algorithms capable of identifying intended targets [e.g.18], could mitigate these challenges.
We recognised two key limitations concerning machine learning.Firstly, our user studies demonstrated that GazeSwitch can effectively operate in diverse tasks and contexts, suggesting a sufficiently diverse dataset.However, while we collected 1500 trials in the training data, this was derived from only five participants.Future investigations could explore more representative data collection methods and alternative ML models to potentially enhance head-based classification performance and improve overall user experience.Secondly, like any ML-based classifier, the performance of GazeSwitch heavily relies on the quality and diversity of the collected data.Although we gathered data from a selection task with various target sequences, expanding data collection to encompass a broader range of eye tracking quality levels and different tasks and environments may result in a more robust classification system.
In this work, we evaluated our proposed technique within a virtual reality (VR), utilising a robust VR HMD equipped with accurate motion tracking using base stations.GazeSwitch, like Eye+Head Pinpointing and BimodalGaze techniques, is intended to function across various environments.As long as gaze and head tracking capabilities are available, any of these techniques, including GazeSwitch, can be applied in any environment.Hence, this concept could theoretically be extended to desktop-based environments by employing a remote eye tracker and a standard webcam for eye and head tracking, offering potential direction for further exploration in future research.

CONCLUSION
In this paper, we contribute GazeSwitch, an ML-based technique designed to enhance real-time mode switching for fast and accurate hands-free pointing.This approach allows users to leverage fast gaze pointing for covering long distances and efficiently switch to refine head pointing in various contexts, enabling the selection of discrete small targets and facilitating continuous interaction.Through our evaluation of GazeSwitch with two baseline switching techniques (Eye+Head Pinpointing and BimodalGaze), we observed that our proposed technique demands less effort for mode switching and enables users to interact seamlessly without the need for exaggerated head movements to trigger mode switching.However, our findings also highlight the performance limitations of GazeSwitch, which accounts for occasional instability in mode switching.In conclusion, GazeSwitch demonstrates the substantial potential for future developments in expressive head-based interactions and other application areas, broadening the possibilities for hands-free interaction.
C "LATE-TRIGGER ERROR" VISUALISATION Figure 5 visualises the "Later-Trigger error" observed during interaction using the GazeSwitch technique, error is characterised by a last-minute saccade away from the target just before selection, undoing head refinement, causing selection error.Failed selections in gaze mode displayed a notable increase in gaze velocity approximately 140 ms before selection.In contrast, successful trials showed an increase in gaze velocity only after the selection, indicating a distinct temporal pattern associated with selection success.In trials where selection occurred in gaze mode but failed, participants maintained a final stable head mode for 0.81 seconds (SD=0.28),only breaking into gaze mode 0.14 seconds (SD=0.12)before selection.During this head refinement period, 97.47% (SD=5.49) of attempts were able to align the cursor on the target, with a minimal average cursor-target offset of 0.22 • / (SD=0.08).However, at the last moment before selection, the refinement was undone by breaking into gaze mode, most likely due to the increased gaze velocity indicative of a 'saccade away' from the target, causing selection error.These results suggest the "Late-Trigger error" may be a top contributor to errors and perceived instability when using GazeSwitch.

Fig. 1 .
Fig. 1.Left: Task sequence for data collection.(a-b) Participants fixates on a target, receiving green feedback.50% of the time, a new target appears, prompting a new gaze shift, the sequence repeats.(c-d) 50% of the time, the target centre turns red, and a black dot appears in the centre to prompt a refinement to place the cursor as close to the target as possible using head mode (thumbpad press) before selecting the target (thumbpad release).A new target appears, and the sequence repeats over.Right: Data collection setup.

Fig. 2 .
Fig. 2. Trial sequence.(a) The participant aligns their eyes and head to a centred neutral position following visual feedback by placing a black dot into a blue square.(b) Target onset, the cursor is visible as a white ring, indicating currently the user is currently in Gaze Mode.(c) In Gaze Mode, as the participant gaze shifts towards the target, the cursor follows eye gaze towards the target.(d) The participant may switch to Head Mode to refine the cursor position, if so, the cursor changes to a white cross to indicate Head Mode.The target turns green when acquired by the cursor.Selection is made with a button-up event of the thumbpad.The target is selectable in both Gaze and Head modes.The next trial begins after realigning eyes and head as in the first step.The cursor fading and the arrows are for illustrations only.

Fig. 4 .
Fig. 4. Applications.The cursor highlighting serves as a visual distinction for the mode and is for illustration purposes only.Yellow indicates gaze mode, while red indicates head mode.Both examples shown used the GazeSwitch to complete the respective task.Left:The user can leverage the eyes' saccadic movement to cover a large distance before switching to head mode to place a marker (b).The user can activate and stay in head mode to place multiple markers in close proximity to trace out details, e.g., the curvature of the wheel (c).Right: The user selects the desired colour, typically in head mode (b).The user can leverage gaze mode to saccade quickly to a target (c).If it is a big target, e.g., a wide leaf, they can select without refinement.If the target is small, for e.g., the thin tree trunk, the user can activate head mode to refine the cursor position (d).

Fig. 5 .
Fig. 5. Eye-in-head velocity from 500 ms before, to 500 ms after selection by selection mode and outcome.Mean over all trials is shown as the solid black line, standard deviation in blue shade.

Table 1 .
Performance metric for the techniques, with mean and standard deviation (in parenthesis).
4.3.4Number of mode switches.This metric quantifies how many times head mode is entered, providing insights into the overall stability of the mode switching for each technique.A count of 0 or 1 signifies complete stability in the technique, under the assumption that participants do not intentionally execute more than one mode switch.Our analysis revealed significant main effects for Technique ( 1,11 = 19.70, < 0.001), Target Size ( 1,11 = 18.17,  < 0.01), and Target Amplitude ( 1,11 = 15.8,  < 0.001