Use of a Low-Cost Portable 3D Virtual Reality Simulator for Psychomotor Skill Training in Minimally Invasive Surgery: Task Metrics and Score Validity

Background The high cost and low availability of virtual reality simulators in surgical specialty training programs in low- and middle-income countries make it necessary to develop and obtain sources of validity for new models of low-cost portable simulators that enable ubiquitous learning of psychomotor skills in minimally invasive surgery. Objective The aim of this study was to obtain validity evidence for relationships to other variables, internal structure, and consequences of testing for the task scores of a new low-cost portable simulator mediated by gestures for learning basic psychomotor skills in minimally invasive surgery. This new simulator is called SIMISGEST-VR (Simulator of Minimally Invasive Surgery mediated by Gestures - Virtual Reality). Methods In this prospective observational validity study, the authors looked for multiple sources of evidence (known group construct validity, prior videogaming experience, internal structure, test-retest reliability, and consequences of testing) for the proposed SIMISGEST-VR tasks. Undergraduate students (n=100, reference group), surgical residents (n=20), and experts in minimally invasive surgery (n=28) took part in the study. After answering a demographic questionnaire and watching a video of the tasks to be performed, they individually repeated each task 10 times with each hand. The simulator provided concurrent, immediate, and terminal feedback and obtained the task metrics (time and score). From the reference group, 29 undergraduate students were randomly selected to perform the tasks 6 months later in order to determine test-retest reliability. Results Evidence from multiple sources, including strong intrarater reliability and internal consistency, considerable evidence for the hypothesized consequences of testing, and partial confirmation for relations to other variables, supports the validity of the scores and the metrics used to train and teach basic psychomotor skills for minimally invasive surgery via a new low-cost portable simulator that utilizes interaction technology mediated by gestures. Conclusions The results obtained provided multiple sources of evidence to validate SIMISGEST-VR tasks aimed at training novices with no prior experience and enabling them to learn basic psychomotor skills for minimally invasive surgery.


Background
The advent of minimally invasive surgery in the mid-1980s [1] led to an increase in the number of iatrogenic bile duct injuries when many surgeons worldwide switched from open surgery to minimally invasive surgery without any prior training [2][3][4][5][6][7][8]. As a result, simulation has since become a valuable tool for learning motor skills for minimally invasive surgery. Many studies have demonstrated that simulation is a useful tool for learning motor skills for minimally invasive surgery and that learned skills can be transferred to the operating theatre [9][10][11][12][13][14][15][16][17][18][19].
Recent years have seen the development of low-cost gesture-based touchless devices that can interact with 3D virtual environments, among them Kinect (Microsoft Corp), the Leap Motion Controller (Leap Motion Inc), and the Myo armband (Thalmic Labs) [26].
The Leap Motion Controller was launched in May 2012. It is based on the principle of infrared optical tracking, which detects the positions of fine objects such as fingertips or pen tips in a Cartesian plane; its interaction zone is an inverted cone of approximately 0.23 m 3 , and it has a motion detection range between 20 mm and 600 mm [27,28]. It measures 76 mm × 30 mm × 13 mm and weighs 45 g. It has 3 infrared emitters and 2 infrared cameras that capture the movements generated within the interaction zone [29,30]. The manufacturer reports an accuracy of 0.01 mm for fingertip detection, although one independent study showed an accuracy of 0.7 mm [31]. Although the Leap Motion Controller is designed mainly to detect hand motions, it can track objects such as pencils and laparoscopic surgical forceps [32][33][34].
The Leap Motion Controller has been used as a tool for the manipulation of medical images in the fields of interventional radiology and image-guided surgery or when there is a risk of contamination through contact (autopsy rooms, for example), for touchless control (operating theatre lights and tables) and for simulation (minimally invasive surgery and robotic surgery). Various authors have used the Leap Motion Controller to develop simulators that track hand or instrument movements [26,[32][33][34][35][36][37][38][39]. A paper by Lahanas [35] describes using Leap Motion Controller to simulate 3 tasks: camera navigation, instrument navigation, and bimanual operation; 28 expert surgeons and 21 reference individuals took part in the study. The experts significantly outperformed novices in all assessment metrics for instrument navigation and bimanual operation.
Simulators for learning skills for minimally invasive surgery can be classified into 3 types: traditional box trainers, augmented reality simulators (hybrid), and VR simulators [40,41]. Simulation has become a valuable tool for learning basic motor skills in surgery, but access to simulators remains problematic, especially in low-and middle-income countries, because of their high cost. Consequently, that makes it necessary to develop and validate the metrics and scores of low-cost portable simulators [42][43][44].
The aim of this study was to evaluate a simulation instrument, SIMISGEST-VR (Simulator of Minimally Invasive Surgery mediated by Gestures -Virtual Reality), and to document the sources of validity evidence for task scores, relations to other variables, internal structure, consequences of testing, and response process.

Hypotheses
To that end, 3 hypotheses were formulated:

Hypothesis 1: Validity Evidence for Relations to Other Variables
The first hypothesis aims to demonstrate that the test scores discriminate between a reference group (no prior experience), surgical residents (less experienced), and surgeons (more experienced), showing that the experts already have the basic psychomotor skills being measured, and similarly, that videogaming experience is correlated with better performance in simulator tasks, regardless of the level of training and experience.

Hypothesis 2: Evidence for Internal Structure
The intrarater test-retest assumes that, if a reference individual is not exposed to simulators in the period of time between the 2 complete simulator exercises, there will be no significant differences in performance between the first and second exercises.

Hypothesis 3: Evidence for Consequences of Testing
Regarding evidence for consequences of testing, the reference group learning will be demonstrated by improvements in the metrics and the final score when comparing the first and the tenth attempt in each task.
Content evidence includes a description of the steps taken to ensure that test content reflects, in a relevant way, the construct or characteristic being measured. The results obtained from the survey assessing fidelity to the criterion and content-related validity evidence for SIMISGEST-VR showed that all 30 participants felt that most aspects of the simulator were adequately realistic and that it could be used as a tool for teaching basic psychomotor skills in laparoscopic surgery (Likert score: range 4.07-4.73). The sources of content-related validity evidence showed that our simulator was a reliable training tool and that the tasks enabled learning of the basic psychomotor skills required in minimally invasive surgery (Likert score: range 4.28-4.67) [53].
Evidence for relations to other variables refers to the statistical association between the test scores and other characteristics or external measures that have theoretical relations, such as level of training, level of experience, prior videogaming experience, and scores for other already validated instruments. One of the most common correlations is known group construct validity (ie, the correlation between performance scores and level of training and experience) [54]. Relations may be positive (convergent or predictive) or negative (divergent or discriminant) depending on the constructs being measured [55]. This study explored the relations between performance scores and the level of training, experience, and prior videogaming experience.
Evidence for internal structure includes data that evaluate the relations between the individual items of the assessment, and how they correlate to the construct. It includes measures of reliability, reproducibility, and factor analysis. Reliability is a necessary but insufficient condition for validity [56]. Intrarater reliability was obtained using the test-retest method, which assesses the stability of responses over time [57]. Test-retest reliability was explored through blinded rerating after an interval of 6 months in the reference group. The randomly selected participants were asked whether they had had additional experience of using simulators during that period of time [56]. The answer was "no" in all cases. The data produced by this second test were not taken into account in the evidence for the construct validity study. Worster and Haines [58] noted that there was no published recommendation for the proportion of data that should be checked but that 10% was common. In this study, 29% of the reference individuals were included in the test-retest study. The demonstration of reliability is mandatory before an evaluation can be shown to be valid [54].
Evidence for consequences refers to the impact, benefit, or danger of assessment itself and the resulting decisions and actions. Yet, simply demonstrating consequences, even significant and impressive ones, does not constitute validity evidence unless investigators explicitly demonstrate that these consequences have an impact on score interpretation (validity) [46,55]. Evidence for consequences falls within a spectrum between high-stake examinations, licensing examinations, or low-stake examinations such a self-assessment used for formative feedback alone [54]. In our case, we hoped to obtain evidence to demonstrate that the reference group had managed to achieve the learning curve.
Evidence for response process includes theoretical and empirical analyses evaluating the extent to which the assessors' and respondents' responses are aligned to the construct. It includes an evaluation of safety, of quality control, and of the actors' thoughts and actions during the assessment. The response process also includes the accuracy of data collection and entry into the database [54]. This type of evidence can be difficult to demonstrate because data are often qualitative [55].
All participants completed a questionnaire providing demographic data (Multimedia Appendix 1) and information about the dominant hand, level of training, levels of minimally invasive surgery skills, prior training with simulators, and experience with videogaming or VR devices.
After the instructor had given basic instructions about using the simulator and had shown a video of each task to be performed, the study participants performed 10 repetitions of tasks 1, 2, 4, 5, and 6 with each hand. Task 3 was repeated 10 times because both hands were considered dominant. The instructor did not give additional feedback, but the simulator did provide concurrent feedback (visual and auditory feedback while performing each task), immediate feedback (displaying the results in terms of time, accuracy and errors at the end of each task), and terminal feedback (performance curve and final score). The participants were able to watch the demonstration videos again at any time. For the test-retest reliability study, 29 participants were randomly selected from the reference group. They repeated the entire exercise 6 months after the first exercise; none were exposed to any type of simulator during that period of time. One of the authors (FAL) supervised and photographically documented each exercise.

SIMISGEST-VR
SIMISGEST-VR was developed using design-based research [59][60][61][62][63]. A previously published article [53] describes in detail the development of the device and a study assessing fidelity to the criterion and content-related validity evidence.

Virtual Environment
The virtual environment consisted of the following modules: registration to collect users' demographic data and a tutorial to show demonstration videos of the tasks to be performed.
Except for task 3, all tasks had the option of configuring the dominant hand during the exercise; task 3 required the simultaneous use of both hands and therefore both played a dominant function. Given its level of difficulty, this task was performed last in all cases. The online virtual environment ran on Windows (Microsoft Inc) and MacOS (Apple Inc) platforms.

Metrics
The metrics were established using 5 parameters: time (velocity), efficiency of movement for the right and left hands [21,74], economy of diathermy, error and accuracy (penalty) [75,76], and final score.

Feedback
Feedback is essential [77]. Training on a simulator should have 3 purposes: to improve performance, to make performance consistent, and to reduce the number of errors [78]. The haptic sensation and concurrent feedback were simulated using sound signals, color changes in the objects, and movement of the object when an undue collision occurred between the different components of the environment or when an error occurred during the task (concurrent feedback). For SIMISGEST-VR, we adopted 3 types of feedback: concurrent, which was provided while the task is being performed; immediate, which was provided at the end of each task when the system provides information on the presence or absence of errors, efficiency, and the time taken; and terminal, which was provided at the end of each training session when the system provides a series of graphs and tables that show performance over time [79][80][81][82][83].The data generated by the program were stored on an SQL (structured query language) database engine integrated into the simulation software.

Hardware
Two laparoscopic forceps were used. In fact, we used simulated forceps made using 3D printers. These minimally invasive surgery forceps did not need to be functional. The final device with all its components assembled is shown in Figure 2. Figure  2 shows the fixing pad (1) for the Leap Motion Controller and the mounting support devices (3) for the minimally invasive surgery laparoscopic forceps (2), which allow simulation of the fulcrum effect; the Leap Motion Controller (4), responsible for detecting the movements of the instruments; and the computer, which, by means of the software programs running on it, administers the virtual environment and the metrics, and provides feedback and the final performance score on the screen (5) where the 3D virtual environment is displayed.
To perform the test, a 13-inch MacBook Pro (Apple Inc) was used, which served as a screen, ran the 3D virtual environment, and stored metrics data.

Data Analysis
Continuous data are presented in a frequency distribution table by mean and standard deviation. The Shapiro-Wilk test was used to assess normality. Categorical data are also presented in a frequency distribution table. Since the metrics data were not normally distributed, nonparametric tests were used to assess the hypotheses. Regarding hypothesis 1, the differences in the scores and time taken to perform the first trial in each task between novices and experts were compared using the Wilcoxon signed-rank test. Among the novices, the final scores of the tenth trial in each task were compared by prior videogaming experience using the Kolmogorov-Smirnov test. To assess hypothesis 2, internal consistency was calculated using Cronbach α. In addition, test-retest reliability was assessed by comparing the tenth trial in each task performed initially and repeated 6 months later using the Spearman correlation coefficient. To assess hypothesis 3, the scores and time taken in the first and last trials in each task were compared by level of training using the Wilcoxon signed-rank test. In addition, excess diathermy in the first and last trials in tasks 5 and 6 was calculated by level of training using the Wilcoxon signed-rank test. P<.05 as level of statistical significance was established. Statistical analysis was performed using Stata (version 15.0; StataCorp LLC).

Demographic Profile
Regarding prior experience with simulators, 35% (7/20) of the surgical residents and 36% (10/28) of the surgeons surveyed said they did not have any. Among the surgical residents, only 15% (3/20) had experience with VR simulators, and none had any experience with hybrid ones.

Validity Hypothesis 1: Relations to Other Variables
To explore validity evidence for relations to other variables, we compared the SIMISGEST-VR test scores across experience levels (known group construct validity). No statistically significant differences were found in the scores of the first trial in each task between novices and experts; however, the times taken to perform tasks 3 (P=.006) and 6 (P=.02) were statistically significantly lower for experts compared to those of the reference group (Table 3). Performance in task 5 was better for novices who had prior videogaming experience (P=.01), as shown in Figure 3. When time was considered as a metric in task 3, a statistically significant difference (P=.006) was found between the reference group and the experts in performing the first trial ( Figure 4).

Validity Hypothesis 2: Internal Structure
The items demonstrated high internal consistency (Cronbach α=.81). Regarding the final scores in all the tasks, no statistically significant differences were found between the first exercises and those 6 months later for the randomly selected participants from the reference group (Table 4); when time was assessed as a metric, statistically significant differences were found for tasks 4 (trial 10: P=.048) and 6 (trial 10: P=.03). This demonstrates full evidence for the internal structure and test-retest reliability with respect to the score and partial evidence with respect to time as a metric (Table 4).

Validity Hypothesis 3: Consequences of Testing
Among the reference group, statistically significant differences were found in the scores and the time taken to perform each task between the first and tenth trials. Among the experts, statistically significant differences were found in the scores in tasks 1 (P<.001), 3 (P=.03), and 4 (P=.01), and in the time taken to perform each task. These findings demonstrate a learning curve (Table 5). In task 5, the reference group made statistically significantly fewer excess diathermy errors in the tenth trial than they did in the first trial (P=.003), which is evidence of a learning curve ( Table 6). Table 6. Excess diathermy errors when doing trials 1 and 10 in tasks 5 and 6, by level of training. Wilcoxon signed-rank test.

Response Process Validity
Study participants had the opportunity to observe each task in advance by watching a video, and they received basic instruction. The only feedback the participants received was from the simulator; they did not receive any other type of feedback from the instructor. Each of the 177 tests performed (148 initial tests and 29 test-retests) was supervised by the same person (FAL). Photographic documentation of every person performing the tasks was obtained. The final performance scores were defined in advance by using the formula described in another study [53]. The exercise results were stored in an SQL database light within the simulator app itself after each test.

SIMISGEST-VR Simulator Cost
The Leap Motion Controller costs approximately US $130, and the hardware elements cost approximately US $70. LapSim essence (Surgical Science Sweden AB) is a portable VR simulator that enables people to learn basic skills. It does not include haptics and is not available for sale, but it can be hired for 6 months at an approximate price of US $5500. To date, there are no publications about the validity of the tasks that this device proposes.

General
The aim of this study was to evaluate a simulation instrument-SIMISGEST-VR-and to document the sources of validity evidence for task scores, relations to other variables, internal structure, consequences of testing, and response process.
Technology-enhanced simulation is defined as an "educational tool or device with which the learner physically interacts to mimic an aspect of clinical care for the purpose of teaching or assessment [83,84]." The use of simulators for learning basic psychomotor skills in minimally invasive surgery has been supported by multiple systematic reviews [16,17,[85][86][87][88][89][90].
In the current state-of-the-art conceptual framework, validity is defined as the appropriate interpretation or use of test results and therefore applies only to the scores or interpretation in a specific context. The commonly used term valid instrument is inaccurate, and validity must be established for each intended interpretation [46,48,50]. Thus, when an evaluation instrument is said to be "valid" or to have been "validated," it is essential to take into account the learning context, the performance context, the domain content, and the exigency of decisions taken on the basis of test results [91].
Validation refers to the process of collecting validity evidence to evaluate the appropriateness of the interpretations, uses, and decisions based on assessment results [52]. Validation is, therefore, a process and not an endpoint, and it involves gathering evidence and taking decisions based on the interpretation of the data obtained. In our case, validation required a series of experiments designed to provide evidence that the scores measured in SIMISGEST-VR reflected the technical skills they purported to measure [92].
The first step in any validity evaluation is to clearly define the construct. The construct we focused on validating was training and learning basic psychomotor skills in minimally invasive surgery using a low-cost portable simulator called SIMISGEST-VR. Several systematic reviews [16,[93][94][95][96] have found that basic psychomotor skills can be learned in low-cost simulation models; however, low-cost simulators are often box trainers made from cardboard boxes [97], plastic crates [98], folding portable boxes [99], and boxes that require the use of laparoscopic equipment [100,101] or even an iPad [102]. There are no low-cost VR simulators on the market.
An important finding from this study was the high percentage of surgical residents and surgeons that had no experience with simulators, and the very low percentage of surgical residents who had experience with hybrid and VR simulators. This finding can be explained by the high cost of this type of simulator, which, in many countries, prevents the creation of simulation centers for learning basic psychomotor skills in minimally invasive surgery and constitutes an argument in favor of exploring the development of models of low-cost portable VR simulators such as SIMISGEST-VR. Ucelli [44] demonstrated a comparable outcome between supervised simulator practice and unstructured free simulator access without mentoring and, therefore, that "take home" simulation was both viable and economically beneficial.

Validity Hypothesis 1: Relations to Other Variables
It is currently considered that a comparison between reference individuals and experts does not constitute an important validity argument [103,104]. However, it is the type of evidence for relations to other variables that is most often referred to in the literature [105]. The SIMISGEST-VR tasks were unable to demonstrate any difference in the performance scores between the reference group and the experts. A statistically significant difference was found between these 2 groups only in the time taken to complete tasks 3 (P=.006) and 6 (P=.02), which were the most complex.
Although some studies support the hypothesis that videogaming experience has a positive impact on minimally invasive surgery performance [106][107][108][109][110][111][112][113]. In this study, a significant difference was found for the reference group only in task 5 (diathermy; P=.003); in the other tasks, prior experience did not have any impact on performance. The demographic characterization made it clear that frequent videogaming (daily or weekly) was low in all population groups, which can explain the absence of impact on performance.
The lack of evidence for relations to other variables in this study can also be explained by the simplicity and ease of the proposed tasks.

Validity Hypothesis 2: Internal Structure
The items demonstrated high internal consistency (Cronbach α=.81). A test should not be used if it has a Cronbach α<0.7, and it should not be used for important decisions on an individual unless the Cronbach α>0.9 [57,[114][115][116]. In our case, therefore, the result enables us to support the use of SIMISGEST-VR tasks as a self-assessment test used for formative assessment [54].
Test-retest reliability is the correlation between scores for a test administered more than once among a homogeneous group of test takers at 2 different times (temporal stability); the longer the period of time, the less likely it is that a person will remember the simulator tasks and, therefore, the greater the test-retest threat will be [116][117][118][119]. In this study, the second exercise was performed 6 months after the first one, and the results obtained demonstrate significant evidence for the temporal stability of scores in the 6 tasks. When the metric used was time, similar results were obtained in all but tasks 4 (P=.048) and 6 (P=.03) when comparing the tenth trial.

Validity Hypothesis 3: Evidence for Consequences of Testing
The most important finding of this study is that that the reference group learned in all the SIMISGEST-VR tasks. Excess diathermy error, defined as a contact time longer than 2 seconds from the moment of initial contact, fell significantly (P=.003) between the first and tenth trials for task 5 in the reference group, which also constitutes evidence for a learning curve. The experts group achieved a learning curve in all the tasks when time was taken as a metric, and for tasks 1 (P<.001), 3 (P=.03), and 4 (P=.01) when the final score of the test was taken into account. We, therefore, consider that the SIMISGEST-VR tasks can be used for the purpose of enabling novices without any prior experience to learn basic psychomotor skills in minimally invasive surgery.
This study has several strengths. The reference group sample included 100 students from 2 faculties of medicine, one public and one private; surgeons from a range of specialties; and surgical residents in general surgery and obstetric-gynecologic surgery. Physical simulators require the presence of a specialized tutor, a scarce, high-cost human resource, whereas VR simulators provide metrics and automatic feedback and allow the physical presence of a tutor to be dispensed with. At times of a pandemic such as COVID-19, this concept of education via VR takes on considerable significance because it avoids the need for learners to travel to simulation laboratories and, therefore, avoids close contact between students and instructors. This study also has limitations. Although the size of the reference group was large, a larger expert group would have been desirable. The sample size in our study was one of availability; as such, there are relatively more participants with minimal surgical experience compared to those with a lot of experience, such as senior surgical residents and surgeons. The low number of senior residents prevented significant results from being obtained when comparing them to the other groups. Another limitation of the data analysis in this study is that there was no statistical analysis performed before the trial to evaluate the proper sample size or to determine the Likert scale.

Conclusions
This study has provided evidence to support the use of SIMISGEST-VR as a low-cost portable tool for the purpose of enabling novices without any prior experience to learn basic psychomotor skills in minimally invasive surgery. The tasks for learning basic motor skills in minimally invasive surgery demonstrated high internal consistency and high test-retest reliability among the reference group when assessing the task scores. The expert group also managed to obtain a learning curve in all the tasks when assessing the time metric. In this study, we were able to demonstrate partial evidence for relations to other variables and strong evidence for internal structure and test consequences.
Future work streams include the creation of different levels of difficulty in the tasks. We also intend to develop an app that can be downloaded online, which contains the full training program. Finally, we hope to develop simulation models using the Leap Motion Controller and other gesture-recognition devices such as the Myo armband.