1 Report - SG Models study

2 Annex I: Calculation of SUS scores

3 Annex II: SUS Questionnaire, by question

1.1 Description of the tests

1.2 Quantitative data

1.3 Qualitative analysis

1.1.1 Location1

1.1.2 Location2

1.1.3 Online survey

1.1.4 Age of participants

1.1.5 Familiarity with games and serious games

1.1.6 Participants per model

1.2.1 ATMSG SUS scores analysis

1.2.2 LM-GM SUS scores analysis

1.2.3 ATMSG vs LM-GM SUS scores

1.2.1.1 ATMSG SUS scores, all groups

1.2.1.2 ATMSG SUS scores by location

1.2.1.3 SUS scores by familiarity with games

1.2.1.4 SUS scores by familiarity with serious games

1.2.2.1 LM-GM SUS scores, all groups (except A)

1.2.2.2 LM-GM SUS scores by location

1.2.2.3 LM-GM SUS scores by familiarity with games

1.2.2.4 LM-GM SUS scores by familiarity with serious games

The tests were carried out in Location1 on 29/05/2014. Participants from Location1 were placed in Group A. Of the 18 participants, 13 handed in their questionnaires and analyses (12 males, 1 female). All participants played the game “MarketPlace” previously, during the 2 months of the course.

In Location2, the study was conducted over the course of two weeks, as part of an undergraduate course in Industrial Management. There were in total 15 participants in the study (13 males, 2 females), divided in two groups (B and C).

A third study was conducted online, using a LimeSurvey questionnaire and a Google Docs implementation of the analysis templates for both LM-GM and ATMSG. Participants were asked to evaluate the same game (“Senior PM Game”) using both models. Group D used ATMSG first, while Group E used LM-GM. It was possible to stop the survey in the middle and return to it at a later time.

We recruited only participants who self-identified as very familiar with serious games or serious games experts. They were given a 20euros voucher as compensation for their time.

Age of all participants
vars	n	mean	sd	median	trimmed	mad	min	max	range	skew	kurtosis	se
1	32	23.34	4.783	23	22.46	1.483	19	44	25	2.794	8.801	0.8455

Familiarity of participants with games, per group
	A	B	C	D	E	Sum
Non-Gamer	4	3	7	0	0	14
Gamer	9	3	2	2	2	18
Sum	13	6	9	2	2	32

Familiarity of participants with serious games, per group
	A	B	C	D	E	Sum
Non-expert	12	6	9	0	0	27
SGExpert	1	0	0	2	2	5
Sum	13	6	9	2	2	32

The tables below show how many participants used which model to evaluate which game and the number of participants in each group.

Note that 3 participants did not deliver their analysis of Senior PM Game (two in Group B and one in Group C).

This section evaluates the SUS score given to each model by the participants.

For details on the SUS scale and how the SUS score is calculated, see Annex I. For a visualization of responses to each of the 10 questions of the SUS questionnaire, see Annex II.

##   vars  n  mean   sd median trimmed   mad  min max range skew kurtosis  se
## 1    1 30 58.83 17.5     60   58.85 14.83 22.5  95  72.5 -0.1    -0.59 3.2

## group: Location1
##   vars  n  mean    sd median trimmed   mad min  max range  skew kurtosis
## 1    1 13 58.85 12.15   62.5   59.77 11.12  35 72.5  37.5 -0.71    -0.81
##     se
## 1 3.37
## -------------------------------------------------------- 
## group: Location2
##   vars  n mean    sd median trimmed   mad  min max range skew kurtosis
## 1    1 13   55 22.31   52.5   54.32 25.95 22.5  95  72.5 0.35    -1.13
##     se
## 1 6.19
## -------------------------------------------------------- 
## group: Online
##   vars n  mean    sd median trimmed  mad  min  max range  skew kurtosis
## 1    1 4 71.25 10.51   72.5   71.25 9.27 57.5 82.5    25 -0.24    -1.93
##     se
## 1 5.25

## group: Non-Gamer
##   vars  n mean   sd median trimmed   mad  min max range skew kurtosis   se
## 1    1 13 47.5 15.1   47.5   47.73 18.53 22.5  70  47.5 0.01    -1.33 4.19
## -------------------------------------------------------- 
## group: Gamer
##   vars  n mean    sd median trimmed   mad  min max range skew kurtosis
## 1    1 17 67.5 14.14     70   67.67 11.12 37.5  95  57.5 0.01    -0.33
##     se
## 1 3.43

## group: Non-expert
##   vars  n mean    sd median trimmed   mad  min max range skew kurtosis
## 1    1 25 56.7 18.04     55   56.19 22.24 22.5  95  72.5  0.1    -0.62
##     se
## 1 3.61
## -------------------------------------------------------- 
## group: SGExpert
##   vars n mean   sd median trimmed   mad  min  max range skew kurtosis   se
## 1    1 5 69.5 9.91     70    69.5 11.12 57.5 82.5    25 0.06    -1.91 4.43

Considering only the ATMSG sus scores, we wanted to know if there was any differences between the groups depending on several factors, such the game played, the familiarity of the student with games or serious games.

For that purpose, we performed an ANOVA to compare the SUS scores for ATMSG. We compared the three conditions (familiarity with games, with sgs and the game played).

H0 = The SUS scores are the same, independently of the game played and the participants familiarity with games or serious games.
H1 = The SUS scores are NOT the same, considering the game played and the participants familiarity with games or serious games.

## Levene's Test for Homogeneity of Variance (center = median)
##       Df F value Pr(>F)
## group  1    0.32   0.57
##       28

	num Df	den Df	MSE	F	ges	Pr(>F)
gamefam	1	25	230.9	10.1255	0.2883	0.0039
sgsfam	1	25	230.9	0.0305	0.0012	0.8628
game	2	25	230.9	0.2964	0.0232	0.7461

## Tables of means
## Grand mean
##       
## 58.83 
## 
##  gamefam 
##     Non-Gamer Gamer
##          47.5  67.5
## rep      13.0  17.0
## 
##  sgsfam 
##     Non-expert SGExpert
##          58.43    60.83
## rep      25.00     5.00
## 
##  game 
##     MarketPlace Senior PM Game Vikings
##           56.69           59.6   61.24
## rep       13.00            8.0    9.00

A significant difference has been found in the game familiarity condition. We performed a pairwise post-hoc Tukey test to identify which groups are different.

## group: Non-Gamer
##   vars  n mean   sd median trimmed   mad  min max range skew kurtosis   se
## 1    1 13 47.5 15.1   47.5   47.73 18.53 22.5  70  47.5 0.01    -1.33 4.19
## -------------------------------------------------------- 
## group: Gamer
##   vars  n mean    sd median trimmed   mad  min max range skew kurtosis
## 1    1 17 67.5 14.14     70   67.67 11.12 37.5  95  57.5 0.01    -0.33
##     se
## 1 3.43

##   Tukey multiple comparisons of means
##     95% family-wise confidence level
## 
## Fit: aov(formula = ATMSG_data_sus_anova$lm)
## 
## $gamefam
##                 diff  lwr   upr  p adj
## Gamer-Non-Gamer   20 8.47 31.53 0.0015

Conclusion: We reject the null hypothesis that there is no difference between the SUS scores given by students who have self-identified as medium familiarity with games (“I’ve played digital games a few times”) and the students who stated that they have a high familiarity with digital games (“I play digital games frequently/I’m a gamer.”).

The same analysis of SUS scores was performed with LM-GM. These scores do not refer to group A, which evaluated only using ATMSG.

##   vars  n  mean    sd median trimmed  mad min  max range  skew kurtosis
## 1    1 18 60.69 14.57   62.5   60.31 9.27  35 92.5  57.5 -0.24    -0.04
##     se
## 1 3.43

## group: Location2
##   vars  n  mean    sd median trimmed  mad min max range  skew kurtosis
## 1    1 14 59.82 11.87   62.5   60.62 5.56  35  75    40 -1.04     0.04
##     se
## 1 3.17
## -------------------------------------------------------- 
## group: Online
##   vars n  mean    sd median trimmed   mad min  max range skew kurtosis
## 1    1 4 63.75 24.02  63.75   63.75 25.95  35 92.5  57.5    0    -1.97
##      se
## 1 12.01

## group: Non-Gamer
##   vars n  mean    sd median trimmed  mad min  max range  skew kurtosis
## 1    1 9 59.17 10.53   62.5   59.17 3.71  35 72.5  37.5 -1.09     0.39
##     se
## 1 3.51
## -------------------------------------------------------- 
## group: Gamer
##   vars n  mean   sd median trimmed   mad min  max range  skew kurtosis  se
## 1    1 9 62.22 18.3     65   62.22 11.12  35 92.5  57.5 -0.18    -1.05 6.1

## group: Non-expert
##   vars  n  mean    sd median trimmed  mad min max range  skew kurtosis
## 1    1 14 59.82 11.87   62.5   60.62 5.56  35  75    40 -1.04     0.04
##     se
## 1 3.17
## -------------------------------------------------------- 
## group: SGExpert
##   vars n  mean    sd median trimmed   mad min  max range skew kurtosis
## 1    1 4 63.75 24.02  63.75   63.75 25.95  35 92.5  57.5    0    -1.97
##      se
## 1 12.01

We also performed the ANOVA analysis in the results of the SUS in LM-GM to identify any differences due to the conditions (game familiarity, game played, serious games familiarity).

H0 = The SUS scores are the same, independently of the game played and the participants familiarity with games or serious games.
H1 = The SUS scores are NOT the same, considering the game played and the participants familiarity with games or serious games.

## Levene's Test for Homogeneity of Variance (center = median)
##       Df F value Pr(>F)
## group  1    1.44   0.25
##       16

	num Df	den Df	MSE	F	ges	Pr(>F)
gamefam	1	14	252.6	0.0186	0.0013	0.8936
sgsfam	1	14	252.6	0.1162	0.0082	0.7383
game	1	14	252.6	0.0606	0.0043	0.8091

## Tables of means
## Grand mean
##       
## 60.69 
## 
##  gamefam 
##     Non-Gamer Gamer
##         59.17 62.22
## rep      9.00  9.00
## 
##  sgsfam 
##     Non-expert SGExpert
##          60.26    62.22
## rep      14.00     4.00
## 
##  game 
##     Senior PM Game Vikings
##              60.11   61.86
## rep          12.00    6.00

Conclusion: we do not reject the H0 hypothesis. For LM-GM, there is no difference in perception between the participants subgroups.

Are the SUS scores from ATMSG and LM-GM significantly different?

Here we use scores from participants who evaluated both models. We also check if there is any difference between gamers and non-gamers (game familiarity) and SG experts and non-experts.

Our null hyphotheses:

H01 = Main effect “game familiarity” is not significant in the resulting SUS scores.
H02 = Main effect “model used” is not significant in the resulting SUS scores.
H03 = Interaction effect between “game familiarity” and “model used” is not significant.

H03 would even be of interest, but cannot be tested with this data, since this is an observational study with unbalanced data ¹.

First, we have a look at the box plots and the interaction plots of the data.

The interaction plot above shows the different scores that participants of different familiarity with games gave to each of the models.

We then performed an ANOVA test to analyze any differences between the conditions, including dividing the participants by familiarity with games (Non-Gamer (scores 1-3), Gamer(scores 4-5)).

Our variables:

Between-subjects factor: familiarity with games, with two levels (Non-Gamer (scores 1-3), Gamer(scores 4-5))
Within-subjects factor: model, with two levels (ATMSG and LM-GM)
Dependent variable: SUS Score

##   model   gamefam sus_score.mean sus_score.length
## 1 ATMSG Non-Gamer          42.81                8
## 2 ATMSG     Gamer          74.38                8
## 3  LMGM Non-Gamer          57.50                8
## 4  LMGM     Gamer          65.62                8

##   model sus_score.mean sus_score.length
## 1 ATMSG          58.59               16
## 2  LMGM          61.56               16

##     gamefam sus_score.mean sus_score.length
## 1 Non-Gamer          50.16               16
## 2     Gamer          70.00               16

## Levene's Test for Homogeneity of Variance (center = median)
##       Df F value Pr(>F)
## group  1       0   0.96
##       30

Type II ANOVA

	num Df	den Df	MSE	F	ges	Pr(>F)
gamefam	1	14	211.8	14.8733	0.3183	0.0017
model	1	14	191.7	0.3678	0.0071	0.5539
gamefam:model	1	14	191.7	5.7306	0.1110	0.0312

Interpretations:

The plots suggest the possibility that there could be an effect in the interaction between the model used and the familiarity of the participant with games. In other words, ATMSG is much less usable for non-gamers and more usable for gamers, while LM-GM has an evaluation that depends less on users familiarity with games. However, as stated before, as this is an observational study, this interaction would have to be tested again with new data.
There is no effect of the model used in the within-subject measurement. This means that participnts did not evaluate the two models in a significantly different way (the ones that thought the first model was hard thought the second one was hard as well).
There is an effect on the familiarity with games, meaning that the results, in general, were different for those who self identified as gamer and as non-gamer.

We have collected the participant’s comments on the questionnaires and coded the answers, from both groups. The tables below shows the number of participants in each group, split by familiarity with games.

Number of participants in Qualitative data collection, by game familiarity
	Non-Gamer	Gamer	Sum
A	1	6	7
B	3	3	6
C	6	2	8
D	0	2	2
E	0	2	2
Sum	10	15	25

The table below shows how many participants evaluated each model.

Number of participants in each location who evaluated each model
	Location1	Location2	Online	Sum
ATMSG	13	13	4	30
LMGM	0	14	4	18

The table below shows the frequency in which the comments were made (for all three studies), split by familiarity with games. Repeated comments made by the same participant were dropped.

Frequency of comments, all participants, by game familiarity
	Non-Gamer	Gamer	Sum
ATMSGHelpful	6	7	13
ATMSGMoreDetailed	7	6	13
LMGMHelpful	4	6	10
LMGMSimpler	6	1	7
ATMSGNeedsSimplifying	5	1	6
ATMSGNeedsExplanation	2	2	4
ExplanationsNotGood	1	3	4
ATMSGFillsObjective	0	3	3
ATMSGHadSampleAnalysis	1	2	3
ATMSGHard	1	2	3
LMGMEasier	1	2	3
LMGMMoreFocused	1	2	3
LMGMNeedsExample	3	0	3
ATMSGBetterDiagram	0	2	2
ATMSGBetterUnderstanding	1	1	2
ATMSGHigherLearningCurve	1	1	2
LMGMHard	0	2	2
LMGMHardGameMap	0	2	2
LMGMInvitesMoreThinking	1	1	2
LMGMLowerLearningCurve	1	1	2
ATMSGBetterTaxonomy	0	1	1
ATMSGClearer	1	0	1
ATMSGEasy	0	1	1
ATMSGHasTargetGroup	0	1	1
ATMSGNotSoHelpful	0	1	1
ATMSGRepetitive	1	0	1
ATMSGTaxonomyNeedsRevision	0	1	1
ATMSGTooDetailed	0	1	1
BothModelsSimilar	1	0	1
DifficultGame	1	0	1
LMGMClearer	1	0	1
LMGMMoreGraphical	0	1	1
LMGMMoreSuperficial	0	1	1
LMGMNeedsExplanation	0	1	1
LMGMNeedsMoreDetails	1	0	1
LMGMNotSoHelpful	0	1	1
LMGMNotUnderstood	0	1	1
LMGMSomewhatHelpful	1	0	1
LMGMTaxonomyNeedsRevision	0	1	1
SecondModelNoAddedInsights	1	0	1

From the table above, and counting the number of participants who made comments, we reached the conclusions below. Multiple answers from the same participants were counted just once.

Number of participants who tought ATMSG is helpful = 15. This is the sum of the following codes: “ATMSGHelpful” and “ATMSGFillsObjective”.

[1] 15

Number of participants who tought LMGM is helpful = 13. This is the sum of the following codes: “LMGMHelpful”, “LMGMInvitesMoreThinking” and “LMGMSomewhatHelpful”, removing repeated answers by the same participant.

[1] 13

To generate the table below, we used only participants who evaluated both models. This table also shows the reversed frequency in which the comments were made, split by familiarity with games.

Frequency of comments, participants who evaluated both methods, by game familiarity
	Non-Gamer	Gamer	Sum
ATMSGMoreDetailed	7	6	13
ATMSGHelpful	5	4	9
LMGMHelpful	3	6	9
LMGMSimpler	6	1	7
ATMSGNeedsExplanation	2	2	4
ATMSGNeedsSimplifying	4	0	4
ExplanationsNotGood	1	3	4
ATMSGHadSampleAnalysis	1	2	3
ATMSGHard	1	2	3
LMGMEasier	1	2	3
LMGMMoreFocused	1	2	3
LMGMNeedsExample	3	0	3
ATMSGBetterDiagram	0	2	2
ATMSGBetterUnderstanding	1	1	2
ATMSGHigherLearningCurve	1	1	2
LMGMHard	0	2	2
LMGMHardGameMap	0	2	2
LMGMInvitesMoreThinking	1	1	2
LMGMLowerLearningCurve	1	1	2
ATMSGBetterTaxonomy	0	1	1
ATMSGClearer	1	0	1
ATMSGEasy	0	1	1
ATMSGFillsObjective	0	1	1
ATMSGHasTargetGroup	0	1	1
ATMSGRepetitive	1	0	1
ATMSGTooDetailed	0	1	1
BothModelsSimilar	1	0	1
DifficultGame	1	0	1
LMGMClearer	1	0	1
LMGMMoreGraphical	0	1	1
LMGMMoreSuperficial	0	1	1
LMGMNotSoHelpful	0	1	1
LMGMSomewhatHelpful	1	0	1
LMGMTaxonomyNeedsRevision	0	1	1
SecondModelNoAddedInsights	1	0	1

The system usability scale (SUS) is a simple, ten-item attitude Likert scale giving a global view of subjective assessments of usability. It was developed by John Brooke at Digital Equipment Corporation in the UK in 1986 as a tool to be used in usability engineering of electronic office systems (Brooke, 1996).

The SUS yields a single score on a scale of 0-100, obtained by converting all the individual measurements to a scale from 0 to 4 (subtracting the user responses from 5 in the even-numbered items, and subtracting 1 from the user’s response for the odd-numbered items), adding them up and multiplying the total by 2.5.

In this analysis, the following questions were used. Green items are positive affirmations; red items are negative (same as in the original SUS).

I think that I would like to use this model if/when I study games for learning
I found the model unnecessarily complex
I thought the model was easy to use
I think that I would need the support of an expert to be able to use this model
I found the various steps in this model were well integrated
I thought there was too much inconsistency in this model
I would imagine that most people would learn to use this model very quickly
I found the model very cumbersome to use
I felt very confident using the model
I needed to learn a lot of things before I could get going with this model

Obs: Questions 4 and 10 indicate learnability of the system/product/model.

MarketPlace

Senior PM Game

Responses to each question of the SUS questionnaire.

Observation: Here we do not try to infer anything about the interactions. This is an observational study, not an experimental one. Consequently, “there is no guarantee that treatments have been randomly assigned to subjects and rarely any balance causing some treatment combinations to be under-represented. All of this makes assessing interaction in observational studies dangerous. Main effects are hard enough to assess in such studies; interactions are truly pushing the envelope.” See: http://www.unc.edu/courses/2010fall/ecol/563/001/docs/lectures/lecture1.htm#interactions

What to do then? From the same author: “when I analyze observational data I start with main effects and maybe tentatively examine a few interactions that have a theoretical basis.”

Since our data does not support it, we are not going to try to identify any interaction between the factors (“game familiarity” and “model use”), but the main effects in each group. For this reason we use Type II ANOVA, which has more power than Type III SS analyses (see: https://stat.ethz.ch/pipermail/r-help/2010-March/230280.html and http://tolstoy.newcastle.edu.au/R/help/06/08/33607.html).↩