Abstract
Aim
Large language models (LLMs) are increasingly integrated into medical education; however, their performance on dermatology examinations in non-English contexts has not been extensively studied. This study aimed to evaluate the performance of six LLMs in terms of accuracy, error profile, and response time on the Turkish Dermatological Society (TDS) qualifying examination.
Materials and Methods
Two hundred publicly available multiple-choice questions from the TDS exam were submitted to six LLMs (ChatGPT-4, Gemini-2.0, Claude-3.7, Grok-3, DeepSeek-R1, Qwen-2.5). Each model was tested in Turkish and in English, under both batch and single-item prompt formats. The strengths and weaknesses of the models were tested under different conditions.
Results
Claude-3.7 and Grok-3 performed best (~83-84% correct) with low variance, whereas Qwen-2.5 and DeepSeek-R1 had lower accuracy (~75%) with more simple errors. Across all models, switching from Turkish to English increased median accuracy by 19.5% (P = 0.028). In contrast, batch vs. single-item prompting showed no overall performance difference (P = 0.280). DeepSeek-R1 was markedly slower (≥ 10 minutes per question vs ~134 seconds for others, P < 0.001). All models achieved high accuracy on common conditions but struggled with nuanced cases and negatively phrased questions.
Conclusion
Current LLMs can answer standard dermatology certification questions with moderate to high accuracy, especially in English. However, they are still susceptible to linguistic traps, negation, and nuanced clinical distinctions. Before they can be routinely used for educational or clinical purposes, optimization for Turkish language input and complex reasoning is necessary.
INTRODUCTION
Artificial intelligence (AI) technologies are used as an effective tool in medical education to support the acquisition of theoretical knowledge and improve the clinical skills of medical students and residents.1 The increasing use of AI applications in medical education has potential, especially in disciplines based on visual diagnoses, such as dermatology; this trend necessitates re-evaluating conventional assessment tools, such as specialty competency exams, with AI models.2 Studies on the performance of AI models in medical education exams reveal the potential, limitations, and room for improvement of this technology.3 Recent benchmark studies show that state-of-the-art large language models (LLMs) (e.g., GPT-4, Gemini Advanced, Claude) can exceed the 60% pass mark on united states medical licensing examination step 1-style items and perform at or near the resident level in ophthalmology and orthopedic vignette sets.3-5
Despite the progress made, two critical knowledge gaps remain. First, almost all validation studies have been conducted in English, while more than half of the world’s medical students study in other languages. LLMs show significant performance declines in low-resource languages due to issues such as imbalanced training data, cultural differences, and tokenization problems.6-8 Second, there has been limited research into domain-specific, high-stakes examinations that assess nuanced clinical reasoning rather than just general medical knowledge.
The Turkish Dermatology Society (TDS) qualifying examination combines essential knowledge with image-rich clinical scenarios and is primarily administered in Turkish. To our knowledge, no published study has yet assessed contemporary LLMs using this examination, nor has any compared their performance when identical items are presented in Turkish versus professionally translated English. Addressing this gap is crucial for two main reasons. First, educators need evidence before incorporating AI into residency training. Second, algorithm developers require detailed error profiles to optimize multilingual models and prevent hallucinations or unsafe recommendations.
In this study, we conducted a comparative performance analysis of six publicly accessible LLMs using multiple-choice questions from the TDS qualifying examination. To explore the effects of linguistic and structural variations on model performance, we tested each model under four prompting conditions that varied by input language (Turkish vs. English) and delivery format (batch vs. single-item). By systematically comparing accuracy, response latency, and error characteristics, we aimed to evaluate the dermatological knowledge base as well as the language adaptability of these models. We, therefore, benchmarked six contemporary LLMs on 200 standardized text-only TDS board items to quantify language-related and prompt-related performance shifts and to characterize error profiles relevant to clinical reasoning.
MATERIALS AND METHODS
Study Design
In this study, a prospective benchmarking comparing the performance of six publicly available LLMs on the dermatology specialty examination was conducted (Table 1). All analyses were performed between 15 February and 1 March 2025 to minimize version drift. Each model was initialized in a fresh “clean” account session to avoid any carryover of prior context. No plug-ins or speech memory features were activated.
Question Bank
Two hundred multiple-choice questions were selected from the publicly available repository of the TDS qualifying examination. Items based on clinical photographs or histopathology images were excluded to keep the assessment entirely text-based, and questions assessing epidemiology, pathophysiology, clinical diagnosis, and treatment were included.
Translation
The initial English drafts of all 200 Turkish board examination items were created using DeepL Pro (v3.5). Subsequently, a senior dermatology resident proficient in academic English (N.Ç.) and a professional dermatologist with a high level of proficiency in academic and clinical English (A.U.A.) reviewed the machine translations. Together, they reached a consensus, correcting any inaccuracies in medical terminology and addressing cultural nuances.
Prompting Conditions
Batch conditions involved multiple questions uploaded simultaneously, whereas single-item conditions involved uploading questions individually. During batch uploading, four separate Word files were uploaded one by one (batch Turkish 2015, batch Turkish 2017, batch English 2015, batch English 2017). In single-item (sequential) prompting, 400 questions were uploaded individually each time.
Outcome Measures
For each method, the response times and accuracy rates of the models were analyzed. The language factor was examined by averaging the batch and single-item prompting results. The official answer key was used as a reference. For each correct answer, 1 point was given and 0 points for an incorrect answer; the total number of correct answers and the success percentage of each model were calculated. Correct answer rates were reported separately for each model and method, and comparisons were made between models and methods. In addition, questions that all models answered incorrectly, questions that only one model answered correctly (superior performance), and questions that only one model answered incorrectly (simple error) were analyzed. Across all four methods, any question answered incorrectly by at least five of the six models was defined as a “difficult question”. Also, all 200 questions were categorized into six content domains: (1) common dermatoses and first-line management, (2) clinical case vignettes, (3) rare syndromes and eponyms, (4) disease sub-typing, (5) negatively worded stems, and (6) other. Accuracy was subsequently assessed for each category to enable a category-based performance analysis. In the batch Turkish method, the response times of the models were determined using a stopwatch.
Statistical Analysis
All statistical analysis were performed using IBM SPSS Statistics v26.0 (IBM Corp., Armonk, NY) software. The distribution of continuous variables was examined using the Shapiro-Wilk test, and non-parametric tests were preferred when the normality assumption was not met. The significance level was set at P < 0.05 for all tests.
Language effect: The average performance of the models in Turkish (batch Turkish + single-loading Turkish) and English (batch English + single-loading English) formats was compared using the paired Wilcoxon signed-rank test.
Method effect: The effect of the batch and single-loading methods in each language group was analyzed using the paired Wilcoxon test.
Response time analysis: The average response time of the DeepSeek model was compared with the response times of other models using the Mann-Whitney U test due to the non-normal distribution of the data. The Kruskal-Wallis test was used to compare models other than DeepSeek.
Difficult questions and word count: The average number of words in difficult questions that were answered incorrectly by all models or correctly by only one model, was analyzed using the Mann-Whitney U test.
Ethical Considerations
The study analyzed publicly available examination material and generated AI responses; it involved no human participants or patient data and therefore did not require Institutional Review Board approval. All procedures conformed to the Declaration of Helsinki principles for non-interventional research.
Data Availability
Full prompt templates, anonymized model outputs, and analysis scripts can be requested from the corresponding author if needed.
RESULTS
Overall Performance Evaluation by Model
When analyzing the overall number of correct answers and average performance of the models using the “batch Turkish” and “single-loading Turkish” methods, the Claude (84.0%±0.00) and Grok-3 (83.0%±3.83) models demonstrated the most successful results, showing the highest average number of correct answers and low standard deviations. In contrast, the Qwen 2.5 (74.25±613) and DeepSeek (75.5%±7.05) models displayed the lowest performance and highest inconsistency, indicated by both lower average correct answer counts, and particularly, for DeepSeek, higher standard deviation values (Figure 1).
However, some models demonstrated superior performance in specific domains. For example, among the 200 questions, there were items that only Qwen 2.5 answered correctly while all other models failed, suggesting areas where it outperformed its peers.
Impact of Language Factor
There was a significant performance advantage for English versus Turkish prompts across models (P = 0.028). This result indicates that LLMs perform significantly better in English than in Turkish. A noticeable performance improvement was observed across all models when switching to English questions (Figure 2). In particular, the ChatGPT and Qwen 2.5 models were the most positively impacted by the language change.
Effect of Method Factor
The DeepSeek model demonstrated a significant performance improvement in the Turkish single-item prompting method compared to the Turkish batch method, showing the largest gain from this approach. In contrast, the single-loading method led to a performance decline in the ChatGPT and Gemini models. When comparing batch and single-loading methods, no statistically significant difference was observed between the methods (P > 0.05). However, per-model analyses revealed notable individual differences beyond this general finding (Figure 3).
Simple Errors and Inconsistencies
In the analysis of “simple errors,” the DeepSeek and Qwen 2.5 models had the highest number of simple errors. DeepSeek recorded the most errors with a total of 27, followed by Qwen with 22 (Figure 4a). When comparing prompting conditions, the Turkish batch condition yielded the most errors, whereas the English single-item condition had the fewest (Figure 4b).
Word Count Analysis of Difficult Questions
The difficult questions had a significantly lower average word count than other questions (9.71±7.08 vs. 11.9±9.89 words; P = 0.019), suggesting that shorter questions tended to pose more of a challenge (Figure 5).
Category-Based Performance Evaluation
When evaluating the full set of results across 200 questions, six language models and four prompting methods, the models achieved over 90% accuracy in most categories, including common disease presentations, primary diagnoses, and treatment approaches, clinical case scenarios, and questions involving specific medical terminology or rare syndromes. In contrast, the lowest accuracy rates were observed in distinguishing clinical subtypes (57.14%) and in handling negatively phrased questions (83.33%) (Table 2).
Response Time: Performance or Speed?
DeepSeek demonstrated a significantly longer response time, consistently exceeding 10 minutes per item (720 seconds), while the other five models responded within a comparable time frame (mean: 134±73.85 seconds), with no statistically significant differences observed among them (P > 0.05) (Table 3).
DISCUSSION
This study presents the first direct comparison of six contemporary LLMs using the TDS qualifying examination. Three key findings stand out. First, the language used was the primary factor influencing performance: switching from Turkish to English improved the median accuracy significantly, benefiting all tested LLMs. Second, the presentation method had minimal overall impact; however, DeepSeek R1 performed significantly better with single-item prompts. Third, even the best-performing models faced challenges with nuanced clinical differentiation, negatively worded questions, and time efficiency. This highlights ongoing limitations in contextual reasoning and practical usability.
The results of the overall performance evaluation showed that LLMs have gained significant competence in interpreting and applying medical knowledge. The consistent performance of the Claude and Grok-3 models, suggests that these models have a more balanced information processing capacity. Claude has shown successful performance in studies evaluating LLMs.9 In radiology board exams, Claude outperformed Bard and Gemini Pro by achieving 62% accuracy.9 In NBME exams, it again performed similarly to GPT-3.5 and Bard with a score of 84.7%.10 Grok 3, on the other hand, is still under development, and while it shows potential in interaction skills and mathematical reasoning, its performance in medical exams has not yet been extensively evaluated.11 As both models continue to evolve, their role in medical education and examinations will likely expand, and they will need to be regularly re-evaluated and refined to ensure their reliability and relevance in the field.11 On the other hand, Qwen 2.5 and DeepSeek’s fluctuating performance and susceptibility to simple errors reflect differences in model architectures and training strategies.12, 13
It was observed that the language factor had a significant effect on the AI models. The significantly higher performance of the models on English questions compared to Turkish questions reveals the dominance of English data sets in the training processes of LLMs.14, 15 This aligns with reports in the literature that LLMs perform worse in languages other than English.6, 7 LLMs are more successful in English in part because of the vast amount of English digital content and the concentration of AI research on English, owing to that language’s global dominance.15-17 Non-English languages present unique challenges (e.g., cultural nuances, complex linguistics) that require specialized AI approaches. A lack of standardized resources and tools in these languages, can lead to issues like cultural hallucinations, making it more difficult to develop effective AI models for them.8 Despite English’s privileged position in AI development, there is growing recognition of the need to improve LLM performance in other languages. Initiatives like cross-language training and multilingual model development are working to create more inclusive, culturally sensitive AI systems.14, 18
Although there was no statistically significant difference between batch and single-item prompting in the analyses regarding the method factor, model-based differences are noteworthy. The performance improvement of the DeepSeek model in the Turkish single-item prompting method suggests that some models are more sensitive to sequential processing.19AI systems designed for sequential processing use character recognition, on-the-fly verification, and error correction mechanisms to ensure accuracy during real-time data entry.19, 20 These approaches provide high accuracy and user efficiency by reducing errors in data entry.19 This finding suggests that the prompt dependency and context management capabilities of LLMs may vary from model to model.21 Unlike DeepSeek, models like ChatGPT and Claude experienced a decline in performance under the same conditions, underscoring the importance of tailoring LLM deployment strategies to model-specific strengths and intended use cases.
In particular, category-based analyses clearly revealed the strengths and weaknesses of AI models. High success rates in basic medical knowledge and common conditions confirm the potential of these models to provide knowledge-based support in general medical practice.22 However, high error rates in distinguishing clinical subtypes of diseases and negatively worded question stems suggest that AI models still have limitations in analyzing context in depth and overcoming linguistic pitfalls.23-25
This finding is in line with the known difficulties of negation and contextual disambiguation in natural language processing systems.26, 27 Moreover, the questions that stumped all models were notably short, suggesting that LLMs make more errors on context-free, brief, and ambiguous statements. As previous studies also suggest, LLMs are heavily context-driven, and their performance degrades when information is lacking.28
In terms of response times, the trade-off between speed and another aspect of performance must also be considered. For AI systems used especially in clinical applications, not only accuracy but also speed in practical use is critical.29
From an educational perspective, LLMs already demonstrate near-expert-level performance on routine factual dermatology content and could be useful as supplementary tutoring tools, particularly when prompts are provided in English. However, their susceptibility to short, context-poor questions and semantic traps presents a risk if they are used uncritically for high-stakes self-assessment. Moreover, DeepSeek R1’s extremely long response time (over 10 minutes per question) makes real-time feedback impractical.
Study Limitations
Several limitations should be taken into account when interpreting our findings. First, our analysis was limited to 200 publicly available, text-only multiple-choice items. This excluded image-based and open-ended questions, which are essential in dermatology practice, therefore, the model’s performance on multimodal or free-text tasks remains unassessed. Second, due to the rapid development of LLMs architectures and public interfaces, our results reflect the model versions as of February 2025 and may not apply to future iterations. Third, all assessment items were derived from a single national board examination, which restricts the external validity to other dermatology curricula or broader medical fields. Lastly, we used a binary scoring approach, giving credit only for fully correct responses. This approach may underestimate partial reasoning or nuanced understanding that could be better evaluated using a rubric-based scoring system. Addressing these limitations will require larger, multimodal test sets, ongoing reassessment of evolving model versions, and the integration of more detailed qualitative scoring frameworks.
CONCLUSION
In conclusion, this study demonstrated the potential and current limitations of AI models in medical education and assessment processes from a multidimensional perspective Our findings indicate that, while AI systems can be valuable tools for medical decision support, they still require improvement in areas such as linguistic diversity, contextual analysis, and use-case optimization. Future research should focus on developing multilingual and culturally sensitive models, enhancing context management capabilities, and optimizing the speed-accuracy balance, particularly in clinical applications.