Abstract
Aim
This study aimed to evaluate and compare the readability and quality of information in responses generated by artificial intelligence (AI) models to patients’ frequently asked questions about systemic isotretinoin, a medication commonly prescribed in dermatology.
Materials and Methods
Thirty-four frequently asked questions from patients using isotretinoin were prepared by a team of dermatology specialists. These questions were posed to three AI-based text-generation tools (ChatGPT, Gemini 2.0, and Copilot), and the responses were analyzed. The resulting texts were compared in terms of readability levels [Flesch Reading Ease score (FRES), Flesch-Kincaid Grade Level (FKGL), Simple Measure of Gobbledygook (SMOG), Gunning Fog index (GFOG), Coleman-Liau index (CLI), and Automated Readability index (ARI)], sentence lengths, and content quality, which was evaluated by dermatologists.
Results
None of the AI models achieved the optimal readability threshold (FRES ≥ 60). Readability metrics differed significantly among models. Gemini produced responses that were significantly less readable and more complex than those produced by ChatGPT and Copilot across all readability indices, including FRES, FKGL, SMOG, GFOG, CLI, and ARI; post-hoc analyses confirmed differences between Gemini and the other models. Sentence counts also differed significantly, with Gemini generating longer responses than Copilot. In contrast, Likert-based quality scores and response appropriateness were comparable across models, with no statistically significant differences observed.
Conclusion
This study demonstrates that AI models produce academic responses that are difficult for those unfamiliar with medical terminology to understand, and can generate outputs with variable readability in health-related content. These findings highlight the need for careful evaluation of AI-based content for use in healthcare.
INTRODUCTION
Chatbots are computer programs that can understand and respond to speech and text in a human-like manner using various algorithms. Large language models are used to simulate human conversation. Many companies are using this technology to develop their own chatbots.1, 2 A major milestone in this field was the introduction of ChatGPT in 2022. These chatbots have various applications, including serving as dialogue systems, providing language translation, and generating content. Alongside ChatGPT, other chatbots utilizing large language models have been launched.1, 3 Microsoft Copilot is another chatbot with different functionalities. Unlike ChatGPT, it can search the internet and update its knowledge base.4 Google Gemini, developed in collaboration with Google teams, integrates different types of information: text, code, audio, images, and video to serve as a writing, planning, and learning assistant.1 Besides these, there are currently over 5,100 chatbots, approximately 100 of which are used in healthcare applications.5 Chatbots offer advantages such as improving diagnostic accuracy, supporting personalized treatment plans, facilitating the translation of the latest medical literature into clinical practice, and contributing to patient education, thereby enhancing healthcare services.5, 6
Many dermatological diseases have a chronic course and require long-term follow-up and treatment. However, patients do not always have easy access to a dermatologist. Therefore, patients increasingly turn to online platforms, including social media and artificial intelligence (AI) chatbots, to obtain information about their diseases and the medications they use. Although these assistive tools are well-designed, concerns remain regarding the accuracy, currency, and reliability of the medical information they provide, and regarding the transparency with which user data are handled. To address these concerns, various studies on AI chatbots have been conducted in different specialties.6–9 In this study, we investigated the readability and reliability of chatbot responses to the most frequently asked questions about isotretinoin, a medication commonly prescribed in dermatology outpatient clinics.
MATERIALS AND METHODS
Chatbots
Chatbots were selected considering factors such as user fees, login requirements, and inspiration from previous similar studies. The chatbots selected for the study and their versions were ChatGPT 4, Google Gemini 2.0, and Microsoft 365 Copilot; the readability and quality of the responses from these chatbots were evaluated. In the remainder of this paper, these chatbots are referred to as ChatGPT, Gemini, and Copilot. The chatbots were accessed using a personal computer (MacBook Air M2) connected to a home broadband connection. Data were collected between July 1 and July 5, 2025.
Questions
The questions about isotretinoin, the most frequently prescribed active ingredient in dermatologic practice, were prepared by expert dermatologists. Of the prepared questions, 34 were selected. Each question was asked individually to the chatbots, and the responses were recorded in separate documents for review and analysis. Responses were examined by two dermatologists specializing in the field. All items were evaluated jointly by two dermatologists. Both the 5-point Likert scale ratings and the appropriateness classifications (appropriate, incomplete, and inappropriate) were determined by discussion and assigned by consensus. Since the ratings were not performed independently, interrater reliability analysis was not applicable. The British Association of Dermatologists’ guidelines were taken as the criterion for response accuracy (Supplementary).10 The Likert scale developed by Kumari et al.,11 shown in Table 1, was used to evaluate the accuracy of responses. Furthermore, chatbot responses were classified into three categories: “appropriate”, “incomplete”, and “inappropriate”. An appropriate response was defined as accurate, complete, and consistent with what an expert would advise a patient in the same situation; an inappropriate response was defined as inconsistent with expert opinion or containing incorrect information; and an incomplete response was defined as correct and relevant but lacking sufficient detail. Prior to each question, the chatbot sessions were reset.
Readability Analysis
After the accuracy of the responses had been verified by two independent dermatologists, a readability analysis was performed for each response. The following measures were used: Flesch Reading Ease score (FRES), Flesch-Kincaid Grade level (FKGL), Simple Measure of Gobbledygook (SMOG), Gunning Fog index (GFOG), Coleman-Liau index (CLI), and Automated Readability index (ARI). In addition, sentence lengths for each response were compared.
Flesch Reading Ease Score: It is a measure of text readability calculated based on the average number of words per sentence and the average number of syllables per word. Scores range from 0 to 100, with higher scores indicating easier readability. Scores between 70 and 80 correspond to approximately an eighth-grade reading level.12
Flesch-Kincaid Grade Level: A readability measure that determines reading difficulty according to the United States school grade level. Scores range from 0 to 18, with higher scores indicating greater difficulty. Scores above 12 indicate that the text is written in an academic style.13
Simple Measure of Gobbledygook: Designed to assess the appropriateness of a text for the reader’s age. It counts ten sentences from the beginning, middle, and end of the text to determine the level. It counts words with three or more syllables across 30 sentences. The syllable counts are then converted to a corresponding reading-grade level.14
Gunning Fog Index: This metric determines how difficult a text is to read based on sentence and word length. Scores range from 1 to 18, with each score corresponding to the number of years of education needed to understand the text. To facilitate comprehension by the general public, an average score of 8 is recommended. Scores of 17 and above are considered primarily understandable to individuals with postgraduate education.15
Coleman-Liau Index: CLI is a readability assessment that measures how challenging a text is and helps determine its appropriate grade level. It is commonly used in the USA and several other countries. Unlike many other grade-level estimators, CLI relies on the number of characters per word rather than on the number of syllables per word.16
Automated Readability Index: This measure estimates the number of years of education required to understand a text on first reading. It takes into account the average number of characters per word and the average number of words per sentence. ARI uses a specific formula to determine the grade level of the text.17
Statistical Analysis
Data were analyzed using SPSS for Windows, version 26.0. Descriptive statistics—including mean, standard deviation, median, and percentage distributions—were used to summarize the data. The Kruskal-Wallis test was applied to compare continuous and ordinal variables, such as readability scores and Likert-scale ratings, across the three AI models. For categorical variables, such as response appropriateness, the chi-square test was employed. A P-value of less than 0.05 was considered statistically significant.
RESULTS
In our study, 34 questions regarding systemic isotretinoin use were asked of each of three AI chatbots. While a FRES of ≥ 60 is required for optimal readability, none of the examined models reached this threshold. With the exception of FRES, none of the models achieved an acceptable readability level on the other five measures; all models scored well above the thresholds, indicating generally low readability of the content (Table 1).
Readability scores differed significantly among the three AI models across all objective indices (FRES: P = 0.013; FKGL: P = 0.002; SMOG: P < 0.001; GFOG: P = 0.006; CLI: P = 0.003; ARI: P < 0.001). Dunn-Bonferroni post-hoc analyses indicated that these differences were consistently driven by Gemini. Compared with both ChatGPT and Copilot, Gemini produced responses that were significantly less readable, as reflected by lower FRES scores (vs. ChatGPT: P = 0.011; vs. Copilot: P = 0.010) and higher grade-level indices (FKGL: P = 0.002 for both comparisons; SMOG: P = 0.001 and P < 0.001; GFOG: P = 0.005 and P = 0.007; CLI: P = 0.002 and P = 0.006; ARI: P < 0.001 for both). No significant differences were observed between ChatGPT and Copilot on any readability metric (Table 1).
Sentence counts also differed significantly (P = 0.009); Gemini generated longer responses than Copilot (P = 0.003); other pairwise comparisons were not significant.
Likert scale ratings evaluating the quality of responses were similar across models (median = 4), with no statistically significant difference observed (P = 0.259). Similarly, the distribution of response appropriateness did not differ significantly among models (P = 0.701), and most responses were rated as appropriate (Table 1).
DISCUSSION
Today, increasing and unmet healthcare demands are leading individuals to seek information from alternative sources. Among these, online tools are the most frequently used due to their accessibility. Previous studies have shown that a significant proportion of patients turn to online resources for information regarding their diseases and treatments.18, 19
Our current study provides insight into the performance and reliability of chatbot responses in the medical context. Chatbots are increasingly used across many fields, including medicine. Due to growing healthcare needs and various limitations, unmet demands are increasingly being addressed through tools such as online chatbots. Our study found that chatbots provided answers that varied in length and readability to the same questions. Even though we selected the questions and responses from a publicly accessible online guide, each chatbot generated content and response lengths that differed. While a FRES of ≥ 60 is required for optimal readability, none of the examined models reached this threshold.
When examining accuracy and appropriateness scores, we found that the three chatbots demonstrated similar performance. Similar to our findings, a previous study comparing ChatGPT and Google Bard on educational questions posed by patients with obstructive sleep apnea found the responses from both chatbots to be appropriate and accurate.20 However, other studies in dermatology, hematology, neurosurgery, lung cancer, and urology have shown differing accuracy rankings among ChatGPT, Gemini, and Copilot.21 These varying results may be related to differences in the algorithms used by chatbots, the training data, which can vary by country, and updates made to chatbots over time. While accuracy rates differ across studies, one common finding in our study is that no chatbot achieved 100% accuracy.
Statistically significant differences in readability scores were most pronounced between ChatGPT and Gemini and between Gemini and Copilot. Across readability scales, Gemini consistently received significantly higher scores than the other two chatbots, indicating lower readability and increased response complexity. Examination of sentence lengths across all three AI chatbots revealed a significant difference between Gemini and Copilot. Gemini provided longer and more detailed answers with the highest number of sentences, while Copilot offered shorter answers with simpler sentence structures.
Across all readability indices (FRES, FKGL, GFOG, ARI, SMOG, and CLI), the responses generally corresponded to a university-level or higher reading difficulty. Negative FRES values, particularly pronounced in Gemini, were attributable to very long sentences and the use of multisyllabic terms, as reflected in the formula: FRES = 206.835−(1.015 × average words per sentence)−(84.6 × average syllables per word). When the subtraction components exceed 206.835, negative scores occur, indicating exceptionally high textual complexity.12 These findings align with previous studies evaluating chatbot readability in lung cancer, radiology, urology, and chronic kidney disease contexts, all of which reported low readability levels.17, 21-26 However, a discipline-specific study in urology has reported different readability outcomes.27
Likert-scale ratings of response quality were similar across models, and no statistically significant differences were found. The quality of chatbot responses ranged from 81.8% to 87; none achieved the perfect score of 5 and therefore cannot be considered 100% reliable. This finding is consistent with previous studies comparing AI chatbots.1,21-23,28-30 The quality and accuracy observed in our study suggest that chatbots may be useful for providing relatively accurate information about diseases. Consequently, they may provide valuable assistance to individuals concerning systemic isotretinoin, one of dermatology’s fundamental treatment options. Chatbots could educate patients about the use, side effects, and treatment process of systemic isotretinoin, and about when to seek professional medical support. Additionally, by answering simple questions, chatbots can support patients in managing their treatment, thereby reducing the workload on the healthcare system and enabling dermatologists to devote more time to complex and serious cases.
Study Limitations
This study has several limitations. First, only three chatbots were selected, excluding other online platforms accessible to patients. Although the accuracy of responses was evaluated jointly by two dermatologists, inter-rater agreement statistics were not reported. This limitation is acknowledged and could be addressed in future studies by including measures such as Cohen’s kappa. The question list was limited and prepared based on the most frequently asked questions in dermatology outpatient clinics. In real life, patient queries may be more diverse and multifaceted.
CONCLUSION
This study demonstrated that responses generated by chatbots about systemic isotretinoin, which is one of the most frequently prescribed treatments in dermatology, have readability levels corresponding to university education and above, making them relatively difficult to read. The highest response quality reached 87%, and no chatbot provided answers with 100% quality. Although the selected questions focused on a specific drug, the responses were based on publicly available online guidelines. Moreover, individuals of diverse ages and educational backgrounds who initiate systemic isotretinoin treatment in dermatology clinics may consult AI chatbots for information.
The high readability of these responses could lead to misinterpretation of information. This indicates that people seeking information through AI chatbots might be misled, potentially resulting in unnecessary anxiety and increased demand for consultations with doctors, thereby placing an excessive burden on the healthcare system. Conversely, if readability is low for matters requiring urgent intervention, this may delay necessary treatment and negatively impact patients’ health.
For healthcare professionals, higher readability levels may be advantageous, providing more detailed and informative content. In the future, it would be beneficial to program chatbots to generate responses tailored to different age groups and educational levels. This approach could make AI chatbots more accessible and effective for diverse populations. More comprehensive and large-scale studies are needed to explore this further.


