TY - GEN
T1 - Effects of Different Prompts on the Quality of GPT-4 Responses to Dementia Care Questions
AU - Li, Zhuochun
AU - Xie, Bo
AU - Hilsabeck, Robin
AU - Aguirre, Alyssa
AU - Zou, Ning
AU - Luo, Zhimeng
AU - He, Daqing
N1 - Publisher Copyright:
© 2024 IEEE.
PY - 2024
Y1 - 2024
N2 - Evidence suggests that different prompts lead large language models (LLMs) to generate responses with varying quality. Yet, little is known about prompts' effects on response quality in health care domains. In this exploratory study, we address this gap, focusing on a specific healthcare domain: dementia caregiving. We first developed an innovative prompt template with three components: (1) system prompts (SPs) featuring 4 different roles; (2) an initialization prompt; and (3) task prompts (TPs) specifying different levels of details, totaling 12 prompt combinations. Next, we selected 3 social media posts containing complicated, real-world questions about dementia caregivers' challenges in 3 areas: memory loss and confusion, aggression, and driving. We then entered these posts into G PT-4, with our 12 prompts, to generate 12 responses per post, totaling 36 responses. We compared the word count of the 36 responses to explore potential differences in response length. Two experienced dementia care clinicians on our team assessed the response quality using a rating scale with 5 quality indicators: factual, interpretation, application, synthesis, and comprehensiveness (scoring range: 0-5; higher scores indicate higher quality). Both clinicians rated the responses from 3 to 5, with 75% agreement. Consensus was reached through discussion. Overall, 44% of responses (16/36) were rated as 5; another 44% (16/36), as 4; the remaining 4 (11 %), as 3. We found no interaction effect of system and task prompts or main effect of system prompts on response length. Task prompts had a statistically significant effect on response length: F(2,24) = 82.784, p <.001. Post hoc analysis showed that the significant difference in responses was due to TP3, which led to significantly longer responses. There was no interaction or main effect of system and task prompts on response quality. Our clinicians' qualitative feedback provided further insight: (1) system prompts with the different professional roles (neuropsychologist and social worker) did not lead to noticeable differences in response content (that is, there were no neuropsychology- and social work-versions of GPT-4 responses); and (2) TP3, while producing longer responses statistically, might not necessarily have produced higher quality responses clinically: at times the details contained in the lengthy responses seem unnecessary from a clinical perspective. We discuss study limitations and future research directions.
AB - Evidence suggests that different prompts lead large language models (LLMs) to generate responses with varying quality. Yet, little is known about prompts' effects on response quality in health care domains. In this exploratory study, we address this gap, focusing on a specific healthcare domain: dementia caregiving. We first developed an innovative prompt template with three components: (1) system prompts (SPs) featuring 4 different roles; (2) an initialization prompt; and (3) task prompts (TPs) specifying different levels of details, totaling 12 prompt combinations. Next, we selected 3 social media posts containing complicated, real-world questions about dementia caregivers' challenges in 3 areas: memory loss and confusion, aggression, and driving. We then entered these posts into G PT-4, with our 12 prompts, to generate 12 responses per post, totaling 36 responses. We compared the word count of the 36 responses to explore potential differences in response length. Two experienced dementia care clinicians on our team assessed the response quality using a rating scale with 5 quality indicators: factual, interpretation, application, synthesis, and comprehensiveness (scoring range: 0-5; higher scores indicate higher quality). Both clinicians rated the responses from 3 to 5, with 75% agreement. Consensus was reached through discussion. Overall, 44% of responses (16/36) were rated as 5; another 44% (16/36), as 4; the remaining 4 (11 %), as 3. We found no interaction effect of system and task prompts or main effect of system prompts on response length. Task prompts had a statistically significant effect on response length: F(2,24) = 82.784, p <.001. Post hoc analysis showed that the significant difference in responses was due to TP3, which led to significantly longer responses. There was no interaction or main effect of system and task prompts on response quality. Our clinicians' qualitative feedback provided further insight: (1) system prompts with the different professional roles (neuropsychologist and social worker) did not lead to noticeable differences in response content (that is, there were no neuropsychology- and social work-versions of GPT-4 responses); and (2) TP3, while producing longer responses statistically, might not necessarily have produced higher quality responses clinically: at times the details contained in the lengthy responses seem unnecessary from a clinical perspective. We discuss study limitations and future research directions.
KW - dementia
KW - informal caregiving
KW - large language models
KW - prompt engineering
KW - social media
UR - https://www.scopus.com/pages/publications/85203682541
UR - https://www.scopus.com/pages/publications/85203682541#tab=citedBy
U2 - 10.1109/ICHI61247.2024.00059
DO - 10.1109/ICHI61247.2024.00059
M3 - Conference contribution
AN - SCOPUS:85203682541
T3 - Proceedings - 2024 IEEE 12th International Conference on Healthcare Informatics, ICHI 2024
SP - 412
EP - 417
BT - Proceedings - 2024 IEEE 12th International Conference on Healthcare Informatics, ICHI 2024
PB - Institute of Electrical and Electronics Engineers Inc.
T2 - 12th IEEE International Conference on Healthcare Informatics, ICHI 2024
Y2 - 3 June 2024 through 6 June 2024
ER -