thesis defense - computer science: austin mccutcheon

please join the computer science department for the upcoming thesis defense:
presenter: austin mccutcheon
thesis title: large-scale news headline quality analysis: clickbait trends, binary classification, and ai-generated content
abstract: online news can be characterized by massive volumes of news content spanning a spectrum from high-quality professional journalism to low-quality articles. this thesis presents four empirical studies that employ methods to analyze, classify, and evaluate quality-varying news headlines at scale.
the first two studies apply interrupted time series (its) analysis to examine associations between clickbait prevalence and major events. analysis of 451 million headlines from worldwide news websites (2016-2023) revealed statistically significant associations for three of five events, each showed slight pre-event decreases followed by sustained post-event increases in clickbait levels. a complementary analysis of 7.4 million headlines from canadian news websites (2017-2023) found similar patterns.
the third study benchmarks twelve machine learning and deep learning models for binary classification of perceived news quality on a balanced dataset of 57.5 million headlines labeled according to website-level expert consensus ratings. results demonstrated that a cpu-based bagging classifier achieved 88.1% accuracy with stability across cross-validation folds, while a fine-tuned distilbert model achieved the highest accuracy at 90.3% but required substantially greater computational resources.
the fourth study evaluates fourteen accessible small language models (slms) for their willingness to generate fake news headlines when explicitly prompted and tests whether the trained classifiers from study three generalize to synthetic content. minimal resistance to generating false news headlines was found, with models refusing requests less than 1% of the time. both classifiers showed substantially reduced performance on ai-generated headlines (54-63% for distilbert, 35-48% for bagging), with systematic misclassification of ai-generated “high-quality” content as “low-quality,” indicating that human-trained classifiers do not generalize effectively to current ai-generated text.
this thesis contributes the application of its methodology to clickbait analysis at web scale, comprehensive benchmarking of model architectures for large-scale headline quality classification, and empirical evidence that quality classifiers trained on human-authored content exhibit reduced performance when applied to slm-generated headlines.
committee members:
dr. chris brogly (supervisor, committee chair), dr. xing tan, dr. xingwei (nancy) yang (toronto metropolitan university)
