top of page

What are the uses for large language models in the social sciences?

Updated: 20 hours ago


ree

Clément Gorin

Associate professor of economics

Paris 1 Panthéon-Sorbonne University


ree

Thomas Renault*

Professor

Paris-Saclay University, RITM


*Member of the faculty of the Executive DBA Paris-Saclay / Business Science Institute



 

Introduction


Machine learning methods are transforming empirical research in the social sciences by offering new tools, particularly for predicting and exploiting data sources that were previously difficult to use, such as language.


In recent years, the rise of large language models (LLMs) has represented a major advance in natural language modeling, both in terms of understanding and generation. In the context of research, these models offer the opportunity to automate certain tasks while reducing costs, including prediction from text, analysis of similarities between documents, and data collection. However, many questions remain about their use, particularly due to the presence of bias, the difficulty of accurately assessing their level of uncertainty, and their lack of interpretability.


In this context, this article provides a brief introduction to LLMs, explaining how they work, presenting some applications in the social sciences, and highlighting certain limitations to their use, in order to provide food for thought on the conditions for their application.


What is an LLM?


LLMs are a family of machine learning models designed to process natural language. These versatile models are based on neural network architectures called transformers and are distinguished by a considerable number of parameters, estimated from large text corpora using a method of training known as self-supervised learning. To understand how LLMs work, it is essential to understand the structure of natural language. Language can be represented as a sequence of words and has two fundamental dimensions: semantics, which assigns meaning to the message and allows it to be interpreted, and syntax, which organizes words according to grammatical rules and ensures the structural coherence of sentences. This duality makes automated language processing particularly complex, notably due to the absence of a numerical representation for the semantics of words and the need to capture the numerous syntactic interactions—sometimes between distant words—that contribute to their meaning.


To address these challenges, neural networks calculate numerical representations of language in the form of contextualized vectors, called “embeddings.” These vectors project words into a latent digital space where proximity reflects semantic and syntactic similarities (Bengio et al., 2003; Le and Mikolov, 2014). For example, it will assign similar values to words used in comparable contexts, thus reflecting their semantic proximity, while distancing words that appear in different contexts. Each dimension of the vector encodes a specific aspect of meaning, which may correspond to an abstract concept or a characteristic shared between several words, although these dimensions are not directly observable. In terms of syntax, these representations also incorporate interactions between words, reflecting both their order of appearance and their hierarchical relationships. These relationships can be simple, such as grammatical rules, or more complex and abstract, such as analogies, as well as temporal and causal structures that contribute to the coherence of the text. Language models can learn these representations by predicting a masked word from those surrounding it (Devlin et al., 2019). This task requires the model to develop a deep understanding of the semantic and syntactic dimensions of language.


Among language models, the success of the transformer architecture (Vaswani et al., 2017) is based on a flexible and dynamic mechanism called attention, which allows these contextualized vectors to be calculated efficiently. Functioning as a question-and-answer system, this mechanism allows each word to interact with those around it in order to identify relevant associations. Thus, if the question posed by a word is answered by the preceding words, part of the meaning of the latter is integrated into the representation of the target word. In a transformer module, several attention mechanisms coexist, offering words the possibility of asking various questions and obtaining as many answers. Finally, the architecture consists of a series of these modules organized hierarchically, allowing language to be represented at various levels of abstraction. The first modules capture elementary interactions such as frequent co-occurrences and basic syntactic structures, while the advanced modules represent more global and abstract concepts such as theme, emotion, or narrative structure.


What are their applications?


LLMs are considered foundation models, i.e., pre-trained architectures that have a general understanding of language and can be adapted to various tasks, sometimes without additional training. This section focuses on generic LLMs, without necessarily including those with conversational modules for chatbot applications[4].


One initial application is to analyze the sentiment expressed in financial tweets in order to assess whether investor opinion is positive, negative, or neutral towards a stock, market, or economic trend (Renault, 2017). This process is usually based on manual annotation, which is often a time-consuming and costly task, sometimes requiring the expertise of specialized annotators. The use of LLMs reduces this dependency by relying on a pre-trained model that only needs to be adapted to a specific task. This process, called transfer learning, involves replacing the model's output module with another one specific to the data distribution, such as classification into categories such as positive, negative, or neutral. The model parameters are then refined on a task-specific sample, resulting in a high-performance model with a minimum of annotated data.


Another application is to measure the similarity between documents using embedding vectors and distance metrics. Neural representations allow consistent and structured distances to be defined by capturing the semantic and syntactic relationships between texts. Thus, two documents can be identified as similar even if they contain different words, a different sentence order, or varying lengths. For example, Kelly et al. (2021) apply this method to the analysis of technology patents to identify disruptive innovations—patents that stand out from previous work while strongly influencing future developments. Textual distances make it possible to measure the novelty of a patent by comparing it to those that preceded it, and its influence by assessing its similarity to patents filed later. By combining these two dimensions, this approach quantifies the impact of innovations and makes it possible to track the evolution of technological waves over the long term.


A final application concerns data collection. A specific architecture of generative LLMs, known as Retrieval-Augmented Generation (RAG), makes it possible to efficiently exploit vast document databases to extract relevant information. Unlike traditional generative models, whose knowledge is limited to the data acquired during training, RAGs combine text generation with information retrieval from an external document database. This approach combines the flexibility of language models with greater accuracy in responses, as it relies on external, verifiable sources rather than simple probabilistic generation. RAG thus simplifies the use of specialized databases, such as historical archives or scientific publications, while significantly reducing the risk of errors.


Under what conditions should they be used?


The use of pre-trained LLMs implies a loss of control over the data used for their training. These statistical models tend to replicate or even to watch the biases present in their training databases, which can lead to biased or discriminatory representations, especially when these data lack diversity (Manvi et al., 2024). Furthermore, if the database used in the application is freely accessible, it is possible that the model has already been trained on the research sample. In this case, this can lead to overfitting, where the model memorizes the data rather than extracting general trends, which distorts inferences and compromises the validity of the results. To limit these risks, it is recommended to use open-source LLMs whose training data is documented and whose updates are clearly dated.


Another problem lies in the difficulty of accurately quantifying the uncertainty of LLM predictions. Unlike traditional statistical models, they do not provide confidence intervals for their predictions[5]. This lack of uncertainty can lead them to produce erroneous predictions with excessive confidence. For example, LLMs are trained to reproduce the distributional structure of language, which can lead them to generate representations that are false but plausible, rather than rigorously accurate. One way to manage this uncertainty is to compare predictions to an external validation sample—one that was not used during training—and to explicitly model the structure of prediction errors (Ludwig et al., 2025).


Finally, another challenge in certain applications is the lack of interpretability of LLMs. This opacity results from the complexity of their mechanisms, which rely on a considerable number of parameters interacting in a non-linear manner. This makes it difficult to trace precisely how a model constructs its representations and generates its predictions. Unlike humans, these models do not understand language semantically, but rely on statistical correlations derived from training data. As a result, their representations of language do not correspond to ours, which complicates their interpretation. Much work is being done to interpret the internal representations of models or to align them with those of humans, but this work mainly applies to architectures that are simpler than current LLMs.


Conclusion


LLMs open up new perspectives for social science research by facilitating language analysis, information extraction, and prediction from textual data. Their flexibility and adaptability make them powerful tools for a wide range of natural language modeling tasks, while limiting the need for manual annotations.


However, their use raises major methodological challenges, particularly in terms of bias, quantification of uncertainty, and interpretability. Thus, for rigorous application in research, their use must be based on the fundamental principles of empirical validation and transparency of training data.


References


  • Bengio, Y., Ducharme, R., Vincent, P., & Jauvin, C. (2003). A Neural Probabilistic Language Model. Journal of Machine Learning Research, 3, 1137–1155.

  • Le, Q., & Mikolov, T. (2014). Distributed Representations of Sentences and Documents. Proceedings of the 31st International Conference on Machine Learning, 1188–1196.

  • Korinek, A. (2023). Generative AI for economic research: Use cases and implications for economists. Journal of Economic Literature, 61(4), 1281–1317.

  • Manvi, R., Khanna, S., Burke, M., Lobell, D., & Ermon, S. (2024). Large language models are geographically biased. Proceedings of the 41st International Conference on Machine Learning, pp. 1–16.

  • Ludwig, J., Mullainathan, S., & Rambachan, A. (2025). Large language models: An applied econometric framework. National Bureau of Economic Research, No. w33344.

  • Kelly, B., Papanikolaou, D., Seru, A., & Taddy, M. (2021). Measuring technological innovation over the long run. American Economic Review: Insights, 3(3), 303–320.

  • Renault, T. (2017). Intraday online investor sentiment and return patterns in the U.S. stock market. Journal of Banking & Finance, 84, 25–40.

  • Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., Kaiser, L., & Polosukhin, I. (2017). Attention is all you need. Advances in Neural Information Processing Systems, 30.



[1] Depending on the application, these units can represent words, sub-words, or individual characters.

[2] This is representation learning, where the model is trained on a secondary task aimed at acquiring high-quality representations. This approach is also known as semi-supervised learning, as it uses raw text to automatically generate input and output data. In the case of generative models, prediction is performed by determining the next word based on the preceding words (Radford et al., 2018). During inference, this same mechanism allows the model to generate a response in an autoregressive manner, using the question as the initial context.

LLMs rely on a specific attention mechanism called self-attention. Furthermore, this mechanism is formulated to exploit parallel computing, which allows the model to be trained on large text corpora.

[4] Conversational LLMs offer many other practical applications for speeding up certain daily research activities, such as interactive discussion to generate feedback, article summarization, text correction and translation, and assistance with computer code writing and mathematical derivations, particularly with a new generation of so-called reasoning models. However, it is essential that researchers have the necessary knowledge to validate the quality of the results obtained. For an in-depth presentation of these applications, readers can consult Korinek (2023).

[5] Several techniques inspired by Bayesian approaches can be used to estimate confidence intervals for parameters and predictions. They are based either on repeated sampling or on explicit modeling of uncertainty, taking into account both the variance of the parameters and that of the data. However, this second approach requires doubling the number of parameters in the model.


bottom of page