“The ultimate purpose of the social sciences is to furnish causal explanations of classes of observable events, which are, at least in part, generated by individual and collective agency/action.”  The aim of a digital assistant for social history research is therefore to support social scientists with the construction of such causal explanations for observable events, also theories or hypotheses. Social history researchers can then test these on various aspects of society, to see whether a newly found hypothesis holds. Since not many structured, easily accessible hypotheses exist in the social domain to learn from, we have focused on causal narratives in the medical domain first, since we have access to a dataset of ~4000 of these. The aim is to analyse this dataset, and transfer insights we gain to the social domain (given that they are transferable between domains). In this specification, we first briefly describe a few ways in which social history researchers discover and answer hypotheses, and potential things to look out for when developing a digital research assistant. Following, we briefly discuss the medical domain and hypothesis generation technologies we are developing in that area of research.
In the field of biomedicine, the common process to generate a new hypothesis that can be tested within a clinical trial, is to intervene in a given biochemical process with a specific treatment. This potential outcome framework has been used for a longtime fueled by new discoveries in the lab, e.g. the discovery of a new protein. As with social science research, the generation of a new hypothesis can includes the following steps:
Step 1. Protein-pathway discovery.
Example: ‘The discovery of a new oncogene participating in a cellular pathway’
Step 2. Drug discovery.
Example: ‘The development of a chemical molecule to target the new onco-gene.’
Step 3. Clinical trial.
Example: ‘A significant effect on tumor growth was found administering the chemical molecule to patients with liver cancer.’
Step 4. Drug repurposing by analogy.
Example: ‘The chemical molecule treats liver cancer, which resembles kidney cancer. Can the molecule treat kidney cancer too?
Hence, hypothesis generation can be fueled by a new scientific discovery such as the discovery of a new gene participating in a pathway, as well as by analogy, through already performed trials and their results.
A digital assistant for biomedical research:
Scientific discovery in the biomedical domain can greatly benefit from automated hypothesis generation, as finding new and interesting research questions is challenging and requires considerable background knowledge about trials, drugs, conditions and their various causal mechanisms.
Two main requirements for automated scientific discovery:
The task is often formulated as a link prediction task, in which a new link is predicted between a disease and an existing treatment, such as insulin treats→ diabetes. Several studies argue for the integration of a model with structured background knowledge about known cause and effect relationships within the problem domain, to support both the generation of hypotheses as well as their explanation.
Explainable link prediction methods have proved very successful in pointing out new, interesting drug-treatment pairs, specifically in being able to focus the attention of medical practitioners to those hypotheses that are explainable with current knowledge on biochemical processes. While these developments are paramount in producing explainable medical AI, such hypotheses are subject to simplification. Bodily processes are complex in nature, and by reducing hypothesis generation to a single link prediction task, a system risks missing out on interesting hypotheses. For example: adults with diabetes mellitus as well asdiabetic ketoacidosis might require a completely different treatment than kids without diabetic ketoacidosis. Such a task can be formulated as a graph generation task, where one predicts not only a link between a drug and a disease, but the entirety of the hypothesis: age groups, symptoms, modes for drug delivery, and other. Even though explainable link prediction is a much researched topic, research into subgraph generation is scarce, and the research that exists focuses on machine-learned methods that are often nontransparent in their reasoning.
Social science/social history research:
In the field of social history, the discovery of a causal narrative often arises first and foremost by the generation or discovery of a grand theory. Such grand theories can arise in a multitude of ways: from `rocking chair sociology’, to the discovery of certain patterns when zooming in on certain groups in society, be it inequality amongst people in a small town, or social cohesion amongst followers of a certain religion. Theories related to the latter therefore come about in a more fortuitous way. Here it is interesting to note that most finer-grained questions can be divided into three big questions or themes: those related to cohesion, inequality or rationalisation (the effect of technological developments on a society).
When an interesting theory that is devised should be tested, or an interesting use case has come to light, the process of constructing causal narratives can be roughly subdivided into three sub-questions and their output. Each following step ingests the output of the previous step:
Start: either a theory, or a use case, e.g., the role of social cohesion can explain certain outcomes among social groups, or the suicide rates and religious beliefs of those living in town X have been recorded, respectively.
Even though overarching ‘grand’ questions should remain the same, branching questions however are prone to grow into a certain direction. Knowledge on social history can therefore only lift one side of the curtain.
A digital assistant for social science research:
We argue that a digital assistant for scientific discovery in the social sciences or social history domain can aid in the data-driven generation of point 1. and 2. described in the section above. By ingesting structured data, such as is available at the international institute of social history (IISH), a digital assistant can, first and foremost, discover trends over time (longitudinal) or among groups, to present to the researcher in question. An example of such a trend is described in point 1 above. Illuminating bias in datasets is an important component here, as bias limits the range of a certain hypothesis, for instance the hypothesis mentioned in the previous section could apply only to people that earn more than the marginal income.
A digital assistant for hypothesis generation in the social sciences, be it social history or social science research in general, should take note of the following:
Explainable. Humanities researchers increasingly turn their data into Linked Data[3,4], interlinking their own data, but also to link social science data to knowledge from other domains available in the LOD cloud.
 Abell, P. (2009). History, case studies, statistics, and causal inference. European Sociological Review. https://doi.org/10.1093/esr/jcn072
Example literature related to a comparative question, as well as a data ecosystem supporting the search for causal narratives:
 van den Berg, N., van Dijk, I. K., Mourits, R. J., Slagboom, P. E., Janssens, A. A. P. O., & Mandemakers, K. (2021). Families in comparison: An individual-level comparison of life-course and family reconstructions between population and vital event registers. Population Studies, 75(1), 91–110. https://doi.org/10.1080/00324728.2020.1718186
 Hoekstra, R., Meroño-Peñuela, A., Rijpma, A., Zijdeman, R., Ashkpour, A., Dentler, K., Zandhuis, I., & Rietveld, L. (2018). The dataLegend ecosystem for historical statistics. Journal of Web Semantics, 50, 49–61. https://doi.org/10.1016/j.websem.2018.03.001
 Zapilko, Benjamin, et al. "Applying linked data technologies in the social sciences." KI-Künstliche Intelligenz 30.2 (2016): 159-162.