Introduction

A graph, in short, is a description of items linked by relations, where the items of a graph are called nodes (or vertices) and their relations are called edges (or links). Examples of graphs can include social networks (e.g. Instagram) or knowledge graphs (e.g. Wikipedia). Nowadays, There is a rising trend in the research of using Machine Learning techniques on graphs to solve various kinds of problems.

Part 1 of this article can be found here.

In this post, we introduce two latest findings where the technique of Machine Learning is applied to Graphs.

AutoRD

Rare diseases (e.g. Hutchinson-Gilford Progeria Syndrome), also known as orphan diseases, are relatively uncommon in isolation and individually sometimes receive less attention in medical research due to their low prevalence. In the meantime, as Large Language Models (LLMs) have demonstrated exceptional proficiency in language understanding and generation, current research is beginning to evaluate the capabilities of the most powerful LLMs, such as ChatGPT and GPT-4, across various medical applications.

In the context of rare diseases, where resources are often limited, LLMs are able to emerge as valuable tools for extracting information about these conditions, showcasing their utility in enhancing medical knowledge systems. For instance, a useful tool called Automated Rare Disease Mining was introduced recently for extracting information about rare diseases and constructing corresponding knowledge graphs. This system can process unstructured medical text as input and output extraction results and a knowledge graph.

The AutoRD framework processes medical texts as input data and outputs entities related to rare diseases and rare disease triples, which are the results of the extraction process. Subsequently, it constructs a knowledge graph based on these extraction results. During the entity and relation extraction steps, ontologies-enhanced large language models (LLMs) are utilised to enhance performance. (Lang Cao et al., 2024) Article Link: https://arxiv.org/abs/2403.00953

The framework of AutoRD includes data preprocessing, entity extraction, relation extraction, entity calibration, and knowledge graph construction.

Data Preprocessing

Segmenting long documents: Split lengthy documents into smaller segments to fit with the LLM being used
Processing medical knowledge data: extract disease names, definitions, symptoms, and signs along with their definitions from given medical ontology files
Preprocessing RareDis dataset: Correct errors in the original dataset annotation and split into training, test, and validation sets for evaluation.

Entity Extraction

Extract medical terms: Use a string-match algorithm with negation detection on a medical ontology (Mondo) to find basic medical terms and use LLMs with a prompt to find additional medical terms, including those not directly in the ontology.
Categorise entities: Uses LLMs with a prompt to categorise all extracted medical terms into entities.

Relation Extraction

Use LLMs with a prompt to extract relations between identified entities

Entity Calibration

Use LLMs with a prompt to re-analyse relationships between all extracted entities.
Filter out entities identified as unrelated based on the combined information from entity extraction and relation extraction.

Knowledge Graph Construction

Use Neo4j graph database to store and manage the relationships between entities as triples (entity-relation-entity) by adding them one by one using Neo4j API.

Experimental results show that AutoRD is able to achieve state-of-the-art performance in terms of efficiency when constructing rare disease knowledge graphs, opening up exciting possibilities for future research.

For example, a possible scenario of how AutoRD works for a text about Hutchinson-Gilford progeria syndrome could be:

Input medical text:

Hutchinson-Gilford progeria syndrome(HGPS) is a rare and fatal genetic disease that causes children to age rapidly. Symptoms typically begin in the first few months of life and include failure to thrive, wrinkled skin, and hair loss. There is currently no cure for HGPS, but research is ongoing.

AutoRD Output:

Entity Extraction: AutoRD starts by extracting entities from the medical text. It uses a combination of string matching and LLM prompts. In this example, it found several entities related to rare diseases, including “Hutchinson-Gilford progeria syndrome”, “disease”, “wrinkled skin”, “failure to thrive”, and “hair loss”.
Relation Extraction: Next, AutoRD uses LLMs to extract relations between the identified entities. It found that “Hutchinson-Gilford progeria syndrome” is a type of “disease” and that “disease” can have “hair loss” as a symptom.
Knowledge Graph Construction: Finally, AutoRD constructs a knowledge graph to represent the relationships between the entities. In this example, the knowledge graph would show that “Hutchinson-Gilford progeria syndrome” is a subclass of “disease” and that “disease” is associated with “hair loss”.

NEGSC

With the exponential growth in user hosts and network services, the frequency and the complexity of cyberattacks are increasing. Fortunately, Network Intrusion Detection Systems (NIDSs) have proven their reliability in detecting and mitigating cyber-attacks, whose main tasks are to capture malicious network traffic flow and apply the identification outputs to achieve precise and prompt responses. In recent studies, Graph Neural Networks (GNNs), which are a powerful deep learning method for graph-structured data, have been introduced into NIDS due to their suitability for representing the network traffic flows.

Converting Netflow-based data into graph representation. An arrow along with nodes indicates a network traffic flow from the source host to the destination host. Normal and attack flows are denoted by black and red arrows, respectively, where different shades of red arrows indicate different types of attacks. (Renjie Xu et al., 2024) Article Link: https://arxiv.org/abs/2403.01501

Although Graph Neural Networks (GNNs) are used to analyse network traffic data (represented as a graph) to detect malicious activity, they might not fully utilise the rich information present in network flow data (data about traffic flow between network nodes). To further improve the performance of GNN, a self-supervised graph representation learning method for identifying malicious attacks and their specific types, NetFlow-Edge Generative Subgraph Contrast (NEGSC), is proposed.

NEGSC does not require pre-labeled data (which can be scarce for malicious attacks). It learns by itself by contrasting “good” and “anomalous” subgraphs within the network traffic graph. It goes beyond the basic graph structure and considers the specific features present in network flows (e.g., packet size, source and destination IP addresses). This allows for a more nuanced understanding of the traffic patterns. NEGSC focuses on analysing the immediate neighbors (directly connected nodes) of a central node. This helps identify suspicious patterns in local traffic flows that might indicate an attack. It employs a learning framework called Generative Subgraph Contrast(GSC) and generates contrastive subgraphs for a given central node. These subgraphs represent possible “good” and “anomalous” traffic patterns around that node. It then uses a loss function to compare the generated contrastive subgraphs. If the real traffic data around a node deviates significantly from the “good” subgraph, it might indicate a potential attack.

Experimental results show that this approach significantly outperforms the existing self-supervised learning method in terms of Accuracy, Precision, Recall and F1 when given a binary classification scenario.

Conclusion

The usage of Machine Learning has witnessed significant advancements, particularly when combined with various kinds of graphs. This can open multiple exciting possibilities for different research directions. In this post, we discussed two latest findings for the use of Machine Learning when combined with graphs. Moreover, we have also shared some promising scopes that could be used for future exploration of these approaches. Through continuous investigation and refinement, we believe that the use of graphs can open up exciting opportunities for us along with different machine learning techniques.

References

Cao, L., Sun, J. and Cross, A. (2024). AutoRD: An Automatic and End-to-End System for Rare Disease Knowledge Graph Construction Based on Ontologies-enhanced Large Language Models. arXiv (Cornell University). doi:https://doi.org/10.48550/arxiv.2403.00953.
Xu, R., Wu, G., Wang, W., Gao, X., He, A. and Zhang, Z. (2024). Applying Self-supervised Learning to Network Intrusion Detection for Network Flows with Graph Neural Network. arXiv (Cornell University). doi:https://doi.org/10.48550/arxiv.2403.01501.

Catch the latest version of this article over on Medium.com. Hit the button below to join our readers there.

Learn more on Medium