联系方式

  • QQ:99515681
  • 邮箱:99515681@qq.com
  • 工作时间:8:00-23:00
  • 微信:codinghelp

您当前位置:首页 >> C/C++编程C/C++编程

日期:2025-03-12 10:20

Work-in-Progress Report for Project 1: Titanic Prediction Graph in Neo4j

1. Introduction

This project aims to construct a knowledge graph from the Titanic dataset to explore potential relationships among passengers that could support survival prediction. Graph Database Development:

Design and implement the graph database schema, including node labels, relationship types, node properties, and taxonomy. Analyze, prepare, and map the dataset (in CSV format) to the graph model. Construct the Neo4j upload scripts and set up the graph database (hosted on AuraDB).

Intelligent Functions Development:

Identify the target audience and define user tasks supported by the KG application. Develop two intelligent functions—including network analysis and interactive visualization—to enhance the KG for survival prediction support.

2. Dataset Description

2.1 Data Source and Content

The dataset selected for this project is the Titanic dataset from Kaggle. It includes detailed passenger information from the 1912 Titanic disaster. The key fields in the dataset include:

PassengerId: Unique passenger identifier

Survived: Survival status (0 = did not survive, 1 = survived)

Pclass: Passenger class (1, 2, or 3)

Name: Passenger name

Sex: Gender

Age: Age in years

SibSp: Number of siblings/spouses aboard

Parch: Number of parents/children aboard

Ticket: Ticket number

Fare: Fare paid

Cabin: Cabin number

Embarked: Port of embarkation

The dataset was collected from historical records and compiled on Kaggle. Data cleaning and preprocessing have been performed to standardize formats and handle missing values.

Citation:

The dataset is available on Kaggle.

3. Task 1 – Graph Database Development

3.1 Knowledge Graph Design

To effectively represent the Titanic dataset, we have designed the graph schema as follows:

Node Labels:

oPassenger: Represents each passenger. Key properties include passengerId, name, sex, age, pclass, ticket, cabin, fare, and survived.

Relationship Types:

oTRAVELS_WITH: Connects passengers sharing the same ticket.

oSHARES_CABIN_WITH: Connects passengers who are in the same cabin.

oIS_SPOUSE_OF: Captures marital relationships based on available spouse information.

oIS_SIBLING_OF: Connects passengers with sibling or familial relationships.

oPARENT_OF / IS_CHILD_OF: (If available) Represent family relationships between parents and children.

3.2 Data Preparation and Mapping

Prior to importing the data into Neo4j, the CSV file was cleaned and normalized. Key preprocessing steps include:

Data Cleaning:

Standardizing fields (e.g., ticket and cabin formats), handling missing values, and ensuring that identifiers are consistent.

Field Mapping:

Mapping CSV columns (e.g., "PassengerId", "Ticket", "Cabin") to node properties in the graph. The correct capitalization is crucial since Neo4j is case-sensitive.

Relationship Construction:

Using shared attributes (such as Ticket and Cabin) and family information to define relationship creation logic.

3.3 Neo4j Upload Scripts

3.4 Database Setup

The graph database is hosted on Neo4j AuraDB. The upload scripts have been executed via the Neo4j Browser, and initial data verification queries have confirmed that nodes and relationships are correctly mapped.

4. Task 2 – Intelligent Functions Development

4.1 Target Audience and User Tasks

The KG application is designed to serve:

Data Scientists & Machine Learning Engineers:

To leverage graph-based features in survival prediction models by exploring passenger interrelationships.

Historians & Educators:

To interactively explore the social network of Titanic passengers and uncover underlying patterns.

Key user tasks include:

1.Exploratory Data Analysis and Relationship Insights:

Users can query the graph to discover clusters of passengers (e.g., those sharing tickets or cabins) and analyze how these relationships correlate with survival rates.

2.Graph Feature Engineering for Prediction:

Using graph algorithms (such as community detection and centrality measures) to derive new features from the passenger network. These features can be used to enhance traditional machine learning models for survival prediction.

4.2 Intelligent Function Design

To meet these user tasks, we propose the following intelligent functions:

1. Graph Clustering and Community Detection:

Objective: Identify clusters within the passenger network using algorithms like Louvain community detection (available in the Neo4j Graph Data Science library).

Implementation: Custom Cypher queries assign community identifiers to nodes, and statistics are computed to reveal survival rate trends within communities.

User Benefit: Enables data scientists to pinpoint influential clusters that might be key predictors of survival.

2. Dynamic Interactive Graph Visualization:

Objective: Provide an interactive front-end using tools such as Cytoscape.js, allowing users to zoom, filter, and explore the network in real time.

Implementation: A Node.js Web API will fetch graph data via Neo4j Cypher queries, and the results will be rendered on a web page with user controls for filtering and interaction.

User Benefit: Offers historians and educators a visually engaging method to understand the complex social relationships onboard the Titanic.

5. Work in Progress and Next Steps

Current Progress

Data Import and Cleaning:

The Titanic dataset has been successfully cleaned and mapped to the graph schema. Passenger nodes and key relationships (TRAVELS_WITH, etc.) have been imported into AuraDB.

Graph Database Development:

The schema design has been implemented, and preliminary queries confirm correct mapping of nodes and relationships.

Intelligent Function Prototypes:

Early prototypes of the community detection and dynamic visualization functions have been developed and are currently undergoing iterative testing.

Next Steps

Data Refinement:

Enhance data mapping by incorporating additional relationships (e.g., SHARES_CABIN_WITH, IS_SPOUSE_OF) and refine transformation scripts for better accuracy.

Algorithm Optimization:

Optimize the graph algorithms to generate robust features for survival prediction and integrate these into the Node.js API.

Web Interface Development:

Develop and integrate an interactive web interface using Cytoscape.js, complete with input controls and visualization features, and deploy the solution on Google App Engine.

Comprehensive Testing:

Conduct end-to-end testing to ensure performance, scalability, and usability of the KG application.

6. Conclusion

This work-in-progress report presents the initial development of a Titanic Prediction Graph in Neo4j. By transforming the Titanic dataset into a rich knowledge graph and leveraging advanced graph algorithms, we aim to provide actionable insights into passenger survival dynamics. Our KG application, targeted at data scientists, historians, and educators, will facilitate interactive exploration and support enhanced survival prediction models. The next phases of the project will focus on data integration, algorithm refinement, and full deployment of the web interface.



Appendix

Below are Cypher scripts used for uploading data into Neo4j:

1. Create a Unique Constraint:

CREATE CONSTRAINT unique_passenger_id IF NOT EXISTS

FOR (p:Passenger)

REQUIRE p.passengerId IS UNIQUE;

2. Import Passenger Nodes:

LOAD CSV WITH HEADERS FROM 'https://docs.google.com/spreadsheets/d/e/2PACX-1vRmy0R3vn98fWGdtsPxgnxvKItfcOl9qH6JNXBDczpwVEcFh5VqxQaUbWk2t5Ywclz0rxWtkEndmitD/pub?gid=788360882&single=true&output=csv' AS row

MERGE (p:Passenger { passengerId: toInteger(row.PassengerId) })

ON CREATE SET

   p.name = row.Name,

   p.sex = row.Sex,

   p.age = CASE WHEN row.Age = "" THEN null ELSE toInteger(row.Age) END,

   p.pclass = toInteger(row.Pclass),

   p.ticket = row.Ticket,

   p.cabin = row.Cabin,

   p.fare = CASE WHEN row.Fare = "" THEN null ELSE toFloat(row.Fare) END,

   p.survived = toInteger(row.Survived);

3. Create TRAVELS_WITH Relationship (Sharing the Same Ticket):

MATCH (p1:Passenger), (p2:Passenger)

WHERE p1.ticket IS NOT NULL AND p2.ticket IS NOT NULL

 AND trim(p1.ticket) = trim(p2.ticket)

 AND p1.passengerId <> p2.passengerId

MERGE (p1)-[:TRAVELS_WITH]->(p2);


Similar scripts are used to create other relationships (such as SHARES_CABIN_WITH, IS_SPOUSE_OF, etc.).


Example of passengers who travel together but don't have specified family relationships


Demonstration of Final Graph


相关文章

【上一篇】:到头了
【下一篇】:没有了

版权所有:留学生编程辅导网 2020 All Rights Reserved 联系方式:QQ:99515681 微信:codinghelp 电子信箱:99515681@qq.com
免责声明:本站部分内容从网络整理而来,只供参考!如有版权问题可联系本站删除。 站长地图

python代写
微信客服:codinghelp