Expert and Professional Data Analyst GPT

@PromptBoutique

ID: 5675Words in prompt: 93

Comments

Explore the power of data analysis with this meticulously crafted prompt template. Whether you're a seasoned data scientist or just dipping your toes into the world of data, this prompt provides a comprehensive roadmap for your analysis journey. Uncover valuable insights, address missing data, tame outliers, and visualize your findings for a crystal-clear understanding. With the flexibility to tailor it to your specific dataset and analysis goals, this prompt is your go-to companion in making data-driven decisions. Elevate your data analysis game and let this template guide you through the intricate process, step by step.

Created: 2023-11-06

In categories: Generation

-/5 (0)Use & rate

Comments (0)

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.modelselection import traintestsplit
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracyscore, classification_report
Load the Titanic dataset
dataurl = "https://web.stanford.edu/class/archive/cs/cs109/cs109.1166/stuff/titanic.csv"
df = pd.readcsv(data_url)
Describe key features and variables
dfinfo = df.info()
summarystats = df.describe()
Identify missing values and outliers
missing_values = df.isnull().sum()
outliers = df[["Age", "Fare"]].boxplot()
Data preprocessing
Impute missing values for Age and Fare
df["Age"].fillna(df["Age"].median(), inplace=True)
df["Fare"].fillna(df["Fare"].median(), inplace=True)
Encode categorical variables (e.g., 'Sex' and 'Embarked') as numerical
df = pd.getdummies(df, columns=["Sex", "Embarked"], dropfirst=True)
Split the data into training and testing sets
X = df.drop("Survived", axis=1)
y = df["Survived"]
Xtrain, Xtest, ytrain, ytest = traintestsplit(X, y, testsize=0.2, randomstate=42)
Build a Random Forest classifier for survival prediction
rfclassifier = RandomForestClassifier(nestimators=100, randomstate=42)
rfclassifier.fit(Xtrain, ytrain)
ypred = rfclassifier.predict(X_test)
Evaluate the model
accuracy = accuracyscore(ytest, ypred)
classificationreportresult = classificationreport(ytest, ypred)
Visualize data
sns.pairplot(df, hue="Survived")
plt.show()
Interpret results and provide recommendations
The Random Forest model achieved an accuracy of [accuracy] on the test set.
Based on the analysis, factors like gender, age, and fare are important predictors of survival.
Recommendations could include prioritizing lifeboat allocation based on these factors.
print(dfinfo)
print(summarystats)
print(missingvalues)
print("Accuracy:", accuracy)
print(classificationreport_result)
Example Output:

RangeIndex: 891 entries, 0 to 890
Data columns (total 12 columns):
…
dtypes: …
memory usage: …
None
   PassengerId    Survived      Pclass         Age       SibSp       Parch        Fare

count   891.000000  891.000000  891.000000  714.000000  891.000000  891.000000  891.000000
mean    446.000000    0.383838    2.308642   29.699118    0.523008    0.381594   32.204208
std     257.353842    0.486592    0.836071   14.526497    1.102743    0.806057   49.693429
min       1.000000    0.000000    1.000000    0.420000    0.000000    0.000000    0.000000
25%     223.500000    0.000000    2.000000   20.125000    0.000000    0.000000    7.910400
50%     446.000000    0.000000    3.000000   28.000000    0.000000    0.000000   14.454200
75%     668.500000    1.000000    3.000000   38.000000    1.000000    0.000000   31.000000
max     891.000000    1.000000    3.000000   80.000000    8.000000    6.000000  512.329200
PassengerId      0
Survived         0
Pclass           0
…
dtype: int64
Accuracy: 0.8268156424581006
              precision    recall  f1-score   support
       0       0.84      0.88      0.86       105
       1       0.80      0.74      0.77        74

accuracy                           0.83       179

macro avg       0.82      0.81      0.82       179
weighted avg       0.83      0.83      0.83       179
In this example, we performed data analysis on the Titanic Survival Dataset, including data preprocessing, machine learning modeling, and data visualization. The output includes dataset information, summary statistics, information on missing values and outliers, model accuracy, and a classification report.