Spaces:

Dun3Co
/

grp14StreamlitApp

Sleeping

App Files Files Community

Dun3Co commited on Oct 9, 2025

Commit

174ff4b

verified ·

1 Parent(s): 5d7d554

Upload 7 files

Browse files

Files changed (7) hide show

1_🎯_Business_case.py +36 -0
pages/2_📊_Data_handling_and_feature_engineering.py +117 -0
pages/3_🔍_EDA_and_visualization.py +187 -0
pages/4_💃_Model_interpretation.py +114 -0
pages/5_💲_Magenment_deck.py +73 -0
pages/6_📞_Callcenter_dashboard.py +235 -0
requirements.txt +12 -0

1_🎯_Business_case.py ADDED Viewed

	@@ -0,0 +1,36 @@

+import streamlit as st
+import pandas as pd
+import numpy as np
+st.set_page_config(page_title="Banking marketing campaign", page_icon=":phone:", layout="wide")
+st.markdown("""
+# Business problem:
+### We're a bank trying to reach potential customers to subscribe to a term deposit.
+Currently, the bank spends over **$220,000** contacting customers who ultimately **do not purchase** the term deposit — indicating inefficiency in campaign targeting.
+Our project focuses on improving the **return on investment (ROI)** of these marketing efforts by:
+- Identifying which customer segments are most likely to subscribe.
+- Reducing wasted expenditure on customers unlikely to convert.
+- Simulating different cost and profit scenarios to evaluate potential improvements.
+The model uses a dataset that includes:
+- **Demographics:** age, marital status, education level.
+- **Financial details:** account balance, housing loans, personal loans.
+- **Campaign history:** number of contacts, previous outcomes, and time since last contact.
+- **Call information:** month and duration of the last contact.
+By predicting the **probability of a new customer subscribing**, the model supports better targeting decisions.
+It also informs the **call centre incentive structure** — offering higher bonuses for converting low-probability customers, encouraging efficiency and motivation among staff.
+It's a direct to consumer marketing case, and our plan is to most effectively use our marketing budget.
+# Stakeholders
+### - Management: Want to maximise ROI by understanding where marketing spend generates the most value and how to allocate resources effectively.
+### - Marketing team leaders: Seek insights into which customer segments and campaign strategies yield higher conversion rates, enabling smarter decision-making and more efficient campaigns.
+### - Call centre staff: Use model predictions to prioritise calls, improve success rates, and align bonuses with the difficulty of conversion — ensuring fair rewards for high-effort sales.
+            """)

pages/2_📊_Data_handling_and_feature_engineering.py ADDED Viewed

	@@ -0,0 +1,117 @@

+import streamlit as st
+import time
+import numpy as np
+st.set_page_config(page_title="Data handling and feature", page_icon="📊")
+container1 = st.container(border=True, vertical_alignment="center")
+container2 = st.container(border=True, vertical_alignment="center")
+with container1:
+    col1, col2 =st.columns([0.4, 0.6])
+    with col1:
+        st.markdown("""
+        # Explanation of dataset
+        The dataset can be broken down into three main components.
+        There's the client attributes. These are the standard information the bank would have, presuming that we're calling our existing customer base, and not cold-calling for new customers.
+        It's mainly categorical features, with age and balance being numerical.
+        There are some categorical values that are almost bolean with nan/unknow values.
+        Then we have our contact attributes. These are features regarding how a customer was contacted, on what day and which month, and for how long the contact was.
+        We checked the dataset, and found that the collection had found place from may 2008 to november 2010. Quickly we realized that this suggest the data has some temporal features to it. As we can see we have the month and the day of month as variables. We know from the dataset that it started may 2008, and is in descending order,
+        so the year is implicitly in the data, with the first grouping of may belonging to 2008, the next to 2009 etc. for every month. From that we can build a datetime variable.
+        Lastly there's campaign attributes. There's been a previous campaign (later we will see, some sample imbalance because of this) and the campaign feature captures the number of contacts during the campaign. Pdays is the number of days since the customer was last contacted, previous the number of times, and poutcome the outcome of the previous campaign.
+        Importantly, the pdays is -1 if the customer has not been contacted before, meaning that a -1 in this feature is different from any other value, and we will later engineer a feature to capture this.
+        ---
+                                """)
+    with col2:
+        st.markdown("""
+                """)
+        st.markdown("""
+    ## Original data
+    ### Client Atributes
+    | Column      | Description                                                                                    |
+    | ----------- | ---------------------------------------------------------------------------------------------- |
+    | `age`       | Age of the client (numeric).                                                                   |
+    | `job`       | Type of job (categorical). Examples: `admin.`, `technician`, `blue-collar`, `management`, etc. |
+    | `marital`   | Marital status (categorical). Values: `married`, `single`, `divorced`.                         |
+    | `education` | Education level (categorical). Values: `primary`, `secondary`, `tertiary`, `unknown`.          |
+    | `default`   | Has credit in default? (categorical). Values: `yes`, `no`, `unknown`.                          |
+    | `balance`   | Average yearly balance in euros (numeric).                                                     |
+    | `housing`   | Has a housing loan? (categorical). Values: `yes`, `no`, `unknown`.                             |
+    | `loan`      | Has a personal loan? (categorical). Values: `yes`, `no`, `unknown`.                            |
+    ### Contact attributes
+    | Column     | Description                                                                                                                |
+    | ---------- | -------------------------------------------------------------------------------------------------------------------------- |
+    | `contact`  | Communication type (categorical). Values: `cellular`, `telephone`.                                                         |
+    | `day`      | Last contact day of the month (numeric).                                                                                   |
+    | `month`    | Last contact month of the year (categorical). Values: `jan`, `feb`, `mar`, etc.                                            |
+    | `duration` | Last contact duration, in seconds (numeric).                                                                               |
+    ### Campaign Attributes
+    | Column     | Description                                                                                                                           |
+    | ---------- | ------------------------------------------------------------------------------------------------------------------------------------- |
+    | `campaign` | Number of contacts performed during this campaign for this client (numeric, includes last contact).                                   |
+    | `pdays`    | Number of days that passed after the client was last contacted in a previous campaign (-1 means client was not previously contacted). |
+    | `previous` | Number of contacts performed before this campaign for this client (numeric).                                                          |
+    | `poutcome` | Outcome of the previous marketing campaign (categorical). Values: `success`, `failure`, `other`, `unknown`.                           |
+    """)
+with container2:
+    col1, col2 =st.columns([0.4, 0.6])
+    with col1:
+        st.markdown("""
+        # Feature engineering
+        We did some feature engineering to make the data more suitable for modeling.
+        as mentioned earlier, we created a datetime variable from the day, month and year (inferred from the order of the data).
+        We also refined and added several new features to help the model better understand customer behavior and campaign performance.
+        For instance:
+        - We **binned `pdays`** into intervals like “No contact” or “0–5 months” to capture how long it’s been since a customer was last reached.
+        - We **grouped previous contacts** (`n_previous_contacts`) to see how persistence impacts outcomes: too many calls might lower interest!
+        - We added **simple True/False flags** like `had_contact` and `is_single` to make the model pick up behavioral patterns more easily.
+        - We created an **“unknown contact” indicator** to handle cases where the contact type wasn’t recorded, improving data consistency.
+        - We converted months into numbers (`month_num`) and inferred a **campaign year** so time-based patterns become clearer.
+        - We combined these into a full **datetime column (`date`)** and a **`year_month`** feature for easy trend analysis.
+        - Finally, we **capped extreme values** in `balance` and `campaign` to prevent outliers from distorting the model.
+        Together, these transformations made the dataset cleaner, more interpretable, and better aligned with real-world marketing insights.
+        """)
+    with col2:
+        st.markdown("""
+        ### Feature Engineering
+        | Feature Name                  | Description                                                                                       |
+        |-------------------------------|---------------------------------------------------------------------------------------------------|
+        | `months_since_previous_contact` | Binned version of `pdays` into intervals (e.g., "No contact", "0 - 5 months", etc.)              |
+        | `n_previous_contacts`           | Binned version of `previous` into categories ("No contact", "1", ..., "More than 6")             |
+        | `had_contact`                   | Boolean: True if client had previous contact (`months_since_previous_contact` ≠ "No contact")    |
+        | `is_single`                     | Boolean: True if marital status is "single"                                                      |
+        | `uknown_contact`                | Boolean: True if contact type is "unknown"                                                       |
+        | `month_num`                     | Numeric month extracted from categorical `month`                                                  |
+        | `year`                          | Year inferred from campaign sequence                                                             |
+        | `date`                          | Combined datetime column from day, month, and year                                               |
+        | `year_moth`                     | Year and month as datetime for time-based splitting                                              |
+        | `balance` (capped)              | Capped at 99th percentile to reduce outlier impact                                               |
+        | `campaign` (capped)             | Capped at 90th percentile for distribution analysis                                              |
+        These features were created to improve model interpretability, handle outliers, and enable time-based splits.
+                """)

pages/3_🔍_EDA_and_visualization.py ADDED Viewed

	@@ -0,0 +1,187 @@

+import streamlit as st
+import pandas as pd
+import numpy as np
+import matplotlib.pyplot as plt
+import seaborn as sns
+import altair as alt
+import requests
+import zipfile
+import io
+# Use session state to load data only once per session
+def load_data():
+    url = "https://archive.ics.uci.edu/ml/machine-learning-databases/00222/bank.zip"
+    r = requests.get(url)
+    z = zipfile.ZipFile(io.BytesIO(r.content))
+    df = pd.read_csv(z.open("bank-full.csv"), sep=";")
+    return df
+if "bank_df" not in st.session_state:
+    st.session_state["bank_df"] = load_data()
+df = st.session_state["bank_df"]
+distribution_variables = ['age', 'balance', 'day', 'duration', 'campaign', 'pdays', 'previous']
+imbalance_variables = ['job', 'marital', 'education', 'default', 'housing', 'loan', 'contact', 'month', 'poutcome']
+with st.sidebar:
+    st.header("Distribution visualizations")
+    with st.form("Visualization options"):
+        select_variable = st.multiselect("Select variable(s) to visualize", distribution_variables, default=['age','balance'])
+        plot_type = st.radio("Plot type", options=["Histogram", "KDE"], index=0)
+        submit_button = st.form_submit_button(label="Update")
+    st.header("Imbalance plots")
+    with st.form("Imbalance options"):
+        select_imbalance = st.multiselect("Select variable(s) to visualize", imbalance_variables, default=['job','education'])
+        submit_button2 = st.form_submit_button(label="Update")
+st.set_page_config(page_title="EDA and Visualization", page_icon=":mag:", layout="wide")
+st.title("Exploratory Data Analysis and Visualization")
+st.markdown("""
+This page allows you to explore the distribution of numerical variables and the imbalance of categorical variables in relation to the target variable `y` (whether the client subscribed to a long term deposit). Use the sidebar to select variables and plot types.
+- **Distribution Plots**: Choose numerical variables to see their distribution (histogram or KDE) segmented by the target variable `y`.
+- **Imbalance Plots**: Choose categorical variables to visualize their class distribution segmented by the target variable `y`.
+Depeding on your selections, the plots will update accordingly, and our thoughts will be attached after each plot.
+""")
+# Distribution plots (histogram or kde by 'y')
+if submit_button and select_variable:
+    st.subheader("Distribution by Target (y)")
+    n_cols = 2
+    n_vars = len(select_variable)
+    n_rows = (n_vars + n_cols - 1) // n_cols  # Ceiling division
+    for row in range(n_rows):
+        cols = st.columns(n_cols)
+        for col_idx in range(n_cols):
+            idx = row * n_cols + col_idx
+            if idx >= n_vars:
+                # If there are fewer plots than grid cells, leave empty
+                continue
+            var = select_variable[idx]
+            col = cols[col_idx]
+            if plot_type == "Histogram":
+                chart = alt.Chart(df).mark_bar(opacity=0.7).encode(
+                    x=alt.X(var, bin=alt.Bin(maxbins=30), title=var),
+                    y=alt.Y('count()', title='Count'),
+                    color=alt.Color('y', title='Subscribed'),
+                    tooltip=[var, 'y']
+                ).properties(
+                    width=350,
+                    height=250,
+                    title=f"{var} histogram by subscription"
+                )
+                col.altair_chart(chart, use_container_width=True)
+                # After plotting each distribution plot:
+                if var == "age":
+                    col.info("Age is right-skewed, most clients are between 30 and 50.")
+                elif var == "balance":
+                    col.info("Balance has a long tail, with most clients having low or negative balances. We will clip it for our model.")
+                elif var == "day":
+                    col.info("Day of month is fairly uniform, with slight peaks around the start and end of the month. There might be weekends affecting this.")
+                elif var == "duration":
+                    col.info("Duration is right-skewed, with many short calls and a few very long ones. We will clip it for our model.")
+                elif var == "campaign":
+                    col.info("Campaign calls are right-skewed, with most clients receiving few calls. We won't clip this for our model, as it might be informative or affect the target variable too much.")
+                elif var == "pdays":
+                    col.info("-1 indicates no previous contact, which is very common in the data. Other values are right-skewed, with many clients not contacted for a long time. We've decided to bin them in the model.")
+                elif var == "previous":
+                    col.info("Previous contacts are right-skewed, with most clients having few previous contacts. We will bin this for our model.")
+            else:  # KDE
+                fig, ax = plt.subplots(figsize=(4, 3))
+                for label, color in zip(['yes', 'no'], ['green', 'red']):
+                    subset = df[df['y'] == label][var]
+                    sns.kdeplot(subset, label=f"y = {label}", color=color, fill=True, alpha=0.3, ax=ax)
+                ax.set_title(f"{var} KDE by subscription")
+                ax.set_xlabel(var)
+                ax.set_ylabel("Density")
+                ax.legend()
+                col.pyplot(fig)
+                plt.close(fig)
+                # After plotting each distribution plot:
+                if var == "age":
+                    col.info("The density shows a peak around 35-40 years, with a slight difference between subscribers and non-subscribers. Notably, subscribers tend to be slightly older.")
+                elif var == "balance":
+                    col.info("Subscribers tend to have higher balances, as seen by the green curve shift.")
+                elif var == "day":
+                    col.info("The density is fairly uniform, with slight peaks, maybe around weekend? Better at the start of month?. Subscribers show a slightly different pattern.")
+                elif var == "duration":
+                    col.info("Subscribers tend to have longer call durations, as seen by the green curve shift. However, there might be model leakage, as we can't use call duration to predict subscription, because we dont know for how long the call is going to go.")
+                elif var == "campaign":
+                    col.info("The density is right-skewed, with most clients having few campaign calls.")
+                elif var == "pdays":
+                    col.info("The density shows a peak at -1 (no previous contact). Subscribers tend to have been contacted more recently. We've binned this for our model.")
+                elif var == "previous":
+                    col.info("The density is right-skewed, with most clients having few previous contacts. We've binned this for our model.")
+# Helper to cache imbalance proportions per variable in session state
+def get_prop_df(var):
+    cache_key = f"prop_df_{var}"
+    if cache_key not in st.session_state:
+        prop_df = (
+            df.groupby(var)['y']
+            .value_counts(normalize=True)
+            .rename('proportion')
+            .reset_index()
+        )
+        st.session_state[cache_key] = prop_df
+    return st.session_state[cache_key]
+# Imbalance plots (stacked bar by 'y')
+if submit_button2 and select_imbalance:
+    st.subheader("Imbalance by Target (y)")
+    for var in select_imbalance:
+        prop_df = get_prop_df(var)
+        df_tooltip = pd.merge(df, prop_df, on=[var, 'y'], how='left')
+        # Custom month order if plotting 'month'
+        if var == "month":
+            month_order = ['jan', 'feb', 'mar', 'apr', 'may', 'jun', 'jul', 'aug', 'sep', 'oct', 'nov', 'dec']
+            x_axis = alt.X(var, title=var, sort=month_order)
+        else:
+            x_axis = alt.X(var, title=var)
+        chart = alt.Chart(df_tooltip).mark_bar().encode(
+            x=x_axis,
+            y=alt.Y('count()', stack='normalize', title='Proportion'),
+            color=alt.Color('y', title='Subscribed', sort=['no', 'yes']),
+            order=alt.Order('y', sort='ascending'),
+            tooltip=[
+                var,
+                'y',
+                alt.Tooltip('proportion:Q', title='Proportion', format='.2%'),
+                alt.Tooltip('count():Q', title='Count')
+            ]
+        ).properties(
+            width=350,
+            height=650,
+            title=f"{var} imbalance by subscription"
+        )
+        st.altair_chart(chart, use_container_width=True)
+        # After plotting each imbalance plot:
+        if var == "job":
+            st.info("Certain jobs (e.g., management, retired and maybe suprisingly student) have higher subscription rates, but not by much.")
+        elif var == "education":
+            st.info("Higher education levels seem correlated with higher subscription rates.")
+        elif var == "marital":
+            st.info("Pretty uniform subscription rates across marital statuses, with slight variations.")
+        elif var == "default":
+            st.info("Clients with credit default have a little lower subscription rate.")
+        elif var == "housing":
+            st.info("Clients with housing loans have a little lower subscription rate.")
+        elif var == "loan":
+            st.info("Clients with personal loans have a little lower subscription rate.")
+        elif var == "contact":
+            st.info("Contact method affects subscription rates, with cellular contacts having a higher rate.")
+        elif var == "month":
+            st.info("Subscription rates vary some by month, with peaks in certain months (e.g., mar. sep. oct. and dec.).")
+        elif var == "poutcome":
+            st.info("Previous campaign outcomes strongly influence subscription rates, with 'success' leading to much higher rates.")
+# Thoughts based on the visualizations

pages/4_💃_Model_interpretation.py ADDED Viewed

	@@ -0,0 +1,114 @@

+import streamlit as st
+import shap
+import matplotlib.pyplot as plt
+import numpy as np
+import joblib
+import pandas as pd
+from sklearn.metrics import roc_curve, roc_auc_score, precision_recall_curve, auc
+st.set_page_config(page_title="Model Analysis Dashboard", layout="wide")
+st.title("Model Analysis Dashboard")
+# --- Load model and test data ---
+@st.cache_data
+def load_model_and_data():
+    model = joblib.load("model_1mvp.pkl")
+    df_test = pd.read_csv("test_data.csv")
+    return model, df_test
+model, df_test = load_model_and_data()
+target = "y"
+X_test = df_test.drop(columns=[target])
+y_test = df_test[target]
+preprocessor = model.named_steps["preprocessor"]
+feature_names = preprocessor.get_feature_names_out()
+X_test_transformed = preprocessor.transform(X_test)
+# --- SHAP Explainer (precompute for efficiency) ---
+explainer = shap.LinearExplainer(model.named_steps["classifier"], X_test_transformed, feature_names=feature_names)
+shap_values = explainer.shap_values(X_test_transformed)
+expected_value = explainer.expected_value
+# --- Sidebar: Plot selection and controls ---
+with st.sidebar.form("plot_selector"):
+    st.markdown("## Select plots to display")
+    show_coeff = st.checkbox("Logistic Regression Coefficients", value=True)
+    show_shap_global = st.checkbox("SHAP Global (summary plot)", value=True)
+    show_shap_local = st.checkbox("SHAP Local (waterfall plot)", value=False)
+    show_roc = st.checkbox("ROC/PR Curves", value=True)
+    top_n = st.slider("Number of top features for LogReg coeffecients", 5, 30, 15)
+    local_idx = st.number_input("Local SHAP sample index", min_value=0, max_value=len(X_test)-1, value=0)
+    submitted = st.form_submit_button("Update plots")
+# --- Logistic Regression Coefficient Plot ---
+if show_coeff and submitted:
+    st.header("Logistic Regression Coefficients")
+    logreg_model = model.named_steps["classifier"]
+    coefficients = logreg_model.coef_[0]
+    importance = pd.DataFrame({
+        "feature": feature_names,
+        "coefficient": coefficients
+    }).sort_values(by="coefficient", key=abs, ascending=False)
+    fig, ax = plt.subplots(figsize=(8, 6))
+    importance.head(top_n).set_index("feature")["coefficient"].plot(kind="barh", ax=ax, color="#4C72B0")
+    ax.set_title("Logistic Regression Feature Importance (Coefficients)")
+    ax.set_xlabel("Coefficient Value")
+    ax.set_ylabel("Feature")
+    st.pyplot(fig)
+    st.dataframe(importance.head(top_n).style.format({"coefficient": "{:.3f}"}))
+# --- SHAP Analysis ---
+if (show_shap_global or show_shap_local) and submitted:
+    st.header("SHAP Analysis")
+    if show_shap_global:
+        st.subheader("Global Feature Importance (SHAP Summary Plot)")
+        fig, ax = plt.subplots(figsize=(10, 6))
+        shap.summary_plot(shap_values, X_test_transformed, feature_names=feature_names, show=False)
+        st.pyplot(fig)
+    if show_shap_local:
+        st.subheader("Local Explanation (SHAP Waterfall Plot)")
+        fig2, ax2 = plt.subplots(figsize=(10, 6))
+        shap.plots.waterfall(
+            shap.Explanation(
+                values=shap_values[local_idx],
+                base_values=expected_value,
+                data=X_test_transformed[local_idx],
+                feature_names=feature_names
+            ),
+            max_display=15,
+            show=False
+        )
+        st.pyplot(fig2)
+# --- ROC and PR Curves ---
+if show_roc and submitted:
+    st.header("Model Performance Metrics (ROC / PR Curves)")
+    y_pred_proba = model.predict_proba(X_test)[:, 1]
+    roc_auc = roc_auc_score(y_test, y_pred_proba)
+    fpr, tpr, _ = roc_curve(y_test, y_pred_proba)
+    precision, recall, _ = precision_recall_curve(y_test, y_pred_proba)
+    pr_auc = auc(recall, precision)
+    col1, col2 = st.columns(2)
+    with col1:
+        st.metric("ROC AUC", f"{roc_auc:.3f}")
+    with col2:
+        st.metric("PR AUC", f"{pr_auc:.3f}")
+    fig1, ax1 = plt.subplots(figsize=(5, 5))
+    ax1.plot(fpr, tpr, color="darkorange", lw=2, label=f"ROC curve (AUC = {roc_auc:.3f})")
+    ax1.plot([0, 1], [0, 1], color="navy", lw=2, linestyle="--", label="Random Guess")
+    ax1.set_xlabel("False Positive Rate")
+    ax1.set_ylabel("True Positive Rate")
+    ax1.set_title("ROC Curve")
+    ax1.legend()
+    st.pyplot(fig1)
+    fig2, ax2 = plt.subplots(figsize=(5, 5))
+    ax2.plot(recall, precision, color="#C44E52")
+    ax2.set_xlabel("Recall")
+    ax2.set_ylabel("Precision")
+    ax2.set_title("Precision-Recall Curve")
+    st.pyplot(fig2)

pages/5_💲_Magenment_deck.py ADDED Viewed

	@@ -0,0 +1,73 @@

+import streamlit as st
+import pandas as pd
+import numpy as np
+import joblib
+st.set_page_config(page_title="💲 Management Deck: Cost Analysis", layout="wide")
+st.title("💲 Management Deck: Cost Analysis of Model Decisions")
+# --- Load model and test data ---
+@st.cache_data
+def load_model_and_data():
+    model = joblib.load("model_1mvp.pkl")
+    df_test = pd.read_csv("test_data.csv")
+    return model, df_test
+model, df_test = load_model_and_data()
+target = "y"
+X_test = df_test.drop(columns=[target])
+y_test = df_test[target]
+# --- Sidebar: Cost selection ---
+with st.sidebar.form("cost_selector"):
+    st.markdown("## Set Cost Parameters")
+    cost_fp = st.number_input("Cost of False Positive (FP)", min_value=0, value=5, step=1)
+    cost_fn = st.number_input("Cost of False Negative (FN)", min_value=0, value=30, step=1)
+    submitted = st.form_submit_button("Update Cost Analysis")
+if submitted:
+    st.subheader("Cost Analysis Based on Model Predictions")
+    # Predict probabilities and classes
+    y_pred_proba = model.predict_proba(X_test)[:, 1]
+    threshold = 0.5
+    y_pred = (y_pred_proba >= threshold).astype(int)
+    # Confusion matrix components
+    FP = np.sum((y_pred == 1) & (y_test == 0))
+    FN = np.sum((y_pred == 0) & (y_test == 1))
+    TP = np.sum((y_pred == 1) & (y_test == 1))
+    TN = np.sum((y_pred == 0) & (y_test == 0))
+    total_cost = FP * cost_fp + FN * cost_fn
+    st.markdown(f"""
+    **Threshold:** {threshold:.2f}
+    - **False Positives (FP):** {FP} × {cost_fp} = {FP * cost_fp}
+    - **False Negatives (FN):** {FN} × {cost_fn} = {FN * cost_fn}
+    - **True Positives (TP):** {TP}
+    - **True Negatives (TN):** {TN}
+    ---
+    ## **Total Cost: {total_cost}**
+    """)
+    # Optional: Show cost as a function of threshold
+    st.subheader("Cost vs. Threshold")
+    thresholds = np.linspace(0, 1, 120)
+    costs = []
+    for t in thresholds:
+        y_pred_t = (y_pred_proba >= t).astype(int)
+        FP_t = np.sum((y_pred_t == 1) & (y_test == 0))
+        FN_t = np.sum((y_pred_t == 0) & (y_test == 1))
+        costs.append(FP_t * cost_fp + FN_t * cost_fn)
+    import matplotlib.pyplot as plt
+    fig, ax = plt.subplots(figsize=(8, 4))
+    ax.plot(thresholds, costs, label="Total Cost")
+    ax.axvline(threshold, color="red", linestyle="--", label=f"Current threshold = {threshold:.2f}")
+    ax.set_xlabel("Threshold")
+    ax.set_ylabel("Total Cost")
+    ax.set_title("Total Cost vs. Classification Threshold")
+    ax.legend()
+    st.pyplot(fig)
+    st.caption("You can adjust the costs in the sidebar to see their impact on the total cost and optimal threshold.")

pages/6_📞_Callcenter_dashboard.py ADDED Viewed

	@@ -0,0 +1,235 @@

+import streamlit as st
+from streamlit_extras.let_it_rain import rain
+import requests
+import random
+import pandas as pd
+import datetime
+st.title("📞 Callcenter Dashboard")
+with st.expander("ℹ️ - About this dashboard", expanded=False):
+    st.markdown(
+        """
+        This dashboard simulates a call center environment where agents can manage a queue of customers to upsell a long term deposit bank product.
+        In the original paper that came with the dataset, they mention that there was inbound calls too, but it's not present in the dataset.
+        The dashboard fetches customer data from an API(NocoDB with test and synthetic data), displays customer information, and uses a machine learning model to predict the likelihood of a successful upsell.
+        **How to use the dashboard:**
+        1. Set the queue size and upsell bonus in the sidebar. The bonus is simply a multiplier for the potential earnings from successful upsells.
+        2. View the current queue of customers and their details.
+        3. For each customer, see the model's predicted probability of subscription.
+        4. After each call, indicate whether the upsell was successful and submit the result.
+        5. Track your total bonus based on successful upsells.
+        **TIP** see what happens when the queue is empty 😉
+        """
+    )
+# --- Sidebar: Set queue size and bonus, and show model probability ---
+with st.sidebar:
+    st.header("Queue Settings")
+    queue_size = st.number_input("Queue size", min_value=1, max_value=50, value=10, step=1)
+    bonus = st.number_input("Upsell Bonus (currency/unit)", min_value=1.0, value=10.0, step=1.0)
+    if st.button("Reset Queue"):
+        st.session_state.queue = None  # Force re-fetch
+        st.session_state.total_bonus = 0.0
+    # Placeholder for model probability
+    model_prob_placeholder = st.empty()
+# --- Cached data fetch ---
+@st.cache_data(show_spinner=False)
+def fetch_customers(limit):
+    API_DATA_URL = "https://dun3co-sdc-nocodb.hf.space/api/v2/tables/m39a8axnn3980w9/records"
+    API_DATA_TOKEN = st.secrets["NOCODB_TOKEN"]
+    HEADERS = {"xc-token": API_DATA_TOKEN}
+    params = {"offset": 0, "limit": limit, "viewId": "vwjuv5jnaet9npuu"}
+    res = requests.get(API_DATA_URL, headers=HEADERS, params=params)
+    res.raise_for_status()
+    return res.json()["list"]
+# --- Initialize or reset queue and bonus ---
+if "queue" not in st.session_state or st.session_state.queue is None:
+    records = fetch_customers(queue_size)
+    st.session_state.queue = random.sample(records, len(records))
+if "total_bonus" not in st.session_state:
+    st.session_state.total_bonus = 0.0
+# --- Calculate maximum potential bonus for the remaining queue ---
+def get_max_potential_bonus(queue, bonus):
+    if not queue:
+        return 0.0, []
+    API_MODEL_URL = "https://dun3co-marketing-lr-prediction.hf.space/predict"
+    inputs = []
+    for row in queue:
+        inputs.append({
+            "age": int(row["age"]),
+            "balance": float(row["balance"]),
+            "day": int(row["day"]),
+            "campaign": int(row["campaign"]),
+            "job": str(row["job"]),
+            "education": str(row["education"]),
+            "default": str(row["default"]),
+            "housing": str(row["housing"]),
+            "loan": str(row["loan"]),
+            "months_since_previous_contact": str(row["months_since_previous_contact"]),
+            "n_previous_contacts": str(row["n_previous_contacts"]),
+            "poutcome": str(row["poutcome"]),
+            "had_contact": bool(row["had_contact"]),
+            "is_single": bool(row["is_single"]),
+            "uknown_contact": bool(row["uknown_contact"]),
+        })
+    try:
+        response = requests.post(API_MODEL_URL, json={"data": inputs})
+        response.raise_for_status()
+        probabilities = response.json()["probabilities"]
+        max_bonus = sum((1 - p) * bonus for p in probabilities)
+        return max_bonus, probabilities
+    except Exception:
+        return None, None
+# --- 3. Show queue visually and bonus info ---
+#st.subheader("Queue")
+# Layout: queue info (left), bonus info (center), (right column left empty for centering)
+queue_col, bonus_col, empty_col = st.columns([2, 1.2, 0.8])
+with queue_col:
+    st.subheader("Queue")
+    for i, row in enumerate(st.session_state.queue):
+        st.write(f"Position {i+1}: {row['job']} ({row['age']} yrs, {row['education']})")
+# Calculate max potential bonus and get probabilities for queue
+max_potential_bonus, queue_probabilities = get_max_potential_bonus(st.session_state.queue, bonus)
+# --- 4. Simulate next call ---
+if st.session_state.queue:
+    st.subheader("Active Call")
+    active_row = st.session_state.queue[0]
+    # Use current day of month if possible, fallback to API day
+    today_day = datetime.datetime.now().day
+    try:
+        day_value = int(today_day)
+    except Exception:
+        day_value = int(active_row["day"])
+    # Prepare model input for active call
+    input_row = {
+        "age": int(active_row["age"]),
+        "balance": float(active_row["balance"]),
+        "day": day_value,
+        "campaign": int(active_row["campaign"]),
+        "job": str(active_row["job"]),
+        "education": str(active_row["education"]),
+        "default": str(active_row["default"]),
+        "housing": str(active_row["housing"]),
+        "loan": str(active_row["loan"]),
+        "months_since_previous_contact": str(active_row["months_since_previous_contact"]),
+        "n_previous_contacts": str(active_row["n_previous_contacts"]),
+        "poutcome": str(active_row["poutcome"]),
+        "had_contact": bool(active_row["had_contact"]),
+        "is_single": bool(active_row["is_single"]),
+        "uknown_contact": bool(active_row["uknown_contact"]),
+    }
+    payload = {"data": [input_row]}
+    # --- 5. Get model prediction for active call ---
+    API_MODEL_URL = "https://dun3co-marketing-lr-prediction.hf.space/predict"
+    try:
+        response = requests.post(API_MODEL_URL, json=payload)
+        response.raise_for_status()
+        result = response.json()
+        probability = result["probabilities"][0]
+        # Show in sidebar
+        model_prob_placeholder.metric("Model Probability (Subscribe)", f"{probability:.2%}")
+    except Exception as e:
+        st.error(f"Model API call failed: {e}")
+        probability = None
+        model_prob_placeholder.metric("Model Probability (Subscribe)", "N/A")
+    # --- Customer info as tiles ---
+    st.write("### Customer Information")
+    keys = [k for k in active_row.keys() if k != "y"] #Dropping the target variable "y"
+    values = [active_row[k] for k in keys] #Dropping the target variable "y"
+    n_cols = 4
+    cols = st.columns(n_cols)
+    for i, key in enumerate(keys):
+        col = cols[i % n_cols]
+        with col:
+            # Show the current day_value for the "day" field
+            display_value = day_value if key == "day" else values[i]
+            st.markdown(
+                f"""
+                <div style="
+                    border: 2px solid #e6e6e6;
+                    border-radius: 16px;
+                    padding: 18px 10px 14px 10px;
+                    margin-bottom: 1em;
+                    background: linear-gradient(135deg, #f9f9f9 80%, #eaf6ff 100%);
+                    box-shadow: 0 2px 8px 0 rgba(0,0,0,0.04);
+                    min-height: 80px;
+                    text-align: center;
+                ">
+                    <div style="font-size: 1.05em; font-weight: 600; color: #2c3e50; margin-bottom: 0.3em;">
+                        {key.replace('_', ' ').capitalize()}
+                    </div>
+                    <div style="font-size: 1.15em; color: #0074d9;">
+                        {display_value}
+                    </div>
+                </div>
+                """,
+                unsafe_allow_html=True,
+            )
+    # --- Bonus info and worker action column ---
+    with bonus_col:
+        st.markdown(
+            """
+            <div style="border:2px solid #e6e6e6; border-radius:14px; padding:18px 14px; background:#f8fbff; margin-bottom:1em;">
+                <div style="font-size:1.2em; font-weight:700; margin-bottom:1em;">Bonus KPI's</div>
+                <div style="font-size:1.1em; margin-bottom:0.7em;">
+                    <b>Current Bonus:</b> <span style="color:#0074d9;">{current_bonus}</span>
+                </div>
+                <div style="font-size:1.1em; margin-bottom:0.7em;">
+                    <b>Current Call Bonus:</b> <span style="color:#28a745;">{current_call_bonus}</span>
+                </div>
+                <div style="font-size:1.1em;">
+                    <b>Max Potential Bonus:</b> <span style="color:#ff851b;">{max_potential_bonus}</span>
+                </div>
+            </div>
+            """.format(
+                current_bonus=f"{st.session_state.total_bonus:.2f}",
+                current_call_bonus=f"{(1 - probability) * bonus:.2f}" if probability is not None else "N/A",
+                max_potential_bonus=f"{max_potential_bonus:.2f}" if max_potential_bonus is not None else "N/A"
+            ),
+            unsafe_allow_html=True,
+        )
+        # Plain Streamlit widgets for worker action (no custom styling)
+        st.subheader("Callcenter Worker Action")
+        upsell = st.radio("Did you upsell?", options=["Yes", "No"], key="upsell_radio", horizontal=True)
+        submit = st.button("Submit", disabled=not st.session_state.queue, key="upsell_submit")
+        if submit:
+            if upsell == "Yes" and probability is not None:
+                st.session_state.total_bonus += (1 - probability) * bonus
+            st.session_state.queue.pop(0)
+            st.rerun()
+else:
+    rain(emoji="💸", font_size=54, falling_speed=5, animation_length="infinite")
+    st.success("Queue is empty! All calls handled.")
+    st.markdown(
+        f"""
+        <div style="border:2px solid #e6e6e6; border-radius:14px; padding:18px 14px; background:#f8fbff; margin-bottom:1em;">
+            <div style="font-size:1.2em; font-weight:700; margin-bottom:1em;">Total Bonus Earned</div>
+            <div style="font-size:2em; color:#0074d9; text-align:center;">
+                {st.session_state.total_bonus:.2f}
+            </div>
+        </div>
+        """,
+        unsafe_allow_html=True,
+    )

requirements.txt ADDED Viewed

	@@ -0,0 +1,12 @@

+scikit-learn==1.7.2
+joblib==1.5.2
+numpy==2.3.1
+pandas==2.3.2
+matplotlib
+seaborn
+shap
+altair
+requests
+xgboost
+optuna
+plotly