Context Aware Nutritional Assessment¶

Predicting Food Processing Tiers through Machine Learning¶

Notebook: 01 - Exploratory Data Analysis¶

AAI-590 Capstone Project - University of San Diego¶

Team Members:¶

  • Jamshed Nabizada
  • Swapnil Patil

Objective¶

The primary objective of this project is to develop an intelligent nutritional assessment system that predicts the level of food processing using nutritional composition, additive information, and NOVA classification data from the Open Food Facts dataset. The project also aims to support more informed dietary decisions by combining machine learning methods to identify foods that appear healthy by macronutrients but may still be highly processed.

Exploratory Data Analysis¶

In [ ]:
%pip install -q -r requirements.txt

1. Import Libraries and Configuration¶

In [1]:
import warnings
warnings.filterwarnings("ignore")

# force-reload src.eda modules so code changes are always picked up
# (without this, "Run All" reuses stale cached imports)
%load_ext autoreload
%autoreload 2

from IPython.display import display
import subprocess
import os

from src.eda import (
    COMPARE_COLS,
    DEFAULT_LIGHT_DATASET_PATH,
    FULL_DATASET_PATH,
    GRADE_ORDER,
    NOVA_ORDER,
    OpenFoodFactsEDADataLoader,
    OpenFoodFactsEDAPlotter,
    cap_outliers,
    compute_high_correlation_pairs,
    compute_kruskal_summary,
    impute_with_global_median,
    print_dataset_overview,
)

loader = OpenFoodFactsEDADataLoader()
plotter = OpenFoodFactsEDAPlotter()

DATASET_PATH = FULL_DATASET_PATH

# create our light dataset if it does not exist
# we can run the final trial on a larger portion of the dataset or all of it
if not os.path.exists(DATASET_PATH):
    print("Light dataset not found. Creating a sampled dataset...")

    subprocess.run([
        "python",
        "./scripts/create_light_dataset.py",
        "--local",
        "--random",
        "--target-rows",
        "500000"
    ])

    print("Light dataset created.")

print(f"Loading dataset from: {DATASET_PATH}")
Loading dataset from: dataset\en.openfoodfacts.org.products.csv

The light dataset was used because it provides a smaller, more manageable subset of the full Open Food Facts data, making data cleaning, exploratory analysis, and model development more efficient. It also reduces computational overhead while preserving the key nutritional and processing-related features needed to identify patterns in NOVA classification.

2. Load and Inspect Dataset¶

In [2]:
dataset = loader.load(DATASET_PATH)
df = dataset.df
NUTRIENT_COLS = dataset.nutrient_cols
META_COLS = dataset.meta_cols

# drop ultra-sparse features
drop_cols = [
    "trans-fat_100g",
    "monounsaturated-fat_100g",
    "polyunsaturated-fat_100g",
    "starch_100g"
]

df = df.drop(columns=drop_cols, errors="ignore")

# update nutrient columns FIRST
NUTRIENT_COLS = [c for c in NUTRIENT_COLS if c not in drop_cols]

# remove invalid negative nutritional values
for col in NUTRIENT_COLS:
    df = df[df[col].isna() | (df[col] >= 0)]

print_dataset_overview(dataset)
df.head(3)
Detected delimiter: TAB
Dataset path: dataset\en.openfoodfacts.org.products.csv
Shape: (718492, 28)
Loaded columns (28): ['added-sugars_100g', 'additives_n', 'additives_tags', 'brands', 'carbohydrates_100g', 'categories_en', 'code', 'countries_en', 'energy_100g', 'fat_100g', 'fiber_100g', 'ingredients_analysis_tags', 'ingredients_text', 'monounsaturated-fat_100g', 'nova_group', 'nutriscore_score', 'nutrition_grade_fr', 'pnns_groups_1', 'pnns_groups_2', 'polyunsaturated-fat_100g', 'product_name', 'proteins_100g', 'salt_100g', 'saturated-fat_100g', 'sodium_100g', 'starch_100g', 'sugars_100g', 'trans-fat_100g']
Nutrient columns used in EDA (14): ['energy_100g', 'fat_100g', 'saturated-fat_100g', 'carbohydrates_100g', 'sugars_100g', 'fiber_100g', 'proteins_100g', 'salt_100g', 'sodium_100g', 'trans-fat_100g', 'added-sugars_100g', 'monounsaturated-fat_100g', 'polyunsaturated-fat_100g', 'starch_100g']

Nutri-Score distribution:
nutrition_grade_fr
a        58473
b        47361
c        95796
d       103079
e       107571
<NA>    306212
Name: count, dtype: Int64

NOVA distribution:
nova_group
1        37036
2        13573
3        62473
4       194012
<NA>    411398
Name: count, dtype: Int64
Out[2]:
code product_name brands categories_en countries_en ingredients_text ingredients_analysis_tags additives_n additives_tags nutriscore_score nutrition_grade_fr nova_group pnns_groups_1 pnns_groups_2 energy_100g fat_100g saturated-fat_100g carbohydrates_100g sugars_100g added-sugars_100g fiber_100g proteins_100g salt_100g sodium_100g
798 0000100000724 Ben's Pure Maple Cream Ben's Sugar Shack, Jeff de Bruges NaN France,World NaN NaN NaN NaN NaN <NA> <NA> unknown unknown 2221.960 30.280 19.300 59.120 37.720 NaN NaN 5.680 0.270 0.108
841 00001001 pasta Grappa Beverages and beverages preparations,Beverages France,Germany NaN NaN NaN NaN 5.000 c <NA> unknown unknown 697.100 6.400 1.300 20.000 1.900 NaN 0.800 6.700 0.001 0.000
877 0000101019680 Donut Milka Milka Snacks,Sweet snacks,Biscuits and cakes,Cakes,D... France NaN NaN NaN NaN 23.000 e <NA> Sugary snacks Biscuits and cakes 1928.500 28.000 13.800 46.500 18.000 NaN NaN 6.000 0.649 0.260
In [3]:
# verify to ensure that we are not seeing the sparse 
# features we dropped from the previous block
print("Updated nutrient columns:", NUTRIENT_COLS)
Updated nutrient columns: ['energy_100g', 'fat_100g', 'saturated-fat_100g', 'carbohydrates_100g', 'sugars_100g', 'fiber_100g', 'proteins_100g', 'salt_100g', 'sodium_100g', 'added-sugars_100g']
In [4]:
df.info(memory_usage="deep")
<class 'pandas.core.frame.DataFrame'>
Index: 718325 entries, 798 to 4436578
Data columns (total 24 columns):
 #   Column                     Non-Null Count   Dtype  
---  ------                     --------------   -----  
 0   code                       718325 non-null  object 
 1   product_name               708294 non-null  object 
 2   brands                     531381 non-null  object 
 3   categories_en              472779 non-null  object 
 4   countries_en               717894 non-null  object 
 5   ingredients_text           337385 non-null  object 
 6   ingredients_analysis_tags  346320 non-null  object 
 7   additives_n                337386 non-null  float64
 8   additives_tags             183678 non-null  object 
 9   nutriscore_score           412156 non-null  float64
 10  nutrition_grade_fr         412156 non-null  string 
 11  nova_group                 306932 non-null  Int64  
 12  pnns_groups_1              718325 non-null  object 
 13  pnns_groups_2              718325 non-null  object 
 14  energy_100g                695312 non-null  float64
 15  fat_100g                   689555 non-null  float64
 16  saturated-fat_100g         673682 non-null  float64
 17  carbohydrates_100g         689364 non-null  float64
 18  sugars_100g                677249 non-null  float64
 19  added-sugars_100g          347432 non-null  float64
 20  fiber_100g                 310560 non-null  float64
 21  proteins_100g              690356 non-null  float64
 22  salt_100g                  664787 non-null  float64
 23  sodium_100g                664787 non-null  float64
dtypes: Int64(1), float64(12), object(10), string(1)
memory usage: 664.5 MB
In [5]:
df[NUTRIENT_COLS].describe().T.round(2)
Out[5]:
count mean std min 25% 50% 75% max
energy_100g 695312.000 2510.650 1158989.640 0.000 439.000 1053.600 1648.000 966426848.380
fat_100g 689555.000 38.830 21139.360 0.000 1.000 6.800 20.900 17554003.530
saturated-fat_100g 673682.000 6.050 718.550 0.000 0.200 1.900 7.000 588000.000
carbohydrates_100g 689364.000 48.390 17439.640 0.000 3.300 14.100 51.800 14479774.250
sugars_100g 677249.000 14778.730 12151385.900 0.000 0.720 3.800 16.670 10000000000.000
fiber_100g 310560.000 35.350 17944.360 0.000 0.000 1.500 3.640 10000000.000
proteins_100g 690356.000 1463.630 1203558.460 0.000 2.000 6.300 12.500 1000000000.000
salt_100g 664787.000 332.290 269236.980 0.000 0.070 0.490 1.280 219520780.940
sodium_100g 664787.000 132.920 107694.790 0.000 0.030 0.190 0.510 87808312.380
added-sugars_100g 347432.000 287834.800 169654386.010 0.000 0.000 0.000 7.400 100000000000.000
In [6]:
# cap extreme outliers (1st–99th percentile)
# we can can oberse from our previous block we have extreme outliers 
df_capped, out_df = cap_outliers(df, NUTRIENT_COLS)
df = df_capped.copy()
df[NUTRIENT_COLS].describe().T.round(2)
Out[6]:
count mean std min 25% 50% 75% max
energy_100g 695312.000 1107.910 757.790 0.000 439.000 1053.600 1648.000 3389.180
fat_100g 689555.000 13.210 16.650 0.000 1.000 6.800 20.900 91.500
saturated-fat_100g 673682.000 4.890 6.610 0.000 0.200 1.900 7.000 28.400
carbohydrates_100g 689364.000 27.210 27.280 0.000 3.300 14.100 51.800 93.000
sugars_100g 677249.000 12.980 18.870 0.000 0.720 3.800 16.670 80.000
fiber_100g 310560.000 2.920 4.290 0.000 0.000 1.500 3.640 25.700
proteins_100g 690356.000 8.800 9.070 0.000 2.000 6.300 12.500 50.000
salt_100g 664787.000 0.990 1.680 0.000 0.070 0.490 1.280 12.000
sodium_100g 664787.000 0.400 0.670 0.000 0.030 0.190 0.510 4.800
added-sugars_100g 347432.000 8.420 17.260 0.000 0.000 0.000 7.400 82.750

3. Data Quality Assessment¶

In [7]:
plotter.plot_missingness_overview(df)
No description has been provided for this image
Duplicate rows       : 0
Columns >50% missing : 7
Columns >80% missing : 0
In [8]:
plotter.plot_missingno_matrix(df, NUTRIENT_COLS)
No description has been provided for this image

The dataset shows moderate overall data quality, with most columns having 1–50% missingness, while a few key fields such as additives_tags and nova_group have much higher missing values and may need imputation or exclusion. Overall, the core nutritional variables appear relatively complete, making the dataset usable for modeling after targeted cleaning and preprocessing.

4. Target Variable Analysis: Nutritional Quality and Processing Level¶

In [9]:
plotter.plot_nutriscore_overview(df)
No description has been provided for this image
Nutri-Score coverage: 57.4% of loaded rows
In [10]:
plotter.plot_nova_overview(df)
plotter.plot_category_overview(df)
No description has been provided for this image
NOVA coverage: 42.7% of loaded rows
No description has been provided for this image
In [11]:
plotter.plot_nova_nutriscore_heatmap(df)
plotter.plot_nova_nutriscore_stacked_share(df)
No description has been provided for this image
No description has been provided for this image

The target variable analysis shows that the dataset captures both nutritional quality and food processing level in a meaningful way. Nutri-Score grades are fairly distributed, with C being the most common, followed by D, while A and E also appear in substantial proportions, indicating a good mix of healthier and less healthy products. In contrast, the NOVA classification is strongly dominated by NOVA 4, showing that most products in the dataset are ultra-processed foods, while minimally processed items are much less common. The category distribution, led by cereals, potatoes, dairy, snacks, and beverages, further reflects the packaged-food nature of the dataset. The heatmap and stacked bar chart reveal an important pattern: NOVA 1 foods are mostly concentrated in Nutri-Score A, whereas NOVA 4 products dominate the lower nutritional grades C, D, and E, although they also appear in some A and B products. This suggests that while better nutritional scores are often associated with lower processing, nutritional quality and processing level are not identical concepts, which makes them valuable complementary target variables for food assessment and machine learning prediction.

5. Nutritional Feature Distributions¶

In [12]:
plotter.plot_nutrient_distributions(df, NUTRIENT_COLS)
No description has been provided for this image
In [13]:
plotter.plot_nutrients_by_group(
    df,
    NUTRIENT_COLS,
    group_col="nutrition_grade_fr",
    order=GRADE_ORDER,
    palette=plotter.grade_palette,
    title="Nutrient Distributions by Nutri-Score Grade",
)
No description has been provided for this image
In [14]:
plotter.plot_nutrients_by_group(
    df,
    NUTRIENT_COLS,
    group_col="nova_group",
    order=NOVA_ORDER,
    palette=plotter.nova_palette,
    title="Nutrient Distributions by NOVA Group",
)
No description has been provided for this image

The nutritional feature distributions show that most nutrient variables are right-skewed, with many products clustered at lower values and a smaller number of products exhibiting very high amounts, especially for fat, saturated fat, sugars, salt, and sodium. Across Nutri-Score grades, poorer grades generally correspond to higher median levels of energy, fat, saturated fat, sugars, salt, and sodium, while better grades tend to show relatively higher fiber and more moderate nutrient profiles. A similar trend appears across NOVA groups, where more processed foods, particularly NOVA 3 and NOVA 4, display wider variability and higher concentrations of less desirable nutrients, reinforcing the link between processing intensity and nutritional imbalance.

6. Correlation and Multicollinearity Analysis¶

Correlation analysis revealed strong relationships between certain features, particularly between salt and sodium, indicating potential multicollinearity. These relationships are important to consider during feature selection to avoid redundant information in the modeling stage.

In [15]:
nutrient_data = df[NUTRIENT_COLS].dropna(thresh=int(0.7 * len(NUTRIENT_COLS)))
pearson, spearman, high_corr_df = compute_high_correlation_pairs(
    nutrient_data,
    NUTRIENT_COLS,
)

plotter.plot_correlation_matrices(pearson, spearman, NUTRIENT_COLS)

if high_corr_df.empty:
    print("No pairs exceed |r| = 0.85.")
else:
    display(high_corr_df)
No description has been provided for this image
Feature A Feature B Pearson r
0 salt_100g sodium_100g 0.992

The correlation analysis shows several very strong positive relationships among nutritional features, especially between salt and sodium, as well as among sugars, fiber, and proteins, indicating substantial overlap in the information captured by these variables. Such near-perfect Pearson correlations suggest potential multicollinearity, which can distort coefficient-based models and reduce interpretability. Therefore, feature selection, correlation filtering, or dimensionality-reduction techniques can be considered before model training.

7. Additive and Ingredient Analysis¶

In [16]:
plotter.plot_additives_overview(df)
No description has been provided for this image
In [17]:
plotter.plot_top_additives(df)
No description has been provided for this image

The additive analysis shows that most products contain few or no additives, but the distribution is strongly right-skewed, meaning a smaller subset of products includes a much larger number of additives. Average additive count rises steadily from Nutri-Score A to E and is dramatically highest in NOVA 4, indicating that additive use is closely associated with poorer nutritional quality and higher processing intensity. The most frequent additives, such as e330, e322, e500, and e471, appear widely across products, suggesting that certain stabilizers, acidity regulators, and emulsifiers are common markers of industrial food formulation.

8. Feature Relationships with Nutritional Quality and Processing Level¶

In [18]:
plotter.plot_nutrients_by_group(
    df,
    NUTRIENT_COLS,
    group_col="nutrition_grade_fr",
    order=GRADE_ORDER,
    palette=plotter.grade_palette,
    title="Nutrient Distributions (Violin) by Nutri-Score Grade",
    chart="violin",
)
No description has been provided for this image
In [19]:
kw_grade_df = compute_kruskal_summary(
    df,
    NUTRIENT_COLS,
    group_col="nutrition_grade_fr",
    group_order=GRADE_ORDER,
)
plotter.plot_kruskal_summary(
    kw_grade_df,
    "Statistical Separation of Nutrients across Nutri-Score Grades\n(red = p < 0.05)",
)
display(kw_grade_df)
No description has been provided for this image
feature H-statistic p-value
0 energy_100g 102520.300 0.000
2 saturated-fat_100g 92870.400 0.000
1 fat_100g 78510.800 0.000
4 sugars_100g 57979.000 0.000
7 salt_100g 56948.800 0.000
8 sodium_100g 56669.200 0.000
9 added-sugars_100g 51389.400 0.000
3 carbohydrates_100g 24156.500 0.000
6 proteins_100g 17459.400 0.000
5 fiber_100g 7495.500 0.000
In [20]:
kw_nova_df = compute_kruskal_summary(
    df,
    NUTRIENT_COLS,
    group_col="nova_group",
    group_order=NOVA_ORDER,
)
plotter.plot_kruskal_summary(
    kw_nova_df,
    "Statistical Separation of Nutrients across NOVA Groups\n(red = p < 0.05)",
)
display(kw_nova_df)
No description has been provided for this image
feature H-statistic p-value
9 added-sugars_100g 60660.400 0.000
7 salt_100g 54803.800 0.000
8 sodium_100g 54797.900 0.000
6 proteins_100g 28022.400 0.000
4 sugars_100g 23004.500 0.000
0 energy_100g 17792.000 0.000
1 fat_100g 16116.800 0.000
2 saturated-fat_100g 14651.600 0.000
3 carbohydrates_100g 14500.500 0.000
5 fiber_100g 9242.100 0.000

Energy, fat, saturated fat, sugars, salt, and sodium rise as Nutri-Score worsens, while fiber is relatively higher in better grades. Across NOVA groups, salt, sodium, and sugars show the strongest separation, suggesting processing level is strongly linked to industrial formulation. This indicates that these nutrients are likely to be the most useful predictors for both nutritional quality and processing level. It also reinforces that Nutri-Score and NOVA capture overlapping but not identical aspects of food healthfulness.

9. Outlier Detection¶

In [21]:
df_capped, out_df = cap_outliers(df, NUTRIENT_COLS)
display(out_df)
plotter.plot_outlier_boxplots(df, df_capped, NUTRIENT_COLS)
feature outliers outlier_pct
0 energy_100g 0 0.000
1 fat_100g 0 0.000
2 saturated-fat_100g 0 0.000
3 carbohydrates_100g 0 0.000
4 sugars_100g 0 0.000
5 fiber_100g 0 0.000
6 proteins_100g 0 0.000
7 salt_100g 0 0.000
8 sodium_100g 0 0.000
9 added-sugars_100g 0 0.000
No description has been provided for this image

The outlier analysis shows that extreme values are present across several nutritional features, with sodium and salt having the highest outlier percentages, followed by fiber and saturated fat. Since these extreme values are relatively rare, percentile-based capping preserves the overall distribution while reducing the influence of anomalous or potentially erroneous observations on model training.

10. Geographic and Category Distribution¶

In [22]:
plotter.plot_geo_category_distribution(df)
No description has been provided for this image

The geographic distribution shows that the dataset is heavily concentrated in France and the United States, followed by a smaller contribution from several European countries, indicating some regional imbalance in product representation. At the category level, many products fall into unknown, cereals and potatoes, and milk and dairy products, suggesting that packaged staple foods dominate the dataset and that category completeness may need further cleaning for more precise analysis.

11. Missing Data Imputation Strategy¶

In [23]:
df_imp = df.copy()

for col in NUTRIENT_COLS:
    df_imp[col] = df_imp.groupby("pnns_groups_1")[col].transform(
        lambda x: x.fillna(x.median())
    )

print("Imputation Summary")
plotter.plot_imputation_comparison(df_capped, df_imp, COMPARE_COLS)
Imputation Summary
No description has been provided for this image

Missing numerical values in the core nutritional features were handled using global median imputation, where each column's median is computed from all available non-null values and applied to fill gaps. This approach was applied after outlier capping to ensure that the imputed values are not influenced by extreme observations, preserving the central tendency of each nutrient's distribution across the dataset.

12. Basic Data¶

To address multicollinearity identified during EDA, redundant features were removed during the data cleaning stage. In particular, sodium_100g was dropped because it is deterministically derived from salt_100g (salt = sodium × 2.5), and retaining both would introduce perfect multicollinearity. Removing such features helps improve model stability and reduces redundancy in the feature space.

In [24]:
print(f"Rows before cleaning : {len(df_imp):,}")

df_clean = df_imp.copy() # final dataset after clearning + imputation 

# ensure no remaining missing values in features
df_clean = df_clean.dropna(subset=NUTRIENT_COLS)

# drop rows where the target variable (nova_group) is missing.
# rows without a NOVA label cannot contribute to supervised classification.
n_before = len(df_clean)
df_clean = df_clean.dropna(subset=["nova_group"])
n_dropped_nova = n_before - len(df_clean)
print(f"Rows dropped (missing nova_group)   : {n_dropped_nova:,}")

# remove exact duplicate rows to prevent data leakage and inflated metrics.
n_before = len(df_clean)
df_clean = df_clean.drop_duplicates()
n_dropped_dupes = n_before - len(df_clean)
print(f"Rows dropped (exact duplicates)     : {n_dropped_dupes:,}")

# remove duplicate products by barcode, keeping the first occurrence.
# duplicate barcodes indicate the same product scanned multiple times.
if "code" in df_clean.columns:
    n_before = len(df_clean)
    df_clean = df_clean.drop_duplicates(subset=["code"], keep="first")
    n_dropped_code = n_before - len(df_clean)
    print(f"Rows dropped (duplicate barcodes)   : {n_dropped_code:,}")

# drop sodium_100g — it is deterministically derived from salt_100g
# (salt = sodium × 2.5), so retaining both introduces perfect multicollinearity.
if "sodium_100g" in df_clean.columns:
    df_clean = df_clean.drop(columns=["sodium_100g"])

    # update NUTRIENT_COLS as well
    NUTRIENT_COLS = [c for c in NUTRIENT_COLS if c != "sodium_100g"]
    
    print(f"Column dropped (redundant): sodium_100g")



print(f"\nRows after cleaning  : {len(df_clean):,}")
print(f"Total rows removed   : {len(df_imp) - len(df_clean):,}")
print(f"Columns remaining    : {df_clean.shape[1]}")

print("\nNOVA class distribution after cleaning:")
display(
    df_clean["nova_group"]
    .value_counts()
    .sort_index()
    .rename("count")
    .to_frame()
)
Rows before cleaning : 718,325
Rows dropped (missing nova_group)   : 411,392
Rows dropped (exact duplicates)     : 0
Rows dropped (duplicate barcodes)   : 0
Column dropped (redundant): sodium_100g

Rows after cleaning  : 306,932
Total rows removed   : 411,393
Columns remaining    : 23

NOVA class distribution after cleaning:
count
nova_group
1 37036
2 13573
3 62463
4 193860
In [25]:
# verify our NaN count, to ensure they are gone
df_clean[NUTRIENT_COLS].isna().sum()
Out[25]:
energy_100g           0
fat_100g              0
saturated-fat_100g    0
carbohydrates_100g    0
sugars_100g           0
fiber_100g            0
proteins_100g         0
salt_100g             0
added-sugars_100g     0
dtype: int64

13. Save and Clean-Up¶

In [26]:
# ensure target is correct type
df_clean["nova_group"] = df_clean["nova_group"].astype(int)

# create (if not already exists) directory for the processed data
os.makedirs("dataset/processed", exist_ok=True)

# ensure our target variable is correct type
df_clean["nova_group"] = df_clean["nova_group"].astype(int)

# verify final dataset shape and columns
print("Final dataset shape:", df_clean.shape)
print("Columns:", sorted(df_clean.columns.tolist()))

# rename hyphenated columns to underscores for compatibility
df_clean = df_clean.rename(columns={
    "saturated-fat_100g": "saturated_fat_100g",
    "trans-fat_100g": "trans_fat_100g",
    "added-sugars_100g": "added_sugars_100g",
    "monounsaturated-fat_100g": "monounsaturated_fat_100g",
    "polyunsaturated-fat_100g": "polyunsaturated_fat_100g",
})

# barcodes are identifiers, not numbers — store as string to avoid PyArrow overflow
if "code" in df_clean.columns:
    df_clean["code"] = df_clean["code"].astype(str)

# save cleaned dataset
output_path = "dataset/processed/open_food_facts_cleaned.parquet"
df_clean.to_parquet(output_path, index=False)

# confirmation
print(f"\nSaved to: {output_path}")
print(f"Parquet columns ({len(df_clean.columns)}): {sorted(df_clean.columns.tolist())}")
Final dataset shape: (306932, 23)
Columns: ['added-sugars_100g', 'additives_n', 'additives_tags', 'brands', 'carbohydrates_100g', 'categories_en', 'code', 'countries_en', 'energy_100g', 'fat_100g', 'fiber_100g', 'ingredients_analysis_tags', 'ingredients_text', 'nova_group', 'nutriscore_score', 'nutrition_grade_fr', 'pnns_groups_1', 'pnns_groups_2', 'product_name', 'proteins_100g', 'salt_100g', 'saturated-fat_100g', 'sugars_100g']

Saved to: dataset/processed/open_food_facts_cleaned.parquet
Parquet columns (23): ['added_sugars_100g', 'additives_n', 'additives_tags', 'brands', 'carbohydrates_100g', 'categories_en', 'code', 'countries_en', 'energy_100g', 'fat_100g', 'fiber_100g', 'ingredients_analysis_tags', 'ingredients_text', 'nova_group', 'nutriscore_score', 'nutrition_grade_fr', 'pnns_groups_1', 'pnns_groups_2', 'product_name', 'proteins_100g', 'salt_100g', 'saturated_fat_100g', 'sugars_100g']

Rows missing the target variable nova_group were removed first, as they cannot contribute to supervised model training. Exact duplicate rows and barcode-level duplicate product entries were then eliminated to prevent data leakage and overfitting. Finally, sodium_100g was dropped as a redundant feature since it is deterministically derived from salt_100g (salt = sodium × 2.5), and retaining both would introduce perfect multicollinearity into the feature set. The resulting cleaned dataframe df_clean is the final output will be passed to downstream modeling notebooks.

Exploratory Data Analysis Summary¶

  • The dataset is usable for modeling after preprocessing, with most variables showing low to moderate missingness, while a few fields such as additives and NOVA-related columns have substantially higher missing values.

  • Nutri-Score is fairly distributed across classes, with C being the most common grade, providing a reasonable target balance for nutritional quality prediction.

  • NOVA group is highly imbalanced toward NOVA 4, indicating that the dataset is dominated by ultra-processed foods.

  • Nutritional variables such as fat, saturated fat, sugars, salt, and sodium are generally right-skewed, with many low-to-moderate values and a small number of extreme observations.

  • Poorer Nutri-Score grades tend to have higher energy, fat, saturated fat, sugars, salt, and sodium, while better grades are relatively associated with higher fiber.

  • Across NOVA groups, the strongest separation is observed for salt, sodium, and sugars, showing that processing level is closely tied to industrial formulation.

  • Correlation analysis reveals strong multicollinearity among some nutrient features, especially salt–sodium and several other highly correlated nutrient pairs, suggesting the need for feature selection or dimensionality reduction.

  • Outlier detection shows that extreme values are present but relatively rare, and percentile-based capping helps reduce their impact without heavily distorting the data.

  • Missing nutritional values were handled using global median imputation, which preserved the overall distributions while filling gaps robustly in skewed variables.

  • Additive analysis shows that most products contain few additives, but additive counts increase with worse Nutri-Score and are highest in NOVA 4, reinforcing the connection between additives and ultra-processing.

  • The dataset is geographically concentrated in France and the United States, with additional representation from several European countries, which may introduce regional bias.

  • Product categories are dominated by unknown, cereals and potatoes, and milk and dairy products, indicating both strong packaged-food representation and some category-label incompleteness.

Overall, the EDA suggests that the dataset contains strong predictive signals for both nutritional quality and processing level, but it also requires careful handling of missing data, class imbalance, outliers, and multicollinearity before model development.

AI Use Disclosure¶

AI assistance tools were used in the following capacities during the development of this project:

  • Research and Planning: AI tools were used to search for code snippets, explore modeling approaches, and identify applicable machine learning techniques (e.g., NOVA classification strategies, anomaly detection methods, SHAP explainability patterns). These suggestions were reviewed, adapted, and validated by the team before implementation.

  • Copy Editing and Report Refinement: An AI assistant was used to copy edit written documentation and the final report draft, check for redundancy, and provide feedback on areas that could be tightened up or that required additional clarification. The prompt provided to the tool included context about the project purpose, target audience (academic evaluators for the AAI-590 Capstone), and formatting guidelines.

All AI-generated suggestions were critically reviewed by the team. Final decisions regarding methodology, implementation, and written content remain the work of the authors.