Objective¶
The primary objective of this project is to develop an intelligent nutritional assessment system that predicts the level of food processing using nutritional composition, additive information, and NOVA classification data from the Open Food Facts dataset. The project also aims to support more informed dietary decisions by combining machine learning methods to identify foods that appear healthy by macronutrients but may still be highly processed.
Exploratory Data Analysis¶
%pip install -q -r requirements.txt
1. Import Libraries and Configuration¶
import warnings
warnings.filterwarnings("ignore")
# force-reload src.eda modules so code changes are always picked up
# (without this, "Run All" reuses stale cached imports)
%load_ext autoreload
%autoreload 2
from IPython.display import display
import subprocess
import os
from src.eda import (
COMPARE_COLS,
DEFAULT_LIGHT_DATASET_PATH,
FULL_DATASET_PATH,
GRADE_ORDER,
NOVA_ORDER,
OpenFoodFactsEDADataLoader,
OpenFoodFactsEDAPlotter,
cap_outliers,
compute_high_correlation_pairs,
compute_kruskal_summary,
impute_with_global_median,
print_dataset_overview,
)
loader = OpenFoodFactsEDADataLoader()
plotter = OpenFoodFactsEDAPlotter()
DATASET_PATH = FULL_DATASET_PATH
# create our light dataset if it does not exist
# we can run the final trial on a larger portion of the dataset or all of it
if not os.path.exists(DATASET_PATH):
print("Light dataset not found. Creating a sampled dataset...")
subprocess.run([
"python",
"./scripts/create_light_dataset.py",
"--local",
"--random",
"--target-rows",
"500000"
])
print("Light dataset created.")
print(f"Loading dataset from: {DATASET_PATH}")
Loading dataset from: dataset\en.openfoodfacts.org.products.csv
The light dataset was used because it provides a smaller, more manageable subset of the full Open Food Facts data, making data cleaning, exploratory analysis, and model development more efficient. It also reduces computational overhead while preserving the key nutritional and processing-related features needed to identify patterns in NOVA classification.
2. Load and Inspect Dataset¶
dataset = loader.load(DATASET_PATH)
df = dataset.df
NUTRIENT_COLS = dataset.nutrient_cols
META_COLS = dataset.meta_cols
# drop ultra-sparse features
drop_cols = [
"trans-fat_100g",
"monounsaturated-fat_100g",
"polyunsaturated-fat_100g",
"starch_100g"
]
df = df.drop(columns=drop_cols, errors="ignore")
# update nutrient columns FIRST
NUTRIENT_COLS = [c for c in NUTRIENT_COLS if c not in drop_cols]
# remove invalid negative nutritional values
for col in NUTRIENT_COLS:
df = df[df[col].isna() | (df[col] >= 0)]
print_dataset_overview(dataset)
df.head(3)
Detected delimiter: TAB Dataset path: dataset\en.openfoodfacts.org.products.csv Shape: (718492, 28) Loaded columns (28): ['added-sugars_100g', 'additives_n', 'additives_tags', 'brands', 'carbohydrates_100g', 'categories_en', 'code', 'countries_en', 'energy_100g', 'fat_100g', 'fiber_100g', 'ingredients_analysis_tags', 'ingredients_text', 'monounsaturated-fat_100g', 'nova_group', 'nutriscore_score', 'nutrition_grade_fr', 'pnns_groups_1', 'pnns_groups_2', 'polyunsaturated-fat_100g', 'product_name', 'proteins_100g', 'salt_100g', 'saturated-fat_100g', 'sodium_100g', 'starch_100g', 'sugars_100g', 'trans-fat_100g'] Nutrient columns used in EDA (14): ['energy_100g', 'fat_100g', 'saturated-fat_100g', 'carbohydrates_100g', 'sugars_100g', 'fiber_100g', 'proteins_100g', 'salt_100g', 'sodium_100g', 'trans-fat_100g', 'added-sugars_100g', 'monounsaturated-fat_100g', 'polyunsaturated-fat_100g', 'starch_100g'] Nutri-Score distribution: nutrition_grade_fr a 58473 b 47361 c 95796 d 103079 e 107571 <NA> 306212 Name: count, dtype: Int64 NOVA distribution: nova_group 1 37036 2 13573 3 62473 4 194012 <NA> 411398 Name: count, dtype: Int64
| code | product_name | brands | categories_en | countries_en | ingredients_text | ingredients_analysis_tags | additives_n | additives_tags | nutriscore_score | nutrition_grade_fr | nova_group | pnns_groups_1 | pnns_groups_2 | energy_100g | fat_100g | saturated-fat_100g | carbohydrates_100g | sugars_100g | added-sugars_100g | fiber_100g | proteins_100g | salt_100g | sodium_100g | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 798 | 0000100000724 | Ben's Pure Maple Cream | Ben's Sugar Shack, Jeff de Bruges | NaN | France,World | NaN | NaN | NaN | NaN | NaN | <NA> | <NA> | unknown | unknown | 2221.960 | 30.280 | 19.300 | 59.120 | 37.720 | NaN | NaN | 5.680 | 0.270 | 0.108 |
| 841 | 00001001 | pasta | Grappa | Beverages and beverages preparations,Beverages | France,Germany | NaN | NaN | NaN | NaN | 5.000 | c | <NA> | unknown | unknown | 697.100 | 6.400 | 1.300 | 20.000 | 1.900 | NaN | 0.800 | 6.700 | 0.001 | 0.000 |
| 877 | 0000101019680 | Donut Milka | Milka | Snacks,Sweet snacks,Biscuits and cakes,Cakes,D... | France | NaN | NaN | NaN | NaN | 23.000 | e | <NA> | Sugary snacks | Biscuits and cakes | 1928.500 | 28.000 | 13.800 | 46.500 | 18.000 | NaN | NaN | 6.000 | 0.649 | 0.260 |
# verify to ensure that we are not seeing the sparse
# features we dropped from the previous block
print("Updated nutrient columns:", NUTRIENT_COLS)
Updated nutrient columns: ['energy_100g', 'fat_100g', 'saturated-fat_100g', 'carbohydrates_100g', 'sugars_100g', 'fiber_100g', 'proteins_100g', 'salt_100g', 'sodium_100g', 'added-sugars_100g']
df.info(memory_usage="deep")
<class 'pandas.core.frame.DataFrame'> Index: 718325 entries, 798 to 4436578 Data columns (total 24 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 code 718325 non-null object 1 product_name 708294 non-null object 2 brands 531381 non-null object 3 categories_en 472779 non-null object 4 countries_en 717894 non-null object 5 ingredients_text 337385 non-null object 6 ingredients_analysis_tags 346320 non-null object 7 additives_n 337386 non-null float64 8 additives_tags 183678 non-null object 9 nutriscore_score 412156 non-null float64 10 nutrition_grade_fr 412156 non-null string 11 nova_group 306932 non-null Int64 12 pnns_groups_1 718325 non-null object 13 pnns_groups_2 718325 non-null object 14 energy_100g 695312 non-null float64 15 fat_100g 689555 non-null float64 16 saturated-fat_100g 673682 non-null float64 17 carbohydrates_100g 689364 non-null float64 18 sugars_100g 677249 non-null float64 19 added-sugars_100g 347432 non-null float64 20 fiber_100g 310560 non-null float64 21 proteins_100g 690356 non-null float64 22 salt_100g 664787 non-null float64 23 sodium_100g 664787 non-null float64 dtypes: Int64(1), float64(12), object(10), string(1) memory usage: 664.5 MB
df[NUTRIENT_COLS].describe().T.round(2)
| count | mean | std | min | 25% | 50% | 75% | max | |
|---|---|---|---|---|---|---|---|---|
| energy_100g | 695312.000 | 2510.650 | 1158989.640 | 0.000 | 439.000 | 1053.600 | 1648.000 | 966426848.380 |
| fat_100g | 689555.000 | 38.830 | 21139.360 | 0.000 | 1.000 | 6.800 | 20.900 | 17554003.530 |
| saturated-fat_100g | 673682.000 | 6.050 | 718.550 | 0.000 | 0.200 | 1.900 | 7.000 | 588000.000 |
| carbohydrates_100g | 689364.000 | 48.390 | 17439.640 | 0.000 | 3.300 | 14.100 | 51.800 | 14479774.250 |
| sugars_100g | 677249.000 | 14778.730 | 12151385.900 | 0.000 | 0.720 | 3.800 | 16.670 | 10000000000.000 |
| fiber_100g | 310560.000 | 35.350 | 17944.360 | 0.000 | 0.000 | 1.500 | 3.640 | 10000000.000 |
| proteins_100g | 690356.000 | 1463.630 | 1203558.460 | 0.000 | 2.000 | 6.300 | 12.500 | 1000000000.000 |
| salt_100g | 664787.000 | 332.290 | 269236.980 | 0.000 | 0.070 | 0.490 | 1.280 | 219520780.940 |
| sodium_100g | 664787.000 | 132.920 | 107694.790 | 0.000 | 0.030 | 0.190 | 0.510 | 87808312.380 |
| added-sugars_100g | 347432.000 | 287834.800 | 169654386.010 | 0.000 | 0.000 | 0.000 | 7.400 | 100000000000.000 |
# cap extreme outliers (1st–99th percentile)
# we can can oberse from our previous block we have extreme outliers
df_capped, out_df = cap_outliers(df, NUTRIENT_COLS)
df = df_capped.copy()
df[NUTRIENT_COLS].describe().T.round(2)
| count | mean | std | min | 25% | 50% | 75% | max | |
|---|---|---|---|---|---|---|---|---|
| energy_100g | 695312.000 | 1107.910 | 757.790 | 0.000 | 439.000 | 1053.600 | 1648.000 | 3389.180 |
| fat_100g | 689555.000 | 13.210 | 16.650 | 0.000 | 1.000 | 6.800 | 20.900 | 91.500 |
| saturated-fat_100g | 673682.000 | 4.890 | 6.610 | 0.000 | 0.200 | 1.900 | 7.000 | 28.400 |
| carbohydrates_100g | 689364.000 | 27.210 | 27.280 | 0.000 | 3.300 | 14.100 | 51.800 | 93.000 |
| sugars_100g | 677249.000 | 12.980 | 18.870 | 0.000 | 0.720 | 3.800 | 16.670 | 80.000 |
| fiber_100g | 310560.000 | 2.920 | 4.290 | 0.000 | 0.000 | 1.500 | 3.640 | 25.700 |
| proteins_100g | 690356.000 | 8.800 | 9.070 | 0.000 | 2.000 | 6.300 | 12.500 | 50.000 |
| salt_100g | 664787.000 | 0.990 | 1.680 | 0.000 | 0.070 | 0.490 | 1.280 | 12.000 |
| sodium_100g | 664787.000 | 0.400 | 0.670 | 0.000 | 0.030 | 0.190 | 0.510 | 4.800 |
| added-sugars_100g | 347432.000 | 8.420 | 17.260 | 0.000 | 0.000 | 0.000 | 7.400 | 82.750 |
3. Data Quality Assessment¶
plotter.plot_missingness_overview(df)
Duplicate rows : 0 Columns >50% missing : 7 Columns >80% missing : 0
plotter.plot_missingno_matrix(df, NUTRIENT_COLS)
The dataset shows moderate overall data quality, with most columns having 1–50% missingness, while a few key fields such as additives_tags and nova_group have much higher missing values and may need imputation or exclusion. Overall, the core nutritional variables appear relatively complete, making the dataset usable for modeling after targeted cleaning and preprocessing.
4. Target Variable Analysis: Nutritional Quality and Processing Level¶
plotter.plot_nutriscore_overview(df)
Nutri-Score coverage: 57.4% of loaded rows
plotter.plot_nova_overview(df)
plotter.plot_category_overview(df)
NOVA coverage: 42.7% of loaded rows
plotter.plot_nova_nutriscore_heatmap(df)
plotter.plot_nova_nutriscore_stacked_share(df)
The target variable analysis shows that the dataset captures both nutritional quality and food processing level in a meaningful way. Nutri-Score grades are fairly distributed, with C being the most common, followed by D, while A and E also appear in substantial proportions, indicating a good mix of healthier and less healthy products. In contrast, the NOVA classification is strongly dominated by NOVA 4, showing that most products in the dataset are ultra-processed foods, while minimally processed items are much less common. The category distribution, led by cereals, potatoes, dairy, snacks, and beverages, further reflects the packaged-food nature of the dataset. The heatmap and stacked bar chart reveal an important pattern: NOVA 1 foods are mostly concentrated in Nutri-Score A, whereas NOVA 4 products dominate the lower nutritional grades C, D, and E, although they also appear in some A and B products. This suggests that while better nutritional scores are often associated with lower processing, nutritional quality and processing level are not identical concepts, which makes them valuable complementary target variables for food assessment and machine learning prediction.
5. Nutritional Feature Distributions¶
plotter.plot_nutrient_distributions(df, NUTRIENT_COLS)
plotter.plot_nutrients_by_group(
df,
NUTRIENT_COLS,
group_col="nutrition_grade_fr",
order=GRADE_ORDER,
palette=plotter.grade_palette,
title="Nutrient Distributions by Nutri-Score Grade",
)
plotter.plot_nutrients_by_group(
df,
NUTRIENT_COLS,
group_col="nova_group",
order=NOVA_ORDER,
palette=plotter.nova_palette,
title="Nutrient Distributions by NOVA Group",
)
The nutritional feature distributions show that most nutrient variables are right-skewed, with many products clustered at lower values and a smaller number of products exhibiting very high amounts, especially for fat, saturated fat, sugars, salt, and sodium. Across Nutri-Score grades, poorer grades generally correspond to higher median levels of energy, fat, saturated fat, sugars, salt, and sodium, while better grades tend to show relatively higher fiber and more moderate nutrient profiles. A similar trend appears across NOVA groups, where more processed foods, particularly NOVA 3 and NOVA 4, display wider variability and higher concentrations of less desirable nutrients, reinforcing the link between processing intensity and nutritional imbalance.
6. Correlation and Multicollinearity Analysis¶
Correlation analysis revealed strong relationships between certain features, particularly between salt and sodium, indicating potential multicollinearity. These relationships are important to consider during feature selection to avoid redundant information in the modeling stage.
nutrient_data = df[NUTRIENT_COLS].dropna(thresh=int(0.7 * len(NUTRIENT_COLS)))
pearson, spearman, high_corr_df = compute_high_correlation_pairs(
nutrient_data,
NUTRIENT_COLS,
)
plotter.plot_correlation_matrices(pearson, spearman, NUTRIENT_COLS)
if high_corr_df.empty:
print("No pairs exceed |r| = 0.85.")
else:
display(high_corr_df)
| Feature A | Feature B | Pearson r | |
|---|---|---|---|
| 0 | salt_100g | sodium_100g | 0.992 |
The correlation analysis shows several very strong positive relationships among nutritional features, especially between salt and sodium, as well as among sugars, fiber, and proteins, indicating substantial overlap in the information captured by these variables. Such near-perfect Pearson correlations suggest potential multicollinearity, which can distort coefficient-based models and reduce interpretability. Therefore, feature selection, correlation filtering, or dimensionality-reduction techniques can be considered before model training.
7. Additive and Ingredient Analysis¶
plotter.plot_additives_overview(df)
plotter.plot_top_additives(df)
The additive analysis shows that most products contain few or no additives, but the distribution is strongly right-skewed, meaning a smaller subset of products includes a much larger number of additives. Average additive count rises steadily from Nutri-Score A to E and is dramatically highest in NOVA 4, indicating that additive use is closely associated with poorer nutritional quality and higher processing intensity. The most frequent additives, such as e330, e322, e500, and e471, appear widely across products, suggesting that certain stabilizers, acidity regulators, and emulsifiers are common markers of industrial food formulation.
8. Feature Relationships with Nutritional Quality and Processing Level¶
plotter.plot_nutrients_by_group(
df,
NUTRIENT_COLS,
group_col="nutrition_grade_fr",
order=GRADE_ORDER,
palette=plotter.grade_palette,
title="Nutrient Distributions (Violin) by Nutri-Score Grade",
chart="violin",
)
kw_grade_df = compute_kruskal_summary(
df,
NUTRIENT_COLS,
group_col="nutrition_grade_fr",
group_order=GRADE_ORDER,
)
plotter.plot_kruskal_summary(
kw_grade_df,
"Statistical Separation of Nutrients across Nutri-Score Grades\n(red = p < 0.05)",
)
display(kw_grade_df)
| feature | H-statistic | p-value | |
|---|---|---|---|
| 0 | energy_100g | 102520.300 | 0.000 |
| 2 | saturated-fat_100g | 92870.400 | 0.000 |
| 1 | fat_100g | 78510.800 | 0.000 |
| 4 | sugars_100g | 57979.000 | 0.000 |
| 7 | salt_100g | 56948.800 | 0.000 |
| 8 | sodium_100g | 56669.200 | 0.000 |
| 9 | added-sugars_100g | 51389.400 | 0.000 |
| 3 | carbohydrates_100g | 24156.500 | 0.000 |
| 6 | proteins_100g | 17459.400 | 0.000 |
| 5 | fiber_100g | 7495.500 | 0.000 |
kw_nova_df = compute_kruskal_summary(
df,
NUTRIENT_COLS,
group_col="nova_group",
group_order=NOVA_ORDER,
)
plotter.plot_kruskal_summary(
kw_nova_df,
"Statistical Separation of Nutrients across NOVA Groups\n(red = p < 0.05)",
)
display(kw_nova_df)
| feature | H-statistic | p-value | |
|---|---|---|---|
| 9 | added-sugars_100g | 60660.400 | 0.000 |
| 7 | salt_100g | 54803.800 | 0.000 |
| 8 | sodium_100g | 54797.900 | 0.000 |
| 6 | proteins_100g | 28022.400 | 0.000 |
| 4 | sugars_100g | 23004.500 | 0.000 |
| 0 | energy_100g | 17792.000 | 0.000 |
| 1 | fat_100g | 16116.800 | 0.000 |
| 2 | saturated-fat_100g | 14651.600 | 0.000 |
| 3 | carbohydrates_100g | 14500.500 | 0.000 |
| 5 | fiber_100g | 9242.100 | 0.000 |
Energy, fat, saturated fat, sugars, salt, and sodium rise as Nutri-Score worsens, while fiber is relatively higher in better grades. Across NOVA groups, salt, sodium, and sugars show the strongest separation, suggesting processing level is strongly linked to industrial formulation. This indicates that these nutrients are likely to be the most useful predictors for both nutritional quality and processing level. It also reinforces that Nutri-Score and NOVA capture overlapping but not identical aspects of food healthfulness.
9. Outlier Detection¶
df_capped, out_df = cap_outliers(df, NUTRIENT_COLS)
display(out_df)
plotter.plot_outlier_boxplots(df, df_capped, NUTRIENT_COLS)
| feature | outliers | outlier_pct | |
|---|---|---|---|
| 0 | energy_100g | 0 | 0.000 |
| 1 | fat_100g | 0 | 0.000 |
| 2 | saturated-fat_100g | 0 | 0.000 |
| 3 | carbohydrates_100g | 0 | 0.000 |
| 4 | sugars_100g | 0 | 0.000 |
| 5 | fiber_100g | 0 | 0.000 |
| 6 | proteins_100g | 0 | 0.000 |
| 7 | salt_100g | 0 | 0.000 |
| 8 | sodium_100g | 0 | 0.000 |
| 9 | added-sugars_100g | 0 | 0.000 |
The outlier analysis shows that extreme values are present across several nutritional features, with sodium and salt having the highest outlier percentages, followed by fiber and saturated fat. Since these extreme values are relatively rare, percentile-based capping preserves the overall distribution while reducing the influence of anomalous or potentially erroneous observations on model training.
10. Geographic and Category Distribution¶
plotter.plot_geo_category_distribution(df)
The geographic distribution shows that the dataset is heavily concentrated in France and the United States, followed by a smaller contribution from several European countries, indicating some regional imbalance in product representation. At the category level, many products fall into unknown, cereals and potatoes, and milk and dairy products, suggesting that packaged staple foods dominate the dataset and that category completeness may need further cleaning for more precise analysis.
11. Missing Data Imputation Strategy¶
df_imp = df.copy()
for col in NUTRIENT_COLS:
df_imp[col] = df_imp.groupby("pnns_groups_1")[col].transform(
lambda x: x.fillna(x.median())
)
print("Imputation Summary")
plotter.plot_imputation_comparison(df_capped, df_imp, COMPARE_COLS)
Imputation Summary
Missing numerical values in the core nutritional features were handled using global median imputation, where each column's median is computed from all available non-null values and applied to fill gaps. This approach was applied after outlier capping to ensure that the imputed values are not influenced by extreme observations, preserving the central tendency of each nutrient's distribution across the dataset.
12. Basic Data¶
To address multicollinearity identified during EDA, redundant features were removed during the data cleaning stage. In particular, sodium_100g was dropped because it is deterministically derived from salt_100g (salt = sodium × 2.5), and retaining both would introduce perfect multicollinearity. Removing such features helps improve model stability and reduces redundancy in the feature space.
print(f"Rows before cleaning : {len(df_imp):,}")
df_clean = df_imp.copy() # final dataset after clearning + imputation
# ensure no remaining missing values in features
df_clean = df_clean.dropna(subset=NUTRIENT_COLS)
# drop rows where the target variable (nova_group) is missing.
# rows without a NOVA label cannot contribute to supervised classification.
n_before = len(df_clean)
df_clean = df_clean.dropna(subset=["nova_group"])
n_dropped_nova = n_before - len(df_clean)
print(f"Rows dropped (missing nova_group) : {n_dropped_nova:,}")
# remove exact duplicate rows to prevent data leakage and inflated metrics.
n_before = len(df_clean)
df_clean = df_clean.drop_duplicates()
n_dropped_dupes = n_before - len(df_clean)
print(f"Rows dropped (exact duplicates) : {n_dropped_dupes:,}")
# remove duplicate products by barcode, keeping the first occurrence.
# duplicate barcodes indicate the same product scanned multiple times.
if "code" in df_clean.columns:
n_before = len(df_clean)
df_clean = df_clean.drop_duplicates(subset=["code"], keep="first")
n_dropped_code = n_before - len(df_clean)
print(f"Rows dropped (duplicate barcodes) : {n_dropped_code:,}")
# drop sodium_100g — it is deterministically derived from salt_100g
# (salt = sodium × 2.5), so retaining both introduces perfect multicollinearity.
if "sodium_100g" in df_clean.columns:
df_clean = df_clean.drop(columns=["sodium_100g"])
# update NUTRIENT_COLS as well
NUTRIENT_COLS = [c for c in NUTRIENT_COLS if c != "sodium_100g"]
print(f"Column dropped (redundant): sodium_100g")
print(f"\nRows after cleaning : {len(df_clean):,}")
print(f"Total rows removed : {len(df_imp) - len(df_clean):,}")
print(f"Columns remaining : {df_clean.shape[1]}")
print("\nNOVA class distribution after cleaning:")
display(
df_clean["nova_group"]
.value_counts()
.sort_index()
.rename("count")
.to_frame()
)
Rows before cleaning : 718,325 Rows dropped (missing nova_group) : 411,392 Rows dropped (exact duplicates) : 0 Rows dropped (duplicate barcodes) : 0 Column dropped (redundant): sodium_100g Rows after cleaning : 306,932 Total rows removed : 411,393 Columns remaining : 23 NOVA class distribution after cleaning:
| count | |
|---|---|
| nova_group | |
| 1 | 37036 |
| 2 | 13573 |
| 3 | 62463 |
| 4 | 193860 |
# verify our NaN count, to ensure they are gone
df_clean[NUTRIENT_COLS].isna().sum()
energy_100g 0 fat_100g 0 saturated-fat_100g 0 carbohydrates_100g 0 sugars_100g 0 fiber_100g 0 proteins_100g 0 salt_100g 0 added-sugars_100g 0 dtype: int64
13. Save and Clean-Up¶
# ensure target is correct type
df_clean["nova_group"] = df_clean["nova_group"].astype(int)
# create (if not already exists) directory for the processed data
os.makedirs("dataset/processed", exist_ok=True)
# ensure our target variable is correct type
df_clean["nova_group"] = df_clean["nova_group"].astype(int)
# verify final dataset shape and columns
print("Final dataset shape:", df_clean.shape)
print("Columns:", sorted(df_clean.columns.tolist()))
# rename hyphenated columns to underscores for compatibility
df_clean = df_clean.rename(columns={
"saturated-fat_100g": "saturated_fat_100g",
"trans-fat_100g": "trans_fat_100g",
"added-sugars_100g": "added_sugars_100g",
"monounsaturated-fat_100g": "monounsaturated_fat_100g",
"polyunsaturated-fat_100g": "polyunsaturated_fat_100g",
})
# barcodes are identifiers, not numbers — store as string to avoid PyArrow overflow
if "code" in df_clean.columns:
df_clean["code"] = df_clean["code"].astype(str)
# save cleaned dataset
output_path = "dataset/processed/open_food_facts_cleaned.parquet"
df_clean.to_parquet(output_path, index=False)
# confirmation
print(f"\nSaved to: {output_path}")
print(f"Parquet columns ({len(df_clean.columns)}): {sorted(df_clean.columns.tolist())}")
Final dataset shape: (306932, 23) Columns: ['added-sugars_100g', 'additives_n', 'additives_tags', 'brands', 'carbohydrates_100g', 'categories_en', 'code', 'countries_en', 'energy_100g', 'fat_100g', 'fiber_100g', 'ingredients_analysis_tags', 'ingredients_text', 'nova_group', 'nutriscore_score', 'nutrition_grade_fr', 'pnns_groups_1', 'pnns_groups_2', 'product_name', 'proteins_100g', 'salt_100g', 'saturated-fat_100g', 'sugars_100g'] Saved to: dataset/processed/open_food_facts_cleaned.parquet Parquet columns (23): ['added_sugars_100g', 'additives_n', 'additives_tags', 'brands', 'carbohydrates_100g', 'categories_en', 'code', 'countries_en', 'energy_100g', 'fat_100g', 'fiber_100g', 'ingredients_analysis_tags', 'ingredients_text', 'nova_group', 'nutriscore_score', 'nutrition_grade_fr', 'pnns_groups_1', 'pnns_groups_2', 'product_name', 'proteins_100g', 'salt_100g', 'saturated_fat_100g', 'sugars_100g']
Rows missing the target variable nova_group were removed first, as they cannot contribute to supervised model training. Exact duplicate rows and barcode-level duplicate product entries were then eliminated to prevent data leakage and overfitting. Finally, sodium_100g was dropped as a redundant feature since it is deterministically derived from salt_100g (salt = sodium × 2.5), and retaining both would introduce perfect multicollinearity into the feature set. The resulting cleaned dataframe df_clean is the final output will be passed to downstream modeling notebooks.
Exploratory Data Analysis Summary¶
The dataset is usable for modeling after preprocessing, with most variables showing low to moderate missingness, while a few fields such as additives and NOVA-related columns have substantially higher missing values.
Nutri-Score is fairly distributed across classes, with C being the most common grade, providing a reasonable target balance for nutritional quality prediction.
NOVA group is highly imbalanced toward NOVA 4, indicating that the dataset is dominated by ultra-processed foods.
Nutritional variables such as fat, saturated fat, sugars, salt, and sodium are generally right-skewed, with many low-to-moderate values and a small number of extreme observations.
Poorer Nutri-Score grades tend to have higher energy, fat, saturated fat, sugars, salt, and sodium, while better grades are relatively associated with higher fiber.
Across NOVA groups, the strongest separation is observed for salt, sodium, and sugars, showing that processing level is closely tied to industrial formulation.
Correlation analysis reveals strong multicollinearity among some nutrient features, especially salt–sodium and several other highly correlated nutrient pairs, suggesting the need for feature selection or dimensionality reduction.
Outlier detection shows that extreme values are present but relatively rare, and percentile-based capping helps reduce their impact without heavily distorting the data.
Missing nutritional values were handled using global median imputation, which preserved the overall distributions while filling gaps robustly in skewed variables.
Additive analysis shows that most products contain few additives, but additive counts increase with worse Nutri-Score and are highest in NOVA 4, reinforcing the connection between additives and ultra-processing.
The dataset is geographically concentrated in France and the United States, with additional representation from several European countries, which may introduce regional bias.
Product categories are dominated by unknown, cereals and potatoes, and milk and dairy products, indicating both strong packaged-food representation and some category-label incompleteness.
Overall, the EDA suggests that the dataset contains strong predictive signals for both nutritional quality and processing level, but it also requires careful handling of missing data, class imbalance, outliers, and multicollinearity before model development.
AI Use Disclosure¶
AI assistance tools were used in the following capacities during the development of this project:
Research and Planning: AI tools were used to search for code snippets, explore modeling approaches, and identify applicable machine learning techniques (e.g., NOVA classification strategies, anomaly detection methods, SHAP explainability patterns). These suggestions were reviewed, adapted, and validated by the team before implementation.
Copy Editing and Report Refinement: An AI assistant was used to copy edit written documentation and the final report draft, check for redundancy, and provide feedback on areas that could be tightened up or that required additional clarification. The prompt provided to the tool included context about the project purpose, target audience (academic evaluators for the AAI-590 Capstone), and formatting guidelines.
All AI-generated suggestions were critically reviewed by the team. Final decisions regarding methodology, implementation, and written content remain the work of the authors.