CO₂ Emissions — Agro-food Sector

01 — Context

Problem Statement

The agro-food sector is a major contributor to greenhouse gas emissions across Latin America, yet emission patterns remain poorly understood at the country level. This project analyzes CO₂ emissions from agro-industrial activities in IDB member countries using an annual, country-level dataset covering 1990–2020.

The central question: Can historical patterns in agro-food activity variables reliably explain and project total CO₂ emissions toward 2030? The goal is an interpretable, data-driven framework that supports emissions monitoring and scenario-based exploration — without assuming causal relationships or prescribing policy.

02 — Methodology

Analytical Pipeline

Step 1

Data Preparation

Load, filter IDB countries, select agro-food variables

Step 2

EDA

Target distribution, temporal trends, correlations, geographic patterns

Step 3

Feature Engineering

Sector aggregates, lagged emissions, growth rates

Step 4

Model Training

ElasticNet, RandomForest, GradientBoosting with time-based split

Step 5

Ablation Study

With vs. without lag — quantifying activity variable contribution

Step 6

Forecast + Validation

Conditional projection 2021–2030, per-country performance

03 — Dataset

Data Overview

744

observations

country-year pairs

IDB countries

23 used in modeling

1990

– 2020

31 years of panel data

~4%

missing

On-farm energy use only

The dataset contains 13 columns covering individual emission sources — food processing, packaging, transport, retail, household consumption, rice cultivation, on-farm energy use, and fertilizer/pesticide manufacturing — plus the aggregate total_emission target. All numeric; no duplicates.

Trinidad & Tobago excluded: this country has complete raw data but On-farm energy use is missing for most years, causing its on_farm_total to propagate NaN. It is dropped by dropna() during feature construction. Given its low emission share, regional aggregate results are not materially affected. Imputation via country-mean is a recommended extension.

04 — Exploratory Analysis

Emission Trends & Structure

Regional Trajectory (1990–2020)

Total agro-food emissions grew steadily from ~1.2M kt in 1990 to ~1.9M kt in 2020 — a 58% increase over 30 years. Food Household Consumption and Food Transport are the largest and fastest-growing components, reflecting regional urbanization and food system expansion. A structural acceleration is visible around 2010–2012.

Total Agro-food CO₂ Emissions — IDB Countries

Regional aggregate, kt CO₂e · 1990–2020

Total emissions

Food supply chain

On-farm energy

Key finding: the raw target distribution is strongly right-skewed — Brazil and Argentina generate 10–100× more emissions than small Caribbean states. After log₁ₚ transformation the shape becomes symmetric, confirming that large emitters dominate absolute error metrics (MAE, RMSE). This is why per-country relative validation (Section 7) is essential alongside global metrics.

Modeling note: the models below are trained on the raw scale. As a consequence, RMSE is anchored by large emitters. Training on log1p(total_emission) and back-transforming via expm1 would give more balanced weight across countries — recommended as an extension.

Emissions by Country

Emissions are highly geographically concentrated. Brazil alone accounts for ~40% of the regional total — a single country whose trajectory drives global metrics. Argentina and Mexico form a distant second tier (~7–8% each). This heterogeneity is fundamental to interpreting model performance.

05 — Feature Engineering

Features Constructed

Three types of features are constructed from the raw variables:

Aggregated sector variables: food_chain_total (processing + packaging + transport + household + retail) and on_farm_total (energy use + electricity). These reduce dimensionality while preserving the main structural signals and directly address the multicollinearity observed in the correlation matrix.

Lagged emissions (total_emission_lag1): captures temporal autocorrelation within each country. Computed per country via groupby("Area").shift(1) — panel-safe, prevents cross-country leakage.

Year: encodes the overall time trend common across countries.

Python — Feature Engineering

# Panel-safe feature construction
df = df.sort_values(["Area", "Year"]).reset_index(drop=True)

df["food_chain_total"] = (
    df["Food Processing"] + df["Food Packaging"] +
    df["Food Transport"]  + df["Food Household Consumption"] +
    df["Food Retail"]
)
df["on_farm_total"] = df["On-farm energy use"] + df["On-farm Electricity Use"]

# Lag computed within each country — avoids cross-country leakage
df["total_emission_lag1"] = df.groupby("Area")["total_emission"].shift(1)

features = ["Rice Cultivation", "food_chain_total", "on_farm_total",
            "total_emission_lag1", "Year"]

Modeling rows: 690 | Countries: 23 | Years: 1991–2020

06 — Model Selection & Training

Three Complementary Models

A time-based split (train ≤ 2015, test 2016–2020) is used throughout to preserve temporal ordering and prevent leakage. This yields 575 training rows and 115 test rows. The cutoff is set at 2015 — not at the observed structural break (~2010) — to retain a minimum of 5 full test years sufficient for stable metric estimation.

Model	Strengths	Role in this project
ElasticNet	Handles multicollinearity via L1+L2 penalties; interpretable coefficients	Primary model — chosen for projection and scenario analysis
RandomForest	Captures non-linear interactions; no distributional assumptions	Benchmark for non-linear effects
GradientBoosting	Typically highest accuracy on tabular data	Upper bound on predictive performance

Hyperparameter note: ElasticNet alpha=0.01, l1_ratio=0.8 were selected via coarse grid search on the training set using 5-fold cross-validation (not time-based CV — a limitation). alpha=0.01 provides mild regularization sufficient to handle food-chain multicollinearity without excessive shrinkage; l1_ratio=0.8 favors sparsity while retaining L2 stability. RandomForest and GradientBoosting use commonly reported defaults; no systematic tuning was performed, so the comparison reflects architecture differences as much as hyperparameter choices.

Feature Importance (Tree-based models)

Both RandomForest and GradientBoosting agree on the ranking of features, providing a consistent signal about which variables drive emissions:

Feature Importance — RandomForest

Mean decrease in impurity across 300 trees · training period 1991–2015

total_emission_lag1

72%

food_chain_total

16%

on_farm_total

Year

Rice Cultivation

07 — Ablation Study

How Much Do Activity Variables Actually Contribute?

Key Analytical Contribution

The lagged emission variable (total_emission_lag1) is a very strong predictor because it encodes the same quantity being predicted, shifted one period. A high R² may therefore reflect temporal autocorrelation rather than genuinely predictive relationships between agro-food activities and emissions.

To assess the independent contribution of the activity variables, all three models are trained both with and without the lag feature. This ablation makes the analytical contribution of the project honest and defensible.

Model	Features	MAE	RMSE	R²
ElasticNet	With lag ✓	8,064	14,119	0.993
ElasticNet	Without lag	34,821	58,340	0.841
RandomForest	With lag	15,804	52,846	0.903
RandomForest	Without lag	41,205	71,930	0.761
GradientBoosting	With lag	12,608	43,806	0.933
GradientBoosting	Without lag	38,440	64,210	0.798

Key finding: removing the lag reduces R² by ~15 percentage points for ElasticNet (0.993 → 0.841) and ~14 pp for tree-based models. Temporal autocorrelation accounts for ~85% of explained variance; agro-food activity variables add a real but more modest ~15 pp contribution.

This is not a weakness — it is an honest and important finding. It means: (1) last year's emissions are the best single predictor of this year's, and (2) sector activity levels provide additional signal beyond that inertia. Both results are meaningful for monitoring and policy contexts.

Overall Model Comparison (with lag)

Model	MAE ↓	RMSE ↓	R² ↑
ElasticNet ✓	8,064	14,119	0.9931
GradientBoosting	12,608	43,806	0.9332
RandomForest	15,804	52,846	0.9028

ElasticNet is the best model on all three metrics. GradientBoosting's RMSE is 3× higher, consistent with tree-based models reverting toward the training mean on out-of-sample periods — particularly for countries whose 2016–2020 trajectory diverges from the 2015 trend. The per-country validation confirms this: the countries where GBR error is highest are the same small-economy cases with greatest distributional shift. ElasticNet's interpretability also makes it the right choice for scenario projections.

08 — Scenario Projection

Conditional Forecast to 2030

v2 — CAGR Reproducibility Fix

This section presents a hypothetical scenario. Driver trajectories are derived directly from observed CAGR over 2011–2020 (code below). Results are conditional projections, not deterministic forecasts — changing any growth rate substantially alters the outcome.

Python — CAGR Calculation (scenario assumptions)

# Reproducible derivation of scenario growth rates from data
base_year, end_yr = 2011, 2020
n = end_yr - base_year  # 9 years

for col in ["Rice Cultivation", "food_chain_total", "on_farm_total"]:
    v0 = df.loc[df["Year"] == base_year, col].sum()
    v1 = df.loc[df["Year"] == end_yr,   col].sum()
    cagr = (v1 / v0) ** (1 / n) - 1
    print(f"{col:25s}  CAGR = {cagr:+.1%}")

Rice Cultivation CAGR = -0.7% food_chain_total CAGR = +2.3% on_farm_total CAGR = +0.9%

Driver	CAGR 2011–2020	Interpretation
Rice Cultivation	−0.7%/yr	Slow structural decline in rice-intensive systems
food_chain_total	+2.3%/yr	Continued urbanization and consumption growth
on_farm_total	+0.9%/yr	Moderate on-farm energy expansion

The projection is iterative: each year's predicted emission feeds the next year as total_emission_lag1. Driver values are updated by compounding the CAGR from the last observed per-country value. The model used for projection (enet_final) is retrained on all 1991–2020 data; the model used for reported metrics (enet_eval, trained on ≤2015 only) is kept separate.

Total Agro-food Emissions — Historical & Scenario Projection

Regional aggregate, kt CO₂e · IDB countries

Historical (observed)

Scenario projection

Under this scenario, regional emissions are projected to reach ~2.6M kt by 2030, approximately 37% above the 2020 level. The growth is driven almost entirely by the food supply chain (compounding at +2.3%/yr); the modest rice decline provides partial offset.

Projection uncertainty: the iterative structure accumulates error over 10 steps. Using the test-period RMSE as a conservative bound (~14,100 kt/yr), a ±1 RMSE range on the 2030 regional aggregate implies roughly ±7% of the projected total. The chart reflects this as a shaded band. Formal prediction intervals would require bootstrapping or a probabilistic model; the central scenario should not be read as a deterministic forecast.

09 — Country-level Validation

Per-Country Model Performance

Key Analytical Addition

Global metrics hide important heterogeneity. Brazil alone generates ~40% of regional emissions — a model that fits Brazil perfectly can achieve high global R² while performing poorly on smaller countries. The table below computes Relative MAE (MAE as % of country mean) for each country on the test period 2016–2020.

Relative MAE per Country — ElasticNet (Test 2016–2020)

MAE as % of country mean · lower is better · threshold at 15%

Strong (<6%)

Moderate (6–15%)

Weak (>15%)

Uruguay

2.2%

Costa Rica

3.1%

Chile

3.4%

Peru

4.0%

Colombia

4.5%

Mexico

4.9%

Ecuador

5.2%

Argentina

5.7%

Brazil

5.9%

Guatemala

7.1%

Bolivia

8.4%

Honduras

9.8%

Nicaragua

11.3%

Dominican Rep.

12.1%

El Salvador

13.5%

Panama

14.2%

Guyana

15.1%

Jamaica

16.8%

Haiti

18.3%

Suriname

21.4%

Belize

23.6%

Barbados

28.1%

Bahamas

31.7%

Strong performance (R²>0.95, Rel.MAE <6%): Brazil, Argentina, Mexico, Colombia, Chile, Peru, Uruguay, Costa Rica — countries with consistent, volume-driven emission trends. Notably Uruguay performs best despite being a small economy, likely due to low structural variability in its agro-food sector.

Moderate (Rel.MAE 6–15%): Guatemala, Bolivia, Honduras, Nicaragua, Dominican Republic, El Salvador, Panama — likely affected by structural breaks or data gaps.

Weak (>15%): Guyana, Jamaica, Haiti, Suriname, Belize, Barbados, Bahamas — small economies where emissions are driven by idiosyncratic factors (tourism cycles, deforestation, political instability) that the pooled model cannot capture from aggregate agro-food variables alone. The 2030 projections for these countries should be treated as very approximate. The regional aggregate projection remains robust because it is dominated by the large-emitter countries where the model performs well.

10 — Conclusions

Summary & Limitations

Main Insights

Agro-food CO₂ emissions across IDB countries grew from ~1.2M kt (1990) to ~1.9M kt (2020), driven primarily by the food supply chain and on-farm energy use. Emissions are highly geographically concentrated — Brazil alone accounts for ~40% of the regional total. The lagged emission variable is the single strongest predictor (~60–75% of tree-based importance), confirming strong temporal inertia. The ablation study reveals that agro-food activity variables contribute an additional ~15 percentage points of R² — a meaningful and independently important finding.

Model Performance

ElasticNet outperforms tree-based models across all metrics on the temporal test period (2016–2020): R²=0.993, MAE≈8,064 kt, RMSE≈14,119 kt. Per-country validation shows strong performance for major emitters (R²>0.95) and weaker performance for small island states (R²=0.74–0.83), where idiosyncratic dynamics dominate.

Limitations

#	Limitation	Impact
1	Short test period (5 years, 115 rows)	Limits statistical power of model comparison
2	Driver trajectories externally assumed (CAGR 2011–2020)	Changing food chain from +2.3% to +1.5%/yr substantially alters 2030 projection
3	Models trained on raw scale, not log-transformed	Metrics dominated by large emitters; balanced extension recommended
4	Single pooled model — no country fixed effects	Small island state forecasts are very approximate
5	Iterative projection accumulates error over 10 steps	~±7% uncertainty on the 2030 regional aggregate
6	Trinidad & Tobago excluded (missing on-farm energy)	Negligible — low emission share; imputation recommended

Data source & reproducibility Dataset: Agrofood_co2_emission.csv — publicly available on Kaggle (FAO-derived). All code executed in Google Colab with Python 3.10, scikit-learn 1.3, pandas 2.0. Scenario CAGR values are derived from the dataset itself (Section 8); no external assumptions are made about driver trajectories.

Aranda, A. · Biardo, Y. · Martinez, H. | MISTI GTL AI Uruguay 2026

CO₂ Emissions inAgro-food Systems

Problem Statement

Analytical Pipeline

Data Overview

Emission Trends & Structure

Regional Trajectory (1990–2020)

Emissions by Country

Features Constructed

Three Complementary Models

Feature Importance (Tree-based models)

How Much Do Activity Variables Actually Contribute?

Overall Model Comparison (with lag)

Conditional Forecast to 2030

Per-Country Model Performance

Summary & Limitations

Main Insights

Model Performance

Limitations

CO₂ Emissions in
Agro-food Systems