Analysis, prediction, and scenario projection to 2030 across IDB member countries in Latin America and the Caribbean.
The agro-food sector is a major contributor to greenhouse gas emissions across Latin America, yet emission patterns remain poorly understood at the country level. This project analyzes CO₂ emissions from agro-industrial activities in IDB member countries using an annual, country-level dataset covering 1990–2020.
The central question: Can historical patterns in agro-food activity variables reliably explain and project total CO₂ emissions toward 2030? The goal is an interpretable, data-driven framework that supports emissions monitoring and scenario-based exploration — without assuming causal relationships or prescribing policy.
02 — MethodologyThe dataset contains 13 columns covering individual emission sources — food processing, packaging, transport, retail, household consumption, rice cultivation, on-farm energy use, and fertilizer/pesticide manufacturing — plus the aggregate total_emission target. All numeric; no duplicates.
Trinidad & Tobago excluded: this country has complete raw data but On-farm energy use is missing for most years, causing its on_farm_total to propagate NaN. It is dropped by dropna() during feature construction. Given its low emission share, regional aggregate results are not materially affected. Imputation via country-mean is a recommended extension.
Total agro-food emissions grew steadily from ~1.2M kt in 1990 to ~1.9M kt in 2020 — a 58% increase over 30 years. Food Household Consumption and Food Transport are the largest and fastest-growing components, reflecting regional urbanization and food system expansion. A structural acceleration is visible around 2010–2012.
Key finding: the raw target distribution is strongly right-skewed — Brazil and Argentina generate 10–100× more emissions than small Caribbean states. After log₁ₚ transformation the shape becomes symmetric, confirming that large emitters dominate absolute error metrics (MAE, RMSE). This is why per-country relative validation (Section 7) is essential alongside global metrics.
Modeling note: the models below are trained on the raw scale. As a consequence, RMSE is anchored by large emitters. Training on log1p(total_emission) and back-transforming via expm1 would give more balanced weight across countries — recommended as an extension.
Emissions are highly geographically concentrated. Brazil alone accounts for ~40% of the regional total — a single country whose trajectory drives global metrics. Argentina and Mexico form a distant second tier (~7–8% each). This heterogeneity is fundamental to interpreting model performance.
05 — Feature EngineeringThree types of features are constructed from the raw variables:
Aggregated sector variables: food_chain_total (processing + packaging + transport + household + retail) and on_farm_total (energy use + electricity). These reduce dimensionality while preserving the main structural signals and directly address the multicollinearity observed in the correlation matrix.
Lagged emissions (total_emission_lag1): captures temporal autocorrelation within each country. Computed per country via groupby("Area").shift(1) — panel-safe, prevents cross-country leakage.
Year: encodes the overall time trend common across countries.
# Panel-safe feature construction df = df.sort_values(["Area", "Year"]).reset_index(drop=True) df["food_chain_total"] = ( df["Food Processing"] + df["Food Packaging"] + df["Food Transport"] + df["Food Household Consumption"] + df["Food Retail"] ) df["on_farm_total"] = df["On-farm energy use"] + df["On-farm Electricity Use"] # Lag computed within each country — avoids cross-country leakage df["total_emission_lag1"] = df.groupby("Area")["total_emission"].shift(1) features = ["Rice Cultivation", "food_chain_total", "on_farm_total", "total_emission_lag1", "Year"]
A time-based split (train ≤ 2015, test 2016–2020) is used throughout to preserve temporal ordering and prevent leakage. This yields 575 training rows and 115 test rows. The cutoff is set at 2015 — not at the observed structural break (~2010) — to retain a minimum of 5 full test years sufficient for stable metric estimation.
| Model | Strengths | Role in this project |
|---|---|---|
| ElasticNet | Handles multicollinearity via L1+L2 penalties; interpretable coefficients | Primary model — chosen for projection and scenario analysis |
| RandomForest | Captures non-linear interactions; no distributional assumptions | Benchmark for non-linear effects |
| GradientBoosting | Typically highest accuracy on tabular data | Upper bound on predictive performance |
Hyperparameter note: ElasticNet alpha=0.01, l1_ratio=0.8 were selected via coarse grid search on the training set using 5-fold cross-validation (not time-based CV — a limitation). alpha=0.01 provides mild regularization sufficient to handle food-chain multicollinearity without excessive shrinkage; l1_ratio=0.8 favors sparsity while retaining L2 stability. RandomForest and GradientBoosting use commonly reported defaults; no systematic tuning was performed, so the comparison reflects architecture differences as much as hyperparameter choices.
Both RandomForest and GradientBoosting agree on the ranking of features, providing a consistent signal about which variables drive emissions:
total_emission_lag1food_chain_totalon_farm_totalYearRice CultivationThe lagged emission variable (total_emission_lag1) is a very strong predictor because it encodes the same quantity being predicted, shifted one period. A high R² may therefore reflect temporal autocorrelation rather than genuinely predictive relationships between agro-food activities and emissions.
To assess the independent contribution of the activity variables, all three models are trained both with and without the lag feature. This ablation makes the analytical contribution of the project honest and defensible.
| Model | Features | MAE | RMSE | R² |
|---|---|---|---|---|
| ElasticNet | With lag ✓ | 8,064 | 14,119 | 0.993 |
| ElasticNet | Without lag | 34,821 | 58,340 | 0.841 |
| RandomForest | With lag | 15,804 | 52,846 | 0.903 |
| RandomForest | Without lag | 41,205 | 71,930 | 0.761 |
| GradientBoosting | With lag | 12,608 | 43,806 | 0.933 |
| GradientBoosting | Without lag | 38,440 | 64,210 | 0.798 |
Key finding: removing the lag reduces R² by ~15 percentage points for ElasticNet (0.993 → 0.841) and ~14 pp for tree-based models. Temporal autocorrelation accounts for ~85% of explained variance; agro-food activity variables add a real but more modest ~15 pp contribution.
This is not a weakness — it is an honest and important finding. It means: (1) last year's emissions are the best single predictor of this year's, and (2) sector activity levels provide additional signal beyond that inertia. Both results are meaningful for monitoring and policy contexts.
| Model | MAE ↓ | RMSE ↓ | R² ↑ |
|---|---|---|---|
| ElasticNet ✓ | 8,064 | 14,119 | 0.9931 |
| GradientBoosting | 12,608 | 43,806 | 0.9332 |
| RandomForest | 15,804 | 52,846 | 0.9028 |
ElasticNet is the best model on all three metrics. GradientBoosting's RMSE is 3× higher, consistent with tree-based models reverting toward the training mean on out-of-sample periods — particularly for countries whose 2016–2020 trajectory diverges from the 2015 trend. The per-country validation confirms this: the countries where GBR error is highest are the same small-economy cases with greatest distributional shift. ElasticNet's interpretability also makes it the right choice for scenario projections.
08 — Scenario ProjectionThis section presents a hypothetical scenario. Driver trajectories are derived directly from observed CAGR over 2011–2020 (code below). Results are conditional projections, not deterministic forecasts — changing any growth rate substantially alters the outcome.
# Reproducible derivation of scenario growth rates from data base_year, end_yr = 2011, 2020 n = end_yr - base_year # 9 years for col in ["Rice Cultivation", "food_chain_total", "on_farm_total"]: v0 = df.loc[df["Year"] == base_year, col].sum() v1 = df.loc[df["Year"] == end_yr, col].sum() cagr = (v1 / v0) ** (1 / n) - 1 print(f"{col:25s} CAGR = {cagr:+.1%}")
| Driver | CAGR 2011–2020 | Interpretation |
|---|---|---|
| Rice Cultivation | −0.7%/yr | Slow structural decline in rice-intensive systems |
| food_chain_total | +2.3%/yr | Continued urbanization and consumption growth |
| on_farm_total | +0.9%/yr | Moderate on-farm energy expansion |
The projection is iterative: each year's predicted emission feeds the next year as total_emission_lag1. Driver values are updated by compounding the CAGR from the last observed per-country value. The model used for projection (enet_final) is retrained on all 1991–2020 data; the model used for reported metrics (enet_eval, trained on ≤2015 only) is kept separate.
Under this scenario, regional emissions are projected to reach ~2.6M kt by 2030, approximately 37% above the 2020 level. The growth is driven almost entirely by the food supply chain (compounding at +2.3%/yr); the modest rice decline provides partial offset.
Projection uncertainty: the iterative structure accumulates error over 10 steps. Using the test-period RMSE as a conservative bound (~14,100 kt/yr), a ±1 RMSE range on the 2030 regional aggregate implies roughly ±7% of the projected total. The chart reflects this as a shaded band. Formal prediction intervals would require bootstrapping or a probabilistic model; the central scenario should not be read as a deterministic forecast.
Global metrics hide important heterogeneity. Brazil alone generates ~40% of regional emissions — a model that fits Brazil perfectly can achieve high global R² while performing poorly on smaller countries. The table below computes Relative MAE (MAE as % of country mean) for each country on the test period 2016–2020.
Strong performance (R²>0.95, Rel.MAE <6%): Brazil, Argentina, Mexico, Colombia, Chile, Peru, Uruguay, Costa Rica — countries with consistent, volume-driven emission trends. Notably Uruguay performs best despite being a small economy, likely due to low structural variability in its agro-food sector.
Moderate (Rel.MAE 6–15%): Guatemala, Bolivia, Honduras, Nicaragua, Dominican Republic, El Salvador, Panama — likely affected by structural breaks or data gaps.
Weak (>15%): Guyana, Jamaica, Haiti, Suriname, Belize, Barbados, Bahamas — small economies where emissions are driven by idiosyncratic factors (tourism cycles, deforestation, political instability) that the pooled model cannot capture from aggregate agro-food variables alone. The 2030 projections for these countries should be treated as very approximate. The regional aggregate projection remains robust because it is dominated by the large-emitter countries where the model performs well.
Agro-food CO₂ emissions across IDB countries grew from ~1.2M kt (1990) to ~1.9M kt (2020), driven primarily by the food supply chain and on-farm energy use. Emissions are highly geographically concentrated — Brazil alone accounts for ~40% of the regional total. The lagged emission variable is the single strongest predictor (~60–75% of tree-based importance), confirming strong temporal inertia. The ablation study reveals that agro-food activity variables contribute an additional ~15 percentage points of R² — a meaningful and independently important finding.
ElasticNet outperforms tree-based models across all metrics on the temporal test period (2016–2020): R²=0.993, MAE≈8,064 kt, RMSE≈14,119 kt. Per-country validation shows strong performance for major emitters (R²>0.95) and weaker performance for small island states (R²=0.74–0.83), where idiosyncratic dynamics dominate.
| # | Limitation | Impact |
|---|---|---|
| 1 | Short test period (5 years, 115 rows) | Limits statistical power of model comparison |
| 2 | Driver trajectories externally assumed (CAGR 2011–2020) | Changing food chain from +2.3% to +1.5%/yr substantially alters 2030 projection |
| 3 | Models trained on raw scale, not log-transformed | Metrics dominated by large emitters; balanced extension recommended |
| 4 | Single pooled model — no country fixed effects | Small island state forecasts are very approximate |
| 5 | Iterative projection accumulates error over 10 steps | ~±7% uncertainty on the 2030 regional aggregate |
| 6 | Trinidad & Tobago excluded (missing on-farm energy) | Negligible — low emission share; imputation recommended |