Understanding Household Resilience to Commodity Price Shocks

The Role of Agricultural Practices

Author

Michael Bean

Introduction

This project explores how agricultural households in developing economies respond to global commodity price shocks, with a specific focus on how different farming practices may help households maintain well-being during volatile market periods. It builds on earlier work examining how smallholder coffee farmers in Sub-Saharan Africa cope with changing prices and how production decisions influence consumption.

The goal is to better understand the intersection of agricultural economics, rural development, and risk management, as well as provide insights for policy formation surrounding support for farming households in anticipation of, and during, seconomic downturns. The research was developed over three previous studies.

Research Development

Agroforestry Practices and Global Commodity Shocks (Proposal)
The first paper presented a research proposal to investigate whether Tanzanian households adopting agroforestry practices demonstrate greater resilience to coffee price volatility. It outlined a fixed-effects methodology using LSMS-ISA data, where agroforestry adoption would be proxied by the presence of tree crops and intercropping. The proposed model would regress household consumption on agroforestry adoption, lagged global coffee prices, and their interaction, with additional analyses of coping mechanisms including entrepreneurship, employment, savings, and borrowing. The method follows that of Adhvaryu et al. in Booms, Busts, and Household Enterprise: Evidence from Coffee Farmers in Tanzania (Adhvaryu, Kala, and Nyshadham 2021). Although not presenting empirical results, the proposal identified potential measurement challenges in agroforestry proxies and the possibility for temporal misalignment between price data and household behavior.

Agricultural Practices and Price Shocks (Ethiopia, Empirical Analysis)
The second paper implemented an econometric analysis using panel data from the Ethiopia Socioeconomic Survey (in conjunction with the World Bank LSMS-ISA), focusing specifically on coffee-producing households. It employed a two-way fixed effects model with Driscoll-Kraay standard errors to estimate how various agricultural practices including crop rotation, fertilizer use, extension program participation, advisory services, oxen ownership (mechanization proxy), and credit access moderated the relationship between lagged global coffee prices and household consumption. Results indicated that extension program participation and crop rotation enhanced household ability to translate price increases into consumption gains. However, the model did not address any endogeneity surrounding practice adoption or explore simultaneous behavioral responses to price changes.

Integrated Econometric and Machine Learning Approaches
Building on the previous econometric framework, the third paper introduced a machine learning ensemble combining Random Forest, Gradient Boosting Machine, and XGBoost models through elastic net stacking with lasso regularization. This approach sought to capture complex non-linear relationships and improve predictive accuracy for consumption outcomes. The results confirmed the importance of extension services and certain agricultural practices (oxen, crop rotation) and suggested more sophisticated modeling approaches—potentially including simultaneous equations or structural models—may better represent the complex decision-making processes of agricultural households facing price volatility.

Current Research Extension

Taken together, these studies raised important questions about how to measure resilience, account for household-level differences in exposure, and model decision-making. This report attempts to extend that work by assembling a richer set of household-level variables from the panel data and applying appropriate econometric techniques to investigate how specific practices may protect household consumption in the face of global price shocks.

Expanding the variable may help address some of the limitations identified in previous analyses:

Endogeneity in practice adoption: The selection of agricultural practices isn’t random and might correlate with unobserved household characteristics or regional differences. More variables allow for better control strategies or potential instrumental variable approaches.
Joint decision processes: Households likely make simultaneous decisions about production practices, consumption, and coping strategies. This dataset supports potential exploration of multivariate outcome models, such as Seemingly Unrelated Regression (SUR) or structural equation approaches.
Heterogeneous treatment effects: The impact of price shocks and the effectiveness of mitigating practices most likely vary by household characteristics. The expanded panel enables more granular subgroup analysis and interaction modeling.

Outcome Variables

After analyzing potential outcomes for missingness and removing variables that may be structurally correlated with the exposure index, a set of primary outcome variables was selected for further analysis.

The selected outcomes reflect various dimensions of household welfare:

Consumption Outcomes:

consumption_annual_total: Total annual household consumption expenditure, combining food and non-food components.
consumption_annual_food: Annual household expenditure on food items.
consumption_annual_nonfood: Annual household expenditure on non-food goods and services.

Food Security Outcomes:

rcsi: Reduced Coping Strategies Index; measures household reliance on coping behaviors during food shortages.
hhs_lite: Household Hunger Scale “Lite” version; summarizes experiences of moderate and severe food insecurity.
worry_dummy: Indicator of whether any household member reported worrying about not having enough food.

Dietary Diversity Outcome:

hdds_7d: Household Dietary Diversity Score over the past 7 days; higher values reflect more diverse diets.

All outcome variables were checked for data quality and missingness before inclusion. Correlation analysis among the outcomes further informed selection and modeling strategies.

Transformation of Outcome Variables

Given the observed strong right-skewness in household consumption expenditure variables, all consumption measures (consumption_annual_total, consumption_annual_food, consumption_annual_nonfood) were log-transformed using the function log(x + 1) before further analysis. This transformation improves normality, reduces the influence of outliers, and facilitates interpretation of regression coefficients as approximate percentage changes.

Food security indicators (rcsi, hhs_lite, worry_dummy) and dietary diversity (hdds_7d) were modeled in their original forms, reflecting their nature as index, binary, or count measures. The modeling approach can later be selected based on the data structure of each outcome.

Figure 1: Pairwise Relationships Among Household Welfare Outcomes (All Waves Combined). Consumption variables (Total, Food, Non-Food) and food security indicators (RCSI, HHS Lite) are log-transformed to reduce skewness.The plot highlights associations among consumption, food security, and dietary diversity. Individual waves followed similar patterns.

Volatility-Based Exposure and Shock Indices

To quantify how global commodity price movements impact individual households, several indices were developed pairing household crop production shares with global commodity price dynamics. These indices aim to capture both price risk and realized shocks, tailored to each household’s cropping pattern.

The method for linking commodity fluctuations to households was originally adopted from Adhvaryu et. al (Adhvaryu, Kala, and Nyshadham 2021) and used in previous papers. However its explanatory value was diminished due to high clustering of survey dates leading to little or no variability in the commodity prices among households.

In order to better measure variability in price exposure across households, the following indices were generated by linking household-level crop area-share of globally traded crops (coffee, wheat, sorghum etc.) to monthly global commodity price data (2010–2016). Price series wer cleaned, standardized, and processed to generate lagged 12-month volatility and rolling deviation indices.

Index Construction

Each household-level index is computed using a three-step process:

1. Global Price Volatility and Shock Measures

For each commodity c at month t:

Rolling 12-month price mean:
P̄_c,t^12m = (1/12) × Σ_k=t−12^t−1 P_c,k
Monthly log returns:
r_c,t = log(P_c,t / P_c,t−1)
Expected return and volatility:
ê_c,t^12m = mean(r_{c,t−12:t−1})
σ_c,t^12m = sd(r_c,t−6:t)
Price shock (surprise component):
shock_c,t = r_c,t − ê_c,t^12m

2. Household Crop Shares

For household i, crop c, and wave w:

s_ic = area_ic / Σ_k area_ik

3. Household-Level Indices

Each household’s index is an area-weighted average of commodity-specific values:

Volatility-Based Exposure Index:
exposure_volatility_i = Σ_c s_ic × σ_c,t^12m
Price Shock Index (raw and absolute):
hf_shock_raw_i = Σ_c s_ic × shock_c,t
hf_shock_abs_i = Σ_c s_ic × |shock_c,t|
Positive and Negative Shock Components:
hf_shock_neg_i = min(0, hf_shock_raw_i)
hf_shock_pos_i = max(0, hf_shock_raw_i)

Interpretation

The volatility index reflects ex-ante risk based on historical price variability.
The shock indices capture realized unexpected price changes.
The positive and negative split allow for asymmetric modeling of shocks.

Table 1: Summary statistics for household-level exposure to negative commodity price shocks.

Wave	Mean	SD	Min
Wave 1	−2.03	1.41	−5.65
Wave 2	−2.71	1.64	−5.68
Wave 3	−2.44	1.32	−4.03
All	−2.40	1.49	−5.68

Figure 3: Relationship between IHS-transformed negative price shock index and key household outcomes (combined waves).

Forest Cover Proxy: Binary Indicator from Cultivated Land Use

To capture the intercropping dimension of agroforestry practices, I originally explored a continuous proxy based on household intercropping with trees (share_tree_mix). However, this variable was too sparse due to the rarity of mixed-use plots containing trees in the data and instead a measure adopted from Morrow et al. Morrow et al. (2024) involving theon-farm presence of forested land was adopted. Although not a direct proxy, forested fields on a farmers land may indicate usage of agroforestry techniques including windbreaks, riparian buffers, forest farming, intercropping, or even silvopasture:

forest_any: a binary indicator equal to 1 if a household classifies one of their fields as forest (pp_s3q03 == 5), and 0 otherwise.

This proxy may also reflect longer-term land use strategies such as orchards (partially distinguished in the survey), woodlots, or natural forest cultivation, and is used to assess whether such households exhibit greater resilience to commodity price shocks.

Forest Cover Prevalence Across Waves

Forest cover, as measured by the binary forest_any, is relatively uncommon, especially in later survey waves:

26.7% of households in Wave 1 cultivated at least one forest plot
15.1% in Wave 2
Only 9.0% in Wave 3
Overall, 16.6% of households ever reported forest cultivation

This trend may reflect declining forest investment, changing land use priorities, or evolving land classification practices over time. Spatial variations may need to be examined between regions. However some analyses point to the inverse trend in certain regions Morrow et al. (2024).

Table 2: Summary of household-level forest cover by survey wave and overall.

Wave	Households	With Forest	% With Forest	Mean Forest Share	Max Forest Share
Wave 1	2,783.00	743.00	26.70	0.03	1.00
Wave 2	3,086.00	466.00	15.10	0.03	0.92
Wave 3	3,050.00	273.00	8.95	0.01	0.98
All	8,919.00	1,482.00	16.62	0.02	1.00

Figure 4: Proportion of households with forest cover (binary indicator), faceted by wave.

Control Variables: Weather

Climatic conditions can both drive global commodity prices and directly affect households’ production and consumption. To isolate the effect of price volatilty exposure weather controls are included (future studies may remove these in favor of controlling for enumeration area as in Morrow et al. (2024):

Annual Precipitation (precipitation_annual_mm_hh): Total rainfall (mm) in the 12 months prior to survey.
Mean Annual Temperature (mean_temp_annual_cx10_hh / 10): Average temperature (°C) over the same period.

Table 3: Summary statistics for household-linked weather variables.

Wave	Mean Annual Precip (mm)	SD Precip	Min Precip	Max Precip	Mean Annual Temp (°C)	SD Temp	Min Temp	Max Temp
Wave 1	1,132.65	387.90	144.00	2,031.00	19.24	3.23	10.60	29.30
Wave 2	1,138.31	383.94	144.00	2,018.00	19.15	3.25	10.20	28.90
Wave 3	1,135.54	383.31	144.00	2,018.00	19.09	3.22	10.90	29.40
All	1,135.60	384.93	144.00	2,031.00	19.16	3.23	10.20	29.40

Exploratory Analysis of Final Model Variables

Before presenting the panel model results, a final pairwise relationship plot among a final set of predictors and outcomes was generated. The IHS-transformed price shock index, three log-transformed consumption outcomes, binary indicator for agroforestry practices and weather controls were included. This matrix provides a concise view of potential correlations and variable distributions, helping establish the plausibility of each covariate in the model specification.

Figure 7: Pairwise relationships among variables included in the final model: IHS-transformed price shock index, log consumption outcomes, agroforestry proxy, and weather controls.

Panel Data Diagnostic Tests and Model Selection

This section summarizes the diagnostic tests used to validate panel model assumptions and justify model selection.

Diagnostic Tests Conducted

F-test (FE vs Pooled OLS)
Tests for individual-specific effects.
→ Result: Strongly rejects Pooled OLS; Fixed Effects preferred.
LM test (RE vs Pooled OLS)
Tests for random effects presence.
→ Result: Strongly rejects Pooled OLS; Random Effects supported.
Hausman test (FE vs RE)
Tests whether regressors are correlated with entity effects.
→ Result: Rejects Random Effects; Fixed Effects preferred.
Serial Correlation (Breusch-Godfrey)
Tests for autocorrelation in idiosyncratic errors.
→ Result: Serial correlation detected (p < 0.001).
Cross-sectional Dependence (Pesaran CD)
Tests for correlation across panel units.
→ Result: Dependence present (p < 0.001).
Heteroskedasticity (Breusch-Pagan)
Tests for unequal error variances.
→ Result: Could not be performed due to wave-level constraints.
Residual Normality (Jarque–Bera)
Tests for normal distribution of residuals.
→ Result: Non-normal residuals detected (p < 0.001).

Model Selection and Implications

Model Choice: The F-test, LM test, and Hausman test collectively support the use of a Fixed Effects model over Pooled OLS or Random Effects.
Diagnostic Issues Identified:
- Serial correlation and cross-sectional dependence are both present.
- Residuals are not normally distributed.
- Heteroskedasticity could not be formally tested.

Remedial Approach: Driscoll–Kraay Standard Errors

Given these concerns, Driscoll–Kraay standard errors were applied. These are:

Robust to serial correlation and cross-sectional dependence
Valid under heteroskedasticity
Well-suited for panels with short time dimensions (3 waves)

This correction ensures more reliable inference without altering coefficient estimates and non-normal residuals are less problematic under this framework.

Table 4: Diagnostic test decisions for multiple outcome models in the 3-wave panel.

Test	Total Consumption	Food Consumption	Non-Food Consumption
F-test (FE vs Pooled OLS)	Reject Pooled OLS (FE preferred)	Reject Pooled OLS (FE preferred)	Reject Pooled OLS (FE preferred)
LM test (RE vs Pooled OLS)	Reject Pooled OLS (RE preferred)	Reject Pooled OLS (RE preferred)	Reject Pooled OLS (RE preferred)
Hausman test (FE vs RE)	Reject RE (FE preferred)	Reject RE (FE preferred)	Reject RE (FE preferred)
Serial Correlation (pbgtest, lag=1)	Serial correlation detected	Serial correlation detected	Serial correlation detected
Cross-sectional Dependence (Pesaran CD)	Cross-sectional dependence detected	Cross-sectional dependence detected	Cross-sectional dependence detected
Heteroskedasticity (Breusch-Pagan)	N/A	N/A	N/A
Residual Normality (Jarque–Bera)	Residuals non-normal	Residuals non-normal	Residuals non-normal

Interpretation: Forest Cover × Negative Price Shocks

Driscoll–Kraay FE estimates show that:

Negative price shocks significantly reduce total, food, and non-food consumption.
Forest cover (binary) is associated with lower food and total consumption overall.
The interaction term (Shock × Forest) is positive and statistically significant for all outcomes, strongest for food and total consumption.

This suggests that households with forest access are more resilient to price shocks, particularly for food security. However, forest land may reduce baseline consumption, reflecting trade-offs between resilience and productivity.

Table 5: Driscoll–Kraay standard error estimates from fixed effects models with negative price shocks and forest cover interaction, across three consumption outcomes.

	Consumption Outcomes
	Total Annual	Annual Food	Annual Non‑Food
Neg. Price Shock (IHS, z)	-0.028*	-0.029+	-0.023***
	(0.014)	(0.015)	(0.004)
Forest Cover (Binary)	-0.034***	-0.043***	0.022*
	(0.006)	(0.003)	(0.009)
Annual Precipitation (z)	-0.192	-0.216	-0.509**
	(0.142)	(0.157)	(0.178)
Mean Temperature (z)	-0.191***	-0.227***	-0.180***
	(0.046)	(0.048)	(0.042)
Shock × Forest	0.038***	0.033**	0.019+
	(0.010)	(0.010)	(0.011)
Num.Obs.	8551	8551	8551
R2	0.003	0.003	0.001
+ p < 0.1, * p < 0.05, p < 0.01, * p < 0.001

Figure 8: Estimated coefficients and 95% confidence intervals from fixed effects models with Driscoll–Kraay standard errors. Interaction effects shown for negative price shocks and forest cover.

Marginal Effects of Forest Cover on Shock Exposure

To better interpret the interaction between negative price shocks and forest cover, the marginal effects of exposure across households with and without forest access is visualized. This plot is derived from a two-way fixed effects model with Driscoll–Kraay standard errors.

The estimated lines show predicted log consumption across varying levels of negative price shocks (IHS-transformed), disaggregated by forest cover status. Confidence intervals are included.

Note: This plot is based on the total consumption model.

Figure 9: Marginal effects of negative price shocks on predicted consumption, moderated by binary forest cover access. Lines represent households with and without forest cover.

Model Diagnostics Summary

To assess the validity of the fixed effects models with Driscoll–Kraay standard errors, six diagnostic plots for each of the three consumption outcomes were generated (total, food, and non-food). These visuals evaluate key model assumptions such as linearity, homoskedasticity, and residual normality. Faceting by model allows comparison across outcomes.

Diagnostic Plots and Interpretation

Residuals vs Fitted (Figure 10)
This plot checks for non-linearity or model misspecification by visualizing whether residuals scatter randomly around zero. In this case, residuals are mostly centered but show slight funneling in the non-food consumption model, indicating mild heteroskedasticity or non-linearity in that specification.
Normal Q-Q Plot (Figure 11)
The Q-Q plot assesses whether residuals are normally distributed. In this case, all models show deviation from the reference line at both tails, suggesting non-normal residuals—particularly for total and food consumption. While this violates the classical OLS assumption, our use of robust standard errors mitigates inference concerns.
Scale-Location Plot (Figure 12)
This plot tests for homoskedasticity by plotting the square root of standardized residuals against fitted values. In this case, the total consumption model shows increasing spread as fitted values rise, indicating some non-constant variance. The food and non-food models are more stable but still show mild curvature.
Residuals by Wave (Figure 13)
This plot detects wave-level differences in residual behavior. In this case, residual medians remain roughly centered for all models across waves. Slight compression of spread in Wave 2 may suggest stronger model fit or less variability in that period.
Actual vs Fitted Values (Figure 14)
This plot evaluates predictive accuracy—ideally, points should fall along the 45-degree line. In this case, the relationship is generally linear, but vertical spread (especially in food consumption) reflects moderate prediction error and unexplained variation.
Histogram of Residuals (Figure 15)
This plot provides a basic check for residual symmetry and skew. In this case, all models show roughly centered residuals, but with some skewness and peaking. Total and food consumption residuals appear leptokurtic, while non-food is slightly more dispersed.

Figure 10: Residuals versus fitted values for each model. A loess smoother is shown to detect non-linearity. Dashed lines indicate zero residual.

Figure 11: Normal Q-Q plot of residuals for each model. Deviation from the dashed line indicates non-normality.

Figure 12: Scale-location plot (√|standardized residuals| vs fitted values) for each model. A loess smoother is included to assess heteroskedasticity.

Figure 13: Distribution of residuals by survey wave for each model. Boxplots reveal potential temporal patterns in residual behavior. A dashed horizontal line marks zero residual.

Figure 14: Actual versus fitted values for each model. The 45-degree reference line indicates perfect prediction.

Figure 15: Histogram of residuals for each model. Dashed vertical line indicates zero residual. Distribution shape may indicate normality or skew.

Summary

Although the models exhibit some non-normal and mildly heteroskedastic residuals, taken together, these plots suggest there is acceptable overall fit/residual centering and no major issues with autocorrelation or structural misspecification. They support the reasonablity of the model specification and reinforce the choice to use Driscoll–Kraay robust standard errors.

References

Adhvaryu, Achyuta, Namrata Kala, and Anant Nyshadham. 2021. “Booms, Busts, and Household Enterprise: Evidence from Coffee Farmers in Tanzania.” The World Bank Economic Review 35 (3): 586–603. https://doi.org/10.1093/wber/lhz044.

Morrow, Nathan, Nancy B. Mock, Andrea Gatto, Andrea Colantoni, and Luca Salvati. 2024. “Farm Forests, Seasonal Hunger, and Biomass Poverty: Evidence of Induced Intensification from Panel Data in the Ethiopian Highlands.” Ambio 53 (3): 435–51. https://doi.org/10.1007/s13280-023-01908-4.

--- title: "Understanding Household Resilience to Commodity Price Shocks" subtitle: "The Role of Agricultural Practices" author: "Michael Bean" format: html: theme: cosmo toc: true toc-depth: 3 code-fold: true code-tools: true code-link: true fig-width: 10 fig-height: 8 embed-resources: true self-contained: true execute: echo: false warning: false editor_options: chunk_output_type: console cache: false bibliography: references.bib --- ```{r } #| label: gt-setup #| echo: false options(gt.html.tag_check = FALSE) ``` ```{r} #| label: setup # Load Packages using pacman. library(pacman) p_load(here , tidyverse , GGally , glue , tidymodels , sf , rnaturalearth , rnaturalearthdata , gt , kableExtra , htmltools , plm , modelsummary , lmtest , sandwich , car , stargazer , pglm , glmmTMB , broom.mixed , tseries , future , patchwork , furrr , FactoMineR , factoextra , ggeffects , splines , dotwhisker ) ``` ```{r} #| label: sources #| results: hide # Specify directories data_dir <- here("data") output_dir <- here("data_out") # Run individual scripts for ETL # source(here("scripts", "01-eth-import-data.R")) #Loads raw data # source(here("scripts", "01x-eth-clean_data_list_for_panel.R")) # fixes identifier issues # source(here("scripts", "03-eth-categorize-variables.R")) # Groups variables # source(here("scripts", "05-eth-construct-panel.R")) # Calls functions to clean data and construct panel ## 05-eth-construct-panel.R sources all cleaning functions from 00-eth-functions.R ## 02-eth-clean-data.R included but not used shows my original data cleaning code-pipelines (wave 1 only) before I turned them into functions with the help of chatGPT so as replicate across waves and construct the final panel # Load Saved Data instead of Sourcing Scripts df_panel <- readRDS(here(output_dir, "df_panel.rds")) # Load list of variables categorized into groups for potential analysis ls_var_groups <- readRDS(here(output_dir, "ls_var_groups.rds")) ``` ```{r} #| label: additional-transformation # Additional Transformations (mainly scaling) df_panel <- df_panel |> filter(total_area > 0) |> mutate( forest_any = as.integer(share_forest_area > 0) , consumption_annual_total_log = log(consumption_annual_total + 1) , consumption_annual_food_log = log(consumption_annual_food + 1) , consumption_annual_nonfood_log = log(consumption_annual_nonfood + 1) , rcsi_log = log(rcsi + 1) , hhs_lite_log = log(hhs_lite + 1) , mean_temp_annual_c = mean_temp_annual_cx10_hh / 10 , crop_sales_hh_log = log(crop_sales_hh + 1) , ent_sales_hh_log = log(ent_sales_hh + 1) , share_tree_mix_log = log(share_tree_mix + 1) , share_forest_area_log = log(share_forest_area + 1) , hf_shock_neg_ihs = asinh(hf_shock_neg) , hf_shock_pos_ihs = asinh(hf_shock_pos) , forest_asinh = asinh(share_forest_area) ) |> group_by(wave) %>% mutate( vol_z_wave = scale(exposure_volatility)[, 1] , forest_z_wave = scale(share_forest_area_log)[, 1] , mix_z_wave = scale(share_tree_mix)[, 1] , precip_z_wave = scale(precipitation_annual_mm_hh)[, 1] , temp_z_wave = scale(mean_temp_annual_c)[, 1] , shock_neg_z_wave_ihs = scale(hf_shock_neg_ihs)[, 1] , shock_pos_z_wave_ihs = scale(hf_shock_pos_ihs)[, 1] , forest_z_wave_asinh = scale(forest_asinh)[, 1] ) |> ungroup() ``` ```{r} #| label: panel-prep # Set panel data identifiers panel_data <- pdata.frame(df_panel, index = c("id_final", "wave")) ``` ## Introduction This project explores how agricultural households in developing economies respond to global commodity price shocks, with a specific focus on how different farming practices may help households maintain well-being during volatile market periods. It builds on earlier work examining how smallholder coffee farmers in Sub-Saharan Africa cope with changing prices and how production decisions influence consumption. The goal is to better understand the intersection of agricultural economics, rural development, and risk management, as well as provide insights for policy formation surrounding support for farming households in anticipation of, and during, seconomic downturns. The research was developed over three previous studies. ### Research Development **Agroforestry Practices and Global Commodity Shocks (Proposal)** The first paper presented a research proposal to investigate whether Tanzanian households adopting agroforestry practices demonstrate greater resilience to coffee price volatility. It outlined a fixed-effects methodology using LSMS-ISA data, where agroforestry adoption would be proxied by the presence of tree crops and intercropping. The proposed model would regress household consumption on agroforestry adoption, lagged global coffee prices, and their interaction, with additional analyses of coping mechanisms including entrepreneurship, employment, savings, and borrowing. The method follows that of Adhvaryu et al. in Booms, Busts, and Household Enterprise: Evidence from Coffee Farmers in Tanzania [@adhvaryu2021booms]. Although not presenting empirical results, the proposal identified potential measurement challenges in agroforestry proxies and the possibility for temporal misalignment between price data and household behavior. **Agricultural Practices and Price Shocks (Ethiopia, Empirical Analysis)** The second paper implemented an econometric analysis using panel data from the Ethiopia Socioeconomic Survey (in conjunction with the World Bank LSMS-ISA), focusing specifically on coffee-producing households. It employed a two-way fixed effects model with Driscoll-Kraay standard errors to estimate how various agricultural practices including crop rotation, fertilizer use, extension program participation, advisory services, oxen ownership (mechanization proxy), and credit access moderated the relationship between lagged global coffee prices and household consumption. Results indicated that extension program participation and crop rotation enhanced household ability to translate price increases into consumption gains. However, the model did not address any endogeneity surrounding practice adoption or explore simultaneous behavioral responses to price changes. **Integrated Econometric and Machine Learning Approaches** Building on the previous econometric framework, the third paper introduced a machine learning ensemble combining Random Forest, Gradient Boosting Machine, and XGBoost models through elastic net stacking with lasso regularization. This approach sought to capture complex non-linear relationships and improve predictive accuracy for consumption outcomes. The results confirmed the importance of extension services and certain agricultural practices (oxen, crop rotation) and suggested more sophisticated modeling approaches—potentially including simultaneous equations or structural models—may better represent the complex decision-making processes of agricultural households facing price volatility. ### Current Research Extension Taken together, these studies raised important questions about how to measure resilience, account for household-level differences in exposure, and model decision-making. This report attempts to extend that work by assembling a richer set of household-level variables from the panel data and applying appropriate econometric techniques to investigate how specific practices may protect household consumption in the face of global price shocks. Expanding the variable may help address some of the limitations identified in previous analyses: 1. **Endogeneity in practice adoption**: The selection of agricultural practices isn't random and might correlate with unobserved household characteristics or regional differences. More variables allow for better control strategies or potential instrumental variable approaches. 2. **Joint decision processes**: Households likely make simultaneous decisions about production practices, consumption, and coping strategies. This dataset supports potential exploration of multivariate outcome models, such as Seemingly Unrelated Regression (SUR) or structural equation approaches. 3. **Heterogeneous treatment effects**: The impact of price shocks and the effectiveness of mitigating practices most likely vary by household characteristics. The expanded panel enables more granular subgroup analysis and interaction modeling. ## Outcome Variables After analyzing potential outcomes for missingness and removing variables that may be structurally correlated with the exposure index, a set of primary outcome variables was selected for further analysis. The selected outcomes reflect various dimensions of household welfare: **Consumption Outcomes**: - `consumption_annual_total`: Total annual household consumption expenditure, combining food and non-food components. - `consumption_annual_food`: Annual household expenditure on food items. - `consumption_annual_nonfood`: Annual household expenditure on non-food goods and services. **Food Security Outcomes**: - `rcsi`: Reduced Coping Strategies Index; measures household reliance on coping behaviors during food shortages. - `hhs_lite`: Household Hunger Scale "Lite" version; summarizes experiences of moderate and severe food insecurity. - `worry_dummy`: Indicator of whether any household member reported worrying about not having enough food. **Dietary Diversity Outcome**: - `hdds_7d`: Household Dietary Diversity Score over the past 7 days; higher values reflect more diverse diets. All outcome variables were checked for data quality and missingness before inclusion. Correlation analysis among the outcomes further informed selection and modeling strategies. ### Transformation of Outcome Variables Given the observed strong right-skewness in household consumption expenditure variables, all consumption measures (`consumption_annual_total`, `consumption_annual_food`, `consumption_annual_nonfood`) were log-transformed using the function `log(x + 1)` before further analysis. This transformation improves normality, reduces the influence of outliers, and facilitates interpretation of regression coefficients as approximate percentage changes. Food security indicators (`rcsi`, `hhs_lite`, `worry_dummy`) and dietary diversity (`hdds_7d`) were modeled in their original forms, reflecting their nature as index, binary, or count measures. The modeling approach can later be selected based on the data structure of each outcome. ```{r} #| label: fig-ggpairs-outcomes-combined #| cache: true #| fig-width: 10 #| fig-height: 8 #| fig-cap: > #| Pairwise Relationships Among Household Welfare Outcomes (All Waves Combined). Consumption variables (Total, Food, Non-Food) and food security indicators (RCSI, HHS Lite) are log-transformed to reduce skewness.The plot highlights associations among consumption, food security, and dietary diversity. Individual waves followed similar patterns. df_panel |> select( consumption_annual_total_log , consumption_annual_food_log , consumption_annual_nonfood_log , rcsi_log , hhs_lite_log , worry_dummy , hdds_7d ) |> mutate(across(everything(), as.numeric)) |> rename( "Total\nConsumption\n(log)" = consumption_annual_total_log , "Food\nConsumption\n(log)" = consumption_annual_food_log , "Non-Food\nConsumption\n(log)" = consumption_annual_nonfood_log , "Reduced\nCoping\nStrategies\nIndex (RCSI)" = rcsi_log , "Household\nHunger Scale\n(HHS Lite)" = hhs_lite_log , "Worried\nAbout\nFood" = worry_dummy , "Household\nDietary\nDiversity\nScore (HDDS)" = hdds_7d ) |> ggpairs( upper = list(continuous = wrap("cor" , size = 3) ) , lower = list(continuous = wrap("points" , alpha = 0.5 , size = 0.5 , linewidth = 0.5) ) , diag = list(continuous = wrap("densityDiag")) ) + theme_bw() + theme( panel.grid.minor = element_blank() , panel.grid.major = element_blank() , axis.ticks.length = unit(0.1, "cm") , axis.ticks = element_line(color = "gray40", linewidth = 0.5) , plot.margin = margin(10, 10, 10, 10) , axis.line.x = element_line(color = "black", linewidth = 0.4) , axis.line.y = element_line(color = "black", linewidth = 0.4) , strip.background = element_blank() , axis.text.x = element_text(angle = 45, hjust = 1, size = 8) , axis.text.y = element_text(size = 8) , strip.text.x.top = element_text(size = 8) , strip.text.y.left = element_text(size = 8) ) ``` ## Volatility-Based Exposure and Shock Indices To quantify how global commodity price movements impact individual households, several indices were developed pairing household crop production shares with global commodity price dynamics. These indices aim to capture both price risk and realized shocks, tailored to each household's cropping pattern. The method for linking commodity fluctuations to households was originally adopted from Adhvaryu et. al [@adhvaryu2021booms] and used in previous papers. However its explanatory value was diminished due to high clustering of survey dates leading to little or no variability in the commodity prices among households. In order to better measure variability in price exposure across households, the following indices were generated by linking household-level crop area-share of globally traded crops (coffee, wheat, sorghum etc.) to monthly global commodity price data (2010–2016). Price series wer cleaned, standardized, and processed to generate lagged 12-month volatility and rolling deviation indices. --- ### Index Construction Each household-level index is computed using a three-step process: ##### 1. Global Price Volatility and Shock Measures For each commodity *c* at month *t*: - Rolling 12-month price mean: *P̄c,t12m = (1/12) × Σk=t−12t−1 Pc,k* - Monthly log returns: *rc,t = log(Pc,t / Pc,t−1)* - Expected return and volatility: *êc,t12m = mean(rc,t−12:t−1)* *σc,t12m = sd(rc,t−6:t)* - Price shock (surprise component): *shockc,t = rc,t − êc,t12m* ##### 2. Household Crop Shares For household *i*, crop *c*, and wave *w*: *sic = areaic / Σk areaik* ##### 3. Household-Level Indices Each household’s index is an area-weighted average of commodity-specific values: - Volatility-Based Exposure Index: *exposure_volatilityi = Σc sic × σc,t12m* - Price Shock Index (raw and absolute): *hf_shock_rawi = Σc sic × shockc,t* *hf_shock_absi = Σc sic × |shockc,t|* - Positive and Negative Shock Components: *hf_shock_negi = min(0, hf_shock_rawi)* *hf_shock_posi = max(0, hf_shock_rawi)* --- ### Interpretation - The volatility index reflects *ex-ante* risk based on historical price variability. - The shock indices capture *realized* unexpected price changes. - The positive and negative split allow for asymmetric modeling of shocks. ```{r} #| label: tbl-exposure-summary-ihs #| tbl-cap: > #| Summary statistics for household-level exposure to negative commodity price shocks. df_panel |> group_by(wave) |> summarize( Mean = mean(hf_shock_neg_ihs, na.rm = TRUE), SD = sd(hf_shock_neg_ihs, na.rm = TRUE), Min = min(hf_shock_neg_ihs, na.rm = TRUE), Max = max(hf_shock_neg_ihs, na.rm = TRUE) ) |> bind_rows( df_panel |> summarize( Mean = mean(hf_shock_neg_ihs, na.rm = TRUE), SD = sd(hf_shock_neg_ihs, na.rm = TRUE), Min = min(hf_shock_neg_ihs, na.rm = TRUE), Max = max(hf_shock_neg_ihs, na.rm = TRUE) ) |> mutate(wave = "All Waves") ) |> relocate(wave, .before = everything()) |> mutate(wave = case_when( wave == "w1" ~ "Wave 1", wave == "w2" ~ "Wave 2", wave == "w3" ~ "Wave 3", wave == "All Waves" ~ "All", TRUE ~ as.character(wave) )) |> gt() |> fmt_number( columns = where(is.numeric), decimals = 2 ) |> cols_label( wave = "Wave", Mean = "Mean", SD = "SD", Min = "Min", Max = "Max" ) |> tab_options( table.width = pct(100), data_row.padding = px(4), column_labels.font.weight = "bold", heading.title.font.size = px(14), heading.subtitle.font.size = px(12), row.striping.include_table_body = TRUE, table.font.size = px(12), table.border.top.style = "none", table.border.bottom.style = "none" ) |> tab_style( style = list( cell_borders(sides = c("top", "bottom"), color = "#a6c8e0", weight = px(1)), cell_fill(color = "#e6f0fa") ), locations = cells_column_labels() ) |> opt_table_lines(extent = "none") |> opt_row_striping() |> gt::tab_options(quarto.disable_processing = TRUE) ``` ```{r} #| label: fig-neg-shock-hist-ihs-facet #| fig-width: 10 #| fig-height: 6 #| fig-cap: > #| Distribution of IHS-transformed negative price shock index across survey waves and in the full sample. df_panel |> mutate(wave_facet = case_when( wave == "w1" ~ "Wave 1", wave == "w2" ~ "Wave 2", wave == "w3" ~ "Wave 3", is.na(wave) ~ "Unknown", TRUE ~ as.character(wave) ))|> bind_rows( df_panel |> mutate(wave_facet = "All Waves") ) |> ggplot(aes(x = shock_neg_z_wave_ihs)) + geom_histogram(bins = 30, fill = "lightblue", color = "black", linewidth = 0.3) + facet_wrap(~wave_facet, scales = "free_y") + labs( x = "Negative Price Shock (IHS-transformed)", y = "Household Count" ) + theme_minimal(base_size = 12) + theme( plot.title = element_blank(), strip.background = element_blank(), strip.text = element_text(face = "bold", size = 11), axis.title = element_text(face = "bold"), panel.grid.major = element_line(color = "gray90", linewidth = 0.25), panel.grid.minor = element_blank() ) ``` ```{r} #| label: fig-exposure-raw-vs-outcomes #| fig-width: 10 #| fig-height: 6 #| fig-cap: > #| Relationship between IHS-transformed negative price shock index and key household outcomes (combined waves). df_panel |> pivot_longer( cols = c(consumption_annual_total_log, rcsi, hhs_lite, hdds_7d), names_to = "outcome", values_to = "value" ) |> mutate( outcome = dplyr::recode(as.character(outcome), consumption_annual_total_log = "Log Consumption", rcsi = "Reduced Coping Score (rCSI)", hhs_lite = "Household Hunger Score", hdds_7d = "Dietary Diversity Score" ) ) |> ggplot(aes(x = shock_neg_z_wave_ihs, y = value)) + geom_point(alpha = 0.2, color = "black") + geom_smooth(method = "lm", color = "blue", linewidth = 1, se = FALSE) + facet_wrap(~ outcome, scales = "free_y") + labs( x = "Negative Price Shock (IHS-transformed)", y = NULL ) + theme_minimal(base_size = 12) + theme( plot.title = element_blank(), strip.text = element_text(face = "bold"), panel.grid.minor = element_blank(), panel.grid.major = element_line(color = "gray90"), axis.title.x = element_text(face = "bold", size = 11) ) ``` ## Forest Cover Proxy: Binary Indicator from Cultivated Land Use To capture the intercropping dimension of agroforestry practices, I originally explored a continuous proxy based on household intercropping with trees (`share_tree_mix`). However, this variable was too sparse due to the rarity of mixed-use plots containing trees in the data and instead a measure adopted from Morrow et al. @morrow2024farm involving theon-farm presence of forested land was adopted. Although not a direct proxy, forested fields on a farmers land may indicate usage of agroforestry techniques including windbreaks, riparian buffers, forest farming, intercropping, or even silvopasture: - **`forest_any`**: a binary indicator equal to 1 if a household classifies one of their fields as forest (`pp_s3q03 == 5`), and 0 otherwise. This proxy may also reflect longer-term land use strategies such as orchards (partially distinguished in the survey), woodlots, or natural forest cultivation, and is used to assess whether such households exhibit greater resilience to commodity price shocks. ### Forest Cover Prevalence Across Waves Forest cover, as measured by the binary `forest_any`, is relatively uncommon, especially in later survey waves: - **26.7%** of households in **Wave 1** cultivated at least one forest plot - **15.1%** in **Wave 2** - Only **9.0%** in **Wave 3** - **Overall**, **16.6%** of households ever reported forest cultivation This trend may reflect declining forest investment, changing land use priorities, or evolving land classification practices over time. Spatial variations may need to be examined between regions. However some analyses point to the inverse trend in certain regions @morrow2024farm. ```{r} #| label: tbl-forest-summary #| tbl-cap: > #| Summary of household-level forest cover by survey wave and overall. df_panel %>% group_by(wave) %>% summarize( n = n(), n_forest = sum(share_forest_area > 0, na.rm = TRUE), pct_forest = mean(share_forest_area > 0, na.rm = TRUE) * 100, mean_share = mean(share_forest_area, na.rm = TRUE), max_share = max(share_forest_area, na.rm = TRUE) ) %>% bind_rows( df_panel %>% summarize( n = n(), n_forest = sum(share_forest_area > 0, na.rm = TRUE), pct_forest = mean(share_forest_area > 0, na.rm = TRUE) * 100, mean_share = mean(share_forest_area, na.rm = TRUE), max_share = max(share_forest_area, na.rm = TRUE) ) %>% mutate(wave = "All Waves") ) %>% mutate(wave = case_when( wave == "w1" ~ "Wave 1", wave == "w2" ~ "Wave 2", wave == "w3" ~ "Wave 3", wave == "All Waves" ~ "All", TRUE ~ as.character(wave) )) %>% relocate(wave) %>% gt() |> cols_label( wave = "Wave", n = "Households", n_forest = "With Forest", pct_forest = "% With Forest", mean_share = "Mean Forest Share", max_share = "Max Forest Share" ) |> fmt_number( columns = where(is.numeric), decimals = 2 ) |> tab_options( table.width = pct(100), data_row.padding = px(4), column_labels.font.weight = "bold", heading.title.font.size = px(14), heading.subtitle.font.size = px(12), row.striping.include_table_body = TRUE, table.font.size = px(12), table.border.top.style = "none", table.border.bottom.style = "none" ) |> tab_style( style = list( cell_borders(sides = c("top", "bottom"), color = "#a6c8e0", weight = px(1)), cell_fill(color = "#e6f0fa") ), locations = cells_column_labels() ) |> opt_table_lines(extent = "none") |> opt_row_striping() |> gt::tab_options(quarto.disable_processing = TRUE) ``` ```{r} #| label: fig-forest-binary-bar-faceted #| fig-cap: > #| Proportion of households with forest cover (binary indicator), faceted by wave. df_panel %>% mutate(wave_facet = case_when( wave == "w1" ~ "Wave 1", wave == "w2" ~ "Wave 2", wave == "w3" ~ "Wave 3", is.na(wave) ~ "Unknown", TRUE ~ as.character(wave) )) %>% bind_rows( df_panel %>% mutate(wave_facet = "All Waves") ) %>% group_by(wave_facet, forest_any) %>% summarise(n = n(), .groups = "drop") %>% group_by(wave_facet) %>% mutate( pct = n / sum(n) * 100, label = ifelse(forest_any == 1, "Has Forest Cover", "No Forest Cover") ) %>% ggplot(aes(x = label, y = pct, fill = label)) + geom_col(width = 0.6, color = "gray30") + geom_hline(yintercept = 0, color = "gray30", linewidth = 0.4) + geom_text(aes(label = paste0(round(pct, 1), "%")), vjust = 2, size = 4) + facet_wrap(~ wave_facet) + scale_fill_manual(values = c( "Has Forest Cover" = "#d2e3f3", "No Forest Cover" = "#e0e0e0" )) + labs( x = "Forest Cover Status", y = "Percentage of Households" ) + theme_minimal(base_size = 12) + theme( legend.position = "none", strip.text = element_text(face = "bold"), panel.grid.minor = element_blank(), panel.grid.major = element_line(color = "gray90"), axis.text.y = element_blank() ) ``` ## Control Variables: Weather Climatic conditions can both drive global commodity prices and directly affect households’ production and consumption. To isolate the effect of price volatilty exposure weather controls are included (future studies may remove these in favor of controlling for enumeration area as in @morrow2024farm: - **Annual Precipitation** (`precipitation_annual_mm_hh`): Total rainfall (mm) in the 12 months prior to survey. - **Mean Annual Temperature** (`mean_temp_annual_cx10_hh / 10`): Average temperature (°C) over the same period. ```{r} #| label: tbl-weather-summary #| tbl-cap: > #| Summary statistics for household-linked weather variables. df_panel |> mutate(mean_temp_annual_c = mean_temp_annual_cx10_hh / 10) |> group_by(wave) |> summarize( mean_precip_annual = mean(precipitation_annual_mm_hh, na.rm = TRUE), sd_precip_annual = sd(precipitation_annual_mm_hh, na.rm = TRUE), min_precip_annual = min(precipitation_annual_mm_hh, na.rm = TRUE), max_precip_annual = max(precipitation_annual_mm_hh, na.rm = TRUE), mean_temp_annual = mean(mean_temp_annual_c, na.rm = TRUE), sd_temp_annual = sd(mean_temp_annual_c, na.rm = TRUE), min_temp_annual = min(mean_temp_annual_c, na.rm = TRUE), max_temp_annual = max(mean_temp_annual_c, na.rm = TRUE) ) |> bind_rows( df_panel |> mutate(mean_temp_annual_c = mean_temp_annual_cx10_hh / 10) |> summarize( mean_precip_annual = mean(precipitation_annual_mm_hh, na.rm = TRUE), sd_precip_annual = sd(precipitation_annual_mm_hh, na.rm = TRUE), min_precip_annual = min(precipitation_annual_mm_hh, na.rm = TRUE), max_precip_annual = max(precipitation_annual_mm_hh, na.rm = TRUE), mean_temp_annual = mean(mean_temp_annual_c, na.rm = TRUE), sd_temp_annual = sd(mean_temp_annual_c, na.rm = TRUE), min_temp_annual = min(mean_temp_annual_c, na.rm = TRUE), max_temp_annual = max(mean_temp_annual_c, na.rm = TRUE) ) |> mutate(wave = "All Waves") ) |> mutate(wave = case_when( wave == "w1" ~ "Wave 1", wave == "w2" ~ "Wave 2", wave == "w3" ~ "Wave 3", wave == "All Waves" ~ "All", TRUE ~ as.character(wave) )) |> relocate(wave) |> gt() |> fmt_number(columns = where(is.numeric), decimals = 2) |> cols_label( wave = "Wave", mean_precip_annual = "Mean Annual Precip (mm)", sd_precip_annual = "SD Precip", min_precip_annual = "Min Precip", max_precip_annual = "Max Precip", mean_temp_annual = "Mean Annual Temp (°C)", sd_temp_annual = "SD Temp", min_temp_annual = "Min Temp", max_temp_annual = "Max Temp" ) |> tab_options( table.width = pct(100), data_row.padding = px(4), column_labels.font.weight = "bold", heading.title.font.size = px(14), heading.subtitle.font.size = px(12), table.font.size = px(12), table.border.top.style = "none", table.border.bottom.style = "none", row.striping.include_table_body = TRUE, row.striping.background_color = "#f7f7f7" ) |> tab_style( style = list( cell_borders(sides = c("top", "bottom"), color = "#a6c8e0", weight = px(1)), cell_fill(color = "#e6f0fa") ), locations = cells_column_labels() ) |> opt_table_lines(extent = "none") |> opt_row_striping() |> gt::tab_options(quarto.disable_processing = TRUE) ``` ```{r} #| label: fig-precip-distribution-facet #| fig-cap: > #| Distribution of annual precipitation (mm) across waves and in the full sample. df_panel |> mutate(wave_facet = case_when( wave == "w1" ~ "Wave 1", wave == "w2" ~ "Wave 2", wave == "w3" ~ "Wave 3", is.na(wave) ~ "Unknown", TRUE ~ as.character(wave) )) |> bind_rows( df_panel |> mutate(wave_facet = "All Waves") ) |> ggplot(aes(x = precipitation_annual_mm_hh)) + geom_histogram(bins = 30, color = "black", linewidth =0.3, fill = "#dfeee0") + facet_wrap(~wave_facet, scales = "free_y") + labs( x = "Annual Precipitation (mm)", y = "Household Count" ) + theme_minimal(base_size = 12) + theme( plot.title = element_blank(), strip.text = element_text(face = "bold"), panel.grid.major = element_line(color = "gray90"), panel.grid.minor = element_blank(), axis.title = element_text(face = "bold") ) ``` ```{r} #| label: fig-temp-distribution-facet #| fig-cap: > #| Distribution of mean annual temperature (°C) across waves and in the full sample. df_panel |> mutate(wave_facet = case_when( wave == "w1" ~ "Wave 1", wave == "w2" ~ "Wave 2", wave == "w3" ~ "Wave 3", is.na(wave) ~ "Unknown", TRUE ~ as.character(wave) )) |> bind_rows( df_panel |> mutate(wave_facet = "All Waves") ) |> ggplot(aes(x = mean_temp_annual_cx10_hh / 10)) + geom_histogram(bins = 30, color = "black", linewidth = 0.3, fill = "#f8dedd") + facet_wrap(~wave_facet, scales = "free_y") + labs( x = "Mean Annual Temperature (°C)", y = "Household Count" ) + theme_minimal(base_size = 12) + theme( plot.title = element_blank(), strip.text = element_text(face = "bold"), panel.grid.major = element_line(color = "gray90"), panel.grid.minor = element_blank(), axis.title = element_text(face = "bold") ) ``` ## Exploratory Analysis of Final Model Variables Before presenting the panel model results, a final pairwise relationship plot among a final set of predictors and outcomes was generated. The IHS-transformed price shock index, three log-transformed consumption outcomes, binary indicator for agroforestry practices and weather controls were included. This matrix provides a concise view of potential correlations and variable distributions, helping establish the plausibility of each covariate in the model specification. ```{r} #| label: fig-final-ggpairs-core-vars #| fig-cap: > #| Pairwise relationships among variables included in the final model: IHS-transformed price shock index, log consumption outcomes, agroforestry proxy, and weather controls. df_panel |> transmute( `Neg. Price Shock\n(IHS, z)` = hf_shock_neg_ihs, `Total\nConsumption\n(log)` = consumption_annual_total_log, `Food\nConsumption\n(log)` = consumption_annual_food_log, `Non-Food\nConsumption\n(log)` = consumption_annual_nonfood_log, `Agroforestry\n(Binary)` = forest_any, `Mean Annual\nTemperature (°C)` = mean_temp_annual_cx10_hh / 10, `Annual\nPrecipitation (mm)` = precipitation_annual_mm_hh ) |> ggpairs( upper = list(continuous = wrap("cor", size = 3)), lower = list(continuous = wrap("points", alpha = 0.5, size = 0.5, linewidth = 0.5)), diag = list(continuous = wrap("densityDiag")) ) + theme_bw() + theme( panel.grid.minor = element_blank(), panel.grid.major = element_blank(), axis.ticks.length = unit(0.1, "cm"), axis.ticks = element_line(color = "gray40", linewidth = 0.5), plot.margin = margin(10, 10, 10, 10), axis.line.x = element_line(color = "black", linewidth = 0.4), axis.line.y = element_line(color = "black", linewidth = 0.4), strip.background = element_blank(), axis.text.x = element_text(angle = 45, hjust = 1, size = 8), axis.text.y = element_text(size = 8), strip.text.x.top = element_text(size = 8), strip.text.y.left = element_text(size = 8) ) ``` ## Panel Data Diagnostic Tests and Model Selection This section summarizes the diagnostic tests used to validate panel model assumptions and justify model selection. ### Diagnostic Tests Conducted - **F-test (FE vs Pooled OLS)** Tests for individual-specific effects. → *Result*: Strongly rejects Pooled OLS; Fixed Effects preferred. - **LM test (RE vs Pooled OLS)** Tests for random effects presence. → *Result*: Strongly rejects Pooled OLS; Random Effects supported. - **Hausman test (FE vs RE)** Tests whether regressors are correlated with entity effects. → *Result*: Rejects Random Effects; Fixed Effects preferred. - **Serial Correlation (Breusch-Godfrey)** Tests for autocorrelation in idiosyncratic errors. → *Result*: Serial correlation detected (p < 0.001). - **Cross-sectional Dependence (Pesaran CD)** Tests for correlation across panel units. → *Result*: Dependence present (p < 0.001). - **Heteroskedasticity (Breusch-Pagan)** Tests for unequal error variances. → *Result*: Could not be performed due to wave-level constraints. - **Residual Normality (Jarque–Bera)** Tests for normal distribution of residuals. → *Result*: Non-normal residuals detected (p < 0.001). ### Model Selection and Implications - **Model Choice**: The F-test, LM test, and Hausman test collectively support the use of a **Fixed Effects model** over Pooled OLS or Random Effects. - **Diagnostic Issues Identified**: - Serial correlation and cross-sectional dependence are both present. - Residuals are not normally distributed. - Heteroskedasticity could not be formally tested. ### Remedial Approach: Driscoll–Kraay Standard Errors Given these concerns, Driscoll–Kraay standard errors were applied. These are: - Robust to serial correlation and cross-sectional dependence - Valid under heteroskedasticity - Well-suited for panels with short time dimensions (3 waves) This correction ensures more reliable inference without altering coefficient estimates and non-normal residuals are less problematic under this framework. ```{r} #| label: tbl-multimodel-diagnostics #| tbl-cap: > #| Diagnostic test decisions for multiple outcome models in the 3-wave panel. # This code was primari;y generated through a dialoguie with chatGPT. It performs # a series of tests on the panel data in order to determine which model # structure is appropriate. # Initially I had it done by hand for one model. outcomes <- c("consumption_annual_total_log", "consumption_annual_food_log", "consumption_annual_nonfood_log") outcome_labels <- c("Total Consumption", "Food Consumption", "Non-Food Consumption") diagnostic_results <- map(outcomes, function(outcome) { spec_formula <- as.formula(paste0(outcome, " ~ shock_neg_z_wave_ihs * forest_any + precip_z_wave + temp_z_wave")) model_pooled <- plm(spec_formula, data = panel_data, model = "pooling") model_re <- plm(spec_formula, data = panel_data, model = "random") model_fe <- plm(spec_formula, data = panel_data, model = "within") f_test <- tryCatch(pFtest(model_fe, model_pooled), error = function(e) NULL) lm_test <- tryCatch(plmtest(model_pooled, type = "bp"), error = function(e) NULL) hausman_test <- tryCatch(phtest(model_fe, model_re), error = function(e) NULL) serial_test <- tryCatch(pbgtest(model_fe, order = 1), error = function(e) NULL) cd_test <- tryCatch(pcdtest(model_fe, test = "cd"), error = function(e) NULL) het_test <- tryCatch(bptest(model_fe$model), error = function(e) NULL) norm_test <- tryCatch(jarque.bera.test(residuals(model_fe)), error = function(e) NULL) tibble( Test = c( "F-test (FE vs Pooled OLS)", "LM test (RE vs Pooled OLS)", "Hausman test (FE vs RE)", "Serial Correlation (pbgtest, lag=1)", "Cross-sectional Dependence (Pesaran CD)", "Heteroskedasticity (Breusch-Pagan)", "Residual Normality (Jarque–Bera)" ), !!outcome := c( if (!is.null(f_test)) ifelse(f_test$p.value < 0.05, "Reject Pooled OLS (FE preferred)", "Fail to reject Pooled OLS") else "N/A", if (!is.null(lm_test)) ifelse(lm_test$p.value < 0.05, "Reject Pooled OLS (RE preferred)", "Fail to reject Pooled OLS") else "N/A", if (!is.null(hausman_test)) ifelse(hausman_test$p.value < 0.05, "Reject RE (FE preferred)", "Fail to reject RE (RE preferred)") else "N/A", if (!is.null(serial_test)) ifelse(serial_test$p.value < 0.05, "Serial correlation detected", "No serial correlation") else "Test inconclusive", if (!is.null(cd_test)) ifelse(cd_test$p.value < 0.05, "Cross-sectional dependence detected", "No cross-sectional dependence") else "N/A", if (!is.null(het_test)) ifelse(het_test$p.value < 0.05, "Heteroskedasticity detected", "Homoskedastic errors") else "N/A", if (!is.null(norm_test)) ifelse(norm_test$p.value < 0.05, "Residuals non-normal", "Residuals approx. normal") else "N/A" ) ) }) diagnostic_table <- reduce(diagnostic_results, full_join, by = "Test") colnames(diagnostic_table)[-1] <- outcome_labels diagnostic_table |> gt() |> cols_label(.list = set_names(names(diagnostic_table), names(diagnostic_table))) |> tab_options( table.width = pct(100), data_row.padding = px(4), column_labels.font.weight = "bold", heading.title.font.size = px(14), heading.subtitle.font.size = px(12), table.font.size = px(12), table.border.top.style = "none", table.border.bottom.style = "none" ) |> tab_style( style = list( cell_borders(sides = c("top", "bottom"), color = "#a6c8e0", weight = px(1)), cell_fill(color = "#e6f0fa") ), locations = cells_column_labels() ) |> opt_table_lines(extent = "none") |> opt_row_striping() |> gt::tab_options(quarto.disable_processing = TRUE) ``` ### Interpretation: Forest Cover × Negative Price Shocks Driscoll–Kraay FE estimates show that: - **Negative price shocks** significantly reduce total, food, and non-food consumption. - **Forest cover (binary)** is associated with lower food and total consumption overall. - The **interaction term (Shock × Forest)** is positive and statistically significant for all outcomes, strongest for food and total consumption. This suggests that households with forest access are more resilient to price shocks, particularly for food security. However, forest land may reduce baseline consumption, reflecting trade-offs between resilience and productivity. ```{r} #| label: model-specification # Driscoll‑Kraay SE Estimates: Fixed Effects Models (Volatility × Forest Cover) # Estimate FE models for five outcomes with exposure volatility and forest cover proxy model_total_vol <- plm(consumption_annual_total_log ~ shock_neg_z_wave_ihs * forest_any + precip_z_wave + temp_z_wave, data = panel_data, model = "within", effect = "twoways") model_food_vol <- plm(consumption_annual_food_log ~ shock_neg_z_wave_ihs * forest_any + precip_z_wave + temp_z_wave, data = panel_data, model = "within", effect = "twoways") model_nonfood_vol <- plm(consumption_annual_nonfood_log ~ shock_neg_z_wave_ihs * forest_any + precip_z_wave + temp_z_wave, data = panel_data, model = "within", effect = "twoways") # Compute Driscoll‑Kraay standard errors vcov_dk_total_vol <- vcovSCC(model_total_vol, maxlag = 1) vcov_dk_food_vol <- vcovSCC(model_food_vol, maxlag = 1) vcov_dk_nonfood_vol <- vcovSCC(model_nonfood_vol, maxlag = 1) # Format table using modelsummary models_vol <- list( "Total Annual" = model_total_vol, "Annual Food" = model_food_vol, "Annual Non‑Food" = model_nonfood_vol ) vcovs_vol <- list( "Total Annual" = vcov_dk_total_vol, "Annual Food" = vcov_dk_food_vol, "Annual Non‑Food" = vcov_dk_nonfood_vol ) ``` ```{r} #| label: tbl-model-results #| tbl-cap: > #| Driscoll–Kraay standard error estimates from fixed effects models with negative price shocks and forest cover interaction, across three consumption outcomes. modelsummary( models_vol, vcov = vcovs_vol, statistic = "std.error", stars = TRUE, gof_map = c("nobs", "r.squared"), coef_rename = c( "shock_neg_z_wave_ihs" = "Neg. Price Shock (IHS, z)", "forest_any" = "Forest Cover (Binary)", "shock_neg_z_wave_ihs:forest_any" = "Shock × Forest", "precip_z_wave" = "Annual Precipitation (z)", "temp_z_wave" = "Mean Temperature (z)" ), output = "gt", fmt = 3 ) |> tab_spanner( label = "Consumption Outcomes", columns = 2:4, id = "spanner_consumption" ) |> tab_style( style = list( cell_fill(color = "#e6f0fa"), cell_borders(sides = c("top", "bottom"), color = "#a6c8e0", weight = px(1)) ), locations = list( cells_column_labels(), cells_column_spanners(spanners = "spanner_consumption") ) ) |> tab_style( style = cell_borders( sides = "bottom", color = "black", weight = px(1) ), locations = cells_column_spanners(spanners = "spanner_consumption") ) |> tab_options( table.width = pct(100), column_labels.font.weight = "bold", table.font.size = px(12), data_row.padding = px(4), heading.title.font.size = px(14), heading.subtitle.font.size = px(12), table.border.top.style = "none", table.border.bottom.style = "none", row.striping.include_table_body = TRUE, row.striping.background_color = "#f7f7f7" ) |> opt_table_lines(extent = "none") |> tab_options(quarto.disable_processing = TRUE) ``` ```{r} #| label: fig-model-dotwhisker #| fig-cap: > #| Estimated coefficients and 95% confidence intervals from fixed effects models with Driscoll–Kraay standard errors. Interaction effects shown for negative price shocks and forest cover. # Function to explicitly tidy with robust standard errors tidy_robust <- function(model, vcov_matrix, model_label) { coeftest(model, vcov_matrix) |> broom::tidy(conf.int = TRUE, conf.level = 0.95) |> mutate(model = model_label) } # Correctly bind robust estimates dw_data <- bind_rows( tidy_robust(model_total_vol, vcov_dk_total_vol, "Total Consumption"), tidy_robust(model_food_vol, vcov_dk_food_vol, "Food Consumption"), tidy_robust(model_nonfood_vol, vcov_dk_nonfood_vol, "Non-Food Consumption") ) |> filter(term %in% c( "shock_neg_z_wave_ihs", "forest_any", "shock_neg_z_wave_ihs:forest_any" )) |> mutate(term = dplyr::recode(term, "shock_neg_z_wave_ihs" = "Neg. Price Shock (IHS, z)", "forest_any" = "Forest Cover (Binary)", "shock_neg_z_wave_ihs:forest_any" = "Shock × Forest" )) # Plot properly dwplot(dw_data, dot_args = list(size = 2.5), whisker_args = list(size = 0.6), dodge_size = 0.25 ) + geom_vline(xintercept = 0, linetype = "dashed", color = "gray40") + scale_color_brewer(palette = "Set2") + coord_cartesian(xlim = c(-0.06, 0.06)) + # Adjust based on your data range theme_minimal(base_size = 11) + theme( legend.title = element_blank(), legend.position = "bottom", axis.title.y = element_blank(), axis.text.y = element_text(size = 9), axis.text.x = element_text(size = 8), plot.title = element_text(hjust = 0.5, size = 12), panel.grid.minor = element_blank() ) + labs( x = "Estimated Coefficient (with 95% CI)" ) ``` ## Marginal Effects of Forest Cover on Shock Exposure To better interpret the interaction between negative price shocks and forest cover, the marginal effects of exposure across households with and without forest access is visualized. This plot is derived from a two-way fixed effects model with Driscoll–Kraay standard errors. The estimated lines show predicted log consumption across varying levels of negative price shocks (IHS-transformed), disaggregated by forest cover status. Confidence intervals are included. **Note:** This plot is based on the total consumption model. ```{r} #| label: fig-marginal-binary-forest #| fig-cap: > #| Marginal effects of negative price shocks on predicted consumption, moderated by binary forest cover access. Lines represent households with and without forest cover. model_binary_forest <- plm( consumption_annual_total_log ~ shock_neg_z_wave_ihs * forest_any + precip_z_wave + temp_z_wave, data = panel_data, model = "within", effect = "twoways" ) vcov_dk_binary <- vcovSCC(model_binary_forest, maxlag = 1) marginal_binary_forest <- ggpredict( model_binary_forest, terms = c("shock_neg_z_wave_ihs [all]", "forest_any"), vcov = vcov_dk_binary ) levels(marginal_binary_forest$group) <- c("No Forest Cover", "Has Forest Cover") ggplot(marginal_binary_forest, aes(x = x, y = predicted, color = group, fill = group)) + geom_line(linewidth = 1) + geom_ribbon(aes(ymin = conf.low, ymax = conf.high), alpha = 0.25, colour = NA) + scale_color_manual(values = c("gray40", "steelblue")) + scale_fill_manual(values = c("gray40", "steelblue")) + labs( x = "Negative Price Shock (IHS, z-score)", y = "Predicted Log Consumption", color = "Forest Cover", fill = "Forest Cover" ) + theme_minimal(base_size = 11) + theme( legend.position = "top", plot.title = element_text(size = 13, face = "bold"), plot.subtitle = element_text(size = 11) ) ``` ```{r} #| label: model-diagnostics # Extract diagnostics data model_diagnostics <- bind_rows( tibble( model = "Total Consumption", residuals = as.vector(residuals(model_total_vol)), fitted = as.vector(fitted(model_total_vol)), time = as.character(index(model_total_vol, "time")), r_squared = round(summary(model_total_vol)$r.squared[1], 3), actual = model_total_vol$model$consumption_annual_total_log ), tibble( model = "Food Consumption", residuals = as.vector(residuals(model_food_vol)), fitted = as.vector(fitted(model_food_vol)), time = as.character(index(model_food_vol, "time")), r_squared = round(summary(model_food_vol)$r.squared[1], 3), actual = model_food_vol$model$consumption_annual_food_log ), tibble( model = "Non-Food Consumption", residuals = as.vector(residuals(model_nonfood_vol)), fitted = as.vector(fitted(model_nonfood_vol)), time = as.character(index(model_nonfood_vol, "time")), r_squared = round(summary(model_nonfood_vol)$r.squared[1], 3), actual = model_nonfood_vol$model$consumption_annual_nonfood_log ) ) # Calculate standardized residuals correctly by group model_diagnostics <- model_diagnostics %>% group_by(model) %>% mutate( std_residuals = (residuals - mean(residuals, na.rm = TRUE)) / sd(residuals, na.rm = TRUE) ) %>% ungroup() # Convert time to factor model_diagnostics$time <- factor(model_diagnostics$time, levels = unique(model_diagnostics$time)) # Create model labels with R² model_labels <- c( "Total Consumption" = paste0("Total (R² = ", round(summary(model_total_vol)$r.squared[1], 3), ")"), "Food Consumption" = paste0("Food (R² = ", round(summary(model_food_vol)$r.squared[1], 3), ")"), "Non-Food Consumption" = paste0("Non-Food (R² = ", round(summary(model_nonfood_vol)$r.squared[1], 3), ")") ) # Set common axis limits for all plots # Find the overall ranges for various metrics y_range <- range(model_diagnostics$residuals, na.rm = TRUE) x_range <- range(model_diagnostics$fitted, na.rm = TRUE) actual_range <- range(model_diagnostics$actual, na.rm = TRUE) sqrt_std_range <- range(sqrt(abs(model_diagnostics$std_residuals)), na.rm = TRUE) # Add some padding to the ranges y_pad <- diff(y_range) * 0.05 x_pad <- diff(x_range) * 0.05 actual_pad <- diff(actual_range) * 0.05 sqrt_std_pad <- diff(sqrt_std_range) * 0.05 y_limits <- c(y_range[1] - y_pad, y_range[2] + y_pad) x_limits <- c(x_range[1] - x_pad, x_range[2] + x_pad) actual_limits <- c(actual_range[1] - actual_pad, actual_range[2] + actual_pad) sqrt_std_limits <- c(sqrt_std_range[1] - sqrt_std_pad, sqrt_std_range[2] + sqrt_std_pad) # Calculate global ranges for QQ plot qq_data <- list() for(m in unique(model_diagnostics$model)) { res <- model_diagnostics$residuals[model_diagnostics$model == m] qq <- qqnorm(res, plot.it = FALSE) qq_data[[m]] <- data.frame(x = qq$x, y = qq$y, model = m) } qq_df <- do.call(rbind, qq_data) qq_x_range <- range(qq_df$x, na.rm = TRUE) qq_y_range <- range(qq_df$y, na.rm = TRUE) qq_x_pad <- diff(qq_x_range) * 0.05 qq_y_pad <- diff(qq_y_range) * 0.05 qq_x_limits <- c(qq_x_range[1] - qq_x_pad, qq_x_range[2] + qq_x_pad) qq_y_limits <- c(qq_y_range[1] - qq_y_pad, qq_y_range[2] + qq_y_pad) # Set theme theme_set(theme_minimal() + theme(strip.text = element_text(face = "bold", size = 9), plot.title = element_text(hjust = 0.5, size = 11), axis.title = element_text(size = 9), axis.text = element_text(size = 8))) ``` ## Model Diagnostics Summary To assess the validity of the fixed effects models with Driscoll–Kraay standard errors, six diagnostic plots for each of the three consumption outcomes were generated (total, food, and non-food). These visuals evaluate key model assumptions such as linearity, homoskedasticity, and residual normality. Faceting by model allows comparison across outcomes. ### Diagnostic Plots and Interpretation - **Residuals vs Fitted** ([Figure @fig-residuals-vs-fitted]) This plot checks for non-linearity or model misspecification by visualizing whether residuals scatter randomly around zero. In this case, residuals are mostly centered but show slight funneling in the non-food consumption model, indicating mild heteroskedasticity or non-linearity in that specification. - **Normal Q-Q Plot** ([Figure @fig-residuals-qq-plot]) The Q-Q plot assesses whether residuals are normally distributed. In this case, all models show deviation from the reference line at both tails, suggesting non-normal residuals—particularly for total and food consumption. While this violates the classical OLS assumption, our use of robust standard errors mitigates inference concerns. - **Scale-Location Plot** ([Figure @fig-scale-location-plot]) This plot tests for homoskedasticity by plotting the square root of standardized residuals against fitted values. In this case, the total consumption model shows increasing spread as fitted values rise, indicating some non-constant variance. The food and non-food models are more stable but still show mild curvature. - **Residuals by Wave** ([Figure @fig-residuals-by-wave]) This plot detects wave-level differences in residual behavior. In this case, residual medians remain roughly centered for all models across waves. Slight compression of spread in Wave 2 may suggest stronger model fit or less variability in that period. - **Actual vs Fitted Values** ([Figure @fig-actual-vs-fitted]) This plot evaluates predictive accuracy—ideally, points should fall along the 45-degree line. In this case, the relationship is generally linear, but vertical spread (especially in food consumption) reflects moderate prediction error and unexplained variation. - **Histogram of Residuals** ([Figure @fig-residuals-histogram]) This plot provides a basic check for residual symmetry and skew. In this case, all models show roughly centered residuals, but with some skewness and peaking. Total and food consumption residuals appear leptokurtic, while non-food is slightly more dispersed. ```{r} #| label: fig-residuals-vs-fitted #| fig-cap: > #| Residuals versus fitted values for each model. A loess smoother is shown to detect non-linearity. Dashed lines indicate zero residual. model_diagnostics |> ggplot(aes(x = fitted, y = residuals, color = model)) + facet_wrap(~ model, labeller = labeller(model = model_labels)) + geom_point(alpha = 0.3, size = 1, shape = 21) + geom_hline(yintercept = 0, linetype = "dashed", color = "black") + geom_smooth(method = "loess", se = FALSE, linewidth = 0.6) + scale_x_continuous(limits = x_limits) + scale_y_continuous(limits = y_limits) + scale_color_brewer(palette = "Set2") + labs( x = "Fitted Values", y = "Residuals" ) + theme_minimal(base_size = 11) + theme( legend.position = "none", strip.text = element_text(face = "bold", size = 9), axis.title = element_text(size = 9), axis.text = element_text(size = 8) ) ``` ```{r} #| label: fig-residuals-qq-plot #| fig-cap: > #| Normal Q-Q plot of residuals for each model. Deviation from the dashed line indicates non-normality. model_diagnostics |> ggplot(aes(sample = residuals)) + facet_wrap(~ model, labeller = labeller(model = model_labels)) + stat_qq(aes(color = model), size = 1.2, alpha = 0.3, shape = 21) + stat_qq_line(color = "black", linewidth = 0.6, linetype = "dashed") + scale_color_brewer(palette = "Set2") + scale_x_continuous(limits = qq_x_limits) + scale_y_continuous(limits = qq_y_limits) + labs( x = "Theoretical Quantiles", y = "Sample Quantiles" ) + theme_minimal(base_size = 11) + theme( legend.position = "none", strip.text = element_text(face = "bold", size = 9), axis.title = element_text(size = 9), axis.text = element_text(size = 8) ) ``` ```{r} #| label: fig-scale-location-plot #| fig-cap: > #| Scale-location plot (√|standardized residuals| vs fitted values) for each model. A loess smoother is included to assess heteroskedasticity. model_diagnostics |> ggplot(aes(x = fitted, y = sqrt(abs(std_residuals)), color = model)) + facet_wrap(~ model, labeller = labeller(model = model_labels)) + geom_point(alpha = 0.3, size = 1, shape = 21) + geom_smooth(method = "loess", se = FALSE, linewidth = 0.6) + scale_x_continuous(limits = x_limits) + scale_y_continuous(limits = sqrt_std_limits) + scale_color_brewer(palette = "Set2") + labs( x = "Fitted Values", y = "√|Standardized Residuals|" ) + theme_minimal(base_size = 11) + theme( legend.position = "none", strip.text = element_text(face = "bold", size = 9), axis.title = element_text(size = 9), axis.text = element_text(size = 8) ) ``` ```{r} #| label: fig-residuals-by-wave #| fig-cap: > #| Distribution of residuals by survey wave for each model. Boxplots reveal potential temporal patterns in residual behavior. A dashed horizontal line marks zero residual. model_diagnostics |> ggplot(aes(x = time, y = residuals, fill = model)) + facet_wrap(~ model, labeller = labeller(model = model_labels)) + geom_boxplot(alpha = 0.6, outlier.size = 0.7, linewidth = 0.4) + geom_hline(yintercept = 0, linetype = "dashed", color = "black") + scale_y_continuous(limits = y_limits) + scale_fill_brewer(palette = "Set2") + labs( x = "Survey Wave", y = "Residuals" ) + theme_minimal(base_size = 11) + theme( legend.position = "none", strip.text = element_text(face = "bold", size = 9), axis.title = element_text(size = 9), axis.text = element_text(size = 8) ) ``` ```{r} #| label: fig-actual-vs-fitted #| fig-cap: > #| Actual versus fitted values for each model. The 45-degree reference line indicates perfect prediction. model_diagnostics |> ggplot(aes(x = fitted, y = actual, color = model)) + facet_wrap(~ model, labeller = labeller(model = model_labels)) + geom_point(alpha = 0.3, size = 1, shape = 21) + geom_abline(intercept = 0, slope = 1, linetype = "dashed", color = "black", linewidth = 0.6) + scale_x_continuous(limits = x_limits) + scale_y_continuous(limits = actual_limits) + scale_color_brewer(palette = "Set2") + labs( x = "Fitted Values", y = "Actual Values" ) + theme_minimal(base_size = 11) + theme( legend.position = "none", strip.text = element_text(face = "bold", size = 9), axis.title = element_text(size = 9), axis.text = element_text(size = 8) ) ``` ```{r} #| label: fig-residuals-histogram #| fig-cap: > #| Histogram of residuals for each model. Dashed vertical line indicates zero residual. Distribution shape may indicate normality or skew. model_diagnostics |> ggplot(aes(x = residuals, fill = model)) + facet_wrap(~ model, labeller = labeller(model = model_labels)) + geom_histogram(bins = 30, alpha = 0.7, color = "gray30") + geom_vline(xintercept = 0, linetype = "dashed", color = "black") + scale_x_continuous(limits = y_limits) + scale_fill_brewer(palette = "Set2") + labs( x = "Residuals", y = "Count" ) + theme_minimal(base_size = 11) + theme( legend.position = "none", strip.text = element_text(face = "bold", size = 9), axis.title = element_text(size = 9), axis.text = element_text(size = 8) ) ``` ### Summary Although the models exhibit some non-normal and mildly heteroskedastic residuals, taken together, these plots suggest there is acceptable overall fit/residual centering and no major issues with autocorrelation or structural misspecification. They support the reasonablity of the model specification and reinforce the choice to use Driscoll–Kraay robust standard errors.