Identifying and Managing Outliers in Harvest Data for Improved Analysis
This strategy focuses on identifying and handling outliers in the harvest dataset to ensure the accuracy and interpretability of the analysis. By detecting and managing extreme data points, we can produce visualizations and insights that are both meaningful and easy to understand.
Rationale
Outliers in the dataset, whether they are abnormally large or small values, can skew the results and obscure meaningful trends. These outliers may result from:
- Data entry errors.
- Unusual farming conditions.
- Inconsistent definitions of farm areas or crop yields.
To address these challenges, we implemented a structured approach to detect and handle outliers in the harvest dataset. This process enables us to clean the data and produce insightful and accurate visualizations that highlight key trends in agricultural yields.
Implementation Strategy
Data Sources
We utilized the following key columns from the harvest dataset:
actual_quantity
: The actual yield from each farm location.location_area_ha
: The area of the location in hectares.type
: The type of location (e.g., field, garden, greenhouse).crop_group
: The broader category of the crop (e.g., Vegetables and melons, Cereals).farm_id
: The unique identifier for each farm.
Steps
1. Initial Cleaning:
- Remove rows with missing values in critical columns (
actual_quantity
,location_area_ha
) to ensure the dataset is complete.
2. Calculate Yield per Hectare:
- Add a derived column,
yield_per_ha
, to calculate productivity using the formula: [ = ]
3. Initial Outlier Detection Using IQR:
- Compute the Interquartile Range (IQR) for
actual_quantity
within eachcrop_group
: [ = Q3 - Q1 ] - Define lower and upper bounds: [ = Q1 - 1.5 ] [ = Q3 + 1.5 ]
- Remove data points outside these bounds.
4. Manual Modifications:
- Apply domain-specific thresholds for further outlier detection, as detailed below:
Per Location Type:
- Field:
- Locations with
location_area_ha >= 160
oractual_quantity >= 300,000
were flagged as outliers and removed.
- Locations with
- Garden:
- Locations with
location_area_ha >= 0.8
oractual_quantity >= 1,500
were flagged as outliers and removed.
- Locations with
- Greenhouse:
- Locations with
location_area_ha >= 0.1
oractual_quantity >= 200
were flagged as outliers and removed.
- Locations with
Per Crop Group:
- Fruit and Nuts:
- Locations with
location_area_ha < 10
andactual_quantity <= 250
were removed.
- Locations with
- Vegetables and Melons:
- Locations with
location_area_ha < 50
andactual_quantity < 50
were removed.
- Locations with
- Oilseed Crops and Oleaginous Fruits:
- Locations with
location_area_ha < 2
andactual_quantity < 120
were removed.
- Locations with
- Other Crops:
- Locations with
location_area_ha < 50
andactual_quantity < 50
were removed.
- Locations with
- Stimulant, Spice, and Aromatic Crops:
- Locations with
location_area_ha < 15
andactual_quantity <= 220
were removed.
- Locations with
- Sugar Crops:
- Locations with
location_area_ha < 2
andactual_quantity <= 1,000
were removed.
- Locations with
- Cereals:
- Locations with
location_area_ha <= 15
andactual_quantity <= 3,000
were removed.
- Locations with
- Leguminous Crops:
- Locations with
location_area_ha <= 6
andactual_quantity <= 200
were removed.
- Locations with
5. Advanced Filtering Using Mean ± 2 Standard Deviations:
- For some crop groups (e.g., High Starch Root/Tuber Crops, Potatoes and Yams, Beverage and Spice Crops):
- Calculate the mean and standard deviation.
- Retain data points within: [ ]
Visualizations
1. Boxplots
- By Location Type:
- Show the distribution of
actual_quantity
acrossfield
,garden
, andgreenhouse
. - Highlight how outliers affect the spread and central tendency.
- Show the distribution of
- By Crop Group:
- Compare
actual_quantity
across differentcrop_group
values to identify trends and variability.
- Compare
2. Scatterplots
- By Location Type:
- Plot
location_area_ha
vs.actual_quantity
for each location type. - Highlight the relationship between farm size and yield for
field
,garden
, andgreenhouse
.
- Plot
- By Crop Group:
- Show the distribution of
actual_quantity
vs.location_area_ha
for each crop group, distinguishing between types (e.g., Field vs. Garden).
- Show the distribution of
3. Histograms
- By Location Type:
- Show frequency distributions of
actual_quantity
forfield
,garden
, andgreenhouse
before and after outlier removal.
- Show frequency distributions of
- By Crop Group:
- Display frequency distributions of
actual_quantity
within each crop group before and after outlier removal.
- Display frequency distributions of
Benefits of This Strategy
- Improved Data Quality:
- Outliers that distort trends and visualizations are removed.
- Targeted Outlier Detection:
- Combines general statistical methods (e.g., IQR) with specific thresholds based on domain knowledge.
- Enhanced Visualizations:
- Produces accurate and meaningful visualizations, enabling better decision-making.
Conclusion
This approach addresses outliers systematically, ensuring the dataset represents typical trends while minimizing distortions from extreme values. By combining IQR-based filtering, manual thresholds, and advanced statistical methods, we enhanced the quality of the dataset and produced clear, reliable visualizations for analysis. This comprehensive strategy provides a strong foundation for understanding agricultural yields across diverse farm locations and crop groups.