2,706
Views
10
CrossRef citations to date
0
Altmetric
Articles

Spatially simplified scatterplots for large raster datasets

, &
Pages 81-93 | Received 10 Feb 2016, Accepted 13 Apr 2016, Published online: 24 May 2016

Figures & data

Figure 1. Scatterplot and a GAM fitted line for Normalized Difference Vegetation Index (NDVI) (Y) and ETM + Band 4 (X). Data are standardized. The R package mgcv was used to fit the GAM (Wood Citation2006; Hastie and Tibshirani Citation1990).

Figure 1. Scatterplot and a GAM fitted line for Normalized Difference Vegetation Index (NDVI) (Y) and ETM + Band 4 (X). Data are standardized. The R package mgcv was used to fit the GAM (Wood Citation2006; Hastie and Tibshirani Citation1990).

Figure 2. Alpha blending scatterplot of a subset (1000 × 1000) of the same data in Figure 1, with alpha value set to 0.05. The alpha value specifies the level of transparency for the data points, ranging from 1 for opaque to 0 for black. Even at such a low alpha level, excessive overplotting remains a problem.

Figure 2. Alpha blending scatterplot of a subset (1000 × 1000) of the same data in Figure 1, with alpha value set to 0.05. The alpha value specifies the level of transparency for the data points, ranging from 1 for opaque to 0 for black. Even at such a low alpha level, excessive overplotting remains a problem.

Figure 3. Two-dimensional binned kernel smoothing scatterplot.

Notes: Black dots are bins with a single data point. Data are the same as Figure .
Figure 3. Two-dimensional binned kernel smoothing scatterplot.

Figure 4. Scatterplot with contours. Due to the lack of sufficient memory to render the plot with the large data-set in Figure 2, 10,000 data points are randomly generated from a standard normal distribution.

Figure 4. Scatterplot with contours. Due to the lack of sufficient memory to render the plot with the large data-set in Figure 2, 10,000 data points are randomly generated from a standard normal distribution.

Figure 5. Binned scatterplot with the same data in Figure 2.

Notes: A 30-by-30 grid was used to bin the original data (1000 × 1000). Brightness symbolizes the counts of data points in each bin. Changing the bin size can achieve different levels of generalization. The shape of the bins can be either rectangular or hexagonal.
Figure 5. Binned scatterplot with the same data in Figure 2.

Figure 6. Nested lattice hexagon binning with the same data in Figure 2.

Notes: For the hexagons, size of the inner point represents counts, while hue represents the hierarchical categories and serves as borders between the hexagons. The graph was generated with the R package hexbin.
Figure 6. Nested lattice hexagon binning with the same data in Figure 2.

Table 1. Summary statistics of the simulated data-set.

Figure 7. Simulated data-set X (left) and Y (right).

Notes: Y = 0.5X + X2 + X3. The size of the image is 100 × 100 cells. See Table for the statistical summaries.
Figure 7. Simulated data-set X (left) and Y (right).

Figure 8. Scatterplot with a local regression fit for the simulated data-set X and Y.

Figure 8. Scatterplot with a local regression fit for the simulated data-set X and Y.

Figure 9. Remote sensing data-set used in the experiment.

Notes: Landsat ETM Band 4 (left) and derived NDVI (right). Image dimension is 3000 rows by 3000 columns, with a spatial resolution of 30 m. Data were obtained on 6 July 2002, near Palisades, Idaho. Figures are various scatterplots for ETM Band 4 and NDVI.
Figure 9. Remote sensing data-set used in the experiment.

Table 2. Summary statistics of the remote sensing data-set.

Figure 10. Illustrations of sample sites for three sampling schemes: random (left), regular (center), and hexagon stratified random (right). The background image is for the variable X, with 100 rows by 100 columns. Variable Y is not shown here. Effective sample size () was calculated based on ρx = 0.9658, ρy = 0.5136, ρxy = 0.4688. Due to geometric restrictions, actual sample sizes are 2809, for regular sampling, and 2438, for hexagon stratified random sampling.

Figure 10. Illustrations of sample sites for three sampling schemes: random (left), regular (center), and hexagon stratified random (right). The background image is for the variable X, with 100 rows by 100 columns. Variable Y is not shown here. Effective sample size () was calculated based on ρx = 0.9658, ρy = 0.5136, ρxy = 0.4688. Due to geometric restrictions, actual sample sizes are 2809, for regular sampling, and 2438, for hexagon stratified random sampling.

Table 3. Summary statistics for simulated variables from the resampled sets.

Figure 11. Spatially simplified scatterplot consisting of the point cloud and local/global fitting lines overlays on the same graphic features from the original data.

Notes: Grey dots are the original data (n = 10,000 data points). Black dots are data points from the resampled sets (left: random sampling, n* = 2785; center: regular sampling, n* = 2809; right: hexagon stratified random sampling, n* = 2438). The red curve (with a yellow error band) and the red dashed line are from the sampled set. The white curve (with a blue error band that appear grey due to color mixing with the yellow band) and the green line are from the original data.
Figure 11. Spatially simplified scatterplot consisting of the point cloud and local/global fitting lines overlays on the same graphic features from the original data.

Figure 12. Spatially simplified scatterplots (point clouds with local/global fits) from three resampling schemes (left: random sampling; center: regular sampling; right: hexagon stratified sampling).

Figure 12. Spatially simplified scatterplots (point clouds with local/global fits) from three resampling schemes (left: random sampling; center: regular sampling; right: hexagon stratified sampling).

Figure 13. A random sample (n = 1326, right) of the hexagon stratified random sample set (n = 2438, left).

Note: Sample sites are shown as yellow dots.
Figure 13. A random sample (n = 1326, right) of the hexagon stratified random sample set (n = 2438, left).

Figure 14. Spatially simplified scatterplot of the simulated data using a two-stage resampling scheme.

Notes: In the left graph, the fitted curves for the original data are colored in white (with a blue error band) and green. The red curve (with a yellow error band) and the red dashed line are from the sampled set. The original data points are colored in light grey. In the right graph, the LOESS line is red with a blue error band. The dashed line is the OLS fit.
Figure 14. Spatially simplified scatterplot of the simulated data using a two-stage resampling scheme.

Table 4. Summary of resampling statistics for the remote sensing data-set.

Figure 15. An overlay of scatterplots from the original data and the resampled set.

Notes: Light grey point cloud: original data (3000 × 3000 points). Darker grey point cloud: hexagon stratified random sampled data (174,212 points). Black wavy curve: GAM fitted line from the original data. Violet wavy curve: GAM fitted line from sample data. Blue curve: a cubic polynomial fit to the sample data. Red curve: a quartic polynomial fit to the sample data. Green curve: a quartic polynomial fit to the original data. Black dash: linear fit to the original data. Violet dash: linear fit to the sample data.
Figure 15. An overlay of scatterplots from the original data and the resampled set.

Figure 16. Scatterplot of data from the stage-two regular resampling.

Notes: Light grey point cloud: stage-one hexagon random sampled data (174,212 data points). Darker grey point cloud: stage-two regular sampled data (42,025 data points). Black wavy curve: GAM fitted line from stage-one data. Violet wavy curve: GAM fitted line from stage-two resampled data. Blue curve: a cubic polynomial fit to the stage-two resampled data. Red curve: a quartic polynomial fit to the stage-two resampled data. Green curve: a quartic polynomial fit to the stage-one data. Black dash: linear fit to the stage-one data. Violet dash: linear fit to the stage-two resampled data. X axis = normalized Band 4 radiance, Y axis = normalized NDVI, NDVI = [(Band 4 − Band 3)/(Band 4 + Band 3)]. NDVI is a simple algebraic combination of remotely sensed spectral information (eg Bands 4 and 3 from Landsat Enhanced Thematic Mapper) that provides meaningful information about vegetative structure (leaf/cellular structure) and condition (chlorophyll content). Generally, high NDVI values indicate that photosynthetically active plant biomass is present.
Figure 16. Scatterplot of data from the stage-two regular resampling.