Tree ordinance monitoring - sampling from populations

Sampling from populations

Many of the evaluation techniques we describe involve collecting information from or about discrete units, such as trees, streets, blocks, or residents. In many cases, it may not be practical to perform a complete census of every unit in the overall population. However, it is still possible to obtain reliable information about the overall population by collecting data from a representative subset or sample. Sampling is simply the technique used to choose representative units for study from a larger population. Sampling is a prerequisite of several of the assessment methods discussed in section 3, including photogrammetry, ground survey, and public polling.

Statistical bias

The reason for using statistically sound sampling methods is to avoid bias in the estimates of the parameter(s) you are measuring. Although the value of any single estimate (biased or not) is unlikely to equal the true population value, the mean of a large number of unbiased estimates will approximate the true value. In contrast, the mean of a large number of biased estimates will either be higher or lower than the true population value, depending on the direction of the bias. Hence, if you are interested in knowing the actual value of a parameter from the population (e.g., actual percent tree canopy cover), you generally want to use an unbiased estimator of that parameter. In some situations, a small bias (e.g., a tendency to slightly over- or underestimate cover) can be tolerated if the bias is small relative to the standard deviation of the estimation errors (perhaps 10% to 15% or less).

Bias in estimates can come from various sources. For instance, if tree shadows are counted as canopy in aerial photo interpretation (misclassification bias), the canopy cover estimate will be biased upward. In public polling, people who fail to respond to a survey may constitute a source of sampling bias. If some segment(s) of the population (e.g., retirees, working couples, low-income households) are either more or less likely to respond than other population segments, responses may not be representative of the population as a whole. Many types of bias can be avoided through good sampling design and the careful implementation of appropriate evaluation techniques.

Random sampling and random numbers

Most statistical methods are based on the assumption of random sampling. This simply means that every unit in the population has an equal chance of being chosen for the sample. Furthermore, the selection of random units should be independent of other units that have been sampled. If you reject a sample unit because you think it is too close to one already chosen, your sample will not be random and independent. A relatively simple and reliable method for randomization is to use random numbers. Most spreadsheet, database, and statistical programs that run on personal computers have functions that generate random numbers. Although these random number generators may not be optimal, they will generally suffice. You can also obtain random numbers from online generators (e.g., https://www.random.org).

Several techniques can be used to draw a random sample from a population that consists of individual objects or records (e.g., street addresses or tree numbers). Many spreadsheet programs, include tools that can produce a random sample of a specified size from a range of cells. Alternatively, you can assign a unique random number to each unit or record, sort on the random number, and pick the required number of units from the top of the sorted database.

In some cases, it is necessary to take random samples across a geographic area, such as part or all of a city or forested area. In such a situation, random sample points can be assigned by randomly sampling from a coordinate grid that has been established for the area in question. This may either be an existing set of map-based coordinates, such as UTM or State Plane grids, or an arbitrary grid based on units measured on a map or aerial photograph (e.g., distances measured from the bottom and left edge of the map or photo). After you have determined the range of X and Y coordinates within the area to be sampled, X and Y coordinates can be selected randomly to generate random sample points.

Stratified sampling

In many urban forestry applications, it is desirable to have samples distributed throughout the population. For instance, you may want to ensure that trees from each of several different maintenance districts are included in the sample. In such situations, stratified random sampling will be the most efficient and meaningful method for selecting samples. In this method, the population to be sampled is first divided into meaningful subunits or strata. These may be large subdivisions, planning sectors, maintenance districts, or any other convenient management or planning unit.

If strata are assigned so that each is more or less homogeneous with respect to the characters being measured, fewer samples will be needed to adequately characterize each stratum. For instance, if tree cover is to be assessed in different portions of a city, visual estimates of the tree canopy cover could be used to help demarcate zones where canopy cover is relatively uniform. A sample of street trees might be stratified by tree species, size, and/or age, depending on the purpose of the evaluation. If these trees were classified in a municipal street tree database, stratification might be accomplished relatively simply from existing tree data. However, if such data are lacking, it may be necessary to conduct a preliminary sample to delineate the population before sampling occurs. For example, in a study we conducted on utility pruning, we needed to sample from a population of matched pairs of London plane (Platanus x acerifolia) street trees that were both directly under conductors and had clearances within a certain range. Because existing tree inventories did not contain all of the necessary information, we surveyed the study area to identify a population of trees that met these criteria. These trees constituted a particular stratum of the street tree population.

Once strata are assigned and delineated, samples are drawn at random from within each stratum. If the number of samples selected from each stratum is not proportional to the size of the stratum, then the averages from each will have to be weighted to obtain an overall population average.

Sample size

Optimal sample size will vary somewhat with the characteristics being rated or tallied.

In general:

up to a point, the reliability of estimates will increase as sample size increases;
the more variable the population is with respect to the characteristic(s) being rated, the larger the sample should be;
a large sample is required to accurately estimate the frequencies of relatively rare events or characteristics;
larger sample sizes are needed in order to detect relatively small differences between means or proportions; smaller sample sizes may suffice if the differences are relatively large.

The optimum sample size represents a compromise between cost and accuracy, since both generally increase with increasing sample size. You can determine an optimum sample size by identifying the point of diminishing returns beyond which further increases in accuracy are not worth the additional costs of data collection. Optimum sample size will vary with the type of data being collected, so it is not possible to set a single number for all applications.

However, you can use certain statistical formulas to estimate the minimum sample size needed for a specific purpose. A number of statistics web sites include on-line interactive calculators that allow you to estimate required sample sizes. Before you can use these sample size calculators, you will need to know several things about the data you are collecting and how it will be analyzed:

Type of data. Main data types include:

continuous - variables can take any value, e.g., tree diameters
discrete - variables can only have certain discrete values. Types of discrete data include
- ranks - ordered ratings, e.g. low, moderate, high
- counts - e.g., number of trees by species
- binary - variables have only two outcomes, e.g., present/absent. Binary data is typically expressed as proportions or percents, such as the percent canopy cover determined from dot grid counts (canopy is rated as present or absent for each dot).

Type of analysis. Continuous data are typically analyzed using linear models, including linear regression and analysis of variance techniques. Discrete data may be analyzed in various ways, including contingency table analysis, logistic regression, and survival analysis. Different formulas are used to estimate sample sizes for various analysis methods.

Expected values. To estimate sample sizes for analyses of continuous data you will have to specify estimates of expected population means (the Greek letter mu may be used for this term) and standard deviations or variances (the Greek letter sigma symbolizes the population standard deviation; variance is the square of the standard deviation). For proportions, estimates of the expected proportions are needed; margins of error (as percents) may also be needed.

Data structure. If data are paired or arranged in blocks or other more complex designs, the structure of the statistical model should be specified.

Confidence level. Also abbreviated as the Greek letter alpha, this is the probability of Type I error, the chance that you will say that a difference is significant when it really isn't (i.e., the probability of rejecting the null hypothesis when it is true). This is typically set a low level, often 5% (alpha=0.05), meaning that there would only be a 5% (1 in 20) chance of deciding that a spurious difference is real (i.e., you have a 95% chance of avoiding Type I error).

Power. This parameter is the flip side of the confidence level, and is expressed as (1-beta) where beta is the probability of Type II error. Power is the the probability of detecting a real difference (i.e., the probability of rejecting the null hypothesis when it is false). If you are interested in detecting real differences, the power of a test should be high, generally at least 80% (0.8) or greater.

Links to sample size calculators

Some useful web sites with sample size calculators are listed below. Additional sites can be found by following links on some of these pages or by searching on the term "sample size" on various web search engines.

http://www.stat.uiowa.edu/~rlenth/Power/ : Russ Lenth's Java applets for power and sample size -This site provides a variety of powerful but easy to use applets that allow you calculate sample size and interactively see how sample size, power, alpha, and other study design factors are interrelated.

http://www.quantitativeskills.com/sisa/ : SISA: Simple Interactive Statistical Analysis - This site includes a number of statistical analysis applications that can be run interactively online. It includes sample size calculators for both continuous and binary (proportion) data.

http://biostat.mc.vanderbilt.edu/twiki/bin/view/Main/PowerSampleSize : Power and Sample Size Estimation - A downloadable application (PS) for calculating sample size and power.