What is it? Random forest consists of hundreds or more decision trees, with each tree using a random subset of data. All of the decision trees cast a vote on the classification of a sample; the majority vote wins.
When is it used? This analysis is one of the most commonly used models. It is used when 1) there are a lot of variables to consider (e.g., expression of thousands of proteins), 2) you only have moderate computing capacities, 3) you don’t want to analyze a separate set of samples for cross-validation, and 4) the groups are or are not normally distributed.
How does it work?
Random forest: Example
We analyze the protein profile of 1,000 proteins of 100 healthy patients and 100 cancer patients using an antibody-based microarray. We want to find biomarkers that will predict which future patients are healthy or diseased.
- Create a table where each row represents a protein and each column represents a patient.
- Assign groups. Here, you tell the software which samples are healthy and which have cancer.
- Center data by subtracting the mean of each patient dataset from itself. Now all datasets have a mean of 0.
- Scale data by dividing each patient dataset with its standard deviation. Now all datasets have a standard deviation of 1.
- Set aside 1/3 of the data. These samples will be used in Step 7 for cross-validation.
- Create decision trees using a subset of samples and variables at a time (Figure 1). The modeler determines the number of trees. Each tree will be created using a different random subset of data; the same sample can be chosen more than once to create 1 tree.
- Determine accuracy of the random forest. The samples set aside in Step 5 are tested against all of the decision trees. The accuracy of the random forest is the proportion of patients that were correctly identified by the random forest.
- Apply Random Forest to samples with unknown health condition. There will be some trees that classify the patient as healthy, while other trees will classify the patient as diseased. The majority decision wins.
What does the data look like? The Random Forest model is an ensemble of hundreds of trees that cannot be represented easily; however, the biomarkers used to create the model can be extracted during cross-validation and evaluation of the model’s performance (e.g., via ROC curve analysis).