Labeled data

From Wikipedia, the free encyclopedia
Jump to navigation Jump to search

Labeled data is a group of samples that have been tagged with one or more labels. Labeling typically takes a set of unlabeled data and augments each piece of that unlabeled data with meaningful tags that are informative. For example, labels might indicate whether a photo contains a horse or a cow, which words were uttered in an audio recording, what type of action is being performed in a video, what the topic of a news article is, what the overall sentiment of a tweet is, whether the dot in an x-ray is a tumor, etc.

Labels can be obtained by asking humans to make judgments about a given piece of unlabeled data (e.g., "Does this photo contain a horse or a cow?"), and are significantly more expensive to obtain than the raw unlabeled data.

After obtaining a labeled dataset, machine learning models can be applied to the data so that new unlabeled data can be presented to the model and a likely label can be guessed or predicted for that piece of unlabeled data.[1]


  1. ^ Johnson, Leif. "What is the difference between labeled and unlabeled data?", Stack Overflow, 4 October 2013. Retrieved on 13 May 2017.  This article incorporates text by lmjohns3 available under the CC BY-SA 3.0 license.